Mathematical Statistics
Basic Ideas and Selected Topics
Volume I
Peter J. Bickel
University of California
Kjell A. Doksum
University of California
1'1"(.'nl icc
Hall
.. ~
PRENTICE HALL
Upper Saddle River, New Jersey 07458
Library of Congress CataloginginPublication Data Bickel. Peter J. Mathematical statistics: basic ideas and selected topics / Peter J. Bickel, Kjell A. Doksum2 nd ed. p. em. Includes bibliographical references and index. ISBN D13850363X(v. 1) L Mathematical statistics. L Doksum, Kjell A. II. Title.
QA276.B47200l
519.5dc21 00031377
Acquisition Editor: Kathleen Boothby Sestak Editor in Chief: Sally Yagan Assistant Vice President of Production and Manufacturing: David W. Riccardi Executive Managing Editor: Kathleen Schiaparelli Senior Managing Editor: Linda Mihatov Behrens Production Editor: Bob Walters Manufacturing Buyer: Alan Fischer Manufacturing Manager: Trudy Pisciotti Marketing Manager: Angela Battle Marketing Assistant: Vince Jansen Director of Marketing: John Tweeddale Editorial Assistant: Joanne Wendelken Art Director: Jayne Conte Cover Design: Jayne Conte
}'I('I1II(\'
lI,dl
@2001, 1977 by PrenticeHall, Inc. Upper Saddle River, New Jersey 07458
All rights reserved. No part of this book may be reproduced, in any form Or by any means, without permission in writing from the publisher. Printed in the United States of America 10 9 8 7 6 5 4 3 2 1
I
i
ISBN: D1385D363X
PrenticeHall International (UK) Limited, London PrenticeHall of Australia Pty. Limited, Sydney PrenticeHall of Canada Inc., Toronto PrenticeHall Hispanoamericana, S.A., Mexico PrenticeHall of India Private Limited, New Delhi PrenticeHall of Japan, Inc., Tokyo Pearson Education Asia Pte. Ltd. Editora PrenticeHall do Brasil, Ltda., Rio de Janeiro
J
To Erich L Lehmann
." i
~I
"I
~ !
,'
,
I,
,.~~
_.

..
CONTENTS
PREFACE TO THE SECOND EDITION: VOLUME I PREFACE TO THE FIRST EDITION I STATISTICAL MODELS, GOALS, AND PERFORMANCE CRITERIA 1.1 Data, Models, Parameters, and Statistics
xiii
xvii
1 1
1.1.1
1.1.2
Data and Models
Pararnetrizations and Parameters
I
6
1.1.3
1.2
Statistics as Functions on the Sample Space
8
1.3
1.4
1.5
1.6
1.7 1.8 1.9
1.1.4 Examples, Regression Models Bayesian Models The Decision Theoretic Framework 1.3.1 Components of the Decision Theory Framework 1.3.2 Comparison of Decision Procedures 1.3.3 Bayes and Minimax Criteria Prediction Sufficiency Exponential Families 1.6.1 The OneParameter Case 1.6.2 The Multiparameter Case 1.6.3 Building Exponential Families 1.6.4 Properties of Exponential Families 1.6.5 Conjugate Families of Prior Distributions Problems and Complements Notes References
9 12 16 17 24 26 32 41 49 49 53 56 58 62 66 95 96
VII
VIII
•••
CONTENTS
2 METHODS OF ESTIMATION 2.1 Basic Heuristics of Estimation 2.1.1 Minimum Contrast Estimates; Estimating Equations 2.1.2 The PlugIn and Extension Principles 2.2 Minimum Contrast Estimates and Estimating Equations 2.2.1 Least Squares and Weighted Least Squares 2.2.2 Maximum Likelihood 2.3 Maximum Likelihood in Multiparameter Exponential Families *2.4 Algorithmic Issues 2.4.1 The Method of Bisection 2.4.2 Coordinate Ascent 2.4.3 The NewtonRaphson Algorithm 2.4.4 The EM (ExpectationlMaximization) Algorithm 2.5 Problems and Complements 2.6 Notes 2.7 References
3
MEASURES OF PERFORMANCE
99 99 99 102 107 107 114 121 127 127 129 132 133 138 158 159
3.1 Introduction 3.2 Bayes Procedures 3.3 Minimax Procedures *3.4 Unbiased Estimation and Risk Inequalities 3.4.1 Unbiased Estimation, Survey Sampling 3.4.2 The Information Inequality *3.5 Nondecision Theoretic Criteria 3.5.1 Computation 3.5.2 Interpretability 3.5.3 Robustness 3.6 Problems and Complements 3.7 Notes 3.8 References
4 TESTING AND CONFIDENCE REGIONS 4.1 Introduction 4.2 Choosing a Test Statistic: The NeymanPearson Lemma 4.3 UnifonnIy Most Powerful Tests and Monotone Likelihood Ratio
161 161 161 170 176 176 179 188 188 189 190 197 210 211 213 213 223
227 233
4.4
Models Confidence Bounds, Intervals, and Regions
CONTENTS
ix
241 248 251 252
4.5 *4.6 *4.7 4.8
The Duality Between Confidence Regions and Tests Uniformly Most Accurate Confidence Bounds Frequentist and Bayesian Formulations Prediction Intervals
4.9
Likelihood Ratio Procedures 4.9.1 Inttoduction
4.9.2 4.9.3 Tests for the Mean of a Normal DistributionMatched Pair Experiments Tests and Confidence Intervals for the Difference in Means of
255 255
257
4.9.4
4.9.5
Two Normal PopUlations The TwoSample Prohlem with Unequal Variances
Likelihood Ratio Procedures fOr Bivariate Nonnal
261 264 266 269 295 295
Distrihutions 4.10 Problems and Complements 4.11 Notes 4.12 References
5 ASYMPTOTIC APPROXIMATIONS 5.1 Inttoduction: The Meaning and Uses of Asymptotics
5.2 Consistency
297 297
301
5.2.1
5.2.2
PlugIn Estimates and MLEs in Exponential Family Models
Consistency of Minimum Contrast Estimates
301
304
5.3
5.4
5.5 5.6 5.7 5.8
First and HigherOrder Asymptotics: The Delta Method with Applications 5.3.1 The Delta Method for Moments 5.3.2 The Delta Method for In Law Approximations 5.3.3 Asymptotic Normality of the Maximum Likelihood Estimate in Exponential Families Asymptotic Theory in One Dimension 5.4.1 Estimation: The Multinomial Case *5.4.2 Asymptotic Normality of Minimum Conttast and M Estimates *5.4.3 Asymptotic Normality and Efficiency of the MLE *5.4.4 Testing *5.4.5 Confidence Bounds Asymptotic Behavior and Optimality of the Posterior Distribution Problems and Complements Notes References
306 306 311 322 324 324 327 331 332 336 337 345 362 363
x
CONTENTS
6 INFERENCE IN THE MULTIPARAMETER CASE
6.1 Inference for Gaussian Linear Models 6.1.1 6.1.2 6.1.3 *6.2 6.2.1 6.2.2 6.2.3 *6.3 6.3.1 6.3.2 *6.4 6.4.1 6.4.2 6.4.3 *6.5 *6.6 6.7 6.8 6.9 The Classical Gaussian Linear Model Estimation Tests and Confidence Intervals Estimating Equations Asymptotic Normality and Efficiency of the MLE The Posterior Distribution in the Multiparameter Case Asymptotic Approximation to the Distribution of the Likelihood Ratio Statistic Wald's and Rao's Large Sample Tests GoodnessofFit in a Multinomial Model. Pearson's X 2 Test GoodnessofFit to Composite Multinomial Models. Contingency Thbles Logistic Regression for Binary Responses
365
365 366 369 374 383 384 386 391 392 392 398 400 401 403 408 411 417 422 438 438 441 441 443 443
Asymptotic Estimation Theory in p Dimensions
Large Sample Tests and Confidence Regions
Large Sample Methods for Discrete Data
Generalized Linear Models
Robustness Properties and Scmiparametric Models Problems and Complements Notes References
A A REVIEW OF BASIC PROBABILITY THEORY A.I The Basic Model A.2 Elementary Properties of Probability Models A.3 Discrete Probability Models A.4 Conditional Probability and Independence A.5 Compound Experiments A.6 Bernoulli and Multinomial Trials, Sampling With and Without Replacement A.7 Probabilities On Euclidean Space A.8 Random Variables and Vectors: Transformations A.9 Independence of Random Variables and Vectors A.IO The Expectation of a Random Variable A.II Moments A.12 Moment and Cumulant Generating Functions
444
446 447 448 451 453 454 456 459
CONTENTS
XI
•
A. t3 Some Classical Discrete and Continuous Distributions A.14 Modes of Convergence of Random Variables and Limit Theorems A. I5 Further Limit Theorems and Inequalities A.16 Poisson Process
A.17 Notes
460
466 468 472
474 475 477 477 477 479 480 482 484 485 485 488 491 491 494 497 502 502 503 506 506 508
A.18 References
B ADDITIONAL TOPICS IN PROBABILITY AND ANALYSIS
B.I Conditioning by a Random Variable or Vector B.I.l B.I.2 B.1.3 B.IA B.l.5 B.2.1 B.2.2 B.3
B.3.1
The Discrete Case Conditional Expectation for Discrete Variables Properties of Conditional Expected Values Continuous Variables Comments on the General Case The Basic Framework The Gamma and Beta Distributions
The X2 , F, and t Distributions
B.2 Distribution Theory for Transformations of Random Vectors
Distribution Theory for Samples from a Normal Population B.3.2 Orthogonal Transformations
BA The Bivariate Normal Distribution
B.5 Moments of Random Vectors and Matrices B.5.1 B.5.2 B.6.1 B.6.2 B.8 Basic Properties of Expectations Properties of Variance Definition and Density Basic Properties. Conditional Distributions
Op
B.6 The Multivariate Normal Distribution
B.7 Convergence for Random Vectors: Multivariate Calculus B.9 Convexity and Inequalities
and Op Notation
511
516 518 519 519 520 521
B.1O Topics in Matrix Theory and Elementary Hilbert Space Theory
B.1O.1 Symmetric Matrices B,10.2 Order on Symmetric Matrices
B.10.3 Elementary Hilbert Space Theory
B.Il Problems and Complements B.12 Notes B.13 References
524
538 539
•• XII
CONTENTS
C TABLFS
Table I The Standard Nonna! Distribution Table I' Auxilliary Table of the Standard Normal Distribution
Table II t Distribution Critical Values Table In x2 Distribution Critical Values Table IV F Distribution Critical Values
INDEX
541 542 543 544 545 546 547
I
PREFACE TO THE SECOND EDITION: VOLUME I
In the twentythree years that have passed since the first edition of our book appeared statistics has changed enonnollsly under dIe impact of several forces:
(1) The generation of what were once unusual types of data such as images, trees (phy
logenetic and other), and other types of combinatorial objects.
(2) The generation of enonnous amounts of dataterrabytes (the equivalent of 10 12 characters) for an astronomical survey over three years.
(3) The possibility of implementing computations of a magnitude that would have once been unthinkable. The underlying sources of these changes have been the exponential change in computing speed (Moore's "law") and the development of devices (computer controlled) using novel instruments and scientific techniques (e.g., NMR tomography, gene sequencing). These techniques often have a strong intrinsic computational component. Tomographic data are the result of mathematically based processing. Sequencing is done by applying computational algorithms to raw gel electrophoresis data. As a consequence the emphasis of statistical theory has shifted away from the small sample optimality results that were a major theme of our book in a number of directions:
(I) Methods for inference based on larger numbers of observations and minimal assumptionsasymptotic methods in non and semiparametric models, models with ''infinite'' number of parameters. (2) The construction of models for time series, temporal spatial series, and other complex data structures using sophisticated probability modeling but again relying for analytical results on asymptotic approximation. Multiparameter models are the rule. (3) The use of methods of inference involving simulation as a key element such as the bootstrap and Markov Chain Monte Carlo.
XIll
.,q;,
•
:ci'~:,f"
.;.
'4,.
<.it:
...
of course. As a consequence our second edition.xiv Preface to the Second Edition: Volume I (4) The development of techniques not describable in "closed mathematical form" but rather through elaborate algorithms for which problems of existence of solutions are important and far from obvious. These will not be dealt with in OUr work. such as the empirical distribution function. The reason for these additions are the changes in subject matter necessitated by the current areas of importance in the field. The latter include the principal axis and spectral theorems for Euclidean space and the elementary theory of convex functions on Rd as well as an elementary introduction to Hilbert space theory. However. In this edition we pursue our philosophy of describing the basic concepts of mathematical statistics relating theory to practice. and functionvalued statistics. such as the density. we do not require measure theory but assume from the start that our models are what we call "regular. As in the first edition. Chapter 1 now has become part of a larger Appendix B. problems." That is. is much changed from the first.and semipararnetric models. and probability inequalities as well as more advanced topics in matrix theory and analysis. been other important consequences such as the extensive development of graphical and other exploratory methods for which theoretical development and connection with mathematics have been minimal. but for those who know this topic Appendix B points out interesting connections to prediction and linear regression analysis. However. weak convergence in Euclidean spaces. Our one long book has grown to two volumes. covers material we now view as important for all beginning graduate students in statistics and science and engineering graduate students whose research will involve statistics intrinsically rather than as an aid in drawing concluSIons. Volume I. some methods run quickly in real time. (6) The study of the interplay between the number of observations and the number of parameters of a model and the beginnings of appropriate asymptotic theories. which includes more advanced topics from probability theory such as the multivariate Gaussian distribution. we assume either a discrete probability whose support does not depend On the parameter set. Others do not and some though theoretically attractive cannot be implemented in a human lifetime. each to be only a little shorter than the first edition. Despite advances in computing speed. which we present in 2000. (5) The study of the interplay between numerical and statistical considerations. or the absolutely continuous case with a density. We . then go to parameters and parametric models stressing the role of identifiability. Specifically. and references to the literature for proofs of the deepest results such as the spectral theorem. Appendix B is as selfcontained as possible with proofs of mOst statements. instead of beginning with parametrized models we include from the start non. From the beginning we stress functionvalued parameters. reflecting what we now teach our graduate students. Hilbert space theory is not needed. Volume [ covers the malerial of Chapters 16 and Chapter 10 of the first edition with pieces of Chapters 710 and includes Appendix A on basic probability theory. There have. our focus and order of presentation have changed.
Major differences here are a greatly expanded treatment of maximum likelihood estimates (MLEs). There is more material on Bayesian models and analysis. Save for these changes of emphasis the other major new elements of Chapter 1. from the start. One of the main ingredients of most modem algorithms for inference. the Wald and Rao statistics and associated confidence regions.Preface to the Second Edition: Volume I xv also. including a complete study of MLEs in canonical kparameter exponential families. Chapter 6 is devoted to inference in multivariate (multiparameter) models. There are clear dependencies between starred . Nevertheless. These objects that are the building blocks of most modem models require concepts involving moments of random vectors and convexity that are given in Appendix B. Almost all the previous ones have been kept with an approximately equal number of new ones addedto correspond to our new topics and point of view. if somewhat augmented. which parallels Chapter 2 of the first edition. This chapter uses multivariate calculus in an intrinsic way and can be viewed as an essential prerequisite for the mOre advanced topics of Volume II. Wilks theorem on the asymptotic distribution of the likelihood ratio test. including some optimality theory for estimation as well and elementary robustness considerations. Other novel features of this chapter include a detailed analysis including proofs of convergence of a standard but slow algorithm for computing MLEs in muitiparameter exponential families and ail introduction to the EM algorithm. The main difference in our new treatment is the downplaying of unbiasedness both in estimation and testing and the presentation of the decision theory of Chapter 10 of the first edition at this stage. Robustness from an asymptotic theory point of view appears also. are an extended discussion of prediction and an expanded introduction to kparameter exponential families. Included are asymptotic normality of maximum likelihood estimates. Generalized linear models are introduced as examples. Finaliy. As in the first edition problems playa critical role by elucidating and often substantially expanding the text. The conventions established on footnotes and notation in the first edition remain. Chapter 2 of this edition parallels Chapter 3 of the first artd deals with estimation. we star sections that could be omitted by instructors with a classical bent and others that could be omitted by instructors with more computational emphasis. inference in the general linear model. It includes the initial theory presented in the first edition but goes much further with proofs of consistency and asymptotic normality and optimality of maximum likelihood procedures in inference. include examples that are important in applications. Chapters 14 develop the basic principles and examples of statistics. Chapter 5 of the new edition is devoted to asymptotic approximations. there is clearly much that could be omitted at a first reading that we also star. Also new is a section relating Bayesian and frequentist inference via the Bemsteinvon Mises theorem. such as regression experiments. Chapters 3 and 4 parallel the treatment of Chapters 4 and 5 of the first edition on the theory of testing and confidence regions. Although we believe the material of Chapters 5 and 6 has now become fundamental. and some parailels to the optimality theory and comparisons of Bayes and frequentist procedures given in the univariate case in Chapter 5.
• with the field. Last and most important we would like to thank our wives. Michael Jordan.4. are weak. in part in the text and. For the first volume of the second edition we would like to add thanks to new colleagnes.5 I.3 ~ 6. Ying Qing Chen. 6. elementary empirical process theory. appeared gratifyingly ended in 1976 but has. Examples of application such as the Cox model in survival analysis. taken on a new life. Jianhna Hnang.edn Kjell Doksnm doksnm@stat. encouragement. Topics to be covered include permutation and rank tests and their basis in completeness and equivariance. convergence for random processes.edn .3 ~ 6. and active participation in an enterprise that at times seemed endless.4 ~ 6.2 ~ 5. Nancy Kramer Bickel and Joan H. The topic presently in Chapter 8. 5.XVI • Pref3ce to the Second Edition: Volume I sections that follow. A final major topic in Volume II will be Monte Carlo methods such as the bootstrap and Markov Chain Monte Carlo. other transformation models. and our families for support. Semiparametric estimation and testing will be considered more generally.6 Volume II is expected to be forthcoming in 2003. Fujimura. We also expect to discuss classification and model selection using the elementary theory of empirical processes. With the tools and concepts developed in this second volume students will be ready for advanced research in modem statistics. density estimation. and the classical nonparametric k sample and independence problems will be included. and the functional delta method.2 ~ 6. particnlarly Jianging Fan. The basic asymptotic tools that will be developed or presented. and Prentice Hall for generous production support.berkeley. will be studied in the context of nonparametric function estimation. in part in appendices. Michael Ostland and Simon Cawley for producing the graphs.berkeley. j j Peter J. Yoram Gat for proofreading that found not only typos but serious errors.4. Bickel bickel@stat. and Carl Spruill and the many students who were guinea pigs in the basic theory course at Berkeley. We also thank Faye Yeager for typing. greatly extending the material in Chapter 8 of the first edition.
and Stone's Introduction to Probability Theory. In addition we feel Chapter 10 on decision theory is essential and cover at least the first two sections. 3rd 00. the Lehmann5cheffe theorem.. statistics. Finally. The extent to which holes in the discussion can be patched and where patches can be found should be clearly indicated. Introduction to Mathematical Statistics. In the twoquarter courses for graduate students in mathematics. Our appendix does give all the probability that is needed. we select topics from xvii . (3) Give heuristic discussions of more advanced results such as the large sample theory of maximum likelihood estimates. we need probability theory and expect readers to have had a course at the level of. we feel that none has quite the mix of coverage and depth desirable at this level. We feel such an introduction should at least do the following: (1) Describe the basic concepts of mathematical statistics indicating the relation of theory to practice. and the structure of both Bayes and admissible solutions in decision theory. At the other end of the scale of difficulty for books at this level is the work of Hogg and Craig. and the GaussMarkoff theorem. covers most of the material we do and much more but at a more abstract level employing measure theory. the physical sciences. Our book contains more material than can be covered in tw~ qp. Although there are several good books available for tbis purpose.. PREFACE TO THE FIRST EDITION This book presents our view of what an introduction to mathematical statistics for students with a good mathematics background should be. 2nd ed. These authors also discuss most of the topics we deal with but in many instances do not include detailed discussion of topics we consider essential such as existence and computation of procedures and large sample behavior. By a good mathematics background we mean linear algebra and matrix theory and advanced calculus (but no measure theory). (2) Give careful proofs of the major "elementary" results such as the NeymanPearson lemma. Be cause the book is an introduction to statistics. the treatment is abridged with few proofs and no examples or problems. Linear Statistical Inference and Its Applications. and engineering that we have taught we cover the core Chapters 2 to 7. (4) Show how the ideas aod results apply in a variety of important subfields such as Gaussian linear mOdels. The work of Rao. Hoel. and nonparametric models. However. the information inequality.arters. which go from modeling through estimation and testing to linear models. Port. multinomial models. for instance.
We would also like tn thank tlle colleagues and friends who Inspired and helped us to enter the field of statistics. and moments is established in the appendix. . X.xviii Preface to the First Edition Chapter 8 on discrete data and Chapter 9 on nonpararnetric models. R. Chou. Chapter 1 covers probability theory rather than statistics. I . Within each section of the text the presence of comments at the end of the chapter is signaled by one or more numbers. G. students. J. enthusiastic. W. respectively. Gray.5 was discovered by F. Among many others who helped in the same way we would like to mention C. They are included both as a check on the student's mastery of the material and as pointers to the wealth of ideas and results that for obvious reasons of space could not be put into the body of the text. reservations. The comments contain digressions. in proofreading the final version. or it may be included at the end of an introductory probability course that precedes the statistics course. distribution functions. A list of the most frequently occurring ones indicating where they are introduced is given at the end of the text. A serious error in Problem 2. i . S. A special feature of the book is its many problems. and A Samulon. We would like to acknowledge our indebtedness to colleagues. These comments are ordered by the section to which they pertain. L. They need to be read only as the reader's curiosity is piqued. Scholz. Pyke's careful reading of a nexttofinal version caught a number of infelicities of style and content Many careless mistakes and typographical errors in an earlier version were caught by D. It may be integrated with the material of Chapters 27 as the course proceeds rather than being given at the start. The foundation of oUr statistical knowledge was obtained in the lucid. (i) Various notational conventions and abbreviations are used in the text. Minassian who sent us an exhaustive and helpful listing. Cannichael.6 would probably not have been written and without Julia Rubalcava's impeccable typing and tolerance this text would never have seen the light of day. C. and so On. Bickel Kjell Doksum Berkeley /976 : . densities. Conventions: (i) In order to minimize the number of footnotes we have added a section of comments at the end of each Chapter preceding the problem section. Much of this material unfortunately does not appear in basic probability texts but we need to draw on it for the rest of the book. preliminary edition. 2 for the second. caught mOre mistakes than both authors together. Gupta. final draft) through which this book passed. Lehmann's wise advice has played a decisive role at many points.2. Quang. (iii) Basic notation for probabilistic objects such as random variables and vectors. and stimUlating lectures of Joe Hodges and Chuck Bell. Drew. Later we were both very much influenced by Erich Lehmann whose ideas are strongly rellected in this hook. U. P. They range from trivial numerical exercises and elementary problems intended to familiarize the students with the concepts to material more difficult than that worked out in the text. I for the first. E. and additional references. Without Winston Chow's lovely plots Section 9. and friends who helped us during the various stageS (notes. Peler J. Chen.
Mathematical Statistics Basic Ideas and Selected Topics Volume I Second Edition .
.. j 1 J 1 . . I .
1 1.1 and 6. (3) Arrays of scalars and/or characters as in contingency tablessee Chapter 6or more generally multifactor IDultiresponse data on a number of individuals.2. Subject matter specialists usually have to be principal guides in model formulation.1.5. (4) All of the above and more. for example. are to draw useful information from data using everything that we know. MODELS.1. produce data whose analysis is the ultimate object of the endeavor. A detailed discussion of the appropriateness of the models we shall discuss in particUlar situations is beyond the scope of this book. scientific or industrial. and/or characters. PARAMETERS AND STATISTICS Data and Models Most studies and experiments. digitized pictures or more routinely measurements of covariates and response on a set of n individualssee Example 1. as usual.1. AND PERFORMANCE CRITERIA 1. A generic source of trouble often called grf?ss errors is discussed in greater detail in the section on robustness (Section 3. but we will introduce general model diagnostic tools in Volume 2. A 1 . The goals of science and society. which statisticians share. we shall parenthetically discuss features of the sources of data that can make apparently suitable models grossly misleading.1 DATA.Chapter 1 STATISTICAL MODELS. for example. ''The Devil is in the details!" All the principles we discuss and calculations we perform should only be suggestive guides in successful applications of statistical analysis in science and policy. in particUlar. Data can consist of: (1) Vectors of scalars. and so on. Chapter 1.4 and Sections 2. In any case all our models are generic and. a single time series of measurements. Moreover. large scale or small. functions as in signal processing. trees as in evolutionary phylogenies. The particular angle of mathematical statistics is to view data as the outcome of a random experiment that we model mathematically.3). GOALS. (2) Matrices of scalars and/or characters. measurements.
(b) We want to study how a physical or economic feature. We begin this in the simple examples that follow and continue in Sections 1. a shipment of manufactured items. and B to n. So to get infonnation about a sample of n is drawn without replacement and inspected. and Performance Criteria Chapter 1 priori. Random variability " : . in the words of George Box (1979). testing. An unknown number Ne of these elements are defective. for modeling purposes. (5) We can be guided to alternative or more general descriptions that might fit better. height or income. for instance. we obtain one or more quantitative or qualitative measures of efficacy from each experiment. Goals. learning a maze..! I 2 Statistical Models. "Models of course. The data gathered are the number of defectives found in the sample.. (d) We want to compare the efficacy of two ways of doing something under similar conditions such as brewing coffee.5 and throughout the book. (2) We can derive methods of extracting useful information from data and. (c) An experimenter makes n independent detenninations of the value of a physical constant p. starting with tentative models: (I) We can conceptualize the data structure and our goals more precisely. to what extent can we expect the same effect more generally? Estimation. for example. The population is so large that. We run m + n independent experiments as follows: m + n members of the population are picked at random and m of these are assigned to the first method and the remaining n are assigned to the second method. Goodness of fit tests. are never true but fortunately it is only necessary that they be usefuL" In this book we will study how. Chapter I.3 and continue with optimality principles in Chapters 3 and 4. and so on. plus some random errors. is distributed in a large population. and so on. in particular. For instance. and diagnostics are discussed in Volume 2. we can assign two drugs. This can be thought of as a problem of comparing the efficacy of two methods applied to the members of a certain population. For instance. treating a disease.21. give methods that assess the generalizability of experimental results. A to m. producing energy. have the patients rated qualitatively for improvement by physicians. Hierarchies of models are discussed throughout. In this manner. (4) We can decide if the models we propose are approximations to the mechanism generating the data adequate for our purposes. It is too expensive to examine all of the items. reducing pollution. . robustness. we approximate the actual process of sampling without replacement by sampling with replacement. (3) We can assess the effectiveness of the methods we propose. confidence regions. His or her measurements are subject to random fluctuations (error) and the data can be thought of as p. An exhaustive census is impossible so the study is based on measurements and a sample of n individuals drawn at random from the population. Here are some examples: (a) We are faced with a population of N elements. e. randomly selected patients and then measure temperature and blood pressure. if we observe an effect in our data. and more general procedures will be discussed in Chapters 24. We begin this discussion with decision theory in Section 1.
F.. .2) where € = (€ I .. The same model also arises naturally in situation (c).1." The model is fully described by the set :F of distributions that we specify. . .xn . . First consider situation (a). €I. n)} of probability distributions for X. which we refer to as: Example 1.. The sample space consists of the numbers 0. OneSample Models. I. The main difference that our model exhibits from the usual probability model is that NO is unknown and.1 Data. Here we can write the n determinations of p. H(N8. then by (AI3.. . we observe XI...8)... n corresponding to the number of defective items found. that depends on how the experiment is carried out. where . X n as a random sample from F. Models. N. Sampling Inspection. . We often refer to such X I. If N8 is the number of defective items in the population sampled. although the sample space is well defined.. X n ? Of course. .. Sample from a Population..1. Situation (b) can be thought of as a generalization of (a) in that a quantitative measure is taken rather than simply recording "defective" or not. . Fonnally.. as Xi = I. and also write that Xl. we cannot specify the probability structure completely but rather only give a family {H(N8.N(I . X n are ij.d. X has an hypergeometric./ ' stands for "is distributed as. can take on any value between and N.Section 1. Given the description in (c). completely specifies the joint distribution of Xl..6) (J. which are modeled as realizations of X I...2.0) < k < min(N8. .. we postulate (l) The value of the error committed on one determination does not affect the value of the error at other times. ."" €n are independent... so that sampling with replacement replaces sampling without. . n) distribution. Thus.t + (i. We shall use these examples to arrive at out formulation of statistical models and to indicate some of the difficulties of constructing such models. 0 ° Example 1. N.l. if the measurements are scalar.1. n).. The mathematical model suggested by the description is well defined.".d.. which together with J1. 1 €n) T is the vector of random errors. What should we assume about the distribution of €.. . n.Xn independent.. identically distributed (i. and Statistics 3 here would come primarily from differing responses among patients to the same drug but also from error in the measurements and variation in the purity of the drugs. anyone of which could have generated the data actually observed. Parameters.) random variables with common unknown distribution function F. .l) if max(n . . It can also be thought of as a limiting case in which N = 00. in principle. . On this space we can define a random variable X given by X(k) ~ k. k ~ 0.. A random experiment has been perfonned. .i. That is. 1 <i < n (1.. 1.. So.1. as X with X .
. tbe set of F's we postulate.... be the responses of m subjects having a given disease given drug A and n other similarly diseased subjects given drug B. Natural initial assumptions here are: (1) The x's and y's are realizations of Xl. patients improve even if they only think they are being treated. and Performance Criteria Chapter 1 (2) The distribution of the error at one determination is the same as that at another. We call the y's treatment observations.Yn. (72) population or equivalently F = {tP (':Ji:) : Jl E R. TwoSample Models. or alternatively all distributions with expectation O.. .3) and the model is alternatively specified by F. . A placebo is a substance such as water tpat is expected to have no effect on the disease and is used to correct for the welldocumented placebo effect.. Let Xl. The Gaussian distribution. respectively.1.Xn are a random sample and. cr > O} where tP is the standard normal distribution.. G E Q} where 9 is the set of all allowable error distributions that we postulate. or by {(I". Then if F is the N(I".t + ~... G) : J1 E R. £n are identically distributed. 0'2)." Now consider situation (d). the Xi are a sample from a N(J. All actual measurements are discrete rather than continuous. that is. whatever be J. There are absolute bounds on most quantitiesIOO ft high men are impossible. so that the model is specified by the set of possible (F. . The classical def~ult model is: (4) The common distribution of the errors is N(o. We let the y's denote the responses of subjects given a new drug or treatment that is being evaluated by comparing its effect with that of the placebo. . if drug A is a standard or placebo. It is important to remember that these are assumptions at best only approximately valid. (2) Suppose that if treatment A had been administered to a subject response X would have been obtained. That is.. where 0'2 is unknown.. (3) The control responses are normally distributed. we refer to the x's as control observations. response y = X + ~ would be obtained where ~ does not depend on x.. will have none of this. '.~). . 0 This default model is also frequently postulated for measurements taken on units obtained by random sampling from populations. then F(x) = G(x  1") (1. To specify this set more closely the critical constant treatment effect assumption is often made. E}.4 Statistical Models. if we let G be the distribution function of f 1 and F that of Xl. . for instance. Example 1. Goals.1. We call this the shift model with parameter ~. Often the final simplification is made..t and cr. 1 X Tn a sample from F. Equivalently Xl. Thus. cr 2 ) distribution. Then if treatment B had been administered to the same subject instead of treatment A. . and Y1. (3) The distribution of f is independent of J1. G) pairs. 1 X Tn . heights of individuals or log incomes. then G(·) F(' . . Heights are always nonnegative. we have specified the Gaussian two sample model with equal variances. . YI. 1 Yn a sample from G.2) distribution and G is the N(J. Commonly considered 9's are all distributions with center of symmetry 0.t. 0 ! I = .3. By convention. •. This implies that if F is the distribution of a control.
we observe X and the family P is that of all bypergeometric distributions with sample size n and population size N.Section 1. On this sample space we have defined a random vector X = (Xl. for instance. Since it is only X that we observe. Fortunately. As our examples suggest.1).3 and 6.1. may be quite irrelevant to the experiment that was actually performed.2 is that. but not others. This distribution is assumed to be a member of a family P of probability distributions on Rn. P is referred to as the model. we have little control over what kind of distribution of errors we get and will need to investigate the properties of methods derived from specific error distribution assumptions when these assumptions are violated.1.3 when F. and Statistics 5 How do we settle on a set of assumptions? Evidently by a mixture of experience and physical considerations.5. This will be done in Sections 3. That is. In some applications we often have a tested theoretical model and the danger is small. In others.1. there is tremendous variation in the degree of knowledge and control we have concerning experiments. we use a random number table or other random mechanism so that the m patients administered drug A are a sample without replacement from the set of m + n available patients.1 Data. For instance.3 the group of patients to whom drugs A and B are to be administered may be haphazard rather than a random sample from the population of sufferers from a disease. in Example 1. our analyses. we now define the elements of a statistical model. The number of defectives in the first example clearly has a hypergeometric distribution.1. Parameters. the data X(w). the number of a particles emitted by a radioactive substance in a small length of time is well known to be approximately Poisson distributed. Statistical methods for models of this kind are given in Volume 2. in comparative experiments such as those of Example 1. if (1)(4) hold. In this situation (and generally) it is important to randomize.1. The study of the model based on the minimal assumption of randomization is complicated and further conceptual issues arise. we can ensure independence and identical distribution of the observations by using different.Xn ). we can be reasonably secure about some aspects. In Example 1. G are assumed arbitrary. if they are false. Experiments in medicine and the social sciences often pose particular difficulties. However. The danger is that. For instance. For instance. the methods needed for its analysis are much the same as those appropriate for the situation of Example 1. we need only consider its probability distribution. When w is the outcome of the experiment. We are given a random experiment with sample space O. Without this device we could not know whether observed differences in drug performance might not (possibly) be due to unconscious bias on the part of the experimenter. It is often convenient to identify the random vector X with its realization. All the severely ill patients might. in Example 1.2.1. though correct for the model written down. Using our first three examples for illustrative purposes. P is the . Models. The advantage of piling on assumptions such as (I )(4) of Example 1. A review of necessary concepts and notation from probability theory are given in the appendices.'" . we know how to combine our measurements to estimate JL in a highly efficient way and also assess the accuracy of our estimation procedure (Example 4. ~(w) is referred to as the observations or data.4. have been assigned to B. equally trained observers with no knowledge of each other's findings.2. if they are true.6.1.
we may parametrize the model by the first and second moments of the normal distribution of the observations (i.2 Parametrizations and Parameters i e To describe P we USe a parametrization.1 we can use the number of defectives in the population. knowledge of the true Pe. When we can take to be a nice subset of Euclidean space and the maps 0 + Po are smooth. The critical problem with such parametrizations is that ev~n with "infinite amounts of data.1.. . n) distribution.I} and 1.. For instance. 02 ). still in this example. I 1. tt = to the same distribution of the observations a~ tt = 1 and N (~ 1. tt is the unknown constant being measured." that is. () is the fraction of defectives.1. Of even greater concern is the possibility that the parametrization is not onetoone.2.2)). 1) errors. . in senses to be made precise later.G) has density n~ 1 g(Xi 1'). by (tt. G) : I' E R. What parametrization we choose is usually suggested by the phenomenon we are modeling..Xn are independent and identically distributed with a common N (IL..2 with assumptions (1)(4) we have = R x R+ and. However. on the other hand. that is. under assumptions (1)(4).JL) where cp is the standard normal density. Then the map sending B = (I'. a map.1.Xrn are identically distributed as are Y1 . models such as that of Example 1. having expectation 0. G with density 9 such that xg(x )dx = O} and p(".1. Thus. we can take = {(I'. in Example = {O. the parameter space e.G taken to be arbitrary are called nonparametric. for eXample. . and Performance Criteria Chapter 1 family of all distributions according to which Xl.1. we only wish to make assumptions (1}(3) with t:. 0. X rn are independent of each other and Yl . It's important to note that even nonparametric models make substantial assumptionsin Example 1. if) (X'.1.G) : I' E R. .. 1 • • • . N. as a parameter and in Example 1. we will need to ensure that our parametrizations are identifiable. We may take any onetoone function of 0 as a new parameter. in (l. If. Thus.2) suppose that we pennit G to be arbitrary. Pe the H(NB. Goals. () t Po from a space of labels. X n ) remains the same but = {(I'. e j . Pe 2 • 2 e ° . .. . Models such as that of Example 1. we have = R+ x R+ . . In Example 1.. Yn . . NO. j. such that we can have (}1 f. if () = (Il.. or equivalently write P = {Pe : BEe}. that is. moreover.3 that Xl. G) into the distribution of (Xl.. that is. we know we are measuring a positive quantity in this model. .X l . parts of fJ remain unknowable. as we shall see later.1. ..l. models P are called parametric. Note that there are many ways of choosing a parametrization in these and all other problems.3 with only (I) holding and F. .e.6 Statistical Models.1 we take () to be the fraction of defectives in the shipment. Now the and N(O.t2 + k e e e J . lh i O => POl f. to P.2 with assumptions (1)(3) are called semiparametric. 1) errors lead parametrization is unidentifiable because. Finally. .2 ) distribution. . If. Ghas(arbitrary)densityg}. .. 02 and yet Pel = Pe 2 • Such parametrizations are called unidentifiable.1.. Pe the distribution on Rfl with density implicitly taken n~ 1 . the first parametrization we arrive at is not necessarily the one leading to the simplest analysis. Yn . The only truly nonparametric but useless model for X E R n is to assume that its (joint) distribution can be anything. For instance.. in Example 1..
. which indexes the family P. Similarly. a function q : + N can be identified with a parameter v( P) iff Po. X n independent with common distribution. Sometimes the choice of P starts by the consideration of a particular parameter. from P to another space N. ) evidently is and so is I' + C>.Section 1. implies q(BJl = q(B2 ) and then v(Po) q(B). a map from some e to P.2 again in which we assume the error € to be Gaussian but with arbitrary mean~.1. . We usually try to combine parameters of interest and nuisance parameters into a single grand parameter (). make 8 + Po into a parametrization of P. But given a parametrization (J + Pe. For instance. For instance. it is natural to write e e e .1. from to its range P iff the latter map is 11. consider Example 1. Then "is identifiable whenever flx and flY exist. where 0'2 is the variance of €. Formally. A parameter is a feature v(P) of the distribution of X. Here are two points to note: (1) A parameter can have many representations. as long as P is the set of all Gaussian distributions.flx. is that of a parameter. In addition to the parameters of interest.1. .3 with assumptions (1H2) we are interested in~.2 where fl denotes the mean income and. and Statistics 7 Dual to the notion of a parametrization. if POl = Pe'). implies 8 1 = 82 . E(€i) = O. our interest in studying a population of incomes may precisely be in the mean income. or more generally as the center of symmetry of P. Models. which correspond to other unknown features of the distribution of X. which can be thought of as the difference in the means of the two populations of responses. For instance. in Example 1.. instead of postulating a constant treatment effect ~. there are also usually nuisance parameters. ~ Po.1 Data.2 with assumptions (1)(4) the parameter of interest fl. and observe Xl. thus. that is.3. then 0'2 is a nuisance parameter. if the errors are normally distributed with unknown variance 0'2. (J is a parameter if and only if the parametrization is identifiable. or the midpoint of the interquantile range of P.1. (2) A vector parametrization that is unidentifiable may still have components that are parameters (identifiable). But = Var(X. the focus of the study. in Example 1. As we have seen this parametrization is unidentifiable and neither f1 nor ~ arc parameters in the sense we've defined.2 is now well defined and identifiable by (1. we can start by making the difference of the means. Then P is parametrized by 8 = (fl1~' 0'2). the fraction of defectives () can be thought of as the mean of X/no In Example 1.2.. Implicit in this description is the assumption that () is a parameter in the sense we have just defined. When we sample. that is.1.1.M(P) can be characterized as the mean of P. Parameters. in Example 1.1. For instance.1. in Example 1. G) parametrization of Example 1.3) and 9 = (G : xdG(x) = J O}. say with replacement. or the median of P." = f1Y . we can define (J : P + as the inverse of the map 8 + Pe.1. formally a map. For instance. . The (fl. v. More generally.
in Example 1. hence. statistics. . F(P)(x) = PIX. Deciding which statistics are important is closely connected to deciding which parameters are important and. It estimates the function valued parameter F defined by its evaluation at x E R. is the statistic T(X 11' . then our attention naturally focuses on estimating this constant If. Thus. For instance.. Models and parametrizations are creations of the statistician. Nevertheless. 1964). a common estimate of 0. which evaluated at ~ X and 8 2 are called the sample mean and sample variance.1. and Performance Criteria Chapter 1 1. a statistic T is a map from the sample space X to some space of values T.1. however.X)" L i=l I n How we use statistics in estimation and other decision procedures is the subject of the next section. T(x) = x/no In Example 1. Next this model. to narrow down in useful ways our ideas of what the "true" P is.2 is the statistic I i i = i i 8 2 = n1 "(Xi . which now depends on the data. Mandel. Formally. These issues will be discussed further in Volume 2. a statistic we shall study extensively in Chapter 2 is the function valued statistic F. x E Ris F(X " ~ I .2 a cOmmon estimate of J1. but the true values of parameters are secrets of nature. T( x) is what we can compute if we observe X = x. The link for us are things we can compute.3 Statistics as Functions on the Sample Space I j • . If we suppose there is a single numerical measure of performance of the drugs and the difference in performance of the drugs for any given patient is a constant irrespective of the patient. Goals. For instance.. Our aim is to use the data inductively. usually a Euclidean space. for example. Xn)(x) = n L I(X n i < x) i=l where (X" .• 1 X n ) = X ~ L~ I Xi. called the empirical distribution function. This statistic takes values in the set of all distribution functions on R. < xJ.1.8 Statistical Models. X n ) are a sample from a probability P on R and I(A) is the indicator of the event A.. Often the outcome of the experiment is used to decide on the model and the appropriate measure of difference. we can draw guidelines from our numbers and cautiously proceed. is used to decide what estimate of the measure of difference should be employed (ct. we have to formulate a relevant measure of the difference in performance of the drugs and decide how to estimate this measure. can be related to model formulation as we saw earlier. For future reference we note that a statistic just as a parameter need not be real or Euclidean valued. consider situation (d) listed at the beginning of this section.. In this volume we assume that the model has . the fraction defective in the sample. . Databased model selection can make it difficult to ascenain or even assign a meaning to the accuracy of estimates or the probability of reaChing correct conclusions.1. Informally. this difference depends on the patient in a complex manner (the effect of each drug is complex).
. We observe (ZI' Y. 0). (A. Thus. 'L':' Such models will be called regular parametric models. in the study. We refer the reader to Wetherill and Glazebrook (1986) and Kendall and Stuart (1966) for more infonnation.. This selection is based on experience with previous similar experiments (cf.. 0). age.4. In the discrete case we will use both the tennsfrequency jimction and density for p(x. (B. The distribution of the response Vi for the ith subject or case in the study is postulated to depend on certain characteristics Zi of the ith subject. 1990). Distribution functions will be denoted by F(·.1.1. In most studies we are interested in studying relations between responses and several other variables not just treatment or control as in Example 1. They are not covered under our general model and will not be treated in this book.3 we could take z to be the treatment label and write onr observations as (A. Lehmann. See A. For instance. are continuous with densities p(x. and so On of the ith subject in a study. 1. the number of patients in the study (the sample size) is random. X m ). (B. Y. these and other subscripts and arguments will be omitted where no confusion can arise.. Example 1. Zi is a d dimensional vector that gives characteristics such as sex. It will be convenient to assume(1) from now on that in any parametric model we consider either: (I) All of the P.. Moreover. Regression Models We end this section with two further important examples indicating the wide scope of the notions we have introduced. density and frequency functions by p(. Problems such as these lie in the fields of sequential analysis and experimental deSign. Yn ). for example. This is obviously overkill but suppose that.1. drugs A and B are given at several .). sequentially.X" . weight. Thus. assign the drugs alternatively to every other patient in the beginning and then. Models. in Example 1. 8).3. Regular models. For instance. we shall denote the distribution corresponding to any particular parameter value 8 by Po. This is the Stage for the following. . ).4 Examples. and the decision of which drug to administer for a given patient may be made using the knowledge of what happened to the previous patients.. 0). in situation (d) again. (zn. patients may be considered one at a time. Xl). (2) All of the P e are discrete with frequency functions p(x. Parameters. and there exists a set {XI. Regression Models. the statistical procedure can be designed so that the experimenter stops experimenting as soon as he or she has significant evidence to the effect that one drug is better than the other. When dependence on 8 has to be observed.lO. Yn are independent.Section 1. height. 0). There are also situations in which selection of what data will be observed depends on the experimenter and on his or her methods of reaching a conclusion. and Statistics 9 been selected prior to the current experiment. Expectations calculated under the assumption that X rv P e will be written Eo.1 Data.. after a while. ) that is independent of 0 such that 1 P(Xi' 0) = 1 for all O.1. assign the drug that seems to be working better to a higher proportion of patients.. Notation. The experimenter may. Y n ) where Y11 ••• . However. . .
Example 1. . . For instance. i = 1.3 with the Gaussian twosample model I'(A) ~ 1'. then we can write. (c) whereZ nxa = (zf. By varying the assumptions we obtain parametric models as with (I).10 Statistical Models. . the effect of z on Y is through f.~II3Jzj = zT (3 so that (b) becomes (b') This is the linear model. That is. z) where 9 is known except for a vector (3 = ((31. in Example 1.z) = L.. 0 .Z~)T and Jisthen x n identity.1. Clearly. On the basis of subject matter knowledge and/or convenience it is usually postulated that (2) 1'(z) = 9((3. (3) g((3.. Identifiability of these parametrizations and the status of their components as parameters are discussed in the problems. .. Zi is a nonrandom vector of values called a covariate vector or a vector of explanatory variables whereas Yi is random and referred to as the response variable or dependent variable in the sense that its distribution depends on Zi.E(Yi ).L(z) denote the expected value of a response with given covariate vector z.8. .3(3) is a special case of this model. which we can write in vector matrix form.(z) only. and Performance Criteria Chapter 1 dose levels.L(z) is an unknown function from R d to R that we are interested in. [n general.2 with assumptions (1)(4). A eommou (but often violated assumption) is (I) The ti are identically distributed with distribution F. In fact by varying our assumptions this class of models includes any situation in which we have independent but not necessarily identically distributed observations. In the two sample models this is implied by the constant treatment effect assumption. (b) where Ei = Yi .2 uuknown. (3) and (4) above. I'(B) = I' + fl. . .. i=l n If we let J. We usually ueed to postulate more. Here J.1. If we let f(Yi I zd denote the density of Yi for a subject with covariate vector Zi.1.2) with ... I3d) T of unknowns. semiparametric as with (I) and (2) with F arbitrary. Often the following final assumption is made: (4) The distributiou F of (I) is N(O... Then. d = 2 and can denote the pair (Treatment Label. Then we have the classical Gaussian linear model.. n.Yn) ~ II f(Yi I Zi). . The most common choice of 9 is the linear fOlltl. then the model is zT (a) P(YI.1.. ..treat the Zi as a label of the completely unknown distributions of Yi. See Problem 1. and nonparametric if we drop (1) and simply . Treatment Dose Level) for patient i. So is Example 1. Goals.1.
. However. .5. . ' . . . we have p(edp(C21 e1)p(e31 e"e2)".Section 1. Parameters. . en' Using conditional probability theory and ei = {3ei1 + €i.{3Xj1 j=2 n (1.. .cn _d p(edp(e2 I edp(e3 I e2) . ..n ei = {3ei1 + €i. X n be the n determinations of a physical constant J..Il.t+ei. the conceptual issues of stationarity.. and the associated probability theory models and inference for dependent data are beyond the scope of this book. i ~ 2.1 Data. ..xn ) ~ f(x1 I') II f(xj . Example 1..n... . Xi ~ 1. X n spent above a fixed high level for a series of n consecutive wave records at a point on the seashore. X n is P(X1. we start by finding the density of CI. .. say... .t.(3cd··· f(e n Because €i =  (3e n _1).. Then we have what is called the AR(1) Gaussian model J is the We include this example to illustrate that we need not be limited by independence. i = 1. p(e n I end f(edf(c2 . Consider the model where Xi and assume = J.. An example would be. the elapsed times X I. . A second example is consecutive measurements Xi of a constant It made by the same observer who seeks to compensate for apparent errors. . Let XI. . 0'2) density.1.. To find the density p(Xt. Let j. model (a) assumes much more but it may be a reasonable first approximation in these situations. . .(3)I'). It is plausible that ei depends on Cil because long waves tend to be followed by long waves. save for a brief discussion in Volume 2.. ergodicity..(3) + (3X'l + 'i. .. ... . Xi . In fact we can write f.(1 . at best an approximation for the wave example. and Statistics 11 Finally.. The default assumption. C n where €i are independent identically distributed with density are dependent as are the X's. the model for X I. Measurement Model with Autoregressive Errors.. i = l. eo = a Here the errors el. is that N(O. . Xl = I' + '1. Of course. x n ). n..p(e" I e1. 0 . Models.t = E(Xi ) be the average time for an infinite series of records. . we give an example in which the responses are dependent. .
This is done in the context of a number of classical examples.. Models are approximations to the mechanisms generating the observations. The notions of parametrization and identifiability are introduced..2. it is possible that. in the past. and indeed necessary. There is a substantial number of statisticians who feel that it is always reasonable. to think of the true value of the parameter (J as being the realization of a random variable (} with a known distribution.1. We know that. we can construct a freq uency distribution {1l"O. i = 0.1. in the inspection Example 1.2. and Performance Criteria Chapter 1 Summary. If the customers have provided accurate records of the number of defective items that they have found.2 BAYESIAN MODELS Throughout our discussion so far we have assumed that there is no information available about the true value of the parameter beyond that provided by the data. They are useful in understanding how the outcomes can he used to draw inferences that go beyond the particular experiment. That is..I 12 Statistical Models. we have had many shipments of size N that have subsequently been distributed. given 9 = i/N.2) This is an example of a Bayesian model.."" 1l"N} for the proportion (J of defectives in past shipments. ([. a . the regression model. vector observations X with unknown probability distributions P ranging over models P. We view statistical models as useful tools for learning from the outcomes of experiments and studies. n). X has the hypergeometric distribution 'H( i. . Thus. In this section we introduced the first basic notions and formalism of mathematical statistics. N. the most important of which is the workhorse of statistics. Goals. 1l"i is the frequency of shipments with i defective items. . The general definition of parameters and statistics is given and the connection between parameters and pararnetrizations elucidated. 0 = N I I ([. . How useful a particular model is is a complex mix of how good the approximation is and how much insight it gives into drawing inferences. 1. Now it is reasonable to suppose that the value of (J in the present shipment is the realization of a random variable 9 with distribution given by P[O = N] I = IT" i = 0. N. PIX = k. N. This distribution does not always corresp:md to an experiment that is physically realizable but rather is thought of as measure of the beliefs of the experimenter concerning the true value of (J before he or she takes any data.1) Our model is then specified by the joint distribution of the observed number X of defectives in the sample and the random variable 9. . There are situations in which most statisticians would agree that more can be said For instance.
1)'(0. I(O. the information about (J is described by the posterior distribution. 1.2) is an example of (1. the joint distribution is neither continuous nor discrete. x) = ?T(O)p(x. De Groot (1969). we can obtain important and useful results and insights. let us turn again to Example 1. However. the information or belief about the true value of the parameter is described by the prior distribution. In this section we shall define and discuss the basic clements of Bayesian models. insofar as possible we prefer to take the frequentist point of view in validating statistical statements and avoid making final claims in terms of subjective posterior probabilities (see later).3). After the value x has been obtained for X.1.O). 1.1. (1962). There is an even greater range of viewpoints in the statistical community from people who consider all statistical statements as purely subjective to ones who restrict the use of such models to situations such as that of the inspection example in which the distribution of (J has an objective interpretation in tenus of frequencies. For instance. However. (8. which is called the posterior distribution of 8. The function 7r represents our belief or information about the parameter (J before the experiment and is called the prior density or frequency function. Before sampling any items the chance that a given shipment contains . and Berger (1985). (J) as a conditional density or frequency function given 8 = we will denote it by p(x I 0) for the remainder of this section.4) for i = 0.3). 100.2. the resulting statistical inference becomes subjective. This would lead to the prior distribution e. suppose that N = 100 and that from past experience we believe that each item has probability .1 of being defective independently of the other members of the shipment..2.Section 1. The theory of this school is expounded by L. The joint distribution of (8. Raiffa and Schlaiffer (1961). . Before the experiment is performed.l. For a concrete illustration. To get a Bayesian model we introduce a random vector (J. ?T. The most important feature of a Bayesian model is the conditional distribution of 8 given X = x. (1. with density Or frequency function 7r.2 Bayesian Models 13 Thus. = ( I~O ) (0. by giving (J a distribution purely as a theoretical tool to which no subjective significance is attached. given (J = (J. Suppose that we have a regular parametric model {Pe : (J E 8}. X) is appropriately continuous or discrete with density Or frequency function. ..3) Because we now think of p(x. whose range is contained in 8. Eqnation (1. X) is that of the outcome of a random experiment in which we first select (J = (J according to 7r and then. select X according to Pe.2. We shall return to the Bayesian framework repeatedly in our discussion. Savage (1954).9)'00'. Lindley (1965). In the "mixed" cases such as (J continuous X discrete. (1) Our own point of view is that SUbjective elements including the views of subject matter experts arc an essential element in all model building. An interesting discussion of a variety of points of view on these questions may be found in Savage et al. If both X and (J are continuous or both are discrete. then by (B. (1.2. We now think of Pe as the conditional distribution of X given (J = (J.
1.1) (1. Suppose that Xl. 1r(9)9k (1 _ 9)nk 1r(Blx" .181(0.1)(0.6) we argue loosely as follows: If be fore the drawing each item was defective with probability ..30. . Specifically.2.1) '" I .1) distribution.4) can be used.k dt (1.2. . If we assume that 8 has a priori distribution with deusity 1r.J C3 (1.8) as posterior density of 0..8) roo 1r(9)p(x 19) 1r(t)p(x I t)dt if 8 is continuous. Example 1.<1>(0. some variant of Bayes' rule (B. I 20 or more bad items is by the normal approximation with continuity correction.2. Goals.1 > . this will continue to be the case for the items left in the lot after the 19 sample items have been drawn.2. X) given by (1....181(0.<I> 1000 . Therefore. the number of defectives left after the drawing.5) 5 ) = 0. 1r(t)p(x I t) if 8 is discrete.2...9) > .3).. In the cases where 8 and X are both continuous or both discrete this is precisely Bayes' rule applied to the joint distribution of (0. 0. P[1000 > 20 I X = 10] P[IOOO .II ~.1100(0.9)(0.30. Thus.j . is iudependeut of X and has a B(81.X) .Xn are indicators of n Bernoulli trials with probability of success () where 0 < 8 < 1...2.:. and Performance Criteria Chapter 1 I . (A 15.9 ] .8. (1. 1008 .7) In general. Now suppose that a sample of 19 has been drawn in which 10 defective items are found.1)(0.001.. Bernoulli Trials. This leads to P[1000 > 20 I X = lOJ '" 0. 1.10). to calculate the posterior. (ii) If we denote the corresponding (posterior) frequency function or density by 1r(9 I x).1.1 and good with probability .X. we obtain by (1.2.9) I .52) 0. 14 Statistkal Models.2.X > 10 I X = lOJ P [(1000 . theu 1r(91 x) 1r(9)p(x I 9) 2.6) To calculate the posterior probability given iu (1.9) I .2. . (i) The posterior distribution is discrete or continuous according as the prior distri bution is discrete or continuous. . ( 1. . .xn )= Jo'1r(t)t k (lt)n.10 '" .1100(0. Here is an example.9)(0.9 independently of the other items. P[IOOO > 201 = 10] P [ .
Summary.nk+s). One such class is the twoparameter beta family. Another bigger conjugate family is that of finite mixtures of beta distributionssee Problem 1. k = I Xi· Note that the posterior density depends on the data only through the total number of successes.2.2.k + s) where B(·. so that the mean is r/(r + s) = 0. and s only. we only observe We can thus write 1[(8 I k) fo". where k ~ l:~ I Xj.n.2 Bayesian Models 15 for 0 < () < 1.(8 I X" . As Figure B. We may want to assume that B has a density with maximum value at o such as that drawn with a dotted line in Figure B.ll) in (1.2.05. which depends on k. We present an elementary discussion of Bayesian models. which corresponds to using the beta distribution with r = . Suppose.. This class of distributions has the remarkable property that the resulting posterior distributions arc again beta distributions.·) is the beta function. To get infonnation we take a sample of n individuals from the city.16.2. we are interested in the proportion () of "geniuses" (IQ > 160) in a particular city. L~ 1 Xi' We also obtain the same posterior density if B has prior density 1r and 1 Xi. If we were interested in some proportion about which we have no information or belief.s) distribution.9).10) The proportionality constant c. s) density (B. Then we can choose r and s in the (3(r.2.Section 1.6.1). and the posterior distribution of B givenl:X i =kis{3(k+r. we need a class of distributions that concentrate on the interval (0.9) we obtain 1 L: L7 (}k+rl (1 _ 8)nk+s1 c (1.11» be B(k + T. n .05 and its variance is very small. x n ).2. Now we may either have some information about the proportion of geniuses in similar cities of the country or we may merely have prejudices that we are willing to express in the fonn of a prior distribution on B. We also by example introduce the notion of a conjugate family of distributions. introduce the notions of prior and posterior distributions and give Bayes rule. .2. Such families are called conjugaU? Evidently the beta family is conjugate to the binomial.2.8) distribution.. (A. nonVshaped bimodal distributions are not pennitted. which has a B( n. If n is small compared to the size of the city. r. . . 0 A feature of Bayesian models exhibited by this example is that there are natural parametric families of priors such that the posterior distributions also belong to this family. The result might be a density such as the one marked with a solid line in Figure B. for instance. We return to conjugate families in Section 1.. must (see (B..2. we might take B to be uniformly distributed on (0. 'i = 1. say 0.2.2 indicates. = 1. 1). . Xi = 0 or 1.2.<. Specifically. To choose a prior 11'.15. For instance. upon substituting the (3(r. 8) distribution given B = () (Problem 1.13) leads us to assume that the number X of geniuses observed has approximately a 8(n. Or else we may think that 1[(8) concentrates its mass near a small number. the beta family provides a wide variety of shapes that can approximate many reasonable prior distributions though by no means all.
the information we want to draw from data can be put in various forms depending on the purposes of our analysis. as in Example 1. "hypothesis" or "nonspecialness" (or alternative).lo means the universe is expanding forever and J.3.1.1. we have a vector z. Po or Pg is special and the general testing problem is really one of discriminating between Po and Po. i. then depending on one's philosophy one could take either P's corresponding to J.3 THE DECISION THEORETIC FRAMEWORK I .1. These are estimation problems. and as we shall see fonnally later.. for instance. In Example 1. if we believe I'(z) = g((3.l < Jio or those corresponding to J.l > JkO correspond to an eternal alternation of Big Bangs and expansions.3. However. there are k! possible rankings or actions. Unfortunately {t(z) is unknown. A consumer organization preparing (say) a report on air conditioners tests samples of several brands. the fraction defective B in Example 1.2. state which is supported by the data: "specialness" or. For instance. shipments. and Performance Criteria Chapter 1 1. Note that we really want to estimate the function Ji(')~ our results will guide the selection of doses of drug for future patients. whereas the receiver does not wish to keep "bad. Example 1. one of which will be announced as more consistent with the data than others. A very important class of situations arises when. For instance.1. a reasonable prediction rule for an unseen Y (response of a new patient) is the function {t(z). the expected value of Y given z. say a 50yearold male patient's response to the level of a drug. such as. in Example 1. say." (} > Bo. Given a statistical model.1. of g((3. If J. 1 < i < n. Thus. As the second example suggests. Goals. Making detenninations of "specialness" corresponds to testing significance. at a first cut. There are many possible choices of estimates.1. in Example 1.L in Example 1. i we can try to estimate the function 1'0. placebo and treatment are equally effective) are special because the FDA (Food and Drug Administration) does not wish to pennit the marketing of drugs that do no good. (age. with (J < (Jo.l > J. say. Prediction.lo is the critical matter density in the universe so that J. drug dose) T that can be used for prediction of a variable of interest Y. In other situations certain P are "special" and we may primarily wish to know whether the data support "specialness" or not. P's that correspond to no treatment effect (i. We may have other goals as illustrated by the next two examples. 0 Example 1. there are many problems of this type in which it's unclear which oftwo disjoint sets of P's.2.1 contractual agreement between shipper and receiver may penalize the return of "good" shipments. Ranking. Zi) and then plug our estimate of (3 into g.l < J.e. z) we can estimate (3 from our observations Y. We may wish to produce "best guesses" of the values of important parameters. the receiver wants to discriminate and may be able to attach monetary Costs to making a mistake of either type: "keeping the bad shipment" or "returning a good shipment. Intuitively. On the basis of the sample outcomes the organization wants to give a ranking from best to worst of the brands (ties not pennitted).3. For instance.4." In testing problems we.lo as special. as it's usually called. Thus. if we have observations (Zil Yi). if there are k different brands. 0 In all of the situations we have discussed it is clear that the analysis does not stop by specifying an estimate or a test or a ranking or a prediction function.1 or the physical constant J.1 do we use the observed fraction of defectives .1.16 Statistical Models. sex.
A = {( 1. and so on..1.Section 1.. if we have three air conditioners. Action space. X = ~ L~l Xi.1 Components of the Decision Theory Framework As in Section 1. (4) provide guidance in the choice of procedures for analyzing outcomes of experiments. Thus. 1). In any case. most significantly.1. Thus. If we are estimating a real parameter such as the fraction () of defectives. 2). A = {a.1. in ranking what mistakes we've made. we would want a posteriori estimates of perfomlance. and reliability of statistical procedures. even if. is large. whatever our choice of procedure we need either a priori (before we have looked at the data) and/or a posteriori estimates of how well we're doing. On the other hand. : 0 E 8).6.3. 1. there are 3! = 6 possible rankings. . (3. 3.3 even with the simplest Gaussian model it is intuitively clear and will be made precise later that.1.1. That is. we begin with a statistical model with an observation vector X whose distribution P ranges over a set P. in Example 1. . accuracy. or p.2. The answer will depend on the model and. These examples motivate the decision theoretic framework: We need to (I) clarify the objectives of a study. . # 0. k}}. By convention.1. 3). in Example 1. a large a 2 will force a large m l n to give us a good chance of correctly deciding that the treatment effect is there. taking action 1 would mean deciding that D. (2) point to what the different possible actions are. 3. we need a priori estimates of how well even the best procedure can do. In designing a study to compare treatments A and B we need to determine sample sizes that will be large enough to enable us to detect differences that matter.3 The Decision Theoretic Framework 17 X/n as our estimate or ignore the data and use hislOrical infonnation on past shipments. Estimation. 2. I)}. once a study is carried out we would probably want not only to estimate ~ but also know how reliable our estimate is. in Example 1. Here quite naturally A = {Permntations (i I . 3).1.1. in estimation we care how far off we are. . (2. in testing whether we are right or wrong. Here are action spaces for our examples. Intuitively. 1. ~ 1 • • • l I} in Example 1. For instance. (1. Testing. 1. (3) provide assessments of risk. ik) of {I. for instance. in Example 1. or combine them in some way? In Example 1. I} with 1 corresponding to rejection of H. (2. it is natural to take A = R though smaller spaces may serve equally well. or the median. . (3.2 lO estimate J1 do we use the mean of the measurements.. Thus. on what criteria of performance we use.1. 2). defined as any value such that half the Xi are at least as large and half no bigger? The same type of question arises in all examples. We usnally take P to be parametrized.3. Here only two actions are contemplated: accepting or rejecting the "specialness" of P (or in more usual language the hypothesis H : P E Po in which we identify Po with the set of "special" P's). 2. Ranking. A new component is an action space A of actions or decisions or claims that we can contemplate making.. P = {P. A = {O.
is the nonnegative loss incurred by the statistician if he or she takes action a and the true "state of Nature. [f Y is real..a)2 (or I(O. . a) 1(0.. as the name suggests. M) would be our prediction of response or no response for a male given treatment B. a) = min {(v( P) . For instance.1). l( P. a) = 1 otherwise. a) = (v(P) .. Sex)T. then a(B. a) 1(0. This loss expresses the notion that all errors within the limits ±d are tolerable and outside these limits equally intolerable. If we use a(·) as a predictor and the new z has marginal distribution Q then it is natural to consider. a) I d . a) = [v( P) . For instance. The interpretation of I(P.. Closely related to the latter is what we shall call confidence interval loss. Quadratic Loss: I(P.a) ~ (q(O) ." that is. examples of loss functions are 1(0. a) = J (I'(z) . "does not respond" and "responds. l(P. ' . Loss function. . .ai. the probability distribution producing the data. 1 d} = supremum distance. = max{laj  vjl. Other choices that are. less computationally convenient but perhaps more realistically penalize large errors less are Absolute Value Loss: l(P.3. I(P.. is P. Goals. A = {a . say. they usually are chosen to qualitatively reflect what we are trying to do and to be mathematically convenient. For instance..a)2. Here A is much larger. Q is the empirical distribution of the Zj in . a) if P is parametrized.)2 = Vj [ squared Euclidean distance/d absolute distance/d 1.qd(lJ)) and a ~ (a""" ad) are vectors. If v ~ (V1. a) = l(v < a).18 Statistical Models.: la. although loss functions. a is a function from Z to R} with a(z) representing the prediction we would make if the new unobserved Y had covariate value z. say. a) = 0.i = We can also consider function valued parameters. sometimes can genuinely be quantified in economic terms.V. If. [v(P) . if Y = 0 or 1 corresponds to. Although estimation loss functions are typically symmetric in v and a. d'}. r " I' 2:)a.2. Far more important than the choice of action space is the choice of loss function defined as a function I : P X A ~ R+. as we shall see (Section 5.a(z))2dQ(z). the expected squared error if a is used. which penalizes only overestimation and by the same amount arises naturally with lower confidence bounds as discussed in Example 1. In estimating a real valued parameter v(P) or q(6') if P is parametrized the most commonly used loss function is. r I I(P. in the prediction example 1.a)2). or 1(0. As we shall see.3. asymmetric loss functions can also be of importance. and Performance Criteria Chapter 1 Prediction. . a). . and z = (Treatment. ~ 2.al < d. 1'() is the parameter of interest. Estimation. Evidently Y could itself range over an arbitrary space Y and then R would be replaced by Y in the definition of a(·).Vd) = (q1(0). and truncated quadraticloss: I(P." respectively.3. and z E Z.
Of course. . we implicitly discussed two estimates or decision rules: 61 (x) = sample mean x and 02(X) = X = sample median.). . L(I'(Zj) _0(Z.. We next give a representation of the process Whereby the statistician uses the data to arrive at a decision. that is. For instance.1 loss function can be written as ed.1. in Example 00 defectives results in a penalty of s dollars whereas every defective item sold results in an r dollar replacement cost..3 with X and Y distributed as N(I' + ~. this leads to the commonly considered I(P. . is a partition of (or equivalently if P E Po or P E P. in the measurement model. a) = 1 otherwise (The decision is wrong). .9. The data is a point X = x in the outcome or sample space X.1) Decision procedures. This 0 . The decision rule can now be written o(x. We ask whether the parameter () is in the subset 6 0 or subset 8 1 of e.. In Section 4. the decision is wrong and the loss is taken to equal one. Otherwise.0) sif8<80 Oif8 > 80 rN8. relative to the standard deviation a. is a or not. if we are asking whether the treatment effect p'!1'ameter 6... I) /(8. respectively.3. and to decide Ll '# a if our estimate is not close to zero.2).3 we will show how to obtain an estimate (j of a from the data. If we take action a when the parameter is in we have made the correct decision and the loss is zero. then a reasonable rule is to decide Ll = 0 if our estimate x .2) and N(I'. 0. For the problem of estimating the constant IJ..3. a) ~ 0 if 8 E e a (The decision is correct) l(0. y) = o if Ix::: 111 (J <c (1. Using 0 means that if X = x is observed. Testing. We define a decision rule or procedure to be any function from the sample space taking its values in A. ]=1 which is just n. Here we mean close to zero relative to the variability in the experiment.3 The Decision Theoretic Framework 19 the training set (Z 1. a(zn) jT and the vector parameter (I'( z. the statistician takes action o(x). Y). a Estimation.1.a) = I n n..).. In Example 1.. (1. .Section 1.1 times the squared Euclidean distance between the prediction vector (a(z. e ea. I) 1(8.. other economic loss functions may be appropriate. . (Zn. where {So.I loss: /(8.))2.).1 suppose returning a shipment with °< 1(8..2) I if Ix ~ 111 (J >c . Y n).1'(zn))T Testing. Then the appropriate loss function is 1.y is close to zero.
i=l . our risk function is called the mean squared error (MSE) of. If we expand the square of the righthand side keeping the brackets intact and take the expected value. The other two terms are (Bias fi)' and Var(v). . for each 8. e Estimation. (Continued).v) = [V  + [E(v) . lfwe use the mean X as our estimate of IJ. .. A useful result is Bias(fi) Proposition 1..) = 2L n ~ n .3.3. We do not know the value of the loss because P is unknown. fi) = Ep(fi(X) . fJ(x)).1. then Bias(X) 1 n Var(X) . but for a range of plausible x·s. 1 is the loss function.1']. we typically want procedures to have good properties not at just one particular x. we turn to the average or mean loss over the sample space.d.3) where for simplicity dependence on P is suppressed in MSE.. so is the other and the result is trivially true.3. we regard I(P.. Moreover.v(P))' (1. 6(X)] as the measure of the perfonnance of the decision rule o(x).i.. (fi . The MSE depends on the variance of fJ and on what is called the bias ofv where = E{fl) . and X = x is the outcome of the experiment.6) = Ep[I(P. Suppose v _ v(P) is the real parameter we wish to estimate and fJ(X) is our estimator (our decision rule). and is given by v= MSE{fl) = R(P. 6 (x)) as a random variable and introduce the riskfunction R(P. () is the true value of the parameter. Goals. R('ld) is our a priori measure of the performance of d. MSE(fi) I. If d is the procedure used. measurements of IJ. R maps P or to R+. Estimation of IJ. (]"2) errors.v can be thought of as the "longrun average error" of v. Thus. Write the error as 1 = (Bias fi)' + E(fi)] Var(v). X n are i. That is. If we use quadratic loss.) 0 We next illustrate the computation and the a priori and a posteriori use of the risk function. then the loss is l(P.E(fi)] = O. Thus. the risk or riskfunction: The risk function.20 Statistical Models. (If one side is infinite.3. and Performance Criteria Chapter 1 where c is a positive constant called the critical value. withN(O. Suppose Xl. How do we choose c? We need the next concept of the decision theoretic framework. Example 1. We illustrate computation of R and its a priori use in some examples. Proof. the cross term will he zero because E[fi . and assume quadratic loss.." Var(X.
Let flo denote the mean of a certain measurement included in the U..5) This harder calculation already suggests why quadratic loss is really favored. in general. a natural guess for fl would be flo.2 .2. say. If we only assume.3 The Decision Theoretic Framework 21 and. if X rnedian(X 1 .3.4) can be used for an a priori estimate of the risk of X.1). 1 X n from area A. the £. If. Suppose that instead of quadratic loss we used the more natural(l) absolute value loss.6). as we assumed.3. for instance by 8 2 = ~ L~ l(X i .4) which doesn't depend on J. or na 21(n . X) = a' (P) / n still. whereas if we have a random sample of measurements X" X 2. Example 1. census. ( N(o.fii/a)[ ~ a . Ii ~ (0.a 2 then by (A 13. then for quadratic loss. for instance. computational difficulties arise even with quadratic loss as soon as we think of estimates other than X. age or income. E(i .a 2 . In fact.1")' ~ E(~) can only be evaluated numerically (see Problem 1.1 is useful for evaluating the performance of competing estimators.X) =.4. that the E:i are i. or numerical and/or Monte Carlo computation.1"... an estimate we can justify later. n (1.2.1 2 MSE(X) = R(I".2)1"0 + (O..1"1 ~ Eill 2 ) where £. . by Proposition 1. of course. but for absolute value loss only approximate.3. If we have no data for area A. If we have no idea of the value of 0..8 can only be made on the basis of additional knowledge about demography or the economy. 0 ~ = a We next give an example in which quadratic loss and the breakup of MSE given in Proposition 1. .d. . .t. with mean 0 and variance a 2 (P). = X. For instance. is possible. itself subject to random error. write for a median of {aI.Xn ) (and we. Then (1.S.3.1. R( P. areN(O.3. a'.i.. (.8)X. planning is not possible but having taken n measurements we can then estimate 0. .X) = ". 1) and R(I".3.X)2. Suppose that the precision of the measuring instrument cr2 is known and equal to crJ or where realistically it is known to be < crJ.3. If we want to be guaranteed MSE(X) < £2 we can do it by taking at least no = <:TolE measurements. X) = EIX . The a posteriori estimate of risk (j2/n is.2 and 0. or approximated asymptotically. Then R(I".. we may wantto combine tLo and X = n 1 L~ 1 Xi into an estimator. Next suppose we are interested in the mean fl of the same measurement for a certain area of the United States. 00 (1. The choice of the weights 0.fii 1 af2 00 Itl<p(t)dt = . an}). We shall derive them in Section 1.6 through a . .Section 1. as we discussed in Example 1. analytic.23).fii V.
A test " " e e e .2) for deciding between i'> two values 0 and 1. The test rule (1. Y) = IJ.8)'Yar(X) = (.64)0" In. MSE(ii) MSE(X) i .3. I.64 when I' = 1'0. R(i'>.1') (0. the risk is = 0 and i'> i 0 can only take on the R(i'>. = 0. Figure 1. Figure 1. I' E R} being 0. The mean squared errdrs of X and ji. D Testing.1. We easily find Bias(ii) Yar(ii) 0.1 loss is 01 + lei'>. thus. the risk R(I'. X) = 0" In of X with the minimum relative risk inf{MSE(ii)IMSE(X).3. Goals. iiJ + 0.3. The two MSE curves cross at I' ~ 1'0 ± 30'1 .) ofthe MSE (called the minimax criteria). In the general case X and denote the outcome and parameter space.64)0" In MSE(ii) = . and we are to decide whether E 80 or E 8 where 8 = 8 0 U 8 8 0 n 8 . Ii) of ii is smaller than the risk R(I'. If I' is close to 1'0..81' . respectively. and Performance Criteria Chapter 1 i formal Bayesian analysis using a nonnal prior to illustrate a way of bringing in additional knowledge. Here we compare the performances of Ii and X as estimators of Jl using MSE.n.6) = 1(i'>. I)P[6(X. Y) = 01 if i'> i O. then X is optimal (Example 3.1 gives the graphs of MSE(ii) and MSE(X) as functions of 1'.I. Y) = which in the case of 0 .6) P[6(X.I' = 0. 22 Statistical Models. neither estimator can be proclaimed as being better than the other. Y) = I] if i'> = 0 P[6(X.3.04(1'0 .1')' + (. However. Because we do not know the value of fL. if we use as our criteria the maximum (over p.21'0 R(I'.4).0)P[6(X. using MSE.2(1'0 .
Suppose our primary interest in an estimation type of problem is to give an upper bound for the parameter v. a < v(P) . it is natural to seek v(X) such that P[v(X) > vi > 1. Here a is small.3.[X < k].18). For instance.3. This is not the only approach to testing. I(P.a) upper coofidence houod on v. 8 < 80 rN8P. and then next look for a procedure with low probability of proclaiming no difference if in fact one treatment is superior to the other (deciding Do = 0 when Do ¥ 0). on the probability of Type I error. For instance. If <5(X) = 1 and we decide 8 E 8 1 when in fact E 80.7) Confidence Bounds and Intervals Decision theory enables us to think clearly about an important hybrid of testing and estimation.[X < k[.1) and tests 15k of the fonn. If (say) X represents the amount owed in the sample and v is the unknown total amount owed.1. and then trying to minimize the probability of a Type II error. Such a v is called a (1 .a (1.3 The Decision Theoretic Framework 23 function is a decision rule <5(X) that equals 1 On a set C C X called the critical region and equals 0 on the complement of C." in Example 1.05.Section 1. R(8. we call the error committed a Type I error.IX > k] + rN8P. the loss function (1.8) sP.6) P(6(X) = 0) if 8 E 8 1 Probability of Type II error. This corresponds to an a priori bound on the risk of a on v(X) viewed as a decision procedure with action space R and loss function.6) E(6(X)) ~ P(6(X) ~ I) if8 E 8 0 (1.3. confidence bounds and intervals (and more generally regions).6) Probability of Type [error R(8. "Reject the shipment if and only if X > k. 8 > 80 . whereas if <5(X) = 0 and we decide () E 80 when in fact 8 E 8 b we call the error a Type II error. Finding good test functions corresponds to finding critical regions with smaIl probabilities of error. a > v(P) I. say . the risk of <5(X) is e R(8. an accounting finn examining accounts receivable for a finn on the basis of a random sample of accounts would be primarily interested in an upper bound on the total amount owed.01 or less.o) 0.3. (1.8) for all possible distrihutions P of X. where 1 denotes the indicator function. the focus is on first providing a small bound. Thus. in the treatments A and B example.05 or . 6(X) = IIX E C].3. that is. [n the NeymanPearson framework of statistical hypothesis testing. usually . we want to start by limiting the probability of falsely proclaiming one treatment superior to the other (deciding ~ =1= 0 when ~ = 0).1 lead to (Prohlem 1. For instance.
a2. We next tum to the final topic of this section. for some constant c > O. or repair it. We shall go into this further in Chapter 4. or wait and see. a 2 certain location either contains oil or does not. we could drill for oil.a ·1 for all PEP.a)  a~v(P) . a patient either has a certain disease or does not. In the context of the foregoing examples. The 0. general criteria for selecting "optimal" procedures. It is clear. We conclude by indicating to what extent the relationships suggested by this picture carry over to the general decision theoretic model. The same issue arises when we are interested in a confidence interval ~(X) l v(X) 1for v defined by the requirement that • P[v(X) < v(P) < Ii(X)] > 1 . as we shall see in Chapter 4.3. and the risk of all possible decision procedures can be computed and plotted. though. e Example 1. where x+ = xl(x > 0). A has three points. c ~1 .3. aJ. are available. and so on.. in fact it is important to get close to the truthknowing that at most 00 doJlars are owed is of no use. I I (Oil) (No oil) Bj B2 0 12 10 I 5 6 \ .a<v(P). . Suppose we have two possible states of nature.1.5. it is customary to first fix a in (1. We shall illustrate some of the relationships between these ideas using the following simple example in which has two members. 1. the connection is close.2 . For instance. and a3. Suppose the following loss function is decided on TABLE 1. For instance = I(P.a>v(P) .1 nature makes it resemble a testing loss function and. sell the location. we could operate. Suppose that three possible actions. thai this fonnulation is inadequate because by taking D 00 we can achieve risk O. The decision theoretic framework accommodates by adding a component reflecting this. which we represent by Bl and B . a) (Drill) aj (Sell) a2 (Partial rights) a3 . Goals. replace it.3. Ii) = E(Ii(X) v(P))+.3. a component in a piece of equipment either works or does not work. . or sell partial rights.: Comparison of Decision Procedures In this section we introduce a variety of concepts used in the comparison of decision procedures. though upper bounding is the primary goal.8) and then see what one can do to control (say) R( P. What is missing is the fact that.24 Statistical Models. we could leave the component in. Typically. rather than this Lagrangian form. administer drugs. an asymmetric estimation type loss function. and Performance Criteria Chapter 1 1 . The loss function I(B.
R(B k . TABLE 1.4) ~ 7.4.5 3 7 1. If is finite and has k members. i Rock formation x o I 0.6.2. 3 a." 02 corresponds to ''Take action Ul> if X = 0. 5 a. Next.7) = 7 12(0.6 3 3.5 4.4 and graphed in Figure 1.5 8.4 8 8. 6 a. Possible decision rules 5i (x) . 1.2 for i = 1.4 = 1. R(B" 52) R(B2 . R(B2.3. 5)) and if k = 2 we can plot the set of all such points obtained by varying 5.3..3 The Decision Theoretic Framework 25 Thus.). 5.4 10 6 6.5. a.). and so on..2 (Oil) (No oil) e.7 0. The risk points (R(B" 5.3.3. 8 a3 a2 9 a3 a3 Here.) 0 12 2 7 7. 2 a.a2)P[5(X) ~ a2] + I(B.7.. an experiment is conducted to obtain information about B resulting in the random variable X with possible values coded as 0. 1 9.a3)P[5(X) = a3]' 0(0.) R(B.)) 1 1 R(B . a3 4 a2 a.6) + 1(0.5. the loss is 12. a.3 0. a.3. B2 Thus. e.4. formations 0 and 1 occur with frequencies 0. We list all possible decision rules in the following table.)) are given in Table 1.3) For instance. The frequency function p(x. I x=O x=1 a.)P[5(X) = ad +1(B." and so on. . The risk of 5 at B is R(B. take action U2.52 ) + 10(0. whereas if there is no oil. a3 7 a3 a..5(X))] = I(B. Risk points (R(B 5. the loss is zero. .6 and 0." Criteria for doing this will be introduced in the next subsection.Section 1. 0 .3. a.. (R(O" 5). 01 represents "Take action Ul regardless of the value of X. if there is oil and we drill. B) given by the following table TABLE 1.0 9 5 6 It remains to pick out the rules that are "good" or "best.). and when there is oil.6 4 5 1 " 3 5. . it is known that formation 0 occurs with frequency 0.5) E[I(B. we can represent the whole risk function of a procedure 0 by a point in kdimensional Euclidean space. whereas if there is no oil and we drill.6 0.3 and formation 1 with frequency 0. e TABLE 1. X may represent a certain geological formation. and frequency function p(x. if X = 1. .5 9. R(025.
7 . neither improves the other.5). Symmetry (or invariance) restrictions are discussed in Ferguson (1967). we obtain M SE(O) = (}2. The absurd rule "S'(X) = 0" cannot be improved on at the value 8 ~ 0 because Eo(S'(X)) = 0 ifand only if O(X) = O. for instance.S.3.S) < R(8. . .   .) < R( 8" S6) but R( 8" S. (2) A second major approach has been to compare risk functions by global crite . a5).3. Goals. Usually.2. S.. i (1) Narrow classes of procedures have been proposed using criteria such as con siderations of symmetry. Extensions of unbiasedness ideas may be found in Lehmann (1997. We shall pursue this approach further in Chapter 3.26 Statistical Models.4 . 1. if we ignore the data and use the estimate () = 0.9.2 . or level of significance (for tests). Here R(8 S. and Performance Criteria Chapter 1 R(8 2 . It is easy to see that there is typically no rule e5 that improves all others. Section 1.6 .5 0 0 5 10 R(8 j .. and only if.9 .) I • 3 10 . Si). i = 1.S') for all () with strict inequality for some (). Researchers have then sought procedures that improve all others within the class. unbiasedness (for estimates and tests).) > R( 8" S6)' The problem of selecting good decision " procedures has been attacked in a variety of ways. R( 8" Si)). if J and 8' are two rules.8 5 . We say that a procedure <5 improves a procedure Of if. R(8. The risk points (R(8 j .3 Bayes and Minimax Criteria The difficulties of comparing decision procedures have already been discussed in the special contexts of estimation and testing. For instance. in estimating () E R when X '"'' N(().Si) Figure 1. and S6 in our example. Consider.
3. but can proceed to calculate what we expect to lose on the average as () varies. it has smaller Bayes risk.8 10 6 3. Bayes: The Bayesian point of view leads to a natural global criterion. We postpone the consideration of posterior analysis.8.0). This quantity which we shall call the Bayes risk of <5 and denote r( <5) is then.3.0) Table 1. if () is discrete with frequency function 'Tr(e).7 6. 0)11(0). 11(0.o)] = E[I(O. to Section 3.9 8. If there is a rule <5*.6 4 4. 0) isjust E[I(O. and reo) = J R(O. + 0.3. ..o(x))].9). suppose that in the oil drilling example an expert thinks the chance of finding oil is .2R(0.o)1I(O)dO. we need not stop at this point.4 5 2. (1. (1.2.2.9) The second preceding identity is a consequence of the double expectation theorem (B. r( 09) specified by (1. In this framework R(O. the expected loss. . Recall that in the Bayesian model () is the realization of a random variable or vector () and that Pe is the conditional distribution of X given 0 ~ O.8 6 In the Bayesian framework 0 is preferable to <5' if.6 12 2 7. Bayes and maximum risks oftbe procedures of Table 1.3 The Decision Theoretic Framework 27 ria rather than on a pointwise basis.3. To illustrate.20) in Appendix B.3.38 9. We shall discuss the Bayes and minimax criteria.5 9 5.5 we for our prior.10) r( 0) ~ 0.).. therefore.92 5.5 gives r( 0.6 3 8. From Table 1. the only reasonable computational method.3.48 7. ()2 and frequency function 1I(OIl The Bayes risk of <5 is. r(o") = minr(o) reo) = EoR(O. TABLE 1.3.5. r(o.02 8. given by reo) = E[R(O.4 8 4. O(X)) I () = ()].8R(0. = 0.3. Note that the Bayes approach leads us to compare procedures on the basis of. . Then we treat the parameter as a random variable () with possible values ()1.) ~ 0. such that • see that rule 05 is the unique Bayes rule then it is called a Bayes rule. R(0" Oi)) I 9.. The method of computing Bayes procedures by listing all available <5 and their Bayes risk is impracticable in general.Section 1.) maxi R(0" Oil.I. if we use <5 and () = ().5 7 7. .2. and only if. If we adopt the Bayesian point of view. . which attains the minimum Bayes risk l that is.
supR(O. I Randomized decision rules: In general. if the statistician believed that the parameter value is being chosen by a malevolent opponent who knows what decision procedure will be used. The criterion comes from the general theory of twoperson zero sum games of von Neumann.J). 8)]. R(O. Minimax: Instead of averaging the risk as the Bayesian does we can look at the worst possible risk.(2) We briefly indicate "the game of decision theory. J) A procedure 0*. if V is the class of all decision procedures (nonrandomized).4. The maximum risk of 0* is the upper pure value of the game. in Example 1. Nature's intentions and degree of foreknowledge are not that clear and most statisticiaqs find the minimax principle too conservative to employ as a general rule. in Example 1. From the listing ofmax(R(O" J). It aims to give maximum protection against the worst that can happen. who picks a decision procedure point 8 E J from V. sup R(O. This criterion of optimality is very conservative.75 is strictly less than that of 04.3. which has .3.J') . . For instance. 6). Player II then pays Player I. For simplicity we shall discuss only randomized .75 if 0 = 0. J)) we see that J 4 is minimax with a maximum risk of 5. but only ac. suppose that. R(02. For instance.20 if 0 = O .28 Statistical Models.3.5 we might feel that both values ofthe risk were equally important. the sel of all decision procedures. Nature's choosing a 8. 4. Our expected risk would be. Nevertheless l in many cases the principle can lead to very reasonable procedures. Such comparisons make sense even if we do not interpret 1r as a prior density or frequency. if and only if. To illustrate computation of the minimax rule we tum to Table 1.5. 2 The maximum risk 4. a randomized decision procedure can be thought of as a random experiment whose outcomes are members of V. J'). a weight function for averaging the values of the function R( B.4. Students of game theory will realize at this point that the statistician may be able to lower the maximum risk without requiring any further information by using a random mechanism to determine which rule to emplOy. J). and Performance Criteria Chapter 1 if (J is continuous with density rr(8). But this is just Bayes comparison where 1f places equal probability on f}r and ()2. . It is then natural to compare procedures using the simple average ~ [R( fh 10) + R( fh. Goals. Of course. . e 4. The principle would be compelling." Nature (Player I) picks a independently of the statistician (Player II). we prefer 0" to 0"'. we toss a fair coin and use 04 if the coin lands heads and 06 otherwise. This is. < sup R(O. is called minimax (minimizes the maximum risk). which makes the risk as large as possible. = infsupR(O.
13) defines a family of parallel lines with slope i/(1 .3.1).3 The Decision Theoretic Framework 29 procedures that select among a finite set (h 1 • • • .A)R(02.r).R(02. J)) and consider the risk sel 2 S = {(RrO"J).5.... .3). (2) The tangent is the line connecting two unonrandomized" risk points Ji . . we then define q R(O. 0 < i < 1. This is thai line with slope .r2):r. including randomiZed ones.).3.\. Finding the Bayes rule corresponds to finding the smallest c for which the line (1. given a prior 7r on q e.'Y) that is tangent to S al the lower boundary of S.. If . That is. We will then indicate how much of what we learn carries over to the general case. If the randomized procedure l5 selects l5i with probability Ai.J): J E V'} where V* is the set of all procedures. i = 1.d) anwng all randomized procedures. . A randomized minimax procedure minimizes maxa R(8.2. AR(02. we represent the risk of any procedure J by the vector (R( 01 .3.). Ai >0. R(O .13) As c varies.(0 d = i = 1 .3.Section 1.i)r2 = c.J) ~ L. A point (Tl.3. i=l (1.11) Similarly we can define. J).3.R(O. All points of S that are on the tangent are Bayes.Ji)' r2 = tAiR(02. (1.3.3. (1.(02 ). (1. Ji )).3. Jj ). By (1.13) intersects S. . R(O .. i = 1 .J. l5q of nOn randomized procedures.i/ (1 . Jj . then all rules having Bayes risk c correspond to points in S that lie on the line irl + (1 .5.Ji )] i=l A randomized Bayes procedure d..10). S= {(rI.3. the Bayes risk of l5 (1.A)R(O"Jj ). = ~A. this point is (10. For instance.12) r(J) = LAiEIR(IJ. Ji ) + (1 .. Ei==l Al = 1. We now want to study the relations between randomized and nonrandomized Bayes and minimax procedures in the context of Example 1.J. ~Ai=1}. TWo cases arise: (I) The tangent has a unique point of contact with a risk point corresponding to a nonrandomized rule. 9 (Figure 2 1. minimizes r(d) anwng all randomized procedures.14) . 1 q.3). T2) on this line can be written AR(O"Ji ) + (1. which is the risk point of the Bayes rule Js (see Figure 1. As in Example 1.3. S is the convex hull of the risk points (R(0" Ji ).R(OI. when ~ = 0..
3.. Goals. and Performance Criteria Chapter 1 r2 I 10 5 Q(c') o o 5 10 Figure 1. = r2. Then Q( c*) n S is either a point or a horizontal or vertical line segment. Let c' be the srr!~llest c for which Q(c) n S i 0 (i. It is the set of risk points of minimax rules because any point with smaller maximum risk would belong to Q(c) n S with c < c* contradicting the choice of c*.30 Statistical Models. 2 The point where the square Q( c') defined by (1. the first point of contact between the squares and S is the . < 1. Because changing the prior 1r corresponds to changing the slope .16) whose diagonal is the line r. thus.e.3.9.. 0 < >. the first square that touches S).3. is Bayes against 7I". The convex hull S of the risk poiots (R(B!. See Figure 1. R(B . OJ with probability (1 . (1.3.3. namely Oi (take'\ = 1) and OJ (take>.. In our example. i = 1./(1 .3. by (1. We can choose two nonrandomized Bayes rules from this class.13). the set B of all risk points corresponding to procedures Bayes with respect to some prior is just the lower left boundary of S (Le.3. and. as >. < 1.11) corresponds to the values Oi with probability>. .• all points on the lowet boundary of S that have as tangents the y axis or lines with nonpositive slopes). = 0).3. where 0 < >. (\)). oil.3.>'). To locate the risk point of the nilhimax rule consider the family of squares.16) touches S is the risk point of the minimax rule.)') of the line given by (1. . ranges from 0 to 1.15) Each one of these rules. (1.
1967. Using Table 1.4.Section 1. there is no (x. In fact.3. However.3.4 we can see. The following features exhibited by the risk set by Example 1. (b) The set B of risk points of Bayes procedures consists of risk points on the lower boundary of S whose tangent hyperplanes have normals pointing into the positive quadrant.. agrees with the set of risk points of Bayes procedures.. they are Bayes procedures. which yields . A decision rule <5 is said to be inadmissible if there exists another rule <5' such that <5' improves <5.3. . 04) = 3 < 7 = R( 81 . if and only if.5(1 . > a for all i.0). R( 81 .14) with i = 4. this equation becomes 3. Y < T2} has only (TI 1 T2) in common with S. 0.6 ~ R( 8" J. If e is finite. Naturally. that <5 2 is inadmissible because 64 improves it (i. for instance.4 < 7. (3) Randomized Bayes procedures are mixtures of nonrandomized ones in the Sense of (1. if and only if.. . y) in S such that x < Tl and y < T2. Thus.\ the solution of From Table 1. the minimax rule is given by (1. l Ok}..59.5 can be shown to hold generally (see Ferguson. {(x. all rules that are not inadmissible are called admissible.V) : x < TI.\).3. (d) All admissible procedures are Bayes procedures.) and R( 8" 04) = 5.14). A rule <5 with risk point (TIl T2) is admissible.\ ~ 0. (a) For any prior there is always a nonrandomized Bayes procedure.3. .e. or equivalently. under some conditions.\) = 5. j = 6 and . then any Bayes procedure corresponding If e is not finite there are typically admissible procedures that are not Bayes. if there is a randomized one.3 The Decision Theoretic Framework 31 intersection between TI = T2 and the line connecting the two points corresponding to <54 and <56 . all admissible procedures are either Bayes procedures or limits of . thus. for instance).)). R(8" 0)) : 0 E V'} where V" is the set of all randomized decision procedures. (c) If e is finite(4) and minimax procedures exist. From the figure it is clear that such points must be on the lower left boundary. 1 .4'\ + 3(1 . the set of all lower left boundary points of S corresponds to the class of admissible rules and. To gain some insight into the class of all admissible procedures (randomized and nOnrandomized) we again use the risk set. There is another important concept that we want to discuss in the context of the risk set. (e) If a Bayes prior has 1f(Oi) to 1f is admissible.. we can define the risk set in general as s ~ {( R(8 .\ + 6. e = {fh.
are due essentially to Waldo They are useful because the property of being Bayes is ea·der to analyze than admissibility. For example.g(Z») = E[g(Z) ~ YI 2 or its square root yE(g(Z) . For more information on these topics. he wants to predict the firstyear grade point averages of entering freshmen on the basis of their College Board scores. A college admissions officer has available the College Board scores at entrance and firstyear grade point averages of freshman classes for a period of several years. These remarkable results. Here are some further examples of the kind of situation that prompts our study in this section. although it usually turns out that all admissible procedures of interest are indeed nonrandomized.32 . and risk through various examples including estimation. In terms of our preceding discussion. The joint distribution of Z and Y can be calculated (or rather well estimated) from the records of previous years that the admissions officer has at his disposal. A government expert wants to predict the amount of heating oil needed next winter. A stockholder wants to predict the value of his holdings at some time in the future on the basis of his past experience with the market and his portfolio. Section 3. loss function. • 1. A meteorologist wants to estimate the amount of rainfall in the coming spring.4 PREDICTION . Statistical Models.Yj'. which is the squared prediction error when g(Z) is used to predict Y. we referto Blackwell and Girshick (1954) and Ferguson (1967). I The prediction Example 1.y)2. Similar problems abound in every field. in the college admissions situation. Other theorems are available characterizing larger but more manageable classes of procedures. at least when procedures with the same risk function are identified. Using this information. which include the admissible rules. ]967. Z would be the College Board score of an entering freshman and Y his or her firstyear grade point average. and Performance Criteria Chapter 1 '.. The MSPE is the measure traditionally used in the . We assume that we know the joint probability distribution of a random vector (or variable) Z and a random variable Y. Summary. Goals. Since Y is not known. Z is the information that we have and Y the quantity to be predicted.I 'jlI . We want to find a function 9 defined on the range of Z such that g(Z) (the predictor) is "close" to Y. ranking. We stress that looking at randomized procedures is essential for these conclusions. The frame we shaH fit them into is the following. testing. and prediction. decision rule. we tum to the mean squared prediction error (MSPE) t>2(Y.3. . Next we must specify what close means. We introduce the decision theoretic foundation of statistics inclUding the notions of action space.4). The basic global comparison criteria Bayes and minimax are presented as well as a discussion of optimality by restriction and notions of admissibility. One reasonable measure of "distance" is (g(Z) . An important example is the class of procedures that depend only on knowledge of a sufficient statistic (see Ferguson. at least in their original fonn. I II . Bayes procedures (in various senses). The basic biasvariance decomposition of mean square error is presented. confidence bounds.2 presented important situations in which a vector z of 00variates can be used to predict an unseen response Y.
Let (1.C)2 has a unique minimum at c = J.2) I'(z) = E(Y I Z = z).l. consider QNP and the class QL of linear predictors of the form a + We begin the search for the best predictor in the sense of minimizing MSPE by considering the case in which there is no covariate information.g(z))' IZ = z] = E[(Y I'(z))' I Z = z] + [g(z) I'(z)]'.1.4.4) .1) follows because E(Y p.4.3.16).c)2 as a function of c.2 where the problem of MSPE prediction is identified with the optimal decision problem of Bayesian statistics with squared error loss. The class Q of possible predictors 9 may be the nonparametric class QN P of all 9 : R d Jo R or it may be to some subset of this class.4.4. Grenander and Rosenblatt. Just how widely applicable the notions of this section are will become apparent in Remark 1. or equivalently. in which Z is a constant. In this situation all predictors are constant and the best one is that number Co that minimizes B(Y .25.4. L1=1 Lemma 1.1 assures us that E[(Y . g(Z)) such as the mean absolute error E(lg( Z) . E(Y . we can find the 9 that minimizes E(Y .c = (Y 1') + (I' .4. and by expanding (1. exists.4.711).6. (1. 0 Now we can solve the problem of finding the best MSPE predictor of Y. By the substitution theorem for conditional expectations (B.g(Z))' I Z = z] = E[(Y .4. then either E(Y g(Z))' = 00 for every function g or E(Y .L = E(Y).(Y. 1.4.g(Z))' (1. Proof. (1.I'(Z))' < E(Y . ljZ is any random vector and Y any random variable. Because g(z) is a constant.g(Z))2.c)2 is either oofor all c or is minimized uniquely by c In fact. See Remark 1. given a vector Z. We see that E(Y . we can conclude that Theorem 1.Section 1. The method that we employ to prove our elementary theorems does generalize to other measures of distance than 6. see Example 1.1.c)' < 00 for all c.4. for example.4.YI) (Problems 1.4.3) If we now take expectations of both sides and employ the double expectation theorem (B.c)' ~ Var Y + (c _ 1')'.) = 0 makes the cross product term vanish. see Problem 1. Lemma 1. that is.5 and Section 3. 1957) presuppose it.4 Prediction 33 mathematical theory of prediction whose deeper results (see. we have E[(Y . when EY' < 00. In this section we bjZj .20).4. EY' = J.g(z))' I Z = z].c) implies that p. EY' < 00 Y .L and the lemma follows. E(Y .1) < 00 if and only if E(Y .
34 Statistical Models. (1. then by the iterated expectation theorem.4. that is. Property (1. let h(Z) be any function of Z. = I'(Z). then (a) f is uncorrelated with every function ofZ (b) I'(Z) and < are uncorrelated (c) Var(Y) = Var I'(Z) + Var <. That is.6) is linked to a notion that we now define: Two random variables U and V with EIUVI < 00 are said to be uncorrelated if EIV .5) is obtained by taking g(z) ~ E(Y) for all z.4.4.Il(Z) denote the random prediction error. . 1.5) An important special case of (1. Equivalently U and V are uncorrelated if either EV[U E(U)] = 0 or EUIV .4. Suppose that Var Y < 00. Theorem 1.4.4. Proof.4.I'(Z))' + E(g(Z) 1'(Z»2 ( 1. To show (a).4. we can derive the following theorem. I'(Z) is the unique E(Y . Goals.5) follows from (a) because (a) implies that the cross product term in the expansionof E ([Y 1'(z)J + [I'(z) g(z)IP vanishes.4.6) which is generally valid because if one side is infinite.4.E(U)] = O.1(c) is equivalent to (1.8) or equivalently unless Y is a function oJZ.1. Var(Y I z) = E([Y .4. 0 Note that Proposition 1. Write Var(Y I z) for the variance of the condition distribution of Y given Z = z.4.E(V)J Let € = Y . then Var(E(Y I Z)) < Var Y. which will prove of importance in estimation theory.5) becomes. then we can write Proposition 1. and recall (B. so is the other.20). Vat Y = E(Var(Y I Z» + Var(E(Y I Z). (1. and Performance Criteria Chapter 1 for every 9 with strict inequality holding unless g(Z) best MSPE predictor.7) IJVar Y • < 00 strict inequality ooids unless 1 Y = E(Y I Z) (1. Properties (b) and (c) follow from (a).4. then (1. If E(IYI) < 00 but Z and Y are otherwise arbitrary. E{h(Z)<1 E{ E[h(Z)< I Z]} E{h(Z)E[Y I'(Z) I Z]} = 0 because E[Y I'(Z) I Z] = I'(Z) I'(Z) = O. = O. As a consequence of (1.2.6).E(v)IIU . when E(Y') < 00.g(Z)' = E(Y .E(Y I z)]' I z).6) and that (1. Infact.
4.025 0.45. .025 0.45 ~ 1 The MSPE of the best predictor can be calculated in two ways. Each day there can be 0. I. the average number of failures per day in a given month. In this case if Yi represents the number of failures on day i and Z the state of the assembly line.1.y) pz(z) 0. The assertion (1. The following table gives the frequency function p( z.05 0.E(Y I Z))' ~ 0 By (A.05 0.20. if we are trying to guess.9) this can hold if.Section 1. But this predictor is also the right one.4 Prediction 35 Proof.30 py(y) I 0.4.25 0.50 z\y 1 0 0.8) is true.10 0. o Example 1. and only if. as we reasonably might.10.05 0.25 2 0.1 2. These fractional figures are not too meaningful as predictors of the natural number values of Y.25 1 0.10 0. E(Y .15 3 0. (1.7) can hold if. i=l 3 E (Y I Z = ~) ~ 2.'" 1 Y. Within any given month the capacity status does not change.10 I 0.4. 1. We want to predict the number of failures for a given day knowing the state of the assembly line for the month.10 0. 2.15 I 0.25 0.E(Y I Z))' =L x 3 ~)y y=o . and only if. 2. half. The first is direct. The row sums of the entries pz (z) (given at the end of each row) represent the frequency with which the assembly line is in the appropriate capacity state.4. E(Var(Y I Z)) ~ E(Y . the best predictor is E (30. We find E(Y I Z = 1) = L iF[Y = i I Z = 1] ~ 2. p(z.6). y) = 0. or quarter capacity.lI. whereas the column sums py (y) yield the frequency of 0.E(Y I Z = z))2 p (z.30 I 0. An assembly line operates either at full. or 3 shutdowns due to mechanical failure. also. E (Y I Z = ±) = 1. y) = p (Z = Z 1 Y = y) of the number of shutdowns Y and the capacity state Z of the line for a randomly chosen day.7) follows immediately from (1.:.4. or 3 failures among all days. I z) = E(Y I Z). Equality in (1.885.
6) we can also write Var /lO(Z) p = VarY .E(Y I Z p') (1.p')). Suppose Y and Z are bivariate normal random variables with the same mean .4. (1. If p > O. Therefore. . . The Bivariate Normal Distribution. can reasonably be thought of as a measure of dependence.L[E(Y I Z y .10) I I " The qualitative behavior of this predictor and of its MSPE gives some insight into the structure of the bivariate normal distribution. the best predictor of Y using Z is the linear function Example 1.J. E((Y . = Z))2 I Z = z) = u~(l _ = u~(1 _ p')./lz). Regression toward the mean.9) I . Theorem B.4. The larger this quantity the more dependent Z and Yare. The term regression was coined by Francis Galton and is based on the following observation. Because of (1. " " is independent of z. and Performance Criteria Chapter 1 The second way is to use (1.6) writing. a'k. the predictor is a monotone increasing function of Z indicating that large (small) values of Y tend to be associated with large (small) values of Z. which is the MSPE of the best constant predictor. = z)]'pz(z) 0. is usually called the regression (line) of Y on Z.2.2 tells us that the conditional dis /lO(Z) Because .4. o tribution of Y given Z = z is N City + p(Uy j UZ ) (z . (1 . this quantity is just rl'.4. 2 I . In the bivariate normal case.4. Goals. the MSPE of our predictor is given by. p) distribution.885 as before.36 Statistical Models. One minus the ratio of the MSPE of the best predictor of Y given Z to Var Y.y as we would expect in the case of independence. O"~. Y) has a N(Jlz l f.11) I The line y = /lY + p(uy juz)(z ~ /lz). u~. If (Z. for this family of distributions the sign of the correlation coefficient gives the type of dependence between Z and Y. whereas its magnitude measures the degree of such dependence.LY. Similarly. = /lY + p(uyjuz)(Z ~ /lZ).4. which corresponds to the best predictor of Y given Z in the bivariate normal model.4. Thus. If p = 0. E(Y . E(Y ~ E(Y I Z))' VarY ~ Var(E(Y I Z)) E(Y') ~ E[(E(Y I Z))'] I>'PY(Y) .E(Y I Z))' (1. p < 0 indicates that large values of Z tend to go with small values of Y and we have negative dependence. the best predictor is just the constant f.
" Zd)T be a d x 1 covariate vector with mean JLz = (ttl. . or the average height of sbns whose fathers are the height Z. distribution (Section B. In Galton's case. in particular in Galton's studies.4.p)1" + pZ) = p'(T'. these were the heights of a randomly selected father (Z) and his son (Y) from a large human population. ttd)T and suppose that (ZT.12) with MSPE ElY l"o(Z)]' ~ E{EIY l"o(zlI' I Z} = E(uYYlz) ~ Uyy  EyzEziEzy..11). .. Ezy O"yy E zz is the d x d variancecovariance matrix Var(Z) . I"Y = E(Y). The variability of the predicted value about 11. .4.4.. the distribution of (Z. and positive correlation p. Yn ) from the population. YI ). We shall see how to do this in Chapter 2. Y) is unavailable and the regression line is estimated on the basis of a sample (Zl.4. so the MSPE of !lo(Z) is smaller than the MSPE of the constant predictor J1y.p)tt + pZ. . The Multivariate Normal Distribution. coefficient ofdetennination or population Rsquared. . This quantity is called the multiple correlation coefficient (MCC). the best predictor E(Y I Z) ofY is the linear function (1.. The quadratic fonn EyzEZ"iEzy is positive except when the joint nonnal distribution is degenerate. consequently. I"Y) T. . 0 Example 1.. Let Z = (Zl. tall fathers tend to have shorter sons. Thus. Note that in practice. E). should.5 states that the conditional distribution of Y given Z = z is N(I"Y +(zl'zf. variance cr2. ." This is compensated for by "progression" toward the mean among the sons of shorter fathers and there is no paradox. uYYlz) where.8.. the MCC equals the square of the .Section 1.3. is closer to the population mean of heights tt than is the height of the father. (1 . Y» T T ~ E yZ and Uyy = Var(Y).. . One minus the ratio of these MSPEs is a measure of how strongly the covariates are associated with Y. where the last identity follows from (1. Then the predicted height of the son. be less than that of the actual heights and indeed Var((! . We write Mee = ' ~! _ PZy ElY /lO(Z))' ~ Var l"o(Z) Var Y Var Y .COV(Zd. By (1. usual correlation coefficient p = UZy / crfyCT lz when d = 1. (Zn.4 Prediction 37 tt. there is "regression toward the mean. E zy = (COV(Z" Y). Thus.6) in which I' (I"~. Theorem B.8 = EziEzy anduYYlz ~ uyyEyzEziEzy. y)T has a (d + !) multivariate normal. N d +1 (I'.6).6.
Suppose that E(Z') and E(Y') are finite and Z and Y are not constant. The best linear predictor.209. when the distribution of (ZT. (Z~. We expand {Y . The natural class to begin with is that of linear combinations of components of Z. Goals. If we are willing to sacrifice absolute excellence.bZ)' = E(Y') + E(Z')(b  Therefore. yn)T See Sections 2.E(Z')b~. let Y and Z = (Zl.5%. and Performance Criteria Chapter 1 For example. P~.4.13) . Suppose(l) that (ZT l Y) T is trivariate nonnal with Var(Y) = 6. The percentage reductions knowing the father's and both parent's heights are 20.2. whereas the unique best linear predictor is ILL (Z) = al + bl Z where b = Cov(Z. Z2)T be the heights in inches of a IOyearold girl and her parents (Zl = mother's height.3%. l. Z2 = father's height). . Two difficulties of the solution are that we need fairly precise knowledge of the joint distribution of Z and Y in order to calcujate E(Y I Z) and that the best predictor may be a complicated function of Z.. respectively.l Proof.1. In words.4. Y) a I VarZ.1 and 2.4. Let us call any random variable of the form a + bZ a linear predictor and any such variable with a = 0 a zero intercept linear predictor. E(Y .335. In practice. we can avoid both objections by looking for a predictor that is best within a class of simple predictors. 29W· Then the strength of association between a girl's height and those of her mother and father.38 Statistical Models.ba) + Zba]}' to get ba )' . The problem of finding the best MSPE predictor is solved by Theorem 1.~ ~:~~). are P~. E(Y . . Y. y)T is unknown.)T. What is the best (zero intercept) linear predictor of Y in the sense of minimizing MSPE? The answer is given by: Theorem 1. and E(Y _ boZ)' = E(Y') _ [E(ZY)]' E(Z') (1..:.zz ~ (.39 l. p~y = . and parents. the hnear predictor l'a(Z) and its MSPE will be estimated using a 0 sample (Zi.9% and 39.393.. We first do the onedimensional case.. knowing the mother's height reduces the mean squared prediction error over the constant predictor by 33.bZ}' = E(Y)  b E(Z) 1· = (Y  [Z(b .y = .3. Then the unique best zero intercept linear predictor i~ obtained by taking E(ZY) b=ba = E(Z') .y = .bZ)' is uniquely minintized by b = ba.zy ~ (407. respectively.
in Example 1. it must coincide with the best linear predictor.E(Z)f[Z .E(Y) to conclude that b. Substituting this value of a in E(Y .1 the best linear predictor and best predictor differ (see Figure 1.4 Prediction 39 To prove the second assertion of the theorem note that by (lA.E(Y)) .b. That is. is the unique minimizing value.4.5). 0 Note that if E(Y I Z) is of the form a + bZ. This is in accordance with our evaluation of E(Y I Z) in Example 1. = I'Y + (Z  J. E(Y . I 1. E(Y .a . then a = a. Theorem 1.E(Z)J))1 exist. (1.boZ)' = 0. Note that R(a. A loss of about 5% is incurred by using the best linear predictor. E[Y . Therefore.a.a .b(Z . and b = b1 • because.13) we obtain tlte proof of the CauchySchwarz inequality (A.E(Y)]) = EziEzy.bZ) + (E(Y) . .l7) in the appendix.E(Z) and Y . whiclt corresponds to Y = boZo We could similarly obtain (A. In that example nothing is lost by using linear prediction.E(ZWW ' E([Z .2.4.. From (1.I'd Z )]' = 1 05 ElY I'(Z)j' . if the best predictor is linear.I1.4.Z)'. Best Multivariate Linear Predictor. 0 Remark 1.bZ)2 is uniquely minimized by taking a ~ E(Y) . OUf linear predictor is of the fonn d I'I(Z) = a + LbjZj j=l = a+ ZTb [3 = (E([Z .I'l(Z)1' depends on the joint distribution P of only through the expectation J. This is because E(Y .bE(Z).14) Yl T Ep[Y .4.E(Z))]'.bZ)' ~ Var(Y .E(Z)IIZ .a)'. We can now apply the result on zero intercept linear predictors to the variables Z .4.4.bZ)2 we see that the b we seek minimizes E[(Y . b) X ~ (ZT.a .l). E(Y .4. and only if.Section 1. by (1. Let Po denote = . On the other hand.4.16) directly by calculating E(Y . then the Unique best MSPE predictor is I'dZ) Proof.t and covariance E of X.E(Z)J[Y . whatever be b.bE(Z) .1).1. If EY' and (E([Z .4.tzf [3.boZ)' > 0 is equivalent to the CauchySchwarz inequality with equality holding if.
.4.4.d.15) .40 Statistical Models. E). 0 Remark 1. See Problem 1. By Proposition 1. We could also have established Theorem 1.4.. Zd are uncorrelated.4.1. However. . and R(a. . p~y = Corr2 (y. The line represents the best linear predictor y = 1. b) = Epo IY .17 for an overall measure of the strength of this relationship. b) is miuimized by (1.3.2.I'LCZ)). R(a.4. E(Zj[Y . that is.4.. b) is minimized by (1. b).. .75 1.4. Thus. case the multiple correlation coefficient (MCC) or coefficient of determination is defined as the correlation between Y and the best linear predictor of Y.00 z Figure 1. Y).3 to d > 1. and Performance Criteria Chapter 1 y .14).19. distribution and let Ro(a..I'I(Z)]'.4. We want to express Q: and f3 in terms of moments of (Z. Because P and Po have the same /L and E.4.. thus.45z. .I'(Z) and each of Zo.(a + ZT 13)]) = 0. the MCC gives the strength of the linear relationship between Z and Y. (1. Suppose the mooel for I'(Z) is linear. that is.50 0. the multivariate uormal. . I"(Z) = E(Y I Z) = a + ZT 13 for unknown Q: E R and f3 E Rd. .1(a). ~ Y . not necessarily nonnal.4.25 0.4. The three dots give the best predictor. j = 0. b) = Ro(a. N(/L. 2 • • 1 o o 0. .4 by extending the proof of Theorem 1.05 + 1. 0 Remark 1.3. Set Zo = 1.14). In the general. Ro(a.4. Goals. 0 Remark 1. A third approach using calculus is given in Problem 1. our new proof shows how secondmoment results sometimes can be established by "connecting" them to the nannal distribution. By Example 1.4. .4.
o Summary. usually R or Rk.6.' (l'e(Z).4.. g(Z) and h(Z) are said to be orthogonal if at least one has expected value zero and E[g(Z)h(Z)] ~ O.16) (1.4. then by recording or taking into account only the value of T(X) we have a reduction of the data. With these concepts the results of this section are linked to the general Hilbert space results of Section 8. If T assigns the same value to different sample points. The range of T is any space of objects T.. but as we have seen in Sec~ tion 1.4. Using the distance D.(Y 19NP). we see that r(5) ~ MSPE for squared error loss 1(9. The notion of mean squared prediction error (MSPE) is introduced.4.4.5)'.. It is shown to coincide with the optimal MSPE predictor when the model is left general but the class of possible predictors is restricted to be linear. then 90(Z) is called the projection of Y on the space 9 of functions of Z and we write go(Z) = . Recall that a statistic is any function of the observations generically denoted by T(X) or T.5(X»]. I'(Z» + ""'(Y.3. and it is shown that if we want to predict Y on the basis of information contained in a random vector Z. Moreover.(Y I 9).8) defined by r(5) = E[I(II.5) = (9 .. I'(Z» Y .4.4. we can conclude that I'(Z) = .2.12).2 and the Bayes risk (1.5 Sufficiency 41 Solving (1. the optimal MSPE predictor is the conditional expected value of Y given Z. Because the multivariate normal model is a linear model. I'dZ) = . Note that (1. Remark 1. 1. We return to this in Section 3. Consider the Bayesian model of Section 1. The optimal MSPE predictor in the multivariate normal distribution is presented.1.Section 1. and projection 1r notation..10.15) for a and f3 gives (1. Remark 1. If we identify II with Y and X with Z.I 0 and there is a 90 E 9 such that 0 < oc form a Hilbert go = arg inf{". the optimal MSPE predictor E(6 I X) is the Bayes procedure for squared error loss.16) is the Pythagorean identity.(Y 1ge) ~ "(I'(Z) 19d (1.17) ""'(Y.4.g(Z»): 9 E g).I'(Z) is orthogonal to I'(Z) and to I'dZ). I'dZ» = ". we would clearly like to separate out any aspects of the data that are irrelevant in the context of the model and that may obscure our understanding of the situation.23). Thus.Thus.. We begin by fonnalizing what we mean by "a reduction of the data" X EX. . T(X) = X loses information about the Xi as soon as n > 1.5 SUFFICIENCY Once we have postulated a statistical model. When the class 9 of possible predictors 9 with Elg(Z)1 space as defined in Section B.(Y. can also be a set of functions.5. this gives a new derivation of (1.14) (Problem 1. We consider situations in which the goal is to predict the (perhaps in the future) value of a random variable Y.3.4.
X. the conditional distribution of XI/(X l + X.l. A machine produces n items in succession. For instance. once the value of a sufficient statistic T is known.Xn = xnl = 8'(1. . + X.8)n' (1. and Performance Criteria Chapter 1 Even T(X J. XI/(X l +X.) given Xl + X. .5).1.1) where Xi is 0 or 1 and t = L:~ I Xi.4). is sufficient. By (A. By (A.5. Then X = (Xl. X(n))' loses information about the labels of the Xi. We could then represent the data by a vector X = (Xl.1.l. we need only record one.l. . . recording at each stage whether the examined item was defective or not.'" . Y) where X is uniform on (0.X." it is important to notice that there are many others e.Xn ) is the record of n Bernoulli trials with probability 8.2. The total number of defective items observed. . We prove that T = X I + X 2 is sufficient for O. We give a decision theory interpretation that follows. whatever be 8. Suppose there is no dependence between the quality of the items produced and let Xi = 1 if the ith item is good and 0 otherwise. Using our discussion in Section B. The most trivial example of a sufficient statistic is T(X) = X because by any interpretation the conditional distribution of X given T(X) = X is point mass at X. The idea of sufficiency is to reduce the data with statistics whose use involves no loss of information. . Each item produced is good with probability 0 and defective with probability 1. 0 In both of the foregoing examples considerable reduction has been achieved.5.9. it is intuitively clear that if we are interested in the proportion 0 of defective items nothing is lost in this situation by recording and using only T. It follows that.2. . X.'" . is a statistic that maps many different values of (Xl. 16. X 2 the time between the arrival of the first and second customers. However. ~ t. 1). the conditional distribution of X 0 given T = L:~ I Xi = t does not involve O.. T = L~~l Xi. = tis U(O.Xn ) where Xi = 1 if the ith item sampled is defective and Xi = 0 otherwise. One way of making the notion "a statistic whose use involves no loss of infonnation" precise is the following. Therefore.)](X I + X. = t.. T is a sufficient statistic for O.) are the same and we can conclude that given Xl + X. Suppose that arrival of customers at a service counter follows a Poisson process with arrival rate (parameter) Let Xl be the time of arrival of the first customer. in the context of a model P = {Pe : (J E e}.0. whatever be 8.1 we had sampled the manufactured items in order.' • . Xl and X 2 are independent and identically distributed exponential random variables with parameter O. By Example B. Although the sufficient statistics we have obtained are "natural.3. 1) whatever be t.42 Statistical Models. Goals..) and Xl +X.'" .l we see that given Xl + X2 = t... P[X I = XI. the conditional distribution of Xl = [XI/(X l + X. A statistic T(X) is called sufficient for PEP or the parameter if the conditional distribution of X given T(X) = t does not involve O.. Instead of keeping track of several numbers. Xl has aU(O. given that P is valid. • X n ) (X(I) . where 0 is unknown. t) distribution.Xn ) does not contain any further infonnation about 0 or equivalently P.5. . t) and Y = t .. Thus. the sample X = (Xl. Thus. o Example 1. Thus. when Xl + X. " . . suppose that in Example 1. Begin by noting that according to TheoremB.) is conditionally distributed as (X. (Xl.) and that of Xlt/(X l + X. are independent and the first of these statistics has a uniform distribution on (0. . Example 1. X n ) into the same number.
• only if. h(xj) (1... Fortunately. The complete result is established for instance by Lehmann (1997. a simple necessary and sufficient criterion for a statistic to be sufficient is available. e porT = til = I: {x:T(x)=t.4) if T(xj) ~ Ii 41 o if T(xj) oF t i . e) definedfor tin T and e in on X such that p(X.6). By our definition of conditional probability in the discrete case. forO E Si. checking sufficiency directly is difficult because we need to compute the conditional distribution. if Tl and T2 are any two statistics such that 7 1 (x) = T 1 (y) if and only if T2(x) = T 2(y). Po[X = XjlT = t.] Po[X = Xj. To prove the sufficiency of (152)..Section 1.] o if T(xj) oF ti if T(xj) = Ii. Po [X ~ XjlT = t..} p(x. Proof.In a regular model. if (152) holds. there exists afunction g(t. O)h(x) for all X E X.I. It is often referred to as the factorization theorem for sufficient statistics. and Halmos and Savage.} h(x). i = 1. X2. Let (Xl. Being told that the numbers of successeS in five trials is three is the same as knowing that the difference between the numbers of successes and the number of failures is one. Now.Ll) and (152). Applying (153) we arrive at. More generally.2. Section 2. 0 E 8. then T 1 and T2 provide the same information and achieve the same reduction of the data. e)h(Xj) Po[T (15.e) = g(ti. In general. we need only show that Pl/[X = xjlT = til is independent of for every i and j. it is enough to show that Po[X = XjlT ~ l.O) I: {x:T(x)=t. and e and a/unction h defined (152) = g(T(x).j is independent ofe on cach of the sets Si = {O: porT ~ til> OJ. . This result was proved in various forms by Fisher.S. T = t.") be the set of possible realizations of X and let t i = T(Xi)' Then T is discrete and 2::~ 1 porT = Ii] = 1 for every e. (153) By (B.]/porT = Ii] p(x.5) . a statistic T(X) with range T is sufficient for e if. Such statistics are called equivalent.5. 0) porT ~ Ii] ~~CC+ g(li.O) Theorem I. Neyman.5 Sufficiency 43 that will do the same job. We shall give the proof in the discrete case..
. and P(XI' . and both functions = 0 otherwise.. . > 0. n P(Xl. .4)).Xn ) is given by I. 0) by (B.2.9) if every Xi is an integer between 1 and B and p( X 11 • (1.Il) .44 Statistical Models. Conversely. . . _ _ _1 . Let 0 = (JL.8) = eneStift > 0.'t n /'[exp{ .3.. We may apply Theorem 1.. T is sufficient. . and Performance Criteria Chapter 1 Therefore. }][exp{ . .1 to conclude that T(X 1 . . . we need only keeep track of X(n) = max(X .5.. and h(xl •. O)h(x) (1. let g(ti' 0) = porT = tiL h(x) = P[X = xIT(X) = til Then (1.n /' exp{ . 1=1 (!.Xn are recorded. The population is sampled with replacement and n members of the population are observed and their labels Xl. . If X 1. . O} = 0 otherwise.5..10) P(Xl. which admits simple sufficient statistics and to which this example belongs. [27". Consider a population with () members labeled consecutively from I to B. both of which are unknown..O)=onexP[OLXil i=l (1..' (L x~ 1=1 1 n n 2JL LXi))]. •. The probability distribution of X "is given by (1.2 (continued). X n )..~.4. ..5..'" . .5.16.' L(Xi . ' .6) p(x.. Let Xl.JL)'} ~=l I n I! 2 . I x n ) = 1 if all the Xi are > 0.1. Common sense indicates that to get information about B.Xn ) is given by (see (A..O) where x(n) = onl{x(n) < O}.5..5. X(n) is a sufficient statistic for O.Xn ) = L~ IXi is sufficient. we can show that X(n) is sufficient.. then the joint density of (X" . 10 fact.5.5.2. are introduced in the next section. 1 X n be independent and identically distributed random variables each having a normal distribution with mean {l and variance (j2. T = T(x)] = g(T(x).'I.Xn.5. if T is sufficient.5. Goals.[27f. Estimating the Size of a Population. Expression (1.. 0 o Example 1. l X n are the interarrival times for n customers..X n JI) = 0 otherwise...3). 0 Example 1.X. 1.S.5. ~ Po[X = x.').xn }. By Theorem 1. = max(xl. Then the density of (X" .7) o Example 1.9) can be rewritten as " 1 Xn .. . Takeg(t. . A whole class of distributions.8) if all the Xi are > 0.
5. Here is an example.1. o Sufficiency and decision theory Sufficiency can be given a clear operational interpretation in the decision theoretic setting. Then 6 = {{31. Ezi 1'i) is sufficient for 6. respectively. a<.) i=l i=l is sufficient for B.6.2a I3.4 with d = 2.~ I: xn By the factorization theorem. By randomized we mean that o'(T(X») can be generated from the value random mechanism not depending on B.5. . we construct a rule O'(X) with the sarne risk = mean squared error as o(X) as follows: Conditionally. X n are independent identically N(I). 0 X Example 1. if T(X) is sufficient.~0)}(2"r~n exp{ . for any decision procedure o(x). we can. [1/(n i=l n n 1)1 I:(Xi . i= 1 = (lin) L~ 1 Xi. The first and second components of this vector are called the sample mean and the sample variance. X is sufficient. L~l x~) and () only and T(X 1 . Suppose X" . a 2 ).1.EZiY. find a randomized decision rule o'(T(X)) depending only on T(X) that does as well as O(X) in the sense of having the same risk function.Zi)'} exp {Er. . a 2)T is identifiable (Problem 1.( 2?Ta ')~ exp p {E(fJ. Let o(X) = Xl' Using only X. 2:>.··· . Thus.X)']. in Example 1.. EYi 2 . 0') for allO. {32.xnJ)) is itself a function of (L~ upon applying Theorem 1.l) distributed.12) t ofT(X) and a Example 1.9) and _ (y.o) = R(O.5. . R(O. X n ) = [(lin) where I: Xi. Suppose. .5.5 Sufficiency 45 I Evidently P{Xl 1 • •• . An equivalent sufficient statistic in this situation that is frequently used IS S(X" ..5.' + 213 2a + 213.. I)) = exp{nI)(x .Section 1. with JLi following the linear regresssion model ("'J . that is. T = (EYi . Yi N(Jli.1 we can conclude that n n Xi. . X n ) = (I: Xi. (1. Specifically. that Y I .• Yn are independent. Then pix. 1 2 2 . where we assume diat the given constants {Zi} are not all identical..} + EY. I)) .
Goals. (T2) sample n > 2. L~ I xl) and S(X) = (X" . by the double expectation theorem.6') ~ E{E[R(O. We define the statistic T(X) to be minimally sufficient if it is sufficient and provides a greater reduction of the data than any other sufficient statistic S(X). Using Section B.. ).14.. ..6'(T))IT]} ~ E{E[R(O.5. (Kolmogorov). in that.12) follows along the lines of the preceding example: Given T(X) = t.2.2. Let SeX) be any other sufficient statistic. = 2:7 Definition. it is Bayes sufficient for every 11. 6'(X) and 6(X) have the same mean squared error..O) = g(S(x). T(X) is Bayes sufficient for IT if the posterior distribution of 8 given X = x is the same as the posterior (conditional) distribution of () given T(X) = T(x) for all x. if Xl.1 (Bernoulli trials) we saw that the posterior distribution given X = x is 1 Xi.. Thus. 0 The proof of (1. Now draw 6' randomly from this conditional distribution. But T(X) provides a greater reduction of the data.46 Statistical Models. IfT(X) is sufficient for 0. X n ) are hoth sufficient. o Sufficiency aud Bayes models There is a natural notion of sufficiency of a statistic T in the Bayesian context where in addition to the model P = {Po : 0 E e} we postulate a prior distribution 11 for e. O)h(x) E:  .Xn is a N(p. and Performance Criteria Chapter 1 given X = t.5. Example 1. R(O.l and (1. 0) as p(x.6).6(X»)IT]} = R(O.1 (continued). In Example 1. n~l) distribution. .5. In this Bernoulli trials case. Equivalently. Then by the factorization theorem we can write p(x. ) Var(T') = E[Var(T'IX)] + VarIE(T'IX)]  ~ nl n + 1 n ~ 1 = Var(X . choose T" = 15"' (X) from the normal N(t. This 6' (T(X)) will have the same risk as 6' (X) because. . T = I Xi was shown to be sufficient. This result and a partial converse is the subject of Problem 1. then T(X) = (L~ I Xi.5. Minimal sufficiency For any model there are many sufficient statistics: Thus. we find E(T') ~ E[E(T'IX)] ~ E(X) ~ I" ~ E(X .4. Theorem }. the same as the posterior distribution given T(X) = L~l Xi = k.6). we can find a transformation r such that T(X) = r(S(X)). 0 and X are independent given T(X). where k In this situation we call T Bayes sufficient. the distribution of 6(X) does not depend on O.
Lx is a map from the sample space X to the class T of functions {() t p(x. for a given 8. Now. 1/3)]}/21og 2. as a function of ().(t. The formula (1. it gives. the likelihood function (1. for example.o)nT ~g(S(x). (T'). we find T = r(S(x)) ~ (log[2ng(s(x).5 Sufficiency 47 Combining this with (1. We define the likelihood function L for a given observed data vector x as Lx(O) = p(x. when we think of Lx(B) as a function of (). For any two fixed (h and fh. ~ 2/3 and 0. 20' 20 I t. L x (()) gives the probability of observing the point x.) ~ (L Xi. L Xf) i=l i=l n n = (0 10 0. The likelihood function o The preceding example shows how we can use p(x.5. ()) for different values of () and the factorization theorem to establish that a sufficient statistic is minimally sufficient.5. Thus.11) is determined by the twodimensional sufficient statistic T Set 0 = (TI.5. In the continuous case it is approximately proportional to the probability of observing a point in a small rectangle around x. ~ 1/3. 2/3)/g(S(x). It is a statistic whose values are functions. then Lx(O) = (27rO.) n/' exp {nO.O). However. Thus. if X = x. (T') example. ()) x E X}.O E e. take the log of both sides of this equation and solve for T.. L x (') determines (iI.T.8) for the posterior distribution can then be remembered as Posterior ex: (Prior) x (Likelihood) where the sign ex denotes proportionality as functions of 8. the "likelihood" or "plausibility" of various 8.Section 1. t2) because. In the discrete case. the ratio of both sides of the foregoing gives In particular. we find OT(1 .5. the statistic L takes on the value Lx. for a given observed x.) = (/".1).4 (continued). = 2 log Lx(O.)}. t. }exp ( I .1) nlog27r . T is minimally sufficient. 20. In this N(/". if we set 0.O)h(x) forall O. Example 1.
. OJ denote the frequency function or density of X. We show the following result: If T(X) is sufficient for 0.. . B ul the ranks are needed if we want to look for possible dependencies in the observations as in Example 1. itself sufficient. If P specifies that X 11 •. Ax is the function valued statistic that at (J takes on the value p x.. . Let p(X.48 Statistical Models. a statistic closely related to L solves the minimal sufficiency problem in general.X)2) is sufficient. 1) (Problem 1.5. we can find a randomized decision rule J'(T(X» depending only on the value of t = T(X) and not on 0 such that J and J' have identical risk functions. . Suppose there exists 00 such that (x: p(x. L is minimal sufficient. 0 In fact. O)h(X). the ranks. as in the Example 1. 1 X n are a random sample. 0) = g(T(X).. The "irrelevant" part of the data We can always rewrite the original X as (T(X). l:~ I (Xi .5. Let Ax OJ c (x: p(x. X is sufficient. Thus. the residuals. 0) and heX) such that p(X.. 1) and Lx (1. in Example 1. By arguing as in Example 1. Lehmann.12 for a proof of this theorem of Dynkin.X. SeX) becomes irrelevant (ancillary) for inference if T(X) is known but only if P is valid.4. or if T(X) ~ (X~l)' . SeX)~ where SeX) is a statistic needed to uniquely detennine x once we know the sufficient statistic T(x).13. _' . .5. L is a statistic that is equivalent to ((1' i 2) and.. . if a 2 = 1 is postulated.O) > for all B.Rn ).. For instance. and Scheffe. Consider an experiment with observation vector X = (Xl •. < X. If. hence. Then Ax is minimal sufficient. ifT(X) = X we can take SeX) ~ (Xl .5.5.5.5. We say that a statistic T (X) is sufficient for PEP. 0 the likelihood ratio of () to Bo. • X( n» is sufficient.. Thus.X). if the conditional distribution of X given T(X) = t does not involve O. but if in fact a 2 I 1 all information about a 2 is contained in the residuals. and Performance Criteria Chapter 1 with a similar expression for t 1 in terms of Lx (O. .17).X Cn )' the order statistics. The factorization theorem states that T(X) is sufficient for () if and only if there exist functions g( t. (X( I)' .1. A sufficient statistic T(X) is minimally sufficient for () if for any other sufficient statistic SeX) we can find a ttansformation r such that T(X) = r(S(X).Oo) > OJ = Lx~6o)' Thus. 1 X n ). a 2 is assumed unknown. or for the parameter O. We define a statistic T(X) to be Bayes sufficient for a prior 7r if the posterior distribution of f:} given X = x is the same as the posterior distribution of 0 given T(X) = T(x) for all X.Xn . Suppose that X has distribution in the class P ~ {Po : 0 E e}. S(X) = (R" . See Problem ((x:). hence. then for any decision procedure J(X). it is Bayes sufficient for O. . If T(X) is sufficient for 0.). (X. 1.. but if in fact the common distribution of the observations is not Gaussian all the information needed to estimate this distribution is contained in the corresponding S(X)see Problem 1. Summary.5. where R i ~ Lj~t I(X. The likelihood function is defined for a given data vector of .1 (continued) we can show that T and. Goals.
1. realvalued functions T and h on Rq. Pitman.6. x.6. many other common features of these families were discovered and they have become important in much of the modern theory of statistics. these families form the basis for an important class of models called generalized linear models.Section 1. by the factorization theorem. We return to these models in Chapter 2. B E' sufficient for B. B>O.o I x.B) and h(x) with itself in the factorization theorem. and if there is a value B E such that o e e. This is clear because we need only identify exp{'7(B)T(x) .1) where x E X c Rq.B o) > O}. B) of the Pe may be written p(x. Then. for x E {O. then.B) ~ h(x)exp{'7(B)T(x) . and multinomial regression models used to relate a response variable Y to a set of predictor variables. We shall refer to T as a natural sufficient statistic of the family. Subsequently. = 1 . binomial. and T are not unique. if there exist realvalued functions fJ(B). Here are some examples. Note that the functions 1}. 1. gamma. IfT(X) is {x: p(x. BEe. Example 1. B).6 Exponential Families 49 observations X to be the function of B defined by Lx(B) = p(X. and Dannois through investigations of this property(l).1. Ax(B) is a minimally sufficient statistic.6. p(x. They will reappear in several connections in this book.2) . B) > O} c {x: p(x.B) = eXe. (1. B.B(B)} with g(T(x).2"" }.1 The OneParameter Case The family of distributions of a model {Pe : B E 8}. The class of families of distributions that we introduce in this section was first discovered in statistics independently by Koopman.6 EXPONENTIAL FAMILIES The binomial and normal models considered in the last section exhibit the interesting feature that there is a natural sufficient statistic whose dimension as a random vector is independent of the sample size. The Poisson Distribution. B( 0) on e. is said to be a oneparameter exponentialfami/y. In a oneparameter exponential family the random variable T(X) is sufficient for B.B(B)} (1. 1. the likelihood ratio Ax(B) = Lx(B) Lx(Bo) depends on X through T(X) only.6.exp{xlogBB}. Poisson. More generally. such that the density (frequency) functions p(x. Let Po be the Poisson distribution with unknown mean IJ. beta. Probability models with these common features include normal.
6.O) f(z. q = 1.6.Z)OI) Z)'O'J} (2rrO)1 exp { (2rr)1 exp {  ~ [z' + (y  ~z' } exp { .~z.T(x)=x. Example 1.. B) are the corresponding density (frequency) functions.6.1](0) = .) exp[~(O)T(x. (1. This is a oneparameter exponential family distribution with q = 2.6) . .Xm are independent and identically distributed with common distribution Pe. 1).~O'(y  z)' logO}.} . 0 Example 1. .6.(y I z) = <p(zW1<p((y . suppose Xl. 0 E e.. fonn a oneparameter exponential family as in (1. .1](O)=log(I~0). 1 X m ) considered as a random vector in Rmq and p(x. T(x) ~ x. y)T where Y = Z independent N(O. 0) distribution. If {pJ=». Goals. The Binomial Family.5) o = 2. Suppose X = (Z. 1 (1.6. Here is an example where q (1.6. Statistical Models. Z and W are f(x. 0 Then. is the family of distributions of X = (Xl' . for x E {O. B(O) = 0. the family of distributions of X is a oneparameter exponential family with q=I. 1. .B(O)=nlog(I0). o The families of distributions obtained by sampling from oneparameter exponential families are themselves oneparameter exponential families.O) = ( : ) 0'(1_ Or" ( : ) ex p [xlog(1 ~ 0) + nlog(l.~O'.h(X)=( : ) . Suppose X has a B(n.O) = f(z)f. and Performance Criteria Chapter 1 Therefore. Specifically. where the p. + OW.0)].3) I' . . . .O) IIh(x. B(O) = logO. .h(x) = ..y.2. .1](0) = logO.4) Therefore.. we have p(x.6.50 .) i=1 m B(8)] ( 1. Then > 0. p(x. n) < 0 < 1.3.1).. the Pe form a oneparameter exponential family with ~~I . h(x) = (2rr)1 exp { . I I! .T(x) = (y  z)'.' x.
\ fixed r fixed s fixed 1/2.Il)" . · . 1 X m). ..6.. h(m)(x) ~ i=l II h(xi). ..B(O)} for suitable h * . then qCm) = mq. . and h.' fixed I' fixed p fixed .1 Family of distributions .Section 1. .· . We leave the proof of these assertions to the reader. In the discrete case we can establish the following general result. •. i=l (1. For example.1.7) Note that the natural sufficient statistic T(m) is onedimensional whatever be m. . oneparameter exponential family with natural sufficient statistic T(m)(x) = Some other important examples are summarized in the following table. ily of distributions of a sample from any of the foregoin~ is just In our first Example 1. Therefore.\ 1"/'" (p (s 1) 1) 1) (r T(x) x (x . and pJ LT(Xi).8) . 1].6.1 the sufficient statistic T :m)(x1. Let {Pe } be a oneparameter exponential family of discrete distributions with corresponding functions T..\) (3(r. 1)(0) N(I'. .2) r(p.. I:::n I::n Theorem 1. then the m ) fonn a I Xi. fl. and h. . the m) fonn a oneparameter exponential family. X m ) corresponding to the oneparameter exponential fam1 T(Xd. This family of Poisson distIibutions is oneparameter exponential whatever be m. . pi pi I:::n TABLE 1.Xm ) = I Xi is distributed as P(mO). if X = (Xl.. 1 X m ) is a vector of independent and identically distributed P(O) random variables and m ) is the family of distributions of x.. then the family of distributions of the statistic T(X) is a oneparameter exponential family of discrete distributions whose frequency functions may be written h'Wexp{1)(O)t . If we use the superscript m to denote the corresponding T.6.x log x 10g(l x) logx The statistic T(m) (X I. B(m)(o) ~ mB(O). B.6 Exponential Families 51 where x = (x 1 .6. B...
1}) = (1(x!)exp{1}xexp[1}]}.A(~)I. x=o x=o o Here is a useful result.9) where A(1}) = logJ . x E {O..9) with 1} E E contains the class of models with 8 E 8.6.6.. P9[T(x) = tl L {x:T(x)=t} p(x. L {x:T{x)=t} h(x)}.6. x E X c Rq (1. = L(e"' (x!) = L(e")X (x! = exp(e"). Example 1. . the momentgenerating function o[T(X) exists and is given by M(s) ~ exp[A(s + 1}) .6.1) by letting the model be indexed by 1} rather than 8.B(8)]{ Ifwe let h'(t) ~ L:{"T(x)~t} h(x). . The model given by (1. Then as we show in Section 1. Let E be the collection of all 1} such that A(~) is finite. £ is called the natural parameter space and T is called the natural sufficient statistic..9) and ~ is an Ihlerior point of E.B(8)] L {x:T(x)=t} ( 1. We obtain an important and useful reparametrization of the exponential family (1. If 8 E e. The Poisson family in canonical form is q(x.6. (continued).6. }.2.8) h(x) exp[~(8)T(x) . If X is distributed according to (1. By definition. 00 00 exp{A(1})} andE = R. Theorem 1. and Performance Criteria Chapter 1 Proof.A(1})1 for s in some neighborhood 0[0. if q is definable.6.6. Jh(x)exp[1}T(x)]dx in the continuous case and the integral is replaced by a sum in the discrete case..1.9) with 1} ranging over E is called the canonical oneparameter exponential family generated by T and h. E is either an interval or all of R and the class of models (1. Goals.6. 0 A similar theorem holds in the continuous case if the distributions ofT(X) are them Canonical exponential families. selves continuous.2.52 Statistical Models. then A(1}) must be finite.2.1}) = h(x)exp[1}T(x) .8) exp[~(8)t . the result follows. The exponential family then has the form q(x. where 1} = log 8.1.
X n is a sample from a population with density p(x. is said to be a kparameter exponential family. B(O) the natural sufficient statistic E~ 2n(j2 and variance nlt}2 4n04 . . Proof.2 The Multiparameter Case Our discussion of the "natural form" suggests that oneparameter exponential families are naturally indexed by a onedimensional real parameter fJ and admit a onedimensional sufficient statistic T(x). Example 1. being the integral of a density.<·or . 2 i=1 i=l n 1 n Here 1] = 1/20 2.A(1])]dx h(x)exp[(s + 1])T(x)  A(s + 1])Jdx = .8 > O.8) = (il(x. . h on Rq such that the density (frequency) functions of the Po may be written as. Therefore.6. E(T(X)) ~ A'(1]).6. 02 ~ 1/21]. Koopman. _ . The rest of the theorem follows from the momentenerating property of M(s) (see Section A. This is known as the Rayleigh distribution. x> 0. x E X j=1 c Rq. which is naturally indexed by a kdimensional parameter and admit a kdimensional sufficient statistic. More generally. 1. ' e k p(x. It is used to model the density of "time until failure" for certain types of equipment.6 Exponential Families 53 Moreover.8) ~ (x/8 2)exp(_x 2/28 2). exp[A(s + 1]) . and Dannois were led in their investigations to the following family of distributions.Section 1. (1. We compute M(s) = = E(exp(sT(X))) {exp[A(s ~ + 1])  A(1])]} J'" J J'" J h(x)exp[(s + 1])T(x) . if there exist realvalued functions 171. Var(T(X)) ~ A"(1]). nlog0 J..10) .j8 2))exp(i=l n n I>U282) ~=1 = (il xi)exp[202 LX.6. 1 Tk.8(0)]. is one. Now p(x. Direct computation of these moments is more complicated. We give the proof in the continuous case. and realvalued functions T 1. 0 = n log 82 and A(1]) 1 xl has mean nit} = = nlog(21]).A(1])] because the last factor.O) = h(x) exp[L 1]j(O)Tj (x) . Pitman.. c R k ..12).4 Suppose X 1> ••• . A family of distrib~tioos {PO: 0 E 8}. 0 Here is a typical application of this result.17k and B of 6.
. The density of Po may be written as e ~ {(I".I' 54 Statistical Models. i=l .. .Ot = fl.1. we define the natural parameter space as t: = ('l E R k : 00 < A('l) < oo}..... If we observe a sample X = (X" .').Xm ) where the Xi are independent and identically distributed and their common distribution ranges over a k~parameter exponential family given by (1. X m ) from a N(I".5. 2 (".2.. + log(27r"'».(x) = x'.3.10).. +log(27r" »)].x .. Again. h(x) = I. . . .TdX))T is sufficient. .'l) = h(x)exp{TT(x)'l. I ! In the discrete case. = N(I".5. then the preceding discussion leads us to the natural sufficient statistic m m (LXi. and Performance Criteria Chapter 1 By Theorem 1. 0 Again it will be convenient to consider the "biggest" families.6. .LTk(Xi )) t=1 Example 1. In either case.') population. A(71) is defined in the same way except integrals over Rq are replaced by sums. J1 < 00. suppose X = (Xl. (]"2 > O}. which corresponds to a twOparameter exponential family with q = I.2(". the canonical kparameter exponential indexed by 71 = (fJI..(X) •.4). 1)2(0) = 2 " B(O) " 1"' 1 " T. Then the distributions of X form a kparameter exponential family with natural sufficient statistic m m TCm)(x) = (LTl(Xi).6.O) =exp[".11) (]"2. . Goals. 8 2 = and I" 1 2' Tl(x) = x. . . . letting the model be Thus.'" . . .LX. It will be referred to as a natural sufficient statistic of the family. the vector T(X) = (T. q(x. Suppose lhat P. in the conlinuous case.6.Tk(x)f and...') : 00 < f1 x2 1 J12 2 p(x. i=l i=1 which we obtained in the previous section (Example 1.' ...A('l)}. . (1.').fJk)T rather than family generated by T and h is e. The Normal Family.x E X c Rq where T(x) = (T1 (x).
Section 1. L71 Aj = I}.. is not and all c. ./Ak) = ". 0) for 1 = (1. Suppose a<. . < OJ.~3): ~l E R.(x).~3 < OJ.6. .. .~l= (3. However 0. I yn)T can be put in canonical form with k = 3. where A is the simplex {A E R k : 0 < A. . .~. Yi rv N(Jli. A(1/) ~ ~[(~U2~.7. < l. 0. n. remedied by considering IV ~j = log(A.6. Now we can write the likelihood as k qo(x. . h(x) = I andt: ~ R x R.. .5.~·(x) . ."k. Linear Regression. where m../(7'.6.T./(7'. k}] with canonical parameter 0.. 0 + el) ~ qo(x. We observe the outcomes of n independent trials where each trial can end up in one of k possible categories.~2)]..5. In this example.~.(x) = L:~ I l[X i = j].n log j=l L exp(. From Exam~ pIe 1.~3 ~ 1/2(7'. . and t: = Rk . with J. . = n'Ezr Example 1.. . i = 1. 0) ~ exp{L . a 2 )...j = 1.6 Exponential Families 55 (x.5.)j.5 that Y I . A) = I1:~1 AJ'(X).'. E R k . ~l ~ /.. In this example.1. TT(x) = T(Y) ~ (EY"EY. ~3 and t: = {(~1.k..6. Multinomial Trials. Y n are independent.. 2. k ~ 2.. the density of Y = (Yb . Then p(x.k.kj.. .id.j j=l = 1. .. = 1/2(7'. . Example 1. I <j < k 1.5. as X and the sample space of each Xi is the k categories {I.'12 ~ (3. .= {(~1. x') = (T.. in Examples 1.(x»).Xn)'r where the Xi are i.4 and 1. Let T.. ~. We write the outcome vector as X = (Xl. 1/) =exp{T'f. (N(/" (7') continued}. j=1 k This is a kparameter canonical exponential family generated by TIl"" Tk and h(x) = II~ I l[Xi E {I. E R.) + log(rr/.. . Example 1. and rewriting kl q(x. It will often be more convenient to work with unrestricted parameters. we can achieve this by the reparametrization k Aj = eO] /2:~ e Oj ./(7'. . A(1/) = 4n[~:+m'~5+z~1~'+2Iog(rr/~3)].'I2.)T. and AJ ~ P(X i ~ j). This can be identifiable because qo(x..EziY..~2): ~l E R._l)(x)1/nlog(l where + Le"')) j=1 .ti = f31 + (32Zi.. A E A.
17 e for details. . is an npararneter canonical exponential family with Yi by T(Yt . If the Ai are unrestricted. let Xt =integers from 0 to ni generated Vi < n.6. Logistic Regression. if X is discrete Affine transformations from Rk to RI defined by UP is the canonical family generated by T kx 1 and hand M is the affine transformation I' . Goals. this. ) 1(0 < However. .6. " Example 1. if c Re and 1/( 0) = BkxeO C R k . then all models for X are exponential families because they are submodels of the multinomial trials model.. . More10g(P'I[X = jI!P'IIX ~ k]). " . is unchanged.d. Here is an example of affine transformations of 6 and T.2. . h(y) = 07 I ( ~. 17). < . Ai). X n ) T where the Xi are i.6. k}] with canonical parameter TJ and £ = Rk~ 1.6...8.13) . from Example 1.). Here 'Ii = log 1\.3 Building Exponential Families Submodels A submodel of a kparameter canonical exponential family {q(x. are identifiable. < X n be specified levels and (1. as X. Let Y. 0 < Ai < 1. 1]) is a k and h(x) = rr~ 1 l[xi E over. and 1] is a map from e to a subset of R k • Thus.6..6. M(T) sponding to and ~ MexkT + hex" it is easy to see that the family generated by M(T(X» and h is the subfamily of P corre 1/(0) = MTO. the parameters 'Ii = Note that the model for X Statistical Models. 1/(0») (1. and Performance Criteria Chapter 1 1 parameter canonical exponential family generated by T (k 1) {I.7 and X = (X t.' A(1/) = L:7 1 ni 10g(l + e'·). 1 < i < n. See Problem 1. B(n.·. Similarly.56 Note that q(x. . then the resulting submodel of P above is a submodel of the exponential family generated by BTT(X) and h... be independent binomial. 1 <j < k . 11 E £ C R k } is an exponential family defined by p(x.12) taking on k values as in Example 1. 0 1. 1 < i < n.1.Y n ) ~ Y.6.i. where (J E e c R1. I < k. 0) ~ q(x.
.10. = '\501 and 'fJ2(0) = ~'\5B2. Assume also: (a) No interaction between animals (independence) in relation to drug effects (b) The distribution of X in the animal population is logistic.i/a. PIX < xl ~ [1 + exp{ (8. then this is the twoparameter canonical exponential family generated by lYIY = (L~l.1)T. where 1 is (1. with B = p" we can write where T 1 = ~~ I Xi. Suppose that YI ~ .. LocationScale Regression.. ..Xi)). E R.. so it is not a curved family. . which is less than k = n when n > 3. > O. + 8. that is.6. If each Iii ranges over R and each a.. T 2 = L:~ 1 X.6.6. y... .Section 1...12) with the range of 1J( 8) restricted to a subset of dimension l with I < k .. Then (and only then). suppose the ratio IJlI/a. Y.!t is assumed that each animal has a random toxicity threshold X such that death results if and only if a substance level on or above X is applied..i.6.PIX < x])) 8.'V T(Y) = (Y" .6.). which is called the coefficient of variation or signaltonoise ratio. Gaussian with Fixed SignaltoNoise Ratio. .14) 8. ~ (1.x)W'.13) holds.x and (1.) = L ni log(1 + exp(8.6 Exponential Families 57 This is a linear transformation TJ(8) = B nxz 8 corresponding to B nx2 = (1. are called curved exponential families provided they do not form a canonical exponential family in the 8 parametrization. This is a 0 In Example 1.a 2 ). SetM = B T .x = (Xl. '. N(Jl. In the nonnal case with Xl.xn)T. .6.i. Example 1.8.6.. Y n are independent.. .. o Curved exponential families Exponential families (1.d. The Yi represent the number of animals dying out of ni when exposed to level Xi of the substance. Then.8. + 8.. n. }Ii N(J1..5 a 2nparamctercanonical exponential family model with fJi = P. I ii. ranges over (0.00). p(x.'. is a known constant '\0 > O.. generated by . this is by Example 1. Yn.1. a. + 8.8. x). the 8 parametrization has dimension 2.Xn i. Example 1.9. 'fJ1(B) curved exponential family with l = 1. i=I This model is sometimes applied in experiments to determine the toxicity of a substance. 8) in the 8 parametrization is a canonical exponential family.)T . . . i = 1. . However. . log(P[X < xl!(1 . and l1n+i = 1/2a. L:~ 1 Xl yi)T and h with A(8.
1. We return to curved exponential family models in Section 2.6. 83 > 0 (e.(0). 8.12.. Even more is true.5 that for any random vector Tkxl.8 note that (1._I11.).12) is not an exponential family model.6. n. Models in which the variance Var(}'i) depends on i are called heteroscedastic whereas models in which Var(Yi ) does not depend on i are called homoscedastic. Supermode1s We have already noted that the exponential family structure is preserved under i. Bickel. Examples 1.3.xjYj ) and we can apply the supennode! approach to reach the same conclusion as before. and Performance Criteria Chapter 1 and h(Y) = 1.6 are heteroscedastic and homoscedastic models. E R. for unknown parameters 8 1 E R.6. 1 < j < n.6.13) exhibits Y.. 1 Bj(O).d.6. Recall from Section B. 1988. Section 15.. and = Ee sTT . Sections 2. sampling. . Carroll and Ruppert.) = (Yj. and Snedecor and Cochran. Next suppose that (JLi. For 8 = (8 1 . say. as being distributed according to a twoparameter family generated by Tj(Y. yn)T is modeled by the exponential family generated by T(Y) ~. In Example 1. we define I I .6.5. 1978. then p(y. the map 1/(0) is Because L:~ 11]i(6)Yi + L:~ i17n+i(6)Y? cannot be written in the fonn ~. TJ'(Y). Ii. 0) ~ q(y.1/(O)) as defined in (6.i. ~. a. 1. and B(O) = ~. but a curved exponential family model with 0 1=3. respectively.1 generalizes directly to kparameter families as does its continuous analogue. Let Yj .10 and 1. 1989.g. be independent.6.E Yj c Rq..2. Goals. . Thus. with an exponential family density Then Y = (Y1. We extend the statement of Theorem 1.4 Properties of Exponential Families Theorem 1.) and 1 h'(YJ)' with pararneter1/(O). M(s) as the momentgenerating function.8.(O)Tj'(Y) for some 11.8.58 Statistical Models.) depend on the value Zi of some covariate. 1 Tj(Y.10).
15) that a'7. ('70). ~ = 1 . Let P be a canonicaL kparameter exponential famiLy generated by (T.1 and Theorem 1. h) with corresponding natural parameter space E and function A(11).4). J exp('7 TT (x))h(x)dx > 0 for all '7 we conclude from (1. A(U'71 Which is (b).3.1 give a classical result in Example 1.8". .Section 1. t: and (a) follows.6. S > 0 with ~ + ~ = 1. then T(X) has under 110 a moment M(s) ~ exp{A('7o + s) . Substitute ~ ~ a.6 Expbnential Families 59 V.a)'7rT(x)) and take logs of both sides to obtain.a)'72 E is proved in exactly the same way as Theorem 1.6.2. v(x).3 V.a)'72) < aA('7l) + (1 . T. (with 00 pennitted on either side).6.A('7o)} vaLidfor aLL s such that 110 + 5 E E. By the Holder inequality (B. Proof of Theorem 1. If '7" '72 E + (1 .6. for any u(x)..15) t: the righthand side of (1.. 8A 8A T" = A('7o) f/J. Theorem 1.6.a)A('72) Because (1. A where A( '70) = (8". .6.6. We prove (b) first. h(x) > 0.a.6.6. + (1 . u(x) ~ exp(a'7. Since 110 is an interior point this set of5 includes a baLL about O.3.6. ('70)) .9. 8".6.r'7o T(X) .3(c).T(x)). vex) = exp«1 .15) is finite. Corollary 1.r(T) = IICov(7~. Under the conditions ofTheorem 1.6. . Suppose '70' '71 E t: and 0 < a < 1. Fiually (c) 0 The formulae of Corollary 1. Then (a) E is convex (b) A: E ) R is convex (c) If E has nonempty interior in R k generating function M given by and'TJo E E. 1O)llkxk. ('70) II· The corollary follows immediately from Theorem B.A('70) = II 8".5. J u(x)v(x)h(x)dx < (J ur(x)h(x)dx)~(Jv'(x)h(x)dx)~.1.
Xl Y1 ).1. We establish the connection and other fundamental relationships in Theorem 1.6. + 9. k A(a) ~ nlog(Le"') j=l and k E>.8. (continued). there is a minimal dimension.60 Statistical Models. if we consider Y with n'> 2 and Xl < X n the family as we have seen remains of rank < 2 and is in fact of rank 2. ajTj(X) = ak+d < 1 unless all aj are O. and ry. It is intuitively clear that k .lx ~ j] = AJ ~ e"'ILe a . such that h(x) > O. An exponential family is of rank k iff the generating statistic T is kdimensional and 1. 9" 9. T. using the a: parametrization. Suppose P = (q(x.4. (i) P is a/rank k.7 we can see that the multinomial family is of rank at most k . 2' Going back to Example 1. However.x. .4.1 is in fact its rank and this is seen in Theorem 1.~~) < 00 for all x. ifn ~ 1.6. (ii) 7) is a parameter (identifiable). ."I 11 J Evidently every kparameter exponential family is also k'dimensional with k' > k. Goads. . Our discussion suggests a link between rank and identifiability of the 'TJ parameterization. Note that PO(A) = 0 or Po (A) < 1 for some 0 iff the corresponding statement holds i. 7) E f} is a canonical exponential/amily generated by (TkXI ' h) with natural parameter space e such that E is open. (X)". Here. (iii) Var7)(T) is positive definite.. Theorem 1. Sintilarly.~. o The rank of an exponential family I . . Then the following are equivalent. Tk(X) are linearly independent with positive probability.6.7). However.6.(Tj(X» = P>.6. in Example 1. we are writing the oneparameter binomial family corresponding to Yl as a twoparameter family with generating statistic (Y1 . Formally. II ! I Ii . P7)[L. p:x.6.(9) = 9. But the rank of the family is 1 and 8 1 and 82 are not identifiable.4 that follows.7. for all 9 because 0 < p«X. f=l . and Performance Criteria Chapter 1 Example 1.
TJ2)T(X) = A(TJ2) . ~ (i) =~ (iii) ~ (i) = P1)[aT T = cJ ~ 1 for some a of 0.6.6 Exponential Families 61 (iv) 1) ~ A(1)) is 11 onto (v) A is strictly convex on E. by our remarks in the discussion of rank. Conversely.(v).A(TJI)}h(x) ~ exp{TJ2T(x) ..4 hold and P is of rank k. A"(1]O) = 0 for some 1]0 implies c. The proof for k details left to a problem.A(TJ2))h(x). Suppose tlult the conditions of Theorem 1.) with probability 1 =~(i).. with probability 1. ranges over A(I'). 0 . hence.1)o)TT. (iii) = (iv) and the same discussion shows that (iii) . = Proof ofthe general case sketched I." Then ~(i) > 1 is then sketched with '* P"[a. 1)0) E n· Q is the exponential family (oneparameter) generated by (1)1 . Now (iii) =} A"(TJ) > 0 by Theorem 1. thus.~> .6. = Proof. A is defined on all ofE.T(x) . hence.1)o) : 1)0 + c(1). This is equivalent to Var"(T) ~ 0 '*~ (iii) {=} f"V(ii) There exist T}I =I= 1J2 such that F Tll = Pm' Equivalently exp{r/. Thus. We.6. 0 Corollary 1.3.Section 1. which that T implies that A"(TJ) = 0 for all TJ and. (b) logq(x.. ~ (ii) ~ _~ (i) (ii) = P1). A'(TJ) is strictly monotone increasing and 11.. . have (i) . by Theorem 1. all 1) ~ (iii) = a T Var1)(T)a = Var1)(aT T) = 0 for some a of 0. A' is constant. all 1) = (~i) II. :. Proof. Note that. (iv) '" (v) = (tii) Properties (iv) and (v) are equivalent to the statements holding for every Q defined as previously for arbitrary 1)0' 1). '" '.A(TJ.2. 1)) is a strictly concavefunction of1) On 1'.2 and. . We give a detailed proof for k = 1.'t·· . of 1)0· Let Q = (P1)o+o(1).6.. Apply the case k = 1 to Q to get ~ (ii) =~ (i). . because E is open.. Let ~ () denote "(. ". F!~ ..) is false. III. ~ P1)o some 1).. Then (a) P may be uniquely parametrized by "'(1)) E1)T(X) where". for all T}. This is just a restatement of (iv) and (v) of the theorem.(ii) = (iii).T ~ a2] = 1 for al of O.' . Taking logs we obtain (TJI .
See Section 2.(Y).. Example 1. . and Performance Criteria Chapter 1 The relation in (a) is sometimes evident and the j. which is obviously a 11 function of (J1..21). E) = Idet(E) 1 1 2 / . .. By our supermodel discussion. E(X.Jl. N p (1J. .2 p p (1.. then X . the relation in (a) may be far from obvious (see Problem 1.62 Statistical Models. with mean J.a 2 ). h(Y) . We close the present discussion of exponential families with the following example. Recall that Y px 1 has a p variate Gaussian distribution. revealing that this is a k = p(p + 3)/2 parameter exponential family with statistics (Yi.5 Conjugate Families of Prior Distributions In Section 1.1 (Y . B(9) = (log Idet(E) I+JlTEl Jl).6.. so that Theorem 1.1 '" 11"'.E). write p(x 19) for p(x.6. An important exponential family is based on the multivariate Gaussian distributions of Section 8. (1.17) The first two terms on the right in (1.16) Rewriting the exponent we obtain 10gf(Y.6. .p/2 I exp{ ..3.Jl)TE.6. (1.6. ..(9) LT... E)..Xn is a sample from the kparameter exponential family (1. the N(/l..ljh<i<.(Xi) i=l j=l i=l n • n nB(9)}. 0 ! = 1. the 8(11) 0) family is parametrized by E(X).I. This is a special case of conjugate families of priors. ifY 1.. .5. The p Van'ate Gaussian Family.29) that I) generate this family and that the rank of the family is indeed p(p + 3)/2.Jl)}. and that (....2 we considered beta prior distributions for the probability of success in n Bernoulli trials.6.. The corollary will prove very important in estimation theory. E) _~yTEIy + (E1JllY 2 I 2 (log Idet(E)1 + JlT I P log"..17) can be rewritten ( 2: 1 <i<j~p aijYiYj + ~ 2: aii r:2 ) + L(2: aij /lj)Yi i=l i=l j=l p where E.~p).. E).4 applies._. Y n are iid Np(Jl.6.{Y.Yp.)] eXP{L 1/. is open. Then p(xI9) = III h(x.10). Thus.2 (Y . which is a p x p symmetric matrix.Jl) . .6. E.18) _. For {N(/" "')}.tpx 1 and positive definite variance covariance matrix L::pxp' iff its density is f(Y. 0). .6. Suppose Xl. ~iYi Vi). as we always do in the Bayesian context. Goals. . where X is the Bernoulli trial.11.Yn)T follows the k = p(p + 3)/2 parameter exponential family with T = (EiYi .._. where we identify the second element ofT.t parametrization is close to the initial parametrization of classical P. 9 ~ (Jl. T (and h generalizing Example 1..X') = (JL.a 2 + J12).11.6..6. families to which the posterior after sampling also belongs.6. However. and. Jl. It may be shown (Problem 1."5) family by E(X). with its distinct p(p + I) /2 entries.
. n n )T and ex indicates that the two sides are proportional functions of (). .20) given by the last expression in (1. .1.Section 1. If p(xIO) is given by (1..6. Because two probability densities that are proportional must be equal.21) where S = (SI.O<W(tJ.6.6.20). The (k + I)parameter exponential/amity given by k ".6.6.6. We assume that rl is nonempty (see Problem 1. Suppose Xl. That is. where a = (I:~ 1 T1 (Xi). To choose a prior distribution for model defined by (1.(O) ex exp{L ~J(O)(LT.18).. .u5) sample.. be "parameters" and treating () as the variable of interest. .'''' I:~ 1 Tk(Xi).. . tk+ I) T and w(t) = 1= eXP{hiJ~J(O)' _='" _= 1 k ik+JB(II)}dO J···dOk (1.'" ..6.36).19) o {(iJ. . we consider the conjugate family of the (1.. then k n .. 1r(6jx) is the member of the exponential family (1. . X n become available..(Olx) ex p(xlO)"..(0) ~ exp{L ~j(O)fj .6.(Xi)+ f. tk+l) E 0.18) by letting 11.. where u6 is known and (j is unknown. then Proposition 1.6. is a conjugate prior to p(xIO) given by (1. and t j = L~ 1 Tj(xd.6.."" Sk+l)T = ( t} + ~TdxiL .) .6 Exponential Families 63 where () E e. n)T D It is easy to check that the beta distributions are obtained as conjugate to the binomial in this way.(0). .21) is an updating formula in the sense that as data Xl. k.18) and" by (1.6. Proof. . ik + ~Tk(Xi)' tk+l + 11. Forn = I e.Xn is a N(O. which is k~dimensional.21) and our assertion follows. let t = (tl.(lk+1 j=1 i=1 + n)B(O)) ex ". the parameter t of the prior distribution is updated to s = (t + a). . Example 1.. . (1..22) 02 p(xIO) ex exp{ 2 .2)' Uo 2uo Ox .6.6.fk+Jl.20) where t = (fJ.12.fk+IB(O) logw(t)) j=1 (1. .1.. Note that (1.6.6. .fkIJ)<oo) with integrals replaced by sums in the discrete case.6. 0 Remark 1. A conjugate exponential family is obtained from (1.20). j = 1.
we obtain 1r.26) (1.6. i. Using (1.6.30) that the Np()o" f) family with )0. W. 761) where '10 varies over RP.(S) = t. Tg is scalar with TO > 0 and I is the p x p identity matrix (Problem 1.. = rg(n)lrg so that w. = 1 . therefore.20) has density (1. Np(O.( 0 .6.64 Statistical Models. it can be shown (Problem 1. 0 These formulae can be generalized to the case X. 1r.6. + n.i. Moreover. Goals.6.6. (J Np(TJo.6. we find that 1r(Olx) is a normal density with mean Jl(s. = s.6.(n) = . Eo known. consists of all N(TJo.6.26) intuitively as ( 1.37). t.23) with (70 TO .6. 75) prior density.21).25) By (1.d.6.n) and variance = t. 1 < i < n. tl(S) = 1]0(10 2 TO 2 + S.(O) <X exp{ .23) Upon completing the square. Eo). (0) is defined only for t.6.. E RP and f symmetric positive definite is a conjugate family f'V .28) where W. = nTg(n)I(1~.6. the posterior has a density (1.24).(n) (g +n)I[s+ 1]0~~1 TO TO (1. > 0 and all t I and is the N (tI!t" (1~ It.W. Our conjugate family. we must have in the (t l •t2) parametrization "g (1. and Performance Criteria Chapter 1 This is a oneparameter exponential family with The conjugate twoparameter exponential family given by (1.24) Thus.) density.27) Note that we can rewrite (1. if we observe EX. If we start with aN(f/o. rJ) distributions where Tlo varies freely and is positive.) } 2a5 h t2 t1 2 (1. .
and Darmois. Eo). the family of distributions in this example and the family U(O.. which admit kdimensional sufficient statistics for all sample sizes. In fact. Some interesting results and a survey of the literature may be found in Brown (1986)..6. Summary. .. Despite the existence of classes of examples such as these. If bas a nonempty interior in R k and '10 E then T(X) has for X ~ P7Jo the momentgenerating function .20) except for p = 1 because Np(A. 0) = h(x) expl2:>j(I1)Ti (x) . with integrals replaced by sums in the discrete case. The set is called the natural parameter space. . O}) model of Example 1. The canonical kparameter exponentialfamily generated by T and h is T(X) = where A(7J) ~ log J:. is a kparameter exponential/amity of distributions if there are realvalued functions 'T}I . the conditions of Proposition 1.31 and 1. B) are not exponential.Section 1. . The set E is convex.10 is a special result of this type..5. and realvalued functions TI.32. See Problems 1. e. Pitman. 8 C R k .6 Exponential Families 65 but a richer one than we've defined in (1.20). to Np (8 .. (1.6.1 are often too restrictive.A(7Jo)} e e. . starting with Koopman.J: h(x)exp{TT(x)7J}dx in the continuous case. It is easy to see that one can consUUct conjugate priors for which one gets reasonable formulae for the parameters indexing the model and yet have as great a richness of the shape variable as one wishes by considering finite mixtures of members of the family defined in (1.6. a theory has been built up that indicates that under suitable regularity conditions families of distributions. T k (X)) is called the natural sufficient statistic of the family. the map A . is not of the form L~ 1 T(Xi). for all s such that '10 + s is in Moreover E 7Jo [T(X)] = A(7Jo) and Var7Jo[T(X)] A( '10) where A and A denote the gradient and Hessian of A.6. h on Rq such that the density (frequency) function of Pe can be written as k p(x. " Tk. The natural sufficient statistic max( X 1 ~ . .6. Discussion Note that the uniform U( (I.'T}k and B on e. {PO: 0 E 8}.6. x j=1 E X c Rq.... .3 is not covered by this theory. which is onedimensional whatever be the sample size. must be kparameter exponential families.(s) = exp{A(7Jo + s) .29) (Tl (X)...Xn ).6.B(O)]. In the onedimensional Gaussian case the members of the Gaussian conjugate family are unimodal and symmetric and have the same shape. ••• . 2. Problem 1. In fact.. r) is a p(p + 3)/2 rather than a p + I parameter family. . E 1 R is convex.
T" .L and variance a 2 . (iii) Var'7(T) is positive definite. (ii) '7 is identifiable. A family F of priOf distributions for a parameter vector () is called a conjugate family of priors to p(x I 8) if the posterior distribution of () given x is a member of F. . 1..1)) (O)t j .(0) = exp{L1)j(O)t) .29). If P is a canonical exponential family with E open. l T k are linearly independent with positive Po probability for some 8 E e. then the following are equivalent: (i) P is of rank k.) E R k+l : 0 < w < oo}.1 1. tk+1) 0 ~ {(t" .66 Statistical Models. (v) A is strictlY convex on f.. and Performance Criteria Chapter 1 An exponential family is said to be of rank k if T is kdimensional and 1... . Suppose that the measuring instrument is known to be biased to the positive side by 0. (iv) the map '7 ~ A('7) is I ' Ion 1'.parameter exponential family k ". Give a formal statement of the following models identifying the probability laws of the data and the parameter space. He wishes to use his observations to obtain some infonnation about J. tk+.6. is conjugate to the exponential family p(xl9) defined in (1.. Goals.L. State whether the model in question is parametric or nonparametric.. and t = (t" . The (k + 1). Theoretical considerations lead him to believe that the logarithm of pebble diameter is normally distributed with mean J.7 PROBLEMS AND COMPLEMENTS Problems for Section 1.B(O)tk+l logw} j=l where w = 1:" 1: E exp{l:. (a) A geologist measures the diameters of a large number n of pebbles in an old stream bed. Assume that the errors are otherwise identically distributed nonnal random variables with known variance.1 units. (b) A measuring instrument is being used to obtain n independent determinations of a physical constant J.B(O)}dO.L and a 2 but has in advance no knowledge of the magnitudes of the two parameters. Ii II I .
. 1'2) and we observe Y X..2) where fLij = v + ai + Aj.. .1(c). Which of the following parametrizations are identifiable? (Prove or disprove.1.1..Xpb . Ab. .. v. ..2). = (al. restricted to p = (all' . .. then X is (b) As in Problem 1.."" a p .Xp are independent with X~ ruN(ai + v. . An entomologist studies a set of n such insects observing both the number of eggs laid and the number of eggs hatching for each nest. ...t. . i=l (c) X and Y are independent N (I' I . (b) The parametrization of Problem 1.. Can you perceive any difficulties in making statements about f1. (a) Let U be any random variable and V be any other nonnegative random variable. 1 Ab) restricted to the sets where l:f I (Xi = 0 and l:~~1 A. and Po is the distribution of X (b) Same as (a) with Q = (X ll . Two groups of nl and n2 individuals. are sampled at random from a very large population.7 Problems and Complements 67 (c) In part (b) suppose that the amount of bias is positive but unknown.2 ).Xp ).ap ) : Lai = O}. 3..p. .for this model? (d) The number of eggs laid by an insect follows a Poisson distribution with unknown mean A. Q p ) and (A I .. Show that Fu+v(t) < Fu(t) for every t. B = (1'1. Q p ) {(al. i = 1. .) < Fy(t) for every t. . bare independeht with Xi.. . Each .) (a) The parametrization of Problem 1. .. each egg has an unknown chance p of hatching and the hatChing of one egg is independent of the hatching of the others.1 describe formaHy the foHowing model..) (a) Xl. Are the following parametrizations identifiable? (Prove or disprove.2) and N (1'2. j = 1.1. ~ N(l'ij. Once laid.1(d). 0. 2.. respectively.. (e) The parametrization of Problem l. 4...Section 1. = o.. ..l(d) if the entomologist observes only the number of eggs hatching but not the number of eggs laid in each case.··. (If F x and F y are distribution functions such that Fx(t) said to be stochastically larger than Y. . . . (d) Xi..2 ) and Po is the distribution of Xll. At. 0. e (e) Same as (d) with (Qt. .
(e) Suppose X ~ N(I'.  .. I nl. e = = 1 if X < 1 and Y {J.4(1). is the distribution of X e = (0.Li < + . Which of the following models are regular? (Prove or disprove. Mk. + nk + ml + .. ..3(2) and 1.' I .2. 5.2.P(Y < c .p(0. Vk) is unknown.. The number n of graduate students entering a certain department is recorded.1.0.. ml .. 0 < Vi < 1.. Suppose the effect of a treatment is to increase the control response by a fixed amount (J. if and only if. I..k is proposed where 0 < 71" < I. 0'2). • "j :1 member of the second (treatment) group is administered the same dose of a certain drug believed to lower blood pressure and the blood pressure is measured after 1 hour. VI. M k = mk] I! n. .:". 0 < J.. mk·r.0)... The following model is proposed. 68 Statistical Models.Li = 71"(1 M)iIJ. whenXis unifonnon {O. =X if X > 1. What assumptions underlie the simplification? 6. .· .k + VI + .. Both Y and p are said to be symmetric about c.LI ni + . .1. . Each member of the first (control) group is administered an equal dose of a placebo and then has the blood pressure measured after 1 hour. . 7. J.c has the same distribution as Y + c.t) = 1 .Lk VI n". but the distribution of blood pressure in the population sampled before and after administration of the drug is quite unknown. 8.1. Goals.0'2) (d) Suppose the possible control responses in an experiment are 0.t). i = 1. • ...LI.. ..9 and they oecur with frequencies p(0. Show that Y .L.. nk·mI . 1 < i < k and (J = (J.. (b)P. + fl.. .. ... the density or frequency function p of Y satisfies p( c + t) ~ p( c . . is the distribution of X when X is unifonn on (0...c) = PrY > c . . I ! fl. o < J1 < 1.O}. . Hint: IfY . ... . k. then P(Y < t + c) = P( Y < t .2).. 2.t) fj>r all t.c bas the same distribution as Y + c..··· .. I . It is known that the drug either has no effect or lowers blood pressure.. + mk + r = n 1.9). and Performance Criteria Chapter 1 ":. V k mk P r where J.0. + Vk + P = 1. Let Po be the distribution of a treatment response. The simplification J. 0 = (1'.p(0... Consider the two sample models of Examples 1. Let N i be the number dropping out and M i the number graduating during year i..1).I nl . P8[N I .) (a) p. = n11MI = mI. In each of k subsequent years the number of students graduating and of students dropping out is recorded. (0).. }. (a) What are the assumptions underlying this model'? (b) (J is very difficult to estimate here if k is large. Let Y and Po is the distribution of Y. 0 < V < 1 are unknown. ..Nk = nk. Vi = (1 11")(1 ~ v)iI V for i = 1..
I} and N is known. Y = T I Y > j].N = 0. then C(·) = F(. Let c > 0 be a constant. r(j) = = PlY = j.7 Problems and Complements 69 (a) Show that if Y ~ X + o(X).Xn are observed i. eY and (c) Suppose a scale model holds for X. . Show that if we assume that o(x) + x is strictly increasing. o (a) Show that in this case. suppose X has a distribution F that is not necessarily normal. Suppose X I. .zp. . e) : p(j) > 0.tl).Znj?' (a) Show that ({31.tl and o(x) ~ 21' + tl. N. o(x) ~ 21' + tl . then satisfy a scale model with parameter e.tl) implies that o(x) tl. . eX o. That is. . (]"2) independent..2x yield the same distribution for the data (Xl. .d. the two cases o(x) . or equivalently.6.tl) does not imply the constant treatment effect assumption. .Section 1.. 0 'Lf 'Lf op(j) = I."" Yn ). that is. ... > 0.. 12. 11.Xn ). = (Zlj. are not collinear (linearly independent).. C(t} = F(tjo). Let X < 1'. {r(j) : j ~ 0. Yn denote the survival times of two groups of patients receiving treatments A and B. X m and Yi."').3 let Xl.. 0 < j < N.1. j = 0.. Positive random variables X and Y satisfy a scale model with parameter 0 > 0 if P(Y < t) = P(oX < t) for all t > 0. = 9. . The Lelunann 1\voSample Model. ci '" N(O.. 10. . fJp) are not identifiable if n pardffieters is larger than the number of observations. In Example 1. C). and (I'. ... . The Scale Model. . (b) Deduce that (th. t > O. ..2x and X ~ N(J1. .. (Yl. . (b) In part (a). Collinearity: Suppose Yi LetzJ' = L:j:=1 4j{3j + Ci. according to the distribution of X. Y. Sbow that {p(j) : j = 0..i. Sx(t) = ."') and o(x) is continuous. C are independent. Therefore. For what tlando(x) ~ 2J1+tl2x? type ofF is it possible to have C(·) ~ F(·tl) for both o(x) (e) Suppose that Y ~ X + O(X) where X ~ N(J1. C).. e) vary freely over F ~ {(I'. Does X' Y' = yc satisfy a scale model? Does log X'. . e(j).. log y' satisfy a shift model? = Xc. log X and log Y satisfy a shift model with parameter log (b) Show that if X and Y satisfy a shift model With parameter tl. .. I(T < C)) = j] PIC = j] PIT where T. .. ..'" . i . if the number of = (min(T. e(j) > 0. N}. Hint: Consider "hazard rates" for Y min(T.. then C(·) ~ F(. . N} are identifiable.{3p) are identifiable iff Zl. C(·) = F(· . p(j). 1 < i < n.
Specify the distribution of €i.ll) with scale parameter 5. (c) Under the assumptions of (b) above. " T k are unobservable and i.. I (b) Assume (1... = = (. Let f(t I Zi) denote the density of the survival time Yi of a patient with covariate vector Zi apd define the regression survival and hazard functions of Y as i Sy(t I Zi) = 1~ fry I zi)dy. t > 0. are called the survival functions.d..l3. For treatments A and E. The most common choice of 9 is the linear form g({3.(t) = fo(t)jSo(t) and hy(t) = g(t)jSy(t) are called the hazard rates of To and Y. 1 P(X > t) =1 (. Moreover. then Yi* = g({3.2) is equivalent to Sy (t I z) = sf" (t). .h.(t) with C.i. A proportional hazard model.1) Equivalently. = exp{g(.  . log So(T) has an exponential distribution. 13.ho(t) is called the Cox proportional hazard model.1.2) where ho(t) is called the baseline hazard function and 9 is known except for a vector {3 ({311 . . hy(t) = c.. t > 0.I ~I .7. Show that hy(t) = c. Zj) + cj for some appropriate €j. Tk > t all occur. The Cox proportional hazani model is defined as h(t I z) = ho(t) exp{g(. thus.7.l3. . as T with survival function So. h(t I Zi) = f(t I Zi)jSy(t I Zi).(t) if and only if Sy(t) = Sg.. 12. Find an increasing function Q(t) such that the regression survival function of Y' = Q(Y) does not depend on ho(t). Hint: See Problem 1. (1. I' 70 StCitistical Models. respectively. Goals. z)) (1. t > 0.G(t).). So (T) has a U(O. Set C.00). and Performance CriteriCl Chapter 1 t Ii .. I) distribution. we have the Lehmann model Sy(t) = si.7.12. Also note that P(X > t) ~ Sort).l3.z)}. .2) and that Fo(t) = P(T < t) is known and strictly increasing. Sy(t) = Sg. k = a and b.. Hint: By Problem 8. Let T denote a survival time with density fo(t) and hazard rate ho(t) = fo(t)j P(T > t).{a(t).) Show that Sy(t) = S'.2. where TI". Then h.(t). Show that if So is continuous. show that there is an increasing function Q'(t) such that ifYt = Q'(Y. then X' = log So (X) and Y' ~ log So(Y) follow an exponential scale model (see Problem Ll. ~ a5.. z) zT. . Survival beyond time t is modeled to occur if the events T} > t.) Show that (1. 7. (b) By extending (bjn) from the rationals to 5 E (0. F(t) and Sy(t) ~ P(Y > t) = 1 .. I (3p)T of unknowns..(t). (c) Suppose that T and Y have densities fort) andg(t). .
7 Problems and Complements 71 Hint: See Problems 1.2 0.000 is the minimum monthly salary for state workers.L . Here is an example in which the mean is extreme and the median is not.1.l (0. 1) distribution. have a N(O..i. .p and C.Section 1.8 0. Let Xl. where () > 0 and c = 2. . (b) Suppose Zl and Z.d.2 unknown. wbere the model {(F. the parameter of interest can be characterl ized as the median v = F. ) = ~.. . Are.. . still identifiable? If not.5) or mean I' = xdF(x) = Fl(u)du.11 and 1. J1 may be very much pulled in the direction of the longer tail of the density. 'I/1(±oo) = ±oo.i. J1 and v are regarded as centers of the distribution F. Yn be i.1.O) 1 . Yl . 1f(O2 ) = ~. Problems for Section 1.. .X n ). '1/1' > 0.p and ~ are identifi· able.2) distribution with .p(t)).d. Examples are the distribution of income and the distribution of wealth. In Example 1.1.2 1." .(x/c)e. When F is not symmetric. what parameters are? Hint: Ca) PIXI < tl = if>(.6 Let 1f be the prior frequency function of 0 defined by 1f(I1..12.. (b) Suppose X" . and for this reason. Suppose the monthly salaries of state workers in a certain state are modeled by the Pareto distribution with distribution function J=oo Jo F(x. G. X n are independent with frequency fnnction p(x posterior1r(() I Xl. . x> c x<c 0.. 15. an experiment leads to a random variable X whose frequency function p(x I 0) is given by O\x 01 O 2 0 0. and suppose that for given (). Find the . and Zl and Z{ are independent random variables. Observe that it depends only on l:~ I Xi· I 0). the median is preferred in this case. Consider a parameter space consisting of two points fh and ()2.2 with assumptions (t){4).v arbitrarily large. 14. Find the median v and the mean J1 for the values of () where the mean exists.Xm be i. Merging Opinions.4 1 0..G)} is described by where 'IjJ is an unknown strictly increasing differentiable map from R to R. Show that both . Generally. have a N(O. Ca) Find the posterior frequency function 1f(0 I x). F.. Show how to choose () to make J. (a) Suppose Zl' Z.
1r(J)= 'u . < 0 < 1. Unllormon {'13} .[X ~ k] ~ (1. "X n ) = c(n 'n+a m) ' J=m. the outcome X has density p(x OJ ~ (2x/O').. " ••. ) ~ . 2.'" 1rajI Show that J + a. Suppose that for given 8 = 0. is used in the fonnula for p(x)? 2. .. Let X be the number of failures before the first success in a sequence of Bernoulli trials with probability of success O.. (3) Find the posterior density of 8 when 71"( 0) ~ 1. ~ = (h I L~l Xi ~ = . 4'2'4 (b) Relative to (a). This is called the geometric distribution (9(0))...m+I. 1. }. 7r (d) Give the values of P(B n = 2 and 100. .2. Let 71" denote a prior density for 8.Xn = X n when 1r(B) = 1.. Compare these B's for n = 2 and 100.25.O)kO.2. IE7 1 Xi = k) forthe two priors (f) Give the set on which the two B's disagree. where a> 1 and c(a) . and Performance CriteriCi Chapter 1 (cj Same as (b) except use the prior 71"1 (0 . Goals. X n are independent with the same distribution as X. For this convergence. Let X I. (b) Find the posterior density of 8 when 71"( OJ = 302 . Assume X I'V p(x) = l:. 3. ':. c(a).72 Statistical Models. . 71"1 (0 2 ) = . .. 4.8). . . 0 < x < O.0 < B < 1. X has the geometric distribution (a) Find the posterior distribution of (J given X = 2 when the prior distribution of (} is .Xn are natural numbers between 1 and Band e= {I. X n be distributed as where XI.0 < 0 < 1. .'" .. 3. . k = 0. (d) Suppose Xl.5n) for the two priors and 1f} when 7f (e) Give the most probable values 8 = arg maxo 7l"(B and 71"1. 'IT .. (3) Suppose 8 has prior frequency. " J = [2:.75.)=1. does it matter which prior. what is the most probable value of 8 given X = 2? Given X = k? (c) Find the posterior distribution of (} given X = k when the prior distribution is beta. l3(r. .. 7l" or 11"1. for given (J = B. Then P. I XI . .. Find the posteriordensityof8 given Xl = Xl. Show that the probability of this set tends to zero as n t 00. ... 0 (c) Find E(81 x) for the two priors in (a) and (b).=l1r(B i )p(x I 8i ).. Consider an experiment in which. . (J. .
11'0) distribution.. X n+ k ) be a sample from a population with density f(x I 0).n. Justify the following approximation to the posterior distribution where q. .(1...1 suppose n is large and (1 In) E~ I Xi = X is not close to 0 or 1 and the prior distribution is beta.2.1.1.• X n = Xn. (a) Show that the family of priors E... . .• X n = Xn' and the conditional distribution of the Zi'S given Y = t is that of sample from the population with density e.Section 1.. b) is the distribution of (aV loW)[1 + (aV IbW)]t."'tF·t'.. . Show rigorously using (1. XI = XI.and Complements 73 wherern = max(xl.. then the posterior distribution of D given X = k is that of k + Z where Z has a B(N . Let (XI. Xn) + 5. . Show that a conjugate family of distributions for the Poisson family is the gamma family. If a and b are integers..1 that the conditional distribution of 6 given I Xi = k agrees with the posterior distribution of 6 given X I = Xl •.'" . X n = Xn is that of (Y. In Example 1.7 Problems .1= n+r+s _ .Xn is a sample with Xi '"'' p(x I (})..f) = [~. I Xn+k) given . W.2. Xn) = Xl = m for all 1 as n + 00 whatever be a. f(x I f). D = NO has a B(N. (b) Suppose that max(xl. 6. 7. Va' W t . where {i E A and N E {l.. Show that the conditional distribution of (6. then !3( a..8) that if in Example 1..0 E Let 9 have prior density 1r. .··. 9." . Show in Example 1. Interpret this result.. .. . X n+ Il . is the standard normal distribution function and n _ T 2 x + n+r+s' a J.) = n+r+s Hint: Let!3( a. a regular model and integrable as a function of O. f3(r. s). where VI..c(b. n. .. S. Assume that A = {x : p( x I 8) > O} does not involve O. 10.. } is a conjugate family of prior distributions for p(x I 8) and that the posterior distribution of () given X = x is ..J:n).. b) denote the posterior distribution. ZI. are independent standard exponential. 11'0) distribution. . Suppose Xl.b> 1. where E~ 1 Xi = k. . Next use the central limit theorem and Slutsky's theorem.2."" Zk) where the marginal distribution of Y equals the posterior distribution of 6 given Xl = Xl.. Show that ?T( m I Xl.2. . .
)T. x> 0. compute the predictive and n + 00. po is I t = ~~ l(X i  1"0)2 and ()( denotes (b) Let 7t( 0) ()( 04 (>2) exp { ..i. <0<x and let7t(O) = 2exp{20}. 11. 82 I fL. I (b) Use the result (a) to give 7t(0) and 7t(0 I x) when Oexp{ Ox}."Z In) and (n1)S2/q 2 '" X~l' This leads to p(x. and () = a.9). (N. Q'r)T.\ > 0.. (c) Find the posterior distribution of 0".2 is (called) the precision of the distribution of Xi.) u/' 0 < cx) L". given (x. 0 > O. N(I". Find I 12. LO. v > 0. X 1. Q'j > 0. . L. X and 5' are independent with X ~ N(I". (b) Discuss the behavior of the two predictive distributions as 15. . i). 9= (OJ. 1 X n . < 1. (12). "5) and N(Oo.J u.d. The posterior predictive distribution is the conditional distribution of X n +l given Xl.) be multinomial M(n.O the posterior density 7t( 0 Ix).1 < j < T. 01 )=1 j=l Uj < 1. O(t + v) has a X~+n distribution.74 where N' ~ N Statistical Models. 0 > O. . 14. 8 2 ) is such that vn(p. 0<0. 0> 0 a otherwise.") ~ . posterior predictive distribution. (. .~l . . . In a Bayesian model whereX l ..~vO}. Letp(x I OJ =exp{(xO)}..J t • I X. Goals. N. ... . . = 1. . and Performance Criteria Chapter 1 + nand ({.Xn.2) and we formally put 7t(I". 52) of J1.d.O... .. . . . D(a).. Next use Bayes rule. Let N = (N l . unconditionally.. . I(x I (J). x n ). The Dirichlet distribution. Show that if Xl... . X n are i. X n ..i. (a) Show that p(x I 0) ()( 04 n exp (~tO) where "proportional to" as a function of 8.) pix I 0) ~ ({" . .~X) '" tnI... a = (a1. Suppose p(x I 0) is the density of i.6 ""' 1r. 'V ...Xn + l arei. j=1 '\' • = 1. T6) densities. •• • .. The Dirichlet distribution is a conjugate prior for the multinomial. where Xi known. Xl . has density r("" a) IT • fa(u) ~ n' r(. (a) If f and 7t are the N(O. . vB has a XX distribution.d. Note that. j=1 • . then the posterior density ~(Jt 52 = _1_ "'(X _ X)2 nI I. given x.. ... Here HinJ: Given I" and ".. 13. Find the posterior distribution 7f(() I x) and show that if>' is an integer. . ../If (Ito.i. the predictive distribution is the marginal distribution of X n + l .
(b)p=lq=. a3. I b"9 }. = 0.3. . 8 = 0 or 8 > 0 for some parameter 8. . 1. Let the actions corresponding to deciding whether the loss function is given by (from Lehmann. I. Find the Bayes rule for case 2. = 11'. 159 of Table 1. a2.1.. .. See Example 1.3.1.) = 0. respectively and suppose . 1C(O.5. 159 for the preceding case (a).5. or 0 > a be penoted by I. .1.7 Problems and Complements 75 Show that if the prior7r( 0) for 0 is V( a).Section 1. (d) Suppose that 0 has prior 1C(OIl (a).q) and let when at..3 IN ~ n) is V( a + n). .3. a) is given by 0. then the posteriof7r(0 where n = (nt l · · · . a new buyer makes a bid and the loss function is changed to 8\a 01 O 2 al a2 a3 a 12 7 I 4 6 (a) Compute and plot the risk points in this case for each rule 1St. 1957) °< °= a 0. n r ).5 and (ii) l' = 0.1. and the loss function l((J. (c) Find the minimax rule among the randomized rules. . (c) Find the minimax rule among bt.. (b) Find the minimax rule among {b 1 . Suppose the possible states of nature are (Jl. Suppose that in Example 1. Compute and plot the risk points (a)p=q= .3. the possible actions are al. (J2.. The problem of selecting the better of two treatments or of deciding whether the effect of one treatment is beneficial or not often reduces to the pr9blem of deciding whether 8 < O. (J) given by O\x a (I .159 be the decision rules of Table 1. . 1C(02) l' = 0.5. 0..3.. (d) Suppose 0 has prior 1C(01) ~ 1'..p) (I . Find the Bayes rule when (i) 3. O 2 a 2 I a I 2 I Let X be a random variable with frequency function p(x.3. Problems for Section 1.
j = 1. geographic locations or age groups).s.) c<f>( y'n(r . 1 < j < S. Show that the strata sample sizes that minimize MSE(!i2) are given by (1... < 00. = E(X) of a population that has been divided (stratified) into s mutually exclusive parts (strata) (e.) r=2s=1. and MSEs of iiI and 'j1z. random variables Xlj"". . 1) distribution function. .8. and Performance Criteria ChClpter 1 O\a 1 0 c 1 <0 0 >0 J". I I I.. How should nj.76 StatistiCCII Models.j = I. are known (estimates will be used in a latet chapter). 1 < j < S. and <I> is the N(O. c<I>(y'n(sO))+b<I>(y'n(rO)). Goals. Assume that 0 < a.I L LXij. E. . Within the jth stratum we have a sample of i. I . Stratified sampling. are known.7.. (a) Compute the biases.. iiz = LpjX j=Ii=1 )=1 s n] s j where we assume that Pj. 0<0 0=0 0 >0 R(O. 11 .(X) 0 b b+c 0 c b+c b 0 where b and c are positive.0)) +b<f>( y'n(s ..d. 1) sample and consider the decision rule = 1 if X <r o (a) Show that the risk function is given by ifr<X<s 1 if X > s. (b) Plot the risk function when b = c = 1. (. Suppose X is aN(B. variances.i. We want to estimate the mean J1. n = 1 and . 1 (l)r=s=l. < J' < s.0)). i I For what values of B does the procedure with r procedure with r = ~s = I? = 8 = 1 have smaller risk than the 4. Weassurne that the s samples from different strata are independent. b<f>(y'ns) + b<I>(y'nr).3) .g.=l iii = N. . J". Suppose that the jth stratum has lOOpj% of the population and that the jth stratum population mean and variances are f£j and Let N = nj and consider the two estimators 0. 0 be chosen to make iiI unbiased? I (b) Neyman allocation.Xnjj.andastratumsamplemeanXj.. where <f> = 1 <I>.
0 + '. ° = 15. . (iii) F is normal. ali X '"'' F.. ~ (a) Find the MSE of X when (i) F is discrete with P(X a < b < c.7 Problems and Complements 77 Hint: You may use a Lagrange multiplier.3.5.2.2. respectively. (d) Find EIX .7.15. The answer is MSE(X) ~ [(ab)'+(cb)'JP(S where k = ..= 2:.bl for the situation in (i). l X n . b = '.=1 pp7J" aV. Use a numerical integration package.. Next note that the distribution of X involves Bernoulli and multinomial trials.2.0. Suppose that n is odd..5(0 + I) and S ~ B(n. 75. n ~ 1. and 2 and (e) Compute the relative risks M SE(X)/MSE(X) in questions (ii) and (iii).) with nk given by (1. Hint: By Problem 1.5.0 .3.. . ~ 7. ~ ~ ~ . U(O. . a = ..b.3. plot RR for p = . (a) Find MSE(X) and the relative risk RR = MSE(X)/MSE(X).4.p). set () = 0 without loss of generality. (c) Same as (b) except when n ~ 1.. . [f the parameters of interest are the population mean and median of Xi .. Hint: See Problem B.2'. We want to estimate "the" median lJ of F. . I). P(X ~ b) = 1 . (b) Evaluate RR when n = 1. ...9. that X is the median of the sample. N(o..I 2:. and that n is odd.bl when n = compare it to EIX . Hint: See Problem B. .2p.1.45. Hint: Use Problem 1. ~ ~  ~ 6.0 ~ + 2'. (b) Compute the relative risk RR = M SE(X)/MSE(X) in question (i) when b = 0.d..2. .13. > 0. ~ a) = P(X = c) = P. and = 1.5.. > k) (ii) F is uniform.20.25.bl. X n are i...5.5. 1). Also find EIX . Let X and X denote the sample mean and median. . (c) Show that MSE(!ill with Ok ~ PkN minus MSE(!i. Suppose that X I. . Each value has probability .i.40.Section 1. 5. p = .3.'.3) is N. . X n be a sample from a population with values 0. Let Xb and X b denote the sample mean and the sample median of the sample XI b.b. < ° P < 1. Let XI.=1 Pj(O'j where a. show that MSE(Xb ) and MSE(Xb ) are the same for all values of b (the MSEs of the sample mean and sample median are invariant with respect to shift). where lJ is defined as a value satisfyingP(X < v) > ~lllldP(X >v) >~.
(o(X)). A decision rule 15 is said to be unbiased if E. defined by (J(6. In Problem 1.4. . Let Xl. . 8. A person in charge of ordering equipment needs to estimate and uses I . You may use the fact that E(X i .X)2 is an unbiased estimator of u 2 • Hint: Write (Xi .(1(6. . (a) Show that s:? = (n .X)2 keeping the square brackets intact.0): 6 E eo}. and only if.  . then expand (Xi . then this definition coincides with the definition of an unbiased estimate of e. then a test function is unbiased in this sense if. . 0) (J(6'.) is (n+ 1)1 Hint!or question (bi: Recall (Theorem B. for all 6' E 8 1. = c L~ 1 (Xi . Find MSE(6) and MSE(P). satisfies > sup{{J(6.g.. It is known that in the state where the company is located.3. suppose (J is discrete with frequency function 11"(0) = 11" (!) = 11" U) = Compute the Bayes risk of I5r •s when i. 10% have the characteristic. (i) Show that MSE(S2) (ii) Let 0'6 c~ .. Goals.[X .J.(1(6'.1)1 E~ 1 (Xi .Xn be a sample from a population with variance a 2 . ' . If the true 6 is 60 .3(a) with b = c = 1 and n = I.8lP where p = XI n is the proportion with the characteristic in a sample of size n from the company. (b) Suppose Xi _ N(I'..L)4 = 3(12. the powerfunction. 9. I I 78 Statistical Models. i1 .2 ~~ I (Xi .X)2. (a)r = 8 =1 (b)r= ~s =1.a)2.1'])2.. and Performance Criteria Chapter 1 :! . e 8 ~ (.0(X))) for all 6. I . Which one of the rules is the better one from the Bayes point of view? 11.. I . II.. . . 0 < a 2 < 00.3. ~ 2(n _ 1)1. Let () denote the proportion of people working in a company who have a certain characteristic (e. I (b) Show that if we use the 0 1 loss function in testing.1'] . Show that the value of c that minimizes M S E(c/.. 10. (a) Show that if 6 is real and 1(6. 6' E e.2)(.X)2 has a X~I distribution. a) = (6 .'). .10) + (. being lefthanded).X)' = ([Xi .for what 60 is ~ MSE(8)/MSE(P) < 17 Give the answer for n = 25 and n = 100.0) = E.3) that .0(X))) < E.
In Example 1.. but if 0 < <p(x) < 1. show that if c ::.3.) for all B.~ is unbiased 13. 0. For Example 1.3. (c) Same as (b) except k = 2. Ok) as a function of B. 0 < u < 1. 0.ao) ~ O. Show that the procedure O(X) au is admissible. (b) find the minimum relative risk of P.1) and let Ok be the decision rule "reject the shipment iff X > k.7 Problems and Complements 79 12. then. given 0 < a < 1." (a) Show that the risk is given by (1. find the set of I' where MSE(ji) depend on n. and k o ~ 3. If X ~ x and <p(x) = 0 we decide 90.UfO.Section 1. (B) = 0 for some event B implies that Po (B) = 0 for all B E El. a) is . Convexity ofthe risk set. consider the loss function (1. (h) If N ~ 10. we petiorm a Bernoulli trial with probability rp( x) of success and decide 8 1 if we obtain a success and decide 8 0 otherwise. Suppose that Po. 15. wc dccide El. Consider the following randomized test J: Observe U. Compare 02 and 03' 19.7).1.3. Furthersuppose that l(Bo. B = .4. o Suppose that U . and possible actions at and a2. 16. Show that J agrees with rp in the sense that.4.1'0 I· 17.3. consider the estimator < MSE(X). procedures. and then JT.3.Polo(X) ~ 01 = Eo(<p(X)). Your answer should Mw If n. if <p(x) = 1. If U = u.. 03) = aR(B. Show that if J 1 and J 2 are two randomized.. A (behavioral) randomized test of a hypothesis H is defined as any statistic rp(X) such that 0 < <p(X) < 1. z > 0. 1) and is independent of X.' and 0 = II' .3. In Example 1. Define the nonrandomized test Ju . plot R(O. Po[o(X) ~ 11 ~ 1 . by 1 if<p(X) if <p(X) >u < u. In Problem 1. Suppose the loss function feB.a )R(B. s = r = I. Consider a decision problem with the possible states of nature 81 and 82 . The interpretation of <p is the following. 18. .1.. b. use the test J u .1.' and 0 = = WJ10 + (1  w)x' II'  1'0 I are known.wo to X. . there is a randomized procedure 03 such that R(B. 14.) + (1 . (a) find the value of Wo that minimizes M SE(Mw). Suppose that the set of decision procedures is finite.
!. 4. (b) Compute the MSPEs of the predictors in (a). and the ratio of its MSPE to that of the best and best linear predictors. 6.4 = 0. Give the minimax rule among the randomized decision rules. 0. and Performance Criteria Chapter 1 B\a 0. and the best zero intercept linear predictor. the best linear predictor. Let Z be the number of red balls obtained in the first two draws and Y the total number of red balls drawn. its MSPE.) the Bayes decision rule? Problems for Section 1.80 Statistical Models.9. but Y is of no value in predicting Z in the sense that Var(Z I Y) = Var(Z). 0 0.1 calculate explicitly the best zero intercept linear predictor. Let X be a random variable with probability function p(x ! B) O\x 0. 2. (a) Find the best predictor of Y given Z.6 (a) Compute and plot the risk points of the nonrandomized decision rules.Is Z of any value in predicting Y? = Ur + U?. 3..7 find the best predictors ofY given X and of X given Y and calculate their MSPEs. al a2 0 3 2 I 0. Give an example in which Z can be used to predict Y perfectly. U2 be independent standard normal random variables and set Z Y = U I.(0. In Example 1. In Problem B. Show that either R(c) = 00 for all cor R(c) is minimized by taking c to be any number such that PlY > cJ > pry < cJ > A number satisfying these restrictions is called a median of (the distribution of) Y. The midpoint of the interval of such c is called the conventionally defined median or simply just the median. 7. Four balls are drawn at random without replacement. Let U I . Give the minimax rule among the nonrandomized decision rules. (c) Suppose 0 has the prior distribution defined by . Let Y be any random variable and let R(c) = E(IYcl) be the mean absolute prediction error.) = 0.4.. (b) Give and plot the risk set S. An urn contains four red and four black balls. Give an example in which the best linear predictor of Y given Z is a constant (has no predictive value) whereas the best predictor Y given Z predicts Y perfectly.l. 5. !. What is 1.4 I 0.8 0.(0. Goals. ..2 0.1.
which is symmetric about c and which is unimodal. z > 0.PlY < eo]} + 2E[(c  Y)llc < Y < co]] 8. Define Z = Z2 and Y = Zl + Z l Z2. If Y and Z are any two random variables. 0.t. Y)I < Ipl.1"1/0] where Q(t) = 2['I'(t) + t<I>(t)]. ° 14. Sbow that the predictor that minimizes our expected loss is again the best MSPE predictor. Show that if (Z. which is symmetric about c. Z". Suppose that Z has a density p. 12. Suppose that if we observe Z = z and predict 1"( z) for Your loss is 1 unit if 11"( z) .2 ) distrihution. Let ZI. p( c all z. Find (a) The best MSPE predictor E(Y I Z = z) ofY given Z = z (b) E(E(Y I Z)) (c) Var(E(Y I Z)) (d) Var(Y I Z = z) (e) E(Var(Y I Z» . 10. where (ZI . a 2. Let Y have a N(I". Many observed biological variables such as height and weight can be thought of as the sum of unobservable genetic and environmental variables. a 2. (b) Suppose (Z. Y) has a bivariate normal distribution. that is. Y = Y' + Y".z) for 11. Y'. 9. (a) Show that P[lZ  tl < s] is maximized as a function oftfor each s > ° by t = c. (a) Show that the relation between Z and Y is weaker than that between Z' and y J .L.7 Problems and Complements 81 Hint: If c ElY . y l ).col < eo. (b) Show that the error of prediction (for the best predictor) incurred in using Z to predict Y is greater than that incurred in using Z' to predict y J • 13. Yare measurements on such a variable for a randomly selected father and son. Suppose that Z.  = ElY cl + (c  co){P[Y > co] . p( z) is nonincreasing for z > c. ICor(Z. Y" be the corresponding genetic and environmental components Z = ZJ + Z". (b) Show directly that I" minimizes E(IY . 7 2 ) variables independent of each other and of (ZI. + z) = p(c .Section 1. and otherwise. Y') have a N(p" J.el) as a function of c. Suppose that Z has a density p.cl) = oQ[lc . Show that c is a median of Z. exhibit a best predictor of Y given Z for mean absolute prediction error. Y" are N(v.YI > s. p) distribution and Z". Let Zl and Zz be independent and have exponential distributions with density Ae\z. that is. (a) Show that E(IY . Y) has a bivariate nonnal distribution the best predictor of Y given Z in the sense of MSPE coincides with the best predictor for mean absolute error.
/L(Z)J' /Var(Y) ~ Var(/L(Z))/Var(Y). . Let /L(z) = E(Y I Z ~ z). 1905. IS./L(Z)) = max Corr'(Y.) ~ 1/>.: (0 The best linear MSPE predictor of Y based on Z = z. IS. (See Pearson. Let S be the unknown date in the past when the sUbject was infected. Show that Var(/L(Z))/Var(Y) = Corr'(Y./L£(Z)) ~ maxgEL Corr'(Y. (b) Show that if Z is onedimensional and h is a 11 increasing transfonnation of Z.) ~ 1/>''.1] . 1). I. .(y) = >.3.4. . and a 2 are the mean and variance of the severity indicator Zo in the population of people without the disease.4./LdZ) be the linear prediction error. Consider a subject who walks into a clinic today. j3 > 0. and Var( Z. then'fJh(Z)Y =1JZY.) (a) Show that 1]~y > piy. g(Z)) 9 where g(Z) stands for any predictor. ~~y = 1 . 16.. on estimation of 1]~y. Assume that the conditional density of Zo (the present) given Yo = Yo (the past) is where j1. 1995.g.exp{ >.g(Z)) where £ is the set of 17. . in the linear model of Remark 1. (a) Show that the conditional density j(z I y) of Z given Y (b) Suppose that Y has the exponential density = y is H(y.) = E( Z. It will be convenient to rescale the problem by introducing Z = (Zo . y> O. One minus the ratio of the smallest possible MSPE to the MSPE of the constant predictor is called Pearson's correlation ratio 1J~y. at time t. We are interested in the time Yo = t .S from infection until detection. Predicting the past from the present. > 0. Hint: Recall that E( Z. a blood cell or viral load measurement) is obtained. hat1s. >. Here j3yo gives the mean increase of Zo for infected subjects over the time period Yo. and Doksurn and Samarov. and Performance Criteria Chapter 1 . Hint: See Problem 1.) = Var( Z.ElY . . Show that p~y linear predictors.I • ! ..y}. .15. mvanant un d ersuchh. that is. (c) Let '£ = Y ./L)/<J and Y ~ /3Yo/<J. Goals. = Corr'(Y. 2 'T ' . . At the same time t a diagnostic indicator Zo of the severity of the disease (e. Show that. and is diagnosed with a certain disease.4. 82 Statisflcal Models. Yo > O.4. where piy is the population multiple correlation coefficient of Remark 1. €L is uncorrelated with PL(Z) and 1J~Y = P~Y' 18.
Let Y be the number of heads showing when X fair coins are tossed. see Berman. Y2) using (a). + b2 Z 2 + W.>')1 2 } Y> ° where c = 4>(z .7 and 1.4. Find Corr(Y}..>.6) when r = s.z) . Let Y be a vector and let r(Y) and s(Y) be real valued.7 Problems and Complements 83 Show that the conditional distribution of Y (the past) given Z = z (the present) has density . then = E{Cov[r(Y). 7r(Y I z) ~ (27r) .4. Cov[r(Y). (d) Find the best predictor of Yo given Zo = zo using mean absolute prediction error EIYo . wbere Y j .4. (c) The optimal predictor of Y given X = x.14 by setting the derivatives of R(a. Hint: Use Bayes rule. 21. c. 1990.s(Y)] Covlr(Y). s(Y) I Z]} + Cov{ E[r(Y) I Z]' E[s(Y) I Z]}. b) equal to zero.9. (c) Find the conditional density 7I"o(Yo I zo) of Yo given Zo = zo.. s(Y» given Z = z. 20. .z). need to be estimated from cohort studies.. This density is called the truncated (at zero) normal. 19. 6. all the unknowns. x = 1..Section 1. (b) The MSPE of the optimal predictor of Y based on X. (In practice. Z] = Cov{E[r(Y) (d) Suppose Y i = al + biZ i + Wand 1'2 = a2 and Y2 are responses of subjects 1 and 2 with common influence W and separate influences ZI and Z2. and Nonnand and Doksum. and checking convexity. s(Y) I z] for the covatiance between r(Y) and s(Y) in the conditional distribution of (r(Y). (a) Show that ifCov[r(Y). .~ [y  (z . (b) Show that (a) is equivalent to (1. Write Cov[r(Y). Establish 1. .I exp { . where X is the number of spots showing when a fair die is rolled.. solving for (a. where Z}. density. N(z . (e) Show that the best MSPE predictor ofY given Z = z is E(Y I Z ~ z) ~ cI<p(>.g(Zo)l· Hint: See Problems 1.(>' . Z2 and W are independent with finite variances. including the "prior" 71". b). 2000).). (c) Show that if Z is real. Find (a) The mean and variance of Y. s(Y)] < 00. I Z]' Z}. 1).>.4.
This is thebeta.0> O. . 1.2eY + e' < 2(Y' + e').g(z)]'lw(y. z) and c is the constant that makes Po a density. 1). z) = z(1 .~(1 . Y z )? (f) In model (d). 25.3. Zz). 2. Verify that solving (1. Assume that Eow (Y.. we say that there is a 50% overlap between Y 1 and Yz. where Po(Y.14). In Example 1. < 00 for all e. This is known as the Weibull density.z) = 6w (y. Problems for Section 1. \ (b) Establish the same result using the factorization theorem. 1 X n be a sample from a Poisson. Find Po(Z) when (i) w(y.4. Oax"l exp( Ox").') and W ~ N(po. (a) p(x. 23. x  . a > O..84 Statistical Models.z) be a positive realvalued function. Zz and ~V have the same variance (T2. 0) = Ox'.5 1. p(e).15) yields (1. and = 0 otherwise. .. Show that L~ 1 Xi is sufficient for 8 directly and by the factorization theorem. .z)lw(y. 3. > a. a > O.Xn is a sample from a population with one of the following densities. Show that EY' < 00 if and only if E(Y ..4..p~y). 1 2 y' . z "5). Find the 22. 0 < z < I. 0) = < X < 1..l . . Show that the mean weighted squared prediction error is minimized by Po(Z) = EO(Y I Z). density.6(r. 0 > 0. if b1 = b2 and Zl. g(Z)) < for some 9 and that Po is a density. are N(p. Let Xi = 1 if the ith item drawn is bad. Goals. (a) Show directly that 1 z:::: 1 Xi is sufficient for 8.g(z») is called weighted squared prediction error.z). Then [y . and Performance Criteria Chapter 1 (e) In the preceding model (d). 0 > 0. Let n items be drawn in order without replacement from a shipment of N items of which N8 are bad. .x > 0. population where e > O.density. and (ii) w(y. show that the MSPE of the optimal predictor is . 0) = Oa' Ix('H). z) = I. . 00 (b) Suppose that given Z = z. .6(O.e)' Hint: Whatever be Y and c. s). n > 2. (a) Let w(y.e)' = Y' . 0 (b) p(x. Suppose Xl. Let Xl. In this case what is Corr(Y. 24. Y ~ B(n.4. and suppose that Z has the beta. optimal predictor of Yz given (Yi. z) ~ ep(y.e' < (Y .z). . suppose that Zl and Z. (c) p(x.
Section 1. 7
Problems and Complements
85
This is known as the Pareto density. In each case. find a realvalued sufficient statistic for (), a fixed. 4. (a) Show that T 1 and T z are equivalent statistics if, and only if, we can write T z = H (T1 ) for some 11 transformation H of the range of T 1 into the range of T z . Which of the following statistics are equivalent? (Prove or disprove.) (b) n~ (c) I:~
1
x t and I:~
and I:~
I:~
1
log Xi, log Xi.
Xi Xi
I Xi
1
>0 >0
l(Xi X
1 (Xi 
(d)(I:~ lXi,I:~ lxnand(I:~ lXi,I:~
(e) (I:~
I Xi, 1
?)
xl)
and (I:~
1 Xi,
I:~
X)3).
5. Let e = (e lo e,) be a bivariate parameter. Suppose that T l (X) is sufficient for e, whenever 82 is fixed and known, whereas T2 (X) is sufficient for (h whenever 81 is fixed and known. Assume that eh ()2 vary independently, lh E 8 1 , 8 2 E 8 2 and that the set S = {x: pix, e) > O} does not depend on e. (a) Show that ifT, and T, do not depend one2 and e, respectively, then (Tl (X), T2 (X)) is sufficient for e. (b) Exhibit an example in which (T, (X), T2 (X)) is sufficient for T l (X) is sufficient for 8 1 whenever 8 2 is fixed and known, but Tz(X) is not sufficient for 82 , when el is fixed and known. 6. Let X take on the specified values VI, .•. 1 Vk with probabilities 8 1 , .•• ,8k, respectively. Suppose that Xl, ... ,Xn are independently and identically distributed as X. Suppose that IJ = (e" ... , e>l is unknown and may range over the set e = {(e" ... ,ek) : e, > 0, 1 < i < k, E~ 18i = I}, Let Nj be the number of Xi which equal Vj' (a) What is the distribution of (N" ... , N k )? (b) Sbow that N = (N" ... , N k _,) is sufficient for 7. Let Xl,'"
1
e,
e.
X n be a sample from a population with density p(x, 8) given by
pix, e)
o otherwise.
Here
e = (/1, <r) with 00 < /1 < 00, <r > 0.
(a) Show that min (Xl, ... 1 X n ) is sufficient for fl when a is fixed. (b) Find a onedimensional sufficient statistic for a when J1. is fixed. (c) Exhibit a twodimensional sufficient statistic for 8. 8. Let Xl,. " ,Xn be a sample from some continuous distribution Fwith density f, which is unknown. Treating f as a parameter, show that the order statistics X(l),"" X(n) (cf. Problem B.2.8) are sufficient for f.
,
"
I
!
86
Statistical Models, Goals, and Performance Criteria
Chapter 1
9. Let Xl, ... ,Xn be a sample from a population with density
j,(x)
a(O)h(x) if 0, < x < 0,
o othetwise
where h(x)
> 0,0= (0,,0,)
with
00
< 0, < 0, <
00,
and a(O)
=
[J:"
h(X)dXr'
is assumed to exist. Find a twodimensional sufficient statistic for this problem and apply your result to the U[()l, ()2] family of distributions. 10. Suppose Xl" .. , X n are U.d. with density I(x, 8) = ~elx61. Show that (X{I),"" X(n», the order statistics, are minimal sufficient. Hint: t,Lx(O) =  E~ ,sgn(Xi  0), 0 't {X"" . , X n }, which determines X(I),
. " , X(n)'
11. Let X 1 ,X2, ... ,Xn be a sample from the unifonn, U(O,B). distribution. Show that X(n) = max{ Xii 1 < i < n} is minimal sufficient for O.
12. Dynkin, Lehmann, Scheffe's Theorem. Let P = {Po : () E e} where Po is discrete concentrated on X = {x" x," .. }. Let p(x, 0) p.[X = xl Lx(O) > on Show that f:xx(~~) is minimial sufficient. Hint: Apply the factorization theorem.
=
=
°
x,
13. Suppose that X = (XlI" _, X n ) is a sample from a population with continuous distribution function F(x). If F(x) is N(j1., ,,'), T(X) = (X, ,,'). where,,2 = n l E(Xi 1')2, is sufficient, and S(X) ~ (XCI)"" ,Xin»' where XCi) = (X(i)  1')/'" is "irrelevant" (ancillary) for (IL, a 2 ). However, S(X) is exactly what is needed to estimate the "shape" of F(x) when F(x) is unknown. The shape of F is represented hy the equivalence class F = {F((·  a)/b) : b > 0, a E R}. Thus a distribution G has the same shape as F iff G E F. For instance, one "estimator" of this shape is the scaled empirical distribution function F,(x) jln, x(j) < x < x(i+1)' j = 1, . .. ,nl
~
0, x
< XCI)
> x(n)
1, x
~
Show that for fixed x, F,((x  x)/,,) converges in prohahility to F(x). Here we are using F to represent :F because every member of:F can be obtained from F.
I
I ,
'I i,
14. Kolmogorov's Theorem. We are given a regular model with e finite.
(a) Suppose that a statistic T(X) has the property that for any prior distribution on 9, the posterior distrihution of 9 depends on x only through T(x). Show that T(X) is sufficient.
(b) Conversely show that if T(X) is sufficient, then, for any prior distribution, the posterior distribution depends on x only through T(x).
Section 1.7
Problems and Complements
87
Hint: Apply the factorization theorem.
15. Let X h .··, X n be a sample from f(x  0), () E R. Show that the order statistics arc minimal sufficient when / is the density Cauchy Itt) ~ I/Jr(1 + t 2 ). 16. Let Xl,"" X rn ; Y1 ,· . " ~l be independently distributed according to N(p, (72) and N(TI, 7 2 ), respectively. Find minimal sufficient statistics for the following three cases:
(i) p, TI,
0", T
are arbitrary:
00
< p, TI < 00, a <
(J,
T.
(ii)
(J
=T
= TJ
and p, TI, (7 are arbitrary. and p,
0", T
(iii) p
are arbitrary.
17. In Example 1.5.4. express tl as a function of Lx(O, 1) and Lx(l, I). Problems to Sectinn 1.6
1. Prove the assertions of Table 1.6.1.
2. Suppose X I, ... , X n is as in Problem 1.5.3. In each of the cases (a), (b) and (c), show that the distribution of X fonns a oneparameter exponential family. Identify 'TI, B, T, and h. 3. Let X be the number of failures before the first success in a sequence of Bernoulli trials with probability nf success O. Then P, IX = k] = (I  0)'0, k ~ 0 0 1,2, ... This is called thc geometric distribution (9 (0». (a) Show that the family of geometric distributions is a oneparameter exponential family with T(x) ~ x. (b) Deduce from Theorem 1.6.1 that if X lo '" oXn is a sample from 9(0), then the distributions of L~ 1 Xi fonn a oneparameter exponential family. (c) Show that E~
1
Xi in part (b) has a negative binomial distribution with parameters
(noO)definedbyP,[L:71Xi = kJ =
(n+~I
)
(10)'on,k~0,1,2o'"
(The
negative binomial distribution is that of the number of failures before the nth success in a sequence of Bernoulli trials with probability of success 0.) Hint: By Theorem 1.6.1, P,[L:7 1 Xi = kJ = c.(1  o)'on. 0 < 0 < 1. If
=
' " CkW'
I
= c(;'',)::, 0 lw n
k=O
L..J
< W < I, then
4. Which of the following families of distributions are exponential families? (Prove or
disprove.) (a) The U(O, 0) fumily
88
(b) p(.", 0)
Statistical Models, Goals, and Performance Criteria
Chapter 1
(c)p(x,O)
= {exp[2Iog0+log(2x)]}I[XE = ~,xE {O.I +0, ... ,0.9+0j = 2(x +
0)/(1 + 20), 0
(0,0)1
(d) The N(O, 02 ) family, 0 > 0
(e)p(x,O)
< x < 1,0> 0
(f) p(x,9) is the conditional frequency function of a binomial, B(n,O), variable X, given that X > O.
5. Show that the following families of distributions are twoparameter exponential families and identify the functions 1], B, T, and h. (a) The beta family. (b) The gamma family. 6. Let X have the Dirichlet distribution, D( a), of Problem 1.2.15. Show the distribution of X form an rparameter exponential family and identify fJl B, T, and h.
7. Let X = ((XI, Y I ), ... , (X no Y n » be a sample from a bIvariate nonnal population.
Show that the distributions of X form a fiveparameter exponential family and identify 'TJ, B, T, and h.
8. Show that the family of distributions of Example 1.5.3 is not a one parameter eX(Xloential family. Hint: If it were. there would be a set A such that p(x, 0) > on A for all O.
°
9. Prove the analogue of Theorem 1.6.1 for discrete kparameter exponential families. 10. Suppose that f(x, B) is a positive density on the real line, which is continuous in x for each 0 and such that if (XI, X 2) is a sample of size 2 from f(·, 0), then XI + X2 is sufficient for B. Show that f(·, B) corresponds to a onearameter exponential family of distributions with T(x) = x. Hint: There exist functions g(t, 0), h(x" X2) such that log f(x" 0) + log f(X2, 0) = g(xI + X2, 0) + h(XI, X2). Fix 00 and let r(x, 0) = log f(x, 0)  log f(x, 00), q(x, 0) = g(x,O)  g(x,Oo). Then, q(xI + X2,0) = r(xI,O) +r(x2,0), and hence, [r(x" 0) r(O, 0)1 + [r(x2, 0)  r(O, 0») = r(xi + X2, 0)  r(O, 0). 11. Use Theorems 1.6.2 and 1.6.3 to obtain momentgenerating functions for the sufficient statistics when sampling from the following distributions. (a) normal, () ~ (ll,a 2 )
(b) gamma. r(p, >.), 0
= >., p fixed
(c) binomial (d) Poisson (e) negative binomial (see Problem 1.6.3)
(0 gamma. r(p, >'). ()
= (p, >.).
 
,

Section 1. 7
Problems and Complements
89
12. Show directly using the definition of the rank of an ex}X)nential family that the multinomialdistribution,M(n;OI, ... ,Ok),O < OJ < 1,1 <j < k,I:~oIOj = 1, is of rank k1. 13. Show that in Theorem 1.6.3, the condition that E has nonempty interior is equivalent to the condition that £ is not contained in any (k ~ I)dimensional hyperplane. 14. Construct an exponential family of rank k for which £ is not open and A is not defined on all of &. Show that if k = 1 and &0 oJ 0 and A, A are defined on all of &, then Theorem 1.6.3 continues to hold. 15. Let P = {P. : 0 E e} where p. is discrete and concentrated on X = {x" X2, ... }, and let p( x, 0) = p. IX = x I. Show that if P is a (discrete) canonical ex ponential family generated bi, (T, h) and &0 oJ 0, then T is minimal sufficient. Hint: ~;j'Lx('l) = Tj(X)  E'lTj(X). Use Problem 1.5.12.
16. Life testing. Let Xl,.'" X n be independently distributed with exponential density (20)l e x/2. for x > 0, and let the ordered X's be denoted by Y, < Y2 < '" < YnIt is assumed that Y1 becomes available first, then Yz, and so on, and that observation is continued until Yr has been observed. This might arise, for example, in life testing where each X measures the length of life of, say, an electron tube, and n tubes are being tested simultaneously. Another application is to the disintegration of radioactive material, where n is the number of atoms, and observation is continued until r aparticles have been emitted. Show that
(i) The joint distribution of Y1 , •.. , Yr is an exponential family with density
n! [ (20), (n _ r)! exp (ii) The distribution of II:: I Y;
(iii) Let
1
I::l Yi + (n 20
r)Yr]
' 0  Y,  ...  Yr·
<
<
<
+ (n 
r)Yrl/O is X2 with 2r degrees of freedom.
denote the time required until the first, second,... event occurs in a Poisson process with parameter 1/20' (see A.I6). Then Z, = YI/O', Z2 = (Y2 Yr)/O', Z3 = (Y3  Y 2)/0', ... are independently distributed as X2 with 2 degrees of freedom, and the joint density of Y1 , ••. , Yr is an exponential family with density
Yi, Yz , ...
The distribution of Yr/B' is again XZ with 2r degrees of freedom. (iv) The same model arises in the application to life testing if the number n of tubes is held constant by replacing each burnedout tube with a new one, and if Y1 denotes the time at which the first tube bums out, Y2 the time at which the second tube burns out, and so on, measured from some fixed time.
I ,
90
Statistical Models, Goals, and Performance Criteria Chapter 1
1)(Y; l~~l)/e (I = 1", .. ,') are independently distributed as X2 with 2 degrees of freedom, and [L~ 1 Yi + (n  7")Yr]/B = [(ii): The random variables Zi ~ (n  i
+
L::~l Z,.l
17. Suppose that (TkXl' h) generate a canonical exponential family P with parameter k 1Jkxl and E = R . Let
(a) Show that Q is the exponential family generated by IlL T and h exp{ cTT}. where IlL is the projection matrix of Tonto L = {'I : 'I = BO + c). (b) Show that ifP has full rank k and B is of rank I, then Q has full rank l. Hint: If B is of rank I, you may assume
18. Suppose Y1, ... 1 Y n are independent with Yi '" N(131 + {32Zi, (12), where Zl,'" , Zn are covariate values not all equaL (See Example 1.6.6.) Show that the family has rank 3.
Give the mean vector and the variance matrix of T.
19. Logistic Regression. We observe (Zll Y1 ), ... , (zn, Y n ) where the Y1 , .. _ , Y n are independent, Yi "' B(TIi, Ad The success probability Ai depends on the characteristics Zi of the ith subject, for example, on the covariate vector Zi = (age, height, blood pressure)T. The function I(u) ~ log[u/(l  u)] is called the logil function. In the logistic linear re(3 where (3 = ((31, ... ,/3d ) T and Zi is d x 1. gression model it is assumed that I (Ai) = Show that Y = (Y1 , ... , yn)T follow an exponential model with rank d iff Zl, ... , Zd are
zT
not collinear (linearly independent) (cf. Examples 1.1.4, 1.6.8 and Problem 1.1.9). 20. (a) In part IT of the proof of Theorem 1.6.4, fill in the details of the arguments that Q is generated by ('11 'Io)TT and that ~(ii) =~(i). (b) Fill in the details of part III of the proof of Theorem 1.6.4. 21. Find JJ.('I) ~ EryT(X) for the gamma,
qa, A), distribution, where e = (a, A).
I
22. Let X I, . _ . ,Xn be a sample from the k·parameter exponential family distribution (1.6.10). Let T = (L:~ 1 1 (Xi ), ... , L:~ 1Tk(X,») and let T
I
S
~
((ryl(O), ... ,ryk(O»): e E 8).
Show that if S contains a subset of k + 1 vectors Vo, .. _, Vk+l so that Vi  Vo, 1 < i are not collinear (linearly independent), then T is minimally sufficient for 8.
< k.
I .' jl,
"
23. Using (1.6.20). find a conjugate family of distributions for the gamma and beta families. (a) With one parameter fixed. (b) With both parameters free.
:
I
Section 1.7
Problems and Complements
91
24. Using (1.6.20), find a conjugate family of distributions for the normal family using as parameter 0 = (O!, O ) where O! = E,(X), 0, ~ l/(Var oX) (cf. Problem 1.2.12). 2 25. Consider the linear Gaussian regression model of Examples 1.5.5 and 1.6.6 except with (72 known. Find a conjugate family of prior distributions for (131,132) T. 26. Using (1.6.20), find a conjugate family of distributions for the multinomial distribution. See Problem 1.2.15. 27. Let P denote the canonical exponential family genrated by T and h. For any TJo E £, set ho(x) = q(x, '10) where q is given by (1.6.9). Show that P is also the canonical exponential family generated by T and h o.
28. Exponential/amities are maximum entropy distributions. The entropy h(f) of a random variable X with density f is defined by h(f)
~ E(logf(X)) =
l:IIOgf(X)I!(X)dx.
This quantity arises naturally in infonnation in theory; see Section 2.2.2 and Cover and Thomas (1991). Let S ~ {x: f(x) > OJ. (a) Show that the canonical kparameter exponential family density
f(x, 'I)
= exp
• ryjrj(x) 1/0 + I:
j:=1
A('I)
, XES
maximizes h(f) subject to the constraints
f(x)
> 0,
Is
f(x)dx
~ 1,
Is
f(x)rj(x)
~ aj,
1 < j < k,
where '17o, .•.• '17k are chosen so that f satisfies the constraints. Hint: You may usc Lagrange multipliers. Maximize the integrand. (b) Find the maximum entropy densities when rj(x) = x j and (i) S ~ (0,00), k = 1, at > 0; (ii) S = R, k = 2, at E R, a, > 0; (iii) S = R, k = 3, a) E R, a, > 0, a3 E R. 29. As in Example 1.6.11, suppose that Y 1, ...• Y n are Li.d. Np(f.L. E) where f.L varies freely in RP and E ranges freely over the class of all p x p symmetric positive definite matrices. Show that the distribution of Y = (Y ... , Yn ) is the p(p + 3)/2 canonical " exponential family generated by h = 1 and the p(p + 3)/2 statistics
n n
Tj
=
LYii>
i=l
1 <j <Pi
Tjl =
LJ'ijJ'iI.
i=l
1 <j< l<p
where Y i = (Yi!, ... , Yip). Show that <: is open and that this family is of rank pcp + 3)/2. Hint: Without loss of generality, take n = 1. We want to show that h = 1 and the m = pcp + 3)/2 statistics Tj(Y) ~ Yj, 1 < j < p, and Tj,(Y) = YjYi, 1 <j < I < p,
92
Statistical Models, Goals, and Performance Criteria
Chapter 1
generate Np(J.l, E). As E ranges over all p x p symmetric positive definite matrices, so does E 1 • Next establish that for symmetric matrices M,
J
M
exp{ _uT Mu}du
< 00 iff M
is positive definite
by using the spectral decomposition (see B.I0.1.2)
=L
j=1
p
AjejeJ for el, ... , e p orthogonal. Aj E R.
To show that the family has full rank m, use induction on p to show that if Zt, ... , Zp are i.i.d. N(O, 1) and if B pxp = (b jl ) is symmetric, then
p
P
LajZj
j"" 1
+ Lbj,ZjZ,
j,l
~c
= P(aTZ + ZTBZ = c) ~ 0
N p(l', E), then
unless a ~ 0, B = 0, c = 0. Next recall (Appendix B.6) that since Y ~ y = SZ for some nonsingular p x p matrix S.
I
30. Show that if Xl,'" ,Xn are d.d. N p (8,E o) given (J where ~o is known, then the Np(A, f) family is conjugate to N p(8, Eo), where A varies freely in RP and f ranges over
all p x p symmetric positive definite matrices.
31. Conjugate Normal Mixture Distributions. A Hierarchical Bayesian Normal Model. Let {(I'j, Tj) : 1 < j < k} be a given collection of pairs with I'j E R, Tj > 0. Let (tt, tT) be a random pair with Aj = P«(I', tT) = (I'j, Tj)), 0 < Aj < 1, L:~~l Aj = 1. Let 8 be a random variable whose conditional distribution given (IL, IT) = (p,j 1 Tj) is nonnal, N(p,j, rJ). Consider the model X = 8 + f, where 8 and € are independent and € rv N(O, a3), a~ known. Note that 8 has the prior density
11'(0)
=L
j=l
k
Aj'l'rj (0  I'j)
(1.7.4)
where 'I'r denotes the N(O, T 2 ) density. Also note that (X tion. (a) Find the posterior
i!
I 0) has the N(O, (75) distribu
" "
,;
k
11'(0 I x)
and write it in the fonn
= LP((tt,tT) ~
j=1
(l'j,Tj) I X)1I'(O I (l'j,Tj),X)
" • ,
k
L
j=1
Aj (x)'I'rj(x) (0 l'j(X»
Section 1.7
Problems ;3nd Complements
93
for appropriate A) (x), Tj (x) and ILJ (x). This shows that (1. 7.4) defines a conjugate prior for the N(O, (76), distribution. (b) Let Xi = + Ei, I < i < n, where is as previously and EI," ., En are ij.d. N(O, (76)' Find the posterior 7r( 0 I Xl, ... , x n ), and show that it belongs to class (1.7 A). Hint: Consider the sufficient statistic for p(x I B).
e
e
32. A Hierarchical BinomialBeta Model. Let {(rj, Sj) : 1 <j < k} be a given collection of pair.; with rj > 0, sJ > 0, let (R, S) be a random pair with P(R = cJ' S = 8j) = Aj, D < Aj < 1, E7=1 Aj = 1, and let e be a random variable whose conditional density ".(0, c, s) given R = r, S = S is beta, (3(c, s). Consider the model in which (X I 0) has the binomial, B( n, fJ), distribution. Note that e has the prior density
".(0)
Find the posterior
k
=L
j=1
k
Aj"'(O, cJ ' sJ)'
(J .7.5)
".(0 I x) = LP(R= cj,S =
j=l
8j
I x)7r(O I (rj,sj),x)
and show that it can be written in the form J (x)7r(O,rj(x),sj(x)) for appropriate Aj(X), Cj(x) and 8j(X). This shows that (1.7.5) defines a class of conjugate prior.; for the B( n, 0) distribution.
L:A
33. Let p(x,TJ) be a one parameter canonical exponential family generated by T(x) = x and h(x), X E X C R, and let 1jJ(x) be a nonconstant, nondecreasing function. Show that E,1jJ(X) is strictly increasing in ry. Hint:
Cov,(1jJ(X), X)
~E{(X 
X')i1jJ(X) 1jJ(X')]}
where X and X' are independent identically distributed as X (see A.Il.12).
34. Let (Xl, ... , X n ) be a stationary Markov chain with two states D and 1. That is.
P[Xi
where
= Ei I Xl = EI,·· .,Xi  = Eid = P[Xi = Ci I X i  l = Eid =
l
PEi_1Ei
(POO PIO
pal) is the matrix of transition probabilities. Suppose further that Pn '
= 1  p.
(i) poo
(ii)
= PII = p, so that, PlO = Pal PIX, = OJ = PIX, = IJ = !.
94
Statistical Models, Goals, and Performance Criteria
Chapter 1
(a) Show that if 0 < p < 1 is unknown this is a full rank, oneparameter exponential family with T = NOD + N ll where Nt) the number of transitions from i to j. For example, 01011 has N Ol = 2, Nil = 1, N oo = 0, N IO ~ 1.
(b) Show that E(T)
= (n 
l)p (by the method of indicators or otherwise).
35. A Conjugate Priorfor the Two~Sample Problem. Suppose that Xl, ... , X n and Y1 , ... , Yn are independent N(fLI' (12) and N(1l2' ( 2 ) samples, respectively. Consider the prior 7r for which for some r > 0, k > 0, ro 2 has a X~ distribution and given 0 2 , /11 and fL2 are independent with N(~I, <7 2/ kt} and N(6, <7 2/ k2) distributions, respectively, where ~j E R, k j > 0, j = 1,2. Show that Jr is a conjugate prior.
I
36. The inverse Gaussian density. IG(j..t, .\), is
f(x,J1.,>')
= [>./21Tjl/2 x 3/2 exp { >.(x 
J1.)2/ 2J1.2 X}, x> 0, J1. > 0, >. > O.
(a) Show thatthis is an exponentialfamily generated hy T( X) h(x) = (21T)1/2 X 3/'. (b) Show that the canonical parameters TJl, TJ2 are given by TJI that A( 'II, '12) =  [! [Og('I2) + v''I1'I2]'£ = [0,00) x (0,00).
= ! (X, XI) T and = fL 2A, 1]2 =
= '\, and
(e) Fwd the momentgenerating function ofT and show that E(X) J1. 3>., E(XI) = J1. 1 + >.1, Var(X I ) = (>'J1.)1 + 2>'2. (d) Suppose J1. pnor.
~
J1., Var(X)
J1.o is known. Show that the gamma family, qa,,6), is a conjugate
(e) Suppose that>' = >'0 is known. Show that the conjngate prior formula (1.6.20) produces a function that is not integrable with respect to fl. That is, defined in (1.6.19) is empty.
n
(I) Suppose that J1. and>. are both unknown. Show that (1.6.20) produces a function
that is not integrable; that is, f! defined in (1.6,19) is empty. 37, Let XI, ... , X n be i.i.d. as X ~ Np(O, ~o) where ~o is known. Show that the conjugate prior generated by (1.6.20) is the N p ( 1]0,761) family, where 1]0 varies freely in RP, 76 > 0 and I is the p x p identity matrix.
,
•
38. Let Xi
"
(Zi, Yi)T be jj,d. as X = (Z, Y)T, 1 < i < n, where X has the density of Example 1.6.3. Write the density of XI, ... ,Xn as a canonical exponential family and identify T, h, A, and E. Find the expected value and variance of the sufficient statistic.
=
,!,i
i!: "
39. Suppose that Y1 , •.. 1 Y n are independent, Yi
'"'' N(fLi, a 2 ), n > 4.
"'
(a) Write the distribution of Y1 , ..• ,Yn in canonical exponential family fonn. Identify T, h, 1), A, and E. (b) Next suppose that fLi depends on the value Zi of some covariate and consider the submodel defined by the map 1) : (0 1, O , 03)T ~ (1'7, <72 jT where 1) is detennined by 2
fLi
I i'i
I ,
I
= exp{OI
+ 02Zi},
Zl
< Z2 < .. , <
Zn;
(72 =
03


Section 1.8
Notes
95
where 8 r E R, O E R, 03 > O. This model is sometimes used when IIi is restricted to be 2 positive. Show that p(y, 0) as given by (1.6.12) is a curved exponential family model with
1=3.
40. Suppose Y 1 , • •. , y;'l are independent exponentially, E' (Ai), distributed survival times, n > 3. (a) Write the distribution of Y1 , ... 1 Yn in canonical exponential family form. Identify T, h, '1, A, and E. (b) Recall that J.1i = E (Y'i) = Ai I. Suppose lJi depends on the value Zi of a covariate. Because Iti > O. fLi is sometimes modeled as
fLi
= CXp{ 0 1 + (hZi},
i
=
1, ... , n
where not all the z's are equal. Show that p(y, fi) as given by (1.6.12) is a curved exponential family model with 1 = 2.
1.8
NOTES
Note for Section 1.1
(1) For the measure theoretically minded we can assume more generally that the Po are
all dominated by a derivative.
(J
finite measure It and that p(x, 8) denotes
dJ;lI, the Radon Nikodym
Notes for Section 1,3
~
(I) More natural in the sense of measuring the Euclidean distance between the estimate f} and the "truth" Squared error gives much more weight to those that are far away from f} than those close to f}.
e.
e
~
(2) We define the lower boundary of a convex set simply to be the set of all boundary points r such that the set lies completely on or above any tangent to the set at r.
Note for Section 1,4
(I) Source; Hodges, Jr., J. L., D. Keetch, and R. S. Crutchfield. Statlab: An Empirical Introduction to Statistics. New York: McGrawHill, 1975.
Notes for Section 1,6
(1) Exponential families arose much earlier in the work of Boltzmann in statistical mechanics as laws for the distribution of the states of systems of particlessee Feynman (1963), for instance. The connection is through the concept of entropy, which also plays a key role in infonnation theorysee Cover and Thomas (199]). (2) The restriction that's x E Rq and that these families be discrete or continuous is artificial. In general if fL is a (J finite measure on the sample space X. p( x, e) as given by (1.6.1)
77. R. Feynman... T. DE GROOT. Theory ofGames and Statistical Decisions New York: Wiley. The Feynmtln Lectures on Physics. v." J. KENDALL. 1. 1967. 1988. S. S.733741 (1990). I. A." Ann.. D. II. R. T. R. positions. Mathematical Statistics New York: Academic Press. E. J. Leighton. Science 5.. Goals.. ley.. Iii I' i M. Testing Statistical Hypotheses.7 (I) u T Mu > 0 for all p x 1 vectors u l' o. 6. 0 . AND COVER. E. 1986.. Transformation and Weighting in Regression New York: Chapman and Hall.. J." Ann. and M. Royal Statist... CRUTCHFIELD. 14431473 (1995). BICKEL. KRETcH AND R. and Performance Criteria Chapter 1 can be taken to be the density of X with respect to fLsee Lehmann (1997). The Advanced Theory of Statistics. . 2nd ed. P. H. J. G. M. . r HorXJEs. P. B. Optimal Statistical Decisions New York: McGrawHili.. AND M. Note for Section 1. . AND M.I' L6HMANN. for instance.. P. 1969. and spheres (e. Math Statist. L. K A AND A. M. "Using Residuals Robustly I: Tests for Heteroscedasticity. Eels. BOX. L. GIRSHICK. Statist.. D. 5.. 1997. IMS Lecture NotesMonograph Series. MA: AddisonWesley. Soc.g. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory. U. "Sampling and Bayes Inference in Scientific Modelling and Robustness (with Discussion). R.t AND D. • • ." Ann. 22.9 REFERENCES BERGER. and Later Developments. 40 Statistical Mechanics ofPhysics Reading. Vols. E. 1985. 23. Nonlinearity." Biomelika..266291 (1978). L6HMANN. BLACKWELL. CARROLL. FEYNMAN. Statist. STUART. L. This permits consideration of data such as images. BERMAN.. the Earth)...96 Statistical Models. 383430 (1979). 125. A 143. Elements of Information Theory New York: Wiley. Ii . Sands. m New York: Hafner Publishing Co. AND A.547572 (1957). L. 1975. I and II. Hayward. I: I" Ii L6HMANN.. Statlab: An Empirical Introduction to Statistics New York: McGrawHili." Statist. "Nonparametric Estimation of Global Functionals and a Measure of the Explanatory Power of Covariates in Regression. "Model Specification: The Views of Fisher and Neyman. 160168 (1990). GRENANDER. 1954. M. 1963. 1961. ROSENBLATT.• Statistical Decision Theory and Bayesian Analysis New York: Springer. THOMAS. New York: Springer. L. J. . SAMAROV. RUPPERT. E. "A Theory of Some Multiple Decision Problems. FERGUSON. Ch. G. 1966. 1991. DoKSUM. JR. and so on. BROWN. 1957. . . P.. "A Stochastic Model for the Distribution ofHIV Latency Time Based on T4 Counts. Statistical Analysis of Stationary Time Series New York: Wi • . A.
1964. L. Part I: Probability. AND K. Harvard University. K. Sequential Methods in Statistics New York: Chapman and Hall.. (Draper's Research Memoirs. V. Wiley & Sons. 1965. "Empirical Bayes Procedures for a Change Point Problem with Application to HIVJAIDS Data. GLAZEBROOK. Graduate School of Business Administration. J. DOKSUM. The Foundation ofStattstical Inference London: Methuen & Co. 8th Ed..) RAIFFA. Wiley & Sons. D. 1." Proc. 1962. IA: Iowa State University Press. 6779. Soc. SL. New York: Springer. New York.. The Statistical Analysis of Experimental Data New York: J. Ames. Reid. H. G. G. L. Roy.. SAVAGE... A. 1986. 1954. PEARSON. Applied Statistical Decision Theory. AND K. Ahmed and N. SNEDECOR. G.. Division of Research. . W. 1961. Part II: Inference London: Cambridge University Press. ET AL. Biometrics Series II. WETHERILL. AND W. MANDEL. Introduction to Probability and Statistics from a Bayesian Point of View." Empirical Bayes and Likelihood Inference. London 71. 303 (1905). 2000. 1989. The Foundations ofStatistics. Editors: S. E. Boston. SCHLAIFFER. Dulan & Co. Statistical Methods. D. J. NORMAND. Lecture Notes in Statistics. SAVAGE.Section 1. COCHRAN. "On the General Theory of Skew Correlation and NOnlinear Regression. J. 9 References 97 LINDLEY. AND R. B..
:i.'. I:' o.·.: '".! ~J . . I 'I i . I II 1'1 . I .
how do we select reasonable estimates for 8 itself? That is.1. how do we find a function 8(X) of the vector observation X that in some sense "is close" to the unknown 81 The fundamental heuristic is typically the following.O) where V denotes the gradient. usually parametrized as P = {PO: 8 E e}. As a function of 8. That is.8) is an estimate of D(8 0 .1. 6) ~ o. Now suppose e is Euclidean C R d . =0 (2. So it is natural to consider 8(X) minimizing p(X. we could obtain 8 0 as the minimizer. Of course. We consider a function that we shall call a contrast function  p:Xx8>R and define D(Oo. X """ PEP. and 8 Jo D( 8 0 .1.1 BASIC HEURISTICS OF ESTIMATION Minimum Contrast Estimates.Chapter 2 METHODS OF ESTIMATION 2. In order for p to be a contrast function we require that DC 8 0l 9) is uniquely minimized for 8 = (Jo.8). Then we expect  \lOD(Oo. X E X. In this parametric case. we don't know the truth so this is inoperable. This is the most general fonn of the minimum contrast estimate we shall consider in the next section.2) 99 . DC8 0l 9) measures the (population) discrepancy between 8 and the true value 8 0 of the parameter. Estimating Equations Our basic framework is as before.1. p(X.O).8) as a function of 8.1 2. (2.1) Arguing heuristically again we are led to estimates B that solve \lOp(X.8) is smooth. 8).O) EOoP(X. the true 8 0 is an interior point of e. The equations (2.2) define a special form of estimating equations. but in a very weak sense (unbiasedness). if PO o were true and we knew DC 8 0 .
g({3. z. (1). there is a substantial overlap between the two classes of estimates.4 R d. suppose we are given a function W : X and define Methods of Estimation Chapter 2 X R d .Zid)T 'j • r . Zn»T That is.16).7) i=l J . ZI). W(X. . z) is continuous and ! I lim{lg(l3. Least Squares.10).1. Here the data are X = {(Zi.z. Then {3 parametrizes the model and we can compute (see Problem 2..4 are i. . Example 2.i + L[g({3o. Wd)T V(l1 o. "'.!I [" '1 L • . N(O.).»)'. z.' . ~ Then we say 8 solving (2. {3 E R d . further. . (3) E{3.100 More generally.1.4 with I'(z) = g({3.z. (3) n n<T. A naiural(l) function p(X.. An estimate (3 that minimizes p(X.('l/Jl.1.)Y. .. {3) to consider is the squared Euclidean distance between the vector Y of observed Yi and the vector expectation of Y. I In the important linear case.2) or equivalently the system of estimating eq!1ations. d g({3.1.g({3. i=l (2. I D({3o. where the function 9 js known.I1) =0 is an estimating equation estimate. . (3) exists if g({3.)g(l3.1L1 2 = L[Yi .1. ua).4) w(X..8/3 ({3. But.3) = 0 has 8 0 as its unique solution for all 8 0 ~ E e. :i = ~8g~ ~ L.)J i=1 ~ 2 (2.1. suppose we postulate that the Ei of Example 1. z). Here is an example to be pursued later.zd = LZij!3j j=1 andzi = (Zil.p(X. Evidently. g({3.6) . we take n p(X. ~ ~8g~ L. i=l 3 I (2.g({3. .1. l'i) : 1 < i < n} where Yi.) . z. Yn are independent. (3) = !Y . (1) Suppose V (8 0 . If. for convenience.1. IL( z) = (g({3. z) is differentiable in {3.1.d. z.1. W . The estimate (3 is called the least squares estimate.5) Strictly speaking P is not fully defined here and this is a point we shall explore later..1. (2.8) ~ E I1 . Consider the parametric version of the regression model of Example 1. . J which is indeed minimized at {3 = {3o and uniquely so if and only if the parametrization is identifiable.i.z)l: 1{31 ~ oo} 'I = 00 I (Problem 2.8/3 ({3.1. then {3 satisfies the equation (2.
Suppose that 1'1 (8).2. . I'd). 1 1 < j < d.(8).Xn are i. I'd of X.. u5).2 and Chapter = 6.. The method of moments prescribes that we estimate 9 by the solution of p. say q(8) = h(I'I. once defined we have a method of computing a statistic fj from the data X = {(Zi 1 Xi).i. . suppose is 1 . 1 1j ~_l". /1j converges in probability to flj(fJ). provides a first example of both minimum contrast and estimating equation methods. d > k.1.d. To apply the method of moments to the problem of estimating 9. 8 E R d and 8 is identifiable.. Thus.. 1 < i < n}. More generally.1 from R d to Rd.Section 2.. and then using h(p!. In fact.Xi t=l . lid) as the estimate of q( 8). thus.8) the normal equations.d.1 Basic Heuristics of Estimation 101 the system becomes (2. ~ .. we need to be able to express 9 as a continuous function 9 of the first d moments. = 1'. .. These equations are commonly written in matrix fonn (2.i. I'd( 8) are the first d moments of the population we are sampling from.1. we assume the existence of Define the jth sample moment f1j by. Here is another basic estimating equation example. Least squares. if we want to estimate a Rkvalued function q(8) of 9. Suppose Xl. .9) where Zv IIZijllnxd is the design matrix. . L.. we obtain a MOM estimate of q( 8) by expressing q( 8) as a function of any of the first d moments 1'1. . This very important example is pursued further in Section 2... . Thus. N(O. .  n n.1 <j < d if it exists. which can be judged on its merits whatever the true P governing X is. . The motivation of this simplest estimating equation example is the law of large numbers: For X "' Po.1. Method a/Moments (MOM). as X ~ P8. . 0 Example 2. . We return to the remark that this estimating method is well defined even if the Ci are not i.
and 1" ~ E(X') = . In this case f) ~ ('" A).. . i = 1.)]xO1exp{Ax}..1. x>O.3. A = X/a' where a 2 = fl. Here is some job category data (Mosteller. 2. for instance. 3. but their respective probabilities PI. neither minimum contrast estimates nor estimating equation solutions can be obtained in closed fonn.10.(X")1 dxd J isquickandM is nonsingular with high probability is the NewtonRaphson algorithm.6. . 0 Algorithmic issues We note that. A = Jl1/a 2 . the proportion of sample values equal to Vi.1 0) . '\). ..2 and 0=2 = n 1 EX1. Frequency Plugin(2) and Extension. the method of moment estimator is not unique..~ (I'l/a)'.·)  l~t. I.. A>O.(1 + ")/A 2 Solving . .1.102 Methods of Estimation Chapter 2 For instance. .. I. in general. If we let Xl. .1. There are many algorithms for optimization and root finding that can be employed. then setting (2.·) fli =D'I'(X.. express () as a function of IJ" and fl3 = E(X 3 ) and obtain a method of moment estimator based on /11 and fi3 (Problem 2. . 1'1 for () gives = E(X) ~ "/ A. f(u. ~ iii ~ (X/a)'.X 2 • In this example.i. As an illustration consider a population of men whose occupations fall in one of five different job categories. .4 and in Chapter 6. two other basic heuristics particularly applicable in the i. 4. We introduce these principles in the context of multinomial trials and then abstract them and relate them to the method of moments. . Example 2. It is defined by initializing with eo. We can. in particular Problem 6.•. . then the natural estimate of Pi = P[X = Vi] suggested by the law of large numbers is Njn.d. 1968). as X and Ni = number of indices j such that X j = Vi. ..11). X n be Ltd.. 2. l Vk of the population being sampled are known.. This algorithm and others will be discussed more extensively in Section 2. Suppose we observe multinomial trials in which the values VI.Pk are completely unknown.5. Vi = i.>0. with density [A"/r(. consider a study in which the survival time X is modeled to have a gamma distribution. An algorithm for estimating equations frequently used when computationofM(X. Pi is the proportion of men in the population in the ith job category and Njn is the sample proportion in this category.2 The PlugIn and Extension Principles We can view the method of moments as an example of what we call the plugin (or substitution) and extension principles. . case. Here k = 5. or 5.
. Equivalently.Pk by the observable sample frequencies Nt/n.. together with the estimates Pi = NiJn.. .P2. use (2. . For instance. and Then q(p) (p.09. . If we use the frequency substitution principle...1.4. (2.0)2. ...41 4 217 0.12 3 289 0. Pk) of the population proportions. . we are led to suppose that there are three types of individuals whose frequencies are given by the socalled HardyWeinberg proportions 2 . Many of the models arising in the analysis of discrete data discussed in Chapter 6 are of this type. .0). . categories 4 and 5 correspond to bluecollar jobs.Xn .»i=1 for Danish men whose fathers were in category 3. + P3).1.1. whereas categories 2 and 3 correspond to whitecollar jobs. lPk). Consider a sample from a population in genetic equilibrium with respect to a single gene with two alleles. the difference in the proportions of bluecollar and whitecollar workers. P3) given by (2. We would be interested in estimating q(P"". suppose that in the previous job category table.31 5 95 0. can be identified with a parameter v : P ~ R. let P dennte p ~ (P". If we assume the three different genotypes are identifiable. Now suppose that the proportions PI' .1. .. That is. 1 Nk/n.. Example 2. in v(P) by 0 P ).. i < k. .03 2 84 0." . HardyWeinberg Equilibrium.0 < 0 < 1. v.(P2 + P3). Next consider the marc general problem of estimating a continuous function q(Pl' .53 = 0.. Pk do not vary freely but are continuous functions of some ddimensional parameter 8 = ((h. 1 ()d) and that we want to estimate a component of 8 or more generally a function q(8)...11) to estimate q(Pll··.. .44 .Section 2.1 Basic Heuristics of Estimation 103 Job Category l I Ni Pi ~ 23 0..12).. The frequency plugin principle simply proposes to replace the unknown population frequencies PI.pl . .0. . Suppose .Pk) with Pi ~ PIX = 'Vi]. . the estimate is which in our case is 0. P2 ~ 20(1. .12) If N i is the number of individuals of type i in the sample of size n. . N 3 ) has a multinomial distribution with parameters (n..Ps) = (P4 + Ps) .}}. P3 PI = 0 = (1.Pk) = (~l . . 1~!.Ni =708 2:::o. then (NIl N 2 .13 n2::. the multinomial empirical distribution of Xl. that is. v(P) = (P4 + Ps) and the frequency plugin principle simply says to replace P = (PI. 1 < think of this model as P = {all probability distributions Pan {VI.
by the law of large numbers. ( . we can usually express q(8) as a continuous function of PI. then v(P) is the plugin estimate of v. (2.d.(IJ).. (2. T(Xl.d..15) ~ ".E have an estimate P of PEP such that PEP and v : P 4 T is a parameter. . We shall consider in Chapters 3 (Example 3.14) are not unique.. .1.Pk are continuous functions of 8. the representation (2.).4.. .1. Note. in the Li. In particular. 1  V In general. The plugin and extension principles can be abstractly stated as follows: Plugin principle. Nk) . suppose that we want to estimate a continuous Rlvalued function q of e.104 Methods of Estimation Chapter 2 we want to estimate fJ. . Let be a submodel of P. .h(p) and v(P) = v(P) for P E Po. consider va(P) = [Fl (a) + Fu 1(a )1.. if X is real and F(x) = P(X < x) is the distribution function (dJ.1. .. we can use the principle we have introduced and estimate by J N l In.1.1. q(lJ) with h defined and continuous on = h(p. X n are Ll. however.13) defines an extension of v from Po to P via v. a natural estimate of P and v(P) is a plugin estimate of v(P) in this nonparametric context For instance./P3 and. that is.. P > R where v(P) .'" . the empirical distribution P of X given by    f'V 1 n P[X E Al = I(X. the frequency of one of the alleles. thus. . . (2.Pk(IJ»). Fu'(a) = sup{x. We can think of the extension principle alternatively as follows. . case if P is the space of all distributions of X and Xl. I) and ! F'(a) " = inf{x. E A) n. F(x) > a}. . . . I: t=l (2. F(x) < a}. Po ~ R given by v(PO) = q(O). If w.13) and estimate (2.13) Given h we can apply the extension principle to estimate q( 8) as. as X p.. Then (2. ... Because f) = ~. that we can also Nsfn is also a plausible estimate of O. "·1 is.1..Pk.14) As we saw in the HardyWeinberg case. If PI.1. Now q(lJ) can be identified if IJ is identifiable by a parameter v .4) and 5 how to choose among such estimates. 0 write 0 = 1 . . where a E (0.Xn ) =h N.16) .
e e Remark 2. P E Po. II to P. context is the jth sample moment v(P) = xjdF(x) ~ 0. P 'Ie Po.3 and 2. With this general statement we can see precisely how method of moment estimates can be obtained as extension and frequency plugin estimates for multinomial trials because I'j(8) where =L i=l k vfPi(8) = h(p(8» = viP.1 I:~ 1 Xl. Here x ~ = median.d. and v are continuous.' Nl = h N () ~ ~ = v(P) and P is the empirical distribution. is a continuous map from = [0. . If v: P ~ T is an extension of v in the sense that v(P) = viP) on Po.12) and to more general method of moment estimates (Problem 2. in the multinomial examples 2.. (P) is called the population (2.1.1. The plugin and extension principles must be calibrated with the target parameter. but only VI(P) = X is a sensible estimate of v(P). Remark 2. if X is real and P is the class of distributions with EIXlj < 00. = v(P). The plugin and extension principles are used when Pe. As stated.17) where F is the empirical dJ. For instance.1.1.i. However. case (Problem 2. (P) is the ath population quantile Xa. 1/. is called the sample median. This reasoning extends to the general i. the sample median V2(P) does not converge in probability to Ep(X). v(P8 ) = q(8) = h(p(8)) is a continuous map from to Rand D( P) = h(p) is a continuous map from P to R.1.1. casebut see Problem 2.1 Basic Heuristics of Estimation 105 V 1. In this case both VI(P) = Ep(X) and V2(P) = "median of P" satisfy v(P) = v(P).1. ~ For a second example. Let viP) be the mean of X and let P be the class of distributions of X = 0 + < where B E R and the distribution of E ranges over the class of distributions with mean zero. then v(P) is an extension (and plugin) estimate of viP).13). i=1 k /1j ~ = ~ LXI 1=1 I n . because when P is not symmetric. ~ ~ . ~ I ~ ~ Extension principle.4.i. Pe as given by the HardyWeinberg p(O). Suppose Po is a submodel of P and P is an element of P but not necessarily Po and suppose v: Po ~ T is a p~eter.Section 2.) h(p) ~ L vip. let Po be the class of distributions of X = B + E where B E R and the distribution of E ranges over the class of symmetric distributions with mean zero.tj = E(Xj) in this nonparametric .. then Va. A natural estimate is the ath sample quantile . For instance. then the plugin estimate of the jth moment v(P) = f.2.1. Here x!. these principles are general. = Lvi n i= 1 k .14.1.d. they are mainly applied in the i.
(b) If the sample size is large.6. C' X n are ij. Remark 2. these are the frequency plugin (substitution) estimates (see Problem 2.4. X. The method of moments estimates of f1 and a 2 are X and 2 0 a. Unfortunately.. O}.l [iX.. Suppose that Xl.) = ~ (0 + I). • . .LI (8) = (J the method of moments leads to the natural estimate of 8. . they are often difficult to compute. When we consider optimality principles. a special type of minimum contrast and estimating equation method.2 with assumptions (1)(4) holding.1 is a method of moments estimate of B. as we shall see in Section 2. .8) we are led by the first moment to the estimate.4. a saving grace becomes apparent in Chapters 5 and 6.106 Methods of Estimation Chapter 2 Here are three further simple examples illustrating reasonable and unreasonable MOM estimates. Because f. Example 2. .1 because in this model B is always at least as large as X (n)' 0 As we have seen. Example 2. The method of moments can lead to either the sample mean or the sample variance. Discussion. we find I' = E. Algorithms for their computation will be introduced in Section 2. To estimate the population variance B( 1 . U {I. these estimates are likely to be close to the value estimated (consistency). This is clearly a foolish estimate if X (n) = max Xi> 2X . " . X(l . valuable as preliminary estimates in algorithms that search for more efficient estimates. . •i . = 0]. [ [ . . Suppose X I. (X. Moreover. X n is a N(f1.5. Plugin is not the optimal way to go for the Bayes. It does turn out that there are "best" frequency plugin estimates. "..1. where Po is n. See Section 2. if we are sampling from a Poisson population with parameter B. . X). because Po = P(X = 0) = exp{ O}. What are the good points of the method of moments and frequency plugin? (a) They generally lead to procedures that are easy to compute and are. This minimal property is discussed in Section 5..1. there are often several method of moments estimates for the same q(9).. For instance. X n are the indicators of a set of Bernoulli trials with probability of success fJ. .5. In Example 1. We will make a selection among such procedures in Chapter 3. I . If the model fits. and there are best extensions. However. a frequency plugin estimate of 0 is Iogpo. a 2 ) sample as in Example 1. or uniformly minimum variance unbiased (UMVU) principles we discuss briefly in Chapter 3. Thus. )" I I • .I and 2X .1.4.. optimality principle solutions agree to first order with the best minimum contrast and estimating equation solutions.3. 2.1.3 where X" .d. the plugin principle is justified. Estimating the Size of a Population (continued). the frequency of successes. 'I I[ i['l i .Ll). estimation of B real with quadratic loss and Bayes priors lead to procedures that are data weighted averages of (J values rather than minimizers of functions p( (J. therefore. minimax. 0 = 21' .X). For example. as we shall see in Chapter 3. 0 Example 2. then B is both the population mean and the population variance.1. . Because we are dealing with (unrestricted) Bernoulli trials.2. those obtained by the method of maximum likelihood. for large amounts of data.7. we may arrive at different types of estimates than those discussed in this section.
. where ZD = IIziJ·llnxd is called the design matrix. . ~ ~ ~ ~ ~ v we find a parameter v such that v(p. Let Po and P be two statistical models for X with Po c P. If P is an estimate of P with PEP. and the contrast estimating equations are V Op(X. In the multinomial case the frequency plugin estimators are empirical PIEs based on v(P) = (Ph . D(P) is called the extensionplug·in estimate of v(P).. where Pj is the probability of the jth category.(3) ~ L[li . is parametric and a vector q( 8) is to be estimated. 1 < j < k... z) = ZT{3. 0).2. An extension ii of v from Po to P is a parameter satisfying v(P) = v(P).Section 2. Method of moment estimates are empirical PIEs based on v(P) = (f. I < i < n.2 Minimum Contrast Estimates and Estimating Equations 107 Summary. P E Po. li) : I < i < n} with li independent and E(Yi ) = g((3. The plugin estimate (PIE) for a vector parameter v = v(P) is obtained by setting fJ = v( P) where P is an estimate of P. If P = PO. . f. ~ ~ ~ ~ 2. A minimum contrast estimator is a minimizer of p(X. 6). When P is the empirical probability distribution P E defined by Pe(A) ~ n. the associated estimating equations are called the normal equations and are given by Z1 Y = ZtZD{3.O) = O. Zi).g((3. In this section we shall introduce the approach and give a few examples leaving detailed development to Chapter 6..ld)T where f. when g({3. For data {(Zi.Pk). For this contrast. 0 E e. . For the model {Pe : () E e} a contrast p is a function from X x H to R such that the discrepancy D(lJo..ll..) = q(O) and call v(P) a plugin estimator of q(6). .1 L:~ 1 I[X i E AI. The general principles are shown to be related to each other.. Suppose X .2 2.1 MINIMUM CONTRAST ESTIMATES AND ESTIMATING EQUATIONS Least Squares and Weighted Least Squares Least squares(1) was advanced early in the nineteenth century by Gauss and Legendre for estimation in problems of astronomical measurement..lJ) ~ E'o"p(X. P. 0 E e c Rd is uniquely minimized at the true value 8 = 8 0 of the parameter. then is called the empirical PIE. a least squares estimate of {3 is a minimizer of p(X. We consider principles that suggest how we can use the outcome X of an experiment to estimate unknown parameters. It is of great importance in many areas of statistics such as the analysis of variance and regression theory.Zi)j2. where 9 is a known function and {3 E Rd is a vector of unknown regression coefficients.lj = E( XJ\ 1 < j < d.
. If we consider this model. Zn»)T is 1.) + L[g(l3 o.:0::d=':Of. That is.6) is difficult Sometimes z can be viewed as the realization of a population variable Z.»)2.zt}. j. i=l (2.. .g(l3. 13 = I3(P) is the miniml2er of E(Y .) .C=ha:::p::. Note that the joint distribution H of (C}. that is.z). that is.::'hc.. < i < n.) = g(l3.2. . 1 < i < n. as (Z.i.1.2) Are least squares estimates still reasonable? For P in the semiparametric model P. Y) ~ PEP = {Alljnintdistributions of(Z..:. . The contrast p(X. a 2 and H unknown.En) is any distribution satisfying the GaussMarkov assumptions.g(l3..108_~ ~_ _~ ~ ~cMc. then 13 has an interpretation as a parameter on p.2.i. I Zj = Zj). I' .:1.3) continues to be valid. z. . 1_' .z. where Ci = g(l3.g(l3.z.)]' i=l ~ led to the least squares estimates (LSEs) {3 of {3. I<i<n.d. 13 E R d }. For instance. 13) = LIY.3) which is again minimized as a function of 13 by 13 = 130 and uniquely so if the map 13 ~ (g(j3.2).E(Y.2. is modeled as a sample from a joint distribution. we can compute still Dp(l3o.. .6) Var( €i) = u 2 COV(fi. as in (a) of Example 1. (Z" Y. aJ) and (3 ranges over R d or an open subset..m::a.4 with <j simply defined as Y. NCO. the model is semiparametric with {3. . This is frequently the case for studies in the social and biological sciences.) = 0. 1 < i < j < n 1 > 0\ . Yi).. = 0.) =0.Y) such thatE(Y I Z = z) = g(j3..1. j.d. This follows "I . Suppose that we enlarge Po to P where we retain the independence of the Yi but only require lJ.).=. . 1 < j < n. and 13 ~ (g(l3. z.2. which satisfies (but is not fully specified by) (2.::2 : In Example 2.) = E(Y.g(l3.(z.g(l3.) + <" I < i < n n (2.2.2.E:::'::'.zn)f is 11.).5) (2.2. E«.2.::. The estimates continue to be reasonable under the GaussMarkov assumptions. " (2. I) are i. (2. or Z could be height and Y log weight Then we can write the conditional model of Y given Zj = Zj. fj) because (2. E«.f3d and is often applied in situations in which specification of the model beyond (2.. Z)f. (Zi.) i. ..4) (2. (3)  E p p(X. The least squares method of estimation applies only to the parameters {31' . " . . .1.. Z could be educationallevel and Y income.2. z. • .13) n n i=1  L Varp«.2. .o=nC..1 we considered the nonlinear (and linear) Gaussian model Po given by Y.z. .4}{2.
Section 2.5) and (2. In that case. {3d). 2:).zo). Yi). a further modeling step is often taken and it is assumed that {Zll . See also Problem 2. .2 Minimum Contrast Estimates and Estimating Equations 109 from Theorem 1.2.8) d f30 = («(3" . Yi) are a sample from a (d + 1)dimensional distribution and the covariates that are the coordinates of Z are continuous. . which. z) = zT (3. I)/.2. j=1 (2. We continue our discussion for this important special case for which explicit fonnulae and theory have been derived. .. n} we write this model in matrix fonn as Y = ZD(3 + €    where Zv = IIZij 11 is the design matrix.7). E(Y I Z = z) can be written as d p(z) where = (30 + L(3jZj j=1 (2.2. and Seber and Wild (1989).I to each of the n pairs (Zil Yi). (2) If as we discussed earlier. (2.7) (2.. Fan and Gijbels (1996). For nonlinear cases we can use numerical methods to solve the estimating equations (2.4). (2.5.L{3jPj.2. see Rnppert and Wand (1994). as we have seen in Section lA. where P is the empirical distribution assigning mass n. As we noted in Example 2.2.1). In this case we recognize the LSE {3 as simply being the usual plugin estimate (3(P). Zd. we are in a situation in which it is plausible to assume that (Zi.2. we can approximate p(z) by p(z) ~ p(zo) +L j=1 d a: a (zo)(z .1.1.2.1. J for Zo an interior point of the domain. i = 1..1 the most commonly used 9 in these models is g({3. j=O This type of approximation is the basis for nonlinear regression analysis based on local polynomials.2.41. E1=1 g~ (zo)zo as an + 1) dimensional d p(z) = L(3jZj..9) . = py . is called the linear (multiple) regression modeL For the data {(Zi.6). We can then treat p(zo)  nnknown (30 and identify (zo) with (3j to give an approximate (d 1 and Zj as before and linear model with Zo = t!.4. and Volnme n. Sections 6. .3 and 6.4. y)T has a nondegenerate multivariate Gaussian distribution Nd+1(Il. The linear model is often the default model for a number of reasons: (I) If the range of the z's is relatively small and p(z) is smooth... in conjnnction with (2.
ecor and Cochran. the solution of the normal equations can be given "explicitly" by (2. we will find that the values of y will not be the same. Nine samples of soil were treated with different amounts Z of phosphorus.1 '. The nonnal equations are n i=l n ~(Yi .1.(3zz. necessarily. Example 2. ~ = (lin) L:~ 1Yi = ii. For certain chemicals and plants.) = O.12) . Y is the amount of phosphorus found in com plants grown for 38 days in the different samples of soil. 139). The parametrization is identifiable if and only ifZ D is of rank d or equivalently if Z'bZD is affulI rank d.2. g(z. {3. 1967. LZi(Yi . is independent of Z and has a N(O. given Zi = 1 < i < n.il2Zi) = 0. Zi.il. In that case.2. If we run several experiments with the same z using plants and soils that are as nearly identical as possible. we assume that for a given z. . exists and is unique and satisfies the Donnal equations (2. the least squares estimate.2. d = 1. g(z.no Furthennore. if the parametrization {3 ) ZD{3 is identifiable. We have already argued in Examp}e 2. For this reason.10) Here are some examples. I I. In the measurement model in which Yi is the detennination of a constant Ih. we get the solutions (2.25. see Problem 2.il. i I . whose solution is .2.11) When the Zi'S are not all equal. We want to estimate f31 and {32. Zi 1'. 0 Ii. ( (J2 2 ) distribution where = ayy Therefore. . p. Example 2. . Y is random with a distribution P(y I z). < Methods of Estimation Chapter 2 =y  I'(Z) Eyz:Ez~Ezy. il.I 'i. Following are the results of an experiment to which a regression model can be applied (Sned.8).2. i=l (2. il. Estimation of {3 in the linear regression model.) = 1 and the normal equation is L:~ 1(Yi . 1 < i < n. We want to find out how increasing the amount z of a certain chemical or fertilizer in the soil increases the amount y of that chemical in the plants grown in that soil.ill .2. we have a Gaussian linear regression model for Yi. il.1.) ~ 0.) = il" a~.2. 1 64 4 71 5 54 9 81 11 76 13 23 77 93 23 95 28 109 The points (Zi' Yi) and an estimate of the line 131 + 132z are plotted in Figure 2. the relationship between z and y can be approximated well by a linear equation y = 131 + (32Z provided z is restricted to a reasonably small interval. L 1 that.1.2.
Yn). Zn. The regression line for the phosphorus data is given in Figure 2. This connection to prediction explains the use of vertical distance! in regression.4. &atter plot {(Zi.. n} and sample regression line for the phosphorus data. and fi = (lin) I:~ 1 Yi· The line y = /31 + f32z is known as the sample regression line or line of best fit of Yl.1. For instance. 9p( z).. Here {31 = 61.9.2..1. .58 and 132 = 1. The linear regression model is considerably more general than appears at first sight.Section 2.(Z). and 131 = fi . 91. . .. The vertical distances €i = fYi . then the regression line minimizes the sum of the squared distances to the n points (ZI' Yl).2 Minimum Contrast Estimates and Estimating Equations 111 y 50 • o 10 20 x 30 Figure 22.3. €6 is the residual for (Z6. Y6).. . Geometrically. if we measure the distance between a point (Zi..Yi) and a line Y ~ a + bz vertically by di = Iy. (zn..132 z ~ ~ (2. ... Yi). i = 1. i = 1.(/31 + (32Zi)] are called the residuals of the fit. suppose we select p realvalued functions of z. .(a + bzi)l. j=1 Then we are still dealing with a linear regression model because we can define WpXI . .9p.Yn on ZI. .13) where Z ~ (lin) I:~ 1 Zi.. . n. . .2. .2.. .42. 0 ~ ~ ~ ~ ~ Remark 2. The line Y = /31 + (32Z is an estimate of the best linear MSPE predictor Ul + b1Z of Theorem 1.1.. that is p I'(z) ~ 2:).. . P > d and postulate that 1'( z) is a linear combination of 91 (z).. .
Note that the variables . Such models are called heteroscedastic (as opposed to the equal variance models that are homoscedastic).2. .2.. Zi) = (2. we may be able to characterize the dependence of Var( ci) on Zi at least up to a multiplicative constant. The weighted least squares estimate of {3 is now the value f3. . We need to find the values {3l and /32 of (3. Yi = 80 + 8 1Zi + 82 + Cisee Problem 2. .= y.wi.. we arrive at quadratic regression.= g({3.filii minimizes ~  I i i L[1Ii .. gp( z)) T as our covariate and consider the linear model Yi = where L j=l p ()jWij + Ci...wi) g((3.3.wi. and (32 that minimize ~ ~ "I. i = 1. _ Yi .2.. .. and g({3. The method of least squares may not be appropriate because (2. . Thus.5). but the Wi are known weights.2.wi  _ g((3... then = WW2/Wi = ...wi 1 < . 1 n. Weighted least squares. which for given Yi = yd. . Zi)/.'l are sufficient for the Yi and that Var(Ed. That is. For instance.2.)]2 i=l i=l ~ n (2.". I and the Y i satisfy the assumption (2. [Yi . if d = 1 and we take gj (z) = zj. However...wi . Zi) + Ei ..2..2.15) . .. iI L Vi[Yi i=l n ({3. 1 <i < n Yi n 1 . + /3zzi)J2 (2. Example 2..16) as a function of {3.8. In Example 2. we can write (2.g(.. Whether any linear model is appropriate in particular situations is a delicate matter partially explorable through further analysis of the data and knowledge of the subject matter. z. Zil = 1.II 112 Methods of Estimation Chapter 2 (91 (z). Zi2 = Zi. fi = <d. Weighted Linear Regression.zd + ti..2.. < n.. .2. 1<i <n Wij = 9j(Zi). if we setg«(3. We return to this in Volume II.14) where (J2 is unknown as before. 0 < j < 2. I .5) fails.2 and many similar situations it may not be reasonable to assume that the variances of the errors Ci are the same for all levels Zi of the covariate variable.24 for more on polynomial regression. • I' "0 r .z.17) .g«(3. Consider the case in which d = 2.2.Zi)]2 = L ~. Zi) = {3l + fhZi.
2.1. we may allow for correlation between the errors {Ei}. By following the steps of Exarnple 2.2. that weighted least squares estimates are also plugin estimates. i i=l = 1. using Theorem 1. More generally.2 Minimum Contrast Estimates and Estimating Equations 113 where Vi = l/Wi_ This problem may be solved by setting up analogues to the normal equations (2.13 minimizing the least squares contrast in this transformed model is given by (2. ~ .(L:~1 UlYi)(L:~l uizd Li~] Ui Zi . a__ Cov(Z"'.18) ~ ~ ~I" (31 = E(Y') .1) and (2.4.. the .4 as follows.3.2.~ UiZi· n n i"~l I" n n F"l This computation suggests. we can write Remark 2.1. 0 Next consider finding the.B that minimizes (2.2.20)..B.Y.2. (Zn. suppose Var(€) = a 2W for some invertible matrix W nxn..(Li~] Ui Z i)2 n 2 n (2.B+€ can be transformed to one satisfying (2.n. We can also use the results on prediction in Section 1.n where Ui n = vi/Lvi. .. . wn ) and ZD = IIZijllnxd is the design matrix. as we make precise in Problem 2.(32E(Z') ~ .2. we find (Problem 2.~ UiYi .Section 2.26. 1 < i < n.28) that the model Y = ZD.4H2. then its MSPE is ElY" ~ 1l1(Z")f ~ :L UdYi i=l n «(3] + (32 Zi)1 2 It follows that the problem of minimizing (2. That is.2. When ZD has rank d and Wi > 0. .Y") ~(Zi.1 leading to (2..2. z) = zT. (3 and for general d. . Moreover. ..6). Let (Z*.7).(3.8). Var(Z") and  _ L:~l UiZiYi . when g(.19) and (2.2. . .27) that f3 satisfy the weighted least squares normal equations ~ ~ where W = diag(wI.B..2. .. Y*) V. Zi) = z.1 7) is equivalent to finding the best linear MSPE predictor of y . Yn) and probability distribution given by PI(Z". Thus.. Y*) denote a pair of discrete random variables with possible values (z" Yl).. Then it can be shown (Problem 2.16) for g((3. i ~ I. If Itl(Z*) = {31 given by + {32Z* denotes a linear predictor of y* based on Z .)] ~Ui.2.2..2. . .2.
.
.
.
.
OJ = (1  0)' where 0 < () < 1 (see Example 2. Similarly. x. In general. the MLE does not exist if 112 + 2n3 = O. let nJ. 1). 0) = 20'(1.) = ·j1 e'"p. Then the same calculation shows that if 2nl + nz and n2 + 2n3 are both positive. the maximum likelihood estimate exists and is given by 8(x) = 2n. . p(3. O)p(l.2. 0) ~ 20(1 .22) and (2. Here are two simple examples with (j real. situations with f) well defined but (2.lx(0) = 5 1 0' ..6. Example 2. Because . If we make the usual simplifying assumption that the arrivals form a Poisson process. + n.5.. 0)p(2. equivalently.2. .l. Consider a popUlation with three kinds of individuals labeled 1.1.. 2 and 3. represents the expected number of arrivals in an hour or. . p(X.0) The likelihood equation is 8 80 Ix (0) ~ = 5 1 0 10 =0.2.2. X2 = 2.>. . and as we have seen in Example 2. Let X denote the number of customers arriving at a service counter during n hours. A is an unknown positive constant and we wish to estimate A using X. which has the unique solution B = ~.. } with probabilities.. the rate of arrival. .29) '(j .2.O) = 0'. 1. ]n practice. there may be solutions of (2.118 Methods of Estimation Chapter 2 which again enables us to analyze the behavior of B using known properties of sums of independent random variables. I). 8' 80.2. If we observe a sample of three individuals and obtain Xl = 1. the dual point of view of (2. where ).28) If 2n.0).ny l. then I • Lx(O) ~ p(l. is zero. p(2. Evidently. X3 = 1.27) that are not maxima or only local maxima. Here X takes on values {O.27) is very important and we shall explore it extensively in the natural and favorable setting of multiparameter exponential families in the next section. x n } equal to 1. (2.2. and 3 and occurring in the HardyWeinberg proportions p(I.0)' < ° for all B E (0. + n.x=O. then X has a Poisson distribution with parameter nA.27) doesn't make sense. respectively. ~ maximizes Lx(B).2.2. 2.. which is maximized by 0 = 0. so the MLE does not exist because = (0.4). and n3 denote the number of {Xl. Nevertheless.OJ'".2. 0 e Example 2..7. the likelihood is {1 . 2n (2.(1. . n2.
0 To apply the likelihood equation successfully we need to know when a solution is an MLE. A sufficient condition. Example 2.2.30) for all This is the condition we applied in Example 2.30)). Then p(x. 80.2...32). We assume that n > k ~ 1. 8) = 0 if any of the 8j are zero.2.kl.\ I 0.31 ) To obtain the MLE (J we consider l as a function of 8 1 . As in Example 1.2 Minimum Contrast Estimates and Estimating Equations 119 The likelihood equation is which has the unique solution).lx(B) <0..ekl with kl Ok ~ 1. this estimate is the MLE of ). .2. 8' (2.8.. trials in which each trial can produce a result in one of k categories. A similar condition applies for vector parameters. 8B.. . J = I.2.Section 2.. thus. and the equation becomes (BkIB j ) n· OJ = . j=d (2. (see (2. let B = P(X. . familiar from calculus. and must satisfy the likelihood equa~ons ~ 8B1x(fJ) = 8B 3 8 8 " n. .2.6.7. k. and let N j = L:~ 1 l[Xi = j] be the number of observations in the jth category.LBj j=1 • (2.logB. 8Bk/8Bj (2.2. Multinomial Trials. L. ~ n . Let Xi = j if the ith trial produces a result in the jth category. consider an experiment with n Li. j = 1. If x = 0 the MLE does not exist. fJ) = TI7~1 B7'. = j) be the probability J of the jth category.. . p(x.2. 31=1 By (2.. the MLE must have all OJ > 0. Then.d. = L. . = jl.32) to find I. for an experiment in which we observe nj ~ 2:7 1 I[X. ~ ~ .. this is well known to be equivalent to ~ e. and k j=1 k Ix(fJ) = LnjlogB j .1.32) We first consider the case with all the nj positive.8B " 1=13 k k =0. the maximum is approached as . e.o.6. is that l be concave in If l is twice differentiable. fJE6= {fJ:Bj >O'LOj ~ I}. = x/no If x is positive.. However.n..
2.2 both unknown. More generally. g«(3. .1 L:~ .6. It is easy to see that weighted least squares estimates are themselves maximum likelihood estimates of f3 for the model Yi independent N(g({3.2. gi..' .. 1 < j < k... . OJ > ~ 0 case.d.(Xi . Zi)J[l:} . Example 2. ...3. Then (} with OJ = njjn. i = 1.2. we find that for n > 2 the unique MLEs of Il and Ii = X and iT2 ~ n..2. these estimates viewed as an algorithm applied to the set of data X make sense much more generally.28.9. k.lV. Zi)) 00 ao n log 2 I"" 2 "2 (21Too) . as maximum likelihood estimates for f3 when Y is distributed as lV. is still the unique MLE of fJ. The 0 < OJ < 1.34) g«(3. version of this example will be considered in the exponential family case in Section 2. Next suppose that nj = 0 for some j..). Suppose the model Po of Example 2. This approach is applied to experiments in which for the ith case in a study the mean of the response Yi depends on • .2 are Maximum likelihood and least squares We conclude with the link between least squares and maximum likelihood. .1.. where E(V. are known functions and {3 is a parameter to be estimated from the independent observations YI . 'ffi f.1).X)2 (Problem 2.. n. ((g((3. .Zi)]2.1 holds and X = (Y" . we can consider minimizing L:i.1. (2.2.120 ~ Methods of Estimation Chapter 2 To show thaI this () maximizes lx(8). zn) 03W.11(a)). least squares estimates are maximum likelihood for the particular model Po.IV.33) It follows that in this nj > 0.2 ) with J1.. Zj)]Wij where W = Ilwijllnxn is a symmetric positive definite matrix. z. In Section 2.202 L...g«(3. o 1=1 Evidently maximizing Ix«(3) is equivalent to minimizing L:~ n (2. Suppose that Xl. see Problem 2.2. See Problem 2. zi)1 .gi«(3) 1 . where Var(Yi) does not depend on i. YnJT.  seen and shall see more in Section 6. i=l g«(3. ~ g«(3.2. Then Ix «(3) log IT ~'P (V. 1 Yn . . and 0. X n are Li. j = 1.. .. As we have IIV. wi(5). . . N(lll 0. Using the concavity argument. ~ g((3. Ix(O) is strictly concave and (} is the unique ~ ~ maximizer of lx(O). Thus. Zi). Summary. 0. then <r <k I. 1 < i < n.1 we consider least squares estimators (LSEs) obtained by min2 imizing a contrast of the fonn L:~ 1 IV.30.) = gi«(3). we check the concavity of lx(O): let 1 1 <j < k .
See Problem 2. including all points with ±oo as a coordinate.3. Existence and unicity of the MLE in exponential families depend on the strict concavity of the log likelihood and the condition of Lemma 2.1. = RxR+ and ee e. Concavity also plays a crucial role in the analysis of algorithms in the next section. e e I"V e De= {(a..1 ). We start with a useful general framework and lemma. . a8 is the set of points outside of 8 that can be obtained as limits of points in e. For instance.a= ±oo.3 Maximum Likelihood in Multiparameter Exponential Families 121 a set of available covariate values Zil. (a.a < b< oo} U {(a. Properties that derive solely fTOm concavity are given in Propositon 2. This is largely a consequence of the strict concavity of the log likelihood in the natural parameter TI.2. where II denotes the Euclidean norm. or 8 1nk diverges with 181nk I .Section 2. (}"2) distribution. Suppose also that e t R where e c RP is open and 1 is (2.I ) all tend to De as m ~ 00.4 and Corollaries 1. These estimates are shown to be equivalent to minimum contrast estimates based on a contrast function related to Shannon entropy and KullbackLeibler information divergence. Suppose we are given a function 1: continuous.3 and 1. Extensions to weighted least squares. though the results of Theorems 1.6. (m.. Formally. m.3. in the N(O" O ) case. Suppose 8 c RP is an open set. z=1=1 ~ 2.1 and 1. are given. 2 e Lemma 2. . m.5. In Section 2.b) .2 we consider maximum likelihood estimators (MLEs) 0 that are defined as maximizers of the likelihood Lx (B) = p(x. which are appropriate when Var(Yj) depends on i or the Y's are correlated. That is. Then there exists 8 E e such that 1(8) = max{l(li) : Ii E e}. In particular we consider the case with 9i({3) = Zij{3j and give the LSE of (3 in the case in which I!ZijllnXd is of rank d. oo]P.zid.2 and other exponential family properties also playa role. In general. as k . Proof. if X N(B I .. for a sequence {8 m } of points from open. (a. it is shown that the MLEs coincide with the LSEs. (m. (h). Let &e = be the boundary of where denotes the closure of in [00.00. b).3 MAXIMUM LIKELIHOOD IN MUlTIPARAMETER EXPONENTIAL FAMILIES Questions of existence and uniqueness of maximum likelihood estimates in canonical exponential families can be answered completely and elegantly.3. b) : aER.6. 8).ae as m t 00 to mean that for any subsequence {B1nI<} either 8 1nk t t with t ¢ e.00.1.6.6. For instance. m).bE {a. we define 8 1n . In the case of independent response variables Yi that are modeled to have a N(9i({3). . b).3. (m.oo}}.1 only.3.1) lim{l(li) : Ii ~ ~ De} = 00.
(a) If to E R k satisfies!!) Ifc ! 'I" 0 (2.2).3. U m = Jl~::rr' Am = . '10) for some reference '10 E [ (see Problem 1.1. i .9) we know that II lx(lI) is continuous on e. we may also assume that to = T(x) = 0 because P is the same as the exponential family generated by T(x) .)) > ~ (fx(Od +lx (0. Proofof Theorem 2.3.1. I I. h) and that (i) The natural parameter space. if lx('1) logp(x.3. From (B.3. a contradiction. [ffunher {x (II) e = e e t 88. . and is a solution to the equation (2. then 1] exists and is unique ijfto E ~ where C!i: is the interior of CT.). = . thus.1.)) 0 Themem 2.3. then lx(11 m ) . We show that if {11 m } has no subsequence converging to a point in E. Without loss of generality we can suppose h(x) = pix.27).3. ~ Proof. Define the convex suppon of a probability P to be the smallest convex set C such that P(C) = 1. which implies existence of 17 by Lemma 2... ! " .3. II) is strictly concave and 1.8 and 2. II E j. Existence and Uniqueness o/the MLE ij.3) I I' (b) Conversely. then the MLE doesn't exist and (2.3. .3) has i 1 !.3. (ii) The family is of rank k.'1 . if to doesn't satisfy (2. /fCT is the convex suppon ofthe distribution ofT (X). Write 11 m = Am U m . Applications of this theorem are given in Problems 2.3. . Suppose P is the canonical exponential/amity generated by (T. " We.  (hO! + 0. are distinct maximizers. and 0. Corollary 2.'1) with T(x) = 0. is unique. Let x be the observed data vector and set to = T(x). lI(x) =  exists. is open.1 hold. Then. open C RP.(11) ~ 00 as densities p(x. II). Furthennore. If 0..2) then the MLE Tj exists.1.to.1.1. E. ' L n .12. We give the proof for the continuous case. have a necessary and sufficient condition for existence and uniqueness of the MLE given the data. Suppose the cOlulitions o/Theorem 2. By Lemma 2.00. then Ix lx(O.3. with corresponding logp(x.3. We can now prove the following. then the MLE8(x) exists and is unique.6. no solution. " . Suppose X ~ (PII .122 Methods of Estimation Chapter 2 Proposition 2.3.
J = 00. that is. = 0 because T(XI) is always a point on the parabola T2 = T'f and the MLE does not exist.3.2) and COTollary 2. By (B. Because any subsequence of {11m} has no subsequence converging in E we conclude L ( 11m) t 00 and Tj exists. I Xl) and 1. both {t : dTt > dTto} n CO and {t : dTt < dTto} n CO are nonempty open sets.1 follow. Case 1: Amk t 111]=11. I' E R.6. So In either case limm.k Lx( 11m. Evidently.3) by TheoTem 1.3. for every d i= 0. So. Nonexistence: if (2.1.3 Maximum Likelihood in Multiparameter Exponential Families ~ 123 1. Suppose the conditions ofTheorem 2. Then AU ¢ £ by assumption.d. Por n = 1.2.3. Then.3.se density is a general phenomenon. contradicting the assumption that the family is of rank: k. which is impossible. T(X) has a density and.. this is the exponential family generated by T(X) = 1 Xi. The Gaussian Model.9. In fact. there exists c # such that Po[cTT < 0] = 1 E1](c T T(X)) < 0. iff.1.3. As we observed in Example 1. 0 Example 2.3. 0 ° '* '* PrOD/a/Corollary 2. (1"2 > O. thus. So we have Case 2: Amk t A. The equivalence of (2. Suppose X"". N(p" ( 2 ). CT = R )( R+ FOT n > 2. existence of MLEs when T has a continuous ca. This is equivalent to the fact that if n = 1 the formal solution to the likelihood equations gives 0'2 = 0.1 hold and T k x 1 has a continuous case density on R k • Then the MLE Tj exists with probabiliry 1 and necessarily satisfies (2. . If ij exists then E'I T ~ 0 E'I(cTT) ~ 0.3.3. Then because for some 6 > 0. Theorem 2. CT = C~ and the MLE always exists. if {1]m} has no subsequence converging in £ it must have a subsequence { 11m k} that obeys either case 1 or 2 as follows.5. . lIu m ll 00. um~.* P'IlcTT = 0] = I. fOT all 1]. t u.Xn are i.i.3.3).4.2) fails. Po[uTT(X) > 61 > O.Section 2. 0 (L:7 L:7 CJ.1) a point to belongs to the interior C of a convex set C iff there exist points in CO on either side of it.6. Write Eo for E110 and Po for P11o . u mk t u. It is unique and satisfies x (2.
liz = I  vn3/n and li3 = (2nl + nz)/2n are frequency substitution estimates (Problem 2.3).2 that (2.T(X) = A(.3. P > 0.1 that in this caseMLEs of"'j = 10g(AjIAk). we see that (2.2 follows.6.4) and (2.n.3.2. I < j < k. For instance. It is easy to see that ifn > 2.i.4 we will see that 83 is.4.3. We assume n > k .>.5) ~=X A where log X ~ L:~ t1ogXi. and one of the two sums is nonempty because c oF 0.3.I which generates the family is T(k_l) = (Tl>"" T.3.3... in a certain sense.2(a). Example 2.1. thus. We follow the notation of Example 1.4) (2. if we write c T to = {CjtjO : Cj > O} + {cjtjO : Cj < O} we can increase c T to by replacing a tjO by tjo + I in the first sum or a tjO by tjO . where Tj(X) L~ I I(Xi = j).4 and 2. lit ~ . Thus. > O. I <j < k.3.3. x > 0.3.1 in the second. They are determined by Aj = Tj In. We conclude from Theorem 2.1. The likelihood equations are equivalent to (problem 2. by Problem 2. I < j < k iff 0 < Tj < n.6. Here is an example.= r(jJ) A log X (2.7.2(b» r' log . using (2. if T has a continuous case density PT(t).1).13 and the next example). The boundary of a convex set necessarily has volume 0 (Problem 2.2. How to find such nonexplicit solutions is discussed in Section 2.I..1. When method of moments and frequency substitution estimates are not unique.. To see this note that Tj > 0.(X) = r~.. Thus. >.I and verify using Theorem 2. where 0 < Aj PIX = j] < I.d.Xn are i. .2) holds. exist iff all Tj > O. A nontrivial application of Theorem 2. h(x) = XI. with density 9p.6. the best estimate of 8.3.3. Because the resulting value of t is possible if 0 < tjO < n. Multinomial Trials.3. with i'... .9).3. in the HardyWeinberg examples 2._t)T. From Theorem 1..)e>"XxPI. Suppose Xl.124 Methods of Estimation Chapter 2 Proof. I < j < k. 0 = If T is discrete MLEs need not exist. In Example 3. but only Os is a MLE. then and the result follows from Corollary 2. I <t< k .)..ln.3. LXi). = = L L i i bJ _ I .1.. Thasadensity.3..3.3 we know that E. The statistic of rank k . The TwoParameter Gamma Family. the MLE 7j in exponential families has an interpretation as a generalized method of moments estimate (see Problem 2. Example 2.5) have a unique solution with probability 1.3. This is a rank 2 canonical exponential family generated by T   = (L log Xi.1. o Remark 2. I < j < k. the maximum likelihood principle in many cases selects the «best" estimate among them.4.
1. . when we put the multinomial in canonical exponential family form. The following result can be useful.8 we saw that in the multinomial case with the clQsed parameter set {Aj : Aj > 0. (B). x E X. Let Q = {PII : II E e). Alternatively we can appeal to Corollary 2.A(c(ii)) ~ = O.1. Let CO denote the interior of the range of (c. .1 is useful. whereas if B = [0.3. and unicity can be losttake c not onetoone for instance. Here [ is the natural paramerer space of the exponential fantily P generated by (T. Similarly.6.3) does not have a closedform solution as in Example 1. then so does the MLE II in Q and it satisfies the likelihood equation ~ cT(ii) (to .1). When P is not an exponential family both existence and unicity of MLEs become more problematic.6) on c( II) = ~~.2. mxk e. Then Theorem 2.Section 2.1.B(B) j=l .~l Aj = I}. = 0. In Example 2..A(c(II))}h(x) Suppose c : 8 ~ [ C R k has a differential (2.exist and are unique.k. .I.8see Problem 2.3. if any TJ = 0 or n.3.. if 201 + 0. If the equations ~ have a solution B(x) E Co. lfP above satisfies the condition of Theorem 2. 0 < j < k .3. n > kl. Remark 2.3. note that in the HardyWeinberg Example 2.2) so that the MLE ij in P exists.. Ck (B)) T and let x be the observed data.3. . L. c(8) is closed in [ and T(x) = to satisfies (2. h). the MLEs ofA"j = 1.. the bivariate normal case (Problem 2.1 and Haberman (1974).3. BEe. 1)T.1. However.1 we can obtain a contradiction to (2.2.3 can be applied to determine existence in cases for which (2. then it is the unique MLE ofB. Unfortunately strict concavity of Ix is not inherited by curved exponential families. our parameter set is open.B) ~ h(x)exp LCj(B)Tj(x) .3. be a cUlVed exponential family p(x. (II) .7) Note that c(lI) E c(8) and is in general not ij. . Consider the exponential family k p(x.3. (2. e open C R=. 0 The argument of Example 2.6. the following corollary to Theorem 2. .3.. for example. 1 < i < k .2.3. m < k . In some applications.3.1 directly D (Problem 2. The remaining case T k = a gives a contradiction if c = (1.3.3.3.13). .10).2) by taking Cj = 1(i = j).3 Maximum likelihood in Multiparameter Exponential Families 125 On the other hand. the MLE does not exist if 8 ~ (0..1] it does exist ~nd is unique.3. Corollary 2.lI) ~ exp{cT (II)T(x) .
9) is a curved exponential family of the form (2.. . Gaussian with Fixed Signal to Noise.3)T. < O}.2~~2 .9. that .'.2. LocationScale Regression..ryl > O.2 . '() A 1/ Thus. '1...5 and 1. corresponding to 1]1 = //x' 1]2 = .d... 'TJn+i = 1/2a.6) with " • .. l}m.nit. Now p(y..2 and 2.3.n. As in Example 1. As a consequence of Theorems 2.6. ry.3..< O. C(O) and from Example 1..i.6.7) becomes = 0. Zl < ..) : ry.3 )(t.?' m)T " ".7) if n > 2. . l = 1. which implies i"i+ > 0. 11.6. we see that the distribution of {01 : j = 1.10. . I .ry. .3. This is a curved exponential ¥.3. .3. Suppose that }jI •.'' .6..5X' +41i2]' < 0. = ~ryf. with I. n. generated by h(Y) = 1 and "'> I T(Y) '.Zn are given constants.I LX.. = ( ~Yll'''' .0'2) with Jl/ a = AO > a known. Methods of Estimation Chapter 2 family with (It) = C2 (It) = . m m m f?. we can conclude that an MLE Ii always exists and satisfies (2..1/'1') T .. are n independent random samples./2'1' ..". simplifies to Jl2 + A6XIl. which is closed in E = {( ryt> ry. suppose Xl.: Note that 11+11_ = '6M2 solution we seek is ii+.5 = ..it. i Example 2.11. = L: xl. N'(fl.3.. . . tLi = (}1 + 82 z i .n(it' + )"5it'))T which with Ji2 = n. Equation (2. 1 n.) : ry. E R.it.3. ..3.gx±).3. t.~Y'1..!0b''. Jl > 0. Using Examples 1. . i = 1.A6iL2 = 0 Ii. L: Xi and I. where 0l N(tLj 1 0'1). af = f)3(8 1 + 82 z i )2.. < O}. j = 1.Xn are i.g( it' . Because It > 0. We find CI Example 2. .. = 1 " = 2 n( ryd'1' . the o q.ry. Ii± ~ ~[). . ).4. .6. < Zn where Zt.~Yn.5.10. as in Example 1.126 The proof is sketched in Problem 2. . .oJ). . m} is a 2nparameter canonical exponential family with 'TJi = /tila. Next suppose.\()'.. Evidently c(8) = {(lh.
x o1d = xo.O. b).1). It is not our goal in this book to enter seriously into questions that are the subject of textbooks in numerical analysis. In this section we derive necessary and sufficient conditions for existence of MLEs in canonical exponential families of full rank with £ open (Theorem 2. Let £ be the canonical parameter set for this full model and let e ~ (II: II. even in the context of canonical multiparameter exponential families. Then c(8) is closed in £ and we can conclude that for m satisfies (2. Here. d. entrust ourselves.1. . there exists unique x*€(a. which give a complete though slow solution to finding MLEs in the canonical exponential families covered by Theorem 2. Finally.3. MLEs may not be given explicitly by fonnulae but only implicitly as the solutions of systems of nonlinear equations.3.1 and Corollary 2. L 10) for {3 is easy to write down symbolically but not easy to evaluate if d is at all large because inversion of Z'bZD requires on the order of nd 2 operations to evaluate each of d(d + 1)/2 tenns with n operations to get Z'bZD and then. f i strictly.4 Algorithmic Issues 127 If m > 2.3. such as the twoparameter gamma.1. even in the classical regression model with design matrix ZD of full rank. in pseudocode. an MLE (J of (J exists and () 0 ~ ~ Summary. The packages that produce least squares estimates do not in fact use fonnula (2. 2.3. then. ~ 2. Given f continuous on (a.3. in this section.4. Ixd large enough.1 work. f( a+) < 0 < f (b.Section 2.x*l: Find Xo < x" f(xo) < 0 < f(x') by taking Ixol. if implemented as usual. These results lead to a necessary condition for existence of the MLE in curved exponential families but without a guarantee of unicity or sufficiency.02 E R. then the full 2nparameter model satisfies the conditions of Theorem 2. L 10). strict concavity. Given tolerance € > a for IXfinal . the fonnula (2.4 ALGORITHMIC ISSUES As we have seen. In fact. at various times. is isolated and shown to apply to a broader class of models. by the intermediate value theorem.). we will discuss three algorithms of a type used in different statistical contexts both for their own sakes and to illustrate what kinds of things can be established about the black boxes to which we all. the basic property making Theorem 2. order d3 operations to invert. Initialize x61d ~ XI. > 2. > OJ. is the bisection algOrithm to find x*. However.1 The Method of Bisection The bisection method is the essential ingredient in the coordinate ascent algorithm that yields MLEs in kparameter exponential families.3. E R. b) such that f(x*) = O. We begin with the bisection and coordinate ascent methods.7).
in addition.4. From this lemma we can deduce the following.. xfinal = Xnew· (4) If f(xnew) < 0.1. Moreover. x~ld (5) If f(xnew) > 0.1). Then.4.4.to· I' Proaf. the interior (a.. Xm < x" < X m +l for all m. By Theorem 1. xfinal = !(x~ld + x old ) and return xfinal' (2) Else. End Lemma 2..d. The bisection algorithm stops at a solution xfinal such that i Proot If X m is the mth iterate of Xnew i (I) Moreover. 00. .1. by the intermediate value theorem. h).4. the MLE Tj.xoldl Methods of Estimation Chapter 2 < 2E. i I i.x*1 < €.4. may befound (to tolerance €) by the method afbisection applied to f(.1. • . The Shape Parameter Gamma Family.i.3.6.) = EryT(X) .. lx. satisfying the conditions of Theorem 2. (2. so that f is strictly increasing and continuous and necessarily because i. . which exists and is unique by Theorem 2. o ! i If desired one could evidently also arrange it so that..xol· 1 (2) Therefore. x~ld = Xnew· Go to (I)..) = VarryT(X) > 0 for all . I (3) and X m + x* i as m j.1.3. b) of the convex support a/PT.IXm+l . !'(. . [(8. I. 0 Example 2. exists. Theorem 2.. Xnew = !(x~ld + x~ld)' ~ Xnew· (3) If f(xnew) = 0. f(a+) < 0 < f(b). If(xfinal)1 < E.xol/€). Let p(x I 1]) be a oneparameter canonical exponentialfamily generated by (T. Let X" ..128 (I) If IX~ld .1 and T = to E C~.1) . for m = log2(lxt .. X n be i...
2 Coordinate Ascent The problem we consider is to solve numerically. It solves the r'(B) r(B) T(X) n which by Theorem 2.'fJk· Repeat. d and finally 1] _ ~Il) _ =1] (~1 1]1"".4.1]2.4 Algorithmic Issues 129 Because T(X) = L:~' equation I log Xi has a density for all n the MLE always exists.···. 1] ~Ok = (~1 ~1 {} "") 1]I.'TJk.'TJk =tk' ) . eventually. Here is the algorithm.4. . it is in fact available to high precision in standard packages such as NAG or MATLAB.Section 2.1. in cycle j and stop possibly in midcycle as soon as <I< . always converges to Ti.1 can be evaluated by bisection. bisection itself is a defined function in some packages. The function r(B) = x'Iexdx needed for the bisection method can itself only be evaluated by numerical integration or some other numerical method. 0 J:: 2.4. This example points to another hidden difficulty.1]2. In fact.'fJk' ~1) an soon. we would again set a tolerenceto be. The case k = 1: see Theorem 2.. 1 k. The general case: Initialize ~o 1] = ("" 'TJll···' "") • 'TJk Solve ~1 f or1]k: Set 1] 8'T]k [) A(~l ~1 1]ll1'J2. which is slow.oJ _ = (~1 "" {}) ~02 _ 1]1. r > 1. for each of the if'. However.· . say c. but as we shall see. getting 1j 1r).···. Notes: (1) in practice.'TJ3. for a canonical kparameter exponential family. E1/(T(X» = A(1/) = to when the MLE Tj = Tj(to) exists.
1 < j < k. Then each step of the iteration both within cycles and from cycle to cycle is quick. ij = (V'I) . Bya standard argument it follows that.? Continuing. = 7Jk.130 Methods of Estimation Chapter 2 (2) Notice that (iii. 1j(r) ~ 1j. I(W ) = 00 for some j.5).. _A (1»). (3) I (71j) = A for all j because the sequence oflikelihoods is monotone.. (3) and (4) => 1]1 . 'tn. rp differ only in the second coordinate. refuses to converge (in fI space!)see Problem 2.. A(71') ~ to. in fact.···) C!J.3.1 hold and to E 1j(r) t g:. Here we use the strict concavity of t. by (I). is the expectation of TI(X) in the oneparameter exponential family model with all parameters save T1I assumed known. .. j ¥ I) can be solved in closed form. . ilL" iii.··. that is.j l(i/i) = A (say) exists and is > 00.2.) where flj has dimension d j and 2::.4. Continuing in this way we can get arbitrarily close to 1j.. Else lim..l(ij') = A. We can initialize with the method 2 of moments estimate from Example 2. TIl = .4. Suppose that we can write fiT = (fir..2.1 because the equa~ tion leading to Anew given bold' (2.I)) = log X + log A 0) and then A(1) = pY. (fij(e) are as above. ij'j and ij'(j+I) differ in only one coordinate for which iji(j+1) maximizes l.'11 11 t t (I . ff 1 < j < k.I ~(~ (1) 1 to get V.3.1.2. Proof. . We give a series of steps.. (Wn') ~ O. . The TwoParameter Gamma Family (continued). the algorithm may be viewed as successive fitting of oneparameter families. Because l(ijI) ~ Aand the MLE is unique. limi.2: For some coordinates l. For n > 2 we know the MLE exists. Therefore. We use the notation of Example 2.2'  It is natural to ask what happens if. 71J. I • . (6) By (4) and (5). ij as r t 00.. .. Let 1(71) = tif '1. 0 ~ (4) :~i (77') = 0 because :~. . . (I) l(ij'j) Tin j for i fixed and in. .I) solving r0'.=1 d j = k and the problem of obtaining ijl(tO. W = iji = ij. they result in substantial savings of time. to ¢ Fortunately in these cases the algorithm.4.pO) = We now use bisection ' r' .1J k) . 1J ! But 71j E [. 0 Example 2. x ink) ( '~ini .2. the log likelihood.. Theorem 2. Suppose that this is true for each l.2. To complete the proof notice that if 1j(rk ) is any subsequence of 1j(r) that converges to ij' (say) then. . (2) The sequence (iji!. 71 1 is the unique MLE. .. 'II. Thus. We note some important generalizations.'. as it should. ijik) has a convergent subsequence in t x ... the MLE 1j doesn't exist.A(71) + log hex).\(0) = ~. Hence. This twodimensional problem is essentially no harder than the onedimensional problem of Example 2.. The case we have C¥. (i). Whenever we can obtain such steps in algorithms..3. j (5) Because 1]1.4.4. Consider a point we noted in Example 2. We pursue this discussion next..1j{ can be explicit. is computationally explicit and simple... fI. (ii) a/Theorem 2.
the log likelihood for 8 E open C RP.10.7. .. the method extends straightforwardly.3.4.4. 1' B .2 has a generalization with cycles of length r. Figure 2..1... The coordinate ascent algorithm can be slow if the contours in Figure 2. values of (B 1 .92. and Problems 2..1 are not close to sphericaL It can be speeded up at the cost of further computation by Newton's method. The graph shows log likelihood contours. find that member of the family of contours to which the vertical (or horizontal) line is tangent. Next consider the setting of Proposition 2. iterate and proceed. that is. 1 B.4. . At each stage with one coordinate fixed. The coordinate ascent algorithm.4.4 Algorithmic Issues 131 just discussed has d 1 = .. e ~ 3 2 I o o 1 2 3 Figure 2. Change other coordinates accordingly. BJ+l' . .4. Feinberg. = dr = 1.4. See also Problem 2.. for instance.1 in which Ix(O). is strictly concave. each of whose members can be evaluated easily. . r = k. If 8(x) exists and Ix is differentiable. A special case of this is the famous DemingStephan proportional fitting of contingency tables algorithmsee Bishop. and Holland (1975).p. which we now sketch.1 illustrates the j process.Section 2. Then it is easy to see that Theorem 2.B~) = 0 by the method of j bisection in B to get OJ for j = 1.4..B2 )T where the log likelihood is constant. Solve g~: (Bt. .
B)} [i+exp{(xB)}j2' n 1 l(B) l(B) n2Lexp{(X. methods such as bisection.  I o The NewtonRaphson algorithm has the property that for large n.and lefthand sides. A onedimensional problem 7 . this method is known to converge to ij at a faster rate than coordinate ascentsee Dahlquist. A hybrid of the two methods that always converges and shares the increased speed of the NewtonRaphson method is given in Problem 2.2) gives (2.7. 1 l(x.10. which may counterbalance its advantage in speed of convergence when it does converge. Let Xl. _.2) The rationale here is simple. though there is a distinct possibility of nonconvergence or convergence to a local rather than global maximum.132 Methods of Estimation Chapter 2 2. can be shown to be faster than coordinate ascent.4. When likelihoods are noncave. B) The density is = [I + exp{ (x = B) WI.3 The NewtonRaphson Algorithm An algorithm that. when it converges.1.k I (iiold)(A(iiold) .f.. X n be a sample from the logistic distribution with d.6.B) i=l 1 1 2 L i=l n I(X" B) < O. then by f1new is the solution for 1] to the approximation equation given by the right.4. We return to this property in Problem 6. If 7J o ld is close to the root 'ij of A(ij) expanding A(ij) around 11old' we obtain = to. If 110 ld is close enough to fj.to). This method requires computation of the inverse of the Hessian. coordinate ascent.4.B) We find exp{ (x .4.3.4. and NewtonRaphson's are still employed. In this case. Newton's method also extends to the framework of Proposition 2. the argument that led to (2. and Anderson (1974). then ii new = iiold . Bjork.4. if l(B) denotes the log likelihood.3) Example 2. I 1 The NewtonRaphson method can be implemented by taking Bold X. iinew after only one step behaves approximately like the MLE.. is the NewtonRaphson method. in general. B)}F(Xi. Here is the method: If 110 ld is the current value of the algorithm. F(x. (2.3.
. Petrie.0)2.4. The log likelihood of S now is 1". (2.n. let Xi. the rest of X is "missing" and its "reconstruction" is part of the process of estimating (} by maximum likelihood. X ~ Po with density p(x. .4. Here is another important example. where Po[X = (1. Unfortunately. Soules.x(B) is "easy" to maximize. What is observed.4) Evidently. Po[X (0. This could happen if.4. . a < 0 < I.4. A fruitful way of thinking of such problems is in terms of 8 as representing part of X. I)] (1 . 0 leads us to an MLE if it exists in both cases.4 Algorithmic Issues 133 in which such difficulties arise is given in Problem 2. Many examples and important issues and methods are discussed. 1.OJ] i=l n (2. and its main properties. and Anderson (1974).4 The EM (Expectation/Maximization) Algorithm There are many models that have the following structure. We give a few examples of situations of the foregoing type in which it is used.4. Lumped HardyWeinberg Data. for instance. the homozygotes of one type (€il = 1) could not be distinguished from the heterozygotes (€i2 = 1). however.. 0) is difficultto maximize. .2.Section 2. A prototypical example folIows. and Rubin (977). with an appropriate starting point. If we suppose (say) that observations 81. Ei3). 1 <i< m (€i1 +€i2. S = S(X) where S(X) is given by (2. 0) where I". is not X but S where = 5i 5i Xi. There are ideal observations. be a sample from a population in HardyWeinberg equilibrium for a twoallele locus. Yet the EM algorithm. but the computation is clearly not as simple as in the Original HardyWeinberg canonical exponential family example. E'2.4). in Chapter 6 of Dahlquist.0)] ~ 28(1 . for some individuals. m+ 1 <i< n.5) + ~ [(EiI+Ei2)log(1(10)')+2Ei310g(IB)] i=m+l a function that is of curved exponential family fonn. Bjork. a)] = B2. 1 8 m are not Xi but (€il.€i3).x(B) is concave in B.4.B) + 2E'310g(1 .4. Laird. €i2 + €i3).. i = 1. 0).0 E c Rd Their log likelihood Ip.. Say there is a closedfonn MLE or at least Ip. the function is not concave.6. As in Example 2. we observe 5 5(X) ~ Qo with density q(s.13. difficult to compute. (0) = log q(s. e = Example 2. Xi = (EiI.B).. 0. The algorithm was fonnalized with many examples in Dempster. 2.0.(0) })2EiI 10gB + Ei210g2B(1 . and Weiss (1970). and so on. though an earlier general form goes back to Baum. For detailed discussion we refer to Little and Rubin (1987) and MacLachlan and Krishnan (1997). then explicit solution is in general not lXJssible. It does tum out that in this simplest case an explicit maximum likelihood solution is still possible. Po[X = (0.
Let .4.\)"'u..4.4. p(s. This fiveparameter model is very rich pennitting up to two modes and scales. . the M step is easy and the E step doable. I That is. . = 01. .. including the examples. • • i ! I • . Suppose that given ~ = (~11"" ~n). i.. The EM algorithm can lead to such a local maximum.).(sl'd +. EM is not particularly appropriate.12) q(s.B) I S(X) = s) 0=00 (2. Although MLEs do not exist in these models. Again. = 11 = .(I'. Suppose 8 1 . Then we set Bnew = arg max J(B I Bold).4.<T. If this is difficul~ the EM algorithm is probably not suitable. B) = (1. Here is the algorithm. Bo) _ I S(X) = s ) (2. I..I.4..o log p(X.af) or N(J12. Note that (2. I' Hi where we suppress dependence on s. I:' . .B) IS(X)=s) q(s. + (1  A.12).9) I for all B (under suitable regularity conditions).4. The log likelihood similarly can have a number of local maxima and can tend to 00 as e tends to the boundary of the parameter space (Problem 2.8) by o taking logs in (2. Thus.B) J(B I Bo) ~ E.Ai)I'Z. A.B) =E.Sn is a sample from a population P whose density is modeled as a mixture of two Gaussian densities.8) .8). that under fJ.7) ii.6) where A.B) 0=00 = E.2)andO <.tz E Rand 'Pu (5) = . Mixture of Gaussians. o (:B 10gp(X.. we have given. I .ll.5.p() [A.4. The second (M) step is to maximize J(B I Bold) as a function of B.4.0'2 > 0.11). As we shall see in important situations. the Si are independent with L()(Si I .p (~).134 Methods of Estimation Chapter 2 Example 2. if this step is difficult.)<T~).<T. j.Bo) 0 p(X.9) follows from (2. ~i tells us whether to sample from N(Jil.4.(f~). J. a local maximum close to the true 8 0 turns out to be a good "proxy" for the 0 nonexistent MLE.(sI'z)where() = (. Initialize with Bold = Bo· Tbe first (E) step of the algorithm is to compute J(B I Bold) for as many values of Bas needed.\"'u. differentiating and exchanging E oo and differentiation with respect i. we can think of S as S(X) where X is given by (2. The EM Algorithm. .i = 1.6). (P(X.Bo) and (2. It is not obvious that this falls under our scheme but let (2.\. O't. (P(X. which we give for 8 real and which can be justified easily in the case that X is finite (Problem 2. :B 10gq(s.4. S has the marginal distribution given previously.4. It is easy to see (Problem 2.\ < 1. . . are independent identically distributed with p() [A. reset Bold = Bnew and repeatthe process.4.4) = L()(Si I Ai) = N(Ail'l + (1. .\ = 1. The rationale behind the algorithm lies in the following formulas.
4. On the other hand.4.3. 00ld are as defined earlier and S(X) (2.4." ) I S(X) ~ 8 .(X 18.14) whete r(· j ·. uold 0 r s. hence.4. Suppose {Po : () E e} is a canonical exponential family generated by (T.Section 2. I S(X) ~ s > 0 } (2.4. Fat x EX. Theorem 2.o log . then ..1.0) } ~ log q(s. Then (2.0) = O. However.00Id) by Shannon's ineqnality.3.O) { " ( X I 8.12) = s. 0) ~ q(s.13) q(8.4.h) satisfying the conditions a/Theorem 2. We give the proof in the discrete case.13) iff the conditional distribution of X given S(X) forOnew asfor 00ld and Onew maximizes J(O I 0old)' =8 is the same Proof.E.11)  D IOgq(8. Id log (X I .Onew) {r(X I 8 .0) (2. ~ E '0 (~I ogp(X . (2. (2.2. Lemma 2. J(Onew I 0old) > J(Oold I 0old) = 0 by definition of Onew.0) is the conditional frequency function of X given S(X) J(O I 00 ) If 00 = s.4. S (x) ~ 8 p(x. DO The main reason the algorithm behaves well follows.O)r(x I 8. 0) I S(X) = 8 ) DO (2.4. (2. Because.1.16) ° q(s. Onew) > q(s. r(Xj8. O n e w ) } log ( " ) ~ J(Onew I 0old) . 0 0 = 0old' = Onew.4 Algorithmic Issues 135 to 0 at 00 .0 ) I S(X) ~ 8 . q S.17) o The most important and revealing special case of this lemma follows. uold Now. 0 ) +E.1.lfOnew. Lemma 2. DJ(O I 00 ) [)O and.10) [)J(O I 00 ) DO it follows that a fixed point 0 of the algorithm satisfies the likelihood equation.15) q(s.Onew) E'Old { log r(X I 8.4. 0old)' Equality holds in (2. In the discrete case we appeal to the product rule.4. formally. Let SeX) be any statistic.4. the result holds whenever the quantities in J(O I (0 ) can be defined in a reasonable fashion.
Proof. 1'I.20) has a unique solution.4. then it converges to a limit 0·.(x).136 (a) The EM algorithm consists of the alternation Methods of Estimation Chapter 2 A(Bnew ) = Eoold(T(X) I S(X) Bold ~ = s) (2. we see. • I E e (2Ntn + N'n I S) = 2Nt= + N. = Ee(T(X) I S(X) = s) ~ (2.18) (2.Bof Eoo(T(X) I S(X) = y) .B) 1) = log (1 = exp(1)(2Ntn (x) + N'n(x)) .18) exists it is necessarily unique.0)' 1.23) I .A(Bo)) I S(X) = s} (B .4. i i. A(1)) = 2nlog(1 + e") and N jn = L:~ t <ij(Xi).(1 . . 0 Example 2.4.4.B) 1 .+l (2.21) Part (a) follows. A'(1)) = 2nB (2. I . m + 1 < i < n) . after some simplification. I • Under the assumption that the process that causes lumping is independent of the values of the €ij.A('1)}h(x) (2.I<j<2 I r r: I I r! Pol<it 11 <it + <i' = 11 0' B' B' + 20(1 .4. i=m. 1 <j < 3.19) = Bnew · If a solution of(2. which is necessarily a local maximum J(B I Bo) I I = Eoo{(B .BO)TT(X) ..A(Bo)) (2.Pol<i.22) • ~ 0) . X is distributed according to the exponential family I where p(x.4.4.(A(O) .(A(O) . Thus.= +Eo ( t (2<" + <i') I <it + <i'.16.B).4 (continued).4. (b) lfthe sequence of iterates {Bm} so obtained is bounded and the equation A(B) o/q(s.25) I" . . 1 I • • Po I<ij = 1 I <it + <i' = OJ =  0. that.4. h(x) = 2 N i . A proof due to Wu (I 983) is sketched in Problem 2. In this case. (2. = 11 <it +<" = IJ. Now.4.24) " ! I . Part (b) is more difficult.4.
we note that for the cases with Zi andlor Yi observed.4.} U {Y.4 Algoritnmic Issues 137 where Mn ~ Thus.1) A( B) = E.(2N3= + N 2=)B + 2 (N1= + (I n n =0 0 in (0. I). b. we oberve only Zi. compute (Problem 2. i=I i=l n n n i=l s = {(Z" Y. In this case a set of sufficient statistics is T 1 = Z.4. /L2. To compute Eo (T I S = s). r) (Problem 2. unew  _ 2N1m + N 2m n + 2 . iii. the conditional expected values equal their observed values.p2)a~ [1'2 + P'Y2(Z. ai ~ We take Bo~ = B MOM ' where BMOM is the method of moment estimates (11" 112.) E. and find (Problem 2. I'l)/al [1'2 + P'Y2(Z.4. for nl + 1 < i < n2. T 5 = n. T3 = The observed data are n l L zl..(Y.(y. (Zn. converges to the unique root of ~ + N.27) hew i'tT2Ji{[T3 (8ol d) . T 2 = Y.. we observe only Yi .Yn) be i. to conclude ar. p). which is indeed the MLE when S is observed. 2 Mn .l LY/. ~ T l (801 d). 112.T = (1"1.Tl IlT4 (8ol d) .2 I Z. Y). 1"1)/ad 2 + (1 .Section 2.= > N 3 =)) 0 and M n > 0. where B = (111. For other cases we use the properties of the bivariate nonna] distribution (Appendix BA and Section 1. B).26) It may be shown directly (Problem 2.6.. as (Z.4. Suppose that some of the Zi and some of the Yi are missing as follows: For 1 < i < nl we observe both Zi and Yi.new T4 (Bo1d ) . and for n2 + 1 < i < n. .) E. new = T 3 (8ol d) . al a2P + 1"1/L2).i.l L ZiYi. For the Mstep.: n. I Z.12) that if2Nl = Bm. + 1 <i< n}. [T5 (8ald) 2 = T2 (8old)' iii.: nl + 1 <i< n.new ~ + I'i. where (Z. l"l)/al]Zi with the corresponding Z on Y regression equations when conditioning on Yi (Problem 2. then B' . T4 = n. at Example 2. Let (Zl'yl).T2 ]}' . ii2 . a~. This completes the Estep.4).4.T 2 .d. .Bold n ~ (2.new ii~. the EM iteration is L i=tn+l n ('iI + <.4.1).1) that the Mstep produces I1l. Jl2.8) of B based on the observed data.1. I Zi) 1"2 + P'Y2(Z.). a~ + ~..4..T{ (2.): 1 <i< nt} U {Z. Y) ~ N (I"" 1"" 1 O'~.(ZiY.
where 0 < B < 1.4. Summary. X n have a beta. . The basic bisection algorithm for finding roots of monotone functions is developed and shown to yield a rapid way of computing the MLE in all oneparameter canonical exponential families with E open (when it exists). what is a frequency substitution estimate of the odds ratio B/(l.i' I' I i. Note that if S(X) = X.d. . (c) Suppose X takes the values 1. (3( 0'1. including the NewtonRaphson method. Remark 2. (a) Find the method of moments estimate of >. . Consider a population made up of three different types of individuals occurring in the HardyWeinberg proportions 02. X n assumed to be independent and identically distributed with exponential. (c) Combine your answers to (a) and (b) to get a method of moment estimate of >.4.B)2.1. in the context of Example 2. X I. which as a function of () is maximized where the contrast logp(X.). (a) Show that T 3 = N1/n + N 2 /2n is a frequency substitution estimate of e.. Important variants of and alternatives to this algorithm. 2B(1 .0'2) distribution. j : > I) that one system 3. Find the method of moments estimates of a = (0'1. (b) Using the estimate of (a). based on the first moment. . in Section 2.P3 given by the HardyWeinberg proportions.4. involves imputing missing values. based on the second moment.1 1.2. ! I 2.4.138 Methods of Estimation Chapter 2 where T j (B) denotes T j with missing values replaced by the values computed in the Estep and T j = Tj(B o1d )' j = 1. use this algorithm as a building block for the general coordinate ascent algorithm.0'2) based on the first two moments. Finally in Section 2. By considering the first moment of X..5 PROBLEMS AND COMPLEMENTS Problems fnr Sectinn 2.B)? .3 and the problems. distributions. Suppose that Li. E"IJ(O I 00)] is the KullbackLeiblerdivergence (2. We then.0) and (I ..23). 0 Because the Estep. show that T 3 is a method of moment estimate of ()... 2. respectively. 0) is minimized. ~' ".. .2.. based On the first two moments.4. j. (d) Find the method of moments estimate of the probability P(X1 will last at least a month. Now the process is repeated with B MOM replaced by ~ ~ ~ ~ 0new.. which yields with certainty the MLEs in kparameter canonical exponential families with E open when it exists. Also note that. &(>. Consider n systems with failure times X I .B o)].. . are discussed and introduced in Section 2.0. then J(B i Bo) is log[p(X. the EM algorithm is often called multiple imputation.P2. (b) Find the method of moments estimate of>.6. in general.1 with respective probabilities PI. B)/p(X.4 we derive and discuss the important EM algorithm and its basic properties.2.
. of Xi n ~ =X. which we define by ~ F ~( s. The jth cumulant '0' of the empirical distribution function is called the jth sample cumulanr and is a method of moments estimate of the cumulant Cj' Give the first three sample cumulants.Xn be a sample from a population with distribution function F and frequency function or density p.. < . . If q(8) can be written in the fonn q(8) ~ s(F) for sOme function s of F we define the empirical substitution principle estimate of q( 8) to be s( F).F(tk).12. (b) Exhibit method of moments estimates for VaroX = 8(1 .F(t. .. Let X(l) < . Yn ) be a set of independent and identically distributed random vectors with common distribution function F. Let Xl. 5.5. . YI )..8)ln first using only the first moment and then using only the second moment of the population. (d) FortI tk. Hint: Consi~r (N I .8. Y2 ). X 1l be the indicators of n Bernoulli trials with probability of success 8..t ) = Number of vectors (Zi. . (Z2.)).5 Problems and Complements 139 Hint: See Problem 8. 8.. (c) Argue that in this case all frequency substitution estimates of q(8) must agree with q(X). Let (ZI.2. = Xi with probability lin. . < ~ ~ Nk+1 = n(l  F(tk)). . 4. (b) Show that in the continuous case X ~ rv F means that X (c) Show that the empirical substitution estimate of the jth moment JLj is the jth sample moment JLj' Hinr: Write mj ~ f== xjdF(x) ormj = Ep(Xj) where XF.. .. See A. Yi) such that Zi < sand Yi < t n .. (See Problem B... The natural estimate of F(s.2. of Xi < xl/n... . Nk+I) where N I ~ nF(t l ). ~ Hint: Express F in terms of p and F in terms of P ~() X = No. ~ ~ ~ (a) Show that in the finite discrete case. < X(n) be the order statistics of a sample Xl... ~  7. we know the order statistics.) There is a Onetoone correspondence between the empirical distribution function ~ F and the order statistics in the sense that. Let X I. (Zn. t). The empirical distribution function F is defined by F(x) = [No.findthejointfrequencyfunctionofF(tl). . (a) Show that X is a method of moments estimate of 8. . . given the order statistics we may construct F and given P. Give the details of this correspondence. N 2 = n(F(t2J . .Section 2. .. .. Show that these estimates coincide. 6. empirical substitution estimates coincides with frequency substitution estimates. . X n . t) is the bivariate empirical distribution function FCs..
Y) = Z)(Yk . Hint: See Problem B.2). .. Y are the sample means of the sample correlation coefficient is given by Z11 •. and so on. z) is continuous in {3 and that Ig({3.!'r«(J)) I . . (a) Find an estimate of a 2 based on the second mOment. Zn and Y11 . 9. (c) Use the empirical substitution principle to COnstruct an estimate of cr using the relation E(IX."ZkYk . . " .. Show that the least squares estimate exists. Let X". the sample covariance.IU9) that 1 < T < L . = #.Yn . X n be LLd.. Vk and that q(9) can be written as ec q«(J) = h(!'I«(J)"" .17. In Example 2. Hint: Set c = p(X. The All of these quantities are natural estimates of the corresponding population characteristics and are also called method of moments estimates. 11.. k=l n  n .Xn ) where the X. ..1.' where Z. the result follows. as the corresponding characteristics of the distribution F.). Suppose X has possible values VI. .4.) Note that it follows from (A. .) is the distribution function of a probability P on R2 assigning mass lin to each point (Zi. (b) Define the sample product moment of order (i. 1". as X ~ P(J. with (J identifiable. (b) Construct an estimate of a using the estimate of part (a) and the equation a ... 0).• . I) = ". f(a.". L I.L. Show that the sample product moment of order (i. .2. are independent N"(O... {3) > c.2 with X iiI and /is..j) is given by ~ The sample covariance is given by n l n L (Zk k=l . (See Problem 2. the sampIe correlation. . >. find the method of moments estimate based on i 12. . .ZY. Suppose X = (X"". Vi).j2. (J E R d. {3) is continuous on K. suppose that g({3. There exists a compact set K such that for f3 in the complement of K. j).1. . . ~ i . I . Since p(X.140 Methods of Estimation Chapter 2 (a) Show that F(. : J • .81 tends to 00. p(X. respectively. In Example 2. 10. . z) I tends to 00 as \..
6.p:::'..p(x. . 1 < j < k. General method of moment estimates(!).. Consider the Gaussian AR(I) model of Example 1. IG(p..l Lgj(Xi ). Suppose X 1. Iii = n. (c) If J. A). Let g.36. .. r.10). In the fOllowing cases. .0) (ii) Beta. = h(PI(O). p fixed (v) Inverse Gaussian. ~ (X.1. Show that the method of moments estimate q = h(j1!.. . find the method of moments estimates (i) Beta... Hint: Use Corollary 1.d. 13.) to give a method of moments estimate of p. fir) can be written as a frequency plugin estimate.ilr) is a frequency plug (a) Show that the method of moments estimate if = h(fill . where U.5 Problems and Com". Vk and that q(O) for some Rkvalued function h. can you give a method of moments estimate of {3? . .6.i.l and (72 are fixed.1.Pr(O)) . When the data are not i. Let 911 . . . X n are i.. to give a method of moments estimate of (72. as X rv P(J. . .9r be given linearly independent functions and write ec n Pi(O) = EO(gj(X)). Use E(U[). .x (iv) Gamma..  1/' Po) / .. (b) Suppose P ~ po and I' = b are fixed.0 > 0 14. 1'(0.i. (a) Use E(X. rep.. A).. > 0. 0 ~ (p.O) ~ (x/O')exp(x'/20'). in estimate.:::m:::. .6.1) (iii) Raleigh.5.:::nt:::' ~__ 141 for some R k valued function h.• it may still be possible to express parameters as functions of moments and then use estimates based on replacing population moments with "sample" moments. Suppose that X has possible values VI.. 0).. 1'(1. (b) Suppose {PO: 0 E 8} is the kparameter exponential family given by (1.Section 2. j i=l = 1. with (J E R d and (J identifiable. See Problem 1.(X) ~ Tj(X).d.
• l Yn are taken at times t 1... we define the empirical or sample moment to be j ~ . S = 1.S 1 . b....  t=1 liB = (0 1 .... _ (Z)' ' a.. and IF. . For a vector X = (Xl. 1 6 and let Pl ~ N j / n.. I . Let X = (Z..J is' _ n n. I..1 " Xir Xk J. .. Establish (2...g(l3o. Readings Y1 . = Y . .)] + [g(l3o... 1 .g(I3... Let 8" 8" and 83 denote the probabilities of S.1". and (J3....z.ZY ~ ._ . 1 I . In a large natural population of plants (Mimulus guttatus) there are three possible alleles S.4......83 8t Let N j be the number of plants of genotype j in a sample of n independent plants. Hint: [y. 16.•• Om) can be expressed as a function of the moments. Y) and (ai.... SF.. .142 Methods of Estimation Chapter 2 15. of observations. 17. 1 q. I I . and F at one locus resulting in six genotypes labeled SS. I. I • . II....2 1. 1"... 83 6 IF 28.. . let the moments be mjkrs = E(XtX~).g(I3. bI) are as in Theorem 1. and F. k> Q mjkrs L. respectively. The HardyWeinberg model specifies that the six genotypes have probabilities L7=1 I Genotype Genotype Probability SS 2 II 8~ 3 FF 8j 4 SI 28. HardyWeinberg with six genotypes.3..) .. .8.. j > 0. ' l X q ).. the method of moments l by replacing mjkrs by mjkrs. P3 + "2PS + '2P6 PI are frequency plugin estimates of OJ. FF.)]. . where (Z.1.~b. where OJ = 1. Q..  1. i = 1. n 1  Problems for Sectinn 2. 5 SF 28. ....)] = [Y. n. 11P2 + "2P4 + '2P6 .. )..z.. . t n on the position of the object. .... k > 0....z.6).! .. Show that method of moments estimators of the parameters b1 and al in the best linear predictor are estimate ~ 8 of B is obtained L:Z. Multivariate method a/moments.. + '2P4 + 2PS .z.bIZ. For independent identically distributed Xi = (XiI.. The reading Yi ... . Show that <j < i ! ..Y. Y) and B = (ai. . An Object of unit mass is placed in a force field of unknown constant intensity 8.. = n 1 L:Z. I . ... ()2."" X iq ). SI.q. >.
.z. _. .. .4)(2.<) ~ .. B. in fact.1..4. B) = Be'x.. .. < O..BJ . Zn and that the linear regression model holds.2.. the range {g(zr.6) under the restrictions BJ > 0. all lie on aline.Section 2.5) provided that 9 is differentiable with respect to (3" 1 < i < d.Yi).4. Hint: The quantity to be minimized is I:7J (y. " 7 5.. (J2) variables. Yn be independent random variables with equal variances such that E(Yi) = O:Zj where the Zj are known constants. X n denote a sample from a population with one of the following densities Or frequency functions. Find the line that minimizes the sum of the squared perpendicular distance to the same points. 2 10.5 Problems and Complements 143 ICi differs from the true position (8 /2)tt by a mndom error f l . (a) Let Y I .. (b) Relate your answer to the fonnula for the best zero intercept linear predictor of Section 1. x > 0.B. .2. i = 1.(zn.t of the best (MSPE) predictor of Yn + I ? 4. The regression line minimizes the sum of the squared vertical distances from the points (Zl. 13). nl and Yi = 82 + ICi. x > c.13 E R d } is closed. . . Find the least squares estimates of ()l and B .. Find the LSE of 8. Suppose that observations YI .y.3. Yn have been taken at times Zl... .We suppose the oand be uncorrelated with constant variance. if we consider the distribution assigning mass l/n to each of the points (zJ. and 13 ranges over R d 8. nl + n2. . . . . Show that the two sample regression lines coincide (when the axes are interchanged) if and only if the points (Zi. 9.. c cOnstant> 0. What is the least squares estimate based on YI . ~p ~(yj}) ~ . . B > O. (Pareto density) . . Show that the fonnulae of Example 2. (a) f(x. .Yn). Show that the least squares estimate is always defined and satisfies the equations (2.)' 1+ B5 6. 13). i + 1. Hint: Write the lines in the fonn (z. B > O. . (exponential density) (b) f(x. where €nlln2 are independent N(O.. .I is to be taken at time Zn+l. ." + 82 z i + €i = nl with ICi a~ given by = 81 + ICi. Y. . g(zn.2. . Find the least squares estimate of 0:.). 7.n. i = 1. Yl). to have mean 2. Let X I... Yn). 3. .. Find the least squares estimates for the model Yi = 8 1 (2... Suppose Yi ICl". B) = Bc'x('+JJ. . . (zn..2 may be derived from Theorem 1. Find the MLE of 8. A new observation Ynl..
Hint: You may use Problem 2. (Pareto density) (d) fix. for each wEn there is 0 E El such that w ~ q( 0). 0) = (x/0 2) exp{ _x 2/20 2 }. reparametrize the model using 1]). 0 > O. X n • n > 2. I • :: I Suppose that h is a onetoone function from e onto h(e). iJ( VB. > O. 12. 0 E e and let 0 denote the MLE of O.X)" > 0. . ry) denote the density or frequency function of X in terms of T} (Le.  e . Show that depends on X through T(X) only provided that 0 is unique. 0) = Ocx c .2.:r > 0. I i' 13. Show that the MLE of ry is h(O) (i.0"2 (a) Find maximum likelihood estimates of J. (Rayleigh density) (f) f(x... I I I (b) LetP = {PO: 0 E e}. . Pi od maximum likelihood estimates of J1 and a 2 .t and a 2 are both known to be nonnegative but otherwise unspecified. c constant> 0.. be independently and identically distributed with density f(x. q(9) is an MLE of w = q(O). 00 = I 0" exp{ (x .) ! ! !' ! 14. the MLE of w is by definition ~ W.O) where 0 ~ (1". density) (e) fix. MLEs are unaffected by reparametrization. (We write U[a. .I")/O") . . Hint: Use the factorization theorem (Theorem 1. = X and 8 2 ~ n.l exp{ OX C } . is a sample from a N(/l' ( 2 ) distribution. X n . then {e(w) : wEn} is a partition of e.1). Because q is onto n.P > I. bl to make pia) ~ p(b) = (b . (beta.. 0> O. 0) = VBXv'O'. Show that if 0 is a MLE of 0.r(c+<). . Suppose that Xl. Hint: Let e(w) = {O E e : q(O) = w}. 0 <x < I. then i  WMLE = arg sup sup{Lx(O) : 0 E e(w)}. they are equivariant 16. (b) Find the maximum likelihood estimate of PelX l > tl for t > 1". (Wei bull density) 11. ncR'.0"2) 15. I < k < p.: I I \ . Let X I..  under onetoone transformations).144 Methods of Estimation Chapter 2 (c) f(".. e c Let q be a map from e onto n. Thus. . x > 0.II ".2 are unknown. n > 2. c constant> 0. Let X" . Define ry = h(8) and let f(x. J1 E R. Suppose that T(X) is sufficient for 0 and ~at O(X) is an MLE of O. ~ I in Example 2.0"2).16(b). 0 > O. x > 1".Pe. .5 show that no maximum likelihood estimate of e = (1". . be a family of models for X E X C Rd.1.. (72 Ii (a) Show that if J1 and 0. (a) Let X . I).. then the unique MLEs are (b) Suppose J.5. say e(w).a)l rather than 0.t and a 2 . 0 > O. . < I" < 00.l L:~ 1 (Xi . WEn . Show that any T such that X(n) < T < X(1) + is a maximum likelihood estimate of O.e. 0) = cO". x> 0. X n be a sample from a U[0 0 + I distribution. If n exists. and o belongs to only one member of this partition.
Xn be independently distributed with Xi having a N( Oi. . •• .B) = 1. k=I. and have commOn frequency function. < 00..~1 ' Li~1 Y. X 2 = the number of failures between the first and second successes.O") : 00 < fl..2. 1) distribution. A general solution of this and related problems may be found in the book by Barlow. 20. if failure occurs on or before time r and otherwise just note that the item has lived at least (r + 1) periods.. . X n ) is a sample from a population with density f(x.X I = B(I.B) =Bk .) Let M = number of indices i such that Y i = r + 1. .. (KieferWolfowitz) Suppose (Xl.[X ~ k] ~ Bk . 17.0") E = {(I'. In the "life testing" problem 1.B).. a model that is often used for the time X to failure of an item is P.1 (I_B).. (b) Solve the problem of part (a) for n = 2 when it is known that B1 < B. Suppose that we only record the time of failure..Section 2. and so on.  .[X < r] ~ 1 LB k=l k  1 (1_ B) = B'." Y: .r f(r + I. (We denote by "r + 1" survival for at least (r + 1) periods. and Brunk (1972). . we observe YI . 1 <i < n. . find the MLE of B. Derive maximum likelihood estimates in the following models. 16(i). but e . identically distributed. . Y n IS _ B(Y) = L. Bartholomew. (a) The observations are indicators of Bernoulli trials with probability of success 8. Yn which are independent. Bremner. Show that the maximum likelihood estimate of 0 based on Y1 .. (a) Find maximum likelihood estimates of the fJ i under the assumption that these quantities vary freely. f(k. Show that maximum likelihood estimates do not exist. If time is measured in discrete periods. B) = 100' cp 9 (x I') + 0' 10 cp(x 1') 1 where cp is the standard normal density and B = (1'. Let Xl. Thus.... 19.P. Censored Geometric Waiting Times. We want to estimate B and Var. We want to estimate O.1 (1 .6.. k ~ 1. 21.n . . . in a sequence of binomial trials with probability of success fJ.. .5 Problems and Complements 145 Now show that WMLE ~ W ~ q(O). .. where 0 < 0 < 1. 0 < a 2 < oo}. (b) The observations are Xl = the number of failures before the first success. M 18.B).
" Yn be two independent samples from N(J. where [t] is the largest integer that is < t. .2 3. distribution.. N.8 14. 1985). Polynomial Regression.1.2 4.0"2) if. 1 1 1 o o .2.8 2. x)/ L(b.4 o o o o o o o o o o o o o Life 20. Tool life data Peed 1 Speed 1 1 Life 54. (1') is If = (X.x n .8 3.I' fl.jp): 0 <j.4)(2. 7 . Weisberg.. .J. respectively.a 2 ) = Sllp~. where'i satisfy (2. Xl..146 Methods of Estimation Chapter 2 that snp. . Suppose Y. and only if. Set zl ~ .t.' wherej E J andJisasubsetof {UI . J2 J2 o o o I 3. n).Y)' i=1 j=1 /(rn + n). Show that the maximum likelihood estimate of b for Nand n fixed is given by if ~ (N + 1) is not an integer.2.2 4.tl.z!.1 2.0 0.. and assume that zt'·· . i: In an experiment to study tool life (in minutes) of steelcutting tools as a function of cutting speed (in feet per minute) and feed rate (in thousands of an inch per revolution).8 0.a 2 ) and NU". 1 <k< p}.0 11. Let Xl. Show that the MLE of = (fl. = fl.Xm and YI . (1') populations. Y.X)' + L(Y.5 I' I I . TABLE 2.2 Peed Speed  2 I 1 1 1 1 1 1 1 J2 o o 1 1 1 3. the following data were obtained (from S.6). Ii equals one of the numbers 22.p(x. x) as a function of b. II I I :' 24.0 5.0 2. and ~ b(X) = X X (N + 1) or (N + 1)1 n n othetwise..ji.< J. ..5 86.0 I .'.(Zi) + 'i.6.up(X.Xj for i =F j and that n > 2. (1') where e n ii' = L(Xi . . Suppose X has a hypergeometric. 23... . Hint: Coosider the mtio L(b + 1. Assume that Xi =I=.5 66.9 3. .5 0. . 1i(b.
(P) = Cov(Z'. (12 unknown.(P) and p.13)/6.(P)Z of Y is defined as the minimizer of t = zT {3. (2. Derive the weighted least squares nonnal equations (2. Let (Z. Being larger. .1). .y)/c where c = I Iv(z. Let Z D = I!zijllnxd be a design matrix and let W nXn be a known symmetric invertible matrix. ~.6.6)). this has to be balanced against greater variability in the estimated coefficients.0:4Z? + O:SZlZ2 + f Use a least squares computer package to compute estimates of the coefficients (f3's and o:'s) in the two models. . Both of these models are approximations to the true mechanism generating the data.(b l + b.2. (b) Let P be the empirical probability . l/Var(Y I Z = z).Section 2.Y = Z D{3+€ satisfy the linear regression mooel (2. ZD = Wz ZD and € = WZ€.1..4)(2. y).2. (a) The parameterization f3 (b) ZD is of rank d. Y)Y') are finite. z) ing are equivalent.1. Y)Z') and E(v(Z.y) defined .Z)]'). Y')/Var Z' and pI(P) = E(Y*) 11.900)/300. z.2. Show that (3.5) for (a) and (b).6) with g({3. That is. Use these estimated coefficients to compute the values of the contrast function (2.y)!(z.. = (cutting speed .1). E{v(Z. Consider the model (2. Y) have joint probability P with joint density ! (z. 25. Consider the model Y = Z Df3 + € where € has covariance matrix (12 W. 26. (c) Zj. in Problem 2.6) with g({3. However.5 Problems and Complements 147 The researchers analyzed these data using Y = log tool life. Y)[Y .2.4)(2..y)dzdy. weighted least squares estimates are plugin estimates. Show that p. 28.19).3.2.  27.. Show that the follow ZDf3 is identifiable. Set Y = Wly. let v(z.z. . This will be discussed in Volume II. z) ~ Z D{3.2. The best linear weighted mean squared prediction error predictor PI (P) + p.Y') have density v(z.2. y) > 0 be a weight funciton such that E(v(Z. + E eto (b) Y = + O:IZl + 0:2Z2 + Q3Zi +.ZD is of rank d.(P) coincide with PI and 13.y)!(z. Let W. ZI ~ (feed rate ..(P)E(Z·). (2.l be a square root matrix of W. Two models are contemplated (a) Y ~ 130 + pIZI + p.8 and let v(z. of Example 2. (a) Let (Z'. . the second model provides a better approximation.1 (see (B. (a) Show tha!.2.
071 0. .716 7.2.6)T(y . _'.856 2.156 5.968 9. = nq = 0.. 4 (b) Show that Y is a multivariate method of moments estimate of p.155 4..788 2.182 0.703 4.(Y .~=l Z.) ~ ~ (I' + 1'. 0) = II j=q+l k 0.. j = 1.564 1.916 6.301 2. . 1'. suppose some of the nj are zero.860 5..968 2... i = 1. (See Problem 2.300 6.666 3. . That is.'.019 6.665 2. Use a weighted least squares c~mputer routine to compute the weighted least squares estimate Ii of /1.229 4.2.6) is given by (2.360 1.131 5.455 9. \ (e) Find the weighted least squares estimate of p.. different from Y? I I TABLE 2.379 2.599 0. .1.393 3.457 2. .918 2. . i = 1.ZD. Chon) shonld be read row by row.+l I Y . .091 3..582 2. 31. . <J > 0. In the multinomial Example 2. .6) T 1 (Y .. Yn spent above a fixed high level for a series of n = 66 consecutive wave records at a point on the seashore.834 3.669 7.075 4.093 1. Let ei = (€i + {HI )/2.038 3.. I Yi is (/1 + Vi).1.908 1. /1i + a}.397 4.511 3.ZD.476 3. Yn are independent with Yi unifonnly distributed on [/1i . The data (courtesy S.6) W...8.20).689 4.€n+l are i.392 4.958 10. then the  f3 that minimizes ZD. . ..196 2. nq+1 > p(x.058 3. (a) Show that E(1'.a.. 2.870 30.17.093 5.971 0.).053 4. Elapsed times spent above a certain high level for a series of 66 wave records taken at San Francisco Bay..081 2. J.i.148 ~ Methods of Estimation Chapter 2 (b) Show that if Z  D has rank d.054 1.391 0. k. where 1'. = (Y  29..039 9. = L. Show that the MLE of .611 4.249 1. .046 2.5.723 1. Consider the model Yi = 11 + ei.nk > 0.480 5. which vanishes if 8 j = 0 for any j = q + 1. with mean zero and variance a 2 • The ei are called moving average errors.. in this model the optimal " MSPE predictor of the future Yi+ 1 given the past YI ..274 5.d. ' I (f) The following data give the elapsed times YI .) (c) Find a matrix A such that enxl = Anx(n+l)ECn+l)xl' (d) Find the covariance matrix W of e.858 3. n. .453 1.114 1.. Hint: Suppose without loss of generality that ni 0.jf3j for given covariate values {z.020 8.921 2. . Is ji. k..097 1. Suppose YI . where El" .j}. .676 5.100 3. n.ZD. Then = n2 = .064 5. ~ ~ ~Show that the MLE of OJ is 0 with OJ = nj In.
) Suppose 1'. then the MLE exists and is unique. (a) Show that the MLE of ((31.. ..32(b).5 Problems and Complements 149 (f3I' . If n is even.. These least absolute deviation estimates (LADEs). = L Ix. .. a) is obtained by finding (31.1. where Iii = ~ L:j:=1 ZtjPj· 32.. be the observations and set II = ~ (Xl . Yn ordered from smallest to largest.>0 ~ ~ where tLi = :Ej=l Ztj{3j for given covariate values {Zij}. . Find the MLE of A and give its mean and variance. . . 1).(3p that minimizes the least absolute deviation contrast function L~ I IYi . . .  Illi < 1.Jitl./3p that minimizes the maximum absolute value conrrast function maxi IYi .8).7 with Y having the empirical distribution F. Yn are independent with Vi having the Laplace density 1 2"exp {[Y... . (a) Show Ihat if Ill[ < 1. and x.i.Ili I and then setting (j = n~I L~ I IYi . be i. 35. XHL i& at each point x'. Let g(x) ~ 1/". let Xl and X. (See (2.. = I' for each i.I/a}.3. An asymptotically equivalent procedure is to take the median of the distribution placing mass and mass . the sample median yis defined as ~ [Y(... where Jii = L:J=l Zij{3j.& at each Xi.O) Hint: See Problem 2.x..d..).) + Y(r+l)] where r = ~n. + Xj i<j  201· (b) Define BH L to be the minimizer of J where F [x . .l1..:C j • i <j (a) Show that the HodgesLehmann estimate is the minimizer of the contrast function p(x. x E R. The HodgesLehmann (location) estimate XHL is defined to be the median of the ~n(n + 1) pairwise averages ~(Xi + Xj). Hint: Use Problem 1. Show that the sample median ii is the minimizer of L~ 1 IYi . a)T is obtained by finding 131. 8 E R.. ..6. . F)(x) *F denotes convolution. ~ f3I.i. y)T where Y = Z + v"XW. (3p.{:i.. Z and W are independent N(O. Let x. .I'. 34.4. A > 0. . .{3p.d..17). .28Id(F. the sample median fj is defined as Y(k) where k ~ ~(n + 1) and Y(l). with density g(x . . i < j.fin are called (b) If n is odd. ~ 33. Suppose YI .Section 2. as (Z.d./Jr ~ ~ and iii.. Let X. . Show that XHL is a plugin estimate of BHL. Y( n) denotes YI . Let B = arg max Lx (8) be "the" MLE..(1 + x'j. be i.2.J.. Give the MLE when .tLil and then setting a = max t IYi . " . . be the Cauchy density. Hint: See Example 1..
. Find the values of B that maximize the likelihood Lx(B) when 1t>1 > L Hint: Factor out (x . . h I (ry») denote the density or frequency function of X for the ry parametrization. and 5. Let X ~ P" B E 8. 1 n + m. B) denote their joint density.j'. symmetric about 0. Show that (a) The likelihood is symmetric about ~ x. 1985).) be a random samplefrom the distribution with density f(x. Sl.B) g(x + t> . . B) and pix.. P(nA). (7') and let p(x. 9 is twice continuously differentiable everywhere except perhaps at O. . 38. = I ! I!: Ii. Let Xl and X2 be the observed values of Xl and X z and write j. a < b. . n.. Let (XI. Let Vj and W j denote the number of hits of type 1 and 2 on day j. Bo) is tn( B. j = n + I.x2)/2. > h(y) . . (a. 37.150 Methods of Estimation Chapter 2 (b) Show that if 1t>1 > 1. then the MLE is not unique. Show that the entropy of p(x. The likelihood function is given by Lx(B) g(xI .. such that for every y E (a. Define ry = h(B) and let p' (x.. i = 1. o Ii !n 'I . N(B.0).B)g(x.. (b) Either () = x or () is not unique. and 8 2 are independent. then B is not unique. and positive everywhere. ~ Let B ~ arg max Lx (B) be "the" MLE. .ryt» denote the KullbackLeibler divergence between p( x. BI ) (p' (x. that 8 1 E. 9 is continuous. Let Xi denote the number of hits at a certain Web site on day i..B I ) (K'(ryo.Xn are i. Let 9 be a probability density on R satisfying the following three conditions: I.h(y) .f)) in the likelihood equation. ~ (d) Use (c) to show that if t> E (a.B)g(i .\2 = A. .b). where Al +. Ii I.. 3.. then h"(y) > 0 for some nonzero y.i.+~l Wj have P(rnAl) and P(mA2) distributions. . . Assume :1.B). .t> . On day n + 1 the Web Master decides to keep track of two types of hits (money making and not money making).. ryl Show that o n ». 2. ~ (XI + x. ryo) and p' (x. B) = g(xB). Suppose X I. Problem 35 can be generalized as follows (Dhannadhikari and JoagDev.h(y . .B )'/ (7'. If we write h = logg. Find the MLEs of Al and A2 hased on 5. ry) = p(x. distribution. B) is and that the KullbackLiebler divergence between p(x. Assume that 5 = L:~ I Xi has a Poisson.)/2 and t> = (XI . 36.+~l Vi and 8 2 = E.d. B ) and p( X. b). 51. . Also assume that S. 'I i 39. Let K(Bo.. where x E Rand (} E R. b) there exists a 0 > 0 ~ (c) There is an interval such that h(y + 0) . X. Suppose h is a II function from 8 Onto = h(8).
7) for a. n.. = L X. a get.1'. 6.0). . 1'). [3. h(z.}. For the ca.. . n > 4. and g(t. /3. . . Aj) + Ei. Let Xl.I[X. 1 En are uncorrelated with mean 0 and variance estimating equations (2.) Show that statistics.j ~ x <0 1. n where A = (a.5. [3. . .. 1). .. Give the least squares where El" •.olog{1 + exp[[3(t. I' E R.5 Problems and Complements 151 40. (.  J1. (e) Let Yi denote the response of the ith organism in a sample and let Zij denote the level of the jth covariate (stimulus) for the ith organism.c n are uncorrelated with mean zero and variance (72. . = LX. An example of a neural net model is Vi = L j=l p h(Zij. where OJ > O. The mean relative growth of an organism of size y at time t is sometimes modeled by the equation (Richards. on a population of a large number of organisms. 0). . i = 1.p. j 1 1 cxp{ x/Oj}. a > 0. i = 1. < 0] are sufficient (b) Find the maximum likelihood estimates of 81 and 82 in tenns of T 1 and T 2 • Carefully check the "T1 = 0 or T 2 = 0" case. .)/o]}" (b) Suppose we have observations (t I .I[X. [3. 1959. . 1 X n satisfy the autoregressive model of Example 1. a. 1989) Y dt ~ [3 1  Idy [ (Y)!] ' y > 0. T. [3 > 0. ... X n be a sample from the generalized Laplace distribution with density 1 O + 0. yd 1 ••• .. A) = g(z. j = 1..O) ~ a {l+exp[[3(tJ1. Suppose Xl. 41. (tn.~t square estimating equations (2. Yn). and 1'. 42.1. 0 > O. o + 0. .Section 2..)/o)} (72. and fj.2. 1'. x > 0. give the lea. where 0 (a) Show that a solution to this equation is of the form y (a. Seber and Wild.~e p = 1. [3. > OJ and T.1. + C.. . Variation in the population is modeled on the log scale by using the model logY. and /1. exp{x/O.7) for estimating a.1. = log a .
3. Let = {(01. fi) = a + fix. L x.l. C2' y' 1 1 for (a) Show that the density of X = (Xl. Consider the HardyWeinberg model with the six genotypes given in Problem 2. Xn)T can be written as the rank 2 canonical exponential family generated by T = (ElogX"EXi ) and hex) = XI with ryl = p.x.) Then find the weighted least square estimate of f. .l. write Xi = j if the ith plant has genotype j.'" £n)T of autoregression errors.1 > _£l.152 Methods of Estimation Chapter 2 (aJ If /' is known. I < j < 6. I ... = L(CI + C.p). Suppose Y I . S.3. > 0. Hint: Let C = 1(0). . In a sample of n independent plants.(8 .. Hint: I .)I(c.. There exists a compact set K c e such that 1(8) < c for all () not in K. . Yn) is not a sequence of 1's followed by all O's or the reverse. _ C2' 2.::. Let Xl. 3. . = OJ. . find the covariance matrix W of the vector € = (€1.. Prove Lenama 2. gamma. X n be i. (b) Show that the likelihood equations are equivalent to (2...0. . (One way to do this is to find a matrix A such that enxl = Anxn€nx 1. Give details of the proof or Corollary 2. (X. 1 Xl <i< n..3. a. • II ~.) 1 II(Z'i /1)2 (b) If j3 is known. = 1] = p(x"a. the bound is sharp and is attained only if Yi x. n i=l n i=l n i=1 n i=l Cl LYi + C..J.d.4) and (2. . .15. Under what conditions on (Xl 1 .02): 01 > 0. n > 2. .x. .1.). + 8..i. + 0.5).3.. r(A..fi) = 1log p pry. t·. 0 for Xi < _£I.3 1. Is this also the MLE of /1? Problems for Section 2. fi exists iff (Yl . show that the MLE of fi is jj =  2. < Show that the MLE of a.y. " l Xn) does the Mill exist? What is the MLE? Is it unique? e 4. . '12 = A and where r denotes the gamma function.Xi)Y' < L(C1 + c.. = If C2 > 0... This set K will have a point where the max is attained. < I} and let 03 = 1. < .l  L: ft)(Xi /. .0 .. Xn· Ip (x.. Yn are independent PlY. . + C1 > 0)..
.en. 1 <j< k1. Hint: If it didn't there would exist ryj = c(9j ) such that ryJ 10 ... fe(x) wherec.1 to show that in the multinomial Example 2. In the heterogenous regression Example 1. . Let XI. .. _.. if n > 2.5 Problems and Complements 153 n 6. " (a) Show that if Cl: > ~ 1. . 12. . 0 < Zl < .3.) ~ Ail = exp{a + ilz. . J.z=7 11.fl..1} 0 OJ = (b) Give an algorithm such that starting at iP = 0.0).1). C! > 0..3. .. • It) w' ( Xi It) . the MLE 8 exists but is not unique if n is even. Let Xl. Let Y I . convex set {(Ij. .0). (O. > 3. distribution where Iti = E(Y.10 with that the MLE exists and is unique.Xn E RP be i.Ld. e E RP.lkIl :Ij >0. < Zn Zn..O.' . with density. (0. the MLE 8 exists and is unique. 0:0 = 1. Zj < . . MLEs of ryj exist iffallTj > 0. . Suppose Y has an exponential. Hint: If BC has positive volume.}.n) are the vertices of the :Ij <n}. .8) . .1 <j < ac k1.. and Zi is the income of the person whose duration time is ti.d. < Zn. w( ±oo) = 00. (3) Show that. show 7. ji(i) + ji.Section 2. Yn denote the duration times of n independent visits to a Web site.6. ...ill T exists and is unique. 10.3. X n be Ll.3. Prove Theorem 2.. (b) Show that if a = 1.n..0 < ZI < .l E R.1 (a) ~ ~ c(a) exp{ Ix . 9.6.3.t). Show that the boundary of a convex C set in Rk has volume 0.'~ve a unique solution (. See also Problem 1..0.. 8. Hint: The kpoints (0.. . [(Ai). Use Corollary 2. and assume forw wll > 0 so that w is strictly convex. ~fo (X u I. u > 1 fR exp{Ixlfr}dxand 1·1 is the Euclidean norm. n > 2.9.. . a(i) + a. . < Show that the MLE of (a... But c( e) is closed so that '10 = c( eO) and eO must satisfy the likelihood equations. the likelihood equations logfo that t (X It) w' iOJ t=I = 0 ~ { (X i . Then {'Ij} has a subsequence that converges to a point '10 E f.A( ryj) ~ max{ 'IT toA( 'I) : 'I E c(8)} > 00. then it must contain a sphere and the center of the sphere is an interior point by (B.40.
' ! (a) In the bivariate nonnal Example 2. 0'1 Problems for Section 2. (b) If n > 5 and J11 and /12 are unknown. (i) If a strictly convex function has a minimum. > 0.J1. b = : and consider varying a. Show that if T is minimal and t: is open and the MLE doesn't exist. /12.. (I't') IfEPD 8a2 a2 D > 0 an d > 0. b successively.9). . O'?. (a) Show that the MLEs of a'f. En i. Hint: (b) Because (XI.b) . Chapter 10. (See Example 2. 3. See.'2).1)". it is unique.4. provided that n > 3.. .154 Methods of Estimation Chapter 2 (c) Show that for the logistic distribution Fo(x) [1 + exp{ X}]I. . complete the Estep by finding E(Zl I l'i) and E(ZiYi I Yd· (b) In Example 2.3. 1985.3...4 1. pal0'2 + J11J. . (b) Reparametrize by a = .6.J1. and p = [~(Xi ... {t2. rank(ZD) = k. 2. . 7if)F 8 08 D 802 vb2 2 2 > (a'D)2 ' then D'IS strictIy convex. respectively. b) = x if either ao = 0 or 00 or bo = ±oo.4. = (lin) L:.2)/M'''2] iI I'.4.(Yi .2)2.) Hint: (a) Thefunction D( a.:.2.l and CT. p) population.b)_(ao.1.n log a is strictly convex in (a. O'~.2 are assumed to be known are = (lin) L:. I (Xi . for example. Note: You may use without proof (see Appendix B..i.d. then the coordinate ascent algorithm doesn't converge to a member of t:. b) and lim(a.. "i . ~J ' . Describe in detail what the coordinate ascent algorithm does in estimation of the regression coefficients in the Gaussian linear model i: y ~ ZDJ3 + <. Golnb and Van Loan. N(O. ar 1 a~. show that the estimates of J11. O'~ + J1~. and p when J1. O'~ + J1i. P coincide with the method of moments estimates of Problem 2. (Check that you are describing the GaussSeidel iterative method for solving a system of linear equations. verify the Mstep by showing that E(Zi I Yi). Apply Corollary 2. J12. IPl < 1. EeT = (J11. Yn ) be a sample from a N(/11. . ag.:. b) ~ I w( aXi .8. . 8aob :1 13.J1.)(Yi . (Xn .. E" . il' i. EM for bivariate data. Y.J1.2). Let (X10 Yd.) has a density you may assume that > 0.1 and J1.bo) D(a.. W is strictly convex and give the likelihood equations for f. L:.) I .6. .
(c) Give explicitly the maximum likelihood estimates of Jt and 5.B)n} _ nB'(I. Suppose that in a family of n members in which one has the disease (and. when they exist. Do you see any problems with >. I' >.[lt ~ OJ.Ii). Yd : 1 < i family with T afi I a? known.B)[I. (a) Justify the following crude estimates of Jt and A.B)"]'[(1 .4. Suppose the Ii in Problem 4 are not observed. Let (Ii.n. and given II = j. .I ~ + (1  B)n] . Y1 '" N(fLl a}).{(Ii. 8new . i ~ 1'. is distributed according to an exponential ryl < n} = i£(1 I) uzuy.1) x Rwhere P. 1') E (0. and that the (b) Use (2.. 2:Ji) . For families in which one member has the disease. where 8 = 80l d and 8 1 estimate of 8. the model often used for X is that it has the conditional distribution of a B( n.. B E [0... . j = 0.[lt ~ 1] = A ~ 1 . Y (~ A 2:7 I(Yi  Y)'  O"~)/ (0". B) variable. it is desired to estimate the proportion 8 that has the genetic trait.3) to show that the NewtonRaphson algorithm gives ~ _ 81 ~ 8 ~ _ ~ _ B(I.(I.? (b) Give as explicitly as possible the E.5 Problems and Complements 155 4.(1 _ B)n]{x nB .O"~). as the first approximation to the maximum likelihood . 1]. given X>l. Hint: Use Bayes rule. be independent and identically distributed according to P6._x(1. 2 (uJ L }iIi + k 2: Y. Because it is known that X > 1. 1 < i < n. X is the number of members who have the trait. I n ) 8"(10)"" (a) Show that P(X = x MLE exists and is unique. thus.[1 . 6. ~2 = log C\) + (b) Deduce that T is minimal sufficient.P.2B)x + nB'] _. 1 and (a) Show that X . B= (A.. IX > 1) = \ X ) 1 (1 6)" .Section 2. Consider a genetic trait that is directly unobservable but will cause a disease among a certain proportion of the individuals that have it.x = 1. Y'i).(1 ..and Msteps of the EM algorithm for this problem. also the trait).B)n[n .
9. b1 c and then are . ~ 0. hence. (1) 10gPabc = /lac = Pabe.. (h) Show that the family of distributions obtained by letting 1'.9) = ". Let and iJnew where ).1) + C(A + B2) = C(A + B1) . (c) Show that the MLEs exist iff 0 < N a+c . Show that the sequence defined by this algorithm converges to the MLE if it exists. Let Xl. where X = (U. the sequence (11ml ijm+l) has a convergent subse quence. . Define TjD as before. v < 00. X Methods of Estimation Chapter 2 = 2.. PIU = a. (a) Suppose for aU a.2 noting that the sequence of iterates {rymJ is bounded and.156 (c) [f n = 5. . W). * maximizes = iJ(A') t.X 2 . Hint: Apply the argument of the proof of Theorem 2.c)} and "+" indicates summation over the ind~x. N +bc given by < N ++c for all a. 1 <c < C and La.4. . n N++ c ++c  . N . 1 < b < B. W = c] 1 < a < A. + Vbc where 00 < /l.9)')1 Suppose X.v ~ b I W = c] = P[U = a I W = c]P[V = b I W ~ i.1 generated by N++c. Show that for a sufficiently large the likelihood function has local maxima between 0 and 1 and between p and a. V.riJ(A) .A(iJ(A)).d.c Pabc = l. X. 8. X 3 = a. c]' Show that this holds iff PIU ~ a.X3 be independent observations from the Cauchy distribution about f(x. .2.b. 7.b.i.4. (a) Deduce that depending on where bisection is started the sequence of iterates may converge to one or the other of the local maxima (b) Make a similar study of the NewtonRaphson method in this case.N+bc where N abc = #{i : Xi = (a. = 1. X n be i. v vary freely is an exponential family of rank (C . Consider the following algorithm under the conditions of Theorem 2. iff U and V are independent given W.e. find (}l of (b) above using (} =   x/n as a preliminary estimate. N++ c N a +c N+ bc Pabc = . V = b. b1 C. .Na+c. Let Xl.1 (1 + (x e.
c' obtained by fixing the "b. (a) Show that this is an exponential family of rank A + B + C . = fo(x  9) where fo(x) 3'P(x) + 3'P(x .b'. Initialize: pd: O) = N a ++ N_jo/>+ N++ c abc nnn Pabc d2) dl) Nab+ Pabc dO) n n Pab+ d1) dl) dO) Pabc d3) N a + c Pabc Pa+c N+bc Pabc d 2) Pabc n P+bc d2)· Reinitialize with ~t~. c" and "a. 12.I)(C .(x) :1:::: 1(S(x) ~ = s).=_ L ellalblp~~~'CI a'. 10. N±bc .a) 1 2 .. but now (2) logPabc = /lac + Vbc + 'Yab where J1.c. f vary freely. v. Justify formula (2.8). Hint: Note that because {p~~~} belongs to the model so do all subsequent iterates and that ~~~ is the MLE for the exponential family Pabc = ellauplO) _=.and Msteps of the EM algorithm in this case. Suppose X is as in Problem 9.4. Show that the algorithm converges to the MLE if it exists and di· verges otheIWise.I)(C .I) ~ AB + AC + BC . (a) Show that S in Example 2.I) + (B .5 has the specified mixtnre of Gaussian distribution.N++c/A. c" parameters. Hint: P.4.:.(A + B + C).[X = x I SeX) = s] = 13. Let f.1) + (A . (b) Consider the following "proportional fitting" algorithm for finding the maximum likelihood estimate in this model. 11.Section 2.5 Problems and Complements 157 Hint: (b) Consider N a+ c .1)(B ..3 + (A .a>!b". (b) Give explicitly the E.N++ c / B. Nl+co (c) The model implies Pabc = P+bcPa±c/P++c and use the likelihood equations.
14. Ii . and independent of El.158 Methods of Estimation Chapter 2 and r. If /12 = 1. The assumption underlying the computations in the EM algorithm is that the conditional probability that a component X j of the data vector X is missing given the rest of the data vector is not a function of X j . {31 1 a~. given X . If Vi represents the seriousness of a disease..) underpredicts Y. 15. Verify the fonnula given in Example 2. al = a2 = 1 and p = 0. I Z.6. . Bm+d} has a subsequence converging to (0" JJ*) and. given Zi.4. necessarily 0* is the global maximizer. the "missingness" of Vi is independent of Yi. Limitations a/the EM Algorithm. Establish part (b) of Theorem 2.n}. this assumption may not be satisfied. R. Then using the Estep to impute values for the missing Y's would greatly unclerpredict the actual V's because all the V's in the imputation would have Y < 2. the probability that Vi is missing may depend on Zi. Show for 11 = 1 that bisection may lead to a local maximum of the likelihood. (2) The frequency plugin estimates are sometimes called Fisher consistent. Establish the last claim in part (2) of the proof of Theorem 2. = {(Zil Yi) Yi :i = 1. suppose Yi is missing iff Vi < 2. These considerations lead essentially to maximum likelihood estimates. the process determining whether X j is missing is independent of X j . Zl.4.4.{Xj }. EM and Regression." .i. 17. {32). N(ftl..~ go. . (J*) and necessarily g.. . That is. • . . Hint: Show that {( 8m . This condition is called missing at random.2. For example.. En a an 2.4.3 for the actual MLE in that example."" En.1 (1) "Natural" now was not so natural in the eighteenth century when the least squares principle was introduced by Legendre and Gauss.p is the N(O} '1) density. Noles for Section 2. ~ 16. if a is sufficiently large. For a fascinating account of the beginnings of estimation in the context of astronomy see Stigler (1986). .6 . A. a 2 .d.5. Bm + I)} has a subsequence converging to (B* . in Example 2. we observe only}li.3. 2 ). For instance.. Hint: Show that {(19 m.d. Hint: Use the canonical nature of the family and openness of E. Zn are ii. In Example 2. Suppose that for 1 < i < m we observe both Zi and Yi and for m + 1 < i < n. Fisher (1922) argued that only estimates possessing the substitution property should be considered and the best of these selected. thus. Complete the E. That is. ! ..4. For X .5. N(O. where E) . find the probability that E(Y. suppose all subjects with Yi > 2 drop out of the study. 18.and Msteps of the EM algorithm for estimating (ftl.. . NOTES . .6. ~ . consider the model I I: ! I I = J31 + J32 Z i + Ei I are i. I I' . but not on Yi.
see Cover and Thomas (1991).. J.. Elements oflnfonnation Theory New York: Wiley. E. Numerical Analysis New York: Prentice Hall.loAGDEV.Section 2. AND K. "On the Mathematical Foundations of Theoretical Statistics." reprinted in Contributions to MatheltUltical Statistics (by R. BAUM. THOMAS. AND P.. Discrete Multivariate Analysis: Theory and Practice Cambridge. T... 1. 1975. 1997). DEMPSTER. 0. t 996. "A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains:' Am!. HABERMAN.7 REFERENCES BARLOW.• AND I. FEINBERG. MACKINLAY. A. D. S. CAMPBELL. The Analysis ofFrequency Data Chicago: University of Chicago Press. ANOH. VAN LOAN. "Examples of Nonunique Maximum Likelihood Estimators. M.5 (1) In the econometrics literature (e. SOULES. Sciences. M. 39. G. R. 164171 (1970).2. J. FISHER.. Local Polynomial Modelling and Its Applications London: Chapman and Hall. PETRIE. A. COVER. BRUNK. 138 (1977)... A. ANDERSON. E. G. Appendix A. LAIRD. AND C.7 References 159 Notes for Section 2. Statistical Inference Under Order Restrictions New York: Wiley. for any A. Wiley and Sons. Note for Section 2.1..1974. Statist. AND A. R. c. "Y." J. Soc. GIlBELS. Statist. GoLUB.. DAHLQUIST. (2) For further properties of KullbackLeibler divergence. 1997. C. T. Note for Section 2.3 (I) Recall that in an exponential family. Matrix Computations Baltimore: John Hopkins University Press. 2433 (1964). E. La. 1991. BARTHOLOMEW. S. G. F. 1974. HOLLAND. Campbell. W. AND N. M.g. DHARMADHlKARI. H. M.. P[T(X) E Al o for all or for no PEP. S. La. J." The American Statistician.. Math. "The Meaning of Least in Least Squares. M. 1922. NJ: Princeton University Press. MA: MIT Press. B. AND J. a multivariate version of minimum contrasts estimates are often called generalized method of moment estimates. L. WEISS. 1985. FAN. The Econometrics ofFinancial Markets Princeton. B." Journal Wash. . AND D. AND N. A. 54. 41. Fisher 1950) New York: J. BJORK. and MacKinlay. W. A. BREMNER. 1972. "Y.. Acad. BISHOP. M.2 (I) An excellent historical account of the development of least squares methods may be found in Eisenhart (1964). RUBIN. "Maximum Likelihood Estimation from Incomplete Data via the EM Algorithm. Roy. EiSENHART. 199200 (1985). A. 2.
E. "Association and Estimation in Contingency Tables. In. WU. Exp... LIlTLE. 63. WEISBERG. Statisr. B.. "On the Convergence Properties of the EM Algorithm. "A Flexible Growth Function for Empirical Use. Theory. J." Ann.. AND D. 1986.. Statist. R. G. KR[SHNAN.95103 (1983) . The EM Algorithm and Extensions New York: Wiley. P. S. D. 102108 (1956). 11. RUPPERT.·sis with Missing Data New York: J. A.. IA: Iowa State University Press. 27. Statistical Anal. New York: Wiley. F. "Multivariate Locally Weighted Least Squares Regression.160 Methods of Estimation Chapter 2 KOLMOGOROV..623656 (1948). G. "On the Shannon Theory of Information Transmission in the Case of Continuous Signals. Applied linear Regression. 1%7. C. AND M. N. I 1 . C. MACLACHLAN. 2nd ed. SEBER. 1987.. Statistical Methods. W. 1997. Assoc. SHANNON." 1. AND W. I • I . E. A. 128 (1968)... J. 22. Wiley. 1989." J. "A Mathematical Theory of Communication. Nonlinear Regression New York: Wiley. SNEDECOR. Amer. MOSTELLER.. 290300 ( 1959).. F. • . The History ofStatistics Cambridge. MA: Harvard University Press. G. i I. A.. Ames.379243. WAND. RICHARDS. AND c. 6th ed. Journal. 10. AND T. WILD. Statist. Botany. RUBIN. STIGLER.." Bell System Tech." Ann. 13461370 (1994). J.. E."/RE Trans! Inform.1. . S. COCHRAN. J. 1985.
6) =ER(O. (3. actual implementation is limited. We think of R(. in Chapter 4 and developing in Chapters 5 and 6 the asymptotic tools needed to say something about the multiparameter case.2 BAYES PROCEDURES Recall from Section 1.Chapter 3 MEASURES OF PERFORMANCE. 3.2. In Section 3. We return to these themes in Chapter 6. in particular computational simplicity and robustness. 6) : e . However. then for data X '" Po and any decision procedure J randomized or not we can define its risk function.a). NOTIONS OF OPTIMALITY. In Sections 3. loss function l(O.2 and 3. ) < R(B. However.4. AND OPTIMAL PROCEDURES 3.6) = E o{(O. the relation of the two major decision theoretic principles to the 000decision theoretic principle of maximum likelihood and the somewhat out of favor principle of unbiasedness. in the context of estimation. Strict comparison of 6. 6 2 ) for all 0 or vice versa.6) = EI(6.6 .1) 161 .3 we show how the important Bayes and minimax criteria can in principle be implemented.3 that if we specify a parametric model P = {Po: E 8}.1 INTRODUCTION Here we develop the theme of Section 1.6(X)). which is how to appraise and select among decision procedures. We also discuss other desiderata that strongly compete with decision theoretic optimality. action space A. by introducing a Bayes prior density (say) 1r for comparison becomes unambiguous by considering the scalar Bayes risk.3. OUf examples are primarily estimation of a real parameter. R(·. o r(1T. and 62 on the basis of the risks alone is not well defined unless R(O. 6) as measuring a priori the performance of 6 for this model.6(X). we study.+ R+ by ° R(O. after similarly discussing testing and confidence bounds.
(0)1 .) ~ inf{r(".3 we showed how in an example. This exercise is interesting and important even if we do not view 1r as reflecting an implicitly believed in prior distribution on (). considering I. x _ !oo= q(O)p(x I 0).4.2. if computable will c plays behave reasonably even if our knowledge is only roughly right. if 1[' is a density and e c R r(". we can give the Bayes estimate a more explicit form. 1(0. .2..5) This procedure is called the Bayes estimate for squared error loss. we find that either r(1I".2. as usual.162 Measures of Performance Chapter 3 where (0. 0) = (q(O) using a nonrandomized decision rule 6.(0) a special role ("equal weight") though (Problem 3.6): 6 E VJ (3.8) for the posterior density and frequency functions.. X) is given the joint distribution specified by (1. This is just the prohlem of finding the best mean squared prediction error (MSPE) predictor of q(O) given X (see Remark 1. it is plausible that 6. Recall also that we can define R(. 6) Bayes rule 6* is given by = a? 6'(X) = E[q(O) I XJ. and that in Section 1. (0.6) In the discrete case. we just need to replace the integrals by sums. After all. Clearly. 1 1  .. e "(). and instead tum to construction of Bayes procedure. oj = ".(O)dfI p(x I O).2.2. For testing problems the hypothesis is often treated as more important than the alternative. It is in fact clear that prior and loss function cannot be separated out clearly either.5. If 1r is then thought of as a weight function roughly reflecting our knowledge. We don't pursue them further except in Problem 3..2. We first consider the problem of estimating q(O) with quadratic loss.4) the parametrization plays a crucial role here. Thus. In view of formulae (1. In the continuous case with real valued and prior density 1T. for instance. Here is an example. 'f " .3) In this section we shall show systematically how to construct Bayes rules. (3.2.6(X)j2.4) !. 00 for all 6 or the Using our results on MSPE prediction. 0) and "I (0) is equivalent to considering 1 (0.3). B denotes mean height of people in meters.(O)dO (3. j I 1': II . 0) 2 and 1T2(0) = 1.2.. and 1r may express that we care more about the values of the risk in some rather than other regions of e.r= . such that (3. Our problem is to find the function 6 of X that minimizes r(". Issues such as these and many others are taken up in the fundamental treatises on Bayesian statistics such as Jeffreys (1948) and Savage (1954) and are reviewed in the modern works of Berger (1985) and Bernardo and Smith (1994).(O)dfI (3.6).2) the Bayes risk of the problem..5).6) = J R(0. We may have vague prior notions such as "IBI > 5 is physically implausible" if. (0. 6) ~ E(q(O) .2. we could identify the Bayes rules 0. Suppose () is a random variable (or vector) with (prior) frequency function or density 11"(0).
This quantity r(a I x) is what we expect to lose. see Section 5. Intuitively. '1]0.2 Bayes Procedures 163 Example 3. " Xn. Because the Bayes risk of X. For more on this.a)' I X ~ x) as a function of the action a. .2. that is. Thus. If we look at the proof of Theorem 1.1. To begin with we consider only nonrandomized rules. tends to a as n _ 00. a 2 / n.w)X of the estimate to be used when there are no observations.) minimizes the conditional MSPE E«(Y .E(e I X))' I X)] E[(J'/(l+~)]=nja 2 + 1/72 ' 1 2 n n/ No finite choice of '1]0 and 7 2 will lead to X as a Bayes estimate. X is the estimate that (3. take that action a = J*(x) that makes r(a I x) as small as possible. for each x. /2) as in Example 1.r( 7r. we should. Bayes Estimates for the Mean of a Normal Distribution with a Normal Prior. X is approximately a Bayes estimate for anyone of these prior distributions in the sense that Ir( 7r. This action need not exist nor be unique if it does exist. In fact.12.6.1.2.1). we obtain the posterior distribution The Bayes estimate is just the mean of the posterior distribution J'(X) ~ 1]0 l/T' ] _ [ n/(J2 ] [n/(J' + l/T' + X n/(J' + l/T' (3. 0 We now tum to the problem of finding Bayes rules for gelleral action spaces A and loss functions l.7) reveals the Bayes estimate in the proper case to be a weighted average MID + (1 .2. In fact. if X = :x and we use action a. Suppose that we want to estimate the mean B of a nonnal distribution with known variance a 2 on the basis of a sample Xl. J') E(e . r) . we fonn the posterior risk r(a I x) = E(l(e. we see that the key idea is to consider what we should do given X = x.Section 3. E(Y I X) is the best predictor because E(Y I X ~ . 1]0 fixed).6) yields. However. Applying the same idea in the general Bayes decision problem. the Bayes estimate corresponding to the prior density N(1]O' 7 2 ) differs little from X for n large.00 with.2. Such priors with f 7r(e) ~ 00 or 2: 7r(0) ~ 00 are called impropa The resulting Bayes procedures are also called improper. if we substitute the prior "density" 7r( 0) = 1 (Prohlem 3. .5. J' )]/r( 7r. Fonnula (3. and X with weights inversely proportional to the Bayes risks of these two estimates. If we choose the conjugate prior N( '1]0. J') ~ 0 as n + 00.E(e I X))' = E[E«e . But X is the limit of such estimates as prior knowledge becomes "vague" (7 .7) Its Bayes risk (the MSPE of the predictor) is just r(7r. .4. a) I X = x).2.
3.o(X)) I X = x] = r(o(x) Therefore. Suppose that there exists a/unction 8*(x) such that r(o'(x) [x) = inf{r(a I x) : a E A}. 10) 8 10. r(al 11) = 8.o'(X)) I X]. o As a first illustration.. by (3. . A = {ao. I 0)  + 91(0" ad r(a. ?f(O. and aa are r(at 10) r(a.o) = E[I(O. if 0" is the Bayes rule.67 5. and let the loss incurred when (}i is true and action aj is taken be given by j e e ..9) " "I " • But.8) Then 0* is a Bayes rule. let Wij > 0 be given constants. r(a.! " j.) = 0. Therefore.1. More generally consider the following class of situations. 6* = 05 as we found previously.8) Thus. [I) = 3.89. a q }. The great advantage of our new approach is that it enables us to compute the Bayes procedure without undertaking the usually impossible calculation of the Bayes risks of all corrpeting procedures. . . consider the oildrilling example (Example 1. az. 11) = 5. (3. ~ a. :i " " Proof.2.70 and we conclude that 0'(1) = a.74. Then the posterior distribution of o is by (1. As in the proof of Theorem 1.2.2.o(X)) I X] > E[I(O. Suppose we observe x = O. Example 3.o(X)) I X)].·. E[I(O. the posterior risks of the actions aI. Bayes Procedures Whfn and A Are Finite.8.9). Let = {Oo. !i n j. we obtain for any 0 r(?f..8).) = 0. " .5) with priOf ?f(0. and the resnlt follows from (3. I x) > r(o'(x) I x) = E[I(O.1. 0'(0) Similarly.2.2. (3. r(a.o(X))] = E[E(I(O.4.Op}. Therefore.164 Measures of Performance Chapter 3 Proposition 3. E[I(O.2.2. az has the smallest posterior risk and.· ..35.2.o'(X)) IX = x].2.
Section 3.2
Bayes Procedures
165
Let 1r(0) be a prior distribution assigning mass 1ri to Oi, so that 1ri > 0, i = 0, ... ,p, and Ef 0 1ri = 1. Suppose, moreover, that X has density or frequency function p(x I 8) for each O. Then, by (1.2.8), the posterior probabilities are
and, thus,
raj (
x I) =
EiWipTiP(X I Oi) . Ei 1fiP(X I Oi)
(3.2.10)
The optimal action 6* (x) has
r(o'(x) I x)
=
O<rSq
min r(oj
I x).
Here are two interesting specializations. (a) Classification: Suppose that p
= q, we identify aj with OJ, j
1, i f j
=
0, ... ,p, and let
Wii
O.
This can be thought of as the classification problem in which we have p + 1 known disjoint populations and a new individual X comes along who is to be classified in one of these categories. In this case,
r(Oi
Ix) = Pl8 f ei I X = xl
and minimizing r(Oi I x) is equivalent to the reasonable procedure of maximizing the posterior probability,
PI8 = e I X = i
xl
=
1fiP(X lei) Ej1fjp(x I OJ)
(b) Testing: Suppose p = q = 1, 1ro = 1r, 1rl = 1  'Jr, 0 < 1r < 1, ao corresponds to deciding 0 = 00 and al to deciding 0 = 01 • This is a special case of the testing fonnulation of Section 1.3 with 8 0 = {eo} and 8 1 = {ed. The Bayes rule is then to
decide e decide e
= 01 if (1 1f)p(x I 0, ) > 1fp(x I eo) = eo if (1  1f)p(x IeIl < 1fp(x Ieo)
and decide either ao or al if equality occurs. See Sections 1.3 and 4.2 on the option of randomizing .between ao and al if equality occurs. As we let 'Jr vary between zero and one. we obtain what is called the class of NeymanPearson tests, which provides the solution to the problem of minimizing P (type II error) given P (type I error) < Ct. This is treated further in Chapter 4. D
166
Measures of Performance
Chapter 3
To complete our illustration of the utility of Proposition 3.2. L we exhibit in "closed form" the Bayes procedure for an estimation problem when the loss is not quadratic.
Example 3.2.3. Bayes Estimation ofthe Probability ofSuccess in n Bernoulli Trials. Suppose that we wish to estimate () using X I, ... , X n , the indicators of n Bernoulli trials with
probability of success
e, We shall consider the loss function I given by
(0  a)2 1(0, a) = 0(1 _ 0)' 0 < 0 < I, a real.
(3.2.11)
"
" " ,

d
This close relative of quadratic loss gives more weight to parameter values close to zero and one. Thus, for () close to zero, this l((), a) is close to the relative squared error (fJ  a)2 JO, lt makes X have constant risk, a property we shall find important in the next section. The analysis can also be applied to other loss functions. See Problem 3.2.5. By sufficiency we need only consider the number of successes, S. Suppose now that we have a prior distribution. Then, if aU terms on the righthand side are finite,
rea I k)
(3.2.12)
;,
I
, " " I
,

I
Minimizing this parabola in a, we find our Bayes procedure is given by
J'(k) = E(1/(1  0) I S = k) E(I/O(I  0) IS  k)
;
(3.2.13)

i! , ,.
provided the denominator is not zero. For convenience let us now take as prior density the density br " (0) of the bela distribution i3( r, s). In Example 1. 2.1 we showed that this leads to a i3(k +r,n +s  k) posteriordistributionforOif S = k. Ifl < k < nI and n > 2, then all quantities in (3.2.12) and (3.2.13) are finite, and
.
,
";1 
J'(k)
Jo'(I/(1  O))bk+r,n_k+,(O)dO
J~(1/0(1 O))bk+r,n_k+,(O)dO
B(k+r,nk+sI) B(k + r  I, n  k + s  I) k+rl n+s+r2'
(3.2.14)
where we are using the notation B.2.11 of Appendix B. If k = 0, it is easy to see that a = is the only a that makes r(a I k) < 00. Thus, J'(O) = O. Similarly, we get J'(n) = 1. If we assume a un!form prior density, (r = s = 1), we see that the Bayes procedure is the usual estimate, X. This is not the case for quadratic loss (see Problem 3.2.2). 0
°
"Real" computation of Bayes procedures
The closed fonos of (3.2.6) and (3.2.10) make the compulation of (3.2.8) appear straightforward. Unfortunately, this is far from true in general. Suppose, as is typically the case, that 0 ~ (0" . .. , Op) has a hierarchically defined prior density,
"(0,, O ,... ,Op) = 2
"1 (Od"2(02
I( 1 ) ... "p(Op I Op_ d.
(3.2.15)
I
Section 3.2
Bayes Procedures
167
Here is an example. Example 3.2.4. The random effects model we shall study in Volume II has
(3.2.16)
where the €ij are i.i.d. N(O, (J:) and Jl, and the vector ~ = (AI, .. ' ,AI) is independent of {€i; ; 1 < i < I, 1 < j < J} with LI." ... ,LI.[ i.i.d. N(O,cri), 1 < j < J, Jl, '"'" N(Jl,O l (J~). Here the Xi)" can be thought of as measurements on individual i and Ai is an "individual" effect. If we now put a prior distribution on (Jl,,(J~,(J~) making them independent, we have a Bayesian model in the usual form. But it is more fruitful to think of this model as parametrized by () = (J.L, (J~, (J~, AI,. '. ,AI) with the Xii I () independent N(f.l + Ll. i . cr~). Then p(x I 0) = lli,; 'Pu.(Xi;  f.l Ll. i ) and
11'(0)
=
11'}(f.l)11'2(cr~)11'3(cri)
II 'PU" (LI.,)
i=l
I
(3.2.17)
where <Pu denotes the N(O, (J2) density. ]n such a context a loss function frequently will single out some single coordinate Os (e.g., LI.} in 3.2.17) and to compute r(a I x) we will need the posterior distribution of ill I X. But this is obtainable from the posterior distribution of (J given X = x only by integrating out OJ, j t= s, and if p is large this is intractable. ]n recent years socalled Markov Chain Monte Carlo (MCMC) techniques have made this problem more tractable 0 and the use of Bayesian methods has spread. We return to the topic in Volume II.
Linear Bayes estimates
When the problem of computing r( 1r, 6) and 611" is daunting, an alternative is to consider  of procedures for which r( 'R', 6) is easy to compute and then to look for 011" E 'D  a class 'D that minimizes r( 1f 16) for 6 E 'D. An example is linear Bayes estimates where, in the case of squared error loss [q( 0)  a]2, the problem is equivalent to minimizing the mean squared prediction error among functions of the form a + L;=l bjXj _ If in (1.4.14) we identify q(8) with Y and X with Z, the solution is

8(X)
= Eq(8) + IX 
E(X)]T{3
where f3 is as defined in Section 1.4. For example, if in the 11l0del (3.2.16), (3.2.17) we set q(fJ) = Ll 1 , we can find the linear Bayes estimate of Ll 1 hy using 1.4.6 and Problem 1.4.21. We find from (1.4.14) that the best linear Bayes estimator of LI.} is
(3.2.18)
where E(Ll.I) given model
= 0,
X = (XIJ, ... ,XIJ)T, P.
= E(X)
and/l
=
ExlcExA,. For the
168
Var(X I ;)
Measures of Performance
Chapter 3
=E
Var(X,)
I B) +
Var E(XIj
I BJ = E«T~) +
(T~ + (T~,
E Cov(X,j , Xlk I B)
+ Cov(E(X'j I B), E(Xlk I BJ)
0+ Cov(1' + II'!, I' + tl.,)
= (T~ +
(T~,
Cov(tl.
Xlj)
=E
Cov(X,j , tl.,
!: ,,
I',
" From these calculations we find and Norberg (1986).
I B) + Cov(E(XIj I B), E(tl. l I 0» = 0 + (T~, =
(Tt·
f3
and OL(X). We leave the details to Problem 3.2.10.
'I
Linear Bayes procedures are useful in actuarial science, for example, Biihlmann (1970)
if
ii "
Bayes estimation, maximum likelihood, and equivariance
As we have noted earlier. the maximum likelihood estimate can be thought of as the mode of the Bayes posterior density when the prior density is (the usually improper) prior 71"(0) ;,::; c. When modes and means coincide for the improper prior (as in the Gaussian case), the MLE is an improper Bayes estimate. In general, computing means is harder than
modes and that again accounts in part for the popularity of maximum likelihood. An important property of the MLE is equivariance: An estimating method M producing the estimate (j M is said to be equivariant with respect to reparametrization if for every onetoDne function h from e to fl = h (e). the estimate of w '" h(B) is W = h(OM); that is, M
~
I
= h(BM ). In Problem 2.2.16 we show that the MLE procedure is equivariant. If we consider squared error loss. then the Bayes procedure (}B = E(8 I X) is not equivariant
(h(O»
M
~

~
~
for nonlinear transfonnations because
E(h(O) I X)
op h(E(O I X))
1 ,
,
,,
for nonlinear h (e.g., Problem 3.2.3). The source of the lack of equivariance of the Bayes risk and procedure for squared error loss is evident from (3.2.9): In the discrete case the conditional Bayes risk is
i
re(a I x)
= LIB  a]'7f(O I x).
'Ee
(3.2.19)
i
If we set w = h(O) for h onetoone onto fl = heel, then w has prior >.(w) and in the w parametrization, the posterior Bayes risk is
= 7f(h
1
(w))
I
,
r,,(alx)
= L[waj2>,(wlx)
wE"
L[h(B)  a)2 7f (B I x).
(3.2.20)
'Ee
Thus, the Bayes procedure for squared error loss is not equivariant because squared error loss is not equivariant and, thus, r,,(a I x) op Te(h 1(a) Ix).
Section 3.2
Bayes Procedures
169
Loss functions of the form l((}, a) = Q(Po, Pa) are necessarily equivariant. The KullbackLeibler divergence K«(},a), (J,a E e, is an example of such a loss function. It satisfies Ko(w, a) = Ke(O, h 1 (a)), thus, with this loss function,
ro(a I x)
~
re(h1(a) I x).
See Problem 2.2.38. In the discrete case using K means that the importance of a loss is measured in probability units, with a similar interpretation in the continuous case (see (A.7.l 0». In the N(O, case the K L (KullbackLeibler) loss K( a) is ~n(a  0)' (Problem 2.2.37), that is, equivalent to squared error loss. In canonical exponential families
"J)
e,
K(1], a) = L[ryj  ajJE1]Tj
j=1
•
+ A(1]) 
A(a).
Moreover, if we can find the KL loss Bayes estimate 1]BKL of the canonical parameter 7] and if 1] = c( 9) :  t E is onetoone, then the K L loss Bayes estimate of 9 in the
e
general exponential family is 9 BKL = C 1(1]BKd. For instance, in Example 3.2.1 where J.l is the mean of a nonnal distribution and the prior is nonnal, we found the squared error Bayes estimate jiB = wf/o + (lw)X, where 1]0 is the prior mean and w is a weight. Because the K L loss is equivalent to squared error for the canonical parameter p, then if w = h(p), WBKL = h('iiUKL), where 'iiBKL =
~
wTJo + (1  w)X.
Bayes procedures based on the KullbackLeibler divergence loss function are important for their applications to model selection and their connection to "minimum description (message) length" procedures. See Rissanen (1987) and Wallace and Freeman (1987). More recent reviews are Shibata (1997), Dowe, Baxter, Oliver, and Wallace (1998), and Hansen and Yu (2000). We will return to this in Volume 11.
Bayes methods and doing reasonable things
There is a school of Bayesian statisticians (Berger, 1985; DeGroot, 1%9; Lindley, 1965; Savage, 1954) who argue on nonnative grounds that a decision theoretic framework and rational behavior force individuals to use only Bayes procedures appropriate to their personal prior 1r. This is not a view we espouse because we view a model as an imperfect approximation to imperfect knowledge. However, given that we view a model and loss structure as an adequate approximation, it is good to know that generating procedures on the basis of Bayes priors viewed as weighting functions is a reasonable thing to do. This is the conclusion of the discussion at the end of Section 1.3. It may be shown quite generally as we consider all possible priors that the class 'Do of Bayes procedures and their limits is complete in the sense that for any 6 E V there is a 60 E V o such that R(0, 60 ) < R(0, 6) for all O. Summary. We show how Bayes procedures can be obtained for certain problems by COmputing posterior risk. In particular, we present Bayes procedures for the important cases of classification and testing statistical hypotheses. We also show that for more complex problems, the computation of Bayes procedures require sophisticated statistical numerical techniques or approximations obtained by restricting the class of procedures.
170
Measures of Performance
Chapter 3
3.3
MINIMAX PROCEDURES
In Section 1.3 on the decision theoretic framework we introduced minimax procedures as ones corresponding to a worstcase analysis; the true () is one that is as "hard" as possible. That is, 6 1 is better than 82 from a minimax point of view if sUPe R(0,6 1 ) < sUPe R(B, 82 ) and is said to be minimax if
o·
supR(O, 0')
.
= infsupR(O,o).
,
.
,
I·
Here (J and 8 are taken to range over 8 and V = {all possible decision procedures (possibly randomized)} while P = {p. : 0 E e}. It is fruitful to consider proper subclasses of V and subsets of P. but we postpone this discussion. The nature of this criterion and its relation to Bayesian optimality is clarified by considering a socalled zero sum game played by two players N (Nature) and S (the statistician). The statistician has at his or her disposal the set V of all randomized decision procedures whereas Nature has at her disposal all prior distributions 1r on 8. For the basic game, 5 picks 0 without N's knowledge, N picks 1f without 5's knowledge and then all is revealed and S pays N
r(Jr,o)
, ,
,
=
J
R(O,o)dJr(O)
where the notation f R(O, o)dJr(0) stands for f R(0, o)Jr(O)dII in the continuous case and L, R(Oj, o)Jr(O j) in the discrete case. S tries to minimize his or her loss, N to maximize her gain. For simplicity, we assume in the general discussion that follows that all sup's and inf's are assumed. There are two related partial information games that are important.
I: N is told the choice 0 of 5 before picking 1r and 5 knows the rules of the game. Then
Ii ,
N naturally picks 1f,; such that
r(Jr"o)
that is, 1f,; is leastfavorable against such that
~supr(Jr,o),
o. Knowing the rules of the game S naturally picks 0*
•
(3.3.1)
r( Jr,., 0·) = sup r(Jr, 0') = inf sup r(Jr, 0).
We claim that 0* is minimax. To see this we note first that,
.
,
.
(3.3.2)
for allJr, o. On the other hand, if R(0" 0) = suP. R(0, 0), then if Jr, is point mass at 0" r( Jr" 0) = R(O" 0) and we conclude that supr(Jr,o) = sup R(O, 0)
,
•
•
(3.3.3)
I
,
•
•
Section 3.3
Minimax Procedures
171
and our claim follows. II: S is told the choke 7r of N before picking 6 and N knows the rules of the game. Then S naturally picks 6 1r such that
That is, r5 rr is a Bayes procedure for
11".
Then N should pick
7r*
such that (3.3.4)
For obvious reasons, 1f* is called a least favorable (to S) prior distribution. As we shall see by example, altbough the rigbthand sides of (3.3.2) and (3.3.4) are always defined, least favorable priors and/or minimax procedures may not exist and, if they exist, may not be umque. The key link between the search for minimax procedures in the basic game and games I and II is the von Neumann minimax theorem of game theory, which we state in our language.
Theorem 3.3.1. (von Neumann). If both
e and D are finite,
?5
11"
then:
(a)
v=supinfr(1I",6), v=infsupr(1I",6)
rr
?5
are both assumed by (say)
7r*
(least favorable), 6* minimax, respectively. Further,
") =v v=r1f,u (
.
(3.3.5)
v and v are called the lower and upper values of the basic game. When v (saY), v is called the value of the game.
=
v
=
v
Remark 3.3.1. Note (Problem 3.3.3) that von Neumann's theorem applies to classification ~ {eo} and = {eIl (Example 3.2.2) but is too reSlrictive in ilS and testing when assumption for the great majority of inference problems. A generalization due to Wald and Karlinsee Karlin (1 959)states that the conclusions of the theorem remain valid if and D are compact subsets of Euclidean spaces. There are more farreaching generalizations but, as we shall see later, without some form of compactness of and/or D, although equality of v and v holds quite generally, existence of least favorable priors and/or minimax procedures may fail.
eo
e,
e
e
The main practical import of minimax theorems is, in fact, contained in a converse and its extension that we now give. Remarkably these hold without essentially any restrictions on and D and are easy to prove.
e
Proposition 3.3.1. Suppose 6**. 7r** can be found such that
U
£** =
r ()1r.',
1f **
= 1rJ••
(3.3.6)
172
Measures of Performance
Chapter 3
that is, 0** is Bayes against 11"** and 11"** is least favorable against 0**. Then v R( 11"** ,0**). That is, 11"** is least favorable and J*'" is minimax.
To utilize this result we need a characterization of 11"8. This is given by
v
=
Proposition 3.3.2.11"8 is leastfavorable against
°iff
1r,jO: R(O,b) = supR(O',b)) = 1.
.'
(3,3,7)
That is, n a assigns probability only to points () at which the function R(·, 0) is maximal.
Thus, combining Propositions 3.3.1 and 3.3.2 we have a simple criterion, "A Bayes rule with constant risk is minimax." Note that 11"8 may not be unique. In particular, if R(O, 0) = constant. the rule has constant risk, then all 11" are least favorable. We now prove Propositions 3.3.1 and 3.3.2.
Proof of Proposition 3.3.1. Note first that we always have
v<v
because, trivially,
i~fr(1r,b)
(3.3.8)
< r(1r,b')
(3.3,9)
for aIln, 5'. Hence,
v = sup in,fr(1r,b) < supr(1r,b')
•
(3,3.10)
•
for all 0' and v
<
infa, sUP1r 1'(11", (/) =
v. On the other hand. by hypothesis,
sup1'(1r,6**) > v.
v> inf1'(11"*'",6)
,
= 1'(1I"*",0*'") =
.
(3.3.11)
Combining (3.3.8) and (3.3.11) we conclude that
:r
v
as advertised.
= i~f 1'(11"**,0) = 1'(11"**,6**) = s~p1'(1I",0*"')
= V
(3.3.12)
" 'I
,.;
~l
o
Proofof Proposition 3.3.2. 1r is least favorable for b iff E.R(8,6) =
,.,
f r(O,b)d1r(O) =s~pr( ..,6).
•
(3.3.13)
But by (3.3.3),
supr(..,b) = supR(O, 6),
(3.3,14)
•
i ,
Because E.R(8, 6)
= sUP. R(O, b), (3.3,13) is possible iff (3.3.7) holds.
o
Putting the two propositions together we have the following.
•
Section 3.3
Minimax Procedures
173
11"*
Theorem 3.3.2. Suppose 0* has sUPo R((}, 0*) = 1" < 00. If there exists a prior that 0* is Bayes for 11"* and tr" {(} : R( (}, 0") = r} = 1, then 0'" is minimax.
such
Example 3.3.1. Minimax estimation in the Binomial Case. Suppose S has a B(n,B) distribution and X = Sjn,as in Example 3.2.3. Let I(B, a) ~ (Ba)'jB(IB),O < B < 1. For this loss function,
R(B X) = E(X _B)' , B(IB)

=
B(IB) = ~ nB(IB) n'

and X does have constant risk. Moreover, we have seen in Example 3.2.3 that X is Bayes, when 8 is U(Ol 1). By Theorem 3.3.2 we conclude that X is minimax and, by Proposition 3.3.2, the uniform distribution least favorable. For the usual quadratic loss neither of these assertions holds. The minimax estimate is
o's=S+hln = .,fii X+ I 1 () n+.,fii .,fii+l .,fii+1 2
This estimate does have constant risk and is Bayes against a (J( y'ri/2, vn/2) prior (Problem 3.3.4). This is an example of a situation in which the minimax principle leads us to an unsatisfactory estimate. For quadratic loss, the limit as n t 00 of the ratio of the risks of 0* and X is > 1 for every () =f ~. At B = the ratio tends to 1. Details are left to Problem 3.3.4. 0
!
Example 3.3.2. Minimax Testing. Satellite Communications. A test to see whether a communications satellite is in working order is run as follows. A very strong signal is beamed from Earth. The satellite responds by sending a signal of intensity v > 0 for n seconds or, if it is not working, does not answer. Because of the general "noise" level in space the signals received on Earth vary randomly whether the satellite is sending ~r not. The mean voltage per second of the signal for each of the n seconds is recorded. Denote the mean voltage of the signal received through the ith second less expected mean voltage due to noise by Xi. We assume that the Xi are independently and identically distributed as N(p" 0'2) where p, = v, if the satellite functions, and otherwise. The variance 0'2 of the "noise" is assumed known. Our problem is to decide whether "J.l = 0" or"p, = v." We view this as a decision problem with 1 loss. If the number of transmissions is fixed, the minimax rule minimizes the maximum probability of error (see (1.3.6)). What is this risk? A natural first step is to use the characterization of Bayes tests given in the preceding section. If we assign probability 1r to and 1  1r to v, use 0  1 loss, and set L(x, 0, v) = p( x Iv) j p(x I 0), then the Bayes test decides I' = v if
°
°°
L(x,O,v)
and decides p,
=
=
exp
°
" {
2"EXi   , a 2a
nv'} >
17l'
if L(x,O,v) <:17l' 7l'
174
Measures of Performance
Chapter 3
This test is equivalent to deciding f.t
= v (Problem 3.3.1) if. and only if,
"yn
1 ;;;:EXi>t,
T=
, , ,
where,
t
If we call this test d,r.
=
" [log " + 2 nv'] ;;;: vyn a
17r
2
'
R(O,J. )
1 ~ <I>(t)
<I>
= <1>( t)
R(v,o,)
(t  vf)
To get a minimax test we must have R(O, 61r ) = R( V, 6ft), which is equivalent to
v.,fii t = t  "'
h •
or
"
','
I ,~
~.
, ,
•
v.,fii t= . 20
•
Because this value of t corresponds to 7r = the intuitive test, which decides JJ only ifT > ~[Eo(T) + Ev(T)J, is indeed minimax.
!.
= v if and
0
'I
•
If is not bounded, minimax rules are often not Bayes rules but instead can be obtained as limits of Bayes rules. To deal with such situations we need an extension of Theorem
e
3.3.2.
Theorem 3,3,3, Let 0' be a rule such that sup.R(O,o') = r < 00, let {"d denote a sequence of pn'or distributions such that 'lrk;{8 : R(B,o*) = r} = 1, and let Tk = inffJ r( ?rie, J), where r( 7fkl 0) denotes the Bayes risk wrt 'lrk;. If
Tk Task 00,
(3.3.i5)
then J* is minimax. Proof Because r( "k, 0') = r
supR(B, 0') = rk + 0(1)
•
where 0(1)
~
0 as k
~ 00.
But hy (3.3.13) for any competitor 0
supR(O,o) > E.,(R(B,o)) > rk ~supR(O,o') 0(1).
•
•
(3.3.16)
,
'.
If we let k _ suP. R(B, 0').
00
the lefthand side of (3.3.16) is unchanged, whereas the right tends to
0
j
1
•
Section 3.3
M'mimax Procedures
175
Example 3.3.3. Normal Mean. We now show that X is minimax in Example 3.2.1. Identify 1fk with theN(1Jo, 7 2 ) prior where k = 7 2 . Then
whereas the Bayes risk of the Bayes rule of Example 3.2.1 is
i~frk(J) ~ (,,'/n) +7' n
Because
(0
7
2
0
2
=
00,
n  (,,'/n) +7'
a2
1
0
2
n·
2
In)1 « (T2 In) + 7 2 )
>
0 as T 2

we can conclude that
X is minimax.
0
Example 3.3.4. Minimax Estimation in a Nonparametric Setting (after Lehmann). Suppose XI,'" ,Xn arei.i.d, FE:F
Then X is minimax for estimating B(F) EF(Xt} with quadratic loss. This can be viewed as an extension of Example 3.3.3. Let 1fk be a prior distribution on :F constructed as foUows:(J)
(i)
=
".{F: VarF(XIl # M}
= O.
(ii) ".{ F : F
# N(I', M) for some I'} = O.
(iii) F is chosen by first choosing I' = 6(F) from a N(O, k) distribution and then taking F = N(6(F),M).
Evidently, the Bayes risk is now the same as in Example 3.3.3 with 0"2 evidently,
= M.
Because.
) VarF(X, ) max R(F, X = max :F :F n
Theorem 3.3.3 applies and the result follows. Minimax procedures and symmetry
M
n
o
As we have seen, minimax procedures have constant risk or at least constant risk on the "most difficult" 0, There is a deep connection between symmetries of the model and the structure of such procedures developed by Hunt and Stein, Lehmann, and others, which is discussed in detail in Chapter 9 of Lehmann (1986) and Chapter 5 of Lehmann and CaseUa (1998), for instance. We shall discuss this approach somewhat, by example. in Chapters 4 and Volume II but refer to Lehmann (1986) and Lehmann and Casella (1998) for further reading. Summary. We introduce the minimax principle in the contex.t of the theory of games. Using this framework we connect minimaxity and Bayes metbods and develop sufficient conditions for a procedure to be minimax and apply them in several important examples.
In the nonBayesian framework. 5) > R(0. 5') for aile. for instance. II .1." R(0. on other grounds.4. . symmetry. compute procedures (in particular estimates) that are best in the class of all procedures. such as 5(X) = q(Oo).4 UNBIASED ESTIMATION AND RISK INEQUALITIES 3.2. for example.. This notion has intuitive appeal. D. 0'5) model. . or more generally with constant risk over the support of some prior.3. estimates that ignore the data. Von Neumann's Theorem states that if e and D are both finite. This approach coupled with the principle of unbiasedness we now introduce leads to the famous GaussMarkov theorem proved in Section 6.1 Unbiased Estimation. then the game of S versus N has a value 11. When Do is the class of linear procedures and I is quadratic Joss. The lower (upper) value v(v) of the game is the supremum (infimum) over priors (decision rules) of the infimum (supremum) over decision rules (priors) of the Bayes risk. We introduced. it is natural to consider the computationally simple class of linear estimates. This result is extended to rules that are limits of Bayes rules with constant risk and we use it to show that x is a minimax rule for squared error loss in the N (0 . I X n are d.L and (72 when XI. looking for the procedure 0. When v = ii. which can't be beat for 8 = 80 but can obviously be arbitrarily terrible. all 5 E Va.2. An estimate such that Biase (8) 0 is called unbiased..• •••. in many cases.6. then in estimating a linear function of the /3J. in Section 1. E Do that minimizes the Bayes risk with respect to a prior 1r among all J E Do. the solution is given in Section 3. Ii I More specifically. Bayes and minimaxity. according to these criteria.1 . v equals the Bayes risk of the Bayes rule 8* for the prior 1r*.. are minimax.I . 176 Measures of Performance Chapter 3 . Survey Sampling In the previous two sections we have considered two decision theoretic optimality principles. This approach has early on been applied to parametric families Va. Do C D. there is a least favorable prior 1[" and a minimax rule 8* such that J* is the Bayes rule for n* and 1r* maximizes the Bayes risk of J* over all priors. The most famous unbiased estimates are the familiar estimates of f. (72) = . We show that Bayes rules with constant risk. the game is said to have a value v. I I . Moreover. the notion of bias of an estimate O(X) of a parameter q(O) in a model P {Po: 0 E e} as = Biaso(5) = Eo5(X)  q(O). and so on. we show how finding minimax procedures can be viewed as solving a game between a statistician S and nature N in which S selects a decision rule 8 and N selects a prior 1['. ruling out.d. Obviously. computational ease. if Y is postulated as following a linear regression model with E(Y) = zT (3 as in Section 2. S(Y) = L:~ 1 diYi . A prior for which the Bayes risk of the Bayes procedure equals the lower value of the game is called leas! favorable.. we can also take this point of view with humbler aims. and then see if within the Do we can find 8* E Do that is best according to the "gold standard. An alternative approach is to specify a proper subclass of procedures. N (Il. 3. . for which it is possible to characterize and.
is to estimate by a regression estimate ~ XR Xi.3) = 0 otherwise. . (3. . As we shall see shortly for X and in Volume 2 for . (3. (~) If{aj. Unbiased estimates playa particularly important role in survey sampling..4.1. Ui is the last census income corresponding to = it E~ 1 'Ui....Section 3..4. UN' One way to do this.6) where b is a prespecified positive constant.. for instance. We let Xl. [J = ~ I:~ 1 Ui· XR is also unbiased. UN for the known last census incomes. (3. Write Xl. . UN) and (Xl.4.3. Example 3.14) and has = iv L:fl (3. ~ N L.an}C{XI. .. If .. UMVU (uniformly minimum variance unbiased).' .1 ) ~ 1 nl L (Xi ~ ooc n ~ X) 2 ..4.2) 1 Because for unbiased estimates mean square error and variance coincide we call an unbiased estimate O*(X) of q(O) that has minimum MSE among all unbiased estimat~s for all 0.4..(Xi 1=1 I" N ~2 x) .. .XN} 1 (3. X N for the unknown current family incomes and correspondingly UI.4 Unbiased Estimation and Risk Inequalities 177 given by (see Example 1.2. .3. these are both UMVU. . We want to estimate the parameter X = Xj. X N ) as parameter .4.4) where 2 CT. ..< 2 ~ ~ (3. reflecting the probable correlation between ('ttl. and u =X  b(U ... We ignore difficulties such as families moving.8) Jt =X . X N ).4. .. a census unit.Xn denote the incomes of a sample of n families drawn at random without replacement.···. Unbiased Estimates in Survey Sampling.il) Clearly for each b. . It is easy to see that the natural estimate X ~ L:~ 1 Xi is unbiased (Problem 3. to determine the average value of a variable (say) monthly family income during a time between two censuses and suppose that we have available a list of families in the unit with family incomes at the last census.5) This method of sampling does not use the information contained in 'ttl. .3 and Problem 1. This leads to the model with x = (Xl. Suppose we wish to sample from a finite population.. ... . . .4.
M (3. ... A natural choice of 1rj is ~n. An alternative approach to using the Uj is to not sample all units with the same prob ability. Further discussion of this and other sampling schemes and comparisons of estimates are left to the problems.4. . The HorvitzThompson estimate then stays unbiased. outside of sampling.bey the attractive equivariance property. :I ..4).3. X M } of random size M such that E(M) = n (Problem 3. 178 Measures of Performance Chapter 3 i . 1. ".Xi XHT=LJN i=l 7fJ. (iii) Unbiased_estimates do not <:. To see this write ~ 1 ' " Xj XHT ~ N LJ 7'" l(Xj E S). II it >: . (ii) Bayes estimates are necessarily biasedsee Problem 3.19).18.. They necessarily in general differ from maximum likelihood estimates except in an important special case we develop later. Xl!Var(U). If B is unbiased for e.. j=1 3 N ! I: . this will be a beller cov(U.4. the unbiasedness principle has largely fallen out of favor for a number of reasons.4. (i) Typically unbiased estimates do not existsee Bickel and Lehmann (1969) and Problem 3. ~ . This makes it more likely for big incomes to be induded and is intuitively desirable.. N toss a coin with probability 7fj of landing heads and select Xj if the coin lands heads. . X is not unbiased but the following estimate known as the HorvitzThompson estimate is: Ef .j .... For each unit 1.. . The result is a sample S = {Xl.4. ' .I. . .i . for instance.4. 0 Discussion. . I j . I . Unbiasedness is also used in stratified sampling theory (see Problem 1. .'I .X)!Var(U) (Problem 3.11.1 the correlation of Ui and Xi is positive and b < 2Cov(U. . It is possible to avoid the undesirable random sample size of these schemes and yet have specified 1rj. " 1 ".15).. However.3. estimate than X and the best choice of b is bopt The value of bopt is unknown but can be estimated by = The resulting estimate is no longer unbiased but behaves well for large samplessee Problem 5. Ifthc 'lrj are not all equal.j' I ': Because 1rj = P[Xj E Sj by construction unbiasedness follows.7) II ! where Ji is defined by Xi = xJ.~'! I . I II. Specifically let 0 < 7fl. q(e) is biased for q(e) unless q is linear. . .2Gand minimax estimates often are. 111" N < 1 with 1 1fj = n. .
%0 [j T(X)P(x. unbiased estimates are still in favor when it comes to estimating residual variances. We make two regularity assumptions on the family {Pe : BEe}.4.4. good estimates in large samples ~ 1 ~ are approximately unbiased. has some decision theoretic applications.   . B)dx. as we shall see in Chapters 5 and 6.4. the variance a 2 = Var(ei) is estimated by the unbiased estimate 8 2 = iTe/(n . p. For all x E A. B) for II to hold. B)dx. and p is the number of coefficients in f3. 167. In particular we shall show that maximum likelihood estimates are approximately unbiased and approximately best among all estimates. Some classical conditiolls may be found in Apostol (1974).p) where i = (Y . Simpler assumptions can be fonnulated using Lebesgue integration theory.OJdx] = J T(xJ%oP(x.8) is finite.. which can be used to show that an estimate is UMVU. e e. and we can interchange differentiation and integration in p(x. The arguments will be based on asymptotic versions of the important inequalities in the next subsection. For instance.8) whenever the righthand side of (3.2. e. Note that in particular (3. f3 is the least squares estimate. Finally. See Problem 3. We expect that jBiase(Bn)I/Vart (B n ) .8) is assumed to hold if T(xJ = I for all x. What is needed are simple sufficient conditions on p(x. Assumption II is practically useless as written.ZDf3).( lTD < 00 J . For instance.9. Then 11 holdS provided that for all T such that E. The lower bound is interesting in its own right.2 The Information Inequality The oneparameter case We will develop a lower bound for the variance of a statistic. (II) 1fT is any statistic such that E 8 (ITf) < 00 for all 0 E then the operations of integration and differentiation by B can be interchanged in J T( x )p(x.O) > O} docs not depend on O.4. and appears in the asymptotic optimality theory of Section 5.Section 3. That is. B) is a density. OJ exists and is finite. for integration over Rq. This preference of 8 2 over the MLE 52 = iT e/n is in accord with optimal behavior when both the number of observations and number of parameters are large. in the linear regression model Y = ZDf3 + e of Section 2.4 Unbiased Estimation and Risk Inequalities 179 Nevertheless. From this point on we will suppose p(x. The discussion and results for the discrete case are essentially identical and will be referred to in the future by the same numbers as the ones associated with the continuous~case theorems given later. (I) The set A ~ {x : p(x. 0 or equivalently Vare(Bn)/MSEe(Bn) t 1 as n t 00.4.O E 81iJO log p( x.OJdx (3. We suppose throughout that we have a regular parametric model and further that is an open subset of the line.4. 3. suppose I holds.
O)] dx " I'  are continuous functions(3) of O.1.0 Xi) 1 = nO =  0' n O· o . thus.6.B(O)} is an exponential family and TJ(B) has a nonvanishing continuous derivative on 8. 1 X n is a sample from a N(B.&0 '0 {[:oP(X. . (3.4.I [I ·1 . Then Xi . Similarly.. t.4. O)dx :Op(x. which is denoted by I(B) and given by Proposition 3.4. suppose XI.  Proof.4. (~ logp(X. the Fisher information number. For instance. 0))' = J(:Ologp(x. I 1(0) 1 = Var (~ !Ogp(X. Then (3.10) and.i J T(x) [:OP(X. 00. ..180 for all Measures of Performance Chapter 3 e. where a 2 is known. the integrals . Then (see Table 1.1) ~(O) = 0la' and 1 and 11 are satisfied.O)).O) & < 00. 1 and 11 are satisfied for samples from gamma and beta distributions with one parameter fixed. I' . It is not hard to check (using Laplace transform theory) that a oneparameter exponential family quite generally satisfies Assumptions I and II.n and 1(0) = Var (~n_l . If I holds it is possible to define an important characteristic of the family {Po}. then I and II hold. Suppose that 1 and II hold and that E &Ologp(X. O)dx ~ :0 J p(x.11 ) . 1 X n is a sample from a Poisson PCB) population. . Ii ! I h(x) exp{ ~(O)T(x) ..4. o Example 3. 0))' p(x. O)dx.O)} p(x. Suppose Xl. 0) ~ 1(0) = E.O)] dxand J T(x) [:oP(X. I J J & logp(x 0 ) = ~n 1 .9) Note that 0 < 1(0) < Lemma 3.1.2. O)dx = O. (3..4.O)]j P(x. /fp(x. (J2) population.
4.(0). (Information Inequality).16) to the random variables 8/8010gp(X. Suppose the conditions of Theorem 3. 1/.'(0)]'  I(e) .I1. e)) = I(e). and that the conditions of Theorem 3. nI. (3. Using I and II we obtain. Proposition 3. 1/.. we obtain a universal lower bound given by the following..4.(T(X)) > [1/. (3.I 1.13) By (A. .'(~)I)'.Xu) is a sample from a population with density f(x.(0).(0) = E(feiogf(X"e»)'.16) The number 1/I(B) is often referred to as the information or CramerRoo lower bound for the variance of an unbiased estimate of 'tjJ(B).12) Proof. We get (3.1 hold and T is an unbiased estimate ofB. Then Var.4.l4) and Lemma 3.4. (3.4.15) The theorem follows because. 10gp(X. Suppose that I and II hold and 0 < 1(0) < 00. . by Lemma 3.4.Section 3. 1/.2.O). 0 (3.4. 0 E e.4 Unbiased Estimation and Risk Inequalities 181 Here is the main result of this section.1.O)dX= J T(x) UOIOgp(x.17) . Suppose that X = (XI. Then for all 0. Here's another important special case. T(X)) . Let [. (e) and Var. > [1/.1. then I(e) = nI. Theorem 3. Denote E.1.(T(X)) < 00 for all O.4. Var (t. ej.4.4. e) and T(X).O)dX.O))P(x.(T(X) > I(O) 1 (3.4.'(0) = J T(X):Op(x.(T(X)) by 1/. CoroUary 3.1 hold.4. 0 The lower bound given in the information inequality depends on T(X) through 1/.1.4. If we consider the class of unbiased estimates of q(B) = B. Let T(X) be all)' statistic such that Var.' (e) = Cov (:e log p(X.14) Now let ns apply the correlation (CauchySchwarz) inequality (A.(0) is differentiable and Var (T(X)) .
4. Suppose Xl.6. By Corollary 3. : BEe} satisfies assumptions I and II and there exists an unbiased estimate T* of1. This is no accident. (3.B)] = nh(B). ..j.4. Example 3.18) 1 a ' .1 and I(B) = Var [:B !ogp(X. whereas if 'P denotes the N(O.4. Example 3.182 Measures of Performance Chapter 3 Proof. . Suppose that the family {p.1 for every B.(B).3. we have in fact proved that X is UMVU even if a 2 is unknown. then X is UMVU. which achieves the lower bound ofTheorem 3. then T' is UMVU as an estimate of 1. This is a consequence of Lemma 3. then T(X) achieves the infonnation inequality bound and is a UMVU estimate of E. If the family {Pe} satisfies I and II and if there exists an unbiased estimate T* of 1/. . if {P9} is a oneparameter exponentialfamily of the form (1. Next we note how we can apply the information inequality to the problem of unbiased estimation..Xn are the indicators of n Bernoulli trials with probability of success B.18) follows. then  1 (3. (Continued). Because X is unbiased and Var( X) ~ Bin. 0 We can similarly show (Problem 3. o II (B) is often referred to as the information contained in one observation.. . These are situations in which X follows a oneparameter exponential family.B) ] [ ~var [%0 ]Ogj(Xi.1 we see that the conclusion that X is UMVU follows if ~  Var(X) ~ nI. As we previously remarked.B) = h(x)exp[ry(B)T'(x) .p'(B)]'ll(B) for all BEe.j. Conversely. (Br Now Var(X) ~ a'ln.. the MLE is B ~ X.[T'(X)] = [. 1) density.4. We have just shown that the information [(B) in a sample of size n is nh (8). the conditions of the information inequality are satisfied.4.(8) has a continuous nonvanishing derivative on e. then X is a UMVU estimate of B.. Then {P9} is a oneparameter exponential family with density or frequency function af the fonn II p(x. Theorem 3.19) Ii i . .B(B)J. i g .1) with natural sufficient statistic T( X) and .4.4. For a sample from a P(B) distribution.4. Note that because X is UMVU whatever may be a 2 .1) that if Xl.2.(T(X».2. X n is a sample from a nonnal distribution with unknown mean B and known variance a 2 .4.(B) such that Var.B)] Var " i) ~ 8B iogj(Xi.. I I "I and (3..4.
Thus. j hence. By (3.21 ) Upon integrating both sides of (3.20) with respect to B we get (3. B) = aj(B)T'(x) + a2(B) (3. the information bound is [A"(B))2/A"(B) A"(B) Vare(T(X») so that T(X) achieves the information bound as an estimate of EeT(X).4.4. If A.B) I + 2nlog(1  BJ} .4..4 Unbiased Estimation and Risk Inequalities 183 Proof We start with the first assertion. B) = a. " and both sides are continuous in B. and only if..4. Our argument is essentially that of Wijsman (1973).ll. 2 A ** = A'" and the result follows.4. + n2) 10gB + (2n3 + n3) Iog(1 . 0 Example 3.A'(B» = VareT(X) = A"(B).4. In the HardyWeinberg model of Examples 2. be a denumerable dense subset of 8. :e Iogp(x. Conversely in the exponential family case (1.19) is highly technical.23) must hold for all B.20) to (3. J 2 Pe.(B)T' (x) +a2(B) for all BEe}..4. Note that if AU = nmAem . (3.6.4. B) = 2"' exp{(2nj 2"' exp{(2n. However. we see that all a2 are linear combinations of {} log p( Xj.(A") = 1 for all B'. . then (3. thus. (B)T'(X) + a2(B) (3.4. From this equality of random variables we shall show that Pe[X E A'I = 1 for all B where A' = {x: :Blogp(x.4.A (B) . (3.10g(1 . Then ~ ( 80 logp X.20) guarantees Pe(Ae) ~ 1 and assumption 1 guarantees P.20) hold. continuous in B. The passage from (3.4. X2 E A"''''. B .2. B) IdB.p(B) = A'(B) and.B)) + nz)[log B .4. Suppose without loss of generality that T(xI) of T(X2) for Xl.23) for all B1 . By solving for a" a2 in (3.4 and 2.16) we know that T'" achieves the lower bound for all (J if.1) we assume without loss of generality (Problem 3. it is necessary. then (3.20) with Pg probability 1 for each B.22) for j = 1. But now if x is such that = 1. Let B .2 and. B ..Section 3.2. denotes the set of x for which (3.4.14) and the conditions for equality in the correlation inequality (A. there exist functions al ((J) and a2 (B) such that :B 10gp(X.1.4.19). (3. p(x. Here is the argument.B) = a.25) But .24) [(B) = Vare(T(X) .(Ae) = 1 for all B' (Problem 3.6.B) so that = T(X) .3) that we have the canonical case with ~(B) = Band B( Bj = A(B) = logf h(x)exp{BT(x))dx.4.4.6).4.4.
e).B)(2n. 0 0 . A third metho:! would be to use Var(B) = I(I(B) and formula (3.4. . Extensions to models in which B is multidimensional are considered next.184 Measures of Performance Chapter 3 where we have used the identity (2nl + HZ) + (2n3 + nz) = 211.4.4..4.26) Proposition 3.2. . By (3. =e'E(~Xi) = . .B) ~ A "( B). B)..'" .25) we obtain a' 10gp(X.7) Var(B) ~ B(I . I(B) = Eo aB' It turns out that this identity also holds outside exponential families. • The multiparameter case We will extend the information lower bound to the case of several parameters. . Sharpenings of the information inequality are available but don't help in general. (3. Suppose p('. Proof We need only check that I a aB' log p(x. DiscRSsion.4. Theorem 3. as Theorem 3.6. (Continued). 0 ~ Note that by differentiating (3.2. '.25). but the variance of the best estimate is not equal to the bound [. Na). I I' f . we have iJ' aB' logp (X. .6. See Volume II.4. in the U(O.4. B) )' aB (3. Even worse. assumptions I and II are satisfied and UMVU estimates of 1/.4. for instance. We find (Problem 3.ed)' In particular. or by transforming p(x.2 implies that T = (2N 1 + N z )/2n is UMVU for estimating E(T) ~ (2n)1[2nB' + 2nB(1 . Then (3. B) aB' p(x. Example 3. although UMVU estimates exist. B) example..B) is twice differentiable and interchange between integration and differentiation is pennitted.3. It often happens.(e) exist. that I and II fail to hold.2 suggests.4. we will find a lower bound on the variance of an estimator . The variance of B can be computed directly using the moments of the multinomial distribution of (NI • N z . in many situations. B) 2 1 a = p(x. " 'oi . . () (ell" . B) to canonical form by setting t) = 10g[B((1 . B) satisfies in addition to I and II: p(·.24).B)] ~ B. ~ ~ This T coincides with the MLE () of Example 2. For a sample from a P( B) distribution o EB(::.B)] and then using Theorem 1.IOgp(X.4.B)) which equals I(B).26) holds.4..2. B)  2 (a log p(x. Because this is an exponential family.27) and integrate both sides with respect to p(x.p' (B) I'( I (B).
(a) B=T 1 (3.4.Xn are U. 0)] = E[er. III (0) = E [~:2 logp(x. 0) j k That is.Xnf' has information matrix 1(0) = EO (aO:.4 E(x 1") = 0 ~ I..Bd are unknown. nh (O) where h is the information matrix ofX.32) Proof.4 Unbiased Estimation and Risk Inequalities 185 of fh when the parameters 82 . 0).4... The (Fisher) information matrix is defined as (3. .. (0) iJiJ lt2(0) ~ E [aer 2 iJl" logp(x. er 2 ). = Var('.O) ] iJ2 ] l22(0) = E [ (iJer2)21ogp(x. Suppose X = 1 case and are left to the problems. .4 /2.2 ] ~ er. d.4. (3. pC B) is twice differentiable and double integration and differentiation under the integral sign can be interchanged.4. ~ N(I".4. a ). 0). 1 <k< d..5. . The arguments follow the d Example 3.. (c) If. iJO logp(X. in addition. er 2). .4.O») . 1 <j< d. then X = (Xl.( x 1") 2 2a 2 logp(x 0) . Let p( x.. .4. B) : 0 E 8} is a regular parametric model with conditions I and II satisfied when differentiation is with respect OJ. (3.29) Proposition 3.d. Under the conditions in the opening paragraph.. .0) = er. j = 1. .2 = er. .31) and 1(0) (b) If Xl. Then 1 = log(211') 2 1 2 1 2 loger .Section 3.28) where (3. as X.? ologp(X.. We assume that e is an open subset of Rd and that {p(x.. 0 = (I".Ok IOgp(X. 6) denote the density or frequency function of X where X E X C Rq. .4.30) iJ Ijd O) = Covo ( eO logp(X.
186 Measures of Performance Chapter 3 Thus.O») = VOEO(T(X» and the last equality follows 0 from the argument in (3.6.2 ( 0 0 ) . II are easily checked and because VOlogp(x. [(0) = VarOT(X) = (3.. Suppose k . that is.. then [(0) = VarOT(X). Let 1/.6.(0) exists and (3. I'L(Z) = I'Y + (Z l'z)TLz~ Lzy· Now set Y (3. Then Theorem 3.35) o ~ Next suppose 8 1 = T is an estimate of (Jl with 82.6. We claim that each of Tj(X) is a UMVU .(O). . ! p(x.4. (continued).36) Proof.'(0) = V1/..37) = T(X). = . 0).4.4.. J)d assumed unknown. where I'L(Z) denotes the optimal MSPE linear predictor of Y.3.6 hold.38) where Lzy = EO(TVOlogp(X. Suppose the conditions of Example 3. . Here are some consequences of this result. in this case [(0) ! .4.4.4..13).4.4/2 . We will use the prediction inequality Var(Y) > Var(I'L(Z». Then VarO(T(X» > Lz~ [1(0) Lzy (3.!pening paragraph hold and suppose that the matrix [(0) is nonsingular Then/or all 0.4.A(O)}h(x) (3. By (3.O) = T(X) . The conditions I.33) o Example 3. A(O). 0) . Assume the conditions of the .' = explLTj(x)9j j=1 .1.. 1/.( 0) be the d x 1 vector of partial derivatives.A. (3.4. .4.4.4.30) and Corollary 1. Canonical kParameter Exponential Family. Example 3.34) () E e open. Z = V 0 logp(x. UMVU Estimates in Canonical Exponential Families.(0) = EOT(X) and let ".
. ..p(OJrI(O). i =j:. . . j=1 X = (XI.p(0) is the first row of 1(0) and.. Thus.A(O)} whereTT(x) = (T!(x).. we transformed the multinomial model M(n.) = Var(Tj(X».39) where.. then Nj/n is UMVU for >'j.A(O) is just VarOT! (X).6... . " Ok_il T. the lower bound on the variance of an unbiased estimator of 1/Jj(O) = E(n1Tj(X) = >.Section 3. .6 with Xl. . . by Theorem 3.i.7. We have already computed in Proposition 3. But because Nj/n is unbiased and has 0 variance >'j(1 . To see our claim note that in our case (3.0).4.j.4.O) = exp{TT(x)O .. is >..d. ..>'.4. = nE(T. Multinomial Trials. j.T(O) = ~:~ I (3.4 Unbiased Estimation and Risk Inequalities 187 estimate of EOTj(X). j = 1. k. .. we let j = 1.>'j)/n. AI..3. (3. But f.40) J We claim that in this case .41) because .p(0)11(0) ~ (1. a'i X and Aj = P(X = j).A. In the multinomial Example 1. Ak) to the canonical form p(x. . 0 = (0 1 .Xn)T. . hence. .Tk_I(X)).(X) = I)IXi = j]. are known..4.4 8'A rl(O)= ( 8080 t )1 kx k . . (1 + E7 / eel) = n>'j(1. n T.. This is a different claim than TJ(X) is UMVU for EOTj(X) if Oi.(1.A(0) j ne ej = (1 + E7 eel _ eOj ) . Example 3.4.4. ... kI A(O) ~ 1£ log Note that 1+ Le j=l Oj 80 A (0) J 8 = 1 "k 11 ne~ I 0 +LJl=le l = 1£>'.. X n i.(X)) 82 80..)/1£. without loss of generality. .0.
They are dealt with extensively 'in books on numerical analysis such as Dahlquist. X n are i.4. Then J (3. The Normal Case. 3.4. LX.8.i.d.21. ." . even if the loss function and model are well specified.4. 188 MeasureS of Performance Chapter 3 j Example 3. These and other examples and the implications of Theorem 3.5 NON DECISION THEORETIC CRITERIA In practice.4. . Asymptotic analogues oftbese inequalities are sharp and lead to the notion and construction of efficient estimates.B)a > Ofor all . We establish analogues of the information inequality and use them to show that under suitable conditions the MLE is asymptotically optimal.4..4. We study the important application of the unbiasedness principle in survey sampling.4.. Let 1/J(O) ~EO(T(X))dXl and "'(0) = (1/Jl(O). 3.. . The three principal issues we discuss are the speed and numerical stability of the method of computation used to obtain the procedure. Also note that ~ o unbiased '* VarOO > r'(O). features other than the risk function are also of importance in selection of a procedure. But it does not follow that n~1 L:(Xi .2 + (J2. interpretability of the procedure. T(X) is the UMVU estimate of its expectation. • Theorem 3.X)2 is UMVU for 0 2 . we show how the infomlation inequality can be extended to the multiparameter case.."xl' Note that both sides of (3. . and robustness to model departures. 1 j Here is an important extension of Theorem 3. . Using inequalities from prediction theory.pd(OW = (~(O)) dxd .4.3 are 0 explored in the problems.42) are d x d matrices. . " il 'I. N(IL.3 hold and • is a ddimensional statistic.4..42) where A > B means aT(A . (72) then X is UMVU for J1 and ~ is UMVU for p. reasonable estimates are asymptotically unbiased. ~ In Chapters 5 and 6 we show that in smoothly parametrized models.3 whose proof is left to Problem 3.4.1 Computation Speed of computation and numerical stability issues have been discussed briefly in Section 2. If Xl.5. We derive the information inequa!ity in oneparameter models and show how it can be used to establish that in a canonical exponential family. Suppose that the conditions afTheorem 3. Summary. .
variance and computation As we have seen in special cases in Examples 3.4.2.P) in Example 2. takes on the order of log log ~ steps (Problem 3.1.1 / 2 is wasteful. for this population of measurements has a clear interpretation.5 Nondecision Theoretic Criteria 189 Bjork.1).2 is the empirical variance (Problem 2. Then least squares estimates are given in closed fonn by equ<\tion (2.5. Closed form versus iteratively computed estimates At one level closed form is clearly preferable. with ever faster computers a difference at this level is irrelevant. and Anderson (1974). For instance.1 / 2 . Of course./ a even if the data are a sample from a distribution with . 3. The closed fonn here is deceptive because inversion of a d x d matrix takes on the order of d 3 operations when done in the usual way and can be numerically unstable.2.A(BU ~l»)). Its maximum likelihood estimate X/a continues to have the same intuitive interpretation as an estimate of /1.1.9) by.2). But it reappears when the data sets are big and the number of parameters large.4.Section 3.4. estimates of parameters based on samples of size n have standard deviations of order n.3. fI(j) = iPl) .2 Interpretability Suppose that iu the normal N(Il. (72) Example 2.3. It may be shown that. The interplay between estimated. We discuss some of the issues and the subtleties that arise in the context of some of our examples in estimation theory. On the other hand.10).5. say. Unfortunately it is hard to translate statements about orders into specific prescriptions without assuming at least bounds on the COnstants involved. < 1 then J is of the order of log ~ (Problem 3. solve equation (2. the signaltonoise ratio. On the other hand. The improvement in speed may however be spurious since AI is costly to compute if d is largethough the same trick as in computing least squares estimates can be used. the NewtonRaphson method in ~ ~(J) which the jth iterate. if we seek to take enough steps J so that III III < . It follows that striving for numerical accuracy of ord~r smaller than n. a method of moments estimate of (A.5. It is in fact faster and better to Y.1.1.4.5 we are interested in the parameter III (7 • • This parameter.3 and 3. in the algOrithm we discuss in Section 2.AI (flU 1) (T(X) .11). consider the Gaussian linear model of Example 2. It is clearly easier to compute than the MLE.2 is given by where 0. Gaussian elimination for the particular z'b Faster versus slower algorithms Consider estimation of the MLE 8 in a general canonical exponential family as in Section 2. at least if started close enough to 8.
that is.\)' distribution. the form of this estimate is complex and if the model is incorrect it no longer is an appropriate estimate of E( X) / [Var( X)] 1/2.e. and both the mean tt and median v qualify. Alternatively. The actual observation X is X· contaminated with "gross errOfs"see the following discussion. we observe not X· but X = (XI •.I1'. But () is still the target in which we are interested. We can now use the MLE iF2. suppose that if n measurements X· (Xi. the parameter v that has half of the population prices on either side (fonnally. P(X > v) > ~).4. they may be interested in total consumption of a commodity such as coffee. This is an issue easy to point to in practice but remarkably difficult to formalize appropriately.)(p/>.3 Robustness Finally.. On the other hand.. 1975b. amoog others. (c) We have a qualitative idea of what the parameter is. but there are several parameters that satisfy this qualitative notion. but we do not necessarily observe X*. E p. the HardyWeinberg parameter () has a clear biological interpretation and is the parameter for the experiment described in Example 2. would be an adequate approximation to the distribution of X* (i..5. we could suppose X* rv p.1. . v is any value such that P(X < v) > ~. 1976) and Doksum (1975). For instance.4) is for n large a more precise estimate than X/a if this model is correct.. we may be interested in the center of a population. Then E(X)/lVar(X) ~ (p/>.190 Measures of Performance Chapter 3 mean Jl and variance 0'2 other than the normal. However.5. The idea of robustness is that we want estimation (or testing) procedures to perform reasonably even when the model assumptions under which they were designed to perform excellently are not exactly satisfied. However. To be a bit formal. X n ) where most of the Xi = X:. We consider three situations (a) The problem dictates the parameter. suppose we initially postulate a model in which the data are a sample from a gamma. but there are a few • • = .')1/2 = l. economists often work with median housing prices.5. We return to this in Section 5. 1 X~) could be taken without gross errors then p. what reasonable means is connected to the choice of the parameter we are estimating (or testing hypotheses about). This idea has been developed by Bickel and Lehmann (l975a. See Problem 3. 3. we turn to robustness. which as we shall see later (Section 5. 9(Pl.. We will consider situations (b) and (c). if gross errors occur. Gross error models Most measurement and recording processes are subject to gross errors. However.13. anomalous values that arise because of human error (often in recording) or instrument malfunction. . E P*). For instance. Similarly. say () = N p" where N is the population size and tt is the expected consumption of a randomly drawn individuaL (b) We imagine that the random variable X* produced by the random experiment we are interested in has a distribution that follows a ''true'' parametric model with an interpretable parameter B.
Consider the onesample symmetric location model P defined by ~ ~ i=l""l n .1 ) where the errors are independent. . make sense for fixed n.5.) are ij. X is the best estimate in a variety of senses.j = PC""f. We return to these issues in Chapter 6. This corresponds to. However. . Then the gross error model issemiparametric.. the sensitivity curve and the breakdown point. (3. it is the center of symmetry of p(Po. iff X" . . we do not need the symmetry assumption.J) for all such P.l.18). and definitions of insensitivity to gross errors.2) for some h such that h(x) = h(x) for all x}.Section 3. it is possible to have PC""f. Xi Xt with probability 1 .5.d. .2.5 Nondecision Theoretic Criteria ~ 191 wild values. Further assumptions that are commonly made are that h has a particular fonn. two notions.h(x). . . Example 1.1). with probability . However. P. has density h(y . = {f (. for example.l.remains identifiable. Now suppose we want to estimate B(P*) and use B(X 1 . We next define and examine the sensitivity curve in the context of the Gaussian location model.1') : f satisfies (3. Unfortunately. if we drop the symmetry assumption.j) E P. h = ~(7 <p (:(7) where K ::» 1 or more generally that h is an unknown density symmetric about O.\ is the probability of making a gross error. but with common distribution function F and density f of the form f(x) ~ (1 . .5.1') and (Xt. and symmetric about 0 with common density f and d. The breakdown point will be discussed in Volume II.Xn are i.2). where f satisfies (3.5.) for 1'1 of 1'2 (Problem 3. the gross errors. In our new formulation it is the xt that obey (3. specification of the gross error mechanism. identically distributed. .A Y. (3. X~) is a good estimate. Note that this implies the possibly unreasonable assumption that committing a gross error is independent of the value of X· .2) Here h is the density of the gross errors and . If the error distribution is normal.d. Y.d.1').1. Again informally we shall call such procedures robust. Formal definitions require model specification. Without h symmetric the quantity jJ. p(".f.is not a parameter. A reasonable formulation of a model in which the possibility of gross errors is acknowledged is to make the Ci still i. so it is unclear what we are estimating. the assumption that h is itself symmetric about 0 seems patently untenable for gross errors. Informally B(X1 .5. That is. X n ) will continue to be a good or at least reasonable estimate if its value is not greatly affected by the Xi I Xt.)~ 'P C) + ..i.l. in situation (c). and then more generally. X n ) knowing that B(Xi. F. That is. •.i. we encounter one of the basic difficulties in formulating robustness in situation (b)... Most analyses require asymptotic theory and will have to be postponed to Chapters 5 and 6. The advantage of this formulation is that jJ. with common density f(x . ISi'1 or 1'2 our goal? On the other hand...5. where Y..
where Xl.j)) = J. P. X n ) .14.1. shift the sensitivity curve in the horizontal or vertical direction whenever this produces more transparent formulas..L = E(X). +X"_I+X) n = x. The sensitivity curoe of () is defined as ~ ~ ~ ~ ~ ~ ~ ~ ~ . In our examples we shall.32.··. . . .J. that is. Xnl as an "ideal" sample of size n . and it splits the sample into two equal halves.. the sample mean is arbitrarily sensitive to gross errOfa large gross error can throw the mean off entirely.Xn ? An interesting way of studying this due to Tukey (1972) and Hampel (1974) is the sensitivity curve defined as follows for plugin estimates (which are well defined for all sample sizes n).. This is equivalent to shifting the Be vertically to make its value at x = 0 equal to zero. . . ." .. 1') = 0.od because E(X.. (}(PU. Xl. . We are interested in the shape of the sensitivity curve. How sensitive is it to the presence of gross errors among XI.17). Then ~ ~ O. . x) . . X =n (Xl+ . . Often this is done by fixing Xl.2.O(Xl' .16). not its location. Now fix Xl!'" .. See Section 2. where F is the empirical d.. 1 Xnl represents an observed sample of size nl from P and X represents an observation that (potentially) comes from a distribution different from P..f) E P. ~ ) SC( X.1 for which the estimator () gives us the right value of the parameter and then we see what the introduction of a potentially deviant nth observation X does to the value of ~ We return to the location problem with e equal to the mean J.. SC(x.od Problem 2. . Thus.L.. has the plugin property.. (i) It is the empirical plugin estimate of the population median v (Problem 3.l.. .L = (}(X l 1'.. • . .1 for all p(/l. .4). X" 1'). .1. . .5.I.. that is.5. X(n) are the order statistics. therefore..X. (2..Xnl so that their mean has the ideal value zero.1. (}(X 1 . we take I' ~ 0 without loss of generality. X"_l )J. 0) = n[O(xl.192 The sensitivity curve Measures of Performance Chapter 3 . X n ) = B(F).. X n ordered from smallest to largest. . Suppose that X ~ P and that 0 ~ O(P) is a parameter. is appropriate for the symmetric location model. At this point we ask: Suppose that an estimate T(X 1 .. The sample median can be motivated as an estimate of location on various grounds. See Problem 3. . We start by defining the sensitivity curve for general plugin estimates.2. Because the estimators we consider are location invariant. I . . The empirical plugin estimate of 0 is 0 = O(P) where P is the empirical probability distribution. Are there estimates that are less sensitive? A classical estimate of location based on the order statistics is the sample median X defined by ~ ~ X X(k+l) !(X(k) ifn~2k+1 + X(k+l)) ifn = 2k where X (I)' . .. See (2. . in particular.
A class of estimates providing such intermediate behavior and including both the mean and .5.5.5. The sensitivity curves of the mean and median. n = 2k + 1 is odd and the median of Xl..32 and 3. .. . SC(x) SC(x) x x Figure 3. say. its perfonnance at the nonnal model is unsatisfactory in the sense that its variance is about 57% larger than the variance of X.2.1). x is an empirical (iii) The sample median is the MLE when we assume the common density f(x) of the errors {cd in (3.1. 27 a density having substantially heavier tails than the normaL See Problems 2. v coincides with fL and plugin estimate of fL.Xn_l = (X(k) + X(ktl)/2 = 0. .5 Nondecision Theoretic Criteria 193 (ii) In the symmetric location model (3.1) is the Laplace (double exponential) density f(x) = 1 exp{l x l/7}.1 suggests that we may improve matters by constructing estimates whose behavior is more like that of the mean when X is near Ji.. XnI. x) nx(k) nx = _nx(k+l) for for < x(k) for x(k) < x x <x(k+I) nx(k+l) x> x(k+l) where xCI) < .5.5.9. Although the median behaves well when gross errors are expected. < X(nl) are the ordered XI.Section 3.. The sensitivity curve in Figure 3. SC(x. The sensitivity curve of the median is as follows: If... we obtain .
infinitely better in the case of the Cauchysee Problem 5.194 Measures of Performance Chapter 3 the median has ~en known since the eighteenth century.2[nnl where [na] is the largest integer < nO' and X(1) < . Xa. that is. and Tukey (1972). There has also been some research into procedures for which a is chosen using the observations. Rogers. i I . Let 0 trimmed mean. The sensitivity curve of the trimmed mean. That is. f(x) = 1/11"(1 + x 2 ).2. (The middle portion is the line y = x(1 .4.l)Q] and the trimmed mean of Xl. Note that if Q = 0. < X (n) are the ordered observations.20 seems to yield estimates that provide adequate protection against the proportions of gross errors expected and yet perform reasonably well when sampling is from the nonnal distribution. The estimates can be justified on plugin grounds (see Problem 3.5. f(x) = or even more strikingly the Cauchy. We define the a (3. Which a should we choose in the trimmed mean? There seems to be no simple answer. the Laplace density. whereas as Q i ~.2[nn]/n)I. If f is symmetric about 0 but has "heavier tails" (see Problem 3. For a discussion of these and other forms of "adaptation.. Figure 3. For instance..5. we throw out the "outer" [na] observations on either side and take the average of the rest. I I ! . Intuitively we expect that if there are no gross errors.4. Xu = X.5. then the trimmed means for a > 0 and even the median can be much better than the mean.. Hampel.1. . .3) Xu = X([no]+l) + . which corresponds approximately to a = This can be verified in tenns of asymptotic variances (MSEs}see Problem 5.5). Huber (1972).5. by <Q < 4.. If [na] = [(n . Bickel.) SC(x) X x(n[naJ) . 4. Xa  X. + X(n[nuj) n . For more sophisticated arguments see Huber (1981). . The range 0.8) than the Gaussian density..5." see Jaeckel (1971). the sensitivity Jo ~ CUIve of an Q trimmed mean is sketched in Figure 3.1. the mean is better than any trimmed mean with a > 0 including the median. suppose we take as our data the differences in Table 3. Haber. 4e'x'.. and Hogg (1974).10 < a < 0. for example. the sensitivity curve calculation points to an equally intuitive conclusion.2. f = <p. . However. See Andrews.1 again..5. . . l Xnl is zero.
Similarly... Quantiles and the lQR.75 . = ~ [X(k) + X(k+l)]' and at sample size n  1. If we are interested in the spread of the values in a population. then SC(X.75 . n + 00 (Problem 3. other examples will be given in the problems. Let B(P) = X o deaote a ath quantile of the distribution of X.25).25 are called the upper and lower quartiles.1. where Xu has 100Ct percent of the values in the population on its left (fonnally. . Xo: = x(k).4) where the approximation is valid for X fixed.5.16). ..25.5.I L:~ I (Xi . say k. denote the o:th sample quantile (see 2. Spread.10). &) (3. Xu is called a Ctth quantile and X.2. To simplify our expression we shift the horizontal axis so that L~/ Xi = O. P( X > xO') > 1 . 0"2) model.5 Nondecision Theoretic Criteria 195 Gross errors or outlying data points affect estimates in a variety of situations.Ct). the scale measure used is 0. &2) It is clear that a:~ is very sensitive to large outlying Ixi values. We next consider two estimates of the spread in the population as well as estimates of quantiles. SC(X. X n is any value such that P( X < x n ) > Ct. then the variance 0"2 or standard deviation 0.5.1.674)0". the nth sample quantile is Xo.is typically used. . X n denote a sample from that population.XnI.1x. . Let B(P) = Var(X) = 0"2 denote the variance in a population and let XI..X. 0 Example 3.. The IQR is often calibrated so that it equals 0" in the N(/l.75 and X.X. Because T = 2 x (.742(x. < x(nI) are the ordered Xl. .Section 3. Example 3. . If no: is an integer. Then a~ = 11.. where x(t) < . A fairly common quick and simple alternative is the IQR (interquartile range) defined as T = X.2 . and let Xo.5. 0 < 0: < 1. Write xn = 11.X)2 is the empirical plugin estimate of 0.1 L~ 1 Xi = 11.
o Remark 3. We will return to the influence function in Volume II. 0.2. We discuss briefly nondecision theoretic considerations for selecting procedures including interpretability. Summary.F) where = limIF.15) I[t < xl). i) = SC(x.5. and Stabel (1983). ~ . for 2 Measures of Performance Chapter 3 < k < n .. median. An exposition of this point of view and some of the earlier procedures proposed is in Hampel. Discussion.75 X. 1'75) . Other aspects of robustness. and other procedures. The sensitivity of the parameter B(F) to x can be measured by the influence function. although this difficulty appears to be being overcome lately. xa: is not sensitive to outlying x's. 1'. which is defined by IF(x. trimmed mean.F) dO I. ~ ~ ~ Next consider the sample lQR T=X.25· Then we can write SC(x. Ii I~ . .25) and the sample IQR is robust with respect to outlying gross errors x.) . F) and~:z. It is easy to see ~ ! '. = <1[0((1  <)F + <t. r: ~i II i . ..Xnl. discussing the difficult issues of identifiability.196 thus.: : where F n .• .(x..5.SC(x.5. Unfortunately these procedures tend to be extremely demanding computationally. ~' is the distribution function of point mass at x (. The rest of our very limited treatment focuses on the sensitivity curve as illustrated in the mean.6. have been studied extensively and a number of procedures proposed and implemented.. Ronchetti.O(F)] = .O. SC(x. x (t) that (Problem 3. in particular the breakdown point. and computability. Most of the section focuses on robustness.1. Rousseuw.5) 1 [xlk+11 _ xlkl] x> xlk+1) ' Clearly. It plays an important role in functional expansions of estimates. 1'0) ~[Xlkl) _ xlk)] x < XlkI) 2 '  2 2 1 Ix _ xlkl] XlkI) < x < xlk+11 '  (3.(x.1 denotes the empirical distribution based on Xl.. .O.
1 X n is a N(B. In the Bernoulli Problem 3. change the loss function to 1(0. 71') as .2 with unifonn prior on the probabilility of success O. Consider the relative risk e( J. respectively. 0) the Bayes rule is where fo(x. 0) is the quadratic loss (0 . (72) sample and 71" is the improper prior 71"(0) = 1. then the improper Bayes rule for squared error loss is 6"* (x) = X. E = R. Suppose IJ ~ 71'(0). 6...6 Problems and Complements 197 3. Check whether q(OB) = E(q(IJ) I x). density.2. x) where c(x) =.0). 71') = R(7f) /r( 71'. J).r 7f(0)p(x I O)dO. Let X I. 2.0) and give the Bayes estimate of q(O).3. In Problem 3. lf we put a (improper) uniform prior on A.2. ~ ~ 4.Section 3.4. Suppose 1(0. give the MLE of the Bernoulli variance q(0) = 0(1 . . which is called the odds ratio (for success).O) and = p(x 1 0)[7f(0)lw(0)]/c c= JJ p(x I 0)[7f(0)/w(0)]dOdx is assumed to be finite. . (X I 0 = 0) ~ p(x I 0).O)P. we found that (S + 1)/(n + 2) is the Bayes rule.2 preceeding. (c) In Example 3.O) = p(x I 0)71'(0) = c(x)7f(0 .0)/0(0).4. where R( 71') is the Bayes risk.2. where OB is the Bayes estimate of O.2 o e 1. J) ofJ(x) = X in Example 3. That is. Show that OB ~ (S + 1)/(n+2) for ~ ~ the uniform prior. In some studies (see Section 6. Find the Bayes risk r(7f.0 E e. a(O) > 0. 0) = (0 . Find the Bayes estimate 0B of 0 and write it as a weighted average wOo + (1 ~ w)X of the mean 00 of the prior and the sample mean X = Sin..6 PROBLEMS AND COMPLEMENTS Problems for Section 3. Hint: See Problem 1.24.. Show that (h) Let 1(0.3). the Bayes rule does not change.a)' jBQ(1. Compute the limit of e( J. the parameter A = 01(1 .2. (a) Show that the joint density of X and 0 is f(x. under what condition on S does the Bayes rule exist and what is the Bayes rule? 5.1.0)' and that the prinf7r(O) is the beta. (3(r. Give the conditions needed for the posterior Bayes risk to be finite and find the Bayes rule.Xn be the indicators of n Bernoulli trials with success probability O. is preferred to 0. = (0 o)'lw(O) for some weight function w(O) > 0. 3. if 71' and 1 are changed to 0(0)71'(0) and 1(0. Show that if Xl. s). .
a = (a" . Var(Oj) = aj(ao .\(B) = I(B.• N. (b) Problem 1. bioequivalent.. . do not derive them..O'5).) (b) When the loss function is I(B. 8.9j ) = aiO:'j/a5(aa + 1). i: agency specifies a number f > a such that if f) E (E.3.3. 1). . Note that ). Set a{::} Bioequivalent 1 {::} Not Bioequivalent . B = (B" .) with loss function I( B.. 00. where is known.. then E(O)) = aj/ao. Find the Bayes decision rule. 7.. (a) If I(B.=l CjUj . There are two possible actions: 0'5 a a with losses I( B. a) = Lj~. " X n of differences in the effect of generic and namebrand effects fora certain drug.19(c). Hint: If 0 ~ D(a). A regulatory " ! I i· . .. where ao = L:J=l Qj.. . to a close approximation. . and Cov{8j . For the following problems.)T..  I . E). equivalent to a namebrand drug. find the Bayes decision mle o' and the minimum conditional Bayes risk r(o'(x) I x). given 0 = B are multinomial M(n. where E(X) = O.ft B be the difference in mean effect of the generic and namebrand drugs.exp {  2~' B' } . (a) Problem 1. (c) We want to estimate the vector (B" .2(d)(i) and (ii). (Use these results.3. . where CI.. (. I ~' . I) = difference in loss of acceptance and rejection of bioequivalence.. a) = [q(B)a]'. E).. Bioequivalence trials are used to test whether a generic drug is. Xl. (e) a 2 + 00. 76) distribution.198 (a) T + 00.Xu are Li. 0) and I( B.I(d). find necessary and j sufficient conditions under which the Bayes risk is finite and under these conditions find the Bayes rule. by definition. Assume that given f). a) = (q(B) .(0) should be negative when 0 E (E. B. B . E) and positive when f) 1998) is 1.. c2 > 0 I. a r )T. Measures of Performance Chapter 3 (b) n . . N{O. .d. Suppose tbat N lo . compute the posterior risks of the possible actions and give the optimal Bayes decisions when x = O. (Bj aj)2. • 9.2.O)  I(B. I j l .Xn ) we want to decide whether or not 0 E (e.B. Cr are given constants.\(B) = r . B). (E. Suppose we have a sample Xl. and that 9 is random with aN(rlO. One such function (Lindley. . then the generic and brandname drugs are..a)' / nj~. (c) Problem 1. defined in Problem 1.E).15.aj)/a5(ao + 1).. Let q( 0) = L:. . and that 0 has the Dirichlet distribution D( a). Let 0 = ftc . On the basis of X = (X I. . .
Hint: See Example 3.6 Problems and Complements 199 where 0 < r < 1.y = (YI.17).. {} (Xo.16) and (3. y). A point (xo. Yo) is in the interior of S x T.Xm).6.2. 1) are not constant. S T Suppose S and T are subsets of Rm. (b) It is proposed that the preceding prior is "uninformative" if it has 170 large ("76 + 00")..3 1... Any two functions with difference "\(8) are possible loss functions at a = 0 and I. and 9 is twice differentiable. (Xv.Yp). . Note that .Yo) = {} (Xo. find (a) the linear Bayes estimate of ~l." (c) Discuss the behavior of the preceding decision rule for large n ("n sider the general case (a) and the specific case (b). (a) Show that a necessary condition for (xo.) + ~}" .. + = 0 and it 00"). . (b) the linear Bayes estimate of fl. representing x = (Xl. Con 10.Yo) Xi {}g {}g Yj = 0.) = 0 implies that r satisfies logr 1 = ( 2 2c' This is an example with two possible actions 0 and 1 where l((). v) > ?r/(l ?r) is equivalent to T > t. (c) Is the assumption that the ~ 's are nonnal needed in (a) and (b)? Problems for Section 3. . 0) and l((). S x T ~ R. (a) Show that the Bayes rule is equivalent to "Accept biocquivalence if E(>'(O) I X and show that (3. Discuss the preceding decision rule for this "prior.2.3.2. . 0. Yo) = sup g(x.2 show that L(x.1) < (T6(n) + c'){log(rg(~)+. Suppose 9 .Section 3. .6. Yo) = inf g(xo. Yo) to be a saddle point is that... For the model defined by (3.1. yo) is a saddle point of 9 if g(xo..\(±€. In Example 3. respectively. RP. 2.1) is equivalent to "Accept bioequivalence if[E(O I x»)' where = x) < 0" (3.
=1 CijXiYj with XE 8 m .. the conclusion of von Neumann's theorem holds. I}.n).1 < i < m. (b) There exists 0 < 11"* < 1 such that the prior 1r* is least favorable against is. Let Lx (Bo. . (b) Show that limn_ooIR(O.Xn ) . B1 ) Ip(X. a) ~ (0 .d..i) =0.1(0.i.j)=Wij>O.200 and Measures of Performance Chapter 3 8'g (x ) < 0 8'g(xo. Hint: See Problem 3. I(Bi.0•• ). • " = '!....andg(x.<7') and 1(<7'. 0») equals 1 when B = ~..j=O. f. n: I>l is minimax. i.a. 1 <j. . B ) and suppose that Lx (B o.0)." 1 Xi ~ l}.(X) = 1 if Lx (B o. Bd. Suppose e = {Bo. 5.2. and o'(S) = (S + 1 2 . Thus. Yc Yd foraHI < i. L:. Let S ~ 8(n. y ESp._ I)'. Suppose i I(Bi.l. ..:.. B ) ~ p(X. A = {O. • . the test rule 1511" given by o.= 1 . Hint: Show that there exists (a unique) 1r* so that 61f~' that R(Bo.y) ~ L~l 12. and show that this limit ( .2. and 1 . X n be i.o•• ) =R(B1 . 1rWlO = 0 otherwise is Bayes against a prior such that PIB = B ] = .d< p. I • > 1 for 0 i ~.c.)w". f" I • L : I' (a) Show that 0' has constant risk and is Bayes for the beta. prior. fJ( . d) = (a) Show that if I' is known to be 0 (!..n12).b < m. (b) Suppose Sm = (x: Xi > 0. N(I'. B1 ) >  (l. and that the mndel is regnlar.. I j 3. Yo .PIB = Bo]. 4. Show that the von Neumann minimax theorem is equivalent to the existence of a saddle point for any twice differentiable g.a)'. BIl has a continuous distrio 1 bution under both P Oo and P BI • Show that (a) For every 0 < 7f < 1. o(S) = X = Sin.n)/(n + . iij. Let X" .. .n12.thesimplex.Yo) > 0 8 8 8 X a 8 Xb 0. . 2 J'(X1 . o')IR(O. &* is minimax.
Section 3_6 Problems and Complements 201 (b) If I' ~ 0.tj..~~n X < Pi < < R(I'. . d = (d). Let k .29). PI.. M (n.. 1 < j < k. X is no longer unique minimax. that both the MLE and the estimate 8' = (n . Rk) where Rj is the rank of Xi.d) where qj ~ 1 . . ik) is smaller than that of (i'l"'" i~). . 00. Then X is nurnmax. . Rj = L~ I I(XI < Xi)' Hint: Consider the uniform prior on permutations and compute the Bayes rule by showing that the posterior risk of a pennutation (i l . prior. Show that X has a Poisson (. 9. .2. jl~ < . Hint: Consider the gamma.. dk)T.. See Volume II... . . 1 = t j=l (di .» = Show that the minimax rule is to take L l. ik is an arbitrary unknown permutation of 1..a? 1. j 7. . . Remark: Stein (1956) has shown that if k > 3. show that 0' is unifonnly best among all rules of the form oc(X) = CL Conclude that the MLE is inadmissible. . 0 1. h were t·. . < /12 is a known set of values.).. .k. . . . 1).j.. Jlk. Write X  • "i)' 1(1'. i=l then o(X) = X is minimax. .).. X k be independent with means f.. . . respectively. .Pi)' Pjqj <j < k. 6..X)' is best among all rules of the form oc(X) = c L(Xi . b. ftk) = (Il?l'"'' J1.3. .\) distribution and 1(. b = ttl. See Problem 1. Xl· (e) Show that if I' is unknown.a < b" = 'lb.12. .. Show that if = (I'I''''.... that is. Permutations of I.. where (Ill. (c) Use (B. Let Xl.tn I(i.X)' and.\. ik).I). o(X) ~ n~l L(Xi .j. / . LetA = {(iI. ....o) for alII'.\ ... d) = L(d. . o(X 1. . . o'(X) is also minimax and R(I'. . N k ) has a multinomial. . < im..j. . .X)' are inadmissible.?. > jm).Pi. 8. (jl.) . Hint: (a) Consider a gamma prior on 0 = 1/<7'... = tj... . an d R a < R b. Xk)T.. For instance. " a.1. I' (X"". . Let Xi beindependentN(I'i.\.P. . a) = (. Show that if (N). X k ) = (R l . .. . J. f(k.k} I«i). distribution.. 1 <i< k. .I'kf.l . 0') = (1. and iI... ttl . then is minimax for the loss function r: l(p. hence.1)1 L(Xi .
. Let the loss " function be the KullbackLeibler divergence lp(B. For a given x we want to estimate the proportion F(x) of the population to the left of x. ... 10. BO l) ~ (k . LB= j 01 1...BO)T.. Let Xi(i = 1. Pk. Hint: Consider the risk function of o~ No..i. n) be i. ....i f X> .. v'n v'n (a) Show that the risk (for squared error loss) E( v'n(o(X) . ... X has a binomial. B).._.q)1r(B)dB. 13. 1). B(n. q) denote the K LD (KullbackLeiblerdivergence) between the densities Po and q and define the Bayes KLD between P = {Po : BEe} and q as k(q.. with unknown distribution F.=1 Show that the Bayes estimate is (Xi . ..3. . . Let K(po.. .. 1 . (b) How does the risk of these estimates compare to that of X? 12.. ~ I I .B). . Define OiflXI v'n < d < v'n d v'n d d X .. . + l)j(n + k). I: I ir'z .1'»2 of these estimates is bounded for all nand 1'. distribution.2. Suppose that given 6 = B = (B ...1r) = Show that the marginal density of X.. Show that the Bayes estimate of 0 for the KullbackLeibler loss function lp(B..8. See Problem 3... J K(po. Show that v'n 1+ v'n 2(1+ v'n) is minimax for estimating F(x) = P(X i < x) with squared error loss.d. M(n.Xo)T has a multinomial.. ! p(x) = J po(x)1r(B)dB. a) is the posterior mean E(6 I X). X n be independent N(I'..d with density defined in Problem 1.I)!. Suppose that given 6 ~ B. X = (X"". 14. of Xi < X • 1 + 1 o.. B j > 0. distribution. d X+ifX 11. See also Problem 3.15.. I: • .202 Measures of Performance Chapter 3 Hint: Consider Dirichlet priors on (PI. ..4.2..a) and let the prior be the uniform prior 1r(B" .. Let X" .
Section 3. .4. (a) Show that if Ip(O) and Iq(fJ) denote the Fisher information in the two parametrizations. 0) cases.. X n be the indicators of n Bernoulli trials with success probability B. Suppose that there is an unbiased estimate J of q((J) and that T(X) is sufficient. a) is convex and J'(X) = E( J(X) I I(X». . Show that X is an UMVU estimate of O. Let Bp(O) and Bq(ry) denote the information inequality lower bound ('Ij. Show that in theN(O. We shall say a loss function is convex. . S. J). Let A = 3.6 Problems and Complements 203 minimizes k( q. 0) and B(n. 4.4.1 . K) . . then That is. for any ao. the Fisher information lower bound is equivariant.2) with I' . N(I".x =. Reparametrize the model by setting ry = h(O) and let q(x. p( x. Show that (a) (b) cr5 = n.). ao) + (1 . if I(O. (ry»).O)!. 0. ry). respectively. N(I"o. . a < a < 1. then R(O. 7f) and that the minimum is I(J. . Problems for Section 3. {log PB(X)}] K(O)d(J. . . Show that B q(ry) = Bp(h. Jeffrey's priors are proportional to 1. and O~ (1 . Suppose X I. ry) = p(x..X is called the mutual information between 0 and X. a..k(p. Let X ..~).4.tO)2 is a UMVU estimate of a 2 • &'8 is inadmissible. Equivariance. that is. Show that if 1(0. J') < R( (J. h1(ry)) denote the model in the new parametrization. Give the Bayes rules for squared error in these three cases. p(X) IO.'? /1 as in (3. X n are Ll. R. It is often improper. Fisher infonnation is not equivariant under increasing transformations of the parameter.d. 15." A density proportional to VJp(O) is called Jeffrey's prior. . Let X I. 0) with 0 E e c R. (b) Equivariance of the Fisher Information Bound. . Hint: Use Jensen's inequality: If 9 is a convex function and X is a random variable.12) for the two parametrizations p(x.a )I( 0.4 1. then E(g(X) > g(E(X».1 E: 1 (Xi  J.alai) < al(O. suppose that assumptions I and II hold and that h is a monotone increasing differentiable function from e onto h(8). 2. . Prove Proposition 3.aao + (1 ..f [E. 0) and q(x. Hint: k(q.. a" (J. Jeffrey's "Prior. K) = J [Eo {log ~i~i}] K((J)d(J > a by Jensen's inequality.1"0 known..
9. Find lim n. a ~ (c) Suppose that Zi = log[i/(n + 1)1. Show that if (Xl. ~ ~ ~ (a) Write the model for Yl give the sufficient statistic. . .XN }. Let X" .3. Suppose Yl . . Is it unbiased? Does it achieve the infonnation inequality lower bound? (b) Show that X is an unbiased estimate of 0/(0 inequality lower bound? + 1). .o.. Show that . For instance. then (a) X is an unbiased estimate of x = I ~ L~ 1 Xi· .1. Zi could be the level of a drug given to the ith patient with an infectious disease and Vi could denote the number of infectious agents in a given unit of blood from the ith patient 24 hours after the drug was administered.4. .. distribution. 0) for any set E. X n be a sample from the beta. Suppose (J is UMVU for estimating fJ. i = 1..\ = a + bB is UMVU for estimating . .1 and variance 0. • 13. . n. Hint: Consider T(X) = X in Theorem 3. 7. ..4. ( 2). 00.p) is an unbiased estimate of (72 in the linear regression model of Section 2. 204 Hint: See Problem 3.Yn are independent Poisson random variables with E(Y'i) = !Ji where Jli = exp{ Ct + (3Zi} depends on the levels Zi of a covariate. Does X achieve the infonnation .11(9) as n ~ give the limit of n times the lower bound on the variances of and (J..5(b).2. 14. P. . I I . Show that a density that minimizes the Fisher infonnation over F is f(x.2 (0 > 0) that satisfy the conditions of the information inequality.' .P. Pe(E) = 1 for some 0 if and only if Pe(E) ~ ~2 > OJ docsn't depend on 0.ZDf3)T(y . . . Measures of Performance Chapter 3 I (c) if 110 is not known and the true distribution of X t is N(Ji.4. Ct.{x : p(x. find the bias o f ao' 6.8. 0) = Oc"I(x > 0). {3) T Compute I( 9) for the model in (a) and then find the lower bound on the variances of unbiased estimators and {J of a and (J. Show that 8 = (Y . Let F denote the class of densities with mean 0. In Example 3. . Show that assumption I implies that if A . Establish the claims of Example 3.ZDf3)/(n ..4. X n ) is a sample drawn without replacement from an unknown finite population {Xl. =f. 8. Hint: Use the integral approximation to sums... and . a ~ i 12. compute Var(O) using each ofthe three methods indicated. B(O. then = 1 forallO..Ii . . Y n in twoparameter canonical exponential form and (b) Let 0 = (a.\ = a + bOo 11. . 2 ~ ~ 10. 1).13 E R.. (a) Find the MLE of 1/0. Let a and b be constants. ..
Let 7fk = ~ and suppose 7fk = 1 < k < K. P[o(X) = 91 = 1..} and X =K  1". Define ~ {Xkl. More generally only polynomials of degree n in p are unbiasedly estimable. Suppose the sampling scheme given in Problem 15 is employed with 'Trj _ ~. . Show that X k given by (3. : 0 E for (J such that E((J2) < 00. . 'L~ 1 h = N.~ for all k. _ Show that X is unbiased and if X is the mean of a simple random sample without replacement from the population then VarX<VarX with equality iff Xk..3. 16.Section 3. .4).. k=l K ~1rkXk. . that ~ is a Bayes estimate for O.511 > (b) Show that the inequality between Var X and Var X continues to hold if ~ . 20. . (b) Deduce that if p. K. 18. 8). distribution. . (See also Problem 1. B(n.4.~).Xkl. UN are as in Example 3. then E( M) ~ L 1r j=1 N J ~ n.X)/Var(U).. even for sampling without replacement in each stratum.6) is (a) unbiased and (b) has smaller variance than X if  b < 2 Cov(U. B(n. 15.) Suppose the Uj can be relabeled into strata {xkd. 17. Show that if M is the expected sample size.1 and Uj is retained independently of all other Uj with probability 1rj where 'Lf 11rj = n..4.. Let X have a binomial. (c) Explain how it is possible if Po is binomial.. Stratified Sampling. (a) Take samples with replacement of size mk from stratum k = fonn the corresponding sample averages Xl. = N(O.. ec R} and 1r is a prior distribution (a) Show that o(X) is both an unbiased estimate of (J and the Bayes estimate with respect to quadratic loss.1  E1 k I Xki doesn't depend on k for all k such that 1Tk > o.. G 19. X is not II Bayes estimate for any prior 1r. 1 < i < h. .p). Show that is not unbiasedly estimable. Show that the resulting unbiased HorvitzThompson estimate for the population mean has variance strictly larger than the estimate obtained by taking the mean of a sample of size n taken without replacement from the population.". Suppose UI. XK.• .6 Problems and Complements 205 (b) The variance of X is given by (3. . :/i.4. Suppose X is distributed accordihg to {p.l.4. if and only if. = 1. k = 1. 7.
however. Show that the sample median X is an empirical plugin estimate of the population median v. B) be the uniform distribution on (0. Show that. Hint: It is equivalent to show that. 5.9)' 21. If n IQR."" F (ij) Vat (:0 logp(X. is an empirical plugin estimate of ~ Here case. If a = 0. Note that logp(x. and we can thus define moments of 8/8B log p(x..3. for all X!. Yet show eand has finite variance. . for all adx 1. give and plot the sensitivity curves of the lower quartile X.25 and (n . that is. 22. .75' and the IQR. T .4.l)a is an integer.206 Hint: Given E(ii(X) 19) ~ 9. with probability I for each B. give and plot the sensitivity curve of the median.4.. E(9 Measures of Performance Chapter 3 I X) ~ ii(X) compute E(ii(X) . If a = 0. Show that the a trimmed mean XCII. Regularity Conditions are Needed for the Information Inequality. An estimate J(X) is said to be shift or translation equivariant if. (iii) 2X is unbiased for .25 and net =k 3. Var(a T 6) > aT (¢(9)II(9). Note that 1/J (9)a ~ 'i7 E9(a 9) and apply Theorem 3..5) to plot the sensitivity curve of the 1. 1 X n1 C.25. use (3. B») ~ °and the information bound is infinite.5. Let X ~ U(O. 1 2. B) is differentiable fat anB > x. Prove Theorem 3. • Problems for Section 35 I is an integer./(9)a [¢ T (9)a]T II (9)[¢ T (9)a]. 4. B). = 2k is even.4. B). the upper quartile X. J xdF(x) denotes Jxp(x)dx in the continuous case and L:xp(x) in the discrete 6.
(b) Show that   . It has the advantage that there is no trimming proportion Q that needs to be subjectively specified.. xH L is translation equivariant and antisymmetric.1. F(x . X n is a sample from a population with dJ. is symmetrically distributed about O. '& is an estimate of scale. k t 0 to the median. . (7= moo 1 IX.03.67.3. .6.) ~ 8. (See Problem 3.30. In . Its properties are similar to those of the trimmed mean.  XI/0.. . X are unbiased estimates of the center of symmetry of a symmetric distribution. The HodgesLehmann (location) estimate XHL is defined to be the median of the 1n(n + 1) pairwise averages ~(Xi + Xj). X a arc translation equivariant and antisymmetric.(a) Show that X. trimmed mean with a = 1/4. The Huber estimate X k is defined implicitly as the solution of the equation where 0 < k < 00. X.. i < j. and the HodgesLehmann estimate. and xiflxl < k kifx > k kifx<k. J is an unbiased estimate of 11. For x > .03 (these are expected values of four N(O. plot the sensitivity curves of the mean.. .5 and for (j is. then (i.Section 3. (a) Suppose n = 5 and the "ideal" ordered sample of size n ~ 1 = 4 is 1.. Show that if 15 is translation equivariant and antisymmetric and E o(15(X» exists and is finite. (b) Suppose Xl. median. . 7. One reasonable choice for k is k = 1. . I)order statistics). .JL) where JL is unknown and Xi .. ..30. Xu.1..5. <i< n Show that (a) k = 00 corresponds to X. Deduce that X.e.).6 Problems and Complements 207 It is antisymmetric if for all Xl.
. 1 Xnl to have sample mean zero. Laplace. . and is fixed. 00. X 12. . and Cauchy distributions. Xk) is a finite constant.O)/ao) where fo(x) for Ixl for Ixl <k > k. Suppose L:~i Xi ~ . ii. .x}2. iTn ) ~ (2a)1 (x 2  a 2 ) as n ~ 00. (a) Find the set of Ixl where g(Ixl) > <p(lxl) for 9 equal to the Laplace and Cauchy densitiesgL(x) = (2ry)1 exp{ Ixl/ry} and gc(x) = b[b2 + x 2 ]1 /rr..25. •• . 11." . with k and € connected through 2<p(k) _ 24>(k) ~ e k 1(e) Xk exists and is unique when k £ ! i > O.d. Let X be a random variable with continuous distribution function F.7(a).) are two densities with medians v zero and identical scale parameters 7. (c) Show that go(x)/<p(x) is of order exp{x2 } as Ixl ~ 10.Xn arei. (d) Xk is translation equivariant and antisymmetric (see Problem 3. In what follows adjust f and 9 to have v = 0 and 'T = 1. then limlxj>00 SC(x. ..208 ~ Measures of Performance Chapter 3 (b) Ifa is replaced by a known 0"0. . = O.. This problem may be done on the computer. Find the limit of the sensitivity curve of t as (a) Ixl ~ (b) n ~ 00.1)1 L:~ 1(Xi . Location Parameters.5. we will use the IQR scale parameter T = X. Use a fixed known 0"0 in place of Ci. P(IXI > 3) and P(IXI > 4) for the nonnal. In the case of the Cauchy density. and "" "'" (b) the Iratio of Problem 3. Let 1'0 = 0 and choose the ideal sample Xl. The functional 0 = Ox = O(F) is said to be scale and shift (translation) equivariant \. 9. If f(·) and g(. [.i.) has heavier tails than f() if g(x) is above f(x) for Ixllarge. (e) If k < 00.5. Show that SC(x. Let JJo be a hypothesized mean for a certain population.X. the standard deviation does not exist. plot the sensitivity curve of (a) iTn . For the ideal sample of Problem 3.~ !~ t .5. 00. thus. The (student) tratio is defined as 1= v'n(x I'o)/s. where S2 = (n .75 .6). (b) Find thetml probabilities P(IXI > 2). .. thenXk is the MLEofB when Xl. we say that g(.• 13. with density fo((.11. n is fixed.
and sample trimmed mean are shift and scale equivariant.(F): 0 <" < 1/2}. F) lim n _ oo SC(x. X is said to be stochastically smaller than Y if = G(t) for all t E R. 8) and ~ IF(x.Section 3. . SC(a + bx. the ~ = bdSC(x.. i/(F)] is the value of some location parameter. (jx..67 and tPk is defined in Problem 3. Hint: For the second part. 8.).xn ).(F) = !(xa + XI"). (a) Show that if F is symmetric about c and () is a location parameter.xn . 0 < " < 1. (e) Show that if the support S(F) = {x : 0 < F(x) < 1} of F is a finite interval. ~ (a) Show that the sample mean. . .a +bxn ) ~ a + b8n (XI. (e) Let Ji{k) be the solution to the equation E ( t/Jk (X :. . median v.. xnd.5.].) 14. In this case we write X " Y. ~ (b) Write the Be as SC(x.. where T is the median of the distributioo of IX location parameter.8.1: (a) Show that SC(x. An estimate ()n is said to be shift and scale equivariant if for all xl. . ([v(F). x n _. and note thatH(xi/(F» < F(x) < H(xv(F». F(t) ~ P(X < t) > P(Y < t) antisymmctric. a + bx. n (b) In the following cases._. 8) = IF~ (x. and trimmed population mean JiOl (see Problem 3. if F is also strictly increasing. c + dO... a. let H(x) be the distribution function whose inverse is H'(a) = ![xax.. sample median. ._.F). c E R. · . Also note that H(x) is symmetric about zero.. then v(F) and i/( F) are location parameters.:. (b) Show that the mean Ji.. 8.(k) is a (d) For 0 < " < 1. any point in [v(F). Show that v" is a location parameter and show that any location parameter 8(F) satisfies v(F) < 8(F) < i/(F). . ~ = d> 0. compare SC(x. and ordcr preserving.t )) = 0.. If 0 is scale and shift equivariant.8.O. In Remark 3. it is called a location parameter.5) are location parameters.Xn_l) to show its dependence on Xnl (Xl.. ~ se is shift invariant and scale equivariant.  vl/O. then for a E R.5..6 Problems and Complements 209 if (Ja+bX = a + bex· It is antisymmetric if (J x :.) That is.t. ~ ~ 8n (a +bxI. ~ ~ 15. 8 < " is said to be order preserving if X < Y ::::} Ox < ()y. .5. then ()(F) = c. i/(F)] and. i/(F)] is the location parameter set in the sense that for any continuous F the value ()( F) of any location parameter must be in [v(F). b > 0. b > 0. Show that if () is shift and location equivariant. Let Y denote a random variable with continuous distribution function G. 8. v(F) = inf{va(F) : 0 < " < I/2} andi/(F) =sup{v. let v" = v.. Fn _ I ). Show that J1.
) (a) Show by example that for suitable t/J and 1(1(0) . This is. · .' > 0.I F(x. (ii) O(F) = (iii) e(F) "7. we shall • I' . consequently. {BU)} do not converge. A. . (b) ~ ~ . in order to be certain that the Jth iterate (j J is within e of the desired () such that 'IjJ( fJ) = 0. also true of the method of coordinate ascent. ~ I(x  = X n . and (iii) preceding? ~ 16. Assume that F is strictly increasing. 0.4.7 NOTES Note for Section 3. ~ ~ 17. B E B. The NewtonRaphson method in this case is .field generated by SA. (b) If no assumptions are made about h. and we seek the unique solution (j of 'IjJ(8) = O.8) . (1) The result of Theorem 3. . 81" < 0. In the gross error model (3..210 (i) Measures of Performance Chapter 3 O(F) ~ 1'1' ~ J . show that (a) If h is a density that is symmetric about zero. Because priority of discovery is now given to the French mathematician M.3 (1) A technical problem is to give the class S of subsets of:F for which we can assign probability (the measurable sets). 3. IlFfdF(x). I : .1) Hint: (a) Try t/J(x) = Alogx with A> 1.t is identifiable.BI < C1(1(i. (ii). C < 00. . We define S as the . Frechet. I I Notes for Section 3.1 is commonly known as the Cramer~Rao inequality. . then J. (e) Does n j [SC(x. is not identifiable.4 I . (b) Show that there exists. • • . then fJ.rdF(l:). we in general must take on the order of log ~ steps..B ~ {F E :F : Pp(A) E B}.0 > 0 (depending on t/J) such that if lOCO) .OJ then l(I(i) .5. Let d = 1 and suppose that 'I/J is twice continuously differentiable.2). 18. Show that in the bisection method. F)! ~ 0 in the cases (i). • .1/. . where B is the class of Borel sets.Bilarge enough.
Dispersion. J.(T(X)) = 00.8 References 211 follow the lead of Lehmann and call the inequality after the Fisher information number that appears in the statement. TUKEY.4. Location. 3. ANDERSON. T. 11391158 (1976). W. "Unbiased Estimation in Convex. 1994.. BICKEL. H.. J. M. Families. AND A.." Ann. "Measures of Location and AsymmeUy. MA: AddisonWesley. 10381044 (l975a). F. D. 1998. LEHMANN.. BAXTER. ROGERS.. . F. (2) Note that this inequality is true but uninteresting if f(O) = 00 (and 1/J'(0) is finite) or if Var.. >. 2nd ed. WALLACE.)dXd>." Ann. H. BICKEL. "Descriptive Statistics for Nonparametric Models. BOHLMANN. SMITH. A.. Mathematical Analysis.. HUBER. P. Jroo T(x) [~p(X' 0)] dx roo Joo 8(J 00 00 00 for all (J whereas the continuity (or even boundedness on compact sets) of the second integral guarantees that we can interchange the order of integration in (4) The finiteness of Var8(T(X)) and f(O) imply that 1/J'(0) is finite by the covariance interpretation given in (3. AND N.. 3. P. Ma. DE GROOT. 1969. D. BERNARDO. J.Section 3. 1972. Jroo T(x) :>. OLIVER.. 3. BICKEL. M. Math. "Descriptive Statistics for Nonparametric Models. 1974. J..10451069 (1975b).. Statistical Decision Theory and Bayesian Analysis New York: Springer. AND E. L. 1. A. R. "Descriptive Statistics for Nonparametric Models.8 REFERENCES ANDREWS. Bayesian Theory New York: Wiley. III. AND C. LEHMANN. APoSTOL. ofStatist. NJ: Princeton University Press. 1985. Statist. AND J. DOWE. J. DoKSUM. 1974. DAHLQUIST. LEHMANN. 1122 (1975)." Ann.cific Asian Conference on Knowledge Discovery and Data Mintng Melbourne: SpringerVerlag. BICKEL. 2... P. P. M. 15231535 (1969).] = Jroo.. 4. F. AND E." Scand. R. Reading. Numen'cal Analysis New York: Prentice Hall. AND E. Statist. P. Introduction. Optimal Statistical Decisions New York: McGrawHill. (3) The continuity of the first integral ensures that : (J [rO )00 . HAMPEL. LEHMANN. 40. A. 1970." Ann. Robust Estimates of Location: Sun>ey and Advances Princeton. BICKEL. AND E.. BJORK. in Proceedings of the Second Pa. M. II.8). S. G... BERGER. W. I. 0. Point Estimation Using the KullbackLeibler Loss Function and MML..thematical Methods in Risk Theory Heidelberg: Springer Verlag. K. Statist. J. P. p(x. H. Statist.
. Robust Statistics: The Approach Based on Influence Functions New York: J. "Robust Estimates of Location.. Y. j I JEFFREYS. ROUSSEUW. Wiley & Sons. 1954. 197206 (1956). Stu/ist... LEHMANN. New York: Springer. 1948. Part I: Probability. 69. Actuarial J. StatiSl. London. Wiley & Sons. P. Programming. "On the Attainment of the CramerRao Lower Bound. SAVAGE." Ann. Introduction to Probability and Statistics from a Bayesian Point of View. SHIBATA.V. "Robust Statistics: A Review. "The Influence Curve and Its Role in Robust Estimation. D. . AND G." Statistica Sinica. "Boostrap Estimate of KullbackLeibler Information for Model Selection.. Part II: Inference. 1965. 1. Assoc. D. MA: AddisonWesley.. RoyalStatist. 1998. B.212 Measures of Performance Chapter 3 HAMPEL. "Model Selection and the Principle of Mimimum Description Length.. HAMPEL. TuKEY... 49... 1959... and Probability. H. 2nd ed. The Foundations afStatistics New York: J. R." J. . MA: AddisonWesley. London: Oxford University Press. 69. Royal Statist. 7. LEHMANN.10411067 (1972).. Robust Statistics New York: Wiley.. WALLACE. STAHEL. 2nd ed. • NORBERG. 43. YU. Amer.. Math.. 49. Theory ofPoint Estimation. R. STEIN. I.. "Estimation and Inference by Compact Coding (With Discussions)." J. 240251 (1987). HANSEN. Math. 42. Assoc. 909927 (1974). J. "Adaptive Robust Procedures. 383393 (1974)." Ann. L. HUBER.222 (1986). "Inadmissibility of the Usual Estimator for the Mean of a Multivariate Distribution. 1986. AND B. L.. University of California Press. WUSMAN. Soc. RONCHEro. and Economics Reading. StrJtist. 1972. (2000). . M.. S. 538542(1973).. Mathematical Methods and Theory in Games. 1. J.. "Hierarchical Credibility: Analysis of a Random Effect Linear Model with Nested Classification. C." Ann. H. Stlltist. Amer. R. P. Assoc. S. W. AND P." J.." Proc. 1986. Soc. Amer. P. KARLIN. • • 1 ." Scand. Statist. L. CASELLA. . Testing Statistical Hypotheses New York: Springer. "Stochastic Complexity (With Discussions):' J. E. Statist. E. Stall'. 13. JAECKEL.n. A. RISSANEN. 10201034 (1971).375394 (1997). F. B. Exploratory Data Analysis Reading. Cambridge University Press. 1981... R. 136141 (1998). E." J. FREEMAN.. Third Berkeley Symposium on Math. LINDLEY. C. A.. E. HUBER. 223239 (1987). 2Q4.." Statistical Science. L. "Decision Analysis and BioequivaIence Trials. Math. AND W. HOGG. R. LINDLEY. Theory ofProbability.
respectively. as is often the case. Accepting this model provisionally. pUblic policy. 3.PfO). not a sample.3 the questions are sometimes simple and the type of data to be gathered under our control. Sex Bias in Graduate Admissions at Berkeley. whether (j E 8 0 or 8 1 jf P j = {Pe : () E 8 j }. The Graduate Division of the University of California at Berkeley attempted to study the possibility that sex bias operated in graduate admissions in 1973 by examining admissions data. treating it as a decision theory problem in which we are to decide whether P E Po or P l or. respectively. conesponding. parametrically.1. and indeed most human activities. But this model is suspect because in fact we are looking at the population of all applicants here. the situation is less simple.1 INTRODUCTION In Sections 1. to answering "no" or "yes" to the preceding questions. M(n. If n is the total number of applicants. Here are two examples that illustrate these issues. Does a new drug improve recovery rates? Does a new car seat design improve safety? Does a new marketing policy increase market share? We can design a clinical trial.PmllPmO.Chapter 4 TESTING AND CONFIDENCE REGIONS: BASIC THEORY 4. in examples such as 1. where Po. o E 8. and what 8 0 and 8 1 correspond to in tenns of the stochastic model may be unclear. Usually.Pfl. it might be tempting to model (Nm1 . what does the 213 .Nf1 . what is an appropriate stochastic model for the data may be questionable. The design of the experiment may not be under our control. peIfonn a survey.Nfb the numbers of admitted male and female applicants.1.1. we are trying to get a yes or no answer to important questions in science. 8d with 8 0 and 8. or more generally construct an experiment that yields data X in X C Rq.3. Nfo of denied applicants. where is partitioned into {80 . distribution. modeled by us as having distribution P(j. medicine. and 3. and we have data providing some evidence one way or the other. and the corresponding numbers N mo . As we have seen.3 we defined the testing problem abstractly. the parameter space e. 8 1 are a partition of the model P or. PI or 8 0 . They initially tabulated Nm1. Nmo. Nfo) by a multinomial. This framework is natural if. e Example 4.2.
. . If departments "use different coins. • Pml + P/I + PmO + Pio • I . ~. .. . This is not the same as our previous hypothesis unless all departments have the same number of applicants or all have the same admission rate. d = 1. the number of homozygous dominants.=7xlO 5 ' n 3 . .. one of which was dominant. d = 1.. • In fact. that N AA. . The example illustrates both the difficulty of specifying a stochastic model and translating the question one wants to answer into a statistical hypothesis. PfOd. . That is. Hammel. N mOd .. ~). B (n. . . I I 1] Fisher conjectured that rather than believing that such a very extraordinary event occurred it is more likely that the numbers were made to "agree with theory" by an overzealous assistant. and O'Connell (1975). . distribution. • .than might be expected under the hypothesis that N AA has a binomial.< .214 Testing and Confidence Regions Chapter 4 hypothesis of no sex bias correspond to? Again it is natural to translate this into P[Admit I Male] = Pm! Pml +PmO = P[Admit I Female] = P fI Pil +PjO But is this a correct translation of what absence of bias means? Only if admission is determined centrally by the toss of a coin with probability Pml Pi! Pml + PmO PIl +PiO [n fact. In these tenns the hypothesis of "no bias" can now be translated into: H: Pml Pmld PmOd Pfld Pild + + PfOd . OUf multinomial assumption now becomes N ""' M(pmld' PmOd. NIld. and so on.. In a modem formulation. The hypothesis of dominant inheritance ~.. if the inheritance ratio can be arbitrary.p) distribution.1.n 3 t I ." then the data are naturally decomposed into N = (Nm1d. m P [ NAA . Pfld. i:.. • I I . I . if there were n dominant offspring (seeds). .! . 0 I • .D. It was noted by Fisher corresponds to H : p = ~ with the alternative K : p as reported in Jeffreys (1961) that in this experiment the observed fraction ':: was much closer to 3. admissions are petfonned at the departmental level and rates of admission differ significantly from department to department. the natural model is to assume. The progeny exhibited approximately the expected ratio of one homozygous dominant to two heterozygous dominants (to one recessive). Mendel's Peas.: Example 4. In one of his famous experiments laying the foundation of the quantitative theory of genetics. . for d = 1. Mendel crossed peas heterozygous for a trait with two alleles. I . where N m1d is the number of male admits to department d. as is discussed in a paper by Bickel. has a binomial (n. the same data can lead to opposite conclusions regarding these hypothesesa phenomenon called Simpson's paradox.2. NjOd. ! . either N AA cannot really be thought of as stochastic or any stochastic I . . D).. D). pml +P/I !.
in the e . It will turn out that in most cases the solution to testing problem~ with 80 simple also solves the composite 8 0 problem. . it's not clear what P should be as in the preceding Mendel example. 0 E 8 0 . The same conventions apply to 8] and K. (1 . is better defined than the alternative answer 8 1 .. In this example with 80 = {()o} it is reasonable to reject IJ if S is "much" larger than what would be expected by chance if H is true and the value of B is eo.3. and we are then led to the natural 0 . and then base our decision on the observed sample X = (X J. where 1 ~ E is the probability that the assistant fudged the data and 6!. say k. we reject H if S exceeds or equals some integer. These considerations lead to the asymmetric formulation that saying P E Po (e E 8 0 ) corresponds to acceptance of the hypothesis H : P E Po and P E PI corresponds to rejection sometimes written as K : P E PJ . say. say 8 0 .Xn ). for instance. It is convenient to distinguish between two structural possibilities for 8 0 and 8 1 : If 8 0 consists of only one point. then 8 0 = [0 . Thus. We illustrate these ideas in the following example. Suppose that we know from past experience that a fixed proportion Bo = 0. The set of distributions corresponding to one answer. That a treatment has no effect is easier to specify than what its effect is.Section 4. .1. then = [00 . then S has a B(n. recall that a decision procedure in the case of a test is described by a test function Ii: x ~ {D. B) distribution. see. Thus. 8 1 is the interval ((}o. What our hypothesis means is that the chance that an individual randomly selected from the ill population will recover is the same with the new and old drug. we shall simplify notation and write H : () = eo. 8 0 and H are called compOSite.3. In science generally a theory typically closely specifies the type of distribution P of the data X as.1. 11 and K is composite. Most simply we would sample n patients. If the theory is false..3 recover from the disease with the old drug. (}o] and eo is composite. Now 8 0 = {Oo} and H is simple. and accept H otherwise.1. = Example 4. for instance. the set of points for which we reject.(1) As we have stated earlier. acceptance and rejection can be thought of as actions a = 0 or 1. n 3' 0 What the second of these examples suggests is often the case. we call 8 0 and H simple.€)51 + cB( nlP). To investigate this question we would have to perform a random experiment. That is. Our hypothesis is then the null hypothesis that the new drug does not improve on the old drug.1 loss l(B. In situations such as this one . When 8 0 contains more than one point. Suppose we have discovered a new drug that we believe will increase the rate of recovery from some disease over the recovery rate when an old established drug is applied. Moreover. If we suppose the new drug is at least as effective as the old. K : () > Bo Ifwe allow for the possibility that the new drug is less effective than the old. is point mass at . administer the new drug.11. See Remark 4.1 Introduction 215 model needs to pennit distributions other than B( n.!. where Xi is 1 if the ith patient recovers and 0 otherwise. the number of recoveries among the n randomly selected patients who have been administered the new drug. a) = 0 if BE 8 a and 1 otherwise. p). I} or critical region C {x: Ii(x) = I}. our discussion of constant treatment effect in Example 1. P = Po. suppose we observe S = EXi . where 00 is the probability of recovery usiog the old drug. (2) If we let () be the probability that a patient to whom the new drug is administered recovers and the population of (present and future) patients is thought of as infinite.
and later chapters. ! : . No one really believes that H is true and possible types of alternatives are vaguely known at best.that has in fact occurred. 1. The Neyman Pearson framework is still valuable in these situations by at least making us think of possible alternatives and then.. as we shall see in Sections 4.3.1. Given this position. In that case rejecting the hypothesis at level a is interpreted as a measure of the weight of evidence we attach to the falsity of H. (Other authors consider test statistics T that tend to be small.e. The value c that completes our specification is referred to as the critical value of the test. generally in science. but if this view is accepted. In most problems it turns out that the tests that arise naturally have the kind of structure we have just described. We now tum to the prevalent point of view on how to choose c.. : I I j ! 1 I . one can be thought of as more important. 0 > 00 . when H is false. asymmetry is often also imposed because one of eo. 1 i . ~ : I ! j ..3.T would then be a test statistic in our sense. 4. but computation under H is easy. . For instance. We will discuss the fundamental issue of how to choose T in Sections 4. and large. 0 PH = probability of type II error ~ P.1 and4. . As we noted in Examples 4..2.1. Thresholds (critical values) are set so that if the matches occur at random (i. if H is false.1 . To detennine significant values of these statistics a (more complicated) version of the following is done. By convention this is chosen to be the type I error and that in tum detennines what we call H and what we call K. . The Neyman Pearson Framework The Neyman Pearson approach rests On the idea that. of the two errors.216 Testing and Confidence Regions Chapter 4 tenninology of Section 1.2 and 4. There is a statistic T that "tends" to be small. We do not find this persuasive. (5 The constant k that determines the critical region is called the critical value. We call T a test statistic.3.1. it again reason~bly leads to a Neyman Pearson fonnulation.) We select a number c and our test is to calculate T(x) and then reject H ifT(x) > c and accept H otherwise. • . One way of doing this is to align the known and unknown regions and compute statistics based on the number of matches. then the probability of exceeding the threshold (type I) error is smaller than Q". is much better defined than its complement and/or the distribution of statistics T under eo is easy to compute. our critical region C is {X : S rule is Ok(X) = I{S > k} with > k} and the test function or PI ~ probability of type I error = Pe. matches at one position are independent of matches at other positions) and the probability of a match is ~. It has also been argued that.3 this asymmetry appears reasonable. suggesting what test statistics it is best to use. (5 > k) < k). if H is true. e 1 . testing techniques are used in searching for regions of the genome that resemble other regions that are known to have significant biological activity. . Note that a test statistic generates a family of possible tests as c varies. .. how reasonable is this point of view? In the medical setting of Example 4..2. announcing that a new phenomenon has been observed when in fact nothing has happened (the socalled null hypothesis) is more serious than missing something new.
which is defined/or all 8 E 8 by e {3(8) = {3(8. 0) is the {3(8.2. there exists a unique smallest c for which a(c) < a.1. Example 4. {3(8. In Example 4. In that case. and we speak of rejecting H at level a. Definition 4. By convention 1 .Section 4. The power of a test against the alternative 0 is the probability of rejecting H when () is true.1. Begin by specifying a small number a > 0 such that probabilities of type I error greater than a arc undesirable.05 are commonly used in practice.9. The values a = 0. 0) is just the probability of type! errot.8)n. we can attach. The power is a function of 8 on I. Because a test of level a is also of level a' > a. Here are the elements of the Neyman Pearson story. Finally.05 critical value 6 and the test has size . numbers to the two losses that are not equal and/or depend on 0..3 and n = 10. Both the power and the probability of type I error are contained in the power function. e t . Specifically. It is referred to as the level a critical value.P [type 11 error] is usually considered. we find from binomial tables the level 0.1. Indeed.1 Introduction 217 There is an important class of situations in which the Neyman Pearson framework is inappropriate. whereas if 8 E power against (). That is.2(b) is the one to take in all cases with 8 0 and 8 1 simple. .(6) = Pe.j J = 10. This is the critical value we shall use.0473.3 (continued). if our test statistic is T and we want level a. the power is 1 minus the probability of type II error. the probabilities of type II error as 8 ranges over 8 1 are determined.2.1. Ll) Nowa(e) is nonincreasing in c and typically a(c) r 1 as c 1 00 and a(e) 1 0 as c r 00. This quantity is called the size of the test and is the maximum probability of type I error. our test has size a(c) given by a(e) ~ sup{Pe[T(X) > cJ: 8 E eo}· (4.1. even though there are just two actions.3. Such tests are said to have level (o/significance) a. it is too limited in any situation in which.1. Once the level or critical value is fixed.Ok) = P(S > k) A plot of this function for n ~ t( j=k n ) 8i (1 . if we have a test statistic T and use critical value c. See Problem 3.01 and 0. 8 0 = 0.3 with O(X) = I{S > k}. the approach of Example 3. in the Bayesian framework with a prior distribution on the parameter. it is convenient to give a name to the smallest level of significance of a test. even nominally.(S > 6) = 0. if 0 < a < 1.0) = Pe[Rejection] = Pe[o(X) = 1] = Pe[T(X) If 8 E 8 0 • {3(8. 80 = 0.1. Here > cJ.1. Then restrict attention to tests that in fact have the probability of rejection less than or equal to a for all () E 8 0 . such as the quality control Example 1. It can be thought of as the probability that the test will "detect" that the alternative 8 holds. If 8 0 is composite as well. Thus. then the probability of type I error is also a function of 8. k = 6 is given in Figure 4.
I i j .0 Figure 4. and H is the hypothesis that the drug has no effect or is detrimental. That is.t > O. 0) family of distributions. The power is plotted as a function of 0.3 for the B(lO. If we assume XI.9 1. I I I I I 1 .3. Power function of the level 0. then the drug effect is measured by p.2 0. for each of a group of n randomly selected patients.3 versus K : B> 0.5. It follows that the level and size of the test are unchanged if instead of 80 = {Oo} we used eo = [0. . One of the most important uses of power is in the selection of sample sizes to achieve reasonable chances of detecting interesting alternatives.~==/++_++1'+ 0 o 0. _l X n are nonnally distributed with mean p.5.1.. Xn ) is a sample from N (/'..1.t < 0 versus K : J.1.8 06 04 0. ! • i. .0 0.:"::':::'.3). From Figure 4. whereas K is the alternative that it has some positive effect. I j a(k) = sup{Pe[T(X) > k]: 0 E eo} = Pe. j 1 .3 is the probability that the level 0. k ~ 6 and the size is 0. Let Xi be the difference between the time slept after administration of the drug and time slept without administration of the drug by the ith patient. and variance (T2.3. (The (T2 unknown case is treated in Section 4.0473. We might. . this probability is only .:.7 0.6 0..3 04 05 0. What is needed to improve on this situation is a larger sample size n. This problem arises when we want to compare two treatments or a treatment and control (nothing) and both treatments are administered to the same subject.05 test will detect an improvement of the recovery fate from 0.1._~:.2 o~31:.. Example 4. .8 0.218 Testing and Confidence Regions Chapter 4 1.05 onesided test c5k of H : () = 0. record sleeping time without the drug (or after the administration of a placebo) and then after some time administer the drug and record sleeping time again. We return to this question in Section 4.1..) We want to test H : J.4.2 is known.1 it appears that the power function is increasing (a proof will be given in Section 4.[T(X) > k].1 0. . suppose we want to see if a drug induces sleep. When (}1 is 0. 0 Remark 4. Suppose that X = (X I.. :1 • • Note that in this example the power at () = B1 > 0. a 67% improvement.3 to (it > 0.2) popnlation with . OneSided Tests for the Mean ofa Normal Distribution with Known Variance..0 0 ]. For instance.3770.
.(jo) = (~ (jo)+ where y+ = Y l(y > 0).   = .~(z). T(X(lI).nX/ (J. T(X(B)) from £0. sup{(J(p) : p < OJ ~ (J(O) = ~(c).~I.p) because ~(z) (4.Section 4. .eo) = IN~A .. has level a if £0 is continuous and (B + 1)(1 . In all of these cases.o if we have H(p.1.2 and 4..1. eo).5. .• T(X(B)).(T(X)) doesn't depend on 0 for 0 E 8 0 . #. 1) distribution. That is. which generates the same family of critical regions. The power function of the test with critical value c is p " [Vii (X !") > e _ Vii!"] (J (J 1. < T(B+l) are the ordered T(X). In Example 4.a) quantile oftheN(O. o The Heuristics of Test Construction When hypotheses are expressed in terms of an estimable parameter H : (j E eo c RP.1. (Tn) for (j E 8 0 is usually invariance The key feature of situations in which under the action of a group of transformations_ See Lehmann (1997) and Volume II for discussions of this property. = P. But it occurs also in more interesting situations such as testing p. It is convenient to replace X by the test statistic T(X) = ..1. .1 and Example 4. ~ estimates(jandd(~. However. The task of finding a critical value is greatly simplified if £.o versus p.( c) = C\' or e = z(a) where z(a) = z(1 . in any case.a) is an integer (Problem 4. T(X(l)). (12) observations with both parameters unknown (the t tests of Example 4.P.1 Introduction 219 Because X tends to be larger under K than under H.1. Given a test statistic T(X) we need to determine critical values and eventually the power of the resulting tests.y) : YES}. £0. p ~ P[AA]. where d is the Euclidean (or some equivalent) distance and d(x.~ (c . In Example 4.2) = 1. then the test that rejects iff T(X) > T«B+l)(la)). S) inf{d(x.1..1.co . The smallest c for whieh ~(c) < C\' is obtained by setting q:.I') ~ ~ (c + V. it is clear that a reasonable test statistic is d((j. where T(l) < .co =. Here are two examples of testing hypotheses in a nonparametric context in which the minimum distance principle is applied and calculation of a critical value is straightforward.1. This minimum distance principle is essentially what underlies Examples 4.d.9).. This occurs if 8 0 is simple as in Example 4. Rejecting for large values of this statistic is equivalent to rejecting for large values of X. the common distribution of T(X) under (j E 8 0 .a) is the (1.i. N~A is the MLE of p and d(N~A. critical values yielding correct type I probabilities are easily obtained by Monte Carlo methods.5).3. Because (J(jL) a(e) ~ is increasing.3.2.. has a closed form and is tabled. . and we have available a good estimate (j of (j. it is natural to reject H for large values of X.3.V. if we generate i.
u[ O<u<l .5. 1). ..   Dn = sup IF(x) . . X n are ij. . The distribution of D n has been thoroughly smdied for finite and large n.. • Example 4. which is called the Kolmogorov statistic...3.1 El{U...12 + OJI/Vri) close approximations to the size" critical values ka are h n (L628). x It can be shown (Problem 4. Let X I. Consider the problem of testing H : F = Fo versus K : F i. F and the hypothesis is H : F = ~ (' (7'/1:) for some M. 0 Example 4. • . as a tcst statistic .1.. h n (L358). the order statistics. < x(n) is the ordered observed sample. 1)... can be wriHen as Dn =~ax max tI. . The natural estimate of the parameter F(!.(i_l. . . and the result follows. + (Tx) is . < Fo(x)} = U(Fo(x)) . as X . . Set Ui ~ j .05.Fo(x)[.1.n {~ n Fo(x(i))' FO(X(i)) _ . Goodness of Fit Tests. that is. and h n (L224) for" = ..1 El{Fo(Xi ) < Fo(x)} n. What is remarkable is that it is independ~nt of which F o we consider. Un' where U denotes the empirical distribution function of Ul u = Fo(x) ranges over (0. This is again a consequence of invariance properties (Lehmann.1. We can proceed as in Example 4.Xn be i.L5 rewriting H : F(!' + (Tx) = <p(x) for all x where !' = EF(X . Also F(x) < x} = n.01. . Goodness of Fit to the Gaussian Family. for n > 80. (T2 = VarF(Xtl. .. thus. Let F denote the empirical distribution and consider the sup distance between the hypothesis Fo and the plugin estimate of F. As x ranges over R.6.. then by Problem B. In particular. U.. o Note that the hypothesis here is simple so that for anyone of these hypotheses F = F o• the distribution can be simulated (or exhibited in closed fonn).. 1997). .i.1.. ).Fa. . F. and hn(t) = t/( Vri + 0.1 J) where x(l) < .4. the empirical distribution function F. n.d. Proof.U. where U denotes the U(O. This statistic has the following distributionjree property: 1 Proposition 4. where F is continuous.. :i' D n = sup [U(u) .. The distribution of D n under H is the same for all continuous Fo. In particular.1 El{Xi ~ U(O. F o (Xi). Suppose Xl. and . which is evidently composite. P Fo (D n < d) = Pu (D n < d). 1) distribution.1.10 respectively.7) that Do.)} n (4.220 Testing and Confidence Regions Chapter 4 1 .d.
But. observations Zi.X) .(3) 0 The pValue: The Test Statistic as Evidence o Different individuals faced with the same testing problem may have different criteria of size. if the experimenter's critical value corresponds to a test of size less than the p~value.4) Considered as a statistic the pvalue is <l'( y"nX /u). .X)/ii. . .) Thus. .4. whatever be 11 and (12. from N(O. if X = x. Consider..3. where X and 0'2 are the MLEs of J1 and we obtain the statistic F(X + ax) Applying the sup distance again. 1. We do this B times independently. whereas experimenter II insists on using 0' = 0. under H. H is rejected. a satisfies T(x) = . 1). Example 4. {12 and is that of ~  (Zi ..2. thereby obtaining Tn1 .a) + IJth order statistic among Tn. .Section 4. This quantity is a statistic that is defined as the smallest level of significance 0' at which an experimenter using T would reject on the basis ofthe observed outcome x. a > <l'( T(x)). Tn!. Tn has the same distribution £0 under H. 1 < i < n. If we observe X = x = (Xl. this difficulty may be overcome by reporting the outcome of the experiment in tenus of the observed size or pvalue or significance probability of the test.i.(iix u > z(a) or upon applying <l' to both sides if. T nB .d. ..<l'(x)1 sup IG(x) . and only if. I < i < n. ~n) doesn't depend on fl.Z) / (~ 2::7 ! (Zi  if) • .1. I). Zn arc i. . Tn sup IF'(X x x + iix) . whereas experimenter II accepts H on the basis of the same outcome x of an experiment. If the two experimenters can agree on a common test statistic T. where Z I. T nB . otherwise.01. That is.05. (4. H is not rejected.. (Sec Section 8..1 Introduction 221 (12.<l'(x)1 where G is the empirical distribution of (L~l"'" Lln ) with Lli (Xi . we would reject H if. the pvalue is <l'( T(x)) = <l' (. then computing the Tn corresponding to those Zi. and the critical value may be obtained by simulating ij. N(O. for instance. Now the Monte Carlo critical value is the I(B + 1)(1 . . Experimenter] may be satisfied to reject the hypothesis H using a test with size a = 0...d. Therefore. the joint distribution of (Dq . In)..' .. . · · . It is then possible that experimenter I rejects the hypothesis H. and only if..
6). for miu{ nOo. Various melhods of combining the data from different experiments in this way are discussed by van Zwet and Osterhoff (1967). let X be a q dimensional random vector. the largest critical value c for which we would reject is c = T(x). but K is not.1). I T r to produce p .4).. a(Tr ).1. We have proved the following. n(1 . ' I values a(T. distribution (Problem 4. to quote Fisher (1958). aCT) has a uniform. a(T) is on the unit interval and when H is simple and T has a continuous distribution.6) I • to test H. . these kinds of issues are currently being discussed under the rubric of datafusion and metaanalysis (e.<I> ( [nOo(1 ~ 0 )1' 2 s~l~nOo) 0 . and the pvalue is a( s) where s is the observed value of X. Thus. Then if we use critical value c.(S > s) '" 1. 1985). when H is well defined. .1.5).. we would reject H if. It is possible to use pvalues to combine the evidence relating to a given hypothesis H provided by several different independent experiments producing different kinds of data. if r experimenters use continuous test statistics T 1 . In this context.). • T = ~2 I: loga(Tj) j=l ~ ~ r (4.• see f [' Hedges and Olkin.. The normal approximation is used for the pvalue also.Oo)} > 5. The pvalue is a(T(X)). then if H is simple Fisher (1958) proposed using l j :i. U(O.. I. The statistic T has a chisquare distribution with 2n degrees of freedom (Problem 4. Suppose that we observe X = x. •• 1. For example. . a(8) = p.ath quantile of the X~n distribution. 1).g. (4. Thus. T(x) > c.. Proposition 4. The pvalue can be thought of as a standardized version of our original statistic..1. that is.5) . !: The pvalue is used extensively in situations of the type we described earlier. Thus. ~ H fJ. and only if. so that type II error considerations are unclear. . the smallest a for which we would reject corresponds to the largest c for which we would reject and is just a(T(x)).222 Testing and Confidence Regions Cnapter 4 In general. Similarly in Example 4. • i~! <:1 im _ . 80). This is in agreement with (4. More generally.1. H is rejected if T > Xla where Xl_a is the 1 . But the size of a test with critical value c is just a(c) and a(c) is decreasing in c.1. Thus.2..1..1.3. ''The actual value of p obtainable from the table by interpolation indicates the strength of the evidence against the null hypothesis" (p.1. We will show that we can express the pvalue simply in tenns of the function a(·) defined in (4.
00 ) = 0. We introduce the basic concepts and terminology of testing statistical hypotheses and give the NeymanPearson framework. under H. test functions. 0) 0 p(x. type II error. 00.3 we derived test statistics that are best in terms of minimizing Bayes risk and maximum risk. Summary.01 )/(1. (null) hypothesis H and alternative (hypothesis) K. equals 0 when both numerator and denominator vanish. This is an instance of testing goodness offit.00)ln~5 [0. 00 . or equivalently. significance level. a given test statistic T. is measured in the NeymanPearson theory. L(x.00)/00 (1 . and. Such a test and the corresponding test statistic are called most poweiful (MP).)1 5 [(1. size. and pvalue. In this section we will consider the problem of finding the level a test that ha<. type I error. In particular. where eo and e 1 are disjoint subsets of the parameter space 6. The statistic L is reasonable for testing H versuS K with large values of L favoring K over H. 4.) which is large when S true.3). we consider experiments in which important questions about phenomena can be turned into questions about whether a parameter () belongs to 80 or e 1./00)5[(1. Typically a test statistic is not given but must be chosen on the basis of its perfonnance.2 CHOOSING A TEST STATISTIC: THE NEYMANPEARSON LEMMA We have seen how a hypothesistesting problem is defined and how perfonnance of a given test b. (4.Section 4.1. (I . by convention. The statistic L takes on the value 00 when p(x. then. we specify a small number 0: and conStruct tests that have at most probability (significance level) 0: of rejecting H (deciding K) when H is true.Otl/(1 . OIl = p (x.Oo)]n. (0. In this case the Bayes principle led to procedures based on the simple likelihood ratio statistic defined by L(x.2 Choosing a Test Statistic The NeymanPearson lemma 223 The preceding paragraph gives an example in which the hypothesis specifies a distribution completely. that is. critical regions. p(x. test statistics. For instance. power. the highest possible power. we test whether the distribution of X is different from a specified Fo. 0. We start with the problem of testing a simple hypothesis H : () = ()o versus a simple alternative K : 0 = 01. power function. in the binomial example (4. In Sections 3. 0) is the density or frequency function of the random vector X.0. OIl where p(x. OIl > 0. o:(Td has a U(O. In the NeymanPearson framework. We introduce the basic concepts of simple and composite hypotheses. that is. I) = EXi is large.2. and S tends to be large when K : () = 01 > 00 is . subject to this restriction.2 and 3. we try to maximize the probability (power) of rejecting H when K is true. 1) distribution.
Eo'P(X) < a. . (c) If <p is an MP level 0: test. there exists k such that (4. If 0 < 'P(x) < 1 for the observation vector x. then it must be a level a likelihood ratio test. and because 0 < 'P(x) < 1. Bo) = O} = I + II (say).3 with n = 10 and Bo = 0.B.'P(X)] > kEo['PdX) .2. Proof. Note that a > 0 implies k < 00 and.4) 1 j 7 . and 'P(x) = [0.224 Testing and Confidence Regions Chapter 4 We call iflk a likelihood ratio or NeymanPearsoll (NP) test ifunction) if for some 0 k < OJ we can write the test function Yk as 'Pk () = X < 1 o if L( x. Theorem 4. we consider randomized tests '1'. the interpretation is that we toss a coin with probability of heads cp(x) and reject H iff the coin shows heads.1) if equality occurs.2. We show that in addition to being Bayes optimal.2.['Pk(X) . where 1= EO{CPk(X)[L(X. They are only used to show that with randomization. that is. 'Pk is MP for level E'c'PdX). B . 1. we choose 'P(x) = 0 if S < 5. i = 0.Bd >k <k with 'Pdx) any value in (0. Bo) = O. B .2. Note (Section 3.3). and suppose r. where 7l' denotes the prior probability of {Bo}.'P(X)[L(X.kEo['Pk(X) . 0 < <p(x) < 1.. (b) For each 0 < a < 1 there exists an MP size 0' likelihood ratio test provided that randomization is permitted.p is a level a test. To this end consider (4.2) forB = BoandB = B.2) that 'Pk is a Bayes rule with k ~ 7l' /(1 . (a) Let E i denote E8. Such randomized tests are not used in practice.Bo. then <{Jk is MP in the class oflevel a tests.) For instance. likelihood ratio tests are unbeatable no matter what the size a is.3.7l'). Because we want results valid for all possible test sizes Cl' in [0.05 in Example 4.0262 if S = 5. BI )  .) . we have shown that I j I i E.1]. It follows that II > O.3.. if want size a ~ .1.'P(X)] a. i' ~ E.P(S > 5)I!P(S = 5) = .[CPk(X) . lOY some x.kJ}.1).) .'P(X)] .k is < 0 or > 0 according as 'Pk(X) is 0 or 1. o Because L(x.'P(X)] [:~~:::l. k] .3) 1 : > O. 'Pk(X) = 1 if p(x. (See also Section 1. using (4. (4.05 . B. EO'PdX) We want to show E. If L(x.1. (a) If a > 0 and I{Jk is a size a likelihood ratio test.k] + E'['Pk(X) .Bo. Finally. Bo. BIl o .('Pk(X) . 'P(x) = 1 if S > 5.'P(X)] 1{P(X.'P(X)] = Eo['Pk{X) . then I > O.'P(X)] > O. which are tests that may take values in (0.. (NeymanPearson Lemma). thns.2. then ~ .
. (c) Let x E {x: p(x. 0. Remark 4.3.Po[L(X. OJ. Now 'Pk is MP size a.5) thatthis 0.) > k] Po[L(X.) = k] = 0. = 'Pk· Moreover.Oo.4) we need tohave'P(x) ~ 'Pk(X) = I when L(x./f 'P is an MP level a test. denotes the Bayes procedure of Exarnple 3..2.OI)' Proof. 'P(X) > a with equality iffp(" 60 ) = P(·. 60 . It follows that (4. Example 4.5) If 0.) = k j.Oo. then 'Pk is MP size a. Consider Example 3. I x) = (1 1T)p(X. 0 It follows from the NeymanPearson lemma that an MP test has power at least as large as its level. = v.2. decides 0.pk is MP size a. See Problem 4. Next consider 0 < Q: < 1.1. where v is a known signal.) (1 .fit [ logL(X. define a.O. when 1T = k/(k + 1).2. 00 .Section 4. 'Pk(X) = 1 and !. tben. 0.I. then there exists k < 00 such that Po[L(X. 00 . 00 . OJ! = ooJ ~ 0.10).2. k = 00 makes 'Pk MP size a.2. that is. If a = 1.) + 1T' (4.) > k and have 'P(x) = 'Pk(X) ~ 0 when L(x. Therefore. OJ! > k] > If Po[L(X.2 Choosing a Test Statistic: The NeymanPearson lemma " 225 =:cc (b) If a ~ 0.) > then to have equality in (4.) < k. then E9. we conc1udefrom (4..u 2 ) random variables with (72 known and we test H : 11 = a versus K : 11 = v.0.2(b). 00 .O. 0. 00 . 00 . 0. is 1T(O. We found nv2} V n L(X. also be easily argued from this Bayes property of 'Pk (Problem 4. OJ! > k] < a and Po[T. 0. If not. The same argument works for x E {x : p(x. Because Po[L(X.1. { (7 i=l 2(7 Note that any strictly increasing function of an optimal statistic is optimal because the two statistics generate the same family of critical regions. 0. .2.2. Then the posterior probability of (). 'Pk(X) =a . X n ) is a sample of n N(j. Here is an example illustrating calculation of the most powerful level a test <{Jk. 0. 00 ) (11T)L(x.2. 0. 0.1. or 00 according as 1T(01 Ix) is larger than or smaller than 1/2.2.) (1 .L. 00 .. Let 1T denote the prior probability of 00 so that (I 1T) is the prior probability of (it.2. for 0 < a < 1.2) holds forO = 0..) .O. 0.) + 1Tp(X..1T)L(x. .7. Let Pi denote Po" i = 0. 2 (7 T(X) ~.fit.(X. 00 . k = 0 makes E.v) + nv ] 2(72 x . OJ Corollary 4.k] on the set {x : L(x.2 where X = (XI. 00 ) > and 0 = 00 . Part (a) of the lemma can.1T)p(x.v)=exp 2LX'2 .
. It is used in a classification context in which 9 0 . .6) that is MP for a specified signal v does not depend on v: The same test maximizes the power for all possible signals v > O. Particularly important is the case Eo = E 1 when "Q large" is equivalent to "F = (ttl .4. The likelihood ratio test for H : (J = 6 0 versus K : (J = (h is based on . T>z(la) (4. > 0 and E l = Eo. if ito. The following important example illustrates. • • . 6.2. Simple Hypothesis Against Simple Alternative for the Multivariate Normal: Fisher's Discriminant Flmction. In this example we bave assnmed that (Jo and (J. then. itl = ito + )"6. If this is not the case. Thus. 0. l.9). they are estimated with their empirical versions with sample means estimating population means and sample covariances estimating population CQvanances.2. E j ).90 or .2. j = 0. and only if.. itl' However if. Suppose X ~ N(Pj.. a UMP (for all ). This is the smallest possible n for any size Ct test. <I>(z(a) + (v.a)[~6Eo' ~oJ! (Problem 4.2. for the two popnlations are known. large. Eo are known. however.) test exists and is given by: Reject if (4. 0 An interesting feature of the preceding example is that the test defined by (4. But T is the test statistic we proposed in Example 4. .1." The function F is known as the Fisher discriminant function. the test that rejects if.' Rejecting H for L large is equivalent to rejecting for I 1 . then this test rule is to reject H if Xl is large. E j ).2. among other things. if Eo #..95).I.2). From our discussion there we know that for any specified Ct.. 0. Example 4. Such a test is called uniformly most powerful (UMP).6) has probability of type I error 0:.Jto)E01X large. (JI correspond to two known populations and we desire to classify a new observation X as belonging to one or the other.1. if we want the probability of detecting a signal v to be at least a preassigned value j3 (say. that the UMP test phenomenon is largely a feature of onedimensional parameter problems. this is no longer the case (Problem 4./iil (J)).7) where c ~ z(1 . If ~o = (1.8). I I .. We return to this in Volume II. . by (4. The power of this test is. . By the NeymanPearsoo lemma this is the largest power available with a level Ct test.2. . We will discuss the phenomenon further in the next section./ii/(J)) = 13 for n and find that we need to take n = ((J Iv j2[z(la) + z(I3)]'. say. then we solve <1>( z(a) + (v.0. ). (Jj = (Pj.2. O)T and Eo = I.226 Testing and Confidence Regions Chapter 4 is also optimal for this problem. Note that in general the test statistic L depends intrinsically on ito.
2 for deciding between 00 and Ol.. 0" .. .3 Uniformly Most Powerful Tests and Monotone Likelihood Ratio Models 227 Summary. . the likelihood ratio L 's L~ rr(:.nk. if a match in a genetic breeding experiment can result in k types..OkO' Usually the alternative to H is composite.1.3 UNIFORMLY MOST POWERFUL TESTS ANO MONOTONE LIKELIHOOD RATIO MODELS We saw in the two Gaussian examples of Section 4.p...2 that UMP tests for onedimensional parameter problems exist.u k nn.1) test !. ... r lUI nt···· nk· ". 0) P n! nn.2) where ..'P) for all 0 E 8" for any other level 0:' (4. . Example 4. However.1. . Suppose that (N1 . and N i is the number of offspring of type i. . . 4. Two examples in which the MP test does not depend on 0 1 are given. . . k are given by Ow. Nk) .3. 0 < € < 1 and for some fixed (4. 1 (}k) distribution with frequency function. n offspring are observed. nk are integers summing to n. there is sometimes a simple alternative theory K : (}l = (}ll. ...3. which states that the size 0:' SLR test is uniquely most powerful (MP) in the class of level a tests. here is the general definition of UMP: Definition 4. Testing for a Multinomial Vector. where nl.. . 'P') > (3(0. Such tests are said to be UMP (unifonnly most powerful). (}l. _ (nl.. .. We note the connection of the MP test to the Bayes procedure of Section 3.3. then (N" . Before we give the example. For instance. The NeymanPearson lemma.···. A level a test 'P' is wtiformly most powerful (UMP) for H : 0 E versus K : 0 E 8 1 if eo (3(0.Section 4."" (}k = Oklo In this case.. This phenomenon is not restricted to the Gaussian case as the next example illustrates. Ok)' The simple hypothesis would correspond to the theory that the expected proportion of offspring of types 1.. Ok = OkO. N k ) has a multinomial M(n.)N' i=l to Here is an interesting special case: Suppose OjO integer I with 1 < I < k > 0 for all}.M( n.3. . is established. . . With such data we often want to test a simple hypothesis H : (}I = Ow. We introduce the simple likelihood ratio statistic and simple likelihood ratio (SLR) test for testing the simple hypothesis H : (j = ()o versus the simple alternative K : 0 = ()1.
If 1J(O) is strictly increasing in () E e.3.3.d.1 is of this (.3. is an MLRfamily in T(x). J Definition 4. type I is less frequent than under H and the conditional probabilities of the other types given that type I has not occurred are the same under K as they are under H.3. under the alternative.3. for".228 Testmg and Confidence Regions Chapter 4 That is. = (I .o)n.1) if T(x) = t. Example 4. Because f. Oil = h(T(x)) for some increasing function h. . . . .o)n[O/(1 .OW and the model is by (4. then this family is MLR. (2) If E'o6. e c R. X ifT(x) < t ° (4. If {P.O) = 0'(1 . 00 . we have seen three models where. Consider the problem of testing H : 0 = 00 versus K: 0 ~ 01 with 00 < 81. B( n. 0 .3. N 1 < c. 02)/p(X. set s then = l:~ 1 Xi.00).1.2. Ow) under H.1) MLR in s. Bernoulli case. the power function (3(/:/) = E. in the case of a real parameter. .nu)!".i.3. .: 0 E e}. is an MLRfamily in T(x).2 (Example 4. e c R. then 6.2. it is UMP at level a = EOodt(X) for testing H : e = eo versus K : B > Bo in fact.3) with 6. Note that because l can be any of the integers 1 .2.2) with 0 < f < I}. Theorem 4. . 6. Because dt does not depend on ()l.1. . Suppose {P. 0) ~. < 1 implies that p > E. In this i. Thus... : 0 E e} with e c R is said to be a monotone likelihood ratio (MLR) family if for (it < O the distributions POl and P0 2 are distinct and 2 the ratio p(x. However.(x) any value in (0. is of the form (4. This is part of a general phenomena we now describe. Then L = pnN1£N1 = pTl(E/p)N1. this test is UMP fortesting H versus K : 0 E 8 1 = {/:/ : /:/ .(X) is increasing in 0.. Example 4. :. for testing H : 8 < /:/0 versus K:O>/:/I' . we get radically different best tests depending on which Oi we assume to be (ho under H. there is a statistic T such that the test with critical region {x : T(x) > c} is UMP. we conclude that the MP lest rejects H. Oil is an increasing function ofT(x). . 0 form with T(x) . k. is UMP level". equals the likelihood ratio test 'Ph(t) and is MP. Consider the oneparameter exponential family mode! o p(x.nx/u and ry(!") Define the NeymanPearson (NP) test function 6 ( ) _ 1 ifT(x) > t . = h(x)exp{ry(O)T(x) ~ B(O)}. then L(x.3 continned). The family of models {P.. (1) For each t E (0. 0 Typically the MP test of H : () = eo versus K : () = (}1 depends on (h and the test is not UMP. if and only if... p(x. Critical values for level a are easily determined because Nl . = . = P( N 1 < c)...(X) = '" > 0. : 8 E e}.. Moreover. Example 4.6. where u is known.
<>. and we are interested in the precision u. X is the observed number of defectives in a sample of n chosen at random without replacement from a lot of N items containing b defectives. where b = N(J. R.3 Uniformly Most Po~rful Tests and Monotone likelihood Ratio Models 229 and Corollary 4. Eoo. then the test/hat rejects H if and only ifT(r) > t(1 .o.2. where 11 is a known standard. Then.( NOo. If the distributionfunction Fo ofT(X) under X"" POo is continuous and ift(1a) isasolution of Fo(t) = 1 . then e}. H. . 0) = exp {~8 ~ IOg(27r0'2)} . Testing Precision. For simplicity suppose that bo > n. 0 Example 4.3. recall that we have seen that e5 t maximizes the power for testing II : (J = (Jo versus K : (J > (Jo among the class of tests with level <> ~ Eo. If 0 < 00 .5. where xn(a) is the ath quantile of the X. distribution. X < h(a).3.(X).o. To show (2).. and specifies an 0' such that the probability of rejecting H (keeping a bad lot) is at most 0'.<» is lfMP level afor testing H: (J < (Jo versus K : (J > (Jo.4. (l. we now show that the test O· with reject H if. Quality Control. . If a is a value taken on by the distribution of X.1. and because dt maximizes the power over this larger class. 0 Proof (I) follows from b l = iPh(t) The following useful result follows immediately.(bl1) x. . we test H : u > Uo l versus K : u < uo. N. she formulates the hypothesis H as > (Jo. Suppose {Po: 0 E Example 4. distribution. is UMP level a. _1')2.(X) for testing H: 0 = 01 versus J( : 0 = 0. where U O represents the minimum tolerable precision.Xn .l. Corollary 4..Xn is a sample from a N(I1. Because the most serious error is to judge the precision adequate when it is not. N . Let S = l:~ I (X. 1 bo(bol) (blX+l)(Nbl) (box+1)(Nbo) (Nbln+x+l) (Nbon+x+l)' . Ifwe write ~= Uo t i''''l (Xi _1')2 000 we see that Sju5 has a X~ distribution. d t is UMP for H : (J < (Jo versus K : (J > (Jo. Thus.l) yields e L( 0 0 )=b.. the critical constant 8(0') is u5xn(a). e c p(x.U 2 ) population. If the inspector making the test considers lots with bo = N(Jo defectives or more unsatisfactory. where h(a) is the ath quantile of the hypergeometric. Because the class of tests with level 0' for H : (J < (Jo is contained in the class of tests with level 0' for H : (J = (Jo.3. Suppose X!. if N0 1 = b1 < bo and 0 < x < b1 . n). (8 < s(a)) = a. Suppose tha~ as in Example l..1 by noting that for any (it < (J2' e5 t is MP at level Eo.l. 0.l. is an MLRfamily in T(r). the alternative K as (J < (Jo. For instance. and only if. then by (1). 20' 2  This is a oneparameter exponential family and is MLR in T = So The UMP level 0' test rejects H if and only if 8 < s(a) where s(a) is such that Pa .bo > n. we could be interested in the precision of a new measuring instrument and test it by applying it to a known standard.l of the measurements Xl. (X) < <> and b t is of level 0: for H : (J :S (Jo.. .Section 4.
This is possible for arbitrary /3 < 1 only by making the sample size n large enough. i. ~) for some small ~ > 0 because such improvements are negligible. In our example this means that in addition to the indifference region and level a. we want the probability of correctly detecting an alternative K to be large. This is not serious in practice if we have an indifference region.x) <I L(x. Therefore. we want guaranteed power as well as an upper bound on the probability of type I error. ° ° . not possible for all parameters in the alternative 8.1.. this is.. if all points in the alternative are of equal significance: We can find > 00 sufficiently close to 00 so that {3( 0) is arbitrarily close to {3(00) = a. . In Example 4.L.x) (N . ~) would be our indifference region.2).1. in general.O'"O..Oo.. Off the indifference region.1. Note that a small signaltonoise ratio ~/ a will require a large sample size n. ".1 and formula (4.. this is a general phenomenon in MLR family models with p( x. The critical values for the hypergeometric distribution are available on statistical calculators and software.n + I) . For such the probability of falsely accepting H is almost 1 . On the other hand. we choose the critical constant so that the maximum probability of falsely rejecting the null hypothesis H is small. we specify {3 close to I and would like to have (3(!') > (3 for aU !' > t.Il = (b l . forO <:1' < b1 1..OIl box (Nn+I)(blx) . By CoroUary 4. In both these cases.B 1 ) =0 forb l Testing and Confidence Regions Chapter 4 < X <n.c:O".) is increasing.230 NotethatL(x.+.'.4 because (3(p. we would also like large power (3(0) when () E 8 1 .. .3. L is decreasing in x and the hypergeometric model is an MLR family in T( x) = r.(}o.ntI" = z({3) whose solution is il .(b o . That is.1.1. I' .ntI") for sample size n. that is. in (O .I.' . and the powers are continuous increasing functions with limOlO. I i (3(t) ~ 'I>(z(a) + .4 we might be uninterested in values of p. This continuity of the power shows that not too much significance can be attached to acceptance of H. Thus. This equation is equiValent to = {3 z(a) + .a.. the appropriate n is obtained by solving i' . However. This is a subset of the alternative on which we are willing to tolerate low power. H and K are of the fonn H : () < ()o and K : () > ()o.:(x:. 0 Power and Sample Size In the NeymanPearson framework we choose the test whose size is small. (0. In our nonnal example 4. 0) continuous in 0. 1 I .. {3(0) ~ a. as seen in Figure 4. It follows that 8* is UMP level Q. Thus..
= 0. and only if. we solve (3(00 ) = Po" (S > s) for s using (4.1. (3 = . The power achievable (exactly. Formula (4.35(0.35 and n = 163 is 0.80 ) . Now let .05 binomial test of H : 8 = 0. to achieve approximate size a. Example 4. 0 ° Our discussion can be generalized. we find (3(0) = PotS > so) = <I> ( [nO(l ~ 0)11/ 2 Now consider the indifference region (80 . where (3(0.3 continued).1. that is.4.55)}2 = 162.3 to 0.Section 4.3(0." The reason is that n is so large that unimportant small discrepancies are picked up.4) and find the approximate critical value So ~ nOo 1 + 2 + z(l.0.282 x 0. we can have very great power for alternatives very close to O. using the SPLUS package) for the level . if Oi = . we fleed n ~ (0. > O.3.1. Such hypotheses are often rejected even though for practical purposes "the fit is good enough.1.4. (h = 00 + Dt. Bd. In Example 4.) = (3 for n and find the approximate solution no+ 1 .6 (Example 4. As a further example and precursor to Section 5.4.90. we next show how to find the sample size that will "approximately" achieve desired power {3 for the size 0: test in the binomial example. Our discussion uses the classical normal approximation to the binomial distribution. They often reduce to adjusting the critical value so that the probability of rejection for parameter value at the boundary of some indifference region is 0:. Dt.86.3 Uniformly Most Powerful Tests and Monotone Likelihood Ratio Models 231 Dual to the problem of not having enough power is that of having too much.00 )] 1/2 .35.5).3.2) shows that.3 requires approximately 163 observations to have probability .4 this would mean rejecting H if. Thus. 1.90 of detecting the 17% increase in 8 from 0.Oi) [nOo(1 .35. and 0. It is natural to associate statistical significance with practical significance so that a very low pvalue is interpreted as evidence that the alternative that holds is physically significant. when we test the hypothesis that a very large sample comes from a particular distribution. This problem arises particularly in goodnessoffit tests (see Example 4.645 x 0.05. 00 = 0. Suppose 8 is a vector. ifn is very large and/or a is small. Again using the nonnal approximation. Often there is a function q(B) such that H and K can be formulated as H : q(O) <: qo and K : q(O) > qo. There are various ways of dealing with this problem.. We solve For instance. far from the hypothesis.7) + 1.05)2{1. the size . First. test for = .
9. Example 4. This procedure can be applied. We have seen in Example 4.. is detennined by a onedimensional parameter ).0) > ° I(O. It may also happen that the distribution of Tn as () ranges over 9.4) j j 1 . to the F test of the linear model in Section 6. that are not 01. = (tm.Bo).£ is unknown. the critical value for testing H : 0.(0) so that 9 0 = {O : ). I}. I . In general.l' independent of IJ. we may consider l(O. when testing H : 0 < 00 versus K : B > Bo. Then the MLE of 0. . first let Co be the smallest number c such that Then let n be the smallest integer such that P" IT > col > fJ where 00 is such that q( 00 ) = qo and 0.Xf as in Example 2. Implicit in this calculation is the assumption that POl [T > col is an increasing function ofn.(Tn ) is an MLR family. for 9.1 by taking q( 0) equal to the noncentrality parameter governing the distribution of the statistic under the alternative.1.= 00 is now composite. .(0) > O} and Co (Tn) = C'(') (Tn) for all O. Suppose that j3(O) depends on () only through q( (J) and is a continuous increasing function of q( 0). for instance.l)l(O.2 is (}'2 = ~ E~ 1 (Xi .< (10 among all tests depending on <. then rejecting for large values of Tn is UMP among all tests based on Tn. > qo be a value such that we want to have power fJ(O) at least fJ when q(O) > q.(0) = o} aild 9. J.4.1) 1(0. 0 E 9. and also increases to 1 for fixed () E 6 1 as n . The set {O : qo < q(O) < ql} is our indjfference region. is such that q( Oil = q. = {O : )..232 Testing and Confidence Regions Chapter 4 q. a reasonable class of loss functions are those that satisfy : 1 i j 1 1(0. Suppose that in the Gaussian model of Example 4. 0 E 9.O)<O forB < 00 forB>B o. Reducing the problem to choosing among such tests comes from invariance consideration that we do not enter into until Volume II. (4. Testing Precision Continued. a)... 0 = Complete Families of Tests The NeymanPearson framework is based on using the 01 loss function. is the a percentile of X~l' It is evident from the argument of Example 4.2.3.7. To achieve level a and power at least (3.= (10 versus K : 0. . 00).+ 00. Thus.3. We may ask whether decision procedures other than likelihood ratio tests arise if we consider loss functions l(O. For each n suppose we have a level a test for H versus I< based on a suitable test statistic T. a E A = {O. Although H : 0.3.3.. 0) = (B .2 only..< (10 and rejecting H if Tn is small. However. For instance. we illustrate what can happen with a simple example..3 that this test is UMP for H : (1 > (10 versus K : 0.5 that a particular test statistic can have a fixed distribution £0 under the hypothesis. the distribution of Tn ni:T 2/ (15 is X. The theory we have developed demonstrates that if C.
0 Summary.12) and. 1") for all 0 E e. In such situations we show how sample size can be chosen to guarantee minimum power for alternatives a given distance from H. In the following the decision procedures arc test functions.I") ~ = EO{I"(X)I(O.{1(8.2. Thus.3.4 Confidence Bounds.) . AND REGIONS We have in Chapter 2 considered the problem of obtaining precise estimates of parameters and we have in this chapter treated the problem of deciding whether the parameter {J i" a .(X) be such that. = R(O.5) holds for all 8. hence. e 4. E. the class of NP tests is complete in the sense that for loss functions other than the 01 loss function. when UMP tests do not exist.5) That is.(X) = E"I"(X) > O.(o. 0 < " < 1. a model is said to be monotone likelihood ratio (MLR) if the simple likelihood ratio statistic for testing ()o versus 8 1 is an increasing function of a statistic T( x) for every ()o < ()1.3.0.(X) > 1.(2) a v R(O. is UMP for H : 0 < 00 versus K: 8 > 8 0 by Theorem 4. For () real.3) with Eo. the test that rejects H : 8 < 8 0 for large values of T(x) is UMP for K : 8 > 8 0 .o. then any procedure not in the complete class can be matched or improved at all () by one in the complete class.4).I"(X) 0 for allO then O=(X) clearly satisfies (4.E. and Regions 233 if for any decision rule 'P The class D of decision procedures is said to be there exists E such that complete(I). we show that for MLR models.E. 1) 1(O. If E. We consider models {Po : () E e} for which there exist tests that are most powerful for every () in a composite alternative t (UMP tests).3.5). Thus. 1) + [1 1"(X)]I(O.3. (4. We also show how. is an MLR family in T( x) and suppose the loss function 1(0. hence.3.(x)) . The risk function of any test rule 'P is R(O. Intervals.3. Proof. it isn't worthwhile to look outside of complete classes. Now 0. (4. for some 00 • E"o.(X)) = 1.3. if the model is correct and loss function is appropriate. is similarly UMP for H : 0 > 8 0 versus K : 0 < 00 (Problem 4.I"(X) for 0 < 00 . Suppose {Po: 0 E e}.3.(I"(X))) < 0 for 8 > 8 0 . a) satisfies (4.1 and. the risk of any procedure can be matched or improved by an NP test. INTERVALS.o.O)]I"(X)} Let o.O)} E. (4.3. O))(E. then the class of tests of the form (4. 1") = (1(0. Theorem 4.E. For MLR models. 1) 1(0. Finally.R(O.Section 4.(X) = ". e c R.6) But 1 .4 CONFIDENCE BOUNDS. is complete.O) + [1(0.(Io. locally most powerful (LMP) tests in some cases can be found.o) < R(O.
• j I where .) and =1 a . PEP.i. . is a constant.a. In the N(".a)l. is a parameter.. we settle for a probability at least (1 .(X) and solving the inequality inside the probability for p..oz(1 . X n ) to establish a lower bound . Suppose that JL represents the mean increase in sleep among patients administered a drug.. N (/1.a)l.oz(1 . Here ii(X) is called an upper level (1 . ( 2 ) with (72 known. We say that . X + a] contains pis 1 .d. we may be interested in an upper bound on a parameter.a) confidence bound for ". X E Rq.1.(Xj < . we want both lower and upper bounds. We say that [j.fri...95 Or some other desired level of confidence.a..! = X .3.fri < .Q equal to . In general.±(X) ~ X ± oz (1. and we look for a statistic . 0.(X) that satisfies P(..a..a) for a prescribed (1. In the nonBayesian framework.. Finally. That is..Q with 1 ... it may not be possible for a bound or interval to achieve exactly probability (1 . In our example this is achieved by writing By solving the inequality inside the probability for tt.8).(X) . intervals.a) such as ... . is a lower bound with P(..+(X)] is a level (1 .fri < ... . In this case.a)I.4 where X I. or sets that constrain the parameter with prescribed probability 1 . and a solution is ii(X) = X + O"z(1 .fri.a). 1 X n are i.(X)..a) confidence interval for ".234 Testing and Confidence Regions Chapter 4 member of a specified set 8 0 .) = 1 ..95. if v ~ v(P). . . we find P(X .a.. This gives ..(X) for" with a prescribed probability (1 .!a)/. in many situations where we want an indication of the accuracy of an estimator.2 ) example this means finding a statistic ii(X) snch that P(ii(X) > ... That is. 11 I • J .) = Ia. and X ~ P. Similarly. .. .a. as in (1.a) of being correct. We find such an interval by noting . As an illustration consider Example 4. we want to find a such that the probability that the interval [X . Then we can use the experimental outcome X = (Xl. Now we consider the problem of giving confidence bounds.(X) is a lower confidence bound with confidence level 1 .) = 1 .
4. In general. has aN(O. the minimum probability of coverage).1) = T(I') . " X n be a sample from a N(J.L) by its estimate s. and assume initially that (72 is known.0) is. .3. Note that in the case of intervals this is just inf{ PI!c(X) < v < v(X).3. I) distribu~on to obtain a confidence interval for I' by solving z (1 .a) upper confidence bound for v if for every PEP.o. for all PEP. In the preceding discussion we used the fact tbat Z (1') = Jii(X . Similarly. P E PI} (i. The quantities on the left are called the probabilities of coverage and (1 . where S 2 = 1 nI L (X.3. we will need the distribution of Now Z(I') = Jii(X 1')/".. In order to avoid this ambiguity it is convenient to define the confidence coefficient to be the largest possible confidence level.L) obtained by replacing (7 in Z(J.X) n i=l 2 . The (Student) t Interval and Bounds. lhat is.0) will be a confidence level if (I .l)s2 lIT'.1.e.. A statistic v(X) is called a level (1 . v(X) is a level (I . P[v(X) < v < v(X)] > I .a) is called a confidence level.O!) lower confidence bound for v if for every PEP.4 Confidence Bounds. which has a X~l distribution. For a given bound or interval. the random interval [v( X). Intervals.. We conclude from the definition of the (Student) t distribution in Section B. P[v(X) = v] > 10.Section 4. D(X) is called a level (1 . P[v(X) < vi > I . For the normal measurement problem we have just discussed the probability of coverage is independent of P and equals the confidence coefficient.1. and Regions 235 Definition 4. ii( Xl] formed by a pair of statistics v(X).0') < (J . has a N(o. 1) distribution and is. by Theorem B. Now we tum to the (72 unknown case and propose the pivot T(J. Let X t . In this process Z(I') is called a pivot.o.L. Moreover.!o) < Z(I') < z (1.0)% confidence intervolfor v if.~o) for 1'.1')1".4.0) or a 100(1 . Example 4.! that Z(iL)1 "jVI(n . independent of V = (n . (72) population. finding confidence intervals (or bounds) often involves finding appropriate pivots. the confidence level is clearly not unique because any number (I .
~a)) = 1..d...4..nJ = 1" X ± sc/. ~.Q).l < X + st n_ 1 (1. .a.~a) < T(J.11 whatever be Il and Tj. the interval will have probability (1 .355 is ..)/ v'n].2) f.~a) = 3. . 0 .~a)/. To calculate the coefficients t n. See Figure 5. we can reasonably replace t n .Xn is a sample from a N(p" a 2 ) population. . • r.!a) and tndl ..l (I .i. Solving the inequality inside the probability for Jt. .355 and IX .n < J..a).355s/31 is the desired level 0. Up to this point. t n .12).fii. we use a calculator. . we have assumed that Xl.1. X +stn _ 1 (1 Similarly.3.a) /.n is.1 distribution and can be used as a pivot. The properties of confidence intervals such as (4.a).. Confidence Intervals and Bounds for the Vartance of a Normal Distribution. .7 (see Problem 8.. For the usual values of a.1) in nonGaussian situations can be investigated using the asymptotic and Monte Carlo methods introduced in Chapter 5.)/.~a)/. Thus."2)) = 1"..1 (p) by the standard normal quantile z(p) for n > 120.. ..355s/3.3.4. or very heavytailed distributions such as the Cauchy.l) is fairly close to the Tn .) /. For instance.l)s2/X(1. X . for very skew distributions such as the X2 with few degrees of freedom.005.2.1) can be much larger than 1 .99 confidence intervaL From the results of Section B.l (I . (4.) confidence interval of the type [X . i '1 !.1 distribution if the X's have a distribution that is nearly symmetric and whose tails are not much heavier than the normal.l)s2/X("I)1 is a confidence interval with confidence coefficient (1 . the confidence coefficient of (4.1 (I .stn_ 1 (I . Example 4.st n_l (1 ~. N (p" a 2 ).. and if al + a2 = a. On the other hand. It turns out that the distribution of the pivot T(J.. . we enter Table II to find that the probability that a 'TnI variable exceeds 3. .3. Suppose that X 1.4.st n _ 1 (I ."2)' (n . V( <7 2 ) = (n .1.1) has confidence coefficient close to 1 . or Tables I and II.. . distribution.a) in the limit as n + 00. In this case the interval (4. X + 3.a. we find P [X .l) < t n_ 1 (1. computer software. .236 has the t distribution 7.1 )s2/<7 2 has a X. • " ~. Then (72. .01..7.fii and X + stn.Q. we see that as n + 00 the Tn_l distribution converges in law to the standard normal distribution.1. . By Theorem B . Testing and Confidence Regions Chapter 4 1 . " .. By solving the inequality inside the probability for a 2 we find that [(n .1. Hence. X n are i.1) The shortest level (I . (4. If we assume a 2 < 00. thus.fii are natural lower and upper confidence bounds with confidence coefficients (1 . if we let Xn l(p) denote the pth quantile of the X~1 distribution.4. Let tk (p) denote the pth quantile of the P (t n.1 (1 . then P(X("I) < V(<7 2) < x(l. if n = 9 and a = 0.
ka)[S(n .X) < OJ'" 1where g(O.0) ] = Plg(O. In Problem 1. g( 0. It may be shown that for n large. taking at = (Y2 = ~a is not far from optimal (Tate and Klett. In tenns of S = nX.0) . There is no natural "exact" pivot based on X and O. and Regions 237 The length of this interval is random.0) < z (1)0(1 .. then X is the MLE of (j. = ~(X) < 0 < B(X)]' Because the coefficient of (j2 in g(() 1 X) is greater than zero.16 we give an interval with correct limiting coverage probability. which typically is unknown. Intervals.S)jn] + k~j4} / (n + k~). the scope of the method becomes much broader. .4. 0 The method of pivots works primarily in problems related to sampling from nonnal populations.~ a) and observe that this is equivalent to Q P [(X .0)  1Ql] = 12 Q. We illustrate by an example.1.0) has approximately a N(O. X) < OJ (4. 1 X n are the indicators of n Bernoulli trials with probability of success (j.(2X + ~) e+X 2 For fixed 0 < X < I.. = Z (1 . which unifonnly minimizes expected length among all intervals of this type. In contrast to Example 4.1.Section 44 Confidence Bounds..3. vIn(X .X) = (1+ k!) 0' . 1) distribution.4.Q) and (n . If we use this function as an "approximate" pivot and let:=:::::: denote "approximate equality. If X I. Example 4.3) + ka)[S(n . Approximate Confidence Bounds and Intervalsfor the Probability of Success in n Bernoulli Trials.k' < . by the De MoivreLaplace theorem.l)sjx(1 .l)sjx(Q).4. If we consider "approximate" pivots.4) ." we can write P [ Let ka vIn(X .S)jn] {S + k2~ + k~j4} / (n + k~) (4. they are(1) O(X) O(X) {S + k2~ . if we drop the nonnality assumption. 1959).O)j )0(1 . However.4. There is a unique choice of cq and a2.a even in the limit as n + 00. Asymptotic methods and Monte Carlo experiments as described in Chapter 5 have shown that the confidence coefficient may be arbitrarily small depending on the underlying true distribution. The pivot V( (T2) similarly yields the respective lower and upper confidence bounds (n . X) is a quadratic polynomial with two real roots. [0: g(O. the confidence interval and bounds for (T2 do not have confidence coefficient 1.0(1.
These inlervals and bounds are satisfactory in practice for the usual levels.0)1 JX(1 . Another approximate pivot for this example is yIn(X . This fonnula for the sample size is very crude because (4.95. . or n = 9.02 by choosing n so that Z = n~ ( 1.4) has length 0. This leads to the simple interval 1 I (4.a) confidence bounds.: iI j . (1 . = T.600. He draws a sample of n potential customers. A discussion is given in Brown. O(X)] is an approximate level (1 .4. note that the length. = 0. n(l . to bound I above by 1 0 choose . That is. He can then detennine how many customers should be sampled so that (4.4. say I.7) See Brown.~a = 0.8) is at least 6.a) procedure developed in Section 4.975. Note that in this example we can detennine the sample size needed for desired accuracy.96. and Das Gupta (2000) for a discussion. we n .02. We can similarly show that the endpoints of the level (1 . To see this. calls willingness to buy success.~a) ko 2 kcc i 1. and we can achieve the desired . i " .96 0.601. it is better to use the exact level (1 . i o 1 [.(1. ka.6) Thus. consider the market researcher whose interest is the proportion B of a population that will buy a product. 1 .5. we choose n so that ka.n 1 tn (4.S)ln] Now use the fact that + kzJ4}(n + k~)l (S . = length 0.2a) interval are approximate upper and lower level (1 .~n)2 < S(n . For instance.4. [n this case.02 and is a confidence interval with confidence coefficient approximately 0.O > 8 1 > ~.4.16. if the smaller of nO.a) confidence interval for O. and Das Gupta (2000). Better results can be obtained if one has upper or lower bounds on 8 such as 8 < 00 < ~.4. Cai. See Problem 4. and uses the preceding model. For small n.5) (4.238 Testing and Confidence Regions Chapter 4 so that [O(X)..S)ln ~ to conclude that tn .02 ) 2 .X).4.(n + k~)~ = 10 . Cai.4. .96) 2 = 9..5) is used and it is only good when 8 is near 1/2. of the interval is I ~ 2ko { JIS(n .
.a Note that if I. By using 2nX /0 as a pivot we find the (1 .r} is said to be a level (1 . qc( 0)) is at least (I . Suppose X I. . In this case..0').a). (X) ~ [q... then the rectangle I( X) Ic(X) has level = h (X) x .~a) < 0 <2nX/x (~a) where x(j3) denotes the 13th quantile of the X~n distribution.1. £(8. Confidence Regions of Higher Dimension We can extend the notion of a confidence interval for onedimensional functions q( B) to rdimensional vectors q( 0) = (q. Let 0 and and upper boundaries of this interval.1 ).Section 44 Confidence Bounds. . Then the rdimensional random rectangle I(X) = {q(O). That is.4.a). For instance. then q(C(X)) ~ {q(O) .. Let Xl.4..F(x) = exp{ x/O}. Note that ifC(X) is a level (1 ~ 0') confidence region for 0.) confidence interval for qj (0) and if the ) pairs (T l ' 1'. I is a level (I . and suppose we want a confidence interval for the population proportion P(X ? x) of subscribers that spend at least x hours per week on the Internet. .a). j ~ I.a) confidence interval 2nX/x (I .X n denote the number of hours a sample of internet subscribers spend per week on the Internet. distribution. If q is not 1 . this technique is typically wasteful.0') confidence region for q(O). ( 2 ) T. j==1 (4.a. (0). and Regions 239 Confidence Regions for Functions of Parameters We can define confidence regions for a function q(O) as random subsets of the range of q that cover the true value of q(O) with probability at least (1 . if the probability that it covers the unknown but fixed true (ql (0). if q( B) = 01 . (T c' Tc ) are independent. . ~. qc (0)).).. X n is modeled as a sample from an exponential. we will later give confidence regions C(X) for pairs 8 = (0 1 . .4. ii. . .3. . . q(C(X)) is larger than the confidence set obtained by focusing on B1 alone. 0 E C(X)} is a level (1.. . we can find confidence regions for q( 0) entirely contained in q( C(X)) with confidence level (1 .8) .. We write this as P[q(O) E I(X)I > 1.. x IT(1a. X~n' distribution.. Intervals.). .a) confidence region.. Suppose q (X) and ii) (X) are realJ valued.4.(X).(O) < ii..(X) < q. 2nX/0 has a chi~square. then exp{ x/O} edenote the lower < q(O) < exp{ x/i)} is a confidence interval for q( 0) with confidence coefficient (1 ... By Problem B. Here q(O) = 1 . Example 4.
min{l.4. Then by solving Dn(F) < do for F. F(t) + do})· We have shown ~ ~ I o P(C(X)(t) :) F(t) for all t E R) for all P E 'P =1. We assume that F is continuous. h (X) X 12 (X) is a level (1 .Laj.a.LP[qj(O) '" Ij(X)1 > 1.1 h(X) = X ± stn_l (1. F(t) .a) confidence rectangle for (/" . j = 1.5. .15.6..7).. c c P[q(O) E I(X)] > 1. Moreover.. .a) r. That is. in which case (Proposition 4. 1 X n is a N (M. then leX) has confidence level 1 .a).4.. 0 The method of pivots can also be applied to oodimensional parameters such as F.a = set of P with P( 00. r.5). (7 2).. ~ ~'f Dn(F) = sup IF(t) . From Example 4. From Example 4. ( 2 ) sample and we want a confidence rectangle for (Ii.(1 . Suppose Xl> . j=1 j=1 Thus. Confidence Rectangle for the Parameters ofa Normal Distribution.~a)/rn !a). According to this inequality. and we are interested in the distribution function F(t) = P(X < t). we find that a simultaneous in t size 1.i.240 Testing and Confidence Regions Chapter 4 Thus. tl continuous in t. D n (F) is a pivot. if we choose a J = air. v(P) = F(·). . X n are i.. . an rdimensional confidence rectangle is in this case automaticalll obtained from the onedimensional intervals. is a confidence interval for J1 with confidence coefficient (1  I (X) _ [(nl)S2 (nl)S2] 2 Xnl (1 .do}.0 confidence region C(x)(·) is the confidence band which.2.F(t)1 tEn i " does not depend on F and is known (Example 4. that is. Example 4. Example 4.1.. See Problem 4.40) 2. An approach that works even if the I j are not independent is to use Bonferroni's inequality (A.~a). . for each t E R.d. Thus.4. It is possible to show that the exact confidence coefficient is (1 .2).. if we choose 0' j = 1 . .4. Let dO' be chosen such that PF(Dn(F) < do) = 1 .2. Suppose Xl. as X rv P. .ia)' Xnl (lo) is a reasonable confidence interval for (72 with confidence coefficient (1.4. then I(X) has confidence level (1 .1) the distribution of . consists of the interval i J C(x)(t) = (max{O.1.0:.
We define lower and upper confidence bounds (LCBs and DCBs).18) arise in accounting practice (see Bickel.0'.5 THE DUALITY BETWEEN CONFIDENCE REGIONS AND TESTS Confidence regions are random subsets of the parameter space that contain the true parameter with probability at least 1 ." for all PEP. is /1.5 The Duality Between Confidence Regions and Tests 241 We can apply the notions studied in Examples 4. as X and that X has a density f( t) = F' (t). A Lower Confidence Boundfor the Mean of a Nonnegative Random Variable. Then a (1 . o 4.4. 1WoSided Tests for the Mean of a Normal Distribution. We shall establish a duality between confidence regions and acceptance regions for families of hypotheses. and we derive an exact confidence interval for the binomial parameter.4. and more generally confidence regions. Acceptance regions of statistical tests are.4. In a nonparametric setting we derive a simultaneous confidence interval for the distribution function F( t) and the mean of a positive variable X. Example 4. given by (4.!(F) : F E C(X)} = J. We derive the (Student) t interval for I' in the N(Il.0: when H is true. A scientist has reasons to believe that the theory is incorrect and measures the constant n times obtaining . if J.! ~ inf{J. . Suppose that an established theory postulates the value Po for a certain physical constant.0: for all E e.! ~ /L(F) = f o tf(t)dt = exists.7. We begin by illustrating the duality in the following example.4. . we similarly require P(C(X) :J v) > 1.1.i.6. for a given hypothesis H.0') lower confidence bound for /1..!(F) : F E C(X)) ~ J. 0 Summary.4. Example 4. then Let F(t) and F+(t) be the lower and upper simultaneous confidence boundaries of Example 4. For a nonparametric class P ~ {P} and parameter v ~ v(P).d.4. a level 1 . 1992) where such bounds are discussed and shown to be asymptotically strictly conservative. confidence intervals.5. In a parametric model {Pe : BEe}.9)   because for C(X) as in Example 4.6.0: confidence region for a parameter q(B) is a set C(x) depending only on the data x such that the probability under Pe that C(X) covers q( 8) is at least 1 .!(F)   ~ oosee Problem 4. X n are i. By integration by parts.4.!(F+) and sUP{J.4 and 4.19. subsets of the sample space with probability of accepting H at least 1 . Intervals for the case F supported on an interval (see Problem 4. J. a 2 ) model with a 2 unknown.5 to give confidence regions for scalar or vector parameters in nonparametric models.4.Section 4. Suppose Xl.. which is zero for t < 0 and nonzero for t > O.
if we start out with (say) the level (1 . (4.5. Evidently.. 00) X (0.4. (4. Xl!_ Knowledge of his instruments leads him to assume that the Xi are independent and identically distributed normal random variables with mean {L and variance a 2 . by starting with the test (4. generated a family of level a tests {J(X.~a).1) we constructed for Jl as follows. Because the same interval (4. then it is reasonable to formulate the problem as that of testing H : fl = Jlo versus K : fl i..5. We can base a size Q' test on the level (1 .1. o These are examples of a general phenomenon.a) confidence region for v if the probability that S(X) contains v is at least (1 . /l) to equal 1 if.00). For a function space example.242 Testing and Confidence Regions Chapter 4 measurements Xl . in fact.~1 >tn_l(l~a) ootherwise.a.~a)] = 0 the test is equivalently characterized by rejecting H when ITI > tnl (1 .flO. In contrast to the tests of Example 4. then S is a (I .. . Because p"IITI = I n. Jlo) being of size a only for the hypothesis H : Jl = flo· Conversely.4.4. if and only if. X + Slnl (1  ~a) /vn]. if and only if. and in Example 4. (/" .6.5'[) If we let T = vn(X .a. generating a family of level a tests. that is P[v E S(X)] > 1 .1) is used for every flo we see that we have.00). For instance. then our test accepts H.1 (1 .a)s/vn > /l. the postulated value JLo is a member of the level (1 0' ) confidence interval [X  Slnl (1 ~a) /vn. If any value of J.1) by finding the set of /l where J(X.1 (1 . • j n .4. 1. Consider the general framework where the random vector X takes values in the sample space X C Rq and X has distribution PEP.Q) confidence interval (4.I n .!a) < T < I n.a). i:..l (1. it has power against parameter values on either side of flo. /l) ~ O.2 = . /l)} where lifvnIX.4. t n .a) LCB X .I n _ 1 (1. X .2) we obtain the confidence interval (4. Let S = S{X) be a map from X to subsets of N.5.2) takes values in N = (00 .1 (1.5. /l = /l(P) takes values in N = (00. Let v = v{P) be a parameter that takes values in the set N. as in Example 4.2) These tests correspond to different hypotheses. O{X.. all PEP..fLo)/ s. = 1.. We accept H. in Example 4. in Example 4. ..5. and only if.2(p) takes values in N = (0. where F is the distribution function of Xi' Here an example of N is the class of all continuous distribution functions..2.4.00). We achieve a similar effect. consider v(P) = F.l other than flo is a possible alternative. .~Ct). This test is called twosided because it rejects for both large and small values of the statistic T.a)s/ vn and define J'(X.
Then the acceptance regIOn A(vo) ~ (x: J(x. let Pvo = (P : v(P) ~ Vo : va E V}. Fe (t) is decreasing in (). Moreover. By Theorem 4.aj.0" of containing the true value of v(P) whatever be P. the acceptance region of the UMP size a test of H : 0 = 00 versus K : > ()a can be written e A(Oo) = (x: T(x) < too (1. By applying F o to hoth sides oft we find 8(t) = (O E e: Fo(t) < 1. Similarly. if S(X) is a level 1 .a)}.Ct.a)) where to o (1 . if 0"1 + 0"2 < 1. for other specified Va. this is a random set contained in N with probapility at least 1 . is a level 0 test for HI/o.a has a solution O.0" confidence region for v.5. acceptance regions. X E A(vo)). Let S(X) = (va EN. H may be rejected. It follows that Fe (t) < 1 .1. then ()u(T) is a lower confidence boundfor () with confidence coefficient 1 . va) with level 0". and pvalues for MLR families: Let t denote the observed value t = T(x) ofT(X) for the datum x. then [QUI' oU2] is confidence interval for () with confidence coefficient 1 . eo e Proof. if 8(t) = (O E e: t < to(1 .1. < to(la).0".Section 4.1. Suppose we have a test 15(X.3. For some specified Va. We next apply the duality theorem to MLR families: Theorem 4.(t) in e.(01 + 0"2).Ct) is the 1. 0 We next give connections between confidence bounds. Duality Theorem. We have the following.0" quantile of F oo • By the duality theorem. Suppose X ~ Po where (Po: 0 E ej is MLR in T = T(X) and suppose that the distribution function Fo(t) ofT under Po is continuous in each of the variables t and 0 when the other is fixed.5 The Duality Between Confidence Regions and Tests 243 Next consider the testing framework where we test the hypothesis H = Hvo : v = Va for some specified value va.Fo(t) for a test with critical constant t is increasing in ().(T) of Fo(T) = a with coefficient (1 .vo) = OJ is a subset of X with probability at least 1 . then the test that accepts HI/o if and only if Va is in S(X). E is an upper confidence bound for 0 with any solution e. Formally.Ct confidence region for v.G iff 0 > Iia(t) and 8(t) = Ilia. Consider the set of Va for which HI/o is accepted. Conversely. the power function Po(T > t) = 1 . let . H may be accepted.3.Ct). The proofs for the npper confid~nce hound and interval follow by the same type of argument. By Corollary 4. then 8(T) is a Ia confidence region forO. If the equation Fo(t) = 1 . then PIX E A(vo)) > 1  a for all P E P vo if and only if S (X) is a 1 . That i~. 00).
.Fo(t) is increasing in O.1. . (ii) k(B. N (IL.v) where.1 244 Testing and Confidence Regions Chapter 4 1 o:(t. '1 Example 4.1).~a)/ v'n}. let a( t.2.3. The pvalue is {t: a(t.to(l.a)ifBiBo. For a E (0. a) .Fo(t) is decreasing in t. . . Let T = X. vo) = 1 IT > cJ of H : v ~ Vo based on a statistic T = T(X) with observed value t = T(x). and A"(B) = T(A(B)) = {T(x) : x Corollary 4. v) <a} = {(t. Figure 4. 1 . • a(t.3. We call C the set of compatible (t. v) = O}. vol denote the pvalue of a test6(T. I c= {(t.d.4.a) confidence region is given by C(X t. for the given t.1.~) upper and lower confidence bounds and confidence intervals for B.v) : J(t. B) plane.. B) pnints. In general. • .v) : a(t.a)~k(Bo.I}.5.80 E (0. Exact Confidence Bounds and Intervals for the Probability of Success in n Binomial Trials. 00 ) denote the pvalue for the UMP size Q' test of H : () let = eo versus K : () > eo..a) test of H.p): It pI < <7Z (1. We illustrate these ideas using the example of testing H : fL = J. Let k(Bo. We claim that (i) k(B.a)] > a} = [~(t). we seek reasonable exact level (1 .10 when X I .oo). We shall use some of the results derived in Example 4. The corresponding level (1 . a) denote the critical constant of a level (1. . and for the given v. v will be accepted. where S = Ef I Xi_ To analyze the structure of the region we need to examine k(fJ. Then the set .1 I. To find a lower confidence bound for B our preceding discussion leads us to consider level a tests for H : 6 < 00. The result follows. Let Xl. .B) = (oo. Because D Fe{t) is a distribution function. a).1 shows the set C. . _1 X n be the indicators of n binomial trials with probability of success 6. A"(B) Set) Proof.5.1 that 1 . vertical sections of C are the confidence regions B( t) whereas borizontal sections are the acceptance regions A" (v) = {t : J( t. a confidence region S( to). B) ~ poeT > t) = 1 . I X n are U. v) =O} gives the pairs (t.2 known. E A(B)).Xn ) = {B : S < k(B. and an acceptance set A" (po) for this example.B) > a} {B: a(t. Under the conditions a/Theorem 4. t is in the acceptance region. : : !.Fo(t).5.1. cr 2 ) with 0. We have seen in the proof of Theorem 4. then C = {(t.1). In the (t...a) is nondecreasing in B.
Po. if 0 > 00 .1 (i) that PetS > j] is nondecreasing in () for fixed j. a) increases by exactly 1 at its points of discontinuity. Q). To prove (i) note that it was shown in Theorem 4. S(to) is a confidence interval for 11 for a given value to of T. it is also nonincreasing in j for fixed e. [S > j] = Q and j = k(Oo. e < e and 1 2 k(O" Q) > k(02. The shaded region is the compatibility set C for the twosided test of Hp. then C(X) ={ (O(S).5 The Duality Between Confidence Regions and Tests 245 Figure 4. If fJo is a discontinuity point 0 k(O.Q)] > Po.o : J. I] if S ifS >0 =0 .L = {to in the normal model. On the other hand. if we define a contradiction.[S > k(O"Q) I] > Q. Q). The claims (iii) and (iv) are left as exercises. (iii). Therefore. [S > j] < Q.Q) I] > Po. hence.o' (iii) k(fJ. Po. (ii). Then P. (iv) k(O.Q) = S + I}.Q) = n+ 1. Therefore. whereas A'" (110) is the acceptance region for Hp.5.[S > k(02. Q) would imply that Q > Po. and (iv) we see that. let j be the limit of k(O.1. and. The assertion (ii) is a consequence of the following remarks. From (i).I] [0.3. [S > j] > Q.[S > k(02.[S > j] < Q for all 0 < 00 O(S) = inf{O: k(O.Q) ~ I andk(I. Q) as 0 tOo.Section 4. Clearly. P.
0.4 0.4. Putting the bounds O(S). From our discussion.I. As might be expected. O(S) = I. these bounds and intervals differ little from those obtained by the first approximate method in Example 4. 4 k(8. j Figure 4.2.1 0. therefore. Then 0(S) is a level (1 . if n is large. . .2Q).16) for n = 2.3. 1 _ . we find O(S) as the unique solution of the equation. These intervals can be obtained from computer packages that use algorithms based on the preceding considerations.Q)=Sl} where j (0. we define ~ O(S)~sup{O:j(O. O(S) is the unique solution of 1 ~( ~ s ) or(1 _ 8)nr = Q.. I • i I .1 d I J f o o 0.Q) DCB for 0 and when S < n.3 0.2 0.16) 3 I 2 11. j '~ I • I • .Q) = S and. Plot of k(8. O(S) together we get the confidence interval [8(S).5.246 Testing and Confidence Regions Chapter 4 and O(S) is the desired level (I . when S > 0. . 0. Q) is given by. then k(O(S). . . When S = n.5. O(S) I oflevel (1. i • .Q) LCB for 0(2) Figure 4. 8(S) = O. Similarly.5 I i . When S 0.2 portrays the situatiou.
d. Make no judgment as 1O whether 0 < 80 or 8 > B if I contains B . we can control the probabilities of a wrong selection by setting the 0' of the parent test or confidence interval. we decide whether this is because () is smaller or larger than Bo. the probability of falsely claiming significance of either 8 < 80 or 0 > 80 is bounded above by ~Q.0') confidence interval!: 1. and vice versa. For this threedecision rule.!a). the twosided test can be regarded as the first step in the decision procedure where if H is not rejected.4. by using this kind of procedure in a comparison or selection problem. v = va. Example 4. we make no claims of significance. We explore the connection between tests of statistical hypotheses and confidence regions. N(/l. Because we do not know whether A or B is to be preferred. and o Decide 8 > 80 if I is entirely to the right of B . If we decide B < 0.. we test H : B = 0 versus K : 8 i. Here we consider the simple solution suggested by the level (1 .~Q) / .5 The Duality Between Confidence Regions and Tests 247 Applications of Confidence Intervals to Comparisons and Selections We have seen that confidence intervals lead naturally to twosided tests.i. suppose B is the expected difference in blood pressure when two treatments. We can use the twosided tests and confidence intervals introduced in later chapters in similar fashions. vo) is a level 0' test of H .5.B .3. Decide I' > I'c ifT > z(1 . o o Decide B < 80 if! is entirely to the left of B .Jii for !t. To see this consider first the case 8 > 00 . and 3. Therefore. it is natural to carry the comparison of A and B further by asking whether 8 < 0 or B > O. then the set S(x) of Vo where . when !t < !to.5. twosided tests seem incomplete in the sense that if H : B = B is rejected in favor of o H : () i. Decide I' < 1'0 ifT < z(l. Then the wrong decision "'1 < !to" is made when T < z(1 .Section 4.. Suppose Xl. This event has probability Similarly.2. but if H is rejected.~Q).5.13.4 we considered the level (1 . If J(x.~a). Thus.Q) confidence interval X ± uz(1 . I X n are i.8 < B • or B > 80 is an example of a threeo decision problem and is a special case of the decision problems in Section 1. then we select A as the better treatment.O. 0. Using this interval and (4. If H is rejected. o (4. .3) we obtain the following three decision rule based on T = J1i(X  1'0)/": Do not reject H : /' ~ I'c if ITI < z(1 . we usually want to know whether H : () > B or H : B < 80 . are given to high blood pressure patients. A and B.3) 3. 2.2 ) with u 2 known. In Section 4. The problem of deciding whether B = 80 .!a). However.. Summary. the probability of the wrong decision is at most ~Q. o o For instance.
and only if. we say that the bound with the smaller probability of being far below () is more accurate. random variable S. .[B(X) < B']. I' i' . the following is true.2.Q) LCB B if....6. unifonnly most accurate in the N(/L.Xn and k is defined by P(S > k) = 1 . for any fixed B and all B' < B. .Ilo)/er > z(1 . e.u2(X) and is.a) LCB 0* of (J is said to be more accurate than a competing level (1 . 4. for X E X C Rq. Suppose X = (X" . But we also want the bounds to be close to (J. . .2 and 4.6 UNIFORMLY MOST ACCURATE CONFIDENCE BOUNDS In our discussion of confidence bounds and intervals so far we have not taken their accuracy into account. where 80 is a specified value.6. Note that 0* is a unifonnly most accurate level (1 . The dual lower confidence bound is 1'. < X(n) denotes the ordered Xl. va) ~ 0 is a level (1 . optimality of the tests translates into accuracy of the bounds.a) LCB for if and only if 0* is a unifonnly most accurate level (1 ... We give explicitly the construction of exact upper and lower confidence bounds and intervals for the parameter in the binomial distribution.' . a level (1 any fixed B and all B' > B. ~). (72) model. v = Va when lIa E 8(:1') is a level 0: test. Using Problem 4. (X) = X .z(1 .5. Thus.Q)er/. We also give a connection between confidence intervals.Q). in fact. for 0') UCB e is more accurate than a competitor e . If (J and (J* are two competing level (1 . or larger than eo.1.I (X) is /L more accurate than . twosided tests. (4.6. we find that a competing lower confidence bound is 1'2(X) = X(k). then the lest that accepts H . Which lower bound is more accurate? It does tum out that . 0"2) random variables with (72 known.1) is nothing more than a comparison of (X)wer functions.3. This is a consequence of the following theorem. (4. I' > 1'0 rejects H when ..0) confidence region for va.1) for all competitors are called uniformly most accurate as are upper confidence bounds satisfying (4. where X(1) < X(2) < . which is connected to the power of the associated onesided tests.IB' (X) I • • < B'] < P. If S(x) is a level (1 . P. j i I . they are both very likely to fall below the true (J. Ii n II r . which reveals that (4.[B(X) < B']. less than 80 .6.. B(n. P.if. and the threedecision problem of deciding whether a parameter 1 l e is eo.1 (Examples 3.6.2) Lower confidence bounds e* satisfying (4.Xn ) is a sample of a N (/L.1) Similarly.1 continued).6. We next show that for a certain notion of accuracy of confidence bounds. and only if.6.Q for a binomial.n. 0 I i.Q) UCB for B. Formally.1 t! Example 4.6. A level a test of H : /L = /La vs K ..248 Testing and Confidence Regions Chapter 4 0("'. . A level (1. Definition 4..~'(X) < B'] < P.0') lower confidence bounds for (J.Q) con· fidence region for v.n(X .2) for all competitors.
0') LeB for (J.a) LCB q.Oo)) < Eo. We define q* to be a uniformly most accurate level (1 .e>. and only if. .Oo) 1 ifO'(x) > 00 ootherwise is UMP level a for H : (J = eo versus K : (J > 00 . Pe[f < q(O')1 < Pe[q < q(O')] whenever q((}') < q((}). and 0 otherwise. 00 )) or Pe. [O(X) > 001 < Pe. such that for each (Jo the associated test whose critical function o*(x. Then!l* is uniformly most accurate at level (1 .a) lower confidence boundfor O. a real parameter. for e 1 > (Jo we must have Ee.6 Uniformly Most Accurate Confidence Bounds 249 Theorem 4. For instance (see Problem 4. .2.6. Let O(X) be any other (1 . Let f)* be a level (1 . [O'(X) > 001. (O'(X. 1 X n be the times to failure of n pieces of equipment where we assume that the Xi are independent £(A) variables. We want a unifonnly most accurate level (1 . Most accurate upper confidence bounds are defined similarly.6.1 to Example 4. Identify 00 with {}/ and 01 with (J in the statement of Definition 4.6. (Jo) is given by o'(x. Proof Let 0 be a competing level (1 .•. Let XI. 00 ) is UMP level Q' for H : (J = eo versus K : 0 > 00 .a) Lea for q( 0) if. Suppose ~'(X) is UMA level (1 . However. O(x) < 00 . they have the smallest expected "distance" to 0: Corollary 4.Section 4.4. the probability of early failure of a piece of equipment.a).2).(o(X.6. Also.5 favor X{k) (see Example 3. . 00 ) by o(x. Uniformly most accurate (UMA) bounds turn out to have related nice properties.1.z(1 a)a/ JTi is uniformly most accurate.t o . if a > 0. X(k) does have the advantage that we don't have to know a or even the shape of the density f of Xi to apply it. then for all 0 where a+ = a. we find that j. Then O(X.1. We can extend the notion of accuracy to confidence bounds for realvalued functions of an arbitrary parameter. Defined 0(x. Boundsforthe Probability ofEarly Failure ofEquipment.2. Example 4. 00 ) is a level a test for H : 0 = 00 versus K : 0 > 00 .1.2 and the result follows. the robustness considerations of Section 3.Because O'(X.6. 0 If we apply the result and Example 4. 00 ) ~ 0 if.a) upper confidence bound q* for q( A) = 1 .a) lower confidence bound.5. and only if.7 for the proof).a) LeB 00. for any other level (1 .
~ q(B) ~ t] ~ Pe[T. they are less likely ilian oilier level (1 . By using the duality between onesided tests and confidence bounds we show that confidence bounds based on UMP level a tests are uniformly most accurate (UMA) level (1 .. *) is a uniformly most accurate level (1 . the confidence region corresponding to this test is (0. the confidence interval be as short as possible. By Problem 4. the lengili t .6.a) quantile of the X§n distribution. however.) as a measure of precision. To find'\* we invert the family of UMP level a tests of H : A ~ AO versus K : A < AO.8. The situation wiili confidence intervals is more complicated. pp. B'.5.a) UCB for the probability of early failure. the intervals developed in Example 4.a) confidence intervals that have uniformly minimum expected lengili among all level (1.T. in general.a).3) or equivalently if A X2n(1a) o < 2"'~ 1 X 2 L_n= where X2n(1.a) UCB for A and. * is by Theorem 4.. the situation is still unsatisfactory because.1.a) is the (1. Thus. . That is. Summary.a) unbiased confidence intervals. there does not exist a member of the class of level (1 .a) iliat has uniformly minimum length among all such intervals. Confidence intervals obtained from twosided tests that are uniformly most powerful within a restricted class of procedures can be shown to have optimality properties within restricted classes.6. However. 0 Discussion We have only considered confidence bounds.a) intervals iliat has minimum expected length for all B. 374376). l .. If we turn to the expected length Ee(t . because q is strictly increasing in A.a)/ 2Ao i=1 n (4.a) intervals for which members with uniformly smallest expected length exist. subject to ilie requirement that the confidence level is (1 . Pratt (1961) showed that in many of the classical problems of estimation .250 Testing and Confidence Regions Chapter 4 We begin by finding a uniformly most accurate level (1. Of course. in the case of lower bounds. a uniformly most accurate level (1 . Therefore.a) UCB'\* for A. there exist level (1 .. is random and it can be shown that in most situations there is no confidence interval of level (1 .a) lower confidence bounds to fall below any value ()' below the true B.6. the UMP test accepts H if LXi < X2n(1. 1962. In particular.\ *) where>. some large sample results in iliis direction (see Wilks.1 have this property. we can restrict attention to certain reasonable subclasses of level (1 . Considerations of accuracy lead us to ask that. as in the estimation problem. ~ q(()') ~ t] for every ().a) by the property that Pe[T. the interval must be at least as likely to cover the true value of q( B) as any other value.a) in the sense that. There are. These topics are discussed in Lehmann (1997). Neyman defines unbiased confidence intervals of level (1 . it follows that q( >.T.
i. Xl. then 100(1 . Example 4.a)% confidence interval. Let 7r('lx) denote the density of agiven X = x. then Ck will be an interval of the form [fl. :s: alx) ~ 1 . Instead. from Example 1.a) lower and upper credible bounds for if they respectively satisfy II(fl.a)% confidence interval is that if we repeated an experiment indefinitely each time computing a 100(1 .3. the posterior distribution of fL given Xl.a. no probability statement can be attached to this interval. Thus.a) credible region for e if II( C k Ix) ~ 1  a . and that a has the prior probability distribution II. with fLo and 75 known.d.7 Frequentist and Bayesian Formulations 251 4. with ~ a6) a6 fLB = ~ nx+l/1 ~t""o n ~ + I ~ ~2 . N(fL.::: 1 . a E e c R..'" .. and that fL rv N(fLo. a a Turning to Bayesian credible intervals and regions. II(a:s: t9lx) .::: k} is called a level (1 . with known.7..a) by the posterior distribution of the parameter given the data.1. Definition 4.a.A)2 nro ao 1 Ji = liB + Zla Vn (1 + . Suppose that.Xn are i. then fl.Xn is N(liB.1. a Definition 4. . .a lower and upper credible bounds for fL are  fL = fLB  ao Zla Vn (1 + . (j].2 and 1.12. 75).7.7. We next give such an example.Section 4.a)% of the intervals would contain the true unknown parameter value.4. given a. Let II( 'Ix) denote the posterior probability distribution of given X = x.)2 nro 1 .a) credible bounds and intervals are subsets of the parameter space which are given probability at least (1 .1. X has distribution P e. then Ck = {a: 7r(lx) . a~).7 FREQUENTIST AND BAYESIAN FORMULATIONS We have so far focused on the frequentist formulation of confidence bounds and intervals where the data X E X c Rq are random while the parameters are fixed but unknown. Suppose that given fL. it is natural to consider the collec tion of that is "most likely" under the distribution II(alx). and {j are level (1 . Then. what are called level (1 '. A consequence of this approach is that once a numerical interval has been computed from experimental data. the interpretation of a 100(1".aB = n ~ + 1 I ~ It follows that the level 1 .6. In the Bayesian formulation of Sections 1.2. If 7r( alx) is unimodal.
Let A = a.n(1+ ~)2 nTo 1 • Compared to the frequentist interval X ± Zl~oO/.12. N(/10. .a) lower credible bound for A and ° is a level (1 . = x a+n (a) / (t + b) is a level (1 . In addition to point prediction of Y...6. Note that as TO > 00. D Example 4. (t + b)A has a X~+n distribution. However. In the Bayesian framework we define bounds and intervals. For instance. b > are known parameters.7. where /10 is a prior guess of the value of /1.02) where /10 is known.2. X n .a) credible interval is [/1. where t = 2:(Xi . called level (1 a) credible bounds and intervals. a doctor administering a treatment with delayed effect will give patients a time interval [1::.. by Problem 1.2 . the probability of coverage is computed with X = x fixed and () random with probability distribution II (B I X = x). that determine subsets of the parameter space that are assigned probability at least (1 . however. is shifted in the direction of the reciprocal b/ a of the mean of W(A).2. Let xa+n(a) denote the ath quantile of the X~+n distribution. t] in which the treatment is likely to take effect.n. Suppose that given 0. the interpretations of the intervals are different. 01 Summary.2.. the probability of coverage is computed with the data X random and B fixed. Then. See Example 1. it is desirable to give an interval [Y./10)2.d. . .8 PREDICTION INTERVALS In Section 1. Similarly.4 we discussed situations in which we want to predict the value of a random variable Y.2 . Y] that contains the unknown value Y with prescribed probability (1 . ~b) density where a > 0. then .a) by the posterior distribution of the parameter () given the data :c. Xl.252 Testing and Confidence Regions Chapter 4 while the level (1 .Xn are Li. we may want an interval for the .2 and suppose A has the gamma f( ~a. whereas in the Bayesian credible interval.4.. 4.4 for sources of such prior guesses. given Xl. In the case of a normal prior w( B) and normal model p( X I B). Compared to the frequentist bound (n 1) 8 2 / Xnl (a) of Example 4.a). /1 +1with /1 ± = /111 ± Zl'" 2 ~ 00 .a) upper credible bound for 0.. We shall analyze Bayesian credible regions further in Chapter 5. the interpretations are different: In the frequentist confidence interval.3.a) credible interval is similar to the frequentist interval except it is pulled in the direction /10 of the prior mean and it is a little narrower.• . the level (1 . the center liB of the Bayesian interval is pulled in the direction of /10. the Bayesian interval tends to the frequentist interval.
X n +1 '" N(O. we find the (1 . TnI.. .. Tp(Y) !a) . . Note that Y Y = X .X) is independent of X by Theorem B.4. which has a X.4.a) prediction interval Y = X± l (1 ~Ct) ::.3.) acts as a confidence interval pivot in Example 4. In Example 3. "~)..4.4. We want a prediction interval for Y = X n +1. The problem of finding prediction intervals is similar to finding confidence intervals using a pivot: Example 4.1.1) Note that Tp(Y) acts as a prediction interval pivot in the same way that T(p. we found that in the class of unbiased estimators. it can be shown using the methods of . in this case. Y ::. ::.3 and independent of X n +l by assumption. [n.1)8 2 /cr 2. the optimal MSPE predictor is Y = X. .8.Xn . by the definition of the (Student) t distribution in Section B.d.l distribution. Thus. the optimal estimator when it exists is also the optimal predictor. The (Student) t Prediction Interval.8.1. We define a predictor Y* to be prediction unbiased for Y if E(Y* .8 Prediction Intervals 253 future GPA of a student or a future value of a portfolio. Y) ?: 1 Ct. and can c~nclu~e that in the class of prediction unbiased predictors. .Y) = 0.. AISP E(Y) = MSE(Y) + cr 2 . By solving tnl Y..3.8. where MSE denotes the estimation theory mean squared error. 8 2 = (n . let Xl.+ 18tn l (1  (4..Xn be i.i. It follows that Z (Y) p  Y vnY l+lcr has a N(O. As in Example 4. We define a level (1.. . Y] based on data X such that P(Y ::. Also note that the prediction interval is much wider than the confidence interval (4. fnl (1  ~a) for Vn.Y to construct a pivot that can be used to give a prediction interval. as X '" N(p" cr 2 ).1). which is assumed to be also N(p" cr 2 ) and independent of Xl. has the t distribution. Then Y and Y are independent and the mean squared prediction error (MSPE) of Y is Note that Y can be regarded as both a predictor of Y and as an estimate of p" and when we do so.Section 4. 1) distribution and is independent of V = (n .1)1 L~(Xi .1. ~ We next use the prediction error Y .l + 1]cr2 ).Ct) prediction interval as an interval [Y. X is the optimal estimator. It follows that. In fact. . Moreover. Let Y = Y(X) denote a predictor based on X = (Xl.
I x) of Xn+l is defined as the conditional distribution of X n + l given x = (XI. 1).v) = E(U(k») .d. whereas the level of the prediction interval (4. .8. Moreover.2. UCk ) = v)dH(u.xn )..1) is approximately correct for large n even if the sample comes from a nonnormal distribution. U(O.. Here Xl"'" Xn are observable and Xn+l is to be predicted. where X n+l is independent of the data Xl'.1) tends to zero in probability at the rate n!.0:) Bayesian prediction interval for Y = X n + l if . Set Ui = F(Xi ). 00 :s: a < b :s: 00.i.. = i/(n+ 1). We want a prediction interval for Y = X n+l rv F.Xn are i. . i = 1. p(X I e).8. Let X(1) < .8. .. E(UCi») thus.d.0:) in the limit as n ) 00 for samples from nonGaussian distributions. Un ordered. . . Now [Y B' YB ] is said to be a level (1 .254 Testing and Confidence Regions Chapter 4 Chapter 5 that the width of the confidence interval (4.' X n . (4. X(k)] with k :s: Xn+l :s: XCk») = n + l' kj = n + 1 . This interval is a distributionfree prediction interval. . the confidence level of (4.'.9.12.i. ..Un+l are i.2) P(X(j) It follows that [X(j) . by Problem B. as X rv F..d. Example 4.. < X(n) denote the order statistics of Xl.i. then P(U(j) j :s: Un+l :s: UCk ») P(u:S: Un+l:S: v I U(j) = U.1) is not (1 ... with a sum replacing the integral in the discrete case. See Problem 4.' •.v) j(v lI)dH(u. n + 1. . b). then..2j)/(n + 1) prediction interval for X n +l .2. The posterior predictive distribution Q(.~o:).8.2). where F is a continuous distribution function with positive density f on (a. that is..j is a level 0: = (n + 1 .. By ProblemB. < UCn) be Ul . uniform. 0 We next give a prediction interval that is valid from samples from any population with a continuous distribution. " X n.2. Let U(1) < . Q(.5 for a simpler proof of (4.. Suppose XI. I x) has in the continuous case density e. . Ul . .4..4. Bayesian Predictive Distributions Xl' . whereas the width of the prediction interval tends to 2(]z (1.8. " X n + l are Suppose that () is random with () rv 1'( and that given () = i.E(U(j)) where H is the joint distribution of UCj) and U(k).
where X and X n + l are independent. .9 Likelihood Ratio Procedures 255 Example 4.8. The Bayesian formulation is based on the posterior predictive distribution which is the conditional distribution of the unobservable variable given the observable variables. we construct the Student t prediction interval for the unobservable variable. 1983).c{(Xn +1 where.7.2 o + I' T2 ~ P. In the case of a normal sample of size n + 1 with only n variables observable.0:) Bayesian prediction interval for Y is [YB ' Yt] with Yf = liB ± Z (1. N(B.3) converges in probability to B±z (1 .i. A sufficient statistic based on the observables Xl.c{Xn+l I X = t} = . and 7r( B) is N( TJo . by Theorem BA. However. T2). Consider Example 3. Thus.9 4. even in .8. (J"5 + a~) ~2 = (J" B n I (. The Bayesian prediction interval is derived for the normal model with a normal prior. T2 known. (J"5).~o:) V(J"5 + a~. 4. (J"5 known. (J" B n(J" B 2) It follows that a level (1 . and it is enough to derive the marginal distribution of Y = X n +l from the joint distribution of X.8.0:). note that given X = t.Section 4. We consider intervals based on observable random variables that contain an unobservable random variable with probability at least (1. from Example 4. This is the same as the probability limit of the frequentist interval (4.9. Xn is T = X = n 1 2:~=1 Xi . Note that E[(Xn+l .8.3.B)B I 0 = B} = O. the results and examples in this chapter deal mostly with oneparameter problems in which it sometimes is possible to find optimal procedures.3) To consider the frequentist properties of the Bayesian prediction interval (4..0)0] = E{E(Xn+l . B = (2 / T 2) TJo + ( 2 / (J"0 X. Thus. . we find that the interval (4. Xn+l .1 LIKELIHOOD RATIO PROCEDURES Introduction Up to this point.0 and 0 are still uncorrelated and independent.I .~o:) (J"o as n + 00. To obtain the predictive distribution. independent. .d. Xn+l .1) .2. (n(J"~ /(J"5) + 1. For a sample of size n + 1 from a continuous distribution we show how the order statistics can be used to give a distributionfree prediction interval..1.  0) + 0 I X = t} = N(liB. (J"5). and X ~ Bas n + 00. Because (J"~ + 0.8. X n + l and 0.0 and 0 are uncorrelated and.1 where (Xi I B) '" N(B .3) we compute its probability limit under the assumption that Xl " '" Xn are i. 0 The posterior predictive distribution is also used to check whether the model and the prior give a reasonable description of the uncertainty in a study (see Box. (4. Summary.
/Lo)/a. In particular cases. eo). z(a ). eo) when 8 0 = {eo}. recall from Section 2. Although the calculations differ from case to case. In this section we introduce intuitive and efficient procedures that can be used when no optimal methods are available and that are natural for multidimensional parameters. the MP level a test 6a (X) rejects H for T > z(l .. In the cases we shall consider. Note that in general A(X) = max(L(x).a )th quantile obtained from the table. . 4.1(c». e) : () sup{p(x. e) as a measure of how well e "explains" the given sample x = (Xl. 8 1 = {e l }. 2. . where T = fo(X . Calculate the MLE eo of e where e may vary only over 8 0 . there is no UMP test for testing H : /L = /Lo vs K : /L I. we specify the size a likelihood ratio test through the test statistic h(A(X) and .2. e) : e E 8 l }. e) E 8} : eE 8 0} (4. Xn).9. To see this. ed/p(x.'Pa(x). ed / p(x. Because h(A(X)) is equivalent to A(X) . We start with a generalization of the NeymanPearson statistic p(x . p(x. and for large samples.2. eo). by the uniqueness of the NP test (Theorem 4.x) = p(x./LO.1) whose computation is often simple. . the MP level a test 'Pa(X) rejects H if T ::::. if Xl. The efficiency is in an approximate sense that will be made clear in Chapters 5 and 6. 1.. if /Ll < /Lo. For instance. Also note that L(x) coincides with the optimal test statistic p(x . sup{p(x. e) : e E 8l}islargecomparedtosup{p(x. note that it follows from Example 4. optimal procedures may not exist.. then the observed sample is best explained by some E 8 1 . Form A(X) = p(x. .1 that if /Ll > /Lo. . ifsup{p(x. . 1). On the other hand. Xn is a sample from a N(/L. . So. e) and we wish to test H : E 8 0 vs K : E 8 1 . Calculate the MLE e of e. B')/p(x . a 2 ) population with a 2 known. Suppose that X = (Xl . Because 6a (x) I. e) is a continuous function of e and eo is of smaller dimension than 8 = 8 0 U 8 1 so that the likelihood ratio equals the test statistic I e e 1 A(X) = sup{p(x. To see that this is a plausible statistic. .2 that we think of the likelihood function L(e.its (1 .e): E 8 0 }. The test statistic we want to consider is the likelihood ratio given by L(x) = sup{p(x. 3..~a).Xn ) has density or frequency function p(x. e) : e E 8 0 } e e e Tests that reject H for large values of L(x) are called likelihood ratio tests. the basic steps are always the same. We are going to derive likelihood ratio tests in several important testing problems. and conversely./Lo. . likelihood ratio tests have weak optimality properties to be discussed in Chapters 5 and 6. there can be no UMP test of H : /L = /Lo vs H : /L I.256 Testing and Confidence Regions Chapter 4 the case in which is onedimensional. Find a function h that is strictly increasing on the range of A such that h(A(X)) has a simple form and a tabled distribution under H .2.
we can invert the family of size a likelihood ratio tests of the point hypothesis H : 8 = 80 and obtain the level (1 . Here are some examples.9.Xn form a sample from a N(JL. mileage of cars with and without a certain ingredient or adjustment. [ supe p(X. and soon. We can regard twins as being matched pairs.82 ) where 81 is the parameter of interest and 82 is a nuisance parameter. This section includes situations in which 8 = (8 1 . In the ith pair one patient is picked at random (i. we consider pairs of patients matched so that within each pair the patients are as alike as possible with respect to the extraneous factors. If the treatment and placebo have the same effect. For instance. and so on. Studies in which subjects serve as their own control can also be thought of as matched pair experiments. while the second patient serves as control and receives a placebo.9 Likelihood Ratio Procedures 257 We can also invert families of likelihood ratio tests to obtain what we shall call likelihood confidence regions.24. Response measurements are taken on the treated and control members of each pair. the difference Xi has a distribution that is .9. That is. We shall obtain likelihood ratio tests for hypotheses of the form H : 81 = 8lD . we measure the response of a subject when under treatment and when not under treatment. An example is discussed in Section 4.e. An important class of situations for which this model may be appropriate occurs in matched pair experiments. 8n e (4. sales performance before and after a course in salesmanship.a) confidence region C(x) = {8 : p(x. Examples of such measurements are hours of sleep when receiving a drug and when receiving a placebo.5. The family of such level a likelihood ratio tests obtained by varying 8lD can also be inverted and yield confidence regions for 8 1 .Section 4. 8) ~ [c( 8)t l sup p(x.. which are composite because 82 can vary freely.9. In order to reduce differences due to the extraneous factors.2 Tests for the Mean of a Normal DistributionMatched Pair Experiments Suppose Xl.'" .9. Suppose we want to study the effect of a treatment on a population of patients whose responses are quite variable because the patients differ with respect to age. 8) > (8)] = eo a. diet. In that case. with probability ~) and given the treatment. After the matching.9.2) where sUPe denotes sup over 8 E e and the critical constant c( 8) satisfies p.8) . bounds. the experiment proceeds as follows.2.c 0 0 It is often approximately true (see Chapter 6) that c( 8) is independent of 8. C (x) is just the set of all 8 whose likelihood is on or above some fixed value dependent on the data. To see how the process works we refer to the specific examples in Sections 4. 4. We are interested in expected differences in responses due to the treatment effect. and other factors. Let Xi denote the difference between the treated and control responses for the ith pair. 0'2) population in which both JL and 0'2 are unknown. p (X .
n i=l is the maximum likelihood estimate of B.0 )2 .0) 2  a2 n] = 0. To this we have added the nonnality assumption.fJ. B) : B E 8 0 } boils down to finding the maximum likelihood estimate a~ of a 2 when fJ. as discussed in Section 4. 'for the purpose of referring to the duality between testing and confidence procedures.0. good or bad.258 Testing and Confidence Regions Chapter 4 symmetric about zero. We think of fJ. Form of the TwoSided Tests Let B = (fJ" a 2 ). The problem of finding the supremum of p(x.L Xi 1 ~( n i=l . = fJ. the test can be modified into a threedecision rule that decides whether there is a significant positive or negative effect. The test we derive will still have desirable properties in an approximate sense to be discussed in Chapter 5 if the nonnality assumption is not satisfied.0 as an established standard for an old treatment.a)= ~ ~2 (1 ~ 1 ~ n i=l 2 ) . 8 0 = {(fJ" Under our assumptions.0 is known and then evaluating p(x. a~). where B=(x.5. = fJ. Let fJ.o.L X i ' . = E(X 1 ) denote the mean difference between the response of the treated and control subjects. . = fJ. B) was solved in Example 3.L ( X i .=1 fJ. =Ie fJ.0. We found that sup{p(x. a 2 ) : fJ.0}.B): B E 8} = p(x. which has the immediate solution ao ~2 = . The likelihood equation is oa 2 a logp(x. where we think of fJ. However. Finding sup{p(x. TwoSided Tests We begin by considering K : fJ. we test H : fJ. Our null hypothesis of no treatment effect is then H : fJ.X) . This corresponds to the alternative "The treatment has some effect. = O.6. e). B) 1[1~ ="2 a 4 L(Xi ." However. as representing the treatment effect. B) at (fJ.3.
Mo)/a.x)2 = n&2/(n .\(x) logp(x.9. Therefore. is of size a for H : M ::. suppose n = 25 and we want a = 0. 8) for 8 E 8 0. Because 8 2 function of ITn 1 where = 1 + (x . rejects H for iarge values of (&5/&2). Therefore.MO)2/&2. However.Section 4. &5 gives the maximum of p(x. the relevant question is whether the treatment creates an improvement. &0)) ~ 2 {~[(log271') + (log&2)]~} ~ log(&5/&2). Mo versus K : M > Mo (with Mo = 0) is suggested. The test statistic . (Mo. (n .1. the size a critical value is t n 1 (1 . The statistic Tn is equivalent to the likelihood ratio statistic A for this problem. the size a likelihood ratio test for H : M Z Mo versus K : M < Mo rejects H if. and only if. Similarly. Thus.8) logp(x. Then we would reject H if. A and B.9. &5/&2 (&5/&2) is monotone increasing Tn = y'n(x . where 8 = (M .1). For instance. OneSided Tests The twosided formulation is natural if two treatments.9 Likelihood Ratio Procedures 259 By Theorem 2. To simplify the rule further we use the following equation.064. the testing problem H : M ::. to find the critical value. which thus equals log .1).a). which can be established by expanding both sides. Mo . the test that rejects H for Tn Z t nl(1 . are considered to be equal before the experiment is performed.~a) and we can use calculators or software that gives quantiles of the t distribution. In Problem 4.3.  {~[(log271') + (log&5)]~} Our test rule. ITnl Z 2. Because Tn has a T distribution under H (see Example 4. 8 Therefore.4. therefore. if we are comparing a treatment and control.Mo) . the likelihood ratio tests reject for large values of ITn I.\(x).05. .\(x) is equivalent to log .1)1 I:(Xi . or Table III.11 we argue that P"[Tn Z t] is increasing in 8. and only if. A proof is sketched in Problem 4.2.
Note that the distribution of Tn depends on () (/L. The density of Z/ . we obtain the confidence region We recognize C(X) as the confidence interval of Example 4. bring the power arbitrarily close to 0:. We can control both probabilities of error by selecting the sample size n large provided we consider alternatives of the form 181 ~ 81 > 0 in the twosided case and 8 ~ 81 or 8 :S 81 in the onesided cases. however. To derive the distribution of Tn.3. we can no longer control both probabilities of error by choosing the sample size. The power functions of the onesided tests are monotone in 8 (Problem 4.l distribution. we know that fo(X /L)/a and (n . 1) and X~ distributions. respectively.1)s2/a 2 are independent and that (n 1)s2/a 2 has a X.b distribution. is possible (Lehmann./Lo)/a and Yare . With this solution. We have met similar difficulties (Problem 4.n(X . If we consider alternatives of the form (/L /Lo) ~ ~. we need to introduce the noncentral t distribution with k degrees of freedom and noncentrality parameter 8. . Problem 17). Thus./Lo)/a . Similarly the onesided tests lead to the lower and upper confidence bounds of Example 4.1./Lo) / a. say./Lo) / a has N (8./Lol ~ ~. is by definition the distribution of Z / . / (n1)s2 /u 2 n1 V has a 'Tn1./Lo) / a as close to 0 as we please and. Likelihood Confidence Regions If we invert the twosided tests.2.jV/ k where Z and V are independent and have N(8. 1997.1.9. with 8 = fo(/L .4.11) just as the power functions of the corresponding tests of Example 4. Because E[fo(X /Lo)/a] = fo(/L . The reason is that. This distribution.9.b.12. fo( X . we may be required to take more observations than we can afford on the second stage. by making a sufficiently large we can force the noncentrality parameter 8 = fo(/L . Computer software will compute n. A Stein solution. note that from Section B.260 Power Functions and Confidence 4 To discuss the power of these tests. and the power can be obtained from computer software or tables of the noncentral t distribution.jV/k is given in Problem 4../Lo)/a) = 1. thus.1 are monotone in fo/L/ a. p.4. 1) distribution. the ratio fo(X . in which we estimate a for a first sample and use this estimate to decide how many more observations we need to obtain guaranteed power against all alternatives with I/L. whatever be n.4. denoted by Tk. ( 2) only through 8. 260.7) when discussing confidence intervals.
4 If we denote the difference as x's.9 likelihood Ratio Procedures 261 Data Example As an illustration of these procedures.. Yn2 .6. this is the problem of determining whether the treatment has any effect..1 0. In the control versus treatment example.g. suppose we wanted to test the effect of a certain drug on some biological variable (e. .Section 4. volume. .. p.9. temperature.2 1..7 1.. For instance.Xn1 could be blood pressure measurements on a sample of patients given a placebo.'" .9. A discussion of the consequences of the violation of these assumptions will be postponed to Chapters 5 and 6. and so forth.9 1. while YI .Xnl and YI .06. and ITnl = 4.32.25. 'Yn2 are independent samples from N (JlI.. As in Section 4. . The 0.0 3.2 2 1.58. consider the following data due to Cushny and Peebles (see Fisher.1.4 1. Patient i A B BA 1 0. ( 2) populations. it is usually assumed that X I. ( 2).9.1 1. blood pressure).1 0. one from each popUlation. Tests We first consider the problem of testing H : Jl1 = J:L2 versus H : Jli =F Jl2. Let e = (Jlt.Xn1 .6 10 2.7 5.3 Tests and Confidence Intervals for the Difference in Means of Two Normal Populations We often want to compare two populations with distribution F and G on the basis of two independent samples X I. then x = 1.99 confidence interval for the mean difference Jl between treatments is [0. Jl2. Y) = (Xl.6 0. length.8 1. height.6 0.0 4.8 8 0.2. .5. . . 121) giving the difference B . weight.. 1958. . YI . The preceding assumptions were discussed in Example 1.. .995) = 3.2 0.6 4. This is a matched pair experiment with each subject serving as its own control..0 7 3.1 1. Because tg(0.. Then 8 0 = {e : Jli = Jl2} and 9 1 = {e : Jli =F Jl2}' The log of the likelihood of (X.0 6 3.3 5 0. we conclude at the 1% level of significance that the two drugs are significantly different. For quantitative measurements such as blood pressure.8 2.5 1.3. . e .84].Y ) is n2 where n = nl + n2..4 4.Xn1 and YI . it is shown that = over 8 by the maximum likelihood estimate e.8 9 0. (See also (4..) 4. . It suggests that not only are the drugs different but in fact B is better than A because no hypothesis Jl = Jl' < 0 is accepted at this level.. Then Xl.A in sleep gained using drugs A and B on 10 patients. . respectively. 8 2 = 1..4 1. ( 2) and N (Jl2. .3).4 3 0.2 the likelihood function and its log are maximized In Problem 4.3 4 1.513. 'Yn2 are the measurements on a sample given the drug..
262 (X.9.2 (y. where 2 Testing and Confidence Regions Chapter 4 When ttl tt2 = tt.3 Tn2 1 . Y.1 . where and If we use the identities ~ fCYi ?l i=l j1)2 1 n fp'i i=l y)2 + n2 (Y _ j1)2 n obtained by writing [Xi .j112 log likelihood ratio statistic [(Xi .X) + (X  j1)]2 and expanding. Y. we find that the is equivalent to the test statistic ITI where and To complete specification of the size a likelihood ratio test we show that T has distribution when ttl = tt2.2. our model reduces to the onesample model of Section 4.2 1 I:rt2 .1 I:rtl . . (75). Thus. (7 ).Y) '1':" a a a i=l a j=l J I . By Theorem B.X) and . j1. . the maximum of p over 8 0 is obtained for f) (j1.2 X.2 (Xi .3.
ILl. twosample t test rejects if. by definition. As for the special case Ll = 0.N(1L2/0.~ Q likelihood confidence bounds.2)8 2/0. there are two onesided tests with critical regions. and only if. and that j nl n2/n(Y . It is also true that these procedures are of size Q for their respective hypotheses.9. we find a simple equivalent statistic IT(Ll)1 where If 1L2 ILl = Ll. 1/n2). T( Ll) has a Tn2 distribution and inversion ofthe tests leads to the interval (4. f"V As usual. onesided tests lead to the upper and lower endpoints of the interval as 1 . As in the onesample case.9 likelihood Ratio Procedures 263 are independent and distributed asN(ILI/o.2)8 2 /0. We can show that these tests are likelihood ratio tests for these hypotheses.Ll.1L2. T and the resulting twosided. IL 1 and for H : ILl ::. We conclude from this remark and the additive property of the X2 distribution that (n . T has a noncentral t distribution with noncentrality parameter.Section 4.3) for 1L2 . corresponding to the twosided test. this follows from the fact that. Similarly.2 under H bution and is independent of (n . if ILl #.has a N(O. 1L2. 1) distriTn .ILl = Ll versus K : 1L2 . X~ll' X~2l' respectively.X) /0. . 1/nI).ILl #.2 • Therefore. Confidence Intervals To obtain confidence intervals for 1L2 .2 has a X~2 distribution.ILl we naturally look at likelihood ratio tests for the family of testing problems H : 1L2 . for H : 1L2 ::.
..975) = 2. ai) and N(Ji2. 472) x (machine 1) Y (machine 2) 1. On the basis of the results of this experiment. we conclude at the 5% level of significance that there is a significant difference between the expected log permeability for the two machines.8 2 = 0. Suppose first that aI and a~ are known. respectively.3. The results in terms oflogarithms were (from Hald. thus. .9.395 ± 0.Xn1 . This is the BehrensFisher problem. From past experience.042 1. If normality holds.4 The TwoSample Problem with Unequal Variances In twosample problems of the kind mentioned in the introduction to Section 4. H is rejected if ITI 2 t n .. :.2 (1 . and T = 2. When Ji1 = Ji2 = Ji..264 Testing and Confidence Regions Chapter 4 Data Example As an illustration. a~) samples. . consider the following experiment designed to study the permeability (tendency to leak water) of sheets of building material produced by two different machines.x = 0.776. except for an additive constant.~o:). p. 4. As a first step we may still want to compare mean responses for the X and Y populations. we are lead to a model where Xl. a treatment that increases mean response may increase the variance of the responses. it may happen that the X's and Y's have different variances. Again we can show that the selection procedure based on the level (1 . thus. Ji2) E R x R.583 1.9. . Because t4(0.395. Here y . it is known that the log of permeability is approximately normally distributed and that the variability from machine to machine is the same.0264. Y1.0:) confidence interval has probability at most ~ 0: of making the wrong selection. .977.95 confidence interval for the difference in mean log permeability is 0.368.845 1.790 1. The level 0. 'Yn2 are two independent N (Ji1. setting the derivative of the log likelihood equal to zero yields the MLE . .282 We test the hypothesis H of no difference in expected log permeability. 1952. The log likelihood.627 2. the more waterproof material. is The MLEs of Ji1 and Ji2 are. /11 = x and /12 = y for (Ji1. For instance. we would select machine 2 as producing the smaller permeability and.
t1. j1. Unfortunately the distribution of (D 1:::..Section 4.. = J.28). BecauseaJy is unknown.j1)2 j=1 = L(Yi . n2 an approximation to the distribution of (D .)18D to generate confidence procedures.. y) Next we compute = exp {2~~ eM  x)2 + 2:'~ (Ii  y?}.1:::.fi = n2(x . and more generally (D .n2(fi .fi? j=1 + n2(j1. An unbiased estimate is J.1:::. where I:::. J.t1 = J. 1) distribution.. that is Yare X) + Var(Y) = nl + n2 .t2. For large nl. n2. by Slutsky's theorem and the central limit theorem.x)2 i=1 n2 i=1 n2 L(Yi .n2)' Similarly.3..X)2 + nl(j1.x)/(nl + .) I 8 D depends on at I a~ for fixed nl.9 Likelihood Ratio Procedures J 265 where.y) By writing nl L(Xi ..y)/(nl + . where D = Y .n2)' It follows that the likelihood ratio test is equivalent to the statistic IDl/aD. n2.fi)2 we obtain \(x. = at I a~. 8D=+' nl n2 It is natural to try and use D I 8 D as a test statistic for the onesided hypothesis H : J.t2 must be estimated. It follows that "\(x.X and aJy is the variance of D. 2 8y 8~ ..)18D has approximately a standard normal distribution (Problem 5. For small and moderate nI.t1. IDI/8D as a test statistic for H : J.) 18 D due to Welch (1949) works well. (D 1:::.t2 :::. it Thus. DiaD has aN(I:::. j1x = = nl .
If we have a sample as before and assume the bivariate normal model for (X.. with > 0. The LR procedure derived in Section 4. Yd.3. . can unfortunately be very misleading if =1= u~ and nl =1= n2. .266 Testing and Confidence Regions Chapter 4 Let c = Sf/nlsb' Then Welch's approximation is Tk where k c2 [ nl 1 (1 C)2]1 + n2 . ur Testing Independence. our problem becomes that of testing H : p O. etc. machines. TwoSided Tests . uI 4. fields. See Figure 5. X test tistical studies. then we end up with a bivariate random sample (XI.3. Confidence Intervals for p The question "Are two random variables X and Y independent?" arises in many staweight.) are sampled from a population and two numerical characteristics are measured on each case. Empirical data sometimes suggest that a reasonable model is one in which the two characteristics (X.28. The tests and confidence intervals resulting from this approximation are called Welch's solutions to the BehrensFisher problem. Y = blood pressure.. u~ > O.9. mice. the critical value is obtained by linear interpolation in the t tables or using computer software. .003.J1. the maximum error in size being bounded by 0. Some familiar examples are: X test score on English exam.u~. Y = age at death. Y). Wang (1971) has shown the approximation to be very good for Q: = 0. which works well if the variances are equal or nl = n2. X percentage of fat in score on mathematics exam.05 and Q: 0. Y) have a joint bivariate normal distribution. Y cholesterol level in blood. X = average cigarette consumption per day in grams.2 .01. (Xn' Yn). Y diet.1 When k is not an integer. Note that Welch's solution works whether the variances are equal or not. p).5 likelihood Ratio Procedures for Bivariate Normal Distributions If n subjects (persons. N(j1.1.3 and Problem 5.9.3.
and the likelihood ratio tests reject H for large values of [P1. Because ITn [ is an increasing function of [P1. where Ui = (Xi ..9. When p = 0 we have two independent samples. . y.4) = Thus.Now. .(j~.5) has a Tn. the twosided likelihood ratio tests can be based on [Tn I and the critical values obtained from Table II.~( Xi . When p = 0.0"2 = .9. 0) and the log of the likelihood ratio statistic becomes log A(X) (4. for any a. A normal approximation is available. p). Yn . .Y . Qualitatively.1..3. Because (U1. . log A(X) is an increasing function of p2. 1 (Problem 2. the power function of the LR test is symmetric about p = 0 and increases continuously from a to 1 as p goes from 0 to 1. the distribution of p depends on p only. To obtain critical values we need the distribution of p or an equivalent statistic under H.13 as 2 = .1. Pis called the sample correlation coefficient and satisfies 1 ::. where (jr e was given in Problem 2.9 Likelihood Ratio Procedures 267 The unrestricted maximum likelihood estimate (x. p::.Xn and that ofY1 .2 distribution.Jl2)/0"2.5. if we specify indifference regions. (4. We have eo = (x. ~ = (Yi . Therefore. ii.4. (jr . (Un.9. we can control probabilities of type II error by increasing the sample size. .8). and eo can be obtained by separately maximizing the likelihood of XI.3.o.1. the distribution of pis available on computer packages. See Example 5. VI)"" . There is no simple form for the distribution of p (or Tn) when p #. .. V n ) is a sample from the N(O.Section 4.Jll)/O"I. 1~ )2 2 1~ )2 0"1 n i=1 n i=1 p= [t(Xi . p) distribution. 0.X .X)(Yi ~=1 fj)] /n(jl(72. (j~..~( Yi . then by Problem B. If p = 0.
0 versus K : P > 0 and similarly that corresponds to the likelihood ratio statistic for H : P 0 versus K : P < O.48. c(po)] 1 . The power functions of these tests are monotone. is the ratio of the maximum value of the likelihood under the general model to the maximum value of the likelihood under the model specified by the hypothesis.268 Testing and Confidence Regions Chapter 4 OneSided Tests In many cases. The likelihood ratio test statistic >. Because c can be shown to be monotone increasing in p. by using the twosided test and referring to the tables.7 18.9. We obtain c(p) either from computer software or by the approximation of Chapter 5. These intervals do not correspond to the inversion of the size Q LR tests of H : p = Po versus K : p :::f.:::: c] is an increasing function of p for fixed c (Problem 4. We can similarly obtain 1 Q upper confidence bounds and. These tests can be shown to be of the form "Accept if. and only if p . inversion of this family of tests leads to levell Q lower confidence bounds. c(po)" where Ppo [p::.75. 'I7 Summary.Q. 0 versus K : P > O. Po but rather of the "equaltailed" test that rejects if. We find the likelihood ratio tests and associated confidence procedures for four classical normal models: . 370.1 Here p 0. we find that there is no evidence of correlation: the pvalue is bigger than 0. by putting two level 1 . ! Data Example A~ an illustration. we would test H : P = 0 versus K : P > 0 or H : P ::. for large n the equal tails and LR confidence intervals approximately coincide with each other. and only if. c(Po)] 1 ~Q. p::. if we want to decide whether increasing fat in a diet significantly increases cholesterol level in the blood. only onesided alternatives are of interest. we obtain a commonly used confidence interval for p. It can be shown that pis equivalent to the likelihood ratio statistic for testing H : P ::. Thus. consider the following bivariate sample of weights Xi of young rats at a certain age and the weight increase Yi during the following week. To obtain lower confidence bounds. we can start by constructing size Q likelihood ratio tests of H : P Po versus K : p > po. However. We want to know whether there is a correlation between the initial weights and the weight increase and formulate the hypothesis H : p O.~Q bounds together. c(po) where P po [I)' d(po)] P po [I)'::. Confidence Bounds and Intervals Usually testing independence is not enough and we want bounds and intervals for p giving us an indication of what departure from independence is present. Therefore.18 and Tn 0.15). For instance. we obtain size Q tests for each of these hypotheses by setting the critical value so that the probability of type I error is Q when p = O. We can show that Po [p.: : d(po) or p::.
. .10 PROBLEMS AND COMPLEMENTS Problems for Section 4.. We also find that the likelihood ratio statistic is equivalent to a t statistic with n .2 degrees of freedom. Consider the hypothesis H that the mean life II A = J. respectively. 0 otherwise. . where p is the sample correlation coefficient. . what choice of c would make 6c have size (c) Draw a rough graph of the power function of 6c specified in (b) when n = 20. X n ) and let 1 if Mn 2:: c = of e.L is zero. ( 2) and we test the hypothesis that the mean difference J. Let Mn = max(X 1 .98 for (e) If in a sample of size n = 20. . where SD is an estimate of the standard deviation of D = Y . (3) Twosample experiments in which two independent samples are modeled as coming from N(J. .L.. e).Ll' ( 2) and N(J. X n are independently and identically distributed according to the uniform distribution U(O.L2' a~) populations. When a? and a~ are known. what is the pvalue? 2. . Assume the model where X = (Xl..05? e :::. .X.L2' ( 2) populations. . . We test the hypothesis that the means are equal and find that the likelihood ratio test is equivalent to the twosample (Student) t test. Mn e = ~? = 0. (a) Compute the power function of 6c and show that it is a monotone increasing function (b) In testing H : exactly 0.Xn denote the times in days to failure of n similar pieces of equipment. When a? and a~ are unknown.XI. The likelihood ratio test is equivalent to the onesample (Student) t test. (2) Twosample experiments in which two independent samples are modeled as coming from N(J. .48. Suppose that Xl.. Approximate critical values are obtained using Welch's t distribution approximation. We test the hypothesis that X and Yare indepen dent and find that the likelihood ratio test is equivalent to the test based on IPl.10 Problems and Complements 269 (1) Matched pair experiments in which differences are modeled as N(J.1 1. 4. we use (Y .. Let Xl.Ll' an and N(J. . the likelihood ratio test is equivalent to the test based on IY .Lo. J.Section 4. (d) How large should n be so that the 6c specified in (b) has power 0.L :::.Xn ) is an £(A) sample.X)I SD. ~ versus K : e> ~. (4) Bivariate sampling experiments in which we have two measurements X and Y on each case in a sample of n cases. respectively.
. distribution. 1). j .270 Testing and Confidence Regions Chapter 4 (a) Use the result of Problem 8.<»1 VTi)J . Days until failure: 315040343237342316514150274627103037 Is H rejected at level <> ~ 0.) 4.. j = 1. T ~ Hint: See Problem B. Hint: See Problem B. Let o:(Tj ) denote the pvalue for 1j.3). (b) Give an expression of the power in tenns of the X~n distribution. 6. .. . . Assume that F o and F are continuous.3. under H. . . j=l . Let Xl.~llog <>(Tj ) has a X~r distribution. .. Draw a graph of the approximate power function. U(O.. Show that if H is simple and the test statistic T has a c. where x(I . X> 0..ontinuous distribution. . (d) The following are days until failure of air monitors at a nuclear plant. 0: (a) Construct a test of H : B = 1 versus K : B > 1 with approximate size complete sufficient statistic for this model. 7...a) is the (1 . give a normal approximation to the significance probability. Hint: Use the central limit theorem for the critical value. f(x.a)th quantile of the X~n distribution.12. 1nz t Z i . .4 to show that the test with critical region IX > I"ox( 1  <» 12n]. is a size Q test..4. . 1 using a I .. (b) Show that the power function of your test is increasing in 8.Xn be a sample from a population with the Rayleigh density . (c) Give an approximate expression for the critical value if n is large and B not too close to 0 or 00.3.O) = (xIO')exp{x'/20'}. . X n be a 1'(0) sample.2. . . i I (b) Check that your test statistic has greater expected value under K than under H.1. (0) Show that. Establish (4. Hint: Approximate the critical region by IX > 1"0(1 + z(1 . " i I 5. Tr are independent test statistics for the same simple H and that each Tj has a continuous distribution. r. . .2 L:. (a) Use the MLE X of 0 to construct a level <> test for H : 0 < 00 versus K : 0 > 00 . Suppose that T 1 . (cj Use the central limit theorem to show that <l?[ (I"oz( <» II") + VTi(1"  1"0) I 1"] is an approximation to the power of the test in part (a). Let Xl. then the pvalue <>(T) has a unifonn.r. I ..05? 3. If 110 = 25... (Use the central limit theorem.0 > 0.
p(Fo(x))IF(x) ..p(F(x)IF(x) . let . . (b) When 'Ij. j = 1. . . 1.o T¢.. Show that the test rejects H iff T > T(B+lm) has level a = m/(B + 1). . if X. 9..Fo(x)IO x ~ ~ sup. = ~ I. n (c) Show that if F and Fo are continuous and FiFo..1.a)/b.12(b). . b > 0.5.. (This is the logistic distribution with mean zero and variance 1.Fo(x) 10 U¢.2.10.)(Tn ) = LN(O. (a) For each of these statistics show that the distribution under H does not depend on F o. .d.Fo(x)IOdFo(x) . . Vw. (b) Use part (a) to conclude that LN(".X(B) from F o on the computer and computing TU) = T(XU)). X n be d. with distribution function F and consider H : F = Fo. Let X I.(X') = Tn(X).Fo(x)1 foteach. .Fo• generate a U(O.T. Here. .. then T. and 1. 10. is chosen 00 so that 00 x 2 dF(x) = 1. Use the fact that T(X) is equally likely to be any particular order statistic.6 is invariant under location and scale. ~ ll.1. Fo(x)1 > kol x ~ Hint: D n > !F(x) . Define the statistics S¢. B. T(B) be B independent Monte Carlo simulated values of T. .o: is called the Cramervon Mises statistic. (In practice these can be obtained by drawing B independent samples X(I).. 1) to (0... Express the Cramervon Mises statistic as a sum.u. . T(l).Fo(x)1 > k o ) for a ~ 0.o J J x .Section 4.5. and let a > O.. 80 and x = 0.. That is. (a) Show that the statistic Tn of Example 4..p(F(x»!F(x) . Suppose that the distribution £0 of the statistic T = T(X) is continuous under H and that H is rejected for large values of T.T(X(l)).. = (Xi . (a) Show that the power PF[D n > kal of the Kolmogorov test is bounded below by ~ sup Fpl!F(x).00). .J(u) = 1 and 0: = 2. .5 using the nonnal approximation to the binomial distribution of nF(x) and the approximate critical value in Example 4.1.10 Problems and Complements 271 8. In Example 4. Hint: If H is true T(X).T(B+l) denote T.5.) Evaluate the bound Pp(lF(x) . (b) Suppose Fo isN(O..o V¢. then the power of the Kolmogorov test tends to 1 as n .o sup. . Let T(l). to get X with distribution .) Next let T(I).l) (Tn). T(B) ordered. .p(Fo(x))lF(x) . 1) and F(x) ~ (1 +exp( _"/7))1 where 7 = "13/. . 1) variable on the computer and set X ~ F O 1 (U) as in Problem B..Fo(x)IOdF(x). 00.p(u) be a function from (0. T(X(B)) is a sample of size B + 1 from La.
05. . [2N1 + N...3.a)) = inf{t: Fo(t) > u}. 12. Consider Examples 3. Let 0 < 00 < 0 1 < 1.. and 3.10. ! . .') sample Xl. f(3. If each time you test.2. You want to buy one of two systems.0) = (1. > cis MP for testing H : 0 = 00 versus K : 0 = 0 1 • .). j 1 i :J . you want the number of seconds of response sufficient to ensure that both probabilities of error are < 0.0). > cJ = a. i • (a) Show that if the test has level a. the UMP test is of the form I {T > c).) . Whichever system you buy during the year. Consider a population with three kinds of individuals labeled 1. Problems for Section 4. Without loss of generality.L > J..1.. f(2. where" is known. 01 ) is an increasing function of 2N1 + N.O) = 0'. I X n from this population. which is independent ofT. 5 about 14% of the time. For a sample Xl. Let T = X/"o and 0 = /" .2.0)'. take 80 = O. which system is cheaper on the basis of a year's operation? 1 I Ii I il 2.1) satisfy Pe.nO /.. 2.to on the basis of the N(p" . 3.. Consider a test with critical region of the fann {T > c} for testing H : () = Bo versus I< : () > 80 .. Suppose that T has a continuous distribution Fe. and 3 occuring in the HardyWeinberg proportions f(I./"0' Show that EPV(O) = if! ( .272 Testing and Confidence Regions Chapter 4 (e) Are any of the four statistics in (a) invariant under location and scale.1. (a) Show that L(x.2 and 4.. Let To denote a random variable with distribution F o..Fo(T). then the pvalue is U =I .2 I i i 1.Fe(F"l(l.) I 1 . i i (b) Define the expected pvalue as EPV(O) = EeU. then the test that rejects H if. Expected pvalues. I 1 (b) Show that if c > 0 and a E (0. the other $105 • One second of transmission on either system COsts $103 each. = /10 versus K : J. . Show that the EPV(O) for I{T > c) is uniformly minimal in 0 > 0 when compared to the EPV(O) for any other test. Hint: peT < to I To = to) is 1 minus the power of a test with critical value to_ (d) Consider the problem of testing H : J1. 00 . ! . Show that EPV(O) = P(To > T). (See Problem 4. 2N1 + N. and only if. 1999. (For a recent review of expected p values see Sackrowitz and SamuelCahn. respectively. X n . A gambler observing a game in which a single die is tossed repeatedly gets the impression that 6 comes up about 18% of the time. 2. Hint: P(To > T) = P(To > tiT = t)fe(t)dt where fe(t) is the density of Fe(t). the other has v/aQ = 1. where if! denotes the standard normal distribution. One has signalMtonoise ratio v/ao = 2. I).. whereas the other a I ~ a . The first system costs $106 . let N I • N 2 • and N 3 denote the number of Xj equal to I.0) = 20(1. you intend to test the satellite 100 times. the power is (3(0) = P(U where FO 1 (u)  < a) = 1. ! J (c) Suppose that for each a E (0.
0. . the gambler asks that he first be allowed to test his hypothesis by tossing the die n times.10 Problems and Complements 273 four numbers are equally likely to occur (i.. M(n. L . find the MP test for testing 10. prove Theorem 4.Pe. and to population 0 ifT < c.4.0. derive the UMP test defined by (4. PdT < cJ is as small as possible.. Bd > cJ = I . In Examle 4. Prove Corollary 4. Hint: Use Problem 4. (c) Using the fact that if(N" . (X.. Hint: The MP test has power at least that of the test with test function J(x) 8.. 7.0. I. B" .. Y) known to be distributed either (as in population 0) according to N(O. H : (J = (Jo versus K : (J = (J.1...imately a N(np" na 2 ) distribution.2) that linear combinations of bivariate nonnal random variables are nonnaUy distributed. Bk). then a. if . then the maximum of the two probabilities ofmisclassification porT > cJ... A newly discovered skull has cranial measurements (X.1. . find an approximation to the critical value of the MP level a test for this problem.. = a. Nk) ~ 4..6) where all parameters are known.2. Show that if randomization is pennitted. with probability ... (a) What test statistic should he use if the only alternative he considers is that the die is fair? (b) Show that if n = 2 the most powerfullevel. .. (b) Find the test that is best in this sense for Example 4.0.1.N..6) or (as in population 1) according to N(I. Bo. [L(X. Y) belongs to population 1 if T > c. and only if. Find a statistic T( X. . Upon being asked to play. 1. . Bo.2.Bd > cJ then the likelihood ratio test with critical value c is best in this sense.2.7).2. +. . of and ~o '" I. Y) and a critical value c such that if we use the classification rule.2..2.<l. (a) Show that if in testing H : f} such that = f}o versus K : f) = f)l there exists a critical value c Po. 6. For 0 < a < I..e.2.1..17).1(a) using the connection between likelihood ratio tests and Bayes tests given in Remark 4. two 5's are obtained.0196 test rejects if.imum probability of error (of either type) is as small as possible. I.=  Section 4. +akNk has approx. A fonnulation of goodness of tests specifies that a test is best if the max.o = (1. [L(X. 9.2. 5. where 11 = I:7 1 ajf}j and a 2 = I:~ 1 f}i(ai 11)2. MPsized a likelihood ratio tests with 0 < a < 1 have power nondecreasing in the sample size.4 and recall (Proposition B.2.2.~:":2:~~=C:="::===='. In Example 4.
. hence. How many days must you observe to ensure that the UMP test of Problem 4.&(>'). show that the power of the UMP test can be written as (3(<J) = Gn(<J~Xn(<»/<J2) where G n denotes the X~n distribution function. Suppose that if 8 < 8 0 it is not worth keeping the counter open..• n. A possible model for these data is to assume that customers arrive according to a homogeneous Poisson process and. (c) Suppose 1/>'0 ~ 12. You want to ensure that if the arrival rate is < 10. a model often used (see Barlow and Proschan.: (b) Show that the critical value for the size a test with critical region [L~ 1 Xi > k] is k = X2n(1 . • I i • 4. the probability of your deciding to close is also < 0. Let Xl. Show that if X I.) 3. Hint: Show that Xf . (a) Exhibit the optimal (UMP) test statistic for H : 0 < 00 versus K : 0 > 00. .. r p(x. .. . (b) For what levels can you exhibit a UMP test? (c) What distribution tables would you need to calculate the power function of the UMP test? 2. A) = c AcxC1e>'x . . . but if the arrival rate is > 15.3.<»/2>'0 where X2n(1.01 lest to have power at least 0. .0)"'/[1. .Xn be the times in months until failure of n similar pieces of equipment.<» is the (1 .3 Testing and Confidence Regions Chapter 4 1. Consider the foregoing situation of Problem 4. 1965) is the one where Xl. In Example 4.. the probability of your deciding to stay open is < 0. the expected number of arrivals per day... .95 at the alternative value 1/>'1 = 15.01. I i .4. 274 Problems for Section 4.01.3. Let Xl be the number of arrivals at a service counter on the ith of a sequence of n days. (a) Show that L~ K: 1/>. i' i . Find the sample size needed for a level 0. > 1/>'0. i 1 . 1 j ! Xf is an optimal test statistic for testing H : 1/ A < 1/ AO versus J ! .i .1. I i . ]f the equipment is subject to wear. Use the normal approximation to the critical value and the probability of rejection. x > O.Xn is a sample from a Weibull distribution with density f(x.1 achieves this? (Use the normal approximation.(IOn x = 1•.3... • • " 5.Xn is a sample from a truncated binomial distribution with • I. Here c is a known positive constant and A > 0 is the parameter of interest.<»th quantile of the X~n distributiou and that the power function of the lIMP level a test is given by where G 2n denotes the X~n distribution function.. 0) = ( : ) 0'(1. that the Xi are a sample from a Poisson distribution with parameter 8.
1. (See Problem 4.1. .d..) It follows that Fisher's method for cQmbining pvalues (see 4. .~) = 1.)... (i) Show that if we model the distribution of Y as C(max{X I . > po. JL (e) Use the central limit theorem to find a normal approximation to the critical value of test in part (b). . . ~ > O. 1 ' Y > 0.. In Problem 1. (b) NabeyaMiura Alternative. 6. 8. random variable.6. (b) Find the optimal test statistic for testing H : JL = JLo versus K ."" X n denote the incomes of n persons chosen at random from a certain population. For the purpose of modeling. against F(u)=u B. Suppose that each Xi has the Pareto density f(x. b). 7. then P(Y < y) = 1 e  . and consider the alternative with distribution function F(x. imagine a sequence X I. Let the distribution of sUIvival times of patients receiving a standard treatment be the known distribution Fo.XFo(y)  1 P(Y < y) = e ' 1 ' Y > 0. To test whether the new treatment is beneficial we test H : ~ < 1 versus K : . Hint: Use the results of Theorem 1.10 Problems and Complements 275 then 2.tive.8 > 80 . we derived the model G(y.2 to find the mean and variance of the optimal test statistic.B) = Fi!(x). where Xl_ a isthe (1 .. X 2. P().O) = cBBx(l+BI. . X 2 . .Fo(y)l". > 1.Fo(y)  }).Q)th quantile of the X~n distribution.Section 4. Let N be a zerotruncated Poisson. (a) Lehmann Alte~p. x> c where 8 > 1 and c > O.12.6. A > O. X N e>. Show that the UMP test for testing H : B > 1 versuS K : B < 1 rejects H if 2E log FO(Xi) > Xl_a. Show how to find critical values. Assume that Fo has ~ density fo.:7 I Xi is an optimal test statistic for testing H : () = eo versus l\ . Y n be the Ll. (a) Express mean income JL in terms of e..5.. survival times of a sample of patients receiving an experimental treatment.•• of Li.. X N }).d.6) is UMP for testing that the pvalues are uniformly distributed. A > O..6. In the goodnessoffit Example 4. (Ii) Show that if we model the distribotion of Y as C(min{X I .. 0 < B < 1. y > 0.1. survival times with distribution Fa.. Find the UMP test. which is independent of XI.O<B<1.[1. then e. .1. . suppose that Fo(x) has a nonzero density on some interval (a. 00 < a < b < 00. Let Xl.  . and let YI .
..01. x > 0. we test H : {} < 0 versus K : {} > O.. eo versus K : B = fh where I i I : 11. with distribution function F(x). . 0) e9Fo (Y) Fo(Y). every Bayes test for H : < 80 versus K : > (Jl is of the fann for some t. Hint: A Bayes test rejects (accepts) H if J . 0.' (cf. Oo)d. (a) Show how to construct level (1 . 0. > Er • .. O)d.0) confidence intervals of fixed finite length for loga2 • (b) Suppose that By 1 (Xi . Show that the UMP test is based on the statistic L~ I Fo(Y. 'i . Show that under the assumptions of Theorem 4. F(x) = 1 . j 1 To see whether the new treatment is beneficial.2. Using a pivot based on 1 (Xi . Show that the test is not UMP..' " 2.L and unknown variance (T2. I . = . L(x. 00 p(x. /.exp( x).(O) «) L > 1 i The lefthand side equals f6". 1 X n be i. Problem 2.3.. 10.).. Let Xl. .1)..1 and 01 loss. 1 • . Assume that F o has a density !o(Y). n announce as your level (1 . F(x) = 1 .{Od varies between 0 and 1. n. . I Hint: Consider the class of all Bayes tests of H : () = . where the €i are independent normal random variables with mean 0 and known variance .3.0) UeB for '.X)' = 16. or Weibull. 1 . x > 0. Problems for Section 4.'? = 2. Let Xl.. the denominator decreasing. .0".(O) L':x.0 e6 1 0~0.. Oo)d.. Show that under the assumptions of Theorem 4.2.exp( _x B).O)d.6t is UMP for testing H : 8 80 versus K : 8 < 80 . Show that under the assumptions of Theorem 4.. 1 X n be a sample from a normal population with unknown mean J.2 the class of all Bayes tests is complete. .3.i. The numerator is an increasing function of T(x). . J J • 9. e e at l · 6.{Oo} = 1 ... 0 = 0.276 (iii) Consider the model Testing and Confidence Regions Chapter 4 G(y...  1 j ..52. 12. What would you .(O)/ 16' I • 00 p(x.(O) 6 .d.' L(x. Find the MP test for testing H : {} = 1 versus K : B = 8 1 > 1. B> O. Let Xi (f)/2)t~ + €i.j "'" .4 1. We want to test whether F is exponential. i = 1.X)2.
n is obtained by taking at = Q2 = a/2 (assume 172 known).1. then the shortest level n (1 . what values should we use for the t! so as to make our interval as short as possible for given a? 3. Let Xl.4.1.7).. (Define the interval arbitrarily if q > q.1.) LCB and q(X) is a level (1.n .~a) /d]2 . find a fixed length level (b) ItO < t i < I..Section 4. (a) Justify the interval [ii. Suppose that an experimenter thinking he knows the value of 17 2 uses a lower confidence bound for Ji. Then take N .1)] if 8 given by (4. 6.01 [X . Stein's (1945) twostage procedure is the following. < 0. X + z(1 ..a. I")/so./N.al). Hint: Reduce to QI +a2 = a by showing that if al +a2 with Ql + Q2 = a. Begin by taking a fixed number no > 2 of observations and calculate X 0 = (1/ no) I:~O 1 Xi and S5 = (no 1)II:~OI(Xi  XO )2.10 Problems and Complements 277 (a) Using a pivot based on the MLE (2L:r<ol (1 ..d.1 Show that. Show that if q(X) is a level (1 .4.no further observations. ii are (b) Calculate the smallest n needed to bound the length of the 95% interval of part (a) by 0. but we may otherwise choose the t! freely..c.q(X)] is a level (1 . then [q(X). of the form Ji. 5. Suppose that in Example 4.IN(X  = I:[' lX..(al + (2)) confidence interval for q(8). there is a shorter interval 7. where c is chosen so that the confidence level under the assumed value of 17 2 is 1 .3 we know that 8 < O. i = 1.2. Suppose we want to select a sample size N such that the interval (4. with N being the smallest integer greater than no and greater than or equal to [Sot".~a)/ . N(Ji" 17 2) and al + a2 < a.X are ij. < a.. Show that if Xl. .4. .3).. minI ii. with X ..IN.a) interval of the fonn [ X .] .4.(2) UCB for q(8). What is the actual confidence coefficient of Ji" if 17 2 can take on all positive values? 4.1) based on n = N observations has length at most l for some preassigned length l = 2d. X+SOt no l (1. where 8... has a 1. n. . It follows that (1 .1.IN] .z(1 .1] if 8 > 0.) Hint: Use (A.4. 0. . [0. Compare yonr result to the n needed for (4. although N is random. . .02.l. t.SOt no l (1.a..a) confidence interval for B. X!)/~i 1tl ofB.~a)!.3).(2) " . Use calculus. distribution.Xn be as in Problem 4. . 0.(X) = X .
sn. Hence.95 confidence interval for (J. 0) and X = 81n. it is necessary to take at least Z2 (1 (12/ d 2 observations.. Let 8 ~ B(n.3. By Theorem B.~a)].O).0') for Jt of length at most 2d. (a) If all parameters are unknown.jn is an approximate level (b) If n = 100 and X = 0.1') has a N(o. (b) Exhibit a level (1 .fN(X .jn(X .051 (c) Suppose that n (12 = 5. X n1 and Yll . The reader interested in pursuing the study of sequential procedures such as this one is referred to the book of Wetherill and Glazebrook. use the result in part (a) to compute an approximate level 0. 1"2. .O)]l < z (1. .) (lr!d observations are necessary to achieve the aim of part (a). (a) Use (A.4. 278 Testing and Confidence Regions Chapter 4 I i' I is a confidence interval with confidence coefficient (1 . ±.3. and the fundamental monograph of Wald.6. (a) Show that X interval for 8. but we are sure that (12 < (It. . Indicate what tables you wdtdd need to calculate the interval. if (J' is large.') populations. > z2 (1  is not known exactly.l4. and... .jn is an approximate level (1  • 12. (12 2 9.001. .0)/[0(1. Show that ~().~ a) upper and lower bounds. I .!a)/ 4. (c) If (12. " 1 a) confidence .: " are known. in order to have a level (1 . . .a) interval defined by (4.a) confidence interval for 1"2/(Y2 using a pivot based on the statistics of part (a).) Hint: Note that X = (noIN)Xno + (lIN)Etn n _ +IXi. (12) and N(1/. d = i II: .. Y n2 be two independent samples from N(IJ.. 1947. (J'2 jk) distribution. X has a N(J1. Hint: Set up an inequality for the length and solve for n. 0. (a) Show that in Problem 4. Suppose that it is known that 0 < ~. Show that these two quadruples are each sufficient. we may very likely be forced to take a prohibitively large number of observations. = 0. T 2 I I' " I'. respectively.  10.... v. I 1 j Hint: [O(X) < OJ = [. 1986. Show that the endpoints of the approximate level (1 .4.3) are indeed approximate level (1 . (The sticky point of this approach is that we have no control over N./3z (1.. Let 8 ~ B(n.a) confidence interval for (1/1'). 11.a:) confidence interval of length at most 2d wheri 0'2 is known.. i (b) What would be the minimum sample size in part (a) if (). find ML estimates of 11.1. 0") distribution and is independent of sn" 8. is _ independent of X no ' Because N depends only on sno' given N = k.18) to show that sin 1( v'X)±z (1 . Such two sample problems arise In comparing the precision of two instruments and in detennining the effect of a treatment.) I ~i . (1  ~a) / 2. 4(). exhibit a fixed length level (1 .a) confidence interval for sin 1 (v'e). (12. Let XI.
9 confidence interval for cr.4.I).30.ia). Assuming that the sample is from a /V({L.4 and the fact that X~ = r (k. Suppose that a new drug is tried out on a sample of 64 patients and that S = 25 cures are observed.3. .2. (c) Suppose Xi has a X~ distribution.5 is (1 _ ~a) 2.' Now use the central limit theorem. is replaced by its MOM estimate.J1. = 3.~Q).10 Problems and Complements 279 (b) What sample size is needed to guarantee that this interval has length at most 0.4.4. Compute the (K.3.I) + V2( n1) t z( "1) and x(1 .1) + V2( n . 1) random variables. Xl = Xn_l na).n(X .1. find (a) A level 0. Hint: Let t = tn~l (1 .3. (n _1)8'/0' ~ X~l = r(~(nl). Show that the confidence coefficient of the rectangle of Example 4.4.3.8).1 and. Hint: By B.(n .2 is known as the kurtosis coefficient. 4) are independent.) Hint: (n . If S "' 8(64.1) t z{1 . See A. Hint: Use Problem 8.2. (c) A level 0.9 confidence interval for {L.J1. 0). and 10 000. and X2 = X n _ I (1 . .Q confidence interval for cr 2 . XI 16. See Problem 5.' = 2:. and (b) (4.4 ) .) can be approximated by x(" 1) "" (n .9 confidence region for (It. give a 95% confidence interval for the true proportion of cures 8 using (al (4.)/01' ~ (J1.n} and use this distribution to find an approximate 1 .J1. K.IUI. (d) A level 0.2. Now use the law of large numbers. In Example 4."') .' 1 (Xi .4/0. (b) A level 0.4 = E(Xi . but assume that J1.J1. (In practice.t ([en . Suppose that 25 measurements on the breaking strength of a certain alloy yield 11.)'. 100.1)8'/0') . :L7/ (b) Suppose that Xi does not necessarily have a nonnal diStribution. V(o') can be written as a sum of squares of n 1 independentN(O. 14. . cr 2 ) population.9 confidence interval for {L x= + cr. known) confidence intervals of part (b) when k = I.1 is known.027 13.). it equals O. Z.".3).Section 4. and the central limit theorem as given in Appendix A.D andn(X{L)2/ cr 2 '" = r (4.<. ~). Slutsky's theorem.)4 < 00 and that'" = Var[(X. 15. Find the limit of the distribution ofn. K.4. (a) Show that x( "d and x{1. Now the result follows from Theorem 8.)' . In the case where Xi is normal.". Compare them to the approximate interval given in part (a). 10. then BytheproofofTheoremB.7).3.
(c) For testing H o : F = Fo with F o continuous. we want a level (1 . (b)ForO<o<b< I.11 .I I 280 Testing and Confidence Regions Chapter 4 17.a) confidence interval for F(x).1.u) confidence interval for 1".3 for deriving a confidence interval for B = F(x).9) and the upper boundary I" = 00. Consider Example 4. U(O.1 (b) V1'(x)[l. Suppose Xl. (a) Show that I" • j j j = fa F(x)dx + f. . define An(F) = sup {vnl1'(X) .F(x)1 )F(x)[1 . I (b) Using Example 4.1'(x)] Show that for F continuous. 1'1(0) < X < 1'.~u) by the value U a determined by Pu(An(U) < u) = 1. b) for some 00 < a < 0 < b < 00. Show that for F continuous where U denotes the uniform.l).6.I Bn(F) = sup vnl1'(x) . I' .1. i 1 .d.4. define .F(x)]dx. distribution function.4. " • j . • i. find a level (I . verify the lower bounary I" given by (4. 18. (a) For 0 < a < b < I.F(x)1 is the approximate pivot given in Example 4. < xl has a binomial distribotion and ~ vnl1'(x) ..F(x)1 Typical choices of a and bare .F(x)l. It follows that the binomial confidence intervals for B in Example 4.3 can be turned into simultaneous confidence intervals for F(x) by replacing z (I . indicate how critical values U a and t a for An (Fo) and Bn(Fo) can be obtained using the Monte Carlo method of Section 4. . That is.r fixed. In this case nF(l') = #[X. . 19. ..4.. In Example 4. pl(O) < X <Fl(b)}. )F(x)[l.95. X n are i. as X and that X has density f(t) = F' (t).i.05 and . !i .F(x)1 .7. Assume that f(t) > 0 ifft E (a.u.4.6 with .4.
1 has size a for H : 0 (}"2 be < 00 .a) LCB corresponding to Oc of parr (a).) denotes the o.3. test given in part (a) has power 0.. M n /0.~a)J has size (c) The following are times until breakdown in days of air monitors operated under two different maintenance policies at a nuclear power plant. respectively. Yn2 be independent exponential E( 8) and £(.5. ..1"2nl. (15 = 1. then the test that accepts. and let Il. = 1 rejected at level a 0. Yf (1 . (d) Show that [Mn .a) UCB for 0.2n2 distribution.4 and 8. How small must an alternative before the size 0. 1/ n J is the shortest such confidence interval. . Give a 90% confidence interval for the ratio ~ of mean life times. (b) Derive the level (1. x y 315040343237342316514150274627103037 826 10 8 29 20 10 ~ Is H : Il. [8(X) < 0] :J [8(X) < 00 ].1.0.a) VCB for this problem and exhibit the confidence intervals obtained by pntting two such hounds of level (1 . and consider the prob2 + 8~ > 0 when (}"2 is known.12 and B.al) and (1 .a. What value of c givessizea? (b) Using Problems B. a (b) Show that the test with acceptance region [f(~a) for testing H : Il. respectively.th quantile of the . .2 = VCBs of Example 4...05. Or . N( O . show that [Y f( ~a) I X. = 0. of l. is of level a for testing H : 0 > 00 .10 Problems and Complements 281 Problems for Section 4.~a) I Xl is a confidence interval for Il. Let Xl.\..5.3. ~ 01).Xn1 and YI . (a) Find c such that Oc of Problem 4.5 1. .. Hint: If 0> 00 . (e) Similarly derive the level (1 .90? 4. . (e) Suppose that n = 16.1 O? 2.3.2).2). O ) is an increasing 2 function of + B~. (a) Deduce from Problem 4.4. lem of testing H : 8 1 = 82 = 0 versus K : Or (a) Let oc(X 1 . 3.. X 2 be independent N( 01. < XI? < f(1. Experience has shown that the exponential assumption is warranted.. Hint: Use the results of Problems B. for H : (12 > (15.2 that the tests of H : . .(2) together.) samples. X2) = 1 if and only if Xl + Xi > c.· (a) If 1(0.3. 5.. with confidence coefficient 1 . if and only if 8(X) > 00 . Show that if 8(X) is a level (1 .2 are level 0.13 show that the power (3(0 1 . Let Xl. "6 based on the level (1  a) (b) Give explicitly the power function of the test of part (a) in tenos of the X~l distribution function.Section 4. = 1 versns K : Il.
ry) = O}. O?.8) where () and fare unknown. . (b) The sign test of H versus K is given by. X. I: H': PIX! >OJ < ~ versus K': PIX! >OJ> ~.a)y'n. . but I(t) = I( t) for all t.00 .a.1 when (72 is unknown. Let C(X) = (ry : o(X. 1 Ct ~r (b) Find the family aftests corresponding to the level (1. >k Determine the smallest value k = k(a) such that oklo) is level a for H and show that for n large. (c) Modify the test of part (a) to obtain a procedure that is level 0: for H : OJ e= O?. .282 Testing and Confidence Regions Chapter 4 = ()~ 1 2 eg and exhibit the corresponding family of confidence circles for (ell ( 2). (e) Show directly that P. Let X l1 . . Show that the conclusions of (aHf) still hold. X n  00 ) is a level a test of H : 0 < 00 versus K : 0 > 00 • (d) Deduce that X(nk(Q)+l) (where XU) is the jth order statistic of the sample) is a level (1 . Hint: (c) X. We are given for each possible value 1]0 of1] a level 0: test o(X.[XU) (f) Suppose that a = 2(n') < OJ and P. l 7.. ). N( 020g. 0. Show that PIX(k) < 0 < X(nk+l)] ~ . (c) Show that Ok(o) (X. Suppose () = ('T}. 6. (a) Shnw that C(X) is a level (l .0:) LeB for 8 whatever be f satisfying our conditions. of Example 4. 0. k . T) where 1] is a parameter of interest and T is a nuisance parameter. og are independentN (0.. 'TJo) of the composite hypothesis H : ry = ryo. (a) Show that testing H : 8 < 0 versus K : e > 0 is equivalent to testing .2). . we have a location parameter family. I . respectively.4. L:J ~ ( . O.a) confidence region for the parameter ry and con versely that any level (1 .0:) confidence region for 1] is equivalent to a family of level tests of these composite hypotheses.~n + ~z(l . (g) Suppose that we drop the assumption that I(t) = J( t) for all t and replace 0 by the v = median of F.2).Xn be a sample from a population with density f (t . .IX(j) < 0 < X(k)] do not depend on 1 nr I.  j 1 j . and 1 is continuous and positive. Thus.0:) confidence interval for 11. . lif (tllXi > OJ) ootherwise.
Let a(S.5. This problem is a simplified version of that encountered in putting a confidence interval on the zero of a regression line.z(1 .a). Show that the Student t interval (4. let T denote a nuisance parameter. (b) Describe the confidence region obtained by inverting the family (J(X. laifO>O <I>(z(1 .pYI < (1 1 otherwise.a) confidence interval [ry(x). 8( S)] be the exact level (1 . Establish (iii) and (iv) of Example 4.5.a) = (b) Show that 0 if X < z(1 . 1).2.O ~ J(X.a))2 if X > z(1 . 10. Note that the region is not necessarily an interval or ray.2 a) (a) Show that J(X.1. 1) and q(O) the ray (X . ij(x)] for ry is said to be unbiased confidence interval if Pl/[ry(X) < ry' < 7)(X)] <1 a for all ry' F ry. Let P (p. a( S.1) is unbiased.3 and let [O(S). . the quantity ~ = 8(8) . Y.[q(X) < 0 2 ] = 1.z(l. 0) ranges from a to a value no smaller than 1 . (a) Show that the lower confidence bound for q( 8) obtained from the image under q of q(X) (X .2.p) = = 0 ifiX . Let X ~ N(O.00 indicates how far we have to go from 80 before the value 8 is not at all surprising under H. the interval is unbiased if it has larger probability of covering the true value 1] than the wrong value 1]'. Yare independent and X ~ N(v. Hint: You may use the result of Problem 4. That is. Define = vl1J.7.5. p)} as in Problem 4.a.7.10 Problems and Complements 283 8. Po) is a size a test of H : p = Po. or). and let () = (ry.Oo) denote the pvalue of the test of H : 0 = 00 versus K : 0 > 00 in Example 4.5. 7)).a) . Then the level (1 . ' + P2 l' z(1 1 . if 00 < O(S) (S is inconsistent with H : 0 = 00 ). p. that suP.Y. Let 1] denote a parameter of interest. alll/. 11. Show that as 0 ranges from O(S) to 8(S). 9.a).20) if 0 < 0 and.Section 4. Thus. hence.5. 00) is = 0'. Suppose X. Y ~ N(r/. 1). 12.2a) confidence interval for 0 of Example 4. Y.
.5.. lOOx. . .a = P(k < S < n I + 1) = pi(1.6 and 4.b) (a) Show that this statement is equivalent to = 1.p) versus K' : P(X > 0) > (1 . Let x p = ~ IFI (p) + Fi! I (P)]. k(a) .p)"j. i j (d) Deduce that X(n_k(Q)+I) (XU) is the jth order statistic of the sample) is a level (1 .) Suppose that p is specified. Let F.95 could be the 95th percentile of the salaries in a certain profession l or lOOx . 1 . Simultaneous Confidence Regions for Quantiles. (a) Show that testing H : x p < 0 versus I< : x p > 0 is equivalent to testing H' : P(X > 0) < (I . I • (c) Let x· be a specified number with 0 < F(x') < 1. We can proceed as follows. 14.1» .. Then   P(P(x) < F(x) < P+(x)) for all x E (a. That is. Construct the interval using F.. .a.p) variable and choose k and I such that 1 .p). Suppose we want a distributionfree confidence region for x p valid for all 0 < p < 1. P(xp < x p < xp for all p E (0. .1 and Fu 1. (g) Let F(x) denote the empirical distribution. Detennine the smallest value k = k(a) such that Jk(Q) has level a for H and show that for n large. ! .4. il.a) confidence interval for x p whatever be F satisfying our conditions. it is distribution free.a.F(x p)]. Thus. vF(xp) [1 F(xp)] " Hint: Note that F(x p ) = p. Show that the interval in parts (e) and (0 can be derived from the pivot T(x ) p  . " II [I . 0 < p < 1. (X(k)l X(n_I») is a level (1 . X n x*) is a level a test for testing H : x p < x* versus K : x p > x*.h(a). = ynIP(xp) . Show that Jk(X I x'. be the pth quantile of F. (e) Let S denote a B(n..7.05 could be the fifth percentile of the duration time for a certain disease. X n be a sample from a population with continuous distribution F. F(x). . (t) Show that k and 1in part (e) can be approximated by h (~a) and h (1  ~a) where ! heal is given in part (b).. k+ ' i .284 Testing and Confidence Regions Chapter 4 13. L. In Problem 13 preceding we gave a disstributionfree confidence interval for the pth quantile x p for p fixed.a) LeB for x p whatever be f satisfying our conditions. (b) The quantile sign test Ok of H versus K has critical region {x : L:~ 1 1 [Xi > 0] > k}.4.. Confidence Regions for QlIalltiles. and F+(x) be as in Examples 4. That is. iI . Let Xl. Show that P(X(k) < x p < X(nl)) = 1. (See Section 3. where ! 1 1 heal c: n(1  p) + zl_QVnp(1  pl.
.) < p} and.. Suppose X denotes the difference between responses after a subject has been given treatments A and B. Hint: Let t. Give a distributionfree level (1 . the desired confidence region is the band consisting of the collection of intervals {[. .17(a) and (c) can be used to give another distributionfree simultaneous confidence band for x p ..1. that is.X n ..X I.u are the empirical distributions of U and 1 .Section 4. .. n Hint: nFx(x) = Li~ll[Fx(Xi) < Fx(x)] ~ nFu(F(x)) and ~ ~ nF_x(x) = L IIXi < xl ~ L i==I i==I n n IIFx( X. I).x.xp]: 0 < l' < I}. Express the band in terms of critical values for An(F) and the order statistics.r" ~ inf{x: a < x < b.r < b.d.. The hypothesis that A and B are equally effective can be expressed as H : F ~x(t) = Fx(t) for all t E R. . F(x) > pl. LetFx and F x be the empirical distributions based on the i. then L i==l n IIXi < X + t.p. where F u and F t .Q) simultaneous confidence band for the curve {VF(p) : 0 < p < I}. (c) Show how the statistic An(F) of Problem 4. then D(F ~ ~ ~ ~ as D(Fu 1 F I  U = F(X) U ). F f (. Suppose that X has the continuous distribution F.z. ~ ~ (a) Consider the test statistic x . X I. TbeaitemativeisthatF_x(t) IF(t) forsomet E R. .r p in terms of the critical value of the Kolmogorov statistic and the order statistics. 15. Note the similarity to the interval in Problem 4. X n and . .(x)] < Fx(x)] = nF1_u(Fx(x)).j < F_x(x)] See also Example 4.T: a <.(x) = F ~(Fx(x» . where p ~ F(x). L i==I n I IFx (Xi) .U with ~ U(O.1. where A is a placebo.. We will write Fx for F when we need to distinguish it from the distribution F_ x of X. (b) Express x p and .i.13(g) preceding. VF(p) = ![x p + xlpl.. That is.10 Problems and Complements 285 where x" ~ SUp{.4. F~x) has the same distribution Show that if F x is continuous and H holds.5. (b) Suppose we measure the difference between the effects of A and B by ~ the difference between the quantiles of X and X...
where F u and F v are independent U(O. where do:. We assume that the X's and Y's are in~ dependent and that they have respective continuous distributions F x and F y .3.) < do.. .286 Testing and Confidence Regions ~ Chapter 4 '1. = 1 i 16.t::. is the nth quantile of the distribution of D(Fu l F 1 . we get a distributionfree level (1 .1) empirical distributions. and by solving D( Fx. i I f F..U ).Fy ): 0 < p < 1}.'" 1 X n be Li..) < Fx(t)] = nFu(Fx(t)). (b) Consider the parameter tJ p ( F x. vt(p)) is the band in part (b).d.) < Fx(x)) = nFv(Fx(x)). I I I 1 1 D(Fx. As in Example 1. let Xl. Let 8(.(x) ~ R. It follows that X is stochastically between Xsv F and XS+VF where X s HI(F(X)) has the symmetric distribution H.i. F y ).x p .) < Fy(x p )] = nFv (Fx (t)) under H.F x l (l . Let t (c) A Distdbution and ParameterFree Confidence Interval. .) is a location parameter} of all location parameter values at F.) = D(Fu ..) ~ < F x (")] ~ ~ = nFu(Fx(x)). Let Fx and F y denote the X and Y empirical distributions and consider the test statistic ~ ~ . .F1 _u). Fx(t)l· .1. "" = Yp .Fy ) = maxIFy(t) 'ER "" ..". Hint: nFx(t) = 2. F y ) has the same distribntion as D(Fu .x (': + A(x)).a) simultaneous confidence band for A(x) = F~l:(Fx(x)) ..F~x. we test H : Fx(t) ~ Fy (t) for all t versus K : Fx(t) # Fy(t) for some t E R. Also note that x = H.17. ~ ~ ~ F~x. 1 Yn be i. Hint: Define H by H.5.x. where x p and YP are the pth quan tiles of F x and Fy. To test the hypothesis H that the two treatments are equally effective. Moreover.. then D(Fx . then D(Fx. The result now follows from the properties of B(·). F y ) .:~ I1IFx(X. i .. < F y I (Fx(x))] i=I n = L: l[Fy (Y.a) simultaneous confidence band 15.Jt: 0 < p < 1) for the curve (5 p (Fx.:~ 1 1 [Fy(Y. It follows that if c. then  nFy(x + A(X)) = L: I[Y. (p)..(F(x)) . nFy(t) = 2. treatment B responses. . Give a distributionfree level (1. he a location parameter as defined in Problem 3.l (p) ~ ~[Fxl(p) .p)] = ~[Fxl(p) + F~l:(p)].:~ II[Fx(X.. <aJ Show that if H holds. Properties of this and other bands are given by Doksum.2A(x) ~ HI(F(x)) l 1 + vF(F(x)). vt = O<p<l sup vt(p) O<p<l where [V. Then H is symmetric about zero.a) that the interval IvF' vt] contains the location set LF = {O(F) : 0(.. . where :F is the class of distribution functions with finite support..d. the probability is (1 . Fenstad and Aaberge (1977). for ~. Fx.. i=I n 1 i t .. Show that for given F E F.) : :F VF ~ inf vF(p). nFx(x) we set ~ 2..x = 2VF(F(x)). treatment A (placebo) responses and let Y1 . Hint: Let A(x) = FyI (Fx(x)) .
Fy.. Fy ) .6 1. Let6. Show that if 0(·.) . moreover X + 6 < Y' < X + 6. the probability is (1 .E(X). . Q. F x ) ~ a and YI > Y.) 12:7 I if. 6..6.) = F (p) . Let do denote a size a critical value for D(Fu .4. T n n = (22:7 I X. 3. F y ) < O(Fx .) < do: for Ll. F y ) > O(Fx " Fy).6].2.. t Hint: Set Y' = X + t..(x)) . where :F is the class of distri butions with finite support. Fx +a) = O(Fx a.2uynz(1 i=! i=l all L i=l n ti is also a level (1. . ~ D(Fu •F v ).(X). (d) Show that E(Y) . Show that n p is known and 0 O' ~ (2 2~A X.. Xl > X 8t 8t 0} (c) A parameter 0 = <5 ('. is called a shift parameter if O(Fx. Show that for given (Fx . It follows that if we set F\~•.. F y ). Show that B = (2 L X.. tt Hint: Both 0 and 8* are nonnally distributed. x' 7 R. we find a distributionfree level (1 .)I L ti . Properties of ~ £ = "" this and other bands are given by Doksum and Sievers (1976).T. and 8 are shift parameters. Now apply the axioms. O(Fx. Suppose X I. . (c) Show that the statement that 0* is more accurate than 0 is equivalent to the assertion that S = (2 E~ 1 X i )/ E~ 1 has uniformly smaller variance than T. then Y' = Y. 6+ maxO<p<1 6+(p). (a) Consider the model of Problem 4. ~) distribution. then by solving D(F x.0:) simultaneouS coofidence band for t. ) is a shift parameter} of the values of all shift parameters at (Fx .0: UCB for J. Fy ). 2.(P). . t.·) is a shift parameter.) ~ + t.= minO<p<l 6.. F y ) E F x F. if I' = 1I>'. 0 < p < 1.0:) confidence bound forO. Let 6 = minO<1'<16p(Fx.Fy). F y. B(.F y ' (p) = 6. ~ ~ = (e) A Distribution and ParameterFree Confidence Interval.('. F y ) is in [6. = 22:7 I Xf/X2n( a) is .10 Problems and Complements 287 Moreover.(Fx . then B( Fx . nFx ("') = nFu(Fx(x)).2z(1 i=l n n a)ulL tt]} i=l is a uniformly most accurate lower confidence bound for 8.6+] contains the shift parameter set {O(Fx. (b) Consider the unbiased estimate of O.). thea It a unifonnly most accurate level 1 .L.4. F y ) and b _maxo<p<l 6p(Fx .a) UCB for O.) I i=l L tt . Exhibit the UMA level (1 . where is nnknown.Xn is a sample from a r (p. Show that for the model of Problem 4.a) that the interval [8.Section 4..(x) = Fy(x thea D(Fx. F y.).    Problems for Section 4.3.(. Fv ). :F x :F O(Fx.
B).O] are two level (1 .B(n. establish that the UMP test has acceptance region (4. .3).B')+.\ is distributed as V /050. /3( T. 3. (a) Show that if 8 has a beta.2.f:s F.B)+.3.d. Suppose that B' is a uniformly most accurate level (I ..6.\.4. Problems for Section 4. ~.6 for c fixed. Hint: Use Examples 4. I'. .. where So is some COnstant and V rv X~· Let T = L:~ 1 Xi' (a) Show that ()" I T m = k + 2t. U = (8 . Hint: Apply Problem 4. ·'"1 1 . 5..' with "> ~.. • • • = t) is distributed as W( s.0 upper and lower confidence bounds for J1 in the model of Problem 4. Show that if F(x) < G(x) for all x and E(U).(O . then E(U) > E(V)."" X n are i. and for ().a. satisfying the conditions of Problem 8.B') < E.i.\ =. density = B. (1: dU) pes..6.6. 0). 1.7 1 . g. Show how the quantiles of the F distribution can be used to find upper and lower credible bounds for. t > e.3.6. t) is the joint density of (8. '" J. and that B has the = se' (t. j . OJ. where s = 80+ 2n and W ~ X.. Prove Corollary 4.2.2" Hint: See Sections 8. 2. s > 0. .B). . (B" 0') have joint densities.2 and B. Suppose that given B Pareto. [B.1 are well defined and strictly increasing. (b) Suppose that given B = B. p. • . Poisson. i . (0 . In Example 4. s).W < BI = 7.) and that.E(V) are finite.d. e> 0. Let U.Xn are Ll.aj LCB such that I . n = 1. Establish the following result due to Pratt (1961). s) distribution with T and s integers. Suppose that given. f3( T. respectively. P(>. distribution with r and s positive integers. . .a) confidence intervals such that Po[8' < B' < 0'] < p. s). Pa( c. . . 1I"(t) Xl.12 so that F. Construct unifoffilly most accurate level 1 .l2(b).Bj has the F distribution F 2r . J • 6. then )" = sB(r(1 . uniform. 8.l. > 1" (h) Show how quantiles of the X2 distribution can be used to determine level (I .1.[B < u < O]du. 0']. then E. V be random variables with d. G corresponding to densities j. where Show that if (B.a) upper and lower credible bounds for A.1.6 to V = (B ..( 0' Hint: E.3 and 4.288 Testing and Confidence Regions Chapter 4 4.B) = J== J<>'= p( s. .2. U(O..X.4.B). . X has a binomial.W < B' < OJ for allB' l' B. Suppose [B'. Xl. t)dsdt = J<>'= P. E(U) = .3. F1(tjdt.C. distribution and that B has beta. Hint: By Problem B.
0:) prediction interval for Xnt 1.x) is aN(x.Xn is observable and X n + 1 is to be predicted. .x)' + E(Yj y)'. .Xn ).y) proportional to where 1f(r I so) is the density of solV with V ~ Xm+n2.0:) confidence interval for B.1 )) distribution. Show thatthe posterior distribution 7f( t 1 x. 'T). . y) is proportional to p(8)p(x 1/11. Hint: p( 0 I x. (a) Let So = E(x.1.. Xl. (b) Find level (1 .Xn + 1 be i. Suppose 8 has the improper priof?r(O) = l/r.1. 'P) is the N(x .a) upper and lower credible bounds forO to the level (Ia) upper and lower confidence bounds for B. (d) Compare the level (1. Yn are two independent N (111 \ 'T) and N(112 1 'T) samples. y) is aN(y.10 Problems and Complements 289 (a)LetM = Sf max{Xj. where (75 is known..d.y.i. 1f(/11 1r. r( TnI (c) Set s2 + n. and Y1.. 4. r In) density.m} and + n. Problems for Section 4. . (b) Show that given r. .s') withe' = max{c. r) where 7f(Ll I x .y.r 1x. Let Xl.. r 1x. Show fonnaUy that the posteriof?r(O 1 x. (c) Give a level (1 .0"5).8 1.y) is 1f(r 180)1f(/11 1 r. .112. In particular consider the credible bounds as n + 00. Hint: 7f(Ll 1x. y) of is (Student) t with m + n .Section 4.y. .. y) is obtained by integrating out Tin 7f(Ll. r > O.. respectively. as X...y) = 7f(r I sO)7f(LlI x ...l. 1 r.Xm. (a) Give a level (1 .2. Here Xl. . y) and that the joint density of. Show that (8 = S IM ~ Tn) ~ Pa(c'.r). Suppose that given 8 = (ttl' J.rlm) density and 7f(/12 1r. /11 and /12 are independent in the posterior distribution p(O X.. .a) upper and lower credible bounds for O.. = PI ~ /12 and'T is 7f(Ll.. T) = (111. = So I (m + n  2).2 degrees of freedom. .r)p(y 1/12. N(J. (d) Use part (c) to give level (Ia) credible bounds and a level (Ia) credible interval for Ll... y).x)1f(/1. ".
Q) distribution free lower and upper prediction bounds for X n + 1 • < 00. N(Jl. B( n.d. CJ2) with (J2 unknown. as X . .10. give level (I . Establish (4. distribution. has a B(m. • .. . . give level 3.5.. . . That is..a (P(Y < Y) > I .i. suppose Xl. distribution.9.3) by doing a frequentist computation of the probability of coverage. B(n. where Xl.i.s+n.2. Show that the likelihood ratio statistic for testing If : 0 = ~ versus K : 0 i ~ is equivalent to 12X . Hint: Xi/8 has a ~ distribution and nXn+d E~ 1 Xi has an F 2 . < u(n+l) denote Ul . . b). (3(r. . n = 100.15..Xn are observable and we want to predict Xn+l.i. " u(n+l).2n distribution.8. Suppose XI..8). X is a binomial.a) lower and upper prediction bounds for X n + 1 . 0 > 0. . and that (J has a beta. s). In Example 4.. Show that the conditional (predictive) distribution of Y given X = xis 1 J I i I . 0'5)' Take 0'5 ~ 7 2 = I. Suppose that Y. Let X have a binomial.. .• .s+nx+my)/B(r+x. Present the results in a table and a graph..' . and a = . q(ylx)= ( . 1 X n are i..11. (a) If F is N(Jl.10.. (This q(y distribution. i! Problems for Section 4...nl.. Un + 1 ordered. Let Xl.a) prediction interval for X n + l .a).290 Testing and Confidence Regions Chapter 4 • • (b) Compare the interval in part (a) to the Bayesian prediction interval (4. ! " • Suppose Xl. X n such that P(Y < Y) > 1.t for M = 5.8.. X n+ l are i. :[ = .8. let U(l) < .x ) where B(·.9 1. as X where X has the exponential distribution F(x I 0) =I . ~o = 10.05.e. 0) distribution given (J = 8...'.8.d.'" .12. x > 0. .Xn + 1 be i. Then the level of the frequentist interval is 95%. . F.) Hint: First show that .Xn are observable and X n + 1 is to be predicted.d. I x) is sometimes called the Polya q(y I x) = J p(y I 0). .x . )B(r+x+y.(0 I x)dO. which is not observable. give level (1 ~ a:) lower and upper prediction (c) If F is continuous with a positive density f on (a. (b) If F is N(p" bounds for X n + 1 . random variable.9. 5.. Find the probability that the Bayesian interval covers the true mean j.·) denotes the beta function.5. 00 < a < b (1 .2) hy using the observation that Un + l is equally likely to be any of the values U(I). Suppose that given (J = 0..0'5) with 0'5 known. 4. Give a level (1 .. 0). A level (1 .0:) lower (upper) prediction bound on Y = X n+ 1 is defined to be a function Y(Y) of Xl. 1 2.
2 2 L.Q.C2n !' 1 as n !' 00. TwoSided Tests for Scale.(x) Hint: Show that for x < !n. (n/2)[ii' / C a5  (b) To obtain size Q for H we should take Hint: Recall Theorem B.n) and >. let X 1l .. (c) Deduce that the critical values of the commonly used equaltailed test. 0'2 < O'~ versus K : 0'2 > O'~..V2nz(1 .. (ii) CI ~ C2 = n logcl/C2.(x) = 0./(n .~Q) also approximately satisfy (i) and (ii) of part (a). XnI (1 . where F is the d..3. 0' 4. if Tn < and = (n/2) log(1 + T.~Q) approximately satisfy (i) and also (ii) in the sense that the ratio . where Tn is the t statistic. Show that (a) Likelihood ratio tests are of the form: Reject if.(x) = . and only if. ~2 .f. ° 3. In testing H : It < 110 versus K . 0'2) sample with both JL and 0'2 unknown. = aD nO' 1". 2. In Problems 24. onesample t test is the likelihood ratio test (fo~ Q <S ~).10 Problems and Complements 291 is an increasing function of (2x .(Xi .) .(Xi < 2" " (10 i=I  X) 2 < C2 where CI and C2 satisfy..F(CI) ~ 1. Hint: Note that liD ~ X if X < 1'0 and ~ 1'0 otherwise. Hint: log . n Cj I L. We want to test H : = 0'0 versus K : 0' 1= 0'0' (a) Show that the size a likelihood ratio test accepts if. Thus. (b) Use the nonnal approximatioh to check that CIn C'n n . We want to test H . if ii' /175 < 1 and = .(nx).Xn be a N(/L. and only if.. (i) F(c. >. OneSided Tests for Scale.I)) for Tn > 0.Q).3.. JL > Po show that the onesided.='7"nlog c l n /C2n CIn .log (ii' / (5)1 otherwise. log . (c) These tests coincide with the testS obtained by inverting the family of level (1  Q) lower confidence bounds for (12. · .( x) ~ 0.Section 4. . Xn_1 (~a)..X) aD· t=l n > C. of the X~I distribution. = Xnl (1 .~Q) n + V2nz(l.. I .
. i = 1. respectively. can we conclude that the mean blood pressure in the population is significantly larger than IDO? (b) Compute a level 0. . whereas the treatment measurements (y's) correspond to 500 pounds of mulch per acre. (X.4.lIl (7 2) and N(Jl.. and that T is sufficient for O.Xn are said to be serially correlated or to follow an autoregressive model if we can write ~ .. 1 I 794 2012 1800 2477 576 3498 411 2092 897 1808 I Assume the twosample normal model with equal variances. . 7..01 test to have power 0. 0'2 ~ ! corresponding to inversion of the (c) Compute a level 0. x y vi I I . (c) Using the normal approximation <l>(z(a)+ nl nz/n(1'1 I'z) /(7) to the power. = 0 and €l. 100. 0 E e. The following blood pressures were obtained in a sample of size n = 5 from a certain population: 124. a 2 ) random variables. (a) Find a level 0..L.Xn1 and YI . . • Xi = (JXi where X o I 1 + €i. (b) Consider the problem aftesting H : Jl.1'2.3. Assume the onesample normal model.90 confidence interval for the mean bloC<! pressure J.1 < J.z . I Yn2 be two independent N(J. . 8. . . . 0'2). 110. 190. 9. find the sample size n needed for the level 0. The nonnally distributed random variables Xl. . Assume a Show that the likelihood ratio statistic is equivalent to the twosample t statistic T..L2 versus K : j.95 confidence interval for p..9.O).2' (12) samples. (7 2) is Section 4.tl > {t2.9. .95 confidence interval for equaltailed tests of Problem 4. eo.. 114. (a) Using the size 0: = 0.. . (b) Can we conclude that leaving the indicated amount of mulch on the ground significantly improves forage production? Use ct = 0.P. where a' is as defined in < ~.. I'· (c) Find a level 0. Show that A(X.05.90 confidence interval for a by using the pivot 8 2 ja z . 6. n. The control measurements (x's) correspond to 0 pounds of mulch per acre. Suppose X has density p(x. .95 when nl = nz = ~n and (1'1 1'2)/17 ~ ~. (a) Show that the MLE of 0 ~ (1'1.. e 1 ) depends on X only throngh T.. Forage production is also measured in pounds per acre. €n are independent N(O.l. The following data are from an experiment to study the relationship between forage production in the spring and mulch left on the ground the previous fall. Y.292 Testing and Confidence Regions Chapter 4 5. Let Xl.05 onesample t test.
.n. i = 1.. ! x 0 2 1<> 2 I 0 <> I 2 1<> 2 1 ~a Ia 2 iI Oe (i~H <» (1') a 10: (t~) (~. . X n1 .10 Problems and Complements 293 X n ) is n (a) Show that the density of X = (Xl.. I).IITI > tl is an increasing function of 161. The power functions of one. (b) P. Show that. Fix 0 < a < and <>/[2(1 . 12. (Yl) = f PY"Y. . Let e consist of the point I and the interval [0. II. Condition on V and apply the double expectation theorem. From the joint distribution of Z and V. J7i'k(~k)ZJ(k+ll io Hint: Let Z and V be as in the preceding hint. and only if.. has level a and is strictly more 11. . Hint: Let Z and V be independent and have N(o. has density !k . with all parameters assumed unknown.O)e (a) What is the size a likelihood ratio test for testing H : B = 1 versus K : B 1= I? (b) Show that the test that rejects if. Stein). Y2 )dY2. X powerful whatever be B. get the joint distribution of Yl = ZI vVlk and Y2 = V. for each v > 0. (a) P.'" p(X. Xo = o. 1 {= xj(kl)ellx+(tVx/k'I'ldx.IIZI > tjvlk] is increasing in [61.<»1 < e < <>.Section 4. 0) by the following table.[T > tl is an increasing function of 6.(t) = . (Yl. respectively. Then. Suppose that T has a noncentral t. (Tn. Then use py. P. distribution. samples from N(J1l.. 13. Yn2 be two independent N(P.2) In exp{ (I/Z(72) 2)Xi .2. Yl.6. . . Consider the following model. The F Test for Equality of Scale... XZ distributions respectively. = 0.. P. (b) Show that the likelihood ratio statistic of H : 0 = 0 (independence) versus K : 0 o(serial correlation) is equivalent to C~=~ 2 X i X i _I)2 / l Xl.6..O) for 00 = (Z.<» (1 . Let Xl.OXi_d 2} i= 1 < Xi < 00. (T~).[Z > tVV/k1 is increasing in 6. 7k. Define the frequency functions p(x. . (An example due to C. . / L:: i 10. 7k. Show that the noncentral t distribution..and twosided t tests.
. p) distribution. that T is an increasing function of R. . . . 14. i=2 i=2 n n S12 =L i=2 n U. (11.a/2) or F < f( n/2). (1~. Vn ) is a sample from a N(O.J J . Finally. • I " and (U" V. p) distribution.8.4. 15. can you conclude at the 10% level of significance that blood cholest~rollevel is correlated with weight/height ratio? 'I I .). 0.. . Un).0 has the same distribution as R.0 has the same dis r j .62 284 2. show that given Uz = Uz . where l _ . .24) implies that this is also the unconditional distribution. where f( t) is the tth quantile of the F nz . i 0 in xi !:.nt 1 distribution.64 298 2.' ! .5) that 2 log '(X) V has a distribution.9.96 279 2. . . vn 1 l i 16. Yn) be a sampl~ from a bivariateN(O. L Yj'. (a) Show that the LR test of H : af = a~ versus K : ai > and only if. i=l j=l n n (b) Show that if we have a sample from a bivariate N(1L1l 1L2.nl1 ar is of the fonn: Reject if.9. and using the arguments of Problems B.~ I . (a) Show that the likelihood ratio ~tatistic is equivalent to ITI where . (Xn . F > /(1 .~ I i " 294 Testing and Confidence Regions Chapter 4 . 0. r> (c) Justify the twosided F test: Reject H if. where a. Consider the problem of testing H : p = 0 versus K : p i O.1. using (4. . then P[p > c] is an increasing function of p for fixed c.7 to conclude that tribution as 8 12 /81 8 2 .4. x y 254 2. 1.. 1. and only if. si = L v.1)]E(Y. . and use Probl~1ll4. p) distribution. .61 310 2. • . . (d) Relate the twosided test of part (c) to the confidence intervals for a?/ar obtained in Problem 4.7 and B. The following data are the blood cholesterol levels (x's) and weightlheight ratios (y's) of 10 men involved in a heart study. .71 240 2. 1 . Let R = S12/SIS" T = 2R/V1 R'.'.4) and (4. .94 Using the likelihood ratio test for the bivariate nonnal model.4. I ' LX.X)' (b) Show that (aUa~)F has an F nz obtained from the :F table.Y)'/E(X. · .68 250 2. ! . T has a noncentral Tnz distribution with noncentrality parameter p. Argue as in Proplem 4.1I(a).4. . .'.V. . the continuous versiop of (B. > C.. V.9. note that .4. Hint: Use the transfonnations and Problem B.. Sbow. Un = Un.1' Sf = L U.19 315 2. as an approximation to the LR test of H : a1 = a2 versus K : a1 =I a2. F ~ [(nl . .. !I " ... YI ). a?. (Un. <..10. .1)/(n. Because this conditional distribution does not depend on (U21 •. .8.12 337 1.37 384 2. Let '(X) denote the likelihood ratio statistic for testing H : p = 0 versus K : p the bivariate normal model. Let (XI. distribution and that critical values can be .1 .
and C. i].05 and 0. Tl . 4. Essentially. 00.12 1965. 00. . P. the class of Bayes procedures is complete. O'CONNELL.4 (I) If the continuity correction discussed in Section A. if the parameter space is compact and loss functions are bounded.3). HAMMEL. Rejection is more definitive. (a) Find the level" LR test for testing H : 8 E given in Problem 3. Consider the bioequivalence example in Problem 3. Data Analysis and Robustness. F.I\ E.11 Notes 295 17.10. E. P.2.3 (1) Such a class is sometimes called essentially complete. REFERENCES BARLOW. Consider the cases Ii. and r.9. (2) The theory of complete and essentially complete families is developed in Wald (1950).3.035. Leonard. T6 . Box. BICKEL.[ii . Wu.. it also has confidence level (1 . "Is there a sex bias in graduate admissions?" Science. AND J. Apology for Ecumenism in Statistics and Scientific lnjerence. Notes for Section 4.851.895 and t = 0. R. Editors New York: Academic Press.. Mathematical Theory of Reliability New York: J.3) holds for some 8 if <p 't V.01 + 0. More generally the closure of the class of Bayes procedures (in a suitable metric) is complete.o = 0. 4. S in 8(X) would be replaced by 5 + and 5 in 8(X) is replaced by 5 . . Acceptance of a hypothesis is only provisional as an adequate current approximation to what we are interested in understanding..2.. Box. E. Nntes for Section 4.819 for" = 0. 187.. i] versus K : 8 't [i.01. !. G. see also Ferguson (1967). (2) In using 8(5) as a confidence hound we are using the region [8(5). (3) A good approximation (Durbin.9. Wiley & Sons.. 1983.). respectively. G.0.1.15 is used here. 1973. AND F.398404 (975)..11.~. PROSCHAN. The term complete is then reserved for the class where strict inequality in (4. W.Section 4.[ii) where t = 1. Stephens. Because the region contains C(X).1 (1) The point of view usually taken in science is that of Karl Popper [1968].(t) = tl(.0. T.11 NOTES Notes for Section 4. 1974) to the critical value is <. (2) We ignore at this time some reallife inadequacies of this experiment such as the placebo effect (see Example 1. (b) Compare your solution to the Bayesian solution based on a continuous loss function rro = 0..
" J. A. Sratisr. Y.674682 (1959). GLAZEBROOK.243258 (1945). OSTERHOFF. "Further notes on Mrs. . L. SIEVERS.. 54 (2000)." Regional Conference Series in Applied Math. D. K. H.. j j New York: 1. AND A. 1968. A. K. Wiley & Sons. 326331 (1999). 1%2. F. New York: Springer. 659680 (1967). 1952.. W. G... 13th ed..." Biometrika. "Distribution theory for tests based on the sample distribution function. 549567 (1961). J. I . 63. "Interval estimation for a binomial proportion.. . HAlO... . WETHERDJ. 1997. S. 54. 1958.296 Testing and Confidence Regions Chapter 4 BROWN. 421434 (1976). ! I I I FERGUSON. JEFFREYS. WANG. Statistical Theory with Engineering Applications . Mathematical Statistics. New York: Hafner Publishing Company. T. 36. STEIN. . D.• Statistical Methods for Research Workers. "On the combination of independent test statistics. 1967." J. STEPHENs.• Conjectures and Refutations. G. the Growth ofScientific Knowledge. 1. R. Statistical Methods for MetaAnalysis Orlando. "EDF statistics for goodness of fit. Statist. VAN ZWET. c. 1950. AND R. 3rd ed. CAl.. 243246 (1949). "Plots and tests for symmetry. A. AND 1. Assoc. "Probabilities of the type I errors of the Welch tests. Statist. A Decision Theoretic Approach New York: Academic Press. AND K. FL: Academic Press. L. New York: I J 1 Harper and Row. OLlaN. Mathematical Statistics New York: J. W. 2nd ed. AND G. j 1 ." An" Math. 1986. Math. AND E. SIAM. FISHER. "P values as random variablesExpected P values. Testing Statistical Hypotheses." Biometrika. 9. j 1 . The Theory of Probability Oxford: Oxford University Press. KLETI. Sequential Methods in Statistics New York: Chapman and Hall. WALD. A.. 730737 (1974).. Philadelphia. WELCH. Statistical Decision Functions New York: Wiley. HEDGES. Pennsylvania (1973). Amer. E. H.." J. 1985. TATE. Amer. I . . 16.. 8. R. Aspin's tables:' Biometrika. FENSTAD. B. 66." The American Statistician.." Ann. 38. PRATI. 53. DURBIN. . AARBERGE. K.. Assoc.. A. M. DAS GUPTA." 1. "Optimal confidence intervals for the variance of a normal distribution. Statist. 56. 473487 (1977). Amer. "Length of confidence intervals. S.. L.. Statist. Assoc. ! . SAMUEL<:AHN. V. 1961. A. WALD. 64. Wiley & Sons. T. R. SACKRoWITZ.. . "PloUing with confidence: Graphical comparisons of two populations. Sequential Analysis New York: Wiley. LEHMANN. Amer. "A twosample test for a linear hypothesis whose pOWer is independent of the variance. DOKSUM. POPPER. 69.. Statist. DOKSUM.. R. . 605608 (1971). AND I." The AmencanStatistician. WILKS.. AND G. 1947.
1 degrees of freedom. k Evaluation here requires only evaluation of F and a onedimensional integration. .1. if n ~ 2k + 1. a 2 ) we have seen in Section 4.1. telling us exactly how the MSE behaves as a function of n.• . If XI.1. If n is odd.1.1 (D.2) where. closed fonn computation of risks in tenns of known functions or simple integrals is the exception rather than the rule. our setting for this section EFX1 and use X we can write. the qualitative behavior of the risk as a function of parameter and sample size is hard to ascertain.Chapter 5 ASYMPTOTIC APPROXIMATIONS 5.1 INTRODUCTION: THE MEANING AND USES OF ASYMPTOTICS Despite the many simple examples we have dealt with. (5.13).2. To go one step further. the qualitative behavior of the risk as a function of n and simple parameters of F is not discernible easily from (5.2 that . and most of this chapter...1.i. v(F) ~ F.Xn ) as an estimate of the population median II(F).3).1). Ifwe want to estimate J1(F} = (5. computation even at a single point may involve highdimensional integrals.. but a different one for each n (Problem 5. 1 X n from a distribution F. Worse. Worse. consider a sample Xl.1.. and calculable for any F and all n by a single onedimensional integration. Even if the risk is computable for a specific P by numerical integration in one dimension. consider evaluation of the power function of the onesided t test of Chapter 4.F(x)k f(x). consider med(X 11 •. . However. and F has density f we can wrile (5.2) and (5. In particular.1) This is a highly informative formula. .Xn are i. from Problem (B.9. N(J1.fiiXIS has a noncentral t distribution with parameter 1"1 (J and n .3) gn(X) =n ( 2k ) F k(x)(1 . This distribution may be evaluated by a twodimensional integral using classical functions 297 .d.
I j {Tn (X" .. . . . in this context. but for the time being let's stick to this case as we have until now. . . Asymptotics in statistics is usually thought of as the study of the limiting behavior of statistics or.. Rn (F). always refers to a sequence of statistics I. where X n = ~ :E~ of medians. {X 1j l ' •• . 10=1 f=l There are two complementary approaches to these difficulties. just as in numerical integration. It seems impossible to determine explicitly what happens to the power function because the distribution of fiX / S requires the joint distribution of (X.n (Xl. . which we explore further in later chapters. X nj }. Xn)}n>l. .Xnj ) .2) and its qualitative properties are reasonably transparent. But suppose F is not Gaussian. Approximately evaluate Rn(F) by _ 1 B i (5.. 1 < j < B from F using a random number generator and an explicit fonn for F. fiB ~ R n (F). more specifically.3... VarF(X 1 ».fii(Xn  EF(Xd) + N(O. The classical examples are.i. j=l • By the law of large numbers as B j. 00. i . which occupies us for most of this chapter.3. . . Thus.fiit !Xi.I I !' 298 Asymptotic Approximations Chapter 5 (Problem 5. X n + EF(Xd or p £F( . _. We shall see later that the scope of asymptotics is much greater.4) i RB =B LI(F. based on observing n i.Xn):~Xi<n_l (~2 (EX. save for the possibility of a very unlikely event. The other. is to approximate the risk function under study by a qualitatively simpler to understand and easier to compute function. where A= { ~ . • . Monte Carlo is described as follows. The first.. or it refers to the sequence of their distributions 1 Xi.1.d. or the sequence Asymptotic statements are always statements about the sequence. Draw B independent "samples" of size n. observations Xl. S) and in general this is only representable as an ndimensional integral..)2) } . X n as n + 00. for instance the sequence of means {Xn }n>l. We now turn to a detailed discussion of asymptotic approximations but will return to describe Monte Carlo and show how it complements asyrnptotics briefly in Example 5. of distributions of statistics. In its simplest fonn. Asymptotics.o(X1). we can approximate Rn(F) arbitrarily closely. is to use the Monte Carlo method. .1.
say. For instance. We interpret this as saying that.4.5) That is. if more delicate Hoeffding bound (B. As an approximation.' m (5.1.1..omes 11m'.6) and (5.2 hand side of (5.1.1. . X n ) we consider are closely related as functions of n so that we expect the limit to approximate Tn(Xl~"· .6) gives IX11 < 1.1.11) .1.1..1') < x] _ ij>(x) < CEFIXd' v'~ 3 1/' " "n (5.1. Similarly.1.' = 1 possible (Problem 5.8) are given in Problem 5. DOD? Similarly...6) does not tell us how large n has to be for the chance of the approximation ~ot holding to this degree (the lefthand side of (5. PF[IX n _  . the central limit theorem tells us that if EFIXII < 00. say.3). x. Xn is approximately equal to its expectation. X n ) but in practice we act as if they do because the T n (Xl. .1. by Chebychev's inequality. the celebrated BerryEsseen bound (A. then P [vn:(X F n .25 whereas (5..01.1.8) Again we are faced with the questions of how good the approximation is for given n.l) (5. .1 Introduction: The Meaning and Uses of Asymptotics 299 In theory these limits say nothing about any particular Tn (Xl.7) where 41 is the standard nonnal d.15.1.9) is .9) As a bound this is typically far too conservative.1. then (5. 7).. n = 400. this reads (5. Is n > 100 enough or does it have to be n > 100. below .(Xn . .l1) states that if EFIXd' < 00. the much PFlIx n  Because IXd < 1 implies that .1. and P F • What we in principle prefer are bounds./' is as above and . the right sup PF x [r.6) to fall.1. for n sufficiently large. .1.VarF(Xd. the weak law of large numbers tells us that. (5. X n )) (in an appropriate sense). if EFXf < 00.2 ..Section 5.' 1'1 > €] < . Further qualitative features of these bounds and relations to approximation (5.1.6) for all £ > O. (5.Xn ) or £F(Tn (Xl..2 is unknown be.10) < 1 with . Thus. ' . (5.10) is . The trouble is that for any specified degree of approximation. 1'1 > €] < 2exp {~n€'}. £ = .. if EFIXd < 00. For € = .14...14.01. 1') < z] ~ ij>(z) (5. For instance.1.9.f.. (see A.9) when . which are available in the classical situations of (5.
' (0)( (1 / y'n)( v'27i') (Problem 5.8) differs from the truth. quite generally. 1 i . for any loss function of the form I(F. . We now turn to specifics. Although giving us some idea of how much (5. The estimates B will be consistent.' I I • If the agreement is satisfactory we use the approximation even though the agreement for the true but unknown F generating the data may not be as good. Consistency will be pursued in Section 5. "(0) > 0.e(F)]) (1(e.1.2 deals with consistency of various estimates including maximum likelihood. F) typically is the standard deviation (SD) of J1iOn or an approximation to this SD. In particular. for all F in the n n .11) is again much too consctvative generally.1. As we mentioned. .12) where (T(O.7) says that the behavior of the distribution of X n is for large n governed (approximately) only by j. Bounds for the goodness of approximations have been available for X n and its distribution to a much greater extent than for nonlinear statistics such as the median. (5. (5.8) is typically much betler than (5. : i .3 begins with asymptotic computation of moments and asymptotic normality of functions of a scalar mean and include as an application asymptotic normality of the maximum likelihood estimate for oneparameter exponential families. • c F (y'n[Bn . although the actual djstribution depends on Pp in a complicated way. even here they are not a very reliable guide.1. If they are simple. Asymptotics has another important function beyond suggesting numerical approximations for specific nand F. as we have seen. Yet. Practically one proceeds as follows: (a) Asymptotic approximations are derived. consistency is proved for the estimates of canonical parameters in exponential families. As we shall see. good estimates On of parameters O(F) will behave like  Xn does in relation to Ji. It suggests that qualitatively the risk of X n as an estimate of Ji. and asymptotically normal.11) suggests. The arguments apply to vectorvalued estimates of Euclidean parameters. which is reasonable.300 Asymptotic Approximations Chapter 5 where C is a universal constant known to be < 33/4. The qualitative implications of results such as are very impor~ tant when we consider comparisons between competing procedures.2 and asymptotic normality via the delta method in Section 5. I . Section 5.1.5) and quite generally that risk increases with (1 and decreases with n.t and 0"2 in a precise way.I '1.di) where '(0) = 0. . For instance. (b) Their validity for the given n and Ttl for some plausible values of F is tested by numerical integration if possible or Monte Carlo computation. • model. Section 5. i .1.1. B ~ O(F).3.1.F) > N(O 1) . The methods are then extended to vector functions of vector means and applied to establish asymptotic normality of the MLE 7j of the canonical parameter 17  j • i. (5.(l) The approximation (5. d) = '(II' . Note that this feature of simple asymptotic approximations using the normal distribution is not replaceable by Monte Carlo. behaves like . asymptotic formulae suggest qualitative properties that may hold even if the approximation itself is not adequate.
for all (5.2. (5.1) and (B.1). .d. if. in accordance with (A.1.. Summary. The simplest example of consistency is that of the mean.i.2. X ~ p(P) ~ ~  p = E(XJ) and p(P) = X. remains central to all asymptotic theory.. We will recall relevant definitions from that appendix as we need them. .2) is called unifonn consistency. 'in 1] q(8) for all 8. That is. Monte Carlo.q(8)1 > 'I that yield (5. ' . The notation we shall use in the rest of this chapter conforms closely to that introduced in Sections A. case become increasingly valid as the sample size increases.Xn ) is that as e n 8 E ~ e.2.2. (See Problem 5.14. We also introduce Monte Carlo methods and discuss the interaction of asymptotics. and 8. In practice.Xn from Po where 0 E and want to estimate a real or vector q(O). for .4 deals with optimality results for likelihoodbased procedures in onedimensional parameter models.. .2. > O. is a consistent estimate of p(P).) forsuP8 P8 lI'in . we talk of uniform cornistency over K. with all the caveats of Section 5. 00.14.7.1.14. . The least we can ask of our estimate Qn(X I.2. For P this large it is not unifonnly consistent.2. 5. e Example 5.1 CONSISTENCY PlugIn Estimates and MlEs in Exponential Family Models Suppose that we have a sample Xl.1). If Xl. Section 5. AlS. by quantities that can be so computed.Section 5.2) Bounds b(n.i. asymptotics are methods of approximating risks.d. P where P is unknown but EplX11 < 00 then. Means.2. If is replaced by a smaller set K.2) are preferable and we shall indicate some of qualitative interest when we can. The stronger statement (5. But. which is called consistency of qn and can be thought of as O'th order asymptotics. and B.2. 1denotes Euclidean distance.2 Consistency 301 in exponential families among other results. but we shall use results we need from A. .) However.2 5. Most aSymptotic theory we consider leads to approximations that in the i. Finally in Section 5. A.5 we examine the asymptotic behavior of Bayes procedures.1) where I . by the WLLN. 1 X n are i. distributions.lS. .7 without further discussion.7.. and other statistical quantities that are not realistically computable in closed form. A stronger requirement is (5. Asymptotic statements refer to the behavior of sequences of procedures as the sequence index tends to 00. and probability bounds.. where P is the empirical distribution.
with Xi E X 1 . consider the plugin estimate p(Iji)ln of the variance of p. and (Xl.2.1.2. 6 >0 O. Then qn q(Pn) is a unifonnly consistent estimate of q(p). .2..b] < R+bedefinedastheinver~eofw. But further.b) say. w...·) is increasing in 0 and has the range [a. . Suppose that q : S + RP is continuous. the kdimensional simplex.5) I A simple and important result for the case in which X!.LJ~IPJ = I}.. If q is continuous w(q. q(P) is consistent." is the following: I X n are LLd. w(q. Thus. in this case. for every < > 0.1 < j < k.6) = sup{lq(p)  q(p')I: Ip .ql > <] < Pp[IPn .p'l < 6}.2. where q(p) = p(1p). . by A. xd is the range of Xl' Let N i = L~ 11(Xi = Xj) and Pj _ Njln. w( q. Letw. By the weak law of large numbers for all p.Pk) : 0 < Pi < 1. . Theorem 5.. 6) It easily follows (Problem 5. 0 In fact. i . (5. there exists 6 «) > 0 such that p.14.3) that > <} (5.6. Then N = LXi has a B(n. = Proof.pi > 6«)] But.1) and the result follows. and p = X = N In is a uniformly consistent estimate of p. · . where Pi = PIXI = xi]' 1 < j < k. 0 ~ To some extent the plugin method was justified by consistency considerations and it is not suprising that consistency holds quite generally for frequency plugin estimates. Pp [IPn  pi > 6] ~ .I l 302 instance.2..2. Asymptotic Approximations Chapter 5 X is uniformly consistent over P because o Example 5. Other moments of Xl can be consistently estimated in the same way. then by Chebyshev's inequality.. Evidently.• X n be the indicators of binomial trials with P[X I = I] = p. it is uniformly continuous on S. w(q.6)! Oas6! O. . implies Iq(p')q(p) I < <Then Pp[li1n . which is ~ q(ji). Suppose thnt P = S = {(Pl.2.2. Suppose the modulus ofcontinuity of q. 0) is defined by . Pn = (iiI.. for all PEP.l : [a.l «) = inf{6 : w(q.p) distribution. p' E S. Ip' pi < 6«). 0 < p < 1. sup{ Pp[liln pi > 61 : pES} < kl4n6 2 (Problem 5. .4) I i (5. Binomial Variance. l iA) E S be the empirical distribution.3) Evidently. we can go further.. Let X 1. P = {P : EpX'f < M < oo}. Because q is continuous and S is compact.
Let g (91.p). and correlation coefficient are all consistent. ai. Jl2. Theorem 5.Jor all 0. where P is the empirical distribution.lpl < 1. We may. where h: Y ~ RP.) I < that h(D n) Foreonsisteney of h(g) apply Proposition B. .V2. Let g(u.2. Suppose [ is open.9d) map X onto Y C R d Eolgj(X')1 < 00.a~. Suppose P is a canonical exponentialfamily of rank d generated by T.v} = (u.2. EVj2 < 00. If Xl. thus..V.i.2.2 Consistency 303 Suppose Proposition 5. variances. is consistent for v(P).1 that the empirical means. 1 are discussed in Problem 5. o Example 5. More generally if v(P) = h(E p g(X . Let TI. If we let 8 = (JLI. [ and A(·) correspond to P as in Section 1. h(D) for all continuous h.)1 < 1 } tions such that EUr < 00. (ii) ij is consistent. Vi).2. af > o.6. 1 < i < n be i. U implies !'. if h is continuous. Then.3. (i) Plj [The MLE Ti exists] ~ 1.) > 0.. We need only apply the general weak law of large numbers (for vectors) to conclude that (5. Let Xi = (Ui . ag.1 .UV) so that E~ I g(Ui . X n are a sample from PrJ E P.3. N2(JLI. let mj(O) '" E09j(X.6) if Ep[g(X. 1 < j < d..). 1 < j < d. Var(V.p).ar. conclude by Proposition 5.1 and Theorem 2. then vIP) h(g). 11. Var(U. )) and P = {P : Eplg(X . ' .1.2.2. then Ifh=m. . = Proof. 0 Here is a general consequence of Proposition 5.Section 5. Questions of uniform consistency and consistency when P = { DistribuCorr(Uj. is a consistent estimate of q(O). Variances and Correlations.2.U 2. and let q(O) = h(m(O)).4. !:. .d..2.then which is well defined and continuous at all points of the range of m. )1 < oo}. Then.) > 0.1.7.JL2. 'Vi) is the statistic generating this 5parameter exponential family.1: D n 00.. .
.1 that i)(Xj. '1 P1)'[n LT(Xi ) E C T) ~ 1.3.E1). i i • • ! . . T(Xl)..1 that On CT the map 17 A(1]) is II and continuous on E.2 Consistency of Minimum Contrast Estimates .8) I > O.  inf{D(II. the MLE.. II) = where.lI) .2.p(X" II) is uniquely minimized at 11 i=l for all 110 E 6. 7 I .7) I .. exists iff the event in (5.1.3.2. II) L p(X n... . I I Proof Recall from Corollary 2. .7) occurs and (i) follows.d. and the result follows from Proposition 5. Let 0 be a minimum contrast estimate that minimizes I .1 belong to the interior of the convex support hecause the equation A(1)) = to. But Tj.304 Asymptotic Approximations Chapter 5 . A more general argument is given in the following simple theorem whose conditions are hard to check.9) n and Then (} is consistent. which solves 1 n A(1)) = .3.L[p(Xi .. z=l 1 n i. the inverse AI: A(e) ~ e is continuous on 8.2. 1 i . for example. see.lI o): 111110 1 > e} > D(1I0 . = 1 n p. is a continuous function of a mean of Li. II) 0 j =EII..d. .2.. Note that. 0 Hence. . i=l 1 n (5.. (5.j. as usual. By a classical result. Po. Let Xl. is solved by 110. 5. I ! . j 1 Pn(X. {t: It .Xn ) exists iff ~ L~ 1 T(X i ) = Tn belongs to the interior CT of the convex support of the distribution of Tn. I . (T(Xl)) must hy Theorem 2.1I)11: II E 6} ~'O (5.1I 0 ) foreverye I . I J Theorem 5. Rudin (1987). vectors evidently used exponential family properties.. D( 110 .1 to Theorem 2. i I . D . T(Xdl < 6} C CT' By the law of large numbers. Suppose 1 n PII sup{l. . • ..LT(Xi ). .2.2. X n be Li. n LT(Xi ) i=I 21 En. .D(1I0 .3.= 1 . E1). By definition of the interior of the convex support there exists a ball 8. BEe c Rd. We showed in Theorem 2. The argument of the the previous subsection in which a minimum contrast estimate. if 1)0 is true. n .2.. where to = A(1)o) = E1)o T(X 1 ).3.
11) sup{l. ifIJ is the MLE.13) By Shannon's Lemma 2.IJol > E] < PIJ [inf{ . IJj))1 > <I : 1 < j < d} ~ O. 1 n PIJo[inf{.2. then.1. E n ~ [p(Xi .5.8) by (i) For ail compact K sup In { c e.2. IJ) n ~ i=l p(X"IJ o)] . .1 we need only check that (5.2.2. Ie ¥ OJ] ~ Ojoroll j.IIIJ  IJol > c] (5.8). IJj ) .8) and (5. 0 Condition (5. EIJollogp(XI.2 Consistency 305 proof Note that. IJ j ) .14) (ii) For some compact K PIJ.lJ o): IIJ IJol > oj.IJ o)  D(IJo.2."'[p(X i . .L(p(Xi. {IJ" . (5. IJ) = logp(x.2. But because e is finite.8) follows from the WLLN and PIJ.D( IJ o.IJd ). IIJ .p(X" IJo)) : IJ E g ~(P(X" K'} > 0] ~ 1. An alternative condition that is readily seen to work more widely is the replacement of (5.2.2.10) By hypothesis. [inf c e. IJ j )) : 1 <j < d} > <] < d maxi PIJ.IJol > <} < 0] (5..2.D(IJo.IJ) p(Xi. IJ).12) which has probability tending to 0 by (5. But for ( > 0 let o ~ ~ inf{D(IJ.[IJ ¥ OJ] = PIJ. IJ) . IJ) .2. A simple and important special case is given by the following. [I ~ L:~ 1 (p(Xi . 0) .2.2.2.IJor > <} < 0] because the event in (5.Section 5.IJ)1 < 00 and the parameterization is identifiable.9) hold for p(x. Coronary 5.2. IJo)) .[max{[ ~ L:~ 1 (p(Xi .D(IJo.L[p(Xi .11) implies that the righthand side of (5.9) follows from Shannon's lemma.inf{D(IJo. IIJ . IJ)I : IJ K } PIJ !l o. and (5. _ I) 1 n PIJ o [16 .11) implies that 1 n ~ 0 (5.. is finite.8) can often failsee Problem 5.10) tends to O.IJO)) ' IIJ IJol > <} n i=l . PIJ.2. PIJ. (5.D(IJ o. /fe e =   Proof Note that for some < > 0. for all 0 > 0.2.2. o Then (5. 2 0 (5. IJ)II ' IJ n i=l E e} > .D(IJo. IJ) .2.
.1) where • .1 that the principal use of asymptotics is to provide quantitatively or qualitatively useful approximations to risk. VariX. in Section 5.3. consistency of the MLE may fail if the number of parameters tends to infinity. We introduce the minimal property we require of any estimate (strictly speak ing. Sufficient conditions are explored in the problems. We denote the jth derivative of h by 00 . We show how consistency holds for continuous functions of vector means as a consequence of the law of large numbers and derives consistency of the MLE in canonical multiparameter exponential families. Unifonn consistency for l' requires more.) = <7 2 Eh(X) We have the following. Summary.3 FIRST. . A general approach due to Wald and a similar approach for consistency of generalized estimating equation solutions are left to the problems. 5.3.2.8) and (5. ~ 5. ! ! (ii) EIXd~ < 00 Let E(X1 ) = Il. =suPx 1k<~)(x)1 <M < . If fin is an estimate of B(P). and assume (i) (a) h is m times differentiable on R. we require that On ~ B(P) as n ~ 00. sequence of estimates) consistency. As usual let Xl.B(P)I > Ej : PEP} t 0 for all € > O.d. D . If (i) and (ii) hold.i. that sup{PIiBn .AND HIGHERORDER ASYMPTOTlCS: THE DelTA METHOD WITH APPLICATIONS We have argued. m Mi) and assume (b) IIh(~)lloo j 1 . let Ilglioo = sup{ Ig( t) I : t E R) denote the sup nonn. + Rm (5.2. X valued and for the moment take X = R.33.1. Unfortunately checking conditions such as (5. Let h : R ~ R. . I ~.. then = h(ll) + L (j)( ) j=l h . J. > 2.3.. We conclude by studying consistency of the MLE and more generally Me estimates in the case e finite and e Euclidean.1 The Delta Method for Moments • • We begin this section by deriving approximations to moments of smooth functions of scalar means and even provide crude bounds on the remainders. ~l Theorem 5.14) is in general difficult.Il E(X Il). When the observations are independent but not identically distributed. see Problem 5. . We then sketch the extension to functions of vector means.3..306 Asymptotic Approximations Chapter 5 We shall see examples in which this modification works in the problems.Xn be i.
3. We give the proof of (5. :i1 + ...3.3) for j even. then there are constants C j > 0 and D j > 0 such (5. so the number d of nonzero tenns in (a) is b/2] (c) l::n _ r.[jj2] + 1) where Cj = 1 <r. X ij ) least twice. 'Zr J . EIX I'li = E(X _1')' Proof. . .5[jJ2J max {l:: { . (n .3. In!..3. then (a) But E(Xi . Lemma 5. . Xij)1 = Elxdj il.3..Section 5. . (5.'" . _ . rl l:: . ik > 2. . .1.3 First. 21. The expression in (c) is.4) Note that for j eveo.3.3) and j odd is given in Problem 5. bounded by (d) < t.. . Moreover. t r 1 and [t] denotes the greatest integer n...3.. .) ~ 0..'1 +..ij } • sup IE(Xi . and the following lemma. . .3.4) for all j and (5.t" = t1 .3) (5.. . tr i/o>2 all k where tl . tI.. (b) a unless each integer that appears among {i I .2. . .. +'r=j J .1 < k < r}}.1) . for j < n/2. C [~]! n(n ... Let I' = E(X. If EjX Ilj < 00. + i r = j..i j appears at by Problem 5.5.and HigherOrder Asymptotics: The Delta Method with Applications 307 The proof is an immediate consequence of Taylor's expansion.. The more difficult argument needed for (5. j > 2.1'1.3.2) where IX' that 1'1 < IX . .
In general by considering Xi .6) n . if 1 1 <j (a) Ilh(J)II= < 00.1. Proof (a) Write 00.l) 6 + 2h(J.2.5) (b) if E(Xt) < G(n. I . By Problem 5.3. give approximations to the bias of h(X) as an estimate of h(J. (n li/2] + 1) < nIJf'jj and (c). If the conditions of (b) hold. 1 < j < 3.1.l)h(J. I I I Corollary 5..l)E(X .l) + [h(1)]2(J.6.1 with m = 3.3. EIX1 .2 ).3. and EXt < replaced by G(n').6) can be 1 Eh 2 (X) = h'(J..3.1'11.3. (a) ifEIXd3 < 00 and Ilh( 31 11 oo < Eh(X) 00. • . I 1 . Because E(X .3. .5) can be replaced by Proof For (5. 00 and 11hC 41 11= < then G(n.3f2 ) in (5. 0 The two most important corollaries of Theorem 5.3.1')2 = 0 2 In. 0 1 I .1')3 = G(n 2) by (5. using Corollary 5. I . Then Rm = G(n 2) and also E(X ..3f') in (5.1') + {h(2)(J.ll j < 2j EIXd j and the lemma follows.l) + 00 2n I' + G(n3f2).1..l) and its variance and MSE. . Corollary 5.l)}E(X .JL as our basic variables we obtain the lemma but with EIXd j replaced by EIX 1 .5) apply Theorem 5.5) follows.3.1')' + 1 E[h'](3)(X')(X _ 1')3 = h2 (J. (d).1 with m = 4. apply Theorem 5. .3.+ O(n. respectively. n (b) Next.llJ' + G(n3f2) (5.3.308 But (e) Asymptotic Approximations Chapter 5 1 I j i njn(n 1) .l) + {h(21(J.3f2 ). i • r' Var h(X) = 02[h(11(J.3.3.3. then I • I.l) + [h(llj'(1')}':. then h(')( ) 2 0 = h(J..3. 1 iln . (5. if I' = O. and (e) applied to (a) imply (5.. then G(n.l)h(1)(J.J.4).4) for j odd and (5.l)h(J.3. < 3 and EIXd 3 < 00. (5.3. I (b) Ifllh(j11l= < 00.3) for j even.
= E"X I ~ 11 A. and.3.2.i. by Corollary 5. by Corollary 5.d.3.exp( 2') c(') = h(/1). (5.Z.9) o Further expansion can be done to increase precision of the approximation to Var heX) for large n.i.3. the MLE of ' is X .3 First..  . If heX) is viewed.3(1_ ')In + G(n. when hit) = t(1 . which is G(n.3..3.3.(h(X) .Z) (5. We will compare E(h(X)) and Var h(X) with their approximations. A qualitatively simple explanation of this important phenonemon will be given in Theorem 5. as the plugin estimate of the parameter h(Jl) then.8) because h(ZI (t) = 4(r· 3 .3.3. Here Jlk denotes the kth central moment of Xi and we have used the facts that (see Problem 5. We can use the two coronaries to compute asymptotic approximations to the means and variance of heX).3.l / Z) unless h(l)(/1) = O. Note an important qualitative feature revealed by these approximations.11) Example 5.3. where /1.7) If h(t) ~ 1 .3. then we may be interested in the warranty failure probability (5.3.3.2 ) 2e.1) = 11. for large n. T!Jus. 0 Clearly the statements of the corollaries as well can be turned to expansions as in Theorem 5.t) and X.1.. by expanding Eh 2 (X) and Eh(X) to six terms we obtain the approximation Var(h(X)) = ~[h(I)U')]'(1Z +.and HigherOrder Asymptotics: The Delta Method with Applications 309 Subtracting Cal from (b) we get C5. the bias of h(X) defined by Eh(X) . which is neglible compared to the standard deviation of h(X).4) E(X .2 (Problem (5.3. To get part (b) we need to expand Eh 2 (X) to four terms and similarly apply the appropriate form of (5. If X [. as we nonnally would.Section 5. {h(l) (/1)h(ZI U')/13 (5. then heX) is the MLE of 1 . Thus. X n are i./1)3 = ~~. Bias and Variance of the MLE of the Binomial VanOance.1.4 ) exp( 2/t).(h(X)) E.5).3.h(/1) is G(n.t.6). .Z ".1 with bounds on the remainders.1 If the Xi represent the lifetimes of independent pieces of equipment in hundreds of hours and the warranty replacement period is (say) 200 hours. (1Z 5. .10) + [h<ZI (/1)J'(14} + R~ ! with R~ tending to zero at the rate 1/n3 .h(/1)) h(2~(II):: + O(n. Bias. Example 5..3.exp( 2It).l ).
.p)J n p(1 ~ p) [I .p) {(I _ 2p)2 n Thus.p).3. M2) ~ 2.2t.2p)' . and that (i) l IIDm(h)ll= < 00 where Dmh(x) is the array (tensor) a i. (5. (5.10) yields .' • " The generalization of this approach to approximation of moments for functions of vector means is fonnally the same but computationally not much used for d larger than 2. asSume that h has continuous partial derivatives of order up to m.2p)2p(1 . 1 and in this case (5.9d(Xi )f. .2p) n n +21"(1 ...p(1 . • . Next compute Varh(X)=p(I.2p)')} + R~. .' . 0 < i. Theorem 5.e x ) : i l + .3 ). .E(X') ~ p .p) + .p)'} + R~ p(l .{2(1. n n Because MII(t) = I ..p)] n I " . .P) {(1_2 P)2+ 2P(I. R'n p(1 ~ p) [(I .p) n . I .310 Asymptotic Approximations Chapter 5 B(l.[Var(X) + (E(X»2] I nI ~ p(1 .P )} (n_l)2 n nl n Because 1'3 = p(l.p) = p ( l .j .p(1 .3. the error of approximation is 1 1 + . D 1 i I .p) . and will illustrate how accurate (5.p)(I.2p(1 . Suppose g : X ~ R d and let Y i = g(Xi ) = (91 (Xi).6p(l.!c[2p(1 n p) .p) .3.3.2p)p(l.2.2p).5) yields E(h(X))  ~ p(1 . ~ Varh(X) = (1. Let h : R d t R..p)(I. First calculate Eh(X) = E(X) .. . . 1 < j < { Xl . + id = m.10) is in a situation in which the approximation can be checked.3. II'. D " ~ I .pl. " ! . a Xd d} .2(1 . • ~ O(n.amh i.5) is exact as it should be. < m.
Suppose {Un} are real random variables and tfult/or a sequence {an} constants with an + 00 as n + 00.3. (J1.»)'var(Y'2)] +O(n') Approximations (5.3.3). under appropriate conditions (Problem 5. The most interesting application.4. Y 12 ) 3/2). by (5.". = EY 1 .C( v'n(h(X) .).~ (J1.1.)} + O(n (5. 0/ The result follows from the more generally usefullernma.) + a.. as for the case d = 1. (J1. Lemma 5.12) can be replaced by O(n.) Var(Y.8. Similarly. (7'(h)) where and (7' = VariX.~E(X. if EIY.3. E xl < 00 and h is differentiable at (5. Y12 ) (5.6).h(!. Theorem 5. and J. ijYk = ~ I:~ Yik> Y = ~ I:~ I Vi.'.12) Var h(Y) ~ [(:. then Eh(Y) This is a consequence of Taylor's expansion in d variables.. is to m = 3. The results in the next subsection go much further and "explain" the fonn of the approximations we already have.. h : R ~ R. We get.) Cov(Y".and HigherOrder Asymptotics: The Delta Method with Applications 311 (ii) EIY'Jlm < 00.13).3.3. (5.14) + (g:.) )' Var(Y.2. then O(n.))) ~ N(O. 5.11.I' < 00. and the appropriate generalization of Lemma 5.15) Then .14) do not help us to approximate risks for loss functions other than quadratic (or some power of (d .Section 5.3. The proof is outlined in Problem 5.) + ~ {~~:~ (J1.5). B. 1 < j < d where Y iJ I = 9j(Xi ).) + 2 g:'.hUll)~..3.3.2 ).3.x.) Cov(Y".3.3.3.3 / 2 ) in (5. 1. and (5.3 First.3.) Var(Y.(J1. + ~ ~:l (J1.) gx~ (J1. for d = 2..3.13) Moreover. . EI Yl13 < 00 Eh(Y) h(J1.2 The Delta Method for In law Approximations = As usual we begin with d !.3. (5.. Suppose thot X = R.1 Then.).3.
Therefore.3. (ii) 9 : R Then > R is d(fferentiable at u with derivative 9(1) (u). . ( 2 ).ul Note that (i) (b) '* . j " But.3. .u)!:. for every (e) > O.a 2 ). • • . By definition of the derivative. from (a). Consider 1 Vn = v'n(X !") !:. by hypothesis. see Problems 5. 'j But (e) implies (f) from (b).312 Asymptotic Approximations Chapter 5 (i) an(Un . hence. V .3. for every Ii > 0 (d) j. ·i Note that (5. j .ul < Ii '* Ig(v)  g(u) . (5.N(O.15) "explains" Lemma 5.7.16) Proof. .• ' " and.32 and B. PIIUn €  ul < iii ~ 1 • "'. u = /}" j \ V . (e) Using (e). I (g) I ~: . The theorem follows from the central limit theorem letting Un X. _ F 7 .3. . Formally we expect that if Vn !:. ~: . 0 . an = n l / 2..u) !:. ~ EVj (although this need not be true.1.i. an(Un .17) . V for some constant u. V.8).3. Thus. I . we expect (5.N(O. for every € > 0 there exists a 8 > 0 such that (a) Iv . . then EV. V and the result follows.g(ll(U)(v  u)1 < <Iv . = I 0 .
O < A < 1. £ (5. Yn2 be two independent samples with 1'1 = E(X I )...= 0 versuS K : 11./2 ). A statistic for testing the hypothesis H : 11. Slutsky's theorem.l (1.N(O.l distribution. For the proof note that by the central limit theorem.). v) = u!v. 1'2 = E(Y.3 we saw that the two sample t statistic Sn vn1n2 (Y X) ' n = = n s nl + n2 has a Tn2 distribution under H when the X's and Y's are normal withar = a~.s Tn =.X" be i.3. VarF(Xd = a 2 < 00.TiS X where S 2 = 1 nl L (X. 1). Then (5.'" 1 X n1 and Y1 . where g(u. sn!a).j odd. (1 .d. al = Var(X.28) that if nI/n _ A.. Consider testing H: /11 = j. N n (0 Aal. i=l If:F = {Gaussian distributions}. In general we claim that if F E F and H is true.> 0 .3 First.Section 5.3.a) for Tn from the Tn .18) In particular this implies not only that t n.a) critical value (or Zla) is approximately correct if H is true and F is not Gaussian. (b) The TwoSample Case. and S2 _ n nl ( ~ X) 2) n~(X. and the foregoing arguments. then S t:. Let Xl.) and a~ = Var(Y. . else EZJ ~ = O. But if j is even.9.2. Example 5. In Example 4.1 (Ia) + Zla butthat the t n.i.17) yields O(n J.). Using the central limit theorem.2 and Slutsky's theorem. EZJ > 0.'" .3. "t" Statistics.and HigherOrder Asymptotics: The Delta Method with Applications 313 where Z ~ N(O. (a) The OneSample Case.t2 versus K : 112 > 111. we find (Problem 5. FE :F where EF(X I ) = '". Let Xl. (1  A)a~ .1 (1 . j even = o(nJ/').+A)al + Aa~) . ••• . we can obtain the critical value t n.18) because Tn = Un!(sn!a) = g(Un . then Tn . Now Slutsky's theorem yields (5.X) n 2 .3.3. 1). 1 2 a P by Theorem 5.3.
5 2 2. .. X~. .. Figure 5.2 shows that when t n _2(10:) critical value is a very good approximation even for small n and for X.I = u~ and nl = n2..95) approximation is only good for n > 10 2 .3. oL.2 or of a~. . Figure 5..1.. . the asymptotic result gives a good approximation when n > 10 1.5 3 Log10 sample size j i Figure 5. and the true distribution F is X~ with d > 10.. then the critical value t. We illustrate such simulations for the preceding t tests by generating data from the X~ distribution M times independently.5 1 1.3... where d is either 2. i 2(1 approximately correct if H is true and the X's and Y's are not normal. The X~ distribution is extremely skew. approximations based on asymptotic results should be checked by Monte Carlo simulations..3. each lime computing the value of the t statistics and then giving the proportion of times out of M that the t statistics exceed the critical values from the t table.000 onesample t tests using X~ data. " .. Here we use the XJ distribution because for small to moderate d it is quite different from the normal distribution.1 shows that for the onesample t test. and in this case the t n .. i I . 0) for Sn is Monte Carlo Simulation As mentioned in Section 5. 20.5 .. the For the twosample t tests. Other distributions should also be tried.'. as indicated in the plot..02 .1 (0. Chisquare data I 0. or 50.. 10. when 0: = 0. 32. .1.314 Asymptotic Approximations Chapter 5 It follows that if 111 = 11. 316. Y .. One sample: 10000 Simulations... ur . The simulations are repeated for different sample sizes and the observed significance levels are plotted.05. Each plotted point represents the results of 10.c:'::c~:____:_J 0.5 .
D Next.3.Xi have a symmetric distribution.000 twosample t tests. Moreover.3.hoi n s[h(l)(X)1 . _ y'ii:[h(X) . By Theoretn 5. However. as we see from the limiting law of Sn and Figure 5.3. Equal Variances 0. scaled to have the same means. 1 (ri .3.n2 and at I=. even when the X's and Y's have different X~ distributions. y'ii:[h(X) .a~ show that as long as nl = n2.Xd. and Yi .and HigherOrder Asymptotics: The Delta Method with Applications 315 1 This is because. ChiSquare Dala. or 50. the t n 2(1 . Other Monte Carlo runs (not shown) with I=.12 0.Section 5_3 First.cCc":'::~___:.02 d'::'.9. af = a~._' 0.h(JL)] ". when both nl I=.' 0.X = ~. in this case. and a~ = 12af.) i O. In this case Monte Carlo studies have shown that the test in Section 4.5 1 1. Each plotted point represents the results of 10. To test the hypothesis H : h(JL) = ho versus K : h(JL) > ho the natural test statistic is T. the t n _2(O.o'[h(l)(JLJf). For each simulation the two samples are the same size (the size indicated on the xaxis). }.L) where h is con tinuously differentiable at !'.a~.3. let h(X) be an estimate of h(J.(})) do not have approximate level 0.n2 and at = a').ll)(!.5 2 2.4 based on Welch's approximation works well..0' H<I o 0.. Y .5 3 log10 sample size Figure 5.2. 2:7 ar Two sample. 10000 Simulations. and the data are X~ where d is one of 2. in the onesample situation.95) approximation is good for nl > 100.0:) approximation is good when nl I=. then the twosample t tests with critical region 1 {S'n > t n . 10.2 (1 . N(O.
Combining Theorem 5.3. +2 + 25 J O'.I _ __ 0 I +__ I :::. From (5. The data in the first sample are N(O.oj:::  6K . Each plotted point represents the results of IO.{x)() twosample t tests.3. then the sample mean X will be approximately normally distributed with variance 0"2 In depending on the parameters indexing the family considered. we see that here. " 0.. which are indexed by one or more parameters. called variance stabilizing.c:~c:_~___:J 0. In Appendices A and B we encounter several important families of distributions. as indicated in the plot. 0.3.~'ri i _ . and beta.. I: . The xaxis denotes the size of the smaller of the two samples. £l __ __ 0.' OIlS .". Variance Stabilizing Transfonnations Example 5. For each simulation the two samples differ in size: The second sample is two times the size of the first. 1) and in the second they are N(Ola 2) where a 2 takes on the values 1.. N(O. Poisson.12 .6.02""~". gamma. Tn .5 Log10 (smaller sample size) l Figure 5.3 and Slutsky's theorem.. and 9. 1) so that ZI_Q is the asymptotic critical value. Unequal Variances.3.0. o. such as the binomial.10000 Simulations: Gaussian Data. . too.316 Asymptotic Approximations Chapter 5 Two Sample.. such that Var heX) is approximately independent of the parameters indexing the family we are considering. If we take a sample from a member of one of these families. .J~ f).3.5 1 1. '! 1 . It turns out to be useful to know transformations h. 2nd sample 2x bigger 0.£.. +___ 9.6) and ..4..3. if H is true . We have seen that smooth transformations heX) are also approximately normally distributed.
As an example.Ct confidence interval for J>. See Problems 5. which has as its solution h(A) = 2. . .3. Also closely related but different are socaBed normalizing transformations.. See Example 5. c) (5.i.6) we find Var(X)' ~ 1/4n and Vi'((X) . which varies freely. 1n (X I. The notion of such transformations can be extended to the following situation. Under general conditions (Bhattacharya and Rao.. Some further examples of variance stabilizing transformations are given in the problems. where d is arbitrary.3. .d.3. Such a function can usually be found if (J depends only on fJ. Suppose.and HigherOrder Asymptotics: The Delta Method with Applications 317 (5. To have Varh(X) approximately constant in A. is to exhibit monotone functions of parameters of interest for which we can give fixed length (independent of the data) confidence intervals.)] 2/ n ..3. 1976. 0 One application of variance stabilizing transformations. h must satisfy the differential equation [h(ll(A)j2A = C > 0 for some arbitrary c > O.19) for all '/.13) we see that a first approximation to the variance of h( X) is a' [h(l) (/.. Edgeworth Approximations The normal approximation to the distribution of X utilizes only the first two moments of X.1 0 ) Vi' ' is an approximate 1.16.\ + d.3. Major examples are the generalized linear models of Section 6. I X n is a sample from a P(A) family. h(t) = Ii is a variance stabilizing transformation of X for the Poisson family of distributions.3.. In this case (5. yX± r5 2z(1 . .15 and 5. X n) is an estimate of a real parameter! indexing a family of distributions from which Xl. sample. The comparative roles of variance stabilizing and canonical transformations as link functions are discussed in Volume II. by their definition. 538) one can improve on . finding a variance stabilizing transfonnation is equivalent to finding a function h such that for all Jl and (J appropriate to our family. In this case a'2 = A and Var(X) = A/n..hb)) + N(o.' .5. 1/4) distribution.Xn are an i. p. this leads to h(l)(A) = VC/J>. Suppose further that Then again..3. Thus. a variance stabilizing transformation h is such that Vi'(h('Y) . A second application occurs for models where the families of distribution for which variance stabilizing transformations exist are used as building blocks of larger models. If we require that h is increasing. suppose that Xl.(A)') has approximately aN(O.3 First.)C. A > 0.19) is an ordinary differential equation. Substituting in (5.. .6. Thus. in the preceding P( A) case.. Thus.Section 5.
0250 0. We can use Problem B.21) The expansion (5. 0. 0 x Exacl 2.6000 0. Let F'1 denote the distribution of Tn = vn( X .3513 0.n)1 V2ri has approximately aN(O.0050 0. " I.ססoo 1.0010 ~: EA NA x Exact . 1).1051 0.9876 0. 4 0.ססOO I .9995 0.I .6548 4.!. H 3 (x) = x3  3x./2 2 .3000 0..' •• .<p(x) ' .9990 0. 0.9943 0. "J 1 '~ • 'YIn = E(Vn? (2n)1 E(V .4415 0.72 . 1) distribution.ססoo 0.0005 0 0. To improve on this approximation." • [3 n .5421 0.9684 0.34 2.1964 1. xro I • I I ..38 0.9097 o.n.9950 0.85 0 0.1 gives this approximation together with the exact distribution and the nonnal approximation when n = 10.9999 1.3.9999 1.IOx 3 + 15x)] I I y'n(X n 9n ! I i !.3. .9506 0. Therefore.9724 0.0100 0.2024 EA NA x 0.0655 O. Edgeworth(2) and nonna! approximations EA and NA to the X~o distribution.8008 0.JL) / a and let lIn and 1'211 denote the coefficient of skewness and kurtosis of Tn.91 0..9984 0.4 to compute Xl.5. Hs(x) = xS  IOx 3 + 15x.40 0.4999 0.8000 0.6999 0. Suppose V rv X~.20) is called the Edgeworth expansion for Fn .3.95 3. (5.1) + 2 (x 3 .77 0.1254 0..0208 ~1.0254 0.15 0..7000 0. According to Theorem 8.1.51 1.0032 ·1. Table 5. Edgeworth Approximations to the X 2 Distribution. Then under some conditionsY) where Tn tends to zero at a rate faster than lin and H 2 • H 3 • and H s are Hermite polynomials defined by H 2 (x) ~ x 2  I.II 0. V has the same distribution as E~ 1 where the Xi are independent and Xi N(O.0105 0.9750 0. we need only compute lIn and 1'2n.3x) + _(xS .9905 0.0877 0. Example 5.3006 0. It follows from the central limit theorem that Tn = (2::7 I Xi n)/V2ri = (V .86 0.04 1.ססoo 0. where Tn is a standardized random variable.2706 0. ~ 2.318 Asymptotic Approximations Chapter 5 the normal approximation by utilizing the third and fourth moments.0553 0.61 0.3.3.5999 0.1.0500 ! ! .1000 0.4000 0.2000 0.40 0.4000 0.2.35 0.95 :' 1. . i = 1. ii. . Exact to' EA NA {. TABLE 5.75 0.9029 0. P(Tn < x).79 0.5000 0.9500 'll .38 0.0481 1.n)' _ 3 (2n)2 = 12 n 1 I .3.0287 0.0284 1.9997 4.66 0.7792 5. • " .0397 0. Fn(x) ~ <!>(x) .0001 0 0.15 0.9996 1.. I' 0.9900 0.
3.1.ar. (X n ._p2). vn(i7f .0 TO . Let central limit theorem T'f.3.m.0. >'2.0 2 (5. Y)/(T~ai where = Var X. where g(UI > U2. .J11)/a. a~ = Var Y.n1EY/) ° ~ ai ~ and u = (p.1.and HigherOrder Asymptotics: The Delta Method with Applications 319 The Multivariate Case Lemma 5. Because of the location and scale invariance of p and r.3.n lEX.)./U2U3. = ai = 1..Section 5.2 T20 .J 2 2 + 4 2 4 2 Tn P 120 + P T 02 +2{ _2 p3Al.3. we can .2. Let p2 = Cov 2(X. Suppose {Un} are ddimensional random vectors and that for some sequence of constants {an} with an !' (Xl as n Jo (Xl.5 that in the bivariate normal case the sample correlation coefficient r is the MLE of the population correlation coefficient p and that the likelihood ratio test of H : p = is based on Irl.1. ..u) where Un = (n1EXiYi. P = E(XY).J12) / a2 to conclude that without use the transformations Xi j loss of generality we may assume J11 = 1k2 = 0.0.0.3.6. R d !' RP has a differential g~~d(U) at u. E Next we compute = T11 . and Yj = (Y . then by the 2 vn(U .2rJ >". aiD.= (Xi . 0< Ey4 < 00..2. xmyl).3.22) 2 g(l)(u) = (2U.3.9.1) and vn(i7~ .i. Then Proof.0 .0.9) that vn(C .l = Cov(XkyJ. (ii) g.I.2 >'2.1. with  r?) is asymptotically nonnal.2 >'1.u) ~~ V dx1 forsome d xl vector ofconstants u.a~) : n 3 !' R.2.d. Using the central limit and Slutsky's theorems.2.2. _p2.3 and (B. It follows from Lemma 5.0. >'1.0 >'1.3 First. . and let r 2 = C2 /(j3<.0 >'1. UillL2U~) = (2p. = Var(Xkyj) and Ak. 4f. 1).2 extends to the dvariate case. Ui/U~U3.6) that vn(r 2 N(O. Let (X" Y.1) jointly have the same asymptotic distribution as vn(U n . as (X.2. Lemma 5. Y) where 0 < EX 4 < 00.u) ~ N(O.j.o}.p). 1. The proof follows from the arguments of the proof of Lemma 5. . U3) = Ui/U2U3..2.2 + p'A2. (i) un(U n .0. E).1. we can show (Problem 5.5.ij~ where ar Recall from Section 4. Yn ) be i.3. We can write r 2 = g(C. Example 5.
.: ! + h(i)(rn)(y .UI. ! f This expression provides approximations to the critical value of tests of H : p = O.3.3(h(r) . . Var Y i = E and h : 0 ~ RP where 0 is an open subset ofRd. 1938) that £(vn .8.~a)/vn3} where tanh is the hyperbolic tangent. (5. David. p=tanh{h(r)±z (1.1) is achieved hy choosing h(P)=!log(1+ P ). Then ~.4.19). we see (Problem 5. Theorem 5. ': I = Ilgki(rn)11 pxd. .fii(Y . E) = .p).3.l) distribution. . o Here is an extension of Theorem 5. EY 1 = rn.1). N(O.mil £ ~ hey) = h(rn) and (b) so that .3. I Y n are independent identically distributed d vectors with ElY Ii 2 < 00.23) • .1'2..320 When (X. Asymptotic Approximations Chapter 5 = 4p2(1_ p2)'. and (Prohlem 5. which is called Fisher's z.10) that in the bivariate nonnal case a variance stahilizing transformation h(r) with . c E (1.Y) ~ N(/li. .p) !:. . Argue as before using B.9) Refening to (5.3. j f· ~. then u5 . .g.P The approximation based on this transfonnation. . per < c) '" <I>(vn . 2 1.3.fii[h(r) .fii(r .3. N(O. and it provides the approximate 100(1 .u 2.. .a)% confidence interval of fixed length.24) I Proof.3.h= (h i .rn) N(o.3[h(c) . Suppose Y 1.h(p)) is closely approximated by the NCO.hp )andhhasatotaldifferentialh(1)(rn) f • .rn) + o(ly .3.. . (5.h(p)]).h(p)J !:. .5 (a) ! . has been studied extensively and it has been shown (e. . that is. (1 _ p2)2). it gives approximations to the power of these tests.
Now the weak law of large numbers (A.d.m. (5. We do not require the Xi to be normal. Suppose that Xl. To get an idea of the accuracy of this approximation. L::+. We write T k for Tk. 14. > 0 and EXt < 00. By Theorem B.k.k+m L::l Xl 2 (5. When k is fixed and m (or equivalently n = k + m) is large.1. By Theorem B.1.3 First.i=k+l Xi ".05 and the respective 0. we first note that (11m) Xl is the average ofm independent xi random variables.3.0 distribution and 2.0 < 2. Next we turn to the normal approximation to the distribution of Tk..3.and HigherOrder Asymptotics: The Delta Method with Applications 321 (c) jn(h(Y) .7. if . Suppose for simplicity that k = m and k .:l ~ m k+m L X.9) to find an approximation to the distribution of Tk.m distribution. as m t 00.37] = P[(vlk) < 2. Then. Then according to Corollary 8. the mean ofaxi variable is E(Z2).:: .3.. This row. For instance. xi and Normal Approximation to the Distribution of F Statistics. is given as the :FIO .3. then P[T5 . I). 00. if k = 5 and m = 60.3.m.m  (11k) (11m) L... D Example 5.m' To show this.).~1. EX. Thus.211 = 0.h(m)) = y'nh(l)(m)(Y  m) + op(I).m distribution can be approximated by the distribution of Vlk.21 for the distribution of Vlk. .x%. . where Z ~ N(O. we can use Slutsky's theorem (A.' = Var(X. gives the quantiles of the distribution of V/k.l) distribution.3. we conclude that for fixed k.05 quantiles are 2. we can write.Section 5.26) l. Suppose that n > 60 so that Table IV cannot be used for the distribution ofTk. the:F statistic T _ k.1. check the entries of Table IV against the last row.7) implies that as m t 00. i=k+1 Using the (b) part of Slutsky's theorem..Xn is a sample from a N(O. which is labeled m = 00. only that they be i. with EX l = 0.15.i. when the number of degrees of freedom in the denominator is large..1 in which the density of Vlk.25) has an :Fk. where V . the :Fk.37 for the :F5 . > 0 is left to the problems. The case m = )'k for some ).= density. See also Figure B. when k = 10. But E(Z') = Var(Z) = 1.3. 1 Xl has a x% distribution. where k + m = n.
m 1) <) :'.29) is unknown.5.I. 1'. :. I)T and h(u.4.3.27) where 1 = (1.k(Tk.. if it exists and equal to c (some fixed value) otherwise. 0 ! j i · 1 where K = Var[(X. j m"': k (/k. v) ~ ~.3. Suppose P is a canonical exponential family of rank d generated by T with [open. By Theorem 5. = (~. .m(l.m(1.a) '" 1 + is asymptotically incorrect.8(d)). In general. ~ N (0.'.' !k .3.m • < tl P[) :'.7). h(i)(u.)T. the F test for equality of variances (Problem 5.1)/12 j 2(m + k) mk Z.3. 2) distribution.m critical value fk. :f. i. Then if Xl. a').)/ad 2 . Equivalently Tk = hey) where Y i ~ (li" li.. 1'. _. • • 1)/12) . P[Tk. E(Y i ) ~ (1. where J is the 2 x 2 v'n(Tk 1) ""N(0.). 1)T. Thus (Problem 5. i = 1.3.3. v'n(Tk . In general (Problem 53. ! i l _ .28) satisfies Zto '" 1 I I:._o l or ~.m(1 . 322 Asymptotic Approximations Chapter 5 where Yi1 = and Yi2 = Xf+i/a2. X n are a sample from PTJ E P and ij is defined as the MLE 1 .k (t 1 (5.3.m = (1+ jK(:: z. when rnin{k. . a'). . I' Theorem 5.28) .8(c» one has to use the critical value .a) .8(a» does not have robustness of level. the distribution of'. and a~ = Var(X.m 1) can be ap proximated by a N(O. ~ Ck. Specifically. = E(X.).. (5. l (: An interesting and important point (noted by Box. 1 5..2Var(Yll))' In particular if X.3.k(tl)] '" 'fI() :'.o) k) K (5. ifVar(Xf) of 2a'. !1 . . the upper h. When it can be eSlimated by the method of moments (Problem 5. k.a).m} t 00.3. i.1) "" N(O.3 Asymptotic Normality of the Maximum likelihood Estimate in Exponential Families Our final application of the 8method follows. when Xi ~ N(o.3. 1953) is that unlike the t test.k(T>. We conclude that xl jer}. v) identity. as k ~ 00. .»)T and ~ = Var(Yll)J. which by (5. 4).
Section 5.11) eqnals the lower bound (3. PT/[ii = AI(T)I ~ L Identify h in Theorem 5. Then T 1 = X and 1 T2 = n.6.3.5.3. where.).1.) . Example 5.3. Thus.3.A(T/)) + oPT/ (n . o Remark 5.  (TIl' .2 that.4 with AI and m with A(T/).4. PT/[T E A(E)] ~ 1 and.4.T.38) on the variance matrix of y'n(ij .23). T/2 By Theorem 5.1.2 and 5.3. thus. The result is a consequence. hence.3. iiI = X/O".8. if T ~ I:7 I T(X.3.2. This is an "asymptotic efficiency" property of the MLE we return to in Section 6. therefore. (5.1.d.8.3.l4. 0'). the asymptotic variance matrix /1(11) of . Thus. Ih our case.ift = A(T/). = (5. (ii) follows from (5.Nd(O. in our case.2 where 0" = T. We showed in the proof ofTheorem 5.71) for any nnbiased estimator ij.EX.3.. of Theorems 5.N(o. .3.2.24).".3. by Corollary 1. as X witb X ~ Nil'. (5.In(ij .and HigherOrder Asymptotics: The Delta Method with Applications 323 (i) ii = 71 + ~ I:71 A •• • I (T/)(T(X.A 1(71))· Proof.4. = 1/20'.. (I" +0')] . Recall that A(T/) ~ VarT/(T) = 1(71) is the Fisher information.'l 1 (ii) LT/(Vii(iiT/)).3.O. Let Xl>' .I(T/) (5. (i) follows from (5. are sufficient statistics in the canonical model. = A by definition and..i.31) . For (ii) simply note that. I'. and.2. and 1).30) But D A .33) 71' 711 Here 711 = 1'/0'. X n be i. by Example 2.32) Hence. Now Vii11. = 1/2. Note that by B.3 First.
i 7 •• . Firstorder asymptotics provides approximations to the difference between a quantity tending to a limit and the limit. . 8 open C R (e. Summary. 0. We begin in Section 5. < 1.L. Y n are i. I I I· < P. stochastic approximations in the case of vector statistics and parameters are developed. we can use (5. In Chapter 6 we sketch how these ideas can be extended to multidimensional parametric families.". Higherorder approximations to distributions (Edgeworth series) are discussed briefly. .. Eo) ..4.4 to find (Problem 5. and so on.. . N = (No.1. .: CPo.1 by studying approximations to moments and central moments of estimates. . .1) i . as Y. see Example 2..P(Xk' 0)) : 0 E 8}. . We consider onedimensional parametric submodels of S defined by P = {(p(xo.15). .4 ASYMPTOTIC THEORY IN ONE DIMENSION I: I " ! .d.. .Pk) where (5.L~ I l(Xi = Xj) is sufficient. Consistency is Othorder asymptotics.3. Secondorder asyrnptotics provides approximations to the difference between the error and its firstorder approximation.3. The moment and in law approximations lead to the definition of variance stabilizing transfonnations for classical onedimensional exponential families. for instance.d. and PES.d.d. taking values {xo. These "8 method" approximations based on Taylor's fonnula and elementary results about moments of means of Ll. and h is smooth.)'.s where Eo = diag(a 2 )2a 4 )..33) and Theorem 5. the difference between a consistent estimate and the parameter it estimates. 0). under Li.') N(O. I . is twice differentiable for 0 < j < k. Thus.7). which lead to a result on the asymptotic nonnality of the MLE in multiparameter exponential families. Xk} only so that P is defined by p .. . when we are dealing with onedimensional smooth parametric models. and confidence bounds. These stochastic approximations lead to Gaussian approximations to the laws of important statistics. 5. il' .(T. . Following Fisher (1958)'p) we develop the theory first for the case that X""" X n are i. Fundamental asymptotic formulae are derived for the bias and variance of an estimate first for smooth function of a scalar mean and then a vector mean.i. . Specifically we shall show that important likelihood based procedures such as MLE's are asymptotically optimal..i. 0).3. Finally.4.1.26) vn(X /". . testing.4 and Problem 2. sampling.6. .g..324 Asymptotic Approximations Chapter 5 Because X = T. I i I . EY = J. . variables are explained in tenns of similar stochastic approximations to h(Y) . . and il' = T. 0 . i In this section we define and study asymptotic optimality for estimation. 5. the (k+ I)dimensional simplex (see Example 1..3. . J J .1 Estimation: The Multinomial Case .h(JL) where Y 1. Assume A : 0 ~ pix. Nk) where N j . We focus first on estimation of O..
if A also holds.6) 1.8)1(X I =Xj) )=00 (5.4. . > rl(O) M a(p(8)) P.:) (see (2. Moreover.4. 0 <J (5.2 (8.0). Under H.1.7) where .. 88 (Xl> 8) and =0 (5. pC') =I 1 m .9) . bounded random variable (5.8). h) with eqUality if and only if. Theorem 5.4.8) logp(X I . 8) is similarly bounded and well defined with (5.4.2).4.4. (5. Many such h exist if k 2.p(Xk. Next suppose we are given a plugin estimator h (r.2(0.Jor all 8. h) is given by (5..4.. fJl E.4) g. Assume H : h is differentiable. for instance.1. Consider Example where p(8) ~ (p(xo.4.4.8) ~ I)ogp(Xj.4. (0) 80(Xj.l (Xl.Section 5.8) .4. .11)) of (J where h: S satisfies ~ R h(p(8» = 8 for all 8 E e > (5. Then we have the following theorem.3) Furthermore (See'ion 3.5) As usual we call 1(8) the Fisher infonnation.1.11).. 8))T. (5.4 Asymptotic Theory in One Dimension 325 Note that A implies that k [(X I .4. < k.2) is twice differentiable and g~ (X I) 8) is a well~defined.
14) . we s~~ iliat equality in (5.8)) = a(8) &8 (X" 8) +b(O) &h Pl (5.3. &8(X.16) o .4. 2: • j=O &l a:(p(8))(I(XI = Xj) . I h 2 (5. • • "'i. Apply Theorem 5. we obtain &l 1 <0 . by noting ~(Xj.I n.8)&Pj 2: • &h a:(p(0))p(x. h(p(8)) ) ~ vn Note that. for some a(8) i' 0 and some b(8) with prob~bility 1. h) = I .12).2 noting that N) vn (h (. By (5.15) i • .4.4.p(xj.4. using the definition of N j . (5.l6) as in the proof of the information inequality (3. 0) = • &h fJp (5.8) with equality iff. (5. ( = 0 '(8. by (5.4.4..p(x).. (8.8) ) ~ .h)I8) (5. 8).4.9).8) ) +op(I). we obtain 2: a:(p(8)) &0(xj.8)).8) j=O 'PJ Note that by differentiating (5.10) Thus.')  h(p(8)) } asymptotically normal with mean 0.4.13).. Taking expectations we get b(8) = O. Noting that the covariance of the right' anp lefthand sides is a(8). which implies (5. : h • &h &Pj (p(8)) (N .13) I I · . I (x j. ~ir common variance is a'(8)I(8) = 0' (0.4.p(xj.4.4.4. h).6). not only is vn{h (.h)Var.4. h k &h &pj(p(8)) (N p(xj.11) • (&h (p(O)) ) ' p(xj.II. 0)] p(Xj. whil. kh &h &Pj (p(8))(I(Xi = Xj) .10).326 Asymptotic Approximations Chapter 5 Proof.12) [t.8) = 1 j=o PJ or equivalently.8) gives a'(8)I'(8) = 1. (5. using the correlation inequality (A.. but also its asymptotic variance is 0'(8./: ..
.0)) ~N (0. Xl.).Section 5A Asymptotic Theory in One Dim"e::n::'::io::n ~_ __'3::2:. the MLE of B where It is defined implicitly ~ by: h(p) is the value of O. Example 5. p) and HardyWeinberg models can both be put into this framework with canonical parameters such as B = log ( G) in the first case.5 applies to the MLE 0 and ~ e is open.8) is. E open C R and corresponding density/frequency functions p('..8) = exp{OT(x) .4.0 (5. .18) and k h(p) = [A]I LT(xj)pj )".(vn(B . 0 Both the asymptotic variance bound and its achievement by the MLE are much more general phenomena..3) L.4.A(O)}h(x) where h(x) = l(x E {xo.Xn are tentatively modeled to be distributed according to Po. Xk}) and rem 5.I L::i~1 T(Xi) = " k T(xJ )".19) The binomial (n. Write P = {P. if it exists and under regularity conditions.2.. Suppose p(x..2 Asymptotic Normality of Minimum Contrast and MEstimates o e o We begin with an asymptotic normality theorem for minimum contrast estimates.. Note that because n N T = n .3 we give this result under conditions that are themselves implied by more technical sufficient conditions that are easier to check. Suppose i.d. which (i) maximizes L::7~o Nj logp(xj."j~O (5. J(~)) with the asymptotic variance achieving the infonnation bound Jl(B).3.::7 We shall see in Section 5.Oo) = E. OneParameter Discrete Exponential Families. . 0). . . 0) and (ii) solvesL::7~ONJgi(Xj. Let p: X x ~ R where e D(O. then.i. 0 E e. As in Theorem 5.17) L..xd). is a canonical oneparameter exponential family (supported on {xo.4. .4. 5.0)=0.4.3 that the information bound (5.4.o(p(X"O)  p(X"lJo )) . Then Theo(5.4..3. achieved by if = It (r. by ( 2... : E Ell.1. . In the next two subsections we consider some more general situations.
# o..22) J i I .O(P)I < En} £. . Suppose AI: The parameter O(P) given hy the solution of . That is.' '. O(Pe ) = O. O)ldP(x) < co. . A4: sup. Let On be the minimum contrast estimate On ~ argmin . f 1 . ~I AS: On £.4.4. O)dP(x) = 0 (5. < co for all PEP.4.!:.=1 n ! (5. Under AOA5.p(x.p(x. That is. On is consistent on P = {Pe : 0 E e}.3.t) .O(P)) +op(n.23) I. • is uniquely minimized at Bo. denote the distribution of Xi_ This is because. o if En ~ O. .20) In what follows we let p. hence. { ~ L~ I (~(Xi. . parameters and their estimates can often be extended to larger classes of distributions than they originally were defined for.p(Xi.21) i J A2: Ep.O(p)))I: It . j.p(.4. ~ (X" 0) has a finite expectation and i ! .1.p = Then Uis well defined.2.pix.21) and. rather than Pe. We need only that O(P) is a parameter as defined in Section 1.4. i' On where = O(P) +. O(P» / ( Ep ~~ (XI. 8. (5.. n i=l _ 1 n Suppose AO: .328 Asymptotic Approximations Chapter 5 .. 1 n i=l _ . under regularity conditions the properties developed in this section are valid for P ~ {Pe : 0 E e}. L. (5.p(X" On) n = O. 0) is differentiable. O(P).L .O). 0 E e.p(x.p2(X.. J is well defined on P.4.LP(Xi. .1/ 2) n '. O(P))) .0(P)) l. Theorem 5. as pointed out later in Remark 5.p E p 80 (X" O(P») A3: .1.4. PEP and O(P) is the unique solution of(5. P) = . ~i  i .~(Xi. As we saw in Section 2.
22) follows by a Taylor expansion of the equations (5.p) = E p 1jJ2(Xl .4. (5.8(P)) I n n n~ t=1 n~ t=1 e (5.25) where 18~  8(P)1 < IiJn .Ti(en .4.4. en) around 8(P).lj2 and L 1jJ(X" P) + op(l) i=1 n EI'1jJ(X 1. and A3. (iJ1jJ ) In (On .4 Asymptotic Theory in One Dimension 329 Hence.4.27) Combining (5.4. 8(P)) + op(l) = n ~ 1jJ(Xi .425)(5.4.24) follows from the central limit theorem and Slutsky's theorem. Next we show that (5.p) < 00 by AI.29) .O(P)) ~ .20) and (5.O(P)) Ep iJO (Xl. Let On = O(P) where P denotes the empirical probability.O(P)) / ( El' ~t (X" O(P))) o while E p 1jJ2(X"p) = (T2(1jJ.L 1jJ(X" 8(P)) = Op(n.26) and A3 and the WLLN to conclude that (5..' " iJ (Xi . n..O(P))) p (5. O(P)).Section 5.20)._ .. 1 1jJ(Xi . where (T2(1jJ.On)(8n .21).4.4. using (5. But by the central limit theorem and AI.4. t=1 1 n (5.4.4.1j2 ).O(P)I· Apply AS and A4 to conclude that (5.4.27) we get.l   2::7 1 iJ1jJ. A2.4. .O(P)) 2' (E '!W(X. applied to (5. we obtain.O(P)) = n. By expanding n.22) because .24) proof Claim (5.' " 1jJ(Xi .28) .
4. .4. 0 • .1. Remark 5.: 1. O(P)) ) and (5. AI. we see that A6 corresponds to (5.4.4. =0 (5.30) is formally obtained by differentiating the equation (5. O)dp.12).4. 0) = 6 (x) 0.2.Eo (XI. it is easy to see that A6 is the same as (3. 1 •• I ! .4. 330 Asymptotic Approximations Chapter 5 Dividing by the second factor in (5.3.28) we tinally obtain On .e.?jJ(XI'O)).. O(P) in AIA5 is then replaced by (I) O(P) ~ argmin Epp(X I .O)?jJ(X I . We conclude by stating some sufficient conditions.! .O).O(P) Ik = n n ?jJ(Xi .4. 0) l' P = {Po: or more generally. B» and a suitable replacement for A3. for Mestimates. . A2.2 may hold even if ?jJ is not differentiable provided that . Remark 5.2 is valid with O(P) as in (I) or (2). • Theorem 5.=0 ?jJ(x..1. e) is as usual a density or frequency function. An additional assumption A6 gives a slightly different formula for E p 'iiIf (X" O(P)) if P = Po. 0) is replaced by Covo(?jJ(X" 0). A4 is found.4.4. Nothing in the arguments require that be a minimum contrast as well as an Mestimate (i. Solutions to (5. for A4 and A6.20) are called M~estimates. Conditions AO.4. Our arguments apply even if Xl. I o Remark 5.i. If an unbiased estimateJ (X d of 0 exists and we let ?jJ (x. This extension will be pursued in Volume2.4. O(P)) + op (I k n n ?jJ(Xi .8).4. I 'iiIf 1 I • I I . (2) O(P) solves Ep?jJ(XI.30) Note that (5.d.'. en j• . 0) = logp(x.4.4.O) = O.2.29). A6: Suppose P = Po so that O(P) = 0.22) follows from the foregoing and (5. that 1/J = ~ for some p). and we define h(p) = 6(xj )Pj. X n are i. Identity (5.(x) (5. . Z~ (Xl.. and that the model P is regular and letl(x. B) where p('. This is in fact truesee Problem 5. {xo 1 ••• 1 Xk}. If further Xl takes on a finite set of values. P but P E a}.4. written as J i J 2::.31) for all O. Suppose lis differen· liable and assume that ! • &1 Eo &O(X I . O)p(x.4.4.21). 0) Covo (:~(Xt. n I _ . Our arguments apply to Mestimates. essentially due to Cramer (1946).30) suggests that ifP is regular the conclusion of Theorem 5. and A3 are readily checkable whereas we have given conditions for AS in Section 5.
B).8) is a continuous function of8 for all x.4. .4.1). In this case is the MLE and we obtain an identity of Fisher's. w. Pe) > 1(0) 1 (5... We can now state the basic result on asymptotic normality and e~ciency of the MLE. Theorem 5. 0) obeys AOA6.01 < J( 0) and J:+: JI'Ii:' (x.33) so that (5. 5.4 Asymptotic Theory in One Dimension 331 A4/: (a) 8 + ~~(xI. "dP(x) = p(x)diL(x). lO.:d in Section 3. then if On is a minimum contrast estimate whose corresponding p and '1jJ satisfy 2 0' (.Section 5.4.4. 0') .4.3. 0) and ." Details of how A4' (with ADA3) iIT!plies A4 and A6' implies A6 are given in the problems.O) 10gp(x.4) but A4' and A6' are not necessary (Problem 5. where EeM(Xl.p(x.4. then the MLE On (5..O) and P ~ ~ Pe.p = a(0) g~ for some a 01 o. AOA6.O) = l(x. < M(Xl.4. 10' .0'1 < J(O)} 00.O).35) with equality iff. We also indicate by example in the problems that some conditions are needed (Problem 5.32) where 1(8) is t~e Fisher information introduq.O) = l(x. 0) < A6': ~t (x. That is. s) diL(X )ds < 00 for some J ~ J(O) > 0.3 Asymptotic Normality and Efficiency of the MLE The most important special case of (5.20) occurs when p(x. 0) .34) Furthermore. where iL(X) is the dominating measure for P(x) defined in (A. = en en = &21 Ee e02 (Xl. (b) There exists J(O) sup { > 0 such that O"lj') Ehi) eo (Xl.p. satisfies If AOA6 apply to p(x.4. 0') is defined for all x.eo (Xl. 0) g~ (x. 0) (5.4.4.
B) is an MLR family in T(X). Therefore. For this estimate superefficiency implies poor behavior of at values close to 0.4.4.2.4. claim (5.35). cross multiplication shows that (5. .2.3 generalizes Example 5. Hodges's Example.4 Testing ! I i .. If B = 0. Then . We next compule the limiting distribution of .Xn be i. By (5.4.4.39) with "'(B) < [I(B) for all B E eand"'(Bo) < [I(Bo). B) with J T(x) .nB) . . We discuss this further in Volume II. for higherdimensional the phenomenon becomes more disturbing and has important practical consequences. 0 e en e. 0 because nIl' .4. Consider the following competitor to X: B n I 0 if IXI < n.33) and (5.3 is not valid without some conditions on the esti· mates being considered. {X 1. PoliXI < n 1/4 ] .. I I Example 5. I. and Polen = OJ . Let Z ~ N(O.30) and (5.'(0) = 0 < l(~)' The phenomenon (5. and.37) . 442..1/4 X if!XI > n.4.4.4.4.4.nB).36) is just the correlation inequality and the theorem follows because equality holds iff 'I/J is a nonzero multiple a( B) of 3!. . N(B. we know that all likelihood ratio tests for simple BI e e eo.. (5..".d.35) is equivalent to (5.. 0 1 . for some Bo E is known as superefficiency.4.. PIIZ + .n(Bn .. Then X is the MLE of B and it is trivial to calculate [( B) _ 1..36) Because Eo 'lj.• •.'(B) = I = 1(~I' B I' 0. 1. . .i.i . (5.r 332 Asymptotic Approximations Chapter 5 Proof Claims (5..4. 1). j 1 " Note that Theorem 5.4.1 once we identify 'IjJ(x. . 1998.. PolBn = Xl . PoliXI < n1/'1 .(X B). .1/4 I (5.ne .34) follow directly by Theorem 5. The optimality part of Theorem 5.nBI < nl/'J <I>(n l/ ' ...l'1 Let X" . .<1>( _n '/4 .A'(B). . We can interpret this estimate as first testing H : () = 0 using the test "Reject iff IXI > n..1/ 4 " and using X as our estimate if the test rejects and 0 as our estimate otherwise.439) where . p. . thus... However. B) = 0. see Lehmann and Casella. . 00. 5.B). if B I' 0.. . The major testing problem if B is onedimensional is H : < 8 0 versus K : > If p(. 1. . 1 . 1 . 1 I I.1)..4.38) Therefore.4. }.
00 )" where P8 ..(a. Proof. Let en (0:'. eo) denote the critical value of the test using the MLE en based On n observations.:. PolO n > c.46) PolBn > cn(a.Oo)] PolOn > cn(a.00) other hand.Oo) = 00 + ZIa/VnI(Oo) +0(n.40).a quantile o/the N(O. are of the form "Reject H for T( X) large" with the critical value specified by making the probability of type I error Q at eo. a < eo < b. 00) . (5.d. as well as the likelihood ratio test for H versus K. (5.4.(a.4. b)..4. 1 < z . Suppose (A4') holds as well as (A6) and 1(0) < oofor all O.3 versus simple 2 .45) which implies that vnI(Oo)(c..4.m='=":c'.. If pC e) is a oneparameter exponential family in B generated by T(X). 1) distribution. 00) . ~ 0.43) Property (5.2 apply to '0 = g~ and On. Thus. The test is then precisely.00) > z] ~ 11>(z) by (5. "Reject H if B > c(a.:c".4.0)].4.41) where Zlct is the 1 .::. 00)] = PolvnI(O)(Bn  0) > VnI(O)(cn(a.42)  ~ o.l4.40) > Ofor all O.0:c":c'=D..:cO"=~ ~ 3=3:. e e e e. [o(vn(Bn 0)) ~N(O. .4. < >"0 versus K : >.Oo)] = a and B is the MLE n n of We will use asymptotic theory to study the behavior of this test when we observe ij.. "Reject H for large values of the MLE T(X) of >.4. (5. and then directly and through problems exhibit other tests wi!.(a.00) > z] . Suppose the model P = {Po .4.42) is sometimes called consistency of the test against a fixed alternative.44) But Polya's theorem (A. PoolBn > 00 + zl_a/VnI(Oo)] = POo IVnI(Oo) (Bn  00) > ZI_a] ~ a. the MLE.[Bn > c(a.. Theorem 5. Then ljO > 00 . Xl.I / 2 ) (5.. 00 )]  ~ 1. ljO < 00 . .4.41) follows. > AO. Zla (5. Then c.h the same behavior. B E (a. • • where A = A(e) because A is strictly increasing. That is.Section 5..22) guarantees that sup IPoolvn(Bn .. On the (5. this test can also be interpreted as a test of H : >. It seems natural in general to study the behavior of the test.4.4 Asymptotic Theo:c'Y:c. derive an optimality property. X n distributed according to Po. The proof is straightforward: PoolvnI(Oo)(Bn . 0 E e} is such thot the conditions of Theorem 5.4. and (5. .4.(1 1>(z))1 ~ 0.r'(O)) where 1(0) (5.
j (5.4.80 ) tends to zero.51) I 1 I I . 00 if8 > 80 0 and .4 tells us that the test under discussion is consistent and that for n large the power function of the test rises steeply to Ct from the left at 00 and continues rising steeply to 1 to the right of 80 . In fact.jnI(8)(80 .5. f. .8) > . j ..jI(80)) ill' < O.(80) > O.4. 80)J 1 P.jnI(80) +0(n. I.4.50) can be interpreted as saying that among all tests that are asymptotically level 0' (obey (5..4.4. That is.017".2 and (5. If "m(8 .   j Proof.k'Pn(X1.jnI(8)(Bn . In either case.jnI(8)(80 . then limnE'o+.80 1 < «80 )) I J . ~ " uniformly in 'Y. .1/ 2 ))]..4.4.4.jnI(8)(Cn(0'. Write Po[Bn > cn(O'. (5. then by (5.8) < z] .80) tends to infinity.8 + zl_a/. (3) = Theorem 5.4. 00 if 8 < 80 . (5. 80 )  8) .42) and (5. these statements can only be interpreted as valid in a small neighborhood of 80 because 'Y fixed means () + B . iflpn(X1 .jnI(8)(Bn .41).48). ~! i • Note that (5.50). Furthennore. .4. On the other hand.48) and (5.···. LetQ) = Po. .jnI(8)(80 . assume sup{IP.80) ~ 8)J Po[.J 334 Asymptotic Approximations Chapter 5 By (5.jnI(O)(Bn . Optimality claims rest on a more refined analysis involving a reparametrization from 8 to ")' "m( 8 . . Suppose the conditions afTheorem 5. . the power of the test based on 8n tends to I by (5.4.4.48) . 0.4. .49)) the test based on rejecting for large values of 8n is asymptotically uniformly most powerful (obey (5. o if "m(8 . the power of tests with asymptotic level 0' tend to 0'. I .jnI(80) + 0(n.49) ! i . Theorem 5.(1 lI(z))1 : 18 .43) follow. the test based on 8n is still asymptotically MP. < 1 lI(Zl_a ")'.4. X n ) i.47) lor.[.[..4.1 / 2 )) .8) + 0(1) . (5.jI(80)) ill' > 0 > llI(zl_a ")'.")' ~ "m(8  80 ).4. Claims (5.80 ).50) i.8 + zl_o/. 1 ~. X n ) is any sequence of{possibly randomized) critical (test) functions such that .4.jnI(8)(Cn(0'.40) hold uniformly for (J in a neighborhood of (Jo.  i.50)) and has asymptotically smallest probability of type I error for B < Bo.8) > . then (5.4.
is asymptotically most powerful as well. X n . 1(8) = 1(80 + + fi) ~ 1(80) because our uniformity assumption implies that 0 1(0) is continuous (Problem 5.7).80 ) < n p (Xi.4.Xn ) = 5Wn (X" .[logPn( Xl. € > 0 rejects for large values of z1.53) tends to the righthand side of (5.4.OO)J . " Xn. Llog·· (X 0) =dn (Q..54) Assertion (5. .4 and 5.48) follows. hence. There are two other types of test that have the same asymptotic behavior.. . Thus. . It is easy to see that the likelihood ratio test for testing H : g < 80 versus K : 8 > 00 is of the form "Reject if L log[p(X" 8n )!p(X i=1 n i . note that the Neyman Pearson LR test for H : 0 = 00 versus K : 00 + t. if I + 0(1)) (5. Q).52) > 0. ... for Q < ~.4. X. for all 'Y. 0) denotes the density of Xi and dn .4.. 00 + E) logPn(X l E I " .O+.50) and.OO+7n) +EnP. is O.4.8) 1. hand side of (5. 1.4. 0 (5.4.4 Asymptotic Theory in One Dimension 335 If.4.4. P.. The details are in Problem 5.?.. ...j1i(8  8 0 ) is fixed.54) establishes that the test 0Ln yields equality in (5..o + 7..Section 5." + 0(1) and that It may be shown (Problem 5. ~ .Xn) is the critical function of the LR test then. > dn (Q. . Further Taylor expansion and probabilistic arguments of the type we have used show that the righthand side of (5.50) for all. To prove (5. . 0 The asymptotic!esults we have just established do not establish that the test that rejects for large values of On is necessarily good for all alternatives for any n.5. .jI(Oo)) + 0(1)) and (5.)] ~ L (5. t n are uniquely chosen so that the right..4.4.<P(Zl_a . i=1 P 1.. Finally. k n (80 ' Q) ~ if OWn (Xl. ..4.jn(I(8o) + 0(1))(80 . The test li8n > en (Q. These are the likelihood ratio test and the score or Rao test.4.53) where p(x.<P(Zla(1 + 0(1)) + . 0 n p(Xi .4.53) is Q if. 00)]1(8n > 80 ) > kn(Oo.' L10g i=1 p (X 8) 1. [5In (X" .5 in the future will be referred to as a Wald test.50) note that by the NeymanPearson lemma. 8 + fi) 0 P'o+:. 80)] of Theorems 5.4.8) that.. .8 0 ) .Xn) is the critical function of the Wald test and oLn (Xl.
336
Asymptotic Approximations
Chapter 5
where PlI(X 1, ... ,Xn ,8) is the joint density of Xl, ... ,Xn . Fort: small. n fixed, this is approximately the same as rejecting for large values of a~o logPn(X 11 • • • 1 X n ) eo).
,
• ,
The preceding argument doesn't depend on the fact that Xl" .. , X n are i.i.d. with common density or frequency function p{x, 8) and the test that rejects H for large values of a~o log Pn (XI, ... ,Xn , eo) is, in general, called the score or Rao test. For the case we are considering it simplifies, becoming
"Reject H iff
t
i=I
iJ~
: ,
logp(X i ,90) > Tn (a,9 0)."
0
I I
(5.4.55)
It is easy to see (Problem 5.4.15) that
Tn(a, 90 ) = Z10 VnI(90) + o(n I/2 )
and that again if G (Xl, ... , X n ) is the critical function of the Rao test then
nn
,
•
.,
,•
Po,+' WRn (X t, ... ,Xn ) = Ow n(X t , ... ,Xn)J ~ 1, rn
(5.4.56)
(Problem 5.4.8) and the Rao test is asymptotically optimal. Note that for all these tests and the confidence bounds of Section 5.4.5, I(90 ), which d' may require numerical integration, can be replaced by _n l d021n(Bn) (Problem 5.4.10).
5.4,5
Confidence Bounds
Q
We define an asymptotic Levell that
lower confidence bound (LCB) On by the requirement
(5.4.57)
r I I i ,
., '
.'
for all () and similarly define asymptotic level!  a DeBs and confidence intervals. We can approach obtaining asymptotically optimal confidence bounds in two ways:
(i) By using a natural pivot.
, f.
(
.,
(ii) By inverting the testing regions derived in Section 5.4.4.
,. "',
Method (i) is easier: If the assumptions of Theorem 5.4.4 hold, that is, (AO)(A6), (A4'), and I(9) finite for all it follows (Problem 5.4.9) that
e.
Co(
V n  e)) ~ N(o, 1) nI(lin)(li
Z1a/VnI(lin ).
(5.4.58)
for all () and, hence. an asymptotic level!  a lower confidence bound is given by
9~ = lin e~,
(5.4.59)
Turning tto method (ii), inversion of 8Wn gives fonnally
= inf{9: en(a, 9) > 9n }
(5.4.60)
,
=
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
337
or if we use the approximation C (0, e) ~ n
e+ zlQ/vnI(iJ), (5,4,41),
e) > en}'
~~, = inf{e , Cn(C>,

(5,4,61)
In fact neither e~I' or e~2 properly inverts the tests unless cn(Q, e) and Cn (Q, e) are increasing in The three bounds are different as illustrated by Examples 4.4.3 and 4.5.2. If it applies and can be computed, e~l is preferable because this bound is not only approximately but genuinely level 1 Q. But computationally it is often hard to implement because cn(Q, 0) needs, in general, to be computed by simulation for a grid of values. Typically, (5.4.59) or some equivalent alternatives (Problem 5,4,10) are preferred but Can be quite inadequate (Problem 5,4,1 I), These bounds e~, O~I' e~2' are in fact asymptotically equivalent and optimal in a suitable sense (Problems 5,4,12 and 5,4,13),
e.
e
Summary. We have defined asymptotic optimality for estimates in oneparameter models. In particular, we developed an asymptotic analogue of the information inequality of Chapter 3 for estimates of in a onedimensional subfamily of the multinomial distributions, showed that the MLE fonnally achieves this bound, and made the latter result sharp in the context of oneparameter discrete exponential families. In Section 5.4.2 we developed the theory of minimum contrast and M estimates, generalizations of the MLE, along the lines of Huber (1967), The asymptotic formulae we derived are applied to the MLE both under the mooel that led to it and tmder an arbitrary P. We also delineated the limitations of the optimality theory for estimation through Hodges's example. We studied the optimality results parallel to estimation in testing and confidence bounds. Results on asymptotic properties of statistical procedures can also be found in Ferguson (1996), Le Cam and Yang (1990), Lehmann (1999), Rao (1973), and Serfling (1980),
e
5.5
ASYMPTOTIC BEHAVIOR AND OPTIMALITY OF THE POSTERIOR DISTRIBUTION
Bayesian and frequentist inferences merge as n t 00 in a sense we now describe. The framework we consider is the one considered in Sections 5.2 and 5.4, i.i.d. observations from a regular madel in which is open C R or = {e 1 , , , , , e,} finite, and e is identifiable, Most of the questions we address and answer are under the assumption that fJ = 0, an arbitrary specified value, or in frequentist tenns, that 8 is true.
e
e
Consistency The first natural question is whether the Bayes posterior distribution as n + 00 concentrates all mass more and more tightly around B. Intuitively this means that the data that are coming from Po eventually wipe out any prior belief that parameter values not close to are likely, Formalizing this statement about the posterior distribution, II(· I X It •.• , X n ), which is a functionvalued statistic, is somewhat subtle in general. But for = {O l ,' .. , Ok} it is
e
e
i
338
straightforward. Let
Asymptotic Approximations Chapter 5
I.
11(8 i XI,···, Xn)
Then we say that II(·
=PIO = 8 I Xl,···, Xn].
(5.5.1)
e, P,li"(8 I Xl, ... , Xn)  11 > ,] ~ 0 for all f. > O. There is a slightly stronger definition: rIC I XI,' .. ,Xn ) is a.S. iff for all 8 E e, 11(8 I Xl, ... , Xn) ~ 1 a.s. P,.
is consistent iff for all 8 E
General a.s. consistency is not hard to formulate:
I Xl, ... , Xn)
(5.5.2)
consistent
(5.5.3)
11(· I X), ... , Xn)
=}
OJ'} a.s. P,
(5.5.4)
where::::} denotes convergence in law and <5{O} is point mass at satisfactory result for finite.
e
e.
There is a completely
, ,
,
Theorem 5.5.1. Let 1rj  p[e = Bj ], j = 1, ... 1 k denote the prior distribution 0/8. Then II(· I Xl, ... ,Xn ) is consistent (a.s. consistent) iff 7fj > afor j = I, ... , k.
Proof. Let p(., B) denote the frequency or limit j function of X. The necessity of the condition is immediate because 1["] = 0 for some j implies that 1f(Bj I Xl, ... ,Xn ) = 0 for all Xl, .. . , X n because, by (1.2.8),
.,
,J,
11(8j
I Xl, ... ,Xn)
PIO = 8j I Xl, ... ,Xn ] 11j Ir~l p(Xi , 8il
L:.~l 11. ni~l p(Xi , 8.)
k
n .
(5.5.5)
,
,
, ,,
Intuitively, no amount of data can convince a Bayesian who has decided a priori that OJ is impossible. On the other hand, suppose all 71" j are positive. If the true is (J j or equivalently 8 = (J j, then
e
log
11(8.IXl, ... ,Xn) =n 11(8 j I X), ... ,Xn)
(11og+ L.. og P(Xi,8.)) . 11. 1{f.,1 n
7fj
n i~l
p(Xi ,8j)
,
By the weak (respectively strong) LLN, under POi'
.,i
1{f.,log p(Xi ,8.)  Lni~l
p(Xi ,8j )
+
E
OJ
(I
I
P(XI ,8.)) og p(X I ,8j )
i
I
in probability (respectively a.s.). But Eo;
(log: ~:::;))
~
< 0, by Shannon's inequality, if
Ba
• .,
=I=
Bj' Therefore,
11(8.IXI, ... ,Xn) 1 og 11(8j I X), ... , X n )
00
,
in the appropriate sense, and the theorem follows.
o
i ~,
h
_
Section 55
Asymptotic Behavior and Optimality of the Posterior Distribution
339
e
Remark 5.5.1. We have proved more than is stated. Namely. that for each I XI, . .. ,Xn ]  a exponentially.
e E e. Po[O =l0
As this proof suggests, consistency of the posterior distribution is very much akin to consistency of the MLE. The appropriate analogues of Theorem 5.2.3 are valid. Next we give a much stronger connection that has inferential implications:
Asymptotic normality of the posterior distribution
Under conditions AOA6 for p(x, B) that if B is the MLE,

= lex, Bj
=logp(x, B), we showed in Section 5.4
(5.5.6)
Ca(y'n(e  B» ~ N(O,rl(B)).
Consider C( ..;ii((}  B) I Xl, ... , X n ), the posterior probability distribution of y'n((} B( Xl, ... , X n )), where we emphasize that (j depends only on the data and is a constant given XI, ... , X n . For conceptual ease we consider A4(a.s.) and A5(a.s.), assumptions that strengthen A4 and A5 by replacing convergence in Po probability by convergence a.s. p•. We also add,



A7: For all (), and all 0> o there exists t(o,(})
p. [sup
> 0 such that
{~ t[I(Xi,B') /(Xi,B)]: 18'  BI > /j} < '(0, B)] ~ I.
e such that 1r(') is continuous and positive
AS: The prior distribution has a density 1f(') On at all B. Remarkably,
Theorem 55.2 (UBernsteinlvon Mises"). If conditions ADA3, A4(a.s.), A5(a.s.), A6, A7, and A8 hold. then
C(y'n((}(})
I X1, ... ,Xn )
~N(O,l
1
(B»)
(5.5.7)
a.s. under P%ralle.
We can rewrite (5.5.7) more usefully as
sup IP[y'n((}  e) < x I Xl, ... , X n]  of>(xVI(B»)j ~ 0
x
(5.5.8)
for all a.s. Po and, of course, the statement holds for our usual and weaker convergence in Po probability also. From this restatement we obtain the important corollary. Corollary 5.5.1. Under the conditions of Theorem 5.5.2,
e
sup IP[y'n(O  e) < x j Xl, ... , XnJ  of>(xVl(e)1
x
~0
(5.5.9)
a.s. P%r all B.
1 , ,
340
Asymptotic Approximations Chapter 5
Remarks
(I) Statements (5,5.4) and (5,5,7)(5,5,9) are, in fact, frequentist statements about the
asymptotic behavior of certain functionvalued statistics.
(2) Claims (5.5.8) and (5.5.9) hold with a.s. replaced by in P, probability if A4 and
A5 are used rather than their strong formssee Problem 5.5.7.
(3) Condition A7 is essentially equivalent to (5.2.8), which coupled with (5.2.9) and
identifiability guarantees consistency of Bin a regular model.

Proof We compute the posterior density of .,fii(O  B) as
(5.5.10)
where en = en(X!, . .. ,Xn) is given by

Divide top and bottom of (5.5.10) by
;,'
II7
1 p(Xi ,
B) to obtain
(5.5.11)

where l(x,B)
= 10gp(x,B) and
,
,
We claim that
for all B. To establish this note that (a) sup { 11" + 1I"(B) : ItI < M} tent and 1T' is continuous. (b) Expanding, (5.5.13)
(e In) 
~ 0 a.s.
for all M because
eis a.s. consis
I
I
1
i
1 ! , ,
I
p
J
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
341
where
Ie  Bit)1 < )n.
We use I:~
1
g~ (Xi, e) ~ 0 here. By A4(a.s.), A5(a.s.),
1
n
sup { n~[}B,(Xi,B'(t))n~[}B,(Xi,B):ltl<M
In [}'I
[}'l
}
~O,
for all M, a.s. Po. Using (5.5.13). the strong law of large numbers (SLLN) and A8, we obtain (Problem 5.5.3),
Po
[dnqn(t)~1f(B)exp{Eo:;:(Xl,B)~}
forallt] =1.
(5.5.14)
Using A6 we obtain (5.5.12).
Now consider
dn =
I:
r
+y'n
1f(e+
;")exp{~I(Xi,9+;,,) 1(Xi,e)}ds
(5.5.15)
dnqn(s)ds
J1:;I<o,fii
J
1f(t) exp
{~(l(Xi' t) 1(X
i,
9)) } l(lt 
el > o)dt
By AS and A7,
Po [sup { exp
{~(l(Xi,t) 1(Xi , e») } : It  el > 0} < e"'("O)] ~ 1
(5.5.16)
for all 0 so that the second teon in (5.5.14) is bounded by y'ne"'("O) ~ 0 a.s. Po for all 0> O. Finally note that (Problem 5.5.4) by arguing as for (5.5.14), tbere exists o(B) > 0 such that
Po [dnqn(t) < 21f(8) exp {~ Eo (:;: (Xl, B))
By (5.5.15) and (5.5.16), for all 0
~}
for all It 1 < 0(8)y'n]
~ I.
(5.5.17)
> 0,
(5.5.18)
Po [dn 
r dnqn(s)ds ~ 0] = I. J1:;I<o,fii
exp {_ 8'I(B)} ds
2
Finally, apply the dominated convergence theorem, Theorem B.7.5, to dnqn(sl(lsl < 0(8)y'n)), using (5.5.14) and (5.5.17) to conclude that, a.s. Po,
d ~ 1f(B)
n
r= L=
= 1f(8)v'21i'.
JI(B)
(5.5.19)
,
I
342
Hence, a.S. Po,
Asymptot'lc Approximations Chapter 5
qn(t) ~ V1(e)<p(tvI(e))
where r.p is the standard Gaussian density and the theorem follows from Scheffe's Theorem B.7.6 and Proposition B.7.2. 0 Example 5.5.1. Posterior Behavior in the Normal Translation Model with Normal Prior. (Example 3.2.1 continued). Suppose as in Example 3.2.1 we have observations from a N{ (), ( 2 ) distribution with a 2 known and we put aN ('TJ, 7 2 ) prior on 8. Then the posterior
distribution of8 isN(Wln7J!W2nX,
(~I r12)1) where
,,2
W2n
.,
I
• •
WIn
= nT 2 +U2'
= !WIn
(5.5.20)
,
'"
,
r
, .,,
I
Evidently, as n + 00, WIn + 0, X + 8, a.s., if () = e, and (~I T\) 1 + O. That is, the posterior distribution has mean approximately (j and variance approximately 0, for n large, or equivalently the posterior is close to point mass at as we vn(O  9) has posterior distribution expect from Theorem 5.5.1. Because 9 =
;
N ( .,!nw1n(ry  X), n (~+
;'»
1).
x,
e
Now, vnW1n
=
O(n 1/ 2) ~ 0(1) and
0
n (;i + ~ ) 1
rem 5.5.2.
+ (12 =
II (8) and we have directly established the conclusion of Theo
I ,
I
I
Example 5.5.2. Posterior Behavior in the Binomial~Beta Model. (Example 3.2.3 continued). If we observe Sn with a binomial, B(n, 8), distribution, or equivalently we observe X" ... , X n Li.d. Bernoulli (I, e) and put a beta, (3(r, s) prior on e, then, as in Example 3.2.3, (J has posterior (3(8n +r, n+s  8 n ). We have shown in Problem 5.3.20 that if Ua,b has a f3(a, b) distribution, then as a + 00, b + 00,
I
If 0
a) £ (a+b)3]1( Ua,b a+b ~N(o,I). [ ab
< B<
(5.5.21)
j I ,
i j
1 is true, Sn/n ~. () so that Sn + r + 00, n + s  Sn + 00 a.s. Po. By identifying a with Sn + r and b with n + s  Sn we conclude after some algebra that because 9 = X,
vn((J  X)!:' N(O,e(l e))
a.s. Po, as claimed by Theorem 5.5.2.
o
Bayesian optimality of optimal frequentist procedures and frequentist optimality of
Bayesian procedures
•
Theorem 5.5.2 has two surprising consequences. (a) Bayes estimates for a wide variety of loss functions and priors are asymptotically efficient in the sense of the previous section.
,
1 ,
I
I
t
hz _
I
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
343
(b) The maximum likelihood estimate is asymptotically equivalent in a Bayesian sense to the Bayes estimate for a variety of priors and loss functions. As an example of this phenomenon consider the following.
~
Theorem 5.~.3. Suppose the conditions of Theorem 5.5.2 are satisfied. Let B be the MLE ofB and let B* be the median ofthe posterior distribution ofB. Then
(i)
(5.5.22)
a.s. Pe for all
e. Consequently,
~, _ I ~ 1 az. I (e)ae(X" e) +op,(n 1/2 ) e  e+ n L.
l=l
(5.5.23)
and LO( .,fii(rr  e)) ~ N(o, rl(e)).
(ii)
(5.5.24)
E( .,fii(111 
el11111 Xl,'"
,Xn) = mjn E(.,fii(111  dl 
1(11) 1Xl.··· ,Xn) + op(I).
(5.5.25)
Thus, (i) corresponds to claim (a) whereas (ii) corresponds to claim (b) for the loss functions In (e, d) = .,fii(18 dlIell· But the Bayes estimatesforl n and forl(e, d) = 18dl must agree whenever E(11111 Xl, ... , Xn) < 00. (Note that if E(1111 I Xl, ... , X n ) = 00, then the posterior Bayes risk under l is infinite and all estimates are equally poor.) Hence, (5.5.25) follows. The proof of a corresponding claim for quadratic loss is sketched in Problem 5.5.5.
Proof. By Theorem 5.5.2 and Polya's theorem (A.l4.22)
sup IP[.,fii(O  e)
< x I Xl,'" ,Xu) 1>(xy'""'I(""'e))1 ~
Oa.s. Po.
(5.5.26)
But uniform convergence of distribution functions implies convergence of quantiles that are unique for the limit distribution (Problem B.7.1 I). Thus, any median of the posterior distribution of .,fii(11  e) tends to 0, the median of N(O, II (~)), a.s. Po. But the median of the posterior of .,fii(0  (1) is .,fii(e'  e), and (5.5.22) follows. To prove (5.5.24) note that
~ ~ ~
and, hence, that
E(.,fii(IOelll1e'l) IXl, .. ·,Xn) < .,fiile
e'l ~O
(5.5.27)
a.s. Po, for all B. Because a.s. convergence Po for all B implies. a.s. convergence P (B.?). claim (5.5.24) follows and, hence,
E( .,fii(10 
01 101) I h ... , Xn)
= E( .,fii(10 
0'1  101) I X" ... , X n ) + op(I).
(5.5.28)
344
~
Asymptotic Approximations
Chapter 5
Because by Problem 1.4.7 and Proposition 3.2.1, B* is the Bayes estimate for In(e,d), (5.5.25) and the theorem follows. 0 Remark. In fact, Bayes procedures can be efficient in the sense of Sections 5.4.3 and 6.2.3 even if MLEs do not exist. See Le Cam and Yang (1990).
Bayes credible regions
~
There is another result illustrating that the frequentist inferential procedures based on f) agree with Bayesian procedures to first order.
Theorem 5.5.4. Suppose the conditions afTheorem 5.5.2 are satisfied. Let
where en is chosen so that 1l"(Cn I Xl, ... ,Xn) = 1  0', be the Bayes credible region defined in Section 4.7. Let Inh) be the asymptotically level 1  'Y optimal interval based on B, given by
~
where dn(y)
i •
= z (! !)
JI~). ThenJorevery€ >
0, 0,
~ 1.
P.lIn(a + €) C Cn(X1 , .•. ,Xn ) C In(a  €)J
(5.5.29)
I
I
I
The proof, which uses a strengthened version of Theorem 5.5.2 by which the posterior density of Jii( IJ  0) converges to the N(O,Il (0)) density nnifonnly over compact neighborhoods of 0 for each fixed 0, is sketched in Problem 5.5.6. The message of the theorem should be clear. Bayesian and frequentist coverage statements are equivalent to first order. A finer analysis both in this case and in estimation reveals that any approximations to Bayes procedures on a scale finer than n 1j2 do involve the prior. A particular choice, the Jeffrey's prior, makes agreement between frequentist and Bayesian confidence procedures valid even to the higher n 1 order (see Schervisch, 1995).
Thsting
! ,
•
Bayes and frequentist inferences diverge when we consider testing a point hypothesis. For instance, in Problem 5.5.1, the posterior probability of 00 given X I, ... ,Xn if H is false is of a different magnitude than the pvalue for the same data. For more on this socalled Lindley paradox see Berger (1985) and Schervisch (1995). However, if instead of considering hypothesis specifying one points 00 we consider indifference regions where H specifies [00 + D.) or (00  D., 00 + D.), then Bayes and freqnentist testing procedures agree in the limit. See Problem 5.5.2. Summary. Here we established the frequentist consistency of Bayes estimates in the finite parameter case, if all parameter values are a prior possible. Second. we established
i
! I
I b
II
!
_
j
TI
Section 5.6
Problems and Complements
345
the socalled Bernsteinvon Mises theorem actually dating back to Laplace (see Le Cam and Yang, 1990), which establishes frequentist optimality of Bayes estimates and Bayes optimality of the MLE for large samples and priors that do not rule out any region of the parameter space. Finally, the connection between the behavior of the posterior given by the socalled Bernstein~von Mises theorem and frequentist contjdence regions is developed.
5.6
PROBLEMS AND COMPLEMENTS
Problems for Section 5.1
1. Suppose Xl, ... , X n are i.i.d. as X ous case density.
rv
F, where F has median F 1 (4) and a continu
(a) Show that, if n
= 2k + 1,
, Xn)
EFmed(X),
n (
_
~
)
l'
k (1  t)kdt F' (t)t
EF
med (X"
2
,Xn )
n( 2;) [1P'(t)f tk (1t)k dt
= 1, 3,
(b) Suppose F is unifonn, U(O, 1). Find the MSE of the sample median for n and 5.
2. Suppose Z ~ N(I', 1) and V is independent of Z with distribution X;'. Then T
Z/
=
(~)!
is said to have a noncentral t distribution with noncentrality J1 and m degrees
of freedom. See Section 4.9.2. (a) Show that
where fm(w) is the x~ density, and <P is the nonnal distribution function.
(b) If X" ... ,Xn are i.i.d.N(I',<T2 ) show that y'nX /
(,.~, L:(Xi
_X)2)! has a
noncentral t distribution with noncentrality parameter .fiiJ1/IT and n  1 degrees of freedom. (c) Show that T 2 in (a) has a noncentral :FI,m distribution with noncentrality parameter J12. Deduce that the density of T is
p(t)
= 2L
i=O
00
P[R = iJ . hi+,(f)[<p(t 1')1(t > 0)
+ <p(t + 1')1(t < 0)1
where R is given in Problem B.3.12.
, " I I. ,
346
Hint: Condition on
Asymptotic Approximations
Chapter 5
ITI.
1, then Var(X) < 1 with equality iff X
3. Show that if P[lXI < 11
±1 with
probability! .
Hint: Var(X) < EX 2
4. Comparison of Bounds: Both the Hoeffding and Chebychev bounds are functions of n and f. through ..jiif..
(a) Show that the ratio of the Hoeffding function h( VilE) to the Chebychev function e( Jii€) tends to as Jii€ ~ 00 so that he) is arbitrarily better than en in the tails.
°
I,.,'
i~
(b) Show that the normal approximation 24> (
V;€)  1 gives lower results than h in
00.
,
the tails if P[lXI < 1] = 1 because, if ,,2 < 1. 1  <p(t)  <p(t)lt as t ~ Note: Hoeffding (1963) exhibits better bounds for known a 2 .
R has .\(0) ~ 0, is bounded, and has a hounded second derivative .\n. Show that if Xl, ... , X n are i.i.d., EX l = f.L and Var Xl = 02 < 00, then
5. Suppose.\ : R
~
E.\(X 
1') = .\'(0)
;'/!; +
0
(~)
as n >
00.
= E>.'(O)JiiIX  1'1 + E (";' (X  I')(X I'?) where IX  1'1 < Ix  1'1· The last term is < suPx I>." (x) 1,,2 In and the first tends to
Hint: JiiE(.\(IX 1,1) .\(0))
I
",
1! "
.\'(0)" f== Izl<p(z)dz by Remark B.7. 1(2).
Problems for Section 5,2
1. Using the notation of Theorern 5.2.1, show that
"
2. Let X" ... ,Xn be ij,d. N(I',,,2), Show that for all n
Sup p(•• u) [IX
u
>
1, all €
>
°
/,1 > ,] ~ 1.
Hint: Let (J
)
00.
•
3, Establish (5.2.5). Hint: Iiln  q(p)1
> € =} IPn  pi > w (€).
l
J
4. Let (Ui , V;), 1 < i < n, be i.i.d.  PEP.
(a) Let y(P)
= PIU, > 0, V, > OJ. Show that if P = N(O, 0,1,1, p), then
p
~ sin21l' (Y(P)  ~).
N (/J.(ii) holds.•.e'l < . e) . sup{lp(X. Hint: sup 1. eol > A} > 0 for some A < 00. e') . inf{p(X. i (e))} > <. (i).d. 7.sup{lp(X.\} . ".2. .e')p(X. (Ii) Show that condition (5. {p(X" 0) . 5. Hint: K can be taken as [A. e E Rand (i) For some «eo) >0 Eo. for each e eo. eo)} : e E K n {I : 1 IB . VarpUVarpV > O}. Suppose Xl.2 rII.p(X. n Lt= ". . inf{p(X.~ .. e) is continuous.2.e)1 : e' E 5'(e. e') . where A is an arbitrary positive and finite constant.2.i. eo) : e' E S(O. t e. by the basic property of maximum contrast estimates. e)1 (ii) Eo. Show that the sample correlation coefficient continues to be a consistent estimate of p(P) but p is no longer consistent. (c) Suppose p{P) is defined generally as Covp (U. (Wald) Suppose e ~ p( X.eol > . Eo.p(Xi .n 1 (XiJ.6 Problems and Complements 347 (b) Deduce that if P is the bivariate normal distribution. eo) : Ie  : Ie .Xn are i. Or of sphere centers such that K Now inf n {e: Ie .l)2 _ + tTl = o:J.14)(i).. By compactness there is a finite number 8 1 .(eo)} < 00.lO)2)) .eol > .\} c U s(ejJ(e j=l T J) {~ t n . . 6. and the dominated convergence theorem. 05) where ao is known. .o)} =0 where S( 0) is the 0 ball about Therefore. V)j /VarpU Varp V for PEP ~ {P: EpU' + EpV' < 00.14)(i) add (ii) suffice for consistency.p(X. Hint: From continuity of p. .Section 5. and < > 0 there is o{0) > 0 such that e. then is a consistent estimate of p. (1 (J. Show that the maximum contrast estimate 8 is consistent. (a) Show that condition (5.. 5_0  lim Eo.lJ.8) fails even in this simplest case in which X ~ /J is clear. Prove that (5. A].p(X.
d.X'i j .) 2] .3. (ii) If ti are ij.1... Indicate how the conditions of Problem 7 have to be changed to ensure uniform consistencyon K. Compact sets J( can be taken of the form {II'I < A.i.. (72).d. Show that the log likelihood tends to 00 as a + 0 and the condition fails. Extend the result of Problem 7 to the case () E RP..I'lj < EIX .3..i.Xn but independent of them. .X. Let X~ be i. Establish (5.O'l n i=l p(X"Oo)}: B' E S(Bj.2. P > 1.."d' k=l d < mm+1 L d ElY.3) for j odd as follows: . 1/(J.. . 4. • • (i) Suppose Xf.O(BJll}. 9. with the same distribution as Xl. 8. i . + id = m E II (Y. . < Mjn~ E (. i .3. . Hint: See part (a) of the proof of Lemma 5. . .1.348 Asymptotic Approximations Chapter 5 > min l<J$r {~tinf{p(Xi. I 10. < < 17 < 1/<. . Establish (5. .. = 1.. . Establish (5.k / 2 . .. li . .. .d.)2]' < t Mjn~ E [. n. < > OJ. L:~ 1(Xi . 1 + . and take the values ±1 with probability ~. in (i) and apply (ii) to get (iv) E IL:~ 1 (Xi  x. The condition of Problem 7(ii) can also fail. L:!Xi . then by Jensen's inequality. II I .1'.l m < Cm k=I n. Then EIX .11). .L. and let X' = n1EX.X.3. Establish Theorem 5.)I' < MjE [L:~ 1 (Xi .3. Hint: Taylor expand and note that if i .3. J (iii) Condition on IXi  X. Problems for Section 5.X~ are i. For r fixed apply the law of large numbers. en are constants. for some constants M j . and if CI..3 I.[. 2..xW) < I'lj· 3. .9) in the exponential model of Example 5.. ~ .
.m distribution with k = nl ./LI = E(Xt}. . respectively.andsupposetheX'sandY's are independent.m 00. LetX1"".d. i. . then (sVaf)/(s~/a~) has an .ad)]m < m L aj j=1 m <mmILaj.6 Problems and Complements 349 Suppose ad > 0.aD. 1)'2:7' . Show that if m = .m) Then. (d) Let Ck. Show that under the assumptions of part (c). j = 1. . )=1 5..i.r'j". (a) Show that if F and G are N(iL"af) and N(iL2. PH(st! s~ < Ck. . theu the LR test of H : or = a~ versus K : af =I a~ is based on the statistic s1 / s~. ~ iLli I IX. j > 2.\ > 0 and = 1 + JI«k+m) ZIa.i.. if a < EXr < 00.d. L~=1 i j = m then m a~l.i. ~ xl". replaced by its method of moments estimate. s~ ~ (n2 . Establish 5. under H : Var(Xtl ~ Var(Y. . km K. 8. 6.. FandYi""'Yn2 bei. . (b) Show that when F and G are nonnal as in part (a).I) Hint: By the iterated expectation theorem EIX.1)'2:7' .1) +E{IX. I < liLl}P(IXd < liLl)· 7. 1 < j < m. R valued with EX 1 = O..m with K.d.). Show that if EIXlii < 00.a as k t 00.m be Ck.(X.. then EIX.. ~ iLli < 2 i EIX.(l'i .. 0'1 = Var(Xd· Xl /LI Ck.) I : i" .Section 5. ."" X n be i. G.c> . . X. n} EIX. Show that ~ sup{ IE( X.I.Tn) + 1 .. Let XI.Fk.\k for some . ~ iLl' I IXli > liLl}P(lXd > 11.Xn1 bei.pli ~ E{IX.28.1 and 'In = n2 . where s1 = (n.3. 2 = Var[ ( ) /0'1 ]2 .. P(sll s~ < ~ 1~ a as k ~ 00.a~d < [max(al"" .l(e) Now suppose that F and G are not necessarily nonnal but that and that 0 < Var( Xn < Ck.
! . i = 1. (iJl.2< 7'. from {t I. () E such that T i = ti where ti = (ui. Consider as estimates e = ') (I X = x.m) + 1 .x) ~ N(O." •. jn(r .. .A») where 7 2 = Var(XIl. . p)..p')') and. . suppose in the context of Example 3. Without loss of generality.iL) • . "1 = "2 = I. t N }. (a) If 1'1 = 1'2 ~ 0. ~ ] . = = 1. then P(sUs~ < qk. then jn(r . . . Without loss of generality.. In particular. H En. .XN} or {(uI. as N 2 t Show that.i.a as k .(I_A)a2). that is.350 Asymptotic Approximations Chapter 5 (e) Next drop the assumption that G E g.I. .. then jn(r' . • 10. . . and if EX..Ii > 0.p).6. (UN..+. . jn((C .1JT. 1).p. suppose i j = j. Show that jn(XRx) ~N(0.00..~) (X . if ~ then .1 · /. Under the assumptions of part (c). (b) Use the delta method for the multivariate case and note (b opt  b)(U .l EY? . Instead assume that 0 < Var(Y?) < x. = (1.t .. (b) If (X.. n. 0 < .i._ J . • . • op(n').. err eri Show that 4log (~+~) is the variance stabilizing transfonnation for the correlation 1 Ip coefficient in Example 5.. use a normal approximation to tind an approximate critical value qk. iik. . JLl = JL2 = 0.p') ~ N(O.XN)} we are interested in is itself a sample from a superpopulation that is known up to parameters.1.l EXiYi . i (b) Suppose Po is such that X I = bUi+t:i.m with KI and K2 replaced by their method of moment estimates. Tin' which we have sampled at random ~ E~ 1 Xi. there exists T I . .1 that we use Til.) =a 2 < 00 and Var(U) Hint: (a) X .1 EX.XC) wHere XC = N~n Et n+l Xi. . + I+p" ' ) I I I II. 7 (1 . be qk.p) ~ N(O.lIl) + 1 .d.1'2)1"2)' such that PH(SI!S~ < Qk.4. ."...xd. i = I. 4p' (I . Show that under the assumptions of part (e). TN i. Et:i = 0. Var(€. . (a) jn(X .. to estimate Ii 1 < j < n. (I _ P')').I' ill "rI' and "2 = Var[( X. (iJi . if 0 < EX~ < 00 and 0 < Eyl8 < 00. Hint: Use the centra] limit theorem and Slutsky's theorem." .3.a: as k + 00.I))T has the same asymptotic distribution as n~ [n.. ' .xd.Il2. if p i' 0.+xn n when T. . (cj Show that if p = 0. .6. In survey sampling the modelbased approach postulates that the population {Xl. .m (0 Let 9.2" ( 11. . show that i' ~ .p) ~ N(O. n.". . .\ < 1.I). = ("t . "i. N where the t:i are i.\ 00.d..3. In Example 5. Wnte (l 1pP . Po.4. Y) ~ N(I'I.N. . .1. X) U t" (ii) X R = bopt(U ~  iL) as in Example 3..m (depending on ~'l ~ Var[ (X I . < 00 (in the supennodel).
. X n is a sample from a population with mean third central moment j1. E(WV) < 00.i'b)(Yc . and Hint: Use (5. Normalizing Transformation for the Poisson Distribution. Here x q denotes the qth quantile of the distribution.Section 5..13(c).i'c) I < Mn'. (a) Show that if n is large. .14 to explain the numerical results of Problem 5..3. This < xl (h) From (a) deduce the approximation P[Sn '" 1>( v'2X  v'2rl)..6) to explain why. 1931) is found to be excellent Use (5.. (h) Deduce fonnula (5. (a) Suppose that Ely. It can be shown (under suitable conditions) that the nonnal approximation to the distribution of h( X) improves as the coefficient of skewness 'Y1 n of heX) diminishes. (a) Show that the only transformations h that make E[h(X) . Show that IE(Y. n = 5...3.. 14..99. 16.y'ri has approximately aN (0.3 < 00. Suppose X I I • • • l X n is a sample from a peA) distrihution. each with HardyWeinberg frequency function f given by . W). Suppose XI. then E(UVW) = O.. X = XO. !) distribution. 15.i'a)(Yb .E{h(X))1' = 0 to terms up to order l/n' for all A > 0 are of the form h{t) ~ ct'!3 + d. . (c) Compare the approximation of (b) with the central limit approximation P[Sn < xl = 1'((x .10.s. (a) Use this fact and Problem 5. (b) Let Sn . Let Sn have a X~ distribution.90. (h) Use (a) to justify the approximation 17. EU 13.3' Justify fonnally j1" variance (72.25. Hint: If U is independent of (V..12).6 Problems and Complements 351 12. Suppose X 1l . .3. X.14). . X~.... = 0.3.: . is known as Fisher's approximation. X n are independent.3.n)/v'2rl) and the exact values of PISn < xl from the X' table for x = XO. The following approximation to the distribution of Sn (due to Wilson and Hilferty.
y). where I' = (a) Find an approximation to P[X < E(X. which are integers. Var(Y) = C7~. h(l) = I.. . (a) (b) Var(h(X.)? 18. .n P . • j b _ . Variance Stabilizing Transfo111U1tion for the Binomial Distribution. (1'1. Yn) is a sample from a bivariate population with E(X) ~ 1'1.n  m/(m + n») < x] 1 > 'li(x). 1'2)]2C7n + O(n 2) i . and h'(t) > for all t. then I m a(l. 101 02 20(10) I f (10)2 2 tJ in terms of fJ and t. (1'1. is given by h(t) = (2/1r) sin' (Yt). Var Bmnm + n +Rmn = 'm+n ' ' where Rm.. V)) '" ..a tends to zero at the rate I/(m + n)2.• • • ~{[hl (1". 0< a < I. where I I h. Let X I. Bm. [vm +  n (Bm.2. 19.y) = a axh(x. 1'2)pC7.y) = a ayh(x. Show directly using Problem B. Show that the only variance stabilizing transformation h such that h(O) = 0..Xm .a) E(Bmn ) = . .1'2)]2 177 n +2h. 1'2)h2(1'1. 172 + [h 2(1'l.. v'a(l.n tends to zero at the rate I/(m + n)2. 1'2)(Y  + O( n '). Show that ifm and n are both tending to oc in such a way that m/(m + n) > a. (Xn .a) Hmt: Use Bm.(x.y). Justify fonnally the following expressions for the moments of h(X.h(l'l.. 1'2)(X . i 21. I ~ 352 x Asymptotic Approximations Chapter 5 where °< e < f(x) 1. 1'2) = h.5 that under the conditions of the previous problem.. h 2 (x... E(Y) = 1'2. . . + (mX/nY)] where Xl. Y) ~ p!7IC72. Cov(X. Var(X) = 171. .I'll + h2(1'1. Y) .1') + X 2 . if m/(m + n) .n = (mX/nY)i1 independent standard eXJX>nentials. X n be the indicators of n binomial trials with probability of success B. Let then have a beta distribution with parameters m and n. (e) What is the approximate distribution of Vr'( X . 1'2) Hint: h(X. Y) where (X" Yi). ° . 20. . Yn are < < . (b) Find an approximation to P[JX' < t] in terms of 0 and t. Y" .
Compare your answer to the answer in part (a). n[X(I .)2.). (a) Show that y'n[h(X) . J1. k > I. Let Xl.(1.) + ": + Rn where IRnl < M 1(J1.. xi 25.X) in tenns of the distribution function when J. Suppose IM41 (x)1 < M for all x and some constant M and suppose that J. Show that Eh(X) 3 h(J1.l and variance (J2 < Suppose h has a second derivative h<2) continuous at IJ.J1. 24. < t) = p(Vi < (b) When J1.. Let Xl>"" X n be a sample Irom a population with and let T = X2 be an estimate of 112 .X2 . Let Sn "' X~.6 Problems and Complements 353 22.. X n is a sample from a population and that h is a realvalued function 01 X whose derivatives of order k are denoted by h(k).4 to give a directjustification of where R n / yin 0 as in n Recall Stirling's approximation: + + 00. _.J1.2V where V ~ 00.)] ~ as ~h(2)(J1.l4 is finite. (e) Fiud the limiting laws of y'n(X . 1 X n be a sample from a population with mean J. Usc Stirling's approximation and Problem 8.l = ~. .J1.2V with V "' Give an approximation to the distribution of X (1 .)2 and n(X . Suppose that Xl.) + ~h(2)(J1.4 + 3o..h(J1.)1J1. ° while n[h(X h(J1.2)/24n2 Hint: Therefore. = ~.)] !:. = E(X) "f 0.Section 5.. .J1. (It may be shown but is not required that [y'nR" I is bounded.2) using the delta method.l) = o.2. find the asymptotic distribution of y'n(T. and that h(1)(J. . XI. ()'2 = Var(X) < 00.)] is asymptotically distributed (b) Use part (a) to show that when J1. (a) When J1.. = 0. find the asymptotic distrintion of nT nsing P(nT y'nX < Vi). xi.3 1/6n2 + M(J1. .J1.) 23.
\. so that ntln ~ .. I i t I .) < (b) Deduce that if p tl ~ <P (t [1. .9. Hint: (a). What happens? Analysis for fixed n is difficult because T(~) no longer has a '12n2 distribution.)/SD where D.\a'4)I').JTi times the length of the onesample t interval based on the differences. .a. 1 X n1 and Y1 .. the interval (4. Y1 ..t. .4. cyi . .\ > 1 .\ < 1. Assume that the observations have a common N(jtl.3 independent samples with III = E(XIl. and a~ = Var(Y1 ). (c) Apply Slutsky's theorem. .') < 00.(j ' a ) N(O.9.9. . n2 + 00.•.9. Hint: Use (5.) of Example 4.3).3).... .9. < tJ ~ <P(t[(. Yd. E In] > 1 . Suppose that instead of using the onesample t intervals based on the differences Vi . . b l .\.\ul + (1 ~ or al . N(Pl (}2).33) and Theorem 5.3) and the intervals based on the pivot ID . Suppose nl + 00 . p) distribution. 0 < .t2.354 26.. 'I :I ! t. and SD are as defined in Section 4. .Jli = ~. We want to obtain confidence intervals on J1. al = Var(XIl. .d. Suppose (Xl.2a 4 ). . Let T = (D . .9. (a) Show that P[T(t. /"' = E(YIl..4. n2 + 00.) (b) Deduce that if..}. Yn2 are as in Section 4. Asymptotic Approximations Chapter 5 then ~ £ c: 2 yn(X . 27.~a) > 2( Val + a'4  ~ a) where the righthand side is the limit of .1/ S D where D and S D are as in Section 4.2 . . a~.Xi we treat Xl.I I (a) Show that T has asymptotically a standard Donnal distribution as and I I . = = az.a. Eo) where Eo = diag(a'. 29. 28. then limn PIt. . (Xnl Yn ) are n sets of control and treatment responses in a matched pair experiment. Viillnl ~ 2vul + a~z(l .\)al + . .9. . XII are ij. that E(Xt) < 00 and E(Y. We want to study the behavior of the twosample pivot T(Li. if n}. Show that if Xl" .\)a'4)/(1 . (d) Make a comparison of the asymptotic length of (4.3) have correct asymptotic (c) Show that if a~ > al and .3. . " Yn as separate samples and use the twosample t intervals (4. Suppose Xl. (c) Show that if IInl is the length of the interval In.9. 2pawz)z(1  > 0 and In is given by (4.(~':~fJ'). whereas the situation is reversed if the sample size inequalities and variance inequalities agree.4.o.9.t. . 1 Xn. the intervals (4. Let n _ 00 and (a) Show that P[T(t./".3) has asymptotic probability of coverage < 1 .3.3. I .\ probability of coverage.
.3. . but Var(Tn ) + 00.Section 5. vectors and EIY 1 1k < 00.3. where tk(1. Hint: If Ixll ~ L. X. Generalize Lemma 5. Show that as n t 00.1.d. (). It may be shown that if Tn is any sequence of random variables such that Tn if the variances ofT and Tn exist. Let .3. x = (XI.Z). .:~l IXjl. . Tn . 32.. k) be independent with Xi} ~ N(l'i.3 using the Welch test based on T rather than the twosample t test based on Sn.X)" Then by Theorem B. 30. .4. Y nERd are Li. Let XI. (a) SupposeE(X 4 ) < 00..p..i.I k and k only. X n be i.£. (a) Show that the MLEs of I'i and ().£. ()._I' Find approximations to P(Yn and P{Vn < XnI (I .~ t (Xi . as X ~ F and let I' = E(X). then lim infn Var(Tn ) > Var(T). Vn = (n . then there exist universal constants 0 < Cd < Cd < 00 Such that cdlxl1 < Ixl < Cdlxh· 31. Hint: See Problems B.I)sz/(). I< = Var[(X 1')/()'jZ. T and where X is unifonn.9. . < XnI (a)) (b) Let Xn_1 (a) be the ath quantile of X. 33.4. .a)) and evaluate the approximations when F is 7.1). Carry out a Monte Carlo study such as the one that led to Figure 5. . (d) Find or write a computer program that carries out the Welch test. has asymptotic level a.16..I)z(a)." . I is the Euclidean norm.xdf and Ixl is Euclidean distance. Show that k ~ ex:: as 111 ~ 00 and 112 t 00.3. then P(Vn < va<) t a as 11 t 00. EIY. Plot your results. Show that if 0 < EX 8 < 00.. j = I. (c) Show using parts (a) and (b) that the tests that reject H : fJI = 112 in favor of K: 112 > III when T > tk(l . U( 1. (c) Let R be the method of moment estimaie of K.3 by showing that if Y 1.I) + y'i«n . and let Va ~ (n .z = Var(X). .a).. where I.1 L j=I k X ij and iT z ~ (kp)I L L(Xij i=l j=l p k Iii)" ..6 Problems and Complements 355 (b) Let k be the Welch degrees of freedom defined in Section 4. Let X ij (i = I.8Z = (n .a) is the critical value using the Welch approximation..z has a X~I distribution when F is theN(fJl (72) distribution.z are Iii = k.9 and 4.1)1 L.d. . then for all integers k: where C depends on d.
. (c) Assume the conditions in (a) and (b).f.p(X. Use the bounded convergence lbeorem applied to .n for t E R. I i . \ . That is the MLE jl' is not consistent (Neyman and Scott. Show lbat if I(x) ~ (d) Assume part (c) and A6. N(O.i.. . . Assmne lbat '\'(0) < 0 exists and lbat . N(O) = Cov(. • 1 = sgn(x). 1948). O(P) is lbe unique solution of Ep. N(O.n(On . . Let i !' X denote the sample median. .. . aliI/ > O(P) is finite. 00 (iii) I!/!I(x) < M < for all x..0))  7'(0)) 0.p(Xi . I _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _1 . . . (a) Show that (i).. . then 8' !.p(X.356 Asymptotic Approximations Chapter 5 . is continuous and lbat 1(0) = F'(O) exists.p(X. _ .0) ~ N Hint: P( .4 1. F(x) of X. and (iii) imply that O(P) defined (not uniquely) by Ep. I ! 1 i . I .... .0) = O. . random variables distributed according to PEP. I(X. .0) !:. Let On = B(P). . Show that _ £ ( .n(X .On) . 1/). Ep... Show that On is consistent for B(P) over P.p(x) (b) Suppose lbat for all PEP..p(X.. .L~. Show that. (ii).n7(0) ~[. under the conditions of (c).. (b) Show that if k is fixed and p ~ 00.1)cr'/k. ... 1/4/'(0)). ..nCOn . Set (.. 1 X n.O(P)) > 0 > Ep!/!(X.Xn be i.)). .0) !.0).) Hint: Show that Ep1/J(X1  8) is nonincreasing in B. where P is the empirical distribution of Xl.p(X. Deduce that the sample median is a consistent estimate of the population median if lbe latter is unique.....0) and 7'(0) .p(X..d. ! Problems for Section 5.(0) Varp!/!(X. .On) < 0). n 1 F' (x) exists.0). (k . Suppose !/! : R~ R (i) is monotone nondecreasing (ii) !/!(oo) < 0 < !/!(oo) . . (Use . ... . (e) Suppose lbat the d.p( 00) as 0 ~ 00... (c) Give a consistent estimate of (]'2... . 1) for every sequence {On} wilb On 1 n = 0 + t/. Let Xl. [N(O))' . then < t) = P(O < On) = P (.\(On)] !:.
jii(Oj .d...(x). X) = 1r/2. .' .    .I / 2 .O)p(x. 1979) which you may assume.7.(1:e )n ~ 1.0. 4. Hint: teJ1J. . (h) Suppose that Xl has the Cauchy density fix) = 1/1r(1 + x 2).  = a?/o.(x.5  and '1'.y. denotes the N(O. Hint: Apply A4 and the dominated convergence theorem B.(x. 2.c)'P. 3.B)) ~ [(I/O). Find the efficiency ep(X. If (] = I. 0 «< 0.(x) + <'P."' c .05.10.0) (see Section 3. X n be i. evaluate the efficiency for € = .6 Problems and Complements 357 (0 For two estImates 0. oj). Let XI.Xn ) is the MLE. 0) = .X) =0. 0 > 0..9)dl"(x) = J te(1J.exp(x/O).(x. Condition A6' pennits interchange of the order of integration by Fubini's theorem (Billingsley. then 'ce(n(B .4. Show that A6' implies A6. Conclude that (b) Show that if 0 = max(XI.'" .(x. .0) ~ N(O.(x.b)dl"(x) J 1J.0.. (a) Show that g~ (x.(x) = (I . J = 1.~ for 0 > x and is undefined for 0 g~(X. " relative efficiency of 8 1 with respect to 82 is defined as ep(8 1 . U(O. 0» density. not O! Hint: Peln(B . Show that assumption A4' of this section coupled with AGA3 implies assumption A4.15 and 0. then ep(X.5) where f.a)p(x..O))dl"(x) ifforalloo < a < b< 00. not only does asymptotic nonnality not hold but 8 converges B faster than at rate n.2.O)p(x.i..0)p(x.(x .O)dl"(x)) = J 1J. lJ :o(1J.O). 82 ) Pis N(I". the asymptotic 2 .. and O WIth . Show that if (g) Suppose Xl has the gross error density f..20 and note that X is more efficient than X for these gross error cases. x E R. 'T = 4. Thus.O) is defined with Pe probability I but < x. This is compatible with I(B) = 00. . ~"'. Show that   ep(X.B) < xl = 1.b)p(x. ( 2 ). X) as defined in (f).a)dl"(x).Section 5.
0 for any sequence {en} by using (b) and Polyii's theorem (A... O.in.4) to g.22).i.5 continue to hold if ."('1(8) .) P"o (X . Show that 0 ~ 1(0) is continuous."log .in) 1 is replaced by the likelihood ratio statistic 7. 8) is continuous and I if En sup {:.in) ~ .O') ..(X. &l/&O so that E.. and A6 hold for.i'''i~l p(Xi. Apply the dominated convergence theorem (B.g. 8)p(x.in) . i. j.i(X. (8) Show that in Theorem 5..0'1 < En}'!' 0 . I I j 6. 8 ) i . A2.·_t1og vn n j = p( x"'o+.In ~ "( n & &8 10gp(Xi.358 5. 8 ) 0 . 0). I .7.18 . under F oo ' and conclude that . Hint: (b) Expand as in (a) but aroond 80 (d) Show that ?'o+".80 + p(Xi . I (c) Prove (5. ~log (b) Show that n p(Xi .~ (X. 0 ~N ~ (. I 1 . Suppose A4'. 80 +.[ L:. .14. I . 1(80 ). ) n en] .p ~ 1(0) < 00.Oo) p(Xi . ~ L.4.O) 1(0) and Hint: () _ g.i (x.In) + .80 p(X" 80) +.50). ) ! .. ..00) . .4.80 + p(X .5 Asymptotic Approximations Chapter 5 ~lOg n p(Xi.< ~log n p(X. Show that the conclusions of PrQple~ 5. .4. i I £"+7.
59).3).58). (Ii) Compare the bounds in Ca) with the bound (5.5 hold.4.4.4.4. (Ii) Deduce that g.2. ~ 1 L i=I n 8'[ ~ 8B' (Xi.4.Section 5. Compare Theorem 4.5 and 5.61) for X ~ 0 and L 12.4.Q" is asymptotically at least as good as B if. and give the behavior of (5.14.59) coincide. f is a is an asymptotic lower confidence bound for fJ.4.4.56). (c) Show that if Pe is a one parameter exponential family the bound of (b) and (5.:vell . 10. which is just (4.3. there is a neighborhood V(Bo) of B o o such that limn sup{Po[B~ < BJ : B E V(Bo)} ~ 1 . Hint.59).54) and (5.4. B' nj for j = 1.6 Problems and Complements 359 8.5.6 and Slutsky's theorem. (Ii) Suppose the conditions of Theorem 5. setting a lower confidence bound for binomial fJ. (a) Show that under assumptions (AO)(A6) for all Band (A4').4.61). 11.4.a. (a) Show that under assumptions CAO)(A6) for 1/! consistent estimate of I (fJ).2. Consider Example 4.4. Establish (5. 9. if X = 0 or 1. Then (5.7.57) can be strengthened to: For each B E e. Let B be two asymptotic l. A. . the bound (5.4. Hint: Use Problem 5. B). hence.4.B lower confidence bounds.7). (a) Establish (5. at all e and (A4'). gives B Compare with the exact bound of Example 4. Let [~11. (a) Show that.6. = X.2.4.4.4. all the 8~i are at least as good as any competitors. ~. B' _n + op (n 1/') 13. Let B~ be as in (5. We say that B >0 nI Show that 8~I and. for all f n2 nI n2 . which agrees with (4. Hint: Use Problems 5.
5. (c) Find the approximate critical value of the NeymanPearson test using a Donnal approximation.01 ..1.2. where . T') distribution.. N(Jt. Consider the Bayes test when J1. =f:. I' has aN(O.•.i. I . . J1.riX) = T(l + nr 2)1/2rp (I+~~. = O. 1 . Jl > OJ . Show that (J l'.2. the pvalue fJ . 0 given Xl.'" XI! are ij. Consider testing H : j. (c) Suppose that I' = 8 > O.. Establish (5.1.\ = '\0 versus K • (b) Show that the NP test is UMP for testing H : A > AO versus K : A < AO. .] versus K . (b) Suppose that I' ). .4.11).4.4. : A < Ao (a) Find the NeymanPearson (NP) test for testing H : . By Problem 4. ! I • = '\0 versus K : .d..1 and 3.i. inverse Gaussian with parameters J. . if H is false.1.t = 0 versus K : J.. N(I'. (d) Find the Wald test for testing H : A = AO versus K : A < AO. the evidence I .d. variables with mean zero and variance 1(0). LJ.5 ! 1.)) and that when I' = LJ.2.LJ.\ < Ao. phas a U(O. That is. 1. ! (a) Show that the posterior probability of (OJ is where m n (. 7f(1' 7" 0) = 1 . A exp . Let X 1> . X n i.\ > o. Now use the central limit theorem. Hint: By (3.L and A. Problems for Section 5. is a given number. > ~.10) and (3. (e) Find the Rao score test for testing H : . Suppose that Xl.I ' '1 against H as measured by the smallness of the pvalue is much greater than the evidence measured by the smallness of the posterior probability of the hypothesis (Lindley's "paradox"). 1)..55)..) has pvalue p = <1>(v'n(X . the test statistic is a sum of i. . each Xi has density ( 27rx3 A ) 1/2 {AX + J..A 7" 0. 1) distribution. X > 0. 1 15..i. X n be i.21J2 2x A} .. Show that (J/(J l'.360 1 Asymptotic Approximations Chapter 5 14. where Jl is known. That is. " " I I .\ l .I). Consider the problem of testing H : I' E [0. 1) distribution..LJ.d.)l/2 Hint: Use Examples 3. is distributed according to 7r such that 1> 7f({O}) and given I' ~ A > 0.=2[1  <1>( v'nIXI)] has a I U(O. 1.d. (a) Show that the test that rejects H for large values of v'n(X . 2.6. 00.
Apply the argnment used for Theorem 5.s..) (d) Compute plimn~oo p/pfor fJ.5.5.01 < 0 } continuThen 00.052 100 . Hint: 1 n [PI . Suppose that in addition to the conditions of Theorem 5..(Xi.( Xt.) sufficiently large.Oo»)+log7f(O+ J.13) and the SLLN. yn(anX   ~)/ va.S.050 logdnqn(t) = ~ {I(O).054 50 ..17).029 .8) ~ 0 a. 1) prior.1 is not in effect..} dt < . L M M tqn(t)dt ~ 0 a.)}.s.0'): iO' .O') : 100'1 < o} 5. By (5.5. Hint: In view of Theorem 5.i Itlqn(t)dt < J:J'). Show that the posterior probability of H is where an = n/(n + I). Fe.O'(t)).5. 1) and p ~ L U(O.6 Problems and Complements 361 (b) Suppose that J..for M(. Hint: By (5..Section 5. oyn. for all M < 00.1. J:J'). i ~. ifltl < ApplytheSLLN and 0 ous at 0 = O... (e) Show that when Jl ~ ~.I(Xi. ~l when ynX = n 1'>=0.046 . 0' (i» n L 802 i=l < In n ~sup {82 8021(X i .5.645 and p = 0.17).2 and the continuity of 7f( 0).05.i It Iexp {iI(O)'. Establish (5.034 .' " .2.4.058 20 .14). 4.5. P.I). ~ E.0 3.~~(:.. (Lindley's "paradox" of Problem 5.:. 10 . ~ L N(O.2 it is equivalent to show that J 02 7f (0)dO < a. [0. all .2.sup{ :. (e) Verify the following table giving posterior probabilities of 1. By Theorem 5.l has a N(O.(Xi.5. yn(E(9 [ X) . Extablish (5..042 .1 I'> = 1.
) by A4 and A5.2 we replace the assumptions A4(a. . (1) This famous result appears in Laplace's work and was rediscovered by S. 362 f Asymptotic Approximations Chapter 5 > O. i=l }()+O(fJ) I Apply (5. Suppose that in Theorem 5.. Fn(x) is taken to be O.d) for some c(d).4 (1) This result was first stated by R. . Bernstein and R.!. See Bhattacharya and Ranga Rao (1976) for further discussion. 7.s. (a) Show that sup{lqn (t) . dn Finally..1.7 NOTES Notes for Section 5. The sets en (c) {t . Hint: (t : Jf(O)<p(tJf(O)) > c(d)} = [d.f.5. 'i roc J. • Notes for Section 5. convergence replaced by convergence in Po probability. Fisher (1925).5.s. Notes for Section 5. all d and c(d) / in d.Pn where Pn a or 1. to obtain = I we must have C n = C(ZI_ ~ [f(0)nt l / 2 )(1 + op(I)) by Theorem 5. For n large these do not correspond to distributions one typically faces. . A. .8) and (5.1 (I) The bound is actually known to be essentially attained for Xi = a with probability Pn and 1 with probability 1 .16) noting that vlnen.5. Show tbat (5. for all 6.".3 ). . jo I I . I : ItI < M} O. ~ II• '.I Notes for Section 5. Finally. ~ 0 a.5. I i. . (O)<p(tI.s. L . " .5 " I i . (I) If the rightband side is negative for some x.5. . 5.1 .) and A5(a. qn (t) > c} are monotone increasing in c.I(Xi}»} 7r(t)dt.9) hold with a. A proof was given by Cramer (1946). I: .s(8)vn tqn(t)dt ~ roc vIn(t  0) exp {i=(l(Xi' t) .29). .) ~ 0 and I jtl7r(t)dt < co. von Misessee Stigler (1986) and Le Cam and Yang (1990).5. • . .S. ! ' (2) Computed by Winston Cbow. . . J I . (0» (b) Deduce (5. . .
Hartley and E. Math." Proc. S. Amer. 2nd ed. SERFLING.. I." J. "Probability inequalities for sums of bounded random variables. Symp. 1995. C. FERGUSON. AND D. 1958. J. 700725 (1925). S. LEHMANN. L.. C. P. New York: J. FISHER.. E. WILSON. Probability and Measure New York: Wiley. 1990. Some Basic Concepts New York: Springer. J. Vol. Linear Statistical Inference and Its Applications. L. 3rd ed. RANGA RAO. R. Monte Carlo Methods London: Methuen & Co. CASELLA. AND E. 'The distribution of chi square.S. STIGLER. N. Soc. "Consistent estimates based on partially consistent observations. Camb.. 1967. W. Editors Cambridge: Cambridge University Press. Phil. Theory o/Statistics New York: Springer. Box. . Mathematical Methods of Statistics Princeton. J. H. 16. E. Vol. "Nonnormality and tests on variances:' Biometrika. T. R. P.. O. 58. 1999. 1979. 132 (1948). CRAMER. R. W. R. RAo. MA: Harvard University press. 318324 (1953). Vth Berk. RUDIN. M.. M.. 1964. Statistical Decision Theory and Bayesian Analysis New York: SpringerVerlag. M. A. AND G. S. BILLINGSLEY. Approximation Theorems of Mathematical Statistics New York: J. U.. Wiley & Sons. 1985. DAVID.. 1998. Normal Approximation and Asymptotic Expansions New York: Wiley.. FISHER. Proc. 1973. E. New York: McGraw Hill. The History of Statistics: The Measurement of Uncertainty Before 1900 Cambridge. J. 1996. E. Elements of LargeSample Theory New York: SpringerVerlag. Statist." Proc. HUBER. Asymptotics in Statistics. 1380 (1963). AND G.8 References 363 5. 1976. L. HOEFFDING.. The Behavior of the Maximum Likelihood Estimator Under NonStandard Conditions. 684 (1931). L.Section 5... SCHERVISCH.. L.8 REFERENCES BERGER.. NEYMAN. Cambridge University Press. HANSCOMB. "Theory of statistical estimation. 1987. 3rd ed. NJ: Princeton University Press. SCOTT. H. Mathematical Analysis. YANG.A.. 1946. 40. AND M. A. J. HfLPERTY. CA: University of California Press... 1986. p. F. Pearson. I Berkeley. Tables of the Correlation Coefficient.. Theory ofPoint EstimatiOn New York SpringerVerlag. AND R. B. Acad.." Econometrica. reprinted in Biometrika Tables/or Statisticians (1966). LEHMANN.. Wiley & Sons. 17. Sci. A CoUrse in Large Sample Theory New York: Chapman and Hall. BHATTACHARYA. 22. 1980. LE CAM. Vth Berkeley Symposium. 1938. Assoc. Nat. Prob. G. R. H.. Statist. P. Statisticallnjerence and Scientific Method. HAMMERSLEY.
' i ' . . I . .. . .I 1 .\ . (I I i : " .
1 INFERENCE FOR GAUSSIAN LINEAR MODElS • Most modern statistical questions iovol ve large data sets. Talagrand type and the modern empirical process theory needed to deal with such questions will also appear in the later chapters of Volume II. There is.3. with the exception of Theorems 5. the fact that d.1) and more generally have studied the theory of multiparameter exponential families (Sections 1. the number of parameters.3). [n this final chapter of Volume I we develop the analogues of the asymptotic analyses of the behaviors of estimates. confidence regions. and efficiency in semiparametric models. The approaches and techniques developed here will be successfully extended in our discussions of the delta method for functionvalued statistics. However. 365 .or nonpararnetric models. however. the modeling of whose stochastic structure involves complex models governed by several. for instance. curve estimates. are often both large and commensurate or nearly so. real parameters and frequently even more semi. the multinomial (Examples 1. 2.7. an important aspect of practical situations that is not touched by the approximation.6.2 and 5.4.and semiparametric models.1.3).2. the properties of nonparametric MLEs. 2. and prediction in such situations. We begin our study with a thorough analysis of the Gaussian linear model with known variance in which exact calculations are possible.5.6.2. testing.3.Chapter 6 INFERENCE IN THE MUlTIPARAMETER CASE 6. and n. The inequalities ofVapnikChervonenkis.4.8 C R d We have presented several such models already. 2. in which we looked at asymptotic theory for the MLE in multiparameter exponential families. We shall show how the exact behavior of likelihood procedures in this model correspond to limiting behavior of such procedures in the unknown variance case and more generally in large samples from regular ddimensional parametric models and shall illustrate our results with a number of important examples. the number of observations. This chapter is a leadin to the more advanced topics of Volume II in which we consider the construction and properties of procedures in non.2.3.1. and confidence regions in regular onedimensional parametric models for ddirnensional models {PO: 0 E 8}. tests. the bootstrap. 1. 2. multiple regression models (Examples 1. often many. we have not considered asymptotic inference.
.. .1. .1.. '.1.d.d. i = 1. In vector and matrix notation.a2). These are among the most commonly used statistical techniques. 0"2).1. i . I The regression framewor~ of Examples 1. n (6.'" . " n (6. . The model is Yi j = /31 + €i.Herep= 1andZnxl = (l.3).. we write (6.N(O. i = 1.l p. 1 en = LZij{3j j=1 +Ei. We consider experiments in which n cases are sampled from a population.1 The Classical Gaussian linear Model l' f Many of the examples considered in the earlier chapters fit the framework in which the ith measurement Yi among n independent observations has a distribution that depends on known constants Zil..1. 1 Yn from a population with mean /31 = E(Y).. . .1.1. let expressions such as (J refer to both column and row vectors. .3) whereZi = (Zil.1.1 covariate measurements denoted by Zi2 • .j .. I I. " 'I and Y = Z(3 + e.2. .. and for each case. N(O.5) .. we have a response Yi and a set of p .zipf.2(4) in this framework.4) 0 where€I..1. It turn~ out that these techniques are sensible and useful outside the narrow framework of model (6.. Here is Example 1. In Section 6.1.3): l I • Example 6. .n (6.1. .1. 366 Inference in the Multiparameter Case Chapter 6 ! . • i . i = l. Example 6. ." .l)T.. .4 and 2. Notational Convention: In this chapter we will. The OneSample Location Problem.1) j .Zip' We are interested in relating the mean of the response to the covariate values. Z = (Zij)nxp. We have n independent measurements Y1 . • .€nareij. e ~N(O. .1 is also of the fonn (6.2J) (6. In this section we will derive exact statistical procedures under the assumptions of the model (6. .1. andJ is then x nidentity matrix. 6. .1. are U. when there is no ambiguity. In the classical Gaussian (nannal) linear model this dependence takes the fonn p 1 Yi where EI. say the ith case.. the Zij are called the design values. ...6 we will investigate the sensitivity of these procedures to the assumptions of the model.Zip. Regression.3). and Z is called the design matrix. .. The normal linear regression model is }i = /31 + L j=2 p Zij/3j + €i.2) Ii . Here Yi is called the response variable. ..1.
5) is called the fixed design norrnallinear regression model. Yn1 + 1 . /3p are called the regression coefficients.Way Layout. 13k is the mean response to the kth treatment. To see that this is a linear model we relabel the observations as Y1 .1... Yn1 correspond to the group receiving the first treatment. If the control and treatment responses are independent and nonnally distributed with the same variance a 2 . . . we often have more than two competing drugs to compare. this terminology is commonly used when the design values are qualitative. The model (6.0: because then Ok represents the difference between the kth and average treatment .. nl + . The random design Gaussian linear regression model is given in Example 1.1 Inference for Gaussian Linear Models 367 where (31 is called the regression intercept and /32. we are interested in qualitative factors taking on several values.3./.3. We treat the covariate values Zij as fixed (nonrandom). The model (6. and the €kl are independent N(O. + n p = n. The pSample Problem Or One. · · ... Ynt +n2 to that getting the second. We can think of the fixed design model as a conditional version of the random design model with the inference developed for the conditional distribution of Y given a set of observed covariate values. then the notation (6.3 and Section 4. n. Frequently.4.1. .. and so on.1. 0 Example 6. To fix ideas suppose we are interested in comparing the performance of p > 2 treatments on a population and that we administer only one treatment to each subject and a sample of nk subjects get treatment k. (6.1. and so on.Section 6. TWosample models apply when the design values represent a qualitative factor taking on only two values. .1. If we are comparing pollution levels.6) where Y kl is the response of the lth subject in the group obtaining the kth treatment.•. we arrive at the one~way layout or psample model. Then for 1 < j < p. where YI.. . the design matrix has elements: 1 if L jl nk + 1<i < L nk k=1 j k=l ootherwise and z= o 0 ••• o o Ip where I j is a column vector of nj ones and the 0 in the "row" whose jth member is I j is a column vector of nj zeros.3) applies.1fwe set Zil = 1.~) random variables. .. we want to do so for a variety of locations.6) is an example of what is often called analysis ojvariance models.1.9.1. Generally. 1 < k < p.3 we considered experiments involving the comparisons of two population means when we had available two independent samples.2) and (6. .6) is often reparametrized by introducing ( l ~ pl I:~~1 (3k and Ok = 13k .. if no = 0. ."i = 1. In Example I. (6. In this case.Yn . one from each population.
b I _ . .(T2J) where Z. j = 1. .. .2). V r span w. V n for Rn such that VI... j = I •..g.1. Because dimw = r..p.. .. TJi = E(Ui) = V'[ J1. . . " .. an orthononnal basis VI.. I5 h .1..17). . i=I (6.. This type oflinear model with the number of columns d of the design matrix larger than its rank r. by the GramSchmidt process) (see Section B. . •• Note that w is the linear space spanned by the columns Cj. E ~N(O. . k model is Inference in the Multiparameter Case Chapter 6 1. n.. The parameter set for f3 is RP and the parameter set for It is W = {I' = Z(3. n. Note that any t E Jl:l can be written ... we call Vi and Vj orthogonal. Note that Z· is of rank p and that {3* is not identifiable for {3* E RP+l. . {3* is identifiable in the pdimensional linear subspace {(3' E RP+l : L:~~l dk = O} of RP+ 1 obtained by adding the linear restriction E~=l 15k = 0 forced by the definition of the 15k '5. When VrVj = O.• znj)T. j = I. . r i' . Even if {3 is not a parameter (is unidentifiable). of the design matrix. the vector of means p of Y always is. We now introduce the canonical variables and means . .. is common in analysis of variance models. vT n i I t and that T = L(vTt)v. . . i = T + 1. It is given by 0 = (itl. •.368 effects. Cj = (Zlj.r additional linear restrictions have been specified.. The Canonical Fonn of the Gaussian Linear Model The linear model can be analyzed easily using some geometry. Z) and Inx I is the vector with n ones. . i5p )T.po In tenns of the new parameter {3* = (0'. • I t Ew ¢:> t = L:(v[t)Vi i=l ¢:> vTt = 0. Let T denote the number of linearly independent Cj. i = 1. then r is the rank of Z and w has dimension r.. i . 1 JLn)T I' where the Cj = Z(3 = L j=l P .3. Ui = v.. .. the linear Y = Z'(3' + E.Y. and with the parameters identifiable only once d .. .a 2 ) is identifiable if and only ifr = p(Problem6.! x (p+l) = (1. We assume that n > r. there exists (e.7) !. Recall that orthononnal means Vj = 0 fori f j andvTvi = 1. n. . • ..P.. (3 E RP}. It follows that the parametrization ({3.6jCj are the columns of the design matrix.. However.
. . E w. . i (iv) = 1. Un are independent nonnal with 0 variance 0.2 L i=l + _ '" ryiui _ 02~ i=l 1 r r 2 (6.2 known (i) T = (UI •. . .. (iii) U i is the UMVU estimate of1]i..2 ) using the parametrization (11. U.1]r. 1 ViUi and Mi is UMVU for Mi. v~.2"log(21r<7 ) t=1 n 1 n _ _ ' " u. whereas (6. . (6.9 N(rli. IJ. In the canonical/orm ofthe Gaussian linear model with 0. . i i = 1.. . and by Theorem B..8)(6.. . 1]r)T varies/reely over Rr. . 1 If Cl. We start by considering the log likelihood £('1. then the MLE of 0: Ci1]i is a = E~ I CiUi. Note that Y~AIU. where = r + 1.. (6..1.. U1 . .. = 1. .Section 6.. 2 ' " 'Ii i=l ~2(T2 6.. .1..10) It will be convenient to obtain our statistical procedures for the canonical variables U.. .).1. . i it and U equivalent.u) 1 ~ 2 n 2 . The U1 are independent and Ui 11i = 0... • .L = 0 for i = r + 1.10).• . •.and 7J are equivalently related. . and . Moreover.•• (ii) U 1 . Then we can write U = AY. it = E.".3.1..)T is sufJicientfor'l. .1 Inference for G::'::"::"::"::"::L::'::"e::'::'::M::o::d::e'::' '" '3::6::::. 0..1. . (T2)T.Ur istheMLEof1]I. . Theorem 6. also UMVU for CY. n..Cr are constants.'J nxn .1.. = L. r. Let A nxn be the orthogonal matrix with rows vi. 7J = AIJ. n. ais (v) The MLE of.9) Var(Y) ~ Var(U) = . which are sufficient for (IJ.2 and E(Ui ) = vi J.2..2 We first consider the known case.(Ui . . which is the guide to asymptotic inference in general.'Ii) . n.2 Estimation 0.8) So..11) _ n log(21r"').. while (1]1. n because p. observing U and Y is the same thing.. u) based on U £('I.1. is Ui = making vTit. 2 0.2<7' L. Theorem 6. . and then translate them to procedures for . (3.1.1.2. 20. Proof. . '" ~ A 1'1..' based on Y using (6.1..
2..4. T r + 1 = L~ I and ()r+l = 1/20. (iv) By the invariance of the MLE (Section 2.. (iv) The conclusions a/Theorem 6. Theorem 6. we need to maximize n (log 27f(J2) 2 II "..11) has fJi 1 2 = Ui . and . . .. 1lr only through L~' 1 CUi . and give a geometric interpretation of ii.(v) are still valid.. To this end. By Problem 3. 1 U1" are the MLEs of 1}1..3. apply Theorem 3.(J2J) by Theorem B. The maximizer is easily seen to be n 1 L~ r+l (Problem 6. . 0.1. i > r+ 1. n.4. i = 1.1 L~ r+l U.1. .• EUl U? Ul Projections We next express j1 in terms of Y. In the canonical Gaussian linear model with a 2 unknown. then W .) (iii) By Theorem 3. . = 1. To show (iv). there exists an orthonormal basis VI) . and is UMVU for its expectation E(W..) = a. wecan assume without loss of generality that 2:::'=1 c.. Let Wi = vTu. r. The distribution of W is an exponential family. .4.4.ti· 1 . (i) By observation. n " . (U I .N(~. " as a function of 0. (6. (ii) The MLE ofa 2 is n. 2 2 ()j = 1'0/0.. . By (6...2.11) by setting T j = Uj . (v) Follows from (iv)..1. . WI = Q is sufficient for 6 = Ct.11). the MLE of q(9) = I:~ I c. 1 Ur . .1. define the norm It I of a vector tERn by  ! It!' = I:~ . 0  i Next we consider the case in which a 2 is unknown and assume n >r+ 1. ( 2 )T. l Cr . (iii) 8 2 =(n . I .. "7r because. r.1. . .4. {3. To show (ii).1. obtain the MLE fJ of fl. But because L~ 1 Ui2 L:~ I Ui2 + r+l Ul.'7.. . . . recall that the maximum of (6.. 2:7 r+l un Tis sufficientfor (1]1.. Q is UMVU. That is.. = 0.) = '7" i = 1. .2 ..r) 2:7 1 L~ r+l ul is an unbiased estimator of (J2.Ui · If all thee's are zero.2).6 to the canonical exponential family obtained from (6.10..3 and Example (iii) is clear because 3. ~i = vi'TJ.. j = 1.. (We could also apply Theorem 2. I i > . (6..4 and Example 3.. .1). . by observation. 2(J i=r+l ~ U.2.2 . .'  . I.. " V n of R n with VI = C = (c" . Ui is UMVU for E(U.2 (ii). Proof.11) is an exponential family with sufficient statistic T.3.1. Proof. . I i I .1. 370 Inference in the Multiparameter Case Chapter 6 I iI .1]r. is q(8) = L~ 1 <."7i)2 and is minimized by setting '(Ii = Ui. where J is the n x n identity matrix.3.O)T E R n .1. By GramSchmidt orthogonalization. ... (i) T = (Ul1" . this statistic is equivalent to T and (i) follows.6.Ur'L~ 1 Ul)T is sufficient. .11) is a function of 1}1. Assume that at least one c is different from zero. (ii) U1.
fj = arg min{Iy .1.9). because Z has full rank. Proof. _.1. note that 6. any linear combination of U's is a UMVU estimate of its expectation.12) ii. ..L of vectors s orthogonal to w can be written as ~ 2:7 w. the MLE of {3 equal~ the least squares estimate (LSE) of {3 defined in Example 2.1. .3.2(iv) and 6. any linear combination ofY's is also a linear combination of U's.13) (iv) lfp = r. .1. i=l. We have Theorem 6.Z/3I' : /3 E W}..14) (v) f3j is the UMVU estimate of (3j. <T) = .Z/3 I' . The projection Yo = 7l"(Y I L.2<T' IY . by (6. the MLE = LSE of /3 is unique and given by (6. ~ The maximum likelihood estimate .1.. .14) follows. ZTZ is nonsingular.3 maximizes 1 log p(y.4.n.1.1 and Section 2.' and by Theorems 6. = Z/3 and (6. To show f3 = (ZTZ)lZTy.J) of a point y Yo =argmin{lyW: tEw}.3 because Ii = l:~ 1 ViUi and Y .1.. ZT (Y .1. /3.3 E RP. and  Jii is the UMVU estimate of J.1. then /3 is identifiable.L = {s ERn: ST(Z/3) = Oforall/3 E RP}.. (i) is clear because Z..1.Section 6.ti. which implies ZT s = 0 for all s E w.. fj = (ZTZ)lZTji. IY .jil' /(n r) (6..Ii = .L. To show (iv)..p.1. spans w... 0 ~ .ji) = 0 and the second equality in (6.1.1 Inference for Gaussian linear Models 371 E Rn on w is the point Definition 6. equivalently. ) 2 or.3 of. (ii) and (iii) are also clear from Theorem r+l VjUj .2. Thus.1.' and is given by ji (ii) jl is orthogonal to Y (iii) 8' ~ = zfj (6.1. It follows that /3T (ZT s) = 0 for all/3 E RP. = Z T Z/3 and ZTji = ZTZfj and.12) implies ZT.n log(21f<T .. and /3 = (ZTZll ZT 1".3(iv).. That is. j = 1. /3 = (ZTZ)l ZT.. f3j and Jii are UMVU because. note that the space w. In the Gaussian linear model (i) jl is the unique projection oiY on L<.
3) is to be taken. . 1 < i < n. whenp = r. (12 (ZTz) 1 ).' € = Y . and I. n}. the residuals €. and the residual € are independent.1.1.1. =H .H)). The residuals are the projection of Y on the orthocomplement of wand ·i . The goodness of the fit is measured by the residual sum of squares (RSS) IY ..1. That is.ii is called the residual from this fit.14) and the normal equations (Z T Z)f3 = Z~Y. ~ ~ ·.12) and (6. H T 2 =H. The estimate fi = Z{3 of It is called thefitted value and € = y . : H = Z(Z T Z)IZT The matrix H is the projection matrix mapping R n into w.3) that if J = J nxn is the identity matrix. Suppose we are given a value of the covariate z at which a value Y following the linear model (6. .1 we give an alternative derivation of (6.1. (ii) (iii) y ~ N(J. . = [y. 1 ~ ~ ! I ~ ~ ~ i . In statistics it is also called the hat matrix because it "puts the hat on Y.1.16) J I I CoroUary 6. • .15) Next note that the residuals can be written as ~ I .({3t + {3. In this method of "prediction" of Yi. . . Var(€) = (12(J .4.)I are the vertical distances from the points to the fitted line. € ~ N(o.5. I. We can now conclude the following.1.1. . Y = il. By Theorem 1. i = 1. • ~ ~ Y=HY where i• 1 . see also Section RIO.H). (12(J . 1 < i < n. the best MSPE predictor of Y if f3 is known as well as z is E(Y) = ZT{3 and its best (UMVU) estimate not knowing f3 is Y = ZT {j. (6. There the points Pi = {31 + fhzi.H)Y. the ith component of the fitted value iJ. i = 1.. then (6.2 with P = 2. w~ obtain Pi = Zi(3.372 Inference in the Multiparameter Case Chapter 6 Note that in Example 2.I" (12H).2.2 illustrates this tenninology in the context of Example 6.'.. moreover. n lie on the regression line fitted to the data {('" y. i . L7 1 ." As a projection matrix H is necessarily symmetric and idempotent. Note that by (6. it is commOn to write Yi for 'j1i. 1 · ..1.). I ..14).1. (iv) ifp = r.ill 2 = 1 q. (j ~ N(f3. Taking z = Zi. . In the Gaussian linear model (i) the fitted values Y = iJ..1. Example 2.Y = (J . . H I I It follows from this and (B.
. The independence follows from the identification of j1 and € in tenns of the Ui in the theorem. . .3.8. Here Ji = ..i respectively.1 ~ Inference for Gaussian Linear Models 373 Proof (Y. 0 Example 6.3.1. Thus..1. the treatments. . in general) p k=l L and the UMVU estimate of the incremental effect Ok = {3k . 0 ~ ~ ~ ~ ~ ~ ~ Example 6. Example 6. then the MLE = LSE estimate is j3 = (ZTZr1 ZTy as seen before in Example 2.8 and that {3j and f1i are UMVU for {3j and J1. i = 1.Ill'. .1. In the Gaussian case. If the design matrix Z has rank P.ii = Y. Moreover.1.no The variances of .8 and (3.4..1. Regression (continued).5. ... is th~ UMVU estimate of the average effect of all a 1 p = Yk .a of the kth treatment is ~ Pk = Yk. hence.8.p.8) = (J2(Z T Z)1 follows from (B... . If {Cijk _. . nk(3k = LYkl. then replacement of a subscript by a dot indicates that we are considering the average over that subscript.p.1.. k = 1. joint Gaussian.. and€"= Y . where n = nl + .3).ii.. 'E) is a linear transformation of U and.p. Var(.2.2.1. The OneWay Layout (continued). (not Y. 0 ~ We now return to our examples.p. .1). (n ...3. In this example the nonnal equations (Z T Z)(3 = ZY become n..81 and i1 = {31 Y.Section 6. .1. o .1 and Section 2. 1=1 At this point we introduce an important notational convention in statistics. .. in the Gaussian model. We now see that the MLE of It is il = Z. which we have seen before in the unbiased estimator 8 2 of (72 is L:~ I (Yi ~  Yf/ Problem 1. . One Sample (continued).2). By Theorem 6. The error variance (72 = Var(El) can be unbiasedly estimated by 8 2 = (n _p)lIY . Y. a = {3. } is a multipleindexed sequence of numbers or variables.Y are given in Corollary 6. + n p and we can write the least squares estimates as {3k=Yk . k=I.1.j = 1. and € are nonnally distributed with Y and € independent. k = 1.
An alternative approach 10 the MLEs for the nonnal model and the associated LSEs of this section is an approach based on MLEs for the model in which the errors El.1. ! j ! I I .al. n} is a twodimensional linear subspace of the full model's threedimensional linear subspace of Rn given by (6.p}.. in a study to investigate whether a drug affects the mean of a response such as blood pressure we may consider. Thus. Next consider the psample model of Example 1. 1 and the estimates of f3 and J. and the matrix IIziillnx3 with Zil = 1 has rank 3. . Now we would test H : 132 = 0 versus K : 132 =I O. Zi3 is the age of the ith patient. • .1.. ." Now. {JL : Jti = 131 + f33zi3. However. the mean vector is an element of the space {JL : J. in the context of Example 6.JL): JL E wo} ~ ! . j 1 • 1 . The most important hypothesistesting questions in the context of a linear model correspond to restriction of the vector of means JL to a linear subspace of the space w. q < r.1.. a regression equation of the form mean response ~ .374 Inference in the Multiparameter Case Chapter 6 Remark 6.zT . " . which together with a 2 specifies the model.1. which is a onedimensional subspace of Rn. We first consider the a 2 known case and consider the likelihood ratio statistic . For instance. i = 1. .\(y) = sup{p(Y.En in (6..3 Tests and Confidence Intervals . 1 < . .4.17) I. see Koenker and D'Orey (1987) and Portnoy and Koenker (1997). i = 1.17). under H. .2.1. whereas for the full model JL is in a pdimensional subspace of Rn. the LADEs are obtained fairly quickly by modem computing methods.JL): JL E w} sup{p(Y. see Problems 1..6.1) have the Laplace distribution with density . j where Zi2 is the dose level of the drug given the ith patient. we let w correspond to the full model with dimension r and let Wo be a qdimensional linear subspace over which JL can range under the null hypothesis H.1..31. The LSEs are preferred because of ease of computation and their geometric properties.. = {3p = 13 for some 13 E R versus K: "the f3's are not all equal.L are least absolute deviation estimates (LADEs) obtained by minimizing the absolute deviation distance L~ 1 IYi . .i 1 6. The first inferential question is typically "Are the means equal or notT' Thus we test H : 131 = . In general..3 with 13k representing the mean resJXlnse for the kth population. The LADEs were introduced by Laplace before Gauss and Legendre introduced the LSEssee Stigler (1986). . For more on LADEs. under H. . (6.1.2.Li = 13 E R.7 and 2. . • = 131 + f32Zi2 + f33Zi3. . j. I. .
\(Y) Proof We only need to establish the second equality in (6. ..I8) then.'" (}r)T (see Problem B.\(Y) has a X.q degrees offreedom and noncentrality parameter Ej2 = 181 2 = L:=q+ I where 8 = «(}q+l.3. .20) i=q+I 210g.1. .iio 2 1 } where i1 and flo are the projections of Yon wand wo.21 ) where fLo is the projection of fL on woo In particular.. 2 log . . We have shown the following.iii' .q( (}2) distribution with 0 2 =a 2 ' \ '1]i L.21)... In the Gaussian linear model with 17 2 known. v~' such that VI. But if we let A nx n be an orthogonal matrix with rows vi. when H holds.\(Y) = L i=q+l r (ud a )'. span Wo and VI. then r. . respectively. Proposition 6..l2). (}r. o . by Theorem 6. V r span wand set ..1. .1.\(Y) ~ exp {  2t21Y .\(Y) It follows that = exp 1 2172 L '\' Ui' (6.l9) then.2(v).Section 6_1 Inference for Gaussian Linear Models 375 for testing H : fL E Wo versus j{ : JL E W  woo Because (6.1. V q (6. Write 'Ii is as defined in (6.. In this case the distribution of L:~q+1 (Uda? is called a chisquare distribution with r .. We write X. i=q+l r = a 21 fL  Ito [2 (6.1. 2 log .1.1. = IJL i=q+l r JLol'.q {(}2) for this distribution.19)._q' = AJL where A L 'I.IY ..1. .J X.1. Note that (uda) has a N(Oi.1. by Theorem 6.1.. 1) distribution with OJ = 'Ida.
22) Because T = {n .2. Thus.1.1 that 021it .1. T is an increasing function of A{Y) and the two test statistics are equivalent. .m{O') for this distrihution where k = r . Remark 6.1. as measured by the residual sum of squares under the model specified by H._q(02) distribution with 0' = .q and n . T has the (central) Frq. when H I lwlds. J • T = (noncentral X:. T = n . In the Gaussian linear model the F statistic defined by (6. Substituting j). By the canonicaL representation (6. In Proposition 6.1./Lol'.1../L.1.') denotes the righthand side of (6.E Wo for K : Jt E W .2 is known" is replaced by '.q)'{[A{Y)J'/n .L a =  n I' and ..14)..18). E In this case.r IY .r.376 Inference in the Multiparameter Case Chapter 6 Next consider the case in which a 2 is unknown. We have seen in PmIXlsition 6.• j . which is equivalent to the likelihood ratio statistic for H : Ii. I .liol' (n _ r) 'IY _ iii' .r degrees affreedam (see Prohlem B.JtoI 2 ..max{p{y.'IY _ iii' = L~ r+l (Ui /0)2. T has the representation .1. IIY . we can write .iT'): /L E w} .iT') : /L E wo} (6.r){r .2. 8 2 . /L. {io. In particular.1./to I' ' n J respectively. which has a X~r distribution and is independent of u.'I/L. ]t consists of rejecting H when the fit.1..nr distribution. T is called the F statistic for the general linear hypothesis. .23) .JLo. .L n where p{y.0.'}' YJ.wo.22). We have shown the following..2 1JL .~I. and 86 into the likelihood ratio statistic.1.1 that the MLEs of a 2 for It E wand It E wQ are . For the purpose of finding critical values it is more convenient to work with a statistic equivalent to >.19).itol 2 have a X.(Y). it can be shown (Problem 6.liol' IY r _q IY _ iii' iii' (r . .a = ~(Y'E. (6. statistic :\(y) _ max{p{y. has the noncentral F distribution F r _q. ..2 is the same under Hand K and estimated by the MLE 0:2 for /.I. The resulting test is intuitive.q)'Ili .5) that if we introduce the variance equal likelihood ratio w'" . is poor compared to the fit under the general model.q and m = n .q variable)/df (central X~T variahle)/df with the numerator and denominator independent. We know from Problem 6. Proposition 6. We write :h.21ii.3.itol 2 = L~ q+ 1 (Ui /0) 2.n_r(02) where 0 2 = u. we obtain o A(Y) = P Y..I}./L. The distribution of such a variable is called the noncentral F distribution with noncentrality parameter 0 2 and r .1 suppose the assumption "0.l) {Ir .1. = aD IIY .J.a:.
r)ln central X~_cln where T is the F statistic (6. q = 0. Example 6.\(Y) = .24) Remark 6. 0 .Y)2' which we recognize as t 2 / n.3.1. Y3 y Iy .Section 6. The projections il and ilo of Yon w and Wo. (6. See Pigure 6. where t is the onesample Student t statistic of Section 4.1. One Sample (continued). r = 1 and T = 1'0 versus K : (3 'f 1'0. The canonical representation (6. Yl = Y2 Figure 6.q noncentral X? q 2Iog.iLol'.1 and Section B.IO. We next return to our examples.1'0 I' y. and the Pythagorean identity.9.1 Inference for Gaussian Linear Models ~ 377 then >.iLol' = IY .(Y) equals the likelihood ratio statistic for the 0'2.1. In this = (Y 1'0)2 (nl) lE(Y.iLl' + liL . We test H : (31 case wo = {I'o}. It follows that ~ (J2 known case with rT 2 replaced by r.1.1.22).1.19) made it possible to recognize the identity IY .1.iLl' ~++Yl I' I' 1 .2.1. This is the Pythagorean identity.1.25) which we exploited in the preceding derivations.~T = (n . (6.1.
{3p are Y1. To formulate this question. age... _y.27) 1 .  I 1.1. we partition the design matrix Z by writing it as Z = (ZI. P . Now the linear model can be written as i i I I ! We test H : f32 (6.1. .ito 1 are the residual sums of squares under the full model and H.dfp) RSSF/dh I 2 where RSSF = IY and RSSH = IY .ZfZl(ZiZl)'ziz. .q covariates in multiple regression have an effect after fitting the first q. In the special case that ZrZ2 = 0 so the variables in Zl are orthogonal to the variables in Z2.np distribution.}f32.1.)2= Lnk(Yk.q covariates does not affect the mean response.Yk.1.1.1. we want to test H : {31 = .3.q) x 1 vector of main (e. • 1 F = (RSSlf .1. We consider the possibility that a subset of p .22) we can write the F statistic version of the likelihood ratio test in the intuitive fonn =   i . j ! j • litPol2 = LL(Yk _y. In this case f3 (ZrZ)lZry and f3 0 = (Z[Ztl1Z[Y are the ML& under the full model (6.26) 0 versus K : f3 2 i' O. .. . 1 02 = (J2(p ..n_p(fP) distribution with noncentrality parameter (Problem 6.q).g. I • I b . and we partition {3 as f3T = (f3f.=11=1 k=1 •• Substituting in (6.)2' .P . Regression (continued). treatment) effect coefficients and f3 1 is a q x 1 vector of "nuisance" (e.1.Q)f3f(ZfZ2)f32.y)2 k . L~"(Yk' ... economic status) coefficients.q)1f3nZfZ2 . (6.RSSF)/(dflf . 02 simplifies to (J2(p . Using (6. in general 02 depends on the sample correlations between the variables in Zl and those in Z2' This issue is discussed further in Example 62. . .2.1 L~=. ..22) we obtain the F statistic for the hypothesis H in the oneway layout . respectivelY. Under the alternative F has a noncentral Fp_q... The One. Z2) where Z1 is n x q and Z2 is 11 X (p .vy.' • ito = Thus.: I :I T _ n ~ p L~l nk(Y . However..26) and H. Recall that the least squares estimates of {31. '1 Yp. respectively. Under H all the observations have the same mean so that. = {3p.... 7) iW l.n 378 Inference in the Multiparameter Case Chapter 6 iI II " • Example 6.. Without loss of generality we ask whether the last p .)2.Way Layout (continued). which only depends on the second set of variables and coefficients. anddfF = np and dfH = nq are the corresponding degrees of freedom. •.. I3I) where {32 is a (p . As we indicated earlier.g. The F test rejects H if F is large when compared to the ath quantile of the Fpq.1. 0 I j Example 6. I • ! . P nk (Y.
1) degrees offreedom as "coming" from SS8/a' and the remaining (n . the between groups (or treatment) sum of squares and SSw. sum of squares in the denominator.np distribution with noncentrality parameter (6. Ypl .fLo 12 for the vector Jt (rh. Because 88B/0" and SSw / a' are independent X' variables with (p .1. If the fJ. is a measure of variation between the p samples YII . consider the following data(I) giving blood cholesterol levels of men in three different socioeconomic groups labeled I.3.1) degrees of freedom and noncentrality parameter 0'. n. which is their ratio. and the F statistic.)'.1. (31. Ypnp' The 88w ~ L L(Y Y p "' k'  k )'. . . This information as well as S8B/(P . k=l 1=1 measures variation within the samples. are not all equal.p).. . The sum of squares in the numerator. identifying 0' and (p . compute a. . k=1 1=1 which measures the variability of the pooled samples. .Section 6. As an illustration.28) where j3 = n I :2:=~"" 1 nifJi. (6.)' .np distribution. the within groups (or residual) sum of squares.. Note that this implies the possibly unrealistic .p) of the (n 1) degrees offreedom of SST /0" as "cooling" from S8w/a'..2 1M .. . 888 =L k=I P nk(Yk.1.. respectively. Yi nl . We assume the oneway layout is valid.1.25) 88T = 888 + 88w . and III with I being the "high" end. SSB.Y.29) Thus. [f we define the total sum of squares as 88T =L l' L(Y nk k'  Y. (3" .. (3p. are often summarized in what is known as an analysis a/variance (ANOVA) table. ijjT = There is an interesting way of looking at the pieces of infonnation summarized by the F statistic. . . . T has a noncentral Fp1.1 and 6. SST. . To derive IF. .p) degrees of freedom. (3p)T and its projection 1'0 = (ij. ..1) and (n . we have a decomposition of the variability of the whole set of data. ...1. then by the Pythagorean identity (6.fJ.1 Inference for Gaussian Linear Models 379 When H holds. we see that the decomposition (6... the unbiased estimates of 02 and a 2 . into two constituent components.. T has a Fpl.···.1.1) and 8Sw /(n .. the total sum of squares.. SST/a 2 is a (noncentral) X2 variable with (n .30) can also be viewed stochastically.. See Tables 6.
j:
I ,
380
Inference in the Multiparameter Case Chapter 6
,
:
TABLE 6.1.1. ANOYA table for the oneway layout
Sum of squares
d.f.
,
Between samples
Within samples
SSe
I:
r I lld\'k_
Mean squares
F value
1\1 S '
i'dS B
Lf!
P
1
MSB
~
, ,
Total
58W  ""k '1'1n I "(I'k l  I')'  I " k 55T  ,. P 1 L "I" 1 (I'kl _ I' )' ~ k I
np
Tl  1
" '''
A1S w = SSw
1
1
j
TABLE 6.1.2. Blood cholesterol levels
I
J
286 290
II III
403 312 403
311 222 244
269 302 353
336 420 235
259 420 319
386 260
353
210
l 1
I
I
,
i,
.!
, , I
I , ,
assumption that the variance of the measurement is the same in the three groups (not to speak of normality). But see Section 6.6 for "robustness" to these assumptions. We want to test whether there is a significant difference among the mean blood cholesterol of the three groups. Here p = 3, nl = 5, n2 = 10, n3 = 6. n = 21, and we compute
TABLE 6.1.3. ANOYA table for the cholesterol data
;
,
88
Between groups Within groups Total
I
,
,I
1202.5 85,750,5 86,953.0
dJ, 2 18 20
M8
601.2 4763,9
F~value
0.126
'I
From :F tables, we find that the pvalue corresponding to the Fvalue 0.126 is 0.88. Thus, there is no evidence to indicate that mean blood cholesterol is different for the three socioeconomic groups. 0 Remark 6.1.4. Decompositions such as (6.1.29) of the response total sum of squares SST into a variety of sums of squares measuring variability in the observations corresponding to variation of covariates are referred to as analysis oj variance. They can be fonnulated in any linear model including regression models. See Scheff" (1959, pp. 4245) and Weisberg (1985, p. 48). Originally such decompositions were used to motivate F statistics and to establish the distribution theory of the components via a device known as Cochran's theorem (Graybill, 1961, p. 86). Their principal use now is in the motivation of the convenient summaries of infonnation we call ANOVA tables.
,
,I "
,
,
,
Section 6.1
Inference for Gaussian linear Models
~
381
Confidence Intervals and Regions We next use our distributional results and the method of pivots to find confidence intervals for J.li, 1 < i < n, !3j, 1 <j < p, and in general, any linear combination
n
,p
= ,p(/l) = Lai/Li
i::: 1
~ aT /l
of the J.l's. If we set;j;
= 1:7
1 ai/Ii
= aT fl and
~
~
where H is the hat matrix, then (,p  ,p)ja(,p) has a N(O, I) distribution. Moreover,
(n  r)8 2ja 2 ~
IY  iii 2 ja 2 =
L
i=r+l
n
(Uda 2 )'
has a X~r distribution and is independent of ;;;. Let
~
~
be an estimate of the standard deviation a('IjJ) of
~
'0. This estimated standard deviation is
called the standard error of 'IjJ. By referring to the definition of the t distribution, we find that the pivot
has a TnT. distribution. Let t n  r (1  40) denote the 1 ~o: quantile of the bution, then by solving IT(,p)1 < t n _, (I  ~",) for,p, we find that
Tn r
distri
is, in the Gaussian linear model, a 100(1  a)% confidence interval for 1/J. Example 6.1.1. One Sample (continued). Consider'IjJ = p. We obtain the interval
i' ~ Y ± t n 1
(1
~q) sj.,fii,
which is the same as the interval of Example 4.4.1 and Section 4.9.2. Example 6.1.2. Regression (continued). Assume that p = T. First consider 1/J = f3j for some specified ~gression coefficient (3j. The 100(1  a)% confidence interval for (3j is
(3j
}" = (3j ± t n p (','" s{ [ (ZT Z) 11 j j ' 1 )
382
Inference in the Multiparameter Case
Chapter 6
where [(ZTZ)~l]j) is the jthdiagonal element of (ZTZ)~ '. Computersoftware computes (ZTZ)~I and labels S{[(ZTZ)~lI]j); as the standard errOr of the (estimated) jth regression coefficient. Next consider ljJ = j.ti = mean response for the ith case, 1 < i < n. The level (1  0:) confidence interval is
J.Li =
Jii ± t n _ p (1 
!a) sJh:
where hii is the ith diagonal element of the hat matrix H. Here sjh;; is called the standard error of the (estimated) mean of the ith case. Next consider the special case in which p = 2 and
Yi = 131 + 132Zi2 + Eil i = 1 ", n.
"
If we use the identity
n
~)Zi2  Z2)(l'i  Y)
i=1
~)Zi2  Z.2)l'i,
We obtain from Example 2.2.2 that
ih ~
Because Var(Yi) = a 2 , we obtain
~ Var(.6,)
L~l (Zi2  z.,)l'i
L~ 1 (Zi2  Z.2)2 .
(6.1.30)
= (J I L)Zi' i=l
''''
n
Z.2) ,
,
and the 100(1  a)% confidence interval for .6, has the form
732 ± t n p (I  ~a) sl J2:(Zi2  Z.2)'. The confidence interval for 131 is given in Problem 6.1.10. .62
=
Similarly, in the p = 2 case, it is straightforward (Problem 6.1.10) to compute
i.
h ii
I ,
•
;
_ 1

(Zi2  z.,)2
",n
+ n
L....i=l Zi2 
(
Z·2
)'
,
0
•
I
and the confidence interval for th~ mean response J1.i of the ith case has a simple explicit
fonn.
,
1
,
i ,
.I
Example 6.1.3. OneWay Layout (continued). We consider 'I/J = 13k. 1 < k .6k = Yk. ~ N(.6k, (J'lnk), we find the 100(1  a)% confidence interval
~
< p. Because
•
,
I
= 7J. ± t n  p (1 ~a) slj'nj; where 8' = SSwl(np). The intervals for I' = .6. and the incremental effect Ok =.6k1'
.6k
are given in Problem 6.1.11. 0
j
j
"I
I
i
,
I
I
Do
l
Section 6.2
Asymptotic Estimation Theory in p Dimensions
383
Joint Confidence Regions
We have seen how to find confidence intervals for each individual (3j, 1 < j < p. We next consider the problem of finding a confidence region C in RP that covers the vector /3 with prescribed probability (1  0:). This can be done by inverting the likelihood ratio test or equivalently the F test That is, we let C be the collection of f3 0 that is accepted when the level (1  a) F test is used to test H : (3 ~ (30' Under H, /' = /'0 = Z(3o; and the numerator of the F statistic (6.1.22) is based on
Iii  /'01'
C "
=
Izi3 
Z(3ol' =
(13  (30)T(Z T Z)(i3 (30)
(30)
Thus, using (6.1.22), the simultaneous confidence region for
f3 is the ellipse
= {(30 .. (13  (30)T(Z 2 Z)(i3 rs
T
< f r,ni"" (1 _ 1 20
'J}
. .T
(6.1.31)
where fr,nr
(1  40:)
is the 1  !o quantile of the :Fr,nr distribution.
Example 6.1.2. Regression (continued). We consider the case p = r and as in (6.1.26) write Z = (ZJ, Z2) and /3T = (f3j, {3f), where f32 is a vector of main effect coefficients and f3 1is a vector of "nuisance" coefficients. Similarly, we partition {3 as {3 = ({31 ,(32 ) where (31 is q x 1 and (3, is (p  q) x 1. By Corollary 6.1.1, O"(ZTZ) is the variancecovariance matrix of {3. It follows that if we let 8 denote the lower right (p  q) x (p ~ q) comer of (ZTZ)I, then (728 is the variancecovariance matrix of 132' Thus, a joint 100(1  0:)% confidence region for {32 is the p  q dimensional ellipse
~ ~ ~
.T ",T
C={(3
0' .
. (i3,(302)TSl(i3,_(3o,) <f
() 2
pqs

pq,np
(11
20:·
l}
o
Summary. We consider the classical Gaussian linear model in which the resonse Yi for (3jZij of the ith case in an experiment is expressed as a linear combination J1i = covariates plus an error fi, where Ci, ... 1 f n are i.i.d. N (0, (72). By introducing a suitable orthogonal transfonnation, we obtain a canonical model in which likelihood analysis is straightforward. The inverse of the orthogonal transfonnation gives procedures and results in terms of the original variables. In particular we obtain maximum likelihood estimates, likelihood ratio tests, and confidence procedures for the regression coefficients {(3j}, the resJXlnse means {J1i}, and linear combinations of these.
LJ=l
6.2
ASYMPTOTIC ESTIMATION THEORY IN p DIMENSIONS
In this section we largely parallel Section 5.4 in which we developed the asymptotic properties of the MLE and related tests and confidence bounds for onedimensional parameters. We leave the analogue of Theorem 5.4.1 to the problems and begin immediately generalizing Section 5.4.2.
384'
f ",'c:,c:""""ce=:::;'=''''h':.M=,'"';,,pa::.'::'m'''e::.t::"...:C::':::",..::Cc::h,:,p:::'e::'~6
6.2.1
Estimating Equations
OUf assumptions are as before save that everything is made a vector: X!, ... , X n are i.i.d. Pwhere P E Q, a model containing P = {PO: 0 E e} such that
(i)
e open C RP.
e.
(ii) Densities of P, are pC 0),9 E
1 , I
The following result gives the general asymptotic behavior of the solution of estimating equations.
AO. 'I'
=(,p" ... , ,pp)T where,p) ~ g:., is well defined and
 L >I'(X n.
1=1
1
n
..
i,
On) = O.
(6.2.1)
A solution to (6.2.1) is called an estimating equation estimate or an M estimate.
AI. The parameter 8( P) given by the solution of (the nonlinear system of p equations in p
unknowns):
J
A2. Epl'l'(X" 0(P)1 2
>I'(x, O)dP(x)
=0
(6.2.2)
, , ,
I
is well defined on Q so that O(P) is the unique solution of (6.2.2). Necessarily O(PO) because Q => P.
=0
I , ,
< 00 where I· I is the Euclidean nonn.
I I:
.' , ,
,
I
A3. 'l/Ji{', 8), 1 < i < P. have firstorder partials with respect to all coordinates and using the notation of Section B.8,
I
where
': I
,
~
.,
i
~~
l:iI ~
is nonsingular.
A4. sup
..
{I ~ L~ 1 (D>I'(X"
p
t)  D>I'(Xi , O(P))) I: It  O(P)I < En} ~ 0 if En ~ O.
, .I ,
,
I
I
AS. On ~ O(P) for all P E Q.
Theorem 6.2.1. Under AOA5 ofthis section
8n = O(P) + where
L iii(X n.
t=l
1
n
i,
O(P))
+ op(n 1 / 2 )
(6.2.3)
i..I , ,
• •
iii(x,o(p))
= (EpD>I'(X" 0(P)W 1 >1'(x, O(P)).
(6.2.4)
b
Section 6.2
Asymptotic Estimation Theory in p Dimensions
385
Hence,
(6.2.5)
where
E(q" P) ~ J(O. p)Eq,q,T(X" O(p))F (0, P)
and
81/J~
J
(6.2.6)
r'(o,p)
= EpDq,(X"O(P))
~
E p 011 (X"O(P))
The proof of this result follows precisely that of Theorem 5.4.2 save that we need multivariate calculus as in Section B.8. Thus,
1
 n
2.:= q,(Xi , O(P)) = n 2.:= Dq,(Xi,0~)(9n
i=l
n
1
n
 O(P)).
(6.2.7)
i=l
Note that the lefthand side of (6.2.7) is a p x 1 vector, the right is the product of a p x p matrix and a p x 1 vector. The rest of the proof follows essentially exactly as in Section 5.4.2 save that we need the observation that the set of nonsingular p x p matrices, when viewed as vectors, is an open , subset of RP , representable, for instance, as the Set of vectors for which the determinant, a continuous function of the entries, is different from zero. We use this remark to conclude that A3 and A4 guarantee that with probability tending to 1, ~ l::~ I Dq,(Xi , 6~) is nonsingular. Note. This result goes beyond Theorem 5.4.2 in making it clear that although the definition of On is motivated by p, the behavior in (6.2.3) is guaranteed for P E Q, which can include P <1c P. In fact, typically Q is essentially the set of P's for which O(P) can be defined uniquely by (6.2.2). We can again extend the assumptions of Section 5.4.2 to: A6. If 1(,0) is differentiabLe
EODq,(X"O)

EOq,(X" O)DI(X" 0) CovO(q,(X" 0), DI(X, , 0))
(6.2.8)
defined as in B.5.2. The heuristics and conditions behind this identity are the same as in the onedimensional case. Remarks 5.4.2, 5.4.3, and Assumptions A4' and A61 extend to the multivariate case readily. Note that consistency of On is assumed. Proving consistency usually requires different arguments such as those of Section 5.2. It may, however. be shown that with probability tending to 1, a rootfinding algorithm starting at a consistent estimate 6~ will find a solution On of (6.2.1) that satisfies (6.2.3) (Problem 6.2.10).
386
Inference in the Multiparameter Case
Chapter 6
6,2,2
comes
Asymptotic Normality and Efficiency of the MLE
[fwe take p(x,O)
=
l(x,O)
= 10gp(x,0), and >I>(x,O)
obeys AOA6, then (62,8) be.
T
l
EODl (X" O)D l( X 1,0» VarODl(X"O)
(62,9)
where
j 1 ,
, ,
is the Fisher information matrix I(e) introduced in Section 3.4. If p: e _ R, e c R d , is a scalar function, the matrix)! 8~~P(}J (e) is known as the Hessian or curvature matrix of the sutface p. Thus, (6.2.9) stateS that the expected value of the Hessian of l is the negative of the Fisher information. We also can immediately state the generalization of Theorem 5.4.3.
Theorem 6.2.2. If AOA6 holdfor p(x, 0)
,1, , ,
I
j
=10gp(x, 0), then the MLE  satisfies On
i,
(62,10)
(62,11)
On ~ 0+  :Lr'(O)DI(Xi,O) + op(n"/')
n
i=1
1
n
,,
so that
is a minimum contrast estimate with p and 'f/J satisfying AOA6 and corresponding asymptotic variance matrix E('I1, Pe), then
"
"
If en
"
, ,
E(>I>,PO)
On
> r'(O)
(62,12)
in the sense of Theorem 3.4.4 with equality in (6,2,12) for 0
,
= 0 0 iff, undRr 0 0 ,
(6,2,13)
~
On + op(n 1/2 ),
, i ,
•
,
Proof. The proofs of (6,2,10) and (6,2,11) parallel those of (5.4.33) and (5.4,34) exactly, The proof of (6,2,12) parallels that of Theorem 3.4.4, For completeness we give it Note that hy (6,2,6) and (6,2,8)
I
" :I I.
1" ,
where U >I>(Xt,O). V Var(U T , VT)T nonsingular Var(V)
=
E(>I>,PO)
= CovO'(U, V)VarO(U)CovO'(V, U)
~
(62,14)
DI(X1,0),
But hy (B,lO.8), for any U,V with
(6,2.15)
> Cov(U, V)Var1(U)Cov(V, U),
Taking inverses of both sides yields
r1(0)
= Var01(V) < E(>I>,O).
(6.2.16)
• • !
Section 6.2
Asymptotic Estimation Theory in p Dimensions
387
Equality holds in (6.2.15) by (B. 10.2.3) iff for some b U
= b(O)
(6.2.17)
= b + Cov(U, V)Var1(V)V with probability 1. This means in view of Eow = EODl = 0 that
w(X"O) = b(0)Dl(X1,0).
In the case of identity in (6.2.16) we must have
[EODw(X 1, OW'W(X 1 , 0)
=
r1(0)DI(X" 0).
(6.2.18)
Hence, from (6.2.3) and (6.2.10) we conclude that (6.2.13) holds.
o
apxl, a T 8 n
~
We see that, by the theorem, the MLE Is efficient in the sense that for any has asymptotic bias o(n1/2) and asymptotic variance nlaT ]1(8)a, which is n.? larger than that of any competing minimum contrast estimate. Further any competitor 8 n such that aTO n has the same asymptotic behavior as a T 8 n for all a in fact agrees with On to ordern 1/2


A special case of Theorem 6.2.2 that we have already established is Theorem 5.3.6
on the asymptotic nonnality of the MLE in canonical exponential families. A number of important new statistical issues arise in the multiparameter case. We illustrate with an example. Example 6.2.1. The Linear Model with Stochastic Covariates. Let Xi = (Zr, Yi)T. 1 < i < n, be ij.d. as X = (ZT, Y) T where Z is a p x 1 vector of explanatory variables and Y is the response of interest. This model is discussed in Section 2.2.1 and Example 1.4.3. We specialize in two ways:
(il
(6.2.19) where, is distributed as N(O, (),2) independent of Z and E(Z) Z, Y has aN (a + ZT [3, (),2) distrihution.
= O.
That is, given
(ii) The distribution Ho of Z is known with density h o and E(ZZT) is nonsingular.
The second assumption is unreasonable but easily dispensed with. It readily follows (Prohlem 6.2.6) that the MLE of [3 is given by (with probability 1) [3 = [Z(n}Z(n}]
T
~
lT
Zen} Y.
(6.2.20)
Here Zen) is the n x p matrix IIZij ~ Z.j II where z.j = ~ 1 Zij. We used subscripts (n) to distinguish the use of Z as a vector in this section and as a matrix in Section 6.1. In the present context, Zen) = (Zl, .. , 1 Zn)T is referred to as the random design matrix. This example is called the random design case as opposed to the fixed design case of Section 6.1. Also the MLEs of a and ()'2 are
p
2:7
Ci
=Y
 ~ Zj{3j, (j
J=l
"

2
.
1 ~2 = IY  (Ci + Z(n)[3)1 .
n
(6.2.21 )
I
388
~
Inference in the Multiparameter Case
Chapter 6
Note that although given ZI," " Zn, (3 is Gaussian, this is not true of the marginal distribution of {3. It is not hard to show that AOA6 hold in this case because if H o has density k o and if 8 denotes (a,{3T,a 2 )T. then
~
j
1
j
I(X,8) Dl(X,8)
and
 20'[Y  (a + Z T 13W
(
1
)
2(logo'
1
)
+ log21T) + logho(z)
(6.2.22)
;2'
1
i,
Z ;" 20 4
(.2
o o
1)
I
.,,
a 2
1(8) =
I
0 0' E(ZZT)
0
,
o o
20 4
(6.2.23)
i ,
,
1
so that by Theorem 6.2.2
iI
I'
L(y'n(ii 
a,(3  13,8'  0 2»
~ N(O,diag(02,02[E(ZZ T W',20 4 ».
(6.2.24)
1
!
I
!
,,
This can be argued directly as well (Problem 6.2.8). It is clear that the restriction of H o 2 known plays no role in the limiting result for a,j3,Ci • Of course, these will only be the MLEs if H o depends only on parameters other than (a, f3,( 2 ). In this case we can estimate E(ZZT) by ~ L~ 1 ZiZ'[ and give approximate confidence intervals for (3j. j = 1 .. ,po " An interesting feature of (6.2.23) is that because 1(8) is a block diagonal matrix so is II (6) and, consequently, f3 and 0'2 are asymptotically independent. In the classical linear model of Section 6.1 where we perfonn inference conditionally given Zi = Zi, 1 < i < n, we have noted this is exactly true. This is an example of the phenomenon of adaptation. If we knew 0 2 , the MLE would still be and its asymptotic varianc~ optimal for this model. If we knew a and 13. ;;2 would no longer be the MLE. But its asymptotic variance would be the same as that of the MLE and, by Theorem 6.2.2, 0=2 would be asymptotically equivalent to the MLE. To summarize, estimating either parameter with the other being a nuisance parameter is no harder than when the nuisance parameter is known. Formally, in a model P = {P(9,"} : 8 E e, '/ E £}
~
j,
1 ,
l
I ,
1 ,
1
13
. ,
:
~
•
L
.' ,
we say we can estimate B adaptively at 'TJO if the asymptotic variance of the MLE (J (or more generally, an efficient estimate of e) in the pair (e, iiJ is lhe same as that of e(,/o), the efficient estimate for 'P'I7o = {P(9,l'jo) : E 8}. The possibility of adaptation is in fact rare. though it appears prominently in this way in the Gaussian linear model. In particular consider estimating {31 in the presence of a, ({32 ... , /3p ) with
~ ~
e
(i) a, (3" ... , {3p known. (ii)
13 arbitrary.
/32
... = {3p = O. LetZi
In case (i), we take, without loss of generality, a = (ZiI, ... , ZiP) T. then the efficient estimate in case (i) is
,
•• "
I
;;n _ L~l Zit Yi Pi n 2
Li=l Zil
(6,2.25)
, , ,
, I
.
I
Section 6.2
Asymptotic Estimation Theory in p Dimensions
389
with asymptotic variance (T2[EZlJ1. On the other hand, [31 is the first coordinate of {3 given by (6.2.20). Its asymptotic variance is the (1,1) element of O"'[EZZTjl, which is strictly bigger than ,,' [EZfj1 unless [EZZTj1 is a diagonal matrix (Problem 6.2.3). So in general we cannot estimate [31 adaptively if {3z, .. . , (3p are regarded as nuisance parameters. What is happening can be seen by a representation of [Z'fn) Z(n)ll Zen) Y and [11(11) where ['(II) Ill'j(lI) II· We claim that


=
2i
fJl
= 2:~_l(Z"
",n
 Zill))y; (Z _ 2(11 ),
tl
t
(6.2.26)
L....t=1
where Z(1) is the regression of (Zu, ... , Z1n)T on the linear space spanned by (Zj1,' . " Zjn)T, 2 < j < p. Similarly.
[11 (II)
= 0"'/ E(ZlI
 I1(ZlI
1
Z'l,"" Zp,»)'
(6.2.27)
where II(Z11 I ZZ1,' .. , Zpl) is the projection of ZII on the linear span of Z211 .. , , Zpl (Problem 6.2.11). Thus, I1(ZlI I Z'lo"" Zpl) = 2:;=, Zjl where (ai, ... ,a;) minimizes E(Zll  2: P _, ajZjl)' over (a" ... , ap) E RPI (see Sections 1.4 and B.10). What (6.2.26) and (6.2.27) reveal is that there is a price paid for not knowing [3" ... , f3p when the variables Z2, .. . ,Zp are in any way correlated with Z1 and the price is measured by
a;
[E(Zll  I1(Zll I Z'lo"" Zpl)'jl = ( _ E(I1(Zll 1 Z'l, ... ,ZPI)),)I '''''';:;f;c;o,~''"'''''1 , E(ZII) E(ZII)
(6.2.28) In the extreme case of perfect collinearity the price is 00 as it should be because (31 then becomes unidentifiable. Thus, adaptation corresponds to the case where (Z2, . .. , Zp) have no value in predicting Zl linearly (see Section 1.4). Correspondingly in the Gaussian linear model (6.1.3) conditional on the Zi, i = 1, ... , n, (31 is undefined if the denominator in (6.2.26) is 0, which corresponds to the case of collinearity and occurs with probability 1 if E(Zu  I1(ZlI I Z'l, ... , Zpt})' = O. 0

Example 6.2.2. M Estimates Generated by Linear Models with General Error Structure. Suppose that the €i in (6.2.19) are ij.d. but not necessarily Gaussian with density ~fo (~). for instance, e x
fo(x)
~
=
(1 + eX)"
the logistic density. Such error densities have the often more realistic, heavier tails(l) than the Gaussian density. The estimates {301 0'0 now solve
and
390
Inference in the Multiparameter Case
Chapter 6
f' ~ ~ where1/; = 'X;,X(Y) =  (iQ)~ Yfo(y)+1 ,(30 (fito, ... ,ppO)T The assumptions of Theorem 6.2.2 may be shown to hold (Problem 6.2.9) if
=
(i) log fa is strictly concave, i.e.,
*
is strictly decreasing.
(ii) (log 10)" exists and is bonnded.
Then, if further fa is symmetric about 0,
I(IJ)
cr"I((3T, I)
= cr"
where
Cl
(C 1 Z ) E(t
~)
(6.2.29)
, ,
= J (f,(x»)' lo(x)dx, c, = J (xf,(x) + I)' 10 (x)dx.
~
Thus, ,Bo, ao are opti
mal estimates of {3 and (l in the sense of Theorem 6.2.2 if fo is true. Now suppose fa generating the estimates Po and (T~ is symmetric and satisfies (i) and (ii) but the true error distribution has density f possibly different from fo. Under suitable conditions we can apply Theorem 6.2.1 with
1 ,
,
i
where
i
1f;j(Z,y,(3,(J)
,
;
"
~ 1f; (y  L~1 Zkpk )
, I
<j < p
(6.2.30)
!
,
•
> (y  L~1
to conclude that
where (301 ao solve
I 1
j
Zkpk )
1
I
I
I
L:o( y'n(,Bo  (30)) ~ N(O, I: ('I', P») L:( y'n(a  cro) ~ N(O, (J'(P))
1 •
,
(6.2.31)
• ,
J'I'(y zT(30)dP=O
• ,
I
p
II . ,.I'
ii
I:
and E('I', P) is as in (6.2.6). What is the relation between (30' (Jo and (3, (J given in the Gaussian model (6.2.19)? If 10 is symmetric about 0 and the solution of (6.2.31) is unique, . then (30 = (3. But (Jo = c(fo)q for some, c(to) typically different from one. Thus, (30 can be used for estimating (3 althougll if t1j. true distribution of the is N(O, (J') it should ' perform less well than (3. On the Qther hand, o is an estimate of (J only if normalized by a constant depending on 10. (See Problem 6.2.5.) These are issues of robustness, that
~

a
'i
1:
is, to have a bounded sensitivity curve (Section 3~5. Problem 3.5.8), we may well wish to use a nonlinear bounded '.Ii = ('!f;11'" l ,pp)T to estimate f3 even though it is suboptimal when € rv N(O, (12). and to use a suita~ly nonnalized version of {To for the same purpose. One effective choice of.,pj is the Hu~er function defined in Problem 3.5.8. We will discuss these issues further in Section 6.6 and Volume II. 0
•
s. Again the two approaches differ at the second order when the prior begins to make a difference. The three approaches coincide asymptotically but differ substantially.s. typically there is an attempt at "exact" calculation. Confidence regions that parallel the tests will also be developed in Section 6. . because we then need to integrate out (03 . . See Schervish (1995) for some of the relevant calculations.8 we obtain Theorem 6. fJ p vary freely. . This approach is refined in Kass. ~ (6. O ). and interpret I . for fixed n.s. t)dt. ife denotes the MLE.3..5.2. The consequences of Theorem 6.2 Asymptotic Estimation Theory in p Dimensions 391 Testing and Confidence Bounds There are three principal approaches to testing hypotheses in multiparameter models. in perfonnance and computationally. the latter can pose a fonnidable problem if p > 2. we are interested in the posterior distribution of some of the parameters. 2 The asymptotic theory we have developed pennits approximation to these constants by the procedure used in deriving (5.5. All of these will be developed in Section 6. These methods are beyond the scope of this volume but will be discussed briefly in Volume II. The problem arises also when. Optimality criteria are not easily stated even in the fixed sample case and not very persuasive except perhaps in the case of testing hypotheses about a real parameter in the presence of other nuisance parameters such as H : fJ 1 < 0 versus K : fh > 0 where fJ 2 . say (fJ 1.I as the Euclidean nonn in conditions A7 and A8. under PeforallB. Op).2. However. We have implicitly done this in the calculations leading up to (5.3 The Posterior Distribution in the Multiparameter Case The asymptotic theory of the posterior distribution parallels that in the onedimensional case exactly. Kadane. 6.3.).) and A6A8 hold then.19) (Laplace's method). Using multivariate expansions as in B. We simply make 8 a vector. A5(a. 1T(8) rr~ 1 P(Xi1 8).(t) n~ 1 p(Xi .. Summary. . When the estimating equations defining the M estimates coincide with the likelihood .2. up to the proportionality constant f e . A new major issue that arises is computation. and Rao's tests. say.Section 6. A4(a. Wald tests (a generalization of pivots). We defined minimum contrast (Me) and M estimates in the case of pdimensional parameters and established their convergence in law to a nonnal distribution. Although it is easy to write down the posterior density of 8. as is usually the case. the likelihood ratio principle.. Ifthe multivariate versions of AOA3.3 are the same as those of Theorem 5. .5. and Tierney (1989).2.. A class of Monte Carlo based methods derived from statistical physics loosely called Markov chain Monte Carlo has been developed in recent years to help with these problems.3.32) a.2. the equivalence of Bayesian and frequentist optimality asymptotically.19).
"' ~ 6. this result gives the asymptotic distribution of the MLE.4. and showed that in several statistical models involving normal distributions. I H I! ~. We present three procedures that are used frequently: likelihood ratio. . 9 E 8" 8 1 = 8 . In this section we will use the results of Section 6.1 we developed exact tests and confidence regions that are appropriate in re~ gression and anaysis of variance (ANOVA) situations when the responses are normally distributed. in many experimental situations in which the likelihood ratio test can be used to address important questions. Finally we show that in the Bayesian framework where given 0.3.f/ : () E 8. ry E £} if the asymptotic distribution of In(() . to the N(O. under P.2 to extend some of the results of Section 5. In such .I . These were treated for f) real in Section 5. . However.B} has mean zero and variance matrix equal to the smallest possible for a general class of regular estimates of () in the family of models {PO.d.•• . 9 E 8 0 } for testing H : 9 E 8 0 versus K .9).9 and 6. as in the linear model.1 . adaptive estimation of 131 is possible iff Zl is uncorre1ated with every linear function of Z2) . Wald and Rao large sample tests.4.392 Inference in the Multiparameter Case Chapter 6 . X II . However. equations. LARGE SAMPLE TESTS AND CONFIDENCE REGIONS ~ 0) converges a.9) .i. We find that the MLE is asymptotically efficient in the sense that it has "smaller" asymptotic covariance matrix than that of any MD or AIestimate if we know the correct model P = {Po : BEe} and use the MLE for this model. A(X) simplified and produced intuitive tests whose critical values can be obtained from the Student t and :F distributions. 6.s. In linear regression. covariates can be arbitrary but responses are necessarily discrete (qualitative) or nonnegative and Gaussian models do not seem to he appropriate approximations. We use an example to introduce the concept of adaptation in which an estimate f) is called adaptive for a model {PO. and confidence procedures.. 1 Zp.4 to vectorvalued parameters.denotes the MLE for PO' then the posterior distribution of yn(O if 0 r' (9)) distribution. X n are i. 9 E 8} sup{p(x.8 0 . 1 . 1 '~ J . we need methods for situations in which. ! I In Sections 4.4. I . . We shall show (see Section 6. Asymptotic Approximation to the Distribution of the Likelihood Ratio Statistic i. and other methods of inference.I ..6) that these methods in many respects are also approximately correct when the distribution of the error in the model fitted is not assumed to be normal. In Section 6.110 : () E 8}. confidence regions. '(x) = sup{p(x. the exact critical value is not available analytically. 170 specified. and we tum to asymptotic approximations to construct tests. In these cases exact methods are typically not available. I .". . I j I I . Another example deals with M estimates based on estimating equations generated by linear models with nonGaussian error distribution. . PO'  .1 we considered the likelihood ratio test statistic.3 • .
1). ~ In Example 2. i = r+ 1...3. iJ). .2. such that VI.l.£ X~" for degrees of freedom d to be specified later.tl. Suppose XL X 2 . .5) to be iJo = l/x and p(x.v'[. SetBi = TU/uo. ~ X.3. E W . when testing whether a parameter vector is restricted to an open subset of Rq or R r .. This is not available analytically.\(X) based on asymptotic theory.q' i=q+1 r Wilks's theorem states that. as X where X has the gamma. the hypothesis H is equivalent to H: 6 q + I = . The Gaussian Linear Model with Known Variance. . .. where A nxn is an orthogonal matrix with rows vI.\(Y) = L X. 0) ~ iJ"x·. 0).3. The MLE of f3 under H is readily seen from (2. r and Xi rv N(O. .... . We next give an example that can be viewed as the limiting situation for which the approximation is exact: independent with Yi rv N(Pi.d.3. Suppose we want to test H : Q = 1 (exponential distribution) versus K : a i. Wilks's approximation is exact.1. . . Y n be nn.. .. 1.1. . 0 We iHustrate the remarkable fact that X~q holds as an approximation to the null distribution of 2 log A quite generally when the hypothesis is a nice qdimensional submanifold of an rdimensional parameter space with the following. .\(Y)). and we transform to canonical form by setting Example 6. distribution with density p(X.1. . .1. iJ > O.. Q > 0.randXi = Uduo. Moreover. As in Section 6. . =6r = O. 0) = IT~ 1 p(Xi. Using Section 6. iJ). 0 = (iX.2 we showed how to find B as a nonexplicit solution of likelihood equations. 0 It remains to find the critical value. exists and in Example 2. ."" V r spanw.\(x) is available as p(x. qQ. which is usually referred to as Wilks's theorem or approximation.3 we test whether J. the numerator of . iJo) is the denominator of the likelihood ratio statistic.Wo where w is an rdimensional linear subspace of Rn and W ::> Wo. = (J. Other approximations that will be explored in Volume II are based on Monte Carlo and bootstrap simulations..tn)T is a member of a qdimensional linear subspace of woo versus the alternative that J.n. x ~ > 0..l..1 exp{ iJx )/qQ). In this u 2 known example. the X~_q distribution is an approximation to £(21og . V q span Wo and VI. ~ ~ ~ ~ ~ The approximation we shall give is based on the result "2 log '\(X) . 2 log . q < r.2 we showed that the MLE. ..i = 1. we conclude that under H. <1~) where {To is known.'··' J. i = 1.i..Section 6.n.3 Large Sample Tests and Confidence Regions 393 cases we can turn to an approximation to the distribution of .1).. Thus.4. Let YI . •. Here is an example in which Wilks's approximation to £('\(X)) is useful: Example 6..Xn are i.3. under regularity conditions. Then Xi rvN(B i .i = 1.
i=q+l 'C'" I=r+l Xi' ) X.2 unknown case.Xn a sample from p(x. . . I I i Example 6. we can conclude arguing from A. ". . Then. Write the log likelihood as e In«(J) ~ L i=I n logp(X i .2 are satisfied._q as well. !.. I( I In((J) ~ n 1 n L i=l & & &" &".6. The Gaussian Linear Model with Unknown Variance. case with Xl.3.. ! .3.. . The result follows because.3. and conclude that Vn S X.d. (72) ranges over an r + Idimensional Example 6. an expansion of lnUI) about en evaluated at 8 = 0 0 gives 2[ln«(Jn) In((Jo)] = n«(Jn .(Jol.. X.3 and AA that that In((J~) "'" 2[ln((Jn) In((Jo)] £r ~ V I((Jo)V.2.2) V ~ N(o. I . 0). ranges over a q + Idimensional manifold. 0 Consider the general i.(Y) defined in Remark ° 6. i\ I . (J = (Jo. and (J E c W. 2 log >'(Y) = Vn ".3.i.. where x E Xc R'. 1 • ! We tirst consider the simple hypothesis H : 6 = 8 0 _ Theorem 6.(Jol.3. In Section 6.394 Inference in the Multiparameter Case Chapter 6 1 .2 with 9(1) = log(l + I). I1«(Jo)). c = and conclude that 210g .. we derived e e 2log >'(Y) ~ n log ( 1 + 2::" L.(Jo) for some (J~ with [(J~ .. I(J~ . . Eln«(Jo) ~ I«(Jo).\(Y) £ X.. y'ri(On .· .1 ((Jo)). "'" (6.3.q' Finally apply Lemma 5.q also in the 0. Hence. B)... : . N(o. V T I«(Jo)V ~ X. . Because . Apply Example 5. hy Corollary B..~ L H .1. an = n.7 to Vn = I:~_q+lXl/nlI:~ r+IX.3..3. . . (6. Note that for J.(X) ~ 1 .2.1) " I." . Proof Because On solves the likelihood equation DOln(O) = 0.(Jol < I(J~ .1..logp(X.3.1. j . Suppose the assumptions of Theorem 6. "'" T. under H . = 2[ln«(Jn) In((Jo)] ~ ~ ~ £ X. .2.2. where DO is the derivative with respect to e.(Jo) In«(Jn)«(Jn .2.(In[ < l(Jn . I. . Here ~ ~ j.2 but (T2 is unknown then manifold whereas under H.(J) Uk uJ rxr c'· • 1 . o . ·'I1 . If Yi are as in = (IL.Onl + IOn  (Jol < 210n . 21og>.. where Irxr«(J) is the Fisher information matrix.(Jo) ". By Theorem 6..
2[1. 0). Make a change of parameter.0:'. dropping the dependence on 6 0 • M ~ PI 1 / 2 (6..Section 6.0(1) = (8 1" ".10) and (6.)T.2. 0 E e.3. OT ~ (0(1).. Next we tum to the more general hypothesis H : 8 E 8 0 .(9 0 .35) where 8(00) = n 1/2 L Dl(X" 0) i=1 n and 8 = (8 1 .)T.6) . Then under H : 0 E 8 0.". 0b2) = (80"+1".q.3. Let Po be the model {Po: 0 E eo} with corresponding parametn"zation 0(1) = ~(1) (1) ~(1) (8 1.0(2».n) /n(Oo)l.)1'. Examples 6.3 Large Sample Tests and Confidence Regions 395 = As a consequence of the theorem.3. where x r (1.3.n and (6.a. T (1) (2) Proof. . Theorem 6.n = (80 . for given true 0 0 in 8 0 • '7=M(OOo) where.3) is a confidence region for 0 with approximat~ coverage probability 1 . and {8 0 .1 and 6.0 0 ).3.8q )..(] .2 hold for p(x.0) is the 1 and C!' quantile of the X.' ".81(00).T.4). where e is open and 8 0 is the set of 0 E e with 8j = 80 . Let 00.3. the test that rejects H : () 2Iog>.j.2.2 illustrate such 8 0 . Suppose that the assumptions of Theorem 6. ~(11 ~ (6. Furthennore.2. j ~ q + 1. ~ {Oo : 2[/ n (On) /n(OO)] < x. has approximately level 1..4) It is easy to see that AOA6 for 10 imply AOA5 for Po. We set d ~ r . 10 (0 0 ) = VarO.+l..(Ia).. 8 2)T where 8 1 is the first q coordinates of8. . SupposetharOo istheMLEofO underH and that 0 0 satisfiesA6for Po..8.80. 0(2) = (8.1) applied to On and the corresponding argument applied to 8 0 .(X) 8 0 when > x. Let 0 0 E 8 0 and write 210g >'(X) ~ = 2[1n(9 n )  In(Oo)] . distribution.8.3.3. (JO.a)) (6..j} are specified values. (6. By (6.
••..a) is an asymptotically level a test of H : 8 E Of equal importance is that we obtain an asymptotic confidence region for (e q+ 1.3.• I . . Me : 1]q+l = TJr = a}.1'1 '1 E eo} and from (B.7). Note that. 0 Note that this argument is simply an asymptotic version of the one given in Example 6.. J) distribution by (6.3. by definition. then by 2 log "(X) TT(O)T(O) . .all (6. His {fJ E applying (6.(X) = y{X) where ! 1 1 ... : • Var T(O) = pT r ' /2 II.8. ). • .Tf(O)T .l.2...3. IIr ) : 2[ln(lI n ) .I '1): 11 0 + M. = 0.l. which has a limiting X~q distribution by Slutsky's theorem because T(O) has a limiting Nr(O..1ITD III(x.II).In( 00 ..10) LTl(o) . '1 E Me}. because in tenus of 7].+1 ~ . with ell _ .1> . Now write De for differentiation with respect to (J and Dry for differentiation with respect to TJ. • 1 y{X) = sup{P{x.2.. is invariant under reparametrization . I I eo~ ~ ~ {(O. Such P exists by the argument given in Example 6..1 '1) = [M. We deduce from (6. ~ 'I.LTl(o) + op(l) i=l i=l L i=q+l r Ti2 (0) + op(l).00 . (O) r • + op(l) (6.00 : e E 8 0} MAD = {'1: '/.+ 1 .1 '1).3.3. (6.1I0 + M.13) D'1I{x. 1e ).3. Moreover.8) that if T('1) then = 1 2 n.3..(X) > x r .0: confidence region is i . rejecting if .3./ L i=I n D'1I{X" liD + M. if A o _ {(} .11) .1 because /1/2 Ao is the intersection of a q dirnensionallinear subspace of R r with JI/2{8 .3. = . .1 '1ll/sup{p(x.1/ 2 p = J. under the conditions of Theorem 6.6) and (6...0.396 Inference in the Multiparameter Case Chapter 6 and P is an orthogonal matrix such that.(1 .8 0 : () E 8}.9).. . Tbe result follows from (6. Or)J < xr_..9) .2. a piece of 8.5) to "(X) we obtain. . Thus.3. This asymptotic level 1 . 18q acting as r nuisance parameters.+I> .1I0 + M.(l .1I 0 + M.
e:::S:::.3. which require the following general theorem. The esseutial idea is that.13) then the set {(8" .o:::"'' 397 CC where 00.3. .) be i. B2 .. The formulation of Theorem 6..8 0 E Wo where Wo is a linear space of dimension q are also covered. q + 1 <j < r}. where J is the 2 x 2 identity matrix aud 8 0 = {O : B1 + O < I}.3.. If 81 + B2 = 1.'':::':::"::d_C:::o:::.3)..p:::'e:.'! are the MLEs.3.d...3.:::m.3.2 and the previously conditions on g.t..(0) ~ 8j .13) Evideutly.8r are known.::Se::..2 is still inadequate for most applications.3. of f)l" .2 to this situation is easy and given in Problem 6.. . 210g A(X) ".. let (XiI.. R.3. if A(X) is ~ the likelihood ratio statistic for H : 0 E 8 0 given in (6. ... Theorem 6.g. Suppose the MLE hold (JO. . A(X) behaves asymptotically like a test for H : 0 E 8 00 where 800 = {O E 8: Dg(Oo)(O .14) a hypothesis of the fonn (6.. 2 e eo Ii o ~ (Xl + X 2) 2 + ~ 1 _ (X.... . More sophisticated examples are given in Problems 6. will appear in the next section.. Defiue H : 0 E 8 0 with e 80 = {O E 8 : g(O) = o}. X. (6. 2' 2 + X 2 )) 2 and 210g A(X) ~ xf but if 81 + 82 < 1 clearly 2 log A(X) ~ op(I).. v r span " R T then. Suppose the assumptions of Theorem 6.6.. X~q under H. q + 1 < j < r written as a vector g. As an example ofwhatcau go wrong. such that Dg(O) exists and is of rauk r . .3.2)(6.q at all 0 E 8.2.( 0 ) ~ o} (6. We only need note that if WQ is a linear space spanned by an orthogonal basis v\. if 0 0 is true.3.. More complicated linear hypotheses such as H : 6 . Suppose H is specified by: There exist d functions. N(B 1 . Theorem 6.3. J). Then.3.12) The extension of Theorem 6.3.:::f....n under H is consistent for all () E 8 0.3.3. It can be extended as follows. q + 1 < j < r..... Examples such as testing for independence in contingency tables. 8q o assuming that 8q + 1 .3. Wilks's theorem depends critically On the fact that nOt only is open but that if giveu in (6."de"'::'"e:::R::e"g.2 falls under this schema with 9. 8q)T : 0 E 8} is opeu iu Rq.80.5 and 6. l B. themselves depending on 8 q + 1 . Here the dimension of 8 0 and 8 is the same but the boundary of 8 0 has lower dimension.13).::..3.3. . "'0 ~ {O : OT Vj = 0.cTe:::.. t. 8..l:::. 9j . The proof is sketched iu Problems (6. .13). (6.o''C6::. We need both properties because we need to analyze both the numerator and denominator of A(X).i. Vl' are orthogonal towo and v" . " ' J v q and Vq+l " .
rl(IJ» asn ~ p ~ L 00.2. I.2 hold. it follows from Proposition B.15) Because I( IJ) is continuous in IJ (Problem 6. If H is true. The last Hessian choice is ~ ~ ..X. Then .31). j Wn(IJ~2)) !:. I22(lJo)) if 1J 0 E eo holds..lJ n ) where IJ n statistic as ~(1) ~(2) ~(1) = (0" .Nr(O.16) n(9 n _IJ)TI(9 n )(9 n IJ) !:. . I .Or) and define the Wald (6.3.l(a) that I(lJ n ) ~ I(IJ) asn By Slutsky's theorem B. 0 ""11 F .10).2. I 22 (9 n ) is replaceable by any consistent estimate of J22( 8). the lower diagonal block of the inverse of .q' ~(2) (6.3.17) ~ ~ e ~ en 1 • I I Wn(IJ~2)) ~ n(9~2) 1J~2)l[I"(9nJrl(9~) _1J~2») where I 22 (IJ) is the lower diagonal block of II (IJ) written as I2 I 1 _ (Ill(lJ) (IJ) I21(IJ) I (1J)) I22(IJ) with diagonal blocks of dimension q x q and d x d. respectively.) and IJ n ~ ~ ~(2) ~ (Oq+" .3.fii(1J IJ) ~N(O.3. But by Theorem 6. . Under the conditions afTheorem 6. (6. More generally /(9 0 ) can be replaced by any consistent estimate of I( 1J0 ). (6. [22 is continuous.18) i 1 Proof. according to Corollary B.2 Wald's and Rao's Large Sample Tests The Wald Test Suppose that the assumptions of Theorem 6.a)} is an ellipsoid in W easily interpretable and computablesee (6.2.: b.2. the Hessian (problem 6.16). It follows that the Wald test that rejects H : (J = 6 0 in favor of K : () • i= 00 when . i .. Y . r 1 (1J)) where. . has asymptotic level Q.fii(lJn (2) L 1J 0 ) ~ Nd(O.~ D 2 1n ( IJ n). in particular . .4.7. i favored because it is usually computed automatically with the MLE. y T I(IJ)Y.398 Inference in the Multiparameter Case Chapter 6 6..3. ~ Theorem 6.~ D 2ln (IJ n ).15) and (6. . For the more general hypothesis H : (} E 8 0 we write the MLE for 8 E as = ~ ! I: (IJ n . y T I(IJ)Y . Slutsky's it I' I' .6. 1(8) continuous implies that [1(8) is continuous and. It and I(8 n ) also have the advantage that the confidence region One generates {6 : Wn (6) < xp(l . (6.1.3. theorem completes the proof.2.7.0. X. ..3..3. ~ ~ 00.3.3. for instance..2.2.~ D 2l n (1J 0 ) or I (IJ n ) or . More generally. hence.9).
The Wald test leads to the Wald confidence regions for (B q +] ..Q).ln(8 0 . as (6.. The test that rejects H when R n ( (}o) > x r (1.9.3 Large Sample Tests and Confidence Regions 399 The Wold test. This test has the advantage that it can be carried out without computing the MLE.3.. un.19) : (J c: 80. the Wald and likelihood ratio tests and confidence regions are asymptotically equivalent in the sense that the same conclusions are reached for large n.3. The Rao test is based on the statistic Rn(Oo ) ~n'1'n(Oo.n)[D.Section 6.22) . in practice they can be very different. by the central limit theorem. asymptotically level Q..(00 )  121 (0 0 )/.. a consistent estimate of . Rao's SCOre test is based on the observation (6.:El ((}o) is ~ n 1 [D. The extension of the Rao test to H : (J E runs as follows. is. as n _ CXJ.. Rn(Oo) = nt/J?:(Oo)r' (OO)t/Jn(OO) !. and so on..3. (6. (J = 8 0 . requires much weaker regularity conditions than does the corresponding convergence for the likelihood ratio and Wald tests.3. What is not as evident is that.3. under H.n W (80 .nl]  2  1 .n] D"ln(Oo. It can be shown that (Problem 6.8) E(Oo) = 1.n) 2  + D21 ln (00.. It follows from this and Corollary B. which rejects iff HIn (9b ») > x 1_ q (1 .21) where III is the upper left q x q block of the r x r infonnation matrix I (80 ). X. and the convergence Rn((}o) ~ X. ~1 '1'n(OO.nlE ~ (2) _ T  .20) vnt/Jn(OO) !. The argument is sketched in Problem 6. therefore. (Problem 6.3. ~ B ) T given by {8(2) : r W n (0(2)) < x r _ q (1 These regions are ellipsoids in R d Although.3. N(O.0:) is called the Rao score test.19) indicates.n) under n H. 112 is the upper right q x d block.ln(Oo.6.\(X) is the LR statistic for H Thus. = 2 log A(X) + op(l) (6. the two tests are equivalent asymptotically. Furthermore.1/ 2 D 2 In (0) where D 1 l n represents the q x 1 gradient with respect to the first q coordinates and D 2 l n the d x 1 gradient with respect to the last d. 2 Wn(Oo ) ~(2) where . the asymptotic variance of . The Rao Score Test For the simple hypothesis H that. .2 that under H.. 1(0 0 )) where 1/J n = n .3..I Dln ((Jo) is the likelihood score vector.n) ~  where :E is a consistent estimate of E( (}o).. Let eo '1'n(O) = n.' (0 0 )112 (00) (6.n under H..9) under AOA6 and consistency of (JO.
On) + t. b . and Wald Tests It is possible as in the onedimensional case to derive the asymptotic power for these tests for alternatives of the form On = ~ eo + ~ where 8 0 E 8 0 . and so on. which measures the distance between the hypothesized value of 8(2) and its MLE. where X~ (')'2) is the noncentral chi square distribution with m degrees of freedom and noncentrality parameter "'? It may he shown that the equivalence (6. X. it shares the disadvantage of the Wald test that matrices need to be computed and inverted. 2 log A(X) has an asymptotic X.. we introduced the Rao score test. We also considered a quadratic fonn.2 but with Rn(lJb 2 1 1 » !:. In particular we shall discuss problems of . .3. then. X~' 2 j > xd(1 . . Consistency for fixed alternatives is clear for the Wald test but requires conditions for the likelihood ratio and score testssee Rao (1973) for more on this. Summary. D 21 the d x d Theorem 6.8 0 where is an open subset of R: and 8 0 is the collection of 8 E 8 with the last r ~ q coordinates 0(2) specified.q distribution under H. We established Wilks's theorem. For instance.3. The advantage of the Rao test over those of Wald and Wilks is that MLEs need to be computed only under H.0 0 ) I(On)(On .19) holds under On and that the power behavior is unaffected and applies to all three tests. .a)} and • The Rao large sample critical and confidence regions are {R n (Ob » {0(2) : Rn (0(2) < xd(l.400 Inference in the Multiparameter Case Chapter 6 where D~ is the d x d matrix of second partials of ill with respect to matrix of mixed second partials with respect to e(l). Finally. . Under H : (J E A6 required only for Po eo and the conditions ADAS a/Theorem 6.)I(On)( vn(On  On) + A) !:. Rao. The asymptotic distribution of this quadratic fonn is also q• I e X: X: 6.) . I I . .. under regularity conditions. . . for the Wald test neOn . which is based on a quadratic fonn in the gradient of the log likelihood. called the Wald statistic. Power Behavior of the LR. I . 0(2).2. The analysis for 8 0 = {Oo} is relatively easy.0 0 ) = T'" "" (vn(On . and showed that this quadratic fonn has limiting distribution q . .. which stales that if A(X) is the LR statistic. We considered the problem of testing H : 8 E 80 versus K : 8 E 8 .4 LARGE SAMPLE METHODS FOR DISCRETE DATA In this section we give a number of important applications of the general methods we have developed to inference for discrete data. On the other hand.5._q(A T I(Oo)t. e(l).a)}.
Ok if i if i = j.) /OOk = n(Bk . = L:~ 1 1{Xi = j}.8. we consider the parameter 9 = (()l. . # j. Thus.. .. and 2.7." . For i. Let ()j = P(Xi = j) be the probability of the jth category.)/OOk. j = 1.1 GoodnessofFit in a Multinomial Model. with ()Ok =  1 1 E7 : kl j=l ()OJ. Ok Thus.]2100.3. In. .32) thaI IIIij II.00k)2lOOk.33) and (3. .+ ] 1 1 0.1.4. k .lnOo.2.. ~ N. . log(N. In Example 2. or we may be testing whether the phenotypes in a genetic experiment follow the frequencies predicted by theory.5.4.. trials in which Xi = j if the ith trial prcx:luces a result in the jth category.4 Large Sample Methods for Discret_e_D_a_ta ~_ _ 401 goodnessoffit and special cases oflog linear and generalized linear models (GLM). Pearson's X2 Test As in Examples 1. j = 1. treated in more detail in Section 6. k. . . the Wald statistic is klkl Wn(Oo) = n LfB. It follows that the large sample LR rejection region is ~ k 2IogA(X) = 2 LN.00.OOi)(B.()k_l)T and test the hypothesis H : ()j = ()OJ for specified (JOj.d. where N. ()j.2.0).)2InOo.E~..i. 6. k1. L(9.. 2. consider i. j=1 Do.3.) > Xkl(1. j=l i=1 The second term on the right is kl 2 n Thus.Section 6.OO.8 we found the MLE 0. j=l . we need the information matrix I = we find using (2. I ij [ . j = 1.6. + n LL(Bi . we may be testing whether a random number generator used in simulation experiments is producing values according to a given distribution.2. . k Wn(Oo) = L(N.. Because ()k = 1 . j=1 To find the Wald test..nOo. .
L .2).. ~ .1S. note that from Example 2. j=1 .4. '" . we write ~ ~  8j 80j .11).2) . . we could invert] or note that by (6. . 80j 80k 1 and expand the square keeping the square brackets intact.4. with To find ]1. Then.8.Nk_d T and. by A.4. ]1(9) = ~ = Var(N)..~ 8 8 0j 0k  2 80j ) ( n LL _ _ L kl kl kl ( J=1 z=1 ~ 8_Z 80i ~) 8k 80k (~ 8 _J 80J ~) ) 8k .Expected)2 Expected (6.. n ( L kl ( j=1 ~ ~) !.2..80k). the Rao statistic is .L .2..4. The second term on the right is n [ j=1 (!.1) X where the sum is over categories and "expected" refers to the expected frequency E H (Nj ). . The general form (6.~) 8 80j 80 k ~ ~ ~ ] 2 0j = n [~ 80 k ~ 1] 2 To simplify the first term on the right of (6...1) of Pearson's X2 will reappear in other multinomial applications in this section. To derive the Rao test. .402 Inference in the Multiparameter Case Chapter 6 The term on the right is called Pearson's chisquare (X 2 ) statistic and is the statistic that is typically used for this multinomial testing problem. ~ = Ilaijll(kl)X(kl) with Thus.13. 80i 80j 80k (6. It is easily remembered as 2 = SUM (Observed . 8k 80 k = {[8 0k (8 j  80j )]  [8 0j (8 k ~  80k ) ] } .. where N = (N1 . because kl L(Oj ..80j ) = (Ok .
n020 = n030 = 104.9 when referred to a X~ table. distribution. If we assume the seeds are produced independently. which has a pvalue of 0.4. Here testing the adequacy of the HardyWeinberg model means testing H : 8 E 8 0 versus K : 8 E . 8 0 .75. n3 = 108.25)2 104. 4 as above and associated probabilities of occurrence fh. Contingency Tables Suppose N = (NI .k. 7. We will investigate how to test H : 0 E 8 0 versus K : 0 ¢:. . 2.1) dimensional parameter space k 8={8:0i~0.25)2 312. However. Testing a Genetic Theory.4). we can think of each seed as being the outcome of a multinomial trial with possible outcomes numbered 1. There is insufficient evidence to reject Mendel's hypothesis. which is a onedimensional curve in the twodimensional parameter space 8. 8). Nk)T has a multinomial.4 Large Sample Methods for Discrete Data 403 the first term on the right of (6.1. and (4) wrinkled green. n4 = 32. Mendel observed nl = 315. in the HardyWeinberg model (Example 2.: . nOlO = 312. Mendel observed the different kinds of seeds obtained by crosses from peas with round yellow seeds and peas with wrinkled green seeds. Then. Possible types of progeny were: (1) round yellow. j. Mendel's theory predicted that 01 = 9/16. where 8 0 is a composite "smooth" subset of the (k . i=1 For example.25 + (2.48 in this case. this value may be too small! See Note 1. k = 4 X 2 = (2.i::.75.2 GoodnessofFit to Composite Multinomial Models.4.75)2 104. n040 = 34.4. M(n. (3) round green. .2) becomes It follows that the Rao statistic equals Pearson 's X2. • 6. For comparison 210g A = 0. and we want to test whether the distribution of types in the n = 556 trials he performed (seeds he observed) is consistent with his theory. 04 = 1/16. fh = 03 = 3/16.25 + (3. 3.75 . In experiments on pea breeding. LOi=l}. 02.75)2 = 04 34. n2 = 101. Example 6.Section 6.75 + (3.1.25.. l::. (2) wrinkled yellow.fh. 04..
However. . .3). If £ is not open sometimes the closure of £ will contain a solution of (6. by the algebra of Section 6.q) exists.X approximately has a X. approximately X.r for specified eOj . Other examples.. nk. . j = q + 1. the Wald statistic based on the parametrization 8 (11) obtained by replacing e by e (ij)... the log likelihood ratio is given by log 'x(nb . ... nk. j = 1.l now leads to the Rao statistic R (8("") n 11 =~ ~ j=l [Ni ..(t)~ei(11)=O.2~(1 . under H. thus. £ is open.3) Le.4.. . we obtain the Rao statistic for the composite multinomial hypothesis by replacing eOj in (6. and the map 11 ~ (e 1(11). which will be pursued further later in this section. Then we can conclude from Theorem 6. 2 log .4. ij satisfies l a a'r/j logp(nl..1. .4. . . 8(11)) for 11 E £. We suppose that we can describe 8 0 parametrically as where 11 = ('r/1. Consider the likelihood ratio test for H : e E 8 0 versus K : e 1:..X approximately has a xi distribution under H.. i=l t n. Maximizing p(n1..404 Inference in the Multiparameter Case Chapter 6 8 1 where 8 1 = 8 . then it must solve the likelihood equation for the model. also equal to Pearson's X2 • . .2) by ej(ij)..q' Moreover. To apply the results of Section 6.3 and conclude that. 8) denote the frequency function of N. ... j Oj is. That is. . 8(11)) = 0. To avoid trivialities we assume q < k . .q distribution for large n." For instance.4.. .8 0 .. If a maximizing value. and ij exists. .. we define ej = 9j (8). a 11 'r/J . l~j~q. e~ = e2 .("") X J 11 I where the righthand side is Pearson's X2 as defined in general by (6. The Rao statistic is also invariant under reparametrization and. involve restrictions on the e obtained by specifying independence assumptions on i classifications of cases into different categories. i=l k If 11 ~ e( 11) is differentiable in each coordinate. .r.~) and test H : e~ = O. ij = (iiI. r. 1 ~ j ~ q or k (6.4. ek(11))T takes £ into 8 0 . . where 9j is chosen so that H becomes equivalent to "( e~ .3.4.3 that 2 log .1. nk."" 'r/q) T. . The Wald statistic is only asymptotically invariant under reparametrization.nk.. is a subset of qdimensional space.ne j (ij)]2 = 2 ne. to test the HardyWeinberg model we set e~ = e1 . The algebra showing Rn(8 0 ) = X2 in Section 6. nk) = L ndlog(ni/n) log ei(ij)]..1). {p(. 8 0 • Let p( n1. . 8) for 8 E 8 0 is the same as maximizing p( n1.. e~) T ranges over an open subset of Rq and ej = eOj . ... 8(11)) : 11 E £}.
do smoking and lung cancer have any relation to each other? Are sex and admission to a university department independent classifications? Let us call the possible categories or states of the first characteristic A and A and of the second Band B..4. HardyWeinberg. We often want to know whether such characteristics are linked or are independent. 1] is the desired estimate (see Problem 6. (4) starchygreen. is male or female. 301). . (3) starchywhite. To study the relation between the two characteristics we take a random sample of size n from the population. The likelihood equation (6.1). p. 1958. A linkage model (Fisher. The only root of this equation in [0.4) which reduces to a quadratic equation in if. we obtain critical values from the X~ tables. then (Nb . AB. (}4) distribution. . AB. . k 4.4 Methods for Discrete Data 405 Example 6. .4.TJ). 1} a "onedimensional curve" of the threedimensional parameter space 8. (2) sugarygreen. 9(ij) ((2nl + n2) 2n 2n 2. (h. and so on.5. nl (2+TJ) (n2 + n3) n4 (1TJ) +:ry 0. {}12. N 4 ) has a M(n. specifies that where TJ is an unknown number between 0 and 1. Independent classification then means that the events [being an A] and [being a B] are independent or in terms of the B .a) with if (2nl + n2)/2n. An individual either is or is not inoculated against a disease. The results are assembled in what is called a 2 x 2 contingency table such as the one shown. i(1 .3) becomes i(1 °:. Then a randomly selected individual from the population can be one of four types AB. Because q = 1. (6. H is rejected if X2 2:: Xl (1 . {}22. green base leaf versus white base leaf) leads to four possible offspring types: (1) sugarywhite. AB. {}21.2.6 that Thus.. is or is not a smoker. ij {}ij = (Bil + Bi2 ) (Blj + B2j ). The Fisher Linkage Model. We found in Example 2..4. For instance. TJ). 0 Testing Independence of Classifications in Contingency Tables Many important characteristics have only two categories.• .4. ( 2n3 + n2) 2) T 2n 2n o Example 6. T() test the validity of the linkage model we would take 8 0 {G(2 + TJ). Denote the probabilities of these types by {}n. (2nl + n2)(2 3 + n2) .Section 6. A selfcrossing of maize heterozygous on two characteristics (starchy versus sugary. If Ni is the number of offspring of type i among a total of n offspring.4.4. respectively. iTJ) : TJ :.
I}. 'r/2 to indicate that these are parameters. 0 ::.011 + 021 as 'r/1.5) whose solutions are Til 'r/2 = (n11+ n 12)/n (nll + n21)/n.4.3) become .6) the proportions of individuals of type A and type B. Here we have relabeled 0 11 + 0 12 . 8 0 .'r/1). ( 22 ). 'r/1 ::. the likelihood equations (6.~ + n12) iiI (nll + n2d (nll i72 (n21 + n22) (1 . respectively. where 8 0 is a twodimensional subset of 8 given by 8 0 = {( 'r/1 'r/2.4. N 22 )T. because k = 4. This suggests that X2 may be written as the square of a single (approximately) standard normal variable. if N = (Nll' N 12 .4.Ti2) (6. 'r/2 ::. 0 12 .4. Then.7) where Ri = Nil + Ni2 is the ith row sum. 'r/2 (1 .'r/1) (1 . In fact (Problem 6. 1. By our theory if H is true. 'r/1 (1 . q = 2.RiCj/n) are all the same in absolute value and. Pearson's statistic is then easily seen to be (6.4.'r/2)) : 0 ::. which vary freely.Tid (n12 + n22) (1 .406 Inference in the Multiparameter Case Chapter 6 l A 11 The entries in the boxes of the table indicate the number of individuals in the sample who belong to the categories of the appropriate row and column. C j = N 1j + N 2j is the jth column sum. We test the hypothesis H : () E 8 0 versus K : 0 f{. These solutions are the maximum likelihood estimates.'r/2). the (Nij . (6. 0 11 . where z tt [R~jl1 2=1 J=l . Thus.2). For () E 8 0 . N 21 . X2 has approximately a X~ distribution. 021 . for example N12 is the number of sampled individuals who fall in category A of the first characteristic and category B of the second characteristic. (1 . we have N rv M(n.
b NIb Rl a Nal C1 C2 ... Positive values of Z indicate that A and B are positively associated (i... TJj2 are nonnegative and 2:~=1 = 2:~=1 TJj2 = 1. Z = v'n[P(A I B) .. peA I B)) versus K : peA I B) > peA I B). a. 1 :::. Nab Cb Ra n . b ~ 2 (e. peA I B) = peA I B). 1). .. 1 Nu 2 N12 .4 Large Sample Methods for Discrete Data 407 An important altemative form for Z is given by (6. B.. The N ij can be arranged in a a x b contingency table. i = 1. The X2 test is equivalent to rejecting (twosidedly) if. Thus.g. a. .. .. that A is more likely to occur in the presence of B than it would in the presence of B)... b where the TJil.. that is. . j :::. b where N ij is the number of individuals of type i for characteristic 1 and j for characteristic 2.Section 6. Z ~ z(l.4. B to denote the event that a randomly selected individual has characteristic A. j :::. (Jij TJil The hypothesis that the characteristics are assigned independently becomes H : TJil TJj2 for 1 :::... 1 :::.peA I B)] [~(B) ~(~)ll/2 peA) peA) where P is the empirical distribution and where we use A. If we take a sample of size n from a population and classify them according to each characteristic we obtain a vector N ij . (Jij : 1 :::. B.3) that if A and B are independent.. . . and only if.4.4. b} "J M(n.. respectively.4. . a.e. . If (Jij = P[A randomly selected individual is of type i for 1 and j for 2]. Therefore. B.. j :::. eye color. 1 :::. . it is reasonable to use the test that rejects.a) as a level a onesided test of H : peA I B) = peA I B) (or peA I B) :::. . i :::. i :::. b). Z indicates what directions these deviations take. a. hair color).8) Thus. if X2 measures deviations from independence. then Z is approximately distributed as N(O. i :::. Next we consider contingency tables for two nonnumerical characteristics having a and b states. then {Nij : 1 :::. if and only if. j = 1. a. It may be shown (Problem 6.
' XI. Instead we turn to the logistic transform g( 7r). Thus.4." We assume that the distribution of the response Y depends on the known covariate vector ZT. whicll we introduced in Example 1. Because 7r(z) varies between 0 and 1.{3p. Other transforms.1 we considered linear models that are appropriate for analyzing continuous responses {Yi} that are. we call Y = 1 a "success" and Y = 0 a "failure.Li = L. In this section we will consider Bernoulli responses Y that can only take on the values 0 and 1. The log likelihood of 7r = (7rl.3 Logistic Regression for Binary Responses In Section 6. As is typical.~ =1 Zij {3j = unknown parameters {31.7r)]. i :::. we obtain what is called the logistic linear regression model where . or (3) market research where a potential customer either desires a new product (Y = 1) or does not (Y = 0).4. Maximum likelihood and dimensionality calculations similar to those for the 2 x 2 table show that Pearson's X2 for the hypothesis of independence is given by (6. Next we choose a parametric model for 7r(z) that will generate useful procedures for analyzing experiments with binary responses. [Xi C :i"J log + m.."" 7rk)T based on X = (Xl"'" Xk)T is t. a simple linear representation zT ~ for 7r(. we observe the number of successes Xi = L. (2) election polls where a voter either supports a proposition (Y = 1) or does not (Y = 0).11) When we use the logit transfonn g(7r). B(mi' 7ri)' where 7ri = 7r(Zi) is the probability of success for a case with covariate vector Zi. In this section we assume that the data are grouped or replicated so that for each fixed i.4.f. with Xi binomial. (6.6. 6..8 as the canonical parameter zT TJ = g ( 7r) = log [7r / (1 . k. such as the probit gl(7r) = <I>1(7r) where <I> is the N(O.10.. log ( ~: ) . 1) d."F~1 Yij where Yij is the response on the jth of the mi trials in block i. . perhaps after a transfonnation. The argument is left to the problems as are some numerical applications. approximately normally distributed ~ for known constants {Zij} and and whose means are modeled as J..9) which has approximately a X(al)(bl) distribution under H.7r)] are also used in practice. Examples are (1) medical trials where at the end of the trial the patient has either recovered (Y = 1) or has not recovered (Y = 0). . (6. and the loglog transform g2(7r) = 10g[log(1 . usually called the log it. 1 :::. log(l "il] + t.) over the whole range of z is impossible.408 Inference in the Multiparameter Case Chapter 6 with row and column sums as indicated. we observe independent Xl' .4.
(6. ... it follows by Theorem 5. j Thus. . in (6. We let J. Theorem 2.~ Xi 2 1 +"2 ' (6. that if m 1fi > 0 for 1 ::. k ( ) (6. where Z = IIZij Ilrnxp is the design matrix. if N = 2:7=1 mi.L) = O. mi 2mi Here the adjustment 1/2mi is used to avoid log 0 and log 00.4 the Fisher information matrix is I(f3) = ZTWZ (6.4.1 and by Proposition 3.3.log(l:'. Alternatively with a good initial value the NewtonRaphson algorithm can be employed. ( 1 . As the initial estimate use (6. Tp) T.4. Zi)T is the logistic regression model of Problem 2.1fi)*}' + 00. .l..! ) ( mi .2 can be used to compute the MLE of f3.13) By Theorem 2. k m} (V.+ . The coordinate ascent iterative procedure of Section 2. W is estimated using Wo = diag{mi1f.. Then E(Tj ) = 2:7=1 Zij J.14). It follows that the NILE of f3 solves E f3 (Tj ) = T .Li.Li = E(Xi ) = mi1fi.3.15) Vi = log X+. The condition is sufficient but not necessary for existencesee Problem 2.. .[ i(l7r iW')· . I N(f3) = f.1 applies and we can conclude that if 0 < Xi < mi and Z has rank p.16) 1fi) 1]. IN(f3) of f3 = (f31.3.3. Zi The log likelihood l( 7r(f3)) = == k (1. we can guarantee convergence with probability tending to 1 as N + 00 as follows.(1.+ .4.4.1.f3p )T is.. p (Jj Tj  ~ mi loge1+ exp{Zif3} ) + ~ log '. mi 2mi Xi 1 Xi 1 = . i ::.. the solution to this equation exists and gives the unique MLE 73 of f3.4 Large Sample Methods for Discrete Data 409 The special case p = 2.J. 7r Using the 8method.p. the likelihood equations are just ZT(X .4. Because f3 = (ZTz) 1 ZT 'T] and TJi = 10g[1fi (1 1fi) in TJi has been replaced by 730 is a plugin estimate of f3 where 1fi and (1 1f.. the empirical logistic transform. .1fi )* = 1 .4.J) SN(O. j = 1.14) where W = diag{ mi1fi(l1fi)}kxk..12) where T j = 2:7=1 ZijXi and we make the dependence on N explicit.3.4..4.3. Similarly. Although unlike coordinate ascent NewtonRaphson need not converge. or Ef3(Z T X) = ZTX.Section 6. Note that IN(f3) is the log likelihood of a pparameter canonical exponential model with parameter vector f3 and sufficient statistic T = (T1' .
1 linear subhypotheses are important.14) that {3o is consistent. that is.:IITlPt'pr Case 6 Because Z has rankp..IOg(E.. Here is a special case."" Xk independent with Xi example. the MLE of 1ri is 1ri = 9. .4.13.flm. .. In this case the likelihood is a product of independent binomial densities.410 Inference in the Mllltin::lr. ji) 2 I)Xi 10g(Xdfli) + XIlog(XI!flDJ i=1 (6.13. i = 1. where ji is the MLE of JL for TJ E w.1. As in Section 6..17) where X: = mi Xi and M~ mi Mi. Suppose that k treatments are to be tested for their effectiveness by assigning the ith treatment to a sample of mi patients and recording the number Xi of patients that recover.3. and the MLEs of 1ri and {Li are Xdmi and Xi.1.6. . .8 that the inverse of the logit transform 9 is the logistic distribution function Thus. Thus.1 (L:~=l Xij(3j).~. The LR statistic 2 log . f"V . {3 E RP} and let r be the dimension of w.\ for testing H : TJ E w versus K : TJ n . one from each location.4.11) and (6. .w is denoted by D(y.. If Wo is a qdimensional linear subspace of w with q < r. 2 log .4. In the present case. D(X. D(X. Example 6. suppose we want to compare k different locations with respect to the percentage that have a certain attribute such as the intention to vote for or against a certain proposition. from (6.. By the multivariate delta method. fl) has asymptotically a X%r. The Binomial OneWay Layout.) +X.4. recall from Example 1. We obtain k independent samples. i 1. and for the ith location count the number Xi among mi that has the given attribute.4. k.4.q distribution as mi + 00.. Testing In analogy with Section 6. We want to contrast w to the case where there are no restrictions on TJ.4. To get expressions for the MLEs of 7r and JL.k.. . we set n = Rk and consider TJ E n.12) k zT D(X. by Problem 6.\=2t [Xi log i=1 (Ei. we let w = {TJ : 'f]i = {3.)l {LOt (6.18) {Lo~ where jio is the MLE of JL under H and fl~i = mi . ji) measures the distance between the fit ji based on the model wand the data X. it foHows (Problem 6. distribution for TJ E w as mi + 00. then we can form the LR statistic for H : TJ E Wo versus K: TJ E w . For a second pendently and we observe Xl.4. The samples are collected indeB(1ri' mi). ji).\ has an asymptotic X. k < oosee Problem 6. Theorem 5. i = 1..Wo 210g.4.
and as in that section.4. J. we find that the MLE of 7r under His 7r = TIN.5 Generalized Linear Models 411 This model corresponds to the oneway layout of Section 6. an important hypothesis is that the popUlations are homogenous. in the case of a Gaussian response. 6.13). The LR statistic is given by (6. When the hypothesis is that the multinomial parameter is in a qdimensional subset of the k .1. the Rao statistic is again of the Pearson X2 form.Li = EYij.Li of a response is expressed as a function of a linear combination ~i =. Under H the log likelihood in canonical exponential form is {3T . Finally. Using Z as given in the oneway layout in Example 6.15).3.Li = ~i. then J.N log(1+ exp{{3}) + ~ log ( k ~: ) where T = 2::=1 Xi. Summary. . The Pearson statistic 2 _ X  L k1 i=l (X "'(1) m'7r . (3 = L Zij{3j j=l p of covariate values.Idimensional parameter space e. We used the large sample testing results of Section 6..7r)]. It follows from Theorem 2.5 GENERALIZED LINEAR MODELS Yi In Sections 6. z. 7r E (0.Section 6. In the special case of testing independence of multinomial frequencies representing classifications in a twoway contingency table.3 to find tests for important statistical problems involving discrete data. Thus. if J. and give the LR test. we test H : 7rl = 7r2 7rk = 7r.4..mk7r)T. the Wald and Rao statistics take a form called "Pearson's X2. the Pearson statistic is shown to have a simple intuitive form.4.18) with JiOi mi7r.3." which equals the sum of standardized squared distances between observed frequencies and expected frequencies under H. and (3 = 10g[7r1(1 .1 and 6.3. We found that for testing the hypothesis that a multinomial parameter equals a specified value. we give explicitly the MLEs and X2 test.1). In particular. In the special case of testing equality of k binomial parameters. discuss algorithms for computing MLEs.1.7r ~ i "')2 mi 7r is a Wald statistic and the X2 test is equivalent asymptotically to the LR test (Problem 6. We derive the likelihood equations.3 we considered experiments in which the mean J. . In Section 6.4. versus the alternative that the 7r'S are not all equal.4.Li 7ri = gl(~i)' where gl(y) is the logistic distribution .1 that if 0 < T < N the MLE exists and is the solution of (6. N 2::=1 mi.L (ml7r. where J. we considered logistic regression for binary responses in which the logit transformation of the probability of success is modeled to be a linear function of covariates.
.r Case 6 function. Typically. in which case 9 is also called the link function. (ii) Log linear models. g = or 11 L J'j Z(j) = Z{3. 11) given by (6.. the mean fL of Y is related to 11 via fL = A(11). The generalized linear model with dispersion depending only on the mean The data consist of an observation (Z. See Haberman (1974). but in a subset of E obtained by restricting TJi to be of the form where h is a known function. Yn)T is n x 1 and ZJxn = (ZI. . These Ii are independent Gaussian with known variance 1 and means JLi The model is GLM with canonical g(fL) fL. fL determines 11 and thereby Var(Y) A( 11).6. Y) where Y = (Yb . Typically.5.1) where 11 is not in E. More generally.. McCullagh and NeIder (1983. such that g(fL) = L J'jZ(j) j=1 p Z{3 where Z(j) (Zlj.1). = L:~=1 Zij{3. j=1 p In this case.412 Inference in the 1\III1IITin'~r:>rn"'T. which is p x 1. Special cases are: (i) The linear model with known variance.. in which case JLi A~ (TJi). 1989) synthesized a number of previous generalizations of the linear model. that is. g(JLn)) T. We assume that there is a onetoone transform g(fL) of fL. called the link junction. the GLM is the canonical subfamily of the original exponential family generated by ZTy.. the natural parameter space of the nparameter canonical exponential family (6. . g(fL) is of the form (g(JL1) .. Canonical links The most important case corresponds to the link being canonical. . A (11) = Ao (TJi) for some A o."" zn) with Zi = (Zib"" Zip)T nonrandom and Y has density p(y. the identity. ••• . As we know from Corollary 1. . Znj)T is the jth column vector of Z. Note that if A is oneone.5..1. most importantly the log linear model developed by Goodman and Haberman.
5. . where f3i. b. j::. the link is canonical. NewtonRaphson coincides with Fisher's method of scoring described in Problem 6. 1 ::. If we take g(/1) = (log J.5. .80 ~ f3 0 .Section 6.t1. Algorithms If the link is canonical. that Y = IIYij 111:Si:Sa.. p. With a good starting point f3 0 one can achieve faster convergence with the NewtonRaphson algorithm of Section 2. Suppose. is that of independence Bij = Bi+B+j where Bi+ = 2:~=1 Bij .5 Generalized linear Models 413 Suppose (Y1.Bp). log J. by Theorem 2. j ::. /1(. and the log linear model corresponding to log Bij = f3i + f3j. 1 ::. "Ij = 10gBj. B+ j = 2:~=1 Bij . . 1f..2) (or ascertain that no solution exists). 0 < (}i < 1 withcanonicallinkg(B) = 10g[B(1B)].5. p are canonical parameters. 1 ::. say. that procedure is just (6.B 1.1.3.1. . the . The log linear label is also attached to models obtained by taking the Yi independent Bernoulli ((}i).. for example./1(. i ::. 1 ::. The coordinate ascent algorithm can be used to solve (6 . See Haberman (1974) for a further discussion. f3j are free (unidentifiable parameters). b. a.4. in general. if maximum likelihood estimates they necessarily uniquely satisfy the equation f3 exist.3.8) is not a member of that space. classification i on characteristic 1 and j on characteristic 2. But. Then. .5. so that Yij is the indicator of. j ::.Yp)T is M(n.7.8) is orthogonal to the column space of Z.6.. This isjust the logistic linear model of Section 6. ZTy or = ZT E13 y = ZT A(Z. Bj > 0. In this case..5. j ::.8) (6. as seen in Example 1.1::.2) can be interpreted geometrically in somewhat the same way as in the Gausian linear modelthe "residual" vector Y .3) where In this situation and more generally even for noncanonical links. Then () = IIBij II.4. The models we obtain are called log linearsee Haberman (1974) for an extensive treatment.tp) T .2) It's interesting to note that (6. 2:~=1 B = 1.. .
the deviance of Y to iLl and tl(iL o.1](Y)) l(Y . Unfortunately tl =f.20) when the data are the residuals from the fit at stage m. For the Gaussian linear model with known variance 0'5 (Problem 6. D generally except in the Gaussian case. iLo) . In that case the MLE of J1 is iL M = (Y1 .D(Y. . The LR statistic for H : IL E Wo is just D(Y.6) a decomposition of the deviance between Y and iLo as the sum of two nonnegative components.4). iLl) where iLl is the MLE under WI.D(Y .5. iLl) == D(Y.1](lLa)] for the hypothesis that IL = lLa within M as a "measure" of (squared) distance between Y and lLo.Wo with WI ~ Wo is D(Y. Let ~m+l == 13m+1 .2.5. We can think of the test statistic .13m' which satisfies the equation (6.1) for which p = n. If Wo C WI we can write (6. (6. In this context. iLo) == inf{D(Y. iLo) . the correction Am+ l is given by the weighted least squares formula (2. as n ~ 00. As in the linear model we can define the biggest possible GLM M of the form (6. then with probability tending to 1 the algorithm converges to the MLE if it exists.5. This quantity.5.5. the variance covariance matrix is W m and the regression is on the columns ofW mZProblem 6. Yn ) T (assume that Y is in the interior of the convex support of {y : p(y. We can then formally write an analysis of deviance analogous to the analysis of variance of Section 6. 210gA = 2[l(Y. the algorithm is also called iterated weighted least squares. This name stems from the following interpretation. iLl)' each of which can be thought of as a squared distance between their arguments. IL) : IL E wo} where iLo is the MLE of IL in Woo The LR statistic for H : IL E Wo versus K : IL E WI .5) is always ~ O.1.414 Inference in the Multiparameter Case Chapter 6 true value of {3. called the deviance between Y and lLo..4) That is.2. Write 1](') for AI.5. Testing in GLM Testing hypotheses in GLM is done via the LR statistic. 1]) > o}). ••• .
However.. .3. f3p ).7) More details are discussed in what follows.. Y n ) from the family with density (6.. . and (6.5. in order to obtain approximate confidence procedures. This can be made precise for stochastic GLMs obtained by conditioning on ZI. . d < p. if we take Zi as having marginal density qo.. For instance. (3) term. then (ZI' Y1 ). fil) is thought of as being asymptotically X. (3))Z. .3 applies straightforwardly in view of the general smoothness properties of canonical exponential families.2. can be estimated by f = :EA(Z{3) where :E is the sample variance matrix of the covariates.5.5 Generalized Linear Models 415 Formally if Wo is a GLM of dimension p and WI of dimension q with canonical links.Section 6 . (Zn' Y n ) can be viewed as a sample from a population and the link is canonical. . so that the MLE {3 is unique.5. we obtain If we wish to test hypotheses such as H : 131 = .9) What is ]1 ((3)? The efficient score function a~i logp(z .2.8) This is not unconditionally an exponential family in view of the Ao (z.0. (Zn. if we assume the covariates in logistic regression with canonical link to be stochastic. which..2 and 6.. y .2. the theory of Sections 6. Asymptotic theory for estimates and tests If (ZI' Y 1 ). . .. (Zn' Y n ) has density (6.1 ....3. asymptotically exists. . ••• . is consistent with probability 1.3 hold (Problem 6.q.3). then ll(fio. ..10) is asymptotically X~ under H. Similar conclusions r '''''''. Thus. Zn in the sample (ZI' Y 1 ). there are easy conditions under which conditions of Theorems 6. = f3d = 0.5. . we can calculate i=l where {3 H is the (p xl) MLE for the GLM with {3~xl = (0. 6. . and 6. f3d+l. Ao(Z. 1.5. (3) is (Yi and so. and can conclude that the statistic of (6. which we temporarily assume known.
which we postpone to Volume II. JL ::. then A(11)/c(r) = log! exp{c1 (r)11T y}h(y. Existence of MLEs and convergence of algorithm questions all become more difficult and so canonical links tend to be preferred. However. 11.10) is of product form A( 11) [1/ c( r)] whereas the righthand side cannot always be put in this form. if in the binary data regression model of Section 6.4.5. then it is easy to see that E(Y) = A(11) Var(Y) = c(r)A(11) (6. These conclusions remain valid for the usual situation in which the Zi are not random but their proof depends on asymptotic theory for independent nonidentically distributed variables.1) depend on an additional scalar parameter r. (J2) and gamma (p. .13) (6. r). From the point of analysis for fixed n. Note that these tests can be carried out without knowing the density qo of Z1. we take g(JL) = <1>1 (JL) so that 7ri = <1> (zT!3) we obtain the socalled probit model.9 are rather similar.11) Jp(y. r) = exp{c.5.5. 11. A) families. Because (6.14) so that the variance can be written as the product of a function of the mean and a general dispersion parameter.3.1 (r)(11T y .A(11))}h(y. For instance.12) The lefthand side of (6. when it can. r)dy = 1.r)dy. the results of analyses with these various transformations over the range . As he points out. 1989). Important special cases are the N (JL. For further discussion of this generalization see McCullagp.416 Inference in the Multiparameter Case Chapter 6 follow for the Wald and Rao statistics. noncanonicallinks can cause numerical problems because the models are now curved rather than canonical exponential families. An additional "dispersion" parameter can be introduced in some exponential family models by making the function h in (6.5. The generalized linear model The GLMs considered so far force the variance of the response to be a function of its mean.5.1 ::. General link functions Links other than the canonical one can be of interest. and NeIder (1983. (6. It is customary to write the model as. p(y. for c( r) > 0. . Cox (1970) considers the variance stabilizing transformation which makes asymptotic analysis equivalent to that in the standard Gaussian linear model.5.
i." Of course. and Models 417 Summary. Y) has a joint distribution and if we are interested in estimating the best linear predictor /lL(Z) of Y given Z.Section 6.6 ROBUSTNESS PROPERTIES AND SEMIPARAMETRIC MODELS As we most recently indicated in Example 6.2. for estimating /lL(Z) in a submodel of P 3 with /3 (6.2. with density f for some f symmetric about O}. then the LSE of (3 is. PI {P : (zT. still a consistent asymptotically normal estimate of (3 and. so is any estimate solving the equations based on (6.2.8 but otherwise also postponed to Volume II.2. have a fixed covariate exact counterpart. we use the asymptotic results of the previous sections to develop large sample estimation results.2. a 2 ) or. the GaussMarkov theorem.d. the LSE is not the best estimate of (3. We considered generalized linear models defined as a canonical exponential model where the mean vector of the vector Y of responses can be written as a function. In Example 6. roughly speaking.19) even if the true errors are N(O.5).1) where E is independent of Z but ao(Z) is not constant and ao is assumed known. We considered the canonical link function that corresponds to the model in which the canonical exponential model parameter equals the linear predictor. the resulting MLEs for (3 optimal under fa continue to estimate (3 as defined by (6.d. of a linear predictor of the form Ef3j Z(j). We discussed algorithms for computing MLEs of In the random design case.r+i". Ii 6. the distributional and implicit structural assumptions of parametric models are often suspect. Furthermore.31) with fa symmetric about O. had any distribution symmetric around 0 (Problem 6. we studied what procedures would be appropriate if the linearity of the linear model held but the error distribution failed to be Gaussian."n". in fact.6 Robustness Pr. are discussed below.2. /3.1. We found that if we assume the error distribution fa. ____ . Another even more important set of questions having to do with selection between nested models of different dimension are touched on in Problem 6.19) with Ei i.6. That is.. which is symmetric about 0. y)T rv P that satisfies Ep(Y I Z) = ZT(3. if P3 is the nonparametric model where we assume only that (Z. in fact.. confidence procedures.J .i. Y) rv P given by (6. the right thing to do if one assumes the (Zi' Yi) are i. For this model it turns out that the LSE is optimal in a sense to be discussed in Volume II. E p y2 < oo}. called the link function.. ".2. where the Z(j) are observable covariate vectors and (3 is a vector of regression coefficients. is "act as ifthe model were the one given in Example 6. There is another semiparametric model P 2 = {P : (ZT. if we consider the semiparametric model. EpZTZ nonsingular. and tests.2. under further mild conditions on P. These issues.6.2. whose further discussion we postpone to Volume II..
6. is UMVU in the class of linear estimates for all models with EY/ < 00. 418 Inference in the Multiparameter Case Chapter 6 Robustness in Estimation We drop the assumption that the errors E1.. . 13i of Section 6. Theorem 6. our current Ji coincides with the empirical plugin estimate of the optimal linear predictor (1..1. n 2 VarcM(a) = La. . In fact. € are n x 1.En in the linear model (6.H). for .1..2(iv) and 6. i=l i<j i=l where VarCM refers to the variance computed under the GaussMarkov assumptions.1. E(a) = E~l aiJLi Cov(Yi.'.2) where Z is an n x p matrix of constants. . In Example 6. for any parameter of the form 0: = E~l aiJLi for some constants a1.1. Instead we assume the GaussMarkov linear model where (6.1.3. By Theorems 6.En are i. Ej).Y . and a is unbiased. Var(Ei) + 2 Laiaj COV(Ei. JL = f31..1 and the LSE in general still holds when they are compared to other linear estimates. Example 6. and Y.3(iv). {3 is p x 1. One Sample (continued). The GaussMarkov theorem shows that Y. . jj and il are still unbiased and Var(Ji) = a 2 H. 0 Note that the preceding result and proof are similar to Theorem 1.. . and Ji = 131 = Y. . However. The preceding computation shows that VarcM(Ci) = Varc(Ci). Var(jj) = a 2 (ZTZ)1. Moreover. and by (B.1.p .1.1.6. . the result follows. moreover.4 are still valid. Then. n = 0:. ( 2 ).Ej) = a La..3) are normal.2) holds.1. . the conclusions (1) and (2) of Theorem 6.i.. . in addition to being UMVU in the normal case.14). Because Varc = VarCM for all linear estimators. The optimality of the estimates Jii.1. Because E( Ei) = 0.6. Varc(a) ::. .2.6.6..Yn . the estimate = E~l aiJii has uniformly minimum variance among all unbiased estimates linear in Y 1 .d. . Let Ci stand for any estimate linear in Y1 . Many of the properties stated for the Gaussian case carry over to the GaussMarkov case: Proposition 6.' En with the GaussMarkov assumptions.. See Problem 6. Suppose the GaussMarkov linear model (6.S. Varc(Ci) for all unbiased Ci. where Varc(Ci) stands for the variance computed under the Gaussian assumption that E1. Ifwe replace the Gaussian assumptions on the errors E1.4.4.6). Yj) = Cov( Ei.4 where it was shown that the optimal linear predictor in the random design case is the same as the optimal predictor in the multivariate normal case. and ifp = r.1.. in Example 6.an.1. n a Proof. N(O. Var(e) = a 2 (I .
but Var( Ei) depends on i. the asymptotic behavior of the LR.ii(8 . Now we will use asymptotic methods to investigate the robustness of levels more generally. we can use the GaussMarkov theorem to conclude that the weighted least squares are unknown. 00.12). Suppose we have a heteroscedastic 0.. Wald and Rao tests depends critically on the asymptotic behavior of the underlying MLEs 8 and 80 .d.6. and .6.4. aT Robustness of Tests In Section 5. Y is unbiased (Problem 3.6.2 we know that if '11(" 8) = Die. For this density and all symmetric densities. Il E R. Because ~ ('lI .3) .6. 0.4) and the Wald test statis8 0 evidently we have i= Tw But if 8(P) = 8 0 . From the theory developed in Section 6. for instance.6.and twosample problems using asymptotic and Monte Carlo methods. consider H : 8 tics Tw = n(e ( 0 )T I(8 0 )(e ( 0 ).6. This observation holds for the LR and Rao tests as wellbut see Problem 6. ~('11... then Tw ~ VT I(8 0 )V where V '" N(O. Thus. a major weakness of the GaussMarkov theorem is that it only applies to linear estimates.. A > O. Y) where Z is a (p x 1) vector of random covariates and we model the relationship between Z and Y as (6.1. which does not belong to the model.5) than the nonlinear estimate Y median when the density of Y is the Laplace density 1 2A exp{ Aly ."n". 8)dP(x) ° = 80 (6.2 are UMVU. y E R. P)).2... . Another weakness is that it only applies to the homoscedastic case where Var(1'i) is the same for all i. the asymptotic distribution of Tw is not X. our estimates are estimates of Section 2. as (Z.. The 6method implies that if the observations Xi come from a distribution P.3 we investigated the robustness of the significance levels of t tests for the one. 0.1.:lrarnetlric Models 419 sample n large.. Suppose (ZI' Yd. .5) . ~(w.6.:.ennip. As seen in Example 6.2. when the not optimal even in the class of linear estimates. Example 6.Ill}. There is an important special case where all is well: the linear model we have discussed in Example 6. Remark 6. However. If the are version of the linear model where E( €) known.ti". The Linear Model with Stochastic Covariates.2.6 Robustness "" . Y has a larger variance (Problem 6.8(P)) ~ N(O. which is the unique solution of ! and w(x.6.Section 6. If 8(P) (6.. 1 (Zn. 8) and Pis true we expect that 8n 8(P).. P)) with :E given by (6.6).2. Y n ) are d.4.P) i= II (8 0 ) in general.
Then.2.7. X. under H : (3 = 0.6.5) and (12(p) = Varp(E)..6.6. .2) the LR.6.6. Thus.8) Moreover. say..30) then (3(P) specified by (6...2. (12).. !ltin:::.q+ 1.1 and 6. E p E2 < 00. EpE 0. the test (6. Y) such that Eand Z are independent. n lZT Z (n) (n) so that the confidence ~ ~ ZiZ'f n ~ 1 (6. and we consider. It follows by Slutsky's theorem that.7) and (6.a) where _ ""T T  2 (6. procedures based on approximating Var(. (Jp (Jo.Zn)T and  2 1 ~ " 2 1 ~(Yi .2. it is still true by Theorem 6.np(l a) . it is still true that if qi is given by (6.Yi) = IY n .np(1 .t. hence. .3. Wald.6. even if E is not Gaussian.  2 Now.. For instance. .2. (see Examples 6.r::nn"..5) has asymptotic level a even if the errors are not Gaussian.t xp(l a) by Example 5.r Case 6 with the distribution P of (Z.420 Inference in the M .P i=l n p  Z(n)(31 . when E is N(O. the limiting distribution of is Because !p.9) i=l n1Zfn)Z(n)S2 / Vii are still asymptotically of correct level.1 that (6. by the law of large numbers.. Zen) S (Zb . (i) the set of P satisfying the hypothesis remains the same and (ii) the (asymptotic) variance of (3 is the same as under the Gaussian model.6. and Rao tests all are equivalent to the F test: Reject if Tn = (3 Z(n)Z(n)(3/ S 2 !p. H : (3 = O. This kind of robustness holds for H : (Jq+ 1 (Jo. It is intimately linked to the fact that even though the parametric model on which the test was based is false.B) by where W pxp is Gaussian with mean 0 and.3) equals (3 in (6..6) (3 is the LSE.p or more generally (3 E £0 + (30' a qdimensional affine subspace of RP. .
6 Robustness Properties and Semiparametric Models 421 If the first of these conditions fails.6.4. It is possible to construct a test equivalent to the Wald test under the parametric model and valid in general. note that. If it holds but the second fails..3. X(2») where X(I).10) does not have asymptotic level a in general.1 Vn(~ where (3) + N(O. X(2) are independent E of paired pieces of equipment.6..3. But the test (6.11) i=1 &= v 2n ! t(xP) + X?»)..6.6.6. (6. Suppose without loss of generality that Var( (') = L Then (6.6. To see this.2.2 clearly hold. 2  (th). (X(I). under H.7) fails and in fact by Theorem 6.13) but & V2Ep(X(I») ::f. i=1 (6. Example 6...Section 6. However suppose now that X(I) 101 and X(2) 102 are identically distributed but not exponential.6.6. .5) are still meaningful but (and Z are dependent. (6. Q) (6.10) nL and ~ ~ X(j) 2 (6. the lifetimes A2 the standard Wald test (6.6. Simply replace (12 by (J ~2 = ~ ~ { ( ~1) n~ 2=1 Xz _ (X(I) + X(2»))2 2 + _ (X(1) + X(2»))2} 2 . The Linear Model with Stochastic Covariates with E and Z Dependent. V2 VarpX{I) (~2) Xz in general. Suppose E(E I Z) = 0 so that the parameters f3 of (6. XU) and X(2) are identically distributed.) respectively..12) The conditions of Theorem 6.6. E (i. we assume variances are heteroscedastic. That is. The TwoSample Scale Problem. For simplicity. If we take H : Al .a)" (6.3.15) . then it's not clear what H : (J = (Jo means anymore.17) is Suppose our model is that X "Reject iff n (0 1 where &2 2 O) > x 1 (1  .14) o Example 6. We illustrate with two final examples. let the distribution of ( given Z = z be that of a(z)(' where (' is independent of Z.. the theory goes wrong. Then H is still meaningful.6..
provided we restrict the class of estimates to linear functions of Y1 . . 1 ::.6) does not have the correct level.7 PROBLEMS AND COMPLEMENTS Problems for Section 6. = Ad = 1/ d (Problem 6. j ::. . and Rao tests.6. . N(O. TJr.16) Summary. ""2 _ ""12 (TJ1. 1967). identical variances. . then the MLE (iil. (6. Specialize to the case Z = (1. (5 2)T·IS (U1.i.d. we showed that the MLEs and tests generated by the model where the errors are U. For d = 2 above this is just the asymptotic solution of the BehrensFisher problem discussed in Section 4. LR. (52) assumption on the errors is replaced by the assumption that the errors have mean zero. In partlCU Iar.. .6. use Theorem 3. in general. Wald. the test (6.. the methods need adjustment.9. We also demonstrated that when either the hypothesis H or the variance of the MLE is not preserved when going to the wider model.1 1.11 .4. The GaussMarkov theorem states that the linear estimates that are optimal in the linear model continue to be so if the i.. A solution is to replace Z(n)Z(n)/ 8 2 in (6...1. This is the stochastic version of the dsample model of Example 6. . N(O.. Ur.n l1Y .. In the linear model with a random design matrix. 6.6. The simplest solution at least for Wald tests is to use as an estimate of [Var y'n8]1 not 1(8) or ~D2ln(8) but the socalled "sandwich estimate" (Huber. It is easy to see that our tests of H : {32 = . (52) are still reasonable when the true error distribution is not Gaussian. then the MLE and LR procedures for a specific model will fail asymptotically in the wider model... . (i) the MLE does not exist if n = r.3 to compute the information lower bound on the variance of an unbiased estimator of (52. We considered the behavior of estimates and tests when the model that generated them does not hold. where Qis a consistent estimate of Q. . the confidence procedures derived from the normal model of Section 6.. Compare this bound to the variance of 8 2 • . and are uncorrelated. In this case.11) with both 11 and (52 unknown. the twosample problem with unequal sample sizes and variances.JL • ""n 2. and we gave the sandwich estimate as one possible adjustment to the variance of the MLE for the smaller model. . Show that in the canonical exponential model (6. in general.Ad) distribution.6. = {3d = 0 fail to have correct asymptotic levels in general unless (52 (Z) is constant or A1 = . 0 < Aj < 1. For the canonical linear Gaussian model with (52 unknown. .6) by Q1.A1.6.d.4. 0 To summarize: If hypotheses remain meaningful when the model is false then.1 are still approximately valid as are the LR. n 1 L. d."i=r+ I i .422 Inference in the Multiparameter Case Chapter 6 (Problem 6. ••• . .. and Rao tests need to be modified to continue to be valid asymptotically. .Id .Id) has a multinomial (A1. (ii) if n 2: r + 1. . (5 .fiT) o:2)T of U 2)T . Wald.. .6).3.Ad)T where (I1. In particular... Yn ..6) and.1.
0. then B the MLE of (). where IJ. (d) Show that Var(O) ~ Var(Y). {Ly = /31. fO 0.. 0. .Li = {Ly+(zi {Lz)f3.13. is (c) Show that Y and Bare unbiased.4..1] is a known constant and are i. Var(Uf) = 20.29). and (3 and {Lz are as defined in (1.d.2 with p = r coincides with the empirical plugin estimate of JLL = ({LL1. zi = (Zi21"" Zip)T..2. 1 {LLn)T. .14)... Yn ) where Z..Section 6.n.1. Derive the formula (6. = (Zi2. i = 1... .29) for the noncentrality parameter 82 in the oneway layout.28) for the noncentrality parameter ()2 in the regression example. N(O. fn where ei = ceil + fi.'''' Zip)T.2 ). . i = 1. Consider the model (see Example 1.1. i = 1. Let Yi denote the response of a subject at time i. of 7. and the 1. . .d. i = 1.. where a..2 . 9.2 known case with 0.2 ). (e) Show that Var(B) < Var(Y) unless c = O. t'V (b) Show that if ei N(O. Suppose that Yi satisfies the following model Yi ei = () + fi. the 100(1 . Show that >:(Y) defined in Remark 6. Derive the formula (6..n ~ where fi can be written as fi = ceil + ei for given constant c satisfying 0 ~ c are independent identically distributed with mean zero and variance 0. Let 1. .. 1 n f1. Show that in the regression example with p = r = 2. see Problem 2. Find the MLE B (). Here the empirical plugin estimate is based on U..2 coincides with the likelihood ratio statistic A(Y) for the 0. .22. i eo = 0 (the €i are called moving average errors.. 4.i. .4 • 3. Show that Ii of Example 6.5) Yi () + ei."" (Z~.(_C)i)2 l+c L i=l (a) Show that Bis the weighted least squares estimate of (). c E [0.1. . 8.1. Yl). J =~ i=O L)) c i (1.7 Problems and Complements 423 Hint: By A.1. (Zi. n.l+c (_c)j+l)/~ (1. . .a)% confidence interval for /31 in the Gaussian linear model is Z2)2 ] . 5. . n.2 replaced by 0: 2 • 6.
We want to predict the value of a 14.1.~ L~=1 nk = (p~:)2 + Lk#i ..a)% confidence intervals for a and 6k • 12.a) confidence interval for the best MSPE predictor E(Y) = {31 + {32 Z.. . . Yn . The following model is useful in such situations.Z.e . (a) Find a level (1 . n2 = . Note that Y is independent of Yl. Y ::.. ::. Yn be a sample from a population with mean f.p)S2 /x n .k)' C (b) If n is fixed and divisible by p. Often a treatment that is beneficial in small doses is harmful in large doses.. ~(~ + t33) t31' r = 2. (c) If n is fixed and divisible by 2(p ..":=::. a 2 ::. (n . Yn )..a).2) L(Zi2 .2. where n is even.{3i are given by = ~(f.I). Let Y 1 . Consider a covariate x.. Show that for the estimates (a) Var(a) = +==_=:. then the hat matrix H = (h ij ) is given by 1 n 11.1 2)? and that a level (1 a) confidence interval for a 2 is given by ~a)::. = 2(Y . . which is the amount . Show that if p Inference in the Multiparameter Case Chapter 6 = r = 2 in Example 6. (Zi2 Z.. 1  (a) Why can you conclude that T1 has a smaller MSE (mean square error) than T2? (b) Which estimate has the smallest MSE for estimating 0 13. . . (b) Find confidence intervals for 'lj. Assume the linear regression model with p future observation Y to be taken at the pont z. Consider the three estimates T1 = and T3 Y.p (~a) (n .p)S2 /x n .2)(Zj2 . then Var(8k ) is minimized by choosing ni = n/2. . ••. (d) Give the 100(1 . l(Yh .• 424 10. (a) Show that level (1 . ~ 1 .1 and variance a 2 .:1 Yi + (3/ 2n) L~= ~ n+ 1 }i.p (1 . Yn ) such that P[t. T2 1 (1/ 2n) L. 15.2)..Z. 'lj. (b) Find a level (1 a) prediction interval for Y (i.a) confidence intervals for linear functions of the form {3j . In the oneway layout.• statistics HYl. then Var( a) is minimized by choosing ni = n / p...2)2 a and 8k in the oneway layout Var(8k ) .. = np n/2(p 1).
i = 1.or dose of a treatment. xC' = 0 => x = 0 for any rvector x.• .r2 )l 1 2 7 > O. Yi) and (Xi. (a) Show that AOA4 and A6 hold for model Qo with densities of the form 1 2<t'CJL. compute confidence intervals for (31. and Q = P. and a response variable Y. then the r x r matrix C'C is of rank r and.7 Probtems and Lornol. (33. 17. Check AO. Suppose a good fit is obtained by the equation where Yi is observed yield for dose 10gYi where E1. (b) Plot (Xi. P.0'2) is the N(Il'. . a 2 <7 2 . ( 2 ) density...emEmts 425 .1 and let Q be the class of distributions with densities of the form (1 E) <t' CJL.77 167. Assume the model = (31 + (32 X i + (33 log Xi + Ei. (b) Show that the MLE of ({L.95 Yi = e131 e132xi Xf3.0'2)(x ) + E<t'(JL. 8). Find the value of X that maxi mizes the estimated yield Y= e131 e132xx133. n. Problems for Section 6. X (nitrogen) Hint: Do the regression for {Li = (31 + (32zil + (32zil = 10gXi . 653). 1 2 where <t'CJL. n are independent N(O.I::llogxi' You may use X = 0. ( 2 ). . Let 8. In the Gaussian linear model show that the parametrization ({3. 1952. hence. fli) where fi1.1) + 2<t'CJL.. p. E :::. fi3 and level 0. Hint: Because C' is of rank r. .5 Zi2 * + (33zi2 where Zil = Xi  X. ( 2 ).A6 when 8 = ({L. P = {N({L. 8) = 2. and p be as in Problem 6. . 7 2 ) does not exist so that A6 doesn't hold.0289. ( 2 )T is identifiable if and only if r = p. En Xi. r :::. : {L E R. .. ( 2 ) logp(x. (a) For the following data (from Hald.Section 6. Y (yield) 3. •. p(x. (32. nonsingular.2. Show that if C is an n x r matrix of rank r. 16. which is yield or production.r2) (x). fi2. But xC'C = o => IIxC'I1 2 = xC'Cx' = 0 => xC' = O.2 1. a 2 > O}.
The MLE of based on (X.2. I .i > 1.1 show thatMLEs of (3. (6. 6. ..24) directly as follows: (a) Show that if Zn = ~ I:~ 1 Zi then. that (d) (fj  (3)TZ'f.)Z(n)(fj ..426 Inference in the Multiparameter Case Chapter 6 I . .2. X . hence.I3).2 hold if (i) and (ii) hold. t) is then called a conditional MLE. Y). . H E H}. Show that if we identify X = (z(n). (J derived as the limit of Newtonwith equality if and only if > [EZfj1 4.1) EZ .". (c) Apply Slutsky's theorem to conclude that and.Pe"H). I I .2.2. 3. i' " (b) Suppose that the distribution of Z is not known so that the model is semiparametric.IY . Fill in the details of the proof of Theorem 6. 5.20). show that the assumptions of Theorem 6. In Example 6.1. show that ([EZZ T jl )(1. n[z(~)~en)jl)' X.. Z i ~O. and . (P("H) : e E e.10 that the estimate Raphson estimates from On is efficient._p distribution. (a) In Example 6. and that 0:2 is independent of the preceding vector with n(j2 /0 2 having a (b) Apply the law oflarge numbers to conclude that T P n I Z(n)Z(n) ~ E ( ZZ T ).  I . T 2 ) based on the first two I (d) Deduce from Problem 6. b .2. In Example 6. In Example 6. (I) Combine (aHe) to establish (6. (e) Construct a method of moment estimate moments which are ~ consistent. show that c(fo) = (To/a is 1 if fo is normal and is different from 1 if 10 is logistic.2.(3) = op(n.(b) The MLE minimizes . Hint: !x(x) = !YIZ(Y)!z(z). H abstract.2. e Euclidean.2. T(X) = Zen).2.Zen) (31 2 e ~ 7.2 are as given in (6. '".. iT 2 ) are conditional MLEs. /1. (fj multivariate nannal distribution with mean 0 and variance.  (3)T)T has a (".2. Establish (6.1/ 2 ). 8. .24).2.1...21). given Z(n). then ({3. . In some cases it is possible to find T(X) such that the distribution of X given T(X) = tis Q" which doesn't depend on H E H. (e) Show that 0:2 is unconditionally independent of (ji.2.'.fii(ji . ell of (} ~ = (Ji. Hint: (a)..2.
.2. (a) Show that if solves (Y ao is assumed known a unique MLE for L.2. the <ball about 0 0 .. p is strictly convex.0 0 1 < <} ~ 0. R d are such that: (i) sup{jDgn(O) . a' .p(Xi'O~) + op(1) n i=l n ) (O~ . <).. . Let Y1 •. = J. .0 0 ). Suppose AGA4 hold and 8~ is vn consistent.'2:7 1 'l'(Xi.L 'l'(Xi'O~) n i=l 1 1 =n n L 'l'(Xi.p(Xi'O~) n. Y.(XiI') _O. (al Let ii n be the first iterate of the NewtonRapbson algorithm for solving (6. p i=l J. t=l 1 tt Show that On satisfies (6.LD.1' _ On = O~  [ . Hint: n .L'l'(Xi'O~) n. Show that if BB a unique MLE for B1 exists and 10. that is. (iv) Dg(O) is continuous at 0 0 .•. ao (}2 = (b) Write 8 1 uniquely solves 1 8.3). (iii) Dg(0 0 ) is nonsingular. ] t=l 1 n .Oo) i=cl (1 .1/ 2 ). 8~ = 1 80 + Op(n.l exists and uniquely ~ .Dg(O)j . hence. and f(x) = e. Him: You may use a uniform version of the inverse function theorem: If gn : Rd j.Section 6. E R. (b) Show that under AGA4 there exists E > 0 such that with probability tending to 1. (ii) gn(Oo) ~ g(Oo).l + aCi where fl.l real be independent identically distributed Y.1) starting at 0. a > 0 are unknown and ( has known density f > 0 such that if p(x) log f(x) then p" > 0 and. .LD.7 Problems and Complements 427 9. 10 . Examples are f Gaussian.' ..O) has auniqueOin S(Oo.x (1 + C X ) . (logistic).
Zi.2. .3.12) and the assumptions of Theorem (6. then it converges to that solution../3)' = L i=1 1 CjZiU) n p . Zip)) + L j=2 p "YjZij + €i J . Inference in the Multiparameter Case Chapter 6 > O. P I .. iteration of the NewtonRaphson algorithm starting at 0. f.3). • (6. .(Z"  ~(') z. i .. Hint: You may use the fact that if the initial value of NewtonRaphson is close enough to a unique solution.ip range freely and Ei are i. Thus. ... LJ j=2 Differentiate with respect to {31. ~. Show that if 3 0 = {'I E 3 : '1j = 0.(j) l..26) and (6. N(O. t• where {31. . Y. .d. •.. I '" LJ i=l ~(1) Y. Suppose that Wo is given by (6.2. 8) for 'I E 3 and 3 = {'1( 0) : 9 E e}. converges to the unique root On described in (b) and that On satisfies 1 1 • j. Establish (6.~_. Wald. Problems for Section 6. q + 1 < j < r} then )'(X) for the original testing problem is given = by )'(X) = sup{ q(X· 'I) : 'I E 3}/ sup{q(X. and ~ . . Reparametnze l' by '1(0) = Lj~1 '1j(O)Vj where '1j(9) 0 Vj. 0 < ZI < . Hint: Write n J • • • i I • DY.1 on 8(9 0 • 6) (c) Conclude that with probability tending to 1.6).. {Vj} are orthonormal. i=l • Z. and Rao tests for testing H : Ih. . . .. 2 ° 2. 'I) : 'I E 3 0 } and. hence.O E e.(Z" . •. log Ai = ElI + (hZi.Zi )13. a 2 ).. < Zn 1 for given covariate values ZI. 1 Yn are independent Poisson variables with Yi . there exists a j and their image contains a ball S(g(Oo).' " 13. . = 131 (Zil .428 then. II.O). = versus K : O oF 0.12).i.II(Zil I Zi2. Ii 'I :f Z'[(3)2 over all {3 is the same as minimizing n . . for "II sufficiently large.27). minimizing I:~ 1 (Ii 2 i.. 1 Zw Find the asymptotic likelihood ratio. that Theorem 6.2..l hold for p(x.3 i 1.3.3. CjIJtlZ.2 holds for eo as given in (6. i2. Suppose responses YI . )Ih  '" j=l LJ(lJj + where ~(l) = I:j and the Cj do not depend on j3.. > 0 such that gil are 1 . P(Ai).. 'I) = p(. Similarly compute the infonnation matrix when the model is written as Y.3. q(.
(b) If h = 2 show that asymptotically the critical value of the most powerful (NeymanPearson) test with Tn ~ L~ . 1). aTB.. OJ}. Testing Simple versus Simple.3 hold.2. {IJ E S(lJo): ~j(lJ) = 0.) where k(Bo. with density p(. IJ. q + 1 <j< rl.) 1(Xi .. (Adjoin to 9q+l.o log (X 0) PI. q + 1 <j < r S(lJo) . 1 5.Xn ) be the likelihood ratio statistic. =  p(·.n:c"' 4c:2=9 3. Xi. Assume that Pel of Pea' and that for some b > 0.) Show that if we reparametrize {PIJ: IJ E S(lJo)} by q('. respectively.. Consider testing H : 8 1 = 82 = 0 versus K : 0. N(IJ" 1)... . Show that under H. .. Deduce that Theorem 6. 'I) S(lJo)} then.. X n Li.'" .n) satisfy the conditions of Problem 6. Suppose that Bo E (:)0 and the conditions of Theorem 6. Oil and p(X"Oo) = E. .. 0 ») is nK(Oo. S(lJ o) and a map ce which is continuously differentiable such that (i) ~J(IJ) = 9J(IJ) on S(lJo).. p .(I(X i . N(Oz. IJ.'1(IJ» is uniquely defined on:::: = {'1(IJ) : IJ E Tio.d. Vi).aq are orthogonal to the linear span of ~(1J0) . independent. be ii. ..3.. (ii) E" (a) Let ).2. Let (Xi. . 1 < i < n. B). . < 00. Consider testing H : (j = 00 versus K : B = B1 . There exists an open ball about 1J0.d.Section 6. > 0 or IJz > O.gr. Bt ) is a KullbackLeibler in fonnation ° "_Q K(Oo.'1 8 0 = and.'1) and Tin '1(lJ n) and.(X" .(X1 .. Suppose 8)" > 0. .X. = ~ 4. \arO whereal.l. hence..7 Problems and C:com"p"":c'm"'.IJ) where q(.3.3. even if ~ b = 0. pXd . Let e = {B o .3 is valid. 210g. Oil + Jria(00 . q('.n 7J(Bo. with Xi and Y. .. . (ii) 'I is IIon S( 1J0 ) and D'I( IJ) is a nonsingular r x r matrix for aIlIJ E S( 1J0). j = 1.) O.
(h) : 0 < 0..\(Xi. Hint: By sufficiency reduce to n = 1. ' · H In!: Consl'defa1O ..) have an N. =0.1.6. (0.d. which is a mixture of point mass at O. Show that (6. In the model of Problem 5(a) compute the MLE (0.) if 0. XI and X~ but with probabilities ~ .li : 1 <i< n) has a null distribution.0). Po) distribution and (Xi. Let Bi l 82 > 0 and H be as above.0. 1988). < cIJ"O. 0. < .. Show that 2log. be i. (b) Suppose Xil Yi e 4 (e) Let (X" Y. > 0.3. are as above with the same hypothesis but = {(£It. li).) under the model and show that (a) If 0. 2 e( y'n(ii. for instance. we test the efficacy of a treatment on the basis of two correlated responses per individual.\(X).1.0. = (T20 = 1 andZ 1= X 1.  0. .=0 where U ~ N(O. .0. > OJ. ii. Exhibit the null distribution of 2 log . = 0. 1) with probability ~ and U with the same distribution.430 Infer~nce in the Multiparameter Case Chapter 6 (a) Show that whatever be n.0"10' O"~o.~. ±' < i < n) is distributed as a respectively. Yi : 1 and X~ with probabilities ~. • i .)) ~ N(O.6.2 and compute W n (8~2» showing that its leading term is the same as that obtained in the proof of Theorem 6.\( Xi. (b) If 0.. 0. (iii) Reparametrize as in Theorem 6. O > 0..19) holds.2 for 210g. Yi : l<i<n). = a + BB where . 2 log >'( Xi..  OJ. Hint: ~ (i) Show that liOn) can be replaced by 1(0). = v'1~c2' 0 <.3. /:1.i. Po . Then xi t. Note: The results of Problems 4 and 5 apply generally to models obeying AQA6 when we restrict the parameter space to a cone (Robertson. > 0. I .0. ~ ° with probability ~ and V is independent of ~ (e) Obtain tbe limit distribution of y'n( 0. 7. under H. Sucb restrictions are natural if. Z 2 = poX1Y v' . and ~ where sin. j ~ ~ 6. Wright. and Dykstra. (ii) Show that Wn(B~2» is invariant under affine reparametrizations "1 B is nonsingular. 1 < i < n. mixture of point mass at 0.3. (d) Relate the result of (b) to the result of Problem 4(a)..
4.P(B))· (b) Show that the sample correlation coefficient r studied in Example 5. 3.4.4 ~ 1(11) is continuous. A3.21).2 to lin . (a) Show that the correlation of Xl and YI is p = peA n B) .P(A))P(B)(1 .7 Problems and Complements ~(1 ) 431 8. . Hint: Write and apply Theorem 6. 9. 2.3. is of the fonn (b) Deduce that X' ~ Z' where Z is given by (6.4. (e) Conclude that if A and B are independent.4) explicitly and find the one that corresponds to the maximizer of the likelihood. 10.8) for Z. Show that under A2. Hint: Argue as in Problem 5..22) is a consistent estimate of ~l(lIo). (b) (6. In the 2 x 2 contingency table model let Xi = 1 or 0 according as the ith individual sampled is an A or A and Yi = 1 or 0 according as the ith individual sampled is a Born.nf. Under conditions AOA6 for (a) and AOA6 with A6 for (a) ~(1 ) i!~1) for (b) establish that [~D2ln(en)]1 is a consistent estimate of 1.3.1(0 0 ). Show that under AOA5 and A6 for 8 11 where ~(lIo) is given by (6.8) by Z ~ .8) (e) Derive the alternative form (6. hence.3.0 < PCB) < 1. then Z has a limitingN(011) distribution. (a) Show that for any 2 x 2 contingency table the table obtained by subtracting (estimated) expectations from each entry has all rows and columns summing to zero. 1.3.2.P(A)P(B) JP(A)(1 . Exhibit the two solutions of (6.Section 6.4. 0 < P( A) < 1.6 is related to Z of (6.10. A611 Problems for Section 6.
8(r2' 82 I/ (8" + 8. 432 Inference in the Multiparameter Case Chapter 6 i 4. CI... n R. TJj2. .b1only... 6.6 that H is true. N 22 ) rv M (u. Ti) (the hypergeometric distribution). C j = L' N'j. in principle. Suppose in Problem 6. n a2 ) . . n) can be then the test that rejects (conditionally on R I = TI' C 1 = GI) if N ll > j(a) is exact level o.. S.. (b) Deduce that Pearson's X2 is given by (6. TJj2 ~ = Cj n where R. . N ll and N 21 are independent 8( r. B!c1b!. . 1 : X 2 (b) How would you. = Lj N'j.. j. Let N ij be the entries of an let 1Jil = E~=l (}ij. ( rl. TJj2 = 2::: a x b contingency table with associated probabilities Bij and 1 Oij. . Consider the hypothesis H : Oij = TJil TJj2 for all i. TJj2 are given by TJil ~ = .~~. i = 1.ra) nab . Fisher's Exact Test From the result of Problem 6. (a) Show that the maximum likelihood estimates of TJil.4. . .9) and has approximately a X1al)(bl) distribution under H..4. 8 21 ... n. . 811 \ 12 . use this result to construct a test of H similar to the test with probability of type I error independent of TJil' TJj2? 1 ..TI.~~.. . . I (c) Show that under independence the conditional distribution of N ii given R. Let R i = Nil + N i2 • Ci = Nii + N 2i · Show that given R 1 = TI..4 deduce that jf j(o:) (depending on chosen so that Tl. Hint: (a) Consider the likelihood as a function of TJil..2 is 1t(Ci. (a) Let (NIl.1 II . R 2 = T2 = n .D.1..1 .4.)). C i = Ci.2. . 8l! / (8 1l + 812 )). nal ) n12. N 12 • N 21 . (22 ) as in the contingency table..C.~l. . nab) A ) = B. Cj = Cj] : ( nll. (b) Sbow that 812 /(8l! ° + 812 ) ~ 821 /(8 21 + 822 ) iff R 1 and C 1 are independent. This is known as Fisher's exact test. b I Ri ( = Ti. j = 1. It may be shown (see Volume IT) that the (approximate) tests based on Z and Fisher's test are asymptotically equivalent in the sense of (5. 7.. j = 1.54). (a) Show that then P[N'j niji i = 1. ( where ( . i = 1. . a . are the multinomial coefficients. =T 1..u..~.
11. 5 Hint: (b) It is easier to work with N 22 • Argue that the Fisher test is equivalent to rejecting H if N 22 > q2 + n .05 level (a) using the X2 test with approximate critical value? (b) using Fisher's exact test of Problem 6. = 0 in the logistic model. (a) If A. Give pvalues for the three cases. Admit Men Women 1 235 1~35' 38 7 273 42 n = 315 Deny Admit 270 45 Men Women I 122 1'.ZiNi > k] = a. Then combine the two tables into one. n. .. if and only if. where Pp~ [2:f . ~i = {3.93'1 215 103 69 172 Deny 225 162 n=387 (d) Relate your results to the phenomenon discussed in (a).(rl + cd or N 22 < ql + n . 10. The following table gives the number of applicants to the graduate program of a small department of the University of California.(rl + cI). + IhZi.5? Admit Deny Men Women 1 19 . which rejects. and that we wish to test H : Ih < f3E versus K : Ih > f3E. B. (i) p(AnB (ii) I C) = PeA I C)P(B I C) (A.BINDEPENDENTGNENC) p(AnB I C) ~ peA I C)P(B I C) (A. 2:f .14). B INDEPENDENT) (iii) PeA (C is the complement of C. 9. for suitable a. if A and C are independent or B and C are independent.1""". (b). C2). Test in both cases whether the events [being a man] and [being admitted] are independent.12 I . there is a UMP level a test.4. classified by sex and admission status. Show that.7 Problems and Complements 433 8. Establish (6.. and that under H.0.ziNi > k. and petfonn the same test on the resulting table.Section 6. Would you accept Or reject the hypothesis of independence at the 0.. consider the assertions.) Show that (i) and (ii) imply (iii).BINDEPENDENTGIVENC) n B) = P(A)P(B) (A. Zi not all equal. but (iii) does not. C are three events. N 22 is conditionally distributed 1t(r2. (h) Construct an experiment and three events for which (i) and (ii) hold. Suppose that we know that {3. (e) The following 2 x 2 tables classify applicants for graduate study in different departments of the university according to admission status and sex.4.
l 1 . .. Xi) are i.. . asymptotically N(O.) ~ B(m. in the logistic regression model. ~ 8 m + [1(8 m )Dl(8 m ).. (a) Compute the Rao test for H : (32 Problem 6.. for example.Oio)("Cooked data").00 or have Eo(Nd : ..4.4. (Zn. mk ~ 00 and H : fJ E Wo is true then the law of the statistic of (6. .Ld..2. I). .: nOiO. k and a 2 is unknown.. .20) for the regression described after (6.d. I. 15. .'Ir(. In the binomial oneway layout show that the LR test is asymptotically equivalent to Pearson's  X2 test in the sense that 2log'\  X2 .4.4) is as claimed formula (2. with (Xi I Z. 434 • Inference in the Multiparameter Case Chapter 6 f • 12. i 16. . • I 1 • • (a)P[Z. Given an initial value ()o define iterates  Om+l .OiO)2 > k 2 or < k}. Y n ) have density as in (6. I < i < k are independent. then 130 as defined by (6. Use this to imitate the argument of Theorem 6. case. < f3g in this (c) By conditioning on L~ 1 Xi and using the approach of Problem 6.d. 3.3) l:::~ I (Xi .(Ni ) < nOiO(1 . " Ok = 8kO. Suppose that (Z" Yj). a 2 ) where either a 2 = (known) and 01 . · . 1 i Show that for GLM this method coincides with the NewtonRaphson method of Section 2.18) tends 2 to Xrq' Hint: (Xi . ..3. 13.1 14..5 construct an exact test (level independent of (31).f.LJ2 Z i )). a under H. but under K may be either multinomial with 0 #.z(kJl) ~I 1 .4. . . 2. This is an approximation (for large k... Zi and so that (Zi..i. " lOk vary freely. or Oi = Bio (known) i = 1. . X k be independent Xi '" N (Oi.8) and. if the design matrix has rank p..4.11. .. .4. Let Xl.5 1.5. ..3.. .5. .. but Var.4).".. Show that if Wo C WI are nested logistic regression models of dimension q < r < k and mI. . E {z(lJ. Show that the likelihO<Xt ratio test of H : O} = 010 . Verify that (6. Nk) ~ M(n. a 2 = 0"5 is of the form: Reject if (1/0". Suppose the ::i in Problem 6. . 010. which is valid for the i. i Problems for Section 6. Show that. n) and simplification of a model under which (N1. 1 ...5. .15) is consistent. Compute the Rao test statistic for H : (32 case. a5 j J j 1 J I .. Fisher's Method ofScoring The following algorithm for solving likelihood equations was proosed by Fishersee Rao (1973)..i. = rn I < /3g and show that it agrees with the test of (b) Suppose that 131 is unknown.OkO) under H.)llmi~i(1 'lri). .11 are obtained as realization of i.. Tn..
g("i) = zT {3. Wi = w(". (y... .9). b(B).F.7 Problems and Complements 435 (b) The linear span of {ZII). as (Z. 00 d" ~ a(3j (b) Show that the Fisher information is Z]... . .1 = 0 V fJ~ . ...5). T).. give the asymptotic distribution of y'n({3 .. . the resuit of (c) coincides with (6. k.d.ztkl} is RP (c) P[ZI ~ z(jl] > 0 for all j. T. h(y. ..)z'j dC. Yn be independent responses and suppose the distribution of Yi depends on a covariate vector Zi.(. Y follow the model p(y. . Suppose Y. (a) Show that the likelihood equations are ~ i=I L. .) . P(I"). j = 1. d". 05. T.y . g(p.T)exp { C(T) where T is known. Hint: Show that if the convex support of the conditional distribution of YI given ZI = zU) contains an open interval about p'j for j = 1.(3). contains an open L. k. (Zn. ball about "k=l A" zlil in RP .9)... Find the canonical link function and show that when g is the canonical link. . then the convex support of the conditional distribution of = 1 Aj Yj zU) given Z j = Z (j) ..) ~ 1/11("i)( d~i/ d"..5. Set ~ = g(. b(9). Show that for the Gaussian linear model with known variance D(y..5.. <. J . and b' and 9 are monotone. T).b(Oi)} p(y. .) = ~ Var(Y)/c(T) b"(O).= 1 .. Wn). J JrJ 4. Hint: By the chain rule a l(y a(3j' 0) = i3l dO d" a~ .). (e) Suppose that Y.WZ v where Zv = Ilz'jll is the design matrix and W = diag( WI. Y) and that given Z = z. Show that. b(9)..O(z)) where O(z) solves 1/(0) = gI(zT {3). . ~ .Section 6. Let YI. and v(.p. Show that when 9 is the canonical link. . C(T). C(T). the deviance is 5..).5. (c) Suppose (Z" Y I ). Yn ) are i. your result coincides with (6.. Show that the conditions AQA6 hold for P = P{3o E P (where qo is assumed known). (d) Gaussian GLM.) and v(. Give 0. . has the Poisson.) and C(T) such that the model for Yi can be written as O. 9 = (b')l. 1'0) = jy 1'01 2 /<7. under appropriate conditions.i. Give 0. Assume that there exist functions h(y. ~ N("" <7. distribution. h(y.T). L... and v(.Oi)~h(y. In the random design case.
I . P. Consider the linear model of Example 6.3 and. W n and R n are computed under the assumption of the Gaussian linear model with a 2 known. hence.3 is as given in (6. Wald.6 1.6. i 6.Q+1>'" .1) ~ . /10) is estimated by 1(80 ). . but if f~(x) = ~ exp Ix 1'1.i. ! I I ! I . Show that the standard Wald test forthe problem of Example 6. l 1 . then it is. Wn Wn + op(l) + op(I). and Rao tests are still asymptotically equivalent in the sense that if 2 log An. then O' 2(p) > 0'2 = Varp(X. Show that 0: 2 given in (6.7. j. .6. In the hinary data regression model of Section 6. set) = 1. I "j 5. (6.Xn are i. Suppose that the ttue P does not belong to"P but if f}(P) is defined by (6.3. (a) Show that if f is symmetric about 1'.d. " . f}o) is used.6.3..10) creates a valid level u test.2. Hint: Retrace the arguments given for the asymptotic equivalence of these statistics under parametric model and note that the only essential property used is that the MLEs under the model satisfy an appropriate estimating equation.3) then f}(P) = f}o. then the Rao test does not in general have the correct asymptotic level. 7. and R n are the corresponding test statistics. then under H.6. . (3) where s(t) is the continuous distribution function of a random variable symmetric about 0. then the sample median X satisfies ~ f at I I Vri(X where O' (p) 2 yep») ~ N(0. let 1r = s(z. 0'2). but that if the estimate ~ L:~ dDIllDljT (Xi.j 436 Inference in the Multiparameter Case Chapter 6 i Problems for Section 6. = 1" (b) Show that if f is N(I'. Suppose ADA6 are valid.).1 under this model and verifying the fonnula given.4. O' 2(p)/0'2 = 2/1r.p under the sole assumption that E€ = 0.6. then 0'2(P) < 0'2.14) is a consistent estimate of2 VarpX(l) in Example 6. Establish (6. W n . if P has a positive density v(P). the infonnation bound and asymptotic variance of Vri(X 1'). Apply Theorem 6. 2. .s(t).1.2. Show that the LR. t E R.1. By Problem 5. .15) by verifying the condition of Theorem 6.2 and the hypothesis (3q+l = (3o. if VarpDl(X.(3p ~ (3o. in fact. . Show that.10).O' 2 (p») = 1/4f(v(p». Note: 2 log An.. Suppose Xl. 0 < Vae f < 00. replacing &2 by 172 in (6. .6. then v( P) . • I 3. the unique median of p. ..6. I 4.6. that is. I: . n . Consider the Rao test for H : f} = f}o for the model "P = {P/I : /I E e} and ADA6 hold.
for instance). But if 13L is defined as the solution of EZ1s(Zr {30) = Q(/3) where Q({3) = E(Zr A(Zr/3) is p x 1...Zn and evaluate the performance of Yjep) . Model selection consists in selecting p to minimize EPE(p) and then using Y(P) as a predictor (Mallows. then ~ Vri(rJL QI ({3) Var(ZI (Y1   {3 Ll has a limiting normal distribution with mean 0 and variance A(Zr {3)) )[QI ({3)J where Q({3) = E(Zr A(Zr{30)ZI) is p x p and necessarily nonsingular.2. L:~ 1 (P.I EL(Y. Show that if the correct model has Jri given by s as above and {3 = {3o.. _.7 Problems and Complements 437 (a) Show that ~ Jr can be written in this form for both the probit and logit models. 8. A natural goal to entertain is to obtain new values Yi". .{3lp) p and {3(P) ~ (/31.p.9. where Xln) ~ {(Y. .1) Yi = J1i + ti. (Model Selection) Consider the classical Gaussian linear model (6.(b) continue to hold if we assume the GaussMarkov model. = (3p = 0 by the (average) expected prediction error ~ ~ n EPE(p) = n.Section 6. O)T and deduce that (c) EPE(p) = RSS(p) + ~. i . i = 1. then 13L is not a consistent estimate of f3 0 unless s(t) is the logistic distribution. 0.lpI)2 be the residual sum of squares. Zi.2 is an unbiased estimate of EPE(P). Gaussian with mean zero J1i = Z T {3 and variance 0'2. .. Zi are ddimensional vectors for covariate (factor) values. 1 V.1.i . .. hence..... y~p) and. (b) Suppose that Zi are realizations of U. i = 1. Suppose that . Suppose that the covariates are ranked in order of importance and that we entertain the possibility that the last d . f3p. (a) Show that EPE(p) ~ ..' i=l .jP)2 where ILl ) = z. at Zll'" . Let f3(p) be the LSE under this assumption and Y.p don't matter. ~ ~ (d) Show that (a). . . . Let RSS(p) = 2JY. be the MLE for the logit model. where ti are i...d.lp)2 Here Yi" l ' •• 1 Y. 1973. . are indepeqdent of}] 1 ' · ' 1 Yn and ~* is distributed as }'i. Hint: Apply Theorem 6..2 is known. the model with 13d+l = .(p) the corresponding fitted value.. /3P+I = '" = /3d ~ O..1.d. Zi) : 1 <i< n}... 1 n.9.2 (b) Show that (1 + ~D + .. 1 n. that Zj is bounded with probability 1 and let ih(X ln )). . .
Heart Study after Dixon and Massey (1969). ti. Hint: (a) Note that ". Y/" j1~. Introduction to Statistical Analysis. EPE(p) n R SS( p) = . which are not multinomial. t12 and {Z/I. . 1 n n L (I'. 1970.) . (b) The result depends only on the mean and covariance structure of the i=l"".6. 1969.16). 1 I . . but it is reasonable. ! .' .1 (1) From the L A. AND F.i2} such that the EPE in case (i) is smaller than in case (d) and vice versa.". For instance. this makes no sense for the model we discussed in this section. Note for Section 6.5. )I'i). To guard against such situations he argued that the test should be used in a twotailed fashion and that we should reject H both for large and for small values of X2 • Of course. A.~I'.438 (e) Suppose p ~ Inference in the Multiparameter Case Chapter 6 2 and 11(Z) ~ Ii.L .n.z. W. Give values of /31. Use 0'2 = 1 and n = 10. The Analysis of Binary Data London: Methuen. .4. 3rd 00. we might envision the possibility that an overzealous assistant of Mendel "cooked" the data.2 (1) See Problem 3.. DIXON. .9 REFERENCES Cox. D. i=1 Derive the result for the canonical model. R.8 NOTES Note for Section 6. ~(p) n . + Evaluate EPE for (i) ~i ~ "IZ.9 for a discussion of densities with heavy tails. and (ii) 'T}i = /31 Zil + f32zi2.. i 6. New York: McGrawHill. The moral of the story is that the practicing statisticians should be on their guard! For more on this theme see Section 6. . LR test statistics for enlarged models of this type do indeed reject H for data corresponding to small values of X2 as well as large ones (Problem 6. . MASSEY. 6.. Y.  J . if we consider alternatives to H. Note for Section 6. 1 "'( i=1 .4 (1) R. Fisher pointed out that the agreement of this and other data of Mendel's with his hypotheses is too good.
Prob. second edition. "Approximate marginal densities of nonlinear functions. Statistical Theory with £r1gineering Applications New York: Wiley. II. 1961. 1989. AND R.• St(ltistical Methods {Of· Research lV(Jrkers.. C. R. 383393 (1987).. E. DYKSTRA." Technometrics. "Computing regression qunntiles. S. P. New York: J. GRAYBILL. C.. $CHERVISCH. AND R. 76.. L. Vol. S. KOENKER. 1995. 13th ed. Paris) (1789). 221233 (1967). MA: Harvard University Press.9 References 439 Frs HER. HABERMAN.. 36." Statistical Science. 1973. . Linear Statisticallnference and Its Applications. STIGLER.• "Sur quelques points du systeme du monde. Theory ofStatistics New York: Springer.661675 (1973).Section 6. AND j. Math.. HUBER. GauthierVillars.. of California Press. WEISBERG. ROBERTSON. R. NELDER. The Analysis of Variance New York: Wiley. "The Gaussian Hare and the Laplacian Tortoise: Computability of squarederror versus absoluteerror estimators. 1958. 2nd ed. D'OREY. 1974. PORTNOY. t 983. SCHEFFE. Applied linear Regression. MALLOWS. New York. 1986. KOENKER. J. 1952. J.." Biometrika.Him{ Models. R. f. 1988. II. Ser. A. A." Memoires de l'Academie des Sciences de Paris (Reprinted in Oevres CompUtes." J. Soc. "Some comments on C p ." Proc. S.425433 (1989). AND Y. KASS. 279300 (1997). KADANE AND L. The Analysis of Frequency Data Chicago: University of Chicago Press. Statist. TIERNEY. M. WRIGHT. HAW. T. 475558. I. T. LAPLACE. 2nd ed. Fifth BNkeley Symp. Genemlized Linear Models London: Chapman and Hall. PS. New York: Hafner. I New York: McGrawHill. "The behavior of the maximum likelihood estimator under nonstandard conditions. c. Wiley & Sons. Roy. A. 1959. RAO. P. McCULLAGH. /5.. Order Restricted Statistical Inference New York: Wiley. 12. S. An Inrmdllction to Linear Stati. New York: Wiley. 1985. The l!istory of Statistics: The Measuremem of Uncel1ainty Before 1900 Cambridge. R. Statist.. Univ.
I . . j. I j i. i 1 I . .I :i .. . . I I .. . .
IS contain some results that the student may not know. We add to this notion the requirement that to be called an experiment such an action must be repeatable. A group of ten individuals selected from the population of the United States can have a majority for or against legalized abortion. The Kolmogorov model and the modem theory of probability based on it are what we need. an experiment is an action that consists of observing or preparing a set of circumstances and then observing the outcome of this situation. The adjective random is used only to indicate that we do not. The purpose of this appendix is to indicate what results we consider basic and to introduce some of the notation that will be used in the rest of the book. Because the notation and the level of generality differ somewhat from that found in the standard textbooks in probability at this level. although we do not exclude this case. The situations we are going to model can all be thought of as random experiments. we include some commentary. Probability theory provides a model for situations in which like or similar causes can produce one of a number of unlike effects. What we expect and observe in practice when we repeat a random experiment many times is that the relative frequency of each of the possible outcomes will tend to stabilize. The intensity of solar flares in the same month of two different years can vary sharply. The reader is expected to have had a basic course in probability theory. A. which are relevant to our study of statistics. In Appendix B we will give additional probability theory results that are of special interest in statistics and may not be treated in enough detail in some probability texts. we include some proofs as well in these sections. require that every repetition yield the same outcome. A prerequisite for such a study is a mathematical model for randomness and some knowledge of its properties.Appendix A A REVIEW OF BASIC PROBABILITY THEORY In statistics we study techniques for obtaining and using information in the presence of uncertainty.I THE BASIC MODEl Classical mechanics is built around the principle that like causes produce like effects.14 and A. is tossed can land heads or tails. Viewed naively. This 441 . Therefore. A coin that. in addition. Sections A. at least conceptually.
is to many statistician::. the operational interpretation of the mathematical concept of probability. B.1. A. " are pairwise disjoint sets in I • I Recall that Ui I Ai is just the collection of points that are in anyone of the sets Ai and that two sets are disjoint if they have no points in common.1.. For a discussion of this approach and further references the reader may wish to consult Savage (1954). where 11. A probabiliry distribution or measure is a nonnegative function P on A having the following properties: (i) P(Q) n 1 n . A. By interpreting probability as a subjective measure. . then (ii) If AI. We denote events by A... complementation. ." Another school of statisticians finds this formulation too restrictive. de Groot (1970). Ii I I I j .1/ /1. Grimmett and Stirzaker. which by definition is a nonempty class of events closed under countable unions.\ is the number of times the possible outcome A occurs in n repetitions. . We now turn to the mathematical abstraction of a random experiment. A random experiment is described mathematically in tenns of the following quantities. i ! A.4 We will let A denote a class of subsets of to which we an assign probabilities. If A contains more than one point. induding the authors. 1974. A2 . intersections. Chung.I I 442 A Review of Basic Probability Theory Appendix A 1 longtefm relative frequency 11 . . we preSUnle the reader to be familiar with elementary set theory and its notation at the level of Chapter I of Feller (1968) or Chapter 1 of Parzen (1960). In this sense. and Berger (1985). Savage (1962). For example. Lindley (1965). is denoted by A. . from horse races to genetic experiments. Its complement.~ n and is typically denoted by w. they are willing to assign probabilities in any situation involving uncertainty. 1 1  .l The sample space is the set of all possible outcomes of a random experiment.l.3 Subsets of are called events. n. whether it is conceptually repeatable or not. C for union. I l 1. set theoretic difference." The set operations we have mentioned have interpretations also. We denote it by n. For technical mathematical reasons it may not be possible to assign a probability P to every subset of n.2 A sample point is any member of 0.l. and complementation (cf. The relation between the experiment and the model is given by the correspondence "A occurs if and only if the actual outcome of the experiment is a member of A. We shall use the symbols U. However. almost any kind of activity involving uncertainty. A. and so on or by a description of their members. = 1. If wEn. 1992. and inclusion as is usual in elementary set theory. the relation A C B between sets considered as events means that the occurrence of A implies the occurrence of B. {w} is called an elementary event. the probability model. Raiffa and Schlaiffer (\96\). . 1977). the null set or impossible event. A is always taken to be a sigma field. as we shall see subsequently. • . and Loeve. intersection. c. falls under the vague heading of "random experiment. In this section and throughout the book. it is called a composite event.
Sections 45 Pitman (1993) Section 1. A.S The three objects n. then P (U::' A. . 1. then PCB .. . In this case.3 Panen (1960) Chapter I.) (Bonferroni's inequality).. References Gnedenko.. J. P) either as a probability model or identify the model with what it represents as a (random) experiment.3 Hoe!. C An . A.40 < P(A) < 1.' 1 peA. We shall refer to the triple (0.3 If A C B. Sections 13. we can write f! = {WI.2 and 1.2 Elementary Properties of Probability Models 443 A.4). W2.P(A).2 PeN) 1 . and P together describe a random experiment mathematically.2.2. Sections 15 Pitman (1993) Sections 1.2. (A. we have for any event A.L~~l peA.2 Parzen (1960) Chapter 1.. Port..2.3 Hoel..2) n . > PeA).. } and A is the collection of subsets of n.3 A.l1. References Gnedenko (1967) Chapter I.1.PeA). P(0) ~ O.Section A.3. (U. For convenience.' 1 An) < L. by axiom (ii) of (A. That is. ~ A.3 A...3 DISCRETE PROBABILITY MODELS A. (1967) Chapter l. A..A) = PCB) .1 If A c B. when we refer to events we shall automatically exclude those that are not members of A.2 ELEMENTARY PROPERTIES OF PROBABILITY MODELS The following are consequences of the definition of P. c A.S P A. P(B) A.2.). A. A.6 If A .2. Port.7 P 1 An) = limn~= P(A n ).t A probability model is called discrete if is finite or countably infinite and every subset of f! is assigned a probability.l. Section 8 Grimmett and Stirzaker (1992) Section 1.2.3. c . (n~~l Ai) > 1.68 Grimmett and Stirzaker (1992) Sections 1. and Stone (1992) Section 1. and Stone (1971) Sections 1.
is an experiment leading to the model of (A. Sections 45 Parzen (1960) Chapter I.1 • 1 . (A. Then P( {w}) = 1/ N for every wEn. (A. From a heuristic point of view P(A I B) is the chance we would assign to the event A if we were told that B has occurred. (A. If A" A 2 . For large N.=:.:.' .3) yield U.4.4)(ii) and (A. 1 1 PiA n B) = P(B)P(A I B). Transposition of the denominator in (AA.l n' " I"I ".2) In fact.• 1 B n are (pairwise) disjoint events of positive probability whose union is fl.4.I B) is a probability measure on (fl. .~ 1 1 ~• 1 . I B). then . and drawing.4. .3).• P (Q A.4.3.:c:. (A. a random number table or computer can be used. (A. which we write PtA I B). i • i: " i:.. '< = ~=~:.1) If P(A) corresponds to the frequency with which A occurs in a large number of repetitions of the experiment...l) gives the multiplication rule. WN are the members of some population (humans. by l P(A I B) ~ PtA n B) P(B) . machines. Such selection can be carried out if N is small by putting the "names" of the Wi in a hopper. ~f 1 . . then P( A I B) corresponds to the frequency of occurrence of A relative to the class of trials in which B does occur.:c Number of elements in A N . Sections 67 Pitman (1993) Section l. .4 CONDITIONAL PROBABILITY AND INDEPENDENCE I j 1 1 ! .444 A Review of Basic Probability Theory Appendix A An important special case arises when n has a finite number of elements..3) 1 I A. . the identity A = . Then selecting an individual from this population in such a way that no one member is more likely to be drawn than another. j=l (A..4) . flowers. . guinea pigs.=. A.3. selecting at random. say N. shaking well. the function P(.3) If B l l B 2 .(A n B j ).. i References Gnedenko (1967) Chapter I.3. A) which is referred to as the conditional probability measure given B. for fixed B as before. etc. and P( A ) ~ j .4. I B) = ~ PiA.. all of which are equally likely. n PiA) = LP(A I Bj)P(Bj ).1.).. Given an event B such that P( B) > 0 and any other event A.4 Suppose that WI. we define the conditional probability of A given B. are (pairwise) disjoint events and P(B) > 0.
. ..II) for any j and {i" . we can combine (A..) j=l k (AA. Chapter 3.. A and B are independent if knowledge of B does not affect the probability of A.i.." .. and Stone (1971) Sections lA. . .. n Bnd > O. Section 4..4) and obtain Bayes rule .4.En such that P(B I n .···. P(il.8) > 0.)P(B. .• .. the relation (AA..n}..) = II P(A. ..Section AA Conditional Probability and Independence 445 If P( A) is positive....) A) ~ ""_ PIA I B )p(BT L.i. . Port.I_1 ..1 ). I. .A. P(B" I Bl.. B ll is written P(A B I •. relation (A.4. ....} such thatj ct {i" . References Gnedenko (1967) Chapter I..i k } of the integers {l. (AA. Sections IA Pittnan (1993) Section lA . ..8) may be written P(A I B) ~ P(A) (AA.S parzen (1960) Chapter 2. . Ifall theP(A i ) are positive..4.lO) is equivalent to requiring that P(A J I A. . n B n ) > O. P(B 1 n·· n B n ) ~ P(B 1 )P(B2 I BJlP(B3 I ill. .3). (AA. An are said to be independent if P(A i .}. n··· nA i . B I . Sections 9 Grimmett and Stirzaker (1992) Section IA Hoel. Simple algebra leads to the multiplication rule. The events All .7) whenever P(B I n . B 2 ) . . BnJl (AA..S) The conditional probability of A given B I defined by ..) ~ P(A J ) (AA..I J (AA.... and (A.lO) for any subset {iI. B n ) and for any events A. I PeA I B.9) In other words. Two events A and B are said to be independent if P(A n B) If P( B) ~ P(A)P(B).
.w n ) E o : Wi = wf}. This makes sense in the compound experiment. .3).0 if and only if WI is the outcome of £1.on. (A5A) i. the Cartesian product Al x . that is. the subsets A ofn to which we can assign probability(1). . it can be uniquely extended to the sigma field A specified in note (l) at the end of this appendix..1 Recall that if AI.o i . ) = P(A. the outcome of the first experiment (toss) reasonably has nothing to do with the outcome of the second.. I ! i w? 1 .. x fl 2 x '" x fln)P(fl.5 COMPOUND EXPERIMENTS There is an intuitive notion of independent experiments. Loeve.. . The reader not interested in the formalities may skip to Section A. P(fl.. . a compound experiment is one made up of two or more component experiments. . An are events..6 where examples of compound experiments are given. .. .5..wn )}) I: " = PI ({Wi}) . 1974. On the other hand. If we want to make the £i independent..1 . we introduce the notion of a compound experiment... I <i< n. To say that £t has had outcome pound event (in . Pn(A n ). x'" . . . Pn({W n }) foral! Wi E fli. W2 is the outcome of £2 and E Oi corresponds to the Occurrence of the comso on. . . then intuitively we should have alI classes of events AI."" .on. To be able to talk about independence and dependence of experiments.. x flnl n Ifl) x A 2 x ' . . If we are given n experiments (probability models) t\. P. then the sample space . (A5. For example. 446 A Review of Basic Probability Theory Appendix A A.o n = {(WI. the sigma field corresponding to £i. wn ) is a sample point in . .. There are certain natural ways of defining sigma fields and probabilities for these experiments.) .0 1 x··· XOi_I x {wn XOi+1 x··· x. InformaUy.0 ofthe n stage compound experiment is by definition 0 1 x·· .5. . x An) ~ P(A.0 1 X '" x .. and if we do not replace the first chip drawn before the second draw.' . 1 1 . If P is the probability measure defined on the sigma field A of the compound experiment. X A 2 X '" x fl n ). An is by definition {(WI. .An with Ai E Ai. 1977) that if P is defined by (A. P([A 1 x fl2 X .1 X Ai x 0i+ I x . A. .. we should have . More generally.3) for events Al x . x . r. X fl n 1 X An). then the probability of a given chip in the second draw will depend on the outcome of the first dmw.. A2). x An by 1 j P(A.53) holds provided that P({(w"". 1995. ..'" £n and recording all n outcomes.2) 1 i " 1 If we are given probabilities Pi on (fl" AI). x " . We shall speak of independent experiments £1. W n ) : W~ E Ai. . I n  I .. .0) given by . it is easy to give examples of dependent experiments: If we draw twice at random from a hat containing two green chips and one red chip.' . \ I I .. . x flnl n ' . £n with respective sample spaces OJ." X An) = P. The interpretation of the sample space 0 is that (WI. x An of AI.. (A. . The (n stage) compound experiment consists in performing component experiments £1. .. then Ai corresponds to . Chung... on (fl2. independent.. In the discrete case (A. I P n on (fln> An). 1 £n if the n stage compound experiment has its probability structure specified by (A5. if we toss a coin twice. x On in the compound experiment.. . x An.. then (A52) defines P for A) x ".. These will be discussed in this section. (A53) It may be shown (Billingsley. . X . if Ai E Ai. 1 < i < n}.
wq and P( {Wi}) = Pi. which we shall denote by 5 (success) and F (failure). the compound experiment is called n multinomial trials with probabilities PI.. we refer to such an experiment as a multinomial trial with probabilities PI. ill the discrete case.: n )}) for each (WI.6.Section A 6 Bernoulli and Multinomial Trials. .6 Hoel.£"_1 has outcome wnd. SAMPLING WITH AND WITHOUT REPLACEMENT A.6. if an experiment has q possible outcomes WI.6. ... .1 Suppose that we have an experiment with only two possible outcomes.: n ) with Wi E 0 1. Ifweassign P( {5}) = p. By the multiplication rule (A.6 BERNOULLI AND MULTINOMIAL TRIALS.S) .. If we repeat such an experiment n times independently.6.SP({(w] . then (A. i = 1" . then (A. References Grimmett and Stirzaker ( (992) Sections (. J. If the experiment is perfonned n times independently.7) we have. . /I.k)!' The fonnula (A..2) where k(w) is the number of S's appearing in w.3) where n ) ( k = n! kl(n .. and Stone (1971) Section 1. we shall refer to such an experiment as a Bernoulli trial with probability of success p. If Ak is the event [exactly k S's occur]. the following. (A. we say we have performed n Bernoulli trials with success probability p. Port. In the discrete case we know P once we have specified P( {(wI .Pq' If fl is the sample space of this experiment and W E fl. any point wEn is an ndimensional vector of S's and F's and.5. A. A.6. The simplest example of such a Bernoulli trial is tossing a coin with probability p of landing heads (success). . .5. Sampling With and Without Replacement 447 Specifying P when the £1 are dependent is more complicated.u. If o is the sample space of the compound experiment..w n )}) = P(£l has outcome wd P(£2 hasoutcomew21 £1 has outcome WI)'" P(£T' has outcome W I £1 has outcome WI. ..6.·. .. Other examples will appear naturally in what follows... 1.4.'" .4 More generally.. n The probability structure is determined by these conditional probabilities and conversely....5 Parzen (1960) Chapter 3 A. .3) is known as the binomial probability.· IPq. .
A. A. References Gnedeuko (1967) Chapter 2. = N! (N . When n is finite the tenn. Np). i .0. I A.. . n .7 PROBABILITIES ON EUCLIDEAN SPACE Random experiments whose outcomes are real numbers playa central role in theory and practice. we are sampling with replacement. .1 . Sectiou 11 Hoel.. we shall sometimes refer to the outcome of the compound experiment as a sample of size n from the population given by (n.. for any outcome a = (Wil"" 1Wi. Sections 14 Pitman (1993) Section 2.S If we have a finite population of cases = {WI"" WN} and we select cases Wi successively at random n times without replacement. .448 A Review of Basic Probability Theory Appendix A where k~(w) = number of times Wi appears in the sequence w. Port.7 If we perform an experiment given by (. __________________J i . I' . and Stone (1971) Section 2. 10) is known as the hypergeometric probability. . If AkJ.1 .) A = n! k k k !.6.(N(I. The fonnula (A. . with replacement is added to distinguish this situation from that described in (A.6. The probability models corresponding to such experiments can all be thought of as having a Euclidean space for sample space.. then ~ .p)) < k < min( n. . · · • A.. and P(Ak) = 0 otherwise.6. P) independently n times.N (1 . the component experiments are not independent and.6) where the k i are natural numbers adding up to n.1 j .6. . . . 1 .A l P).1O) J for max(O.9) . If Np of the members of n have a "special" characteristic S and N (1 ~ p) have the opposite characteristic F and A k = (exactly k "special" individuals are obtained in the sample). . exactly kqwq's are observed). . PtA ) k =( n ) (Np). exactly k 2 wz's are observed.P))nk k (N)n = (A6..) of the compound experiment. kq!Pt' . ·Pq t (A. i . .. ..6..6.S) as follows. then P( k" .4 Parzen (1960) Chapter 3. . . and the component experiments are independent and P( {a}) = liNn.n)!' If the case drawn is replaced before the next drawing. n P({a}) ~ (N)n where 1 I (A.k.k q is the event (exactly k l WI 's are observed. (N)n .
A. x A.. Any subset of R k we might conceivably be interested in turns out to be a member of f3k. A. Geometrically.bd x '" (ak.I) because the study of this model and that of the model that has = {Xl.7 A continuous probability distn'bution on Rk is a probability P that is defined by the relation P(A) = L p(x)d x =1 (A7. only an Xi can occur as an outcome of the experiment. P defined by A.7..7. It may be shown that a function P so defined satisfies (AlA). Thefrequency function p of a discrete distribution is defined on Rk by n pix) = P({x»). .5) A.).7. dx n • is called a density function.7. ..9) . .1If (al. we shall call the set (aJ. ..S) for some density function P and all events A.bd..(ak.7.Xk) :ai <Xi <bi. Riemann integrals are adequate.3 A discrete (probability) distribution on R k is a probability measure P such that L:~ I P( {Xi}) = 1 for some sequence of points {xd in R k . } of vectors and that satisfies L:~ I P(Xi) = 1 defines a unique discrete probability distribution by the relation P(A) = L x. . JR' where dx denotes dX1 ..3.7. for practical purposes..Section A. . . . An important special case of (A. . (A7A) Conversely. } are equivalent.S) is by definition r JR' 1A(X)P(x)dx where 1A(x) ~ 1 if x E A.Xn .bk ) are k open intervals.. and 0 otherwise. . This definition is consistent with (A. We will only consider continuous probability distributions that are also absolutely continuous and drop the term absolutely.b k ) = {(XI"". any nonnegative function pon R k vanishing except on a sequence {Xl.6 A nonnegative function p on R k • which is integrable and which has r p(x)dx = 1.. Integrals should be interpreted in the sense of Lebesgue.. Recall that the integral on the right of (A.7.EA pix.. However.2 The Borelfield in R k .S) is given by (A.8 are usually called absolutely continuous. 1 <i <k}anopenkrectallgle. A. P(A) is the volume of the "cylinder" with base A and height p(x) at x.. which we denote by f3k. We will write R for R1 and f3 for f31.. X n .7. (A7. where ( )' denotes transpose.7 Probabilities on Euclidean Space 449 We shall use the notation R k of k~dimensional Euclidean space and denote members of Rk by symbols such as x or (Xl. xd'.7.. That is. is defined to be the smallest sigma field having all open k rectangles as members.
450 A Review of Basic Probability Theory Appendix A It turns out that a continuous probability distribution determines the density that generates it "uniquely. For instance. . be thought of as measuring approximately how much more Or less likely we are to obtain an outcome in a neighborhood of XQ then one in a neighborhood of Xl_ A. 5. x. . F is continuous at x if and only if P( {x}) (A.7. then P = Q.f."(l) Although in a continuous model P( {x}) = 0 for every x.:0 and Xl are in R.7. Thus..7.7.xo + h]) '" 2hp(xo) and P([ h Xl 1 + h]) + Xl p(xo) hi) '" ( ).7.5 i • . x n j X =? F(x n ) ~ (A.. 5.7.13HA. . x.7. and h is close to 0.h.1 and 4.2.. We always have F(x)F(xO)(2) =P({x}).4.14) .15) I.12) The dJ. F is a function of a real variable characterized by the following properties: > I .7.) = P( ( 00.. .7 Pitman (1993) Sections 3.11 The distribution function (dJ.7. limx~oo F(x) limx~_oo =1 F(x) = O.16) I It may be shown that any function F satisfying (A.16) defines a unique P on the real line. Xo P([xo . (A.7. (A. 3.13) x <y =? F(x) < F(y) (Monotone) F(x) (Continuous from the right) (A. defines P in the sense that if P and Q are two probabilities with the same d. x (00.h.1. Sections 21. When k = 1. • J .J x . 22 Hoel..1'. POlt.]).lO) The ratio p(xo)jp(xl) can. (A. thus. Sections 14. ! I .x. References Gnedenko (1967) Chapter 4. the density function has an operational interpretation close to thal of the frequency function. 4.) F is defined by F(Xl' .1. r .2 parzen (1960) Chapter 4. if p is a continuous density on R. P Xl (A. then by the mean value theorem paxo .17) = O. and Stone (1971) Sections 3.
the statistician is usually interested primarily in one or more numerical characteristics of the sample point that has occurred. The subscript X or X will be used for densities.8. . X(w) E: B} ~ X.8. and so on.or vectorvalued functions of a random vectOr X is central in the theory of probability and of statistics.a RANDOM VARIABLES AND VECTORS: TRANSFORMATIONS Although sample spaces can be very diverse. m > 1. density. we measure the weight of pigs drawn at random from a population.2 A random vector X = (Xl •. the yield per acre of a field of wheat in a given year. Px ) given by n Px(B) = PIX E BI· (A. The study of real.S. Here is the formal definition of such transformations. we will refer to the frequency Junction..(1) = A.1 A random variable X is a function from Oto Rsuch that the set {w: X(w) E B} X. . k. 13k . dj. In the probability model.1 (H) is in A for every B E BkJI) For k = 1 random vectors are just random variables. Forexample. When we are interested in particular random variables or vectors.Section A. such that(2) gl(B) = {y E: Rk : g(y) E: . referring to those features of its probability distribution.8 Random VariOlbles and Vectors: Transformations 451 A.5) and (A. by definition. The probability distribution of a random vector X is. In the discrete case this means we need only know the frequency function and in the continuous case the density.5) p(x)dx . if X is continuous. the concentration of a certain pollutant in the atmosphere. A. the time to breakdown and length of repair time for a randomly chosen machine. and so on to indicate which vector or variable they correspond to unless the reference is clear from the context in which case they will be omitted.Xk)T is ktuple of random variables.8. Similarly. and so on of a random vectOr when we are.8.7.7. The event Xl( B) will usually be written [X E B] and P([X E BJ) will be written PIX E: H]. these quantities will correspond to random variables and vectors.8) PIX E: A] LP(X). or equivalently a function from to Rk such that the set {w . Letg be any function from Rk to Rm. ifXisdiscrete xEA L (A.3) A. Thus.1 (B) is in 0 fnr every BE B. in fact. from (A. The probability of any event that is expressible purely in tenns of X can be calculated if we know only the probability distribution of X.4 A random vector is said to have a continuous or discrete distribution (or to be continuous or discrete) according to whether its probability distribution is continuous or discrete. dJ.'s. we will describe them purely in terms of their probability distributions without any further specification of the underlying sample space on which they are defined. the probability measure Px in the model (R k .
8.7) and (A.7. . Yf is a discrete random vector with frequency function p(X.8)) that X is a marginal density function given by px(x) ~ 1: P(X.jPX a (A.12) by summing or integrating out over yin P(X. . then 1 (I .y)dy.Y) (x. Then g(X) is continuous with density given by . y).(5) (A. Pg(X) (I) = j. Then the random tran~form(lti(m g( X) is defined by g(X)(w) = g(X(w)).y)(x.)'. a random vector obtained by putting two random vectors together.11) Similarly. V).452 A Review of BClsic ProbClbility Theory Appendix A [J} E BI. and X is continuous. and 0 otherwise. .S.Y). if (X. This is called the change of variable formula.8.12) 1 .1 E~' l(X i .92/ with 91 (X) = k. Another common example is g(X) = (min{X. a # 0. y (A. j . then the frequency function of X.y)(x.). max{X.8) it follows that if (X. If g(X) ~ aX + 1'.Y).1 1 Xi = X and 92(X) = k. (A. Discrete random variables may be used to approximate continuous ones arbitrarily closely and vice versa. .X)2.6) An example of a transformation often used in statistics is g (91.: for every B E Bill.7) If X is discrete with frequency function Px.1') . (A. These notions generalize to the case Z = (X. The (marginal) frequency or density of X is found as in (A.y).8. known as the marginal frequency function. it may be shown (as a consequence of (A.8. assume that the derivative l of 9 exists and does not vanish on S.S.8. .11) and (A.1O) From (A. .8. • II ! I • .8.8. i . The probability distribution of g(X) is completely detennined by that of X through L:: P[g(X) E BI = PIX E gI(B)].8.8) Suppose that X is continuous with density PX and 9 is realvalued and onetoone(3) on an open set S such that P[X E 5] = 1. Furthennore. is given by(4) i I ( PX(X) = LP(X. then g(X) is discrete and has frequency function Pg(X)(t) = L {x:g(x)=t} Px(x). ! for Pg(x)(I) ~ PX(gl(t)) Ig'(g 1(1))1 (A.9) t E g(S). (A. y)T is continuous with density p(X. .
. X 2 ) and (Yt> Y2 ) are independent.9 Independence of Random Variables and Vectors 453 In practice. .S.9..9. (X"XIX.9.7 The preceding equivalences are valid for random vectors XlJ . . so are Xl + X2 and YI Y2 ..2 The random variables X I.1. all random variables are discrete because there is no instrument that can measure with perfect accuracy. .4 Parzen (1960) Chapter 7. . 9 Pitman (1993) Section 4.9 INDEPENDENCE OF RANDOM VARIABLES AND VECTORS A.Section A. For example.6.. Nevertheless. S... . ..9. X n with X (XI>"" Xn)' . the approximation of a discrete distribution by a continuous one is made reasonable by one of the limit theorems of Sections A. References Gnedenko (1967) Chapter 4. and Stone (1971) Sections 3. E BI are independent. the events [X I E All. the events [XI E AI and IX. " X n ) is either a discrete or continuous random vector.[Xn E An] are independent.An in S. A.9..3. To generalize these definitions to random vectors Xl. 5. One possibility is that the observed random variable or vector is obtained by rounding off to a large number of places the true unobservable continuous random variable specified by some idealized physical modeL Or else. Sections 2124 Grimmett and Stirzaker (1992) Section 4. Theorem A. either ofthe (A.4) (A.5) = following two conditions hold: A. then X = (Xl.7. 6. if (Xl.Xn (not necessarily of the same dimensionality) we need only use the events [Xi E Ail where Ai is a set in the range of Xi' A.1.7). 1 X n are independent if. The justification for this may be theoretical or pragmatic.) and Y" and so on. Sections 15. Port.7 Hoel.. .9. ..S.6IftheXi are all continuous and independent.. . A. and only if. Then the random variables Xl.13 A convention: We shan write X = Y if the probability of the event IX fe YI is O.. ' . Suppose X (Xl. A.. which may be easier to deal with.1 Two random variables X I and X 2 are said to be independent if and only if for sets A and B in E.3 By (A. whatever be g and h. " X n ) is continuous.9.9. if X and Yare independent.Xn are said to be (mutually) independent if and only if for any sets AI. it is common in statistics to work with continuous distributions. .2.. so are g(X) and h(Y).4 A..l5 and B.
I l I . In line with these ideas we develop the general concept of expectation as follows.. then the Xi form a sample from a distribution that assigns probability p to 1 and (1.I0.4 Par""n (1960) Chapter 7. px(i) = i(i+1)' i= 1.454 A Review of Basic Probability Theory Appendix A A.9. . 4... . . by 00 1 i .EA XiPX (Xi) < . Take • ..S If Xl . we define the expectation or mean of X. . If XI . Port. X l l is called a random sample (l size n from a population with dJ... 24 Grimmett and Stirzaker (1992) Sections 3. X n are independent identically distributed kdimensional random vectors with dJ. i .. by 1 ifw E A o otherwise.p) to 0.. The same quantity arises (approximately) if we use the longrun frequency interpretation of probability and calculate the average height of the individuals in a large sample from the population in question.. the indicator of the event A.3 . Fx or density (frequency function) Px.. . it follows that this average is given by 'L. . (A9. then Xl .~ ( . Sections 6. I i I I . } into two sets A and B where A consists of aU nonnegative Xi and B of all negative Xi· If either 'L.j j .2. 1 Xi =i. Sections 23.9) If we perfonn n Bernoulli trials with probability of success p and we let Xi be the indicator of the event (success on the ith trial).X.~ 1 i E(X) = LXiPX(Xi).2..I) . . X2. written E(X).• }. such a random sample is often obtained by selecting n members at random in the sense of (A.I0 THE EXPECTATION OF A RANDOM VARIABLE r: r I Let X be the height of an individual sampled at random from a finite population.2 More generally.4) from a population and measuring k characteristics on each member.Ll XiP[X = Xi] where P[X = Xi] is just the proportion of individuals of height Xi in the population. if X is discrete.. If X is a nonnegative. F x or density (frequency function) PX.3. . . "" 1 X q are the only heights present in the population. 5.2 Hoel. . If A is any event. and Stone (1971) Section 3. decompose {Xl l X2. discrete random variable with possible values {Xl. 7 Pitman (1993) Sections 2.' (Infinity is a possible value of E(X).5. we define the random variable 1 A.W..I . Then a reasonable measure of the center of the distribution of X is the average height of an individual in the given population. A. Such samples win be referred to as the indicators of n Bernoulli tn"als with probability ofsuccess p" References Gnedenko (1967) Chapter 4. i==l (A. I..) A. . In statistics..
)Px (.lO.V.. if 9 is a realvalued function on Rn.7) if Qt. If X is a continuous random variable.p.c.9)). ."_"_b_. X(t.ES( x.tO A random variable X is said to be integrable if E(IXI) < 00. ."5'. . (A. we have 00 E(IXI) ~ L i=l IXiIPx(Xi). E(Y) are defined.. E(X) is left undefined. If X is a constant. lOA) If X is an ndimensional random vector."_0_"_A".J:.3) = IA (cf (A. I 0. 10.IO. i = 1. 10. and if E(lg(X) I) < 00.'"":. Otherwise.) < 00. Jo= xpx(x)dx or A. Here are some properties of the expectation that hold when X is discrete. . we leave E(X) undefined.6) Taking g(x) = L:~ 1 O:'iXi we obtain the fundamental relationship (A. (A.IO. then E(X) < E(Y). 10.9) as the definition of the expectation or mean of X f~oo xpx(x)dx is finite. an are constants and E(!Xil) < 00.'_""o"_of_'""R_'_"d"o:.. 10.9. Otherwise. Those familiar with Lebesgue integration will realize that this leads to E(X) = 1: xpx(x)dx whenever (A. then it may be shown that 00 E(g(X)) = Lg(Xi)PX(Xi) i=l (A.. If X (A. we define E(X) unambiguously by (A. then E(X) = c. it is natural to attempt a definition of the expectation via approximation from the discrete case.rn".l"O_T_h''CEx. .v') = c for all w. 00 455 or L.. ... It may be shown that if X is a continuous k~dimensional random vector and g(X) is any random variable such that JR' { Ig(x)IPx(x)dx < 00.I). A.5) As a consequence of this result. then E(X) = P(A). .7) it follows that if X < Y and E(X).8From (A. n.. 10.
B) g(x)p(x)dx.5) and (A.456 then E(g(X» exists and E(g(X» A Review of BaSIC Probability Theory Appendix A = 1. By (A. A... Chapter S. 3A. II. The fonnulae (A 10.IRk g(x)dP(x) (A.7).. 'I I' I..1 I 1 . • g(x)dP(x) = L g(Xi)p(Xi)..3 Hoel. I0.6) hold.5) and (A. j A. 7. the kth moment of X is defined to be the expectation of X k. In the discrete case J1 assigns weight one to each of the points in {x : p(x) > O} and it is called counting measure... 10. g(X)Px(x)dx.5).! If k is any natural number and X is a random variable.. .5) and (A 10. j i . j . 10.1 parzen (1960) Chapter 5..3.!I) In the continuous case expectation properties (A.IO. 1 .. (AIO. A convenient notation is dP(x) = p(x)dl'(x). 4.2) xkpx(x)dx if X is continuous. We assume that all moments written here exist..12) where F denotes the distribution function of X and P is the probability function of X defined by (AS. POrl.H.. Section 26 Grimmett and Stirzaker (1992) Sections 3.ll MOMENTS 1 . continuous case.. We will often refer to p(x) as the density of X in the discrete case as well as the continuous case. lOA). . It is possible to define the expectation of a random variable in general using discrete approximations. In the continuous case dJ1(x) = dx and M(X) is called Lebesgue measure.. I . and (AIO. (A.11). 10. Sections 14 Pitman (1993) Sections 3.1.3. discrete case f: f References i~l (AIO.. (A.' • 1: x I (A I 1.. I I) are both sometimes written as E(g(X)) ~ r ink = g(x)dF(x) or r . I: We refer to J1 = IlP as the dominating measure for P.S) as well as continuous analogues of (A. . 4.. and Stone (1971) Sections 4. The interested reader may consult an advanced text such as Chung (1974).3).lO. E(X k ) j = Lxkpx(x) if X is discrete . which means J . I In general. I0. Chapter 3. It I! II . the moments depend on the distribution of X only.. Chung (1974) Chapter 3 Gnedenko (1967) Chapter 5.
9 If E(X 2 ) = 0.ll Moments 457 A. 0 2).n. See also Section A. then the coefficient of skewness and the kurtosis of Y = 12 = O. If XI and X 2 are random variables and i. The variance of X is finite if and only if the second moment of X is finite (cf. then the product moment of order (i.4 The kth central moment of X (X . (AIl.j) of X.ll. The central product moment of order (i.1l. then X = O.7) and (All.1 and the kunosis ')'2.15. the kth moment of A. if the random variable possesses a moment generating function (cf. the stan E(Z) = 0 and Var Z = 1.n. then 11 A.7 If X is any random variable with welldefined (finite) mean and variance. and is denoted by I'k. is by definition E[(X . These results follow. A.Section A. This is the case. The central product moment of order (1.15». from (A. then by (A. are the same as those of X.2).7) Var(aX + b) = a2 Var X. It is also called a measure of scale. by definition.6) it follows then that A.S The second central moment is called the variance of X and will be written Var X. . (A.l and.E(X)J). and X 2 is again by definition E[(X.n.ll. which is often referred to as the mean deviation..E(X))/vVar x.IO The third and fourth central moment. E(XiX~). These descriptive measures are useful in comparing the shapes of various frequently used densities.E(XIl)i(X.12 It is possible to generalize the notion of moments to random vectors. Another measure of the same type is E(jX . A.E(X)). If X ~ N(/l. If a and b are constants.H.U. . for instance. j are natural numbers.) dardized version or Zscore of X is the random variable Z ~ (X .2 are expressed in tenns of cumulants.12 where. 10. (AI2. for example. X = E(X) (a constant). By (A 10.6) (One side of the equation exists if and only if the other does.1». 1) is . (All.E(X 2 ))i]. are used in the coefficient of skewness .E(X))k].II. The nonnegative square root of Var X is called the standard deviation of X. For sim plicity we consider the case k = 2.j) of Xl and X 2 is. The standard deviation measures the spread of the distribution of X about its expectation.S) A. A. If Var X = 0.n If Y = a + bX with b > 0.3 The distribution of a random variable is typically uniquely specified by its moments. which are defined by where 0'2 = Var X.
\:2 and is written Cov(.E(X I ). 0 # 0) of XI' l il . Cov(aX I + oX" eX. This is the correlation inequality. Z.).) + od Cov(X"X. + dX.E( X. such that E(Zf) < 00.IIl4) If Xi and Xf are distributed as Xl and X 2 and are independent of Xl and X 2.I 1.X. then Cov(XI.\ 1. E(Zil < 00. denoted by Corr(XI1 X2). The correlation of Xl and X 2. X.E(X J!)( X. .14). J I • for any two random variables Z" Z.. .E(X.lI. (2) (XI .)(X. we obtain the relations. •  II .. By expanding the product (XI . . 7).4.). .E(X.E(X I ») = CO~(X~X.1.[E(X)j2 (A.458 A Review of Basic Prob<lbility Theory Appendix A called the cOl'Orionce of X 1 and. X. is defined whenever Xl and X 2 are not constant and the variances of Xl and X 2 are finite by (A.3) and (AID.) (X. ar '2 . (AI 117) j r .) (A. X 3 ) + be Cov(X" X 3 ) and + ad Cov(X I . I 1 1 j I I .X. Equality holds if and only if X. It may be obtained from the CauchySchwartz inequality. . A proof of the CauchySchwartz inequality is given in Remark 1.) = lfwe put XI ~ ~E(XI .). with equality holding if and only if (1) Xl or X 2 is a constant Or ! • • .X 2).. = a + oXI . =X in (A. we get the formula Var X = E(X') .X. = X.I6) . • i . I liS) The covariance is defined whenever Xl and X 2 have finite variances and in that case (AII.) = ac Cov(X I . is linear function (X.I US) The correlation of Xl and X 2 is the covariance of the standardized versions of Xl and X 2· The correlation inequality is equivalent to the statement (A.11l3) (A.19) 1 • t . .)) and using (AI 0. i . o. The correlation inequality correspcnds to the special case ZI = Xl . Equality holds if and only if one of Zll Z2 equals 0 or 2 1 = aZ2 for some constant a.
CoV(X1. we obtain as a consequence of (A. + X n ) = L '1=1 n Var X. and Stone (1971) Sections 4. XII have finite variances.X2 ) = Corr(X1.l1. (A. lsi < So Mx(s) LeSX'PX(Xi) if X is discrete 1: Mx(s) i~1 (A.II).22) This may be checked directly. Sections 14 Pitman (1993) Section 6. then Var(X 1 References Gnedenko (1967) Chapter 5. are uncorrelated) need be independent.ID.II.. It is not trUe in general that X I and X 2 that satisfy (AIl.. 12.12 Moment and Cumulant Generating Functions 459 If Xl. The correlation coefficient roughly measures the amount and sign of linear relationship between Xl and X 2. 11. respectively). + X n ) = L Var Xi· i=I n (A.12 MOMENT AND CUMULANT GENERATING FUNCTIONS A.23) A. =L = E(X k ) lsi < so· (A.e.2) eSXpx(x)dx if X is continuous.22) (i.. Mx(s) ~ E(e'X) is well defined for and is called the moment generating function of X. i ~ 1.3 Parzen (1960) Chapter 5.22) and (AI 1..X.bX 1.12.llf E(c'nIXI) < 00 for some So > 0. By (A. all moments of X k! sk.Xn are independent with finite variances.5) and (A.21 ) or in view of(All. It is 1 Or 1 in the case of perfect relationship (X2 = a l. Sections 27.II..l2. As a consequence of (AI 1.20) If Xl and X 2 are independent and X I and X 2 are integrable. are If M x is well defined in a neighhorhood finite and {s . b < orb> 0.J4).3) k=O .. I 1.Section A.5.20). 7. 10.24. lsi < so} of zero.2. .. 28.. Xl)· (A.4 ° + . Port.4.13) the relation Var(X 1 + .) > 0. Chapter 8. 30 Hoel.. See also Section 1. we see that if Xl. +2L '<7 COV(X.'" . then (A.) ~ o when Var(X.
3. Chapter 8. see Section B. A/x determines the distribution of X uniquely and is itself uniquely determined by the distribution of X. j A.Kx(s)I.(X) + cJ(Y).l1.12. I I i I A.~o sJ dj (A. Port. If lv/x is well defined in some neighborhood of zero. 12.7) is called the cumulant generating function of X. then Cj(X + Y) = c.6) This follows by induction from the definition and (A.t4 .12. + XI! has moment generating function given by M(x. Section 8.9) l 1 " I J . j > L For j > 2 and any constant a.12. In this section we introduce certain families of b •. and Stone (1971) Chapter 8.t of X. Sections 23 Rao (1973) Section 2bA j i i I . . If X and Yare independent. c.12.8) J.'(8) ds~ = E(X').3J1~. If X is nonnally distributed.. .460 A Review of Basic: Probability Theory Appendix A A. • 111.+. 11. . SOME CLASSICAL DISCRETE AND CONTINUOUS DISTRIBUTIONS I f • .1 parzen (1960) Chapter 5.10» can be written as 1'1 = Problem 8. Ii .'(".13 .1x I ' . The function Kx(s) = log M x (s) (A..2l). the probability distribution of a random variable or vector is just a probability measure on a suitable Euclidean space. For a generalization of the notion afmoment generating function to random vectors. • By definition. Section 3. ..8. The coefficients of skewness and kurtosis (see (A. .5. References Hoel. ! . and C4 = J.5 If defined. The first cumulant Cl is the mean J. ~ Cj(X) ~ d .12.. C2 and C3 equal the second and third central moments f12 and f13 of X.Us hils derivatives of all orders at :. XII are independent random variables with moment generating functions 11. = 0 and dk :\1. Cj = 0 for j > 3.(s). c3/ci and 1'2 = C4/C~.. is called the jth cumulant of X. Cj(X + a) = Cj(X). See • 1 ~ .+x"I(S) = II Mx. i=l n (A. ..4 The moment generuting function . . K x can be represented by the convergent Taylor expansion Kx(s) = where L j=O 00 Cisj (A. then Xl + . If XI .
.. k=O.D nondefectives. X has a B(n. n) distribution (see (A6.D» < k < min(n. + X 2 + '" + X.3) Higherorder moments may be computed from the moment generating function Mx(t) = [Oe' + (1 .(N . A.. This result may be derived by using (A. respectively. 0). p(k) = (A. N. Discrete Distributions The binomial distribution with parameters nand B : B( n. then X has a B( n. The parameters D and n may be any natural nwnbers that are less than or equal to the natural number N. If the sample is taken with replacement. If anywhere below p is not specified explicitly for some value of x it shall be assumed that p vanishes at that point.13. + n"O) distribution. A. .OW. Following the name of each distribution we give a shorthand notation that will sometimes be used as will obvious abbreviations such as "binomial (n. . + . DIN) distribution. The symbol p as usual stands for a frequency or density function. (A I 3.5 If Xl.Section A.6) for k a natural number with max(O.1 0». B(n. 0). then X has an H( D. p(k)~ (~)0'(10)"k.2 If X is the total number of successes obtained in n Bernoulli trials with probability of success 0.12. .n . ..13.13. X 2 . Similarly.12.l3. .Xk are independent random variables distributed as B(n l1 8). Var X = nO(1  0)... Or. B(n2.D). B)" for "the binomial distribution with pmameter (n . . (Al3. I. 1]. it is assumed to be zero to the "left" ofthe set and one to the "right" ofthe set. if the value of the distribution function F is not specified outside some set.6.4) .1) > 0 whereas 0 may be any number in [0.5) and (A. N.4). 0) distribution..3». N. The parameter n can be any integer (A. B). and n : 1t(D.n. which arise frequently in probability and statistics. .l3. then X.l. If X has a B( n. and list some of their properties.71f X is the number of defectives (special objects) in a sample of size n taken without replacement from a population with D defectives and N .n).13. The hypergeometric distribution with parameters D. has a B(n. 0) distribution (see (A. then E(X) ~ nO.6) in conjunction with (A.13 Some Classical Discrete and Continuous Distributions 461 distributions. A.
~ • (AI 1. respectively. i of:. (A.7) by writing X = 'L.'l). Bq : M(n. e = {(O" . . Oil ..'" Oq) is any vector in n! k k (A.j l i.Bq) distribu ! i : E(Xi ) Cov(Xi.20). .i3. The parameter>.O. then . can be any positive number..13.462 A Review of Basic Probability Theory Appendix A If X has an H(D. + An) distribution. .•• . tion (see (A. 0 ' . n) distribution..l3. Oq' whenever k i are nonnegative integers such that natural number while (Ol.. Oq) distribution.. . 13.6)).. + X 2 + +Xn has the P(AI + A2 + . This result may be derived in th same manner as the corresponding fact for the binomial distribution.l3) The par~meter n is any 'L.7). • • I • E(X) = Var X ~ A.N N 1' (A. k ( k) ]0 = e'A k! (A 13.Oq): 0. q.kq) = k.1O) f The moment generating function of X is given by Mx(t) ~ e.. t=l = (Xl. j = 1..l . If X has a peA) distribution.8) Formulae (AI3.14 If X 0.6.0. then X.8) may be obtained directly from the definition (A 13. (A. 01 "" ..IS) . then D D ( D) Nn E(X) = n N' Var X = n N 1.1 < < q. Oq).. > A.. " X n are independent random variables with P(A').1I) A. and then applying formulae (A. The multinomial distribution with parameters n. An easier way is to use the interpretation (A.) n(JiOj. . where Xi multinomial trials with probabilities (0 1 ."" P(A n ) distributions. .. .(.9) for k = 0. P(A2). . The Poisson distribution with parameter A: peA). If X has a M( n... to..6).71Ij where I j = 1 if the jth object sampled is defective and 0 otherwise. ]O(k" . " f" is the number of times outcome Wi occurs in n . . and .. . . ~ I}. (A. then X has a M(n. 2.1