Mathematical Statistics
Basic Ideas and Selected Topics
Volume I
Peter J. Bickel
University of California
Kjell A. Doksum
University of California
1'1"(.'nl icc
Hall
.. ~
PRENTICE HALL
Upper Saddle River, New Jersey 07458
Library of Congress CataloginginPublication Data Bickel. Peter J. Mathematical statistics: basic ideas and selected topics / Peter J. Bickel, Kjell A. Doksum2 nd ed. p. em. Includes bibliographical references and index. ISBN D13850363X(v. 1) L Mathematical statistics. L Doksum, Kjell A. II. Title.
QA276.B47200l
519.5dc21 00031377
Acquisition Editor: Kathleen Boothby Sestak Editor in Chief: Sally Yagan Assistant Vice President of Production and Manufacturing: David W. Riccardi Executive Managing Editor: Kathleen Schiaparelli Senior Managing Editor: Linda Mihatov Behrens Production Editor: Bob Walters Manufacturing Buyer: Alan Fischer Manufacturing Manager: Trudy Pisciotti Marketing Manager: Angela Battle Marketing Assistant: Vince Jansen Director of Marketing: John Tweeddale Editorial Assistant: Joanne Wendelken Art Director: Jayne Conte Cover Design: Jayne Conte
}'I('I1II(\'
lI,dl
@2001, 1977 by PrenticeHall, Inc. Upper Saddle River, New Jersey 07458
All rights reserved. No part of this book may be reproduced, in any form Or by any means, without permission in writing from the publisher. Printed in the United States of America 10 9 8 7 6 5 4 3 2 1
I
i
ISBN: D1385D363X
PrenticeHall International (UK) Limited, London PrenticeHall of Australia Pty. Limited, Sydney PrenticeHall of Canada Inc., Toronto PrenticeHall Hispanoamericana, S.A., Mexico PrenticeHall of India Private Limited, New Delhi PrenticeHall of Japan, Inc., Tokyo Pearson Education Asia Pte. Ltd. Editora PrenticeHall do Brasil, Ltda., Rio de Janeiro
J
To Erich L Lehmann
." i
~I
"I
~ !
,'
,
I,
,.~~
_.

..
CONTENTS
PREFACE TO THE SECOND EDITION: VOLUME I PREFACE TO THE FIRST EDITION I STATISTICAL MODELS, GOALS, AND PERFORMANCE CRITERIA 1.1 Data, Models, Parameters, and Statistics
xiii
xvii
1 1
1.1.1
1.1.2
Data and Models
Pararnetrizations and Parameters
I
6
1.1.3
1.2
Statistics as Functions on the Sample Space
8
1.3
1.4
1.5
1.6
1.7 1.8 1.9
1.1.4 Examples, Regression Models Bayesian Models The Decision Theoretic Framework 1.3.1 Components of the Decision Theory Framework 1.3.2 Comparison of Decision Procedures 1.3.3 Bayes and Minimax Criteria Prediction Sufficiency Exponential Families 1.6.1 The OneParameter Case 1.6.2 The Multiparameter Case 1.6.3 Building Exponential Families 1.6.4 Properties of Exponential Families 1.6.5 Conjugate Families of Prior Distributions Problems and Complements Notes References
9 12 16 17 24 26 32 41 49 49 53 56 58 62 66 95 96
VII
VIII
•••
CONTENTS
2 METHODS OF ESTIMATION 2.1 Basic Heuristics of Estimation 2.1.1 Minimum Contrast Estimates; Estimating Equations 2.1.2 The PlugIn and Extension Principles 2.2 Minimum Contrast Estimates and Estimating Equations 2.2.1 Least Squares and Weighted Least Squares 2.2.2 Maximum Likelihood 2.3 Maximum Likelihood in Multiparameter Exponential Families *2.4 Algorithmic Issues 2.4.1 The Method of Bisection 2.4.2 Coordinate Ascent 2.4.3 The NewtonRaphson Algorithm 2.4.4 The EM (ExpectationlMaximization) Algorithm 2.5 Problems and Complements 2.6 Notes 2.7 References
3
MEASURES OF PERFORMANCE
99 99 99 102 107 107 114 121 127 127 129 132 133 138 158 159
3.1 Introduction 3.2 Bayes Procedures 3.3 Minimax Procedures *3.4 Unbiased Estimation and Risk Inequalities 3.4.1 Unbiased Estimation, Survey Sampling 3.4.2 The Information Inequality *3.5 Nondecision Theoretic Criteria 3.5.1 Computation 3.5.2 Interpretability 3.5.3 Robustness 3.6 Problems and Complements 3.7 Notes 3.8 References
4 TESTING AND CONFIDENCE REGIONS 4.1 Introduction 4.2 Choosing a Test Statistic: The NeymanPearson Lemma 4.3 UnifonnIy Most Powerful Tests and Monotone Likelihood Ratio
161 161 161 170 176 176 179 188 188 189 190 197 210 211 213 213 223
227 233
4.4
Models Confidence Bounds, Intervals, and Regions
CONTENTS
ix
241 248 251 252
4.5 *4.6 *4.7 4.8
The Duality Between Confidence Regions and Tests Uniformly Most Accurate Confidence Bounds Frequentist and Bayesian Formulations Prediction Intervals
4.9
Likelihood Ratio Procedures 4.9.1 Inttoduction
4.9.2 4.9.3 Tests for the Mean of a Normal DistributionMatched Pair Experiments Tests and Confidence Intervals for the Difference in Means of
255 255
257
4.9.4
4.9.5
Two Normal PopUlations The TwoSample Prohlem with Unequal Variances
Likelihood Ratio Procedures fOr Bivariate Nonnal
261 264 266 269 295 295
Distrihutions 4.10 Problems and Complements 4.11 Notes 4.12 References
5 ASYMPTOTIC APPROXIMATIONS 5.1 Inttoduction: The Meaning and Uses of Asymptotics
5.2 Consistency
297 297
301
5.2.1
5.2.2
PlugIn Estimates and MLEs in Exponential Family Models
Consistency of Minimum Contrast Estimates
301
304
5.3
5.4
5.5 5.6 5.7 5.8
First and HigherOrder Asymptotics: The Delta Method with Applications 5.3.1 The Delta Method for Moments 5.3.2 The Delta Method for In Law Approximations 5.3.3 Asymptotic Normality of the Maximum Likelihood Estimate in Exponential Families Asymptotic Theory in One Dimension 5.4.1 Estimation: The Multinomial Case *5.4.2 Asymptotic Normality of Minimum Conttast and M Estimates *5.4.3 Asymptotic Normality and Efficiency of the MLE *5.4.4 Testing *5.4.5 Confidence Bounds Asymptotic Behavior and Optimality of the Posterior Distribution Problems and Complements Notes References
306 306 311 322 324 324 327 331 332 336 337 345 362 363
x
CONTENTS
6 INFERENCE IN THE MULTIPARAMETER CASE
6.1 Inference for Gaussian Linear Models 6.1.1 6.1.2 6.1.3 *6.2 6.2.1 6.2.2 6.2.3 *6.3 6.3.1 6.3.2 *6.4 6.4.1 6.4.2 6.4.3 *6.5 *6.6 6.7 6.8 6.9 The Classical Gaussian Linear Model Estimation Tests and Confidence Intervals Estimating Equations Asymptotic Normality and Efficiency of the MLE The Posterior Distribution in the Multiparameter Case Asymptotic Approximation to the Distribution of the Likelihood Ratio Statistic Wald's and Rao's Large Sample Tests GoodnessofFit in a Multinomial Model. Pearson's X 2 Test GoodnessofFit to Composite Multinomial Models. Contingency Thbles Logistic Regression for Binary Responses
365
365 366 369 374 383 384 386 391 392 392 398 400 401 403 408 411 417 422 438 438 441 441 443 443
Asymptotic Estimation Theory in p Dimensions
Large Sample Tests and Confidence Regions
Large Sample Methods for Discrete Data
Generalized Linear Models
Robustness Properties and Scmiparametric Models Problems and Complements Notes References
A A REVIEW OF BASIC PROBABILITY THEORY A.I The Basic Model A.2 Elementary Properties of Probability Models A.3 Discrete Probability Models A.4 Conditional Probability and Independence A.5 Compound Experiments A.6 Bernoulli and Multinomial Trials, Sampling With and Without Replacement A.7 Probabilities On Euclidean Space A.8 Random Variables and Vectors: Transformations A.9 Independence of Random Variables and Vectors A.IO The Expectation of a Random Variable A.II Moments A.12 Moment and Cumulant Generating Functions
444
446 447 448 451 453 454 456 459
CONTENTS
XI
•
A. t3 Some Classical Discrete and Continuous Distributions A.14 Modes of Convergence of Random Variables and Limit Theorems A. I5 Further Limit Theorems and Inequalities A.16 Poisson Process
A.17 Notes
460
466 468 472
474 475 477 477 477 479 480 482 484 485 485 488 491 491 494 497 502 502 503 506 506 508
A.18 References
B ADDITIONAL TOPICS IN PROBABILITY AND ANALYSIS
B.I Conditioning by a Random Variable or Vector B.I.l B.I.2 B.1.3 B.IA B.l.5 B.2.1 B.2.2 B.3
B.3.1
The Discrete Case Conditional Expectation for Discrete Variables Properties of Conditional Expected Values Continuous Variables Comments on the General Case The Basic Framework The Gamma and Beta Distributions
The X2 , F, and t Distributions
B.2 Distribution Theory for Transformations of Random Vectors
Distribution Theory for Samples from a Normal Population B.3.2 Orthogonal Transformations
BA The Bivariate Normal Distribution
B.5 Moments of Random Vectors and Matrices B.5.1 B.5.2 B.6.1 B.6.2 B.8 Basic Properties of Expectations Properties of Variance Definition and Density Basic Properties. Conditional Distributions
Op
B.6 The Multivariate Normal Distribution
B.7 Convergence for Random Vectors: Multivariate Calculus B.9 Convexity and Inequalities
and Op Notation
511
516 518 519 519 520 521
B.1O Topics in Matrix Theory and Elementary Hilbert Space Theory
B.1O.1 Symmetric Matrices B,10.2 Order on Symmetric Matrices
B.10.3 Elementary Hilbert Space Theory
B.Il Problems and Complements B.12 Notes B.13 References
524
538 539
•• XII
CONTENTS
C TABLFS
Table I The Standard Nonna! Distribution Table I' Auxilliary Table of the Standard Normal Distribution
Table II t Distribution Critical Values Table In x2 Distribution Critical Values Table IV F Distribution Critical Values
INDEX
541 542 543 544 545 546 547
I
PREFACE TO THE SECOND EDITION: VOLUME I
In the twentythree years that have passed since the first edition of our book appeared statistics has changed enonnollsly under dIe impact of several forces:
(1) The generation of what were once unusual types of data such as images, trees (phy
logenetic and other), and other types of combinatorial objects.
(2) The generation of enonnous amounts of dataterrabytes (the equivalent of 10 12 characters) for an astronomical survey over three years.
(3) The possibility of implementing computations of a magnitude that would have once been unthinkable. The underlying sources of these changes have been the exponential change in computing speed (Moore's "law") and the development of devices (computer controlled) using novel instruments and scientific techniques (e.g., NMR tomography, gene sequencing). These techniques often have a strong intrinsic computational component. Tomographic data are the result of mathematically based processing. Sequencing is done by applying computational algorithms to raw gel electrophoresis data. As a consequence the emphasis of statistical theory has shifted away from the small sample optimality results that were a major theme of our book in a number of directions:
(I) Methods for inference based on larger numbers of observations and minimal assumptionsasymptotic methods in non and semiparametric models, models with ''infinite'' number of parameters. (2) The construction of models for time series, temporal spatial series, and other complex data structures using sophisticated probability modeling but again relying for analytical results on asymptotic approximation. Multiparameter models are the rule. (3) The use of methods of inference involving simulation as a key element such as the bootstrap and Markov Chain Monte Carlo.
XIll
.,q;,
•
:ci'~:,f"
.;.
'4,.
<.it:
...
which we present in 2000. been other important consequences such as the extensive development of graphical and other exploratory methods for which theoretical development and connection with mathematics have been minimal. Others do not and some though theoretically attractive cannot be implemented in a human lifetime. Our one long book has grown to two volumes. and functionvalued statistics. The reason for these additions are the changes in subject matter necessitated by the current areas of importance in the field. such as the empirical distribution function. Volume [ covers the malerial of Chapters 16 and Chapter 10 of the first edition with pieces of Chapters 710 and includes Appendix A on basic probability theory. weak convergence in Euclidean spaces. instead of beginning with parametrized models we include from the start non. As in the first edition.and semipararnetric models. However. then go to parameters and parametric models stressing the role of identifiability. covers material we now view as important for all beginning graduate students in statistics and science and engineering graduate students whose research will involve statistics intrinsically rather than as an aid in drawing concluSIons. we assume either a discrete probability whose support does not depend On the parameter set.xiv Preface to the Second Edition: Volume I (4) The development of techniques not describable in "closed mathematical form" but rather through elaborate algorithms for which problems of existence of solutions are important and far from obvious. and probability inequalities as well as more advanced topics in matrix theory and analysis. or the absolutely continuous case with a density. There have. These will not be dealt with in OUr work. problems. (6) The study of the interplay between the number of observations and the number of parameters of a model and the beginnings of appropriate asymptotic theories. Volume I. Hilbert space theory is not needed. each to be only a little shorter than the first edition. which includes more advanced topics from probability theory such as the multivariate Gaussian distribution. The latter include the principal axis and spectral theorems for Euclidean space and the elementary theory of convex functions on Rd as well as an elementary introduction to Hilbert space theory. Appendix B is as selfcontained as possible with proofs of mOst statements. Specifically. As a consequence our second edition. such as the density. We . our focus and order of presentation have changed. we do not require measure theory but assume from the start that our models are what we call "regular. Despite advances in computing speed. and references to the literature for proofs of the deepest results such as the spectral theorem. reflecting what we now teach our graduate students. of course. From the beginning we stress functionvalued parameters. is much changed from the first. (5) The study of the interplay between numerical and statistical considerations. Chapter 1 now has become part of a larger Appendix B. some methods run quickly in real time. In this edition we pursue our philosophy of describing the basic concepts of mathematical statistics relating theory to practice. However." That is. but for those who know this topic Appendix B points out interesting connections to prediction and linear regression analysis.
such as regression experiments. Save for these changes of emphasis the other major new elements of Chapter 1. Almost all the previous ones have been kept with an approximately equal number of new ones addedto correspond to our new topics and point of view. Finaliy. The conventions established on footnotes and notation in the first edition remain. including some optimality theory for estimation as well and elementary robustness considerations. Chapter 2 of this edition parallels Chapter 3 of the first artd deals with estimation. Nevertheless. the Wald and Rao statistics and associated confidence regions. including a complete study of MLEs in canonical kparameter exponential families. As in the first edition problems playa critical role by elucidating and often substantially expanding the text. Included are asymptotic normality of maximum likelihood estimates. which parallels Chapter 2 of the first edition. Chapters 14 develop the basic principles and examples of statistics. There is more material on Bayesian models and analysis. Other novel features of this chapter include a detailed analysis including proofs of convergence of a standard but slow algorithm for computing MLEs in muitiparameter exponential families and ail introduction to the EM algorithm. Although we believe the material of Chapters 5 and 6 has now become fundamental. These objects that are the building blocks of most modem models require concepts involving moments of random vectors and convexity that are given in Appendix B. Chapters 3 and 4 parallel the treatment of Chapters 4 and 5 of the first edition on the theory of testing and confidence regions. One of the main ingredients of most modem algorithms for inference. include examples that are important in applications. there is clearly much that could be omitted at a first reading that we also star. There are clear dependencies between starred .Preface to the Second Edition: Volume I xv also. Wilks theorem on the asymptotic distribution of the likelihood ratio test. It includes the initial theory presented in the first edition but goes much further with proofs of consistency and asymptotic normality and optimality of maximum likelihood procedures in inference. inference in the general linear model. and some parailels to the optimality theory and comparisons of Bayes and frequentist procedures given in the univariate case in Chapter 5. Robustness from an asymptotic theory point of view appears also. we star sections that could be omitted by instructors with a classical bent and others that could be omitted by instructors with more computational emphasis. if somewhat augmented. This chapter uses multivariate calculus in an intrinsic way and can be viewed as an essential prerequisite for the mOre advanced topics of Volume II. The main difference in our new treatment is the downplaying of unbiasedness both in estimation and testing and the presentation of the decision theory of Chapter 10 of the first edition at this stage. Chapter 6 is devoted to inference in multivariate (multiparameter) models. are an extended discussion of prediction and an expanded introduction to kparameter exponential families. Chapter 5 of the new edition is devoted to asymptotic approximations. Major differences here are a greatly expanded treatment of maximum likelihood estimates (MLEs). from the start. Generalized linear models are introduced as examples. Also new is a section relating Bayesian and frequentist inference via the Bemsteinvon Mises theorem.
Michael Ostland and Simon Cawley for producing the graphs. With the tools and concepts developed in this second volume students will be ready for advanced research in modem statistics. elementary empirical process theory.3 ~ 6. encouragement. will be studied in the context of nonparametric function estimation. Nancy Kramer Bickel and Joan H. convergence for random processes. and our families for support. and the functional delta method. Ying Qing Chen. Topics to be covered include permutation and rank tests and their basis in completeness and equivariance.berkeley. other transformation models. Jianhna Hnang. particnlarly Jianging Fan. greatly extending the material in Chapter 8 of the first edition. are weak.4 ~ 6.6 Volume II is expected to be forthcoming in 2003. Michael Jordan. in part in the text and. Examples of application such as the Cox model in survival analysis. Yoram Gat for proofreading that found not only typos but serious errors. The topic presently in Chapter 8. taken on a new life. and the classical nonparametric k sample and independence problems will be included. The basic asymptotic tools that will be developed or presented. We also expect to discuss classification and model selection using the elementary theory of empirical processes. and active participation in an enterprise that at times seemed endless. and Carl Spruill and the many students who were guinea pigs in the basic theory course at Berkeley.5 I. density estimation.XVI • Pref3ce to the Second Edition: Volume I sections that follow. Semiparametric estimation and testing will be considered more generally. 6. • with the field. Fujimura. j j Peter J. and Prentice Hall for generous production support.2 ~ 6. 5.3 ~ 6. in part in appendices. appeared gratifyingly ended in 1976 but has.4.edn . We also thank Faye Yeager for typing.berkeley.2 ~ 5. Last and most important we would like to thank our wives. Bickel bickel@stat. For the first volume of the second edition we would like to add thanks to new colleagnes. A final major topic in Volume II will be Monte Carlo methods such as the bootstrap and Markov Chain Monte Carlo.4.edn Kjell Doksnm doksnm@stat.
These authors also discuss most of the topics we deal with but in many instances do not include detailed discussion of topics we consider essential such as existence and computation of procedures and large sample behavior. (2) Give careful proofs of the major "elementary" results such as the NeymanPearson lemma. At the other end of the scale of difficulty for books at this level is the work of Hogg and Craig. Hoel. and the structure of both Bayes and admissible solutions in decision theory. Finally. we feel that none has quite the mix of coverage and depth desirable at this level. the physical sciences. Although there are several good books available for tbis purpose. Introduction to Mathematical Statistics. By a good mathematics background we mean linear algebra and matrix theory and advanced calculus (but no measure theory). we select topics from xvii . for instance. we need probability theory and expect readers to have had a course at the level of. We feel such an introduction should at least do the following: (1) Describe the basic concepts of mathematical statistics indicating the relation of theory to practice.. 3rd 00. In the twoquarter courses for graduate students in mathematics. PREFACE TO THE FIRST EDITION This book presents our view of what an introduction to mathematical statistics for students with a good mathematics background should be. The work of Rao. and engineering that we have taught we cover the core Chapters 2 to 7. and the GaussMarkoff theorem. the information inequality. and nonparametric models. 2nd ed. which go from modeling through estimation and testing to linear models. Port. Be cause the book is an introduction to statistics.. Our book contains more material than can be covered in tw~ qp. (3) Give heuristic discussions of more advanced results such as the large sample theory of maximum likelihood estimates. the Lehmann5cheffe theorem. Our appendix does give all the probability that is needed.arters. The extent to which holes in the discussion can be patched and where patches can be found should be clearly indicated. covers most of the material we do and much more but at a more abstract level employing measure theory. In addition we feel Chapter 10 on decision theory is essential and cover at least the first two sections. multinomial models. and Stone's Introduction to Probability Theory. the treatment is abridged with few proofs and no examples or problems. Linear Statistical Inference and Its Applications. (4) Show how the ideas aod results apply in a variety of important subfields such as Gaussian linear mOdels. statistics. However.
They range from trivial numerical exercises and elementary problems intended to familiarize the students with the concepts to material more difficult than that worked out in the text. S. Conventions: (i) In order to minimize the number of footnotes we have added a section of comments at the end of each Chapter preceding the problem section. R. They need to be read only as the reader's curiosity is piqued. I for the first. Gray. We would like to acknowledge our indebtedness to colleagues.5 was discovered by F. preliminary edition. G. Scholz. . Drew. Without Winston Chow's lovely plots Section 9. enthusiastic. U. Chen. C. in proofreading the final version. J. final draft) through which this book passed. Chou. and stimUlating lectures of Joe Hodges and Chuck Bell. Within each section of the text the presence of comments at the end of the chapter is signaled by one or more numbers. Lehmann's wise advice has played a decisive role at many points. Quang.xviii Preface to the First Edition Chapter 8 on discrete data and Chapter 9 on nonpararnetric models. and moments is established in the appendix. (iii) Basic notation for probabilistic objects such as random variables and vectors. (i) Various notational conventions and abbreviations are used in the text. The comments contain digressions. and so On. or it may be included at the end of an introductory probability course that precedes the statistics course.6 would probably not have been written and without Julia Rubalcava's impeccable typing and tolerance this text would never have seen the light of day. Later we were both very much influenced by Erich Lehmann whose ideas are strongly rellected in this hook. i . P. We would also like tn thank tlle colleagues and friends who Inspired and helped us to enter the field of statistics. I . Minassian who sent us an exhaustive and helpful listing. densities. Among many others who helped in the same way we would like to mention C. Bickel Kjell Doksum Berkeley /976 : . respectively. Chapter 1 covers probability theory rather than statistics. 2 for the second. These comments are ordered by the section to which they pertain. It may be integrated with the material of Chapters 27 as the course proceeds rather than being given at the start.2. They are included both as a check on the student's mastery of the material and as pointers to the wealth of ideas and results that for obvious reasons of space could not be put into the body of the text. A list of the most frequently occurring ones indicating where they are introduced is given at the end of the text. reservations. X. L. The foundation of oUr statistical knowledge was obtained in the lucid. and friends who helped us during the various stageS (notes. and A Samulon. caught mOre mistakes than both authors together. students. A special feature of the book is its many problems. Much of this material unfortunately does not appear in basic probability texts but we need to draw on it for the rest of the book. E. A serious error in Problem 2. Cannichael. Pyke's careful reading of a nexttofinal version caught a number of infelicities of style and content Many careless mistakes and typographical errors in an earlier version were caught by D. distribution functions. Peler J. Gupta. W. and additional references.
Mathematical Statistics Basic Ideas and Selected Topics Volume I Second Edition .
. j 1 J 1 . . I ..
(2) Matrices of scalars and/or characters. (4) All of the above and more. digitized pictures or more routinely measurements of covariates and response on a set of n individualssee Example 1.1 and 6. and/or characters. A detailed discussion of the appropriateness of the models we shall discuss in particUlar situations is beyond the scope of this book. as usual. Subject matter specialists usually have to be principal guides in model formulation. Chapter 1. In any case all our models are generic and. The goals of science and society.1.1 1. and so on. The particular angle of mathematical statistics is to view data as the outcome of a random experiment that we model mathematically. for example.5.3). AND PERFORMANCE CRITERIA 1.1. (3) Arrays of scalars and/or characters as in contingency tablessee Chapter 6or more generally multifactor IDultiresponse data on a number of individuals. in particUlar. scientific or industrial. are to draw useful information from data using everything that we know. trees as in evolutionary phylogenies. Moreover. a single time series of measurements. functions as in signal processing. we shall parenthetically discuss features of the sources of data that can make apparently suitable models grossly misleading. GOALS. A 1 .1. MODELS. Data can consist of: (1) Vectors of scalars. which statisticians share.Chapter 1 STATISTICAL MODELS. measurements. A generic source of trouble often called grf?ss errors is discussed in greater detail in the section on robustness (Section 3.4 and Sections 2. ''The Devil is in the details!" All the principles we discuss and calculations we perform should only be suggestive guides in successful applications of statistical analysis in science and policy. produce data whose analysis is the ultimate object of the endeavor. for example.2.1 DATA. but we will introduce general model diagnostic tools in Volume 2. PARAMETERS AND STATISTICS Data and Models Most studies and experiments. large scale or small.
So to get infonnation about a sample of n is drawn without replacement and inspected. e.3 and continue with optimality principles in Chapters 3 and 4. in particular.. and B to n. "Models of course. plus some random errors. Goals. randomly selected patients and then measure temperature and blood pressure. We run m + n independent experiments as follows: m + n members of the population are picked at random and m of these are assigned to the first method and the remaining n are assigned to the second method. For instance. The data gathered are the number of defectives found in the sample. learning a maze. (b) We want to study how a physical or economic feature. give methods that assess the generalizability of experimental results. are never true but fortunately it is only necessary that they be usefuL" In this book we will study how. (4) We can decide if the models we propose are approximations to the mechanism generating the data adequate for our purposes. testing. An exhaustive census is impossible so the study is based on measurements and a sample of n individuals drawn at random from the population. (c) An experimenter makes n independent detenninations of the value of a physical constant p. for example. Random variability " : . This can be thought of as a problem of comparing the efficacy of two methods applied to the members of a certain population. for modeling purposes. We begin this discussion with decision theory in Section 1. (3) We can assess the effectiveness of the methods we propose. and so on. Hierarchies of models are discussed throughout. and Performance Criteria Chapter 1 priori. (d) We want to compare the efficacy of two ways of doing something under similar conditions such as brewing coffee. A to m. (2) We can derive methods of extracting useful information from data and. and more general procedures will be discussed in Chapters 24. treating a disease. height or income. have the patients rated qualitatively for improvement by physicians. for instance. The population is so large that. producing energy. robustness.21. His or her measurements are subject to random fluctuations (error) and the data can be thought of as p. It is too expensive to examine all of the items. (5) We can be guided to alternative or more general descriptions that might fit better.. we approximate the actual process of sampling without replacement by sampling with replacement. and diagnostics are discussed in Volume 2. Goodness of fit tests. we can assign two drugs. to what extent can we expect the same effect more generally? Estimation. Here are some examples: (a) We are faced with a population of N elements. Chapter I. We begin this in the simple examples that follow and continue in Sections 1. a shipment of manufactured items. An unknown number Ne of these elements are defective.5 and throughout the book. starting with tentative models: (I) We can conceptualize the data structure and our goals more precisely. reducing pollution. confidence regions. . if we observe an effect in our data.! I 2 Statistical Models. In this manner. is distributed in a large population. For instance. and so on. in the words of George Box (1979). we obtain one or more quantitative or qualitative measures of efficacy from each experiment.
. which we refer to as: Example 1."" €n are independent..t + (i.".. X has an hypergeometric. That is. in principle.2. The main difference that our model exhibits from the usual probability model is that NO is unknown and. Situation (b) can be thought of as a generalization of (a) in that a quantitative measure is taken rather than simply recording "defective" or not. We often refer to such X I. identically distributed (i.l. .1. can take on any value between and N. N. 1 <i < n (1. .. 1 €n) T is the vector of random errors... . which are modeled as realizations of X I.1. we observe XI.. X n are ij. OneSample Models.8).. If N8 is the number of defective items in the population sampled.. N. that depends on how the experiment is carried out. completely specifies the joint distribution of Xl. .. k ~ 0..Section 1.l) if max(n .d. A random experiment has been perfonned..) random variables with common unknown distribution function F." The model is fully described by the set :F of distributions that we specify. .xn . ..0) < k < min(N8. n). X n ? Of course. we postulate (l) The value of the error committed on one determination does not affect the value of the error at other times. Sampling Inspection. X n as a random sample from F. Fonnally. so that sampling with replacement replaces sampling without. Thus.. . First consider situation (a). Given the description in (c). Sample from a Population. . So.Xn independent.1. if the measurements are scalar. We shall use these examples to arrive at out formulation of statistical models and to indicate some of the difficulties of constructing such models. 0 ° Example 1. On this space we can define a random variable X given by X(k) ~ k. n corresponding to the number of defective items found.. we cannot specify the probability structure completely but rather only give a family {H(N8. . 1. n. . . n)} of probability distributions for X. as X with X . although the sample space is well defined..i. where .. Models.. which together with J1. . H(N8. .2) where € = (€ I . then by (AI3. The same model also arises naturally in situation (c). and also write that Xl.6) (J. anyone of which could have generated the data actually observed../ ' stands for "is distributed as.. The mathematical model suggested by the description is well defined. I.1 Data. and Statistics 3 here would come primarily from differing responses among patients to the same drug but also from error in the measurements and variation in the purity of the drugs. .. n) distribution.N(I . What should we assume about the distribution of €. F.. Here we can write the n determinations of p. as Xi = I.. .1. €I.. It can also be thought of as a limiting case in which N = 00.. . Parameters.d. The sample space consists of the numbers 0.
(72) population or equivalently F = {tP (':Ji:) : Jl E R. if we let G be the distribution function of f 1 and F that of Xl. There are absolute bounds on most quantitiesIOO ft high men are impossible. . 0'2). if drug A is a standard or placebo. . or by {(I". Natural initial assumptions here are: (1) The x's and y's are realizations of Xl. A placebo is a substance such as water tpat is expected to have no effect on the disease and is used to correct for the welldocumented placebo effect.1. . then F(x) = G(x  1") (1. £n are identically distributed." Now consider situation (d). Equivalently Xl. 0 This default model is also frequently postulated for measurements taken on units obtained by random sampling from populations. Example 1.1. Goals. and Performance Criteria Chapter 1 (2) The distribution of the error at one determination is the same as that at another. '. be the responses of m subjects having a given disease given drug A and n other similarly diseased subjects given drug B. . The classical def~ult model is: (4) The common distribution of the errors is N(o.3) and the model is alternatively specified by F.t. .3. then G(·) F(' . 1 X Tn . so that the model is specified by the set of possible (F. 1 Yn a sample from G. where 0'2 is unknown. (3) The distribution of f is independent of J1. and Y1. respectively. Let Xl... . The Gaussian distribution. .4 Statistical Models. tbe set of F's we postulate. will have none of this.. YI. (3) The control responses are normally distributed. We call this the shift model with parameter ~. or alternatively all distributions with expectation O.. we refer to the x's as control observations. 0 ! I = . By convention.t + ~. 1 X Tn a sample from F. We let the y's denote the responses of subjects given a new drug or treatment that is being evaluated by comparing its effect with that of the placebo..Yn. cr 2 ) distribution. cr > O} where tP is the standard normal distribution. All actual measurements are discrete rather than continuous. G) pairs.t and cr. that is.~). •.Xn are a random sample and. . Then if treatment B had been administered to the same subject instead of treatment A. Often the final simplification is made. Commonly considered 9's are all distributions with center of symmetry 0. patients improve even if they only think they are being treated. To specify this set more closely the critical constant treatment effect assumption is often made. Heights are always nonnegative. Thus.. E}.. for instance.2) distribution and G is the N(J. response y = X + ~ would be obtained where ~ does not depend on x. we have specified the Gaussian two sample model with equal variances. . (2) Suppose that if treatment A had been administered to a subject response X would have been obtained. TwoSample Models. G) : J1 E R. the Xi are a sample from a N(J... It is important to remember that these are assumptions at best only approximately valid. That is. This implies that if F is the distribution of a control.. Then if F is the N(I". We call the y's treatment observations. whatever be J.... heights of individuals or log incomes. G E Q} where 9 is the set of all allowable error distributions that we postulate.
That is. Without this device we could not know whether observed differences in drug performance might not (possibly) be due to unconscious bias on the part of the experimenter. For instance. Models. Since it is only X that we observe. P is the . may be quite irrelevant to the experiment that was actually performed. in Example 1. ~(w) is referred to as the observations or data. but not others.1.2.2. the data X(w).2 is that. we can ensure independence and identical distribution of the observations by using different.1. the methods needed for its analysis are much the same as those appropriate for the situation of Example 1.5. It is often convenient to identify the random vector X with its realization. The advantage of piling on assumptions such as (I )(4) of Example 1.Section 1. The study of the model based on the minimal assumption of randomization is complicated and further conceptual issues arise.1. On this sample space we have defined a random vector X = (Xl. we use a random number table or other random mechanism so that the m patients administered drug A are a sample without replacement from the set of m + n available patients. we need only consider its probability distribution.1 Data.Xn ). This distribution is assumed to be a member of a family P of probability distributions on Rn. In others. and Statistics 5 How do we settle on a set of assumptions? Evidently by a mixture of experience and physical considerations. if they are false. G are assumed arbitrary. In this situation (and generally) it is important to randomize.1. Parameters. A review of necessary concepts and notation from probability theory are given in the appendices.3 when F. This will be done in Sections 3. The danger is that. When w is the outcome of the experiment. for instance.6. the number of a particles emitted by a radioactive substance in a small length of time is well known to be approximately Poisson distributed. our analyses.1. In some applications we often have a tested theoretical model and the danger is small.4.3 the group of patients to whom drugs A and B are to be administered may be haphazard rather than a random sample from the population of sufferers from a disease. However.1. Using our first three examples for illustrative purposes. We are given a random experiment with sample space O. in Example 1. P is referred to as the model. Statistical methods for models of this kind are given in Volume 2. As our examples suggest. For instance. we can be reasonably secure about some aspects. In Example 1.1). we now define the elements of a statistical model.1. we have little control over what kind of distribution of errors we get and will need to investigate the properties of methods derived from specific error distribution assumptions when these assumptions are violated. All the severely ill patients might. Experiments in medicine and the social sciences often pose particular difficulties. The number of defectives in the first example clearly has a hypergeometric distribution. we know how to combine our measurements to estimate JL in a highly efficient way and also assess the accuracy of our estimation procedure (Example 4. equally trained observers with no knowledge of each other's findings. in comparative experiments such as those of Example 1. there is tremendous variation in the degree of knowledge and control we have concerning experiments.'" . have been assigned to B. For instance. if (1)(4) hold. we observe X and the family P is that of all bypergeometric distributions with sample size n and population size N. Fortunately.3 and 6. though correct for the model written down. if they are true.
t2 + k e e e J . Yn . N. . Finally.1.1.1 we take () to be the fraction of defectives in the shipment. G) : I' E R. Note that there are many ways of choosing a parametrization in these and all other problems.Xrn are identically distributed as are Y1 .. that is. j. we have = R+ x R+ . Pe the distribution on Rfl with density implicitly taken n~ 1 . n) distribution.. For instance. Ghas(arbitrary)densityg}. When we can take to be a nice subset of Euclidean space and the maps 0 + Po are smooth.3 that Xl. Goals.1 we can use the number of defectives in the population.G) : I' E R. parts of fJ remain unknowable." that is. knowledge of the true Pe... .e. G with density 9 such that xg(x )dx = O} and p(". we know we are measuring a positive quantity in this model. in senses to be made precise later. as we shall see later. X n ) remains the same but = {(I'.3 with only (I) holding and F. Yn . 1) errors.2 Parametrizations and Parameters i e To describe P we USe a parametrization. . In Example 1. .l. in Example 1. Thus. that is..2 with assumptions (1)(3) are called semiparametric.Xn are independent and identically distributed with a common N (IL.. if) (X'. lh i O => POl f. such that we can have (}1 f.2. we will need to ensure that our parametrizations are identifiable. The only truly nonparametric but useless model for X E R n is to assume that its (joint) distribution can be anything.1. . a map. by (tt. . However.JL) where cp is the standard normal density.. to P. Now the and N(O.2 ) distribution. moreover. still in this example.1. tt = to the same distribution of the observations a~ tt = 1 and N (~ 1.6 Statistical Models. under assumptions (1)(4). . X rn are independent of each other and Yl . If. and Performance Criteria Chapter 1 family of all distributions according to which Xl.G) has density n~ 1 g(Xi 1'). . we can take = {(I'.I} and 1. . . models P are called parametric..1. () t Po from a space of labels. We may take any onetoone function of 0 as a new parameter.2 with assumptions (1)(4) we have = R x R+ and. the parameter space e.1. () is the fraction of defectives. . For instance. tt is the unknown constant being measured. that is. as a parameter and in Example 1. . we may parametrize the model by the first and second moments of the normal distribution of the observations (i. 0. Models such as that of Example 1.2)). Then the map sending B = (I'.G taken to be arbitrary are called nonparametric.. 1 • • • . The critical problem with such parametrizations is that ev~n with "infinite amounts of data. e j . the first parametrization we arrive at is not necessarily the one leading to the simplest analysis. NO. models such as that of Example 1. 02 and yet Pel = Pe 2 • Such parametrizations are called unidentifiable. we only wish to make assumptions (1}(3) with t:. . What parametrization we choose is usually suggested by the phenomenon we are modeling. having expectation 0. for eXample.... in Example = {O. 02 ).. . if () = (Il.1. G) into the distribution of (Xl. on the other hand. If.X l . 1) errors lead parametrization is unidentifiable because.1.. Pe 2 • 2 e ° . It's important to note that even nonparametric models make substantial assumptionsin Example 1. or equivalently write P = {Pe : BEe}. I 1. in (l. Pe the H(NB. Thus. Of even greater concern is the possibility that the parametrization is not onetoone.2) suppose that we pennit G to be arbitrary.
(J is a parameter if and only if the parametrization is identifiable. ) evidently is and so is I' + C>. from P to another space N. A parameter is a feature v(P) of the distribution of X. For instance.1. v. which can be thought of as the difference in the means of the two populations of responses. in Example 1. as long as P is the set of all Gaussian distributions. is that of a parameter. Implicit in this description is the assumption that () is a parameter in the sense we have just defined.2 where fl denotes the mean income and. implies q(BJl = q(B2 ) and then v(Po) q(B). which correspond to other unknown features of the distribution of X.flx. say with replacement.2 again in which we assume the error € to be Gaussian but with arbitrary mean~. In addition to the parameters of interest. Similarly. a map from some e to P. there are also usually nuisance parameters. . that is.1 Data. we can start by making the difference of the means.. As we have seen this parametrization is unidentifiable and neither f1 nor ~ arc parameters in the sense we've defined.1. from to its range P iff the latter map is 11.1. or the midpoint of the interquantile range of P. a function q : + N can be identified with a parameter v( P) iff Po. For instance." = f1Y . or the median of P. .2. Formally. Sometimes the choice of P starts by the consideration of a particular parameter. We usually try to combine parameters of interest and nuisance parameters into a single grand parameter (). Models. For instance. G) parametrization of Example 1.1. formally a map. which indexes the family P.. More generally. the fraction of defectives () can be thought of as the mean of X/no In Example 1.2 is now well defined and identifiable by (1.Section 1.2 with assumptions (1)(4) the parameter of interest fl. ~ Po. Then P is parametrized by 8 = (fl1~' 0'2). But = Var(X. in Example 1.M(P) can be characterized as the mean of P. and observe Xl. For instance. The (fl. make 8 + Po into a parametrization of P.3. Here are two points to note: (1) A parameter can have many representations. where 0'2 is the variance of €.3) and 9 = (G : xdG(x) = J O}. For instance. implies 8 1 = 82 . thus. E(€i) = O.1.1. and Statistics 7 Dual to the notion of a parametrization. our interest in studying a population of incomes may precisely be in the mean income. consider Example 1. (2) A vector parametrization that is unidentifiable may still have components that are parameters (identifiable). ..1. if POl = Pe'). But given a parametrization (J + Pe. or more generally as the center of symmetry of P. then 0'2 is a nuisance parameter. if the errors are normally distributed with unknown variance 0'2.1. that is. it is natural to write e e e . in Example 1. Parameters. Then "is identifiable whenever flx and flY exist. the focus of the study.3 with assumptions (1H2) we are interested in~.1. instead of postulating a constant treatment effect ~. we can define (J : P + as the inverse of the map 8 + Pe. in Example 1. X n independent with common distribution. When we sample.
The link for us are things we can compute. These issues will be discussed further in Volume 2. This statistic takes values in the set of all distribution functions on R.2 a cOmmon estimate of J1.3 Statistics as Functions on the Sample Space I j • . F(P)(x) = PIX. . can be related to model formulation as we saw earlier. and Performance Criteria Chapter 1 1. Our aim is to use the data inductively. Models and parametrizations are creations of the statistician. this difference depends on the patient in a complex manner (the effect of each drug is complex). Databased model selection can make it difficult to ascenain or even assign a meaning to the accuracy of estimates or the probability of reaChing correct conclusions. usually a Euclidean space. the fraction defective in the sample. T( x) is what we can compute if we observe X = x.1. Nevertheless. statistics. a common estimate of 0. in Example 1. is used to decide what estimate of the measure of difference should be employed (ct.• 1 X n ) = X ~ L~ I Xi. Goals. however. For instance. Xn)(x) = n L I(X n i < x) i=l where (X" .1. then our attention naturally focuses on estimating this constant If. For future reference we note that a statistic just as a parameter need not be real or Euclidean valued. for example. to narrow down in useful ways our ideas of what the "true" P is. Mandel. which now depends on the data. Next this model.1. is the statistic T(X 11' . Formally.. Informally. T(x) = x/no In Example 1.. < xJ.X)" L i=l I n How we use statistics in estimation and other decision procedures is the subject of the next section. x E Ris F(X " ~ I . X n ) are a sample from a probability P on R and I(A) is the indicator of the event A.8 Statistical Models. In this volume we assume that the model has . Deciding which statistics are important is closely connected to deciding which parameters are important and. consider situation (d) listed at the beginning of this section. we have to formulate a relevant measure of the difference in performance of the drugs and decide how to estimate this measure. but the true values of parameters are secrets of nature. If we suppose there is a single numerical measure of performance of the drugs and the difference in performance of the drugs for any given patient is a constant irrespective of the patient. we can draw guidelines from our numbers and cautiously proceed. It estimates the function valued parameter F defined by its evaluation at x E R. . Often the outcome of the experiment is used to decide on the model and the appropriate measure of difference. called the empirical distribution function. For instance. a statistic we shall study extensively in Chapter 2 is the function valued statistic F. Thus.1. a statistic T is a map from the sample space X to some space of values T.2 is the statistic I i i = i i 8 2 = n1 "(Xi ... hence. which evaluated at ~ X and 8 2 are called the sample mean and sample variance. 1964).
lO.1. and Statistics 9 been selected prior to the current experiment. patients may be considered one at a time. are continuous with densities p(x. we shall denote the distribution corresponding to any particular parameter value 8 by Po. drugs A and B are given at several . Thus. 1990). For instance. See A. 8). The distribution of the response Vi for the ith subject or case in the study is postulated to depend on certain characteristics Zi of the ith subject. (B. 0). Example 1. Models. Regular models.1. Regression Models. For instance.. (A. X m ).. . density and frequency functions by p(. . and so On of the ith subject in a study. age. Parameters. 1. Yn ). Regression Models We end this section with two further important examples indicating the wide scope of the notions we have introduced. When dependence on 8 has to be observed. after a while.3.3 we could take z to be the treatment label and write onr observations as (A. Lehmann. 0). in Example 1.1. We observe (ZI' Y.. Moreover. This selection is based on experience with previous similar experiments (cf. Y. There are also situations in which selection of what data will be observed depends on the experimenter and on his or her methods of reaching a conclusion. Problems such as these lie in the fields of sequential analysis and experimental deSign. The experimenter may. assign the drugs alternatively to every other patient in the beginning and then.1. This is obviously overkill but suppose that. Yn are independent. in the study. Thus. Distribution functions will be denoted by F(·. ). (2) All of the P e are discrete with frequency functions p(x.). the statistical procedure can be designed so that the experimenter stops experimenting as soon as he or she has significant evidence to the effect that one drug is better than the other. 0).4 Examples. Expectations calculated under the assumption that X rv P e will be written Eo. 0).. in situation (d) again.. and there exists a set {XI.X" . and the decision of which drug to administer for a given patient may be made using the knowledge of what happened to the previous patients.. . In the discrete case we will use both the tennsfrequency jimction and density for p(x.Section 1. assign the drug that seems to be working better to a higher proportion of patients. sequentially. They are not covered under our general model and will not be treated in this book. In most studies we are interested in studying relations between responses and several other variables not just treatment or control as in Example 1. 'L':' Such models will be called regular parametric models.. Zi is a d dimensional vector that gives characteristics such as sex. (zn. This is the Stage for the following. weight. . We refer the reader to Wetherill and Glazebrook (1986) and Kendall and Stuart (1966) for more infonnation.4. height.1 Data. for example. the number of patients in the study (the sample size) is random. these and other subscripts and arguments will be omitted where no confusion can arise. ) that is independent of 0 such that 1 P(Xi' 0) = 1 for all O. Xl). Y n ) where Y11 ••• . It will be convenient to assume(1) from now on that in any parametric model we consider either: (I) All of the P. (B. Notation. However.
1. (c) whereZ nxa = (zf. Then we have the classical Gaussian linear model. in Example 1. (3) and (4) above.1. and nonparametric if we drop (1) and simply . . [n general.. d = 2 and can denote the pair (Treatment Label.Yn) ~ II f(Yi I Zi)..~II3Jzj = zT (3 so that (b) becomes (b') This is the linear model. Often the following final assumption is made: (4) The distributiou F of (I) is N(O. .treat the Zi as a label of the completely unknown distributions of Yi. (b) where Ei = Yi . . Treatment Dose Level) for patient i.1. I3d) T of unknowns.. then we can write... (3) g((3. Clearly. That is.E(Yi ).. Example 1.1.2 uuknown.Z~)T and Jisthen x n identity.2) with . A eommou (but often violated assumption) is (I) The ti are identically distributed with distribution F.. . Goals. So is Example 1. We usually ueed to postulate more. On the basis of subject matter knowledge and/or convenience it is usually postulated that (2) 1'(z) = 9((3. If we let f(Yi I zd denote the density of Yi for a subject with covariate vector Zi.1. Here J. Zi is a nonrandom vector of values called a covariate vector or a vector of explanatory variables whereas Yi is random and referred to as the response variable or dependent variable in the sense that its distribution depends on Zi. See Problem 1.3(3) is a special case of this model. 0 .L(z) denote the expected value of a response with given covariate vector z. By varying the assumptions we obtain parametric models as with (I). which we can write in vector matrix form. semiparametric as with (I) and (2) with F arbitrary. i=l n If we let J.3 with the Gaussian twosample model I'(A) ~ 1'. For instance. then the model is zT (a) P(YI. z) where 9 is known except for a vector (3 = ((31. ..(z) only.. I'(B) = I' + fl. the effect of z on Y is through f..z) = L. . n. i = 1. Then.8.L(z) is an unknown function from R d to R that we are interested in.10 Statistical Models. and Performance Criteria Chapter 1 dose levels. In the two sample models this is implied by the constant treatment effect assumption. The most common choice of 9 is the linear fOlltl.2 with assumptions (1)(4). . Identifiability of these parametrizations and the status of their components as parameters are discussed in the problems. .. In fact by varying our assumptions this class of models includes any situation in which we have independent but not necessarily identically distributed observations. .
t+ei.. Then we have what is called the AR(1) Gaussian model J is the We include this example to illustrate that we need not be limited by independence. To find the density p(Xt.(1 .. en' Using conditional probability theory and ei = {3ei1 + €i.. we give an example in which the responses are dependent..5.. .1. Example 1. However. Xl = I' + '1. ' . Measurement Model with Autoregressive Errors.cn _d p(edp(e2 I edp(e3 I e2) . C n where €i are independent identically distributed with density are dependent as are the X's. . .1 Data...(3cd··· f(e n Because €i =  (3e n _1). x n ). . ... Xi ~ 1. ... X n is P(X1.t.(3) + (3X'l + 'i. Of course.{3Xj1 j=2 n (1. and Statistics 11 Finally. the conceptual issues of stationarity. A second example is consecutive measurements Xi of a constant It made by the same observer who seeks to compensate for apparent errors. Parameters..t = E(Xi ) be the average time for an infinite series of records.. An example would be.p(e" I e1. . eo = a Here the errors el. is that N(O. the model for X I. and the associated probability theory models and inference for dependent data are beyond the scope of this book.xn ) ~ f(x1 I') II f(xj .n. X n spent above a fixed high level for a series of n consecutive wave records at a point on the seashore. The default assumption.Il. Let XI. . Consider the model where Xi and assume = J.... . ergodicity. . In fact we can write f. i ~ 2. the elapsed times X I.. 0'2) density. It is plausible that ei depends on Cil because long waves tend to be followed by long waves. Let j. . . i = l. . . p(e n I end f(edf(c2 .(3)I').. at best an approximation for the wave example. Models. . X n be the n determinations of a physical constant J. 0 . i = 1. .Section 1.n ei = {3ei1 + €i. Xi . model (a) assumes much more but it may be a reasonable first approximation in these situations. .. . . ... n.. . . save for a brief discussion in Volume 2.. we start by finding the density of CI. say. we have p(edp(C21 e1)p(e31 e"e2)".
PIX = k. The general definition of parameters and statistics is given and the connection between parameters and pararnetrizations elucidated. we have had many shipments of size N that have subsequently been distributed. If the customers have provided accurate records of the number of defective items that they have found. They are useful in understanding how the outcomes can he used to draw inferences that go beyond the particular experiment. Goals. the most important of which is the workhorse of statistics. given 9 = i/N.1) Our model is then specified by the joint distribution of the observed number X of defectives in the sample and the random variable 9. 1l"i is the frequency of shipments with i defective items. How useful a particular model is is a complex mix of how good the approximation is and how much insight it gives into drawing inferences. This is done in the context of a number of classical examples. We view statistical models as useful tools for learning from the outcomes of experiments and studies... In this section we introduced the first basic notions and formalism of mathematical statistics. Now it is reasonable to suppose that the value of (J in the present shipment is the realization of a random variable 9 with distribution given by P[O = N] I = IT" i = 0. and Performance Criteria Chapter 1 Summary.I 12 Statistical Models. This distribution does not always corresp:md to an experiment that is physically realizable but rather is thought of as measure of the beliefs of the experimenter concerning the true value of (J before he or she takes any data. in the past. n). ([. There is a substantial number of statisticians who feel that it is always reasonable.2. .1. i = 0. There are situations in which most statisticians would agree that more can be said For instance. we can construct a freq uency distribution {1l"O. to think of the true value of the parameter (J as being the realization of a random variable (} with a known distribution.. X has the hypergeometric distribution 'H( i. vector observations X with unknown probability distributions P ranging over models P. . N. Thus.2. . . in the inspection Example 1. N.2 BAYESIAN MODELS Throughout our discussion so far we have assumed that there is no information available about the true value of the parameter beyond that provided by the data."" 1l"N} for the proportion (J of defectives in past shipments. That is. 0 = N I I ([.. N. The notions of parametrization and identifiability are introduced. We know that. the regression model.1. 1.2) This is an example of a Bayesian model. it is possible that. and indeed necessary. Models are approximations to the mechanisms generating the observations. a .
1. (J) as a conditional density or frequency function given 8 = we will denote it by p(x I 0) for the remainder of this section. . Suppose that we have a regular parametric model {Pe : (J E 8}. ?T. To get a Bayesian model we introduce a random vector (J. An interesting discussion of a variety of points of view on these questions may be found in Savage et al. For a concrete illustration. I(O. X) is appropriately continuous or discrete with density Or frequency function.4) for i = 0.2) is an example of (1. (1962). whose range is contained in 8. (1. In this section we shall define and discuss the basic clements of Bayesian models.l. x) = ?T(O)p(x. We shall return to the Bayesian framework repeatedly in our discussion.O). The most important feature of a Bayesian model is the conditional distribution of 8 given X = x. and Berger (1985). However..2. . After the value x has been obtained for X. For instance. by giving (J a distribution purely as a theoretical tool to which no subjective significance is attached. The theory of this school is expounded by L. This would lead to the prior distribution e.9)'00'. In the "mixed" cases such as (J continuous X discrete. which is called the posterior distribution of 8. We now think of Pe as the conditional distribution of X given (J = (J. the information about (J is described by the posterior distribution.3). De Groot (1969).3). (1. Before the experiment is performed. There is an even greater range of viewpoints in the statistical community from people who consider all statistical statements as purely subjective to ones who restrict the use of such models to situations such as that of the inspection example in which the distribution of (J has an objective interpretation in tenus of frequencies.1. If both X and (J are continuous or both are discrete. 1. Savage (1954).3) Because we now think of p(x. Before sampling any items the chance that a given shipment contains . (1) Our own point of view is that SUbjective elements including the views of subject matter experts arc an essential element in all model building. The function 7r represents our belief or information about the parameter (J before the experiment and is called the prior density or frequency function. the joint distribution is neither continuous nor discrete.2.1)'(0. suppose that N = 100 and that from past experience we believe that each item has probability . then by (B.1 of being defective independently of the other members of the shipment. Eqnation (1.2 Bayesian Models 13 Thus. the resulting statistical inference becomes subjective.2. we can obtain important and useful results and insights. Lindley (1965). given (J = (J. with density Or frequency function 7r. (8.. X) is that of the outcome of a random experiment in which we first select (J = (J according to 7r and then. select X according to Pe.2. Raiffa and Schlaiffer (1961). 100. However.1. the information or belief about the true value of the parameter is described by the prior distribution. let us turn again to Example 1. The joint distribution of (8. insofar as possible we prefer to take the frequentist point of view in validating statistical statements and avoid making final claims in terms of subjective posterior probabilities (see later). = ( I~O ) (0.Section 1.
the number of defectives left after the drawing.1.9 independently of the other items. P[1000 > 20 I X = 10] P[IOOO .2.6) we argue loosely as follows: If be fore the drawing each item was defective with probability .. If we assume that 8 has a priori distribution with deusity 1r.001.9)(0.X) .xn )= Jo'1r(t)t k (lt)n. Suppose that Xl.. .8) as posterior density of 0. (ii) If we denote the corresponding (posterior) frequency function or density by 1r(9 I x).7) In general..1 > .. to calculate the posterior.1100(0. 1008 .2. . This leads to P[1000 > 20 I X = lOJ '" 0.II ~. ( 1. . 14 Statistkal Models. we obtain by (1.9) > .. I 20 or more bad items is by the normal approximation with continuity correction.6) To calculate the posterior probability given iu (1.2. Example 1.1100(0.2..8) roo 1r(9)p(x 19) 1r(t)p(x I t)dt if 8 is continuous. Goals. theu 1r(91 x) 1r(9)p(x I 9) 2.9) I . Now suppose that a sample of 19 has been drawn in which 10 defective items are found.5) 5 ) = 0.1)(0. .J C3 (1.10 '" ..30.1 and good with probability .9) I . this will continue to be the case for the items left in the lot after the 19 sample items have been drawn.X > 10 I X = lOJ P [(1000 .2.9)(0. 1.. 1r(9)9k (1 _ 9)nk 1r(Blx" .181(0.1)(0.8..2.X.10).3).52) 0. .<1>(0. 0. Therefore.1) (1.2. Specifically. Thus. . 1r(t)p(x I t) if 8 is discrete. (A 15.2.9 ] . is iudependeut of X and has a B(81. In the cases where 8 and X are both continuous or both discrete this is precisely Bayes' rule applied to the joint distribution of (0.1.Xn are indicators of n Bernoulli trials with probability of success () where 0 < 8 < 1. Here is an example.j ..1) distribution.:. P[IOOO > 201 = 10] P [ .4) can be used. (i) The posterior distribution is discrete or continuous according as the prior distri bution is discrete or continuous.181(0. Bernoulli Trials.1) '" I . some variant of Bayes' rule (B. X) given by (1. (1. and Performance Criteria Chapter 1 I .k dt (1..30.2.<I> 1000 .
We also by example introduce the notion of a conjugate family of distributions. Specifically. .(8 I X" . If n is small compared to the size of the city.k + s) where B(·. Now we may either have some information about the proportion of geniuses in similar cities of the country or we may merely have prejudices that we are willing to express in the fonn of a prior distribution on B. 'i = 1. and s only.9) we obtain 1 L: L7 (}k+rl (1 _ 8)nk+s1 c (1.15.2. To get infonnation we take a sample of n individuals from the city. for instance.10) The proportionality constant c.05.. For instance.2. 0 A feature of Bayesian models exhibited by this example is that there are natural parametric families of priors such that the posterior distributions also belong to this family. x n ). As Figure B.ll) in (1. If we were interested in some proportion about which we have no information or belief. Then we can choose r and s in the (3(r.2. Suppose.Section 1. . (A.n. Or else we may think that 1[(8) concentrates its mass near a small number. We may want to assume that B has a density with maximum value at o such as that drawn with a dotted line in Figure B.2. . We present an elementary discussion of Bayesian models. Summary. upon substituting the (3(r.11» be B(k + T.2. we might take B to be uniformly distributed on (0.. which corresponds to using the beta distribution with r = . say 0.2. and the posterior distribution of B givenl:X i =kis{3(k+r. which has a B( n.2 Bayesian Models 15 for 0 < () < 1. we only observe We can thus write 1[(8 I k) fo".1).16. must (see (B. n . L~ 1 Xi' We also obtain the same posterior density if B has prior density 1r and 1 Xi. which depends on k. We return to conjugate families in Section 1. we need a class of distributions that concentrate on the interval (0. where k ~ l:~ I Xj.2.2.8) distribution.2. Another bigger conjugate family is that of finite mixtures of beta distributionssee Problem 1. r. One such class is the twoparameter beta family. = 1.2 indicates. 1). . Such families are called conjugaU? Evidently the beta family is conjugate to the binomial.s) distribution.6.nk+s). To choose a prior 11'. 8) distribution given B = () (Problem 1. we are interested in the proportion () of "geniuses" (IQ > 160) in a particular city. so that the mean is r/(r + s) = 0..<. Xi = 0 or 1.13) leads us to assume that the number X of geniuses observed has approximately a 8(n.·) is the beta function..05 and its variance is very small. k = I Xi· Note that the posterior density depends on the data only through the total number of successes.2. nonVshaped bimodal distributions are not pennitted.2. This class of distributions has the remarkable property that the resulting posterior distributions arc again beta distributions. The result might be a density such as the one marked with a solid line in Figure B. the beta family provides a wide variety of shapes that can approximate many reasonable prior distributions though by no means all. introduce the notions of prior and posterior distributions and give Bayes rule. s) density (B.9).
2.1. say. "hypothesis" or "nonspecialness" (or alternative). in Example 1.l < Jio or those corresponding to J. Thus. We may wish to produce "best guesses" of the values of important parameters.1. at a first cut.3. shipments. In Example 1. the fraction defective B in Example 1. if there are k different brands. (age.1. with (J < (Jo. Ranking. These are estimation problems. i. placebo and treatment are equally effective) are special because the FDA (Food and Drug Administration) does not wish to pennit the marketing of drugs that do no good.e.3. Po or Pg is special and the general testing problem is really one of discriminating between Po and Po.1.1 do we use the observed fraction of defectives . one of which will be announced as more consistent with the data than others.4. for instance. For instance. For instance. say." In testing problems we. 1 < i < n. and as we shall see fonnally later.1 contractual agreement between shipper and receiver may penalize the return of "good" shipments. sex. On the basis of the sample outcomes the organization wants to give a ranking from best to worst of the brands (ties not pennitted). whereas the receiver does not wish to keep "bad. such as. Intuitively. i we can try to estimate the function 1'0. Example 1. as in Example 1. Goals. Making detenninations of "specialness" corresponds to testing significance. drug dose) T that can be used for prediction of a variable of interest Y. there are many problems of this type in which it's unclear which oftwo disjoint sets of P's. the receiver wants to discriminate and may be able to attach monetary Costs to making a mistake of either type: "keeping the bad shipment" or "returning a good shipment.L in Example 1. There are many possible choices of estimates.16 Statistical Models.3. We may have other goals as illustrated by the next two examples. For instance." (} > Bo. Note that we really want to estimate the function Ji(')~ our results will guide the selection of doses of drug for future patients. P's that correspond to no treatment effect (i. state which is supported by the data: "specialness" or. a reasonable prediction rule for an unseen Y (response of a new patient) is the function {t(z).lo as special.1. z) we can estimate (3 from our observations Y. A consumer organization preparing (say) a report on air conditioners tests samples of several brands.2. If J. As the second example suggests.l > J. in Example 1. Zi) and then plug our estimate of (3 into g. if we believe I'(z) = g((3. the information we want to draw from data can be put in various forms depending on the purposes of our analysis. Given a statistical model.1. 0 In all of the situations we have discussed it is clear that the analysis does not stop by specifying an estimate or a test or a ranking or a prediction function. 0 Example 1. and Performance Criteria Chapter 1 1. then depending on one's philosophy one could take either P's corresponding to J. Prediction. of g((3. However. we have a vector z. Unfortunately {t(z) is unknown.. Thus.1. say a 50yearold male patient's response to the level of a drug.lo means the universe is expanding forever and J.l > JkO correspond to an eternal alternation of Big Bangs and expansions.lo is the critical matter density in the universe so that J. A very important class of situations arises when.l < J. if we have observations (Zil Yi). In other situations certain P are "special" and we may primarily wish to know whether the data support "specialness" or not. as it's usually called.3 THE DECISION THEORETIC FRAMEWORK I . the expected value of Y given z. there are k! possible rankings or actions.1 or the physical constant J.
(2. Thus. in ranking what mistakes we've made. whatever our choice of procedure we need either a priori (before we have looked at the data) and/or a posteriori estimates of how well we're doing. in testing whether we are right or wrong. Here are action spaces for our examples. A = {a. I} with 1 corresponding to rejection of H. 3). it is natural to take A = R though smaller spaces may serve equally well. Intuitively. 2). (3. A = {( 1.1 Components of the Decision Theory Framework As in Section 1. For instance. We usnally take P to be parametrized. taking action 1 would mean deciding that D. A new component is an action space A of actions or decisions or claims that we can contemplate making. 2. in estimation we care how far off we are. ~ 1 • • • l I} in Example 1. In any case. or the median. ik) of {I. P = {P. I)}. (4) provide guidance in the choice of procedures for analyzing outcomes of experiments. is large. and reliability of statistical procedures.1. 1). If we are estimating a real parameter such as the fraction () of defectives. The answer will depend on the model and. in Example 1. By convention.1. Estimation.1.3. Action space. defined as any value such that half the Xi are at least as large and half no bigger? The same type of question arises in all examples.. 1. . (1. These examples motivate the decision theoretic framework: We need to (I) clarify the objectives of a study. 3. we would want a posteriori estimates of perfomlance. accuracy. and so on. A = {O. even if. Here quite naturally A = {Permntations (i I . : 0 E 8). a large a 2 will force a large m l n to give us a good chance of correctly deciding that the treatment effect is there. there are 3! = 6 possible rankings. On the other hand.1. (2.Section 1. most significantly. . Thus. on what criteria of performance we use.1. # 0. 3).3.. 1.2. once a study is carried out we would probably want not only to estimate ~ but also know how reliable our estimate is.3 The Decision Theoretic Framework 17 X/n as our estimate or ignore the data and use hislOrical infonnation on past shipments. k}}. . In designing a study to compare treatments A and B we need to determine sample sizes that will be large enough to enable us to detect differences that matter.. in Example 1. or combine them in some way? In Example 1.3 even with the simplest Gaussian model it is intuitively clear and will be made precise later that. in Example 1.1. if we have three air conditioners. in Example 1. we begin with a statistical model with an observation vector X whose distribution P ranges over a set P. for instance. X = ~ L~l Xi.. (2) point to what the different possible actions are. 1. (3) provide assessments of risk. Here only two actions are contemplated: accepting or rejecting the "specialness" of P (or in more usual language the hypothesis H : P E Po in which we identify Po with the set of "special" P's). . 2). Ranking. 2. . 3.1. Testing.2 lO estimate J1 do we use the mean of the measurements. (3. or p. we need a priori estimates of how well even the best procedure can do. That is.1.6.1. Thus.
Goals. r " I' 2:)a. they usually are chosen to qualitatively reflect what we are trying to do and to be mathematically convenient. ' ." that is. a) = 0. If. = max{laj  vjl. a is a function from Z to R} with a(z) representing the prediction we would make if the new unobserved Y had covariate value z. is P. Closely related to the latter is what we shall call confidence interval loss. Other choices that are.. if Y = 0 or 1 corresponds to. The interpretation of I(P. a) 1(0. a) = 1 otherwise. in the prediction example 1. less computationally convenient but perhaps more realistically penalize large errors less are Absolute Value Loss: l(P. a) if P is parametrized.1). a) = min {(v( P) .a) ~ (q(O) . . [v(P) . I(P. Although estimation loss functions are typically symmetric in v and a. If v ~ (V1.V. For instance.18 Statistical Models. a) = J (I'(z) . a) = [v( P) . . and Performance Criteria Chapter 1 Prediction. 1 d} = supremum distance. For instance. Q is the empirical distribution of the Zj in . Loss function.2." respectively. say.ai.a)2). In estimating a real valued parameter v(P) or q(6') if P is parametrized the most commonly used loss function is. although loss functions. Sex)T. "does not respond" and "responds. the expected squared error if a is used. the probability distribution producing the data.3. l(P. . as we shall see (Section 5. . a) = (v(P) .: la.3. or 1(0.a)2 (or I(O. a) I d . A = {a . Evidently Y could itself range over an arbitrary space Y and then R would be replaced by Y in the definition of a(·).. If we use a(·) as a predictor and the new z has marginal distribution Q then it is natural to consider. a) 1(0. l( P..a)2.al < d. M) would be our prediction of response or no response for a male given treatment B.Vd) = (q1(0). [f Y is real. Far more important than the choice of action space is the choice of loss function defined as a function I : P X A ~ R+. and z = (Treatment. asymmetric loss functions can also be of importance. and z E Z.)2 = Vj [ squared Euclidean distance/d absolute distance/d 1..3. This loss expresses the notion that all errors within the limits ±d are tolerable and outside these limits equally intolerable. which penalizes only overestimation and by the same amount arises naturally with lower confidence bounds as discussed in Example 1. ~ 2. Here A is much larger. and truncated quadraticloss: I(P. . is the nonnegative loss incurred by the statistician if he or she takes action a and the true "state of Nature.i = We can also consider function valued parameters. a). sometimes can genuinely be quantified in economic terms.a(z))2dQ(z). r I I(P. then a(B.. 1'() is the parameter of interest. examples of loss functions are 1(0. Quadratic Loss: I(P.. Estimation. For instance.qd(lJ)) and a ~ (a""" ad) are vectors. d'}. a) = l(v < a). As we shall see. as the name suggests. say.
is a partition of (or equivalently if P E Po or P E P.1) Decision procedures. . In Section 4. respectively..I loss: /(8. I) /(8. this leads to the commonly considered I(P. a) = 1 otherwise (The decision is wrong). .3.. the statistician takes action o(x)..y is close to zero. a) ~ 0 if 8 E e a (The decision is correct) l(0.a) = I n n. I) 1(8.).Section 1.. and to decide Ll '# a if our estimate is not close to zero.3.).3 The Decision Theoretic Framework 19 the training set (Z 1. 0. Of course.0) sif8<80 Oif8 > 80 rN8. This 0 . (Zn. The data is a point X = x in the outcome or sample space X. We ask whether the parameter () is in the subset 6 0 or subset 8 1 of e.))2. a Estimation.2) and N(I'..2) I if Ix ~ 111 (J >c . . in Example 00 defectives results in a penalty of s dollars whereas every defective item sold results in an r dollar replacement cost. Testing.. other economic loss functions may be appropriate.1 suppose returning a shipment with °< 1(8. We define a decision rule or procedure to be any function from the sample space taking its values in A.3 we will show how to obtain an estimate (j of a from the data. in the measurement model.. Y). . e ea. a(zn) jT and the vector parameter (I'( z.3 with X and Y distributed as N(I' + ~.1 times the squared Euclidean distance between the prediction vector (a(z. Using 0 means that if X = x is observed. ]=1 which is just n. where {So.. If we take action a when the parameter is in we have made the correct decision and the loss is zero. For instance.1. that is.9. the decision is wrong and the loss is taken to equal one. We next give a representation of the process Whereby the statistician uses the data to arrive at a decision. Then the appropriate loss function is 1. relative to the standard deviation a. we implicitly discussed two estimates or decision rules: 61 (x) = sample mean x and 02(X) = X = sample median. y) = o if Ix::: 111 (J <c (1. For the problem of estimating the constant IJ.1 loss function can be written as ed. Here we mean close to zero relative to the variability in the experiment.).. Y n). . if we are asking whether the treatment effect p'!1'ameter 6. then a reasonable rule is to decide Ll = 0 if our estimate x . . Otherwise.. The decision rule can now be written o(x.1.2).. L(I'(Zj) _0(Z.1'(zn))T Testing. (1. In Example 1. is a or not.
. R maps P or to R+. MSE(fi) I. Proof. lfwe use the mean X as our estimate of IJ. measurements of IJ. we regard I(P." Var(X. Thus.3) where for simplicity dependence on P is suppressed in MSE. Thus.. (If one side is infinite. Suppose v _ v(P) is the real parameter we wish to estimate and fJ(X) is our estimator (our decision rule).) 0 We next illustrate the computation and the a priori and a posteriori use of the risk function. withN(O. we typically want procedures to have good properties not at just one particular x.. () is the true value of the parameter.d..v(P))' (1. the risk or riskfunction: The risk function.1. fi) = Ep(fi(X) . (fi ..i. then Bias(X) 1 n Var(X) .3. and Performance Criteria Chapter 1 where c is a positive constant called the critical value.3.. but for a range of plausible x·s. our risk function is called the mean squared error (MSE) of.) = 2L n ~ n . Moreover. Goals. then the loss is l(P. R('ld) is our a priori measure of the performance of d. we turn to the average or mean loss over the sample space.3. the cross term will he zero because E[fi . We illustrate computation of R and its a priori use in some examples. 1 is the loss function.v can be thought of as the "longrun average error" of v. If we expand the square of the righthand side keeping the brackets intact and take the expected value. i=l . so is the other and the result is trivially true. If we use quadratic loss. A useful result is Bias(fi) Proposition 1. How do we choose c? We need the next concept of the decision theoretic framework.20 Statistical Models. for each 8. 6 (x)) as a random variable and introduce the riskfunction R(P. and assume quadratic loss. The other two terms are (Bias fi)' and Var(v). and X = x is the outcome of the experiment. Example 1. The MSE depends on the variance of fJ and on what is called the bias ofv where = E{fl) . e Estimation. Write the error as 1 = (Bias fi)' + E(fi)] Var(v).v) = [V  + [E(v) .. We do not know the value of the loss because P is unknown. . (]"2) errors.1']. If d is the procedure used.6) = Ep[I(P. 6(X)] as the measure of the perfonnance of the decision rule o(x). fJ(x)). (Continued).3. X n are i.E(fi)] = O. Estimation of IJ. That is. Suppose Xl. and is given by v= MSE{fl) = R(P.
Section 1.a 2 then by (A 13. say.2. an}). 00 (1.. for instance by 8 2 = ~ L~ l(X i .t.4) which doesn't depend on J. .3. areN(O.1.. (.3 The Decision Theoretic Framework 21 and.. or na 21(n .i. planning is not possible but having taken n measurements we can then estimate 0.3.4) can be used for an a priori estimate of the risk of X.2)1"0 + (O.3. If we have no idea of the value of 0. then for quadratic loss. Then (1. . write for a median of {aI. for instance.2 and 0.3. as we assumed. of course. Next suppose we are interested in the mean fl of the same measurement for a certain area of the United States. itself subject to random error.3. but for absolute value loss only approximate. = X.23). if X rnedian(X 1 .X) = ".5) This harder calculation already suggests why quadratic loss is really favored. . The choice of the weights 0. . ( N(o. whereas if we have a random sample of measurements X" X 2. R( P. 0 ~ = a We next give an example in which quadratic loss and the breakup of MSE given in Proposition 1. Example 1. If we have no data for area A.fii/a)[ ~ a . Suppose that the precision of the measuring instrument cr2 is known and equal to crJ or where realistically it is known to be < crJ. Let flo denote the mean of a certain measurement included in the U.1")' ~ E(~) can only be evaluated numerically (see Problem 1. 1) and R(I".a 2 . E(i . We shall derive them in Section 1.3. X) = EIX .1"1 ~ Eill 2 ) where £.1". If we want to be guaranteed MSE(X) < £2 we can do it by taking at least no = <:TolE measurements.3.2.Xn ) (and we.. X) = a' (P) / n still. a natural guess for fl would be flo.2 . or approximated asymptotically. as we discussed in Example 1.fii V.X)2.d. Ii ~ (0.1). computational difficulties arise even with quadratic loss as soon as we think of estimates other than X. . In fact.8 can only be made on the basis of additional knowledge about demography or the economy.. an estimate we can justify later. or numerical and/or Monte Carlo computation.1 2 MSE(X) = R(I". Suppose that instead of quadratic loss we used the more natural(l) absolute value loss.X) =. . age or income. the £.1 is useful for evaluating the performance of competing estimators. n (1. The a posteriori estimate of risk (j2/n is.8)X.6 through a . is possible. Then R(I". a'.4.S. If. we may wantto combine tLo and X = n 1 L~ 1 Xi into an estimator.6). If we only assume. by Proposition 1. For instance. census. analytic. in general.. that the E:i are i.fii 1 af2 00 Itl<p(t)dt = . with mean 0 and variance a 2 (P). 1 X n from area A..
R(i'>. Ii) of ii is smaller than the risk R(I'. 22 Statistical Models. then X is optimal (Example 3. respectively. MSE(ii) MSE(X) i .1. D Testing.) ofthe MSE (called the minimax criteria). Goals. I)P[6(X. the risk R(I'. Y) = I] if i'> = 0 P[6(X. neither estimator can be proclaimed as being better than the other. I' E R} being 0.1')' + (. I. the risk is = 0 and i'> i 0 can only take on the R(i'>. thus.04(1'0 . In the general case X and denote the outcome and parameter space.81' .1 loss is 01 + lei'>. If I' is close to 1'0.I' = 0. The two MSE curves cross at I' ~ 1'0 ± 30'1 . using MSE. Because we do not know the value of fL.2) for deciding between i'> two values 0 and 1.6) = 1(i'>.I. = 0. if we use as our criteria the maximum (over p.64)0" In.64)0" In MSE(ii) = . Figure 1. We easily find Bias(ii) Yar(ii) 0.6) P[6(X.1') (0. The test rule (1. The mean squared errdrs of X and ji. X) = 0" In of X with the minimum relative risk inf{MSE(ii)IMSE(X).3..4).3. Y) = IJ. and Performance Criteria Chapter 1 i formal Bayesian analysis using a nonnal prior to illustrate a way of bringing in additional knowledge.21'0 R(I'.3. A test " " e e e . Figure 1.2(1'0 .8)'Yar(X) = (. Y) = which in the case of 0 .0)P[6(X.3. Y) = 01 if i'> i O.n. and we are to decide whether E 80 or E 8 where 8 = 8 0 U 8 8 0 n 8 . Here we compare the performances of Ii and X as estimators of Jl using MSE.64 when I' = 1'0.1 gives the graphs of MSE(ii) and MSE(X) as functions of 1'. iiJ + 0. However.
05. Finding good test functions corresponds to finding critical regions with smaIl probabilities of error.8) sP.3. R(8. in the treatments A and B example. 8 < 80 rN8P.1 lead to (Prohlem 1. For instance. where 1 denotes the indicator function.Section 1.a (1.6) E(6(X)) ~ P(6(X) ~ I) if8 E 8 0 (1.01 or less.3. "Reject the shipment if and only if X > k. and then next look for a procedure with low probability of proclaiming no difference if in fact one treatment is superior to the other (deciding Do = 0 when Do ¥ 0). usually .05 or .1. If <5(X) = 1 and we decide 8 E 8 1 when in fact E 80.o) 0. For instance. If (say) X represents the amount owed in the sample and v is the unknown total amount owed.[X < k]. Here a is small. the focus is on first providing a small bound. (1. 6(X) = IIX E C]. confidence bounds and intervals (and more generally regions). the risk of <5(X) is e R(8. that is. we call the error committed a Type I error. and then trying to minimize the probability of a Type II error. say . Such a v is called a (1 .1) and tests 15k of the fonn. the loss function (1. This is not the only approach to testing." in Example 1.3.6) P(6(X) = 0) if 8 E 8 1 Probability of Type II error. [n the NeymanPearson framework of statistical hypothesis testing.3.3. I(P. Thus.7) Confidence Bounds and Intervals Decision theory enables us to think clearly about an important hybrid of testing and estimation. on the probability of Type I error.18). 8 > 80 . it is natural to seek v(X) such that P[v(X) > vi > 1.8) for all possible distrihutions P of X. an accounting finn examining accounts receivable for a finn on the basis of a random sample of accounts would be primarily interested in an upper bound on the total amount owed.3 The Decision Theoretic Framework 23 function is a decision rule <5(X) that equals 1 On a set C C X called the critical region and equals 0 on the complement of C. whereas if <5(X) = 0 and we decide () E 80 when in fact 8 E 8 b we call the error a Type II error. a < v(P) .[X < k[.6) Probability of Type [error R(8. This corresponds to an a priori bound on the risk of a on v(X) viewed as a decision procedure with action space R and loss function. we want to start by limiting the probability of falsely proclaiming one treatment superior to the other (deciding ~ =1= 0 when ~ = 0). Suppose our primary interest in an estimation type of problem is to give an upper bound for the parameter v. For instance. a > v(P) I.IX > k] + rN8P.a) upper coofidence houod on v.
a2.5. are available. general criteria for selecting "optimal" procedures. in fact it is important to get close to the truthknowing that at most 00 doJlars are owed is of no use. We shall go into this further in Chapter 4. In the context of the foregoing examples. e Example 1. aJ. Ii) = E(Ii(X) v(P))+. and so on. which we represent by Bl and B .: Comparison of Decision Procedures In this section we introduce a variety of concepts used in the comparison of decision procedures. The loss function I(B.24 Statistical Models. Suppose that three possible actions. and Performance Criteria Chapter 1 1 . rather than this Lagrangian form. I I (Oil) (No oil) Bj B2 0 12 10 I 5 6 \ .a<v(P).3.a>v(P) . sell the location.8) and then see what one can do to control (say) R( P. a) (Drill) aj (Sell) a2 (Partial rights) a3 . though upper bounding is the primary goal. Suppose we have two possible states of nature. It is clear.3. we could operate. though. a patient either has a certain disease or does not. or sell partial rights. an asymmetric estimation type loss function. where x+ = xl(x > 0).a)  a~v(P) . For instance = I(P. and the risk of all possible decision procedures can be computed and plotted.2 . What is missing is the fact that. it is customary to first fix a in (1. and a3. . for some constant c > O. We next tum to the final topic of this section. For instance.3. . A has three points. We shall illustrate some of the relationships between these ideas using the following simple example in which has two members.1 nature makes it resemble a testing loss function and. The same issue arises when we are interested in a confidence interval ~(X) l v(X) 1for v defined by the requirement that • P[v(X) < v(P) < Ii(X)] > 1 .1. thai this fonnulation is inadequate because by taking D 00 we can achieve risk O.a ·1 for all PEP. the connection is close. we could leave the component in.3. we could drill for oil. or wait and see.. a 2 certain location either contains oil or does not. a component in a piece of equipment either works or does not work. c ~1 . replace it. as we shall see in Chapter 4. Typically. The decision theoretic framework accommodates by adding a component reflecting this. The 0. administer drugs. 1. or repair it. Goals. Suppose the following loss function is decided on TABLE 1. We conclude by indicating to what extent the relationships suggested by this picture carry over to the general decision theoretic model.
R(025. 01 represents "Take action Ul regardless of the value of X. e. R(B" 52) R(B2 .. ..2 (Oil) (No oil) e.. .5 3 7 1. .3. 5. 3 a.7.a3)P[5(X) = a3]' 0(0.5(X))] = I(B.3.4) ~ 7. i Rock formation x o I 0.5 9.5 8.2.5 4.6 3 3. I x=O x=1 a.4. B2 Thus.Section 1.3 and formation 1 with frequency 0. Next. whereas if there is no oil.3 0. (R(O" 5). if X = 1.3.). Risk points (R(B 5.4 8 8.6 0. Possible decision rules 5i (x) . 5 a. the loss is zero. and so on. R(B k .4. 1 9.).6. R(B2.3.7) = 7 12(0. 2 a.2 for i = 1. a.6) + 1(0.7 0. if there is oil and we drill.. the loss is 12.4 = 1. B) given by the following table TABLE 1. The risk of 5 at B is R(B.a2)P[5(X) ~ a2] + I(B.4 and graphed in Figure 1. 5)) and if k = 2 we can plot the set of all such points obtained by varying 5. it is known that formation 0 occurs with frequency 0.)P[5(X) = ad +1(B.).)) 1 1 R(B .3 The Decision Theoretic Framework 25 Thus. TABLE 1. formations 0 and 1 occur with frequencies 0.3. a. a3 4 a2 a.0 9 5 6 It remains to pick out the rules that are "good" or "best. If is finite and has k members. whereas if there is no oil and we drill. The risk points (R(B" 5. 6 a. .3. We list all possible decision rules in the following table. e TABLE 1.5. take action U2. The frequency function p(x.3) For instance.) R(B. 8 a3 a2 9 a3 a3 Here." 02 corresponds to ''Take action Ul> if X = 0. an experiment is conducted to obtain information about B resulting in the random variable X with possible values coded as 0.5.4 10 6 6.) 0 12 2 7 7. a.5) E[I(B. 1. and when there is oil. a3 7 a3 a.)) are given in Table 1. and frequency function p(x. 0 ." and so on. X may represent a certain geological formation.6 4 5 1 " 3 5.52 ) + 10(0.6 and 0. a." Criteria for doing this will be introduced in the next subsection. we can represent the whole risk function of a procedure 0 by a point in kdimensional Euclidean space..
9.9 .7 . Goals. neither improves the other. We shall pursue this approach further in Chapter 3. S. (2) A second major approach has been to compare risk functions by global crite .Si) Figure 1.3. i = 1. if we ignore the data and use the estimate () = 0. Si).5 0 0 5 10 R(8 j .4 .) > R( 8" S6)' The problem of selecting good decision " procedures has been attacked in a variety of ways. Usually. . It is easy to see that there is typically no rule e5 that improves all others. Researchers have then sought procedures that improve all others within the class. For instance. if J and 8' are two rules.S') for all () with strict inequality for some (). in estimating () E R when X '"'' N((). . R( 8" Si)).5).) I • 3 10 .S) < R(8.   . and Performance Criteria Chapter 1 R(8 2 .26 Statistical Models. and only if. or level of significance (for tests). Consider. we obtain M SE(O) = (}2.2 . Extensions of unbiasedness ideas may be found in Lehmann (1997.6 . i (1) Narrow classes of procedures have been proposed using criteria such as con siderations of symmetry.. unbiasedness (for estimates and tests).2. a5).) < R( 8" S6) but R( 8" S. The risk points (R(8 j . and S6 in our example.3 Bayes and Minimax Criteria The difficulties of comparing decision procedures have already been discussed in the special contexts of estimation and testing. 1. Symmetry (or invariance) restrictions are discussed in Ferguson (1967). Here R(8 S.. for instance. R(8.S. The absurd rule "S'(X) = 0" cannot be improved on at the value 8 ~ 0 because Eo(S'(X)) = 0 ifand only if O(X) = O. Section 1.8 5 . We say that a procedure <5 improves a procedure Of if.3.
8 10 6 3.10) r( 0) ~ 0. we need not stop at this point.2. Then we treat the parameter as a random variable () with possible values ()1. and only if.0).3 The Decision Theoretic Framework 27 ria rather than on a pointwise basis. We postpone the consideration of posterior analysis.48 7. Bayes: The Bayesian point of view leads to a natural global criterion.) ~ 0. If there is a rule <5*. but can proceed to calculate what we expect to lose on the average as () varies.5 gives r( 0.3. to Section 3. TABLE 1.5 9 5. 0) isjust E[I(O.9) The second preceding identity is a consequence of the double expectation theorem (B. (1.o(x))]. 11(0. This quantity which we shall call the Bayes risk of <5 and denote r( <5) is then.. which attains the minimum Bayes risk l that is. and reo) = J R(O.3.. O(X)) I () = ()]. it has smaller Bayes risk. the expected loss.0) Table 1. R(0" Oi)) I 9.4 5 2. (1. ..2. In this framework R(O.8.92 5. the only reasonable computational method. From Table 1. given by reo) = E[R(O. We shall discuss the Bayes and minimax criteria. .o)1I(O)dO.6 3 8.o)] = E[I(O. r( 09) specified by (1.02 8.2.3. . . + 0.9 8.3.). r(o") = minr(o) reo) = EoR(O. such that • see that rule 05 is the unique Bayes rule then it is called a Bayes rule.6 12 2 7.38 9.5 7 7.7 6.5 we for our prior. The method of computing Bayes procedures by listing all available <5 and their Bayes risk is impracticable in general.3. Note that the Bayes approach leads us to compare procedures on the basis of. if we use <5 and () = ().3. To illustrate.8R(0. Bayes and maximum risks oftbe procedures of Table 1.5.I. 0)11(0). suppose that in the oil drilling example an expert thinks the chance of finding oil is .9).20) in Appendix B.Section 1.8 6 In the Bayesian framework 0 is preferable to <5' if.4 8 4.2R(0. Recall that in the Bayesian model () is the realization of a random variable or vector () and that Pe is the conditional distribution of X given 0 ~ O. .) maxi R(0" Oil.6 4 4. r(o. If we adopt the Bayesian point of view. therefore.3. if () is discrete with frequency function 'Tr(e). ()2 and frequency function 1I(OIl The Bayes risk of <5 is.3. = 0.
For simplicity we shall discuss only randomized . J)) we see that J 4 is minimax with a maximum risk of 5. This is.5. . if V is the class of all decision procedures (nonrandomized). Minimax: Instead of averaging the risk as the Bayesian does we can look at the worst possible risk. Students of game theory will realize at this point that the statistician may be able to lower the maximum risk without requiring any further information by using a random mechanism to determine which rule to emplOy. From the listing ofmax(R(O" J). It aims to give maximum protection against the worst that can happen. But this is just Bayes comparison where 1f places equal probability on f}r and ()2. 2 The maximum risk 4.3.75 is strictly less than that of 04. which makes the risk as large as possible. Our expected risk would be. J) A procedure 0*. if the statistician believed that the parameter value is being chosen by a malevolent opponent who knows what decision procedure will be used. 8)]. a randomized decision procedure can be thought of as a random experiment whose outcomes are members of V. sup R(O. Of course. R(O.J') . we prefer 0" to 0"'. Such comparisons make sense even if we do not interpret 1r as a prior density or frequency. in Example 1. Nature's choosing a 8. For instance. J). a weight function for averaging the values of the function R( B. For instance. suppose that.28 Statistical Models. I Randomized decision rules: In general. < sup R(O.5 we might feel that both values ofthe risk were equally important. The maximum risk of 0* is the upper pure value of the game. Nature's intentions and degree of foreknowledge are not that clear and most statisticiaqs find the minimax principle too conservative to employ as a general rule. This criterion of optimality is very conservative. 6). .75 if 0 = 0. in Example 1. supR(O. The criterion comes from the general theory of twoperson zero sum games of von Neumann.3. R(02. It is then natural to compare procedures using the simple average ~ [R( fh 10) + R( fh. is called minimax (minimizes the maximum risk). the sel of all decision procedures. if and only if.3.(2) We briefly indicate "the game of decision theory. Nevertheless l in many cases the principle can lead to very reasonable procedures. which has . The principle would be compelling.4. and Performance Criteria Chapter 1 if (J is continuous with density rr(8). e 4.4." Nature (Player I) picks a independently of the statistician (Player II). but only ac.J). we toss a fair coin and use 04 if the coin lands heads and 06 otherwise. . 4. To illustrate computation of the minimax rule we tum to Table 1. Player II then pays Player I. J'). = infsupR(O.20 if 0 = O . Goals. who picks a decision procedure point 8 E J from V.
.A)R(02. That is. For instance.14) .1). If the randomized procedure l5 selects l5i with probability Ai.3. including randomiZed ones. Ji ) + (1 . We will then indicate how much of what we learn carries over to the general case. the Bayes risk of l5 (1.3. given a prior 7r on q e. Ai >0.).d) anwng all randomized procedures.R(OI. l5q of nOn randomized procedures.. minimizes r(d) anwng all randomized procedures. If .3.R(02. . . (2) The tangent is the line connecting two unonrandomized" risk points Ji . . J). 9 (Figure 2 1. By (1. This is thai line with slope . ~Ai=1}.(0 d = i = 1 . Jj ).3).. this point is (10. A point (Tl.. Ei==l Al = 1.).J) ~ L.3.(02 ). i=l (1. A randomized minimax procedure minimizes maxa R(8. All points of S that are on the tangent are Bayes.13) defines a family of parallel lines with slope i/(1 .3.12) r(J) = LAiEIR(IJ. 0 < i < 1.13) As c varies.J. i = 1 ..10).Section 1.3. R(O . We now want to study the relations between randomized and nonrandomized Bayes and minimax procedures in the context of Example 1. .r2):r. = ~A.Ji )] i=l A randomized Bayes procedure d.J. S is the convex hull of the risk points (R(0" Ji ). we represent the risk of any procedure J by the vector (R( 01 . T2) on this line can be written AR(O"Ji ) + (1.r). (1. S= {(rI.3. (1.2.11) Similarly we can define.3. R(O .3). Jj . then all rules having Bayes risk c correspond to points in S that lie on the line irl + (1 ..5.i)r2 = c.3. Finding the Bayes rule corresponds to finding the smallest c for which the line (1.3 The Decision Theoretic Framework 29 procedures that select among a finite set (h 1 • • • . which is the risk point of the Bayes rule Js (see Figure 1.\. when ~ = 0. TWo cases arise: (I) The tangent has a unique point of contact with a risk point corresponding to a nonrandomized rule. AR(02..3. 1 q. (1. J)) and consider the risk sel 2 S = {(RrO"J)..'Y) that is tangent to S al the lower boundary of S.R(O. we then define q R(O. i = 1.Ji)' r2 = tAiR(02.3.J): J E V'} where V* is the set of all procedures. As in Example 1.A)R(O"Jj ).i/ (1 .13) intersects S.5. Ji )).
as >. and.3.13). < 1..11) corresponds to the values Oi with probability>.• all points on the lowet boundary of S that have as tangents the y axis or lines with nonpositive slopes). 0 < >.16) whose diagonal is the line r. Then Q( c*) n S is either a point or a horizontal or vertical line segment. (1../(1 . To locate the risk point of the nilhimax rule consider the family of squares.3.>'). oil.e. is Bayes against 7I". < 1. the first point of contact between the squares and S is the . The convex hull S of the risk poiots (R(B!.16) touches S is the risk point of the minimax rule.3. See Figure 1.3. Let c' be the srr!~llest c for which Q(c) n S i 0 (i. . In our example. where 0 < >. OJ with probability (1 . the first square that touches S). 2 The point where the square Q( c') defined by (1. the set B of all risk points corresponding to procedures Bayes with respect to some prior is just the lower left boundary of S (Le.30 Statistical Models. = 0). (1. namely Oi (take'\ = 1) and OJ (take>. R(B . by (1.3. thus. Because changing the prior 1r corresponds to changing the slope .9. and Performance Criteria Chapter 1 r2 I 10 5 Q(c') o o 5 10 Figure 1. i = 1. ranges from 0 to 1.15) Each one of these rules. We can choose two nonrandomized Bayes rules from this class.3. It is the set of risk points of minimax rules because any point with smaller maximum risk would belong to Q(c) n S with c < c* contradicting the choice of c*. (\)). Goals.3.3. = r2.)') of the line given by (1.. .3.
4 < 7. (e) If a Bayes prior has 1f(Oi) to 1f is admissible.\ the solution of From Table 1. that <5 2 is inadmissible because 64 improves it (i. then any Bayes procedure corresponding If e is not finite there are typically admissible procedures that are not Bayes.\ + 6. There is another important concept that we want to discuss in the context of the risk set. > a for all i. thus.3. A decision rule <5 is said to be inadmissible if there exists another rule <5' such that <5' improves <5. 1967. A rule <5 with risk point (TIl T2) is admissible.Section 1. {(x. . In fact. To gain some insight into the class of all admissible procedures (randomized and nOnrandomized) we again use the risk set. (d) All admissible procedures are Bayes procedures.4'\ + 3(1 .3. However.e. j = 6 and . If e is finite. if and only if. agrees with the set of risk points of Bayes procedures. Thus.3..5(1 . R(8" 0)) : 0 E V'} where V" is the set of all randomized decision procedures.3.) and R( 8" 04) = 5. there is no (x. if there is a randomized one. the minimax rule is given by (1. (3) Randomized Bayes procedures are mixtures of nonrandomized ones in the Sense of (1. R( 81 . Using Table 1.14). The following features exhibited by the risk set by Example 1.3. 1 . or equivalently. this equation becomes 3.\) = 5.\). y) in S such that x < Tl and y < T2. Naturally. they are Bayes procedures. l Ok}. e = {fh.5 can be shown to hold generally (see Ferguson. From the figure it is clear that such points must be on the lower left boundary.V) : x < TI..6 ~ R( 8" J. under some conditions..0). . for instance.)). 0.14) with i = 4. Y < T2} has only (TI 1 T2) in common with S..3 The Decision Theoretic Framework 31 intersection between TI = T2 and the line connecting the two points corresponding to <54 and <56 . if and only if. which yields . (a) For any prior there is always a nonrandomized Bayes procedure. all admissible procedures are either Bayes procedures or limits of .59.4 we can see.. for instance). (b) The set B of risk points of Bayes procedures consists of risk points on the lower boundary of S whose tangent hyperplanes have normals pointing into the positive quadrant.4. the set of all lower left boundary points of S corresponds to the class of admissible rules and. 04) = 3 < 7 = R( 81 . all rules that are not inadmissible are called admissible. (c) If e is finite(4) and minimax procedures exist.\ ~ 0. . we can define the risk set in general as s ~ {( R(8 .
A stockholder wants to predict the value of his holdings at some time in the future on the basis of his past experience with the market and his portfolio. ]967. and prediction. are due essentially to Waldo They are useful because the property of being Bayes is ea·der to analyze than admissibility.4). Bayes procedures (in various senses). Using this information. in the college admissions situation. For example. . An important example is the class of procedures that depend only on knowledge of a sufficient statistic (see Ferguson. Z would be the College Board score of an entering freshman and Y his or her firstyear grade point average. Since Y is not known. I The prediction Example 1. The basic global comparison criteria Bayes and minimax are presented as well as a discussion of optimality by restriction and notions of admissibility. loss function. he wants to predict the firstyear grade point averages of entering freshmen on the basis of their College Board scores. and Performance Criteria Chapter 1 '. We assume that we know the joint probability distribution of a random vector (or variable) Z and a random variable Y. Statistical Models. A government expert wants to predict the amount of heating oil needed next winter. One reasonable measure of "distance" is (g(Z) . Goals. ranking. We stress that looking at randomized procedures is essential for these conclusions. Other theorems are available characterizing larger but more manageable classes of procedures.g(Z») = E[g(Z) ~ YI 2 or its square root yE(g(Z) .y)2. Here are some further examples of the kind of situation that prompts our study in this section. although it usually turns out that all admissible procedures of interest are indeed nonrandomized. testing. The joint distribution of Z and Y can be calculated (or rather well estimated) from the records of previous years that the admissions officer has at his disposal. In terms of our preceding discussion. which include the admissible rules. For more information on these topics. A meteorologist wants to estimate the amount of rainfall in the coming spring. The basic biasvariance decomposition of mean square error is presented. We want to find a function 9 defined on the range of Z such that g(Z) (the predictor) is "close" to Y.3.2 presented important situations in which a vector z of 00variates can be used to predict an unseen response Y. confidence bounds. we tum to the mean squared prediction error (MSPE) t>2(Y. Z is the information that we have and Y the quantity to be predicted. • 1. I II .I 'jlI .Yj'.4 PREDICTION .. decision rule. at least in their original fonn. which is the squared prediction error when g(Z) is used to predict Y. The MSPE is the measure traditionally used in the . and risk through various examples including estimation. Similar problems abound in every field. These remarkable results. Next we must specify what close means. Section 3. We introduce the decision theoretic foundation of statistics inclUding the notions of action space.32 . A college admissions officer has available the College Board scores at entrance and firstyear grade point averages of freshman classes for a period of several years. Summary. The frame we shaH fit them into is the following. we referto Blackwell and Girshick (1954) and Ferguson (1967). at least when procedures with the same risk function are identified.
16).c) implies that p. given a vector Z.g(Z))2. that is.) = 0 makes the cross product term vanish.c = (Y 1') + (I' .g(z))' I Z = z].1.4 Prediction 33 mathematical theory of prediction whose deeper results (see.g(Z))' (1. see Problem 1. consider QNP and the class QL of linear predictors of the form a + We begin the search for the best predictor in the sense of minimizing MSPE by considering the case in which there is no covariate information.4. Because g(z) is a constant. The method that we employ to prove our elementary theorems does generalize to other measures of distance than 6.1.4) .c)2 is either oofor all c or is minimized uniquely by c In fact.3.1 assures us that E[(Y .4. In this situation all predictors are constant and the best one is that number Co that minimizes B(Y .4.4.(Y. Let (1. or equivalently.4.4.C)2 has a unique minimum at c = J. See Remark 1. E(Y . for example.5 and Section 3. In this section we bjZj .2 where the problem of MSPE prediction is identified with the optimal decision problem of Bayesian statistics with squared error loss. 1.4.I'(Z))' < E(Y . Grenander and Rosenblatt.L and the lemma follows.4. E(Y . 1957) presuppose it. (1. we have E[(Y .Section 1. L1=1 Lemma 1. exists. We see that E(Y .711). EY' = J.2) I'(z) = E(Y I Z = z).25.4.6. g(Z)) such as the mean absolute error E(lg( Z) .L = E(Y).l.4. ljZ is any random vector and Y any random variable. Proof.4.g(z))' IZ = z] = E[(Y I'(z))' I Z = z] + [g(z) I'(z)]'. (1.1) follows because E(Y p.c)' ~ Var Y + (c _ 1')'.4. 0 Now we can solve the problem of finding the best MSPE predictor of Y. see Example 1. Lemma 1.c)' < 00 for all c. The class Q of possible predictors 9 may be the nonparametric class QN P of all 9 : R d Jo R or it may be to some subset of this class. in which Z is a constant. By the substitution theorem for conditional expectations (B.YI) (Problems 1. EY' < 00 Y .20).c)2 as a function of c. and by expanding (1.1) < 00 if and only if E(Y .4. when EY' < 00.g(Z))' I Z = z] = E[(Y . then either E(Y g(Z))' = 00 for every function g or E(Y . Just how widely applicable the notions of this section are will become apparent in Remark 1.3) If we now take expectations of both sides and employ the double expectation theorem (B. we can conclude that Theorem 1. we can find the 9 that minimizes E(Y .
4.4. I'(Z) is the unique E(Y . then we can write Proposition 1.20). then (a) f is uncorrelated with every function ofZ (b) I'(Z) and < are uncorrelated (c) Var(Y) = Var I'(Z) + Var <. and recall (B. = O. 0 Note that Proposition 1.4.5) is obtained by taking g(z) ~ E(Y) for all z.I'(Z))' + E(g(Z) 1'(Z»2 ( 1. then by the iterated expectation theorem.4.4.8) or equivalently unless Y is a function oJZ.6).4. E{h(Z)<1 E{ E[h(Z)< I Z]} E{h(Z)E[Y I'(Z) I Z]} = 0 because E[Y I'(Z) I Z] = I'(Z) I'(Z) = O.1(c) is equivalent to (1. 1.4. Write Var(Y I z) for the variance of the condition distribution of Y given Z = z.4.6) which is generally valid because if one side is infinite. Infact. so is the other. which will prove of importance in estimation theory. Theorem 1.1. (1. Proof.2.4.5) follows from (a) because (a) implies that the cross product term in the expansionof E ([Y 1'(z)J + [I'(z) g(z)IP vanishes. As a consequence of (1. Equivalently U and V are uncorrelated if either EV[U E(U)] = 0 or EUIV . and Performance Criteria Chapter 1 for every 9 with strict inequality holding unless g(Z) best MSPE predictor. Suppose that Var Y < 00.4.4.5) An important special case of (1. then Var(E(Y I Z)) < Var Y. Goals. = I'(Z).4. To show (a).E(V)J Let € = Y . we can derive the following theorem. Property (1.E(v)IIU . then (1.7) IJVar Y • < 00 strict inequality ooids unless 1 Y = E(Y I Z) (1.6) and that (1. If E(IYI) < 00 but Z and Y are otherwise arbitrary.E(U)] = O. Var(Y I z) = E([Y . let h(Z) be any function of Z. when E(Y') < 00. Properties (b) and (c) follow from (a).6) is linked to a notion that we now define: Two random variables U and V with EIUVI < 00 are said to be uncorrelated if EIV .E(Y I z)]' I z). . that is.5) becomes.34 Statistical Models.g(Z)' = E(Y .Il(Z) denote the random prediction error.4. Vat Y = E(Var(Y I Z» + Var(E(Y I Z). (1. That is.
(1.E(Y I Z))' ~ 0 By (A.1.6).025 0. .30 py(y) I 0.15 3 0.25 0.7) follows immediately from (1. o Example 1.45. E (Y I Z = ±) = 1.10 0.Section 1. The row sums of the entries pz (z) (given at the end of each row) represent the frequency with which the assembly line is in the appropriate capacity state. An assembly line operates either at full. and only if. E(Var(Y I Z)) ~ E(Y . half.10 0.885.25 0. the best predictor is E (30. and only if. also.4.4.y) pz(z) 0. 2.E(Y I Z))' =L x 3 ~)y y=o . or 3 shutdowns due to mechanical failure.10 0. if we are trying to guess.8) is true.45 ~ 1 The MSPE of the best predictor can be calculated in two ways. Each day there can be 0.4 Prediction 35 Proof.4.20. as we reasonably might.:.10.025 0.50 z\y 1 0 0. These fractional figures are not too meaningful as predictors of the natural number values of Y.1 2. 1.05 0. whereas the column sums py (y) yield the frequency of 0.9) this can hold if. The first is direct. y) = p (Z = Z 1 Y = y) of the number of shutdowns Y and the capacity state Z of the line for a randomly chosen day. or quarter capacity.30 I 0.15 I 0.lI. 2. Within any given month the capacity status does not change.'" 1 Y. But this predictor is also the right one.25 2 0. Equality in (1. p(z.E(Y I Z = z))2 p (z. y) = 0. We find E(Y I Z = 1) = L iF[Y = i I Z = 1] ~ 2. We want to predict the number of failures for a given day knowing the state of the assembly line for the month. the average number of failures per day in a given month.7) can hold if.4. or 3 failures among all days.25 1 0.4. E(Y . i=l 3 E (Y I Z = ~) ~ 2.05 0. The following table gives the frequency function p( z.05 0. The assertion (1. I z) = E(Y I Z).10 I 0. In this case if Yi represents the number of failures on day i and Z the state of the assembly line. I.
the MSPE of our predictor is given by.E(Y I Z))' (1. the best predictor is just the constant f. this quantity is just rl'. One minus the ratio of the MSPE of the best predictor of Y given Z to Var Y.36 Statistical Models.2 tells us that the conditional dis /lO(Z) Because . u~.p')).4. Similarly. is usually called the regression (line) of Y on Z. " " is independent of z. can reasonably be thought of as a measure of dependence. Regression toward the mean.4.11) I The line y = /lY + p(uy juz)(z ~ /lz).E(Y I Z p') (1. Suppose Y and Z are bivariate normal random variables with the same mean .4. (1. If (Z. The Bivariate Normal Distribution. The term regression was coined by Francis Galton and is based on the following observation.885 as before. = /lY + p(uyjuz)(Z ~ /lZ). O"~. which is the MSPE of the best constant predictor.4. The larger this quantity the more dependent Z and Yare. In the bivariate normal case. for this family of distributions the sign of the correlation coefficient gives the type of dependence between Z and Y. E(Y ~ E(Y I Z))' VarY ~ Var(E(Y I Z)) E(Y') ~ E[(E(Y I Z))'] I>'PY(Y) . E((Y .2. . Y) has a N(Jlz l f. a'k. the best predictor of Y using Z is the linear function Example 1.6) we can also write Var /lO(Z) p = VarY . p < 0 indicates that large values of Z tend to go with small values of Y and we have negative dependence. whereas its magnitude measures the degree of such dependence./lz).y as we would expect in the case of independence. Because of (1. o tribution of Y given Z = z is N City + p(Uy j UZ ) (z .4.LY. the predictor is a monotone increasing function of Z indicating that large (small) values of Y tend to be associated with large (small) values of Z. (1 . = z)]'pz(z) 0. E(Y .4.J. 2 I .10) I I " The qualitative behavior of this predictor and of its MSPE gives some insight into the structure of the bivariate normal distribution.9) I . Goals. . p) distribution. If p > O.L[E(Y I Z y . Therefore.6) writing. Thus. Theorem B. If p = 0. = Z))2 I Z = z) = u~(l _ = u~(1 _ p'). and Performance Criteria Chapter 1 The second way is to use (1. which corresponds to the best predictor of Y given Z in the bivariate normal model.4.
should.p)1" + pZ) = p'(T'. YI ).6). We write Mee = ' ~! _ PZy ElY /lO(Z))' ~ Var l"o(Z) Var Y Var Y . the best predictor E(Y I Z) ofY is the linear function (1. By (1. and positive correlation p.6.3. .Section 1. 0 Example 1. so the MSPE of !lo(Z) is smaller than the MSPE of the constant predictor J1y. . In Galton's case. distribution (Section B. E). . One minus the ratio of these MSPEs is a measure of how strongly the covariates are associated with Y. Ezy O"yy E zz is the d x d variancecovariance matrix Var(Z) . " Zd)T be a d x 1 covariate vector with mean JLz = (ttl. . Then the predicted height of the son.11). .p)tt + pZ. The variability of the predicted value about 11.5 states that the conditional distribution of Y given Z = z is N(I"Y +(zl'zf. Thus. N d +1 (I'.. where the last identity follows from (1. Theorem B. Yn ) from the population. ttd)T and suppose that (ZT.4.COV(Zd.8.. E zy = (COV(Z" Y). or the average height of sbns whose fathers are the height Z. y)T has a (d + !) multivariate normal. Thus.. (Zn.4. there is "regression toward the mean. the MCC equals the square of the . This quantity is called the multiple correlation coefficient (MCC). usual correlation coefficient p = UZy / crfyCT lz when d = 1. I"Y) T.6) in which I' (I"~. consequently. Let Z = (Zl... Y) is unavailable and the regression line is estimated on the basis of a sample (Zl. Note that in practice. I"Y = E(Y). the distribution of (Z. uYYlz) where. coefficient ofdetennination or population Rsquared.4 Prediction 37 tt. The Multivariate Normal Distribution.. We shall see how to do this in Chapter 2. Y» T T ~ E yZ and Uyy = Var(Y). in particular in Galton's studies. . tall fathers tend to have shorter sons.8 = EziEzy anduYYlz ~ uyyEyzEziEzy.." This is compensated for by "progression" toward the mean among the sons of shorter fathers and there is no paradox.4. be less than that of the actual heights and indeed Var((! . (1 .12) with MSPE ElY l"o(Z)]' ~ E{EIY l"o(zlI' I Z} = E(uYYlz) ~ Uyy  EyzEziEzy. . . variance cr2. is closer to the population mean of heights tt than is the height of the father. these were the heights of a randomly selected father (Z) and his son (Y) from a large human population. The quadratic fonn EyzEZ"iEzy is positive except when the joint nonnal distribution is degenerate.4.
are P~. respectively.3%.~ ~:~~). respectively. The problem of finding the best MSPE predictor is solved by Theorem 1. Y. Y) a I VarZ. .zz ~ (. What is the best (zero intercept) linear predictor of Y in the sense of minimizing MSPE? The answer is given by: Theorem 1. knowing the mother's height reduces the mean squared prediction error over the constant predictor by 33.393.4.4.)T.13) . 29W· Then the strength of association between a girl's height and those of her mother and father.ba) + Zba]}' to get ba )' .bZ)' is uniquely minintized by b = ba. The best linear predictor.1. y)T is unknown. and Performance Criteria Chapter 1 For example.bZ}' = E(Y)  b E(Z) 1· = (Y  [Z(b .335. when the distribution of (ZT. Z2 = father's height).l Proof. Two difficulties of the solution are that we need fairly precise knowledge of the joint distribution of Z and Y in order to calcujate E(Y I Z) and that the best predictor may be a complicated function of Z..bZ)' = E(Y') + E(Z')(b  Therefore. E(Y . We expand {Y . and parents.39 l. we can avoid both objections by looking for a predictor that is best within a class of simple predictors. We first do the onedimensional case. .y = . The natural class to begin with is that of linear combinations of components of Z. The percentage reductions knowing the father's and both parent's heights are 20. Suppose(l) that (ZT l Y) T is trivariate nonnal with Var(Y) = 6. Suppose that E(Z') and E(Y') are finite and Z and Y are not constant. l.38 Statistical Models.9% and 39. (Z~.209. let Y and Z = (Zl.zy ~ (407. Z2)T be the heights in inches of a IOyearold girl and her parents (Zl = mother's height.5%.2. If we are willing to sacrifice absolute excellence. Goals. Then the unique best zero intercept linear predictor i~ obtained by taking E(ZY) b=ba = E(Z') .. p~y = . In words.4. P~..:.y = .3.. yn)T See Sections 2. the hnear predictor l'a(Z) and its MSPE will be estimated using a 0 sample (Zi. whereas the unique best linear predictor is ILL (Z) = al + bl Z where b = Cov(Z. Let us call any random variable of the form a + bZ a linear predictor and any such variable with a = 0 a zero intercept linear predictor. In practice. and E(Y _ boZ)' = E(Y') _ [E(ZY)]' E(Z') (1.E(Z')b~.1 and 2. E(Y .
14) Yl T Ep[Y . E(Y . . 0 Note that if E(Y I Z) is of the form a + bZ. Best Multivariate Linear Predictor.t and covariance E of X. Therefore. b) X ~ (ZT.bZ) + (E(Y) .2. On the other hand.E(Z))]'. whatever be b.bE(Z). In that example nothing is lost by using linear prediction.bZ)2 we see that the b we seek minimizes E[(Y .4. This is in accordance with our evaluation of E(Y I Z) in Example 1.l7) in the appendix.Section 1.a .E(ZWW ' E([Z . by (1.E(Y) to conclude that b.16) directly by calculating E(Y .b. (1.Z)'.4. I 1.I'd Z )]' = 1 05 ElY I'(Z)j' .1 the best linear predictor and best predictor differ (see Figure 1. E(Y .13) we obtain tlte proof of the CauchySchwarz inequality (A.I1.E(Z) and Y .bE(Z) . A loss of about 5% is incurred by using the best linear predictor. Substituting this value of a in E(Y . and b = b1 • because.4.1).4.E(Y)]) = EziEzy. This is because E(Y .I'l(Z)1' depends on the joint distribution P of only through the expectation J.bZ)' ~ Var(Y .a)'. = I'Y + (Z  J. If EY' and (E([Z .4. From (1.bZ)2 is uniquely minimized by taking a ~ E(Y) .E(Z)J[Y .a .E(Z)J))1 exist.a . and only if..E(Y)) . OUf linear predictor is of the fonn d I'I(Z) = a + LbjZj j=l = a+ ZTb [3 = (E([Z . We can now apply the result on zero intercept linear predictors to the variables Z . then the Unique best MSPE predictor is I'dZ) Proof.boZ)' > 0 is equivalent to the CauchySchwarz inequality with equality holding if.4.tzf [3. E(Y . Theorem 1.1.a.boZ)' = 0.4 Prediction 39 To prove the second assertion of the theorem note that by (lA.b(Z .E(Z)IIZ . if the best predictor is linear. whiclt corresponds to Y = boZo We could similarly obtain (A. Note that R(a.l). it must coincide with the best linear predictor.5). then a = a. That is.E(Z)f[Z . Let Po denote = . 0 Remark 1.4. is the unique minimizing value. E[Y . in Example 1.4.4.
b) is minimized by (1.I'(Z) and each of Zo. 0 Remark 1.15) . (1. We could also have established Theorem 1.1(a). Y)..(a + ZT 13)]) = 0.4. . E). 2 • • 1 o o 0. .2. A third approach using calculus is given in Problem 1. b) = Epo IY . Zd are uncorrelated.4. The line represents the best linear predictor y = 1.d. See Problem 1.4. I"(Z) = E(Y I Z) = a + ZT 13 for unknown Q: E R and f3 E Rd. p~y = Corr2 (y.. b) is miuimized by (1. Suppose the mooel for I'(Z) is linear. E(Zj[Y . distribution and let Ro(a. case the multiple correlation coefficient (MCC) or coefficient of determination is defined as the correlation between Y and the best linear predictor of Y. 0 Remark 1. We want to express Q: and f3 in terms of moments of (Z. The three dots give the best predictor. N(/L. not necessarily nonnal. ..1. R(a. However. .4.3 to d > 1. the MCC gives the strength of the linear relationship between Z and Y. Because P and Po have the same /L and E.40 Statistical Models. . and R(a.14). that is.17 for an overall measure of the strength of this relationship. Goals..3. ~ Y . b) = Ro(a. Thus. Ro(a.25 0.14). the multivariate uormal. . Set Zo = 1.4. 0 Remark 1.45z. and Performance Criteria Chapter 1 y .4.4 by extending the proof of Theorem 1.3.I'LCZ)). By Example 1.. By Proposition 1.05 + 1.4. j = 0.4.I'I(Z)]'.4. that is.4.4.4.00 z Figure 1. b). thus. . .19. .4.50 0. In the general. our new proof shows how secondmoment results sometimes can be established by "connecting" them to the nannal distribution.75 1.4.
We begin by fonnalizing what we mean by "a reduction of the data" X EX.4. the optimal MSPE predictor is the conditional expected value of Y given Z. Consider the Bayesian model of Section 1.(Y I 9).5) = (9 .I'(Z) is orthogonal to I'(Z) and to I'dZ)..5 Sufficiency 41 Solving (1. We consider situations in which the goal is to predict the (perhaps in the future) value of a random variable Y.4. o Summary. this gives a new derivation of (1.I 0 and there is a 90 E 9 such that 0 < oc form a Hilbert go = arg inf{".4. When the class 9 of possible predictors 9 with Elg(Z)1 space as defined in Section B.3.8) defined by r(5) = E[I(II. Recall that a statistic is any function of the observations generically denoted by T(X) or T. and it is shown that if we want to predict Y on the basis of information contained in a random vector Z. Remark 1.5(X»]. I'dZ) = . we can conclude that I'(Z) = .2. Using the distance D.1..16) (1.12). the optimal MSPE predictor E(6 I X) is the Bayes procedure for squared error loss.17) ""'(Y.5 SUFFICIENCY Once we have postulated a statistical model. .. I'(Z» Y .(Y.4.14) (Problem 1. usually R or Rk.2 and the Bayes risk (1.Thus. It is shown to coincide with the optimal MSPE predictor when the model is left general but the class of possible predictors is restricted to be linear. T(X) = X loses information about the Xi as soon as n > 1.4. can also be a set of functions.3.5. With these concepts the results of this section are linked to the general Hilbert space results of Section 8. then by recording or taking into account only the value of T(X) we have a reduction of the data.15) for a and f3 gives (1.(Y 19NP). we would clearly like to separate out any aspects of the data that are irrelevant in the context of the model and that may obscure our understanding of the situation.' (l'e(Z). we see that r(5) ~ MSPE for squared error loss 1(9. If T assigns the same value to different sample points.4. but as we have seen in Sec~ tion 1. If we identify II with Y and X with Z. and projection 1r notation.(Y 1ge) ~ "(I'(Z) 19d (1. g(Z) and h(Z) are said to be orthogonal if at least one has expected value zero and E[g(Z)h(Z)] ~ O. Moreover.4. then 90(Z) is called the projection of Y on the space 9 of functions of Z and we write go(Z) = .16) is the Pythagorean identity.5)'. I'dZ» = ".4. Remark 1.23).10. I'(Z» + ""'(Y. The notion of mean squared prediction error (MSPE) is introduced.6. Thus.. 1.g(Z»): 9 E g). We return to this in Section 3.. The range of T is any space of objects T. The optimal MSPE predictor in the multivariate normal distribution is presented.Section 1.4.. Note that (1. Because the multivariate normal model is a linear model.
. ..Xn ) where Xi = 1 if the ith item sampled is defective and Xi = 0 otherwise. XI/(X l +X. Y) where X is uniform on (0. Then X = (Xl.2. the sample X = (Xl. T = L~~l Xi. Begin by noting that according to TheoremB. the conditional distribution of X 0 given T = L:~ I Xi = t does not involve O. is sufficient. = tis U(O.1. and Performance Criteria Chapter 1 Even T(X J.8)n' (1.Xn ) is the record of n Bernoulli trials with probability 8. .1) where Xi is 0 or 1 and t = L:~ I Xi. By (A. Thus. the conditional distribution of Xl = [XI/(X l + X. P[X I = XI. T is a sufficient statistic for O.l.. is a statistic that maps many different values of (Xl. Xl has aU(O. • X n ) (X(I) . By (A. . Xl and X 2 are independent and identically distributed exponential random variables with parameter O.5. X(n))' loses information about the labels of the Xi. in the context of a model P = {Pe : (J E e}.2.4). Suppose there is no dependence between the quality of the items produced and let Xi = 1 if the ith item is good and 0 otherwise.l.5. 1) whatever be t.' • ..l we see that given Xl + X2 = t.) and Xl +X. are independent and the first of these statistics has a uniform distribution on (0.l. X. t) distribution. Although the sufficient statistics we have obtained are "natural. it is intuitively clear that if we are interested in the proportion 0 of defective items nothing is lost in this situation by recording and using only T." it is important to notice that there are many others e. we need only record one. Using our discussion in Section B.) and that of Xlt/(X l + X. where 0 is unknown.9.42 Statistical Models.'" . The most trivial example of a sufficient statistic is T(X) = X because by any interpretation the conditional distribution of X given T(X) = X is point mass at X.)](X I + X. (Xl.'" . We give a decision theory interpretation that follows. . given that P is valid.X.5).'" . t) and Y = t . Thus. Each item produced is good with probability 0 and defective with probability 1. A statistic T(X) is called sufficient for PEP or the parameter if the conditional distribution of X given T(X) = t does not involve O. X. recording at each stage whether the examined item was defective or not. = t. when Xl + X. X n ) into the same number. We could then represent the data by a vector X = (Xl. Suppose that arrival of customers at a service counter follows a Poisson process with arrival rate (parameter) Let Xl be the time of arrival of the first customer.5. Goals. once the value of a sufficient statistic T is known.Xn ) does not contain any further infonnation about 0 or equivalently P.) is conditionally distributed as (X.) given Xl + X.Xn = xnl = 8'(1. . whatever be 8. Therefore.3.1 we had sampled the manufactured items in order. 16. We prove that T = X I + X 2 is sufficient for O. One way of making the notion "a statistic whose use involves no loss of infonnation" precise is the following. " . ~ t. Example 1.) are the same and we can conclude that given Xl + X. the conditional distribution of XI/(X l + X. + X. For instance. However.. Thus.0. . . A machine produces n items in succession. The idea of sufficiency is to reduce the data with statistics whose use involves no loss of information. . 1). It follows that. By Example B. . Instead of keeping track of several numbers... 0 In both of the foregoing examples considerable reduction has been achieved.. whatever be 8. X 2 the time between the arrival of the first and second customers.1. o Example 1. suppose that in Example 1. The total number of defective items observed.
.5 Sufficiency 43 that will do the same job. Let (Xl.} h(x). Now.2. forO E Si. if (152) holds. Section 2. We shall give the proof in the discrete case. e porT = til = I: {x:T(x)=t.. In general.j is independent ofe on cach of the sets Si = {O: porT ~ til> OJ.e) = g(ti.O) I: {x:T(x)=t. X2. 0) porT ~ Ii] ~~CC+ g(li.In a regular model.4) if T(xj) ~ Ii 41 o if T(xj) oF t i . Being told that the numbers of successeS in five trials is three is the same as knowing that the difference between the numbers of successes and the number of failures is one. we need only show that Pl/[X = xjlT = til is independent of for every i and j.} p(x.O) Theorem I. Such statistics are called equivalent. if Tl and T2 are any two statistics such that 7 1 (x) = T 1 (y) if and only if T2(x) = T 2(y).]/porT = Ii] p(x. By our definition of conditional probability in the discrete case. O)h(x) for all X E X. This result was proved in various forms by Fisher..5) .Ll) and (152).] o if T(xj) oF ti if T(xj) = Ii.I. Applying (153) we arrive at. then T 1 and T2 provide the same information and achieve the same reduction of the data. . More generally. e) definedfor tin T and e in on X such that p(X. e)h(Xj) Po[T (15. The complete result is established for instance by Lehmann (1997.6).] Po[X = Xj. To prove the sufficiency of (152).. it is enough to show that Po[X = XjlT ~ l. Proof.5. a statistic T(X) with range T is sufficient for e if. Po [X ~ XjlT = t. It is often referred to as the factorization theorem for sufficient statistics.S. and e and a/unction h defined (152) = g(T(x).") be the set of possible realizations of X and let t i = T(Xi)' Then T is discrete and 2::~ 1 porT = Ii] = 1 for every e. and Halmos and Savage. checking sufficiency directly is difficult because we need to compute the conditional distribution.Section 1. i = 1. there exists afunction g(t. a simple necessary and sufficient criterion for a statistic to be sufficient is available. • only if. Po[X = XjlT = t. Neyman. 0 E 8. T = t. h(xj) (1. (153) By (B. Fortunately..
.5. 1=1 (!. A whole class of distributions.Xn ) is given by I. We may apply Theorem 1.'t n /'[exp{ . 1. l X n are the interarrival times for n customers. . T is sufficient.'" ..7) o Example 1. we need only keeep track of X(n) = max(X . . then the joint density of (X" .O) where x(n) = onl{x(n) < O}. .2 (continued). . .5. if T is sufficient. ... If X 1. 1 X n be independent and identically distributed random variables each having a normal distribution with mean {l and variance (j2. 0 Example 1.3.. n P(Xl.5. 10 fact.9) if every Xi is an integer between 1 and B and p( X 11 • (1. [27".9) can be rewritten as " 1 Xn . ..Il) . Takeg(t. Goals. and P(XI' . are introduced in the next section. The probability distribution of X "is given by (1. •. T = T(x)] = g(T(x). and both functions = 0 otherwise..3).6) p(x. }][exp{ ..'). = max(xl. X n )... By Theorem 1. which admits simple sufficient statistics and to which this example belongs.Xn are recorded.5.n /' exp{ .5. ~ Po[X = x. . O} = 0 otherwise. O)h(x) (1.16..5. . and h(xl •.44 Statistical Models. Let 0 = (JL.O)=onexP[OLXil i=l (1.. and Performance Criteria Chapter 1 Therefore.2.1.' (L x~ 1=1 1 n n 2JL LXi))].5... Estimating the Size of a Population.~.. The population is sampled with replacement and n members of the population are observed and their labels Xl. X(n) is a sufficient statistic for O.4)).8) = eneStift > 0... Expression (1. 0) by (B.5.10) P(Xl.8) if all the Xi are > 0. 0 o Example 1.2.Xn ) is given by (see (A. .xn }.. Let Xl. Consider a population with () members labeled consecutively from I to B. let g(ti' 0) = porT = tiL h(x) = P[X = xIT(X) = til Then (1. I x n ) = 1 if all the Xi are > 0.[27f. Conversely.4.1 to conclude that T(X 1 . _ _ _1 .5.Xn.. > 0. Common sense indicates that to get information about B.JL)'} ~=l I n I! 2 . . Then the density of (X" . ' .X n JI) = 0 otherwise..' L(Xi . both of which are unknown.Xn ) = L~ IXi is sufficient.5.X.S. . we can show that X(n) is sufficient. ..'I.5.
X)'].2a I3.) i=l i=l is sufficient for B.1 we can conclude that n n Xi.5. Let o(X) = Xl' Using only X. Specifically. a 2 ).~ I: xn By the factorization theorem. o Sufficiency and decision theory Sufficiency can be given a clear operational interpretation in the decision theoretic setting. Suppose X" .5. An equivalent sufficient statistic in this situation that is frequently used IS S(X" .Zi)'} exp {Er. we can.l) distributed. 2:>. i= 1 = (lin) L~ 1 Xi. {32. 0') for allO.1. Then 6 = {{31. By randomized we mean that o'(T(X») can be generated from the value random mechanism not depending on B. X n ) = (I: Xi.' + 213 2a + 213. R(O.• Yn are independent.. in Example 1. that is. a 2)T is identifiable (Problem 1. ... I)) = exp{nI)(x . Then pix. T = (EYi .1.5. The first and second components of this vector are called the sample mean and the sample variance. (1. with JLi following the linear regresssion model ("'J . I)) . for any decision procedure o(x). Here is an example.o) = R(O.~0)}(2"r~n exp{ . 0 X Example 1.. if T(X) is sufficient. L~l x~) and () only and T(X 1 . X is sufficient.Section 1. . . respectively.5 Sufficiency 45 I Evidently P{Xl 1 • •• .5.} + EY. . where we assume diat the given constants {Zi} are not all identical. [1/(n i=l n n 1)1 I:(Xi .EZiY. Suppose.( 2?Ta ')~ exp p {E(fJ. Yi N(Jli. a<. find a randomized decision rule o'(T(X)) depending only on T(X) that does as well as O(X) in the sense of having the same risk function. X n ) = [(lin) where I: Xi. . X n are independent identically N(I).5.4 with d = 2.··· . EYi 2 . we construct a rule O'(X) with the sarne risk = mean squared error as o(X) as follows: Conditionally.xnJ)) is itself a function of (L~ upon applying Theorem 1.12) t ofT(X) and a Example 1.6.9) and _ (y. that Y I . Ezi 1'i) is sufficient for 6. Thus. 1 2 2 .
Theorem }.6') ~ E{E[R(O.14.O) = g(S(x). . Goals. T(X) is Bayes sufficient for IT if the posterior distribution of 8 given X = x is the same as the posterior (conditional) distribution of () given T(X) = T(x) for all x. where k In this situation we call T Bayes sufficient. then T(X) = (L~ I Xi. ) Var(T') = E[Var(T'IX)] + VarIE(T'IX)]  ~ nl n + 1 n ~ 1 = Var(X . we find E(T') ~ E[E(T'IX)] ~ E(X) ~ I" ~ E(X . if Xl. Using Section B. the distribution of 6(X) does not depend on O. the same as the posterior distribution given T(X) = L~l Xi = k. Let SeX) be any other sufficient statistic. Minimal sufficiency For any model there are many sufficient statistics: Thus. in that. Equivalently.2. Then by the factorization theorem we can write p(x. o Sufficiency aud Bayes models There is a natural notion of sufficiency of a statistic T in the Bayesian context where in addition to the model P = {Po : 0 E e} we postulate a prior distribution 11 for e. IfT(X) is sufficient for 0.4.1 (Bernoulli trials) we saw that the posterior distribution given X = x is 1 Xi. it is Bayes sufficient for every 11. R(O. Now draw 6' randomly from this conditional distribution. T = I Xi was shown to be sufficient. In this Bernoulli trials case. n~l) distribution.5..46 Statistical Models. ). 0 and X are independent given T(X).5. We define the statistic T(X) to be minimally sufficient if it is sufficient and provides a greater reduction of the data than any other sufficient statistic S(X).6'(T))IT]} ~ E{E[R(O. by the double expectation theorem.5.Xn is a N(p. = 2:7 Definition. X n ) are hoth sufficient. . (Kolmogorov).6).1 (continued). In Example 1. choose T" = 15"' (X) from the normal N(t.5.12) follows along the lines of the preceding example: Given T(X) = t. 0) as p(x. Example 1. O)h(x) E:  . (T2) sample n > 2. This 6' (T(X)) will have the same risk as 6' (X) because.6). Thus. L~ I xl) and S(X) = (X" . 6'(X) and 6(X) have the same mean squared error. we can find a transformation r such that T(X) = r(S(X)). But T(X) provides a greater reduction of the data.6(X»)IT]} = R(O.2. 0 The proof of (1. and Performance Criteria Chapter 1 given X = t. This result and a partial converse is the subject of Problem 1...l and (1. ...
5.. if X = x. L Xf) i=l i=l n n = (0 10 0. Thus. for a given 8. take the log of both sides of this equation and solve for T. L x (()) gives the probability of observing the point x.O E e. (T'). T is minimally sufficient.5. ~ 2/3 and 0.11) is determined by the twodimensional sufficient statistic T Set 0 = (TI.1). However. the "likelihood" or "plausibility" of various 8.5. In the continuous case it is approximately proportional to the probability of observing a point in a small rectangle around x. Now.o)nT ~g(S(x). In this N(/". the statistic L takes on the value Lx. when we think of Lx(B) as a function of (). the ratio of both sides of the foregoing gives In particular. 20' 20 I t.4 (continued). Example 1. 20. if we set 0.)}. ()) x E X}. It is a statistic whose values are functions. Lx is a map from the sample space X to the class T of functions {() t p(x. 1/3)]}/21og 2. ~ 1/3. the likelihood function (1. for example.8) for the posterior distribution can then be remembered as Posterior ex: (Prior) x (Likelihood) where the sign ex denotes proportionality as functions of 8. t. we find T = r(S(x)) ~ (log[2ng(s(x). it gives. 2/3)/g(S(x). In the discrete case. L x (') determines (iI. For any two fixed (h and fh.5 Sufficiency 47 Combining this with (1. as a function of (). t2) because. The likelihood function o The preceding example shows how we can use p(x.) ~ (L Xi. (T') example. Thus. We define the likelihood function L for a given observed data vector x as Lx(O) = p(x. ()) for different values of () and the factorization theorem to establish that a sufficient statistic is minimally sufficient. for a given observed x. The formula (1. = 2 log Lx(O. then Lx(O) = (27rO.T. }exp ( I .Section 1.) n/' exp {nO. we find OT(1 .5.) = (/".(t.O).O)h(x) forall O.1) nlog27r .
5. hence.12 for a proof of this theorem of Dynkin. By arguing as in Example 1. .5. A sufficient statistic T(X) is minimally sufficient for () if for any other sufficient statistic SeX) we can find a ttansformation r such that T(X) = r(S(X). 0) and heX) such that p(X. 0) = g(T(X).X Cn )' the order statistics. a 2 is assumed unknown.1 (continued) we can show that T and. Thus. Then Ax is minimal sufficient. if the conditional distribution of X given T(X) = t does not involve O. but if in fact a 2 I 1 all information about a 2 is contained in the residuals. We say that a statistic T (X) is sufficient for PEP. Let Ax OJ c (x: p(x. the ranks. SeX)~ where SeX) is a statistic needed to uniquely detennine x once we know the sufficient statistic T(x). . If T(X) is sufficient for 0.. X is sufficient. and Scheffe. 1. 1 X n are a random sample. and Performance Criteria Chapter 1 with a similar expression for t 1 in terms of Lx (O.X).5.. Consider an experiment with observation vector X = (Xl •. hence. l:~ I (Xi . ifT(X) = X we can take SeX) ~ (Xl . The "irrelevant" part of the data We can always rewrite the original X as (T(X).. _' . For instance.Oo) > OJ = Lx~6o)' Thus. or if T(X) ~ (X~l)' .13. SeX) becomes irrelevant (ancillary) for inference if T(X) is known but only if P is valid. Goals. B ul the ranks are needed if we want to look for possible dependencies in the observations as in Example 1.. we can find a randomized decision rule J'(T(X» depending only on the value of t = T(X) and not on 0 such that J and J' have identical risk functions. • X( n» is sufficient.. it is Bayes sufficient for O..5. 1) and Lx (1. Let p(X. OJ denote the frequency function or density of X. If. . O)h(X).O) > for all B. Ax is the function valued statistic that at (J takes on the value p x. Suppose there exists 00 such that (x: p(x.5. .17).. The likelihood function is defined for a given data vector of . itself sufficient.1. . where R i ~ Lj~t I(X. < X. We define a statistic T(X) to be Bayes sufficient for a prior 7r if the posterior distribution of f:} given X = x is the same as the posterior distribution of 0 given T(X) = T(x) for all X.5.).48 Statistical Models. if a 2 = 1 is postulated. a statistic closely related to L solves the minimal sufficiency problem in general. L is a statistic that is equivalent to ((1' i 2) and.Rn ). The factorization theorem states that T(X) is sufficient for () if and only if there exist functions g( t. 1 X n ).5. . We show the following result: If T(X) is sufficient for 0. . (X( I)' . in Example 1. S(X) = (R" .X.X)2) is sufficient. Lehmann.4. 0 the likelihood ratio of () to Bo. Summary.5. 1) (Problem 1. the residuals. L is minimal sufficient.. If P specifies that X 11 •. 0 In fact. as in the Example 1. Suppose that X has distribution in the class P ~ {Po : 0 E e}. but if in fact the common distribution of the observations is not Gaussian all the information needed to estimate this distribution is contained in the corresponding S(X)see Problem 1. then for any decision procedure J(X). or for the parameter O.. (X. Thus. See Problem ((x:).Xn .
They will reappear in several connections in this book. Pitman. B). the likelihood ratio Ax(B) = Lx(B) Lx(Bo) depends on X through T(X) only. 1.B o) > O}.6.1. B>O. B) of the Pe may be written p(x. IfT(X) is {x: p(x. and Dannois through investigations of this property(l). by the factorization theorem. Note that the functions 1}. The class of families of distributions that we introduce in this section was first discovered in statistics independently by Koopman. 1. In a oneparameter exponential family the random variable T(X) is sufficient for B. B E' sufficient for B. = 1 . these families form the basis for an important class of models called generalized linear models. p(x. many other common features of these families were discovered and they have become important in much of the modern theory of statistics.B(B)} (1. This is clear because we need only identify exp{'7(B)T(x) .6 Exponential Families 49 observations X to be the function of B defined by Lx(B) = p(X. Subsequently.1) where x E X c Rq. and T are not unique. We shall refer to T as a natural sufficient statistic of the family. x. More generally. Ax(B) is a minimally sufficient statistic. Let Po be the Poisson distribution with unknown mean IJ. BEe. Poisson. binomial. Probability models with these common features include normal. Example 1. for x E {O. realvalued functions T and h on Rq.2) . gamma. B( 0) on e. The Poisson Distribution.exp{xlogBB}.1 The OneParameter Case The family of distributions of a model {Pe : B E 8}. is said to be a oneparameter exponentialfami/y. B) > O} c {x: p(x.Section 1.6. (1.B) = eXe. and if there is a value B E such that o e e.B) and h(x) with itself in the factorization theorem. such that the density (frequency) functions p(x.6 EXPONENTIAL FAMILIES The binomial and normal models considered in the last section exhibit the interesting feature that there is a natural sufficient statistic whose dimension as a random vector is independent of the sample size. if there exist realvalued functions fJ(B). then. Here are some examples. and multinomial regression models used to relate a response variable Y to a set of predictor variables. 1.B) ~ h(x)exp{'7(B)T(x) .2"" }.6. beta. B.6.o I x. Then. We return to these models in Chapter 2.B(B)} with g(T(x).
.) exp[~(O)T(x.T(x)=x. Statistical Models. Specifically. y)T where Y = Z independent N(O.' x.4) Therefore. q = 1. Then > 0.1](O)=log(I~0).Z)OI) Z)'O'J} (2rrO)1 exp { (2rr)1 exp {  ~ [z' + (y  ~z' } exp { .. B) are the corresponding density (frequency) functions. fonn a oneparameter exponential family as in (1. the family of distributions of X is a oneparameter exponential family with q=I. p(x. Suppose X has a B(n. h(x) = (2rr)1 exp { . .h(x) = .6. the Pe form a oneparameter exponential family with ~~I .50 .y.. and Performance Criteria Chapter 1 Therefore. Example 1.~O'. 0 Then. .6) .O) f(z.3.O) IIh(x. The Binomial Family.. .T(x) = (y  z)'.2.~z.6.. .1](0) = logO. This is a oneparameter exponential family distribution with q = 2. Suppose X = (Z. 1). for x E {O.5) o = 2.6.6. + OW. where the p. . o The families of distributions obtained by sampling from oneparameter exponential families are themselves oneparameter exponential families.Xm are independent and identically distributed with common distribution Pe. 0) distribution.B(O)=nlog(I0).6. 1 X m ) considered as a random vector in Rmq and p(x. is the family of distributions of X = (Xl' .1](0) = .6.3) I' . B(O) = 0.6. .O) = ( : ) 0'(1_ Or" ( : ) ex p [xlog(1 ~ 0) + nlog(l. n) < 0 < 1.) i=1 m B(8)] ( 1. Here is an example where q (1. 0 E e. 1 (1. Goals. 0 Example 1. .~O'(y  z)' logO}.0)]. I I! .. T(x) ~ x. suppose Xl. we have p(x. B(O) = logO. .h(X)=( : ) .(y I z) = <p(zW1<p((y .} . Z and W are f(x.1). 1. If {pJ=». (1.O) = f(z)f.
1. . . 1 X m ) is a vector of independent and identically distributed P(O) random variables and m ) is the family of distributions of x.6.1 Family of distributions . the m) fonn a oneparameter exponential family. X m ) corresponding to the oneparameter exponential fam1 T(Xd.Il)" . if X = (Xl. 1 X m).. h(m)(x) ~ i=l II h(xi). ily of distributions of a sample from any of the foregoin~ is just In our first Example 1. Therefore. •. In the discrete case we can establish the following general result. This family of Poisson distIibutions is oneparameter exponential whatever be m. · . I:::n I::n Theorem 1. . B(m)(o) ~ mB(O).· .6..7) Note that the natural sufficient statistic T(m) is onedimensional whatever be m. . .B(O)} for suitable h * .' fixed I' fixed p fixed ..2) r(p.1 the sufficient statistic T :m)(x1. pi pi I:::n TABLE 1.Section 1...\ 1"/'" (p (s 1) 1) 1) (r T(x) x (x . and pJ LT(Xi). oneparameter exponential family with natural sufficient statistic T(m)(x) = Some other important examples are summarized in the following table. . 1)(0) N(I'.\ fixed r fixed s fixed 1/2. .\) (3(r.6. B.. We leave the proof of these assertions to the reader.6 Exponential Families 51 where x = (x 1 . If we use the superscript m to denote the corresponding T. then the family of distributions of the statistic T(X) is a oneparameter exponential family of discrete distributions whose frequency functions may be written h'Wexp{1)(O)t . B. i=l (1.6. then the m ) fonn a I Xi. Let {Pe } be a oneparameter exponential family of discrete distributions with corresponding functions T.. fl.. then qCm) = mq. and h. and h. 1]. .8) .x log x 10g(l x) logx The statistic T(m) (X I.. For example.Xm ) = I Xi is distributed as P(mO).
2.6. By definition.. . £ is called the natural parameter space and T is called the natural sufficient statistic.6. The Poisson family in canonical form is q(x.2.1}) = h(x)exp[1}T(x) . We obtain an important and useful reparametrization of the exponential family (1. x E {O. if q is definable.6. The exponential family then has the form q(x.. Goals. Example 1. E is either an interval or all of R and the class of models (1.9) and ~ is an Ihlerior point of E. L {x:T{x)=t} h(x)}. Then as we show in Section 1.1}) = (1(x!)exp{1}xexp[1}]}.6. . P9[T(x) = tl L {x:T(x)=t} p(x. selves continuous. where 1} = log 8.6.52 Statistical Models.A(~)I. and Performance Criteria Chapter 1 Proof. If 8 E e. Theorem 1. (continued). = L(e"' (x!) = L(e")X (x! = exp(e").8) exp[~(8)t . the momentgenerating function o[T(X) exists and is given by M(s) ~ exp[A(s + 1}) . 00 00 exp{A(1})} andE = R..B(8)]{ Ifwe let h'(t) ~ L:{"T(x)~t} h(x). Let E be the collection of all 1} such that A(~) is finite.6.9) with 1} E E contains the class of models with 8 E 8.9) where A(1}) = logJ . If X is distributed according to (1.B(8)] L {x:T(x)=t} ( 1.1) by letting the model be indexed by 1} rather than 8. x=o x=o o Here is a useful result.6. then A(1}) must be finite.1. }.6.1.A(1})1 for s in some neighborhood 0[0..9) with 1} ranging over E is called the canonical oneparameter exponential family generated by T and h. the result follows. The model given by (1. 0 A similar theorem holds in the continuous case if the distributions ofT(X) are them Canonical exponential families. x E X c Rq (1. Jh(x)exp[1}T(x)]dx in the continuous case and the integral is replaced by a sum in the discrete case.2.6.8) h(x) exp[~(8)T(x) .
and Dannois were led in their investigations to the following family of distributions. Proof. nlog0 J. if there exist realvalued functions 171. E(T(X)) ~ A'(1]). B(O) the natural sufficient statistic E~ 2n(j2 and variance nlt}2 4n04 .<·or . c R k . exp[A(s + 1]) . More generally.8) = (il(x. h on Rq such that the density (frequency) functions of the Po may be written as. (1. It is used to model the density of "time until failure" for certain types of equipment.17k and B of 6.12). _ .6. We compute M(s) = = E(exp(sT(X))) {exp[A(s ~ + 1])  A(1])]} J'" J J'" J h(x)exp[(s + 1])T(x) . which is naturally indexed by a kdimensional parameter and admit a kdimensional sufficient statistic.2 The Multiparameter Case Our discussion of the "natural form" suggests that oneparameter exponential families are naturally indexed by a onedimensional real parameter fJ and admit a onedimensional sufficient statistic T(x).Section 1. 1 Tk. x> 0. Example 1. .6 Exponential Families 53 Moreover. This is known as the Rayleigh distribution. 1. and realvalued functions T 1. 2 i=1 i=l n 1 n Here 1] = 1/20 2. Pitman. . 02 ~ 1/21]. being the integral of a density.j8 2))exp(i=l n n I>U282) ~=1 = (il xi)exp[202 LX.6. Var(T(X)) ~ A"(1]). 0 = n log 82 and A(1]) 1 xl has mean nit} = = nlog(21]). Therefore. x E X j=1 c Rq. Direct computation of these moments is more complicated.. is one. The rest of the theorem follows from the momentenerating property of M(s) (see Section A. A family of distrib~tioos {PO: 0 E 8}. 0 Here is a typical application of this result.. is said to be a kparameter exponential family.A(1])] because the last factor. Now p(x.A(1])]dx h(x)exp[(s + 1])T(x)  A(s + 1])Jdx = .10) .8(0)].. X n is a sample from a population with density p(x.6.8 > O. We give the proof in the continuous case.O) = h(x) exp[L 1]j(O)Tj (x) . Koopman.4 Suppose X 1> ••• . ' e k p(x.8) ~ (x/8 2)exp(_x 2/28 2).
8 2 = and I" 1 2' Tl(x) = x.11) (]"2. . i=l . Suppose lhat P.Ot = fl..5..') : 00 < f1 x2 1 J12 2 p(x. . suppose X = (Xl. q(x. . The density of Po may be written as e ~ {(I". Then the distributions of X form a kparameter exponential family with natural sufficient statistic m m TCm)(x) = (LTl(Xi). (1. .. + log(27r"'». X m ) from a N(I".fJk)T rather than family generated by T and h is e. the vector T(X) = (T...A('l)}. . .2.I' 54 Statistical Models. It will be referred to as a natural sufficient statistic of the family. Goals. then the preceding discussion leads us to the natural sufficient statistic m m (LXi.6.Xm ) where the Xi are independent and identically distributed and their common distribution ranges over a k~parameter exponential family given by (1.') population..x . . we define the natural parameter space as t: = ('l E R k : 00 < A('l) < oo}. . which corresponds to a twOparameter exponential family with q = I. i=l i=1 which we obtained in the previous section (Example 1. 2 (".6. Again. ..'" .' .. letting the model be Thus. . A(71) is defined in the same way except integrals over Rq are replaced by sums. 0 Again it will be convenient to consider the "biggest" families. = N(I".10).LTk(Xi )) t=1 Example 1.2(". I ! In the discrete case. and Performance Criteria Chapter 1 By Theorem 1.x E X c Rq where T(x) = (T1 (x).. .'). +log(27r" »)].(X) •. J1 < 00. The Normal Family. ..3.1.O) =exp[".6.').. ..Tk(x)f and..4). In either case. If we observe a sample X = (X" . 1)2(0) = 2 " B(O) " 1"' 1 " T.TdX))T is sufficient.'l) = h(x)exp{TT(x)'l. in the conlinuous case.LX.5. (]"2 > O}. the canonical kparameter exponential indexed by 71 = (fJI. h(x) = I.(x) = x'..
5.ti = f31 + (32Zi. Now we can write the likelihood as k qo(x.6.~2)].j j=l = 1.k. and rewriting kl q(x./Ak) = ".... . 0 + el) ~ qo(x.4 and 1..~l= (3.~3 ~ 1/2(7'. Example 1.5.5. TT(x) = T(Y) ~ (EY"EY. ~..~3 < OJ. Linear Regression.6. . A E A. In this example. < OJ. (N(/" (7') continued}. It will often be more convenient to work with unrestricted parameters.. where A is the simplex {A E R k : 0 < A. A(1/) = 4n[~:+m'~5+z~1~'+2Iog(rr/~3)]. remedied by considering IV ~j = log(A. Yi rv N(Jli..j = 1. Let T. = 1/2(7'. and t: = Rk . . 1/) =exp{T'f.... I yn)T can be put in canonical form with k = 3.. Suppose a<.kj. . = n'Ezr Example 1.5 that Y I ./(7'. I <j < k 1.(x). E R k . is not and all c. h(x) = I andt: ~ R x R.k.) + log(rr/. and AJ ~ P(X i ~ j). .~..6. i = 1. .Section 1. < l.. x') = (T. This can be identifiable because qo(x. n.)T.'I2. .T.'12 ~ (3.6.7.id. A) = I1:~1 AJ'(X). We observe the outcomes of n independent trials where each trial can end up in one of k possible categories.= {(~1. .~.1.... j=1 k This is a kparameter canonical exponential family generated by TIl"" Tk and h(x) = II~ I l[Xi E {I. 2.(x»). 0) ~ exp{L . However 0. Example 1. we can achieve this by the reparametrization k Aj = eO] /2:~ e Oj .. ... .6 Exponential Families 55 (x.~3): ~l E R. L71 Aj = I}. where m.(x) = L:~ I l[X i = j].'. . k}] with canonical parameter 0. 0. In this example.. .EziY. . 0) for 1 = (1.n log j=l L exp(. From Exam~ pIe 1."k./(7'. Multinomial Trials.~2): ~l E R. Then p(x. A(1/) ~ ~[(~U2~. as X and the sample space of each Xi is the k categories {I. E R. k ~ 2.Xn)'r where the Xi are i.. . . We write the outcome vector as X = (Xl. ~3 and t: = {(~1.)j.5./(7'._l)(x)1/nlog(l where + Le"')) j=1 . . the density of Y = (Yb . a 2 ). with J. .~·(x) . . in Examples 1. Y n are independent. ~l ~ /.
. Here is an example of affine transformations of 6 and T.. 17). let Xt =integers from 0 to ni generated Vi < n. " Example 1. this. is an npararneter canonical exponential family with Yi by T(Yt . as X. if X is discrete Affine transformations from Rk to RI defined by UP is the canonical family generated by T kx 1 and hand M is the affine transformation I' . and Performance Criteria Chapter 1 1 parameter canonical exponential family generated by T (k 1) {I. are identifiable. the parameters 'Ii = Note that the model for X Statistical Models. More10g(P'I[X = jI!P'IIX ~ k]). 0 < Ai < 1. . B(n. " .6. Let Y.6. 1 < i < n. Similarly. 1/(0») (1. if c Re and 1/( 0) = BkxeO C R k .Y n ) ~ Y.d. 1 <j < k . from Example 1.i. If the Ai are unrestricted. .' A(1/) = L:7 1 ni 10g(l + e'·). 1 < i < n. 1]) is a k and h(x) = rr~ 1 l[xi E over. I < k. .8. X n ) T where the Xi are i. < X n be specified levels and (1.. .. then the resulting submodel of P above is a submodel of the exponential family generated by BTT(X) and h. Here 'Ii = log 1\. Ai).2. k}] with canonical parameter TJ and £ = Rk~ 1.6.6.6. 0 1. and 1] is a map from e to a subset of R k • Thus.12) taking on k values as in Example 1. Goals. < . 0) ~ q(x.3 Building Exponential Families Submodels A submodel of a kparameter canonical exponential family {q(x. M(T) sponding to and ~ MexkT + hex" it is easy to see that the family generated by M(T(X» and h is the subfamily of P corre 1/(0) = MTO.6.17 e for details.7 and X = (X t.56 Note that q(x.1.). ) 1(0 < However. then all models for X are exponential families because they are submodels of the multinomial trials model. where (J E e c R1.6. be independent binomial...13) . See Problem 1. 11 E £ C R k } is an exponential family defined by p(x.·. h(y) = 07 I ( ~. is unchanged. . Logistic Regression.
T 2 = L:~ 1 X. .x = (Xl. are called curved exponential families provided they do not form a canonical exponential family in the 8 parametrization.Section 1.. i=I This model is sometimes applied in experiments to determine the toxicity of a substance.14) 8.'V T(Y) = (Y" . ~ (1. i = 1.6.6.. log(P[X < xl!(1 .a 2 ). This is a 0 In Example 1. Yn. ranges over (0. SetM = B T . with B = p" we can write where T 1 = ~~ I Xi.. '.8.12) with the range of 1J( 8) restricted to a subset of dimension l with I < k . = '\501 and 'fJ2(0) = ~'\5B2.5 a 2nparamctercanonical exponential family model with fJi = P.. }Ii N(J1. + 8. PIX < xl ~ [1 + exp{ (8. . which is called the coefficient of variation or signaltonoise ratio.i.. Then (and only then).. E R. . suppose the ratio IJlI/a. that is. + 8. Y n are independent. Assume also: (a) No interaction between animals (independence) in relation to drug effects (b) The distribution of X in the animal population is logistic. Example 1. .) = L ni log(1 + exp(8. . I ii.. The Yi represent the number of animals dying out of ni when exposed to level Xi of the substance.6. . then this is the twoparameter canonical exponential family generated by lYIY = (L~l.6.00).. p(x.x and (1. LocationScale Regression. where 1 is (1.. n. ..d.. the 8 parametrization has dimension 2.8.PIX < x])) 8. x).. . + 8.'. so it is not a curved family.9.6.Xi)).i/a. 8) in the 8 parametrization is a canonical exponential family.10. Gaussian with Fixed SignaltoNoise Ratio..!t is assumed that each animal has a random toxicity threshold X such that death results if and only if a substance level on or above X is applied.8. ..1.. .6 Exponential Families 57 This is a linear transformation TJ(8) = B nxz 8 corresponding to B nx2 = (1.).6.xn)T. However.13) holds. this is by Example 1.6. generated by . L:~ 1 Xl yi)T and h with A(8. o Curved exponential families Exponential families (1. y. In the nonnal case with Xl.. > O. and l1n+i = 1/2a. N(Jl. which is less than k = n when n > 3. is a known constant '\0 > O.1)T..)T . Example 1.i. .Xn i. Y. 'fJ1(B) curved exponential family with l = 1. Suppose that YI ~ . Then.x)W'. If each Iii ranges over R and each a. a.
Ii. then p(y.d. 83 > 0 (e. .12) is not an exponential family model. 1989. In Example 1.. 1. Goals.8. say.i.2. Let Yj ._I11.6. Carroll and Ruppert.10). E R.6. Models in which the variance Var(}'i) depends on i are called heteroscedastic whereas models in which Var(Yi ) does not depend on i are called homoscedastic.13) exhibits Y.) and 1 h'(YJ)' with pararneter1/(O). and = Ee sTT .1/(O)) as defined in (6. we define I I . a. M(s) as the momentgenerating function.xjYj ) and we can apply the supennode! approach to reach the same conclusion as before.8. 8.1.12.5. sampling. for unknown parameters 8 1 E R. n.. respectively. ~.) = (Yj. Next suppose that (JLi.6. 0) ~ q(y. Even more is true. Thus. be independent.6..58 Statistical Models.6.10 and 1. and B(O) = ~. the map 1/(0) is Because L:~ 11]i(6)Yi + L:~ i17n+i(6)Y? cannot be written in the fonn ~. 1 Tj(Y.8 note that (1. 1 Bj(O). For 8 = (8 1 . and Snedecor and Cochran. . Section 15. 1 < j < n. TJ'(Y).(0).3.4 Properties of Exponential Families Theorem 1. Sections 2.) depend on the value Zi of some covariate.(O)Tj'(Y) for some 11.6 are heteroscedastic and homoscedastic models. and Performance Criteria Chapter 1 and h(Y) = 1. 1978.). We return to curved exponential family models in Section 2.6. 1988. with an exponential family density Then Y = (Y1.E Yj c Rq. Bickel. but a curved exponential family model with 0 1=3.6.5 that for any random vector Tkxl. yn)T is modeled by the exponential family generated by T(Y) ~. as being distributed according to a twoparameter family generated by Tj(Y.1 generalizes directly to kparameter families as does its continuous analogue. Examples 1. Recall from Section B.. We extend the statement of Theorem 1. Supermode1s We have already noted that the exponential family structure is preserved under i.g.
('70)) .9.1 give a classical result in Example 1.6.a)A('72) Because (1. Then (a) E is convex (b) A: E ) R is convex (c) If E has nonempty interior in R k generating function M given by and'TJo E E.A('7o)} vaLidfor aLL s such that 110 + 5 E E.3.6. ('70) II· The corollary follows immediately from Theorem B.6.r(T) = IICov(7~. .5..4). ~ = 1 .6. Under the conditions ofTheorem 1.A('70) = II 8".3 V.6 Expbnential Families 59 V.r'7o T(X) .6. A(U'71 Which is (b).6.a)'72) < aA('7l) + (1 . S > 0 with ~ + ~ = 1. Fiually (c) 0 The formulae of Corollary 1. for any u(x). A where A( '70) = (8".3(c). T. 8A 8A T" = A('7o) f/J. h(x) > 0.6. ('70). then T(X) has under 110 a moment M(s) ~ exp{A('7o + s) . t: and (a) follows. Substitute ~ ~ a. (with 00 pennitted on either side).8".1. Since 110 is an interior point this set of5 includes a baLL about O.15) is finite. By the Holder inequality (B.1 and Theorem 1. vex) = exp«1 . 1O)llkxk.15) t: the righthand side of (1. u(x) ~ exp(a'7.3.a)'72 E is proved in exactly the same way as Theorem 1. We prove (b) first.15) that a'7. .T(x)).6.6.. 8".6.2. Corollary 1. . v(x).Section 1. J u(x)v(x)h(x)dx < (J ur(x)h(x)dx)~(Jv'(x)h(x)dx)~. Let P be a canonicaL kparameter exponential famiLy generated by (T.6. Proof of Theorem 1. Suppose '70' '71 E t: and 0 < a < 1. If '7" '72 E + (1 . Theorem 1. h) with corresponding natural parameter space E and function A(11).6. J exp('7 TT (x))h(x)dx > 0 for all '7 we conclude from (1.a)'7rT(x)) and take logs of both sides to obtain. + (1 .a.
1. o The rank of an exponential family I .7). .4 that follows. Suppose P = (q(x. . (iii) Var7)(T) is positive definite."I 11 J Evidently every kparameter exponential family is also k'dimensional with k' > k.4. T. II ! I Ii .lx ~ j] = AJ ~ e"'ILe a . Our discussion suggests a link between rank and identifiability of the 'TJ parameterization. (continued). ..7.6. Then the following are equivalent. there is a minimal dimension.6. in Example 1.60 Statistical Models. ajTj(X) = ak+d < 1 unless all aj are O.8. Xl Y1 ).(Tj(X» = P>.x. k A(a) ~ nlog(Le"') j=l and k E>.~. However. using the a: parametrization. . f=l . 7) E f} is a canonical exponential/amily generated by (TkXI ' h) with natural parameter space e such that E is open. (ii) 7) is a parameter (identifiable).7 we can see that the multinomial family is of rank at most k . ifn ~ 1. and Performance Criteria Chapter 1 Example 1. and ry. Formally. We establish the connection and other fundamental relationships in Theorem 1. (i) P is a/rank k. Note that PO(A) = 0 or Po (A) < 1 for some 0 iff the corresponding statement holds i. However. Sintilarly.6. p:x.6. for all 9 because 0 < p«X. such that h(x) > O.~~) < 00 for all x.6. 9" 9.6. It is intuitively clear that k . Goads. 2' Going back to Example 1. + 9. Theorem 1.4. we are writing the oneparameter binomial family corresponding to Yl as a twoparameter family with generating statistic (Y1 . (X)". But the rank of the family is 1 and 8 1 and 82 are not identifiable. if we consider Y with n'> 2 and Xl < X n the family as we have seen remains of rank < 2 and is in fact of rank 2.(9) = 9.1 is in fact its rank and this is seen in Theorem 1. Tk(X) are linearly independent with positive probability. Here. An exponential family is of rank k iff the generating statistic T is kdimensional and 1. P7)[L.
. F!~ . ~ (i) =~ (iii) ~ (i) = P1)[aT T = cJ ~ 1 for some a of 0. III. This is just a restatement of (iv) and (v) of the theorem. (iii) = (iv) and the same discussion shows that (iii) .A(TJ.2.6 Exponential Families 61 (iv) 1) ~ A(1)) is 11 onto (v) A is strictly convex on E. Suppose tlult the conditions of Theorem 1. 1)) is a strictly concavefunction of1) On 1'. (b) logq(x. Let ~ () denote "(. Now (iii) =} A"(TJ) > 0 by Theorem 1. 0 Corollary 1. = Proof. A'(TJ) is strictly monotone increasing and 11.1)o)TT.T(x) . hence. Proof. have (i) . ~ P1)o some 1). thus. because E is open. hence. by our remarks in the discussion of rank.Section 1.'t·· .. (iv) '" (v) = (tii) Properties (iv) and (v) are equivalent to the statements holding for every Q defined as previously for arbitrary 1)0' 1).. all 1) = (~i) II. .4 hold and P is of rank k. 1)0) E n· Q is the exponential family (oneparameter) generated by (1)1 . Conversely.A(TJ2))h(x). Apply the case k = 1 to Q to get ~ (ii) =~ (i).TJ2)T(X) = A(TJ2) .6. The proof for k details left to a problem.(ii) = (iii).6. all 1) ~ (iii) = a T Var1)(T)a = Var1)(aT T) = 0 for some a of 0.2 and.' .. ranges over A(I'). A is defined on all ofE. A' is constant.~> . We give a detailed proof for k = 1.T ~ a2] = 1 for al of O.1)o) : 1)0 + c(1).6. ".A(TJI)}h(x) ~ exp{TJ2T(x) . with probability 1.. .(v). 0 . of 1)0· Let Q = (P1)o+o(1). '" '. which that T implies that A"(TJ) = 0 for all TJ and. for all T}.. Taking logs we obtain (TJI . Note that.. This is equivalent to Var"(T) ~ 0 '*~ (iii) {=} f"V(ii) There exist T}I =I= 1J2 such that F Tll = Pm' Equivalently exp{r/." Then ~(i) > 1 is then sketched with '* P"[a. by Theorem 1. = Proof ofthe general case sketched I. We.3. A"(1]O) = 0 for some 1]0 implies c. ~ (ii) ~ _~ (i) (ii) = P1). :. .) is false. Then (a) P may be uniquely parametrized by "'(1)) E1)T(X) where".6.) with probability 1 =~(i). Thus..
6.E). families to which the posterior after sampling also belongs.. . (1. By our supermodel discussion..2 (Y .~p).29) that I) generate this family and that the rank of the family is indeed p(p + 3)/2. Suppose Xl. (1.Jl.(9) LT. E) = Idet(E) 1 1 2 / .1 (Y . where X is the Bernoulli trial.Yp. E.._.Jl) .3. then X .Xn is a sample from the kparameter exponential family (1. and that (.(Xi) i=l j=l i=l n • n nB(9)}. Then p(xI9) = III h(x.Jl)}. Thus. It may be shown (Problem 1. The corollary will prove very important in estimation theory. .. the relation in (a) may be far from obvious (see Problem 1.I. is open.6. ifY 1. Example 1.X') = (JL..5.Yn)T follows the k = p(p + 3)/2 parameter exponential family with T = (EiYi . the 8(11) 0) family is parametrized by E(X). .a 2 + J12). See Section 2. E). E) _~yTEIy + (E1JllY 2 I 2 (log Idet(E)1 + JlT I P log"..4 applies..6. and Performance Criteria Chapter 1 The relation in (a) is sometimes evident and the j. This is a special case of conjugate families of priors. E).1 '" 11"'.t parametrization is close to the initial parametrization of classical P.6. revealing that this is a k = p(p + 3)/2 parameter exponential family with statistics (Yi.6. with its distinct p(p + I) /2 entries. .ljh<i<.. . which is a p x p symmetric matrix.18) _. . An important exponential family is based on the multivariate Gaussian distributions of Section 8..p/2 I exp{ .6.62 Statistical Models..6.tpx 1 and positive definite variance covariance matrix L::pxp' iff its density is f(Y. N p (1J.. T (and h generalizing Example 1. The p Van'ate Gaussian Family.. 0).16) Rewriting the exponent we obtain 10gf(Y. . which is obviously a 11 function of (J1.6. However.. . 0 ! = 1."5) family by E(X).5 Conjugate Families of Prior Distributions In Section 1. B(9) = (log Idet(E) I+JlTEl Jl). h(Y) .2 p p (1.6. Y n are iid Np(Jl.2 we considered beta prior distributions for the probability of success in n Bernoulli trials. E(X. and..17) The first two terms on the right in (1. . where we identify the second element ofT. Recall that Y px 1 has a p variate Gaussian distribution.(Y). the N(/l.11.10).6. Jl.6. For {N(/" "')}.a 2 ). Goals.6.{Y..21)....11. ~iYi Vi). write p(x 19) for p(x._..)] eXP{L 1/... We close the present discussion of exponential families with the following example.17) can be rewritten ( 2: 1 <i<j~p aijYiYj + ~ 2: aii r:2 ) + L(2: aij /lj)Yi i=l i=l j=l p where E. as we always do in the Bayesian context. with mean J.Jl)TE. 9 ~ (Jl. so that Theorem 1.
and t j = L~ 1 Tj(xd.2)' Uo 2uo Ox ..6. which is k~dimensional. n)T D It is easy to check that the beta distributions are obtained as conjugate to the binomial in this way...18) and" by (1.. 0 Remark 1."" Sk+l)T = ( t} + ~TdxiL . n n )T and ex indicates that the two sides are proportional functions of ().6. ik + ~Tk(Xi)' tk+l + 11. .Xn is a N(O.6.'" .6. .12.6.. We assume that rl is nonempty (see Problem 1.. . Forn = I e. . X n become available. . Because two probability densities that are proportional must be equal. A conjugate exponential family is obtained from (1.fk+IB(O) logw(t)) j=1 (1.. .6.. If p(xIO) is given by (1. That is. The (k + I)parameter exponential/amity given by k ".u5) sample.. tk+ I) T and w(t) = 1= eXP{hiJ~J(O)' _='" _= 1 k ik+JB(II)}dO J···dOk (1. let t = (tl. .6.) . tk+l) E 0.20). To choose a prior distribution for model defined by (1...fk+Jl.20).6. .Section 1. where a = (I:~ 1 T1 (Xi).(lk+1 j=1 i=1 + n)B(O)) ex ".22) 02 p(xIO) ex exp{ 2 .6.20) where t = (fJ.18). the parameter t of the prior distribution is updated to s = (t + a). k. .1.6.6. is a conjugate prior to p(xIO) given by (1.(O) ex exp{L ~J(O)(LT.21) and our assertion follows.(Olx) ex p(xlO)".O<W(tJ.20) given by the last expression in (1.21) is an updating formula in the sense that as data Xl.21) where S = (SI.6. .19) o {(iJ.fkIJ)<oo) with integrals replaced by sums in the discrete case. . we consider the conjugate family of the (1.(0). (1. then k n ..'''' I:~ 1 Tk(Xi). Note that (1.6 Exponential Families 63 where () E e.1.18) by letting 11. 1r(6jx) is the member of the exponential family (1.6. where u6 is known and (j is unknown..6.36). Example 1.. Suppose Xl. . Proof.6. be "parameters" and treating () as the variable of interest.. .6. then Proposition 1. j = 1.(0) ~ exp{L ~j(O)fj .(Xi)+ f.
+ n.6.27) Note that we can rewrite (1. t.21).(O) <X exp{ . Moreover.6.26) (1.6. Eo known. = rg(n)lrg so that w..20) has density (1. = nTg(n)I(1~.24). and Performance Criteria Chapter 1 This is a oneparameter exponential family with The conjugate twoparameter exponential family given by (1.6. i.37).23) with (70 TO .25) By (1. Our conjugate family.6.23) Upon completing the square. Goals.6. (J Np(TJo. we must have in the (t l •t2) parametrization "g (1.(n) = .) density. 761) where '10 varies over RP.(n) (g +n)I[s+ 1]0~~1 TO TO (1.24) Thus.30) that the Np()o" f) family with )0. consists of all N(TJo. > 0 and all t I and is the N (tI!t" (1~ It. it can be shown (Problem 1. Using (1. = 1 .28) where W. 1r. W.6.) } 2a5 h t2 t1 2 (1.26) intuitively as ( 1. 1 < i < n. If we start with aN(f/o. = s.6.( 0 .6. Tg is scalar with TO > 0 and I is the p x p identity matrix (Problem 1. E RP and f symmetric positive definite is a conjugate family f'V .64 Statistical Models.6. rJ) distributions where Tlo varies freely and is positive. 0 These formulae can be generalized to the case X.. . (0) is defined only for t. we find that 1r(Olx) is a normal density with mean Jl(s. Eo).6. tl(S) = 1]0(10 2 TO 2 + S.n) and variance = t. if we observe EX.6.d.W.6. 75) prior density. the posterior has a density (1. therefore.i. Np(O.(S) = t. we obtain 1r.
.B(O)]. e. In fact. Summary. .5. 8 C R k . It is easy to see that one can consUUct conjugate priors for which one gets reasonable formulae for the parameters indexing the model and yet have as great a richness of the shape variable as one wishes by considering finite mixtures of members of the family defined in (1.. and realvalued functions TI. The natural sufficient statistic max( X 1 ~ .6.10 is a special result of this type. Pitman.6 Exponential Families 65 but a richer one than we've defined in (1. Discussion Note that the uniform U( (I. Some interesting results and a survey of the literature may be found in Brown (1986). must be kparameter exponential families... The set E is convex. . is a kparameter exponential/amity of distributions if there are realvalued functions 'T}I . x j=1 E X c Rq.6. which admit kdimensional sufficient statistics for all sample sizes. In fact. See Problems 1. to Np (8 . The canonical kparameter exponentialfamily generated by T and h is T(X) = where A(7J) ~ log J:. If bas a nonempty interior in R k and '10 E then T(X) has for X ~ P7Jo the momentgenerating function . . the map A . {PO: 0 E 8}..3 is not covered by this theory.31 and 1. for all s such that '10 + s is in Moreover E 7Jo [T(X)] = A(7Jo) and Var7Jo[T(X)] A( '10) where A and A denote the gradient and Hessian of A. Eo). In the onedimensional Gaussian case the members of the Gaussian conjugate family are unimodal and symmetric and have the same shape. the conditions of Proposition 1. . " Tk. ••• . is not of the form L~ 1 T(Xi). . a theory has been built up that indicates that under suitable regularity conditions families of distributions. the family of distributions in this example and the family U(O. O}) model of Example 1..J: h(x)exp{TT(x)7J}dx in the continuous case. . Despite the existence of classes of examples such as these. starting with Koopman.A(7Jo)} e e. 2.6. 0) = h(x) expl2:>j(I1)Ti (x) . with integrals replaced by sums in the discrete case..Xn ).6.20) except for p = 1 because Np(A.20).1 are often too restrictive. and Darmois.Section 1. E 1 R is convex. ..'T}k and B on e. T k (X)) is called the natural sufficient statistic of the family.32. The set is called the natural parameter space. Problem 1.(s) = exp{A(7Jo + s) . r) is a p(p + 3)/2 rather than a p + I parameter family.. h on Rq such that the density (frequency) function of Pe can be written as k p(x.6... (1.6.6.29) (Tl (X). B) are not exponential. which is onedimensional whatever be the sample size.
(v) A is strictlY convex on f.6. Give a formal statement of the following models identifying the probability laws of the data and the parameter space. and Performance Criteria Chapter 1 An exponential family is said to be of rank k if T is kdimensional and 1. He wishes to use his observations to obtain some infonnation about J. (a) A geologist measures the diameters of a large number n of pebbles in an old stream bed. Theoretical considerations lead him to believe that the logarithm of pebble diameter is normally distributed with mean J.. Ii II I .66 Statistical Models.1)) (O)t j .) E R k+l : 0 < w < oo}. is conjugate to the exponential family p(xl9) defined in (1. .parameter exponential family k ".1 units.B(O)tk+l logw} j=l where w = 1:" 1: E exp{l:. (iv) the map '7 ~ A('7) is I ' Ion 1'.1 1. Assume that the errors are otherwise identically distributed nonnal random variables with known variance. Goals.B(O)}dO. then the following are equivalent: (i) P is of rank k.7 PROBLEMS AND COMPLEMENTS Problems for Section 1.. State whether the model in question is parametric or nonparametric.(0) = exp{L1)j(O)t) . If P is a canonical exponential family with E open. (b) A measuring instrument is being used to obtain n independent determinations of a physical constant J.L and a 2 but has in advance no knowledge of the magnitudes of the two parameters.. The (k + 1). l T k are linearly independent with positive Po probability for some 8 E e.. tk+1) 0 ~ {(t" . (iii) Var'7(T) is positive definite. 1.L.. Suppose that the measuring instrument is known to be biased to the positive side by 0. . (ii) '7 is identifiable. T" . A family F of priOf distributions for a parameter vector () is called a conjugate family of priors to p(x I 8) if the posterior distribution of () given x is a member of F.L and variance a 2 .. and t = (t" .29). tk+.
. Each .p. Two groups of nl and n2 individuals. (e) The parametrization of Problem l. At. Q p ) {(al.2) where fLij = v + ai + Aj.. Show that Fu+v(t) < Fu(t) for every t. . = (al..1(d). each egg has an unknown chance p of hatching and the hatChing of one egg is independent of the hatching of the others.Xp are independent with X~ ruN(ai + v. 0. and Po is the distribution of X (b) Same as (a) with Q = (X ll ... j = 1. 1'2) and we observe Y X. v. are sampled at random from a very large population. Which of the following parametrizations are identifiable? (Prove or disprove... .2 )..1. (d) Xi.....1(c).Section 1. .l(d) if the entomologist observes only the number of eggs hatching but not the number of eggs laid in each case.. .. . 4. . ... (b) The parametrization of Problem 1.. bare independeht with Xi. An entomologist studies a set of n such insects observing both the number of eggs laid and the number of eggs hatching for each nest."" a p . . Q p ) and (A I . B = (1'1. (If F x and F y are distribution functions such that Fx(t) said to be stochastically larger than Y. = o.1. restricted to p = (all' .. .) (a) The parametrization of Problem 1. (a) Let U be any random variable and V be any other nonnegative random variable. Once laid. e (e) Same as (d) with (Qt.t.) (a) Xl. then X is (b) As in Problem 1.··. .ap ) : Lai = O}. ~ N(l'ij....1 describe formaHy the foHowing model. .for this model? (d) The number of eggs laid by an insect follows a Poisson distribution with unknown mean A. Can you perceive any difficulties in making statements about f1.Xpb . . . .7 Problems and Complements 67 (c) In part (b) suppose that the amount of bias is positive but unknown. 0.. 1 Ab) restricted to the sets where l:f I (Xi = 0 and l:~~1 A.2) and N (1'2. respectively.2 ) and Po is the distribution of Xll. . Are the following parametrizations identifiable? (Prove or disprove. i = 1. . 3.Xp ).1.. 2. i=l (c) X and Y are independent N (I' I . Ab. .) < Fy(t) for every t.2).
e = = 1 if X < 1 and Y {J. and Performance Criteria Chapter 1 ":. Let Y and Po is the distribution of Y. is the distribution of X when X is unifonn on (0. Consider the two sample models of Examples 1. then P(Y < t + c) = P( Y < t . .Lk VI n". . M k = mk] I! n. + nk + ml + .4(1).· .p(0. i = 1..Nk = nk.0. nk·mI ... .LI.. 2.. . k.. Vk) is unknown. Which of the following models are regular? (Prove or disprove.0).. .:". mk·r. I. ..Li < + . P8[N I . 0 < J.. What assumptions underlie the simplification? 6. VI.I nl . Let Po be the distribution of a treatment response.c has the same distribution as Y + c. if and only if. + fl.0.t) = 1 . ... (b)P.1.. 8. I ! fl.9).P(Y < c . 0 < Vi < 1. Suppose the effect of a treatment is to increase the control response by a fixed amount (J. o < J1 < 1.1.. .0'2) (d) Suppose the possible control responses in an experiment are 0. ..2). . 68 Statistical Models.. Show that Y . the density or frequency function p of Y satisfies p( c + t) ~ p( c .Li = 71"(1 M)iIJ.9 and they oecur with frequencies p(0.. The number n of graduate students entering a certain department is recorded....k + VI + . ...3(2) and 1.) (a) p. + Vk + P = 1.1.. }.. J.... is the distribution of X e = (0. but the distribution of blood pressure in the population sampled before and after administration of the drug is quite unknown.t) fj>r all t.c bas the same distribution as Y + c. ml . .LI ni + .p(0..2. V k mk P r where J.1). • "j :1 member of the second (treatment) group is administered the same dose of a certain drug believed to lower blood pressure and the blood pressure is measured after 1 hour. Both Y and p are said to be symmetric about c..' I . whenXis unifonnon {O.. Let N i be the number dropping out and M i the number graduating during year i.. Mk. = n11MI = mI. Each member of the first (control) group is administered an equal dose of a placebo and then has the blood pressure measured after 1 hour. . 0 = (1'. I nl. In each of k subsequent years the number of students graduating and of students dropping out is recorded.. (0).t). 0'2). (e) Suppose X ~ N(I'.  . =X if X > 1. The following model is proposed. + mk + r = n 1. Goals.L..2. 1 < i < k and (J = (J. It is known that the drug either has no effect or lowers blood pressure. 0 < V < 1 are unknown. 7.k is proposed where 0 < 71" < I. The simplification J. .O}..c) = PrY > c . Hint: IfY . 5. I .··· .. .. • .. (a) What are the assumptions underlying this model'? (b) (J is very difficult to estimate here if k is large. Vi = (1 11")(1 ~ v)iI V for i = 1.
N = 0.. e(j).. The Scale Model. . log X and log Y satisfy a shift model with parameter log (b) Show that if X and Y satisfy a shift model With parameter tl. = 9. . Let X < 1'. {r(j) : j ~ 0. then satisfy a scale model with parameter e. That is. .1. For what tlando(x) ~ 2J1+tl2x? type ofF is it possible to have C(·) ~ F(·tl) for both o(x) (e) Suppose that Y ~ X + O(X) where X ~ N(J1. N}... Y. N} are identifiable. 0 < j < N.. C(·) = F(· . i . o(x) ~ 21' + tl ..d. if the number of = (min(T. Let c > 0 be a constant. . . I(T < C)) = j] PIC = j] PIT where T. o (a) Show that in this case.. . .Xn are observed i. .'" .zp. suppose X has a distribution F that is not necessarily normal. r(j) = = PlY = j. ci '" N(O.tl). .3 let Xl.. . Sbow that {p(j) : j = 0. .. = (Zlj. e) : p(j) > 0. .. 12. The Lelunann 1\voSample Model. 11. p(j). C are independent.. are not collinear (linearly independent).. (b) Deduce that (th.{3p) are identifiable iff Zl. eY and (c) Suppose a scale model holds for X."'). In Example 1. log y' satisfy a shift model? = Xc. . Sx(t) = . 1 < i < n. Positive random variables X and Y satisfy a scale model with parameter 0 > 0 if P(Y < t) = P(oX < t) for all t > 0. that is.. the two cases o(x) .. Show that if we assume that o(x) + x is strictly increasing. C).i."') and o(x) is continuous. C(t} = F(tjo). . Hint: Consider "hazard rates" for Y min(T.Xn ).. . Y = T I Y > j].. 0 'Lf 'Lf op(j) = I. . I} and N is known."" Yn ). . Yn denote the survival times of two groups of patients receiving treatments A and B.2x yield the same distribution for the data (Xl. X m and Yi. Does X' Y' = yc satisfy a scale model? Does log X'. then C(·) = F(. and (I'. fJp) are not identifiable if n pardffieters is larger than the number of observations.7 Problems and Complements 69 (a) Show that if Y ~ X + o(X). .tl and o(x) ~ 21' + tl. (Yl. (]"2) independent. Collinearity: Suppose Yi LetzJ' = L:j:=1 4j{3j + Ci..6. . . .Section 1. then C(·) ~ F(. j = 0. t > O. C).. eX o.. Suppose X I. 10.tl) implies that o(x) tl. N.2x and X ~ N(J1..tl) does not imply the constant treatment effect assumption. according to the distribution of X. e) vary freely over F ~ {(I'. or equivalently. (b) In part (a). > 0.Znj?' (a) Show that ({31. Therefore. e(j) > 0.
So (T) has a U(O. I' 70 StCitistical Models. 7. (c) Suppose that T and Y have densities fort) andg(t). For treatments A and E. Show that if So is continuous. thus. Sy(t) = Sg. t > 0. Hint: By Problem 8.12.1.7.. and Performance CriteriCl Chapter 1 t Ii .00).G(t). respectively.2) is equivalent to Sy (t I z) = sf" (t). 12.). Hint: See Problem 1. hy(t) = c. Let f(t I Zi) denote the density of the survival time Yi of a patient with covariate vector Zi apd define the regression survival and hazard functions of Y as i Sy(t I Zi) = 1~ fry I zi)dy. 13.1) Equivalently.) Show that Sy(t) = S'. I (b) Assume (1. Show that hy(t) = c.l3.I ~I ... then X' = log So (X) and Y' ~ log So(Y) follow an exponential scale model (see Problem Ll. Set C. Survival beyond time t is modeled to occur if the events T} > t.(t). then Yi* = g({3. (b) By extending (bjn) from the rationals to 5 E (0. Then h.(t).h.7. . as T with survival function So.  . z)) (1.. Specify the distribution of €i.. k = a and b.7. ~ a5. A proportional hazard model. = exp{g(..2) and that Fo(t) = P(T < t) is known and strictly increasing..{a(t). t > 0. The most common choice of 9 is the linear form g({3. = = (. we have the Lehmann model Sy(t) = si.(t) with C. Zj) + cj for some appropriate €j. Also note that P(X > t) ~ Sort). Let T denote a survival time with density fo(t) and hazard rate ho(t) = fo(t)j P(T > t).ho(t) is called the Cox proportional hazard model..(t) if and only if Sy(t) = Sg. h(t I Zi) = f(t I Zi)jSy(t I Zi). .d. . (1.ll) with scale parameter 5.z)}. I) distribution. Tk > t all occur. The Cox proportional hazani model is defined as h(t I z) = ho(t) exp{g(. show that there is an increasing function Q'(t) such that ifYt = Q'(Y. . .) Show that (1. Find an increasing function Q(t) such that the regression survival function of Y' = Q(Y) does not depend on ho(t). Goals.2) where ho(t) is called the baseline hazard function and 9 is known except for a vector {3 ({311 . log So(T) has an exponential distribution.l3. " T k are unobservable and i. I (3p)T of unknowns. F(t) and Sy(t) ~ P(Y > t) = 1 . where TI".(t) = fo(t)jSo(t) and hy(t) = g(t)jSy(t) are called the hazard rates of To and Y.i.l3.. t > 0...2. . are called the survival functions. 1 P(X > t) =1 (. (c) Under the assumptions of (b) above. Moreover. z) zT.
. F. Merging Opinions. Observe that it depends only on l:~ I Xi· I 0). 1f(O2 ) = ~.. .p and C. (b) Suppose Zl and Z. .1. .11 and 1. Suppose the monthly salaries of state workers in a certain state are modeled by the Pareto distribution with distribution function J=oo Jo F(x.i.1. Find the .2 1.(x/c)e. X n are independent with frequency fnnction p(x posterior1r(() I Xl. ) = ~. . an experiment leads to a random variable X whose frequency function p(x I 0) is given by O\x 01 O 2 0 0.X n ).6 Let 1f be the prior frequency function of 0 defined by 1f(I1. Yl .G)} is described by where 'IjJ is an unknown strictly increasing differentiable map from R to R. and for this reason.O) 1 . (a) Suppose Zl' Z...12. Show that both . When F is not symmetric.. have a N(O.4 1 0. 1) distribution. . J1 and v are regarded as centers of the distribution F.v arbitrarily large..d.l (0.L . have a N(O. still identifiable? If not. Show how to choose () to make J.8 0. G. J1 may be very much pulled in the direction of the longer tail of the density.2 with assumptions (t){4).2 0. Are.000 is the minimum monthly salary for state workers. In Example 1.Section 1. Generally.1. wbere the model {(F. Examples are the distribution of income and the distribution of wealth. Ca) Find the posterior frequency function 1f(0 I x).2 unknown. the parameter of interest can be characterl ized as the median v = F..p and ~ are identifi· able.5) or mean I' = xdF(x) = Fl(u)du. .. '1/1' > 0. Problems for Section 1. Consider a parameter space consisting of two points fh and ()2.. 14. (b) Suppose X" . x> c x<c 0.7 Problems and Complements 71 Hint: See Problems 1." . 15.d. Here is an example in which the mean is extreme and the median is not.Xm be i. Yn be i.p(t)). Find the median v and the mean J1 for the values of () where the mean exists. and Zl and Z{ are independent random variables. what parameters are? Hint: Ca) PIXI < tl = if>(. and suppose that for given ().2) distribution with . Let Xl. where () > 0 and c = 2. the median is preferred in this case.i. 'I/1(±oo) = ±oo.
0 < 0 < 1. (b) Find the posterior density of 8 when 71"( OJ = 302 .2.. 71"1 (0 2 ) = . and Performance CriteriCi Chapter 1 (cj Same as (b) except use the prior 71"1 (0 .. does it matter which prior. Let X be the number of failures before the first success in a sequence of Bernoulli trials with probability of success O. " J = [2:. . IE7 1 Xi = k) forthe two priors (f) Give the set on which the two B's disagree.8). . (3) Suppose 8 has prior frequency. (J.. 0 (c) Find E(81 x) for the two priors in (a) and (b). . .72 Statistical Models.O)kO. l3(r.. This is called the geometric distribution (9(0)).25. 4'2'4 (b) Relative to (a). (d) Suppose Xl.2. 3. Goals. . .. ':. the outcome X has density p(x OJ ~ (2x/O'). 'IT . 0 < x < O.Xn are natural numbers between 1 and Band e= {I. Show that the probability of this set tends to zero as n t 00. . For this convergence..[X ~ k] ~ (1. < 0 < 1. Find the posteriordensityof8 given Xl = Xl. where a> 1 and c(a) . is used in the fonnula for p(x)? 2. X n are independent with the same distribution as X. . Let 71" denote a prior density for 8. 3.. Assume X I'V p(x) = l:.'" 1rajI Show that J + a. Then P. Compare these B's for n = 2 and 100. what is the most probable value of 8 given X = 2? Given X = k? (c) Find the posterior distribution of (} given X = k when the prior distribution is beta.75. k = 0. for given (J = B. " ••. c(a). }. "X n ) = c(n 'n+a m) ' J=m.. 7r (d) Give the values of P(B n = 2 and 100. Unllormon {'13} .. Let X I. Suppose that for given 8 = 0. .5n) for the two priors and 1f} when 7f (e) Give the most probable values 8 = arg maxo 7l"(B and 71"1. I XI . 1r(J)= 'u .0 < B < 1. X has the geometric distribution (a) Find the posterior distribution of (J given X = 2 when the prior distribution of (} is . 7l" or 11"1. 4..'" ..Xn = X n when 1r(B) = 1. X n be distributed as where XI. . 1. 2.. .)=1. ~ = (h I L~l Xi ~ = .=l1r(B i )p(x I 8i ). Consider an experiment in which. . ) ~ ..m+I. .. (3) Find the posterior density of 8 when 71"( 0) ~ 1.
. . X n+ Il .1= n+r+s _ ..0 E Let 9 have prior density 1r. Show in Example 1. where VI. s).and Complements 73 wherern = max(xl. then the posterior distribution of D given X = k is that of k + Z where Z has a B(N .2. .. W. ..(1. Xn) = Xl = m for all 1 as n + 00 whatever be a. Show that a conjugate family of distributions for the Poisson family is the gamma family. 9. (a) Show that the family of priors E. In Example 1. . . Suppose Xl."" Zk) where the marginal distribution of Y equals the posterior distribution of 6 given Xl = Xl. Va' W t . Show that the conditional distribution of (6.1. 11'0) distribution. . Next use the central limit theorem and Slutsky's theorem.··.1 suppose n is large and (1 In) E~ I Xi = X is not close to 0 or 1 and the prior distribution is beta.2..c(b. } is a conjugate family of prior distributions for p(x I 8) and that the posterior distribution of () given X = x is . where {i E A and N E {l. is the standard normal distribution function and n _ T 2 x + n+r+s' a J. XI = XI..J:n). X n = Xn is that of (Y. where E~ 1 Xi = k. b) denote the posterior distribution.... Let (XI.n..) = n+r+s Hint: Let!3( a.1."'tF·t'. a regular model and integrable as a function of O.. Show that ?T( m I Xl.1 that the conditional distribution of 6 given I Xi = k agrees with the posterior distribution of 6 given X I = Xl •. f3(r. Interpret this result. If a and b are integers. . I Xn+k) given . .• X n = Xn' and the conditional distribution of the Zi'S given Y = t is that of sample from the population with density e. 11'0) distribution. b) is the distribution of (aV loW)[1 + (aV IbW)]t.8) that if in Example 1..• X n = Xn... .2.b> 1. X n+ k ) be a sample from a population with density f(x I 0).. 10. are independent standard exponential. (b) Suppose that max(xl.Xn is a sample with Xi '"'' p(x I (})..Section 1.'" . S.. Show rigorously using (1. then !3( a.2. n. . . f(x I f). Justify the following approximation to the posterior distribution where q..7 Problems ... Xn) + 5. ZI.. . D = NO has a B(N. 6. Assume that A = {x : p( x I 8) > O} does not involve O." . 7.f) = [~..
X n are i..~X) '" tnI. (. 13. Here HinJ: Given I" and ". j=1 '\' • = 1. . < 1. .i. 0 > O. has density r("" a) IT • fa(u) ~ n' r(. •• • . .Xn. Q'j > 0. . . Goals... (N. I (b) Use the result (a) to give 7t(0) and 7t(0 I x) when Oexp{ Ox}. (a) Show that p(x I 0) ()( 04 n exp (~tO) where "proportional to" as a function of 8. the predictive distribution is the marginal distribution of X n + l .. 82 I fL. 0<0. Show that if Xl. .d. ./If (Ito. Let N = (N l .) u/' 0 < cx) L". <0<x and let7t(O) = 2exp{20}. then the posterior density ~(Jt 52 = _1_ "'(X _ X)2 nI I. LO. The posterior predictive distribution is the conditional distribution of X n +l given Xl.. . . D(a). 14. L. .d.i. 1 X n . The Dirichlet distribution. (b) Discuss the behavior of the two predictive distributions as 15. ..~vO}. In a Bayesian model whereX l .J t • I X. a = (a1. (12).6 ""' 1r.Xn + l arei..... Q'r)T.. . 01 )=1 j=l Uj < 1. 8 2 ) is such that vn(p. X and 5' are independent with X ~ N(I". j=1 • . compute the predictive and n + 00. O(t + v) has a X~+n distribution.J u. .. (c) Find the posterior distribution of 0". 'V .. Find the posterior distribution 7f(() I x) and show that if>' is an integer. Xl .d. "5) and N(Oo. I(x I (J).. 0> 0 a otherwise.. x n ). 0 > O. unconditionally. x> 0.2 is (called) the precision of the distribution of Xi. (a) If f and 7t are the N(O. .1 < j < T. Letp(x I OJ =exp{(xO)}. po is I t = ~~ l(X i  1"0)2 and ()( denotes (b) Let 7t( 0) ()( 04 (>2) exp { . i). T6) densities. . . X n . X 1.)T... given x. N. 11.i."Z In) and (n1)S2/q 2 '" X~l' This leads to p(x. 9= (OJ. . . 52) of J1. vB has a XX distribution.~l .") ~ ... . The Dirichlet distribution is a conjugate prior for the multinomial..2) and we formally put 7t(I". given (x.\ > 0.. .. . where Xi known.. and () = a. Next use Bayes rule. N(I".9). v > 0..) pix I 0) ~ ({" . Note that. = 1.74 where N' ~ N Statistical Models. Find I 12. and Performance Criteria Chapter 1 + nand ({. posterior predictive distribution.O the posterior density 7t( 0 Ix). Suppose p(x I 0) is the density of i.O.) be multinomial M(n.
3 IN ~ n) is V( a + n). = 0. the possible actions are al. 8 = 0 or 8 > 0 for some parameter 8. 1957) °< °= a 0. 1C(O. 0.3.1. Find the Bayes rule for case 2. I b"9 }.5. (b)p=lq=.5.p) (I .. ..7 Problems and Complements 75 Show that if the prior7r( 0) for 0 is V( a). (c) Find the minimax rule among the randomized rules. I.1. O 2 a 2 I a I 2 I Let X be a random variable with frequency function p(x.159 be the decision rules of Table 1. (b) Find the minimax rule among {b 1 ..) = 0.3. a new buyer makes a bid and the loss function is changed to 8\a 01 O 2 al a2 a3 a 12 7 I 4 6 (a) Compute and plot the risk points in this case for each rule 1St.. a) is given by 0. .. (J2. Compute and plot the risk points (a)p=q= . or 0 > a be penoted by I. (J) given by O\x a (I . a3.3.. . 1.3. Suppose the possible states of nature are (Jl. n r ). Problems for Section 1.5.3. 159 of Table 1. . (c) Find the minimax rule among bt.5 and (ii) l' = 0.3. (d) Suppose 0 has prior 1C(01) ~ 1'. 1C(02) l' = 0. = 11'. Find the Bayes rule when (i) 3. a2. Suppose that in Example 1. Let the actions corresponding to deciding whether the loss function is given by (from Lehmann. and the loss function l((J. 159 for the preceding case (a).q) and let when at.Section 1.1. . .1. then the posteriof7r(0 where n = (nt l · · · . (d) Suppose that 0 has prior 1C(OIl (a)... . The problem of selecting the better of two treatments or of deciding whether the effect of one treatment is beneficial or not often reduces to the pr9blem of deciding whether 8 < O. See Example 1. respectively and suppose .
g.. n = 1 and . Suppose that the jth stratum has lOOpj% of the population and that the jth stratum population mean and variances are f£j and Let N = nj and consider the two estimators 0..d. (b) Plot the risk function when b = c = 1. .3) ..s. 0<0 0=0 0 >0 R(O. variances.0)). 11 . c<I>(y'n(sO))+b<I>(y'n(rO)).Xnjj. are known (estimates will be used in a latet chapter). We want to estimate the mean J1. Show that the strata sample sizes that minimize MSE(!i2) are given by (1. Within the jth stratum we have a sample of i.j = I. .. Stratified sampling. (. where <f> = 1 <I>.) c<f>( y'n(r . and Performance Criteria ChClpter 1 O\a 1 0 c 1 <0 0 >0 J". 1) sample and consider the decision rule = 1 if X <r o (a) Show that the risk function is given by ifr<X<s 1 if X > s. = E(X) of a population that has been divided (stratified) into s mutually exclusive parts (strata) (e. < J' < s.. 1 < j < S. 1) distribution function.=l iii = N. J". 1 (l)r=s=l. and <I> is the N(O. geographic locations or age groups). (a) Compute the biases.j = 1. < 00.8. How should nj. iiz = LpjX j=Ii=1 )=1 s n] s j where we assume that Pj. I .andastratumsamplemeanXj. 1 < j < S. random variables Xlj"". i I For what values of B does the procedure with r procedure with r = ~s = I? = 8 = 1 have smaller risk than the 4. are known. ..76 StatistiCCII Models.7. 0 be chosen to make iiI unbiased? I (b) Neyman allocation. E.(X) 0 b b+c 0 c b+c b 0 where b and c are positive. and MSEs of iiI and 'j1z. Weassurne that the s samples from different strata are independent.0)) +b<f>( y'n(s . . I I I.. Assume that 0 < a.i. Suppose X is aN(B.) r=2s=1.I L LXij. Goals. b<f>(y'ns) + b<I>(y'nr).
(d) Find EIX .bl when n = compare it to EIX . set () = 0 without loss of generality.13.4. Next note that the distribution of X involves Bernoulli and multinomial trials. (c) Show that MSE(!ill with Ok ~ PkN minus MSE(!i.7 Problems and Complements 77 Hint: You may use a Lagrange multiplier.. X n are i. . We want to estimate "the" median lJ of F.5. 5. Let X and X denote the sample mean and median. ~ (a) Find the MSE of X when (i) F is discrete with P(X a < b < c. . ..2p. and = 1. U(O.1.7. Each value has probability . .. Hint: See Problem B.25. 75.b. Let Xb and X b denote the sample mean and the sample median of the sample XI b. . Also find EIX . ali X '"'' F.2. p = .) with nk given by (1.=1 Pj(O'j where a. ~ ~ ~ .=1 pp7J" aV. > 0. where lJ is defined as a value satisfyingP(X < v) > ~lllldP(X >v) >~.= 2:..p).3.. Use a numerical integration package.. The answer is MSE(X) ~ [(ab)'+(cb)'JP(S where k = . respectively. (c) Same as (b) except when n ~ 1.. and that n is odd. . (b) Evaluate RR when n = 1. show that MSE(Xb ) and MSE(Xb ) are the same for all values of b (the MSEs of the sample mean and sample median are invariant with respect to shift).I 2:. 1).2.d. l X n . I).. n ~ 1. (iii) F is normal. Suppose that X I.5(0 + I) and S ~ B(n. (b) Compute the relative risk RR = M SE(X)/MSE(X) in question (i) when b = 0.5. .0.40. that X is the median of the sample.2.15. ° = 15..5. plot RR for p = . and 2 and (e) Compute the relative risks M SE(X)/MSE(X) in questions (ii) and (iii)..20.9.. < ° P < 1.0 ~ + 2'.3.. Suppose that n is odd. [f the parameters of interest are the population mean and median of Xi .... Hint: Use Problem 1. ~ ~  ~ 6. .Section 1.2'. Hint: By Problem 1.5.45. a = .3. . Let XI.'.3. N(o.5.b. . P(X ~ b) = 1 . Hint: See Problem B.bl.0 .0 + '. b = '. > k) (ii) F is uniform.i. X n be a sample from a population with values 0.2.3) is N.bl for the situation in (i). ~ a) = P(X = c) = P. ~ 7. . (a) Find MSE(X) and the relative risk RR = MSE(X)/MSE(X).
In Problem 1.. ' . Find MSE(6) and MSE(P). for all 6' E 8 1.3(a) with b = c = 1 and n = I.0) = E..2 ~~ I (Xi .[X . the powerfunction. A person in charge of ordering equipment needs to estimate and uses I . I . 6' E e. .1'])2. Goals.(o(X)). then this definition coincides with the definition of an unbiased estimate of e.(1(6. (i) Show that MSE(S2) (ii) Let 0'6 c~ ..0(X))) for all 6. suppose (J is discrete with frequency function 11"(0) = 11" (!) = 11" U) = Compute the Bayes risk of I5r •s when i. = c L~ 1 (Xi .. then expand (Xi . II. You may use the fact that E(X i . . (b) Suppose Xi _ N(I'. 9. . 8. .Xn be a sample from a population with variance a 2 .(1(6'. (a)r = 8 =1 (b)r= ~s =1.J.1'] .0(X))) < E.3) that .2)(. then a test function is unbiased in this sense if. I (b) Show that if we use the 0 1 loss function in testing.for what 60 is ~ MSE(8)/MSE(P) < 17 Give the answer for n = 25 and n = 100. I I 78 Statistical Models. 0 < a 2 < 00.4. and only if.3.10) + (. satisfies > sup{{J(6.'). and Performance Criteria Chapter 1 :! . (a) Show that s:? = (n . It is known that in the state where the company is located. I .. Let () denote the proportion of people working in a company who have a certain characteristic (e.1)1 E~ 1 (Xi . (a) Show that if 6 is real and 1(6.L)4 = 3(12.X)2 is an unbiased estimator of u 2 • Hint: Write (Xi .X)2 has a X~I distribution.8lP where p = XI n is the proportion with the characteristic in a sample of size n from the company. . 10.3.X)2 keeping the square brackets intact.  ..X)' = ([Xi . ~ 2(n _ 1)1. If the true 6 is 60 .g. Show that the value of c that minimizes M S E(c/. a) = (6 . 0) (J(6'.X)2. e 8 ~ (. Let Xl.) is (n+ 1)1 Hint!or question (bi: Recall (Theorem B. defined by (J(6. .a)2.. 10% have the characteristic.0): 6 E eo}. being lefthanded). Which one of the rules is the better one from the Bayes point of view? 11. A decision rule 15 is said to be unbiased if E. . i1 .
Furthersuppose that l(Bo. . 0 < u < 1. A (behavioral) randomized test of a hypothesis H is defined as any statistic rp(X) such that 0 < <p(X) < 1. Define the nonrandomized test Ju . .. we petiorm a Bernoulli trial with probability rp( x) of success and decide 8 1 if we obtain a success and decide 8 0 otherwise.3. o Suppose that U . consider the estimator < MSE(X). 16.1'0 I· 17.1. In Problem 1. 03) = aR(B. (h) If N ~ 10. Suppose that Po.1.ao) ~ O.Polo(X) ~ 01 = Eo(<p(X)). Consider a decision problem with the possible states of nature 81 and 82 . In Example 1. consider the loss function (1. 0. 0. If X ~ x and <p(x) = 0 we decide 90. Suppose the loss function feB.wo to X. wc dccide El. 18. Ok) as a function of B. but if 0 < <p(x) < 1. Your answer should Mw If n. use the test J u . there is a randomized procedure 03 such that R(B. if <p(x) = 1. Suppose that the set of decision procedures is finite.~ is unbiased 13. Consider the following randomized test J: Observe U. The interpretation of <p is the following.7 Problems and Complements 79 12. B = . s = r = I. then.1. In Example 1. given 0 < a < 1. show that if c ::.. Convexity ofthe risk set. z > 0.1) and let Ok be the decision rule "reject the shipment iff X > k. and then JT.' and 0 = = WJ10 + (1  w)x' II'  1'0 I are known.Section 1.' and 0 = II' . Po[o(X) ~ 11 ~ 1 . b. Show that the procedure O(X) au is admissible.7).) + (1 .3. find the set of I' where MSE(ji) depend on n. Show that if J 1 and J 2 are two randomized. and possible actions at and a2. plot R(O. If U = u.3. (a) find the value of Wo that minimizes M SE(Mw).) for all B. by 1 if<p(X) if <p(X) >u < u. procedures." (a) Show that the risk is given by (1.. and k o ~ 3.a )R(B. 15..4. Compare 02 and 03' 19.4.3. 14. a) is .UfO. Show that J agrees with rp in the sense that. (B) = 0 for some event B implies that Po (B) = 0 for all B E El.3.3. (b) find the minimum relative risk of P. (c) Same as (b) except k = 2. 1) and is independent of X. For Example 1.
Let U I . Give the minimax rule among the nonrandomized decision rules.(0. Give an example in which Z can be used to predict Y perfectly. Goals.Is Z of any value in predicting Y? = Ur + U?.4 = 0.1 calculate explicitly the best zero intercept linear predictor.8 0. Let Y be any random variable and let R(c) = E(IYcl) be the mean absolute prediction error.2 0.6 (a) Compute and plot the risk points of the nonrandomized decision rules. 0 0. !. 7. The midpoint of the interval of such c is called the conventionally defined median or simply just the median.. 4. 2. Let Z be the number of red balls obtained in the first two draws and Y the total number of red balls drawn. Give the minimax rule among the randomized decision rules. (a) Find the best predictor of Y given Z.4.(0.) the Bayes decision rule? Problems for Section 1. the best linear predictor.7 find the best predictors ofY given X and of X given Y and calculate their MSPEs. and Performance Criteria Chapter 1 B\a 0. and the best zero intercept linear predictor. In Example 1. !.9. (b) Give and plot the risk set S. al a2 0 3 2 I 0. Show that either R(c) = 00 for all cor R(c) is minimized by taking c to be any number such that PlY > cJ > pry < cJ > A number satisfying these restrictions is called a median of (the distribution of) Y. What is 1. but Y is of no value in predicting Z in the sense that Var(Z I Y) = Var(Z).4 I 0. (b) Compute the MSPEs of the predictors in (a).) = 0. Let X be a random variable with probability function p(x ! B) O\x 0. U2 be independent standard normal random variables and set Z Y = U I. In Problem B. .. Give an example in which the best linear predictor of Y given Z is a constant (has no predictive value) whereas the best predictor Y given Z predicts Y perfectly. its MSPE.1. 5.80 Statistical Models. (c) Suppose 0 has the prior distribution defined by . 3.. 6. and the ratio of its MSPE to that of the best and best linear predictors. 0. Four balls are drawn at random without replacement. An urn contains four red and four black balls.l.
Y" be the corresponding genetic and environmental components Z = ZJ + Z". (a) Show that E(IY . + z) = p(c . Suppose that Z has a density p.YI > s. 9. (a) Show that the relation between Z and Y is weaker than that between Z' and y J . a 2. p( z) is nonincreasing for z > c. which is symmetric about c and which is unimodal. Y') have a N(p" J. If Y and Z are any two random variables.PlY < eo]} + 2E[(c  Y)llc < Y < co]] 8. Let Zl and Zz be independent and have exponential distributions with density Ae\z. Suppose that if we observe Z = z and predict 1"( z) for Your loss is 1 unit if 11"( z) . p( c all z.t.  = ElY cl + (c  co){P[Y > co] . Y)I < Ipl. 7 2 ) variables independent of each other and of (ZI. Let ZI. which is symmetric about c. exhibit a best predictor of Y given Z for mean absolute prediction error.Section 1. Y) has a bivariate nonnal distribution the best predictor of Y given Z in the sense of MSPE coincides with the best predictor for mean absolute error. Y) has a bivariate normal distribution. where (ZI . z > 0. Let Y have a N(I". Z". that is. Yare measurements on such a variable for a randomly selected father and son. a 2. Show that c is a median of Z. 12. ICor(Z. 10. (b) Show directly that I" minimizes E(IY . Find (a) The best MSPE predictor E(Y I Z = z) ofY given Z = z (b) E(E(Y I Z)) (c) Var(E(Y I Z)) (d) Var(Y I Z = z) (e) E(Var(Y I Z» . Sbow that the predictor that minimizes our expected loss is again the best MSPE predictor.L.2 ) distrihution.cl) = oQ[lc . Y = Y' + Y". Suppose that Z has a density p.1"1/0] where Q(t) = 2['I'(t) + t<I>(t)]. Y'.el) as a function of c.z) for 11. Define Z = Z2 and Y = Zl + Z l Z2. (b) Suppose (Z. Y" are N(v. that is. 0. ° 14. Suppose that Z. (b) Show that the error of prediction (for the best predictor) incurred in using Z to predict Y is greater than that incurred in using Z' to predict y J • 13. and otherwise.7 Problems and Complements 81 Hint: If c ElY . p) distribution and Z". y l ). Show that if (Z. (a) Show that P[lZ  tl < s] is maximized as a function oftfor each s > ° by t = c.col < eo. Many observed biological variables such as height and weight can be thought of as the sum of unobservable genetic and environmental variables.
4.) (a) Show that 1]~y > piy. (b) Show that if Z is onedimensional and h is a 11 increasing transfonnation of Z. and a 2 are the mean and variance of the severity indicator Zo in the population of people without the disease.1] .3. Show that.S from infection until detection. Hint: Recall that E( Z./L£(Z)) ~ maxgEL Corr'(Y.4. 1995. . IS.ElY .4. 82 Statisflcal Models. . Let /L(z) = E(Y I Z ~ z).y}. and is diagnosed with a certain disease. where piy is the population multiple correlation coefficient of Remark 1. . Consider a subject who walks into a clinic today.) ~ 1/>''. Assume that the conditional density of Zo (the present) given Yo = Yo (the past) is where j1. Yo > O. Show that Var(/L(Z))/Var(Y) = Corr'(Y. mvanant un d ersuchh./L(Z)J' /Var(Y) ~ Var(/L(Z))/Var(Y). .15. = Corr'(Y../L)/<J and Y ~ /3Yo/<J. €L is uncorrelated with PL(Z) and 1J~Y = P~Y' 18.g. Let S be the unknown date in the past when the sUbject was infected. j3 > 0.4. We are interested in the time Yo = t . (See Pearson. 16. Here j3yo gives the mean increase of Zo for infected subjects over the time period Yo. Hint: See Problem 1. IS. 2 'T ' . 1). 1905.exp{ >./L(Z)) = max Corr'(Y.) = Var( Z.(y) = >. Predicting the past from the present. hat1s. a blood cell or viral load measurement) is obtained.) = E( Z.. then'fJh(Z)Y =1JZY. and Doksurn and Samarov./LdZ) be the linear prediction error. ~~y = 1 . .: (0 The best linear MSPE predictor of Y based on Z = z. (a) Show that the conditional density j(z I y) of Z given Y (b) Suppose that Y has the exponential density = y is H(y. g(Z)) 9 where g(Z) stands for any predictor. on estimation of 1]~y.I • ! . Goals. and Var( Z. Show that p~y linear predictors. > 0. in the linear model of Remark 1.g(Z)) where £ is the set of 17.) ~ 1/>. One minus the ratio of the smallest possible MSPE to the MSPE of the constant predictor is called Pearson's correlation ratio 1J~y. (c) Let '£ = Y . . and Performance Criteria Chapter 1 . at time t. >. . y> O. I. It will be convenient to rescale the problem by introducing Z = (Zo . that is. At the same time t a diagnostic indicator Zo of the severity of the disease (e.
Hint: Use Bayes rule. and checking convexity. Cov[r(Y).Section 1.z). wbere Y j .7 and 1. (e) Show that the best MSPE predictor ofY given Z = z is E(Y I Z ~ z) ~ cI<p(>.. 6.4. solving for (a. density. Write Cov[r(Y). where Z}.). . 19. (a) Show that ifCov[r(Y).>. + b2 Z 2 + W. (b) Show that (a) is equivalent to (1. N(z . Z] = Cov{E[r(Y) (d) Suppose Y i = al + biZ i + Wand 1'2 = a2 and Y2 are responses of subjects 1 and 2 with common influence W and separate influences ZI and Z2. Let Y be a vector and let r(Y) and s(Y) be real valued. Find (a) The mean and variance of Y. see Berman.7 Problems and Complements 83 Show that the conditional distribution of Y (the past) given Z = z (the present) has density . 2000).>')1 2 } Y> ° where c = 4>(z . b) equal to zero. need to be estimated from cohort studies. s(Y) I z] for the covatiance between r(Y) and s(Y) in the conditional distribution of (r(Y).9. c. s(Y) I Z]} + Cov{ E[r(Y) I Z]' E[s(Y) I Z]}. . (c) The optimal predictor of Y given X = x. x = 1.14 by setting the derivatives of R(a.~ [y  (z . 1990.>.4. 21. Find Corr(Y}. Let Y be the number of heads showing when X fair coins are tossed. s(Y)] < 00. and Nonnand and Doksum.g(Zo)l· Hint: See Problems 1.(>' . 7r(Y I z) ~ (27r) . This density is called the truncated (at zero) normal. (In practice. Y2) using (a).s(Y)] Covlr(Y). Establish 1. 1). where X is the number of spots showing when a fair die is rolled. all the unknowns. Z2 and W are independent with finite variances. (b) The MSPE of the optimal predictor of Y based on X. then = E{Cov[r(Y).z) .. (c) Show that if Z is real.4.4. 20.6) when r = s. s(Y» given Z = z. b).. I Z]' Z}. (d) Find the best predictor of Yo given Zo = zo using mean absolute prediction error EIYo . ..I exp { . including the "prior" 71". (c) Find the conditional density 7I"o(Yo I zo) of Yo given Zo = zo..
Show that L~ 1 Xi is sufficient for 8 directly and by the factorization theorem. and suppose that Z has the beta.. 0 > 0. optimal predictor of Yz given (Yi. . density. 1. .5 1. Y z )? (f) In model (d). population where e > O. Then [y . Suppose Xl. 1 2 y' .z) = 6w (y. 0 > 0. Find the 22.2eY + e' < 2(Y' + e'). > a.6(O. 1). (a) Let w(y. 00 (b) Suppose that given Z = z. 24. show that the MSPE of the optimal predictor is .84 Statistical Models. Find Po(Z) when (i) w(y. 0) = Ox'. 3. x  .p~y). a > O. we say that there is a 50% overlap between Y 1 and Yz. Let Xl.x > 0.z) be a positive realvalued function. suppose that Zl and Z. 0) = < X < 1. .density. 1 X n be a sample from a Poisson. s). .z). .4. . z) = z(1 .3. In Example 1..14). \ (b) Establish the same result using the factorization theorem. g(Z)) < for some 9 and that Po is a density. a > O. 0) = Oa' Ix('H).g(z») is called weighted squared prediction error. z) = I. . Zz and ~V have the same variance (T2. Let n items be drawn in order without replacement from a shipment of N items of which N8 are bad. Goals. 25.. Problems for Section 1. (a) p(x. .. 23.6(r. Y ~ B(n. Let Xi = 1 if the ith item drawn is bad. Zz).0> O. p(e).e)' Hint: Whatever be Y and c.z).4. n > 2. < 00 for all e. and (ii) w(y.g(z)]'lw(y. 2. z) and c is the constant that makes Po a density. z "5).. Assume that Eow (Y.e)' = Y' . 0 < z < I. Show that the mean weighted squared prediction error is minimized by Po(Z) = EO(Y I Z).e' < (Y . (a) Show directly that 1 z:::: 1 Xi is sufficient for 8.. and Performance Criteria Chapter 1 (e) In the preceding model (d). z) ~ ep(y. This is known as the Weibull density. where Po(Y. This is thebeta.l .4.') and W ~ N(po. Show that EY' < 00 if and only if E(Y .~(1 . and = 0 otherwise. 0 (b) p(x. if b1 = b2 and Zl.15) yields (1.z)lw(y. Verify that solving (1. (c) p(x.Xn is a sample from a population with one of the following densities. In this case what is Corr(Y. Oax"l exp( Ox"). are N(p.
Section 1. 7
Problems and Complements
85
This is known as the Pareto density. In each case. find a realvalued sufficient statistic for (), a fixed. 4. (a) Show that T 1 and T z are equivalent statistics if, and only if, we can write T z = H (T1 ) for some 11 transformation H of the range of T 1 into the range of T z . Which of the following statistics are equivalent? (Prove or disprove.) (b) n~ (c) I:~
1
x t and I:~
and I:~
I:~
1
log Xi, log Xi.
Xi Xi
I Xi
1
>0 >0
l(Xi X
1 (Xi 
(d)(I:~ lXi,I:~ lxnand(I:~ lXi,I:~
(e) (I:~
I Xi, 1
?)
xl)
and (I:~
1 Xi,
I:~
X)3).
5. Let e = (e lo e,) be a bivariate parameter. Suppose that T l (X) is sufficient for e, whenever 82 is fixed and known, whereas T2 (X) is sufficient for (h whenever 81 is fixed and known. Assume that eh ()2 vary independently, lh E 8 1 , 8 2 E 8 2 and that the set S = {x: pix, e) > O} does not depend on e. (a) Show that ifT, and T, do not depend one2 and e, respectively, then (Tl (X), T2 (X)) is sufficient for e. (b) Exhibit an example in which (T, (X), T2 (X)) is sufficient for T l (X) is sufficient for 8 1 whenever 8 2 is fixed and known, but Tz(X) is not sufficient for 82 , when el is fixed and known. 6. Let X take on the specified values VI, .•. 1 Vk with probabilities 8 1 , .•• ,8k, respectively. Suppose that Xl, ... ,Xn are independently and identically distributed as X. Suppose that IJ = (e" ... , e>l is unknown and may range over the set e = {(e" ... ,ek) : e, > 0, 1 < i < k, E~ 18i = I}, Let Nj be the number of Xi which equal Vj' (a) What is the distribution of (N" ... , N k )? (b) Sbow that N = (N" ... , N k _,) is sufficient for 7. Let Xl,'"
1
e,
e.
X n be a sample from a population with density p(x, 8) given by
pix, e)
o otherwise.
Here
e = (/1, <r) with 00 < /1 < 00, <r > 0.
(a) Show that min (Xl, ... 1 X n ) is sufficient for fl when a is fixed. (b) Find a onedimensional sufficient statistic for a when J1. is fixed. (c) Exhibit a twodimensional sufficient statistic for 8. 8. Let Xl,. " ,Xn be a sample from some continuous distribution Fwith density f, which is unknown. Treating f as a parameter, show that the order statistics X(l),"" X(n) (cf. Problem B.2.8) are sufficient for f.
,
"
I
!
86
Statistical Models, Goals, and Performance Criteria
Chapter 1
9. Let Xl, ... ,Xn be a sample from a population with density
j,(x)
a(O)h(x) if 0, < x < 0,
o othetwise
where h(x)
> 0,0= (0,,0,)
with
00
< 0, < 0, <
00,
and a(O)
=
[J:"
h(X)dXr'
is assumed to exist. Find a twodimensional sufficient statistic for this problem and apply your result to the U[()l, ()2] family of distributions. 10. Suppose Xl" .. , X n are U.d. with density I(x, 8) = ~elx61. Show that (X{I),"" X(n», the order statistics, are minimal sufficient. Hint: t,Lx(O) =  E~ ,sgn(Xi  0), 0 't {X"" . , X n }, which determines X(I),
. " , X(n)'
11. Let X 1 ,X2, ... ,Xn be a sample from the unifonn, U(O,B). distribution. Show that X(n) = max{ Xii 1 < i < n} is minimal sufficient for O.
12. Dynkin, Lehmann, Scheffe's Theorem. Let P = {Po : () E e} where Po is discrete concentrated on X = {x" x," .. }. Let p(x, 0) p.[X = xl Lx(O) > on Show that f:xx(~~) is minimial sufficient. Hint: Apply the factorization theorem.
=
=
°
x,
13. Suppose that X = (XlI" _, X n ) is a sample from a population with continuous distribution function F(x). If F(x) is N(j1., ,,'), T(X) = (X, ,,'). where,,2 = n l E(Xi 1')2, is sufficient, and S(X) ~ (XCI)"" ,Xin»' where XCi) = (X(i)  1')/'" is "irrelevant" (ancillary) for (IL, a 2 ). However, S(X) is exactly what is needed to estimate the "shape" of F(x) when F(x) is unknown. The shape of F is represented hy the equivalence class F = {F((·  a)/b) : b > 0, a E R}. Thus a distribution G has the same shape as F iff G E F. For instance, one "estimator" of this shape is the scaled empirical distribution function F,(x) jln, x(j) < x < x(i+1)' j = 1, . .. ,nl
~
0, x
< XCI)
> x(n)
1, x
~
Show that for fixed x, F,((x  x)/,,) converges in prohahility to F(x). Here we are using F to represent :F because every member of:F can be obtained from F.
I
I ,
'I i,
14. Kolmogorov's Theorem. We are given a regular model with e finite.
(a) Suppose that a statistic T(X) has the property that for any prior distribution on 9, the posterior distrihution of 9 depends on x only through T(x). Show that T(X) is sufficient.
(b) Conversely show that if T(X) is sufficient, then, for any prior distribution, the posterior distribution depends on x only through T(x).
Section 1.7
Problems and Complements
87
Hint: Apply the factorization theorem.
15. Let X h .··, X n be a sample from f(x  0), () E R. Show that the order statistics arc minimal sufficient when / is the density Cauchy Itt) ~ I/Jr(1 + t 2 ). 16. Let Xl,"" X rn ; Y1 ,· . " ~l be independently distributed according to N(p, (72) and N(TI, 7 2 ), respectively. Find minimal sufficient statistics for the following three cases:
(i) p, TI,
0", T
are arbitrary:
00
< p, TI < 00, a <
(J,
T.
(ii)
(J
=T
= TJ
and p, TI, (7 are arbitrary. and p,
0", T
(iii) p
are arbitrary.
17. In Example 1.5.4. express tl as a function of Lx(O, 1) and Lx(l, I). Problems to Sectinn 1.6
1. Prove the assertions of Table 1.6.1.
2. Suppose X I, ... , X n is as in Problem 1.5.3. In each of the cases (a), (b) and (c), show that the distribution of X fonns a oneparameter exponential family. Identify 'TI, B, T, and h. 3. Let X be the number of failures before the first success in a sequence of Bernoulli trials with probability nf success O. Then P, IX = k] = (I  0)'0, k ~ 0 0 1,2, ... This is called thc geometric distribution (9 (0». (a) Show that the family of geometric distributions is a oneparameter exponential family with T(x) ~ x. (b) Deduce from Theorem 1.6.1 that if X lo '" oXn is a sample from 9(0), then the distributions of L~ 1 Xi fonn a oneparameter exponential family. (c) Show that E~
1
Xi in part (b) has a negative binomial distribution with parameters
(noO)definedbyP,[L:71Xi = kJ =
(n+~I
)
(10)'on,k~0,1,2o'"
(The
negative binomial distribution is that of the number of failures before the nth success in a sequence of Bernoulli trials with probability of success 0.) Hint: By Theorem 1.6.1, P,[L:7 1 Xi = kJ = c.(1  o)'on. 0 < 0 < 1. If
=
' " CkW'
I
= c(;'',)::, 0 lw n
k=O
L..J
< W < I, then
4. Which of the following families of distributions are exponential families? (Prove or
disprove.) (a) The U(O, 0) fumily
88
(b) p(.", 0)
Statistical Models, Goals, and Performance Criteria
Chapter 1
(c)p(x,O)
= {exp[2Iog0+log(2x)]}I[XE = ~,xE {O.I +0, ... ,0.9+0j = 2(x +
0)/(1 + 20), 0
(0,0)1
(d) The N(O, 02 ) family, 0 > 0
(e)p(x,O)
< x < 1,0> 0
(f) p(x,9) is the conditional frequency function of a binomial, B(n,O), variable X, given that X > O.
5. Show that the following families of distributions are twoparameter exponential families and identify the functions 1], B, T, and h. (a) The beta family. (b) The gamma family. 6. Let X have the Dirichlet distribution, D( a), of Problem 1.2.15. Show the distribution of X form an rparameter exponential family and identify fJl B, T, and h.
7. Let X = ((XI, Y I ), ... , (X no Y n » be a sample from a bIvariate nonnal population.
Show that the distributions of X form a fiveparameter exponential family and identify 'TJ, B, T, and h.
8. Show that the family of distributions of Example 1.5.3 is not a one parameter eX(Xloential family. Hint: If it were. there would be a set A such that p(x, 0) > on A for all O.
°
9. Prove the analogue of Theorem 1.6.1 for discrete kparameter exponential families. 10. Suppose that f(x, B) is a positive density on the real line, which is continuous in x for each 0 and such that if (XI, X 2) is a sample of size 2 from f(·, 0), then XI + X2 is sufficient for B. Show that f(·, B) corresponds to a onearameter exponential family of distributions with T(x) = x. Hint: There exist functions g(t, 0), h(x" X2) such that log f(x" 0) + log f(X2, 0) = g(xI + X2, 0) + h(XI, X2). Fix 00 and let r(x, 0) = log f(x, 0)  log f(x, 00), q(x, 0) = g(x,O)  g(x,Oo). Then, q(xI + X2,0) = r(xI,O) +r(x2,0), and hence, [r(x" 0) r(O, 0)1 + [r(x2, 0)  r(O, 0») = r(xi + X2, 0)  r(O, 0). 11. Use Theorems 1.6.2 and 1.6.3 to obtain momentgenerating functions for the sufficient statistics when sampling from the following distributions. (a) normal, () ~ (ll,a 2 )
(b) gamma. r(p, >.), 0
= >., p fixed
(c) binomial (d) Poisson (e) negative binomial (see Problem 1.6.3)
(0 gamma. r(p, >'). ()
= (p, >.).
 
,

Section 1. 7
Problems and Complements
89
12. Show directly using the definition of the rank of an ex}X)nential family that the multinomialdistribution,M(n;OI, ... ,Ok),O < OJ < 1,1 <j < k,I:~oIOj = 1, is of rank k1. 13. Show that in Theorem 1.6.3, the condition that E has nonempty interior is equivalent to the condition that £ is not contained in any (k ~ I)dimensional hyperplane. 14. Construct an exponential family of rank k for which £ is not open and A is not defined on all of &. Show that if k = 1 and &0 oJ 0 and A, A are defined on all of &, then Theorem 1.6.3 continues to hold. 15. Let P = {P. : 0 E e} where p. is discrete and concentrated on X = {x" X2, ... }, and let p( x, 0) = p. IX = x I. Show that if P is a (discrete) canonical ex ponential family generated bi, (T, h) and &0 oJ 0, then T is minimal sufficient. Hint: ~;j'Lx('l) = Tj(X)  E'lTj(X). Use Problem 1.5.12.
16. Life testing. Let Xl,.'" X n be independently distributed with exponential density (20)l e x/2. for x > 0, and let the ordered X's be denoted by Y, < Y2 < '" < YnIt is assumed that Y1 becomes available first, then Yz, and so on, and that observation is continued until Yr has been observed. This might arise, for example, in life testing where each X measures the length of life of, say, an electron tube, and n tubes are being tested simultaneously. Another application is to the disintegration of radioactive material, where n is the number of atoms, and observation is continued until r aparticles have been emitted. Show that
(i) The joint distribution of Y1 , •.. , Yr is an exponential family with density
n! [ (20), (n _ r)! exp (ii) The distribution of II:: I Y;
(iii) Let
1
I::l Yi + (n 20
r)Yr]
' 0  Y,  ...  Yr·
<
<
<
+ (n 
r)Yrl/O is X2 with 2r degrees of freedom.
denote the time required until the first, second,... event occurs in a Poisson process with parameter 1/20' (see A.I6). Then Z, = YI/O', Z2 = (Y2 Yr)/O', Z3 = (Y3  Y 2)/0', ... are independently distributed as X2 with 2 degrees of freedom, and the joint density of Y1 , ••. , Yr is an exponential family with density
Yi, Yz , ...
The distribution of Yr/B' is again XZ with 2r degrees of freedom. (iv) The same model arises in the application to life testing if the number n of tubes is held constant by replacing each burnedout tube with a new one, and if Y1 denotes the time at which the first tube bums out, Y2 the time at which the second tube burns out, and so on, measured from some fixed time.
I ,
90
Statistical Models, Goals, and Performance Criteria Chapter 1
1)(Y; l~~l)/e (I = 1", .. ,') are independently distributed as X2 with 2 degrees of freedom, and [L~ 1 Yi + (n  7")Yr]/B = [(ii): The random variables Zi ~ (n  i
+
L::~l Z,.l
17. Suppose that (TkXl' h) generate a canonical exponential family P with parameter k 1Jkxl and E = R . Let
(a) Show that Q is the exponential family generated by IlL T and h exp{ cTT}. where IlL is the projection matrix of Tonto L = {'I : 'I = BO + c). (b) Show that ifP has full rank k and B is of rank I, then Q has full rank l. Hint: If B is of rank I, you may assume
18. Suppose Y1, ... 1 Y n are independent with Yi '" N(131 + {32Zi, (12), where Zl,'" , Zn are covariate values not all equaL (See Example 1.6.6.) Show that the family has rank 3.
Give the mean vector and the variance matrix of T.
19. Logistic Regression. We observe (Zll Y1 ), ... , (zn, Y n ) where the Y1 , .. _ , Y n are independent, Yi "' B(TIi, Ad The success probability Ai depends on the characteristics Zi of the ith subject, for example, on the covariate vector Zi = (age, height, blood pressure)T. The function I(u) ~ log[u/(l  u)] is called the logil function. In the logistic linear re(3 where (3 = ((31, ... ,/3d ) T and Zi is d x 1. gression model it is assumed that I (Ai) = Show that Y = (Y1 , ... , yn)T follow an exponential model with rank d iff Zl, ... , Zd are
zT
not collinear (linearly independent) (cf. Examples 1.1.4, 1.6.8 and Problem 1.1.9). 20. (a) In part IT of the proof of Theorem 1.6.4, fill in the details of the arguments that Q is generated by ('11 'Io)TT and that ~(ii) =~(i). (b) Fill in the details of part III of the proof of Theorem 1.6.4. 21. Find JJ.('I) ~ EryT(X) for the gamma,
qa, A), distribution, where e = (a, A).
I
22. Let X I, . _ . ,Xn be a sample from the k·parameter exponential family distribution (1.6.10). Let T = (L:~ 1 1 (Xi ), ... , L:~ 1Tk(X,») and let T
I
S
~
((ryl(O), ... ,ryk(O»): e E 8).
Show that if S contains a subset of k + 1 vectors Vo, .. _, Vk+l so that Vi  Vo, 1 < i are not collinear (linearly independent), then T is minimally sufficient for 8.
< k.
I .' jl,
"
23. Using (1.6.20). find a conjugate family of distributions for the gamma and beta families. (a) With one parameter fixed. (b) With both parameters free.
:
I
Section 1.7
Problems and Complements
91
24. Using (1.6.20), find a conjugate family of distributions for the normal family using as parameter 0 = (O!, O ) where O! = E,(X), 0, ~ l/(Var oX) (cf. Problem 1.2.12). 2 25. Consider the linear Gaussian regression model of Examples 1.5.5 and 1.6.6 except with (72 known. Find a conjugate family of prior distributions for (131,132) T. 26. Using (1.6.20), find a conjugate family of distributions for the multinomial distribution. See Problem 1.2.15. 27. Let P denote the canonical exponential family genrated by T and h. For any TJo E £, set ho(x) = q(x, '10) where q is given by (1.6.9). Show that P is also the canonical exponential family generated by T and h o.
28. Exponential/amities are maximum entropy distributions. The entropy h(f) of a random variable X with density f is defined by h(f)
~ E(logf(X)) =
l:IIOgf(X)I!(X)dx.
This quantity arises naturally in infonnation in theory; see Section 2.2.2 and Cover and Thomas (1991). Let S ~ {x: f(x) > OJ. (a) Show that the canonical kparameter exponential family density
f(x, 'I)
= exp
• ryjrj(x) 1/0 + I:
j:=1
A('I)
, XES
maximizes h(f) subject to the constraints
f(x)
> 0,
Is
f(x)dx
~ 1,
Is
f(x)rj(x)
~ aj,
1 < j < k,
where '17o, .•.• '17k are chosen so that f satisfies the constraints. Hint: You may usc Lagrange multipliers. Maximize the integrand. (b) Find the maximum entropy densities when rj(x) = x j and (i) S ~ (0,00), k = 1, at > 0; (ii) S = R, k = 2, at E R, a, > 0; (iii) S = R, k = 3, a) E R, a, > 0, a3 E R. 29. As in Example 1.6.11, suppose that Y 1, ...• Y n are Li.d. Np(f.L. E) where f.L varies freely in RP and E ranges freely over the class of all p x p symmetric positive definite matrices. Show that the distribution of Y = (Y ... , Yn ) is the p(p + 3)/2 canonical " exponential family generated by h = 1 and the p(p + 3)/2 statistics
n n
Tj
=
LYii>
i=l
1 <j <Pi
Tjl =
LJ'ijJ'iI.
i=l
1 <j< l<p
where Y i = (Yi!, ... , Yip). Show that <: is open and that this family is of rank pcp + 3)/2. Hint: Without loss of generality, take n = 1. We want to show that h = 1 and the m = pcp + 3)/2 statistics Tj(Y) ~ Yj, 1 < j < p, and Tj,(Y) = YjYi, 1 <j < I < p,
92
Statistical Models, Goals, and Performance Criteria
Chapter 1
generate Np(J.l, E). As E ranges over all p x p symmetric positive definite matrices, so does E 1 • Next establish that for symmetric matrices M,
J
M
exp{ _uT Mu}du
< 00 iff M
is positive definite
by using the spectral decomposition (see B.I0.1.2)
=L
j=1
p
AjejeJ for el, ... , e p orthogonal. Aj E R.
To show that the family has full rank m, use induction on p to show that if Zt, ... , Zp are i.i.d. N(O, 1) and if B pxp = (b jl ) is symmetric, then
p
P
LajZj
j"" 1
+ Lbj,ZjZ,
j,l
~c
= P(aTZ + ZTBZ = c) ~ 0
N p(l', E), then
unless a ~ 0, B = 0, c = 0. Next recall (Appendix B.6) that since Y ~ y = SZ for some nonsingular p x p matrix S.
I
30. Show that if Xl,'" ,Xn are d.d. N p (8,E o) given (J where ~o is known, then the Np(A, f) family is conjugate to N p(8, Eo), where A varies freely in RP and f ranges over
all p x p symmetric positive definite matrices.
31. Conjugate Normal Mixture Distributions. A Hierarchical Bayesian Normal Model. Let {(I'j, Tj) : 1 < j < k} be a given collection of pairs with I'j E R, Tj > 0. Let (tt, tT) be a random pair with Aj = P«(I', tT) = (I'j, Tj)), 0 < Aj < 1, L:~~l Aj = 1. Let 8 be a random variable whose conditional distribution given (IL, IT) = (p,j 1 Tj) is nonnal, N(p,j, rJ). Consider the model X = 8 + f, where 8 and € are independent and € rv N(O, a3), a~ known. Note that 8 has the prior density
11'(0)
=L
j=l
k
Aj'l'rj (0  I'j)
(1.7.4)
where 'I'r denotes the N(O, T 2 ) density. Also note that (X tion. (a) Find the posterior
i!
I 0) has the N(O, (75) distribu
" "
,;
k
11'(0 I x)
and write it in the fonn
= LP((tt,tT) ~
j=1
(l'j,Tj) I X)1I'(O I (l'j,Tj),X)
" • ,
k
L
j=1
Aj (x)'I'rj(x) (0 l'j(X»
Section 1.7
Problems ;3nd Complements
93
for appropriate A) (x), Tj (x) and ILJ (x). This shows that (1. 7.4) defines a conjugate prior for the N(O, (76), distribution. (b) Let Xi = + Ei, I < i < n, where is as previously and EI," ., En are ij.d. N(O, (76)' Find the posterior 7r( 0 I Xl, ... , x n ), and show that it belongs to class (1.7 A). Hint: Consider the sufficient statistic for p(x I B).
e
e
32. A Hierarchical BinomialBeta Model. Let {(rj, Sj) : 1 <j < k} be a given collection of pair.; with rj > 0, sJ > 0, let (R, S) be a random pair with P(R = cJ' S = 8j) = Aj, D < Aj < 1, E7=1 Aj = 1, and let e be a random variable whose conditional density ".(0, c, s) given R = r, S = S is beta, (3(c, s). Consider the model in which (X I 0) has the binomial, B( n, fJ), distribution. Note that e has the prior density
".(0)
Find the posterior
k
=L
j=1
k
Aj"'(O, cJ ' sJ)'
(J .7.5)
".(0 I x) = LP(R= cj,S =
j=l
8j
I x)7r(O I (rj,sj),x)
and show that it can be written in the form J (x)7r(O,rj(x),sj(x)) for appropriate Aj(X), Cj(x) and 8j(X). This shows that (1.7.5) defines a class of conjugate prior.; for the B( n, 0) distribution.
L:A
33. Let p(x,TJ) be a one parameter canonical exponential family generated by T(x) = x and h(x), X E X C R, and let 1jJ(x) be a nonconstant, nondecreasing function. Show that E,1jJ(X) is strictly increasing in ry. Hint:
Cov,(1jJ(X), X)
~E{(X 
X')i1jJ(X) 1jJ(X')]}
where X and X' are independent identically distributed as X (see A.Il.12).
34. Let (Xl, ... , X n ) be a stationary Markov chain with two states D and 1. That is.
P[Xi
where
= Ei I Xl = EI,·· .,Xi  = Eid = P[Xi = Ci I X i  l = Eid =
l
PEi_1Ei
(POO PIO
pal) is the matrix of transition probabilities. Suppose further that Pn '
= 1  p.
(i) poo
(ii)
= PII = p, so that, PlO = Pal PIX, = OJ = PIX, = IJ = !.
94
Statistical Models, Goals, and Performance Criteria
Chapter 1
(a) Show that if 0 < p < 1 is unknown this is a full rank, oneparameter exponential family with T = NOD + N ll where Nt) the number of transitions from i to j. For example, 01011 has N Ol = 2, Nil = 1, N oo = 0, N IO ~ 1.
(b) Show that E(T)
= (n 
l)p (by the method of indicators or otherwise).
35. A Conjugate Priorfor the Two~Sample Problem. Suppose that Xl, ... , X n and Y1 , ... , Yn are independent N(fLI' (12) and N(1l2' ( 2 ) samples, respectively. Consider the prior 7r for which for some r > 0, k > 0, ro 2 has a X~ distribution and given 0 2 , /11 and fL2 are independent with N(~I, <7 2/ kt} and N(6, <7 2/ k2) distributions, respectively, where ~j E R, k j > 0, j = 1,2. Show that Jr is a conjugate prior.
I
36. The inverse Gaussian density. IG(j..t, .\), is
f(x,J1.,>')
= [>./21Tjl/2 x 3/2 exp { >.(x 
J1.)2/ 2J1.2 X}, x> 0, J1. > 0, >. > O.
(a) Show thatthis is an exponentialfamily generated hy T( X) h(x) = (21T)1/2 X 3/'. (b) Show that the canonical parameters TJl, TJ2 are given by TJI that A( 'II, '12) =  [! [Og('I2) + v''I1'I2]'£ = [0,00) x (0,00).
= ! (X, XI) T and = fL 2A, 1]2 =
= '\, and
(e) Fwd the momentgenerating function ofT and show that E(X) J1. 3>., E(XI) = J1. 1 + >.1, Var(X I ) = (>'J1.)1 + 2>'2. (d) Suppose J1. pnor.
~
J1., Var(X)
J1.o is known. Show that the gamma family, qa,,6), is a conjugate
(e) Suppose that>' = >'0 is known. Show that the conjngate prior formula (1.6.20) produces a function that is not integrable with respect to fl. That is, defined in (1.6.19) is empty.
n
(I) Suppose that J1. and>. are both unknown. Show that (1.6.20) produces a function
that is not integrable; that is, f! defined in (1.6,19) is empty. 37, Let XI, ... , X n be i.i.d. as X ~ Np(O, ~o) where ~o is known. Show that the conjugate prior generated by (1.6.20) is the N p ( 1]0,761) family, where 1]0 varies freely in RP, 76 > 0 and I is the p x p identity matrix.
,
•
38. Let Xi
"
(Zi, Yi)T be jj,d. as X = (Z, Y)T, 1 < i < n, where X has the density of Example 1.6.3. Write the density of XI, ... ,Xn as a canonical exponential family and identify T, h, A, and E. Find the expected value and variance of the sufficient statistic.
=
,!,i
i!: "
39. Suppose that Y1 , •.. 1 Y n are independent, Yi
'"'' N(fLi, a 2 ), n > 4.
"'
(a) Write the distribution of Y1 , ..• ,Yn in canonical exponential family fonn. Identify T, h, 1), A, and E. (b) Next suppose that fLi depends on the value Zi of some covariate and consider the submodel defined by the map 1) : (0 1, O , 03)T ~ (1'7, <72 jT where 1) is detennined by 2
fLi
I i'i
I ,
I
= exp{OI
+ 02Zi},
Zl
< Z2 < .. , <
Zn;
(72 =
03


Section 1.8
Notes
95
where 8 r E R, O E R, 03 > O. This model is sometimes used when IIi is restricted to be 2 positive. Show that p(y, 0) as given by (1.6.12) is a curved exponential family model with
1=3.
40. Suppose Y 1 , • •. , y;'l are independent exponentially, E' (Ai), distributed survival times, n > 3. (a) Write the distribution of Y1 , ... 1 Yn in canonical exponential family form. Identify T, h, '1, A, and E. (b) Recall that J.1i = E (Y'i) = Ai I. Suppose lJi depends on the value Zi of a covariate. Because Iti > O. fLi is sometimes modeled as
fLi
= CXp{ 0 1 + (hZi},
i
=
1, ... , n
where not all the z's are equal. Show that p(y, fi) as given by (1.6.12) is a curved exponential family model with 1 = 2.
1.8
NOTES
Note for Section 1.1
(1) For the measure theoretically minded we can assume more generally that the Po are
all dominated by a derivative.
(J
finite measure It and that p(x, 8) denotes
dJ;lI, the Radon Nikodym
Notes for Section 1,3
~
(I) More natural in the sense of measuring the Euclidean distance between the estimate f} and the "truth" Squared error gives much more weight to those that are far away from f} than those close to f}.
e.
e
~
(2) We define the lower boundary of a convex set simply to be the set of all boundary points r such that the set lies completely on or above any tangent to the set at r.
Note for Section 1,4
(I) Source; Hodges, Jr., J. L., D. Keetch, and R. S. Crutchfield. Statlab: An Empirical Introduction to Statistics. New York: McGrawHill, 1975.
Notes for Section 1,6
(1) Exponential families arose much earlier in the work of Boltzmann in statistical mechanics as laws for the distribution of the states of systems of particlessee Feynman (1963), for instance. The connection is through the concept of entropy, which also plays a key role in infonnation theorysee Cover and Thomas (199]). (2) The restriction that's x E Rq and that these families be discrete or continuous is artificial. In general if fL is a (J finite measure on the sample space X. p( x, e) as given by (1.6.1)
Statistical Analysis of Stationary Time Series New York: Wi • .t AND D. The Feynmtln Lectures on Physics. Note for Section 1. v. M. 40 Statistical Mechanics ofPhysics Reading. "Nonparametric Estimation of Global Functionals and a Measure of the Explanatory Power of Covariates in Regression. Science 5. KENDALL. . R. BICKEL.. 1963. and Later Developments.266291 (1978)..I' L6HMANN. BERMAN. This permits consideration of data such as images. 125. AND M. J. A. I.733741 (1990). BOX. 1969. II.. SAMAROV. AND COVER. Testing Statistical Hypotheses. Ii . Statist. AND A. GRENANDER. Sands. 1954. U." Ann.. THOMAS. Feynman. RUPPERT. 6.. P. B. Transformation and Weighting in Regression New York: Chapman and Hall. J. New York: Springer. I and II. . T. CRUTCHFIELD.. 1966. Theory ofGames and Statistical Decisions New York: Wiley. Eels. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory. KRETcH AND R. M.. and Performance Criteria Chapter 1 can be taken to be the density of X with respect to fLsee Lehmann (1997). FEYNMAN. R. 1991. A 143. Math Statist. 77. and so on. E.. 1997. DE GROOT. MA: AddisonWesley. for instance. 22. K A AND A. Statist. S. GIRSHICK. Nonlinearity. P." Biomelika. H. R. L.. L6HMANN. Mathematical Statistics New York: Academic Press. G." Ann. Ch. Royal Statist. E. J. L.7 (I) u T Mu > 0 for all p x 1 vectors u l' o. The Advanced Theory of Statistics. R. 160168 (1990).. ley. M. 23. "Using Residuals Robustly I: Tests for Heteroscedasticity. . 1957. 1967. "A Theory of Some Multiple Decision Problems. 1. D. IMS Lecture NotesMonograph Series..9 REFERENCES BERGER. D. Iii I' i M. positions. and spheres (e. S. . L. STUART. Elements of Information Theory New York: Wiley. E. • • . 1975. JR. FERGUSON. BROWN. A. 0 ." Ann. AND M... DoKSUM. L. 1961." J.547572 (1957). CARROLL. 5. T... ROSENBLATT. 2nd ed. 1988. ..• Statistical Decision Theory and Bayesian Analysis New York: Springer. Soc. Goals.g.96 Statistical Models. J. "Sampling and Bayes Inference in Scientific Modelling and Robustness (with Discussion). P. L.. the Earth).. I: I" Ii L6HMANN. 383430 (1979). r HorXJEs. Vols.. 1986. . 14431473 (1995). Hayward. BLACKWELL. and M. P. "Model Specification: The Views of Fisher and Neyman. m New York: Hafner Publishing Co. Leighton. Statlab: An Empirical Introduction to Statistics New York: McGrawHili. Optimal Statistical Decisions New York: McGrawHili." Statist. E. G.. 1985. "A Stochastic Model for the Distribution ofHIV Latency Time Based on T4 Counts.
Introduction to Probability and Statistics from a Bayesian Point of View. Reid. 9 References 97 LINDLEY. Division of Research. DOKSUM. E. IA: Iowa State University Press. Dulan & Co.) RAIFFA.Section 1. Soc. The Foundations ofStatistics. D. Ames. New York... K.. NORMAND. New York: Springer.. Part II: Inference London: Cambridge University Press. G. GLAZEBROOK. G. AND W. WETHERILL. Editors: S. Biometrics Series II. (Draper's Research Memoirs.. Lecture Notes in Statistics. 1986. 8th Ed. Roy. SAVAGE. L. Wiley & Sons. 1954. The Statistical Analysis of Experimental Data New York: J. 303 (1905). G. Part I: Probability. V. . A. London 71. The Foundation ofStattstical Inference London: Methuen & Co. 1962. SAVAGE. 1965. "On the General Theory of Skew Correlation and NOnlinear Regression. COCHRAN. 1964. Wiley & Sons. 1989. 1. H. ET AL. Ahmed and N. D. AND K. PEARSON. Graduate School of Business Administration. B. SCHLAIFFER. 6779.. J. Applied Statistical Decision Theory. SL. SNEDECOR. "Empirical Bayes Procedures for a Change Point Problem with Application to HIVJAIDS Data. 2000. Harvard University.. Boston. J.." Proc. Statistical Methods. 1961. AND K. Sequential Methods in Statistics New York: Chapman and Hall. MANDEL. W. AND R. L." Empirical Bayes and Likelihood Inference. J.
: '".·. I 'I i .! ~J . . I .'. I:' o. :i. I II 1'1 .
X """ PEP. (2. DC8 0l 9) measures the (population) discrepancy between 8 and the true value 8 0 of the parameter. we don't know the truth so this is inoperable. how do we select reasonable estimates for 8 itself? That is. the true 8 0 is an interior point of e.1) Arguing heuristically again we are led to estimates B that solve \lOp(X. Of course. but in a very weak sense (unbiasedness). X E X.8) as a function of 8. =0 (2.Chapter 2 METHODS OF ESTIMATION 2. Now suppose e is Euclidean C R d .O) EOoP(X. how do we find a function 8(X) of the vector observation X that in some sense "is close" to the unknown 81 The fundamental heuristic is typically the following. That is. In this parametric case. The equations (2. p(X. Then we expect  \lOD(Oo. Estimating Equations Our basic framework is as before.1 BASIC HEURISTICS OF ESTIMATION Minimum Contrast Estimates. As a function of 8.2) 99 . 8).1 2. So it is natural to consider 8(X) minimizing p(X.1.1.O) where V denotes the gradient. usually parametrized as P = {PO: 8 E e}.1. 6) ~ o.8) is an estimate of D(8 0 . we could obtain 8 0 as the minimizer. In order for p to be a contrast function we require that DC 8 0l 9) is uniquely minimized for 8 = (Jo. and 8 Jo D( 8 0 . if PO o were true and we knew DC 8 0 .2) define a special form of estimating equations.1.8).O). This is the most general fonn of the minimum contrast estimate we shall consider in the next section. We consider a function that we shall call a contrast function  p:Xx8>R and define D(Oo.8) is smooth.
.!I [" '1 L • . Evidently. ~ Then we say 8 solving (2. for convenience.2) or equivalently the system of estimating eq!1ations.z. z.1. I D({3o. An estimate (3 that minimizes p(X. I In the important linear case.p(X. ZI). we take n p(X.zd = LZij!3j j=1 andzi = (Zil. .1.4 with I'(z) = g({3.1.1..1. Wd)T V(l1 o. Consider the parametric version of the regression model of Example 1. IL( z) = (g({3. . suppose we postulate that the Ei of Example 1.z)l: 1{31 ~ oo} 'I = 00 I (Problem 2.4) w(X. If. J which is indeed minimized at {3 = {3o and uniquely so if and only if the parametrization is identifiable. Here is an example to be pursued later. z).4 R d.1. g({3. Then {3 parametrizes the model and we can compute (see Problem 2. i=l 3 I (2. A naiural(l) function p(X. suppose we are given a function W : X and define Methods of Estimation Chapter 2 X R d . .7) i=l J . there is a substantial overlap between the two classes of estimates. :i = ~8g~ ~ L.)g(l3. then {3 satisfies the equation (2. d g({3. "'.). {3 E R d .d.1.10). ~ ~8g~ L.)J i=1 ~ 2 (2.1.1.16). further. (1). {3) to consider is the squared Euclidean distance between the vector Y of observed Yi and the vector expectation of Y. Zn»T That is.3) = 0 has 8 0 as its unique solution for all 8 0 ~ E e.Zid)T 'j • r . Here the data are X = {(Zi.i + L[g({3o.g({3..)Y. ua).8/3 ({3. z) is continuous and ! I lim{lg(l3. z.('l/Jl. . (3) = !Y . . But. The estimate (3 is called the least squares estimate. z.8) ~ E I1 .1. W(X.1. z) is differentiable in {3. (3) exists if g({3. . where the function 9 js known. (2.) .' .. (3) E{3. (3) n n<T.5) Strictly speaking P is not fully defined here and this is a point we shall explore later.1L1 2 = L[Yi . W . Yn are independent.4 are i. (1) Suppose V (8 0 .g({3.6) .. g({3. N(O.»)'.i. Least Squares.100 More generally. i=l (2.z.1. z. l'i) : 1 < i < n} where Yi.I1) =0 is an estimating equation estimate.8/3 ({3. Example 2..
1 1j ~_l".(8). 1 1 < j < d. .d. we need to be able to express 9 as a continuous function 9 of the first d moments. These equations are commonly written in matrix fonn (2.  n n. provides a first example of both minimum contrast and estimating equation methods.2. . L. . ~ . N(O. and then using h(p!.1 <j < d if it exists. we assume the existence of Define the jth sample moment f1j by. We return to the remark that this estimating method is well defined even if the Ci are not i.. More generally. if we want to estimate a Rkvalued function q(8) of 9.Xn are i. To apply the method of moments to the problem of estimating 9.1 from R d to Rd. . d > k. . .8) the normal equations.i... /1j converges in probability to flj(fJ). we obtain a MOM estimate of q( 8) by expressing q( 8) as a function of any of the first d moments 1'1.. . This very important example is pursued further in Section 2.d. I'd of X.Section 2. Suppose Xl.1. In fact. = 1'. ... Method a/Moments (MOM). once defined we have a method of computing a statistic fj from the data X = {(Zi 1 Xi). thus. 0 Example 2. say q(8) = h(I'I. u5). The motivation of this simplest estimating equation example is the law of large numbers: For X "' Po. 1 < i < n}... Thus. 8 E R d and 8 is identifiable. I'd( 8) are the first d moments of the population we are sampling from.Xi t=l . Suppose that 1'1 (8). which can be judged on its merits whatever the true P governing X is..1. lid) as the estimate of q( 8). Thus.1.9) where Zv IIZijllnxd is the design matrix. Here is another basic estimating equation example. . I'd). Least squares.1 Basic Heuristics of Estimation 101 the system becomes (2. . . as X ~ P8.i.. The method of moments prescribes that we estimate 9 by the solution of p. suppose is 1 .2 and Chapter = 6.
.·)  l~t. This algorithm and others will be discussed more extensively in Section 2.3. We can. Pi is the proportion of men in the population in the ith job category and Njn is the sample proportion in this category. A>O. As an illustration consider a population of men whose occupations fall in one of five different job categories. Suppose we observe multinomial trials in which the values VI. with density [A"/r(. Here k = 5. 1968).2 and 0=2 = n 1 EX1.2 The PlugIn and Extension Principles We can view the method of moments as an example of what we call the plugin (or substitution) and extension principles. for instance. '\). the method of moment estimator is not unique..•. . in particular Problem 6. l Vk of the population being sampled are known. 0 Algorithmic issues We note that. ~ iii ~ (X/a)'.Pk are completely unknown. . 2. and 1" ~ E(X') = .1 0) ...102 Methods of Estimation Chapter 2 For instance. I.d.1. . f(u.11). . or 5. .. A = Jl1/a 2 .)]xO1exp{Ax}.i. There are many algorithms for optimization and root finding that can be employed.10. .. .4 and in Chapter 6.1. the proportion of sample values equal to Vi.>0.1. x>O. Example 2. case.(X")1 dxd J isquickandM is nonsingular with high probability is the NewtonRaphson algorithm. consider a study in which the survival time X is modeled to have a gamma distribution. We introduce these principles in the context of multinomial trials and then abstract them and relate them to the method of moments.. .·) fli =D'I'(X. but their respective probabilities PI.(1 + ")/A 2 Solving . An algorithm for estimating equations frequently used when computationofM(X. two other basic heuristics particularly applicable in the i. in general. Here is some job category data (Mosteller. A = X/a' where a 2 = fl. as X and Ni = number of indices j such that X j = Vi. It is defined by initializing with eo. .X 2 • In this example. In this case f) ~ ('" A). .5. . then setting (2. If we let Xl.. 4. Frequency Plugin(2) and Extension. 2.. 3. I. then the natural estimate of Pi = P[X = Vi] suggested by the law of large numbers is Njn.6. 1'1 for () gives = E(X) ~ "/ A.. neither minimum contrast estimates nor estimating equation solutions can be obtained in closed fonn. i = 1. .~ (I'l/a)'. X n be Ltd. express () as a function of IJ" and fl3 = E(X 3 ) and obtain a method of moment estimator based on /11 and fi3 (Problem 2. Vi = i.
can be identified with a parameter v : P ~ R. . 1 < think of this model as P = {all probability distributions Pan {VI. . .P2. Equivalently.31 5 95 0.12) If N i is the number of individuals of type i in the sample of size n..53 = 0. + P3). lPk).. If we assume the three different genotypes are identifiable. 1 ()d) and that we want to estimate a component of 8 or more generally a function q(8).. .}}. then (NIl N 2 .Pk) = (~l .0 < 0 < 1. .. The frequency plugin principle simply proposes to replace the unknown population frequencies PI. (2..12). 1~!.44 .03 2 84 0. Pk) of the population proportions. .pl . Consider a sample from a population in genetic equilibrium with respect to a single gene with two alleles.12 3 289 0. suppose that in the previous job category table. Pk do not vary freely but are continuous functions of some ddimensional parameter 8 = ((h. and Then q(p) (p..13 n2::. P3 PI = 0 = (1. Now suppose that the proportions PI' . HardyWeinberg Equilibrium. If we use the frequency substitution principle.(P2 + P3).0).1.»i=1 for Danish men whose fathers were in category 3..Ps) = (P4 + Ps) ... whereas categories 2 and 3 correspond to whitecollar jobs. .. P3) given by (2. the estimate is which in our case is 0. together with the estimates Pi = NiJn.Pk) with Pi ~ PIX = 'Vi]. .0. .. . .1. 1 Nk/n.09. P2 ~ 20(1. . .Ni =708 2:::o.. i < k.1. v. . Suppose . We would be interested in estimating q(P"". For instance. the multinomial empirical distribution of Xl." .Xn .. Many of the models arising in the analysis of discrete data discussed in Chapter 6 are of this type. let P dennte p ~ (P". That is.1. the difference in the proportions of bluecollar and whitecollar workers. use (2.1 Basic Heuristics of Estimation 103 Job Category l I Ni Pi ~ 23 0. . .41 4 217 0.Pk by the observable sample frequencies Nt/n..4. that is. N 3 ) has a multinomial distribution with parameters (n..11) to estimate q(Pll··.. we are led to suppose that there are three types of individuals whose frequencies are given by the socalled HardyWeinberg proportions 2 . Next consider the marc general problem of estimating a continuous function q(Pl' . v(P) = (P4 + Ps) and the frequency plugin principle simply says to replace P = (PI.. categories 4 and 5 correspond to bluecollar jobs.Section 2.0)2. in v(P) by 0 P ). Example 2.
suppose that we want to estimate a continuous Rlvalued function q of e... We shall consider in Chapters 3 (Example 3. X n are Ll. If w.1. that is.14) As we saw in the HardyWeinberg case.1. . .15) ~ ".. 0 write 0 = 1 .d. by the law of large numbers. Then (2. thus. Fu'(a) = sup{x. case if P is the space of all distributions of X and Xl.E have an estimate P of PEP such that PEP and v : P 4 T is a parameter. E A) n. ( . .1.1.. . ... the empirical distribution P of X given by    f'V 1 n P[X E Al = I(X. we can usually express q(8) as a continuous function of PI. I) and ! F'(a) " = inf{x.13) defines an extension of v from Po to P via v. I: t=l (2. if X is real and F(x) = P(X < x) is the distribution function (dJ.13) Given h we can apply the extension principle to estimate q( 8) as...(IJ). in the Li.14) are not unique.). . . Note.Pk.Xn ) =h N. . consider va(P) = [Fl (a) + Fu 1(a )1. "·1 is.1.d.13) and estimate (2. . a natural estimate of P and v(P) is a plugin estimate of v(P) in this nonparametric context For instance. .Pk(IJ»). we can use the principle we have introduced and estimate by J N l In. q(lJ) with h defined and continuous on = h(p. Nk) . however. Po ~ R given by v(PO) = q(O).. that we can also Nsfn is also a plausible estimate of O. . then v(P) is the plugin estimate of v.104 Methods of Estimation Chapter 2 we want to estimate fJ. . The plugin and extension principles can be abstractly stated as follows: Plugin principle. Because f) = ~.. where a E (0.4) and 5 how to choose among such estimates. P > R where v(P) . Let be a submodel of P. (2.Pk are continuous functions of 8. F(x) < a}.16) . If PI. ..4.1. the frequency of one of the alleles. .1. Now q(lJ) can be identified if IJ is identifiable by a parameter v .'" . as X p. T(Xl./P3 and. We can think of the extension principle alternatively as follows. the representation (2. F(x) > a}. (2. (2. 1  V In general. In particular.h(p) and v(P) = v(P) for P E Po.
but only VI(P) = X is a sensible estimate of v(P).1. Here x ~ = median. because when P is not symmetric. if X is real and P is the class of distributions with EIXlj < 00.1.. let Po be the class of distributions of X = B + E where B E R and the distribution of E ranges over the class of symmetric distributions with mean zero. e e Remark 2.i.1. in the multinomial examples 2.' Nl = h N () ~ ~ = v(P) and P is the empirical distribution.2. these principles are general. Pe as given by the HardyWeinberg p(O). This reasoning extends to the general i.1 I:~ 1 Xl. A natural estimate is the ath sample quantile . ~ I ~ ~ Extension principle. v(P8 ) = q(8) = h(p(8)) is a continuous map from to Rand D( P) = h(p) is a continuous map from P to R. is a continuous map from = [0. is called the sample median.13). P 'Ie Po. .i. II to P.1.12) and to more general method of moment estimates (Problem 2. However. (P) is called the population (2.tj = E(Xj) in this nonparametric . As stated. Remark 2. i=1 k /1j ~ = ~ LXI 1=1 I n .14. In this case both VI(P) = Ep(X) and V2(P) = "median of P" satisfy v(P) = v(P). For instance. With this general statement we can see precisely how method of moment estimates can be obtained as extension and frequency plugin estimates for multinomial trials because I'j(8) where =L i=l k vfPi(8) = h(p(8» = viP..3 and 2. then Va. 1/.1 Basic Heuristics of Estimation 105 V 1. they are mainly applied in the i. (P) is the ath population quantile Xa.) h(p) ~ L vip.d.1.4. case (Problem 2. Let viP) be the mean of X and let P be the class of distributions of X = 0 + < where B E R and the distribution of E ranges over the class of distributions with mean zero.d. If v: P ~ T is an extension of v in the sense that v(P) = viP) on Po. Here x!. The plugin and extension principles must be calibrated with the target parameter. P E Po. and v are continuous.1. Suppose Po is a submodel of P and P is an element of P but not necessarily Po and suppose v: Po ~ T is a p~eter.1. context is the jth sample moment v(P) = xjdF(x) ~ 0. casebut see Problem 2. = Lvi n i= 1 k . The plugin and extension principles are used when Pe. For instance. ~ ~ . then v(P) is an extension (and plugin) estimate of viP).Section 2. then the plugin estimate of the jth moment v(P) = f.1.1.17) where F is the empirical dJ. = v(P). ~ For a second example. the sample median V2(P) does not converge in probability to Ep(X).
X(l . What are the good points of the method of moments and frequency plugin? (a) They generally lead to procedures that are easy to compute and are. 'I I[ i['l i . See Section 2. because Po = P(X = 0) = exp{ O}. If the model fits. . Example 2.2 with assumptions (1)(4) holding.4. they are often difficult to compute. To estimate the population variance B( 1 . a special type of minimum contrast and estimating equation method.8) we are led by the first moment to the estimate. 2. Suppose that Xl. . valuable as preliminary estimates in algorithms that search for more efficient estimates. X n is a N(f1. .1.1.1. I . Thus. we may arrive at different types of estimates than those discussed in this section. " .2.1 is a method of moments estimate of B.d.1. there are often several method of moments estimates for the same q(9). Unfortunately. Because f.4. However.1 because in this model B is always at least as large as X (n)' 0 As we have seen.7. X. as we shall see in Section 2. The method of moments estimates of f1 and a 2 are X and 2 0 a. Plugin is not the optimal way to go for the Bayes. X). where Po is n. . For instance. these are the frequency plugin (substitution) estimates (see Problem 2..3 where X" . a saving grace becomes apparent in Chapters 5 and 6. Because we are dealing with (unrestricted) Bernoulli trials. estimation of B real with quadratic loss and Bayes priors lead to procedures that are data weighted averages of (J values rather than minimizers of functions p( (J. a frequency plugin estimate of 0 is Iogpo.4. U {I.) = ~ (0 + I). if we are sampling from a Poisson population with parameter B.. or uniformly minimum variance unbiased (UMVU) principles we discuss briefly in Chapter 3. [ [ . . and there are best extensions. ". minimax.3. For example. . we find I' = E. Suppose X I. 0 = 21' . • .5.1. as we shall see in Chapter 3.. •i .6. Discussion. for large amounts of data. We will make a selection among such procedures in Chapter 3. (b) If the sample size is large. 0 Example 2. Example 2. Moreover. Algorithms for their computation will be introduced in Section 2. (X. the frequency of successes. When we consider optimality principles.Ll). the plugin principle is justified. those obtained by the method of maximum likelihood. . Remark 2.X). This is clearly a foolish estimate if X (n) = max Xi> 2X . C' X n are ij.LI (8) = (J the method of moments leads to the natural estimate of 8.l [iX. a 2 ) sample as in Example 1.5. It does turn out that there are "best" frequency plugin estimates..106 Methods of Estimation Chapter 2 Here are three further simple examples illustrating reasonable and unreasonable MOM estimates. The method of moments can lead to either the sample mean or the sample variance. O}.. Estimating the Size of a Population (continued). then B is both the population mean and the population variance. X n are the indicators of a set of Bernoulli trials with probability of success fJ. )" I I • . = 0]. therefore.I and 2X . In Example 1. . . optimality principle solutions agree to first order with the best minimum contrast and estimating equation solutions. This minimal property is discussed in Section 5. these estimates are likely to be close to the value estimated (consistency)..
. ~ ~ ~ ~ ~ v we find a parameter v such that v(p.2 Minimum Contrast Estimates and Estimating Equations 107 Summary. 0 E e. Let Po and P be two statistical models for X with Po c P. I < i < n. The plugin estimate (PIE) for a vector parameter v = v(P) is obtained by setting fJ = v( P) where P is an estimate of P. For data {(Zi. z) = ZT{3. For the model {Pe : () E e} a contrast p is a function from X x H to R such that the discrepancy D(lJo. It is of great importance in many areas of statistics such as the analysis of variance and regression theory.Section 2. f. 0).2. .Zi)j2. then is called the empirical PIE. Zi). 1 < j < k. is parametric and a vector q( 8) is to be estimated..O) = O.) = q(O) and call v(P) a plugin estimator of q(6). a least squares estimate of {3 is a minimizer of p(X. where Pj is the probability of the jth category. 0 E e c Rd is uniquely minimized at the true value 8 = 8 0 of the parameter..2 2. If P is an estimate of P with PEP.lJ) ~ E'o"p(X. The general principles are shown to be related to each other.. We consider principles that suggest how we can use the outcome X of an experiment to estimate unknown parameters... Method of moment estimates are empirical PIEs based on v(P) = (f. P E Po. For this contrast.ll. where ZD = IIziJ·llnxd is called the design matrix. When P is the empirical probability distribution P E defined by Pe(A) ~ n.1 MINIMUM CONTRAST ESTIMATES AND ESTIMATING EQUATIONS Least Squares and Weighted Least Squares Least squares(1) was advanced early in the nineteenth century by Gauss and Legendre for estimation in problems of astronomical measurement.lj = E( XJ\ 1 < j < d. li) : I < i < n} with li independent and E(Yi ) = g((3. An extension ii of v from Po to P is a parameter satisfying v(P) = v(P). A minimum contrast estimator is a minimizer of p(X. Suppose X . In this section we shall introduce the approach and give a few examples leaving detailed development to Chapter 6..Pk).g((3.. In the multinomial case the frequency plugin estimators are empirical PIEs based on v(P) = (Ph ..1 L:~ 1 I[X i E AI. ..(3) ~ L[li . the associated estimating equations are called the normal equations and are given by Z1 Y = ZtZD{3. P. when g({3. ~ ~ ~ ~ 2. and the contrast estimating equations are V Op(X. If P = PO. .ld)T where f. D(P) is called the extensionplug·in estimate of v(P). 6). where 9 is a known function and {3 E Rd is a vector of unknown regression coefficients.
That is.::'hc. < i < n.3) which is again minimized as a function of 13 by 13 = 130 and uniquely so if the map 13 ~ (g(j3. .d. I' .g(l3.2. The contrast p(X. I Zj = Zj). 1 < j < n.. 1 < i < n.Y) such thatE(Y I Z = z) = g(j3. .C=ha:::p::.4}{2.::2 : In Example 2.E:::'::'. which satisfies (but is not fully specified by) (2.) = 0. • .2).z..) i.z).::.. i=l (2. (Z" Y. (Zi.=..i.2. is modeled as a sample from a joint distribution. then 13 has an interpretation as a parameter on p. This follows "I . Note that the joint distribution H of (C}.1 we considered the nonlinear (and linear) Gaussian model Po given by Y. .3) continues to be valid.)]' i=l ~ led to the least squares estimates (LSEs) {3 of {3. 1_' . E«.2. as (Z.) + L[g(l3 o.En) is any distribution satisfying the GaussMarkov assumptions.g(l3.2. Z could be educationallevel and Y income..108_~ ~_ _~ ~ ~cMc. " (2. .2.. (3)  E p p(X. the model is semiparametric with {3. a 2 and H unknown. that is. we can compute still Dp(l3o. aJ) and (3 ranges over R d or an open subset. z. .g(l3.) = E(Y.) + <" I < i < n n (2.E(Y..g(l3.).. This is frequently the case for studies in the social and biological sciences.13) n n i=1  L Varp«. .:1. I) are i. . Y) ~ PEP = {Alljnintdistributions of(Z.) =0.. fj) because (2.1. 13) = LIY. 13 E R d }.zn)f is 11. 1 < i < j < n 1 > 0\ . . and 13 ~ (g(l3.2.z. Z)f.5) (2. . NCO. z.. or Z could be height and Y log weight Then we can write the conditional model of Y given Zj = Zj. .d.2. Yi). Zn»)T is 1.z.4 with <j simply defined as Y.i.zt}. (2. The least squares method of estimation applies only to the parameters {31' . I<i<n.2.2..2. where Ci = g(l3.2) Are least squares estimates still reasonable? For P in the semiparametric model P. j.»)2.6) is difficult Sometimes z can be viewed as the realization of a population variable Z.m::a. " .1. For instance. . as in (a) of Example 1. If we consider this model.o=nC. The estimates continue to be reasonable under the GaussMarkov assumptions.(z. z. = 0. .:.) . E«.4) (2. 13 = I3(P) is the miniml2er of E(Y .6) Var( €i) = u 2 COV(fi. that is.) = g(l3.).g(l3. .1. Suppose that we enlarge Po to P where we retain the independence of the Yi but only require lJ. j.f3d and is often applied in situations in which specification of the model beyond (2.:0::d=':Of..
1).1. For nonlinear cases we can use numerical methods to solve the estimating equations (2. We continue our discussion for this important special case for which explicit fonnulae and theory have been derived. a further modeling step is often taken and it is assumed that {Zll .1.2.Section 2. and Volnme n. J for Zo an interior point of the domain.2.4).zo). Zd. In that case.2.. Yi). where P is the empirical distribution assigning mass n.6). {3d). E1=1 g~ (zo)zo as an + 1) dimensional d p(z) = L(3jZj. is called the linear (multiple) regression modeL For the data {(Zi. (2. .. = py . As we noted in Example 2...3 and 6.L{3jPj. E(Y I Z = z) can be written as d p(z) where = (30 + L(3jZj j=1 (2. 2:). Yi) are a sample from a (d + 1)dimensional distribution and the covariates that are the coordinates of Z are continuous. i = 1. .2. See also Problem 2. and Seber and Wild (1989).5) and (2.2.2.4.41. as we have seen in Section lA.5. I)/. Sections 6. . in conjnnction with (2.2.9) . In this case we recognize the LSE {3 as simply being the usual plugin estimate (3(P). Fan and Gijbels (1996). z) = zT (3.1 the most commonly used 9 in these models is g({3.2 Minimum Contrast Estimates and Estimating Equations 109 from Theorem 1. see Rnppert and Wand (1994). y)T has a nondegenerate multivariate Gaussian distribution Nd+1(Il. We can then treat p(zo)  nnknown (30 and identify (zo) with (3j to give an approximate (d 1 and Zj as before and linear model with Zo = t!. we can approximate p(z) by p(z) ~ p(zo) +L j=1 d a: a (zo)(z .7) (2.4. n} we write this model in matrix fonn as Y = ZD(3 + €    where Zv = IIZij 11 is the design matrix. . (2) If as we discussed earlier. we are in a situation in which it is plausible to assume that (Zi..2..I to each of the n pairs (Zil Yi). The linear model is often the default model for a number of reasons: (I) If the range of the z's is relatively small and p(z) is smooth.1. (2.8) d f30 = («(3" . j=O This type of approximation is the basis for nonlinear regression analysis based on local polynomials. which. j=1 (2.7).
The parametrization is identifiable if and only ifZ D is of rank d or equivalently if Z'bZD is affulI rank d. . il. i=l (2.2. exists and is unique and satisfies the Donnal equations (2.2. p.1. il. the relationship between z and y can be approximated well by a linear equation y = 131 + (32Z provided z is restricted to a reasonably small interval. il. Example 2.il. I I.8). Example 2. 1 64 4 71 5 54 9 81 11 76 13 23 77 93 23 95 28 109 The points (Zi' Yi) and an estimate of the line 131 + 132z are plotted in Figure 2.2. ~ = (lin) L:~ 1Yi = ii.) ~ 0.) = il" a~.10) Here are some examples. Nine samples of soil were treated with different amounts Z of phosphorus. Estimation of {3 in the linear regression model. If we run several experiments with the same z using plants and soils that are as nearly identical as possible. .no Furthennore. we have a Gaussian linear regression model for Yi.) = O. if the parametrization {3 ) ZD{3 is identifiable. whose solution is .) = 1 and the normal equation is L:~ 1(Yi .1 '. < Methods of Estimation Chapter 2 =y  I'(Z) Eyz:Ez~Ezy. Y is the amount of phosphorus found in com plants grown for 38 days in the different samples of soil. g(z. LZi(Yi .2. 1 < i < n. L 1 that.2.I 'i. . The nonnal equations are n i=l n ~(Yi .1. given Zi = 1 < i < n.2. necessarily.ill .2. Zi.(3zz. Following are the results of an experiment to which a regression model can be applied (Sned. we assume that for a given z. 0 Ii.11) When the Zi'S are not all equal.25.2. we will find that the values of y will not be the same. the least squares estimate. We want to estimate f31 and {32. For certain chemicals and plants. is independent of Z and has a N(O.ecor and Cochran. g(z. We want to find out how increasing the amount z of a certain chemical or fertilizer in the soil increases the amount y of that chemical in the plants grown in that soil. Y is random with a distribution P(y I z). i I . For this reason. We have already argued in Examp}e 2. Zi 1'. 139).12) .1. see Problem 2.il.il2Zi) = 0. ( (J2 2 ) distribution where = ayy Therefore. {3. In that case. we get the solutions (2. In the measurement model in which Yi is the detennination of a constant Ih. the solution of the normal equations can be given "explicitly" by (2. 1967. d = 1.
.1. i = 1. 9p( z). The regression line for the phosphorus data is given in Figure 2. .. P > d and postulate that 1'( z) is a linear combination of 91 (z). j=1 Then we are still dealing with a linear regression model because we can define WpXI . The line Y = /31 + (32Z is an estimate of the best linear MSPE predictor Ul + b1Z of Theorem 1. .. Zn. .(/31 + (32Zi)] are called the residuals of the fit.58 and 132 = 1. n} and sample regression line for the phosphorus data.(Z). if we measure the distance between a point (Zi..Section 2. . .(a + bzi)l. This connection to prediction explains the use of vertical distance! in regression. The vertical distances €i = fYi ..42.9p. The linear regression model is considerably more general than appears at first sight.. . .1.Yn on ZI.. and 131 = fi .. €6 is the residual for (Z6. Y6). Geometrically. n..2 Minimum Contrast Estimates and Estimating Equations 111 y 50 • o 10 20 x 30 Figure 22.2.132 z ~ ~ (2.2. For instance. (zn. . i = 1. . .Yi) and a line Y ~ a + bz vertically by di = Iy. and fi = (lin) I:~ 1 Yi· The line y = /31 + f32z is known as the sample regression line or line of best fit of Yl. that is p I'(z) ~ 2:).9..2. suppose we select p realvalued functions of z.4. . . then the regression line minimizes the sum of the squared distances to the n points (ZI' Yl). .1. 0 ~ ~ ~ ~ ~ Remark 2. Yi).. . Here {31 = 61.. Yn). . &atter plot {(Zi.3...13) where Z ~ (lin) I:~ 1 Zi. 91.
. and g({3. + /3zzi)J2 (2. Such models are called heteroscedastic (as opposed to the equal variance models that are homoscedastic).. However.5) fails..2.wi . < n.15) ..zd + ti. . i = 1..2.. Zi) + Ei .. Zi) = {3l + fhZi. [Yi . Note that the variables .2.2.2.3.5)..g«(3.14) where (J2 is unknown as before.8. That is.16) as a function of {3. I and the Y i satisfy the assumption (2..= g({3. . We need to find the values {3l and /32 of (3.wi) g((3.2. Whether any linear model is appropriate in particular situations is a delicate matter partially explorable through further analysis of the data and knowledge of the subject matter. we arrive at quadratic regression. .. Zi)/..2. 1 <i < n Yi n 1 .= y. In Example 2.. Example 2. Thus. Weighted least squares.17) .z.'l are sufficient for the Yi and that Var(Ed.Zi)]2 = L ~..)]2 i=l i=l ~ n (2.wi 1 < . . which for given Yi = yd. if we setg«(3. we can write (2. _ Yi . 1<i <n Wij = 9j(Zi). The method of least squares may not be appropriate because (2.. 1 n. Zil = 1. 0 < j < 2. if d = 1 and we take gj (z) = zj.2...2.wi. fi = <d.. Zi) = (2.wi. We return to this in Volume II.. we may be able to characterize the dependence of Var( ci) on Zi at least up to a multiplicative constant.wi  _ g((3... For instance.24 for more on polynomial regression. Consider the case in which d = 2. .g(.filii minimizes ~  I i i L[1Ii .".. .II 112 Methods of Estimation Chapter 2 (91 (z). Weighted Linear Regression. and (32 that minimize ~ ~ "I. I .2 and many similar situations it may not be reasonable to assume that the variances of the errors Ci are the same for all levels Zi of the covariate variable. • I' "0 r .2. Zi2 = Zi. iI L Vi[Yi i=l n ({3. but the Wi are known weights. then = WW2/Wi = . Yi = 80 + 8 1Zi + 82 + Cisee Problem 2. The weighted least squares estimate of {3 is now the value f3. gp( z)) T as our covariate and consider the linear model Yi = where L j=l p ()jWij + Ci. . . z.
(3 and for general d.2.26. Thus.1. we can write Remark 2. when g(. i ~ I.2.1. i i=l = 1. By following the steps of Exarnple 2.(32E(Z') ~ . Then it can be shown (Problem 2. that weighted least squares estimates are also plugin estimates.20).7). Yn) and probability distribution given by PI(Z"..Section 2.2.4H2.. we may allow for correlation between the errors {Ei}.6).8). Let (Z*...4 as follows.2. the .B that minimizes (2.. (Zn.2.28) that the model Y = ZD. we find (Problem 2.n. When ZD has rank d and Wi > 0. If Itl(Z*) = {31 given by + {32Z* denotes a linear predictor of y* based on Z .19) and (2.1) and (2. . 1 < i < n.n where Ui n = vi/Lvi. wn ) and ZD = IIZijllnxd is the design matrix.3. .4..2. Zi) = z.~ UiYi . Y*) V.2. We can also use the results on prediction in Section 1. .B. .1 leading to (2. Var(Z") and  _ L:~l UiZiYi . as we make precise in Problem 2.2. . .Y.)] ~Ui. ~ . using Theorem 1.2.~ UiZi· n n i"~l I" n n F"l This computation suggests. . . z) = zT.13 minimizing the least squares contrast in this transformed model is given by (2. a__ Cov(Z"'.(Li~] Ui Z i)2 n 2 n (2..(3.Y") ~(Zi. suppose Var(€) = a 2W for some invertible matrix W nxn.B+€ can be transformed to one satisfying (2. That is.27) that f3 satisfy the weighted least squares normal equations ~ ~ where W = diag(wI.B. .16) for g((3....2 Minimum Contrast Estimates and Estimating Equations 113 where Vi = l/Wi_ This problem may be solved by setting up analogues to the normal equations (2. Moreover.18) ~ ~ ~I" (31 = E(Y') .2. 0 Next consider finding the.2.(L:~1 UlYi)(L:~l uizd Li~] Ui Zi . More generally. Y*) denote a pair of discrete random variables with possible values (z" Yl).1 7) is equivalent to finding the best linear MSPE predictor of y .2.2. then its MSPE is ElY" ~ 1l1(Z")f ~ :L UdYi i=l n «(3] + (32 Zi)1 2 It follows that the problem of minimizing (2. ..2.
.
.
.
.
p(X.ny l. n2.. Let X denote the number of customers arriving at a service counter during n hours. Nevertheless.. 0 e Example 2. .2.6. 0) ~ 20(1 . p(2. the dual point of view of (2.(1.. ]n practice.2. so the MLE does not exist because = (0. 2 and 3.. and as we have seen in Example 2. Consider a popUlation with three kinds of individuals labeled 1.l. + n.27) doesn't make sense.0).27) that are not maxima or only local maxima. the MLE does not exist if 112 + 2n3 = O. X2 = 2. Example 2. p(3.0)' < ° for all B E (0.29) '(j . 2. Because .118 Methods of Estimation Chapter 2 which again enables us to analyze the behavior of B using known properties of sums of independent random variables..) = ·j1 e'"p. 8' 80. represents the expected number of arrivals in an hour or. which has the unique solution B = ~. . Similarly. (2. .2.2.OJ'". 1). there may be solutions of (2. + n.2.x=O.5. equivalently. In general. } with probabilities. let nJ.lx(0) = 5 1 0' . 1. A is an unknown positive constant and we wish to estimate A using X. and n3 denote the number of {Xl.2. 0)p(2. ..2. the likelihood is {1 ..2. x.2.22) and (2. the maximum likelihood estimate exists and is given by 8(x) = 2n.7. Then the same calculation shows that if 2nl + nz and n2 + 2n3 are both positive. then X has a Poisson distribution with parameter nA. Evidently. ~ maximizes Lx(B).2. If we make the usual simplifying assumption that the arrivals form a Poisson process.O) = 0'. 0) = 20'(1. the rate of arrival. .>. X3 = 1.28) If 2n. Here X takes on values {O. situations with f) well defined but (2. OJ = (1  0)' where 0 < () < 1 (see Example 2. then I • Lx(O) ~ p(l. 2n (2. I). which is maximized by 0 = 0.0) The likelihood equation is 8 80 Ix (0) ~ = 5 1 0 10 =0.4). is zero. Here are two simple examples with (j real. .1. x n } equal to 1. respectively. and 3 and occurring in the HardyWeinberg proportions p(I. If we observe a sample of three individuals and obtain Xl = 1.27) is very important and we shall explore it extensively in the natural and favorable setting of multiparameter exponential families in the next section. O)p(l. where ).
and k j=1 k Ix(fJ) = LnjlogB j . A sufficient condition.ekl with kl Ok ~ 1.2.d.kl.30)). familiar from calculus.. = L. Multinomial Trials.2.LBj j=1 • (2. .6. and let N j = L:~ 1 l[Xi = j] be the number of observations in the jth category. j=d (2. and must satisfy the likelihood equa~ons ~ 8B1x(fJ) = 8B 3 8 8 " n. and the equation becomes (BkIB j ) n· OJ = .logB.31 ) To obtain the MLE (J we consider l as a function of 8 1 . 8' (2. . p(x...2 Minimum Contrast Estimates and Estimating Equations 119 The likelihood equation is which has the unique solution).2. 31=1 By (2. Example 2. fJE6= {fJ:Bj >O'LOj ~ I}. for an experiment in which we observe nj ~ 2:7 1 I[X.. 80. ~ n . 8Bk/8Bj (2. this estimate is the MLE of ).lx(B) <0.2. .. We assume that n > k ~ 1. thus. If x = 0 the MLE does not exist.2.. e. . trials in which each trial can produce a result in one of k categories. Then. Then p(x. consider an experiment with n Li.2.8. = jl.Section 2. J = I. Let Xi = j if the ith trial produces a result in the jth category. is that l be concave in If l is twice differentiable. L. fJ) = TI7~1 B7'.. j = 1. 8B. (see (2.8B " 1=13 k k =0.32) We first consider the case with all the nj positive. k.32) to find I.o. this is well known to be equivalent to ~ e. 0 To apply the likelihood equation successfully we need to know when a solution is an MLE..2. As in Example 1..32). the MLE must have all OJ > 0. However. ~ ~ . = j) be the probability J of the jth category.7.1.6.n.2.\ I 0. let B = P(X.. . the maximum is approached as . 8) = 0 if any of the 8j are zero.. = x/no If x is positive. . A similar condition applies for vector parameters.30) for all This is the condition we applied in Example 2.
. we find that for n > 2 the unique MLEs of Il and Ii = X and iT2 ~ n.2 are Maximum likelihood and least squares We conclude with the link between least squares and maximum likelihood.2.. then <r <k I. YnJT.1.) = gi«(3).1.' .. Zj)]Wij where W = Ilwijllnxn is a symmetric positive definite matrix. See Problem 2. .202 L.34) g«(3.. Then (} with OJ = njjn. . we check the concavity of lx(O): let 1 1 <j < k . Zi)) 00 ao n log 2 I"" 2 "2 (21Too) .. zi)1 .gi«(3) 1 .. It is easy to see that weighted least squares estimates are themselves maximum likelihood estimates of f3 for the model Yi independent N(g({3.2. Then Ix «(3) log IT ~'P (V.3. Zi).. The 0 < OJ < 1. Example 2. z.2.2..9.X)2 (Problem 2. ~ g«(3. g«(3.1 we consider least squares estimators (LSEs) obtained by min2 imizing a contrast of the fonn L:~ 1 IV.11(a)).. 1 Yn . As we have IIV. least squares estimates are maximum likelihood for the particular model Po.. Summary. zn) 03W. see Problem 2.IV. as maximum likelihood estimates for f3 when Y is distributed as lV. wi(5). i=l g«(3. N(lll 0. X n are Li. Next suppose that nj = 0 for some j.2 ) with J1. and 0. This approach is applied to experiments in which for the ith case in a study the mean of the response Yi depends on • . ~ g((3. ..1 holds and X = (Y" . o 1=1 Evidently maximizing Ix«(3) is equivalent to minimizing L:~ n (2. where E(V. 1 < i < n.. are known functions and {3 is a parameter to be estimated from the independent observations YI . Suppose the model Po of Example 2. OJ > ~ 0 case.2. 'ffi f. version of this example will be considered in the exponential family case in Section 2. .2 both unknown. . n. gi.2. 0. 1 < j < k. . these estimates viewed as an algorithm applied to the set of data X make sense much more generally. Suppose that Xl. In Section 2. . . Using the concavity argument.33) It follows that in this nj > 0. ((g((3.6.g«(3.1 L:~ . Thus.2. where Var(Yi) does not depend on i. .(Xi .  seen and shall see more in Section 6.). Zi)J[l:} . More generally..lV.. Ix(O) is strictly concave and (} is the unique ~ ~ maximizer of lx(O). (2..1). is still the unique MLE of fJ.30.28. k.d.120 ~ Methods of Estimation Chapter 2 To show thaI this () maximizes lx(8).Zi)]2. we can consider minimizing L:i. j = 1. i = 1.
Suppose we are given a function 1: continuous. Suppose 8 c RP is an open set.I ) all tend to De as m ~ 00.2 we consider maximum likelihood estimators (MLEs) 0 that are defined as maximizers of the likelihood Lx (B) = p(x. Formally. (h). 8). In particular we consider the case with 9i({3) = Zij{3j and give the LSE of (3 in the case in which I!ZijllnXd is of rank d. That is. Existence and unicity of the MLE in exponential families depend on the strict concavity of the log likelihood and the condition of Lemma 2.2.1) lim{l(li) : Ii ~ ~ De} = 00. = RxR+ and ee e. In the case of independent response variables Yi that are modeled to have a N(9i({3). (m. m). m.4 and Corollaries 1. We start with a useful general framework and lemma. including all points with ±oo as a coordinate.1.3.ae as m t 00 to mean that for any subsequence {B1nI<} either 8 1nk t t with t ¢ e. (m.1 and 1. (m. e e I"V e De= {(a. (}"2) distribution. in the N(O" O ) case. Suppose also that e t R where e c RP is open and 1 is (2.oo}}.. This is largely a consequence of the strict concavity of the log likelihood in the natural parameter TI. 2 e Lemma 2. b). See Problem 2.3 MAXIMUM LIKELIHOOD IN MUlTIPARAMETER EXPONENTIAL FAMILIES Questions of existence and uniqueness of maximum likelihood estimates in canonical exponential families can be answered completely and elegantly.6.Section 2.1 only.2 and other exponential family properties also playa role. or 8 1nk diverges with 181nk I . Proof.00.zid. These estimates are shown to be equivalent to minimum contrast estimates based on a contrast function related to Shannon entropy and KullbackLeibler information divergence.1.. In Section 2. (a. Properties that derive solely fTOm concavity are given in Propositon 2. are given. Extensions to weighted least squares.a < b< oo} U {(a. as k .6. a8 is the set of points outside of 8 that can be obtained as limits of points in e. b) : aER.3. if X N(B I .3.3 and 1. . . z=1=1 ~ 2. where II denotes the Euclidean norm. which are appropriate when Var(Yj) depends on i or the Y's are correlated.bE {a. For instance. Then there exists 8 E e such that 1(8) = max{l(li) : Ii E e}. . In general. b).3. Let &e = be the boundary of where denotes the closure of in [00. oo]P.5. for a sequence {8 m } of points from open.00.1 ).6.. though the results of Theorems 1.a= ±oo. it is shown that the MLEs coincide with the LSEs. Concavity also plays a crucial role in the analysis of algorithms in the next section.b) . For instance.3 Maximum Likelihood in Multiparameter Exponential Families 121 a set of available covariate values Zil. (a. m.6. we define 8 1n .3.
Suppose the cOlulitions o/Theorem 2.3.to.. [ffunher {x (II) e = e e t 88. open C RP.8 and 2. II E j. Let x be the observed data vector and set to = T(x). II) is strictly concave and 1. Suppose P is the canonical exponential/amity generated by (T. then 1] exists and is unique ijfto E ~ where C!i: is the interior of CT. (a) If to E R k satisfies!!) Ifc ! 'I" 0 (2. (ii) The family is of rank k. From (B. Define the convex suppon of a probability P to be the smallest convex set C such that P(C) = 1. Applications of this theorem are given in Problems 2. = . By Lemma 2. Then.. Write 11 m = Am U m . then lx(11 m ) . ! " .3.1.3.3.3. are distinct maximizers.3.3. Furthennore.1.3) I I' (b) Conversely.3.1. thus. i . Proofof Theorem 2.2).  (hO! + 0. We give the proof for the continuous case. is unique. U m = Jl~::rr' Am = . " We. . E. /fCT is the convex suppon ofthe distribution ofT (X). Without loss of generality we can suppose h(x) = pix. then the MLE8(x) exists and is unique.1.'1 .3. I I. Suppose X ~ (PII .(11) ~ 00 as densities p(x.'1) with T(x) = 0. lI(x) =  exists. '10) for some reference '10 E [ (see Problem 1.)) 0 Themem 2. then Ix lx(O. and is a solution to the equation (2. " . we may also assume that to = T(x) = 0 because P is the same as the exponential family generated by T(x) . is open.3.)) > ~ (fx(Od +lx (0. with corresponding logp(x. if lx('1) logp(x. if to doesn't satisfy (2.12. ' L n . We show that if {11 m } has no subsequence converging to a point in E.122 Methods of Estimation Chapter 2 Proposition 2.. no solution.3. .9) we know that II lx(lI) is continuous on e. ~ Proof. and 0.3. II). Corollary 2.2) then the MLE Tj exists. a contradiction. have a necessary and sufficient condition for existence and uniqueness of the MLE given the data.1.6. If 0.27).00.3) has i 1 !.1 hold.3. h) and that (i) The natural parameter space. then the MLE doesn't exist and (2. We can now prove the following. Existence and Uniqueness o/the MLE ij. .).1. which implies existence of 17 by Lemma 2.
2) and COTollary 2.1 follow. Write Eo for E110 and Po for P11o . for every d i= 0.2. that is. iff.i.3) by TheoTem 1.J = 00.3. So we have Case 2: Amk t A.2) fails.3). Case 1: Amk t 111]=11. The Gaussian Model. Suppose X"". . t u.5. um~. As we observed in Example 1. fOT all 1].9.6. So In either case limm. thus. there exists c # such that Po[cTT < 0] = 1 E1](c T T(X)) < 0. 0 (L:7 L:7 CJ. Theorem 2. both {t : dTt > dTto} n CO and {t : dTt < dTto} n CO are nonempty open sets. if {1]m} has no subsequence converging in £ it must have a subsequence { 11m k} that obeys either case 1 or 2 as follows.se density is a general phenomenon.3.3. lIu m ll 00.4. = 0 because T(XI) is always a point on the parabola T2 = T'f and the MLE does not exist. Por n = 1.k Lx( 11m. Evidently. CT = C~ and the MLE always exists. this is the exponential family generated by T(X) = 1 Xi. . Suppose the conditions ofTheorem 2. Then AU ¢ £ by assumption.6. Then because for some 6 > 0.1) a point to belongs to the interior C of a convex set C iff there exist points in CO on either side of it. 0 Example 2. It is unique and satisfies x (2. existence of MLEs when T has a continuous ca.3. So. which is impossible. This is equivalent to the fact that if n = 1 the formal solution to the likelihood equations gives 0'2 = 0. Then. Nonexistence: if (2. u mk t u. Because any subsequence of {11m} has no subsequence converging in E we conclude L ( 11m) t 00 and Tj exists.. I Xl) and 1.1. I' E R.* P'IlcTT = 0] = I.3.3.3. Po[uTT(X) > 61 > O.Xn are i.3.d. In fact. If ij exists then E'I T ~ 0 E'I(cTT) ~ 0. By (B. T(X) has a density and. 0 ° '* '* PrOD/a/Corollary 2.Section 2. The equivalence of (2.1 hold and T k x 1 has a continuous case density on R k • Then the MLE Tj exists with probabiliry 1 and necessarily satisfies (2. N(p" ( 2 ). contradicting the assumption that the family is of rank: k. (1"2 > O.1.3.3 Maximum Likelihood in Multiparameter Exponential Families ~ 123 1. CT = R )( R+ FOT n > 2.
and one of the two sums is nonempty because c oF 0.9). The TwoParameter Gamma Family.i. the maximum likelihood principle in many cases selects the «best" estimate among them. I < j < k. How to find such nonexplicit solutions is discussed in Section 2.Xn are i. in a certain sense.1. The boundary of a convex set necessarily has volume 0 (Problem 2.3. Multinomial Trials.)e>"XxPI. Example 2.3.d.. To see this note that Tj > 0.. where 0 < Aj PIX = j] < I. if we write c T to = {CjtjO : Cj > O} + {cjtjO : Cj < O} we can increase c T to by replacing a tjO by tjo + I in the first sum or a tjO by tjO ..3.5) have a unique solution with probability 1.6.1.. Here is an example. where Tj(X) L~ I I(Xi = j).3.. We conclude from Theorem 2. In Example 3.2 follows.3. then and the result follows from Corollary 2. For instance.1.ln.I which generates the family is T(k_l) = (Tl>"" T. The likelihood equations are equivalent to (problem 2.1 that in this caseMLEs of"'j = 10g(AjIAk).3.124 Methods of Estimation Chapter 2 Proof.6. but only Os is a MLE.3).).4 we will see that 83 is.n.1)...I._t)T. I < j < k.13 and the next example). the MLE 7j in exponential families has an interpretation as a generalized method of moments estimate (see Problem 2.3. We assume n > k .1. using (2. the best estimate of 8.T(X) = A(.2 that (2. Thasadensity.3.. exist iff all Tj > O.3. Because the resulting value of t is possible if 0 < tjO < n.3.3.3.= r(jJ) A log X (2. > O. Suppose Xl. I <t< k .2(a). I < j < k. lit ~ .4) and (2. h(x) = XI.. . with density 9p.2.4 and 2. if T has a continuous case density PT(t).3.2(b» r' log .2. I < j < k iff 0 < Tj < n. The statistic of rank k .. Thus.1.. A nontrivial application of Theorem 2. in the HardyWeinberg examples 2. They are determined by Aj = Tj In. >.I and verify using Theorem 2.>. This is a rank 2 canonical exponential family generated by T   = (L log Xi. LXi).3. From Theorem 1. 0 = If T is discrete MLEs need not exist. we see that (2.4) (2. x > 0.5) ~=X A where log X ~ L:~ t1ogXi. When method of moments and frequency substitution estimates are not unique. It is easy to see that ifn > 2. We follow the notation of Example 1.3 we know that E. Example 2. thus.7. I <j < k.3.2) holds. = = L L i i bJ _ I .4. Thus.4.(X) = r~. with i'.6. o Remark 2. P > 0. .3.3.1 in the second. by Problem 2. liz = I  vn3/n and li3 = (2nl + nz)/2n are frequency substitution estimates (Problem 2.
.3. In some applications.6.exist and are unique. the MLE does not exist if 8 ~ (0.3. note that in the HardyWeinberg Example 2. when we put the multinomial in canonical exponential family form. be a cUlVed exponential family p(x.6) on c( II) = ~~. BEe. our parameter set is open. n > kl. Corollary 2.3. = 0. Similarly. c(8) is closed in [ and T(x) = to satisfies (2.1.lI) ~ exp{cT (II)T(x) .1 is useful..3. . Alternatively we can appeal to Corollary 2.3.2.3.B) ~ h(x)exp LCj(B)Tj(x) . and unicity can be losttake c not onetoone for instance.~l Aj = I}.3 can be applied to determine existence in cases for which (2. h). L. if 201 + 0. Here [ is the natural paramerer space of the exponential fantily P generated by (T. the bivariate normal case (Problem 2.3.2) by taking Cj = 1(i = j). Let Q = {PII : II E e). 0 < j < k .8see Problem 2. If the equations ~ have a solution B(x) E Co.1] it does exist ~nd is unique.. Consider the exponential family k p(x. the following corollary to Theorem 2.1.2.7) Note that c(lI) E c(8) and is in general not ij. 0 The argument of Example 2.I.8 we saw that in the multinomial case with the clQsed parameter set {Aj : Aj > 0.10).2.k. 1 < i < k .1 we can obtain a contradiction to (2. .1. then so does the MLE II in Q and it satisfies the likelihood equation ~ cT(ii) (to . (II) .1)..3.1. for example.1 directly D (Problem 2.Section 2. Remark 2.3.3. then it is the unique MLE ofB.3. e open C R=. Let CO denote the interior of the range of (c. When P is not an exponential family both existence and unicity of MLEs become more problematic. However.A(c(ii)) ~ = O.A(c(II))}h(x) Suppose c : 8 ~ [ C R k has a differential (2. . .6. (B). (2. the MLEs ofA"j = 1. The following result can be useful. Then Theorem 2. if any TJ = 0 or n.1 and Haberman (1974). . whereas if B = [0. . Ck (B)) T and let x be the observed data.13).3) does not have a closedform solution as in Example 1.. Unfortunately strict concavity of Ix is not inherited by curved exponential families.3. In Example 2. The remaining case T k = a gives a contradiction if c = (1...3 Maximum likelihood in Multiparameter Exponential Families 125 On the other hand. 1)T.3. m < k .3.B(B) j=l . lfP above satisfies the condition of Theorem 2. mxk e.3. x E X.3.2) so that the MLE ij in P exists.
Because It > 0.ryl > O.. . Next suppose.. the o q. < O}. .3.ry.. that . As a consequence of Theorems 2. m} is a 2nparameter canonical exponential family with 'TJi = /tila.Xn are i. This is a curved exponential ¥.) : ry..4.3. Methods of Estimation Chapter 2 family with (It) = C2 (It) = .nit. are n independent random samples. I .11.gx±).~Y'1. . tLi = (}1 + 82 z i .. < O}.3.1/'1') T .3..5.d. Using Examples 1. af = f)3(8 1 + 82 z i )2.Zn are given constants. Now p(y. we can conclude that an MLE Ii always exists and satisfies (2. N'(fl.10. = ( ~Yll'''' .i.2~~2 .oJ).n.6.... . ). .5 = .) : ry.3. suppose Xl. t.!0b''. generated by h(Y) = 1 and "'> I T(Y) '. Suppose that }jI •. Gaussian with Fixed Signal to Noise. Jl > 0.6) with " • .6. = 1 " = 2 n( ryd'1' .. l = 1. 'TJn+i = 1/2a.. n.2 .ry.10. As in Example 1. . which implies i"i+ > 0. i Example 2. = L: xl. '1.~Yn. Zl < .it.126 The proof is sketched in Problem 2.3... i = 1. j = 1.9.. '() A 1/ Thus. 1 n. Equation (2.A6iL2 = 0 Ii. .I LX.< O. with I. as in Example 1. we see that the distribution of {01 : j = 1.2 and 2. . .g( it' . E R.3. < Zn where Zt.n(it' + )"5it'))T which with Ji2 = n.3. 11. We find CI Example 2.0'2) with Jl/ a = AO > a known.5X' +41i2]' < 0.6.3 )(t..?' m)T " ". Ii± ~ ~[).". . ... ry. 9) is a curved exponential family of the form (2. . ...\()'. corresponding to 1]1 = //x' 1]2 = . = ~ryf. . C(O) and from Example 1./2'1' .3)T.6.3.'' .6. L: Xi and I.7) if n > 2. simplifies to Jl2 + A6XIl.2. l}m. .'. m m m f?.. LocationScale Regression. where 0l N(tLj 1 0'1).5 and 1.it.7) becomes = 0.. which is closed in E = {( ryt> ry..: Note that 11+11_ = '6M2 solution we seek is ii+. Evidently c(8) = {(lh..
Let £ be the canonical parameter set for this full model and let e ~ (II: II. We begin with the bisection and coordinate ascent methods. we will discuss three algorithms of a type used in different statistical contexts both for their own sakes and to illustrate what kinds of things can be established about the black boxes to which we all. is isolated and shown to apply to a broader class of models.1 The Method of Bisection The bisection method is the essential ingredient in the coordinate ascent algorithm that yields MLEs in kparameter exponential families. is the bisection algOrithm to find x*. an MLE (J of (J exists and () 0 ~ ~ Summary. However. . b) such that f(x*) = O. f i strictly. E R.1. Finally. In fact. L 10). strict concavity. if implemented as usual.4 ALGORITHMIC ISSUES As we have seen.4. d. These results lead to a necessary condition for existence of the MLE in curved exponential families but without a guarantee of unicity or sufficiency. MLEs may not be given explicitly by fonnulae but only implicitly as the solutions of systems of nonlinear equations.3. by the intermediate value theorem.4 Algorithmic Issues 127 If m > 2. Given tolerance € > a for IXfinal .O. Given f continuous on (a. then the full 2nparameter model satisfies the conditions of Theorem 2.1 work. The packages that produce least squares estimates do not in fact use fonnula (2.3.7). even in the context of canonical multiparameter exponential families.3.x*l: Find Xo < x" f(xo) < 0 < f(x') by taking Ixol. 2. Ixd large enough. in this section. which give a complete though slow solution to finding MLEs in the canonical exponential families covered by Theorem 2. entrust ourselves. It is not our goal in this book to enter seriously into questions that are the subject of textbooks in numerical analysis.1 and Corollary 2. the basic property making Theorem 2. Initialize x61d ~ XI. Here. in pseudocode.02 E R.).1.3. then.3. x o1d = xo. there exists unique x*€(a. ~ 2. such as the twoparameter gamma. the fonnula (2. even in the classical regression model with design matrix ZD of full rank. order d3 operations to invert.3.Section 2. b). L 10) for {3 is easy to write down symbolically but not easy to evaluate if d is at all large because inversion of Z'bZD requires on the order of nd 2 operations to evaluate each of d(d + 1)/2 tenns with n operations to get Z'bZD and then. Then c(8) is closed in £ and we can conclude that for m satisfies (2. In this section we derive necessary and sufficient conditions for existence of MLEs in canonical exponential families of full rank with £ open (Theorem 2. > 2. at various times.1). > OJ. f( a+) < 0 < f (b.
1. xfinal = !(x~ld + x old ) and return xfinal' (2) Else. by the intermediate value theorem. The Shape Parameter Gamma Family. in addition.to· I' Proaf. Theorem 2. Moreover. which exists and is unique by Theorem 2..1. . 00. i I i. Then.. • .. x~ld = Xnew· Go to (I).. I (3) and X m + x* i as m j.) = EryT(X) . .d. xfinal = Xnew· (4) If f(xnew) < 0. Xnew = !(x~ld + x~ld)' ~ Xnew· (3) If f(xnew) = 0. exists.1) .6. If(xfinal)1 < E..) = VarryT(X) > 0 for all .. h). X n be i. I.1.xol/€)..4.xoldl Methods of Estimation Chapter 2 < 2E. may befound (to tolerance €) by the method afbisection applied to f(.4..3. the interior (a. By Theorem 1.x*1 < €..3.4. lx.1 and T = to E C~.xol· 1 (2) Therefore. satisfying the conditions of Theorem 2.i. so that f is strictly increasing and continuous and necessarily because i..4. !'(. o ! i If desired one could evidently also arrange it so that. for m = log2(lxt .1.IXm+l . the MLE Tj. x~ld (5) If f(xnew) > 0. (2. Let p(x I 1]) be a oneparameter canonical exponentialfamily generated by (T. 0 Example 2. Xm < x" < X m +l for all m. From this lemma we can deduce the following. [(8. . The bisection algorithm stops at a solution xfinal such that i Proot If X m is the mth iterate of Xnew i (I) Moreover. b) of the convex support a/PT.4. Let X" ..1). End Lemma 2. f(a+) < 0 < f(b).128 (I) If IX~ld .
say c. for a canonical kparameter exponential family. always converges to Ti. This example points to another hidden difficulty. 1 k. it is in fact available to high precision in standard packages such as NAG or MATLAB. bisection itself is a defined function in some packages.1. The function r(B) = x'Iexdx needed for the bisection method can itself only be evaluated by numerical integration or some other numerical method. However. which is slow. In fact.···. Here is the algorithm.4. eventually..1]2. getting 1j 1r). 1] ~Ok = (~1 ~1 {} "") 1]I. Notes: (1) in practice.'TJk. .Section 2.'TJk =tk' ) .4.1]2.4. r > 1.1 can be evaluated by bisection. d and finally 1] _ ~Il) _ =1] (~1 1]1"". for each of the if'. It solves the r'(B) r(B) T(X) n which by Theorem 2. E1/(T(X» = A(1/) = to when the MLE Tj = Tj(to) exists. 0 J:: 2. we would again set a tolerenceto be.···.oJ _ = (~1 "" {}) ~02 _ 1]1.'fJk· Repeat.2 Coordinate Ascent The problem we consider is to solve numerically. The case k = 1: see Theorem 2. but as we shall see.'TJ3. in cycle j and stop possibly in midcycle as soon as <I< .4 Algorithmic Issues 129 Because T(X) = L:~' equation I log Xi has a density for all n the MLE always exists. The general case: Initialize ~o 1] = ("" 'TJll···' "") • 'TJk Solve ~1 f or1]k: Set 1] 8'T]k [) A(~l ~1 1]ll1'J2.· .'fJk' ~1) an soon.
(i). 0 ~ (4) :~i (77') = 0 because :~. is computationally explicit and simple. .3.4.4. ilL" iii.j l(i/i) = A (say) exists and is > 00. 71J. the MLE 1j doesn't exist. . Bya standard argument it follows that. TIl = . Continuing in this way we can get arbitrarily close to 1j. I(W ) = 00 for some j.. 71 1 is the unique MLE. fI. . refuses to converge (in fI space!)see Problem 2. ij'j and ij'(j+I) differ in only one coordinate for which iji(j+1) maximizes l.2: For some coordinates l. 0 Example 2. they result in substantial savings of time. ..? Continuing. 'II.. Theorem 2.. ij = (V'I) ..2.l(ij') = A. = 7Jk.. (fij(e) are as above.. A(71') ~ to. (I) l(ij'j) Tin j for i fixed and in.. . limi. Suppose that this is true for each l. The case we have C¥.\(0) = ~.1.···) C!J.A(71) + log hex). to ¢ Fortunately in these cases the algorithm. the log likelihood.3. in fact. . Here we use the strict concavity of t. Whenever we can obtain such steps in algorithms. We pursue this discussion next. x ink) ( '~ini . Therefore. W = iji = ij.I) solving r0'. (3) and (4) => 1]1 .2'  It is natural to ask what happens if. Consider a point we noted in Example 2. We note some important generalizations. 'tn. Then each step of the iteration both within cycles and from cycle to cycle is quick.··. j (5) Because 1]1.. rp differ only in the second coordinate.1 hold and to E 1j(r) t g:.1J k) . We can initialize with the method 2 of moments estimate from Example 2. Proof.1j{ can be explicit.4. Suppose that we can write fiT = (fir..pO) = We now use bisection ' r' .. (Wn') ~ O.2. ff 1 < j < k. Else lim. . _A (1»)..4. Because l(ijI) ~ Aand the MLE is unique.. (3) I (71j) = A for all j because the sequence oflikelihoods is monotone.5). Thus. We give a series of steps.3.2. The TwoParameter Gamma Family (continued). is the expectation of TI(X) in the oneparameter exponential family model with all parameters save T1I assumed known. (2) The sequence (iji!.2.. that is. ij as r t 00. . 1J ! But 71j E [. (6) By (4) and (5).=1 d j = k and the problem of obtaining ijl(tO. This twodimensional problem is essentially no harder than the onedimensional problem of Example 2. For n > 2 we know the MLE exists. We use the notation of Example 2.I ~(~ (1) 1 to get V.I)) = log X + log A 0) and then A(1) = pY. the algorithm may be viewed as successive fitting of oneparameter families. I • . 1j(r) ~ 1j. .1 because the equa~ tion leading to Anew given bold' (2. j ¥ I) can be solved in closed form. Let 1(71) = tif '1.. (ii) a/Theorem 2. ijik) has a convergent subsequence in t x .'11 11 t t (I .) where flj has dimension d j and 2::...4. To complete the proof notice that if 1j(rk ) is any subsequence of 1j(r) that converges to ij' (say) then.. 1 < j < k.'. as it should. Hence.130 Methods of Estimation Chapter 2 (2) Notice that (iii. by (I).2.. .
4. Figure 2.. The coordinate ascent algorithm can be slow if the contours in Figure 2.3. A special case of this is the famous DemingStephan proportional fitting of contingency tables algorithmsee Bishop.4. for instance. find that member of the family of contours to which the vertical (or horizontal) line is tangent.Section 2. . BJ+l' .92.2 has a generalization with cycles of length r.1 are not close to sphericaL It can be speeded up at the cost of further computation by Newton's method. values of (B 1 .. Change other coordinates accordingly. .4. = dr = 1.4. Next consider the setting of Proposition 2. the method extends straightforwardly. See also Problem 2. and Problems 2.. The coordinate ascent algorithm.B2 )T where the log likelihood is constant. the log likelihood for 8 E open C RP.B~) = 0 by the method of j bisection in B to get OJ for j = 1. r = k... . that is.4. iterate and proceed. . each of whose members can be evaluated easily. Solve g~: (Bt.1 in which Ix(O). Then it is easy to see that Theorem 2. is strictly concave. 1' B . . The graph shows log likelihood contours.. e ~ 3 2 I o o 1 2 3 Figure 2.4 Algorithmic Issues 131 just discussed has d 1 = .1 illustrates the j process. Feinberg.1. If 8(x) exists and Ix is differentiable.p.. which we now sketch. and Holland (1975).7. At each stage with one coordinate fixed.4.10. 1 B.4..
B) We find exp{ (x .10. this method is known to converge to ij at a faster rate than coordinate ascentsee Dahlquist. Let Xl.3.B) i=l 1 1 2 L i=l n I(X" B) < O. If 7J o ld is close to the root 'ij of A(ij) expanding A(ij) around 11old' we obtain = to.  I o The NewtonRaphson algorithm has the property that for large n. 1 l(x.3. the argument that led to (2. then ii new = iiold .and lefthand sides.2) gives (2.3) Example 2. B) The density is = [I + exp{ (x = B) WI. iinew after only one step behaves approximately like the MLE.4. When likelihoods are noncave.. I 1 The NewtonRaphson method can be implemented by taking Bold X.4.7. is the NewtonRaphson method. A onedimensional problem 7 .6. F(x.f. if l(B) denotes the log likelihood. and Anderson (1974).4. though there is a distinct possibility of nonconvergence or convergence to a local rather than global maximum. B)}F(Xi. then by f1new is the solution for 1] to the approximation equation given by the right.132 Methods of Estimation Chapter 2 2. We return to this property in Problem 6. (2. X n be a sample from the logistic distribution with d.3 The NewtonRaphson Algorithm An algorithm that.4.k I (iiold)(A(iiold) .1. A hybrid of the two methods that always converges and shares the increased speed of the NewtonRaphson method is given in Problem 2. Newton's method also extends to the framework of Proposition 2. Bjork. methods such as bisection. This method requires computation of the inverse of the Hessian. and NewtonRaphson's are still employed.to). when it converges..2) The rationale here is simple. In this case.4.4. coordinate ascent. _. in general. Here is the method: If 110 ld is the current value of the algorithm. can be shown to be faster than coordinate ascent. which may counterbalance its advantage in speed of convergence when it does converge.B)} [i+exp{(xB)}j2' n 1 l(B) l(B) n2Lexp{(X. If 110 ld is close enough to fj.
. . (0) = log q(s. and so on.4) Evidently. for some individuals.x(B) is concave in B. we observe 5 5(X) ~ Qo with density q(s. and Weiss (1970).5) + ~ [(EiI+Ei2)log(1(10)')+2Ei310g(IB)] i=m+l a function that is of curved exponential family fonn. i = 1. Here is another important example.. a)] = B2. 1 8 m are not Xi but (€il.4).4. for instance. What is observed. Po[X = (0. For detailed discussion we refer to Little and Rubin (1987) and MacLachlan and Krishnan (1997). in Chapter 6 of Dahlquist.(0) })2EiI 10gB + Ei210g2B(1 . then explicit solution is in general not lXJssible. and its main properties.4. though an earlier general form goes back to Baum. There are ideal observations. 1 <i< m (€i1 +€i2. Unfortunately. 0) where I". It does tum out that in this simplest case an explicit maximum likelihood solution is still possible. Laird.4 The EM (Expectation/Maximization) Algorithm There are many models that have the following structure. A prototypical example folIows. Many examples and important issues and methods are discussed. Petrie. 0).0)] ~ 28(1 . where Po[X = (1. Ei3).0 E c Rd Their log likelihood Ip.4. 1. Po[X (0. If we suppose (say) that observations 81.0)2. m+ 1 <i< n. As in Example 2. but the computation is clearly not as simple as in the Original HardyWeinberg canonical exponential family example. 0) is difficultto maximize. be a sample from a population in HardyWeinberg equilibrium for a twoallele locus. a < 0 < I.2. The algorithm was fonnalized with many examples in Dempster.. S = S(X) where S(X) is given by (2. Xi = (EiI. (2.4. is not X but S where = 5i 5i Xi.€i3). This could happen if.B). Soules. 0 leads us to an MLE if it exists in both cases.x(B) is "easy" to maximize.13. the function is not concave. .. the rest of X is "missing" and its "reconstruction" is part of the process of estimating (} by maximum likelihood. E'2.0. A fruitful way of thinking of such problems is in terms of 8 as representing part of X. 2. the homozygotes of one type (€il = 1) could not be distinguished from the heterozygotes (€i2 = 1). e = Example 2. Bjork.OJ] i=l n (2.6. €i2 + €i3).B) + 2E'310g(1 . The log likelihood of S now is 1". 0. and Anderson (1974). difficult to compute. Say there is a closedfonn MLE or at least Ip. however.4 Algorithmic Issues 133 in which such difficulties arise is given in Problem 2. X ~ Po with density p(x. let Xi. Lumped HardyWeinberg Data. .4. Yet the EM algorithm.Section 2.4. with an appropriate starting point. I)] (1 . and Rubin (977)..n.4. We give a few examples of situations of the foregoing type in which it is used.
(sI'z)where() = (. If this is difficul~ the EM algorithm is probably not suitable. we can think of S as S(X) where X is given by (2. The EM Algorithm.\ = 1. = 11 = .<T..\"'u.4. (P(X.2)andO <. I .af) or N(J12.4. It is not obvious that this falls under our scheme but let (2. Initialize with Bold = Bo· Tbe first (E) step of the algorithm is to compute J(B I Bold) for as many values of Bas needed. differentiating and exchanging E oo and differentiation with respect i.4. are independent identically distributed with p() [A.4.Sn is a sample from a population P whose density is modeled as a mixture of two Gaussian densities. which we give for 8 real and which can be justified easily in the case that X is finite (Problem 2. the M step is easy and the E step doable. Then we set Bnew = arg max J(B I Bold). Note that (2.\ < 1.4. O't.5. . Although MLEs do not exist in these models. B) = (1.4) = L()(Si I Ai) = N(Ail'l + (1.(I'.. Thus. p(s. . EM is not particularly appropriate.p() [A.9) I for all B (under suitable regularity conditions). It is easy to see (Problem 2.12). As we shall see in important situations.Bo) 0 p(X.8) by o taking logs in (2. we have given. o (:B 10gp(X. i.p (~). I That is. .\.(sl'd +. Bo) _ I S(X) = s ) (2.4.4. if this step is difficult.). S has the marginal distribution given previously.134 Methods of Estimation Chapter 2 Example 2.4.I. including the examples.B) 0=00 = E.o log p(X.4.Ai)I'Z. = 01.6) where A. I.\)"'u. The EM algorithm can lead to such a local maximum. . ~i tells us whether to sample from N(Jil.0'2 > 0.11). I' Hi where we suppress dependence on s. The second (M) step is to maximize J(B I Bold) as a function of B. Again.. .4.i = 1.<T.8).4. The rationale behind the algorithm lies in the following formulas. a local maximum close to the true 8 0 turns out to be a good "proxy" for the 0 nonexistent MLE. The log likelihood similarly can have a number of local maxima and can tend to 00 as e tends to the boundary of the parameter space (Problem 2.7) ii..tz E Rand 'Pu (5) = . This fiveparameter model is very rich pennitting up to two modes and scales.9) follows from (2.6).4. • • i ! I • . Here is the algorithm. that under fJ. Mixture of Gaussians. I:' . Suppose that given ~ = (~11"" ~n). reset Bold = Bnew and repeatthe process. the Si are independent with L()(Si I . .ll.B) IS(X)=s) q(s.B) J(B I Bo) ~ E.. . (P(X. A.. + (1  A.B) I S(X) = s) 0=00 (2. .12) q(s.8) .Bo) and (2.(f~). Suppose 8 1 . Let . j.)<T~). :B 10gq(s. J.B) =E.
(2. DO The main reason the algorithm behaves well follows. S (x) ~ 8 p(x.1. (2.Onew) {r(X I 8 .10) [)J(O I 00 ) DO it follows that a fixed point 0 of the algorithm satisfies the likelihood equation.h) satisfying the conditions a/Theorem 2. Let SeX) be any statistic.Onew) E'Old { log r(X I 8. 0 ) +E. DJ(O I 00 ) [)O and.3. J(Onew I 0old) > J(Oold I 0old) = 0 by definition of Onew.o log .4. Id log (X I . We give the proof in the discrete case.Section 2.3.4. formally. uold 0 r s.O)r(x I 8.0 ) I S(X) ~ 8 .0) is the conditional frequency function of X given S(X) J(O I 00 ) If 00 = s. In the discrete case we appeal to the product rule. ~ E '0 (~I ogp(X . 0) I S(X) = 8 ) DO (2. I S(X) ~ s > 0 } (2.1.13) iff the conditional distribution of X given S(X) forOnew asfor 00ld and Onew maximizes J(O I 0old)' =8 is the same Proof. Fat x EX.4.4.0) (2.4.0) = O. uold Now. 0) ~ q(s.2.(X 18.17) o The most important and revealing special case of this lemma follows.1.4. the result holds whenever the quantities in J(O I (0 ) can be defined in a reasonable fashion. q S. However. Then (2.4. Lemma 2.4. hence.4. then .E.4 Algorithmic Issues 135 to 0 at 00 .13) q(8.4.O) { " ( X I 8.14) whete r(· j ·. r(Xj8. On the other hand. (2. 0 0 = 0old' = Onew.11)  D IOgq(8.. Lemma 2.12) = s. Because. Onew) > q(s. O n e w ) } log ( " ) ~ J(Onew I 0old) . 0old)' Equality holds in (2." ) I S(X) ~ 8 .16) ° q(s.15) q(s. Suppose {Po : () E e} is a canonical exponential family generated by (T.0) } ~ log q(s. Theorem 2. 00ld are as defined earlier and S(X) (2.4.00Id) by Shannon's ineqnality.lfOnew.
after some simplification.24) " ! I .4.A(Bo)) (2.20) has a unique solution. then it converges to a limit 0·.B) 1 .A(Bo)) I S(X) = s} (B . 1 <j < 3. I • Under the assumption that the process that causes lumping is independent of the values of the €ij.136 (a) The EM algorithm consists of the alternation Methods of Estimation Chapter 2 A(Bnew ) = Eoold(T(X) I S(X) Bold ~ = s) (2.25) I" .B) 1) = log (1 = exp(1)(2Ntn (x) + N'n(x)) .22) • ~ 0) . A proof due to Wu (I 983) is sketched in Problem 2. Part (b) is more difficult.4. Thus.4.0)' 1.+l (2. .4.4. that.21) Part (a) follows.I<j<2 I r r: I I r! Pol<it 11 <it + <i' = 11 0' B' B' + 20(1 .(x). = Ee(T(X) I S(X) = s) ~ (2. h(x) = 2 N i . 1'I. i i. 0 Example 2. we see. Proof. which is necessarily a local maximum J(B I Bo) I I = Eoo{(B . I .23) I .19) = Bnew · If a solution of(2.4. (b) lfthe sequence of iterates {Bm} so obtained is bounded and the equation A(B) o/q(s. (2. m + 1 < i < n) . In this case. . = 11 <it +<" = IJ.4.= +Eo ( t (2<" + <i') I <it + <i'. X is distributed according to the exponential family I where p(x. A'(1)) = 2nB (2.Pol<i.4.4.A('1)}h(x) (2. 1 I • • Po I<ij = 1 I <it + <i' = OJ =  0.16. • I E e (2Ntn + N'n I S) = 2Nt= + N.B).(A(O) .4.18) (2. Now.4..(1 .(A(O) . A(1)) = 2nlog(1 + e") and N jn = L:~ t <ij(Xi).4 (continued).BO)TT(X) .18) exists it is necessarily unique.Bof Eoo(T(X) I S(X) = y) . i=m.
.4. Jl2.(2N3= + N 2=)B + 2 (N1= + (I n n =0 0 in (0.p2)a~ [1'2 + P'Y2(Z. . I'l)/al [1'2 + P'Y2(Z.(Y.1) A( B) = E. a~ + ~. T3 = The observed data are n l L zl. [T5 (8ald) 2 = T2 (8old)' iii. ai ~ We take Bo~ = B MOM ' where BMOM is the method of moment estimates (11" 112.Tl IlT4 (8ol d) .i.) E. Y).4. B). i=I i=l n n n i=l s = {(Z" Y.. b.6. for nl + 1 < i < n2.T 2 . 112.1. to conclude ar.Section 2.l L ZiYi.: nl + 1 <i< n. T 2 = Y. where B = (111. ~ T l (801 d). .. + 1 <i< n}.(y. the EM iteration is L i=tn+l n ('iI + <. iii.12) that if2Nl = Bm. where (Z.} U {Y.T2 ]}' .(ZiY. and find (Problem 2.4. /L2.) E.d.T = (1"1.= > N 3 =)) 0 and M n > 0.4 Algoritnmic Issues 137 where Mn ~ Thus.Yn) be i.27) hew i'tT2Ji{[T3 (8ol d) . al a2P + 1"1/L2).4. which is indeed the MLE when S is observed. To compute Eo (T I S = s).8) of B based on the observed data.4. l"l)/al]Zi with the corresponding Z on Y regression equations when conditioning on Yi (Problem 2. 1"1)/ad 2 + (1 .2 I Z. r) (Problem 2. a~.l LY/.new ~ + I'i.1). we observe only Yi ..4.Bold n ~ (2. I Zi) 1"2 + P'Y2(Z. as (Z. p).new ii~. For the Mstep. (Zn. Y) ~ N (I"" 1"" 1 O'~. ii2 .new T4 (Bo1d ) . unew  _ 2N1m + N 2m n + 2 . we note that for the cases with Zi andlor Yi observed.: n.4). I). Let (Zl'yl). then B' .T{ (2.26) It may be shown directly (Problem 2. I Z. In this case a set of sufficient statistics is T 1 = Z. we oberve only Zi. For other cases we use the properties of the bivariate nonna] distribution (Appendix BA and Section 1. T4 = n. the conditional expected values equal their observed values.4.1) that the Mstep produces I1l. 2 Mn .): 1 <i< nt} U {Z. This completes the Estep. and for n2 + 1 < i < n. T 5 = n. Suppose that some of the Zi and some of the Yi are missing as follows: For 1 < i < nl we observe both Zi and Yi. new = T 3 (8ol d) . at Example 2.. compute (Problem 2. converges to the unique root of ~ + N.).
Suppose that Li. Summary. ! I 2.B)2. involves imputing missing values. (c) Suppose X takes the values 1.3 and the problems.1. use this algorithm as a building block for the general coordinate ascent algorithm.B)? .2. in the context of Example 2. which yields with certainty the MLEs in kparameter canonical exponential families with E open when it exists. E"IJ(O I 00)] is the KullbackLeiblerdivergence (2. j. Finally in Section 2. &(>. respectively. 2.1 with respective probabilities PI. . X n have a beta. Also note that. X n assumed to be independent and identically distributed with exponential. We then. in Section 2. . based on the first moment.0'2) based on the first two moments. Note that if S(X) = X.). B)/p(X..0'2) distribution. (a) Find the method of moments estimate of >.0) and (I . distributions. ~' ".6.. (d) Find the method of moments estimate of the probability P(X1 will last at least a month.4. are discussed and introduced in Section 2.0. (3( 0'1..4. j : > I) that one system 3.4. based On the first two moments. show that T 3 is a method of moment estimate of (). (c) Combine your answers to (a) and (b) to get a method of moment estimate of >. in general. .2. 0 Because the Estep.138 Methods of Estimation Chapter 2 where T j (B) denotes T j with missing values replaced by the values computed in the Estep and T j = Tj(B o1d )' j = 1.23). based on the second moment.5 PROBLEMS AND COMPLEMENTS Problems fnr Sectinn 2. (a) Show that T 3 = N1/n + N 2 /2n is a frequency substitution estimate of e.4. Consider n systems with failure times X I .2. then J(B i Bo) is log[p(X. .P3 given by the HardyWeinberg proportions.i' I' I i. which as a function of () is maximized where the contrast logp(X.. 0) is minimized.. including the NewtonRaphson method.P2. ..4. The basic bisection algorithm for finding roots of monotone functions is developed and shown to yield a rapid way of computing the MLE in all oneparameter canonical exponential families with E open (when it exists).d.4 we derive and discuss the important EM algorithm and its basic properties. (b) Find the method of moments estimate of>. Remark 2.B o)]. Important variants of and alternatives to this algorithm. Find the method of moments estimates of a = (0'1. 2B(1 . . (b) Using the estimate of (a).. Now the process is repeated with B MOM replaced by ~ ~ ~ ~ 0new.1 1. where 0 < B < 1. what is a frequency substitution estimate of the odds ratio B/(l. X I. the EM algorithm is often called multiple imputation. By considering the first moment of X. Consider a population made up of three different types of individuals occurring in the HardyWeinberg proportions 02...
(b) Exhibit method of moments estimates for VaroX = 8(1 . N 2 = n(F(t2J . (d) FortI tk.. See A.... If q(8) can be written in the fonn q(8) ~ s(F) for sOme function s of F we define the empirical substitution principle estimate of q( 8) to be s( F). ..Section 2.. (Zn. (Z2.5. .12.. . (a) Show that X is a method of moments estimate of 8. . 5. of Xi < xl/n. Yi) such that Zi < sand Yi < t n . The jth cumulant '0' of the empirical distribution function is called the jth sample cumulanr and is a method of moments estimate of the cumulant Cj' Give the first three sample cumulants.8. ~ Hint: Express F in terms of p and F in terms of P ~() X = No. Y2 ).findthejointfrequencyfunctionofF(tl).. we know the order statistics. (c) Argue that in this case all frequency substitution estimates of q(8) must agree with q(X). Let (ZI.F(tk). X 1l be the indicators of n Bernoulli trials with probability of success 8.. . Give the details of this correspondence.. YI ). Show that these estimates coincide. ~  7. . . ..) There is a Onetoone correspondence between the empirical distribution function ~ F and the order statistics in the sense that. . The empirical distribution function F is defined by F(x) = [No. empirical substitution estimates coincides with frequency substitution estimates. (See Problem B. Let X I.. Hint: Consi~r (N I .. X n . given the order statistics we may construct F and given P.Xn be a sample from a population with distribution function F and frequency function or density p. < X(n) be the order statistics of a sample Xl. 6.5 Problems and Complements 139 Hint: See Problem 8. . which we define by ~ F ~( s. ~ ~ ~ (a) Show that in the finite discrete case.8)ln first using only the first moment and then using only the second moment of the population. . Nk+I) where N I ~ nF(t l ). Let X(l) < . The natural estimate of F(s. 4... of Xi n ~ =X.. t) is the bivariate empirical distribution function FCs.2. (b) Show that in the continuous case X ~ rv F means that X (c) Show that the empirical substitution estimate of the jth moment JLj is the jth sample moment JLj' Hinr: Write mj ~ f== xjdF(x) ormj = Ep(Xj) where XF. Yn ) be a set of independent and identically distributed random vectors with common distribution function F. < ~ ~ Nk+1 = n(l  F(tk)).F(t. 8.. t).)). .2. < . . . = Xi with probability lin.t ) = Number of vectors (Zi. . Let Xl..
f(a.Yn . X n be LLd.". respectively..2). .L. .!'r«(J)) I .Y) = Z)(Yk ..) Note that it follows from (A. There exists a compact set K such that for f3 in the complement of K. Suppose X = (X"". Zn and Y11 .1.' where Z. . Vi). (b) Define the sample product moment of order (i.. Hint: Set c = p(X. . L I.Xn ) where the X.. ~ i ."ZkYk . as the corresponding characteristics of the distribution F.. I) = ". . .ZY. with (J identifiable. the result follows. (b) Construct an estimate of a using the estimate of part (a) and the equation a . suppose that g({3. In Example 2. z) is continuous in {3 and that Ig({3. .j2.). Y are the sample means of the sample correlation coefficient is given by Z11 •. >. Show that the least squares estimate exists. and so on.j) is given by ~ The sample covariance is given by n l n L (Zk k=l . the sampIe correlation. Since p(X. p(X. (J E R d. In Example 2. Vk and that q(9) can be written as ec q«(J) = h(!'I«(J)"" . .) is the distribution function of a probability P on R2 assigning mass lin to each point (Zi. (See Problem 2.81 tends to 00. = #.. 1".IU9) that 1 < T < L .. . as X ~ P(J.. find the method of moments estimate based on i 12. . Let X".2.17.140 Methods of Estimation Chapter 2 (a) Show that F(. . (c) Use the empirical substitution principle to COnstruct an estimate of cr using the relation E(IX.. I . (a) Find an estimate of a 2 based on the second mOment. : J • .• . . " .1. 10. the sample covariance. 11. . {3) > c. j). {3) is continuous on K. are independent N"(O. k=l n  n . Hint: See Problem B. The All of these quantities are natural estimates of the corresponding population characteristics and are also called method of moments estimates.2 with X iiI and /is. Suppose X has possible values VI. z) I tends to 00 as \. 9.. Show that the sample product moment of order (i..4. 0).
Section 2..1) (iii) Raleigh.i. (c) If J.6. ~ (X. rep. ..9r be given linearly independent functions and write ec n Pi(O) = EO(gj(X)). . Consider the Gaussian AR(I) model of Example 1... 0 ~ (p.. with (J E R d and (J identifiable. > 0.x (iv) Gamma.Pr(O)) . .6.36. can you give a method of moments estimate of {3? . A). Show that the method of moments estimate q = h(j1!. General method of moment estimates(!). X n are i.d. Hint: Use Corollary 1..  1/' Po) / .ilr) is a frequency plug (a) Show that the method of moments estimate if = h(fill . .:::nt:::' ~__ 141 for some R k valued function h.1.l Lgj(Xi ). Iii = n. (b) Suppose P ~ po and I' = b are fixed. j i=l = 1..0) (ii) Beta.O) ~ (x/O')exp(x'/20'). as X rv P(J. (b) Suppose {PO: 0 E 8} is the kparameter exponential family given by (1.p(x. = h(PI(O). .10)..p:::'. 13. 0). in estimate..5.l and (72 are fixed. 1'(1. Let g. where U. A).(X) ~ Tj(X).0 > 0 14. fir) can be written as a frequency plugin estimate. .6.5 Problems and Com". p fixed (v) Inverse Gaussian. . When the data are not i. r.. 1'(0.. find the method of moments estimates (i) Beta.1. to give a method of moments estimate of (72.. In the fOllowing cases. Vk and that q(O) for some Rkvalued function h.) to give a method of moments estimate of p. .i. Suppose X 1. . See Problem 1. 1 < j < k.d.:::m:::. Use E(U[). .. . IG(p. Suppose that X has possible values VI. (a) Use E(X..• it may still be possible to express parameters as functions of moments and then use estimates based on replacing population moments with "sample" moments. Let 911 .
8... bI) are as in Theorem 1. . SF. + '2P4 + 2PS . respectively.3. = Y .. .) ..S 1 .z.._ . t n on the position of the object.•• Om) can be expressed as a function of the moments. n... .. I.. FF. 11P2 + "2P4 + '2P6 .• l Yn are taken at times t 1. In a large natural population of plants (Mimulus guttatus) there are three possible alleles S..6). Let X = (Z. n 1  Problems for Sectinn 2.J is' _ n n..z.4.  1.)]. 1 6 and let Pl ~ N j / n..)] = [Y.1. let the moments be mjkrs = E(XtX~).. SI.. The HardyWeinberg model specifies that the six genotypes have probabilities L7=1 I Genotype Genotype Probability SS 2 II 8~ 3 FF 8j 4 SI 28. HardyWeinberg with six genotypes....ZY ~ ... and (J3..Y....g(l3o.... Let 8" 8" and 83 denote the probabilities of S.. where (Z. I • ... For a vector X = (Xl. I I . we define the empirical or sample moment to be j ~ . Q. _ (Z)' ' a. ..! . 1". Hint: [y..1".. I . k> Q mjkrs L... 1 I .83 8t Let N j be the number of plants of genotype j in a sample of n independent plants. 1 q. of observations...z.... >. ' l X q ). j > 0. II. S = 1. Show that <j < i ! .)] + [g(l3o.1 " Xir Xk J.142 Methods of Estimation Chapter 2 15.2 1.. An Object of unit mass is placed in a force field of unknown constant intensity 8... .g(I3. 17.. I. where OJ = 1. i = 1.~b. .."" X iq ). .g(I3. = n 1 L:Z..bIZ. ... Show that method of moments estimators of the parameters b1 and al in the best linear predictor are estimate ~ 8 of B is obtained L:Z.. ).  t=1 liB = (0 1 .. b. .. 5 SF 28.. ()2. The reading Yi . 1 . Readings Y1 . Establish (2... .... 83 6 IF 28.. . P3 + "2PS + '2P6 PI are frequency plugin estimates of OJ.. the method of moments l by replacing mjkrs by mjkrs. and IF. and F.z..q. I . Multivariate method a/moments. For independent identically distributed Xi = (XiI. . . k > 0.. Y) and (ai. Y) and B = (ai. 16... and F at one locus resulting in six genotypes labeled SS..
Find the least squares estimates of ()l and B . . x > c.t of the best (MSPE) predictor of Yn + I ? 4. (a) f(x.. 13)..y. Suppose that observations YI . to have mean 2. . ~p ~(yj}) ~ . if we consider the distribution assigning mass l/n to each of the points (zJ.Yi). 7. . . B. i = 1." + 82 z i + €i = nl with ICi a~ given by = 81 + ICi. The regression line minimizes the sum of the squared vertical distances from the points (Zl. Show that the least squares estimate is always defined and satisfies the equations (2. A new observation Ynl. . What is the least squares estimate based on YI . (exponential density) (b) f(x. Yl). .. Find the least squares estimate of 0:. .4)(2. 3. . .2 may be derived from Theorem 1. Suppose Yi ICl".Section 2. Hint: Write the lines in the fonn (z. . Zn and that the linear regression model holds. x > 0. the range {g(zr. . and 13 ranges over R d 8. nl + n2.5 Problems and Complements 143 ICi differs from the true position (8 /2)tt by a mndom error f l . Find the least squares estimates for the model Yi = 8 1 (2. X n denote a sample from a population with one of the following densities Or frequency functions. Find the line that minimizes the sum of the squared perpendicular distance to the same points. 2 10.)..<) ~ .I is to be taken at time Zn+l.. Let X I.. B) = Be'x. . Yn).z.Yn).3.2..We suppose the oand be uncorrelated with constant variance. . where €nlln2 are independent N(O. all lie on aline. B) = Bc'x('+JJ...(zn. i = 1.4.n.2. 9. . Y. (J2) variables..5) provided that 9 is differentiable with respect to (3" 1 < i < d. (b) Relate your answer to the fonnula for the best zero intercept linear predictor of Section 1.. in fact. B > O.B. . (a) Let Y I . " 7 5. . 13).. Find the LSE of 8. . . . Show that the fonnulae of Example 2. (Pareto density) .. Yn be independent random variables with equal variances such that E(Yi) = O:Zj where the Zj are known constants.13 E R d } is closed. nl and Yi = 82 + ICi.)' 1+ B5 6.4.. Yn have been taken at times Zl. < O... . (zn. B > O. . g(zn.BJ . . Show that the two sample regression lines coincide (when the axes are interchanged) if and only if the points (Zi... i + 1. _. c cOnstant> 0. .1....2. Find the MLE of 8. Hint: The quantity to be minimized is I:7J (y.6) under the restrictions BJ > 0.
X n be a sample from a U[0 0 + I distribution. c constant> 0.144 Methods of Estimation Chapter 2 (c) f(". 0 E e and let 0 denote the MLE of O. Suppose that Xl.e. 0 > O.l exp{ OX C } .1). .2 are unknown.5. (beta. 0) = (x/0 2) exp{ _x 2/20 2 }. Suppose that T(X) is sufficient for 0 and ~at O(X) is an MLE of O. < I" < 00.X)" > 0. . be a family of models for X E X C Rd. 0) = cO". x > 1". x> 0. density) (e) fix. e c Let q be a map from e onto n. 0 <x < I. (b) Find the maximum likelihood estimate of PelX l > tl for t > 1".t and a 2 . then the unique MLEs are (b) Suppose J. . WEn .. (Pareto density) (d) fix. MLEs are unaffected by reparametrization. Show that the MLE of ry is h(O) (i. Hint: Let e(w) = {O E e : q(O) = w}.t and a 2 are both known to be nonnegative but otherwise unspecified..5 show that no maximum likelihood estimate of e = (1". . X n • n > 2.) ! ! !' ! 14. Thus. I). Show that if 0 is a MLE of 0. Define ry = h(8) and let f(x.0"2) 15. x > 0.I")/O") . (a) Let X . Because q is onto n.O) where 0 ~ (1".1. then i  WMLE = arg sup sup{Lx(O) : 0 E e(w)}.. ry) denote the density or frequency function of X in terms of T} (Le. Let X I. and o belongs to only one member of this partition.. 0> O. J1 E R. . then {e(w) : wEn} is a partition of e.Pe. 0) = VBXv'O'.a)l rather than 0. 12.: I I \ .16(b). 0 > O.0"2). say e(w). Show that any T such that X(n) < T < X(1) + is a maximum likelihood estimate of O. be independently and identically distributed with density f(x. .  e .r(c+<). X n . reparametrize the model using 1]).0"2 (a) Find maximum likelihood estimates of J. iJ( VB. .  under onetoone transformations).l L:~ 1 (Xi .P > I. ncR'. > O. bl to make pia) ~ p(b) = (b .. the MLE of w is by definition ~ W.. q(9) is an MLE of w = q(O). I < k < p. Show that depends on X through T(X) only provided that 0 is unique. I i' 13. Hint: Use the factorization theorem (Theorem 1. .:r > 0. (Rayleigh density) (f) f(x. Let X" .II ". I I I (b) LetP = {PO: 0 E e}. Pi od maximum likelihood estimates of J1 and a 2 . (We write U[a.. (Wei bull density) 11. = X and 8 2 ~ n. 0) = Ocx c . 0 > O. If n exists. for each wEn there is 0 E El such that w ~ q( 0). n > 2. I • :: I Suppose that h is a onetoone function from e onto h(e). Hint: You may use Problem 2.. c constant> 0. (72 Ii (a) Show that if J1 and 0. ~ I in Example 2. they are equivariant 16. is a sample from a N(/l' ( 2 ) distribution. 00 = I 0" exp{ (x .2. .
. . Bremner. . and have commOn frequency function. . Yn which are independent..[X < r] ~ 1 LB k=l k  1 (1_ B) = B'. (a) The observations are indicators of Bernoulli trials with probability of success 8. In the "life testing" problem 1. 0 < a 2 < oo}. Show that the maximum likelihood estimate of 0 based on Y1 . k=I.1 (1 . We want to estimate B and Var. 1) distribution.5 Problems and Complements 145 Now show that WMLE ~ W ~ q(O). Derive maximum likelihood estimates in the following models..Xn be independently distributed with Xi having a N( Oi. . 19. A general solution of this and related problems may be found in the book by Barlow. (b) Solve the problem of part (a) for n = 2 when it is known that B1 < B. find the MLE of B. (b) The observations are Xl = the number of failures before the first success.. .1 (I_B). < 00.n . Show that maximum likelihood estimates do not exist. identically distributed.X I = B(I. and so on. k ~ 1. 21.2. (a) Find maximum likelihood estimates of the fJ i under the assumption that these quantities vary freely." Y: . Thus..P...0") E = {(I'. where 0 < 0 < 1. if failure occurs on or before time r and otherwise just note that the item has lived at least (r + 1) periods.) Let M = number of indices i such that Y i = r + 1. 17. but e . (KieferWolfowitz) Suppose (Xl. . . We want to estimate O.. Bartholomew.  .B) = 1..B) =Bk . B) = 100' cp 9 (x I') + 0' 10 cp(x 1') 1 where cp is the standard normal density and B = (1'. 1 <i < n.B).O") : 00 < fl. Let Xl. .r f(r + I. . 20.B).Section 2. in a sequence of binomial trials with probability of success fJ. and Brunk (1972). X 2 = the number of failures between the first and second successes... M 18. Suppose that we only record the time of failure. a model that is often used for the time X to failure of an item is P. Censored Geometric Waiting Times. we observe YI . .6. (We denote by "r + 1" survival for at least (r + 1) periods. If time is measured in discrete periods. •• . X n ) is a sample from a population with density f(x. 16(i).[X ~ k] ~ Bk . f(k.. Y n IS _ B(Y) = L..~1 ' Li~1 Y..
. n).5 86.0"2) if. Tool life data Peed 1 Speed 1 1 Life 54..Y)' i=1 j=1 /(rn + n).146 Methods of Estimation Chapter 2 that snp. J2 J2 o o o I 3. Suppose X has a hypergeometric. (1') where e n ii' = L(Xi .. Set zl ~ . x) as a function of b.z!.0 5.tl. and ~ b(X) = X X (N + 1) or (N + 1)1 n n othetwise..8 3..2 4. . and assume that zt'·· .x n .5 66. distribution.0 11.6. x)/ L(b. Suppose Y. respectively. = fl.< J. Show that the MLE of = (fl. Ii equals one of the numbers 22.8 2.1 2.8 0. . Show that the maximum likelihood estimate of b for Nand n fixed is given by if ~ (N + 1) is not an integer. 23.4)(2.0 I . 1 1 1 o o . .2.jp): 0 <j. N.X)' + L(Y. Assume that Xi =I=. 1985). Xl.t. 1 <k< p}.8 14.Xj for i =F j and that n > 2. 1i(b. where'i satisfy (2.2. .4 o o o o o o o o o o o o o Life 20.. the following data were obtained (from S.. Polynomial Regression. Y.2 4.. (1') is If = (X.1.5 I' I I .a 2 ) = Sllp~.2 Peed Speed  2 I 1 1 1 1 1 1 1 J2 o o 1 1 1 3.J. TABLE 2. and only if. .up(X.a 2 ) and NU".' wherej E J andJisasubsetof {UI . Let Xl.'.9 3.. .2 3.ji.I' fl. . " Yn be two independent samples from N(J.5 0. Weisberg. II I I :' 24.0 2.6).(Zi) + 'i. Hint: Coosider the mtio L(b + 1.0 0. where [t] is the largest integer that is < t.Xm and YI .p(x. i: In an experiment to study tool life (in minutes) of steelcutting tools as a function of cutting speed (in feet per minute) and feed rate (in thousands of an inch per revolution). (1') populations. 7 .
2.13)/6. ZI ~ (feed rate .1.. Set Y = Wly.y)!(z. of Example 2. Y) have joint probability P with joint density ! (z. Use these estimated coefficients to compute the values of the contrast function (2.4)(2. Y')/Var Z' and pI(P) = E(Y*) 11.2.6) with g({3.4)(2. Show that the follow ZDf3 is identifiable.2. + E eto (b) Y = + O:IZl + 0:2Z2 + Q3Zi +. Y)[Y . 25.. (2.y)dzdy. l/Var(Y I Z = z). in Problem 2. (a) The parameterization f3 (b) ZD is of rank d. E{v(Z.2.2.Y = Z D{3+€ satisfy the linear regression mooel (2.y)/c where c = I Iv(z. ZD = Wz ZD and € = WZ€..2. (12 unknown.900)/300.(P) coincide with PI and 13. Show that (3.6.1).ZD is of rank d. (c) Zj.Section 2. Y)Z') and E(v(Z. Consider the model Y = Z Df3 + € where € has covariance matrix (12 W. y) > 0 be a weight funciton such that E(v(Z. Let W.(P) = Cov(Z'. (2. (a) Let (Z'.19). . The best linear weighted mean squared prediction error predictor PI (P) + p. Consider the model (2. Show that p. z. Let (Z. (a) Show tha!. y). . Y)Y') are finite. let v(z. However. Being larger. Derive the weighted least squares nonnal equations (2.0:4Z? + O:SZlZ2 + f Use a least squares computer package to compute estimates of the coefficients (f3's and o:'s) in the two models.1 (see (B. Two models are contemplated (a) Y ~ 130 + pIZI + p.3.Z)]').  27.y) defined .y)!(z. That is.(P)Z of Y is defined as the minimizer of t = zT {3. .l be a square root matrix of W.5 Problems and Complements 147 The researchers analyzed these data using Y = log tool life.z. (b) Let P be the empirical probability . This will be discussed in Volume II.1). the second model provides a better approximation. = (cutting speed . . z) ~ Z D{3. 28..(P) and p. z) ing are equivalent. this has to be balanced against greater variability in the estimated coefficients. Let Z D = I!zijllnxd be a design matrix and let W nXn be a known symmetric invertible matrix.1.6) with g({3.2.2.8 and let v(z. Both of these models are approximations to the true mechanism generating the data.6)). 26. weighted least squares estimates are plugin estimates.(P)E(Z·).Y') have density v(z. ~.(b l + b.5) for (a) and (b). ..
155 4. Is ji.+l I Y .097 1.393 3.).196 2.081 2..300 6.379 2.916 6.392 4. 2.100 3. then the  f3 that minimizes ZD.17. = L.723 1.€n+l are i.958 10.6) W.093 1. .229 4.6) T 1 (Y . Show that the MLE of . .665 2. _'..2.5. \ (e) Find the weighted least squares estimate of p.ZD.. . Hint: Suppose without loss of generality that ni 0. = nq = 0.. .091 3. Then = n2 = . i = 1.053 4. . ~ ~ ~Show that the MLE of OJ is 0 with OJ = nj In. .599 0..nk > 0..703 4..669 7.856 2.908 1.020 8.918 2. where El" .788 2.476 3.038 3.093 5.a. with mean zero and variance a 2 • The ei are called moving average errors. That is. j = 1.971 0.. . Suppose YI .131 5. suppose some of the nj are zero..) ~ ~ (I' + 1'..6) is given by (2. In the multinomial Example 2.046 2. I Yi is (/1 + Vi). in this model the optimal " MSPE predictor of the future Yi+ 1 given the past YI . where 1'.j}. 0) = II j=q+l k 0. ' I (f) The following data give the elapsed times YI .ZD.064 5.611 4.968 2.716 7.20). n.) (c) Find a matrix A such that enxl = Anx(n+l)ECn+l)xl' (d) Find the covariance matrix W of e.453 1.274 5. Let ei = (€i + {HI )/2.1.054 1.058 3. <J > 0..071 0.391 0.039 9.860 5.858 3.6)T(y ..564 1. Consider the model Yi = 11 + ei.249 1. The data (courtesy S. (See Problem 2.511 3.676 5..'. n. (a) Show that E(1'.. J.968 9. . Use a weighted least squares c~mputer routine to compute the weighted least squares estimate Ii of /1.457 2. k.019 6. different from Y? I I TABLE 2. .jf3j for given covariate values {z.480 5. 31.397 4. Chon) shonld be read row by row.360 1.834 3.. .148 ~ Methods of Estimation Chapter 2 (b) Show that if Z  D has rank d.666 3.075 4. nq+1 > p(x. 1'.114 1. . 4 (b) Show that Y is a multivariate method of moments estimate of p. k. Yn are independent with Yi unifonnly distributed on [/1i .i.2.d.301 2. Yn spent above a fixed high level for a series of n = 66 consecutive wave records at a point on the seashore..870 30.ZD.8.~=l Z.156 5.. . i = 1. . /1i + a}. which vanishes if 8 j = 0 for any j = q + 1. .. = (Y  29..455 9.921 2. .1. Elapsed times spent above a certain high level for a series of 66 wave records taken at San Francisco Bay..(Y .689 4.582 2.182 0.
F)(x) *F denotes convolution. as (Z. Let g(x) ~ 1/". 34. be i. be the Cauchy density.{:i... . (a) Show that the MLE of ((31. An asymptotically equivalent procedure is to take the median of the distribution placing mass and mass . XHL i& at each point x'.(3p that minimizes the least absolute deviation contrast function L~ I IYi .. and x. . . A > 0..Section 2.).d. x E R.. = I' for each i.. . (See (2. . .  Illi < 1.5 Problems and Complements 149 (f3I' . Yn are independent with Vi having the Laplace density 1 2"exp {[Y.i.Jitl. . (3p. the sample median fj is defined as Y(k) where k ~ ~(n + 1) and Y(l). . Give the MLE when . where Iii = ~ L:j:=1 ZtjPj· 32. . where Jii = L:J=l Zij{3j. let Xl and X.32(b). y)T where Y = Z + v"XW.O) Hint: See Problem 2.:C j • i <j (a) Show that the HodgesLehmann estimate is the minimizer of the contrast function p(x./3p that minimizes the maximum absolute value conrrast function maxi IYi ./Jr ~ ~ and iii.d. .) Suppose 1'.. (a) Show Ihat if Ill[ < 1.l1. i < j.7 with Y having the empirical distribution F.. ~ f3I.17). The HodgesLehmann (location) estimate XHL is defined to be the median of the ~n(n + 1) pairwise averages ~(Xi + Xj). Show that the sample median ii is the minimizer of L~ 1 IYi . Let x. .Ili I and then setting (j = n~I L~ I IYi .. Yn ordered from smallest to largest... Let X.4.3.) + Y(r+l)] where r = ~n. with density g(x . Find the MLE of A and give its mean and variance. Z and W are independent N(O.. 1).J. . be the observations and set II = ~ (Xl .. the sample median yis defined as ~ [Y(.(1 + x'j. Let B = arg max Lx (8) be "the" MLE... . . a) is obtained by finding (31. .{3p.i. 35.1. Hint: See Example 1. .I'. Y( n) denotes YI .I/a}..fin are called (b) If n is odd. . + Xj i<j  201· (b) Define BH L to be the minimizer of J where F [x . These least absolute deviation estimates (LADEs). 8 E R.6.x. ~ 33.. " . Hint: Use Problem 1.8). be i.2. a)T is obtained by finding 131. Show that XHL is a plugin estimate of BHL. If n is even....d. then the MLE exists and is unique.>0 ~ ~ where tLi = :Ej=l Ztj{3j for given covariate values {Zij}..& at each Xi.tLil and then setting a = max t IYi . = L Ix. Suppose YI .28Id(F. .
> h(y) . Show that (a) The likelihood is symmetric about ~ x. . X. ~ Let B ~ arg max Lx (B) be "the" MLE.B) g(x + t> .t> . 38. Let Vj and W j denote the number of hits of type 1 and 2 on day j.. BI ) (p' (x.B I ) (K'(ryo. B) is and that the KullbackLiebler divergence between p(x. b). .f)) in the likelihood equation. (7') and let p(x.. Also assume that S. . B) denote their joint density. Ii I. . If we write h = logg. Sl. ~ (d) Use (c) to show that if t> E (a. . B) = g(xB). and 8 2 are independent. 1 n + m.. . and 5.. 51.B)g(x. B ) and p( X. then B is not unique. where x E Rand (} E R. On day n + 1 the Web Master decides to keep track of two types of hits (money making and not money making). that 8 1 E. ry) = p(x. where Al +.B)g(i . ~ (XI + x.. b) there exists a 0 > 0 ~ (c) There is an interval such that h(y + 0) . .ryt» denote the KullbackLeibler divergence between p( x.b).+~l Vi and 8 2 = E. . Let Xl and X2 be the observed values of Xl and X z and write j. . Define ry = h(B) and let p' (x. and positive everywhere.+~l Wj have P(rnAl) and P(mA2) distributions. j = n + I.. Let K(Bo. . .)/2 and t> = (XI .\2 = A. 9 is continuous. P(nA). B) and pix. then the MLE is not unique. a < b..d. Bo) is tn( B. Find the values of B that maximize the likelihood Lx(B) when 1t>1 > L Hint: Factor out (x .) be a random samplefrom the distribution with density f(x. n. Let (XI. h I (ry») denote the density or frequency function of X for the ry parametrization. o Ii !n 'I . Assume that 5 = L:~ I Xi has a Poisson.B). Assume :1. 37. 2.j'. such that for every y E (a.. Show that the entropy of p(x. .h(y . distribution..x2)/2. Find the MLEs of Al and A2 hased on 5. Let 9 be a probability density on R satisfying the following three conditions: I.0). 36. Suppose X I. (b) Either () = x or () is not unique. The likelihood function is given by Lx(B) g(xI . 1985).150 Methods of Estimation Chapter 2 (b) Show that if 1t>1 > 1. symmetric about 0. N(B.B )'/ (7'. (a. ryl Show that o n ». 'I i 39. 9 is twice continuously differentiable everywhere except perhaps at O.Xn are i.i. Let X ~ P" B E 8. i = 1. Let Xi denote the number of hits at a certain Web site on day i. Problem 35 can be generalized as follows (Dhannadhikari and JoagDev. then h"(y) > 0 for some nonzero y. = I ! I!: Ii.. 3.h(y) . Suppose h is a II function from 8 Onto = h(8). ryo) and p' (x..
1'.. . I' E R. and 1'. . The mean relative growth of an organism of size y at time t is sometimes modeled by the equation (Richards..2. 0). Yn).j ~ x <0 1. n > 4..) Show that statistics. and g(t. /3.7) for estimating a.. 1'. = L X. X n be a sample from the generalized Laplace distribution with density 1 O + 0.. j = 1.Section 2. [3.)/o]}" (b) Suppose we have observations (t I .I[X. n where A = (a.~t square estimating equations (2. (tn. 41. where 0 (a) Show that a solution to this equation is of the form y (a. on a population of a large number of organisms. j 1 1 cxp{ x/Oj}.O) ~ a {l+exp[[3(tJ1.. [3. 42.)/o)} (72. 1989) Y dt ~ [3 1  Idy [ (Y)!] ' y > 0. Let Xl. where OJ > O. h(z. < 0] are sufficient (b) Find the maximum likelihood estimates of 81 and 82 in tenns of T 1 and T 2 • Carefully check the "T1 = 0 or T 2 = 0" case. An example of a neural net model is Vi = L j=l p h(Zij.5 Problems and Complements 151 40. a > 0. 1 En are uncorrelated with mean 0 and variance estimating equations (2.. [3.. 1959. n. . a.1. .I[X.. 6.. Seber and Wild. yd 1 ••• . i = 1. [3. . . + C. Variation in the population is modeled on the log scale by using the model logY. . .0). .7) for a. For the ca. . T. exp{x/O.olog{1 + exp[[3(t. > OJ and T. Aj) + Ei. give the lea. Suppose Xl. 1). . o + 0. . .}. = LX. 1'). (. and fj.  J1. Give the least squares where El" •. = log a .5.~e p = 1. i = 1.1. 1 X n satisfy the autoregressive model of Example 1. a get.1.c n are uncorrelated with mean zero and variance (72. . (e) Let Yi denote the response of the ith organism in a sample and let Zij denote the level of the jth covariate (stimulus) for the ith organism. x > 0.p. 0 > O. and /1. [3 > 0. A) = g(z.
fi exists iff (Yl . Hint: I . S..3. < . .y.3. In a sample of n independent plants..d.. fi) = a + fix. C2' y' 1 1 for (a) Show that the density of X = (Xl.(8 . Hint: Let C = 1(0). .152 Methods of Estimation Chapter 2 (aJ If /' is known.0.l  L: ft)(Xi /. I . n i=l n i=l n i=1 n i=l Cl LYi + C. Yn) is not a sequence of 1's followed by all O's or the reverse. This set K will have a point where the max is attained. .1 > _£l. = OJ.x. Xn)T can be written as the rank 2 canonical exponential family generated by T = (ElogX"EXi ) and hex) = XI with ryl = p. (X. Suppose Y I . the bound is sharp and is attained only if Yi x. .).J. 0 for Xi < _£I.. > 0.. + 0... Consider the HardyWeinberg model with the six genotypes given in Problem 2. Let Xl.i. n > 2. . . = If C2 > 0.'" £n)T of autoregression errors.02): 01 > 0. 3.15. (b) Show that the likelihood equations are equivalent to (2.. X n be i.. • II ~. Xn· Ip (x. Under what conditions on (Xl 1 . show that the MLE of fi is jj =  2. Yn are independent PlY. (One way to do this is to find a matrix A such that enxl = Anxn€nx 1. .. .x... .0 .l.. I < j < 6. " l Xn) does the Mill exist? What is the MLE? Is it unique? e 4.p). L x.. .5). There exists a compact set K c e such that 1(8) < c for all () not in K.) 1 II(Z'i /1)2 (b) If j3 is known. find the covariance matrix W of the vector € = (€1.::. + 8. a. = 1] = p(x"a. .1. ..4) and (2. t·.. Is this also the MLE of /1? Problems for Section 2.) Then find the weighted least square estimate of f.fi) = 1log p pry.3.3 1.3. write Xi = j if the ith plant has genotype j. + C1 > 0).Xi)Y' < L(C1 + c... '12 = A and where r denotes the gamma function. Give details of the proof or Corollary 2. gamma. < Show that the MLE of a. < I} and let 03 = 1.)I(c. = L(CI + C. r(A. Prove Lenama 2. Let = {(01. _ C2' 2. 1 Xl <i< n.l. .
C! > 0.t). . distribution where Iti = E(Y.. 0:0 = 1..3.' . . the likelihood equations logfo that t (X It) w' iOJ t=I = 0 ~ { (X i .0 < ZI < .8) ... show 7.. Suppose Y has an exponential. convex set {(Ij. Let Xl. 9. Then {'Ij} has a subsequence that converges to a point '10 E f.O. then it must contain a sphere and the center of the sphere is an interior point by (B.0). .A( ryj) ~ max{ 'IT toA( 'I) : 'I E c(8)} > 00.3.l E R. See also Problem 1.n) are the vertices of the :Ij <n}. Use Corollary 2. n > 2.1 <j < ac k1. . and assume forw wll > 0 so that w is strictly convex. . 0 < Zl < .0..d. < Zn Zn. Zj < . ji(i) + ji. (0.1} 0 OJ = (b) Give an algorithm such that starting at iP = 0.Section 2. J. Hint: If it didn't there would exist ryj = c(9j ) such that ryJ 10 .}..1)..1 to show that in the multinomial Example 2.6. Let Y I . In the heterogenous regression Example 1.40.. with density. a(i) + a.. if n > 2. < Zn. But c( e) is closed so that '10 = c( eO) and eO must satisfy the likelihood equations..5 Problems and Complements 153 n 6. w( ±oo) = 00. (3) Show that. X n be Ll. Show that the boundary of a convex C set in Rk has volume 0. Hint: If BC has positive volume..9. [(Ai). " (a) Show that if Cl: > ~ 1.lkIl :Ij >0..Xn E RP be i. Let XI. .3. > 3. u > 1 fR exp{Ixlfr}dxand 1·1 is the Euclidean norm.fl. (O. the MLE 8 exists and is unique.. . . .3. Prove Theorem 2. 10.. .en....z=7 11.Ld.n. _. • It) w' ( Xi It) .3. 8. ~fo (X u I. 12..ill T exists and is unique. MLEs of ryj exist iffallTj > 0. Hint: The kpoints (0. . 1 <j< k1.10 with that the MLE exists and is unique. (b) Show that if a = 1. .0).. Yn denote the duration times of n independent visits to a Web site. .) ~ Ail = exp{a + ilz. .'~ve a unique solution (. fe(x) wherec. e E RP. and Zi is the income of the person whose duration time is ti.1 (a) ~ ~ c(a) exp{ Ix .6. < Show that the MLE of (a. . the MLE 8 exists but is not unique if n is even.
W is strictly convex and give the likelihood equations for f.. b) ~ I w( aXi . verify the Mstep by showing that E(Zi I Yi).. it is unique. show that the estimates of J11. for example.:. Y. and p when J1.d.9). Yn ) be a sample from a N(/11. il' i. . > 0. Golnb and Van Loan.)(Yi . Apply Corollary 2. O'~. complete the Estep by finding E(Zl I l'i) and E(ZiYi I Yd· (b) In Example 2. 8aob :1 13. Note: You may use without proof (see Appendix B. pal0'2 + J11J.4. (b) Reparametrize by a = .2.6. P coincide with the method of moments estimates of Problem 2.l and CT.1. provided that n > 3. L:. b successively.. IPl < 1.i. See. .. Show that if T is minimal and t: is open and the MLE doesn't exist.bo) D(a. 0'1 Problems for Section 2.4. Let (X10 Yd.6.2 are assumed to be known are = (lin) L:.2)/M'''2] iI I'. (b) If n > 5 and J11 and /12 are unknown. (Xn . (See Example 2.. . /12.4.2)2.'2). = (lin) L:.b) . E" . 7if)F 8 08 D 802 vb2 2 2 > (a'D)2 ' then D'IS strictIy convex.4 1.J1. {t2.(Yi . O'~ + J1~. ~J ' . .1)". rank(ZD) = k. J12. I (Xi . EM for bivariate data. b) = x if either ao = 0 or 00 or bo = ±oo. ag. p) population. 3.) Hint: (a) Thefunction D( a.:. . (Check that you are describing the GaussSeidel iterative method for solving a system of linear equations. (i) If a strictly convex function has a minimum.) has a density you may assume that > 0. b = : and consider varying a. and p = [~(Xi . . Hint: (b) Because (XI. "i .' ! (a) In the bivariate nonnal Example 2..) I . then the coordinate ascent algorithm doesn't converge to a member of t:. N(O. En i. respectively.154 Methods of Estimation Chapter 2 (c) Show that for the logistic distribution Fo(x) [1 + exp{ X}]I.. (I't') IfEPD 8a2 a2 D > 0 an d > 0.. O'?. O'~ + J1i.n log a is strictly convex in (a. Describe in detail what the coordinate ascent algorithm does in estimation of the regression coefficients in the Gaussian linear model i: y ~ ZDJ3 + <.b)_(ao. Chapter 10. ..3.J1. ar 1 a~. 2.8. b) and lim(a. (a) Show that the MLEs of a'f.1 and J1.J1.3. 1985.2). .J1. EeT = (J11..
P.B)n[n . given X>l.. 1 and (a) Show that X .B)"]'[(1 . it is desired to estimate the proportion 8 that has the genetic trait. For families in which one member has the disease. Y'i).B)[I.(1 _ B)n]{x nB . 1 < i < n. when they exist. Do you see any problems with >.2B)x + nB'] _. Y (~ A 2:7 I(Yi  Y)'  O"~)/ (0".. B= (A.? (b) Give as explicitly as possible the E. I n ) 8"(10)"" (a) Show that P(X = x MLE exists and is unique.(1 . 6. Consider a genetic trait that is directly unobservable but will cause a disease among a certain proportion of the individuals that have it.(I. . Suppose the Ii in Problem 4 are not observed.and Msteps of the EM algorithm for this problem.. i ~ 1'.I ~ + (1  B)n] .[lt ~ 1] = A ~ 1 .B)n} _ nB'(I. .3) to show that the NewtonRaphson algorithm gives ~ _ 81 ~ 8 ~ _ ~ _ B(I. 1]. 2:Ji) .n.Section 2. be independent and identically distributed according to P6.. and that the (b) Use (2.. Suppose that in a family of n members in which one has the disease (and. as the first approximation to the maximum likelihood .Ii). is distributed according to an exponential ryl < n} = i£(1 I) uzuy._x(1.x = 1. Yd : 1 < i family with T afi I a? known. 2 (uJ L }iIi + k 2: Y.1) x Rwhere P. j = 0. Because it is known that X > 1. IX > 1) = \ X ) 1 (1 6)" .O"~). Let (Ii.{(Ii. where 8 = 80l d and 8 1 estimate of 8. 1') E (0. and given II = j.5 Problems and Complements 155 4. (a) Justify the following crude estimates of Jt and A. Hint: Use Bayes rule.. Y1 '" N(fLl a}). also the trait). B) variable. (c) Give explicitly the maximum likelihood estimates of Jt and 5. I' >. ~2 = log C\) + (b) Deduce that T is minimal sufficient. B E [0.4. X is the number of members who have the trait. thus. the model often used for X is that it has the conditional distribution of a B( n. 8new .[lt ~ OJ.[1 .
W = c] 1 < a < A. . + Vbc where 00 < /l. X n be i. ~ 0. (a) Deduce that depending on where bisection is started the sequence of iterates may converge to one or the other of the local maxima (b) Make a similar study of the NewtonRaphson method in this case. Let and iJnew where ).156 (c) [f n = 5. 7.. N +bc given by < N ++c for all a. Hint: Apply the argument of the proof of Theorem 2.4.2.Na+c. (a) Suppose for aU a. . Let Xl. b1 C. b1 c and then are . Let Xl. Show that for a sufficiently large the likelihood function has local maxima between 0 and 1 and between p and a. X 3 = a. PIU = a. . Define TjD as before. iff U and V are independent given W. find (}l of (b) above using (} =   x/n as a preliminary estimate.A(iJ(A)).d. 8. v < 00. N . hence.X 2 .e.9)')1 Suppose X.b. W).1 generated by N++c. n N++ c ++c  . the sequence (11ml ijm+l) has a convergent subse quence.v ~ b I W = c] = P[U = a I W = c]P[V = b I W ~ i. N++ c N a +c N+ bc Pabc = . V = b. (1) 10gPabc = /lac = Pabe. 9. .1 (1 + (x e. X.N+bc where N abc = #{i : Xi = (a.2 noting that the sequence of iterates {rymJ is bounded and. = 1. v vary freely is an exponential family of rank (C . * maximizes = iJ(A') t.9) = ".i.c Pabc = l.c)} and "+" indicates summation over the ind~x. c]' Show that this holds iff PIU ~ a..1) + C(A + B2) = C(A + B1) .riJ(A) . Consider the following algorithm under the conditions of Theorem 2. (h) Show that the family of distributions obtained by letting 1'.X3 be independent observations from the Cauchy distribution about f(x. 1 <c < C and La. . where X = (U. Show that the sequence defined by this algorithm converges to the MLE if it exists. X Methods of Estimation Chapter 2 = 2. 1 < b < B.b. V. (c) Show that the MLEs exist iff 0 < N a+c .4.
. N±bc . but now (2) logPabc = /lac + Vbc + 'Yab where J1.a) 1 2 .c. Show that the algorithm converges to the MLE if it exists and di· verges otheIWise.5 Problems and Complements 157 Hint: (b) Consider N a+ c .and Msteps of the EM algorithm in this case.4. = fo(x  9) where fo(x) 3'P(x) + 3'P(x . (b) Give explicitly the E. (a) Show that this is an exponential family of rank A + B + C .1) + (A . 10.I) + (B .(A + B + C).[X = x I SeX) = s] = 13. c" and "a. Hint: P.N++ c / B.I)(C . Justify formula (2.(x) :1:::: 1(S(x) ~ = s). c" parameters. 12.5 has the specified mixtnre of Gaussian distribution.N++c/A. Nl+co (c) The model implies Pabc = P+bcPa±c/P++c and use the likelihood equations.b'.c' obtained by fixing the "b.:. (a) Show that S in Example 2.1)(B .Section 2.a>!b". f vary freely. (b) Consider the following "proportional fitting" algorithm for finding the maximum likelihood estimate in this model. v. Suppose X is as in Problem 9.4.I)(C . Hint: Note that because {p~~~} belongs to the model so do all subsequent iterates and that ~~~ is the MLE for the exponential family Pabc = ellauplO) _=. Initialize: pd: O) = N a ++ N_jo/>+ N++ c abc nnn Pabc d2) dl) Nab+ Pabc dO) n n Pab+ d1) dl) dO) Pabc d3) N a + c Pabc Pa+c N+bc Pabc d 2) Pabc n P+bc d2)· Reinitialize with ~t~. Let f.3 + (A .=_ L ellalblp~~~'CI a'.8). 11..I) ~ AB + AC + BC .
Show for 11 = 1 that bisection may lead to a local maximum of the likelihood. Hint: Show that {( 8m . ~ 16.. find the probability that E(Y. For X . 14. 17. (2) The frequency plugin estimates are sometimes called Fisher consistent. ~ . . we observe only}li.2. • . 18. For a fascinating account of the beginnings of estimation in the context of astronomy see Stigler (1986).i. 2 ). I I' . Complete the E. In Example 2. if a is sufficiently large..d. Zn are ii. Hint: Use the canonical nature of the family and openness of E. Establish the last claim in part (2) of the proof of Theorem 2.4. Ii . If Vi represents the seriousness of a disease. That is..{Xj }. a 2 . .4. Verify the fonnula given in Example 2. (J*) and necessarily g. {31 1 a~.5.3.n}. 15.and Msteps of the EM algorithm for estimating (ftl.) underpredicts Y. . . A.d.p is the N(O} '1) density. suppose Yi is missing iff Vi < 2. {32).6." . That is.158 Methods of Estimation Chapter 2 and r.4. I Z.. For instance. Zl. Bm+d} has a subsequence converging to (0" JJ*) and. this assumption may not be satisfied. The assumption underlying the computations in the EM algorithm is that the conditional probability that a component X j of the data vector X is missing given the rest of the data vector is not a function of X j . EM and Regression. Bm + I)} has a subsequence converging to (B* . Hint: Show that {(19 m.. These considerations lead essentially to maximum likelihood estimates. Suppose that for 1 < i < m we observe both Zi and Yi and for m + 1 < i < n.4. . En a an 2. NOTES . . For example.6 . given X . If /12 = 1. in Example 2. the process determining whether X j is missing is independent of X j .4. Limitations a/the EM Algorithm. R. Fisher (1922) argued that only estimates possessing the substitution property should be considered and the best of these selected. consider the model I I: ! I I = J31 + J32 Z i + Ei I are i.1 (1) "Natural" now was not so natural in the eighteenth century when the least squares principle was introduced by Legendre and Gauss. Then using the Estep to impute values for the missing Y's would greatly unclerpredict the actual V's because all the V's in the imputation would have Y < 2. given Zi. . al = a2 = 1 and p = 0. N(ftl. the probability that Vi is missing may depend on Zi. = {(Zil Yi) Yi :i = 1. N(O. the "missingness" of Vi is independent of Yi. Establish part (b) of Theorem 2. Noles for Section 2. This condition is called missing at random. suppose all subjects with Yi > 2 drop out of the study.3 for the actual MLE in that example.5."" En.~ go.6. thus.. ! .. and independent of El. necessarily 0* is the global maximizer. where E) . but not on Yi.
Statist. EiSENHART. H. A. CAMPBELL.. 138 (1977).• AND I. P[T(X) E Al o for all or for no PEP.. ANOH. BAUM. BISHOP. AND N.loAGDEV. AND C. J. S. HOLLAND. BREMNER. "On the Mathematical Foundations of Theoretical Statistics. Soc. J. E. 41. for any A. W. 1974.. Sciences. MACKINLAY. "The Meaning of Least in Least Squares. G. t 996... GoLUB. Acad. The Analysis ofFrequency Data Chicago: University of Chicago Press. Roy. M. 1997. M. THOMAS. 54.2. Statistical Inference Under Order Restrictions New York: Wiley. 1997). c. 1985. C. Elements oflnfonnation Theory New York: Wiley. M.2 (I) An excellent historical account of the development of least squares methods may be found in Eisenhart (1964). M. "A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains:' Am!. 1922. Math. M. Fisher 1950) New York: J. BJORK.7 REFERENCES BARLOW. "Y." J. The Econometrics ofFinancial Markets Princeton. DHARMADHlKARI. 199200 (1985). A. DAHLQUIST.. BRUNK. La. DEMPSTER. FEINBERG. J. Wiley and Sons. ANDERSON. Campbell.. Discrete Multivariate Analysis: Theory and Practice Cambridge. LAIRD. A. F. (2) For further properties of KullbackLeibler divergence. AND D. La. W. Note for Section 2. HABERMAN.Section 2. 1975. 1972. 1.. PETRIE. T. A. VAN LOAN.. Local Polynomial Modelling and Its Applications London: Chapman and Hall. Statist. .. AND N. 164171 (1970)." reprinted in Contributions to MatheltUltical Statistics (by R. MA: MIT Press. WEISS. FISHER. SOULES. AND J.1974. COVER. AND K." The American Statistician. NJ: Princeton University Press. and MacKinlay. AND P..3 (I) Recall that in an exponential family. S. B." Journal Wash. 1991. D. B. 39. "Maximum Likelihood Estimation from Incomplete Data via the EM Algorithm. G. RUBIN. "Y. E. T. FAN. Note for Section 2. M. A. see Cover and Thomas (1991). R. G. R.. E.g. 2..1.5 (1) In the econometrics literature (e.7 References 159 Notes for Section 2. GIlBELS. S. 2433 (1964). A. L. Matrix Computations Baltimore: John Hopkins University Press. Appendix A. 0. Numerical Analysis New York: Prentice Hall. a multivariate version of minimum contrasts estimates are often called generalized method of moment estimates. "Examples of Nonunique Maximum Likelihood Estimators. BARTHOLOMEW. AND A.
RUBIN. LIlTLE.. AND D. WEISBERG." 1.1.. In. G.. J. C.. KR[SHNAN. E. W. MA: Harvard University Press. WAND.. S. RUPPERT. New York: Wiley. RICHARDS. IA: Iowa State University Press. "Association and Estimation in Contingency Tables. AND M.. Journal. 1987. "A Flexible Growth Function for Empirical Use. The EM Algorithm and Extensions New York: Wiley. I • I . "On the Shannon Theory of Information Transmission in the Case of Continuous Signals.623656 (1948). A.. STIGLER. J.. AND c. Applied linear Regression. MACLACHLAN.. 13461370 (1994). G. A. 63.. Statist. G. AND T. 128 (1968). 10. A. E. D. SNEDECOR." Bell System Tech. S." J. Statisr. Statistical Methods. 290300 ( 1959). 11. Exp."/RE Trans! Inform. SHANNON.. "Multivariate Locally Weighted Least Squares Regression.·sis with Missing Data New York: J.95103 (1983) . 1%7. Statistical Anal.. Botany. MOSTELLER. 1986. Assoc. B.379243. 102108 (1956). WU. 6th ed. Statist. 22. 2nd ed. E. AND W. Nonlinear Regression New York: Wiley. C. WILD." Ann. R. SEBER. Amer. Ames. F. P. i I. The History ofStatistics Cambridge. "A Mathematical Theory of Communication. "On the Convergence Properties of the EM Algorithm. 1997. Wiley." Ann. . J. Theory. 1985.160 Methods of Estimation Chapter 2 KOLMOGOROV. N. F.. 1989.. COCHRAN. J.. 27. • . I 1 .
However. actual implementation is limited. ) < R(B. we study. Strict comparison of 6.3 that if we specify a parametric model P = {Po: E 8}.3 we show how the important Bayes and minimax criteria can in principle be implemented. We return to these themes in Chapter 6. loss function l(O. action space A. R(·.6) = E o{(O. in Chapter 4 and developing in Chapters 5 and 6 the asymptotic tools needed to say something about the multiparameter case.a).Chapter 3 MEASURES OF PERFORMANCE. the relation of the two major decision theoretic principles to the 000decision theoretic principle of maximum likelihood and the somewhat out of favor principle of unbiasedness.+ R+ by ° R(O. o r(1T. We think of R(.6) = EI(6. In Section 3. 6 2 ) for all 0 or vice versa. after similarly discussing testing and confidence bounds.2 and 3. 6) : e . in particular computational simplicity and robustness. in the context of estimation.2 BAYES PROCEDURES Recall from Section 1.1) 161 . In Sections 3.4.1 INTRODUCTION Here we develop the theme of Section 1. AND OPTIMAL PROCEDURES 3.3. by introducing a Bayes prior density (say) 1r for comparison becomes unambiguous by considering the scalar Bayes risk. which is how to appraise and select among decision procedures. We also discuss other desiderata that strongly compete with decision theoretic optimality. NOTIONS OF OPTIMALITY. 3. (3.6(X). then for data X '" Po and any decision procedure J randomized or not we can define its risk function.2.6) =ER(O.6(X)). OUf examples are primarily estimation of a real parameter. 6) as measuring a priori the performance of 6 for this model. However. and 62 on the basis of the risks alone is not well defined unless R(O.6 .
If 1r is then thought of as a weight function roughly reflecting our knowledge. such that (3. for instance. 6) ~ E(q(O) . considering I.3). we just need to replace the integrals by sums. we find that either r(1I". X) is given the joint distribution specified by (1. it is plausible that 6. 0) = (q(O) using a nonrandomized decision rule 6. oj = ". In view of formulae (1.. Here is an example. 1(0.2.3) In this section we shall show systematically how to construct Bayes rules. 1 1  . Thus..3 we showed how in an example. we could identify the Bayes rules 0. 6) Bayes rule 6* is given by = a? 6'(X) = E[q(O) I XJ.6).4) !. if computable will c plays behave reasonably even if our knowledge is only roughly right. if 1[' is a density and e c R r(".) ~ inf{r(". and that in Section 1. We don't pursue them further except in Problem 3.6) In the discrete case. and 1r may express that we care more about the values of the risk in some rather than other regions of e. Recall also that we can define R(. Clearly. (0.(O)dO (3.. Issues such as these and many others are taken up in the fundamental treatises on Bayesian statistics such as Jeffreys (1948) and Savage (1954) and are reviewed in the modern works of Berger (1985) and Bernardo and Smith (1994). B denotes mean height of people in meters.2) the Bayes risk of the problem.. j I 1': II . 0) and "I (0) is equivalent to considering 1 (0. (3.. We may have vague prior notions such as "IBI > 5 is physically implausible" if. We first consider the problem of estimating q(O) with quadratic loss.2. After all.(0) a special role ("equal weight") though (Problem 3. In the continuous case with real valued and prior density 1T.2.2.162 Measures of Performance Chapter 3 where (0.6): 6 E VJ (3.(O)dfI (3. This is just the prohlem of finding the best mean squared prediction error (MSPE) predictor of q(O) given X (see Remark 1. and instead tum to construction of Bayes procedure.2. Suppose () is a random variable (or vector) with (prior) frequency function or density 11"(0).5) This procedure is called the Bayes estimate for squared error loss. It is in fact clear that prior and loss function cannot be separated out clearly either. x _ !oo= q(O)p(x I 0). .4.r= .8) for the posterior density and frequency functions.(O)dfI p(x I O). 'f " .6) = J R(0. as usual. For testing problems the hypothesis is often treated as more important than the alternative. (0..2.2.2.2.5.4) the parametrization plays a crucial role here. we can give the Bayes estimate a more explicit form.6(X)j2. This exercise is interesting and important even if we do not view 1r as reflecting an implicitly believed in prior distribution on (). 0) 2 and 1T2(0) = 1.(0)1 . Our problem is to find the function 6 of X that minimizes r(". e "(). 00 for all 6 or the Using our results on MSPE prediction.5).
2.4. we fonn the posterior risk r(a I x) = E(l(e.r( 7r. However. Thus. Bayes Estimates for the Mean of a Normal Distribution with a Normal Prior.E(e I X))' I X)] E[(J'/(l+~)]=nja 2 + 1/72 ' 1 2 n n/ No finite choice of '1]0 and 7 2 will lead to X as a Bayes estimate. r) .6) yields. '1]0. a) I X = x). tends to a as n _ 00. .) minimizes the conditional MSPE E«(Y . To begin with we consider only nonrandomized rules.1. take that action a = J*(x) that makes r(a I x) as small as possible. If we look at the proof of Theorem 1. 1]0 fixed).w)X of the estimate to be used when there are no observations. Suppose that we want to estimate the mean B of a nonnal distribution with known variance a 2 on the basis of a sample Xl. a 2 / n.2.12.00 with.2. X is approximately a Bayes estimate for anyone of these prior distributions in the sense that Ir( 7r. In fact. Fonnula (3. " Xn. In fact. This action need not exist nor be unique if it does exist. .7) reveals the Bayes estimate in the proper case to be a weighted average MID + (1 .2 Bayes Procedures 163 Example 3. Applying the same idea in the general Bayes decision problem.1). if we substitute the prior "density" 7r( 0) = 1 (Prohlem 3.5.2. the Bayes estimate corresponding to the prior density N(1]O' 7 2 ) differs little from X for n large. we obtain the posterior distribution The Bayes estimate is just the mean of the posterior distribution J'(X) ~ 1]0 l/T' ] _ [ n/(J2 ] [n/(J' + l/T' + X n/(J' + l/T' (3. E(Y I X) is the best predictor because E(Y I X ~ . J' )]/r( 7r. If we choose the conjugate prior N( '1]0. Such priors with f 7r(e) ~ 00 or 2: 7r(0) ~ 00 are called impropa The resulting Bayes procedures are also called improper.2. we see that the key idea is to consider what we should do given X = x. see Section 5. Because the Bayes risk of X. For more on this.1. Intuitively. we should.6. /2) as in Example 1.a)' I X ~ x) as a function of the action a. But X is the limit of such estimates as prior knowledge becomes "vague" (7 . 0 We now tum to the problem of finding Bayes rules for gelleral action spaces A and loss functions l.7) Its Bayes risk (the MSPE of the predictor) is just r(7r. This quantity r(a I x) is what we expect to lose. . that is. if X = :x and we use action a. for each x. X is the estimate that (3. and X with weights inversely proportional to the Bayes risks of these two estimates. J') ~ 0 as n + 00.E(e I X))' = E[E«e .Section 3. J') E(e .
. Suppose that there exists a/unction 8*(x) such that r(o'(x) [x) = inf{r(a I x) : a E A}. let Wij > 0 be given constants.o'(X)) I X].) = 0.2.2.· . the posterior risks of the actions aI. and let the loss incurred when (}i is true and action aj is taken be given by j e e .o) = E[I(O.o(X)) I X = x] = r(o(x) Therefore..o'(X)) IX = x]. Then the posterior distribution of o is by (1.! " j. !i n j.2. (3.9) " "I " • But. A = {ao.1. [I) = 3.5) with priOf ?f(0. 11) = 5. r(al 11) = 8..3.·.2. As in the proof of Theorem 1.Op}.4. Bayes Procedures Whfn and A Are Finite.o(X))] = E[E(I(O. az has the smallest posterior risk and. ?f(O.164 Measures of Performance Chapter 3 Proposition 3.. 0'(0) Similarly.o(X)) I X] > E[I(O. a q }. . ~ a.70 and we conclude that 0'(1) = a. E[I(O.8) Thus.74. and the resnlt follows from (3. we obtain for any 0 r(?f.8) Then 0* is a Bayes rule.2.2.2. by (3. Therefore. . 6* = 05 as we found previously. Let = {Oo. (3. " .67 5.) = 0. I 0)  + 91(0" ad r(a.35.89. o As a first illustration. Suppose we observe x = O. if 0" is the Bayes rule.. Example 3. and aa are r(at 10) r(a. More generally consider the following class of situations. az.2.8. Therefore. :i " " Proof. I x) > r(o'(x) I x) = E[I(O.o(X)) I X)].1.8). E[I(O. consider the oildrilling example (Example 1. r(a.2. The great advantage of our new approach is that it enables us to compute the Bayes procedure without undertaking the usually impossible calculation of the Bayes risks of all corrpeting procedures. r(a. 10) 8 10.9).
Section 3.2
Bayes Procedures
165
Let 1r(0) be a prior distribution assigning mass 1ri to Oi, so that 1ri > 0, i = 0, ... ,p, and Ef 0 1ri = 1. Suppose, moreover, that X has density or frequency function p(x I 8) for each O. Then, by (1.2.8), the posterior probabilities are
and, thus,
raj (
x I) =
EiWipTiP(X I Oi) . Ei 1fiP(X I Oi)
(3.2.10)
The optimal action 6* (x) has
r(o'(x) I x)
=
O<rSq
min r(oj
I x).
Here are two interesting specializations. (a) Classification: Suppose that p
= q, we identify aj with OJ, j
1, i f j
=
0, ... ,p, and let
Wii
O.
This can be thought of as the classification problem in which we have p + 1 known disjoint populations and a new individual X comes along who is to be classified in one of these categories. In this case,
r(Oi
Ix) = Pl8 f ei I X = xl
and minimizing r(Oi I x) is equivalent to the reasonable procedure of maximizing the posterior probability,
PI8 = e I X = i
xl
=
1fiP(X lei) Ej1fjp(x I OJ)
(b) Testing: Suppose p = q = 1, 1ro = 1r, 1rl = 1  'Jr, 0 < 1r < 1, ao corresponds to deciding 0 = 00 and al to deciding 0 = 01 • This is a special case of the testing fonnulation of Section 1.3 with 8 0 = {eo} and 8 1 = {ed. The Bayes rule is then to
decide e decide e
= 01 if (1 1f)p(x I 0, ) > 1fp(x I eo) = eo if (1  1f)p(x IeIl < 1fp(x Ieo)
and decide either ao or al if equality occurs. See Sections 1.3 and 4.2 on the option of randomizing .between ao and al if equality occurs. As we let 'Jr vary between zero and one. we obtain what is called the class of NeymanPearson tests, which provides the solution to the problem of minimizing P (type II error) given P (type I error) < Ct. This is treated further in Chapter 4. D
166
Measures of Performance
Chapter 3
To complete our illustration of the utility of Proposition 3.2. L we exhibit in "closed form" the Bayes procedure for an estimation problem when the loss is not quadratic.
Example 3.2.3. Bayes Estimation ofthe Probability ofSuccess in n Bernoulli Trials. Suppose that we wish to estimate () using X I, ... , X n , the indicators of n Bernoulli trials with
probability of success
e, We shall consider the loss function I given by
(0  a)2 1(0, a) = 0(1 _ 0)' 0 < 0 < I, a real.
(3.2.11)
"
" " ,

d
This close relative of quadratic loss gives more weight to parameter values close to zero and one. Thus, for () close to zero, this l((), a) is close to the relative squared error (fJ  a)2 JO, lt makes X have constant risk, a property we shall find important in the next section. The analysis can also be applied to other loss functions. See Problem 3.2.5. By sufficiency we need only consider the number of successes, S. Suppose now that we have a prior distribution. Then, if aU terms on the righthand side are finite,
rea I k)
(3.2.12)
;,
I
, " " I
,

I
Minimizing this parabola in a, we find our Bayes procedure is given by
J'(k) = E(1/(1  0) I S = k) E(I/O(I  0) IS  k)
;
(3.2.13)

i! , ,.
provided the denominator is not zero. For convenience let us now take as prior density the density br " (0) of the bela distribution i3( r, s). In Example 1. 2.1 we showed that this leads to a i3(k +r,n +s  k) posteriordistributionforOif S = k. Ifl < k < nI and n > 2, then all quantities in (3.2.12) and (3.2.13) are finite, and
.
,
";1 
J'(k)
Jo'(I/(1  O))bk+r,n_k+,(O)dO
J~(1/0(1 O))bk+r,n_k+,(O)dO
B(k+r,nk+sI) B(k + r  I, n  k + s  I) k+rl n+s+r2'
(3.2.14)
where we are using the notation B.2.11 of Appendix B. If k = 0, it is easy to see that a = is the only a that makes r(a I k) < 00. Thus, J'(O) = O. Similarly, we get J'(n) = 1. If we assume a un!form prior density, (r = s = 1), we see that the Bayes procedure is the usual estimate, X. This is not the case for quadratic loss (see Problem 3.2.2). 0
°
"Real" computation of Bayes procedures
The closed fonos of (3.2.6) and (3.2.10) make the compulation of (3.2.8) appear straightforward. Unfortunately, this is far from true in general. Suppose, as is typically the case, that 0 ~ (0" . .. , Op) has a hierarchically defined prior density,
"(0,, O ,... ,Op) = 2
"1 (Od"2(02
I( 1 ) ... "p(Op I Op_ d.
(3.2.15)
I
Section 3.2
Bayes Procedures
167
Here is an example. Example 3.2.4. The random effects model we shall study in Volume II has
(3.2.16)
where the €ij are i.i.d. N(O, (J:) and Jl, and the vector ~ = (AI, .. ' ,AI) is independent of {€i; ; 1 < i < I, 1 < j < J} with LI." ... ,LI.[ i.i.d. N(O,cri), 1 < j < J, Jl, '"'" N(Jl,O l (J~). Here the Xi)" can be thought of as measurements on individual i and Ai is an "individual" effect. If we now put a prior distribution on (Jl,,(J~,(J~) making them independent, we have a Bayesian model in the usual form. But it is more fruitful to think of this model as parametrized by () = (J.L, (J~, (J~, AI,. '. ,AI) with the Xii I () independent N(f.l + Ll. i . cr~). Then p(x I 0) = lli,; 'Pu.(Xi;  f.l Ll. i ) and
11'(0)
=
11'}(f.l)11'2(cr~)11'3(cri)
II 'PU" (LI.,)
i=l
I
(3.2.17)
where <Pu denotes the N(O, (J2) density. ]n such a context a loss function frequently will single out some single coordinate Os (e.g., LI.} in 3.2.17) and to compute r(a I x) we will need the posterior distribution of ill I X. But this is obtainable from the posterior distribution of (J given X = x only by integrating out OJ, j t= s, and if p is large this is intractable. ]n recent years socalled Markov Chain Monte Carlo (MCMC) techniques have made this problem more tractable 0 and the use of Bayesian methods has spread. We return to the topic in Volume II.
Linear Bayes estimates
When the problem of computing r( 1r, 6) and 611" is daunting, an alternative is to consider  of procedures for which r( 'R', 6) is easy to compute and then to look for 011" E 'D  a class 'D that minimizes r( 1f 16) for 6 E 'D. An example is linear Bayes estimates where, in the case of squared error loss [q( 0)  a]2, the problem is equivalent to minimizing the mean squared prediction error among functions of the form a + L;=l bjXj _ If in (1.4.14) we identify q(8) with Y and X with Z, the solution is

8(X)
= Eq(8) + IX 
E(X)]T{3
where f3 is as defined in Section 1.4. For example, if in the 11l0del (3.2.16), (3.2.17) we set q(fJ) = Ll 1 , we can find the linear Bayes estimate of Ll 1 hy using 1.4.6 and Problem 1.4.21. We find from (1.4.14) that the best linear Bayes estimator of LI.} is
(3.2.18)
where E(Ll.I) given model
= 0,
X = (XIJ, ... ,XIJ)T, P.
= E(X)
and/l
=
ExlcExA,. For the
168
Var(X I ;)
Measures of Performance
Chapter 3
=E
Var(X,)
I B) +
Var E(XIj
I BJ = E«T~) +
(T~ + (T~,
E Cov(X,j , Xlk I B)
+ Cov(E(X'j I B), E(Xlk I BJ)
0+ Cov(1' + II'!, I' + tl.,)
= (T~ +
(T~,
Cov(tl.
Xlj)
=E
Cov(X,j , tl.,
!: ,,
I',
" From these calculations we find and Norberg (1986).
I B) + Cov(E(XIj I B), E(tl. l I 0» = 0 + (T~, =
(Tt·
f3
and OL(X). We leave the details to Problem 3.2.10.
'I
Linear Bayes procedures are useful in actuarial science, for example, Biihlmann (1970)
if
ii "
Bayes estimation, maximum likelihood, and equivariance
As we have noted earlier. the maximum likelihood estimate can be thought of as the mode of the Bayes posterior density when the prior density is (the usually improper) prior 71"(0) ;,::; c. When modes and means coincide for the improper prior (as in the Gaussian case), the MLE is an improper Bayes estimate. In general, computing means is harder than
modes and that again accounts in part for the popularity of maximum likelihood. An important property of the MLE is equivariance: An estimating method M producing the estimate (j M is said to be equivariant with respect to reparametrization if for every onetoDne function h from e to fl = h (e). the estimate of w '" h(B) is W = h(OM); that is, M
~
I
= h(BM ). In Problem 2.2.16 we show that the MLE procedure is equivariant. If we consider squared error loss. then the Bayes procedure (}B = E(8 I X) is not equivariant
(h(O»
M
~

~
~
for nonlinear transfonnations because
E(h(O) I X)
op h(E(O I X))
1 ,
,
,,
for nonlinear h (e.g., Problem 3.2.3). The source of the lack of equivariance of the Bayes risk and procedure for squared error loss is evident from (3.2.9): In the discrete case the conditional Bayes risk is
i
re(a I x)
= LIB  a]'7f(O I x).
'Ee
(3.2.19)
i
If we set w = h(O) for h onetoone onto fl = heel, then w has prior >.(w) and in the w parametrization, the posterior Bayes risk is
= 7f(h
1
(w))
I
,
r,,(alx)
= L[waj2>,(wlx)
wE"
L[h(B)  a)2 7f (B I x).
(3.2.20)
'Ee
Thus, the Bayes procedure for squared error loss is not equivariant because squared error loss is not equivariant and, thus, r,,(a I x) op Te(h 1(a) Ix).
Section 3.2
Bayes Procedures
169
Loss functions of the form l((}, a) = Q(Po, Pa) are necessarily equivariant. The KullbackLeibler divergence K«(},a), (J,a E e, is an example of such a loss function. It satisfies Ko(w, a) = Ke(O, h 1 (a)), thus, with this loss function,
ro(a I x)
~
re(h1(a) I x).
See Problem 2.2.38. In the discrete case using K means that the importance of a loss is measured in probability units, with a similar interpretation in the continuous case (see (A.7.l 0». In the N(O, case the K L (KullbackLeibler) loss K( a) is ~n(a  0)' (Problem 2.2.37), that is, equivalent to squared error loss. In canonical exponential families
"J)
e,
K(1], a) = L[ryj  ajJE1]Tj
j=1
•
+ A(1]) 
A(a).
Moreover, if we can find the KL loss Bayes estimate 1]BKL of the canonical parameter 7] and if 1] = c( 9) :  t E is onetoone, then the K L loss Bayes estimate of 9 in the
e
general exponential family is 9 BKL = C 1(1]BKd. For instance, in Example 3.2.1 where J.l is the mean of a nonnal distribution and the prior is nonnal, we found the squared error Bayes estimate jiB = wf/o + (lw)X, where 1]0 is the prior mean and w is a weight. Because the K L loss is equivalent to squared error for the canonical parameter p, then if w = h(p), WBKL = h('iiUKL), where 'iiBKL =
~
wTJo + (1  w)X.
Bayes procedures based on the KullbackLeibler divergence loss function are important for their applications to model selection and their connection to "minimum description (message) length" procedures. See Rissanen (1987) and Wallace and Freeman (1987). More recent reviews are Shibata (1997), Dowe, Baxter, Oliver, and Wallace (1998), and Hansen and Yu (2000). We will return to this in Volume 11.
Bayes methods and doing reasonable things
There is a school of Bayesian statisticians (Berger, 1985; DeGroot, 1%9; Lindley, 1965; Savage, 1954) who argue on nonnative grounds that a decision theoretic framework and rational behavior force individuals to use only Bayes procedures appropriate to their personal prior 1r. This is not a view we espouse because we view a model as an imperfect approximation to imperfect knowledge. However, given that we view a model and loss structure as an adequate approximation, it is good to know that generating procedures on the basis of Bayes priors viewed as weighting functions is a reasonable thing to do. This is the conclusion of the discussion at the end of Section 1.3. It may be shown quite generally as we consider all possible priors that the class 'Do of Bayes procedures and their limits is complete in the sense that for any 6 E V there is a 60 E V o such that R(0, 60 ) < R(0, 6) for all O. Summary. We show how Bayes procedures can be obtained for certain problems by COmputing posterior risk. In particular, we present Bayes procedures for the important cases of classification and testing statistical hypotheses. We also show that for more complex problems, the computation of Bayes procedures require sophisticated statistical numerical techniques or approximations obtained by restricting the class of procedures.
170
Measures of Performance
Chapter 3
3.3
MINIMAX PROCEDURES
In Section 1.3 on the decision theoretic framework we introduced minimax procedures as ones corresponding to a worstcase analysis; the true () is one that is as "hard" as possible. That is, 6 1 is better than 82 from a minimax point of view if sUPe R(0,6 1 ) < sUPe R(B, 82 ) and is said to be minimax if
o·
supR(O, 0')
.
= infsupR(O,o).
,
.
,
I·
Here (J and 8 are taken to range over 8 and V = {all possible decision procedures (possibly randomized)} while P = {p. : 0 E e}. It is fruitful to consider proper subclasses of V and subsets of P. but we postpone this discussion. The nature of this criterion and its relation to Bayesian optimality is clarified by considering a socalled zero sum game played by two players N (Nature) and S (the statistician). The statistician has at his or her disposal the set V of all randomized decision procedures whereas Nature has at her disposal all prior distributions 1r on 8. For the basic game, 5 picks 0 without N's knowledge, N picks 1f without 5's knowledge and then all is revealed and S pays N
r(Jr,o)
, ,
,
=
J
R(O,o)dJr(O)
where the notation f R(O, o)dJr(0) stands for f R(0, o)Jr(O)dII in the continuous case and L, R(Oj, o)Jr(O j) in the discrete case. S tries to minimize his or her loss, N to maximize her gain. For simplicity, we assume in the general discussion that follows that all sup's and inf's are assumed. There are two related partial information games that are important.
I: N is told the choice 0 of 5 before picking 1r and 5 knows the rules of the game. Then
Ii ,
N naturally picks 1f,; such that
r(Jr"o)
that is, 1f,; is leastfavorable against such that
~supr(Jr,o),
o. Knowing the rules of the game S naturally picks 0*
•
(3.3.1)
r( Jr,., 0·) = sup r(Jr, 0') = inf sup r(Jr, 0).
We claim that 0* is minimax. To see this we note first that,
.
,
.
(3.3.2)
for allJr, o. On the other hand, if R(0" 0) = suP. R(0, 0), then if Jr, is point mass at 0" r( Jr" 0) = R(O" 0) and we conclude that supr(Jr,o) = sup R(O, 0)
,
•
•
(3.3.3)
I
,
•
•
Section 3.3
Minimax Procedures
171
and our claim follows. II: S is told the choke 7r of N before picking 6 and N knows the rules of the game. Then S naturally picks 6 1r such that
That is, r5 rr is a Bayes procedure for
11".
Then N should pick
7r*
such that (3.3.4)
For obvious reasons, 1f* is called a least favorable (to S) prior distribution. As we shall see by example, altbough the rigbthand sides of (3.3.2) and (3.3.4) are always defined, least favorable priors and/or minimax procedures may not exist and, if they exist, may not be umque. The key link between the search for minimax procedures in the basic game and games I and II is the von Neumann minimax theorem of game theory, which we state in our language.
Theorem 3.3.1. (von Neumann). If both
e and D are finite,
?5
11"
then:
(a)
v=supinfr(1I",6), v=infsupr(1I",6)
rr
?5
are both assumed by (say)
7r*
(least favorable), 6* minimax, respectively. Further,
") =v v=r1f,u (
.
(3.3.5)
v and v are called the lower and upper values of the basic game. When v (saY), v is called the value of the game.
=
v
=
v
Remark 3.3.1. Note (Problem 3.3.3) that von Neumann's theorem applies to classification ~ {eo} and = {eIl (Example 3.2.2) but is too reSlrictive in ilS and testing when assumption for the great majority of inference problems. A generalization due to Wald and Karlinsee Karlin (1 959)states that the conclusions of the theorem remain valid if and D are compact subsets of Euclidean spaces. There are more farreaching generalizations but, as we shall see later, without some form of compactness of and/or D, although equality of v and v holds quite generally, existence of least favorable priors and/or minimax procedures may fail.
eo
e,
e
e
The main practical import of minimax theorems is, in fact, contained in a converse and its extension that we now give. Remarkably these hold without essentially any restrictions on and D and are easy to prove.
e
Proposition 3.3.1. Suppose 6**. 7r** can be found such that
U
£** =
r ()1r.',
1f **
= 1rJ••
(3.3.6)
172
Measures of Performance
Chapter 3
that is, 0** is Bayes against 11"** and 11"** is least favorable against 0**. Then v R( 11"** ,0**). That is, 11"** is least favorable and J*'" is minimax.
To utilize this result we need a characterization of 11"8. This is given by
v
=
Proposition 3.3.2.11"8 is leastfavorable against
°iff
1r,jO: R(O,b) = supR(O',b)) = 1.
.'
(3,3,7)
That is, n a assigns probability only to points () at which the function R(·, 0) is maximal.
Thus, combining Propositions 3.3.1 and 3.3.2 we have a simple criterion, "A Bayes rule with constant risk is minimax." Note that 11"8 may not be unique. In particular, if R(O, 0) = constant. the rule has constant risk, then all 11" are least favorable. We now prove Propositions 3.3.1 and 3.3.2.
Proof of Proposition 3.3.1. Note first that we always have
v<v
because, trivially,
i~fr(1r,b)
(3.3.8)
< r(1r,b')
(3.3,9)
for aIln, 5'. Hence,
v = sup in,fr(1r,b) < supr(1r,b')
•
(3,3.10)
•
for all 0' and v
<
infa, sUP1r 1'(11", (/) =
v. On the other hand. by hypothesis,
sup1'(1r,6**) > v.
v> inf1'(11"*'",6)
,
= 1'(1I"*",0*'") =
.
(3.3.11)
Combining (3.3.8) and (3.3.11) we conclude that
:r
v
as advertised.
= i~f 1'(11"**,0) = 1'(11"**,6**) = s~p1'(1I",0*"')
= V
(3.3.12)
" 'I
,.;
~l
o
Proofof Proposition 3.3.2. 1r is least favorable for b iff E.R(8,6) =
,.,
f r(O,b)d1r(O) =s~pr( ..,6).
•
(3.3.13)
But by (3.3.3),
supr(..,b) = supR(O, 6),
(3.3,14)
•
i ,
Because E.R(8, 6)
= sUP. R(O, b), (3.3,13) is possible iff (3.3.7) holds.
o
Putting the two propositions together we have the following.
•
Section 3.3
Minimax Procedures
173
11"*
Theorem 3.3.2. Suppose 0* has sUPo R((}, 0*) = 1" < 00. If there exists a prior that 0* is Bayes for 11"* and tr" {(} : R( (}, 0") = r} = 1, then 0'" is minimax.
such
Example 3.3.1. Minimax estimation in the Binomial Case. Suppose S has a B(n,B) distribution and X = Sjn,as in Example 3.2.3. Let I(B, a) ~ (Ba)'jB(IB),O < B < 1. For this loss function,
R(B X) = E(X _B)' , B(IB)

=
B(IB) = ~ nB(IB) n'

and X does have constant risk. Moreover, we have seen in Example 3.2.3 that X is Bayes, when 8 is U(Ol 1). By Theorem 3.3.2 we conclude that X is minimax and, by Proposition 3.3.2, the uniform distribution least favorable. For the usual quadratic loss neither of these assertions holds. The minimax estimate is
o's=S+hln = .,fii X+ I 1 () n+.,fii .,fii+l .,fii+1 2
This estimate does have constant risk and is Bayes against a (J( y'ri/2, vn/2) prior (Problem 3.3.4). This is an example of a situation in which the minimax principle leads us to an unsatisfactory estimate. For quadratic loss, the limit as n t 00 of the ratio of the risks of 0* and X is > 1 for every () =f ~. At B = the ratio tends to 1. Details are left to Problem 3.3.4. 0
!
Example 3.3.2. Minimax Testing. Satellite Communications. A test to see whether a communications satellite is in working order is run as follows. A very strong signal is beamed from Earth. The satellite responds by sending a signal of intensity v > 0 for n seconds or, if it is not working, does not answer. Because of the general "noise" level in space the signals received on Earth vary randomly whether the satellite is sending ~r not. The mean voltage per second of the signal for each of the n seconds is recorded. Denote the mean voltage of the signal received through the ith second less expected mean voltage due to noise by Xi. We assume that the Xi are independently and identically distributed as N(p" 0'2) where p, = v, if the satellite functions, and otherwise. The variance 0'2 of the "noise" is assumed known. Our problem is to decide whether "J.l = 0" or"p, = v." We view this as a decision problem with 1 loss. If the number of transmissions is fixed, the minimax rule minimizes the maximum probability of error (see (1.3.6)). What is this risk? A natural first step is to use the characterization of Bayes tests given in the preceding section. If we assign probability 1r to and 1  1r to v, use 0  1 loss, and set L(x, 0, v) = p( x Iv) j p(x I 0), then the Bayes test decides I' = v if
°
°°
L(x,O,v)
and decides p,
=
=
exp
°
" {
2"EXi   , a 2a
nv'} >
17l'
if L(x,O,v) <:17l' 7l'
174
Measures of Performance
Chapter 3
This test is equivalent to deciding f.t
= v (Problem 3.3.1) if. and only if,
"yn
1 ;;;:EXi>t,
T=
, , ,
where,
t
If we call this test d,r.
=
" [log " + 2 nv'] ;;;: vyn a
17r
2
'
R(O,J. )
1 ~ <I>(t)
<I>
= <1>( t)
R(v,o,)
(t  vf)
To get a minimax test we must have R(O, 61r ) = R( V, 6ft), which is equivalent to
v.,fii t = t  "'
h •
or
"
','
I ,~
~.
, ,
•
v.,fii t= . 20
•
Because this value of t corresponds to 7r = the intuitive test, which decides JJ only ifT > ~[Eo(T) + Ev(T)J, is indeed minimax.
!.
= v if and
0
'I
•
If is not bounded, minimax rules are often not Bayes rules but instead can be obtained as limits of Bayes rules. To deal with such situations we need an extension of Theorem
e
3.3.2.
Theorem 3,3,3, Let 0' be a rule such that sup.R(O,o') = r < 00, let {"d denote a sequence of pn'or distributions such that 'lrk;{8 : R(B,o*) = r} = 1, and let Tk = inffJ r( ?rie, J), where r( 7fkl 0) denotes the Bayes risk wrt 'lrk;. If
Tk Task 00,
(3.3.i5)
then J* is minimax. Proof Because r( "k, 0') = r
supR(B, 0') = rk + 0(1)
•
where 0(1)
~
0 as k
~ 00.
But hy (3.3.13) for any competitor 0
supR(O,o) > E.,(R(B,o)) > rk ~supR(O,o') 0(1).
•
•
(3.3.16)
,
'.
If we let k _ suP. R(B, 0').
00
the lefthand side of (3.3.16) is unchanged, whereas the right tends to
0
j
1
•
Section 3.3
M'mimax Procedures
175
Example 3.3.3. Normal Mean. We now show that X is minimax in Example 3.2.1. Identify 1fk with theN(1Jo, 7 2 ) prior where k = 7 2 . Then
whereas the Bayes risk of the Bayes rule of Example 3.2.1 is
i~frk(J) ~ (,,'/n) +7' n
Because
(0
7
2
0
2
=
00,
n  (,,'/n) +7'
a2
1
0
2
n·
2
In)1 « (T2 In) + 7 2 )
>
0 as T 2

we can conclude that
X is minimax.
0
Example 3.3.4. Minimax Estimation in a Nonparametric Setting (after Lehmann). Suppose XI,'" ,Xn arei.i.d, FE:F
Then X is minimax for estimating B(F) EF(Xt} with quadratic loss. This can be viewed as an extension of Example 3.3.3. Let 1fk be a prior distribution on :F constructed as foUows:(J)
(i)
=
".{F: VarF(XIl # M}
= O.
(ii) ".{ F : F
# N(I', M) for some I'} = O.
(iii) F is chosen by first choosing I' = 6(F) from a N(O, k) distribution and then taking F = N(6(F),M).
Evidently, the Bayes risk is now the same as in Example 3.3.3 with 0"2 evidently,
= M.
Because.
) VarF(X, ) max R(F, X = max :F :F n
Theorem 3.3.3 applies and the result follows. Minimax procedures and symmetry
M
n
o
As we have seen, minimax procedures have constant risk or at least constant risk on the "most difficult" 0, There is a deep connection between symmetries of the model and the structure of such procedures developed by Hunt and Stein, Lehmann, and others, which is discussed in detail in Chapter 9 of Lehmann (1986) and Chapter 5 of Lehmann and CaseUa (1998), for instance. We shall discuss this approach somewhat, by example. in Chapters 4 and Volume II but refer to Lehmann (1986) and Lehmann and Casella (1998) for further reading. Summary. We introduce the minimax principle in the contex.t of the theory of games. Using this framework we connect minimaxity and Bayes metbods and develop sufficient conditions for a procedure to be minimax and apply them in several important examples.
estimates that ignore the data.. S(Y) = L:~ 1 diYi . Bayes and minimaxity. In the nonBayesian framework. we can also take this point of view with humbler aims. When v = ii. 5') for aile. We show that Bayes rules with constant risk.• •••. if Y is postulated as following a linear regression model with E(Y) = zT (3 as in Section 2. looking for the procedure 0. there is a least favorable prior 1[" and a minimax rule 8* such that J* is the Bayes rule for n* and 1r* maximizes the Bayes risk of J* over all priors. according to these criteria. I X n are d. the game is said to have a value v. Survey Sampling In the previous two sections we have considered two decision theoretic optimality principles. This notion has intuitive appeal. the notion of bias of an estimate O(X) of a parameter q(O) in a model P {Po: 0 E e} as = Biaso(5) = Eo5(X)  q(O).I . D. symmetry.4 UNBIASED ESTIMATION AND RISK INEQUALITIES 3.1 Unbiased Estimation. for which it is possible to characterize and. . it is natural to consider the computationally simple class of linear estimates.L and (72 when XI.4.3. (72) = . This approach has early on been applied to parametric families Va. This result is extended to rules that are limits of Bayes rules with constant risk and we use it to show that x is a minimax rule for squared error loss in the N (0 . 5) > R(0.1. and so on. in many cases. . for instance.6. Ii I More specifically. all 5 E Va. or more generally with constant risk over the support of some prior. we show how finding minimax procedures can be viewed as solving a game between a statistician S and nature N in which S selects a decision rule 8 and N selects a prior 1['.d. N (Il.. then the game of S versus N has a value 11.. Von Neumann's Theorem states that if e and D are both finite. The most famous unbiased estimates are the familiar estimates of f. computational ease. I I .1 ." R(0. for example. II . An alternative approach is to specify a proper subclass of procedures. in Section 1. the solution is given in Section 3. and then see if within the Do we can find 8* E Do that is best according to the "gold standard. When Do is the class of linear procedures and I is quadratic Joss. 3. Do C D. then in estimating a linear function of the /3J. Obviously. v equals the Bayes risk of the Bayes rule 8* for the prior 1r*. A prior for which the Bayes risk of the Bayes procedure equals the lower value of the game is called leas! favorable. . An estimate such that Biase (8) 0 is called unbiased. which can't be beat for 8 = 80 but can obviously be arbitrarily terrible. are minimax. on other grounds. We introduced. ruling out. 0'5) model. compute procedures (in particular estimates) that are best in the class of all procedures.2.2.. Moreover. This approach coupled with the principle of unbiasedness we now introduce leads to the famous GaussMarkov theorem proved in Section 6. E Do that minimizes the Bayes risk with respect to a prior 1r among all J E Do. such as 5(X) = q(Oo). 176 Measures of Performance Chapter 3 . The lower (upper) value v(v) of the game is the supremum (infimum) over priors (decision rules) of the infimum (supremum) over decision rules (priors) of the Bayes risk.
.XN} 1 (3. for instance. . . . Unbiased Estimates in Survey Sampling.(Xi 1=1 I" N ~2 x) ..2) 1 Because for unbiased estimates mean square error and variance coincide we call an unbiased estimate O*(X) of q(O) that has minimum MSE among all unbiased estimat~s for all 0.4.4.1.. UN' One way to do this. ~ N L.4. X N ).3. .2..3. (~) If{aj.an}C{XI.4. . UMVU (uniformly minimum variance unbiased).Section 3. Unbiased estimates playa particularly important role in survey sampling.4.4..4.8) Jt =X . (3. [J = ~ I:~ 1 Ui· XR is also unbiased. We want to estimate the parameter X = Xj.' . It is easy to see that the natural estimate X ~ L:~ 1 Xi is unbiased (Problem 3.. We ignore difficulties such as families moving.. and u =X  b(U .. . UN) and (Xl....< 2 ~ ~ (3.. .. . (3. Ui is the last census income corresponding to = it E~ 1 'Ui. Suppose we wish to sample from a finite population.14) and has = iv L:fl (3.3) = 0 otherwise. .4.il) Clearly for each b. .5) This method of sampling does not use the information contained in 'ttl. reflecting the probable correlation between ('ttl. . Example 3. UN for the known last census incomes. X N ) as parameter ..4 Unbiased Estimation and Risk Inequalities 177 given by (see Example 1.Xn denote the incomes of a sample of n families drawn at random without replacement. is to estimate by a regression estimate ~ XR Xi. If . a census unit. (3. these are both UMVU. .. to determine the average value of a variable (say) monthly family income during a time between two censuses and suppose that we have available a list of families in the unit with family incomes at the last census. As we shall see shortly for X and in Volume 2 for . This leads to the model with x = (Xl. X N for the unknown current family incomes and correspondingly UI.3 and Problem 1...4) where 2 CT.1 ) ~ 1 nl L (Xi ~ ooc n ~ X) 2 .6) where b is a prespecified positive constant.. . We let Xl. . .···. Write Xl.
. .1 the correlation of Ui and Xi is positive and b < 2Cov(U.bey the attractive equivariance property.4). X is not unbiased but the following estimate known as the HorvitzThompson estimate is: Ef . . . However. This makes it more likely for big incomes to be induded and is intuitively desirable.7) II ! where Ji is defined by Xi = xJ.Xi XHT=LJN i=l 7fJ.. 0 Discussion. N toss a coin with probability 7fj of landing heads and select Xj if the coin lands heads. II it >: ..X)!Var(U) (Problem 3. (iii) Unbiased_estimates do not <:.~'! I . I . A natural choice of 1rj is ~n.'I .11. For each unit 1. outside of sampling. They necessarily in general differ from maximum likelihood estimates except in an important special case we develop later. .. the unbiasedness principle has largely fallen out of favor for a number of reasons. .18. (ii) Bayes estimates are necessarily biasedsee Problem 3. estimate than X and the best choice of b is bopt The value of bopt is unknown but can be estimated by = The resulting estimate is no longer unbiased but behaves well for large samplessee Problem 5. ' . ".4. " 1 ".j' I ': Because 1rj = P[Xj E Sj by construction unbiasedness follows. It is possible to avoid the undesirable random sample size of these schemes and yet have specified 1rj. To see this write ~ 1 ' " Xj XHT ~ N LJ 7'" l(Xj E S).4. for instance. M (3.. The result is a sample S = {Xl. q(e) is biased for q(e) unless q is linear. ~ . j=1 3 N ! I: ..15)..3. . I II. . The HorvitzThompson estimate then stays unbiased. Further discussion of this and other sampling schemes and comparisons of estimates are left to the problems.I..4. 111" N < 1 with 1 1fj = n. X M } of random size M such that E(M) = n (Problem 3. this will be a beller cov(U.. 178 Measures of Performance Chapter 3 i . Specifically let 0 < 7fl. . I j . Ifthc 'lrj are not all equal.j .19).4..3. . . Unbiasedness is also used in stratified sampling theory (see Problem 1.. . An alternative approach to using the Uj is to not sample all units with the same prob ability. If B is unbiased for e.i .2Gand minimax estimates often are.4. :I . Xl!Var(U). 1. (i) Typically unbiased estimates do not existsee Bickel and Lehmann (1969) and Problem 3..
See Problem 3. f3 is the least squares estimate. This preference of 8 2 over the MLE 52 = iT e/n is in accord with optimal behavior when both the number of observations and number of parameters are large. In particular we shall show that maximum likelihood estimates are approximately unbiased and approximately best among all estimates.2. What is needed are simple sufficient conditions on p(x.OJdx (3.4. for integration over Rq. the variance a 2 = Var(ei) is estimated by the unbiased estimate 8 2 = iTe/(n . B)dx. has some decision theoretic applications. e e. We make two regularity assumptions on the family {Pe : BEe}.ZDf3). e.O) > O} docs not depend on O.4. We suppose throughout that we have a regular parametric model and further that is an open subset of the line. which can be used to show that an estimate is UMVU. and we can interchange differentiation and integration in p(x. For instance. and p is the number of coefficients in f3. Finally. B) is a density. suppose I holds.4. B)dx. That is. B) for II to hold.8) whenever the righthand side of (3.OJdx] = J T(xJ%oP(x. good estimates in large samples ~ 1 ~ are approximately unbiased. 3. 0 or equivalently Vare(Bn)/MSEe(Bn) t 1 as n t 00. 167.O E 81iJO log p( x. OJ exists and is finite. (I) The set A ~ {x : p(x.4. The arguments will be based on asymptotic versions of the important inequalities in the next subsection.   . For instance.( lTD < 00 J .8) is finite.2 The Information Inequality The oneparameter case We will develop a lower bound for the variance of a statistic.4 Unbiased Estimation and Risk Inequalities 179 Nevertheless. From this point on we will suppose p(x.8) is assumed to hold if T(xJ = I for all x. Note that in particular (3. We expect that jBiase(Bn)I/Vart (B n ) . as we shall see in Chapters 5 and 6. Simpler assumptions can be fonnulated using Lebesgue integration theory.4..Section 3. Some classical conditiolls may be found in Apostol (1974). The discussion and results for the discrete case are essentially identical and will be referred to in the future by the same numbers as the ones associated with the continuous~case theorems given later. %0 [j T(X)P(x. Then 11 holdS provided that for all T such that E. p. The lower bound is interesting in its own right. Assumption II is practically useless as written. and appears in the asymptotic optimality theory of Section 5.9.4.p) where i = (Y . For all x E A. unbiased estimates are still in favor when it comes to estimating residual variances. in the linear regression model Y = ZDf3 + e of Section 2. (II) 1fT is any statistic such that E 8 (ITf) < 00 for all 0 E then the operations of integration and differentiation by B can be interchanged in J T( x )p(x.
2.4.B(O)} is an exponential family and TJ(B) has a nonvanishing continuous derivative on 8. (3. I J J & logp(x 0 ) = ~n 1 .. then I and II hold. Then Xi . suppose XI. .O) & < 00. /fp(x. (~ logp(X.O)] dxand J T(x) [:oP(X. It is not hard to check (using Laplace transform theory) that a oneparameter exponential family quite generally satisfies Assumptions I and II. where a 2 is known.i J T(x) [:OP(X. 1 X n is a sample from a N(B. o Example 3. 1 and 11 are satisfied for samples from gamma and beta distributions with one parameter fixed. Then (see Table 1. I 1(0) 1 = Var (~ !Ogp(X.O)} p(x. I' ..10) and. For instance. (3.&0 '0 {[:oP(X.1. Similarly. 0))' = J(:Ologp(x.O)).1) ~(O) = 0la' and 1 and 11 are satisfied.4.4.6.4.. If I holds it is possible to define an important characteristic of the family {Po}.O)]j P(x.. Suppose that 1 and II hold and that E &Ologp(X. . 00. 1 X n is a sample from a Poisson PCB) population. O)dx ~ :0 J p(x. t. (J2) population. . which is denoted by I(B) and given by Proposition 3. Suppose Xl. the Fisher information number.9) Note that 0 < 1(0) < Lemma 3. the integrals . 0) ~ 1(0) = E.180 for all Measures of Performance Chapter 3 e.n and 1(0) = Var (~n_l . O)dx :Op(x. 0))' p(x. Then (3. thus. O)dx.11 ) .4. Ii ! I h(x) exp{ ~(O)T(x) .0 Xi) 1 = nO =  0' n O· o . O)dx = O.O)] dx " I'  are continuous functions(3) of O.I [I ·1 .4.  Proof.1.
4. Suppose the conditions of Theorem 3.' (e) = Cov (:e log p(X.(0) = E(feiogf(X"e»)'.O).l4) and Lemma 3.Section 3. 0 E e. (Information Inequality). and that the conditions of Theorem 3.(0) is differentiable and Var (T(X)) . e) and T(X). 1/.. We get (3. 1/. T(X)) .O)dX= J T(x) UOIOgp(x.1 hold.I 1. Using I and II we obtain.4.4. (e) and Var. we obtain a universal lower bound given by the following.4. 0 (3. 10gp(X.4. then I(e) = nI. Suppose that I and II hold and 0 < 1(0) < 00.1.4.I1. Suppose that X = (XI.4.17) . ej.O))P(x.16) to the random variables 8/8010gp(X. Denote E.13) By (A. Let T(X) be all)' statistic such that Var.(0).14) Now let ns apply the correlation (CauchySchwarz) inequality (A. Let [. 0 The lower bound given in the information inequality depends on T(X) through 1/. Proposition 3.16) The number 1/I(B) is often referred to as the information or CramerRoo lower bound for the variance of an unbiased estimate of 'tjJ(B). > [1/.1 hold and T is an unbiased estimate ofB.'(0) = J T(X):Op(x. Theorem 3. Then Var.4. Then for all 0.4. Var (t. .1. e)) = I(e). nI.12) Proof. Here's another important special case.'(~)I)'.4.'(0)]'  I(e) . (3.4 Unbiased Estimation and Risk Inequalities 181 Here is the main result of this section. (3. by Lemma 3.4. .(T(X)) by 1/.. If we consider the class of unbiased estimates of q(B) = B. 1/.4.1.(T(X)) > [1/.15) The theorem follows because. (3.O)dX.1.(T(X)) < 00 for all O.(0).Xu) is a sample from a population with density f(x.2.(T(X) > I(O) 1 (3.4. CoroUary 3.
[T'(X)] = [.B)] Var " i) ~ 8B iogj(Xi. then T(X) achieves the infonnation inequality bound and is a UMVU estimate of E.4. i g . Example 3. ..1 we see that the conclusion that X is UMVU follows if ~  Var(X) ~ nI. X n is a sample from a nonnal distribution with unknown mean B and known variance a 2 . Conversely. Next we note how we can apply the information inequality to the problem of unbiased estimation. whereas if 'P denotes the N(O. Theorem 3. We have just shown that the information [(B) in a sample of size n is nh (8). For a sample from a P(B) distribution..B(B)J. As we previously remarked. Note that because X is UMVU whatever may be a 2 .4.j. if {P9} is a oneparameter exponentialfamily of the form (1.(B) such that Var.3. I I "I and (3..B) ] [ ~var [%0 ]Ogj(Xi.2.4. Suppose Xl. Because X is unbiased and Var( X) ~ Bin. then X is a UMVU estimate of B. the conditions of the information inequality are satisfied.182 Measures of Performance Chapter 3 Proof. (Br Now Var(X) ~ a'ln. 0 We can similarly show (Problem 3. o II (B) is often referred to as the information contained in one observation.18) 1 a ' .j.. the MLE is B ~ X. By Corollary 3.18) follows. (Continued). This is no accident. .19) Ii i .Xn are the indicators of n Bernoulli trials with probability of success B. This is a consequence of Lemma 3.2.4.p'(B)]'ll(B) for all BEe.4.. .B) = h(x)exp[ry(B)T'(x) .4.1 for every B. 1) density. we have in fact proved that X is UMVU even if a 2 is unknown.4.4..4. (3. If the family {Pe} satisfies I and II and if there exists an unbiased estimate T* of 1/.(8) has a continuous nonvanishing derivative on e.4.1) that if Xl. . Then {P9} is a oneparameter exponential family with density or frequency function af the fonn II p(x. which achieves the lower bound ofTheorem 3. : BEe} satisfies assumptions I and II and there exists an unbiased estimate T* of1.(B). Example 3.6.B)] = nh(B). These are situations in which X follows a oneparameter exponential family.. then T' is UMVU as an estimate of 1. then  1 (3. then X is UMVU.1) with natural sufficient statistic T( X) and .1 and I(B) = Var [:B !ogp(X. . Suppose that the family {p.(T(X».
there exist functions al ((J) and a2 (B) such that :B 10gp(X.4. Note that if AU = nmAem .4.20) with Pg probability 1 for each B. be a denumerable dense subset of 8.22) for j = 1. then (3.B)) + nz)[log B . However.A'(B» = VareT(X) = A"(B).. B .4.24) [(B) = Vare(T(X) .B) = a.10g(1 .19).4.16) we know that T'" achieves the lower bound for all (J if. (B)T'(X) + a2(B) (3. thus. B) = 2"' exp{(2nj 2"' exp{(2n.4. Conversely in the exponential family case (1.1.23) must hold for all B.4. then (3. continuous in B. and only if.4. the information bound is [A"(B))2/A"(B) A"(B) Vare(T(X») so that T(X) achieves the information bound as an estimate of EeT(X).6. But now if x is such that = 1. B . (3. it is necessary. From this equality of random variables we shall show that Pe[X E A'I = 1 for all B where A' = {x: :Blogp(x.4.2.B) I + 2nlog(1  BJ} .. B) = aj(B)T'(x) + a2(B) (3. " and both sides are continuous in B.2. Suppose without loss of generality that T(xI) of T(X2) for Xl. :e Iogp(x.2 and. Our argument is essentially that of Wijsman (1973).4.4. If A. (3.(A") = 1 for all B'. J 2 Pe. Let B .4.20) guarantees Pe(Ae) ~ 1 and assumption 1 guarantees P...1) we assume without loss of generality (Problem 3.21 ) Upon integrating both sides of (3.14) and the conditions for equality in the correlation inequality (A.4. In the HardyWeinberg model of Examples 2.ll.6). Thus. + n2) 10gB + (2n3 + n3) Iog(1 . The passage from (3. B) IdB. p(x. j hence. we see that all a2 are linear combinations of {} log p( Xj. X2 E A"''''.4.4 and 2.4.20) hold. By solving for a" a2 in (3. . 0 Example 3.4.B) so that = T(X) . (3. By (3.(Ae) = 1 for all B' (Problem 3.25) But .3) that we have the canonical case with ~(B) = Band B( Bj = A(B) = logf h(x)exp{BT(x))dx.(B)T' (x) +a2(B) for all BEe}. 2 A ** = A'" and the result follows.4. B) = a.20) with respect to B we get (3.4.4 Unbiased Estimation and Risk Inequalities 183 Proof We start with the first assertion. denotes the set of x for which (3.p(B) = A'(B) and.19) is highly technical.6.20) to (3.4. Here is the argument.23) for all B1 .A (B) . Then ~ ( 80 logp X.Section 3.
.6.6.26) Proposition 3. =e'E(~Xi) = .4. that I and II fail to hold. See Volume II.2. ~ ~ This T coincides with the MLE () of Example 2.ed)' In particular.2. as Theorem 3. For a sample from a P( B) distribution o EB(::.4.4.B)(2n. e).4. . The variance of B can be computed directly using the moments of the multinomial distribution of (NI • N z .25). '. we have iJ' aB' logp (X.B)) which equals I(B). . By (3. B) to canonical form by setting t) = 10g[B((1 . Suppose p('. B).B) is twice differentiable and interchange between integration and differentiation is pennitted.4.. B)  2 (a log p(x. B) satisfies in addition to I and II: p(·. • The multiparameter case We will extend the information lower bound to the case of several parameters. Theorem 3.26) holds.3. in the U(O.B)] and then using Theorem 1. It often happens.IOgp(X.2 suggests. DiscRSsion. (3. we will find a lower bound on the variance of an estimator . (Continued).. Even worse.2. Because this is an exponential family.4.24). B) 2 1 a = p(x.. B) example. 0 0 .2 implies that T = (2N 1 + N z )/2n is UMVU for estimating E(T) ~ (2n)1[2nB' + 2nB(1 .27) and integrate both sides with respect to p(x. Proof We need only check that I a aB' log p(x.B) ~ A "( B).(e) exist.4.'" . in many situations. () (ell" . I I' f .4. or by transforming p(x. but the variance of the best estimate is not equal to the bound [. Sharpenings of the information inequality are available but don't help in general. Then (3. A third metho:! would be to use Var(B) = I(I(B) and formula (3. " 'oi .4.4. assumptions I and II are satisfied and UMVU estimates of 1/. .184 Measures of Performance Chapter 3 where we have used the identity (2nl + HZ) + (2n3 + nz) = 211.4.B)] ~ B. B) aB' p(x. 0 ~ Note that by differentiating (3. B) )' aB (3. Example 3. . for instance. although UMVU estimates exist.7) Var(B) ~ B(I .25) we obtain a' 10gp(X. . I(B) = Eo aB' It turns out that this identity also holds outside exponential families. Extensions to models in which B is multidimensional are considered next. We find (Problem 3. Na). .p' (B) I'( I (B).
. in addition. . er 2). . . Let p( x. The (Fisher) information matrix is defined as (3..Xn are U. 1 <j< d. ..4. 0).4. III (0) = E [~:2 logp(x. then X = (Xl.4.28) where (3. 0). nh (O) where h is the information matrix ofX.4 Unbiased Estimation and Risk Inequalities 185 of fh when the parameters 82 ..Section 3.. (3. The arguments follow the d Example 3. (a) B=T 1 (3. . . We assume that e is an open subset of Rd and that {p(x.Bd are unknown. = Var('.O) ] iJ2 ] l22(0) = E [ (iJer2)21ogp(x.4.4 /2.30) iJ Ijd O) = Covo ( eO logp(X.d.4. Suppose X = 1 case and are left to the problems.( x 1") 2 2a 2 logp(x 0) ..Ok IOgp(X..32) Proof.29) Proposition 3.2 = er.4.O») . 0)] = E[er. (c) If. iJO logp(X. 1 <k< d. .31) and 1(0) (b) If Xl. (0) iJiJ lt2(0) ~ E [aer 2 iJl" logp(x.0) = er. Under the conditions in the opening paragraph. j = 1.5.. B) : 0 E 8} is a regular parametric model with conditions I and II satisfied when differentiation is with respect OJ. pC B) is twice differentiable and double integration and differentiation under the integral sign can be interchanged.. 0 = (I". 6) denote the density or frequency function of X where X E X C Rq. a ). d.2 ] ~ er.4. 0) j k That is.Xnf' has information matrix 1(0) = EO (aO:. .? ologp(X. as X. (3.4 E(x 1") = 0 ~ I. . Then 1 = log(211') 2 1 2 1 2 loger . ~ N(I".. er 2 ).4.
1/. Assume the conditions of the .1.4.186 Measures of Performance Chapter 3 Thus. UMVU Estimates in Canonical Exponential Families. I'L(Z) = I'Y + (Z l'z)TLz~ Lzy· Now set Y (3. where I'L(Z) denotes the optimal MSPE linear predictor of Y.4.3. A(O).(0) exists and (3. II are easily checked and because VOlogp(x.6.38) where Lzy = EO(TVOlogp(X.4. Example 3. that is.' = explLTj(x)9j j=1 .4. Let 1/.(O). We will use the prediction inequality Var(Y) > Var(I'L(Z». then [(0) = VarOT(X).6 hold. (3. 0) . = . .. Then Theorem 3. [(0) = VarOT(X) = (3.4.13).O») = VOEO(T(X» and the last equality follows 0 from the argument in (3.4.. J)d assumed unknown.34) () E e open. By (3. The conditions I. Canonical kParameter Exponential Family. ! p(x.35) o ~ Next suppose 8 1 = T is an estimate of (Jl with 82.4.4..!pening paragraph hold and suppose that the matrix [(0) is nonsingular Then/or all 0.4. Then VarO(T(X» > Lz~ [1(0) Lzy (3. (continued).4/2 .4.33) o Example 3. Z = V 0 logp(x.36) Proof. Suppose k .37) = T(X).A(O)}h(x) (3.4. in this case [(0) ! . Suppose the conditions of Example 3.. .(0) = EOT(X) and let ".O) = T(X) .'(0) = V1/.A. .( 0) be the d x 1 vector of partial derivatives.4. 0).. Here are some consequences of this result.6.30) and Corollary 1.. We claim that each of Tj(X) is a UMVU .6.2 ( 0 0 ) .
. . To see our claim note that in our case (3. = nE(T.40) J We claim that in this case .7. Multinomial Trials.6 with Xl. by Theorem 3.0). a'i X and Aj = P(X = j).d.T(O) = ~:~ I (3. without loss of generality...p(0)11(0) ~ (1.41) because .A(0) j ne ej = (1 + E7 eel _ eOj ) . is >..4. . Ak) to the canonical form p(x.3. .O) = exp{TT(x)O . We have already computed in Proposition 3. j=1 X = (XI.. .(X) = I)IXi = j]. hence. Example 3.>'. . .4. .... " Ok_il T. But because Nj/n is unbiased and has 0 variance >'j(1 .4. . .. the lower bound on the variance of an unbiased estimator of 1/Jj(O) = E(n1Tj(X) = >. . . .(X)) 82 80.Xn)T.4 8'A rl(O)= ( 8080 t )1 kx k .A(O)} whereTT(x) = (T!(x). 0 = (0 1 ... j = 1. (1 + E7 / eel) = n>'j(1.Tk_I(X)).A(O) is just VarOT! (X).0. Thus..(1. we let j = 1.6.39) where.4 Unbiased Estimation and Risk Inequalities 187 estimate of EOTj(X).4..>'j)/n.4..i. X n i. . then Nj/n is UMVU for >'j. we transformed the multinomial model M(n. n T. . .4. kI A(O) ~ 1£ log Note that 1+ Le j=l Oj 80 A (0) J 8 = 1 "k 11 ne~ I 0 +LJl=le l = 1£>'. In the multinomial Example 1.p(OJrI(O). AI.A. This is a different claim than TJ(X) is UMVU for EOTj(X) if Oi.) = Var(Tj(X». j.Section 3.. k. are known... (3.j. i =j:.)/1£.. But f.p(0) is the first row of 1(0) and.
. features other than the risk function are also of importance in selection of a procedure... N(IL. But it does not follow that n~1 L:(Xi . Using inequalities from prediction theory.d. . . X n are i. interpretability of the procedure.4. we show how the infomlation inequality can be extended to the multiparameter case. The three principal issues we discuss are the speed and numerical stability of the method of computation used to obtain the procedure.4."xl' Note that both sides of (3.21.5 NON DECISION THEORETIC CRITERIA In practice. ~ In Chapters 5 and 6 we show that in smoothly parametrized models.4. Summary. reasonable estimates are asymptotically unbiased. LX.4.8.3 whose proof is left to Problem 3.1 Computation Speed of computation and numerical stability issues have been discussed briefly in Section 2." . and robustness to model departures..4. These and other examples and the implications of Theorem 3. 188 MeasureS of Performance Chapter 3 j Example 3. The Normal Case. Asymptotic analogues oftbese inequalities are sharp and lead to the notion and construction of efficient estimates.. 1 j Here is an important extension of Theorem 3. T(X) is the UMVU estimate of its expectation. .4. even if the loss function and model are well specified.5.3 hold and • is a ddimensional statistic. (72) then X is UMVU for J1 and ~ is UMVU for p. Then J (3. We study the important application of the unbiasedness principle in survey sampling. They are dealt with extensively 'in books on numerical analysis such as Dahlquist. Let 1/J(O) ~EO(T(X))dXl and "'(0) = (1/Jl(O).B)a > Ofor all . We establish analogues of the information inequality and use them to show that under suitable conditions the MLE is asymptotically optimal.pd(OW = (~(O)) dxd . Also note that ~ o unbiased '* VarOO > r'(O).42) are d x d matrices.2 + (J2. 3. We derive the information inequa!ity in oneparameter models and show how it can be used to establish that in a canonical exponential family. • Theorem 3. " il 'I.4.3 are 0 explored in the problems.42) where A > B means aT(A . .4. 3.4. . .4. If Xl. Suppose that the conditions afTheorem 3.X)2 is UMVU for 0 2 ..i.
if we seek to take enough steps J so that III III < . Its maximum likelihood estimate X/a continues to have the same intuitive interpretation as an estimate of /1.2./ a even if the data are a sample from a distribution with .Section 3.5.4.P) in Example 2.A(BU ~l»)).2 is the empirical variance (Problem 2.11). The interplay between estimated. But it reappears when the data sets are big and the number of parameters large. solve equation (2. (72) Example 2.5 we are interested in the parameter III (7 • • This parameter. It may be shown that.2).1. fI(j) = iPl) . Of course.5. It is in fact faster and better to Y. estimates of parameters based on samples of size n have standard deviations of order n. and Anderson (1974).5 Nondecision Theoretic Criteria 189 Bjork. It follows that striving for numerical accuracy of ord~r smaller than n. < 1 then J is of the order of log ~ (Problem 3.2.4.3.2 is given by where 0.4.1 / 2 is wasteful.1. Gaussian elimination for the particular z'b Faster versus slower algorithms Consider estimation of the MLE 8 in a general canonical exponential family as in Section 2. On the other hand. the NewtonRaphson method in ~ ~(J) which the jth iterate. It is clearly easier to compute than the MLE. variance and computation As we have seen in special cases in Examples 3.10).1 / 2 . 3.1. On the other hand.AI (flU 1) (T(X) .1. at least if started close enough to 8. for this population of measurements has a clear interpretation. say. takes on the order of log log ~ steps (Problem 3.1). Closed form versus iteratively computed estimates At one level closed form is clearly preferable. in the algOrithm we discuss in Section 2. The closed fonn here is deceptive because inversion of a d x d matrix takes on the order of d 3 operations when done in the usual way and can be numerically unstable. The improvement in speed may however be spurious since AI is costly to compute if d is largethough the same trick as in computing least squares estimates can be used. We discuss some of the issues and the subtleties that arise in the context of some of our examples in estimation theory.2 Interpretability Suppose that iu the normal N(Il.3 and 3. with ever faster computers a difference at this level is irrelevant. consider the Gaussian linear model of Example 2.5. a method of moments estimate of (A. Unfortunately it is hard to translate statements about orders into specific prescriptions without assuming at least bounds on the COnstants involved. the signaltonoise ratio. For instance.4. Then least squares estimates are given in closed fonn by equ<\tion (2.9) by.3.
.5. 1975b. and both the mean tt and median v qualify.5. the HardyWeinberg parameter () has a clear biological interpretation and is the parameter for the experiment described in Example 2. we could suppose X* rv p. Gross error models Most measurement and recording processes are subject to gross errors. Then E(X)/lVar(X) ~ (p/>. if gross errors occur.190 Measures of Performance Chapter 3 mean Jl and variance 0'2 other than the normal.e. 3. We can now use the MLE iF2. This is an issue easy to point to in practice but remarkably difficult to formalize appropriately.13.. but there are a few • • = . v is any value such that P(X < v) > ~. But () is still the target in which we are interested. For instance. X n ) where most of the Xi = X:. would be an adequate approximation to the distribution of X* (i. what reasonable means is connected to the choice of the parameter we are estimating (or testing hypotheses about). amoog others. E P*). . 9(Pl. However.')1/2 = l. . but we do not necessarily observe X*.4) is for n large a more precise estimate than X/a if this model is correct. the form of this estimate is complex and if the model is incorrect it no longer is an appropriate estimate of E( X) / [Var( X)] 1/2. we may be interested in the center of a population. P(X > v) > ~). E p.)(p/>. suppose that if n measurements X· (Xi. Alternatively. On the other hand. The actual observation X is X· contaminated with "gross errOfs"see the following discussion. For instance. which as we shall see later (Section 5.. 1976) and Doksum (1975). We will consider situations (b) and (c). To be a bit formal. Similarly. say () = N p" where N is the population size and tt is the expected consumption of a randomly drawn individuaL (b) We imagine that the random variable X* produced by the random experiment we are interested in has a distribution that follows a ''true'' parametric model with an interpretable parameter B..I1'. the parameter v that has half of the population prices on either side (fonnally. We return to this in Section 5.. suppose we initially postulate a model in which the data are a sample from a gamma.3 Robustness Finally. we turn to robustness.4. 1 X~) could be taken without gross errors then p. The idea of robustness is that we want estimation (or testing) procedures to perform reasonably even when the model assumptions under which they were designed to perform excellently are not exactly satisfied. See Problem 3. they may be interested in total consumption of a commodity such as coffee. However. We consider three situations (a) The problem dictates the parameter. that is. economists often work with median housing prices. we observe not X· but X = (XI •. This idea has been developed by Bickel and Lehmann (l975a. (c) We have a qualitative idea of what the parameter is. but there are several parameters that satisfy this qualitative notion. anomalous values that arise because of human error (often in recording) or instrument malfunction.\)' distribution.1.5. However.
if we drop the symmetry assumption. Note that this implies the possibly unreasonable assumption that committing a gross error is independent of the value of X· .1').remains identifiable.. .i. .. That is. = {f (. it is possible to have PC""f.5.18). Most analyses require asymptotic theory and will have to be postponed to Chapters 5 and 6. it is the center of symmetry of p(Po. p(". (3...i.l.j = PC""f. X n ) will continue to be a good or at least reasonable estimate if its value is not greatly affected by the Xi I Xt.2) Here h is the density of the gross errors and . However. specification of the gross error mechanism. and definitions of insensitivity to gross errors.d.d. We return to these issues in Chapter 6. X n ) knowing that B(Xi. This corresponds to. Unfortunately.l.5 Nondecision Theoretic Criteria ~ 191 wild values.J) for all such P. Without h symmetric the quantity jJ. •. Consider the onesample symmetric location model P defined by ~ ~ i=l""l n . Informally B(X1 . .)~ 'P C) + .5. The advantage of this formulation is that jJ. . F. the gross errors. .A Y. with common density f(x . so it is unclear what we are estimating. ISi'1 or 1'2 our goal? On the other hand.2. The breakdown point will be discussed in Volume II. P. . X~) is a good estimate.\ is the probability of making a gross error.5.f. Further assumptions that are commonly made are that h has a particular fonn. Example 1..1') and (Xt.5. where Y. Again informally we shall call such procedures robust. .l. and then more generally.) are ij.5. That is.. the sensitivity curve and the breakdown point. X is the best estimate in a variety of senses.is not a parameter.d. We next define and examine the sensitivity curve in the context of the Gaussian location model. In our new formulation it is the xt that obey (3.1. Y. we encounter one of the basic difficulties in formulating robustness in situation (b).5. If the error distribution is normal. h = ~(7 <p (:(7) where K ::» 1 or more generally that h is an unknown density symmetric about O. Formal definitions require model specification. Then the gross error model issemiparametric. and symmetric about 0 with common density f and d.j) E P. Now suppose we want to estimate B(P*) and use B(X 1 .2).1') : f satisfies (3. make sense for fixed n. the assumption that h is itself symmetric about 0 seems patently untenable for gross errors. Xi Xt with probability 1 . has density h(y . However. two notions. .) for 1'1 of 1'2 (Problem 3. iff X" .. in situation (c).2) for some h such that h(x) = h(x) for all x}. where f satisfies (3. (3.Xn are i. with probability . A reasonable formulation of a model in which the possibility of gross errors is acknowledged is to make the Ci still i. for example.1 ) where the errors are independent. but with common distribution function F and density f of the form f(x) ~ (1 . identically distributed.1).Section 3.h(x). we do not need the symmetry assumption.
Xnl as an "ideal" sample of size n .17). The sensitivity curoe of () is defined as ~ ~ ~ ~ ~ ~ ~ ~ ~ .od Problem 2. The sample median can be motivated as an estimate of location on various grounds.Xnl so that their mean has the ideal value zero.. In our examples we shall. • . that is.5.1. not its location.. (2.16).O(Xl' .192 The sensitivity curve Measures of Performance Chapter 3 . we take I' ~ 0 without loss of generality. Then ~ ~ O. . where Xl. (}(X 1 . SC(x. .J. How sensitive is it to the presence of gross errors among XI.od because E(X. .4).. . Often this is done by fixing Xl.. has the plugin property. Xl. Thus.I.··..1.f) E P. .l. X(n) are the order statistics. The empirical plugin estimate of 0 is 0 = O(P) where P is the empirical probability distribution. See Problem 3. We are interested in the shape of the sensitivity curve.2... x) . 1') = 0..j)) = J. . (}(PU.L = E(X).1 for all p(/l. +X"_I+X) n = x. See (2. therefore. . the sample mean is arbitrarily sensitive to gross errOfa large gross error can throw the mean off entirely.. . . in particular.X. 1 Xnl represents an observed sample of size nl from P and X represents an observation that (potentially) comes from a distribution different from P. ." .1 for which the estimator () gives us the right value of the parameter and then we see what the introduction of a potentially deviant nth observation X does to the value of ~ We return to the location problem with e equal to the mean J. Are there estimates that are less sensitive? A classical estimate of location based on the order statistics is the sample median X defined by ~ ~ X X(k+l) !(X(k) ifn~2k+1 + X(k+l)) ifn = 2k where X (I)' .Xn ? An interesting way of studying this due to Tukey (1972) and Hampel (1974) is the sensitivity curve defined as follows for plugin estimates (which are well defined for all sample sizes n). (i) It is the empirical plugin estimate of the population median v (Problem 3. is appropriate for the symmetric location model. This is equivalent to shifting the Be vertically to make its value at x = 0 equal to zero.. . . X =n (Xl+ . . ..2. that is.1. Now fix Xl!'" . . X"_l )J. X n ordered from smallest to largest. where F is the empirical d.L = (}(X l 1'. X n ) = B(F). ~ ) SC( X.L. shift the sensitivity curve in the horizontal or vertical direction whenever this produces more transparent formulas. .. 0) = n[O(xl.32. Because the estimators we consider are location invariant. .. At this point we ask: Suppose that an estimate T(X 1 . . . See Section 2. .. . X" 1'). Suppose that X ~ P and that 0 ~ O(P) is a parameter. X n ) . We start by defining the sensitivity curve for general plugin estimates. and it splits the sample into two equal halves.14..5.... I .. . P.
1 suggests that we may improve matters by constructing estimates whose behavior is more like that of the mean when X is near Ji. we obtain . say.Xn_l = (X(k) + X(ktl)/2 = 0.2. SC(x. x is an empirical (iii) The sample median is the MLE when we assume the common density f(x) of the errors {cd in (3. The sensitivity curve of the median is as follows: If. < X(nl) are the ordered XI. n = 2k + 1 is odd and the median of Xl. 27 a density having substantially heavier tails than the normaL See Problems 2.5.5.1).5 Nondecision Theoretic Criteria 193 (ii) In the symmetric location model (3.. .. XnI.5. x) nx(k) nx = _nx(k+l) for for < x(k) for x(k) < x x <x(k+I) nx(k+l) x> x(k+l) where xCI) < .. Although the median behaves well when gross errors are expected.Section 3.1. . A class of estimates providing such intermediate behavior and including both the mean and ..5.5. The sensitivity curve in Figure 3.. .32 and 3. SC(x) SC(x) x x Figure 3.9. v coincides with fL and plugin estimate of fL..1) is the Laplace (double exponential) density f(x) = 1 exp{l x l/7}. The sensitivity curves of the mean and median. its perfonnance at the nonnal model is unsatisfactory in the sense that its variance is about 57% larger than the variance of X.
The estimates can be justified on plugin grounds (see Problem 3.. . Haber. ..2. Hampel. See Andrews. The sensitivity curve of the trimmed mean. i I . for example. 4.2[nnl where [na] is the largest integer < nO' and X(1) < .5.. If [na] = [(n . For a discussion of these and other forms of "adaptation. l Xnl is zero.1.) SC(x) X x(n[naJ) .. Xa  X. infinitely better in the case of the Cauchysee Problem 5. f(x) = or even more strikingly the Cauchy. . Xu = X.5). + X(n[nuj) n .2. we throw out the "outer" [na] observations on either side and take the average of the rest. Bickel. Which a should we choose in the trimmed mean? There seems to be no simple answer. and Tukey (1972). .194 Measures of Performance Chapter 3 the median has ~en known since the eighteenth century.5.5. suppose we take as our data the differences in Table 3. and Hogg (1974). For more sophisticated arguments see Huber (1981). Intuitively we expect that if there are no gross errors. < X (n) are the ordered observations.1. Let 0 trimmed mean. We define the a (3. That is.20 seems to yield estimates that provide adequate protection against the proportions of gross errors expected and yet perform reasonably well when sampling is from the nonnal distribution.10 < a < 0. Note that if Q = 0.4.l)Q] and the trimmed mean of Xl.5. the sensitivity curve calculation points to an equally intuitive conclusion.2[nn]/n)I. f(x) = 1/11"(1 + x 2 ). whereas as Q i ~. The range 0. For instance..1 again. which corresponds approximately to a = This can be verified in tenns of asymptotic variances (MSEs}see Problem 5. Huber (1972). Xa. then the trimmed means for a > 0 and even the median can be much better than the mean. the Laplace density. However. the mean is better than any trimmed mean with a > 0 including the median. (The middle portion is the line y = x(1 . Figure 3. the sensitivity Jo ~ CUIve of an Q trimmed mean is sketched in Figure 3. . that is. Rogers..4. 4e'x'. by <Q < 4." see Jaeckel (1971). There has also been some research into procedures for which a is chosen using the observations.5. I I ! .8) than the Gaussian density.5. f = <p. . If f is symmetric about 0 but has "heavier tails" (see Problem 3.3) Xu = X([no]+l) + ..
X n denote a sample from that population.16). n + 00 (Problem 3. . then SC(X. A fairly common quick and simple alternative is the IQR (interquartile range) defined as T = X. To simplify our expression we shift the horizontal axis so that L~/ Xi = O.X.Section 3. the scale measure used is 0.742(x. and let Xo. where x(t) < .. We next consider two estimates of the spread in the population as well as estimates of quantiles.1x.5.1. Example 3. Xo: = x(k).5.1 L~ 1 Xi = 11. = ~ [X(k) + X(k+l)]' and at sample size n  1.. Because T = 2 x (.25 are called the upper and lower quartiles..X.4) where the approximation is valid for X fixed.5 Nondecision Theoretic Criteria 195 Gross errors or outlying data points affect estimates in a variety of situations.2 ..25.X)2 is the empirical plugin estimate of 0. other examples will be given in the problems. SC(X.1. . Then a~ = 11. then the variance 0"2 or standard deviation 0. &) (3.. say k. Quantiles and the lQR. X n is any value such that P( X < x n ) > Ct. < x(nI) are the ordered Xl. P( X > xO') > 1 . where Xu has 100Ct percent of the values in the population on its left (fonnally. If no: is an integer.2.5.75 and X. 0 Example 3. 0"2) model. 0 < 0: < 1.XnI.75 . Let B(P) = X o deaote a ath quantile of the distribution of X. Similarly.25). denote the o:th sample quantile (see 2. &2) It is clear that a:~ is very sensitive to large outlying Ixi values.674)0". the nth sample quantile is Xo.is typically used.Ct). Write xn = 11.. The IQR is often calibrated so that it equals 0" in the N(/l. Spread. If we are interested in the spread of the values in a population. . Xu is called a Ctth quantile and X.5. Let B(P) = Var(X) = 0"2 denote the variance in a population and let XI. .75 . .10). .I L:~ I (Xi .
1'0) ~[Xlkl) _ xlk)] x < XlkI) 2 '  2 2 1 Ix _ xlkl] XlkI) < x < xlk+11 '  (3. ~ ~ ~ Next consider the sample lQR T=X. for 2 Measures of Performance Chapter 3 < k < n .: : where F n .196 thus. x (t) that (Problem 3. SC(x..2. discussing the difficult issues of identifiability.O(F)] = .F) dO I.15) I[t < xl). An exposition of this point of view and some of the earlier procedures proposed is in Hampel. have been studied extensively and a number of procedures proposed and implemented.• .1. and other procedures. .5) 1 [xlk+11 _ xlkl] x> xlk+1) ' Clearly. = <1[0((1  <)F + <t. 1'. Other aspects of robustness.O. The sensitivity of the parameter B(F) to x can be measured by the influence function.1 denotes the empirical distribution based on Xl.) .. Rousseuw. i) = SC(x.5. trimmed mean.5. It is easy to see ~ ! '. which is defined by IF(x. xa: is not sensitive to outlying x's. Most of the section focuses on robustness.(x. Ronchetti. Summary.SC(x. We will return to the influence function in Volume II. The rest of our very limited treatment focuses on the sensitivity curve as illustrated in the mean. . Ii I~ . ~' is the distribution function of point mass at x (. r: ~i II i . in particular the breakdown point. although this difficulty appears to be being overcome lately. F) and~:z. median. ~ .25· Then we can write SC(x.Xnl. We discuss briefly nondecision theoretic considerations for selecting procedures including interpretability. and Stabel (1983).75 X.F) where = limIF. o Remark 3.. 0. 1'75) .. . and computability.6.25) and the sample IQR is robust with respect to outlying gross errors x.. Discussion.O. Unfortunately these procedures tend to be extremely demanding computationally. It plays an important role in functional expansions of estimates.(x.5.
.3. 0) the Bayes rule is where fo(x.24. a(O) > 0. the parameter A = 01(1 .2 o e 1. In Problem 3. then the improper Bayes rule for squared error loss is 6"* (x) = X.2.. E = R. 71') = R(7f) /r( 71'. x) where c(x) =.Xn be the indicators of n Bernoulli trials with success probability O.4. under what condition on S does the Bayes rule exist and what is the Bayes rule? 5..0)' and that the prinf7r(O) is the beta. 3.Section 3.2 with unifonn prior on the probabilility of success O. if 71' and 1 are changed to 0(0)71'(0) and 1(0. Find the Bayes estimate 0B of 0 and write it as a weighted average wOo + (1 ~ w)X of the mean 00 of the prior and the sample mean X = Sin. density.O) and = p(x 1 0)[7f(0)lw(0)]/c c= JJ p(x I 0)[7f(0)/w(0)]dOdx is assumed to be finite.6 PROBLEMS AND COMPLEMENTS Problems for Section 3. respectively. 0) = (0 . 2. J) ofJ(x) = X in Example 3.0)/0(0). Suppose IJ ~ 71'(0). which is called the odds ratio (for success). give the MLE of the Bernoulli variance q(0) = 0(1 . is preferred to 0. 1 X n is a N(B.3).0).2. Check whether q(OB) = E(q(IJ) I x). In some studies (see Section 6. s).a)' jBQ(1. J).0 E e. change the loss function to 1(0. Show that OB ~ (S + 1)/(n+2) for ~ ~ the uniform prior. Hint: See Problem 1. . Give the conditions needed for the posterior Bayes risk to be finite and find the Bayes rule.r 7f(0)p(x I O)dO.. the Bayes rule does not change. Find the Bayes risk r(7f. Compute the limit of e( J.2 preceeding. (c) In Example 3. 6. . we found that (S + 1)/(n + 2) is the Bayes rule. (72) sample and 71" is the improper prior 71"(0) = 1. where OB is the Bayes estimate of O. (X I 0 = 0) ~ p(x I 0).4.2. Consider the relative risk e( J. (a) Show that the joint density of X and 0 is f(x.1. where R( 71') is the Bayes risk. = (0 o)'lw(O) for some weight function w(O) > 0. Let X I. That is.O)P. .6 Problems and Complements 197 3. 71') as .0) and give the Bayes estimate of q(O).2. ~ ~ 4. In the Bernoulli Problem 3. Show that if Xl. Suppose 1(0. 0) is the quadratic loss (0 . (3(r. Show that (h) Let 1(0. lf we put a (improper) uniform prior on A.O) = p(x I 0)71'(0) = c(x)7f(0 .
. 00..a)' / nj~.2(d)(i) and (ii).. E).Xu are Li.. find the Bayes decision mle o' and the minimum conditional Bayes risk r(o'(x) I x).ft B be the difference in mean effect of the generic and namebrand drugs. where E(X) = O. Find the Bayes decision rule.• N.. For the following problems. 0) and I( B. (e) a 2 + 00. and that 9 is random with aN(rlO. E). Suppose tbat N lo .I(d). B). a r )T.2.. 7. (Use these results.3. .) with loss function I( B. A regulatory " ! I i· . . find necessary and j sufficient conditions under which the Bayes risk is finite and under these conditions find the Bayes rule.15. . I j l . .exp {  2~' B' } . B.(0) should be negative when 0 E (E. a) = [q(B)a]'. a) = Lj~.E). . Var(Oj) = aj(ao . . Note that ). I ~' .3. and Cov{8j . . (c) Problem 1. Xl. . (Bj aj)2. Assume that given f). B = (B" . There are two possible actions: 0'5 a a with losses I( B.) (b) When the loss function is I(B. One such function (Lindley. • 9.. Measures of Performance Chapter 3 (b) n .. 8.  I . do not derive them. Let 0 = ftc . 1). to a close approximation. N{O.\(B) = r .d. Hint: If 0 ~ D(a). bioequivalent. where CI. On the basis of X = (X I. given 0 = B are multinomial M(n. (c) We want to estimate the vector (B" . defined in Problem 1. a = (a" ..O'5). by definition.aj)/a5(ao + 1). Suppose we have a sample Xl. B . c2 > 0 I. i: agency specifies a number f > a such that if f) E (E. (E. where is known.. Let q( 0) = L:. (a) Problem 1. (. E) and positive when f) 1998) is 1. I) = difference in loss of acceptance and rejection of bioequivalence. then E(O)) = aj/ao. then the generic and brandname drugs are.. and that 0 has the Dirichlet distribution D( a). " X n of differences in the effect of generic and namebrand effects fora certain drug. equivalent to a namebrand drug. compute the posterior risks of the possible actions and give the optimal Bayes decisions when x = O. . where ao = L:J=l Qj..=l CjUj . (a) If I(B.19(c). .)T.198 (a) T + 00. Set a{::} Bioequivalent 1 {::} Not Bioequivalent .9j ) = aiO:'j/a5(aa + 1).\(B) = I(B. a) = (q(B) . (b) Problem 1.3... Bioequivalence trials are used to test whether a generic drug is. .Xn ) we want to decide whether or not 0 E (e.B. . Cr are given constants. 76) distribution.O)  I(B.
0) and l(().2 show that L(x.1) is equivalent to "Accept bioequivalence if[E(O I x»)' where = x) < 0" (3. Suppose 9 . For the model defined by (3.. Con 10. 0.6 Problems and Complements 199 where 0 < r < 1. A point (xo. (b) the linear Bayes estimate of fl. Yo) = inf g(xo. .. In Example 3. (c) Is the assumption that the ~ 's are nonnal needed in (a) and (b)? Problems for Section 3. 1) are not constant.6.. Any two functions with difference "\(8) are possible loss functions at a = 0 and I.Yo) = {} (Xo... and 9 is twice differentiable. .. find (a) the linear Bayes estimate of ~l. RP.) = 0 implies that r satisfies logr 1 = ( 2 2c' This is an example with two possible actions 0 and 1 where l((). {} (Xo.. Note that .2.17). Discuss the preceding decision rule for this "prior." (c) Discuss the behavior of the preceding decision rule for large n ("n sider the general case (a) and the specific case (b). (a) Show that the Bayes rule is equivalent to "Accept biocquivalence if E(>'(O) I X and show that (3.Section 3. 2. Yo) is in the interior of S x T. (a) Show that a necessary condition for (xo. Yo) = sup g(x. .6..1) < (T6(n) + c'){log(rg(~)+. Yo) to be a saddle point is that.2. yo) is a saddle point of 9 if g(xo. + = 0 and it 00").Xm).) + ~}" . y).3. (b) It is proposed that the preceding prior is "uninformative" if it has 170 large ("76 + 00").Yo) Xi {}g {}g Yj = 0. respectively. Hint: See Example 3. v) > ?r/(l ?r) is equivalent to T > t.\(±€. . . (Xv. representing x = (Xl. S x T ~ R.16) and (3.1. S T Suppose S and T are subsets of Rm.Yp).3 1.2.y = (YI.
prior. Hint: See Problem 3.2. B1 ) >  (l.thesimplex. I • > 1 for 0 i ~. and show that this limit ( .. L:.. BIl has a continuous distrio 1 bution under both P Oo and P BI • Show that (a) For every 0 < 7f < 1. a) ~ (0 .o•• ) =R(B1 .PIB = Bo]. iij. . A = {O. the test rule 1511" given by o.d.n12). f" I • L : I' (a) Show that 0' has constant risk and is Bayes for the beta. Suppose e = {Bo. Yo . 5.d< p.0).Yo) > 0 8 8 8 X a 8 Xb 0.Xn ) . (b) Show that limn_ooIR(O. Let S ~ 8(n.i) =0. B1 ) Ip(X.<7') and 1(<7'.andg(x. y ESp. 1 <j. and 1 . Let X" ._ I)'. the conclusion of von Neumann's theorem holds.a.. (b) Suppose Sm = (x: Xi > 0. . Bd..=1 CijXiYj with XE 8 m ." 1 Xi ~ l}..0•• ). n: I>l is minimax.1(0.b < m.j=O. I(Bi.n). fJ( . N(I'.. . d) = (a) Show that if I' is known to be 0 (!. .200 and Measures of Performance Chapter 3 8'g (x ) < 0 8'g(xo.y) ~ L~l 12. • . Suppose i I(Bi.. o(S) = X = Sin. i. I}.l.)w". Let Lx (Bo. f. 4.i..(X) = 1 if Lx (B o.2. and that the mndel is regnlar. B ) ~ p(X. • " = '!. Hint: Show that there exists (a unique) 1r* so that 61f~' that R(Bo. Thus.n12.c.. I j 3. X n be i.. (b) There exists 0 < 11"* < 1 such that the prior 1r* is least favorable against is. Yc Yd foraHI < i.1 < i < m.n)/(n + . Show that the von Neumann minimax theorem is equivalent to the existence of a saddle point for any twice differentiable g.:.. 1rWlO = 0 otherwise is Bayes against a prior such that PIB = B ] = . o')IR(O. &* is minimax.= 1 . . 0») equals 1 when B = ~.a)'. B ) and suppose that Lx (B o. and o'(S) = (S + 1 2 .j)=Wij>O. 2 J'(X1 .
dk)T. 6. .3. ttl . hence. See Volume II.. (jl. f(k. . > jm).I)..2. o(X 1.. . . Permutations of I.l . . distribution. ..k} I«i). jl~ < .. ..I'kf. ftk) = (Il?l'"'' J1.. Remark: Stein (1956) has shown that if k > 3.tn I(i. Xl· (e) Show that if I' is unknown. Rj = L~ I I(XI < Xi)' Hint: Consider the uniform prior on permutations and compute the Bayes rule by showing that the posterior risk of a pennutation (i l .. ... Show that if (N). . . Let Xi beindependentN(I'i. 8. 9.. J.Pi)' Pjqj <j < k..1)1 L(Xi . .. . Jlk.. ik) is smaller than that of (i'l"'" i~). LetA = {(iI. I' (X"".j. 1).\ ...d) where qj ~ 1 .. prior..P..tj.. = tj. show that 0' is unifonnly best among all rules of the form oc(X) = CL Conclude that the MLE is inadmissible. 1 = t j=l (di .\. j 7.. . X is no longer unique minimax. (c) Use (B.). . b. Let Xl. . X k ) = (R l . . h were t·. .\) distribution and 1(. 00. . 0') = (1.. PI. i=l then o(X) = X is minimax.o) for alII'. / . then is minimax for the loss function r: l(p.. 0 1. d) = L(d.29). . See Problem 1. < /12 is a known set of values. .~~n X < Pi < < R(I'. o(X) ~ n~l L(Xi . 1 <i< k. Write X  • "i)' 1(1'. . Then X is nurnmax. .1. that is.X)' is best among all rules of the form oc(X) = c L(Xi . 1 < j < k. .. . N k ) has a multinomial.. ik is an arbitrary unknown permutation of 1.\. Hint: Consider the gamma. .» = Show that the minimax rule is to take L l.). . Rk) where Rj is the rank of Xi. .a? 1.) . Let k .?. ik).Section 3_6 Problems and Complements 201 (b) If I' ~ 0. b = ttl. " a. < im. where (Ill. respectively. Show that if = (I'I''''.a < b" = 'lb. . Xk)T.. For instance. . ..Pi.12..j. . Show that X has a Poisson (.j... that both the MLE and the estimate 8' = (n . Hint: (a) Consider a gamma prior on 0 = 1/<7'. X k be independent with means f.X)' are inadmissible. and iI. d = (d).k. a) = (.X)' and. .. . M (n. o'(X) is also minimax and R(I'.. an d R a < R b. . ..
2. B). I: I ir'z . + l)j(n + k). 13.d with density defined in Problem 1. distribution...BO)T. Suppose that given 6 ~ B._. 14....q)1r(B)dB. n) be i. .. See Problem 3. . B j > 0. ..8.. with unknown distribution F. . d X+ifX 11. distribution.i.1'»2 of these estimates is bounded for all nand 1'..1r) = Show that the marginal density of X.. .. X n be independent N(I'. a) is the posterior mean E(6 I X). 10. BO l) ~ (k . of Xi < X • 1 + 1 o.. For a given x we want to estimate the proportion F(x) of the population to the left of x.... See also Problem 3.3.Xo)T has a multinomial. 1).i f X> .. Let X" . v'n v'n (a) Show that the risk (for squared error loss) E( v'n(o(X) . . ~ I I . X has a binomial.. Hint: Consider the risk function of o~ No. Pk. I: • . 1 . . (b) How does the risk of these estimates compare to that of X? 12.I)!. J K(po.15.. .. q) denote the K LD (KullbackLeiblerdivergence) between the densities Po and q and define the Bayes KLD between P = {Po : BEe} and q as k(q. Show that the Bayes estimate of 0 for the KullbackLeibler loss function lp(B. Suppose that given 6 = B = (B .202 Measures of Performance Chapter 3 Hint: Consider Dirichlet priors on (PI..d. Define OiflXI v'n < d < v'n d v'n d d X .. ...B)...a) and let the prior be the uniform prior 1r(B" ... LB= j 01 1.=1 Show that the Bayes estimate is (Xi . Show that v'n 1+ v'n 2(1+ v'n) is minimax for estimating F(x) = P(X i < x) with squared error loss. . Let Xi(i = 1.. B(n.4. X = (X"". ! p(x) = J po(x)1r(B)dB. . Let K(po. M(n... Let the loss " function be the KullbackLeibler divergence lp(B..2..
Hint: Use Jensen's inequality: If 9 is a convex function and X is a random variable. Jeffrey's "Prior. respectively. and O~ (1 . ry) = p(x. Show that (a) (b) cr5 = n. 0. R. Show that B q(ry) = Bp(h. Hint: k(q.4 1. then E(g(X) > g(E(X». Fisher infonnation is not equivariant under increasing transformations of the parameter. 15. J') < R( (J. h1(ry)) denote the model in the new parametrization. (b) Equivariance of the Fisher Information Bound. . 0) and q(x. (a) Show that if Ip(O) and Iq(fJ) denote the Fisher information in the two parametrizations. 4.x =." A density proportional to VJp(O) is called Jeffrey's prior. ..Section 3.. K) = J [Eo {log ~i~i}] K((J)d(J > a by Jensen's inequality.. It is often improper. Reparametrize the model by setting ry = h(O) and let q(x. 2.. . then That is. Problems for Section 3. ao) + (1 . . Suppose X I. Show that in theN(O.a )I( 0. suppose that assumptions I and II hold and that h is a monotone increasing differentiable function from e onto h(8). .1"0 known.. Prove Proposition 3. . 7f) and that the minimum is I(J. Show that if 1(0. Jeffrey's priors are proportional to 1.4. .1 . p( x.tO)2 is a UMVU estimate of a 2 • &'8 is inadmissible. . Give the Bayes rules for squared error in these three cases.aao + (1 . if I(O.f [E.2) with I' . S. ry). . a" (J.d. Suppose that there is an unbiased estimate J of q((J) and that T(X) is sufficient.. . 0) cases.4.alai) < al(O. Show that X is an UMVU estimate of O. J)..). 0) with 0 E e c R.~).O)!.1 E: 1 (Xi  J. Let X I. K) . X n be the indicators of n Bernoulli trials with success probability B. Let X . {log PB(X)}] K(O)d(J. 0) and B(n. that is. the Fisher information lower bound is equivariant. N(I"o. (ry»). a < a < 1. p(X) IO. for any ao. N(I".6 Problems and Complements 203 minimizes k( q. a) is convex and J'(X) = E( J(X) I I(X».k(p. . Equivariance. then R(O.12) for the two parametrizations p(x. X n are Ll.4. a. Let A = 3.X is called the mutual information between 0 and X.'? /1 as in (3. We shall say a loss function is convex. Let Bp(O) and Bq(ry) denote the information inequality lower bound ('Ij.
5(b).o. 2 ~ ~ 10. Show that if (Xl. • 13. Pe(E) = 1 for some 0 if and only if Pe(E) ~ ~2 > OJ docsn't depend on 0. compute Var(O) using each ofthe three methods indicated. Y n in twoparameter canonical exponential form and (b) Let 0 = (a. P. X n ) is a sample drawn without replacement from an unknown finite population {Xl. Find lim n. ..1 and variance 0.4. 14. .p) is an unbiased estimate of (72 in the linear regression model of Section 2. {3) T Compute I( 9) for the model in (a) and then find the lower bound on the variances of unbiased estimators and {J of a and (J. n.13 E R. Hint: Consider T(X) = X in Theorem 3.1. distribution.2. Zi could be the level of a drug given to the ith patient with an infectious disease and Vi could denote the number of infectious agents in a given unit of blood from the ith patient 24 hours after the drug was administered. Suppose (J is UMVU for estimating fJ.4. .XN }. . i = 1. Let a and b be constants. 1).2 (0 > 0) that satisfy the conditions of the information inequality.\ = a + bB is UMVU for estimating . 0) = Oc"I(x > 0).8. 8.. X n be a sample from the beta. I I . .11(9) as n ~ give the limit of n times the lower bound on the variances of and (J..3.Yn are independent Poisson random variables with E(Y'i) = !Ji where Jli = exp{ Ct + (3Zi} depends on the levels Zi of a covariate. . . Ct. Show that 8 = (Y .ZDf3)T(y . . a ~ i 12.. Measures of Performance Chapter 3 I (c) if 110 is not known and the true distribution of X t is N(Ji. Does X achieve the infonnation .{x : p(x. B(O. 7.. .\ = a + bOo 11. 9.ZDf3)/(n . . then = 1 forallO..P. =f. . For instance. Show that . 204 Hint: See Problem 3. Is it unbiased? Does it achieve the infonnation inequality lower bound? (b) Show that X is an unbiased estimate of 0/(0 inequality lower bound? + 1)..Ii .... In Example 3.4. . Hint: Use the integral approximation to sums.' .4. . Let F denote the class of densities with mean 0. Show that a density that minimizes the Fisher infonnation over F is f(x. 00. Let X" .. a ~ (c) Suppose that Zi = log[i/(n + 1)1. Show that assumption I implies that if A . then (a) X is an unbiased estimate of x = I ~ L~ 1 Xi· . Suppose Yl . and . Establish the claims of Example 3. find the bias o f ao' 6. (a) Find the MLE of 1/0. ~ ~ ~ (a) Write the model for Yl give the sufficient statistic. 0) for any set E. ( 2). . .
1 < i < h.4. 'L~ 1 h = N. . even for sampling without replacement in each stratum. Show that is not unbiasedly estimable. B(n.. 15. G 19. 20.3. P[o(X) = 91 = 1. .. distribution. (c) Explain how it is possible if Po is binomial. if and only if.6) is (a) unbiased and (b) has smaller variance than X if  b < 2 Cov(U. Let 7fk = ~ and suppose 7fk = 1 < k < K.X)/Var(U).. then E( M) ~ L 1r j=1 N J ~ n. Show that X k given by (3. Suppose UI. Suppose the sampling scheme given in Problem 15 is employed with 'Trj _ ~. B(n.} and X =K  1". (a) Take samples with replacement of size mk from stratum k = fonn the corresponding sample averages Xl.~). Show that if M is the expected sample size. Let X have a binomial.• .4).Section 3. Show that the resulting unbiased HorvitzThompson estimate for the population mean has variance strictly larger than the estimate obtained by taking the mean of a sample of size n taken without replacement from the population.l. .". 18. K. More generally only polynomials of degree n in p are unbiasedly estimable.4. 17. (b) Deduce that if p. Suppose X is distributed accordihg to {p... _ Show that X is unbiased and if X is the mean of a simple random sample without replacement from the population then VarX<VarX with equality iff Xk. UN are as in Example 3. . . (See also Problem 1. = 1.~ for all k. Define ~ {Xkl. XK. . . 7. k = 1.) Suppose the Uj can be relabeled into strata {xkd. k=l K ~1rkXk. 16.511 > (b) Show that the inequality between Var X and Var X continues to hold if ~ .4. 8). : 0 E for (J such that E((J2) < 00..1  E1 k I Xki doesn't depend on k for all k such that 1Tk > o. Stratified Sampling.p). that ~ is a Bayes estimate for O.1 and Uj is retained independently of all other Uj with probability 1rj where 'Lf 11rj = n. = N(O. .. ec R} and 1r is a prior distribution (a) Show that o(X) is both an unbiased estimate of (J and the Bayes estimate with respect to quadratic loss.Xkl.. X is not II Bayes estimate for any prior 1r.6 Problems and Complements 205 (b) The variance of X is given by (3. :/i..4.
4. and we can thus define moments of 8/8B log p(x. T .3. If a = 0. Let X ~ U(O.l)a is an integer.25 and net =k 3. Note that 1/J (9)a ~ 'i7 E9(a 9) and apply Theorem 3.206 Hint: Given E(ii(X) 19) ~ 9.4. the upper quartile X. give and plot the sensitivity curve of the median. (iii) 2X is unbiased for . 4. • Problems for Section 35 I is an integer. Yet show eand has finite variance."" F (ij) Vat (:0 logp(X. Var(a T 6) > aT (¢(9)II(9). Show that. 22.25. use (3. B») ~ °and the information bound is infinite. Regularity Conditions are Needed for the Information Inequality.25 and (n . Prove Theorem 3. J xdF(x) denotes Jxp(x)dx in the continuous case and L:xp(x) in the discrete 6. 1 X n1 C. An estimate J(X) is said to be shift or translation equivariant if. E(9 Measures of Performance Chapter 3 I X) ~ ii(X) compute E(ii(X) . Hint: It is equivalent to show that. B) be the uniform distribution on (0. Note that logp(x. for all X!. 5./(9)a [¢ T (9)a]T II (9)[¢ T (9)a].. If a = 0. for all adx 1. with probability I for each B.. Show that the sample median X is an empirical plugin estimate of the population median v.4. 1 2.9)' 21. If n IQR.75' and the IQR. that is. B). Show that the a trimmed mean XCII. however. . give and plot the sensitivity curves of the lower quartile X. . is an empirical plugin estimate of ~ Here case. B) is differentiable fat anB > x..5. = 2k is even.5) to plot the sensitivity curve of the 1. B).
1. plot the sensitivity curves of the mean. X n is a sample from a population with dJ. Xu.6. One reasonable choice for k is k = 1. I)order statistics). For x > .5 and for (j is.. <i< n Show that (a) k = 00 corresponds to X. xH L is translation equivariant and antisymmetric. . and xiflxl < k kifx > k kifx<k. X.3.Section 3. J is an unbiased estimate of 11. The HodgesLehmann (location) estimate XHL is defined to be the median of the 1n(n + 1) pairwise averages ~(Xi + Xj)..  XI/0.67. k t 0 to the median. X a arc translation equivariant and antisymmetric.e. Show that if 15 is translation equivariant and antisymmetric and E o(15(X» exists and is finite. i < j..03. (b) Show that   ..JL) where JL is unknown and Xi .) ~ 8. . X are unbiased estimates of the center of symmetry of a symmetric distribution. (See Problem 3.1. .30. . The Huber estimate X k is defined implicitly as the solution of the equation where 0 < k < 00. F(x . trimmed mean with a = 1/4.. It has the advantage that there is no trimming proportion Q that needs to be subjectively specified. (b) Suppose Xl.(a) Show that X. then (i. In . Its properties are similar to those of the trimmed mean. '& is an estimate of scale. is symmetrically distributed about O. .. .30. . Deduce that X.5. (a) Suppose n = 5 and the "ideal" ordered sample of size n ~ 1 = 4 is 1.).. and the HodgesLehmann estimate. 7.03 (these are expected values of four N(O.6 Problems and Complements 207 It is antisymmetric if for all Xl. median.. (7= moo 1 IX. .
Location Parameters..x}2. we say that g(. Let JJo be a hypothesized mean for a certain population. In the case of the Cauchy density. .1)1 L:~ 1(Xi . (e) If k < 00.6).X. the standard deviation does not exist. 9. (b) Find thetml probabilities P(IXI > 2).) has heavier tails than f() if g(x) is above f(x) for Ixllarge. This problem may be done on the computer. = O. Show that SC(x. and is fixed. .5. (d) Xk is translation equivariant and antisymmetric (see Problem 3. Xk) is a finite constant.7(a). If f(·) and g(." . where S2 = (n .i.5. then limlxj>00 SC(x. Use a fixed known 0"0 in place of Ci. and "" "'" (b) the Iratio of Problem 3. iTn ) ~ (2a)1 (x 2  a 2 ) as n ~ 00. . The functional 0 = Ox = O(F) is said to be scale and shift (translation) equivariant \.Xn arei.11. Laplace. (c) Show that go(x)/<p(x) is of order exp{x2 } as Ixl ~ 10. thenXk is the MLEofB when Xl.d. The (student) tratio is defined as 1= v'n(x I'o)/s. . Suppose L:~i Xi ~ . 00. 1 Xnl to have sample mean zero. P(IXI > 3) and P(IXI > 4) for the nonnal.208 ~ Measures of Performance Chapter 3 (b) Ifa is replaced by a known 0"0. For the ideal sample of Problem 3. thus.) are two densities with medians v zero and identical scale parameters 7.~ !~ t .5. . ii. n is fixed. Let X be a random variable with continuous distribution function F..75 . •• . plot the sensitivity curve of (a) iTn . with density fo((. [. 11. we will use the IQR scale parameter T = X. Let 1'0 = 0 and choose the ideal sample Xl. .• 13. (a) Find the set of Ixl where g(Ixl) > <p(lxl) for 9 equal to the Laplace and Cauchy densitiesgL(x) = (2ry)1 exp{ Ixl/ry} and gc(x) = b[b2 + x 2 ]1 /rr. with k and € connected through 2<p(k) _ 24>(k) ~ e k 1(e) Xk exists and is unique when k £ ! i > O. Find the limit of the sensitivity curve of t as (a) Ixl ~ (b) n ~ 00..25.O)/ao) where fo(x) for Ixl for Ixl <k > k. In what follows adjust f and 9 to have v = 0 and 'T = 1. and Cauchy distributions. 00. X 12...
. it is called a location parameter.1: (a) Show that SC(x.Section 3.]. 8) and ~ IF(x.t.. 8) = IF~ (x.. 8. · . ~ ~ 15. ~ ~ 8n (a +bxI. then v(F) and i/( F) are location parameters. (a) Show that if F is symmetric about c and () is a location parameter. and sample trimmed mean are shift and scale equivariant.Xn_l) to show its dependence on Xnl (Xl. any point in [v(F). 8._. 8 < " is said to be order preserving if X < Y ::::} Ox < ()y.5.t )) = 0.F).O. c E R..6 Problems and Complements 209 if (Ja+bX = a + bex· It is antisymmetric if (J x :. 0 < " < 1. a + bx. . (jx.8. Show that J1. Also note that H(x) is symmetric about zero.5) are location parameters. i/(F)] is the value of some location parameter.. the ~ = bdSC(x._. ~ (a) Show that the sample mean. . compare SC(x. .5. Fn _ I ). ~ (b) Write the Be as SC(x.. x n _. (b) Show that the mean Ji. and ordcr preserving. a.) 14. 8.  vl/O. .5.a +bxn ) ~ a + b8n (XI. v(F) = inf{va(F) : 0 < " < I/2} andi/(F) =sup{v. . c + dO. b > 0. .(F) = !(xa + XI").(F): 0 <" < 1/2}.xn ). median v. then ()(F) = c. ([v(F).. Let Y denote a random variable with continuous distribution function G..) That is. let v" = v.. Show that if () is shift and location equivariant. i/(F)] and. Hint: For the second part. F(t) ~ P(X < t) > P(Y < t) antisymmctric. Show that v" is a location parameter and show that any location parameter 8(F) satisfies v(F) < 8(F) < i/(F). if F is also strictly increasing. (e) Show that if the support S(F) = {x : 0 < F(x) < 1} of F is a finite interval.). An estimate ()n is said to be shift and scale equivariant if for all xl. b > 0.8.. xnd. let H(x) be the distribution function whose inverse is H'(a) = ![xax. i/(F)] is the location parameter set in the sense that for any continuous F the value ()( F) of any location parameter must be in [v(F). . In Remark 3. and note thatH(xi/(F» < F(x) < H(xv(F». n (b) In the following cases.67 and tPk is defined in Problem 3. If 0 is scale and shift equivariant. ~ = d> 0. then for a E R. SC(a + bx. F) lim n _ oo SC(x. .xn . ~ se is shift invariant and scale equivariant.(k) is a (d) For 0 < " < 1. In this case we write X " Y.. X is said to be stochastically smaller than Y if = G(t) for all t E R. (e) Let Ji{k) be the solution to the equation E ( t/Jk (X :..:. where T is the median of the distributioo of IX location parameter. sample median.. and trimmed population mean JiOl (see Problem 3.
B ~ {F E :F : Pp(A) E B}. Frechet. 18. . This is. The NewtonRaphson method in this case is . and (iii) preceding? ~ 16. where B is the class of Borel sets. then J. then fJ. (ii).1/. .4. • • . {BU)} do not converge. in order to be certain that the Jth iterate (j J is within e of the desired () such that 'IjJ( fJ) = 0.t is identifiable.OJ then l(I(i) . • . (ii) O(F) = (iii) e(F) "7. We define S as the .I F(x. show that (a) If h is a density that is symmetric about zero.0 > 0 (depending on t/J) such that if lOCO) . A. B E B.8) .. (b) If no assumptions are made about h. · . (e) Does n j [SC(x. (1) The result of Theorem 3. . ) (a) Show by example that for suitable t/J and 1(1(0) . consequently. 81" < 0. I : .field generated by SA. ~ ~ 17. 0..4 I . (b) ~ ~ . Let d = 1 and suppose that 'I/J is twice continuously differentiable.5. Assume that F is strictly increasing. Because priority of discovery is now given to the French mathematician M.1) Hint: (a) Try t/J(x) = Alogx with A> 1. also true of the method of coordinate ascent.3 (1) A technical problem is to give the class S of subsets of:F for which we can assign probability (the measurable sets). we shall • I' . and we seek the unique solution (j of 'IjJ(8) = O. C < 00. is not identifiable. IlFfdF(x). (b) Show that there exists.2).Bilarge enough.7 NOTES Note for Section 3. 3. I I Notes for Section 3. . Show that in the bisection method.rdF(l:). ~ I(x  = X n . In the gross error model (3. F)! ~ 0 in the cases (i).BI < C1(1(i.' > 0. we in general must take on the order of log ~ steps.210 (i) Measures of Performance Chapter 3 O(F) ~ 1'1' ~ J .1 is commonly known as the Cramer~Rao inequality. .
K.. OLIVER. 3. AND E. 0. "Descriptive Statistics for Nonparametric Models.(T(X)) = 00. ofStatist." Ann. T." Ann. DOWE. "Descriptive Statistics for Nonparametric Models. MA: AddisonWesley. AND C. DE GROOT. S. (2) Note that this inequality is true but uninteresting if f(O) = 00 (and 1/J'(0) is finite) or if Var. P. J. Introduction. F. 15231535 (1969). "Descriptive Statistics for Nonparametric Models. >. 1972. Optimal Statistical Decisions New York: McGrawHill. F. P. LEHMANN. BOHLMANN. 3. 2. Math. A. BICKEL. BAXTER.4." Ann.. P.. P. J.. DoKSUM. L.thematical Methods in Risk Theory Heidelberg: Springer Verlag. J. Statist. Reading. BICKEL. 1970. Dispersion. Mathematical Analysis. 1122 (1975).. J.Section 3." Ann. AND A. BICKEL. Jroo T(x) [~p(X' 0)] dx roo Joo 8(J 00 00 00 for all (J whereas the continuity (or even boundedness on compact sets) of the second integral guarantees that we can interchange the order of integration in (4) The finiteness of Var8(T(X)) and f(O) imply that 1/J'(0) is finite by the covariance interpretation given in (3. 1. Statist.. M. ANDERSON.10451069 (1975b). AND E. (3) The continuity of the first integral ensures that : (J [rO )00 . H. "Unbiased Estimation in Convex. Statist. Location. 1994..8 References 211 follow the lead of Lehmann and call the inequality after the Fisher information number that appears in the statement. BERNARDO...)dXd>. "Measures of Location and AsymmeUy. in Proceedings of the Second Pa. DAHLQUIST. 4. Ma.] = Jroo. W.. LEHMANN. 10381044 (l975a). Robust Estimates of Location: Sun>ey and Advances Princeton. 40. P. AND E. W.. Statistical Decision Theory and Bayesian Analysis New York: Springer. D." Scand. H. M. Point Estimation Using the KullbackLeibler Loss Function and MML. P. M.cific Asian Conference on Knowledge Discovery and Data Mintng Melbourne: SpringerVerlag. I. R. BJORK. NJ: Princeton University Press.. 1998. 3. AND N. LEHMANN. 11391158 (1976). 2nd ed. 1985..8 REFERENCES ANDREWS.. A. Jroo T(x) :>. F. Statist. WALLACE. D. AND E. A. G. III. ROGERS. BERGER. SMITH. H. Families.. II.. TUKEY. HUBER. J. AND J.. 1969. 1974.. J. Numen'cal Analysis New York: Prentice Hall. BICKEL. APoSTOL. BICKEL. R.8). . 1974. M. Bayesian Theory New York: Wiley. p(x.. LEHMANN. HAMPEL.
Math. SHIBATA. ROUSSEUW. 13. 909927 (1974). "Hierarchical Credibility: Analysis of a Random Effect Linear Model with Nested Classification. Statist.. Testing Statistical Hypotheses New York: Springer. JAECKEL.. Assoc. Statist. 1948.. 1998. Robust Statistics New York: Wiley. H. LEHMANN. W. S. Math. KARLIN. Theory ofPoint Estimation. 1. A. P. "Model Selection and the Principle of Mimimum Description Length.V. "Robust Estimates of Location. • • 1 . LINDLEY. HUBER. Theory ofProbability. Stu/ist." J.... Programming. D. RoyalStatist. StrJtist.. R.. 1. LINDLEY. "Adaptive Robust Procedures. j I JEFFREYS. 2nd ed. . Math.. Amer. RISSANEN. 10201034 (1971). AND B. AND W. . 49. Y. SAVAGE. Assoc. Exploratory Data Analysis Reading. "Inadmissibility of the Usual Estimator for the Mean of a Multivariate Distribution. 1972. AND G. B. • NORBERG.. YU. P." Proc.. 197206 (1956). 1986. Assoc.. I. R. L. D. (2000). R.n.. and Economics Reading. HANSEN. 136141 (1998)." Statistical Science. CASELLA.. 2nd ed. 43. H.. Part II: Inference. WUSMAN." Statistica Sinica. R. 7. Soc.. Amer." Ann. TuKEY." Ann. "Boostrap Estimate of KullbackLeibler Information for Model Selection. London: Oxford University Press. E. Wiley & Sons. 383393 (1974). Wiley & Sons. Amer.. 2Q4. Third Berkeley Symposium on Math. Soc. RONCHEro. The Foundations afStatistics New York: J. 1986. AND P. HUBER... 1954. Stlltist. Introduction to Probability and Statistics from a Bayesian Point of View. MA: AddisonWesley. F." J... HAMPEL. 223239 (1987). 1959. B. 538542(1973). Cambridge University Press. Part I: Probability. C. Robust Statistics: The Approach Based on Influence Functions New York: J. New York: Springer. E. L. 240251 (1987)." Scand. London." Ann. . C. E. P. Actuarial J. "Robust Statistics: A Review.375394 (1997). STAHEL. University of California Press. J. FREEMAN. "The Influence Curve and Its Role in Robust Estimation.. 49. Stall'. "Estimation and Inference by Compact Coding (With Discussions). L. 69.10411067 (1972). M.222 (1986). S. StatiSl. "On the Attainment of the CramerRao Lower Bound. LEHMANN. "Decision Analysis and BioequivaIence Trials.. Royal Statist. 69. J. 1965.. . and Probability. L. "Stochastic Complexity (With Discussions):' J. STEIN. 42. A.. 1981.212 Measures of Performance Chapter 3 HAMPEL. WALLACE.. MA: AddisonWesley.." J. E." J. R. Mathematical Methods and Theory in Games. HOGG..
3. and indeed most human activities. the parameter space e. This framework is natural if. Sex Bias in Graduate Admissions at Berkeley. peIfonn a survey. where is partitioned into {80 . pUblic policy. Does a new drug improve recovery rates? Does a new car seat design improve safety? Does a new marketing policy increase market share? We can design a clinical trial. what is an appropriate stochastic model for the data may be questionable. conesponding.3 the questions are sometimes simple and the type of data to be gathered under our control. o E 8. and the corresponding numbers N mo .PfO).PmllPmO.1. it might be tempting to model (Nm1 .2.1. 3. Nfo) by a multinomial. PI or 8 0 . and what 8 0 and 8 1 correspond to in tenns of the stochastic model may be unclear.Chapter 4 TESTING AND CONFIDENCE REGIONS: BASIC THEORY 4.Pfl. As we have seen. The Graduate Division of the University of California at Berkeley attempted to study the possibility that sex bias operated in graduate admissions in 1973 by examining admissions data. Nmo. to answering "no" or "yes" to the preceding questions.1. not a sample. distribution. But this model is suspect because in fact we are looking at the population of all applicants here. medicine. The design of the experiment may not be under our control.Nf1 . treating it as a decision theory problem in which we are to decide whether P E Po or P l or. 8 1 are a partition of the model P or. parametrically. and we have data providing some evidence one way or the other. If n is the total number of applicants. as is often the case. where Po. in examples such as 1. or more generally construct an experiment that yields data X in X C Rq. whether (j E 8 0 or 8 1 jf P j = {Pe : () E 8 j }. modeled by us as having distribution P(j. Nfo of denied applicants. Usually. we are trying to get a yes or no answer to important questions in science. the situation is less simple. e Example 4. respectively.1 INTRODUCTION In Sections 1. M(n. what does the 213 .Nfb the numbers of admitted male and female applicants. 8d with 8 0 and 8. Accepting this model provisionally. and 3. respectively.3 we defined the testing problem abstractly. They initially tabulated Nm1. Here are two examples that illustrate these issues.
The example illustrates both the difficulty of specifying a stochastic model and translating the question one wants to answer into a statistical hypothesis. Mendel's Peas. In these tenns the hypothesis of "no bias" can now be translated into: H: Pml Pmld PmOd Pfld Pild + + PfOd . for d = 1. That is. if there were n dominant offspring (seeds). If departments "use different coins.1. I I 1] Fisher conjectured that rather than believing that such a very extraordinary event occurred it is more likely that the numbers were made to "agree with theory" by an overzealous assistant. • ..214 Testing and Confidence Regions Chapter 4 hypothesis of no sex bias correspond to? Again it is natural to translate this into P[Admit I Male] = Pm! Pml +PmO = P[Admit I Female] = P fI Pil +PjO But is this a correct translation of what absence of bias means? Only if admission is determined centrally by the toss of a coin with probability Pml Pi! Pml + PmO PIl +PiO [n fact.than might be expected under the hypothesis that N AA has a binomial. if the inheritance ratio can be arbitrary. as is discussed in a paper by Bickel. B (n. In a modem formulation. . ..: Example 4. The progeny exhibited approximately the expected ratio of one homozygous dominant to two heterozygous dominants (to one recessive). I . D). The hypothesis of dominant inheritance ~. . distribution. admissions are petfonned at the departmental level and rates of admission differ significantly from department to department. d = 1. Mendel crossed peas heterozygous for a trait with two alleles. Pfld. NjOd. ~). D). i:. the natural model is to assume.p) distribution. either N AA cannot really be thought of as stochastic or any stochastic I . • In fact. • I I . . PfOd...=7xlO 5 ' n 3 . ~. ! . that N AA. the number of homozygous dominants.D. and so on. m P [ NAA . . NIld. ... • Pml + P/I + PmO + Pio • I .. pml +P/I !. It was noted by Fisher corresponds to H : p = ~ with the alternative K : p as reported in Jeffreys (1961) that in this experiment the observed fraction ':: was much closer to 3. This is not the same as our previous hypothesis unless all departments have the same number of applicants or all have the same admission rate. and O'Connell (1975). . one of which was dominant. Hammel. N mOd . 0 I • . the same data can lead to opposite conclusions regarding these hypothesesa phenomenon called Simpson's paradox.! . where N m1d is the number of male admits to department d. ." then the data are naturally decomposed into N = (Nm1d. In one of his famous experiments laying the foundation of the quantitative theory of genetics.2.. . OUf multinomial assumption now becomes N ""' M(pmld' PmOd..n 3 t I .< . . d = 1. . I . has a binomial (n. .
. The set of distributions corresponding to one answer. and we are then led to the natural 0 . = Example 4. 8 0 and H are called compOSite. (}o] and eo is composite. These considerations lead to the asymmetric formulation that saying P E Po (e E 8 0 ) corresponds to acceptance of the hypothesis H : P E Po and P E PI corresponds to rejection sometimes written as K : P E PJ . Our hypothesis is then the null hypothesis that the new drug does not improve on the old drug. we call 8 0 and H simple. Suppose we have discovered a new drug that we believe will increase the rate of recovery from some disease over the recovery rate when an old established drug is applied.€)51 + cB( nlP). we shall simplify notation and write H : () = eo. When 8 0 contains more than one point. If we suppose the new drug is at least as effective as the old. see. If the theory is false. then S has a B(n. 11 and K is composite.Xn ). Thus. Most simply we would sample n patients. we reject H if S exceeds or equals some integer. n 3' 0 What the second of these examples suggests is often the case. To investigate this question we would have to perform a random experiment.1. I} or critical region C {x: Ii(x) = I}. and then base our decision on the observed sample X = (X J. Thus. Suppose that we know from past experience that a fixed proportion Bo = 0. in the e . where 00 is the probability of recovery usiog the old drug. (2) If we let () be the probability that a patient to whom the new drug is administered recovers and the population of (present and future) patients is thought of as infinite. suppose we observe S = EXi . 0 E 8 0 . We illustrate these ideas in the following example. then 8 0 = [0 . the set of points for which we reject.1 Introduction 215 model needs to pennit distributions other than B( n.1. say 8 0 . The same conventions apply to 8] and K.(1) As we have stated earlier. and accept H otherwise.3 recover from the disease with the old drug.3. It is convenient to distinguish between two structural possibilities for 8 0 and 8 1 : If 8 0 consists of only one point. where Xi is 1 if the ith patient recovers and 0 otherwise.3. it's not clear what P should be as in the preceding Mendel example. In this example with 80 = {()o} it is reasonable to reject IJ if S is "much" larger than what would be expected by chance if H is true and the value of B is eo. . say k. then = [00 . administer the new drug. In situations such as this one . That is. Now 8 0 = {Oo} and H is simple. . p). the number of recoveries among the n randomly selected patients who have been administered the new drug. Moreover.Section 4.1. recall that a decision procedure in the case of a test is described by a test function Ii: x ~ {D. for instance. P = Po. (1 . B) distribution. See Remark 4. is point mass at . say. our discussion of constant treatment effect in Example 1.11. K : () > Bo Ifwe allow for the possibility that the new drug is less effective than the old.1 loss l(B. That a treatment has no effect is easier to specify than what its effect is.!. It will turn out that in most cases the solution to testing problem~ with 80 simple also solves the composite 8 0 problem. 8 1 is the interval ((}o. is better defined than the alternative answer 8 1 . a) = 0 if BE 8 a and 1 otherwise. where 1 ~ E is the probability that the assistant fudged the data and 6!. What our hypothesis means is that the chance that an individual randomly selected from the ill population will recover is the same with the new and old drug.. acceptance and rejection can be thought of as actions a = 0 or 1. for instance. In science generally a theory typically closely specifies the type of distribution P of the data X as.
but computation under H is easy.2. our critical region C is {X : S rule is Ok(X) = I{S > k} with > k} and the test function or PI ~ probability of type I error = Pe.3. In that case rejecting the hypothesis at level a is interpreted as a measure of the weight of evidence we attach to the falsity of H. 0 > 00 . matches at one position are independent of matches at other positions) and the probability of a match is ~. One way of doing this is to align the known and unknown regions and compute statistics based on the number of matches. then the probability of exceeding the threshold (type I) error is smaller than Q". . No one really believes that H is true and possible types of alternatives are vaguely known at best.1. Given this position.1.. and later chapters. We will discuss the fundamental issue of how to choose T in Sections 4.. : I I j ! 1 I . ~ : I ! j . The value c that completes our specification is referred to as the critical value of the test. e 1 . is much better defined than its complement and/or the distribution of statistics T under eo is easy to compute. (Other authors consider test statistics T that tend to be small. asymmetry is often also imposed because one of eo. generally in science. To detennine significant values of these statistics a (more complicated) version of the following is done. how reasonable is this point of view? In the medical setting of Example 4.e.3 this asymmetry appears reasonable. (5 The constant k that determines the critical region is called the critical value. if H is false. As we noted in Examples 4. .2 and 4. 1 i . announcing that a new phenomenon has been observed when in fact nothing has happened (the socalled null hypothesis) is more serious than missing something new. The Neyman Pearson Framework The Neyman Pearson approach rests On the idea that. (5 > k) < k).3. It has also been argued that. it again reason~bly leads to a Neyman Pearson fonnulation. if H is true..216 Testing and Confidence Regions Chapter 4 tenninology of Section 1. . In most problems it turns out that the tests that arise naturally have the kind of structure we have just described. 0 PH = probability of type II error ~ P. We call T a test statistic. 4.2.T would then be a test statistic in our sense.1.. as we shall see in Sections 4. of the two errors. suggesting what test statistics it is best to use. We now tum to the prevalent point of view on how to choose c.3. one can be thought of as more important.. .that has in fact occurred. when H is false. but if this view is accepted.1 . ! : .1 and4. • . . 1.) We select a number c and our test is to calculate T(x) and then reject H ifT(x) > c and accept H otherwise. Note that a test statistic generates a family of possible tests as c varies. There is a statistic T that "tends" to be small. By convention this is chosen to be the type I error and that in tum detennines what we call H and what we call K. and large. We do not find this persuasive. For instance.. The Neyman Pearson framework is still valuable in these situations by at least making us think of possible alternatives and then. testing techniques are used in searching for regions of the genome that resemble other regions that are known to have significant biological activity. Thresholds (critical values) are set so that if the matches occur at random (i.
it is too limited in any situation in which. The power of a test against the alternative 0 is the probability of rejecting H when () is true. Finally.05 critical value 6 and the test has size . k = 6 is given in Figure 4. Once the level or critical value is fixed.1. Here > cJ. if our test statistic is T and we want level a. This quantity is called the size of the test and is the maximum probability of type I error. . This is the critical value we shall use.2(b) is the one to take in all cases with 8 0 and 8 1 simple. 8 0 = 0.9. the approach of Example 3. Ll) Nowa(e) is nonincreasing in c and typically a(c) r 1 as c 1 00 and a(e) 1 0 as c r 00. That is.(6) = Pe. e t .j J = 10. we find from binomial tables the level 0.1. In Example 4.3.1. 80 = 0. even nominally. the probabilities of type II error as 8 ranges over 8 1 are determined.Ok) = P(S > k) A plot of this function for n ~ t( j=k n ) 8i (1 . there exists a unique smallest c for which a(c) < a. such as the quality control Example 1. Here are the elements of the Neyman Pearson story. Such tests are said to have level (o/significance) a. Both the power and the probability of type I error are contained in the power function. It can be thought of as the probability that the test will "detect" that the alternative 8 holds. which is defined/or all 8 E 8 by e {3(8) = {3(8. Specifically. In that case.1. If 8 0 is composite as well.Section 4.(S > 6) = 0. even though there are just two actions.2. 0) is just the probability of type! errot. By convention 1 .01 and 0.05 are commonly used in practice. numbers to the two losses that are not equal and/or depend on 0.1 Introduction 217 There is an important class of situations in which the Neyman Pearson framework is inappropriate. Thus. if we have a test statistic T and use critical value c. It is referred to as the level a critical value.3 with O(X) = I{S > k}. then the probability of type I error is also a function of 8.0473.3 and n = 10.2. See Problem 3.P [type 11 error] is usually considered. whereas if 8 E power against ().1. Begin by specifying a small number a > 0 such that probabilities of type I error greater than a arc undesirable.8)n. Indeed. The power is a function of 8 on I.1. The values a = 0. Because a test of level a is also of level a' > a. the power is 1 minus the probability of type II error.1. 0) is the {3(8.1. Then restrict attention to tests that in fact have the probability of rejection less than or equal to a for all () E 8 0 . Example 4. Definition 4. and we speak of rejecting H at level a. we can attach. in the Bayesian framework with a prior distribution on the parameter.. it is convenient to give a name to the smallest level of significance of a test. our test has size a(c) given by a(e) ~ sup{Pe[T(X) > cJ: 8 E eo}· (4.3 (continued). {3(8. if 0 < a < 1.0) = Pe[Rejection] = Pe[o(X) = 1] = Pe[T(X) If 8 E 8 0 • {3(8.
From Figure 4. That is. One of the most important uses of power is in the selection of sample sizes to achieve reasonable chances of detecting interesting alternatives. and H is the hypothesis that the drug has no effect or is detrimental.4. _l X n are nonnally distributed with mean p. 0 Remark 4. whereas K is the alternative that it has some positive effect.:"::':::'. then the drug effect is measured by p.0 0 ].218 Testing and Confidence Regions Chapter 4 1. I j a(k) = sup{Pe[T(X) > k]: 0 E eo} = Pe. j 1 .) We want to test H : J. this probability is only .1.05 onesided test c5k of H : () = 0.. Let Xi be the difference between the time slept after administration of the drug and time slept without administration of the drug by the ith patient. I I I I I 1 .8 06 04 0. If we assume XI.6 0. . Xn ) is a sample from N (/'. suppose we want to see if a drug induces sleep..05 test will detect an improvement of the recovery fate from 0. record sleeping time without the drug (or after the administration of a placebo) and then after some time administer the drug and record sleeping time again.3.3 to (it > 0.5. We return to this question in Section 4. 0) family of distributions.3..7 0.1 it appears that the power function is increasing (a proof will be given in Section 4. :1 • • Note that in this example the power at () = B1 > 0.._~:. . It follows that the level and size of the test are unchanged if instead of 80 = {Oo} we used eo = [0.2 is known. a 67% improvement. ! • i. .3 for the B(lO. k ~ 6 and the size is 0. We might. What is needed to improve on this situation is a larger sample size n.0 0.2 0. . . Example 4.3770.0473.8 0.. Power function of the level 0. (The (T2 unknown case is treated in Section 4. The power is plotted as a function of 0.9 1. for each of a group of n randomly selected patients.:.3 04 05 0.3). and variance (T2.5.t > O. OneSided Tests for the Mean ofa Normal Distribution with Known Variance.1.1 0.3 versus K : B> 0..0 Figure 4. I i j .2 o~31:.1. When (}1 is 0.2) popnlation with .[T(X) > k]. This problem arises when we want to compare two treatments or a treatment and control (nothing) and both treatments are administered to the same subject.1.t < 0 versus K : J. For instance.1.~==/++_++1'+ 0 o 0.3 is the probability that the level 0. Suppose that X = (X I..
. . < T(B+l) are the ordered T(X). The power function of the test with critical value c is p " [Vii (X !") > e _ Vii!"] (J (J 1.1.1.. critical values yielding correct type I probabilities are easily obtained by Monte Carlo methods.1. has a closed form and is tabled.. Because (J(jL) a(e) ~ is increasing. In all of these cases.( c) = C\' or e = z(a) where z(a) = z(1 . has level a if £0 is continuous and (B + 1)(1 . .• T(X(B)).3. Given a test statistic T(X) we need to determine critical values and eventually the power of the resulting tests. (Tn) for (j E 8 0 is usually invariance The key feature of situations in which under the action of a group of transformations_ See Lehmann (1997) and Volume II for discussions of this property.d..o if we have H(p.~I. #.i. p ~ P[AA]. T(X(lI). then the test that rejects iff T(X) > T«B+l)(la)).2) = 1.1.a) is the (1.5. where T(l) < . it is clear that a reasonable test statistic is d((j.(T(X)) doesn't depend on 0 for 0 E 8 0 .I') ~ ~ (c + V.o versus p.y) : YES}.a) is an integer (Problem 4.9).V. But it occurs also in more interesting situations such as testing p. (12) observations with both parameters unknown (the t tests of Example 4.. = P.a) quantile oftheN(O. In Example 4. in any case.5). That is.p) because ~(z) (4. ~ estimates(jandd(~. it is natural to reject H for large values of X.. sup{(J(p) : p < OJ ~ (J(O) = ~(c). Here are two examples of testing hypotheses in a nonparametric context in which the minimum distance principle is applied and calculation of a critical value is straightforward. The task of finding a critical value is greatly simplified if £.1.1 and Example 4. o The Heuristics of Test Construction When hypotheses are expressed in terms of an estimable parameter H : (j E eo c RP. However. This minimum distance principle is essentially what underlies Examples 4.~(z).3.eo) = IN~A . . It is convenient to replace X by the test statistic T(X) = .1 Introduction 219 Because X tends to be larger under K than under H.   = . T(X(l)).1. In Example 4. The smallest c for whieh ~(c) < C\' is obtained by setting q:. where d is the Euclidean (or some equivalent) distance and d(x. T(X(B)) from £0.2 and 4.co . This occurs if 8 0 is simple as in Example 4.1.Section 4. 1) distribution. if we generate i. Rejecting for large values of this statistic is equivalent to rejecting for large values of X. £0. and we have available a good estimate (j of (j.1.2.P..(jo) = (~ (jo)+ where y+ = Y l(y > 0).co =.nX/ (J. the common distribution of T(X) under (j E 8 0 . S) inf{d(x..3.~ (c . eo). N~A is the MLE of p and d(N~A. which generates the same family of critical regions.
. ..1. 1) distribution.i. Let X I. as X . and the result follows.. In particular.U. then by Problem B. where F is continuous. . • Example 4.1 El{Fo(Xi ) < Fo(x)} n. The natural estimate of the parameter F(!. Consider the problem of testing H : F = Fo versus K : F i. F..12 + OJI/Vri) close approximations to the size" critical values ka are h n (L628). the empirical distribution function F.n {~ n Fo(x(i))' FO(X(i)) _ .. that is. where U denotes the U(O. Un' where U denotes the empirical distribution function of Ul u = Fo(x) ranges over (0. + (Tx) is .1 El{Xi ~ U(O. .   Dn = sup IF(x) . Set Ui ~ j .)} n (4. h n (L358).1 J) where x(l) < . x It can be shown (Problem 4. :i' D n = sup [U(u) . U.1. (T2 = VarF(Xtl.u[ O<u<l . . This is again a consequence of invariance properties (Lehmann.Fa.(i_l...5.1. . can be wriHen as Dn =~ax max tI. and . < x(n) is the ordered observed sample. Suppose Xl. thus. Goodness of Fit Tests.d.05..1. This statistic has the following distributionjree property: 1 Proposition 4. F o (Xi).. n. and hn(t) = t/( Vri + 0.. P Fo (D n < d) = Pu (D n < d). and h n (L224) for" = . Also F(x) < x} = n.220 Testing and Confidence Regions Chapter 4 1 . . The distribution of D n under H is the same for all continuous Fo. The distribution of D n has been thoroughly smdied for finite and large n..Fo(x)[.4. Proof.6. As x ranges over R. In particular. Goodness of Fit to the Gaussian Family. as a tcst statistic . We can proceed as in Example 4.L5 rewriting H : F(!' + (Tx) = <p(x) for all x where !' = EF(X .. which is called the Kolmogorov statistic.7) that Do.d. < Fo(x)} = U(Fo(x)) .3. 1).. . • . X n are ij.. ). .01.10 respectively.. 1997).1.. which is evidently composite. . o Note that the hypothesis here is simple so that for anyone of these hypotheses F = F o• the distribution can be simulated (or exhibited in closed fonn). What is remarkable is that it is independ~nt of which F o we consider.. the order statistics. F and the hypothesis is H : F = ~ (' (7'/1:) for some M.. for n > 80.Xn be i. 0 Example 4.1 El{U. 1). Let F denote the empirical distribution and consider the sup distance between the hypothesis Fo and the plugin estimate of F. .
05. and the critical value may be obtained by simulating ij. . . a > <l'( T(x)). thereby obtaining Tn1 .. This quantity is a statistic that is defined as the smallest level of significance 0' at which an experimenter using T would reject on the basis ofthe observed outcome x. H is rejected..Section 4.. N(O. Now the Monte Carlo critical value is the I(B + 1)(1 . We do this B times independently. where X and 0'2 are the MLEs of J1 and we obtain the statistic F(X + ax) Applying the sup distance again. Therefore. if X = x.. {12 and is that of ~  (Zi .01. this difficulty may be overcome by reporting the outcome of the experiment in tenus of the observed size or pvalue or significance probability of the test. Tn sup IF'(X x x + iix) .d. whereas experimenter II insists on using 0' = 0..d.X) .. Tn!. Experimenter] may be satisfied to reject the hypothesis H using a test with size a = 0. 1.<l'(x)1 where G is the empirical distribution of (L~l"'" Lln ) with Lli (Xi . · · .. Example 4.1 Introduction 221 (12.' . 1 < i < n. Tn has the same distribution £0 under H.. Zn arc i. (4. I < i < n. otherwise.4) Considered as a statistic the pvalue is <l'( y"nX /u). T nB .i. .. .3. It is then possible that experimenter I rejects the hypothesis H. ~n) doesn't depend on fl.4. a satisfies T(x) = . If the two experimenters can agree on a common test statistic T. we would reject H if. Consider.X)/ii. whereas experimenter II accepts H on the basis of the same outcome x of an experiment. I). and only if. and only if. T nB . if the experimenter's critical value corresponds to a test of size less than the p~value. the joint distribution of (Dq . If we observe X = x = (Xl. . from N(O.a) + IJth order statistic among Tn.) Thus.<l'(x)1 sup IG(x) . observations Zi.(3) 0 The pValue: The Test Statistic as Evidence o Different individuals faced with the same testing problem may have different criteria of size. the pvalue is <l'( T(x)) = <l' (. . 1).1. then computing the Tn corresponding to those Zi. .(iix u > z(a) or upon applying <l' to both sides if. That is. In).Z) / (~ 2::7 ! (Zi  if) • . where Z I. . (Sec Section 8. for instance. But. under H. H is not rejected.2. .. whatever be 11 and (12.
1. (4. Proposition 4.1). if r experimenters use continuous test statistics T 1 . Similarly in Example 4..1.. but K is not.1. !: The pvalue is used extensively in situations of the type we described earlier. But the size of a test with critical value c is just a(c) and a(c) is decreasing in c. Thus. a(8) = p. then if H is simple Fisher (1958) proposed using l j :i. and only if.Oo)} > 5. Suppose that we observe X = x.1. More generally.1. I. .<I> ( [nOo(1 ~ 0 )1' 2 s~l~nOo) 0 . ' I values a(T. to quote Fisher (1958).5) . 80).g.ath quantile of the X~n distribution. that is.. the smallest a for which we would reject corresponds to the largest c for which we would reject and is just a(T(x)). 1985).1. We will show that we can express the pvalue simply in tenns of the function a(·) defined in (4.. n(1 . and the pvalue is a( s) where s is the observed value of X. In this context. T(x) > c. . We have proved the following.. For example. . ''The actual value of p obtainable from the table by interpolation indicates the strength of the evidence against the null hypothesis" (p.5). H is rejected if T > Xla where Xl_a is the 1 . distribution (Problem 4. Then if we use critical value c. Thus. we would reject H if.222 Testing and Confidence Regions Cnapter 4 In general. The pvalue is a(T(X)). •• 1. aCT) has a uniform. Thus.1.3.1. • i~! <:1 im _ .6) I • to test H.). the largest critical value c for which we would reject is c = T(x). It is possible to use pvalues to combine the evidence relating to a given hypothesis H provided by several different independent experiments producing different kinds of data. so that type II error considerations are unclear. • T = ~2 I: loga(Tj) j=l ~ ~ r (4. these kinds of issues are currently being discussed under the rubric of datafusion and metaanalysis (e.2. Various melhods of combining the data from different experiments in this way are discussed by van Zwet and Osterhoff (1967).. This is in agreement with (4.. ~ H fJ. I T r to produce p . when H is well defined. for miu{ nOo. .• see f [' Hedges and Olkin. a(Tr ).6). 1). The pvalue can be thought of as a standardized version of our original statistic. a(T) is on the unit interval and when H is simple and T has a continuous distribution. U(O. The normal approximation is used for the pvalue also.4).(S > s) '" 1. Thus. let X be a q dimensional random vector. The statistic T has a chisquare distribution with 2n degrees of freedom (Problem 4..
test functions. subject to this restriction. Typically a test statistic is not given but must be chosen on the basis of its perfonnance. by convention. we try to maximize the probability (power) of rejecting H when K is true.2.Oo)]n. Such a test and the corresponding test statistic are called most poweiful (MP). is measured in the NeymanPearson theory. In particular.Otl/(1 ./00)5[(1. we specify a small number 0: and conStruct tests that have at most probability (significance level) 0: of rejecting H (deciding K) when H is true.00)/00 (1 . This is an instance of testing goodness offit. 0) is the density or frequency function of the random vector X. test statistics.) which is large when S true. and S tends to be large when K : () = 01 > 00 is .1. In Sections 3.Section 4. 00. the highest possible power. 1) distribution. and.3). where eo and e 1 are disjoint subsets of the parameter space 6. 00 ) = 0. OIl = p (x.3 we derived test statistics that are best in terms of minimizing Bayes risk and maximum risk. In this section we will consider the problem of finding the level a test that ha<. we consider experiments in which important questions about phenomena can be turned into questions about whether a parameter () belongs to 80 or e 1. or equivalently. under H. power.0. power function. type I error. (0. We start with the problem of testing a simple hypothesis H : () = ()o versus a simple alternative K : 0 = 01.2 and 3. (4.00)ln~5 [0. 0) 0 p(x. We introduce the basic concepts of simple and composite hypotheses. L(x. The statistic L takes on the value 00 when p(x. (I . in the binomial example (4. critical regions. 0. then. we test whether the distribution of X is different from a specified Fo. and pvalue. size.2 CHOOSING A TEST STATISTIC: THE NEYMANPEARSON LEMMA We have seen how a hypothesistesting problem is defined and how perfonnance of a given test b. The statistic L is reasonable for testing H versuS K with large values of L favoring K over H. that is. significance level. We introduce the basic concepts and terminology of testing statistical hypotheses and give the NeymanPearson framework. In this case the Bayes principle led to procedures based on the simple likelihood ratio statistic defined by L(x. 00 . type II error. For instance. that is. equals 0 when both numerator and denominator vanish. 4. (null) hypothesis H and alternative (hypothesis) K.)1 5 [(1. Summary. I) = EXi is large.2 Choosing a Test Statistic The NeymanPearson lemma 223 The preceding paragraph gives an example in which the hypothesis specifies a distribution completely. OIl where p(x. a given test statistic T. p(x. In the NeymanPearson framework.01 )/(1. o:(Td has a U(O. OIl > 0.
Note (Section 3.3) 1 : > O. then it must be a level a likelihood ratio test.'P(X)] > O. Such randomized tests are not used in practice.224 Testing and Confidence Regions Chapter 4 We call iflk a likelihood ratio or NeymanPearsoll (NP) test ifunction) if for some 0 k < OJ we can write the test function Yk as 'Pk () = X < 1 o if L( x. Theorem 4. .'P(X)] a. and because 0 < 'P(x) < 1.) . we have shown that I j I i E. BI )  . then <{Jk is MP in the class oflevel a tests. thns. we choose 'P(x) = 0 if S < 5. i = 0. B .2. o Because L(x. which are tests that may take values in (0. We show that in addition to being Bayes optimal. To this end consider (4. the interpretation is that we toss a coin with probability of heads cp(x) and reject H iff the coin shows heads. Eo'P(X) < a. Note that a > 0 implies k < 00 and.k] + E'['Pk(X) . 'Pk is MP for level E'c'PdX).B. 1. using (4.P(S > 5)I!P(S = 5) = .0262 if S = 5.k is < 0 or > 0 according as 'Pk(X) is 0 or 1. BIl o .. where 7l' denotes the prior probability of {Bo}. Finally. 'P(x) = 1 if S > 5.'P(X)] .) For instance. Bo) = O. EO'PdX) We want to show E. Bo) = O} = I + II (say). They are only used to show that with randomization. then ~ . and 'P(x) = [0.Bo. B. i' ~ E. If 0 < 'P(x) < 1 for the observation vector x. we consider randomized tests '1'.3). 0 < <p(x) < 1.2) forB = BoandB = B.2) that 'Pk is a Bayes rule with k ~ 7l' /(1 .'P(X)] = Eo['Pk{X) . likelihood ratio tests are unbeatable no matter what the size a is..7l'). (c) If <p is an MP level 0: test.05 in Example 4. where 1= EO{CPk(X)[L(X. that is.Bo. k] .3.1). then I > O.2. Because we want results valid for all possible test sizes Cl' in [0.2. 'Pk(X) = 1 if p(x.4) 1 j 7 .'P(X)] [:~~:::l.3.3 with n = 10 and Bo = 0.05 .['Pk(X) .Bd >k <k with 'Pdx) any value in (0.p is a level a test. lOY some x. there exists k such that (4.2.'P(X)] 1{P(X.kJ}. It follows that II > O. (NeymanPearson Lemma). B .'P(X)] > kEo['PdX) . and suppose r.1) if equality occurs.. (a) If a > 0 and I{Jk is a size a likelihood ratio test.2. (a) Let E i denote E8.('Pk(X) . (b) For each 0 < a < 1 there exists an MP size 0' likelihood ratio test provided that randomization is permitted.1].kEo['Pk(X) .1. (See also Section 1. (4. if want size a ~ . If L(x. Bo.[CPk(X) .'P(X)[L(X.) . Proof.1.
then there exists k < 00 such that Po[L(X.2.OI)' Proof. 0. 0. 'Pk(X) =a .) > then to have equality in (4. then 'Pk is MP size a.Po[L(X..O. Now 'Pk is MP size a.v)=exp 2LX'2 .) > k] Po[L(X.Oo. 0 It follows from the NeymanPearson lemma that an MP test has power at least as large as its level.pk is MP size a. Then the posterior probability of (). Next consider 0 < Q: < 1.2) holds forO = 0.(X. 00 . Because Po[L(X. 2 (7 T(X) ~.. 0.O. OJ. 0.2.4) we need tohave'P(x) ~ 'Pk(X) = I when L(x..u 2 ) random variables with (72 known and we test H : 11 = a versus K : 11 = v. Therefore.. OJ! > k] > If Po[L(X. I x) = (1 1T)p(X. that is.1.3. tben. define a. If not. We found nv2} V n L(X.2. See Problem 4. 00 ) > and 0 = 00 . OJ! > k] < a and Po[T. The same argument works for x E {x : p(x. = 'Pk· Moreover. (c) Let x E {x: p(x. 60 . It follows that (4. X n ) is a sample of n N(j. 00 ) (11T)L(x.Section 4.) + 1Tp(X. is 1T(O. 00 .1.2. k = 0 makes E.L.) > k and have 'P(x) = 'Pk(X) ~ 0 when L(x. Let 1T denote the prior probability of 00 so that (I 1T) is the prior probability of (it. { (7 i=l 2(7 Note that any strictly increasing function of an optimal statistic is optimal because the two statistics generate the same family of critical regions. = v. Here is an example illustrating calculation of the most powerful level a test <{Jk.2(b). OJ Corollary 4. for 0 < a < 1. 0.0.fit [ logL(X. If a = 1./f 'P is an MP level a test.) (1 . OJ! = ooJ ~ 0.v) + nv ] 2(72 x .) . 00 . Consider Example 3. Let Pi denote Po" i = 0.2.) = k] = 0.fit. 'P(X) > a with equality iffp(" 60 ) = P(·.2. 00 .. then E9. 0. .I. 0.) (1 .Oo. denotes the Bayes procedure of Exarnple 3.2.. Remark 4. . we conc1udefrom (4.1T)L(x.2. or 00 according as 1T(01 Ix) is larger than or smaller than 1/2. 00 . 00 . 00 . Example 4.7. also be easily argued from this Bayes property of 'Pk (Problem 4.) + 1T' (4.) = k j. 'Pk(X) = 1 and !.2 where X = (XI.1T)p(x.5) thatthis 0.5) If 0.2. 0. 0. Part (a) of the lemma can.) < k.1.k] on the set {x : L(x.2 Choosing a Test Statistic: The NeymanPearson lemma " 225 =:cc (b) If a ~ 0. 0.2. k = 00 makes 'Pk MP size a.O. where v is a known signal. decides 0. when 1T = k/(k + 1).10). 00 .
this is no longer the case (Problem 4. Particularly important is the case Eo = E 1 when "Q large" is equivalent to "F = (ttl . We return to this in Volume II.2). I I .90 or . We will discuss the phenomenon further in the next section. itl' However if..2.0. From our discussion there we know that for any specified Ct.. Note that in general the test statistic L depends intrinsically on ito. 0. O)T and Eo = I.1. j = 0. say. T>z(la) (4. Eo are known.9).Jto)E01X large. if we want the probability of detecting a signal v to be at least a preassigned value j3 (say. large. In this example we bave assnmed that (Jo and (J.4./iil (J)). then we solve <1>( z(a) + (v.) test exists and is given by: Reject if (4. By the NeymanPearsoo lemma this is the largest power available with a level Ct test. Thus. . .8). If ~o = (1. if ito.6) has probability of type I error 0:.. The power of this test is.2. (Jj = (Pj.I. • • . however. (JI correspond to two known populations and we desire to classify a new observation X as belonging to one or the other.2. > 0 and E l = Eo. . they are estimated with their empirical versions with sample means estimating population means and sample covariances estimating population CQvanances. then.. for the two popnlations are known." The function F is known as the Fisher discriminant function.2. This is the smallest possible n for any size Ct test. . E j ). 0. ). Example 4. 6. Simple Hypothesis Against Simple Alternative for the Multivariate Normal: Fisher's Discriminant Flmction. If this is not the case. The following important example illustrates. by (4. But T is the test statistic we proposed in Example 4./ii/(J)) = 13 for n and find that we need to take n = ((J Iv j2[z(la) + z(I3)]'. that the UMP test phenomenon is largely a feature of onedimensional parameter problems..7) where c ~ z(1 . . E j ).6) that is MP for a specified signal v does not depend on v: The same test maximizes the power for all possible signals v > O.2.95). itl = ito + )"6. among other things. <I>(z(a) + (v.2.. 0 An interesting feature of the preceding example is that the test defined by (4. Such a test is called uniformly most powerful (UMP). l. the test that rejects if. if Eo #.2.1. Suppose X ~ N(Pj.a)[~6Eo' ~oJ! (Problem 4.226 Testing and Confidence Regions Chapter 4 is also optimal for this problem. It is used in a classification context in which 9 0 .' Rejecting H for L large is equivalent to rejecting for I 1 . The likelihood ratio test for H : (J = 6 0 versus K : (J = (h is based on . then this test rule is to reject H if Xl is large. and only if. a UMP (for all ).
OkO' Usually the alternative to H is composite. 4.3.'P) for all 0 E 8" for any other level 0:' (4. 1 (}k) distribution with frequency function. then (N" . . N k ) has a multinomial M(n..1. However...2) where . _ (nl."" (}k = Oklo In this case. 0" . . there is sometimes a simple alternative theory K : (}l = (}ll. Example 4.. This phenomenon is not restricted to the Gaussian case as the next example illustrates.2 that UMP tests for onedimensional parameter problems exist. (}l. . . .1. We note the connection of the MP test to the Bayes procedure of Section 3. . . A level a test 'P' is wtiformly most powerful (UMP) for H : 0 E versus K : 0 E 8 1 if eo (3(0. Testing for a Multinomial Vector.1) test !.p.3 Uniformly Most Powerful Tests and Monotone Likelihood Ratio Models 227 Summary. 'P') > (3(0. the likelihood ratio L 's L~ rr(:. . Two examples in which the MP test does not depend on 0 1 are given. if a match in a genetic breeding experiment can result in k types.. 0 < € < 1 and for some fixed (4. Suppose that (N1 . nk are integers summing to n. n offspring are observed.. 0) P n! nn. where nl. With such data we often want to test a simple hypothesis H : (}I = Ow. Nk) .M( n.. Such tests are said to be UMP (unifonnly most powerful). k are given by Ow. Before we give the example. which states that the size 0:' SLR test is uniquely most powerful (MP) in the class of level a tests.2 for deciding between 00 and Ol. ...3 UNIFORMLY MOST POWERFUL TESTS ANO MONOTONE LIKELIHOOD RATIO MODELS We saw in the two Gaussian examples of Section 4. Ok = OkO. ..3.. r lUI nt···· nk· ".)N' i=l to Here is an interesting special case: Suppose OjO integer I with 1 < I < k > 0 for all}.u k nn. The NeymanPearson lemma. is established.. .nk.3. .. . Ok)' The simple hypothesis would correspond to the theory that the expected proportion of offspring of types 1. . here is the general definition of UMP: Definition 4.3.Section 4.. For instance. We introduce the simple likelihood ratio statistic and simple likelihood ratio (SLR) test for testing the simple hypothesis H : (j = ()o versus the simple alternative K : 0 = ()1. . and N i is the number of offspring of type i.···..
(1) For each t E (0. J Definition 4.(x) any value in (0. < 1 implies that p > E.1) if T(x) = t. this test is UMP fortesting H versus K : 0 E 8 1 = {/:/ : /:/ . = (I . under the alternative. for testing H : 8 < /:/0 versus K:O>/:/I' . In this i. : 8 E e}. : 0 E e} with e c R is said to be a monotone likelihood ratio (MLR) family if for (it < O the distributions POl and P0 2 are distinct and 2 the ratio p(x..d. Critical values for level a are easily determined because Nl .228 Testmg and Confidence Regions Chapter 4 That is.3) with 6. The family of models {P.. Example 4. This is part of a general phenomena we now describe. = . set s then = l:~ 1 Xi. we have seen three models where. equals the likelihood ratio test 'Ph(t) and is MP. then this family is MLR. e c R.3. X ifT(x) < t ° (4. . B( n. (2) If E'o6.3.3.. Note that because l can be any of the integers 1 .nx/u and ry(!") Define the NeymanPearson (NP) test function 6 ( ) _ 1 ifT(x) > t . Oil = h(T(x)) for some increasing function h.3.6. for". 6..00). ...2. 0 Typically the MP test of H : () = eo versus K : () = (}1 depends on (h and the test is not UMP. Thus. type I is less frequent than under H and the conditional probabilities of the other types given that type I has not occurred are the same under K as they are under H. 0 . Oil is an increasing function ofT(x). there is a statistic T such that the test with critical region {x : T(x) > c} is UMP. e c R. 02)/p(X. then 6.2. . Consider the oneparameter exponential family mode! o p(x.1.3. If 1J(O) is strictly increasing in () E e. Ow) under H. . . we get radically different best tests depending on which Oi we assume to be (ho under H. . Bernoulli case. = P( N 1 < c). = h(x)exp{ry(O)T(x) ~ B(O)}.(X) is increasing in 0. we conclude that the MP lest rejects H. it is UMP at level a = EOodt(X) for testing H : e = eo versus K : B > Bo in fact. is an MLRfamily in T(x). in the case of a real parameter. 0) ~.2 (Example 4. If {P. Moreover. the power function (3(/:/) = E. Suppose {P.O) = 0'(1 . is an MLRfamily in T(x). then L(x. .1. 0 form with T(x) .2. N 1 < c. Because f.nu)!". .3.o)n. k.o)n[O/(1 .1) MLR in s. Because dt does not depend on ()l. However.(X) = '" > 0.: 0 E e}. Then L = pnN1£N1 = pTl(E/p)N1.OW and the model is by (4. is of the form (4.1 is of this (.2) with 0 < f < I}. Consider the problem of testing H : 0 = 00 versus K: 0 ~ 01 with 00 < 81..3. is UMP level". 00 . where u is known.. Example 4. Example 4. Theorem 4. . :.3 continned). p(x.i.. if and only if.
where b = N(J. where h(a) is the ath quantile of the hypergeometric.o. then the test/hat rejects H if and only ifT(r) > t(1 . Eoo. d t is UMP for H : (J < (Jo versus K : (J > (Jo. 0.3. 0 Example 4.(X). where xn(a) is the ath quantile of the X. we test H : u > Uo l versus K : u < uo.l. if N0 1 = b1 < bo and 0 < x < b1 . For simplicity suppose that bo > n. Corollary 4.bo > n. then by (1).1 by noting that for any (it < (J2' e5 t is MP at level Eo. Because the most serious error is to judge the precision adequate when it is not. . and because dt maximizes the power over this larger class. 20' 2  This is a oneparameter exponential family and is MLR in T = So The UMP level 0' test rejects H if and only if 8 < s(a) where s(a) is such that Pa .l.4. . and only if.Section 4.5.Xn . Let S = l:~ I (X. X is the observed number of defectives in a sample of n chosen at random without replacement from a lot of N items containing b defectives. distribution. and we are interested in the precision u. If the distributionfunction Fo ofT(X) under X"" POo is continuous and ift(1a) isasolution of Fo(t) = 1 . Ifwe write ~= Uo t i''''l (Xi _1')2 000 we see that Sju5 has a X~ distribution. and specifies an 0' such that the probability of rejecting H (keeping a bad lot) is at most 0'. .l of the measurements Xl. N . To show (2). Because the class of tests with level 0' for H : (J < (Jo is contained in the class of tests with level 0' for H : (J = (Jo.3 Uniformly Most Po~rful Tests and Monotone likelihood Ratio Models 229 and Corollary 4.U 2 ) population.<» is lfMP level afor testing H: (J < (Jo versus K : (J > (Jo. If 0 < 00 .(X) for testing H: 0 = 01 versus J( : 0 = 0. Suppose X!.o. where U O represents the minimum tolerable precision.. 0 Proof (I) follows from b l = iPh(t) The following useful result follows immediately.2. . If the inspector making the test considers lots with bo = N(Jo defectives or more unsatisfactory. If a is a value taken on by the distribution of X.(bl1) x. N.3. recall that we have seen that e5 t maximizes the power for testing II : (J = (Jo versus K : (J > (Jo among the class of tests with level <> ~ Eo. then e}. is an MLRfamily in T(r).l) yields e L( 0 0 )=b. Suppose tha~ as in Example l. we could be interested in the precision of a new measuring instrument and test it by applying it to a known standard.Xn is a sample from a N(I1.1. distribution.3. R. where 11 is a known standard. she formulates the hypothesis H as > (Jo. Testing Precision.l. the critical constant 8(0') is u5xn(a). Suppose {Po: 0 E Example 4. is UMP level a.( NOo. For instance.. the alternative K as (J < (Jo. Quality Control. _1')2. Thus. H. 0) = exp {~8 ~ IOg(27r0'2)} .<>... we now show that the test O· with reject H if. 1 bo(bol) (blX+l)(Nbl) (box+1)(Nbo) (Nbln+x+l) (Nbon+x+l)' . Then. n). e c p(x. (l.. (X) < <> and b t is of level 0: for H : (J :S (Jo. X < h(a). (8 < s(a)) = a.
° ° .. In both these cases. The critical values for the hypergeometric distribution are available on statistical calculators and software.2). Thus. L is decreasing in x and the hypergeometric model is an MLR family in T( x) = r. In our nonnal example 4.' .:(x:.. this is a general phenomenon in MLR family models with p( x.ntI") for sample size n. For such the probability of falsely accepting H is almost 1 . forO <:1' < b1 1.I..1.(}o. By CoroUary 4.3.1 and formula (4. This continuity of the power shows that not too much significance can be attached to acceptance of H.ntI" = z({3) whose solution is il . .4 because (3(p. H and K are of the fonn H : () < ()o and K : () > ()o. in (O .(b o . Off the indifference region.. 0 Power and Sample Size In the NeymanPearson framework we choose the test whose size is small. Thus. In our example this means that in addition to the indifference region and level a.O'"O... we would also like large power (3(0) when () E 8 1 .a. we specify {3 close to I and would like to have (3(!') > (3 for aU !' > t.) is increasing. Note that a small signaltonoise ratio ~/ a will require a large sample size n. 1 I . (0.4 we might be uninterested in values of p. ~) for some small ~ > 0 because such improvements are negligible.230 NotethatL(x.Il = (b l . Therefore. this is.. I' . In Example 4. that is. I i (3(t) ~ 'I>(z(a) + .1.1. as seen in Figure 4. That is. the appropriate n is obtained by solving i' . in general.. and the powers are continuous increasing functions with limOlO.. This is a subset of the alternative on which we are willing to tolerate low power. i. However. On the other hand. we want the probability of correctly detecting an alternative K to be large.1.+.x) (N .B 1 ) =0 forb l Testing and Confidence Regions Chapter 4 < X <n. This equation is equiValent to = {3 z(a) + . .OIl box (Nn+I)(blx) .1.Oo. {3(0) ~ a.x) <I L(x. not possible for all parameters in the alternative 8. ~) would be our indifference region.n + I) . It follows that 8* is UMP level Q. 0) continuous in 0. ".'. This is not serious in practice if we have an indifference region. if all points in the alternative are of equal significance: We can find > 00 sufficiently close to 00 so that {3( 0) is arbitrarily close to {3(00) = a. This is possible for arbitrary /3 < 1 only by making the sample size n large enough. we choose the critical constant so that the maximum probability of falsely rejecting the null hypothesis H is small.. we want guaranteed power as well as an upper bound on the probability of type I error.L.c:O".
They often reduce to adjusting the critical value so that the probability of rejection for parameter value at the boundary of some indifference region is 0:. Our discussion uses the classical normal approximation to the binomial distribution. = 0. and 0. to achieve approximate size a. Now let . where (3(0.35.90. As a further example and precursor to Section 5.4.Section 4. and only if. we fleed n ~ (0. if Oi = . Thus. Often there is a function q(B) such that H and K can be formulated as H : q(O) <: qo and K : q(O) > qo. 0 ° Our discussion can be generalized.3 to 0. we next show how to find the sample size that will "approximately" achieve desired power {3 for the size 0: test in the binomial example. when we test the hypothesis that a very large sample comes from a particular distribution.3 continued).90 of detecting the 17% increase in 8 from 0. Again using the nonnal approximation. Such hypotheses are often rejected even though for practical purposes "the fit is good enough.05 binomial test of H : 8 = 0.55)}2 = 162.35. the size . First.80 ) . that is.3(0.7) + 1. (3 = .) = (3 for n and find the approximate solution no+ 1 .4.Oi) [nOo(1 .00 )] 1/2 . Bd. We solve For instance.645 x 0. The power achievable (exactly.5). using the SPLUS package) for the level .282 x 0.1. ifn is very large and/or a is small.35 and n = 163 is 0. In Example 4.4) and find the approximate critical value So ~ nOo 1 + 2 + z(l.4.3 requires approximately 163 observations to have probability . This problem arises particularly in goodnessoffit tests (see Example 4." The reason is that n is so large that unimportant small discrepancies are picked up. we can have very great power for alternatives very close to O.4 this would mean rejecting H if.1. There are various ways of dealing with this problem. 00 = 0.86.35(0.05)2{1. 1. test for = .6 (Example 4. we find (3(0) = PotS > so) = <I> ( [nO(l ~ 0)11/ 2 Now consider the indifference region (80 . Suppose 8 is a vector. Example 4..2) shows that. far from the hypothesis.05.1.3. Dt. It is natural to associate statistical significance with practical significance so that a very low pvalue is interpreted as evidence that the alternative that holds is physically significant.0. > O.3 Uniformly Most Powerful Tests and Monotone Likelihood Ratio Models 231 Dual to the problem of not having enough power is that of having too much. we solve (3(00 ) = Po" (S > s) for s using (4.3. (h = 00 + Dt. Formula (4.1.
when testing H : 0 < 00 versus K : B > Bo.Bo). a reasonable class of loss functions are those that satisfy : 1 i j 1 1(0.l' independent of IJ. a E A = {O. For each n suppose we have a level a test for H versus I< based on a suitable test statistic T. is detennined by a onedimensional parameter ).O)<O forB < 00 forB>B o. first let Co be the smallest number c such that Then let n be the smallest integer such that P" IT > col > fJ where 00 is such that q( 00 ) = qo and 0. we illustrate what can happen with a simple example. the critical value for testing H : 0. for 9.(0) > O} and Co (Tn) = C'(') (Tn) for all O. To achieve level a and power at least (3.5 that a particular test statistic can have a fixed distribution £0 under the hypothesis.. 0 E 9. and also increases to 1 for fixed () E 6 1 as n . .= (10 versus K : 0. Thus. . 0) = (B .1) 1(0. I . = {O : ).= 00 is now composite.232 Testing and Confidence Regions Chapter 4 q. The theory we have developed demonstrates that if C. In general. For instance.7. 0 = Complete Families of Tests The NeymanPearson framework is based on using the 01 loss function.1. a).. then rejecting for large values of Tn is UMP among all tests based on Tn. J.4) j j 1 .(Tn ) is an MLR family. Then the MLE of 0. We have seen in Example 4.< (10 among all tests depending on <. Example 4.2.l)l(O.2 is (}'2 = ~ E~ 1 (Xi . the distribution of Tn ni:T 2/ (15 is X. = (tm.£ is unknown. We may ask whether decision procedures other than likelihood ratio tests arise if we consider loss functions l(O. However.2 only. 00). The set {O : qo < q(O) < ql} is our indjfference region.. 0 E 9. we may consider l(O.4. Suppose that j3(O) depends on () only through q( (J) and is a continuous increasing function of q( 0).3.9.3. It may also happen that the distribution of Tn as () ranges over 9.+ 00.3.Xf as in Example 2.. Suppose that in the Gaussian model of Example 4. Although H : 0.3. I}. to the F test of the linear model in Section 6.0) > ° I(O. that are not 01.1 by taking q( 0) equal to the noncentrality parameter governing the distribution of the statistic under the alternative. > qo be a value such that we want to have power fJ(O) at least fJ when q(O) > q.. . is the a percentile of X~l' It is evident from the argument of Example 4..3 that this test is UMP for H : (1 > (10 versus K : 0. Reducing the problem to choosing among such tests comes from invariance consideration that we do not enter into until Volume II. Implicit in this calculation is the assumption that POl [T > col is an increasing function ofn.. This procedure can be applied.(0) = o} aild 9. for instance.(0) so that 9 0 = {O : ). (4. is such that q( Oil = q. Testing Precision Continued.< (10 and rejecting H if Tn is small.
then the class of tests of the form (4.o) < R(O.3.3.3. Proof. Theorem 4. when UMP tests do not exist.(X) = ".1 and. the class of NP tests is complete in the sense that for loss functions other than the 01 loss function. Thus. = R(O.(X)) = 1.I") ~ = EO{I"(X)I(O. locally most powerful (LMP) tests in some cases can be found. then any procedure not in the complete class can be matched or improved at all () by one in the complete class. 0 Summary. AND REGIONS We have in Chapter 2 considered the problem of obtaining precise estimates of parameters and we have in this chapter treated the problem of deciding whether the parameter {J i" a . In such situations we show how sample size can be chosen to guarantee minimum power for alternatives a given distance from H.0. Finally.2. Thus. we show that for MLR models. e c R. Suppose {Po: 0 E e}. 1) 1(0.4). (4. (4.o. the risk of any procedure can be matched or improved by an NP test. 0 < " < 1.3. a model is said to be monotone likelihood ratio (MLR) if the simple likelihood ratio statistic for testing ()o versus 8 1 is an increasing function of a statistic T( x) for every ()o < ()1.E. We also show how.(x)) .4 CONFIDENCE BOUNDS. The risk function of any test rule 'P is R(O.O)]I"(X)} Let o. E.3. Now 0.o. for some 00 • E"o. 1) + [1 1"(X)]I(O.3. We consider models {Po : () E e} for which there exist tests that are most powerful for every () in a composite alternative t (UMP tests).I"(X) 0 for allO then O=(X) clearly satisfies (4. INTERVALS. (4.(Io. For MLR models.O) + [1(0. For () real. the test that rejects H : 8 < 8 0 for large values of T(x) is UMP for K : 8 > 8 0 . 1") for all 0 E e.Section 4.12) and.3.4 Confidence Bounds. 1) 1(O.R(O.5) That is.E.(X) = E"I"(X) > O.5). and Regions 233 if for any decision rule 'P The class D of decision procedures is said to be there exists E such that complete(I). if the model is correct and loss function is appropriate. is an MLR family in T( x) and suppose the loss function 1(0. 1") = (1(0.3) with Eo.3. is UMP for H : 0 < 00 versus K: 8 > 8 0 by Theorem 4. is similarly UMP for H : 0 > 8 0 versus K : 0 < 00 (Problem 4.) . O))(E.(I"(X))) < 0 for 8 > 8 0 . it isn't worthwhile to look outside of complete classes.(X) be such that.I"(X) for 0 < 00 .3.5) holds for all 8.E. e 4. hence.(X) > 1. In the following the decision procedures arc test functions.(2) a v R(O. is complete. If E.6) But 1 . Intervals. a) satisfies (4.{1(8.(o. hence.O)} E.
a) such as . in many situations where we want an indication of the accuracy of an estimator. 11 I • J . we settle for a probability at least (1 .(Xj < .fri < ...fri < .a. We say that [j. Here ii(X) is called an upper level (1 ..(X).(X) that satisfies P(.d. we want to find a such that the probability that the interval [X ..95 Or some other desired level of confidence. X + a] contains pis 1 ..a)l..(X) is a lower confidence bound with confidence level 1 . 0.±(X) ~ X ± oz (1. In the N(".8).3.a.i. we may be interested in an upper bound on a parameter.2 ) example this means finding a statistic ii(X) snch that P(ii(X) > . .. is a parameter. and a solution is ii(X) = X + O"z(1 .. In the nonBayesian framework. N (/1. In our example this is achieved by writing By solving the inequality inside the probability for tt.(X) .fri.oz(1 .(X) for" with a prescribed probability (1 .oz(1 . That is. it may not be possible for a bound or interval to achieve exactly probability (1 . We say that . That is.. This gives .a) confidence bound for ".234 Testing and Confidence Regions Chapter 4 member of a specified set 8 0 ..a)l.a).) = 1 . In this case.. we find P(X . and we look for a statistic . if v ~ v(P)....fri. . intervals. X E Rq. Finally.a. is a lower bound with P(.95. . In general.. .a.) = 1 .+(X)] is a level (1 .) = Ia. Similarly. .4 where X I. .! = X .. PEP. we want both lower and upper bounds.. X n ) to establish a lower bound .(X) and solving the inequality inside the probability for p..!a)/.Q with 1 . ( 2 ) with (72 known.1.. • j I where .a) confidence interval for ".a) of being correct. .) and =1 a .a) for a prescribed (1. is a constant.a)I. 1 X n are i. Now we consider the problem of giving confidence bounds. As an illustration consider Example 4. Then we can use the experimental outcome X = (Xl.. We find such an interval by noting .. Suppose that JL represents the mean increase in sleep among patients administered a drug... and X ~ P.Q equal to .a. as in (1. or sets that constrain the parameter with prescribed probability 1 ..
D(X) is called a level (1 . Moreover.1) = T(I') . has a N(o. P[v(X) = v] > 10. where S 2 = 1 nI L (X. ii( Xl] formed by a pair of statistics v(X).0') < (J .~o) for 1'. 1) distribution and is.4.4. independent of V = (n . (72) population. the confidence level is clearly not unique because any number (I . A statistic v(X) is called a level (1 . P[v(X) < vi > I .1')1".Section 4... for all PEP.1.3. Let X t . In the preceding discussion we used the fact tbat Z (1') = Jii(X .a) is called a confidence level.L) obtained by replacing (7 in Z(J. and assume initially that (72 is known.e.0) will be a confidence level if (I .X) n i=l 2 . Note that in the case of intervals this is just inf{ PI!c(X) < v < v(X).3. Now we tum to the (72 unknown case and propose the pivot T(J. P E PI} (i. by Theorem B.o.. The (Student) t Interval and Bounds.L) by its estimate s.1. For a given bound or interval.O!) lower confidence bound for v if for every PEP. Similarly.o.L.0) or a 100(1 . the random interval [v( X).0) is. lhat is. We conclude from the definition of the (Student) t distribution in Section B. and Regions 235 Definition 4. has aN(O.3. which has a X~l distribution. P[v(X) < v < v(X)] > I . the minimum probability of coverage). v(X) is a level (I . In order to avoid this ambiguity it is convenient to define the confidence coefficient to be the largest possible confidence level. " X n be a sample from a N(J.0)% confidence intervolfor v if.4 Confidence Bounds. For the normal measurement problem we have just discussed the probability of coverage is independent of P and equals the confidence coefficient. we will need the distribution of Now Z(I') = Jii(X 1')/".!o) < Z(I') < z (1. In general. The quantities on the left are called the probabilities of coverage and (1 . finding confidence intervals (or bounds) often involves finding appropriate pivots. In this process Z(I') is called a pivot.l)s2 lIT'. Example 4.! that Z(iL)1 "jVI(n .a) upper confidence bound for v if for every PEP. Intervals. . I) distribu~on to obtain a confidence interval for I' by solving z (1 .
st n_l (1 ~...1 (I . Suppose that X 1.)/ v'n].355s/31 is the desired level 0. .4.fii are natural lower and upper confidence bounds with confidence coefficients (1 . we find P [X .~a)/. X + 3.a.1.4. ... X n are i.7 (see Problem 8. Let tk (p) denote the pth quantile of the P (t n. we use a calculator. (4.."2)) = 1". . N (p" a 2 ).) confidence interval of the type [X .fii and X + stn.355 is . " . By Theorem B .~a)) = 1.nJ = 1" X ± sc/.a)..d.. On the other hand.1) has confidence coefficient close to 1 . if n = 9 and a = 0.236 has the t distribution 7.1 distribution if the X's have a distribution that is nearly symmetric and whose tails are not much heavier than the normal. For the usual values of a. t n . By solving the inequality inside the probability for a 2 we find that [(n .) /.3.n < J. See Figure 5. .l) < t n_ 1 (1.st n _ 1 (I .1 )s2/<7 2 has a X. computer software. . for very skew distributions such as the X2 with few degrees of freedom..1 distribution and can be used as a pivot.Xn is a sample from a N(p" a 2 ) population. or Tables I and II.Q). Hence.1) The shortest level (I .005.. then P(X("I) < V(<7 2) < x(l.. . Solving the inequality inside the probability for Jt.l (I .2) f."2)' (n .l)s2/X("I)1 is a confidence interval with confidence coefficient (1 . or very heavytailed distributions such as the Cauchy.~a) < T(J..1.1 (1 . we have assumed that Xl. Up to this point.l < X + st n_ 1 (1.fii.2. (4. the interval will have probability (1 . Confidence Intervals and Bounds for the Vartance of a Normal Distribution. For instance..355 and IX .l)s2/X(1. . If we assume a 2 < 00..n is.7. V( <7 2 ) = (n . .1. i '1 !. • r.l) is fairly close to the Tn .99 confidence intervaL From the results of Section B.. In this case the interval (4. Thus.Q.4.a).a. To calculate the coefficients t n. 0 .~a)/. Then (72. . if we let Xn l(p) denote the pth quantile of the X~1 distribution.!a) and tndl .355s/3... • " ~.4. we enter Table II to find that the probability that a 'TnI variable exceeds 3.1. .1) can be much larger than 1 .. X . Example 4. It turns out that the distribution of the pivot T(J.12).)/.11 whatever be Il and Tj. Testing and Confidence Regions Chapter 4 1 .1 (p) by the standard normal quantile z(p) for n > 120. distribution.l (I .1) in nonGaussian situations can be investigated using the asymptotic and Monte Carlo methods introduced in Chapter 5. .3. the confidence coefficient of (4.~a) = 3.01. we see that as n + 00 the Tn_l distribution converges in law to the standard normal distribution.3.stn_ 1 (I . and if al + a2 = a.i. we can reasonably replace t n . thus. . The properties of confidence intervals such as (4. X +stn _ 1 (1 Similarly.a) in the limit as n + 00.a) /. ~. .
S)jn] {S + k2~ + k~j4} / (n + k~) (4.4.0) . the scope of the method becomes much broader.16 we give an interval with correct limiting coverage probability.l)sjx(Q). = Z (1 . 0 The method of pivots works primarily in problems related to sampling from nonnal populations. In contrast to Example 4. However.Section 44 Confidence Bounds. Asymptotic methods and Monte Carlo experiments as described in Chapter 5 have shown that the confidence coefficient may be arbitrarily small depending on the underlying true distribution.1. which unifonnly minimizes expected length among all intervals of this type. If we use this function as an "approximate" pivot and let:=:::::: denote "approximate equality. There is no natural "exact" pivot based on X and O. then X is the MLE of (j.0(1. 1) distribution.k' < . g( 0.0)  1Ql] = 12 Q.l)sjx(1 .3) + ka)[S(n .0) ] = Plg(O. . In tenns of S = nX.X) < OJ'" 1where g(O. which typically is unknown.4.4. In Problem 1.Q) and (n . 1 X n are the indicators of n Bernoulli trials with probability of success (j. The pivot V( (T2) similarly yields the respective lower and upper confidence bounds (n ..4) . and Regions 237 The length of this interval is random.3.." we can write P [ Let ka vIn(X .ka)[S(n .1.O)j )0(1 . X) is a quadratic polynomial with two real roots.0) < z (1)0(1 . taking at = (Y2 = ~a is not far from optimal (Tate and Klett. they are(1) O(X) O(X) {S + k2~ . the confidence interval and bounds for (T2 do not have confidence coefficient 1. by the De MoivreLaplace theorem. If we consider "approximate" pivots. 1959). X) < OJ (4.. If X I.a even in the limit as n + 00.S)jn] + k~j4} / (n + k~). Intervals.0) has approximately a N(O.(2X + ~) e+X 2 For fixed 0 < X < I. [0: g(O.X) = (1+ k!) 0' . Example 4.~ a) and observe that this is equivalent to Q P [(X . Approximate Confidence Bounds and Intervalsfor the Probability of Success in n Bernoulli Trials. vIn(X . It may be shown that for n large. There is a unique choice of cq and a2. if we drop the nonnality assumption. We illustrate by an example.4. = ~(X) < 0 < B(X)]' Because the coefficient of (j2 in g(() 1 X) is greater than zero.
This leads to the simple interval 1 I (4.: iI j .0)1 JX(1 .4.~n)2 < S(n . and we can achieve the desired . note that the length.5) is used and it is only good when 8 is near 1/2. .02 ) 2 . For small n. He can then detennine how many customers should be sampled so that (4.4. .6) Thus.4.600. if the smaller of nO.X). ka. Another approximate pivot for this example is yIn(X . [n this case.975. See Problem 4. A discussion is given in Brown. calls willingness to buy success.5. Cai.a) confidence bounds. i o 1 [.4. That is. consider the market researcher whose interest is the proportion B of a population that will buy a product. it is better to use the exact level (1 .238 Testing and Confidence Regions Chapter 4 so that [O(X). 1 .8) is at least 6.a) confidence interval for O.O > 8 1 > ~. = length 0.96) 2 = 9. we choose n so that ka.5) (4.S)ln] Now use the fact that + kzJ4}(n + k~)l (S . O(X)] is an approximate level (1 . = 0. we n .n 1 tn (4.. = T.2a) interval are approximate upper and lower level (1 .. To see this.96.4. He draws a sample of n potential customers.7) See Brown.4.~a) ko 2 kcc i 1.(1.S)ln ~ to conclude that tn . i " . and Das Gupta (2000).4) has length 0. n(l . and Das Gupta (2000) for a discussion. We can similarly show that the endpoints of the level (1 . or n = 9. of the interval is I ~ 2ko { JIS(n .~a = 0. Better results can be obtained if one has upper or lower bounds on 8 such as 8 < 00 < ~.02 and is a confidence interval with confidence coefficient approximately 0.a) procedure developed in Section 4. and uses the preceding model.96 0.16.4. This fonnula for the sample size is very crude because (4. to bound I above by 1 0 choose . (1 .02 by choosing n so that Z = n~ ( 1.95. These inlervals and bounds are satisfactory in practice for the usual levels. Note that in this example we can detennine the sample size needed for desired accuracy. For instance.601.02.(n + k~)~ = 10 . Cai. say I.
By using 2nX /0 as a pivot we find the (1 . then q(C(X)) ~ {q(O) .1..a.. By Problem B. (T c' Tc ) are independent. this technique is typically wasteful.F(x) = exp{ x/O}. Then the rdimensional random rectangle I(X) = {q(O). X~n' distribution. .(X).). qc (0)). we will later give confidence regions C(X) for pairs 8 = (0 1 . . qc( 0)) is at least (I .1 ).. £(8. Let 0 and and upper boundaries of this interval. if q( B) = 01 .(O) < ii. . and Regions 239 Confidence Regions for Functions of Parameters We can define confidence regions for a function q(O) as random subsets of the range of q that cover the true value of q(O) with probability at least (1 .. I is a level (I .0').a) confidence interval 2nX/x (I .8) . Intervals.(X) < q.X n denote the number of hours a sample of internet subscribers spend per week on the Internet.. and suppose we want a confidence interval for the population proportion P(X ? x) of subscribers that spend at least x hours per week on the Internet. then the rectangle I( X) Ic(X) has level = h (X) x . For instance. . we can find confidence regions for q( 0) entirely contained in q( C(X)) with confidence level (1 ..0') confidence region for q(O).a Note that if I. That is.~a) < 0 <2nX/x (~a) where x(j3) denotes the 13th quantile of the X~n distribution. Here q(O) = 1 . j==1 (4. Suppose X I.. 0 E C(X)} is a level (1. x IT(1a. q(C(X)) is larger than the confidence set obtained by focusing on B1 alone. . X n is modeled as a sample from an exponential... j ~ I.a). Let Xl. Suppose q (X) and ii) (X) are realJ valued.3.a).. . Note that ifC(X) is a level (1 ~ 0') confidence region for 0. In this case. ~. . if the probability that it covers the unknown but fixed true (ql (0).4. Example 4. . 2nX/0 has a chi~square.4. ii.r} is said to be a level (1 .) confidence interval for qj (0) and if the ) pairs (T l ' 1'..4. .). (X) ~ [q... then exp{ x/O} edenote the lower < q(O) < exp{ x/i)} is a confidence interval for q( 0) with confidence coefficient (1 . ( 2 ) T.4.. . If q is not 1 . distribution.a).Section 44 Confidence Bounds. (0). We write this as P[q(O) E I(X)I > 1. . Confidence Regions of Higher Dimension We can extend the notion of a confidence interval for onedimensional functions q( B) to rdimensional vectors q( 0) = (q.a) confidence region. .
an rdimensional confidence rectangle is in this case automaticalll obtained from the onedimensional intervals. if we choose 0' j = 1 . Suppose Xl> .1) the distribution of .7).1 h(X) = X ± stn_l (1..~a).. then I(X) has confidence level (1 . ~ ~'f Dn(F) = sup IF(t) . j=1 j=1 Thus.40) 2.0:. if we choose a J = air.a.2). An approach that works even if the I j are not independent is to use Bonferroni's inequality (A. c c P[q(O) E I(X)] > 1.i.a).4. Confidence Rectangle for the Parameters ofa Normal Distribution. consists of the interval i J C(x)(t) = (max{O.4.4.6.5). Suppose Xl. We assume that F is continuous. From Example 4.a) r. . Then by solving Dn(F) < do for F. From Example 4. then leX) has confidence level 1 . r. 1 X n is a N (M. j = 1. for each t E R.Laj.(1 .a) confidence rectangle for (/" .2.. . That is.2... . Example 4. tl continuous in t. in which case (Proposition 4. According to this inequality.d. F(t) + do})· We have shown ~ ~ I o P(C(X)(t) :) F(t) for all t E R) for all P E 'P =1.. as X rv P.~a)/rn !a).0 confidence region C(x)(·) is the confidence band which. See Problem 4. we find that a simultaneous in t size 1. Thus. X n are i.4.F(t)1 tEn i " does not depend on F and is known (Example 4.1. F(t) .. 0 The method of pivots can also be applied to oodimensional parameters such as F. D n (F) is a pivot.5. Let dO' be chosen such that PF(Dn(F) < do) = 1 . Moreover. min{l. ( 2 ) sample and we want a confidence rectangle for (Ii. .1. v(P) = F(·).LP[qj(O) '" Ij(X)1 > 1.. and we are interested in the distribution function F(t) = P(X < t).do}. (7 2).15. that is. .4. .a = set of P with P( 00. h (X) X 12 (X) is a level (1 .ia)' Xnl (lo) is a reasonable confidence interval for (72 with confidence coefficient (1. Example 4. .240 Testing and Confidence Regions Chapter 4 Thus. It is possible to show that the exact confidence coefficient is (1 . is a confidence interval for J1 with confidence coefficient (1  I (X) _ [(nl)S2 (nl)S2] 2 Xnl (1 .
4.!(F)   ~ oosee Problem 4. By integration by parts.! ~ /L(F) = f o tf(t)dt = exists. A scientist has reasons to believe that the theory is incorrect and measures the constant n times obtaining . We derive the (Student) t interval for I' in the N(Il. 0 Summary.!(F) : F E C(X)) ~ J. We shall establish a duality between confidence regions and acceptance regions for families of hypotheses.4.0: for all E e. a level 1 . X n are i. We begin by illustrating the duality in the following example.4.4. 1992) where such bounds are discussed and shown to be asymptotically strictly conservative. and more generally confidence regions.! ~ inf{J. .Section 4. Acceptance regions of statistical tests are." for all PEP. 1WoSided Tests for the Mean of a Normal Distribution.0: confidence region for a parameter q(B) is a set C(x) depending only on the data x such that the probability under Pe that C(X) covers q( 8) is at least 1 .5 to give confidence regions for scalar or vector parameters in nonparametric models.5.!(F+) and sUP{J.18) arise in accounting practice (see Bickel.9)   because for C(X) as in Example 4. In a parametric model {Pe : BEe}.4. Suppose Xl. Example 4.6. Then a (1 . a 2 ) model with a 2 unknown.0'.0: when H is true. In a nonparametric setting we derive a simultaneous confidence interval for the distribution function F( t) and the mean of a positive variable X.6.!(F) : F E C(X)} = J. For a nonparametric class P ~ {P} and parameter v ~ v(P). given by (4.1. we similarly require P(C(X) :J v) > 1. subsets of the sample space with probability of accepting H at least 1 . as X and that X has a density f( t) = F' (t).i. confidence intervals. then Let F(t) and F+(t) be the lower and upper simultaneous confidence boundaries of Example 4.4. and we derive an exact confidence interval for the binomial parameter.0') lower confidence bound for /1. . A Lower Confidence Boundfor the Mean of a Nonnegative Random Variable.4.4 and 4. We define lower and upper confidence bounds (LCBs and DCBs).19. Example 4..4. for a given hypothesis H. J. if J..d.5 THE DUALITY BETWEEN CONFIDENCE REGIONS AND TESTS Confidence regions are random subsets of the parameter space that contain the true parameter with probability at least 1 . o 4. is /1. Suppose that an established theory postulates the value Po for a certain physical constant. which is zero for t < 0 and nonzero for t > O.5 The Duality Between Confidence Regions and Tests 241 We can apply the notions studied in Examples 4.7. Intervals for the case F supported on an interval (see Problem 4.
flO. Consider the general framework where the random vector X takes values in the sample space X C Rq and X has distribution PEP.2) takes values in N = (00 .a)s/vn > /l.a). then it is reasonable to formulate the problem as that of testing H : fl = Jlo versus K : fl i. t n .Q) confidence interval (4. Xl!_ Knowledge of his instruments leads him to assume that the Xi are independent and identically distributed normal random variables with mean {L and variance a 2 . consider v(P) = F.. O{X.00). /l) ~ O. For a function space example.4.fLo)/ s.. /l)} where lifvnIX. • j n . This test is called twosided because it rejects for both large and small values of the statistic T.l (1.. all PEP.4. in fact.I n _ 1 (1. i:.5'[) If we let T = vn(X . 1. o These are examples of a general phenomenon.~1 >tn_l(l~a) ootherwise. Evidently.. by starting with the test (4. where F is the distribution function of Xi' Here an example of N is the class of all continuous distribution functions.4.5.2) we obtain the confidence interval (4.. generated a family of level a tests {J(X.l other than flo is a possible alternative.2.1) is used for every flo we see that we have. . (4.4. it has power against parameter values on either side of flo. then our test accepts H.2 = .5.4. For instance.~Ct).00). = 1.2(p) takes values in N = (0. as in Example 4.. . Because the same interval (4.~a).2) These tests correspond to different hypotheses.1 (1. Let v = v{P) be a parameter that takes values in the set N. We achieve a similar effect. /l = /l(P) takes values in N = (00.a)s/ vn and define J'(X. /l) to equal 1 if. .~a)] = 0 the test is equivalently characterized by rejecting H when ITI > tnl (1 . if we start out with (say) the level (1 . if and only if. if and only if.1 (1 .1 (1 .1) by finding the set of /l where J(X. We accept H. in Example 4. (4. Let S = S{X) be a map from X to subsets of N. (/" .242 Testing and Confidence Regions Chapter 4 measurements Xl .4. Because p"IITI = I n.5..1) we constructed for Jl as follows. and in Example 4.5. the postulated value JLo is a member of the level (1 0' ) confidence interval [X  Slnl (1 ~a) /vn.a.a) LCB X . in Example 4. and only if. generating a family of level a tests.00). X + Slnl (1  ~a) /vn]. X .5. that is P[v E S(X)] > 1 .1. 00) X (0. Jlo) being of size a only for the hypothesis H : Jl = flo· Conversely.a.I n .6. then S is a (I . If any value of J... In contrast to the tests of Example 4. We can base a size Q' test on the level (1 .a) confidence region for v if the probability that S(X) contains v is at least (1 .!a) < T < I n.
H may be rejected. then [QUI' oU2] is confidence interval for () with confidence coefficient 1 .0" quantile of F oo • By the duality theorem.Ct confidence region for v. For some specified Va. < to(la).aj. is a level 0 test for HI/o.Ct) is the 1. acceptance regions. va) with level 0". then ()u(T) is a lower confidence boundfor () with confidence coefficient 1 .1. Suppose X ~ Po where (Po: 0 E ej is MLR in T = T(X) and suppose that the distribution function Fo(t) ofT under Po is continuous in each of the variables t and 0 when the other is fixed.Ct. 0 We next give connections between confidence bounds. By Corollary 4. and pvalues for MLR families: Let t denote the observed value t = T(x) ofT(X) for the datum x. E is an upper confidence bound for 0 with any solution e. if 0"1 + 0"2 < 1. 00). this is a random set contained in N with probapility at least 1 .(01 + 0"2).(t) in e.0". then PIX E A(vo)) > 1  a for all P E P vo if and only if S (X) is a 1 .a)) where to o (1 . for other specified Va.1. We next apply the duality theorem to MLR families: Theorem 4. By Theorem 4.vo) = OJ is a subset of X with probability at least 1 . If the equation Fo(t) = 1 .G iff 0 > Iia(t) and 8(t) = Ilia. Duality Theorem. It follows that Fe (t) < 1 . The proofs for the npper confid~nce hound and interval follow by the same type of argument. the power function Po(T > t) = 1 . Suppose we have a test 15(X.Ct). We have the following. then the test that accepts HI/o if and only if Va is in S(X). Moreover. H may be accepted. eo e Proof.(T) of Fo(T) = a with coefficient (1 . Consider the set of Va for which HI/o is accepted. let .a has a solution O.5. Formally.0" of containing the true value of v(P) whatever be P.1.0" confidence region for v. X E A(vo)). the acceptance region of the UMP size a test of H : 0 = 00 versus K : > ()a can be written e A(Oo) = (x: T(x) < too (1.3. Let S(X) = (va EN. Similarly.Section 4. if S(X) is a level 1 .5 The Duality Between Confidence Regions and Tests 243 Next consider the testing framework where we test the hypothesis H = Hvo : v = Va for some specified value va.3. Then the acceptance regIOn A(vo) ~ (x: J(x. if 8(t) = (O E e: t < to(1 . Conversely.Fo(t) for a test with critical constant t is increasing in (). That i~.a)}. let Pvo = (P : v(P) ~ Vo : va E V}. By applying F o to hoth sides oft we find 8(t) = (O E e: Fo(t) < 1. then 8(T) is a Ia confidence region forO. Fe (t) is decreasing in ().
We illustrate these ideas using the example of testing H : fL = J. then C = {(t. .v) : a(t. where S = Ef I Xi_ To analyze the structure of the region we need to examine k(fJ.I}.B) = (oo.1 244 Testing and Confidence Regions Chapter 4 1 o:(t.a) test of H.a) is nondecreasing in B. and an acceptance set A" (po) for this example. (ii) k(B. a) denote the critical constant of a level (1. The pvalue is {t: a(t.2.Fo(t) is increasing in O.3. v will be accepted.1. . The corresponding level (1 . B) plane. • . Then the set .. vol denote the pvalue of a test6(T.1 shows the set C. 1 . a). • a(t.Fo(t) is decreasing in t.a)] > a} = [~(t). Let T = X. I c= {(t. _1 X n be the indicators of n binomial trials with probability of success 6. I X n are U. Figure 4. Let k(Bo.1 that 1 . vo) = 1 IT > cJ of H : v ~ Vo based on a statistic T = T(X) with observed value t = T(x). B) ~ poeT > t) = 1 . 00 ) denote the pvalue for the UMP size Q' test of H : () let = eo versus K : () > eo. a confidence region S( to). Under the conditions a/Theorem 4.1. . The result follows. v) =O} gives the pairs (t. N (IL. We have seen in the proof of Theorem 4.a)~k(Bo. We call C the set of compatible (t.p): It pI < <7Z (1.4. .10 when X I . E A(B)). v) <a} = {(t.5.B) > a} {B: a(t.d.oo).to(l. : : !.a)ifBiBo.3. and A"(B) = T(A(B)) = {T(x) : x Corollary 4.~) upper and lower confidence bounds and confidence intervals for B. A"(B) Set) Proof. .5. In the (t.2 known.a) confidence region is given by C(X t. . In general. Let Xl.Fo(t).. We shall use some of the results derived in Example 4.v) : J(t.1). vertical sections of C are the confidence regions B( t) whereas borizontal sections are the acceptance regions A" (v) = {t : J( t.1). '1 Example 4. v) = O}. we seek reasonable exact level (1 . for the given t. Exact Confidence Bounds and Intervals for the Probability of Success in n Binomial Trials. cr 2 ) with 0. let a( t. B) pnints.~a)/ v'n}.. a) .Xn ) = {B : S < k(B.1 I. For a E (0. Because D Fe{t) is a distribution function.80 E (0.5.v) where. To find a lower confidence bound for B our preceding discussion leads us to consider level a tests for H : 6 < 00.1. . We claim that (i) k(B. and for the given v. t is in the acceptance region..
1 (i) that PetS > j] is nondecreasing in () for fixed j. if 0 > 00 .5. Therefore. From (i).[S > k(02. Clearly. Po. whereas A'" (110) is the acceptance region for Hp. S(to) is a confidence interval for 11 for a given value to of T. To prove (i) note that it was shown in Theorem 4. Therefore.L = {to in the normal model. Q) as 0 tOo. if we define a contradiction. On the other hand.[S > j] < Q for all 0 < 00 O(S) = inf{O: k(O.[S > k(02. I] if S ifS >0 =0 . Po. a) increases by exactly 1 at its points of discontinuity. [S > j] < Q. The claims (iii) and (iv) are left as exercises. P. The shaded region is the compatibility set C for the twosided test of Hp.o : J. If fJo is a discontinuity point 0 k(O. (ii).I] [0.5 The Duality Between Confidence Regions and Tests 245 Figure 4. (iv) k(O.1. [S > j] > Q.Q) ~ I andk(I. and.Q) = n+ 1. [S > j] = Q and j = k(Oo. then C(X) ={ (O(S).Section 4. Q) would imply that Q > Po. let j be the limit of k(O. Q). e < e and 1 2 k(O" Q) > k(02.Q) I] > Po. The assertion (ii) is a consequence of the following remarks. it is also nonincreasing in j for fixed e. (iii).o' (iii) k(fJ.[S > k(O"Q) I] > Q. and (iv) we see that. hence.Q) = S + I}.3.Q)] > Po. Then P. Q).
when S > 0. these bounds and intervals differ little from those obtained by the first approximate method in Example 4. When S 0.2Q). I • i I . As might be expected. When S = n. Plot of k(8.5.3. O(S) together we get the confidence interval [8(S). Similarly. .Q)=Sl} where j (0. These intervals can be obtained from computer packages that use algorithms based on the preceding considerations.16) for n = 2. From our discussion. then k(O(S). j '~ I • I • .0. j Figure 4.Q) DCB for 0 and when S < n.16) 3 I 2 11. we define ~ O(S)~sup{O:j(O. 8(S) = O.Q) = S and. .I. Q) is given by.1 d I J f o o 0.2 0.2. O(S) is the unique solution of 1 ~( ~ s ) or(1 _ 8)nr = Q. O(S) I oflevel (1. i • . .5 I i .5. therefore. O(S) = I. 0. we find O(S) as the unique solution of the equation. .4.1 0. 4 k(8. Putting the bounds O(S).Q) LCB for 0(2) Figure 4.3 0.246 Testing and Confidence Regions Chapter 4 and O(S) is the desired level (I .4 0. if n is large. Then 0(S) is a level (1 .. 1 _ . .2 portrays the situatiou.
For this threedecision rule.4 we considered the level (1 .3) 3. Decide I' < 1'0 ifT < z(l. Because we do not know whether A or B is to be preferred. it is natural to carry the comparison of A and B further by asking whether 8 < 0 or B > O. we usually want to know whether H : () > B or H : B < 80 .3) we obtain the following three decision rule based on T = J1i(X  1'0)/": Do not reject H : /' ~ I'c if ITI < z(1 .B . we decide whether this is because () is smaller or larger than Bo. the twosided test can be regarded as the first step in the decision procedure where if H is not rejected. . Therefore.4.2. by using this kind of procedure in a comparison or selection problem. Decide I' > I'c ifT > z(1 . N(/l.~a).. If J(x. the probability of the wrong decision is at most ~Q. 2. We can use the twosided tests and confidence intervals introduced in later chapters in similar fashions.!a).~Q). Example 4. Using this interval and (4. 0. If H is rejected. twosided tests seem incomplete in the sense that if H : B = B is rejected in favor of o H : () i.Section 4. when !t < !to. Suppose Xl.Q) confidence interval X ± uz(1 . o o Decide B < 80 if! is entirely to the left of B . o o For instance.O. vo) is a level 0' test of H .. o (4. v = va. Then the wrong decision "'1 < !to" is made when T < z(1 . then the set S(x) of Vo where . we test H : B = 0 versus K : 8 i. we can control the probabilities of a wrong selection by setting the 0' of the parent test or confidence interval. then we select A as the better treatment.5.~Q) / . However.5. I X n are i.8 < B • or B > 80 is an example of a threeo decision problem and is a special case of the decision problems in Section 1.5.i. and vice versa.0') confidence interval!: 1. We explore the connection between tests of statistical hypotheses and confidence regions. To see this consider first the case 8 > 00 . In Section 4.13. and o Decide 8 > 80 if I is entirely to the right of B .. Make no judgment as 1O whether 0 < 80 or 8 > B if I contains B . This event has probability Similarly. Summary. A and B. The problem of deciding whether B = 80 . suppose B is the expected difference in blood pressure when two treatments. the probability of falsely claiming significance of either 8 < 80 or 0 > 80 is bounded above by ~Q. Thus. If we decide B < 0. we make no claims of significance. Here we consider the simple solution suggested by the level (1 .2 ) with u 2 known. but if H is rejected.d.Jii for !t. and 3.3.!a). are given to high blood pressure patients.5 The Duality Between Confidence Regions and Tests 247 Applications of Confidence Intervals to Comparisons and Selections We have seen that confidence intervals lead naturally to twosided tests.
for any fixed B and all B' < B.1) for all competitors are called uniformly most accurate as are upper confidence bounds satisfying (4. I' i' . Note that 0* is a unifonnly most accurate level (1 . (X) = X . which reveals that (4. The dual lower confidence bound is 1'. . where X(1) < X(2) < .2 and 4. P. 0"2) random variables with (72 known. (4. then the lest that accepts H .248 Testing and Confidence Regions Chapter 4 0("'. (4.1) Similarly.Q).1.1) is nothing more than a comparison of (X)wer functions.6. Formally. where 80 is a specified value. B(n.IB' (X) I • • < B'] < P.Xn ) is a sample of a N (/L.2) Lower confidence bounds e* satisfying (4. in fact..0) confidence region for va. We also give a connection between confidence intervals.1 continued).6.Xn and k is defined by P(S > k) = 1 . and only if.2. and the threedecision problem of deciding whether a parameter 1 l e is eo. random variable S. Which lower bound is more accurate? It does tum out that . Suppose X = (X" . If (J and (J* are two competing level (1 .Q)er/. 4..z(1 . unifonnly most accurate in the N(/L.. we find that a competing lower confidence bound is 1'2(X) = X(k). they are both very likely to fall below the true (J.[B(X) < B']. I' > 1'0 rejects H when .6. Ii n II r . If S(x) is a level (1 . va) ~ 0 is a level (1 .if. . 0 I i. twosided tests.n(X ..2) for all competitors. or larger than eo.6. Using Problem 4.~'(X) < B'] < P. v = Va when lIa E 8(:1') is a level 0: test.Ilo)/er > z(1 .. optimality of the tests translates into accuracy of the bounds.Q) LCB B if..6.Q for a binomial.1 (Examples 3. . we say that the bound with the smaller probability of being far below () is more accurate. A level (1. We give explicitly the construction of exact upper and lower confidence bounds and intervals for the parameter in the binomial distribution.Q) con· fidence region for v.5... for X E X C Rq. P. This is a consequence of the following theorem.a) LCB 0* of (J is said to be more accurate than a competing level (1 .[B(X) < B']. But we also want the bounds to be close to (J. (72) model. A level a test of H : /L = /La vs K . j i I . Thus. the following is true. e.u2(X) and is.I (X) is /L more accurate than .a) LCB for if and only if 0* is a unifonnly most accurate level (1 .0') lower confidence bounds for (J. for 0') UCB e is more accurate than a competitor e . .6. less than 80 . We next show that for a certain notion of accuracy of confidence bounds.Q) UCB for B.6.1 t! Example 4. .' .n. which is connected to the power of the associated onesided tests.6. and only if. Definition 4. ~). < X(n) denotes the ordered Xl.3.6 UNIFORMLY MOST ACCURATE CONFIDENCE BOUNDS In our discussion of confidence bounds and intervals so far we have not taken their accuracy into account.. a level (1 any fixed B and all B' > B.
Suppose ~'(X) is UMA level (1 . if a > 0. Also. [O(X) > 001 < Pe.2.Because O'(X. for any other level (1 .•.6.z(1 a)a/ JTi is uniformly most accurate. However. 0 If we apply the result and Example 4. a real parameter. Pe[f < q(O')1 < Pe[q < q(O')] whenever q((}') < q((}). Let XI. Proof Let 0 be a competing level (1 .t o .a). they have the smallest expected "distance" to 0: Corollary 4.Oo)) < Eo. We can extend the notion of accuracy to confidence bounds for realvalued functions of an arbitrary parameter.1 to Example 4. We define q* to be a uniformly most accurate level (1 .(o(X.7 for the proof).6.0') LeB for (J. and only if. Uniformly most accurate (UMA) bounds turn out to have related nice properties. and 0 otherwise. and only if. such that for each (Jo the associated test whose critical function o*(x.a) lower confidence boundfor O. Let f)* be a level (1 .5.5 favor X{k) (see Example 3. [O'(X) > 001.1.Section 4. . (O'(X.2 and the result follows.e>. (Jo) is given by o'(x.4. X(k) does have the advantage that we don't have to know a or even the shape of the density f of Xi to apply it.Oo) 1 ifO'(x) > 00 ootherwise is UMP level a for H : (J = eo versus K : (J > 00 . then for all 0 where a+ = a. Example 4.1. 1 X n be the times to failure of n pieces of equipment where we assume that the Xi are independent £(A) variables. Most accurate upper confidence bounds are defined similarly.a) LeB 00. 00 ) is a level a test for H : 0 = 00 versus K : 0 > 00 .a) upper confidence bound q* for q( A) = 1 .a) lower confidence bound. . the probability of early failure of a piece of equipment. We want a unifonnly most accurate level (1 . 00 ) ~ 0 if.6 Uniformly Most Accurate Confidence Bounds 249 Theorem 4.a) Lea for q( 0) if. Boundsforthe Probability ofEarly Failure ofEquipment.2).6. .a) LCB q.2. 00 )) or Pe. for e 1 > (Jo we must have Ee. O(x) < 00 . we find that j. Let O(X) be any other (1 .6.1. Defined 0(x. 00 ) is UMP level Q' for H : (J = eo versus K : 0 > 00 . Then O(X. Then!l* is uniformly most accurate at level (1 . For instance (see Problem 4.6. 00 ) by o(x. the robustness considerations of Section 3. Identify 00 with {}/ and 01 with (J in the statement of Definition 4.
a) UCB'\* for A.a). 1962. the interval must be at least as likely to cover the true value of q( B) as any other value. in general. * is by Theorem 4.1. The situation wiili confidence intervals is more complicated.a) intervals iliat has minimum expected length for all B. *) is a uniformly most accurate level (1 .a) UCB for A and. Summary. Confidence intervals obtained from twosided tests that are uniformly most powerful within a restricted class of procedures can be shown to have optimality properties within restricted classes. we can restrict attention to certain reasonable subclasses of level (1 ..a) intervals for which members with uniformly smallest expected length exist.1 have this property. the lengili t . If we turn to the expected length Ee(t . the confidence interval be as short as possible.T. it follows that q( >.a) is the (1. because q is strictly increasing in A. l . some large sample results in iliis direction (see Wilks. however. ~ q(B) ~ t] ~ Pe[T.a) unbiased confidence intervals. Considerations of accuracy lead us to ask that.a) lower confidence bounds to fall below any value ()' below the true B. ~ q(()') ~ t] for every (). Thus.a) confidence intervals that have uniformly minimum expected lengili among all level (1. 374376). Neyman defines unbiased confidence intervals of level (1 . there does not exist a member of the class of level (1 . is random and it can be shown that in most situations there is no confidence interval of level (1 . Therefore.6. in the case of lower bounds. as in the estimation problem. By using the duality between onesided tests and confidence bounds we show that confidence bounds based on UMP level a tests are uniformly most accurate (UMA) level (1 .) as a measure of precision.a) quantile of the X§n distribution. B'.T.. In particular. there exist level (1 .250 Testing and Confidence Regions Chapter 4 We begin by finding a uniformly most accurate level (1... However.a) UCB for the probability of early failure.a) in the sense that. subject to ilie requirement that the confidence level is (1 . To find'\* we invert the family of UMP level a tests of H : A ~ AO versus K : A < AO.a) iliat has uniformly minimum length among all such intervals. a uniformly most accurate level (1 .6. pp. There are.3) or equivalently if A X2n(1a) o < 2"'~ 1 X 2 L_n= where X2n(1. Of course.8.a)/ 2Ao i=1 n (4. These topics are discussed in Lehmann (1997). the UMP test accepts H if LXi < X2n(1. they are less likely ilian oilier level (1 .\ *) where>. the intervals developed in Example 4.6.5. . By Problem 4. the situation is still unsatisfactory because. Pratt (1961) showed that in many of the classical problems of estimation . 0 Discussion We have only considered confidence bounds. That is.a) by the property that Pe[T. the confidence region corresponding to this test is (0.
We next give such an example.12.a)% confidence interval is that if we repeated an experiment indefinitely each time computing a 100(1 .a)% of the intervals would contain the true unknown parameter value. (j].7. Instead. from Example 1. Suppose that.a) credible bounds and intervals are subsets of the parameter space which are given probability at least (1 .Section 4. a~). it is natural to consider the collec tion of that is "most likely" under the distribution II(alx). :s: alx) ~ 1 . no probability statement can be attached to this interval.1. and {j are level (1 . Suppose that given fL. with known..a lower and upper credible bounds for fL are  fL = fLB  ao Zla Vn (1 + .a)% confidence interval.2 and 1.7.a) by the posterior distribution of the parameter given the data. then 100(1 . with fLo and 75 known. Xl. with ~ a6) a6 fLB = ~ nx+l/1 ~t""o n ~ + I ~ ~2 . the posterior distribution of fL given Xl. the interpretation of a 100(1".Xn are i. . then Ck = {a: 7r(lx) . a Definition 4. II(a:s: t9lx) .a. 75). what are called level (1 '. Let 7r('lx) denote the density of agiven X = x. a E e c R.i.aB = n ~ + 1 I ~ It follows that the level 1 .1. and that fL rv N(fLo.a) credible region for e if II( C k Ix) ~ 1  a . .7 Frequentist and Bayesian Formulations 251 4.7 FREQUENTIST AND BAYESIAN FORMULATIONS We have so far focused on the frequentist formulation of confidence bounds and intervals where the data X E X c Rq are random while the parameters are fixed but unknown. then fl.A)2 nro ao 1 Ji = liB + Zla Vn (1 + .3. Thus.a. Example 4.4. a a Turning to Bayesian credible intervals and regions. X has distribution P e.7.'" . Definition 4.a) lower and upper credible bounds for if they respectively satisfy II(fl. A consequence of this approach is that once a numerical interval has been computed from experimental data.Xn is N(liB. and that a has the prior probability distribution II.::: k} is called a level (1 .::: 1 . In the Bayesian formulation of Sections 1..1. N(fL.)2 nro 1 .. given a. If 7r( alx) is unimodal. then Ck will be an interval of the form [fl.6.2.d. Let II( 'Ix) denote the posterior probability distribution of given X = x. Then.
(t + b)A has a X~+n distribution.6. Then.Xn are Li.3. that determine subsets of the parameter space that are assigned probability at least (1 . Suppose that given 0. In addition to point prediction of Y. However. . Y] that contains the unknown value Y with prescribed probability (1 . called level (1 a) credible bounds and intervals.4 for sources of such prior guesses.a) upper credible bound for 0. we may want an interval for the .2.n(1+ ~)2 nTo 1 • Compared to the frequentist interval X ± Zl~oO/.a) credible interval is similar to the frequentist interval except it is pulled in the direction /10 of the prior mean and it is a little narrower.a) by the posterior distribution of the parameter () given the data :c. the center liB of the Bayesian interval is pulled in the direction of /10. Xl. the interpretations are different: In the frequentist confidence interval. is shifted in the direction of the reciprocal b/ a of the mean of W(A)..d. Similarly. b > are known parameters.a). the probability of coverage is computed with X = x fixed and () random with probability distribution II (B I X = x).a) lower credible bound for A and ° is a level (1 . the interpretations of the intervals are different. it is desirable to give an interval [Y. D Example 4. given Xl.4. For instance. In the case of a normal prior w( B) and normal model p( X I B)..02) where /10 is known.12. /1 +1with /1 ± = /111 ± Zl'" 2 ~ 00 . X n . ~b) density where a > 0.. t] in which the treatment is likely to take effect. Compared to the frequentist bound (n 1) 8 2 / Xnl (a) of Example 4.8 PREDICTION INTERVALS In Section 1.a) credible interval is [/1./10)2.2. a doctor administering a treatment with delayed effect will give patients a time interval [1::.2... by Problem 1.. 01 Summary.2 . .2 . the probability of coverage is computed with the data X random and B fixed. where /10 is a prior guess of the value of /1. In the Bayesian framework we define bounds and intervals.2 and suppose A has the gamma f( ~a. Note that as TO > 00.7. N(/10. the Bayesian interval tends to the frequentist interval. See Example 1. whereas in the Bayesian credible interval. then . . 4. We shall analyze Bayesian credible regions further in Chapter 5.252 Testing and Confidence Regions Chapter 4 while the level (1 . however. = x a+n (a) / (t + b) is a level (1 . where t = 2:(Xi .n.• . Let A = a.4 we discussed situations in which we want to predict the value of a random variable Y. the level (1 . Let xa+n(a) denote the ath quantile of the X~+n distribution.
Ct) prediction interval as an interval [Y. Y ::.1)1 L~(Xi .4. It follows that.3.8 Prediction Intervals 253 future GPA of a student or a future value of a portfolio. We define a level (1.l distribution.Xn be i.+ 18tn l (1  (4.1) Note that Tp(Y) acts as a prediction interval pivot in the same way that T(p.X) is independent of X by Theorem B. 8 2 = (n .8. Let Y = Y(X) denote a predictor based on X = (Xl.) acts as a confidence interval pivot in Example 4. the optimal estimator when it exists is also the optimal predictor. Thus. . As in Example 4. where MSE denotes the estimation theory mean squared error. the optimal MSPE predictor is Y = X. we found that in the class of unbiased estimators. X is the optimal estimator.3 and independent of X n +l by assumption. .4. Also note that the prediction interval is much wider than the confidence interval (4. 1) distribution and is independent of V = (n .a) prediction interval Y = X± l (1 ~Ct) ::.1.4. in this case.Y to construct a pivot that can be used to give a prediction interval..1.Xn . By solving tnl Y.. let Xl. "~).. ~ We next use the prediction error Y . The (Student) t Prediction Interval.1. In Example 3. TnI. Tp(Y) !a) .. which has a X.Y) = 0. The problem of finding prediction intervals is similar to finding confidence intervals using a pivot: Example 4. AISP E(Y) = MSE(Y) + cr 2 . We want a prediction interval for Y = X n +1. and can c~nclu~e that in the class of prediction unbiased predictors. which is assumed to be also N(p" cr 2 ) and independent of Xl.. . Then Y and Y are independent and the mean squared prediction error (MSPE) of Y is Note that Y can be regarded as both a predictor of Y and as an estimate of p" and when we do so.Section 4. we find the (1 .8. fnl (1  ~a) for Vn. Y) ?: 1 Ct. has the t distribution.X n +1 '" N(O. . Moreover. it can be shown using the methods of . Note that Y Y = X . We define a predictor Y* to be prediction unbiased for Y if E(Y* . by the definition of the (Student) t distribution in Section B.4.8.3..d. ::.1)8 2 /cr 2.i. In fact. . Y] based on data X such that P(Y ::. [n. ..l + 1]cr2 ). as X '" N(p" cr 2 ). It follows that Z (Y) p  Y vnY l+lcr has a N(O.1).
i.. By ProblemB.1) is approximately correct for large n even if the sample comes from a nonnormal distribution. (4.8. whereas the width of the prediction interval tends to 2(]z (1.. the confidence level of (4.xn ). Suppose XI. Moreover. We want a prediction interval for Y = X n+l rv F.v) j(v lI)dH(u. I x) of Xn+l is defined as the conditional distribution of X n + l given x = (XI. Here Xl"'" Xn are observable and Xn+l is to be predicted. then.12. Un ordered. uniform. where X n+l is independent of the data Xl'. 00 :s: a < b :s: 00.E(U(j)) where H is the joint distribution of UCj) and U(k). 0 We next give a prediction interval that is valid from samples from any population with a continuous distribution.' X n .4.8. p(X I e). E(UCi») thus.1) tends to zero in probability at the rate n!. The posterior predictive distribution Q(.d. whereas the level of the prediction interval (4. Q(. ..4.8.2) P(X(j) It follows that [X(j) . UCk ) = v)dH(u.. < UCn) be Ul . See Problem 4. then P(U(j) j :s: Un+l :s: UCk ») P(u:S: Un+l:S: v I U(j) = U.j is a level 0: = (n + 1 . . This interval is a distributionfree prediction interval. by Problem B.5 for a simpler proof of (4. n + 1. Set Ui = F(Xi ). with a sum replacing the integral in the discrete case.' •. Ul . < X(n) denote the order statistics of Xl.2.. X(k)] with k :s: Xn+l :s: XCk») = n + l' kj = n + 1 .. Bayesian Predictive Distributions Xl' . Example 4.8. .'..1) is not (1 . I x) has in the continuous case density e.2.9. as X rv F.i.. .2).i. U(O. where F is a continuous distribution function with positive density f on (a.d...Xn are i..8. Let X(1) < . " X n. . . " X n + l are Suppose that () is random with () rv 1'( and that given () = i. Now [Y B' YB ] is said to be a level (1 .254 Testing and Confidence Regions Chapter 4 Chapter 5 that the width of the confidence interval (4.2. . = i/(n+ 1).. Let U(1) < . ..d.0:) Bayesian prediction interval for Y = X n + l if .0:) in the limit as n ) 00 for samples from nonGaussian distributions.~o:).v) = E(U(k») . . b). that is. 1).2j)/(n + 1) prediction interval for X n +l .Un+l are i. .. i = 1.
8.B)B I 0 = B} = O. For a sample of size n + 1 from a continuous distribution we show how the order statistics can be used to give a distributionfree prediction interval. and X ~ Bas n + 00. (J"5).8. note that given X = t.I .~o:) (J"o as n + 00..Section 4. (J" B n(J" B 2) It follows that a level (1 . To obtain the predictive distribution.1.0 and 0 are uncorrelated and.0:). 0 The posterior predictive distribution is also used to check whether the model and the prior give a reasonable description of the uncertainty in a study (see Box. (J"5 known. 1983). N(B.0 and 0 are still uncorrelated and independent. the results and examples in this chapter deal mostly with oneparameter problems in which it sometimes is possible to find optimal procedures. Thus.9 4.d. independent. .8. 4.2 o + I' T2 ~ P. where X and X n + l are independent. Xn+l .8.i. from Example 4. The Bayesian formulation is based on the posterior predictive distribution which is the conditional distribution of the unobservable variable given the observable variables. T2 known.3) To consider the frequentist properties of the Bayesian prediction interval (4. (J"5 + a~) ~2 = (J" B n I (.1) . we find that the interval (4.1 where (Xi I B) '" N(B .9 Likelihood Ratio Procedures 255 Example 4.~o:) V(J"5 + a~.. The Bayesian prediction interval is derived for the normal model with a normal prior. However. Because (J"~ + 0. (n(J"~ /(J"5) + 1.1 LIKELIHOOD RATIO PROCEDURES Introduction Up to this point. This is the same as the probability limit of the frequentist interval (4. (J"5). (4.0)0] = E{E(Xn+l . T2). Thus. and it is enough to derive the marginal distribution of Y = X n +l from the joint distribution of X. even in .3. We consider intervals based on observable random variables that contain an unobservable random variable with probability at least (1. X n + l and 0. Note that E[(Xn+l .9. we construct the Student t prediction interval for the unobservable variable.c{Xn+l I X = t} = . A sufficient statistic based on the observables Xl.8. Consider Example 3. Xn+l . .3) converges in probability to B±z (1 .3) we compute its probability limit under the assumption that Xl " '" Xn are i. Xn is T = X = n 1 2:~=1 Xi . and 7r( B) is N( TJo . Summary. .2. B = (2 / T 2) TJo + ( 2 / (J"0 X. In the case of a normal sample of size n + 1 with only n variables observable.0:) Bayesian prediction interval for Y is [YB ' Yt] with Yf = liB ± Z (1. by Theorem BA.  0) + 0 I X = t} = N(liB.c{(Xn +1 where.7.
The efficiency is in an approximate sense that will be made clear in Chapters 5 and 6. 2. e) : () sup{p(x. Because 6a (x) I. and for large samples. Xn is a sample from a N(/L. likelihood ratio tests have weak optimality properties to be discussed in Chapters 5 and 6. Suppose that X = (Xl . B')/p(x . .2 that we think of the likelihood function L(e. Also note that L(x) coincides with the optimal test statistic p(x . We start with a generalization of the NeymanPearson statistic p(x . 8 1 = {e l }. eo) when 8 0 = {eo}. . To see that this is a plausible statistic. and conversely. . there can be no UMP test of H : /L = /Lo vs H : /L I.1(c». Calculate the MLE eo of e where e may vary only over 8 0 . ed/p(x. . On the other hand.2. Find a function h that is strictly increasing on the range of A such that h(A(X)) has a simple form and a tabled distribution under H .1 that if /Ll > /Lo.1) whose computation is often simple. . note that it follows from Example 4. Note that in general A(X) = max(L(x). z(a )./LO..its (1 . e) : e E 8 0 } e e e Tests that reject H for large values of L(x) are called likelihood ratio tests. if /Ll < /Lo.9. We are going to derive likelihood ratio tests in several important testing problems.'Pa(x). sup{p(x. p(x. For instance. eo). The test statistic we want to consider is the likelihood ratio given by L(x) = sup{p(x. recall from Section 2. . So. 1). Form A(X) = p(x. ed / p(x. if Xl./Lo)/a. .~a). 1..Xn ) has density or frequency function p(x. In particular cases. Xn)..a )th quantile obtained from the table. the basic steps are always the same.256 Testing and Confidence Regions Chapter 4 the case in which is onedimensional.x) = p(x. e) as a measure of how well e "explains" the given sample x = (Xl. there is no UMP test for testing H : /L = /Lo vs K : /L I. where T = fo(X . . Although the calculations differ from case to case. In this section we introduce intuitive and efficient procedures that can be used when no optimal methods are available and that are natural for multidimensional parameters. then the observed sample is best explained by some E 8 1 . eo). In the cases we shall consider. e) and we wish to test H : E 8 0 vs K : E 8 1 . . 4.2. by the uniqueness of the NP test (Theorem 4. e) : e E 8l}islargecomparedtosup{p(x. the MP level a test 6a (X) rejects H for T > z(l ./Lo. optimal procedures may not exist. Calculate the MLE e of e. To see this. e) E 8} : eE 8 0} (4. we specify the size a likelihood ratio test through the test statistic h(A(X) and . 3. a 2 ) population with a 2 known.2. ifsup{p(x. e) is a continuous function of e and eo is of smaller dimension than 8 = 8 0 U 8 1 so that the likelihood ratio equals the test statistic I e e 1 A(X) = sup{p(x. Because h(A(X)) is equivalent to A(X) .e): E 8 0 }. the MP level a test 'Pa(X) rejects H if T ::::.. e) : e E 8 l }.
a) confidence region C(x) = {8 : p(x. 0'2) population in which both JL and 0'2 are unknown. while the second patient serves as control and receives a placebo. Examples of such measurements are hours of sleep when receiving a drug and when receiving a placebo. An important class of situations for which this model may be appropriate occurs in matched pair experiments. This section includes situations in which 8 = (8 1 . and soon. For instance.8) . and so on.82 ) where 81 is the parameter of interest and 82 is a nuisance parameter. In the ith pair one patient is picked at random (i. p (X . we consider pairs of patients matched so that within each pair the patients are as alike as possible with respect to the extraneous factors. we measure the response of a subject when under treatment and when not under treatment.Xn form a sample from a N(JL. In order to reduce differences due to the extraneous factors. the difference Xi has a distribution that is .5.9. After the matching.2) where sUPe denotes sup over 8 E e and the critical constant c( 8) satisfies p.. diet.9 Likelihood Ratio Procedures 257 We can also invert families of likelihood ratio tests to obtain what we shall call likelihood confidence regions. The family of such level a likelihood ratio tests obtained by varying 8lD can also be inverted and yield confidence regions for 8 1 .2. Let Xi denote the difference between the treated and control responses for the ith pair. which are composite because 82 can vary freely. 4.2 Tests for the Mean of a Normal DistributionMatched Pair Experiments Suppose Xl. We shall obtain likelihood ratio tests for hypotheses of the form H : 81 = 8lD . To see how the process works we refer to the specific examples in Sections 4. Suppose we want to study the effect of a treatment on a population of patients whose responses are quite variable because the patients differ with respect to age. Studies in which subjects serve as their own control can also be thought of as matched pair experiments.9. In that case. 8) > (8)] = eo a.9. We can regard twins as being matched pairs. An example is discussed in Section 4. sales performance before and after a course in salesmanship. Response measurements are taken on the treated and control members of each pair. C (x) is just the set of all 8 whose likelihood is on or above some fixed value dependent on the data.24. If the treatment and placebo have the same effect. That is. Here are some examples.9.'" . We are interested in expected differences in responses due to the treatment effect. bounds.9. 8n e (4.e. 8) ~ [c( 8)t l sup p(x. mileage of cars with and without a certain ingredient or adjustment. we can invert the family of size a likelihood ratio tests of the point hypothesis H : 8 = 80 and obtain the level (1 . and other factors.Section 4.c 0 0 It is often approximately true (see Chapter 6) that c( 8) is independent of 8. the experiment proceeds as follows. with probability ~) and given the treatment. [ supe p(X.
B) at (fJ. We found that sup{p(x. we test H : fJ. Our null hypothesis of no treatment effect is then H : fJ. where we think of fJ.0}. = fJ.0) 2  a2 n] = 0.6.=1 fJ.fJ.258 Testing and Confidence Regions Chapter 4 symmetric about zero. = E(X 1 ) denote the mean difference between the response of the treated and control subjects.X) .3. B) : B E 8 0 } boils down to finding the maximum likelihood estimate a~ of a 2 when fJ. B) was solved in Example 3.L ( X i .0 )2 . Form of the TwoSided Tests Let B = (fJ" a 2 ). = fJ. = O. n i=l is the maximum likelihood estimate of B.B): B E 8} = p(x. e). = fJ. a~).0 as an established standard for an old treatment. . Finding sup{p(x. However.o. Let fJ. This corresponds to the alternative "The treatment has some effect. as discussed in Section 4. TwoSided Tests We begin by considering K : fJ.5. which has the immediate solution ao ~2 = . The test we derive will still have desirable properties in an approximate sense to be discussed in Chapter 5 if the nonnality assumption is not satisfied. =Ie fJ.0 is known and then evaluating p(x. a 2 ) : fJ. The likelihood equation is oa 2 a logp(x. The problem of finding the supremum of p(x.L Xi 1 ~( n i=l . good or bad.L X i ' ." However.0.0. 8 0 = {(fJ" Under our assumptions. as representing the treatment effect. B) 1[1~ ="2 a 4 L(Xi . the test can be modified into a threedecision rule that decides whether there is a significant positive or negative effect. We think of fJ.a)= ~ ~2 (1 ~ 1 ~ n i=l 2 ) . where B=(x. To this we have added the nonnality assumption. 'for the purpose of referring to the duality between testing and confidence procedures.
OneSided Tests The twosided formulation is natural if two treatments. where 8 = (M . Therefore. and only if. which thus equals log .064. To simplify the rule further we use the following equation.9. 8 Therefore. to find the critical value.\(x) logp(x. are considered to be equal before the experiment is performed. (Mo. Similarly.05.1).a). The statistic Tn is equivalent to the likelihood ratio statistic A for this problem. Mo . Because 8 2 function of ITn 1 where = 1 + (x . Then we would reject H if. and only if.2.11 we argue that P"[Tn Z t] is increasing in 8.9. the relevant question is whether the treatment creates an improvement. therefore. The test statistic . suppose n = 25 and we want a = 0. the likelihood ratio tests reject for large values of ITn I. However.Section 4. the testing problem H : M ::.3. or Table III.MO)2/&2. the test that rejects H for Tn Z t nl(1 . A and B.\(x). In Problem 4. A proof is sketched in Problem 4.Mo) .8) logp(x. rejects H for iarge values of (&5/&2). &5/&2 (&5/&2) is monotone increasing Tn = y'n(x . if we are comparing a treatment and control. the size a critical value is t n 1 (1 . . which can be established by expanding both sides.1.Mo)/a. 8) for 8 E 8 0. Therefore. &0)) ~ 2 {~[(log271') + (log&2)]~} ~ log(&5/&2).4. is of size a for H : M ::.9 Likelihood Ratio Procedures 259 By Theorem 2. (n . Because Tn has a T distribution under H (see Example 4. &5 gives the maximum of p(x. Thus. ITnl Z 2.x)2 = n&2/(n .~a) and we can use calculators or software that gives quantiles of the t distribution. Mo versus K : M > Mo (with Mo = 0) is suggested.1).  {~[(log271') + (log&5)]~} Our test rule. the size a likelihood ratio test for H : M Z Mo versus K : M < Mo rejects H if.\(x) is equivalent to log .1)1 I:(Xi . For instance.
Note that the distribution of Tn depends on () (/L.9. fo( X ./Lo) / a as close to 0 as we please and. we may be required to take more observations than we can afford on the second stage. we know that fo(X /L)/a and (n .b distribution. we obtain the confidence region We recognize C(X) as the confidence interval of Example 4.n(X . .11) just as the power functions of the corresponding tests of Example 4.b. thus.jV/ k where Z and V are independent and have N(8. Because E[fo(X /Lo)/a] = fo(/L . Thus.4./Lo)/a) = 1. however.1 are monotone in fo/L/ a. in which we estimate a for a first sample and use this estimate to decide how many more observations we need to obtain guaranteed power against all alternatives with I/L.4. 1997. The reason is that. 260.1)s2/a 2 are independent and that (n 1)s2/a 2 has a X. With this solution. is by definition the distribution of Z / .260 Power Functions and Confidence 4 To discuss the power of these tests. Likelihood Confidence Regions If we invert the twosided tests. and the power can be obtained from computer software or tables of the noncentral t distribution. Computer software will compute n./Lol ~ ~. This distribution.12. Similarly the onesided tests lead to the lower and upper confidence bounds of Example 4. is possible (Lehmann.2. the ratio fo(X .1. We have met similar difficulties (Problem 4.1./Lo) / a has N (8./Lo)/a and Yare .7) when discussing confidence intervals. 1) distribution./Lo)/a . whatever be n. Problem 17). with 8 = fo(/L .9. bring the power arbitrarily close to 0:. The density of Z/ . we can no longer control both probabilities of error by choosing the sample size. p. 1) and X~ distributions. respectively.jV/k is given in Problem 4. ( 2) only through 8. say. / (n1)s2 /u 2 n1 V has a 'Tn1./Lo) / a. by making a sufficiently large we can force the noncentrality parameter 8 = fo(/L . note that from Section B. To derive the distribution of Tn. we need to introduce the noncentral t distribution with k degrees of freedom and noncentrality parameter 8.l distribution. If we consider alternatives of the form (/L /Lo) ~ ~.3. denoted by Tk. A Stein solution. We can control both probabilities of error by selecting the sample size n large provided we consider alternatives of the form 181 ~ 81 > 0 in the twosided case and 8 ~ 81 or 8 :S 81 in the onesided cases.. The power functions of the onesided tests are monotone in 8 (Problem 4.4.
. and ITnl = 4.. As in Section 4.1 0.84]..4 3 0.. Yn2 .513.0 7 3.g.9. e .0 6 3.Y ) is n2 where n = nl + n2. Jl2.6 4. 1958... In the control versus treatment example. .1 1.6 0.8 2.1 0.32. . Then Xl. 'Yn2 are the measurements on a sample given the drug. . .. height.99 confidence interval for the mean difference Jl between treatments is [0.0 4.Xn1 could be blood pressure measurements on a sample of patients given a placebo.A in sleep gained using drugs A and B on 10 patients.. .3. this is the problem of determining whether the treatment has any effect. For instance. YI .Section 4. suppose we wanted to test the effect of a certain drug on some biological variable (e. 8 2 = 1. length.995) = 3. temperature.2 0. and so forth.2 2 1.58..6.8 9 0. one from each popUlation. Y) = (Xl.9. .2 the likelihood function and its log are maximized In Problem 4.6 10 2.. The 0.) 4. Let e = (Jlt. .. . .1. it is shown that = over 8 by the maximum likelihood estimate e.9 likelihood Ratio Procedures 261 Data Example As an illustration of these procedures.2.7 5.7 1. 121) giving the difference B .8 1. ( 2) and N (Jl2.Xn1 and YI . Then 8 0 = {e : Jli = Jl2} and 9 1 = {e : Jli =F Jl2}' The log of the likelihood of (X.2 1..1 1.3). ( 2).0 3..9.Xn1 . Patient i A B BA 1 0.5. respectively.4 If we denote the difference as x's. .9 1. For quantitative measurements such as blood pressure. blood pressure).. it is usually assumed that X I.Xnl and YI .3 5 0. we conclude at the 1% level of significance that the two drugs are significantly different. This is a matched pair experiment with each subject serving as its own control. then x = 1.06.6 0. consider the following data due to Cushny and Peebles (see Fisher. while YI .'" . p. . .4 4.4 1.25. Because tg(0.8 8 0. It suggests that not only are the drugs different but in fact B is better than A because no hypothesis Jl = Jl' < 0 is accepted at this level. Tests We first consider the problem of testing H : Jl1 = J:L2 versus H : Jli =F Jl2. (See also (4.3 4 1. weight. volume. ( 2) populations.4 1.5 1.. The preceding assumptions were discussed in Example 1. A discussion of the consequences of the violation of these assumptions will be postponed to Chapters 5 and 6.3 Tests and Confidence Intervals for the Difference in Means of Two Normal Populations We often want to compare two populations with distribution F and G on the basis of two independent samples X I. 'Yn2 are independent samples from N (JlI.
where 2 Testing and Confidence Regions Chapter 4 When ttl tt2 = tt. our model reduces to the onesample model of Section 4. .2 1 I:rt2 .1 . the maximum of p over 8 0 is obtained for f) (j1. (7 ).2 (Xi . Thus.2. j1.262 (X.9.3 Tn2 1 . (75).X) + (X  j1)]2 and expanding. Y.3.2 X. we find that the is equivalent to the test statistic ITI where and To complete specification of the size a likelihood ratio test we show that T has distribution when ttl = tt2.X) and . Y.2 (y. By Theorem B. .Y) '1':" a a a i=l a j=l J I . where and If we use the identities ~ fCYi ?l i=l j1)2 1 n fp'i i=l y)2 + n2 (Y _ j1)2 n obtained by writing [Xi .j112 log likelihood ratio statistic [(Xi .1 I:rtl .
Similarly.1L2. this follows from the fact that. corresponding to the twosided test. T( Ll) has a Tn2 distribution and inversion ofthe tests leads to the interval (4.ILl #. f"V As usual. It is also true that these procedures are of size Q for their respective hypotheses.2 has a X~2 distribution.Section 4.2 • Therefore. 1/n2). twosample t test rejects if.has a N(O. if ILl #.2 under H bution and is independent of (n . 1) distriTn . As for the special case Ll = 0. .ILl we naturally look at likelihood ratio tests for the family of testing problems H : 1L2 . X~ll' X~2l' respectively. and that j nl n2/n(Y .2)8 2/0. T and the resulting twosided. IL 1 and for H : ILl ::.~ Q likelihood confidence bounds. As in the onesample case.X) /0.3) for 1L2 .ILl. 1/nI).ILl = Ll versus K : 1L2 . onesided tests lead to the upper and lower endpoints of the interval as 1 .N(1L2/0. T has a noncentral t distribution with noncentrality parameter. 1L2.9 likelihood Ratio Procedures 263 are independent and distributed asN(ILI/o.Ll. We can show that these tests are likelihood ratio tests for these hypotheses. for H : 1L2 ::. and only if. by definition.2)8 2 /0.9. Confidence Intervals To obtain confidence intervals for 1L2 . We conclude from this remark and the additive property of the X2 distribution that (n . we find a simple equivalent statistic IT(Ll)1 where If 1L2 ILl = Ll. there are two onesided tests with critical regions.
264 Testing and Confidence Regions Chapter 4 Data Example As an illustration.395 ± 0. setting the derivative of the log likelihood equal to zero yields the MLE . On the basis of the results of this experiment. Again we can show that the selection procedure based on the level (1 .776.0:) confidence interval has probability at most ~ 0: of making the wrong selection. Because t4(0. The level 0.395.x = 0. a~) samples.4 The TwoSample Problem with Unequal Variances In twosample problems of the kind mentioned in the introduction to Section 4.042 1. . When Ji1 = Ji2 = Ji.2 (1 .8 2 = 0.3. we conclude at the 5% level of significance that there is a significant difference between the expected log permeability for the two machines. From past experience. H is rejected if ITI 2 t n . a treatment that increases mean response may increase the variance of the responses. is The MLEs of Ji1 and Ji2 are. the more waterproof material.977. except for an additive constant. Ji2) E R x R.95 confidence interval for the difference in mean log permeability is 0. 472) x (machine 1) Y (machine 2) 1. thus. The results in terms oflogarithms were (from Hald. and T = 2.9. it may happen that the X's and Y's have different variances. we would select machine 2 as producing the smaller permeability and. we are lead to a model where Xl.. .627 2.282 We test the hypothesis H of no difference in expected log permeability. The log likelihood.368.0264.975) = 2.Xn1 ..790 1..583 1. ai) and N(Ji2. . 'Yn2 are two independent N (Ji1. consider the following experiment designed to study the permeability (tendency to leak water) of sheets of building material produced by two different machines. As a first step we may still want to compare mean responses for the X and Y populations.~o:). thus. :.9. Y1.845 1. 1952. respectively. /11 = x and /12 = y for (Ji1. it is known that the log of permeability is approximately normally distributed and that the variability from machine to machine is the same. For instance. Suppose first that aI and a~ are known. Here y .. 4. . If normality holds. This is the BehrensFisher problem. p. .
. (D 1:::. = J.1:::..)18D to generate confidence procedures. where D = Y .n2)' It follows that the likelihood ratio test is equivalent to the statistic IDl/aD.n2)' Similarly.t1. j1x = = nl . that is Yare X) + Var(Y) = nl + n2 . 2 8y 8~ . BecauseaJy is unknown. j1.x)/(nl + .fi? j=1 + n2(j1. n2. It follows that "\(x.j1)2 j=1 = L(Yi .y) By writing nl L(Xi .t2 :::..28). = at I a~. An unbiased estimate is J. For small and moderate nI. Unfortunately the distribution of (D 1:::.fi)2 we obtain \(x.Section 4. where I:::...) 18 D due to Welch (1949) works well. 8D=+' nl n2 It is natural to try and use D I 8 D as a test statistic for the onesided hypothesis H : J. n2 an approximation to the distribution of (D .t1 = J.X and aJy is the variance of D. J. it Thus.fi = n2(x .t1.y)/(nl + .n2(fi . and more generally (D ..3.t2. n2.X)2 + nl(j1. by Slutsky's theorem and the central limit theorem..x)2 i=1 n2 i=1 n2 L(Yi .)18D has approximately a standard normal distribution (Problem 5. DiaD has aN(I:::. IDI/8D as a test statistic for H : J. 1) distribution.1:::.) I 8 D depends on at I a~ for fixed nl. y) Next we compute = exp {2~~ eM  x)2 + 2:'~ (Ii  y?}. For large nl.9 Likelihood Ratio Procedures J 265 where.t2 must be estimated.
fields.9.1. The tests and confidence intervals resulting from this approximation are called Welch's solutions to the BehrensFisher problem. Y diet.266 Testing and Confidence Regions Chapter 4 Let c = Sf/nlsb' Then Welch's approximation is Tk where k c2 [ nl 1 (1 C)2]1 + n2 . mice.1 When k is not an integer. (Xn' Yn). machines.. X test tistical studies. Some familiar examples are: X test score on English exam. . Wang (1971) has shown the approximation to be very good for Q: = 0. Y). X = average cigarette consumption per day in grams. See Figure 5. Y cholesterol level in blood. which works well if the variances are equal or nl = n2.3 and Problem 5. then we end up with a bivariate random sample (XI. Y) have a joint bivariate normal distribution.003.) are sampled from a population and two numerical characteristics are measured on each case.3. the critical value is obtained by linear interpolation in the t tables or using computer software. TwoSided Tests . If we have a sample as before and assume the bivariate normal model for (X. our problem becomes that of testing H : p O. Yd. . with > 0. p). u~ > O. . The LR procedure derived in Section 4.05 and Q: 0.2 . Empirical data sometimes suggest that a reasonable model is one in which the two characteristics (X. etc. can unfortunately be very misleading if =1= u~ and nl =1= n2. X percentage of fat in score on mathematics exam. uI 4.u~. Y = blood pressure. Y = age at death. Confidence Intervals for p The question "Are two random variables X and Y independent?" arises in many staweight. ur Testing Independence.J1. Note that Welch's solution works whether the variances are equal or not.01.3.9.28.3.5 likelihood Ratio Procedures for Bivariate Normal Distributions If n subjects (persons. N(j1.. the maximum error in size being bounded by 0.
. A normal approximation is available. V n ) is a sample from the N(O. . p). the twosided likelihood ratio tests can be based on [Tn I and the critical values obtained from Table II. where Ui = (Xi .3. then by Problem B.Section 4. Because (U1.. When p = 0 we have two independent samples. the distribution of pis available on computer packages. p::.1.8). ii. If p = 0.Xn and that ofY1 .Jl2)/0"2.1. and the likelihood ratio tests reject H for large values of [P1.4.(j~.. the distribution of p depends on p only. To obtain critical values we need the distribution of p or an equivalent statistic under H.Y . and eo can be obtained by separately maximizing the likelihood of XI. (Un. There is no simple form for the distribution of p (or Tn) when p #.9.13 as 2 = . (jr .Jll)/O"I. Because ITn [ is an increasing function of [P1.5. .9..0"2 = .. Pis called the sample correlation coefficient and satisfies 1 ::. VI)"" . (j~.Now. log A(X) is an increasing function of p2. See Example 5. . Therefore. if we specify indifference regions. 0.3. We have eo = (x. Yn . ~ = (Yi .5) has a Tn. .9 Likelihood Ratio Procedures 267 The unrestricted maximum likelihood estimate (x. y. 0) and the log of the likelihood ratio statistic becomes log A(X) (4. p) distribution. 1 (Problem 2. where (jr e was given in Problem 2.2 distribution.1.o. When p = 0. (4.4) = Thus. . we can control probabilities of type II error by increasing the sample size.~( Xi .9. for any a. 1~ )2 2 1~ )2 0"1 n i=1 n i=1 p= [t(Xi .X)(Yi ~=1 fj)] /n(jl(72.~( Yi . the power function of the LR test is symmetric about p = 0 and increases continuously from a to 1 as p goes from 0 to 1. Qualitatively.X .
c(po)] 1 . and only if. p::. These tests can be shown to be of the form "Accept if.9. It can be shown that pis equivalent to the likelihood ratio statistic for testing H : P ::. The likelihood ratio test statistic >. for large n the equal tails and LR confidence intervals approximately coincide with each other. ! Data Example A~ an illustration. We find the likelihood ratio tests and associated confidence procedures for four classical normal models: . c(po)" where Ppo [p::. we find that there is no evidence of correlation: the pvalue is bigger than 0.75.~Q bounds together. we obtain a commonly used confidence interval for p. We obtain c(p) either from computer software or by the approximation of Chapter 5. and only if p . by using the twosided test and referring to the tables.48.18 and Tn 0. if we want to decide whether increasing fat in a diet significantly increases cholesterol level in the blood. we can start by constructing size Q likelihood ratio tests of H : P Po versus K : p > po.:::: c] is an increasing function of p for fixed c (Problem 4. These intervals do not correspond to the inversion of the size Q LR tests of H : p = Po versus K : p :::f.268 Testing and Confidence Regions Chapter 4 OneSided Tests In many cases. We want to know whether there is a correlation between the initial weights and the weight increase and formulate the hypothesis H : p O. Thus. Therefore. consider the following bivariate sample of weights Xi of young rats at a certain age and the weight increase Yi during the following week. Because c can be shown to be monotone increasing in p. However.7 18. For instance.: : d(po) or p::. Po but rather of the "equaltailed" test that rejects if.Q. The power functions of these tests are monotone. We can similarly obtain 1 Q upper confidence bounds and. by putting two level 1 . c(po) where P po [I)' d(po)] P po [I)'::. 0 versus K : P > 0 and similarly that corresponds to the likelihood ratio statistic for H : P 0 versus K : P < O. 370. is the ratio of the maximum value of the likelihood under the general model to the maximum value of the likelihood under the model specified by the hypothesis. 0 versus K : P > O.1 Here p 0. Confidence Bounds and Intervals Usually testing independence is not enough and we want bounds and intervals for p giving us an indication of what departure from independence is present. only onesided alternatives are of interest.15). inversion of this family of tests leads to levell Q lower confidence bounds. we would test H : P = 0 versus K : P > 0 or H : P ::. 'I7 Summary. To obtain lower confidence bounds. c(Po)] 1 ~Q. we obtain size Q tests for each of these hypotheses by setting the critical value so that the probability of type I error is Q when p = O. We can show that Po [p.
( 2) and we test the hypothesis that the mean difference J. Consider the hypothesis H that the mean life II A = J.. We test the hypothesis that X and Yare indepen dent and find that the likelihood ratio test is equivalent to the test based on IPl. 4.Ll' ( 2) and N(J.2 degrees of freedom. . Mn e = ~? = 0.98 for (e) If in a sample of size n = 20.Lo.. where SD is an estimate of the standard deviation of D = Y .48..L :::.L2' a~) populations. Let Xl. what choice of c would make 6c have size (c) Draw a rough graph of the power function of 6c specified in (b) when n = 20. When a? and a~ are unknown. . Approximate critical values are obtained using Welch's t distribution approximation. J. The likelihood ratio test is equivalent to the onesample (Student) t test. respectively. Suppose that Xl.. we use (Y . 0 otherwise. . . . We also find that the likelihood ratio statistic is equivalent to a t statistic with n . the likelihood ratio test is equivalent to the test based on IY . (3) Twosample experiments in which two independent samples are modeled as coming from N(J. . Let Mn = max(X 1 . When a? and a~ are known.Section 4. (2) Twosample experiments in which two independent samples are modeled as coming from N(J. ~ versus K : e> ~.L is zero. X n ) and let 1 if Mn 2:: c = of e.L2' ( 2) populations.10 Problems and Complements 269 (1) Matched pair experiments in which differences are modeled as N(J.Xn denote the times in days to failure of n similar pieces of equipment. (a) Compute the power function of 6c and show that it is a monotone increasing function (b) In testing H : exactly 0. . what is the pvalue? 2.X)I SD. Assume the model where X = (Xl. e).Xn ) is an £(A) sample.X.1 1. (d) How large should n be so that the 6c specified in (b) has power 0..XI. where p is the sample correlation coefficient. respectively. . X n are independently and identically distributed according to the uniform distribution U(O.Ll' an and N(J.10 PROBLEMS AND COMPLEMENTS Problems for Section 4. .05? e :::. (4) Bivariate sampling experiments in which we have two measurements X and Y on each case in a sample of n cases.. . . We test the hypothesis that the means are equal and find that the likelihood ratio test is equivalent to the twosample (Student) t test.L.
Xn be a sample from a population with the Rayleigh density . Hint: See Problem B.ontinuous distribution. . Days until failure: 315040343237342316514150274627103037 Is H rejected at level <> ~ 0. 7..O) = (xIO')exp{x'/20'}.... (0) Show that.. Show that if H is simple and the test statistic T has a c. j = 1. i I (b) Check that your test statistic has greater expected value under K than under H. (b) Show that the power function of your test is increasing in 8.~llog <>(Tj ) has a X~r distribution.. where x(I . 6. . . .a) is the (1 . Suppose that T 1 . (a) Use the MLE X of 0 to construct a level <> test for H : 0 < 00 versus K : 0 > 00 .0 > 0. Hint: Use the central limit theorem for the critical value.05? 3. . Tr are independent test statistics for the same simple H and that each Tj has a continuous distribution..4. 1nz t Z i .3. . (cj Use the central limit theorem to show that <l?[ (I"oz( <» II") + VTi(1"  1"0) I 1"] is an approximation to the power of the test in part (a). under H. Let o:(Tj ) denote the pvalue for 1j. Let Xl.a)th quantile of the X~n distribution.. is a size Q test. distribution. X n be a 1'(0) sample. Draw a graph of the approximate power function. .2 L:. give a normal approximation to the significance probability..2. then the pvalue <>(T) has a unifonn.12. .1. Hint: Approximate the critical region by IX > 1"0(1 + z(1 . . (b) Give an expression of the power in tenns of the X~n distribution. 0: (a) Construct a test of H : B = 1 versus K : B > 1 with approximate size complete sufficient statistic for this model. Let Xl.. If 110 = 25. I . " i I 5.r. .4 to show that the test with critical region IX > I"ox( 1  <» 12n].. . j . U(O. .3). Assume that F o and F are continuous. f(x. Establish (4.. 1 using a I .) 4.270 Testing and Confidence Regions Chapter 4 (a) Use the result of Problem 8. r. T ~ Hint: See Problem B. 1). X> 0.<»1 VTi)J .3. .. . (d) The following are days until failure of air monitors at a nuclear plant. j=l . (c) Give an approximate expression for the critical value if n is large and B not too close to 0 or 00. (Use the central limit theorem.
. . . is chosen 00 so that 00 x 2 dF(x) = 1..) Evaluate the bound Pp(lF(x) .o V¢.. B. (In practice these can be obtained by drawing B independent samples X(I).T. . Fo(x)1 > kol x ~ Hint: D n > !F(x) . Show that the test rejects H iff T > T(B+lm) has level a = m/(B + 1).. 1) variable on the computer and set X ~ F O 1 (U) as in Problem B.Fo(x)IO x ~ ~ sup.T(X(l)). = (Xi . b > 0.o: is called the Cramervon Mises statistic. j = 1..Fo(x)1 > k o ) for a ~ 0. 00.Fo(x) 10 U¢.5.6 is invariant under location and scale. Suppose that the distribution £0 of the statistic T = T(X) is continuous under H and that H is rejected for large values of T.. with distribution function F and consider H : F = Fo. . T(B) be B independent Monte Carlo simulated values of T. . n (c) Show that if F and Fo are continuous and FiFo. 1) and F(x) ~ (1 +exp( _"/7))1 where 7 = "13/. . . Define the statistics S¢. .5.. Here.5 using the nonnal approximation to the binomial distribution of nF(x) and the approximate critical value in Example 4..10.. and let a > O.o J J x . 9. if X. Use the fact that T(X) is equally likely to be any particular order statistic. .)(Tn ) = LN(O.p(Fo(x))IF(x) .. 10. Hint: If H is true T(X).1..a)/b. .X(B) from F o on the computer and computing TU) = T(XU)).2.1. (b) Suppose Fo isN(O. (b) When 'Ij. to get X with distribution . 80 and x = 0.(X') = Tn(X). (This is the logistic distribution with mean zero and variance 1. (b) Use part (a) to conclude that LN(".Fo(x)IOdF(x).p(u) be a function from (0. T(X(B)) is a sample of size B + 1 from La. = ~ I. In Example 4. Vw. then T. Express the Cramervon Mises statistic as a sum. T(l).o sup. . That is..Fo(x)1 foteach. . . .p(F(x»!F(x) . (a) For each of these statistics show that the distribution under H does not depend on F o.Fo(x)IOdFo(x) . (a) Show that the power PF[D n > kal of the Kolmogorov test is bounded below by ~ sup Fpl!F(x).l) (Tn). X n be d.Fo• generate a U(O.J(u) = 1 and 0: = 2.T(B+l) denote T. 1) to (0.o T¢. ~ ll.00). Let X I..) Next let T(I).d... T(B) ordered.12(b). then the power of the Kolmogorov test tends to 1 as n .10 Problems and Complements 271 8.p(F(x)IF(x) .Section 4. 1. (a) Show that the statistic Tn of Example 4. let .u. Let T(l).5.1. and 1. ..p(Fo(x))lF(x) .
).. If each time you test. ! .272 Testing and Confidence Regions Chapter 4 (e) Are any of the four statistics in (a) invariant under location and scale. Consider a population with three kinds of individuals labeled 1. which system is cheaper on the basis of a year's operation? 1 I Ii I il 2. Hint: P(To > T) = P(To > tiT = t)fe(t)dt where fe(t) is the density of Fe(t).2. then the pvalue is U =I . the power is (3(0) = P(U where FO 1 (u)  < a) = 1..10.nO /. Let 0 < 00 < 0 1 < 1. i • (a) Show that if the test has level a. A gambler observing a game in which a single die is tossed repeatedly gets the impression that 6 comes up about 18% of the time. the UMP test is of the form I {T > c).. You want to buy one of two systems.. and 3 occuring in the HardyWeinberg proportions f(I. Consider a test with critical region of the fann {T > c} for testing H : () = Bo versus I< : () > 80 . respectively.. I X n from this population. Let T = X/"o and 0 = /" .2 and 4. (See Problem 4. Suppose that T has a continuous distribution Fe. 2N1 + N. > cis MP for testing H : 0 = 00 versus K : 0 = 0 1 • . Expected pvalues. X n . 2. the other has v/aQ = 1. . Let To denote a random variable with distribution F o. I)..0)'..1.. and only if.2 I i i 1.0). I 1 (b) Show that if c > 0 and a E (0.a)) = inf{t: Fo(t) > u}. 3. ! . (For a recent review of expected p values see Sackrowitz and SamuelCahn.2. . and 3. > cJ = a. let N I • N 2 • and N 3 denote the number of Xj equal to I. the other $105 • One second of transmission on either system COsts $103 each. Show that the EPV(O) for I{T > c) is uniformly minimal in 0 > 0 when compared to the EPV(O) for any other test.') sample Xl. = /10 versus K : J./"0' Show that EPV(O) = if! ( . 00 .3. 2. (a) Show that L(x. f(3.L > J.) I 1 .1.O) = 0'. take 80 = O. 1999. . Whichever system you buy during the year. which is independent ofT.0) = 20(1. [2N1 + N. whereas the other a I ~ a . Show that EPV(O) = P(To > T). Problems for Section 4.0) = (1. j 1 i :J . you intend to test the satellite 100 times... 01 ) is an increasing function of 2N1 + N. Consider Examples 3.Fo(T).Fe(F"l(l. then the test that rejects H if. f(2. where if! denotes the standard normal distribution. One has signalMtonoise ratio v/ao = 2. Without loss of generality. The first system costs $106 . Hint: peT < to I To = to) is 1 minus the power of a test with critical value to_ (d) Consider the problem of testing H : J1. ! J (c) Suppose that for each a E (0.05. For a sample Xl. i i (b) Define the expected pvalue as EPV(O) = EeU. 5 about 14% of the time.1) satisfy Pe.) . 12.to on the basis of the N(p" . you want the number of seconds of response sufficient to ensure that both probabilities of error are < 0. where" is known..
. For 0 < a < I. Bd > cJ = I .. MPsized a likelihood ratio tests with 0 < a < 1 have power nondecreasing in the sample size. .. 9.2. I.0..1. L . 1..1(a) using the connection between likelihood ratio tests and Bayes tests given in Remark 4. and only if.~:":2:~~=C:="::===='..Bd > cJ then the likelihood ratio test with critical value c is best in this sense. H : (J = (Jo versus K : (J = (J..2.4. Nk) ~ 4.0.imum probability of error (of either type) is as small as possible. 0. (c) Using the fact that if(N" . Show that if randomization is pennitted. M(n.10 Problems and Complements 273 four numbers are equally likely to occur (i.0.17).2. find an approximation to the critical value of the MP level a test for this problem..4 and recall (Proposition B. 5. +akNk has approx. (a) Show that if in testing H : f} such that = f}o versus K : f) = f)l there exists a critical value c Po.. = a.. . Hint: Use Problem 4..N.6) where all parameters are known. then a.2. I...2) that linear combinations of bivariate nonnal random variables are nonnaUy distributed.o = (1.2.. (a) What test statistic should he use if the only alternative he considers is that the die is fair? (b) Show that if n = 2 the most powerfullevel. [L(X. find the MP test for testing 10.2. Y) belongs to population 1 if T > c.<l.. .1. PdT < cJ is as small as possible. 6...Pe.1. Hint: The MP test has power at least that of the test with test function J(x) 8.=  Section 4. and to population 0 ifT < c. Upon being asked to play.2. 7. +. . Bo. with probability . then the maximum of the two probabilities ofmisclassification porT > cJ. derive the UMP test defined by (4.0196 test rejects if.6) or (as in population 1) according to N(I. In Examle 4. two 5's are obtained.2.1. Prove Corollary 4. the gambler asks that he first be allowed to test his hypothesis by tossing the die n times. of and ~o '" I.e. Bk). prove Theorem 4. where 11 = I:7 1 ajf}j and a 2 = I:~ 1 f}i(ai 11)2. Y) known to be distributed either (as in population 0) according to N(O. if . [L(X.imately a N(np" na 2 ) distribution. In Example 4..7). B" . (b) Find the test that is best in this sense for Example 4. Bo. A fonnulation of goodness of tests specifies that a test is best if the max.2. . Find a statistic T( X. . A newly discovered skull has cranial measurements (X. (X.2. Y) and a critical value c such that if we use the classification rule.
. . 274 Problems for Section 4.i . .01. the expected number of arrivals per day. Suppose that if 8 < 8 0 it is not worth keeping the counter open.. (b) For what levels can you exhibit a UMP test? (c) What distribution tables would you need to calculate the power function of the UMP test? 2. x > O.: (b) Show that the critical value for the size a test with critical region [L~ 1 Xi > k] is k = X2n(1 . > 1/>'0.) 3. ... but if the arrival rate is > 15.. Show that if X I. In Example 4. .01 lest to have power at least 0.Xn is a sample from a Weibull distribution with density f(x. . You want to ensure that if the arrival rate is < 10. (c) Suppose 1/>'0 ~ 12. .4. . (a) Show that L~ K: 1/>. • • " 5. i 1 .3. I i . show that the power of the UMP test can be written as (3(<J) = Gn(<J~Xn(<»/<J2) where G n denotes the X~n distribution function.0)"'/[1.<»th quantile of the X~n distributiou and that the power function of the lIMP level a test is given by where G 2n denotes the X~n distribution function. r p(x. Here c is a known positive constant and A > 0 is the parameter of interest... a model often used (see Barlow and Proschan.Xn be the times in months until failure of n similar pieces of equipment. 1965) is the one where Xl.. that the Xi are a sample from a Poisson distribution with parameter 8. Hint: Show that Xf . Find the sample size needed for a level 0. the probability of your deciding to stay open is < 0.&(>'). Use the normal approximation to the critical value and the probability of rejection.3.<» is the (1 . A) = c AcxC1e>'x .Xn is a sample from a truncated binomial distribution with • I. • I i • 4. Let Xl. hence.01.95 at the alternative value 1/>'1 = 15. Let Xl be the number of arrivals at a service counter on the ith of a sequence of n days. Consider the foregoing situation of Problem 4. I i .1 achieves this? (Use the normal approximation.3 Testing and Confidence Regions Chapter 4 1.<»/2>'0 where X2n(1. . A possible model for these data is to assume that customers arrive according to a homogeneous Poisson process and. . the probability of your deciding to close is also < 0..(IOn x = 1•. 0) = ( : ) 0'(1. i' i . ]f the equipment is subject to wear. How many days must you observe to ensure that the UMP test of Problem 4.• n. (a) Exhibit the optimal (UMP) test statistic for H : 0 < 00 versus K : 0 > 00.1. 1 j ! Xf is an optimal test statistic for testing H : 1/ A < 1/ AO versus J ! .3.
then e. imagine a sequence X I.:7 I Xi is an optimal test statistic for testing H : () = eo versus l\ .1. . .d.5. 8.1. Show that the UMP test for testing H : B > 1 versuS K : B < 1 rejects H if 2E log FO(Xi) > Xl_a. .. . Find the UMP test.6) is UMP for testing that the pvalues are uniformly distributed.. (a) Express mean income JL in terms of e.O) = cBBx(l+BI. (i) Show that if we model the distribution of Y as C(max{X I ..8 > 80 .. Show how to find critical values. A > O. . (See Problem 4. then P(Y < y) = 1 e  . Let Xl. X 2.B) = Fi!(x). > 1. 0 < B < 1.O<B<1. x> c where 8 > 1 and c > O. For the purpose of modeling. > po. survival times of a sample of patients receiving an experimental treatment.. In Problem 1. 6. (b) NabeyaMiura Alternative. Y n be the Ll.1. Hint: Use the results of Theorem 1.~) = 1. (b) Find the optimal test statistic for testing H : JL = JLo versus K . 00 < a < b < 00.6.[1.6.•• of Li. A > O. we derived the model G(y. and consider the alternative with distribution function F(x.. In the goodnessoffit Example 4. Suppose that each Xi has the Pareto density f(x. . and let YI . (a) Lehmann Alte~p. X N }).Fo(y)  }).).Fo(y)l".2 to find the mean and variance of the optimal test statistic. P(). (Ii) Show that if we model the distribotion of Y as C(min{X I .12. X 2 .1. suppose that Fo(x) has a nonzero density on some interval (a. JL (e) Use the central limit theorem to find a normal approximation to the critical value of test in part (b).. 1 ' Y > 0.. Let the distribution of sUIvival times of patients receiving a standard treatment be the known distribution Fo.) It follows that Fisher's method for cQmbining pvalues (see 4. Let N be a zerotruncated Poisson.10 Problems and Complements 275 then 2."" X n denote the incomes of n persons chosen at random from a certain population. b).XFo(y)  1 P(Y < y) = e ' 1 ' Y > 0.. Assume that Fo has ~ density fo. which is independent of XI. X N e>..  . ~ > O. random variable. where Xl_ a isthe (1 .tive.6. survival times with distribution Fa. against F(u)=u B. . 7.. y > 0. To test whether the new treatment is beneficial we test H : ~ < 1 versus K : . .Q)th quantile of the X~n distribution.d.Section 4.
every Bayes test for H : < 80 versus K : > (Jl is of the fann for some t. Show that under the assumptions of Theorem 4. Hint: A Bayes test rejects (accepts) H if J .0) UeB for '. 0) e9Fo (Y) Fo(Y). Oo)d.X)2..1).' L(x.. we test H : {} < 0 versus K : {} > O.2 the class of all Bayes tests is complete.(O) L':x. F(x) = 1 . n announce as your level (1 . 0.1 and 01 loss. the denominator decreasing.  1 j . Show that under the assumptions of Theorem 4. .52.. eo versus K : B = fh where I i I : 11. x > 0..(O)/ 16' I • 00 p(x.. 1 • ..O)d. Oo)d. Show that the test is not UMP. 00 p(x.0) confidence intervals of fixed finite length for loga2 • (b) Suppose that By 1 (Xi . > Er • .. Problems for Section 4. Let Xi (f)/2)t~ + €i. (a) Show how to construct level (1 . B> O.3.0".. Let Xl. .. F(x) = 1 .X)' = 16. e e at l · 6.exp( _x B). Let Xl. J J • 9.0 e6 1 0~0.' " 2.3. or Weibull. 12.(O) «) L > 1 i The lefthand side equals f6".). where the €i are independent normal random variables with mean 0 and known variance .3. Show that the UMP test is based on the statistic L~ I Fo(Y... I Hint: Consider the class of all Bayes tests of H : () = . n. j 1 To see whether the new treatment is beneficial. 1 . Assume that F o has a density !o(Y).4 1. We want to test whether F is exponential. with distribution function F(x). i = 1.{Oo} = 1 . I .. .2. The numerator is an increasing function of T(x). . /. x > 0. 1 X n be i.j "'" . Find the MP test for testing H : {} = 1 versus K : B = 8 1 > 1. 1 X n be a sample from a normal population with unknown mean J. What would you ..d.276 (iii) Consider the model Testing and Confidence Regions Chapter 4 G(y.i.' (cf. L(x..6t is UMP for testing H : 8 80 versus K : 8 < 80 . 'i .L and unknown variance (T2.. = . . Using a pivot based on 1 (Xi .01.'? = 2. Show that under the assumptions of Theorem 4.{Od varies between 0 and 1. 0. .exp( x). Problem 2. O)d. .2. 10..(O) 6 . 0 = 0.
.Xn be as in Problem 4. Show that if q(X) is a level (1 . with X . Show that if Xl. Use calculus..(2) UCB for q(8). . with N being the smallest integer greater than no and greater than or equal to [Sot".. find a fixed length level (b) ItO < t i < I.4.c.a.no further observations. . < a..02.3 we know that 8 < O. although N is random. n.a) confidence interval for B.) Hint: Use (A. then the shortest level n (1 .Section 4.. .IN(X  = I:[' lX. 5. of the form Ji. then [q(X).al). Suppose that in Example 4.1..a.1. 0.. Suppose we want to select a sample size N such that the interval (4. where c is chosen so that the confidence level under the assumed value of 17 2 is 1 .IN] . .~a)/ .3).(2) " .. but we may otherwise choose the t! freely. Begin by taking a fixed number no > 2 of observations and calculate X 0 = (1/ no) I:~O 1 Xi and S5 = (no 1)II:~OI(Xi  XO )2. < 0.10 Problems and Complements 277 (a) Using a pivot based on the MLE (2L:r<ol (1 . (a) Justify the interval [ii. ii are (b) Calculate the smallest n needed to bound the length of the 95% interval of part (a) by 0.4.X are ij.4.n is obtained by taking at = Q2 = a/2 (assume 172 known). I")/so. Stein's (1945) twostage procedure is the following. t. 0. 6. X+SOt no l (1.1] if 8 > 0.] .1)] if 8 given by (4. It follows that (1 ..d. What is the actual confidence coefficient of Ji" if 17 2 can take on all positive values? 4. N(Ji" 17 2) and al + a2 < a.7). .(X) = X .4./N.3).4.a) interval of the fonn [ X . has a 1. minI ii. X + z(1 . there is a shorter interval 7.1) based on n = N observations has length at most l for some preassigned length l = 2d. i = 1.~a) /d]2 .. what values should we use for the t! so as to make our interval as short as possible for given a? 3.n . Then take N .) LCB and q(X) is a level (1.2.~a)!.IN. . Compare yonr result to the n needed for (4.1 Show that. .z(1 . (Define the interval arbitrarily if q > q..01 [X .l. Let Xl. [0.SOt no l (1. X!)/~i 1tl ofB.(al + (2)) confidence interval for q(8). where 8. Suppose that an experimenter thinking he knows the value of 17 2 uses a lower confidence bound for Ji. Hint: Reduce to QI +a2 = a by showing that if al +a2 with Ql + Q2 = a.1.. distribution.q(X)] is a level (1 .1.
') populations.O)]l < z (1. and.1') has a N(o. " 1 a) confidence . I 1 j Hint: [O(X) < OJ = [. Such two sample problems arise In comparing the precision of two instruments and in detennining the effect of a treatment.6..  10.95 confidence interval for (J.a) confidence interval for (1/1')./3z (1.: " are known. . 0") distribution and is independent of sn" 8.O).18) to show that sin 1( v'X)±z (1 . 1"2. X n1 and Yll .4. T 2 I I' " I'.a) interval defined by (4.a) confidence interval for 1"2/(Y2 using a pivot based on the statistics of part (a). .jn is an approximate level (b) If n = 100 and X = 0. .jn(X . Let XI. find ML estimates of 11. we may very likely be forced to take a prohibitively large number of observations. 1986.) Hint: Note that X = (noIN)Xno + (lIN)Etn n _ +IXi.~a)].1.. (c) If (12. . (12 2 9. respectively. Hint: Set up an inequality for the length and solve for n...a:) confidence interval of length at most 2d wheri 0'2 is known. (The sticky point of this approach is that we have no control over N. d = i II: . (12. I . (b) Exhibit a level (1 .!a)/ 4. 4(). (J'2 jk) distribution.l4. and the fundamental monograph of Wald..fN(X . X has a N(J1.jn is an approximate level (1  • 12. v. (12) and N(1/.. The reader interested in pursuing the study of sequential procedures such as this one is referred to the book of Wetherill and Glazebrook. if (J' is large. Show that ~(). in order to have a level (1 . 11.0)/[0(1. 0) and X = 81n. ±. Let 8 ~ B(n.) (lr!d observations are necessary to achieve the aim of part (a).3. exhibit a fixed length level (1 . Let 8 ~ B(n. use the result in part (a) to compute an approximate level 0.a) confidence interval for sin 1 (v'e). (a) Show that in Problem 4.4. sn.0') for Jt of length at most 2d. (1  ~a) / 2. (a) If all parameters are unknown. but we are sure that (12 < (It.051 (c) Suppose that n (12 = 5.001. Show that these two quadruples are each sufficient. 1947. Suppose that it is known that 0 < ~. By Theorem B.~ a) upper and lower bounds.) I ~i .3) are indeed approximate level (1 . (a) Use (A.3.. . Y n2 be two independent samples from N(IJ. . = 0. Show that the endpoints of the approximate level (1 . > z2 (1  is not known exactly.. 0. is _ independent of X no ' Because N depends only on sno' given N = k. Indicate what tables you wdtdd need to calculate the interval.. i (b) What would be the minimum sample size in part (a) if ().. it is necessary to take at least Z2 (1 (12/ d 2 observations. (a) Show that X interval for 8. . Hence. 278 Testing and Confidence Regions Chapter 4 I i' I is a confidence interval with confidence coefficient (1 .
9 confidence interval for {L. and 10 000. See Problem 5. and X2 = X n _ I (1 . K.5 is (1 _ ~a) 2. 4) are independent.(n .) Hint: (n . (c) Suppose Xi has a X~ distribution. is replaced by its MOM estimate.1 is known.4 = E(Xi .1) t z{1 .2. 15.1.9 confidence interval for cr. Suppose that a new drug is tried out on a sample of 64 patients and that S = 25 cures are observed.' = 2:.) can be approximated by x(" 1) "" (n . Hint: Let t = tn~l (1 . Hint: By B. ~).ia). Compute the (K.~Q).027 13. :L7/ (b) Suppose that Xi does not necessarily have a nonnal diStribution. and the central limit theorem as given in Appendix A.4.Section 4.J1. give a 95% confidence interval for the true proportion of cures 8 using (al (4.IUI. 100. .30.3.7).<. but assume that J1. Z. cr 2 ) population. Slutsky's theorem.J1.I).)' . Find the limit of the distribution ofn. Xl = Xn_l na). 0).8).3). it equals O. . and (b) (4.4 and the fact that X~ = r (k. Suppose that 25 measurements on the breaking strength of a certain alloy yield 11.9 confidence interval for {L x= + cr.3.2 is known as the kurtosis coefficient. In the case where Xi is normal.4. (d) A level 0. See A. then BytheproofofTheoremB. Hint: Use Problem 8.I) + V2( n1) t z( "1) and x(1 . 14.4/0.n(X .9 confidence region for (It. (a) Show that x( "d and x{1. 10.t ([en . find (a) A level 0.10 Problems and Complements 279 (b) What sample size is needed to guarantee that this interval has length at most 0.n} and use this distribution to find an approximate 1 . (In practice.). Show that the confidence coefficient of the rectangle of Example 4.2. Compare them to the approximate interval given in part (a).J1.1) + V2( n .' Now use the central limit theorem.D andn(X{L)2/ cr 2 '" = r (4. .)/01' ~ (J1.3.".4 ) .4.Q confidence interval for cr 2 . Now the result follows from Theorem 8. known) confidence intervals of part (b) when k = I. V(o') can be written as a sum of squares of n 1 independentN(O.)'.1 and.4.1)8'/0') . XI 16.". Assuming that the sample is from a /V({L. Now use the law of large numbers. K. In Example 4. (n _1)8'/0' ~ X~l = r(~(nl). = 3."') . If S "' 8(64. (b) A level 0.' 1 (Xi . 1) random variables. (c) A level 0.2.J1.3.4.)4 < 00 and that'" = Var[(X.3.
F(x)1 Typical choices of a and bare .I Bn(F) = sup vnl1'(x) . Suppose Xl. " • j .1. 1'1(0) < X < 1'.7. (b)ForO<o<b< I. define An(F) = sup {vnl1'(X) .11 . . find a level (I ..6 with . Show that for F continuous where U denotes the uniform. b) for some 00 < a < 0 < b < 00. verify the lower bounary I" given by (4.05 and . 18.F(x)1 is the approximate pivot given in Example 4.F(x)l.9) and the upper boundary I" = 00. That is.F(x)1 )F(x)[1 . Consider Example 4. indicate how critical values U a and t a for An (Fo) and Bn(Fo) can be obtained using the Monte Carlo method of Section 4.i.4.4..d.1'(x)] Show that for F continuous. )F(x)[l.~u) by the value U a determined by Pu(An(U) < u) = 1. . . as X and that X has density f(t) = F' (t). . define .F(x)]dx. !i ..3 for deriving a confidence interval for B = F(x).u) confidence interval for 1". X n are i.4. (a) For 0 < a < b < I. I (b) Using Example 4.4. I' .a) confidence interval for F(x). (c) For testing H o : F = Fo with F o continuous.1 (b) V1'(x)[l. pl(O) < X <Fl(b)}. (a) Show that I" • j j j = fa F(x)dx + f. It follows that the binomial confidence intervals for B in Example 4. In this case nF(l') = #[X. In Example 4.4. Assume that f(t) > 0 ifft E (a. • i. 19.6. distribution function.r fixed.1.95.u. U(O. i 1 .I I 280 Testing and Confidence Regions Chapter 4 17.F(x)1 .3 can be turned into simultaneous confidence intervals for F(x) by replacing z (I . < xl has a binomial distribotion and ~ vnl1'(x) .l). we want a level (1 .
a. X2) = 1 if and only if Xl + Xi > c. Or . Show that if 8(X) is a level (1 .3.12 and B. 5.~a)J has size (c) The following are times until breakdown in days of air monitors operated under two different maintenance policies at a nuclear power plant.2 = VCBs of Example 4. lem of testing H : 8 1 = 82 = 0 versus K : Or (a) Let oc(X 1 .\.4 and 8.Section 4. O ) is an increasing 2 function of + B~.10 Problems and Complements 281 Problems for Section 4. "6 based on the level (1  a) (b) Give explicitly the power function of the test of part (a) in tenos of the X~l distribution function.2).2n2 distribution. if and only if 8(X) > 00 . How small must an alternative before the size 0.. is of level a for testing H : 0 > 00 .5 1.0. (a) Find c such that Oc of Problem 4. Hint: If 0> 00 . = 0. . ..al) and (1 . Yn2 be independent exponential E( 8) and £(.. [8(X) < 0] :J [8(X) < 00 ].1 has size a for H : 0 (}"2 be < 00 .3. respectively. then the test that accepts.. (a) Deduce from Problem 4. (b) Derive the level (1.1 O? 2.1"2nl. a (b) Show that the test with acceptance region [f(~a) for testing H : Il. . respectively. (15 = 1. N( O . and consider the prob2 + 8~ > 0 when (}"2 is known.(2) together.) samples.a) UCB for 0. (e) Similarly derive the level (1 . for H : (12 > (15. (e) Suppose that n = 16.2).. (d) Show that [Mn .2 that the tests of H : . 3. Let Xl.05. M n /0. test given in part (a) has power 0. = 1 versns K : Il.2 are level 0.4.a) VCB for this problem and exhibit the confidence intervals obtained by pntting two such hounds of level (1 . What value of c givessizea? (b) Using Problems B. Let Xl. of l.a) LCB corresponding to Oc of parr (a).13 show that the power (3(0 1 .. < XI? < f(1.5.3.. show that [Y f( ~a) I X. .. Give a 90% confidence interval for the ratio ~ of mean life times.· (a) If 1(0. and let Il. x y 315040343237342316514150274627103037 826 10 8 29 20 10 ~ Is H : Il.. Experience has shown that the exponential assumption is warranted.th quantile of the . X 2 be independent N( 01. .5. . Hint: Use the results of Problems B. 1/ n J is the shortest such confidence interval.90? 4. ~ 01).) denotes the o.3. Yf (1 . with confidence coefficient 1 .~a) I Xl is a confidence interval for Il.1. = 1 rejected at level a 0.Xn1 and YI .
a)y'n. . k .[XU) (f) Suppose that a = 2(n') < OJ and P. (b) The sign test of H versus K is given by. (c) Modify the test of part (a) to obtain a procedure that is level 0: for H : OJ e= O?. Let X l1 .0:) LeB for 8 whatever be f satisfying our conditions. (a) Shnw that C(X) is a level (l .a) confidence region for the parameter ry and con versely that any level (1 .282 Testing and Confidence Regions Chapter 4 = ()~ 1 2 eg and exhibit the corresponding family of confidence circles for (ell ( 2). T) where 1] is a parameter of interest and T is a nuisance parameter. respectively. O?.2).Xn be a sample from a population with density f (t . 0.IX(j) < 0 < X(k)] do not depend on 1 nr I.a. 0.0:) confidence interval for 11. . l 7.4. lif (tllXi > OJ) ootherwise.1 when (72 is unknown.8) where () and fare unknown. and 1 is continuous and positive. We are given for each possible value 1]0 of1] a level 0: test o(X. Let C(X) = (ry : o(X. (e) Show directly that P. Thus. O. 'TJo) of the composite hypothesis H : ry = ryo. . . 1 Ct ~r (b) Find the family aftests corresponding to the level (1.0:) confidence region for 1] is equivalent to a family of level tests of these composite hypotheses. og are independentN (0. I .  j 1 j . Show that PIX(k) < 0 < X(nk+l)] ~ .. X.2). I: H': PIX! >OJ < ~ versus K': PIX! >OJ> ~. but I(t) = I( t) for all t. Hint: (c) X. . . X n  00 ) is a level a test of H : 0 < 00 versus K : 0 > 00 • (d) Deduce that X(nk(Q)+l) (where XU) is the jth order statistic of the sample) is a level (1 . of Example 4.. ). Show that the conclusions of (aHf) still hold. L:J ~ ( . we have a location parameter family. ry) = O}. . >k Determine the smallest value k = k(a) such that oklo) is level a for H and show that for n large. (a) Show that testing H : 8 < 0 versus K : e > 0 is equivalent to testing . (c) Show that Ok(o) (X.~n + ~z(l . Suppose () = ('T}. (g) Suppose that we drop the assumption that I(t) = J( t) for all t and replace 0 by the v = median of F.00 . N( 020g. 6.
9. (a) Show that the lower confidence bound for q( 8) obtained from the image under q of q(X) (X .Y. the quantity ~ = 8(8) . Note that the region is not necessarily an interval or ray. 0) ranges from a to a value no smaller than 1 . 1).z(1 . alll/.a). if 00 < O(S) (S is inconsistent with H : 0 = 00 ). 1) and q(O) the ray (X . laifO>O <I>(z(1 .Section 4.O ~ J(X.5. p. 7)).5. Yare independent and X ~ N(v.pYI < (1 1 otherwise. 12.a) confidence interval [ry(x). Let a(S. 11.a) . (b) Describe the confidence region obtained by inverting the family (J(X.20) if 0 < 0 and.10 Problems and Complements 283 8. Then the level (1 .7.5.2. Hint: You may use the result of Problem 4. p)} as in Problem 4. 1).a))2 if X > z(1 . ' + P2 l' z(1 1 . or). a( S. hence.a. and let () = (ry. 00) is = 0'. .1.5. Thus. let T denote a nuisance parameter.a). Define = vl1J. Y. Suppose X.p) = = 0 ifiX .a) = (b) Show that 0 if X < z(1 .7. Y ~ N(r/. This problem is a simplified version of that encountered in putting a confidence interval on the zero of a regression line.3 and let [O(S).2 a) (a) Show that J(X. the interval is unbiased if it has larger probability of covering the true value 1] than the wrong value 1]'.5.z(l. Let P (p.2. Show that as 0 ranges from O(S) to 8(S). Establish (iii) and (iv) of Example 4.00 indicates how far we have to go from 80 before the value 8 is not at all surprising under H. 10. ij(x)] for ry is said to be unbiased confidence interval if Pl/[ry(X) < ry' < 7)(X)] <1 a for all ry' F ry.1) is unbiased. that suP.2a) confidence interval for 0 of Example 4. 8( S)] be the exact level (1 . Po) is a size a test of H : p = Po. That is. Show that the Student t interval (4. Let X ~ N(O. Y. Let 1] denote a parameter of interest.Oo) denote the pvalue of the test of H : 0 = 00 versus K : 0 > 00 in Example 4.[q(X) < 0 2 ] = 1.
p)"j. iI . P(xp < x p < xp for all p E (0. .a) LeB for x p whatever be f satisfying our conditions..a. (See Section 3. Confidence Regions for QlIalltiles. X n be a sample from a population with continuous distribution F. = ynIP(xp) . where ! 1 1 heal c: n(1  p) + zl_QVnp(1  pl. We can proceed as follows.. Then   P(P(x) < F(x) < P+(x)) for all x E (a. lOOx. X n x*) is a level a test for testing H : x p < x* versus K : x p > x*. Simultaneous Confidence Regions for Quantiles.5.. it is distribution free. k(a) .b) (a) Show that this statement is equivalent to = 1. .. (X(k)l X(n_I») is a level (1 . Show that Jk(X I x'. That is.p).7.1 and Fu 1. . vF(xp) [1 F(xp)] " Hint: Note that F(x p ) = p. Show that the interval in parts (e) and (0 can be derived from the pivot T(x ) p  ..p) versus K' : P(X > 0) > (1 .h(a). (b) The quantile sign test Ok of H versus K has critical region {x : L:~ 1 1 [Xi > 0] > k}.a = P(k < S < n I + 1) = pi(1. F(x). (e) Let S denote a B(n. (t) Show that k and 1in part (e) can be approximated by h (~a) and h (1  ~a) where ! heal is given in part (b). Let x p = ~ IFI (p) + Fi! I (P)]. . Let F. That is. 14. L. 1 . Let Xl. Detennine the smallest value k = k(a) such that Jk(Q) has level a for H and show that for n large.F(x p)].) Suppose that p is specified. Suppose we want a distributionfree confidence region for x p valid for all 0 < p < 1. I • (c) Let x· be a specified number with 0 < F(x') < 1.4.284 Testing and Confidence Regions Chapter 4 13. .05 could be the fifth percentile of the duration time for a certain disease.6 and 4.4. k+ ' i . i j (d) Deduce that X(n_k(Q)+I) (XU) is the jth order statistic of the sample) is a level (1 . and F+(x) be as in Examples 4.. (a) Show that testing H : x p < 0 versus I< : x p > 0 is equivalent to testing H' : P(X > 0) < (I .95 could be the 95th percentile of the salaries in a certain profession l or lOOx . Show that P(X(k) < x p < X(nl)) = 1. (g) Let F(x) denote the empirical distribution. ! .a. il. be the pth quantile of F. 0 < p < 1. " II [I . Construct the interval using F.p) variable and choose k and I such that 1 . In Problem 13 preceding we gave a disstributionfree confidence interval for the pth quantile x p for p fixed. Thus. .a) confidence interval for x p whatever be F satisfying our conditions.1» .
Hint: Let t.) < p} and.(x) = F ~(Fx(x» . ~ ~ (a) Consider the test statistic x ... ..d. L i==I n I IFx (Xi) . (b) Suppose we measure the difference between the effects of A and B by ~ the difference between the quantiles of X and X. LetFx and F x be the empirical distributions based on the i. .Q) simultaneous confidence band for the curve {VF(p) : 0 < p < I}. (c) Show how the statistic An(F) of Problem 4.. VF(p) = ![x p + xlpl. X n and . then D(F ~ ~ ~ ~ as D(Fu 1 F I  U = F(X) U ).17(a) and (c) can be used to give another distributionfree simultaneous confidence band for x p .. TbeaitemativeisthatF_x(t) IF(t) forsomet E R. F~x) has the same distribution Show that if F x is continuous and H holds.4.X I.(x)] < Fx(x)] = nF1_u(Fx(x)).r < b.i. Express the band in terms of critical values for An(F) and the order statistics. . (b) Express x p and . X I. that is. Suppose that X has the continuous distribution F.5.u are the empirical distributions of U and 1 . ..1. n Hint: nFx(x) = Li~ll[Fx(Xi) < Fx(x)] ~ nFu(F(x)) and ~ ~ nF_x(x) = L IIXi < xl ~ L i==I i==I n n IIFx( X. then L i==l n IIXi < X + t.z. where F u and F t .j < F_x(x)] See also Example 4.T: a <.Section 4.U with ~ U(O.1.x.. We will write Fx for F when we need to distinguish it from the distribution F_ x of X. Note the similarity to the interval in Problem 4. 15. F(x) > pl. F f (.13(g) preceding.r p in terms of the critical value of the Kolmogorov statistic and the order statistics.p. where p ~ F(x).. Give a distributionfree level (1 . where A is a placebo. I).xp]: 0 < l' < I}. That is. The hypothesis that A and B are equally effective can be expressed as H : F ~x(t) = Fx(t) for all t E R.. the desired confidence region is the band consisting of the collection of intervals {[.r" ~ inf{x: a < x < b. Suppose X denotes the difference between responses after a subject has been given treatments A and B.10 Problems and Complements 285 where x" ~ SUp{. .X n . .
and by solving D( Fx.U ). nFx(x) we set ~ 2. The result now follows from the properties of B(·). where F u and F v are independent U(O.) < Fx(t)] = nFu(Fx(t))..) < do. Fx(t)l· ... for ~. then D(Fx.a) simultaneous confidence band 15. As in Example 1.3... . <aJ Show that if H holds.. 1 Yn be i. = 1 i 16. Show that for given F E F.t::..286 Testing and Confidence Regions ~ Chapter 4 '1.Fy ) = maxIFy(t) 'ER "" .x p .Fy ): 0 < p < 1}. vt(p)) is the band in part (b). where :F is the class of distribution functions with finite support. Properties of this and other bands are given by Doksum. let Xl. (b) Consider the parameter tJ p ( F x. Hint: Define H by H. Let Fx and F y denote the X and Y empirical distributions and consider the test statistic ~ ~ . then  nFy(x + A(X)) = L: I[Y.F x l (l . To test the hypothesis H that the two treatments are equally effective. i I f F.) = D(Fu . Fenstad and Aaberge (1977). Hint: Let A(x) = FyI (Fx(x)) ..l (p) ~ ~[Fxl(p) .Jt: 0 < p < 1) for the curve (5 p (Fx. where do:.x = 2VF(F(x)).. .1) empirical distributions..x.(F(x)) .. is the nth quantile of the distribution of D(Fu l F 1 .".:~ I1IFx(X. .F1 _u).) is a location parameter} of all location parameter values at F. F y ) . Also note that x = H. Let t (c) A Distdbution and ParameterFree Confidence Interval. Give a distributionfree level (1.(x) ~ R. Let 8(. i . F y ). treatment B responses. Fx. i=I n 1 i t . vt = O<p<l sup vt(p) O<p<l where [V.F~x.1..) : :F VF ~ inf vF(p).) < Fx(x)) = nFv(Fx(x)). he a location parameter as defined in Problem 3. then D(Fx .x (': + A(x)). . Then H is symmetric about zero.d. we test H : Fx(t) ~ Fy (t) for all t versus K : Fx(t) # Fy(t) for some t E R. ~ ~ ~ F~x.) < Fy(x p )] = nFv (Fx (t)) under H.:~ 1 1 [Fy(Y. We assume that the X's and Y's are in~ dependent and that they have respective continuous distributions F x and F y . nFy(t) = 2.a) simultaneous confidence band for A(x) = F~l:(Fx(x)) .5. .:~ II[Fx(X.) ~ < F x (")] ~ ~ = nFu(Fx(x)). Hint: nFx(t) = 2.d. It follows that if c. the probability is (1 . "" = Yp . F y ) has the same distribntion as D(Fu . treatment A (placebo) responses and let Y1 . where x p and YP are the pth quan tiles of F x and Fy.i.2A(x) ~ HI(F(x)) l 1 + vF(F(x)). It follows that X is stochastically between Xsv F and XS+VF where X s HI(F(X)) has the symmetric distribution H. I I I 1 1 D(Fx. < F y I (Fx(x))] i=I n = L: l[Fy (Y.'" 1 X n be Li..17..a) that the interval IvF' vt] contains the location set LF = {O(F) : 0(.. (p).p)] = ~[Fxl(p) + F~l:(p)]. Moreover. we get a distributionfree level (1 .
Let do denote a size a critical value for D(Fu ..6.. Show that B = (2 L X. moreover X + 6 < Y' < X + 6.·) is a shift parameter. Fy ) .10 Problems and Complements 287 Moreover. ~) distribution.6 1. Show that if 0(·..F y ' (p) = 6.Section 4. Let 6 = minO<1'<16p(Fx. Now apply the axioms. F y ) < O(Fx .4.) I i=l L tt . the probability is (1 . if I' = 1I>'.. It follows that if we set F\~•.2z(1 i=l n n a)ulL tt]} i=l is a uniformly most accurate lower confidence bound for 8. then B( Fx . 6+ maxO<p<1 6+(p). and 8 are shift parameters. F y ) is in [6. where :F is the class of distri butions with finite support.(x)) . nFx ("') = nFu(Fx(x)).4. (b) Consider the unbiased estimate of O. t.)I L ti . t Hint: Set Y' = X + t. ~ ~ = (e) A Distribution and ParameterFree Confidence Interval. ~ D(Fu •F v ). Exhibit the UMA level (1 ..6]..).T. Suppose X I. F y.(Fx .0: UCB for J. we find a distributionfree level (1 . Show that for the model of Problem 4. Properties of ~ £ = "" this and other bands are given by Doksum and Sievers (1976).) .L.0:) simultaneouS coofidence band for t. (a) Consider the model of Problem 4. . Let6. Show that for given (Fx . T n n = (22:7 I X.E(X). Xl > X 8t 8t 0} (c) A parameter 0 = <5 ('. tt Hint: Both 0 and 8* are nonnally distributed. 6. is called a shift parameter if O(Fx. .= minO<p<l 6. Fx +a) = O(Fx a.. x' 7 R.6+] contains the shift parameter set {O(Fx. 2.2uynz(1 i=! i=l all L i=l n ti is also a level (1.). 3.('...) < do: for Ll. Show that n p is known and 0 O' ~ (2 2~A X. F y ) E F x F.(x) = Fy(x thea D(Fx.a) UCB for O. = 22:7 I Xf/X2n( a) is .0:) confidence bound forO. O(Fx. F y ) > O(Fx " Fy).Xn is a sample from a r (p.(P). thea It a unifonnly most accurate level 1 .3. then Y' = Y. F y ) and b _maxo<p<l 6p(Fx . Fy ). 0 < p < 1.) = F (p) .) ~ + t. F x ) ~ a and YI > Y.) 12:7 I if. ) is a shift parameter} of the values of all shift parameters at (Fx .(. Q. Fv ).(X). B(. (d) Show that E(Y) . F y ). F y. Fy.2.Fy). .a) that the interval [8. then by solving D(F x. where is nnknown. (c) Show that the statement that 0* is more accurate than 0 is equivalent to the assertion that S = (2 E~ 1 X i )/ E~ 1 has uniformly smaller variance than T. :F x :F O(Fx.    Problems for Section 4. .
) and that. 3.2" Hint: See Sections 8. satisfying the conditions of Problem 8.B')+. J • 6.B). (B" 0') have joint densities. e> 0. E(U) = . .. where So is some COnstant and V rv X~· Let T = L:~ 1 Xi' (a) Show that ()" I T m = k + 2t.3 and 4.E(V) are finite. Hint: Apply Problem 4. Suppose [B'. then )" = sB(r(1 . n = 1. Suppose that B' is a uniformly most accurate level (I . uniform. t > e. .2. P(>. Suppose that given. g.2.\. s) distribution with T and s integers. 5. X has a binomial.( 0' Hint: E."" X n are i. . Hint: By Problem B.. . (1: dU) pes. Let U. 1.4.12 so that F. 0). Hint: Use Examples 4.1. Pa( c.6 to V = (B .1 are well defined and strictly increasing. density = B. B). Prove Corollary 4.3. p.. where Show that if (B. 2. and that B has the = se' (t. respectively..B)+. j .f:s F.6.7 1 .2 and B.d. and for (). Construct unifoffilly most accurate level 1 . 0']. s). • . /3( T. U = (8 . . establish that the UMP test has acceptance region (4. . Xl.l2(b).B') < E. In Example 4.B) = J== J<>'= p( s. s > 0. 1I"(t) Xl. distribution with r and s positive integers.' with "> ~. '" J.6.3). Show how the quantiles of the F distribution can be used to find upper and lower credible bounds for. I'.6. Show that if F(x) < G(x) for all x and E(U).aj LCB such that I . ~. (a) Show that if 8 has a beta. then E(U) > E(V). t)dsdt = J<>'= P.1. s). • • • = t) is distributed as W( s.a) confidence intervals such that Po[8' < B' < 0'] < p.B).3.288 Testing and Confidence Regions Chapter 4 4. V be random variables with d.Bj has the F distribution F 2r . U(O. G corresponding to densities j. f3( T.W < B' < OJ for allB' l' B. then E. t) is the joint density of (8.d. .2. Poisson.i. (b) Suppose that given B = B.. 8.l.. (0 . > 1" (h) Show how quantiles of the X2 distribution can be used to determine level (I .\ =. .a.6..B(n. Problems for Section 4. where s = 80+ 2n and W ~ X. Suppose that given B Pareto.\ is distributed as V /050. .4.Xn are Ll. distribution and that B has beta.(O . .[B < u < O]du. ·'"1 1 . F1(tjdt.X. i .O] are two level (1 .6 for c fixed. OJ.0 upper and lower confidence bounds for J1 in the model of Problem 4.3.a) upper and lower credible bounds for A. Establish the following result due to Pratt (1961).C. [B..W < BI = 7.
y) is aN(y. r) where 7f(Ll I x . = PI ~ /12 and'T is 7f(Ll.x) is aN(x.x)1f(/1.r 1x.y.d. (a) Give a level (1 . (b) Find level (1 . ".y. ...Section 4.1 )) distribution.0:) prediction interval for Xnt 1..y) is 1f(r 180)1f(/11 1 r.y.s') withe' = max{c. r 1x.2 degrees of freedom..a) upper and lower credible bounds forO to the level (Ia) upper and lower confidence bounds for B.m} and + n.0:) confidence interval for B. N(J. .x)' + E(Yj y)'. (d) Compare the level (1.Xn is observable and X n + 1 is to be predicted. and Y1..a) upper and lower credible bounds for O. as X.. r > O.y) proportional to where 1f(r I so) is the density of solV with V ~ Xm+n2.. Problems for Section 4.r)p(y 1/12. Show that (8 = S IM ~ Tn) ~ Pa(c'. 'P) is the N(x . respectively. In particular consider the credible bounds as n + 00.1. . Show fonnaUy that the posteriof?r(O 1 x.1. ..8 1. y). = So I (m + n  2). Show thatthe posterior distribution 7f( t 1 x.112.Xm. (b) Show that given r..l. y) and that the joint density of. 1 r. Hint: 7f(Ll 1x..Xn ).. . . y) of is (Student) t with m + n ... Hint: p( 0 I x. (c) Give a level (1 . 1f(/11 1r. /11 and /12 are independent in the posterior distribution p(O X.. y) is proportional to p(8)p(x 1/11.. . r( TnI (c) Set s2 + n.Xn + 1 be i. 4.rlm) density and 7f(/12 1r. (d) Use part (c) to give level (Ia) credible bounds and a level (Ia) credible interval for Ll. Xl. .10 Problems and Complements 289 (a)LetM = Sf max{Xj.2. Here Xl. Suppose 8 has the improper priof?r(O) = l/r. Let Xl. where (75 is known.y) = 7f(r I sO)7f(LlI x .0"5). . y) is obtained by integrating out Tin 7f(Ll.r). 'T). T) = (111. . Suppose that given 8 = (ttl' J. r In) density.i. (a) Let So = E(x. Yn are two independent N (111 \ 'T) and N(112 1 'T) samples.
'" .a). )B(r+x+y. Establish (4. Suppose that given (J = 0.8). x > 0. which is not observable. I x) is sometimes called the Polya q(y I x) = J p(y I 0)..3) by doing a frequentist computation of the probability of coverage.. as X where X has the exponential distribution F(x I 0) =I . • ..t for M = 5. 0 > 0. Present the results in a table and a graph.a) prediction interval for X n + l . That is. Un + 1 ordered.·) denotes the beta function.9.10.Q) distribution free lower and upper prediction bounds for X n + 1 • < 00. " u(n+l). (b) If F is N(p" bounds for X n + 1 .• . . has a B(m. Suppose that Y. random variable. i! Problems for Section 4. .9.9 1.a) lower and upper prediction bounds for X n + 1 .d.s+n.8. .Xn are observable and X n + 1 is to be predicted.8. 0'5)' Take 0'5 ~ 7 2 = I. Find the probability that the Bayesian interval covers the true mean j.e. as X .x ) where B(·.8. (a) If F is N(Jl. < u(n+l) denote Ul . suppose Xl. distribution..i.. s). N(Jl.) Hint: First show that . let U(l) < ..5. 5. give level (I . Let X have a binomial. .8..' . X n such that P(Y < Y) > 1. X n+ l are i. and that (J has a beta. . give level (1 ~ a:) lower and upper prediction (c) If F is continuous with a positive density f on (a. A level (1 . .. . give level 3. In Example 4.5.290 Testing and Confidence Regions Chapter 4 • • (b) Compare the interval in part (a) to the Bayesian prediction interval (4.'. where Xl. Give a level (1 .05. 0). F. B(n.x .10.Xn are observable and we want to predict Xn+l. . . 00 < a < b (1 .(0 I x)dO.0:) lower (upper) prediction bound on Y = X n+ 1 is defined to be a function Y(Y) of Xl.s+nx+my)/B(r+x... CJ2) with (J2 unknown..i. Show that the likelihood ratio statistic for testing If : 0 = ~ versus K : 0 i ~ is equivalent to 12X . 0) distribution given (J = 8..2) hy using the observation that Un + l is equally likely to be any of the values U(I). 1 2. . ~o = 10. b). n = 100. Show that the conditional (predictive) distribution of Y given X = xis 1 J I i I .a (P(Y < Y) > I . Hint: Xi/8 has a ~ distribution and nXn+d E~ 1 Xi has an F 2 . .2n distribution. (3(r.0'5) with 0'5 known.Xn + 1 be i.11.. X is a binomial..nl. distribution.12. :[ = .i.2. ! " • Suppose Xl. B( n.. Let Xl.. 1 X n are i.d. 4..15. . q(ylx)= ( .d. and a = . Suppose XI. Then the level of the frequentist interval is 95%. (This q(y distribution...
. onesample t test is the likelihood ratio test (fo~ Q <S ~). ° 3.X) aD· t=l n > C. In testing H : It < 110 versus K . log . We want to test H . XnI (1 . (ii) CI ~ C2 = n logcl/C2. TwoSided Tests for Scale. of the X~I distribution.3.) . >. 0'2 < O'~ versus K : 0'2 > O'~.Q).3. 2 2 L.. JL > Po show that the onesided. 2. Show that (a) Likelihood ratio tests are of the form: Reject if.(x) Hint: Show that for x < !n.. We want to test H : = 0'0 versus K : 0' 1= 0'0' (a) Show that the size a likelihood ratio test accepts if. Hint: Note that liD ~ X if X < 1'0 and ~ 1'0 otherwise. 0'2) sample with both JL and 0'2 unknown. = Xnl (1 .. (n/2)[ii' / C a5  (b) To obtain size Q for H we should take Hint: Recall Theorem B. let X 1l .~Q) also approximately satisfy (i) and (ii) of part (a).Section 4. where Tn is the t statistic. In Problems 24.f. Thus.~Q) approximately satisfy (i) and also (ii) in the sense that the ratio . (i) F(c. OneSided Tests for Scale./(n .10 Problems and Complements 291 is an increasing function of (2x . I .F(CI) ~ 1.(Xi < 2" " (10 i=I  X) 2 < C2 where CI and C2 satisfy.Xn be a N(/L.(Xi . n Cj I L. (c) These tests coincide with the testS obtained by inverting the family of level (1  Q) lower confidence bounds for (12.. if ii' /175 < 1 and = .. Xn_1 (~a).='7"nlog c l n /C2n CIn . ~2 .. and only if.. = aD nO' 1". (b) Use the nonnal approximatioh to check that CIn C'n n .. · .Q. where F is the d. .log (ii' / (5)1 otherwise. 0' 4.(x) = 0.V2nz(1 .( x) ~ 0.n) and >. Hint: log .~Q) n + V2nz(l.C2n !' 1 as n !' 00. and only if.I)) for Tn > 0. if Tn < and = (n/2) log(1 + T.(nx).(x) = . (c) Deduce that the critical values of the commonly used equaltailed test.
= 0 and €l.lIl (7 2) and N(Jl. n. The following data are from an experiment to study the relationship between forage production in the spring and mulch left on the ground the previous fall. The following blood pressures were obtained in a sample of size n = 5 from a certain population: 124.l..9.3. 0'2 ~ ! corresponding to inversion of the (c) Compute a level 0. 1 I 794 2012 1800 2477 576 3498 411 2092 897 1808 I Assume the twosample normal model with equal variances. (b) Can we conclude that leaving the indicated amount of mulch on the ground significantly improves forage production? Use ct = 0..P. find the sample size n needed for the level 0.L.2' (12) samples. . Forage production is also measured in pounds per acre.95 confidence interval for equaltailed tests of Problem 4. (a) Find a level 0. 0'2). 100. . • Xi = (JXi where X o I 1 + €i. . where a' is as defined in < ~. respectively. and that T is sufficient for O. . .292 Testing and Confidence Regions Chapter 4 5.4. .. Show that A(X. (a) Show that the MLE of 0 ~ (1'1.05 onesample t test.Xn1 and YI . (b) Consider the problem aftesting H : Jl. Y.. The nonnally distributed random variables Xl. (X.95 when nl = nz = ~n and (1'1 1'2)/17 ~ ~.95 confidence interval for p. .tl > {t2. eo.O)..05. 6. . I Yn2 be two independent N(J. 190. (a) Using the size 0: = 0. €n are independent N(O.90 confidence interval for a by using the pivot 8 2 ja z ..9... a 2 ) random variables. 114. Let Xl.L2 versus K : j. i = 1.. I'· (c) Find a level 0. Suppose X has density p(x. . .. whereas the treatment measurements (y's) correspond to 500 pounds of mulch per acre.90 confidence interval for the mean bloC<! pressure J.z . (7 2) is Section 4. 9.1'2.01 test to have power 0. The control measurements (x's) correspond to 0 pounds of mulch per acre. 7.Xn are said to be serially correlated or to follow an autoregressive model if we can write ~ . Assume the onesample normal model..1 < J. 110. can we conclude that the mean blood pressure in the population is significantly larger than IDO? (b) Compute a level 0. e 1 ) depends on X only throngh T. (c) Using the normal approximation <l>(z(a)+ nl nz/n(1'1 I'z) /(7) to the power. . 8.. 0 E e. Assume a Show that the likelihood ratio statistic is equivalent to the twosample t statistic T. x y vi I I .
Then. P. Y2 )dY2. (T~). 0) by the following table. has level a and is strictly more 11. X n1 . samples from N(J1l. Then use py.. Suppose that T has a noncentral t. 12. 1 {= xj(kl)ellx+(tVx/k'I'ldx.. (b) Show that the likelihood ratio statistic of H : 0 = 0 (independence) versus K : 0 o(serial correlation) is equivalent to C~=~ 2 X i X i _I)2 / l Xl. The F Test for Equality of Scale. with all parameters assumed unknown.(t) = .O) for 00 = (Z..[T > tl is an increasing function of 6. The power functions of one.O)e (a) What is the size a likelihood ratio test for testing H : B = 1 versus K : B 1= I? (b) Show that the test that rejects if. Fix 0 < a < and <>/[2(1 . ! x 0 2 1<> 2 I 0 <> I 2 1<> 2 1 ~a Ia 2 iI Oe (i~H <» (1') a 10: (t~) (~. From the joint distribution of Z and V. (An example due to C. get the joint distribution of Yl = ZI vVlk and Y2 = V.<» (1 . has density !k . Let Xl. Consider the following model. . and only if. distribution. (b) P. XZ distributions respectively. X powerful whatever be B.. J7i'k(~k)ZJ(k+ll io Hint: Let Z and V be as in the preceding hint.6. 7k. Yl.. for each v > 0... Xo = o. I). (Yl) = f PY"Y. = 0..IITI > tl is an increasing function of 161. . .6. . Show that the noncentral t distribution. Show that.. / L:: i 10. Define the frequency functions p(x. Condition on V and apply the double expectation theorem. 7k. (a) P.<»1 < e < <>. . Let e consist of the point I and the interval [0. . Stein). 13.'" p(X. respectively. Hint: Let Z and V be independent and have N(o.IIZI > tjvlk] is increasing in [61. (Yl. i = 1. II.and twosided t tests.[Z > tVV/k1 is increasing in 6.n.OXi_d 2} i= 1 < Xi < 00.2) In exp{ (I/Z(72) 2)Xi . P.. (Tn.2.10 Problems and Complements 293 X n ) is n (a) Show that the density of X = (Xl. Yn2 be two independent N(P.Section 4. .
.. x y 254 2..8. using (4. (Xn . V.4. .19 315 2. note that .nl1 ar is of the fonn: Reject if.~ I .4.V. the continuous versiop of (B. Yn) be a sampl~ from a bivariateN(O.' ! . > C. (d) Relate the twosided test of part (c) to the confidence intervals for a?/ar obtained in Problem 4.0 has the same distribution as R. where l _ . . Let R = S12/SIS" T = 2R/V1 R'.68 250 2. L Yj'. Argue as in Proplem 4. . where a. I ' LX.a/2) or F < f( n/2).Y)'/E(X. Un = Un. can you conclude at the 10% level of significance that blood cholest~rollevel is correlated with weight/height ratio? 'I I . a?. 14.J J . F > /(1 .9.64 298 2. and use Probl~1ll4.. T has a noncentral Tnz distribution with noncentrality parameter p.1 . Sbow.0 has the same dis r j . Let '(X) denote the likelihood ratio statistic for testing H : p = 0 versus K : p the bivariate normal model.4. (11. where f( t) is the tth quantile of the F nz . Let (XI. . 0.1I(a).7 to conclude that tribution as 8 12 /81 8 2 .1)/(n. p) distribution. i 0 in xi !:. Hint: Use the transfonnations and Problem B.4. Because this conditional distribution does not depend on (U21 •. i=l j=l n n (b) Show that if we have a sample from a bivariate N(1L1l 1L2. . that T is an increasing function of R. Consider the problem of testing H : p = 0 versus K : p i O. . and using the arguments of Problems B. .nt 1 distribution.37 384 2. and only if. . r> (c) Justify the twosided F test: Reject H if. si = L v. F ~ [(nl . !I " . <. 1. .8. . 1 .94 Using the likelihood ratio test for the bivariate nonnal model. Vn ) is a sample from a N(O. .~ I i " 294 Testing and Confidence Regions Chapter 4 . .10.9.61 310 2. Un). ! . as an approximation to the LR test of H : a1 = a2 versus K : a1 =I a2. The following data are the blood cholesterol levels (x's) and weightlheight ratios (y's) of 10 men involved in a heart study. p) distribution. 1. (a) Show that the likelihood ratio ~tatistic is equivalent to ITI where . . .). show that given Uz = Uz ..'. • . 0. i=2 i=2 n n S12 =L i=2 n U. · .1' Sf = L U.'.. . (a) Show that the LR test of H : af = a~ versus K : ai > and only if. .X)' (b) Show that (aUa~)F has an F nz obtained from the :F table.5) that 2 log '(X) V has a distribution. . (1~.4.. (Un. . vn 1 l i 16.12 337 1.7 and B.24) implies that this is also the unconditional distribution.4) and (4. . then P[p > c] is an increasing function of p for fixed c. p) distribution.9. distribution and that critical values can be . 15. Finally.1..62 284 2.1)]E(Y. • I " and (U" V. .96 279 2.71 240 2. YI ).
035. 00. see also Ferguson (1967). S in 8(X) would be replaced by 5 + and 5 in 8(X) is replaced by 5 .0. !. W. Wiley & Sons.~.Section 4. i] versus K : 8 't [i.01.3) holds for some 8 if <p 't V. .11 NOTES Notes for Section 4. (3) A good approximation (Durbin..1. (a) Find the level" LR test for testing H : 8 E given in Problem 3.10. Editors New York: Academic Press. i].11. 187. "Is there a sex bias in graduate admissions?" Science. 4. Box.. T6 . if the parameter space is compact and loss functions are bounded. Consider the cases Ii. P.[ii) where t = 1.4 (I) If the continuity correction discussed in Section A. AND J. (2) In using 8(5) as a confidence hound we are using the region [8(5).(t) = tl(. Data Analysis and Robustness.). (2) We ignore at this time some reallife inadequacies of this experiment such as the placebo effect (see Example 1.01 + 0. G. T. P.9.819 for" = 0.2..3). respectively. G.9. 1983.0. Rejection is more definitive. O'CONNELL.398404 (975). Nntes for Section 4.11 Notes 295 17. Essentially. Apology for Ecumenism in Statistics and Scientific lnjerence.1 (1) The point of view usually taken in science is that of Karl Popper [1968].851. Acceptance of a hypothesis is only provisional as an adequate current approximation to what we are interested in understanding. it also has confidence level (1 . Stephens. HAMMEL. 4.05 and 0. (2) The theory of complete and essentially complete families is developed in Wald (1950)..895 and t = 0. the class of Bayes procedures is complete.15 is used here. Notes for Section 4. and r.2. R.3. F.[ii ... REFERENCES BARLOW. 1973. Tl . E. BICKEL. Consider the bioequivalence example in Problem 3. Because the region contains C(X).12 1965. . PROSCHAN. Wu. E. The term complete is then reserved for the class where strict inequality in (4.3 (1) Such a class is sometimes called essentially complete. Box. and C.. Mathematical Theory of Reliability New York: J..I\ E. More generally the closure of the class of Bayes procedures (in a suitable metric) is complete. (b) Compare your solution to the Bayesian solution based on a continuous loss function rro = 0.o = 0. Leonard. 00. AND F. 1974) to the critical value is <.
DOKSUM. Statist. R. Wiley & Sons. 421434 (1976). 1961. . Sequential Analysis New York: Wiley. T. "Length of confidence intervals. AND G. LEHMANN.• Statistical Methods for Research Workers... HEDGES. 1952.. 1. L. 54.. Statistical Methods for MetaAnalysis Orlando. S.. W. Statist. ! . FISHER.. KLETI.. 1947. Assoc. AND K.. VAN ZWET. SIEVERS. Mathematical Statistics.. 3rd ed." The AmencanStatistician. Statist. B. OLlaN.. ." Regional Conference Series in Applied Math. A. H. 1985. WETHERDJ. the Growth ofScientific Knowledge. Statist. Amer.. SIAM. . A. L.. Y. E.. WALD." J. T. F. . "On the combination of independent test statistics. D. K. R. 1%2. SACKRoWITZ. j j New York: 1. 605608 (1971). Statistical Decision Functions New York: Wiley.. I . TATE. SAMUEL<:AHN. AND I. 13th ed. PRATI. 1986. Aspin's tables:' Biometrika." Biometrika. K." Ann. 36. 2nd ed. 38. 549567 (1961). Statist. G. AND E.. K. "EDF statistics for goodness of fit. 54 (2000). WANG. Wiley & Sons. c." Biometrika. "Distribution theory for tests based on the sample distribution function. J. Testing Statistical Hypotheses. ! I I I FERGUSON. S. j 1 . 56. HAlO. 1968.. Amer. 69. D. 64. Philadelphia. H. L. 730737 (1974). FENSTAD. 1967. "Further notes on Mrs." J. "A twosample test for a linear hypothesis whose pOWer is independent of the variance. A. New York: Springer. WILKS. 659680 (1967). AND R. "Interval estimation for a binomial proportion.. "P values as random variablesExpected P values. AND G. A Decision Theoretic Approach New York: Academic Press. 66.. 326331 (1999).. 1950. A. WALD. Mathematical Statistics New York: J. DURBIN. . 53. j 1 ." The American Statistician.. 473487 (1977). . Amer. Sratisr. 1997. DOKSUM. Amer.296 Testing and Confidence Regions Chapter 4 BROWN. . CAl. A. Assoc. The Theory of Probability Oxford: Oxford University Press. "Plots and tests for symmetry.. 8.674682 (1959). AND A. Pennsylvania (1973). 9. Assoc. "PloUing with confidence: Graphical comparisons of two populations... A. Statistical Theory with Engineering Applications . New York: I J 1 Harper and Row. .• Conjectures and Refutations. I . 16. R. New York: Hafner Publishing Company. G. V. JEFFREYS.." J. AARBERGE. 243246 (1949). DAS GUPTA." An" Math. R." 1. FL: Academic Press. Math. POPPER. "Probabilities of the type I errors of the Welch tests. M. AND 1. Sequential Methods in Statistics New York: Chapman and Hall. STEPHENs. "Optimal confidence intervals for the variance of a normal distribution. WELCH. W.243258 (1945). OSTERHOFF. 1958. STEIN. 63. GLAZEBROOK.
This distribution may be evaluated by a twodimensional integral using classical functions 297 .2 that . consider evaluation of the power function of the onesided t test of Chapter 4...13).1. To go one step further..d.1).1. and most of this chapter.3).F(x)k f(x).9.1.1 (D.3) gn(X) =n ( 2k ) F k(x)(1 . closed fonn computation of risks in tenns of known functions or simple integrals is the exception rather than the rule. consider med(X 11 •.1 degrees of freedom. In particular.i.1. 1 X n from a distribution F. a 2 ) we have seen in Section 4. . the qualitative behavior of the risk as a function of parameter and sample size is hard to ascertain. If XI.• . Ifwe want to estimate J1(F} = (5. Worse.fiiXIS has a noncentral t distribution with parameter 1"1 (J and n .Xn are i. but a different one for each n (Problem 5. (5.2) and (5.1) This is a highly informative formula.1. from Problem (B.1 INTRODUCTION: THE MEANING AND USES OF ASYMPTOTICS Despite the many simple examples we have dealt with. if n ~ 2k + 1. consider a sample Xl.Xn ) as an estimate of the population median II(F). k Evaluation here requires only evaluation of F and a onedimensional integration. N(J1. .1.2) where. Even if the risk is computable for a specific P by numerical integration in one dimension. . telling us exactly how the MSE behaves as a function of n. our setting for this section EFX1 and use X we can write.Chapter 5 ASYMPTOTIC APPROXIMATIONS 5. and F has density f we can wrile (5. the qualitative behavior of the risk as a function of n and simple parameters of F is not discernible easily from (5. computation even at a single point may involve highdimensional integrals. If n is odd..2. Worse. and calculable for any F and all n by a single onedimensional integration. v(F) ~ F. However.
2) and its qualitative properties are reasonably transparent. is to use the Monte Carlo method. or the sequence Asymptotic statements are always statements about the sequence. S) and in general this is only representable as an ndimensional integral.d. of distributions of statistics. which we explore further in later chapters. I j {Tn (X" . X nj }. always refers to a sequence of statistics I. In its simplest fonn. . observations Xl. 10=1 f=l There are two complementary approaches to these difficulties. Approximately evaluate Rn(F) by _ 1 B i (5.I I !' 298 Asymptotic Approximations Chapter 5 (Problem 5. _. .fiit !Xi.Xn):~Xi<n_l (~2 (EX. is to approximate the risk function under study by a qualitatively simpler to understand and easier to compute function. . which occupies us for most of this chapter. We shall see later that the scope of asymptotics is much greater. based on observing n i.. Draw B independent "samples" of size n. X n + EF(Xd or p £F( . more specifically. i . . 00.1. • .4) i RB =B LI(F. . We now turn to a detailed discussion of asymptotic approximations but will return to describe Monte Carlo and show how it complements asyrnptotics briefly in Example 5.)2) } . {X 1j l ' •• . save for the possibility of a very unlikely event. Xn)}n>l. j=l • By the law of large numbers as B j.1. Asymptotics in statistics is usually thought of as the study of the limiting behavior of statistics or. where A= { ~ . X n as n + 00. just as in numerical integration.3. VarF(X 1 ».. 1 < j < B from F using a random number generator and an explicit fonn for F. The first.3. Asymptotics. .n (Xl. The other.i. .o(X1). for instance the sequence of means {Xn }n>l. .Xnj ) . The classical examples are. Thus. Monte Carlo is described as follows. we can approximate Rn(F) arbitrarily closely.. in this context. .. It seems impossible to determine explicitly what happens to the power function because the distribution of fiX / S requires the joint distribution of (X.fii(Xn  EF(Xd) + N(O. where X n = ~ :E~ of medians. But suppose F is not Gaussian. or it refers to the sequence of their distributions 1 Xi. Rn (F).. . but for the time being let's stick to this case as we have until now. fiB ~ R n (F)...
01./' is as above and . Xn is approximately equal to its expectation. X n )) (in an appropriate sense).l1) states that if EFIXd' < 00. Thus.6) gives IX11 < 1.9) As a bound this is typically far too conservative.6) and (5. .14. .2 .10) < 1 with . As an approximation.2 hand side of (5. if more delicate Hoeffding bound (B.. if EFXf < 00..1.01. for n sufficiently large.1.14. Similarly.6) for all £ > O.. if EFIXd < 00.VarF(Xd.' 1'1 > €] < . 7). by Chebychev's inequality..9.8) are given in Problem 5.6) to fall.f. then P [vn:(X F n . say.1.' m (5.(Xn .1 Introduction: The Meaning and Uses of Asymptotics 299 In theory these limits say nothing about any particular Tn (Xl. PF[IX n _  . which are available in the classical situations of (5.1..7) where 41 is the standard nonnal d.11) . For instance. .10) is . x. . say. the right sup PF x [r.1. For € = .' = 1 possible (Problem 5. (5.3).Xn ) or £F(Tn (Xl. X n ) but in practice we act as if they do because the T n (Xl. The trouble is that for any specified degree of approximation. Is n > 100 enough or does it have to be n > 100. 1'1 > €] < 2exp {~n€'}. For instance.9) is . ' .1.l) (5.1.Section 5.1.4. then (5.1.1.1.1. the much PFlIx n  Because IXd < 1 implies that . DOD? Similarly. Further qualitative features of these bounds and relations to approximation (5. the central limit theorem tells us that if EFIXII < 00. the weak law of large numbers tells us that.1. n = 400.25 whereas (5.1.6) does not tell us how large n has to be for the chance of the approximation ~ot holding to this degree (the lefthand side of (5. £ = . (5. below . and P F • What we in principle prefer are bounds.1.. . We interpret this as saying that..1.1') < x] _ ij>(x) < CEFIXd' v'~ 3 1/' " "n (5.8) Again we are faced with the questions of how good the approximation is for given n.omes 11m'.5) That is.. the celebrated BerryEsseen bound (A. (see A.. (5.2 is unknown be.. 1') < z] ~ ij>(z) (5.9) when .1.1. X n ) we consider are closely related as functions of n so that we expect the limit to approximate Tn(Xl~"· . this reads (5...15.
asymptotic formulae suggest qualitative properties that may hold even if the approximation itself is not adequate. In particular.11) is again much too consctvative generally. The estimates B will be consistent.5) and quite generally that risk increases with (1 and decreases with n. . As we shall see. Yet. Although giving us some idea of how much (5.300 Asymptotic Approximations Chapter 5 where C is a universal constant known to be < 33/4.1. behaves like . F) typically is the standard deviation (SD) of J1iOn or an approximation to this SD. Practically one proceeds as follows: (a) Asymptotic approximations are derived.F) > N(O 1) .1. • model.8) is typically much betler than (5.3. even here they are not a very reliable guide. (b) Their validity for the given n and Ttl for some plausible values of F is tested by numerical integration if possible or Monte Carlo computation. Consistency will be pursued in Section 5. and asymptotically normal. If they are simple. 1 i . We now turn to specifics.2 and asymptotic normality via the delta method in Section 5. I . d) = '(II' .1. . The arguments apply to vectorvalued estimates of Euclidean parameters. Asymptotics has another important function beyond suggesting numerical approximations for specific nand F.I '1. i . although the actual djstribution depends on Pp in a complicated way. for any loss function of the form I(F. Bounds for the goodness of approximations have been available for X n and its distribution to a much greater extent than for nonlinear statistics such as the median. good estimates On of parameters O(F) will behave like  Xn does in relation to Ji. Note that this feature of simple asymptotic approximations using the normal distribution is not replaceable by Monte Carlo. : i .di) where '(0) = 0. quite generally. (5.1. .(l) The approximation (5.1. (5. "(0) > 0. • c F (y'n[Bn .e(F)]) (1(e.' (0)( (1 / y'n)( v'27i') (Problem 5.1. The qualitative implications of results such as are very impor~ tant when we consider comparisons between competing procedures. As we mentioned.' I I • If the agreement is satisfactory we use the approximation even though the agreement for the true but unknown F generating the data may not be as good. It suggests that qualitatively the risk of X n as an estimate of Ji. The methods are then extended to vector functions of vector means and applied to establish asymptotic normality of the MLE 7j of the canonical parameter 17  j • i.1.2 deals with consistency of various estimates including maximum likelihood.7) says that the behavior of the distribution of X n is for large n governed (approximately) only by j. (5.t and 0"2 in a precise way.12) where (T(O. for all F in the n n .11) suggests. consistency is proved for the estimates of canonical parameters in exponential families. Section 5.3 begins with asymptotic computation of moments and asymptotic normality of functions of a scalar mean and include as an application asymptotic normality of the maximum likelihood estimate for oneparameter exponential families. Section 5. as we have seen. For instance. which is reasonable. B ~ O(F).8) differs from the truth.
.. e Example 5.2. 1 X n are i. The notation we shall use in the rest of this chapter conforms closely to that introduced in Sections A.) However.1 CONSISTENCY PlugIn Estimates and MlEs in Exponential Family Models Suppose that we have a sample Xl.q(8)1 > 'I that yield (5.) forsuP8 P8 lI'in . That is. for all (5.14. and other statistical quantities that are not realistically computable in closed form.2) Bounds b(n. For P this large it is not unifonnly consistent.2 Consistency 301 in exponential families among other results. X ~ p(P) ~ ~  p = E(XJ) and p(P) = X. (See Problem 5.1) and (B. 'in 1] q(8) for all 8. and B. Summary..2.2. .2.7. 5.1. . The stronger statement (5.2. .2. distributions.14. asymptotics are methods of approximating risks. we talk of uniform cornistency over K. case become increasingly valid as the sample size increases.4 deals with optimality results for likelihoodbased procedures in onedimensional parameter models.1). Asymptotic statements refer to the behavior of sequences of procedures as the sequence index tends to 00. and 8. with all the caveats of Section 5. by the WLLN. where P is the empirical distribution. Section 5. which is called consistency of qn and can be thought of as O'th order asymptotics. A stronger requirement is (5.Section 5. P where P is unknown but EplX11 < 00 then.Xn from Po where 0 E and want to estimate a real or vector q(O). AlS. .d. for .2.Xn ) is that as e n 8 E ~ e. ' . A.lS. and probability bounds. Finally in Section 5.2. We also introduce Monte Carlo methods and discuss the interaction of asymptotics. (5. 00. . Monte Carlo.1. if.2 5. If is replaced by a smaller set K.7. In practice. We will recall relevant definitions from that appendix as we need them. If Xl. remains central to all asymptotic theory.1) where I . The least we can ask of our estimate Qn(X I.2. but we shall use results we need from A. But. > O.5 we examine the asymptotic behavior of Bayes procedures.1).2) is called unifonn consistency. is a consistent estimate of p(P). The simplest example of consistency is that of the mean.14. 1denotes Euclidean distance.i.d.2) are preferable and we shall indicate some of qualitative interest when we can. .7 without further discussion.. in accordance with (A. Most aSymptotic theory we consider leads to approximations that in the i. by quantities that can be so computed. Means.i..
in this case. then by Chebyshev's inequality. Suppose that q : S + RP is continuous.2. P = {P : EpX'f < M < oo}. .2. Other moments of Xl can be consistently estimated in the same way. w. · . l iA) E S be the empirical distribution.. implies Iq(p')q(p) I < <Then Pp[li1n .l : [a. . Evidently. = Proof. i .4) I i (5.b] < R+bedefinedastheinver~eofw.ql > <] < Pp[IPn . sup{ Pp[liln pi > 61 : pES} < kl4n6 2 (Problem 5..l «) = inf{6 : w(q. the kdimensional simplex. 6) It easily follows (Problem 5.. Asymptotic Approximations Chapter 5 X is uniformly consistent over P because o Example 5.1) and the result follows. Ip' pi < 6«). by A.• X n be the indicators of binomial trials with P[X I = I] = p. 0) is defined by .2. there exists 6 «) > 0 such that p.b) say. 0 In fact. where q(p) = p(1p). it is uniformly continuous on S.5) I A simple and important result for the case in which X!.2. . .I l 302 instance. Theorem 5. and (Xl.2. If q is continuous w(q. Thus. q(P) is consistent.14. p' E S. w(q.pi > 6«)] But. 6 >0 O.. which is ~ q(ji).3) that > <} (5..1 < j < k..6. w( q.2.6)! Oas6! O.. Because q is continuous and S is compact. w(q.2. . Pn = (iiI. with Xi E X 1 . (5. Letw. Binomial Variance. Suppose thnt P = S = {(Pl. xd is the range of Xl' Let N i = L~ 11(Xi = Xj) and Pj _ Njln. By the weak law of large numbers for all p." is the following: I X n are LLd.1. Pp [IPn  pi > 6] ~ . for all PEP.2.LJ~IPJ = I}.. for every < > 0.p'l < 6}.3) Evidently.p) distribution.6) = sup{lq(p)  q(p')I: Ip . Then qn q(Pn) is a unifonnly consistent estimate of q(p). 0 < p < 1.Pk) : 0 < Pi < 1. consider the plugin estimate p(Iji)ln of the variance of p. 0 ~ To some extent the plugin method was justified by consistency considerations and it is not suprising that consistency holds quite generally for frequency plugin estimates. Suppose the modulus ofcontinuity of q. where Pi = PIXI = xi]' 1 < j < k.·) is increasing in 0 and has the range [a. and p = X = N In is a uniformly consistent estimate of p. Let X 1. Then N = LXi has a B(n. But further. .. we can go further.
1 . More generally if v(P) = h(E p g(X .2.1. 1 < j < d.2.. EVj2 < 00.U 2.9d) map X onto Y C R d Eolgj(X')1 < 00. 1 < j < d.v} = (u.) I < that h(D n) Foreonsisteney of h(g) apply Proposition B. If Xl. ai.1: D n 00.2.2. Jl2.p).Section 5. (ii) ij is consistent..i. Let TI. let mj(O) '" E09j(X. is consistent for v(P).then which is well defined and continuous at all points of the range of m.2. variances. Vi). Theorem 5. If we let 8 = (JLI. Var(V.) > 0. Suppose P is a canonical exponentialfamily of rank d generated by T.) > 0. Let Xi = (Ui . X n are a sample from PrJ E P. 11.1 and Theorem 2.1. Then. N2(JLI. 0 Here is a general consequence of Proposition 5.2. thus. . ag. and correlation coefficient are all consistent. )1 < oo}. 1 are discussed in Problem 5. and let q(O) = h(m(O)). Let g (91. Var(U.4..JL2.lpl < 1.V2. 'Vi) is the statistic generating this 5parameter exponential family.6. . (i) Plj [The MLE Ti exists] ~ 1.1 that the empirical means. Then.p).6) if Ep[g(X.7. h(D) for all continuous h. where h: Y ~ RP.V.ar. is a consistent estimate of q(O).2. then vIP) h(g). 1 < i < n be i.a~.Jor all 0.. o Example 5.UV) so that E~ I g(Ui . Questions of uniform consistency and consistency when P = { DistribuCorr(Uj. !:.3. . Let g(u. U implies !'. then Ifh=m. ' .)1 < 1 } tions such that EUr < 00. conclude by Proposition 5.. [ and A(·) correspond to P as in Section 1. Suppose [ is open. if h is continuous.d. We may.3. af > o.2.). We need only apply the general weak law of large numbers (for vectors) to conclude that (5. where P is the empirical distribution. = Proof. Variances and Correlations. .2 Consistency 303 Suppose Proposition 5. )) and P = {P : Eplg(X .
{t: It . and the result follows from Proposition 5. I .. II) L p(X n. 5.lI) . Rudin (1987). 7 I . . D( 110 . n . vectors evidently used exponential family properties.2. . = 1 n p.E1). D ..1I 0 ) foreverye I . Let 0 be a minimum contrast estimate that minimizes I . '1 P1)'[n LT(Xi ) E C T) ~ 1. But Tj. I I Proof Recall from Corollary 2.L[p(Xi .2.8) I > O.7) occurs and (i) follows. • .1 that On CT the map 17 A(1]) is II and continuous on E. . i I . see.7) I . II) = where. 0 Hence.2. is a continuous function of a mean of Li.. By definition of the interior of the convex support there exists a ball 8. . . exists iff the event in (5.d.= 1 .. (T(Xl)) must hy Theorem 2.9) n and Then (} is consistent.1 belong to the interior of the convex support hecause the equation A(1)) = to. Note that.lI o): 111110 1 > e} > D(1I0 .3. I J Theorem 5. X n be Li.1 to Theorem 2. The argument of the the previous subsection in which a minimum contrast estimate.3. .  inf{D(II. By a classical result.2. z=l 1 n i.2..D(1I0 . I ! . Let Xl.2.p(X" II) is uniquely minimized at 11 i=l for all 110 E 6.d..1 that i)(Xj. T(Xdl < 6} C CT' By the law of large numbers. . i=l 1 n (5.Xn ) exists iff ~ L~ 1 T(X i ) = Tn belongs to the interior CT of the convex support of the distribution of Tn. i i • • ! . j 1 Pn(X. We showed in Theorem 2. II) 0 j =EII.. for example. Suppose 1 n PII sup{l..3. as usual.1I)11: II E 6} ~'O (5. the inverse AI: A(e) ~ e is continuous on 8.304 Asymptotic Approximations Chapter 5 . where to = A(1)o) = E1)o T(X 1 ). T(Xl).2. E1)..LT(Xi ). A more general argument is given in the following simple theorem whose conditions are hard to check. which solves 1 n A(1)) = . the MLE. .j.3.1.. ..2 Consistency of Minimum Contrast Estimates .3. if 1)0 is true. Po.. BEe c Rd. (5. 1 i . n LT(Xi ) i=I 21 En. is solved by 110.
2 Consistency 305 proof Note that. IIJ . An alternative condition that is readily seen to work more widely is the replacement of (5. /fe e =   Proof Note that for some < > 0.2. and (5. ifIJ is the MLE.8) and (5.L(p(Xi.2. [I ~ L:~ 1 (p(Xi . But for ( > 0 let o ~ ~ inf{D(IJ.. IJo)) . IIJ .2. Ie ¥ OJ] ~ Ojoroll j. EIJollogp(XI. for all 0 > 0.IJ) p(Xi.IJO)) ' IIJ IJol > <} n i=l .11) sup{l.2. PIJ. IJ) n ~ i=l p(X"IJ o)] .8).8) by (i) For ail compact K sup In { c e. IJ) = logp(x. 2 0 (5. o Then (5.8) can often failsee Problem 5. IJ)I : IJ K } PIJ !l o.1 we need only check that (5. (5. IJ)II ' IJ n i=l E e} > . is finite.5.2. IJ). IJ) . But because e is finite.2.2.Section 5.2. . (5. _ I) 1 n PIJ o [16 . A simple and important special case is given by the following. E n ~ [p(Xi .10) By hypothesis. IJ j ) ..2. 1 n PIJo[inf{.IJ)1 < 00 and the parameterization is identifiable.IJd ).2.D( IJ o.IJol > E] < PIJ [inf{ .2. IJ j )) : 1 <j < d} > <] < d maxi PIJ.2. IJ) .D(IJo. Coronary 5.2.1.9) follows from Shannon's lemma. then.11) implies that 1 n ~ 0 (5."'[p(X i .13) By Shannon's Lemma 2.9) hold for p(x.p(X" IJo)) : IJ E g ~(P(X" K'} > 0] ~ 1.IIIJ  IJol > c] (5. IJj ) .IJol > <} < 0] (5.L[p(Xi . . PIJ.D(IJo.8) follows from the WLLN and PIJ.2.D(IJo. IJj))1 > <I : 1 < j < d} ~ O.[max{[ ~ L:~ 1 (p(Xi .11) implies that the righthand side of (5.D(IJ o. IJ) .IJor > <} < 0] because the event in (5.2. 0) .IJ o)  D(IJo. {IJ" .14) (ii) For some compact K PIJ. [inf c e.10) tends to O.2.lJ o): IIJ IJol > oj.inf{D(IJo.2. 0 Condition (5.[IJ ¥ OJ] = PIJ.2.12) which has probability tending to 0 by (5.
We denote the jth derivative of h by 00 . If fin is an estimate of B(P).1. Unifonn consistency for l' requires more. We introduce the minimal property we require of any estimate (strictly speak ing.8) and (5.33.1) where • . that sup{PIiBn .. see Problem 5. + Rm (5.Xn be i. . As usual let Xl. m Mi) and assume (b) IIh(~)lloo j 1 .306 Asymptotic Approximations Chapter 5 We shall see examples in which this modification works in the problems. ~ 5. . =suPx 1k<~)(x)1 <M < . in Section 5. then = h(ll) + L (j)( ) j=l h .3 FIRST.AND HIGHERORDER ASYMPTOTlCS: THE DelTA METHOD WITH APPLICATIONS We have argued. and assume (i) (a) h is m times differentiable on R. Summary. > 2.) = <7 2 Eh(X) We have the following.1 The Delta Method for Moments • • We begin this section by deriving approximations to moments of smooth functions of scalar means and even provide crude bounds on the remainders.d. VariX. If (i) and (ii) hold. X valued and for the moment take X = R. A general approach due to Wald and a similar approach for consistency of generalized estimating equation solutions are left to the problems. J. . let Ilglioo = sup{ Ig( t) I : t E R) denote the sup nonn.i. 5.3. When the observations are independent but not identically distributed.. D . consistency of the MLE may fail if the number of parameters tends to infinity. We conclude by studying consistency of the MLE and more generally Me estimates in the case e finite and e Euclidean. Unfortunately checking conditions such as (5. sequence of estimates) consistency. .14) is in general difficult.Il E(X Il).3. Sufficient conditions are explored in the problems. Let h : R ~ R. we require that On ~ B(P) as n ~ 00.2.B(P)I > Ej : PEP} t 0 for all € > O..3. I ~.2. We show how consistency holds for continuous functions of vector means as a consequence of the law of large numbers and derives consistency of the MLE in canonical multiparameter exponential families.3.1 that the principal use of asymptotics is to provide quantitatively or qualitatively useful approximations to risk. We then sketch the extension to functions of vector means. ! ! (ii) EIXd~ < 00 Let E(X1 ) = Il. ~l Theorem 5.
4) Note that for j eveo.3. for j < n/2. . then (a) But E(Xi . Lemma 5. . tr i/o>2 all k where tl .3 First. The expression in (c) is. . 'Zr J . Let I' = E(X. .and HigherOrder Asymptotics: The Delta Method with Applications 307 The proof is an immediate consequence of Taylor's expansion.3. . .3) and j odd is given in Problem 5.3. In!.3) (5. rl l:: . We give the proof of (5. .3. .... (n . then there are constants C j > 0 and D j > 0 such (5. bounded by (d) < t.. so the number d of nonzero tenns in (a) is b/2] (c) l::n _ r. and the following lemma. .3) for j even. .3. If EjX Ilj < 00..1) .3. ....'1 +..2) where IX' that 1'1 < IX .3.) ~ 0.. :i1 + . The more difficult argument needed for (5. +'r=j J .Section 5.3. .[jj2] + 1) where Cj = 1 <r.. X ij ) least twice. . j > 2. + i r = j. C [~]! n(n .. EIX I'li = E(X _1')' Proof. t r 1 and [t] denotes the greatest integer n.ij } • sup IE(Xi .5[jJ2J max {l:: { ... Moreover.4) for all j and (5.2. (b) a unless each integer that appears among {i I . .3.5.'" .i j appears at by Problem 5. _ . (5. tI.1'1.. Xij)1 = Elxdj il.. ik > 2. 21.1..t" = t1 .1 < k < r}}.
3.3. (n li/2] + 1) < nIJf'jj and (c). give approximations to the bias of h(X) as an estimate of h(J.1 with m = 3.l) + [h(1)]2(J.1.l)h(1)(J.l) + {h(21(J.6.. • .1')2 = 0 2 In.3.3. n (b) Next.l) + 00 2n I' + G(n3f2).3. (5.308 But (e) Asymptotic Approximations Chapter 5 1 I j i njn(n 1) .l) 6 + 2h(J.. if I' = O. Proof (a) Write 00.1')3 = G(n 2) by (5. 0 1 I . Then Rm = G(n 2) and also E(X . (d). I (b) Ifllh(j11l= < 00.3.1. (a) ifEIXd3 < 00 and Ilh( 31 11 oo < Eh(X) 00. Because E(X .1') + {h(2)(J.4).1..1')' + 1 E[h'](3)(X')(X _ 1')3 = h2 (J.3.3. I 1 . 00 and 11hC 41 11= < then G(n.3f2 ).3. I . EIX1 .1'11.5) (b) if E(Xt) < G(n.llJ' + G(n3f2) (5. I .3f') in (5. .2 ).l) + [h(llj'(1')}':. then h(')( ) 2 0 = h(J.ll j < 2j EIXd j and the lemma follows. I I I Corollary 5. 1 iln . then G(n. respectively.3. < 3 and EIXd 3 < 00.3.3. i • r' Var h(X) = 02[h(11(J. then I • I.3f2 ) in (5.3. Corollary 5..3) for j even.5) apply Theorem 5.5) can be replaced by Proof For (5.2.3.l)E(X . apply Theorem 5. If the conditions of (b) hold.l)}E(X . using Corollary 5. .l)h(J.6) can be 1 Eh 2 (X) = h'(J.JL as our basic variables we obtain the lemma but with EIXd j replaced by EIX 1 .. .4) for j odd and (5.3.. 0 The two most important corollaries of Theorem 5.J. (5.3. In general by considering Xi . if 1 1 <j (a) Ilh(J)II= < 00. and (e) applied to (a) imply (5. By Problem 5.6) n .5) follows.l) and its variance and MSE. and EXt < replaced by G(n').l)h(J.+ O(n.1 with m = 4. 1 < j < 3.
l ). as the plugin estimate of the parameter h(Jl) then. To get part (b) we need to expand Eh 2 (X) to four terms and similarly apply the appropriate form of (5.3.1 If the Xi represent the lifetimes of independent pieces of equipment in hundreds of hours and the warranty replacement period is (say) 200 hours.exp( 2It). If heX) is viewed.1. {h(l) (/1)h(ZI U')/13 (5.3.and HigherOrder Asymptotics: The Delta Method with Applications 309 Subtracting Cal from (b) we get C5.3.3.exp( 2') c(') = h(/1).h(/1)) h(2~(II):: + O(n. (1Z 5.3.3. Bias and Variance of the MLE of the Binomial VanOance. We will compare E(h(X)) and Var h(X) with their approximations.i.3..2 ) 2e.3. by expanding Eh 2 (X) and Eh(X) to six terms we obtain the approximation Var(h(X)) = ~[h(I)U')]'(1Z +.t.1 with bounds on the remainders. . the bias of h(X) defined by Eh(X) .l / Z) unless h(l)(/1) = O. (5. We can use the two coronaries to compute asymptotic approximations to the means and variance of heX).3. A qualitatively simple explanation of this important phenonemon will be given in Theorem 5.2.2 (Problem (5. then we may be interested in the warranty failure probability (5.3.h(/1) is G(n.i.10) + [h<ZI (/1)J'(14} + R~ ! with R~ tending to zero at the rate 1/n3 . 0 Clearly the statements of the corollaries as well can be turned to expansions as in Theorem 5. where /1. Here Jlk denotes the kth central moment of Xi and we have used the facts that (see Problem 5.3 First.6)./1)3 = ~~.1.Z) (5... which is G(n.5). Note an important qualitative feature revealed by these approximations. X n are i. as we nonnally would.8) because h(ZI (t) = 4(r· 3 . T!Jus.= E"X I ~ 11 A. .11) Example 5.4 ) exp( 2/t). when hit) = t(1 . then heX) is the MLE of 1 .t) and X. and. Example 5. which is neglible compared to the standard deviation of h(X).Section 5.3.3. Thus.(h(X) .3.4) E(X .7) If h(t) ~ 1 .. Bias.9) o Further expansion can be done to increase precision of the approximation to Var heX) for large n. the MLE of ' is X . If X [.3(1_ ')In + G(n.d.3.Z ".(h(X)) E.Z. by Corollary 5.3. for large n.3.  . by Corollary 5.1) = 11.
E(X') ~ p . the error of approximation is 1 1 + .3. Theorem 5. D 1 i I .[Var(X) + (E(X»2] I nI ~ p(1 . 1 < j < { Xl .2p)2p(1 ..p(1 .p)'} + R~ p(l .' • " The generalization of this approach to approximation of moments for functions of vector means is fonnally the same but computationally not much used for d larger than 2. • ~ O(n.10) is in a situation in which the approximation can be checked..P )} (n_l)2 n nl n Because 1'3 = p(l.p) = p ( l .10) yields .2p) n n +21"(1 . < m.2p)')} + R~. . + id = m. (5.3.p)] n I " .2.3. • .2p(1 .2t..5) yields E(h(X))  ~ p(1 .310 Asymptotic Approximations Chapter 5 B(l.{2(1.p) .pl.3. .2(1 .p) + .9d(Xi )f.j .' .p) n .p) .p) {(I _ 2p)2 n Thus... First calculate Eh(X) = E(X) . n n Because MII(t) = I . and will illustrate how accurate (5. a Xd d} .p)J n p(1 ~ p) [I . . .p).p)(I. . Let h : R d t R.5) is exact as it should be.!c[2p(1 n p) .3 ).P) {(1_2 P)2+ 2P(I. ~ Varh(X) = (1. R'n p(1 ~ p) [(I . . (5. asSume that h has continuous partial derivatives of order up to m. " ! .e x ) : i l + .amh i.p)(I.6p(l.2p)p(l.. 1 and in this case (5. D " ~ I . 0 < i. II'. and that (i) l IIDm(h)ll= < 00 where Dmh(x) is the array (tensor) a i.2p). M2) ~ 2. I . Next compute Varh(X)=p(I.2p)' .p(1 . Suppose g : X ~ R d and let Y i = g(Xi ) = (91 (Xi).3.
3.hUll)~. The results in the next subsection go much further and "explain" the fonn of the approximations we already have. E xl < 00 and h is differentiable at (5. Theorem 5.3. Y 12 ) 3/2). 1.3..4.Section 5. (J1.13) Moreover.14) do not help us to approximate risks for loss functions other than quadratic (or some power of (d . 5. The most interesting application. We get. B.) Var(Y. 0/ The result follows from the more generally usefullernma.))) ~ N(O. if EIY.) + 2 g:'.3.3.1.13).).3. then Eh(Y) This is a consequence of Taylor's expansion in d variables. + ~ ~:l (J1.12) can be replaced by O(n.)} + O(n (5.I' < 00.h(!.8. Similarly.) + ~ {~~:~ (J1.".2. (5. The proof is outlined in Problem 5. Suppose {Un} are real random variables and tfult/or a sequence {an} constants with an + 00 as n + 00. (7'(h)) where and (7' = VariX. .»)'var(Y'2)] +O(n') Approximations (5. and (5.C( v'n(h(X) .3).).3. ijYk = ~ I:~ Yik> Y = ~ I:~ I Vi. (J1.12) Var h(Y) ~ [(:.5).3..14) + (g:.3.2 The Delta Method for In law Approximations = As usual we begin with d !. is to m = 3. by (5.'. Suppose thot X = R.) )' Var(Y.x. then O(n.) Cov(Y".) Var(Y. and J. 1 < j < d where Y iJ I = 9j(Xi ).. (5. as for the case d = 1. for d = 2.3. under appropriate conditions (Problem 5.3.6).1 Then.3.2 ).(J1.11.3.. and the appropriate generalization of Lemma 5.~ (J1. h : R ~ R.3.3 / 2 ) in (5.) + a..3 First. EI Yl13 < 00 Eh(Y) h(J1. Y12 ) (5.) gx~ (J1. = EY 1 .3.) Cov(Y". Lemma 5.15) Then ..3.~E(X.and HigherOrder Asymptotics: The Delta Method with Applications 311 (ii) EIY'Jlm < 00.
by hypothesis.3. Formally we expect that if Vn !:.• ' " and.3. j .ul Note that (i) (b) '* . V and the result follows. u = /}" j \ V .312 Asymptotic Approximations Chapter 5 (i) an(Un .15) "explains" Lemma 5.7. ·i Note that (5. for every Ii > 0 (d) j.i. The theorem follows from the central limit theorem letting Un X.u) !:. an(Un .3. By definition of the derivative.17) .g(ll(U)(v  u)1 < <Iv .1.ul < Ii '* Ig(v)  g(u) . • • . .16) Proof. from (a). 'j But (e) implies (f) from (b).N(O.3. I (g) I ~: . (5. ~: . hence. I . V . . . Thus.N(O. .32 and B. = I 0 . for every € > 0 there exists a 8 > 0 such that (a) Iv . ~ EVj (although this need not be true. V for some constant u. Therefore. (ii) 9 : R Then > R is d(fferentiable at u with derivative 9(1) (u). (e) Using (e). V.3. Consider 1 Vn = v'n(X !") !:. we expect (5.a 2 ). 0 .u)!:. see Problems 5.8).. _ F 7 . an = n l / 2. ( 2 ). for every (e) > O. PIIUn €  ul < iii ~ 1 • "'. then EV. . j " But.
Section 5. and the foregoing arguments.X) n 2 .18) In particular this implies not only that t n.9. 1 2 a P by Theorem 5..+A)al + Aa~) . But if j is even. ••• .. N n (0 Aal.) and a~ = Var(Y. FE :F where EF(X I ) = '".l distribution. v) = u!v./2 ). In general we claim that if F E F and H is true.TiS X where S 2 = 1 nl L (X. (a) The OneSample Case. 1'2 = E(Y. Yn2 be two independent samples with 1'1 = E(X I ). and S2 _ n nl ( ~ X) 2) n~(X.a) for Tn from the Tn .i.t2 versus K : 112 > 111. VarF(Xd = a 2 < 00. then S t:.'" .3 First.'" 1 X n1 and Y1 .17) yields O(n J. we find (Problem 5. Then (5.28) that if nI/n _ A. Let Xl.d. A statistic for testing the hypothesis H : 11. .18) because Tn = Un!(sn!a) = g(Un .X" be i.3. 1). Using the central limit theorem.3. where g(u. Let Xl. (1  A)a~ . al = Var(X.N(O.a) critical value (or Zla) is approximately correct if H is true and F is not Gaussian. (1 .1 (Ia) + Zla butthat the t n.O < A < 1. else EZJ ~ = O. Slutsky's theorem. For the proof note that by the central limit theorem. 1).2 and Slutsky's theorem.3.1 (1 .3 we saw that the two sample t statistic Sn vn1n2 (Y X) ' n = = n s nl + n2 has a Tn2 distribution under H when the X's and Y's are normal withar = a~.s Tn =. then Tn . i=l If:F = {Gaussian distributions}. Now Slutsky's theorem yields (5.j odd.).. sn!a).= 0 versuS K : 11. EZJ > 0.2. In Example 4. (b) The TwoSample Case. "t" Statistics.3.> 0 . Consider testing H: /11 = j. £ (5. Example 5. we can obtain the critical value t n.l (1.3.3.). j even = o(nJ/').and HigherOrder Asymptotics: The Delta Method with Applications 313 where Z ~ N(O.
10.5 . One sample: 10000 Simulations. oL. 0) for Sn is Monte Carlo Simulation As mentioned in Section 5.. Figure 5.'.. i 2(1 approximately correct if H is true and the X's and Y's are not normal. and in this case the t n .. ur .000 onesample t tests using X~ data. .1 (0.2 or of a~. Other distributions should also be tried. 32.5 2 2.5 .3...c:'::c~:____:_J 0.95) approximation is only good for n > 10 2 . ..1 shows that for the onesample t test. . or 50.. and the true distribution F is X~ with d > 10. The simulations are repeated for different sample sizes and the observed significance levels are plotted. Each plotted point represents the results of 10.. i I . then the critical value t.5 1 1..3.I = u~ and nl = n2.... 316. " .5 3 Log10 sample size j i Figure 5.3. Here we use the XJ distribution because for small to moderate d it is quite different from the normal distribution. each lime computing the value of the t statistics and then giving the proportion of times out of M that the t statistics exceed the critical values from the t table. the asymptotic result gives a good approximation when n > 10 1. where d is either 2.05. as indicated in the plot. Figure 5... approximations based on asymptotic results should be checked by Monte Carlo simulations. The X~ distribution is extremely skew. X~.02 .. when 0: = 0. . 20.1. We illustrate such simulations for the preceding t tests by generating data from the X~ distribution M times independently.314 Asymptotic Approximations Chapter 5 It follows that if 111 = 11.1.. Y .2 shows that when t n _2(10:) critical value is a very good approximation even for small n and for X. the For the twosample t tests.. . Chisquare data I 0.
) i O.hoi n s[h(l)(X)1 . Y .02 d'::'. }.. ChiSquare Dala.2.' 0.3.5 3 log10 sample size Figure 5.5 1 1. Other Monte Carlo runs (not shown) with I=.3. even when the X's and Y's have different X~ distributions..h(JL)] ". 10000 Simulations.L) where h is con tinuously differentiable at !'.3. _ y'ii:[h(X) .a~. in the onesample situation. For each simulation the two samples are the same size (the size indicated on the xaxis).and HigherOrder Asymptotics: The Delta Method with Applications 315 1 This is because. y'ii:[h(X) . Equal Variances 0.0:) approximation is good when nl I=. the t n 2(1 .3.0' H<I o 0.a~ show that as long as nl = n2. and Yi .000 twosample t tests._' 0.12 0. D Next. 1 (ri .3.X = ~. Each plotted point represents the results of 10. and a~ = 12af.o'[h(l)(JLJf).n2 and at I=.95) approximation is good for nl > 100.9. af = a~. To test the hypothesis H : h(JL) = ho versus K : h(JL) > ho the natural test statistic is T. By Theoretn 5.(})) do not have approximate level 0.Xi have a symmetric distribution.Xd. when both nl I=.n2 and at = a').5 2 2. In this case Monte Carlo studies have shown that the test in Section 4. as we see from the limiting law of Sn and Figure 5. 10. let h(X) be an estimate of h(J.4 based on Welch's approximation works well. N(O. However.cCc":'::~___:. and the data are X~ where d is one of 2.Section 5_3 First. in this case. scaled to have the same means. or 50. the t n _2(O. then the twosample t tests with critical region 1 {S'n > t n . Moreover.ll)(!.2 (1 . 2:7 ar Two sample.
£l __ __ 0. +2 + 25 J O'..3. For each simulation the two samples differ in size: The second sample is two times the size of the first. From (5.{x)() twosample t tests.02""~". .3. Poisson. 1) so that ZI_Q is the asymptotic critical value.12 .6) and . If we take a sample from a member of one of these families. we see that here.oj:::  6K . 2nd sample 2x bigger 0. such as the binomial. Variance Stabilizing Transfonnations Example 5. as indicated in the plot. Each plotted point represents the results of IO. o. In Appendices A and B we encounter several important families of distributions. +___ 9. " 0.316 Asymptotic Approximations Chapter 5 Two Sample. The data in the first sample are N(O. which are indexed by one or more parameters. Combining Theorem 5.. We have seen that smooth transformations heX) are also approximately normally distributed.10000 Simulations: Gaussian Data. Tn . and 9.c:~c:_~___:J 0.6. called variance stabilizing. gamma.. too.0.I _ __ 0 I +__ I :::.. if H is true ..3. I: . then the sample mean X will be approximately normally distributed with variance 0"2 In depending on the parameters indexing the family considered.3. 1) and in the second they are N(Ola 2) where a 2 takes on the values 1.3. and beta. N(O.3..4.J~ f). such that Var heX) is approximately independent of the parameters indexing the family we are considering. The xaxis denotes the size of the smaller of the two samples. '! 1 . It turns out to be useful to know transformations h.5 1 1.5 Log10 (smaller sample size) l Figure 5. Unequal Variances.".~'ri i _ . 0.' OIlS ..3 and Slutsky's theorem. ..£.
Major examples are the generalized linear models of Section 6. . As an example.3 First. 538) one can improve on .. Substituting in (5.i. I X n is a sample from a P(A) family. The notion of such transformations can be extended to the following situation.16. a variance stabilizing transformation h is such that Vi'(h('Y) .19) is an ordinary differential equation. If we require that h is increasing.3. A second application occurs for models where the families of distribution for which variance stabilizing transformations exist are used as building blocks of larger models.)C. 1n (X I.' .3. this leads to h(l)(A) = VC/J>.15 and 5. See Problems 5. .1 0 ) Vi' ' is an approximate 1.. c) (5. 0 One application of variance stabilizing transformations. The comparative roles of variance stabilizing and canonical transformations as link functions are discussed in Volume II.. p. .3... See Example 5. In this case a'2 = A and Var(X) = A/n. A > 0. finding a variance stabilizing transfonnation is equivalent to finding a function h such that for all Jl and (J appropriate to our family.6. Suppose further that Then again.hb)) + N(o. h must satisfy the differential equation [h(ll(A)j2A = C > 0 for some arbitrary c > O.\ + d.. In this case (5. .. by their definition.. 1976.3. Some further examples of variance stabilizing transformations are given in the problems. in the preceding P( A) case. .Ct confidence interval for J>. yX± r5 2z(1 . Also closely related but different are socaBed normalizing transformations. Edgeworth Approximations The normal approximation to the distribution of X utilizes only the first two moments of X.3.d. Under general conditions (Bhattacharya and Rao.5.Xn are an i. Such a function can usually be found if (J depends only on fJ.6) we find Var(X)' ~ 1/4n and Vi'((X) .and HigherOrder Asymptotics: The Delta Method with Applications 317 (5. X n) is an estimate of a real parameter! indexing a family of distributions from which Xl.Section 5. which has as its solution h(A) = 2. Suppose.13) we see that a first approximation to the variance of h( X) is a' [h(l) (/. is to exhibit monotone functions of parameters of interest for which we can give fixed length (independent of the data) confidence intervals. Thus. 1/4) distribution. h(t) = Ii is a variance stabilizing transformation of X for the Poisson family of distributions. To have Varh(X) approximately constant in A. Thus. where d is arbitrary. which varies freely..(A)') has approximately aN(O.3.19) for all '/. Thus. sample.)] 2/ n . suppose that Xl.3.
3.0032 ·1.38 0. It follows from the central limit theorem that Tn = (2::7 I Xi n)/V2ri = (V .n. where Tn is a standardized random variable.0005 0 0.77 0. Edgeworth(2) and nonna! approximations EA and NA to the X~o distribution.7792 5.ססOO I .n)' _ 3 (2n)2 = 12 n 1 I .3000 0.9750 0. V has the same distribution as E~ 1 where the Xi are independent and Xi N(O. Exact to' EA NA {.1964 1. Suppose V rv X~.2024 EA NA x 0.5421 0.95 3.5.0254 0.4999 0.3. 0.9950 0. H 3 (x) = x3  3x.75 0.0655 O.9984 0.9996 1.1 gives this approximation together with the exact distribution and the nonnal approximation when n = 10./2 2 .ססoo 0.9995 0.2000 0.79 0.3.8000 0.9029 0.9724 0.6548 4.JL) / a and let lIn and 1'211 denote the coefficient of skewness and kurtosis of Tn.4000 0. • " .1.20) is called the Edgeworth expansion for Fn .9943 0.3513 0.n)1 V2ri has approximately aN(O. Then under some conditionsY) where Tn tends to zero at a rate faster than lin and H 2 • H 3 • and H s are Hermite polynomials defined by H 2 (x) ~ x 2  I.0105 0.1000 0.15 0.2706 0.8008 0.7000 0.0010 ~: EA NA x Exact .3x) + _(xS .9506 0.9999 1. ~ 2.3.II 0.9500 'll .35 0.3.0100 0.91 0. xro I • I I .40 0.0877 0.318 Asymptotic Approximations Chapter 5 the normal approximation by utilizing the third and fourth moments. I' 0..!.0500 ! ! . ii.66 0.ססoo 0. Fn(x) ~ <!>(x) .86 0.I . .0287 0. . 0.2. 1) distribution. TABLE 5.. Edgeworth Approximations to the X 2 Distribution.6999 0.3006 0.9999 1.4 to compute Xl.' •• .<p(x) ' .3. " I.9905 0.9900 0.0553 0.95 :' 1..72 .0397 0. (5. i = 1.9684 0.9876 0..5999 0.21) The expansion (5.9097 o." • [3 n .04 1. Example 5..0050 0. Let F'1 denote the distribution of Tn = vn( X .34 2.0284 1.51 1.1) + 2 (x 3 . 1).1051 0. To improve on this approximation.6000 0.1254 0. we need only compute lIn and 1'2n. Hs(x) = xS  IOx 3 + 15x.61 0.1.IOx 3 + 15x)] I I y'n(X n 9n ! I i !. 4 0. Table 5.85 0 0.38 0. P(Tn < x).9990 0.5000 0.9997 4.15 0. "J 1 '~ • 'YIn = E(Vn? (2n)1 E(V .0250 0.ססoo 1. .4415 0. Therefore. We can use Problem B. 0 x Exacl 2..40 0.0001 0 0. According to Theorem 8.0481 1.0208 ~1.4000 0.
.3 and (B. Let p2 = Cov 2(X. with  r?) is asymptotically nonnal. 1.6) that vn(r 2 N(O.l = Cov(XkyJ.a~) : n 3 !' R.o}.p)..n1EY/) ° ~ ai ~ and u = (p.2 extends to the dvariate case. Example 5.J11)/a.0.u) ~ N(O.J12) / a2 to conclude that without use the transformations Xi j loss of generality we may assume J11 = 1k2 = 0.Section 5. It follows from Lemma 5.0 .2.2 >'2.2rJ >".).1.2.2 + p'A2.3 First.5 that in the bivariate normal case the sample correlation coefficient r is the MLE of the population correlation coefficient p and that the likelihood ratio test of H : p = is based on Irl. Y) where 0 < EX 4 < 00. Ui/U~U3. where g(UI > U2.1./U2U3.J 2 2 + 4 2 4 2 Tn P 120 + P T 02 +2{ _2 p3Al.0.n lEX. (X n .1) jointly have the same asymptotic distribution as vn(U n . a~ = Var Y.0.1) and vn(i7~ . U3) = Ui/U2U3. as (X.0 2 (5.2. The proof follows from the arguments of the proof of Lemma 5.1.2. = Var(Xkyj) and Ak.u) where Un = (n1EXiYi.and HigherOrder Asymptotics: The Delta Method with Applications 319 The Multivariate Case Lemma 5.3. = ai = 1. >'2.3.m.6.u) ~~ V dx1 forsome d xl vector ofconstants u._p2).= (Xi . Yn ) be i. xmyl).3.0. we can show (Problem 5.. 4f.3. Let (X" Y.0 TO .ar. We can write r 2 = g(C. we can . . Using the central limit and Slutsky's theorems.2. UillL2U~) = (2p. E). (ii) g.0..2.I. 0< Ey4 < 00. Because of the location and scale invariance of p and r. aiD.3. Let central limit theorem T'f. R d !' RP has a differential g~~d(U) at u.3.i. (i) un(U n ..5. Y)/(T~ai where = Var X. and let r 2 = C2 /(j3<.0 >'1.3. P = E(XY).22) 2 g(l)(u) = (2U.2 T20 .9) that vn(C .1. 1). Suppose {Un} are ddimensional random vectors and that for some sequence of constants {an} with an !' (Xl as n Jo (Xl. then by the 2 vn(U . vn(i7f . _p2. Then Proof.d.3.0 >'1.2. >'1. and Yj = (Y .0. Lemma 5.j. .ij~ where ar Recall from Section 4.9. E Next we compute = T11 .2 >'1.1. .
p) !:.Y) ~ N(/li.9) Refening to (5.3.3[h(c) . per < c) '" <I>(vn . j f· ~. (5.p). ': I = Ilgki(rn)11 pxd. . Var Y i = E and h : 0 ~ RP where 0 is an open subset ofRd. that is. then u5 .8.10) that in the bivariate nonnal case a variance stahilizing transformation h(r) with . and (Prohlem 5. (5..u 2.1). o Here is an extension of Theorem 5.UI..3. . N(O. which is called Fisher's z. E) = . Argue as before using B. . it gives approximations to the power of these tests.a)% confidence interval of fixed length. Theorem 5.fii(Y . N(O.~a)/vn3} where tanh is the hyperbolic tangent. . 1938) that £(vn .3. .g. (1 _ p2)2).3. c E (1.mil £ ~ hey) = h(rn) and (b) so that .320 When (X.4.fii(r . EY 1 = rn. p=tanh{h(r)±z (1.3(h(r) . Then ~. I Y n are independent identically distributed d vectors with ElY Ii 2 < 00. .l) distribution. .1'2.P The approximation based on this transfonnation. ! f This expression provides approximations to the critical value of tests of H : p = O.: ! + h(i)(rn)(y .24) I Proof.3.h= (h i . . .19).23) • . David. Suppose Y 1.3. we see (Problem 5.h(p)) is closely approximated by the NCO.rn) + o(ly . 2 1.5 (a) ! .1) is achieved hy choosing h(P)=!log(1+ P ). .3. and it provides the approximate 100(1 .fii[h(r) . has been studied extensively and it has been shown (e.3.rn) N(o.hp )andhhasatotaldifferentialh(1)(rn) f • . Asymptotic Approximations Chapter 5 = 4p2(1_ p2)'.. .h(p)]).h(p)J !:. .
7.1.~1. L::+.25) has an :Fk.k+m L::l Xl 2 (5. Now the weak law of large numbers (A.m distribution can be approximated by the distribution of Vlk.:: .15. the mean ofaxi variable is E(Z2).. if k = 5 and m = 60.26) l.37] = P[(vlk) < 2. We write T k for Tk.:l ~ m k+m L X.211 = 0.1. when k = 10.l) distribution.k. Suppose that n > 60 so that Table IV cannot be used for the distribution ofTk. We do not require the Xi to be normal.3 First.05 and the respective 0.m. gives the quantiles of the distribution of V/k. as m t 00..1 in which the density of Vlk.i=k+l Xi ".m  (11k) (11m) L.m. where V . then P[T5 . the:F statistic T _ k.h(m)) = y'nh(l)(m)(Y  m) + op(I).0 < 2.. we conclude that for fixed k. is given as the :FIO .Xn is a sample from a N(O. To get an idea of the accuracy of this approximation. But E(Z') = Var(Z) = 1. EX. > 0 is left to the problems.m distribution.05 quantiles are 2. where k + m = n.3. xi and Normal Approximation to the Distribution of F Statistics.3.7) implies that as m t 00. .). See also Figure B. For instance.0 distribution and 2. only that they be i. which is labeled m = 00.x%. when the number of degrees of freedom in the denominator is large. I). 00.i.. 14. This row.3. 1 Xl has a x% distribution.3.and HigherOrder Asymptotics: The Delta Method with Applications 321 (c) jn(h(Y) . When k is fixed and m (or equivalently n = k + m) is large. .21 for the distribution of Vlk.Section 5. we first note that (11m) Xl is the average ofm independent xi random variables.d. The case m = )'k for some ). > 0 and EXt < 00. Suppose for simplicity that k = m and k . i=k+1 Using the (b) part of Slutsky's theorem. Thus. with EX l = 0.= density.9) to find an approximation to the distribution of Tk. we can write. By Theorem B..37 for the :F5 .1.' = Var(X.3. (5. if . Then.3. check the entries of Table IV against the last row. the :Fk. where Z ~ N(O.. Then according to Corollary 8. Next we turn to the normal approximation to the distribution of Tk. we can use Slutsky's theorem (A. Suppose that Xl. D Example 5.m' To show this.3. By Theorem B.
as k ~ 00. By Theorem 5. I' Theorem 5.). i = 1.3. In general. a'). ~ Ck.8(a» does not have robustness of level.a) . . j m"': k (/k.4.m = (1+ jK(:: z. Equivalently Tk = hey) where Y i ~ (li" li.7). 1953) is that unlike the t test.m(l. where J is the 2 x 2 v'n(Tk 1) ""N(0. Then if Xl.. when rnin{k.k (t 1 (5. 4). = E(X. 1)T. a'). v'n(Tk ..3. v) identity.'. P[Tk. = (~.k(T>. Specifically.)T.5.8(c» one has to use the critical value .3. the distribution of'._o l or ~. 2) distribution. ~ N (0. 322 Asymptotic Approximations Chapter 5 where Yi1 = and Yi2 = Xf+i/a2. In general (Problem 53. • • 1)/12) . if it exists and equal to c (some fixed value) otherwise. i. h(i)(u. Thus (Problem 5.3. the F test for equality of variances (Problem 5..3. :.)/ad 2 . 0 ! j i · 1 where K = Var[(X.28) . We conclude that xl jer}. !1 . . E(Y i ) ~ (1. v) ~ ~.27) where 1 = (1. .3. 1'.m(1.1) "" N(O.I.3. .m 1) can be ap proximated by a N(O.o) k) K (5. :f.k(Tk. _. 1'.k(tl)] '" 'fI() :'.m 1) <) :'. ifVar(Xf) of 2a'.1)/12 j 2(m + k) mk Z. k.m critical value fk.a) '" 1 + is asymptotically incorrect.m} t 00.).m(1 .3.28) satisfies Zto '" 1 I I:. When it can be eSlimated by the method of moments (Problem 5. when Xi ~ N(o. 1 5. (5.a).3 Asymptotic Normality of the Maximum likelihood Estimate in Exponential Families Our final application of the 8method follows.' !k .8(d))..m • < tl P[) :'.»)T and ~ = Var(Yll)J. Suppose P is a canonical exponential family of rank d generated by T with [open.2Var(Yll))' In particular if X. l (: An interesting and important point (noted by Box. i.3. X n are a sample from PTJ E P and ij is defined as the MLE 1 . which by (5.29) is unknown. and a~ = Var(X. I)T and h(u.3. ! i l _ . the upper h. .
Then T 1 = X and 1 T2 = n. I'. where. Recall that A(T/) ~ VarT/(T) = 1(71) is the Fisher information. Let Xl>' .4 with AI and m with A(T/).11) eqnals the lower bound (3. Example 5.1.3. the asymptotic variance matrix /1(11) of . We showed in the proof ofTheorem 5.3 First.3.3. and 1).3. in our case.3.1. . = 1/2.2 where 0" = T. Thus.l4. therefore.71) for any nnbiased estimator ij. = 1/20'.". thus. PT/[ii = AI(T)I ~ L Identify h in Theorem 5.30) But D A .32) Hence.A 1(71))· Proof.8. PT/[T E A(E)] ~ 1 and.Section 5.8. Now Vii11. Note that by B.2 that.2 and 5.23). T/2 By Theorem 5.24).5..In(ij .33) 71' 711 Here 711 = 1'/0'. are sufficient statistics in the canonical model. iiI = X/O". (5.1.3.3.3. For (ii) simply note that.  (TIl' . if T ~ I:7 I T(X.2.d. (I" +0')] . of Theorems 5.4. (i) follows from (5.4.I(T/) (5.4. 0'). (ii) follows from (5.'l 1 (ii) LT/(Vii(iiT/)). by Example 2. hence.2.and HigherOrder Asymptotics: The Delta Method with Applications 323 (i) ii = 71 + ~ I:71 A •• • I (T/)(T(X. by Corollary 1.3. and. = (5. This is an "asymptotic efficiency" property of the MLE we return to in Section 6.i.).6.) .2.T. The result is a consequence. Thus.O.38) on the variance matrix of y'n(ij . X n be i.. = A by definition and. Ih our case.N(o.EX. o Remark 5.. (5.Nd(O.A(T/)) + oPT/ (n .3.3.ift = A(T/). as X witb X ~ Nil'.3.31) .
d. .d. . 0). testing.1. . Firstorder asymptotics provides approximations to the difference between a quantity tending to a limit and the limit. Higherorder approximations to distributions (Edgeworth series) are discussed briefly. we can use (5. . These "8 method" approximations based on Taylor's fonnula and elementary results about moments of means of Ll. Consistency is Othorder asymptotics.L. The moment and in law approximations lead to the definition of variance stabilizing transfonnations for classical onedimensional exponential families. 5.3.15). variables are explained in tenns of similar stochastic approximations to h(Y) .4. In Chapter 6 we sketch how these ideas can be extended to multidimensional parametric families. and il' = T.') N(O.1) i . Following Fisher (1958)'p) we develop the theory first for the case that X""" X n are i. We focus first on estimation of O. Finally. Nk) where N j . under Li. . Thus. taking values {xo. . and PES. 5.i. EY = J.. .s where Eo = diag(a 2 )2a 4 ). 8 open C R (e. Eo) . and h is smooth... 0.".: CPo.3... the (k+ I)dimensional simplex (see Example 1.4 ASYMPTOTIC THEORY IN ONE DIMENSION I: I " ! .1 Estimation: The Multinomial Case . .)'.. see Example 2. and so on. .Pk) where (5. I I I· < P.. . I i I . sampling. . I .P(Xk' 0)) : 0 E 8}...4. which lead to a result on the asymptotic nonnality of the MLE in multiparameter exponential families.1. as Y. N = (No.3. Fundamental asymptotic formulae are derived for the bias and variance of an estimate first for smooth function of a scalar mean and then a vector mean. Secondorder asyrnptotics provides approximations to the difference between the error and its firstorder approximation.1 by studying approximations to moments and central moments of estimates.4 to find (Problem 5.33) and Theorem 5. J J . i 7 •• . i In this section we define and study asymptotic optimality for estimation.324 Asymptotic Approximations Chapter 5 Because X = T. .4 and Problem 2.3.d.7). Xk} only so that P is defined by p . We consider onedimensional parametric submodels of S defined by P = {(p(xo. for instance. .26) vn(X /".g. 0). These stochastic approximations lead to Gaussian approximations to the laws of important statistics. stochastic approximations in the case of vector statistics and parameters are developed. Summary.L~ I l(Xi = Xj) is sufficient. Y n are i.(T.i.d. and confidence bounds. . the difference between a consistent estimate and the parameter it estimates. < 1. when we are dealing with onedimensional smooth parametric models..h(JL) where Y 1. Assume A : 0 ~ pix. is twice differentiable for 0 < j < k. Specifically we shall show that important likelihood based procedures such as MLE's are asymptotically optimal. We begin in Section 5. . . 0 .6.. il' . .
fJl E.8) . Then we have the following theorem. Many such h exist if k 2.8) logp(X I .5) As usual we call 1(8) the Fisher infonnation. Assume H : h is differentiable. pC') =I 1 m .. Moreover.2 (8.4.. (5. 8) is similarly bounded and well defined with (5. bounded random variable (5.p(Xk.0)..8)1(X I =Xj) )=00 (5.4.1.4.8).4. . > rl(O) M a(p(8)) P. Consider Example where p(8) ~ (p(xo.Section 5. if A also holds.1. < k. (0) 80(Xj. for instance.8) ~ I)ogp(Xj. h) with eqUality if and only if. 0 <J (5. 88 (Xl> 8) and =0 (5.4.Jor all 8.4.4.4) g.11)) of (J where h: S satisfies ~ R h(p(8» = 8 for all 8 E e > (5.4. Under H.4 Asymptotic Theory in One Dimension 325 Note that A implies that k [(X I .11). Theorem 5. h) is given by (5.3) Furthermore (See'ion 3.2(0.4.7) where .6) 1.. (5.1.l (Xl.:) (see (2. Next suppose we are given a plugin estimator h (r. 8))T.4.4.9) . .2) is twice differentiable and g~ (X I) 8) is a well~defined.2).4.
by noting ~(Xj.2 noting that N) vn (h (.6).4. we s~~ iliat equality in (5.4. using the definition of N j . I (x j.h)Var. (5.10).15) i • . : h • &h &Pj (p(8)) (N . ( = 0 '(8. (5.11) • (&h (p(O)) ) ' p(xj. which implies (5./: .4. Apply Theorem 5.4.p(xj..8) ) ~ .II. h) = I .. (5.8) with equality iff.8) gives a'(8)I'(8) = 1.l6) as in the proof of the information inequality (3.4.8) ) +op(I).4. Noting that the covariance of the right' anp lefthand sides is a(8). (8.16) o .4.h)I8) (5.326 Asymptotic Approximations Chapter 5 Proof. h(p(8)) ) ~ vn Note that.3.4.4. ~ir common variance is a'(8)I(8) = 0' (0. Taking expectations we get b(8) = O.10) Thus. using the correlation inequality (A.p(x).I n. whil. 0) = • &h fJp (5. By (5. by (5.9).8)). &8(X.14) .8)) = a(8) &8 (X" 8) +b(O) &h Pl (5. 2: • j=O &l a:(p(8))(I(XI = Xj) .13) I I · . • • "'i.. kh &h &Pj (p(8))(I(Xi = Xj) . 8). we obtain &l 1 <0 .. h). we obtain 2: a:(p(8)) &0(xj.12)..12) [t.4. for some a(8) i' 0 and some b(8) with prob~bility 1.8) = 1 j=o PJ or equivalently.4. h k &h &pj(p(8)) (N p(xj.8)&Pj 2: • &h a:(p(0))p(x.4.4.')  h(p(8)) } asymptotically normal with mean 0.8) j=O 'PJ Note that by differentiating (5. not only is vn{h (.13). I h 2 (5.p(xj. but also its asymptotic variance is 0'(8. 0)] p(Xj.
A(O)}h(x) where h(x) = l(x E {xo.0)=0... J(~)) with the asymptotic variance achieving the infonnation bound Jl(B).8) is.. p) and HardyWeinberg models can both be put into this framework with canonical parameters such as B = log ( G) in the first case. 5.::7 We shall see in Section 5. Xl. . . is a canonical oneparameter exponential family (supported on {xo..3.0 (5.4.Section 5A Asymptotic Theory in One Dim"e::n::'::io::n ~_ __'3::2:. achieved by if = It (r.0)) ~N (0. 0 E e.3 that the information bound (5. : E Ell.4.2 Asymptotic Normality of Minimum Contrast and MEstimates o e o We begin with an asymptotic normality theorem for minimum contrast estimates. which (i) maximizes L::7~o Nj logp(xj... 0) and (ii) solvesL::7~ONJgi(Xj.18) and k h(p) = [A]I LT(xj)pj )". .Xn are tentatively modeled to be distributed according to Po.2. Suppose p(x. . E open C R and corresponding density/frequency functions p('.8) = exp{OT(x) .. Write P = {P.4. Suppose i.3.o(p(X"O)  p(X"lJo )) . if it exists and under regularity conditions..4. Note that because n N T = n . Xk}) and rem 5.I L::i~1 T(Xi) = " k T(xJ )".3) L. Example 5.3 we give this result under conditions that are themselves implied by more technical sufficient conditions that are easier to check.i.19) The binomial (n. 0 Both the asymptotic variance bound and its achievement by the MLE are much more general phenomena..4.(vn(B . Then Theo(5.4. . As in Theorem 5.). . the MLE of B where It is defined implicitly ~ by: h(p) is the value of O.17) L.1. then. OneParameter Discrete Exponential Families. 0). by ( 2."j~O (5.Oo) = E..d. Let p: X x ~ R where e D(O.4.xd). In the next two subsections we consider some more general situations.5 applies to the MLE 0 and ~ e is open.
O)ldP(x) < co. As we saw in Section 2.O(p)))I: It .. f 1 .2. ~ (X" 0) has a finite expectation and i ! .' '.4. That is.23) I. O(Pe ) = O. .22) J i I .1.4.p2(X. That is. i' On where = O(P) +. n i=l _ 1 n Suppose AO: .4.=1 n ! (5.!:. 8.1. rather than Pe. On is consistent on P = {Pe : 0 E e}. O(P» / ( Ep ~~ (XI. .p(x.328 Asymptotic Approximations Chapter 5 . parameters and their estimates can often be extended to larger classes of distributions than they originally were defined for.1/ 2) n '. hence. j. • is uniquely minimized at Bo.~(Xi. 0) is differentiable. Let On be the minimum contrast estimate On ~ argmin .p(Xi. < co for all PEP.p(X" On) n = O.t) . .p(. O(P). A4: sup.4.. 1 n i=l _ .20) In what follows we let p.0(P)) l. O(P))) . Under AOA5.. (5.21) and.p(x. O)dP(x) = 0 (5. We need only that O(P) is a parameter as defined in Section 1. .O(P)) +op(n.21) i J A2: Ep.3. o if En ~ O. PEP and O(P) is the unique solution of(5.pix.O).O(P)I < En} £. # o.4. (5. Theorem 5. Suppose AI: The parameter O(P) given hy the solution of . L. ~I AS: On £. P) = .. .p = Then Uis well defined.p E p 80 (X" O(P») A3: .4.L . ~i  i . 0 E e. J is well defined on P. as pointed out later in Remark 5. denote the distribution of Xi_ This is because.LP(Xi.4. under regularity conditions the properties developed in this section are valid for P ~ {Pe : 0 E e}.p(x. { ~ L~ I (~(Xi.
n.Ti(en .l   2::7 1 iJ1jJ.p) < 00 by AI.On)(8n .24) proof Claim (5.24) follows from the central limit theorem and Slutsky's theorem.25) where 18~  8(P)1 < IiJn . A2.4.4.4.O(P)I· Apply AS and A4 to conclude that (5.1j2 ). (5.. O(P)).Section 5.4.4. where (T2(1jJ.21).O(P)) 2' (E '!W(X.22) follows by a Taylor expansion of the equations (5. applied to (5. Next we show that (5. (iJ1jJ ) In (On ..4.4..22) because .O(P))) p (5. en) around 8(P).4.O(P)) Ep iJO (Xl.4.4.28) . Let On = O(P) where P denotes the empirical probability. By expanding n.O(P)) / ( El' ~t (X" O(P))) o while E p 1jJ2(X"p) = (T2(1jJ.4.26) and A3 and the WLLN to conclude that (5.p) = E p 1jJ2(Xl .4.' " 1jJ(Xi .20) and (5.4.20). .8(P)) I n n n~ t=1 n~ t=1 e (5.lj2 and L 1jJ(X" P) + op(l) i=1 n EI'1jJ(X 1.O(P)) ~ .O(P)) = n. But by the central limit theorem and AI.27) we get. we obtain._ .' " iJ (Xi .425)(5.29) .L 1jJ(X" 8(P)) = Op(n. and A3. 8(P)) + op(l) = n ~ 1jJ(Xi . t=1 1 n (5. using (5. 1 1jJ(Xi .27) Combining (5.4 Asymptotic Theory in One Dimension 329 Hence.
Nothing in the arguments require that be a minimum contrast as well as an Mestimate (i.20) are called M~estimates.30) suggests that ifP is regular the conclusion of Theorem 5.d.4.2. and that the model P is regular and letl(x. I o Remark 5. n I _ . 0) = 6 (x) 0. 0) = logp(x.2 may hold even if ?jJ is not differentiable provided that .O(P) Ik = n n ?jJ(Xi . • Theorem 5. I 'iiIf 1 I • I I . Solutions to (5. .. we see that A6 corresponds to (5. for A4 and A6..4.4. Z~ (Xl.2. {xo 1 ••• 1 Xk}.4. and we define h(p) = 6(xj )Pj.29).22) follows from the foregoing and (5. X n are i. P but P E a}.4.2 is valid with O(P) as in (I) or (2).4. An additional assumption A6 gives a slightly different formula for E p 'iiIf (X" O(P)) if P = Po. e) is as usual a density or frequency function.30) is formally obtained by differentiating the equation (5. A2. Conditions AO.1.12). AI.31) for all O. for Mestimates. This extension will be pursued in Volume2.4.4. 0 • . .4. If further Xl takes on a finite set of values.4. 0) Covo (:~(Xt. 330 Asymptotic Approximations Chapter 5 Dividing by the second factor in (5. B) where p('. Identity (5. Our arguments apply even if Xl. and A3 are readily checkable whereas we have given conditions for AS in Section 5.1. This is in fact truesee Problem 5. 1 •• I ! . that 1/J = ~ for some p).4.O) = O.?jJ(XI'O)).4.(x) (5.: 1.4.i.30) Note that (5. =0 (5.'. If an unbiased estimateJ (X d of 0 exists and we let ?jJ (x. A4 is found.=0 ?jJ(x. O)dp.21).8). O(P)) ) and (5.e.O)?jJ(X I .4. en j• . (2) O(P) solves Ep?jJ(XI. Suppose lis differen· liable and assume that ! • &1 Eo &O(X I .4. written as J i J 2::. Remark 5. O(P) in AIA5 is then replaced by (I) O(P) ~ argmin Epp(X I .O). it is easy to see that A6 is the same as (3.28) we tinally obtain On . Our arguments apply to Mestimates. O)p(x.Eo (XI. A6: Suppose P = Po so that O(P) = 0. B» and a suitable replacement for A3. 0) l' P = {Po: or more generally.! . O(P)) + op (I k n n ?jJ(Xi . .3.. essentially due to Cramer (1946). 0) is replaced by Covo(?jJ(X" 0). We conclude by stating some sufficient conditions.4. Remark 5.4.
O) = l(x.35) with equality iff. In this case is the MLE and we obtain an identity of Fisher's.. s) diL(X )ds < 00 for some J ~ J(O) > 0. = en en = &21 Ee e02 (Xl. .0'1 < J(O)} 00. We can now state the basic result on asymptotic normality and e~ciency of the MLE.01 < J( 0) and J:+: JI'Ii:' (x.32) where 1(8) is t~e Fisher information introduq. B). We also indicate by example in the problems that some conditions are needed (Problem 5.4.3 Asymptotic Normality and Efficiency of the MLE The most important special case of (5." Details of how A4' (with ADA3) iIT!plies A4 and A6' implies A6 are given in the problems.1).4.p = a(0) g~ for some a 01 o.4 Asymptotic Theory in One Dimension 331 A4/: (a) 8 + ~~(xI.34) Furthermore. AOA6. < M(Xl. That is.4. 0) and .4.3. 0) g~ (x.p(x.O) 10gp(x. then the MLE On (5. then if On is a minimum contrast estimate whose corresponding p and '1jJ satisfy 2 0' (.p. "dP(x) = p(x)diL(x).4. (b) There exists J(O) sup { > 0 such that O"lj') Ehi) eo (Xl. lO.4. 0) . Theorem 5. w.4.Section 5. Pe) > 1(0) 1 (5. where EeM(Xl. 0) < A6': ~t (x.O) = l(x.4) but A4' and A6' are not necessary (Problem 5.4. 0') .. 5. 0) obeys AOA6. 0') is defined for all x.20) occurs when p(x.33) so that (5.:d in Section 3. 10' .4. 0) (5.8) is a continuous function of8 for all x.O).4.O) and P ~ ~ Pe.. where iL(X) is the dominating measure for P(x) defined in (A.eo (Xl. satisfies If AOA6 apply to p(x.
.4.(X B).1/ 4 " and using X as our estimate if the test rejects and 0 as our estimate otherwise. B) with J T(x) ... }.30) and (5.4.39) with "'(B) < [I(B) for all B E eand"'(Bo) < [I(Bo)..2.".4. 1 . PoliXI < n1/'1 . p...nB). 5.1 once we identify 'IjJ(x. . cross multiplication shows that (5.4.3 generalizes Example 5.'(0) = 0 < l(~)' The phenomenon (5.4.1/4 X if!XI > n.. If B = 0.A'(B).4. claim (5.<1>( _n '/4 .. PoliXI < n 1/4 ] .. The optimality part of Theorem 5.4..Xn be i. Hodges's Example.2. 00.4. Consider the following competitor to X: B n I 0 if IXI < n. 0 e en e. The major testing problem if B is onedimensional is H : < 8 0 versus K : > If p(. .38) Therefore.439) where . We next compule the limiting distribution of .'(B) = I = 1(~I' B I' 0.4. for some Bo E is known as superefficiency.36) Because Eo 'lj. 1).4.4. B) is an MLR family in T(X). see Lehmann and Casella.4.• •. I. .33) and (5. 0 1 . . . and Polen = OJ . Let Z ~ N(O..3 is not valid without some conditions on the esti· mates being considered.ne . .. and. (5. 1 .. PolBn = Xl .l'1 Let X" .B). 1.. thus.n(Bn .d. .. PIIZ + . N(B. We discuss this further in Volume II. .nB) . 0 because nIl' .4. 1.36) is just the correlation inequality and the theorem follows because equality holds iff 'I/J is a nonzero multiple a( B) of 3!.35).4 Testing ! I i . we know that all likelihood ratio tests for simple BI e e eo. {X 1.nBI < nl/'J <I>(n l/ ' ..4. B) = 0. 1 I I. .r 332 Asymptotic Approximations Chapter 5 Proof Claims (5. For this estimate superefficiency implies poor behavior of at values close to 0. .1). 442.. j 1 " Note that Theorem 5.i . However. 1998. for higherdimensional the phenomenon becomes more disturbing and has important practical consequences. I I Example 5. By (5. Then X is the MLE of B and it is trivial to calculate [( B) _ 1. Therefore.i..35) is equivalent to (5. (5. We can interpret this estimate as first testing H : () = 0 using the test "Reject iff IXI > n.34) follow directly by Theorem 5.37) .1/4 I (5.4. . Then . .4. if B I' 0. .
::. Suppose the model P = {Po . The test is then precisely. 00) .r'(O)) where 1(0) (5.4. (5.4 Asymptotic Theo:c'Y:c.40) > Ofor all O.0:c":c'=D. The proof is straightforward: PoolvnI(Oo)(Bn .41) where Zlct is the 1 .a quantile o/the N(O. PoolBn > 00 + zl_a/VnI(Oo)] = POo IVnI(Oo) (Bn  00) > ZI_a] ~ a. Thus.4. this test can also be interpreted as a test of H : >.(a. derive an optimality property.h the same behavior. It seems natural in general to study the behavior of the test.. Proof. .4.:cO"=~ ~ 3=3:. a < eo < b.Oo) = 00 + ZIa/VnI(Oo) +0(n.4.4. the MLE. That is. .41) follows. and then directly and through problems exhibit other tests wi!. b)..42) is sometimes called consistency of the test against a fixed alternative..00) > z] .(a.4. 00 )]  ~ 1. (5. Let en (0:'. • • where A = A(e) because A is strictly increasing.00) other hand. Then ljO > 00 .22) guarantees that sup IPoolvn(Bn .43) Property (5.4. .:c". If pC e) is a oneparameter exponential family in B generated by T(X). On the (5. Xl.. 1) distribution. as well as the likelihood ratio test for H versus K. X n distributed according to Po..:.l4.. < >"0 versus K : >.Oo)] = a and B is the MLE n n of We will use asymptotic theory to study the behavior of this test when we observe ij.. 00) .I / 2 ) (5..(a. PolO n > c. eo) denote the critical value of the test using the MLE en based On n observations.2 apply to '0 = g~ and On. ~ 0.. 00)] = PolvnI(O)(Bn  0) > VnI(O)(cn(a. (5.4. Zla (5. 0 E e} is such thot the conditions of Theorem 5. e e e e. Suppose (A4') holds as well as (A6) and 1(0) < oofor all O. are of the form "Reject H for T( X) large" with the critical value specified by making the probability of type I error Q at eo.45) which implies that vnI(Oo)(c.. ljO < 00 .4.Section 5.[Bn > c(a. "Reject H if B > c(a. B E (a.4. Then c. and (5.46) PolBn > cn(a. "Reject H for large values of the MLE T(X) of >. 00 )" where P8 .00) > z] ~ 11>(z) by (5. 1 < z .m='=":c'.3 versus simple 2 . Theorem 5.0)].40).42)  ~ o. [o(vn(Bn 0)) ~N(O.44) But Polya's theorem (A.4.Oo)] PolOn > cn(a.d.(1 1>(z))1 ~ 0.4.4. > AO.
..k'Pn(X1.  i..4 tells us that the test under discussion is consistent and that for n large the power function of the test rises steeply to Ct from the left at 00 and continues rising steeply to 1 to the right of 80 .48) .jnI(8)(80 . .jnI(80) +0(n.50)) and has asymptotically smallest probability of type I error for B < Bo. (5. 00 if8 > 80 0 and . < 1 lI(Zl_a ")'. X n ) i. the power of tests with asymptotic level 0' tend to 0'. In either case. I .47) lor.jI(80)) ill' > 0 > llI(zl_a ")'. .50).43) follow. these statements can only be interpreted as valid in a small neighborhood of 80 because 'Y fixed means () + B . 00 if 8 < 80 .4.8) < z] .jnI(8)(Cn(0'.4.jnI(80) + 0(n. (3) = Theorem 5.5. I.jnI(8)(Bn . X n ) is any sequence of{possibly randomized) critical (test) functions such that .4.80 1 < «80 )) I J .80 ) tends to zero. assume sup{IP. Suppose the conditions afTheorem 5.80 ). LetQ) = Po.4..(80) > O.jnI(O)(Bn .jnI(8)(80 . Write Po[Bn > cn(O'. 1 ~..4. iflpn(X1 . j (5. (5. ~! i • Note that (5. then limnE'o+.4. j . f.1 / 2 )) .4.[. the test based on 8n is still asymptotically MP. Theorem 5. (5.1/ 2 ))].50) i. In fact.8 + zl_o/.(1 lI(z))1 : 18 .4. Claims (5. .jI(80)) ill' < O.")' ~ "m(8  80 ).4.jnI(8)(Cn(0'. On the other hand.4.8) > .50) can be interpreted as saying that among all tests that are asymptotically level 0' (obey (5. If "m(8 .4. Furthennore.4. That is. .4.4.49) ! i .017". .···.4.jnI(8)(Bn . Optimality claims rest on a more refined analysis involving a reparametrization from 8 to ")' "m( 8 .8) > .8) + 0(1) . the power of the test based on 8n tends to I by (5.4.8 + zl_a/.42) and (5. o if "m(8 . 80 )  8) .2 and (5.jnI(8)(80 .48).41).51) I 1 I I .4.[.80) ~ 8)J Po[. 0. then by (5.40) hold uniformly for (J in a neighborhood of (Jo.49)) the test based on rejecting for large values of 8n is asymptotically uniformly most powerful (obey (5.48) and (5. . then (5. ~ " uniformly in 'Y.J 334 Asymptotic Approximations Chapter 5 By (5.80) tends to infinity.4.   j Proof. 80)J 1 P.
4. 1(8) = 1(80 + + fi) ~ 1(80) because our uniformity assumption implies that 0 1(0) is continuous (Problem 5.Xn) is the critical function of the Wald test and oLn (Xl.. It is easy to see that the likelihood ratio test for testing H : g < 80 versus K : 8 > 00 is of the form "Reject if L log[p(X" 8n )!p(X i=1 n i . Llog·· (X 0) =dn (Q. ~ .54) establishes that the test 0Ln yields equality in (5. > dn (Q.52) > 0.4.o + 7. To prove (5.. 1.Xn) is the critical function of the LR test then. hand side of (5.50) and.4. .O+. Finally. is asymptotically most powerful as well. X. if I + 0(1)) (5.4.50) for all.4 and 5. The test li8n > en (Q..[logPn( Xl. for Q < ~. 00 + E) logPn(X l E I " ..4. 00)]1(8n > 80 ) > kn(Oo.54) Assertion (5.)] ~ L (5. [5In (X" .<P(Zla(1 + 0(1)) + . 80)] of Theorems 5. . 0 n p(Xi .4 Asymptotic Theory in One Dimension 335 If. is O.8 0 ) .53) tends to the righthand side of (5. .8) that.4..Xn ) = 5Wn (X" ..4..?.50) note that by the NeymanPearson lemma..4.7).jn(I(8o) + 0(1))(80 . . Thus. These are the likelihood ratio test and the score or Rao test.4. . X n .jI(Oo)) + 0(1)) and (5. 8 + fi) 0 P'o+:. note that the Neyman Pearson LR test for H : 0 = 00 versus K : 00 + t. .<P(Zl_a .. 0 (5. The details are in Problem 5..80 ) < n p (Xi. hence.5.48) follows. ." + 0(1) and that It may be shown (Problem 5.' L10g i=1 p (X 8) 1. € > 0 rejects for large values of z1.4.4.53) where p(x. . There are two other types of test that have the same asymptotic behavior. 0) denotes the density of Xi and dn .8) 1.OO+7n) +EnP..53) is Q if.4.. k n (80 ' Q) ~ if OWn (Xl. Further Taylor expansion and probabilistic arguments of the type we have used show that the righthand side of (5. t n are uniquely chosen so that the right. " Xn.4.4. . 0 The asymptotic!esults we have just established do not establish that the test that rejects for large values of On is necessarily good for all alternatives for any n.5 in the future will be referred to as a Wald test. . Q).4.Section 5.. i=1 P 1. P.. for all 'Y..j1i(8  8 0 ) is fixed.OO)J .
336
Asymptotic Approximations
Chapter 5
where PlI(X 1, ... ,Xn ,8) is the joint density of Xl, ... ,Xn . Fort: small. n fixed, this is approximately the same as rejecting for large values of a~o logPn(X 11 • • • 1 X n ) eo).
,
• ,
The preceding argument doesn't depend on the fact that Xl" .. , X n are i.i.d. with common density or frequency function p{x, 8) and the test that rejects H for large values of a~o log Pn (XI, ... ,Xn , eo) is, in general, called the score or Rao test. For the case we are considering it simplifies, becoming
"Reject H iff
t
i=I
iJ~
: ,
logp(X i ,90) > Tn (a,9 0)."
0
I I
(5.4.55)
It is easy to see (Problem 5.4.15) that
Tn(a, 90 ) = Z10 VnI(90) + o(n I/2 )
and that again if G (Xl, ... , X n ) is the critical function of the Rao test then
nn
,
•
.,
,•
Po,+' WRn (X t, ... ,Xn ) = Ow n(X t , ... ,Xn)J ~ 1, rn
(5.4.56)
(Problem 5.4.8) and the Rao test is asymptotically optimal. Note that for all these tests and the confidence bounds of Section 5.4.5, I(90 ), which d' may require numerical integration, can be replaced by _n l d021n(Bn) (Problem 5.4.10).
5.4,5
Confidence Bounds
Q
We define an asymptotic Levell that
lower confidence bound (LCB) On by the requirement
(5.4.57)
r I I i ,
., '
.'
for all () and similarly define asymptotic level!  a DeBs and confidence intervals. We can approach obtaining asymptotically optimal confidence bounds in two ways:
(i) By using a natural pivot.
, f.
(
.,
(ii) By inverting the testing regions derived in Section 5.4.4.
,. "',
Method (i) is easier: If the assumptions of Theorem 5.4.4 hold, that is, (AO)(A6), (A4'), and I(9) finite for all it follows (Problem 5.4.9) that
e.
Co(
V n  e)) ~ N(o, 1) nI(lin)(li
Z1a/VnI(lin ).
(5.4.58)
for all () and, hence. an asymptotic level!  a lower confidence bound is given by
9~ = lin e~,
(5.4.59)
Turning tto method (ii), inversion of 8Wn gives fonnally
= inf{9: en(a, 9) > 9n }
(5.4.60)
,
=
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
337
or if we use the approximation C (0, e) ~ n
e+ zlQ/vnI(iJ), (5,4,41),
e) > en}'
~~, = inf{e , Cn(C>,

(5,4,61)
In fact neither e~I' or e~2 properly inverts the tests unless cn(Q, e) and Cn (Q, e) are increasing in The three bounds are different as illustrated by Examples 4.4.3 and 4.5.2. If it applies and can be computed, e~l is preferable because this bound is not only approximately but genuinely level 1 Q. But computationally it is often hard to implement because cn(Q, 0) needs, in general, to be computed by simulation for a grid of values. Typically, (5.4.59) or some equivalent alternatives (Problem 5,4,10) are preferred but Can be quite inadequate (Problem 5,4,1 I), These bounds e~, O~I' e~2' are in fact asymptotically equivalent and optimal in a suitable sense (Problems 5,4,12 and 5,4,13),
e.
e
Summary. We have defined asymptotic optimality for estimates in oneparameter models. In particular, we developed an asymptotic analogue of the information inequality of Chapter 3 for estimates of in a onedimensional subfamily of the multinomial distributions, showed that the MLE fonnally achieves this bound, and made the latter result sharp in the context of oneparameter discrete exponential families. In Section 5.4.2 we developed the theory of minimum contrast and M estimates, generalizations of the MLE, along the lines of Huber (1967), The asymptotic formulae we derived are applied to the MLE both under the mooel that led to it and tmder an arbitrary P. We also delineated the limitations of the optimality theory for estimation through Hodges's example. We studied the optimality results parallel to estimation in testing and confidence bounds. Results on asymptotic properties of statistical procedures can also be found in Ferguson (1996), Le Cam and Yang (1990), Lehmann (1999), Rao (1973), and Serfling (1980),
e
5.5
ASYMPTOTIC BEHAVIOR AND OPTIMALITY OF THE POSTERIOR DISTRIBUTION
Bayesian and frequentist inferences merge as n t 00 in a sense we now describe. The framework we consider is the one considered in Sections 5.2 and 5.4, i.i.d. observations from a regular madel in which is open C R or = {e 1 , , , , , e,} finite, and e is identifiable, Most of the questions we address and answer are under the assumption that fJ = 0, an arbitrary specified value, or in frequentist tenns, that 8 is true.
e
e
Consistency The first natural question is whether the Bayes posterior distribution as n + 00 concentrates all mass more and more tightly around B. Intuitively this means that the data that are coming from Po eventually wipe out any prior belief that parameter values not close to are likely, Formalizing this statement about the posterior distribution, II(· I X It •.• , X n ), which is a functionvalued statistic, is somewhat subtle in general. But for = {O l ,' .. , Ok} it is
e
e
i
338
straightforward. Let
Asymptotic Approximations Chapter 5
I.
11(8 i XI,···, Xn)
Then we say that II(·
=PIO = 8 I Xl,···, Xn].
(5.5.1)
e, P,li"(8 I Xl, ... , Xn)  11 > ,] ~ 0 for all f. > O. There is a slightly stronger definition: rIC I XI,' .. ,Xn ) is a.S. iff for all 8 E e, 11(8 I Xl, ... , Xn) ~ 1 a.s. P,.
is consistent iff for all 8 E
General a.s. consistency is not hard to formulate:
I Xl, ... , Xn)
(5.5.2)
consistent
(5.5.3)
11(· I X), ... , Xn)
=}
OJ'} a.s. P,
(5.5.4)
where::::} denotes convergence in law and <5{O} is point mass at satisfactory result for finite.
e
e.
There is a completely
, ,
,
Theorem 5.5.1. Let 1rj  p[e = Bj ], j = 1, ... 1 k denote the prior distribution 0/8. Then II(· I Xl, ... ,Xn ) is consistent (a.s. consistent) iff 7fj > afor j = I, ... , k.
Proof. Let p(., B) denote the frequency or limit j function of X. The necessity of the condition is immediate because 1["] = 0 for some j implies that 1f(Bj I Xl, ... ,Xn ) = 0 for all Xl, .. . , X n because, by (1.2.8),
.,
,J,
11(8j
I Xl, ... ,Xn)
PIO = 8j I Xl, ... ,Xn ] 11j Ir~l p(Xi , 8il
L:.~l 11. ni~l p(Xi , 8.)
k
n .
(5.5.5)
,
,
, ,,
Intuitively, no amount of data can convince a Bayesian who has decided a priori that OJ is impossible. On the other hand, suppose all 71" j are positive. If the true is (J j or equivalently 8 = (J j, then
e
log
11(8.IXl, ... ,Xn) =n 11(8 j I X), ... ,Xn)
(11og+ L.. og P(Xi,8.)) . 11. 1{f.,1 n
7fj
n i~l
p(Xi ,8j)
,
By the weak (respectively strong) LLN, under POi'
.,i
1{f.,log p(Xi ,8.)  Lni~l
p(Xi ,8j )
+
E
OJ
(I
I
P(XI ,8.)) og p(X I ,8j )
i
I
in probability (respectively a.s.). But Eo;
(log: ~:::;))
~
< 0, by Shannon's inequality, if
Ba
• .,
=I=
Bj' Therefore,
11(8.IXI, ... ,Xn) 1 og 11(8j I X), ... , X n )
00
,
in the appropriate sense, and the theorem follows.
o
i ~,
h
_
Section 55
Asymptotic Behavior and Optimality of the Posterior Distribution
339
e
Remark 5.5.1. We have proved more than is stated. Namely. that for each I XI, . .. ,Xn ]  a exponentially.
e E e. Po[O =l0
As this proof suggests, consistency of the posterior distribution is very much akin to consistency of the MLE. The appropriate analogues of Theorem 5.2.3 are valid. Next we give a much stronger connection that has inferential implications:
Asymptotic normality of the posterior distribution
Under conditions AOA6 for p(x, B) that if B is the MLE,

= lex, Bj
=logp(x, B), we showed in Section 5.4
(5.5.6)
Ca(y'n(e  B» ~ N(O,rl(B)).
Consider C( ..;ii((}  B) I Xl, ... , X n ), the posterior probability distribution of y'n((} B( Xl, ... , X n )), where we emphasize that (j depends only on the data and is a constant given XI, ... , X n . For conceptual ease we consider A4(a.s.) and A5(a.s.), assumptions that strengthen A4 and A5 by replacing convergence in Po probability by convergence a.s. p•. We also add,



A7: For all (), and all 0> o there exists t(o,(})
p. [sup
> 0 such that
{~ t[I(Xi,B') /(Xi,B)]: 18'  BI > /j} < '(0, B)] ~ I.
e such that 1r(') is continuous and positive
AS: The prior distribution has a density 1f(') On at all B. Remarkably,
Theorem 55.2 (UBernsteinlvon Mises"). If conditions ADA3, A4(a.s.), A5(a.s.), A6, A7, and A8 hold. then
C(y'n((}(})
I X1, ... ,Xn )
~N(O,l
1
(B»)
(5.5.7)
a.s. under P%ralle.
We can rewrite (5.5.7) more usefully as
sup IP[y'n((}  e) < x I Xl, ... , X n]  of>(xVI(B»)j ~ 0
x
(5.5.8)
for all a.s. Po and, of course, the statement holds for our usual and weaker convergence in Po probability also. From this restatement we obtain the important corollary. Corollary 5.5.1. Under the conditions of Theorem 5.5.2,
e
sup IP[y'n(O  e) < x j Xl, ... , XnJ  of>(xVl(e)1
x
~0
(5.5.9)
a.s. P%r all B.
1 , ,
340
Asymptotic Approximations Chapter 5
Remarks
(I) Statements (5,5.4) and (5,5,7)(5,5,9) are, in fact, frequentist statements about the
asymptotic behavior of certain functionvalued statistics.
(2) Claims (5.5.8) and (5.5.9) hold with a.s. replaced by in P, probability if A4 and
A5 are used rather than their strong formssee Problem 5.5.7.
(3) Condition A7 is essentially equivalent to (5.2.8), which coupled with (5.2.9) and
identifiability guarantees consistency of Bin a regular model.

Proof We compute the posterior density of .,fii(O  B) as
(5.5.10)
where en = en(X!, . .. ,Xn) is given by

Divide top and bottom of (5.5.10) by
;,'
II7
1 p(Xi ,
B) to obtain
(5.5.11)

where l(x,B)
= 10gp(x,B) and
,
,
We claim that
for all B. To establish this note that (a) sup { 11" + 1I"(B) : ItI < M} tent and 1T' is continuous. (b) Expanding, (5.5.13)
(e In) 
~ 0 a.s.
for all M because
eis a.s. consis
I
I
1
i
1 ! , ,
I
p
J
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
341
where
Ie  Bit)1 < )n.
We use I:~
1
g~ (Xi, e) ~ 0 here. By A4(a.s.), A5(a.s.),
1
n
sup { n~[}B,(Xi,B'(t))n~[}B,(Xi,B):ltl<M
In [}'I
[}'l
}
~O,
for all M, a.s. Po. Using (5.5.13). the strong law of large numbers (SLLN) and A8, we obtain (Problem 5.5.3),
Po
[dnqn(t)~1f(B)exp{Eo:;:(Xl,B)~}
forallt] =1.
(5.5.14)
Using A6 we obtain (5.5.12).
Now consider
dn =
I:
r
+y'n
1f(e+
;")exp{~I(Xi,9+;,,) 1(Xi,e)}ds
(5.5.15)
dnqn(s)ds
J1:;I<o,fii
J
1f(t) exp
{~(l(Xi' t) 1(X
i,
9)) } l(lt 
el > o)dt
By AS and A7,
Po [sup { exp
{~(l(Xi,t) 1(Xi , e») } : It  el > 0} < e"'("O)] ~ 1
(5.5.16)
for all 0 so that the second teon in (5.5.14) is bounded by y'ne"'("O) ~ 0 a.s. Po for all 0> O. Finally note that (Problem 5.5.4) by arguing as for (5.5.14), tbere exists o(B) > 0 such that
Po [dnqn(t) < 21f(8) exp {~ Eo (:;: (Xl, B))
By (5.5.15) and (5.5.16), for all 0
~}
for all It 1 < 0(8)y'n]
~ I.
(5.5.17)
> 0,
(5.5.18)
Po [dn 
r dnqn(s)ds ~ 0] = I. J1:;I<o,fii
exp {_ 8'I(B)} ds
2
Finally, apply the dominated convergence theorem, Theorem B.7.5, to dnqn(sl(lsl < 0(8)y'n)), using (5.5.14) and (5.5.17) to conclude that, a.s. Po,
d ~ 1f(B)
n
r= L=
= 1f(8)v'21i'.
JI(B)
(5.5.19)
,
I
342
Hence, a.S. Po,
Asymptot'lc Approximations Chapter 5
qn(t) ~ V1(e)<p(tvI(e))
where r.p is the standard Gaussian density and the theorem follows from Scheffe's Theorem B.7.6 and Proposition B.7.2. 0 Example 5.5.1. Posterior Behavior in the Normal Translation Model with Normal Prior. (Example 3.2.1 continued). Suppose as in Example 3.2.1 we have observations from a N{ (), ( 2 ) distribution with a 2 known and we put aN ('TJ, 7 2 ) prior on 8. Then the posterior
distribution of8 isN(Wln7J!W2nX,
(~I r12)1) where
,,2
W2n
.,
I
• •
WIn
= nT 2 +U2'
= !WIn
(5.5.20)
,
'"
,
r
, .,,
I
Evidently, as n + 00, WIn + 0, X + 8, a.s., if () = e, and (~I T\) 1 + O. That is, the posterior distribution has mean approximately (j and variance approximately 0, for n large, or equivalently the posterior is close to point mass at as we vn(O  9) has posterior distribution expect from Theorem 5.5.1. Because 9 =
;
N ( .,!nw1n(ry  X), n (~+
;'»
1).
x,
e
Now, vnW1n
=
O(n 1/ 2) ~ 0(1) and
0
n (;i + ~ ) 1
rem 5.5.2.
+ (12 =
II (8) and we have directly established the conclusion of Theo
I ,
I
I
Example 5.5.2. Posterior Behavior in the Binomial~Beta Model. (Example 3.2.3 continued). If we observe Sn with a binomial, B(n, 8), distribution, or equivalently we observe X" ... , X n Li.d. Bernoulli (I, e) and put a beta, (3(r, s) prior on e, then, as in Example 3.2.3, (J has posterior (3(8n +r, n+s  8 n ). We have shown in Problem 5.3.20 that if Ua,b has a f3(a, b) distribution, then as a + 00, b + 00,
I
If 0
a) £ (a+b)3]1( Ua,b a+b ~N(o,I). [ ab
< B<
(5.5.21)
j I ,
i j
1 is true, Sn/n ~. () so that Sn + r + 00, n + s  Sn + 00 a.s. Po. By identifying a with Sn + r and b with n + s  Sn we conclude after some algebra that because 9 = X,
vn((J  X)!:' N(O,e(l e))
a.s. Po, as claimed by Theorem 5.5.2.
o
Bayesian optimality of optimal frequentist procedures and frequentist optimality of
Bayesian procedures
•
Theorem 5.5.2 has two surprising consequences. (a) Bayes estimates for a wide variety of loss functions and priors are asymptotically efficient in the sense of the previous section.
,
1 ,
I
I
t
hz _
I
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
343
(b) The maximum likelihood estimate is asymptotically equivalent in a Bayesian sense to the Bayes estimate for a variety of priors and loss functions. As an example of this phenomenon consider the following.
~
Theorem 5.~.3. Suppose the conditions of Theorem 5.5.2 are satisfied. Let B be the MLE ofB and let B* be the median ofthe posterior distribution ofB. Then
(i)
(5.5.22)
a.s. Pe for all
e. Consequently,
~, _ I ~ 1 az. I (e)ae(X" e) +op,(n 1/2 ) e  e+ n L.
l=l
(5.5.23)
and LO( .,fii(rr  e)) ~ N(o, rl(e)).
(ii)
(5.5.24)
E( .,fii(111 
el11111 Xl,'"
,Xn) = mjn E(.,fii(111  dl 
1(11) 1Xl.··· ,Xn) + op(I).
(5.5.25)
Thus, (i) corresponds to claim (a) whereas (ii) corresponds to claim (b) for the loss functions In (e, d) = .,fii(18 dlIell· But the Bayes estimatesforl n and forl(e, d) = 18dl must agree whenever E(11111 Xl, ... , Xn) < 00. (Note that if E(1111 I Xl, ... , X n ) = 00, then the posterior Bayes risk under l is infinite and all estimates are equally poor.) Hence, (5.5.25) follows. The proof of a corresponding claim for quadratic loss is sketched in Problem 5.5.5.
Proof. By Theorem 5.5.2 and Polya's theorem (A.l4.22)
sup IP[.,fii(O  e)
< x I Xl,'" ,Xu) 1>(xy'""'I(""'e))1 ~
Oa.s. Po.
(5.5.26)
But uniform convergence of distribution functions implies convergence of quantiles that are unique for the limit distribution (Problem B.7.1 I). Thus, any median of the posterior distribution of .,fii(11  e) tends to 0, the median of N(O, II (~)), a.s. Po. But the median of the posterior of .,fii(0  (1) is .,fii(e'  e), and (5.5.22) follows. To prove (5.5.24) note that
~ ~ ~
and, hence, that
E(.,fii(IOelll1e'l) IXl, .. ·,Xn) < .,fiile
e'l ~O
(5.5.27)
a.s. Po, for all B. Because a.s. convergence Po for all B implies. a.s. convergence P (B.?). claim (5.5.24) follows and, hence,
E( .,fii(10 
01 101) I h ... , Xn)
= E( .,fii(10 
0'1  101) I X" ... , X n ) + op(I).
(5.5.28)
344
~
Asymptotic Approximations
Chapter 5
Because by Problem 1.4.7 and Proposition 3.2.1, B* is the Bayes estimate for In(e,d), (5.5.25) and the theorem follows. 0 Remark. In fact, Bayes procedures can be efficient in the sense of Sections 5.4.3 and 6.2.3 even if MLEs do not exist. See Le Cam and Yang (1990).
Bayes credible regions
~
There is another result illustrating that the frequentist inferential procedures based on f) agree with Bayesian procedures to first order.
Theorem 5.5.4. Suppose the conditions afTheorem 5.5.2 are satisfied. Let
where en is chosen so that 1l"(Cn I Xl, ... ,Xn) = 1  0', be the Bayes credible region defined in Section 4.7. Let Inh) be the asymptotically level 1  'Y optimal interval based on B, given by
~
where dn(y)
i •
= z (! !)
JI~). ThenJorevery€ >
0, 0,
~ 1.
P.lIn(a + €) C Cn(X1 , .•. ,Xn ) C In(a  €)J
(5.5.29)
I
I
I
The proof, which uses a strengthened version of Theorem 5.5.2 by which the posterior density of Jii( IJ  0) converges to the N(O,Il (0)) density nnifonnly over compact neighborhoods of 0 for each fixed 0, is sketched in Problem 5.5.6. The message of the theorem should be clear. Bayesian and frequentist coverage statements are equivalent to first order. A finer analysis both in this case and in estimation reveals that any approximations to Bayes procedures on a scale finer than n 1j2 do involve the prior. A particular choice, the Jeffrey's prior, makes agreement between frequentist and Bayesian confidence procedures valid even to the higher n 1 order (see Schervisch, 1995).
Thsting
! ,
•
Bayes and frequentist inferences diverge when we consider testing a point hypothesis. For instance, in Problem 5.5.1, the posterior probability of 00 given X I, ... ,Xn if H is false is of a different magnitude than the pvalue for the same data. For more on this socalled Lindley paradox see Berger (1985) and Schervisch (1995). However, if instead of considering hypothesis specifying one points 00 we consider indifference regions where H specifies [00 + D.) or (00  D., 00 + D.), then Bayes and freqnentist testing procedures agree in the limit. See Problem 5.5.2. Summary. Here we established the frequentist consistency of Bayes estimates in the finite parameter case, if all parameter values are a prior possible. Second. we established
i
! I
I b
II
!
_
j
TI
Section 5.6
Problems and Complements
345
the socalled Bernsteinvon Mises theorem actually dating back to Laplace (see Le Cam and Yang, 1990), which establishes frequentist optimality of Bayes estimates and Bayes optimality of the MLE for large samples and priors that do not rule out any region of the parameter space. Finally, the connection between the behavior of the posterior given by the socalled Bernstein~von Mises theorem and frequentist contjdence regions is developed.
5.6
PROBLEMS AND COMPLEMENTS
Problems for Section 5.1
1. Suppose Xl, ... , X n are i.i.d. as X ous case density.
rv
F, where F has median F 1 (4) and a continu
(a) Show that, if n
= 2k + 1,
, Xn)
EFmed(X),
n (
_
~
)
l'
k (1  t)kdt F' (t)t
EF
med (X"
2
,Xn )
n( 2;) [1P'(t)f tk (1t)k dt
= 1, 3,
(b) Suppose F is unifonn, U(O, 1). Find the MSE of the sample median for n and 5.
2. Suppose Z ~ N(I', 1) and V is independent of Z with distribution X;'. Then T
Z/
=
(~)!
is said to have a noncentral t distribution with noncentrality J1 and m degrees
of freedom. See Section 4.9.2. (a) Show that
where fm(w) is the x~ density, and <P is the nonnal distribution function.
(b) If X" ... ,Xn are i.i.d.N(I',<T2 ) show that y'nX /
(,.~, L:(Xi
_X)2)! has a
noncentral t distribution with noncentrality parameter .fiiJ1/IT and n  1 degrees of freedom. (c) Show that T 2 in (a) has a noncentral :FI,m distribution with noncentrality parameter J12. Deduce that the density of T is
p(t)
= 2L
i=O
00
P[R = iJ . hi+,(f)[<p(t 1')1(t > 0)
+ <p(t + 1')1(t < 0)1
where R is given in Problem B.3.12.
, " I I. ,
346
Hint: Condition on
Asymptotic Approximations
Chapter 5
ITI.
1, then Var(X) < 1 with equality iff X
3. Show that if P[lXI < 11
±1 with
probability! .
Hint: Var(X) < EX 2
4. Comparison of Bounds: Both the Hoeffding and Chebychev bounds are functions of n and f. through ..jiif..
(a) Show that the ratio of the Hoeffding function h( VilE) to the Chebychev function e( Jii€) tends to as Jii€ ~ 00 so that he) is arbitrarily better than en in the tails.
°
I,.,'
i~
(b) Show that the normal approximation 24> (
V;€)  1 gives lower results than h in
00.
,
the tails if P[lXI < 1] = 1 because, if ,,2 < 1. 1  <p(t)  <p(t)lt as t ~ Note: Hoeffding (1963) exhibits better bounds for known a 2 .
R has .\(0) ~ 0, is bounded, and has a hounded second derivative .\n. Show that if Xl, ... , X n are i.i.d., EX l = f.L and Var Xl = 02 < 00, then
5. Suppose.\ : R
~
E.\(X 
1') = .\'(0)
;'/!; +
0
(~)
as n >
00.
= E>.'(O)JiiIX  1'1 + E (";' (X  I')(X I'?) where IX  1'1 < Ix  1'1· The last term is < suPx I>." (x) 1,,2 In and the first tends to
Hint: JiiE(.\(IX 1,1) .\(0))
I
",
1! "
.\'(0)" f== Izl<p(z)dz by Remark B.7. 1(2).
Problems for Section 5,2
1. Using the notation of Theorern 5.2.1, show that
"
2. Let X" ... ,Xn be ij,d. N(I',,,2), Show that for all n
Sup p(•• u) [IX
u
>
1, all €
>
°
/,1 > ,] ~ 1.
Hint: Let (J
)
00.
•
3, Establish (5.2.5). Hint: Iiln  q(p)1
> € =} IPn  pi > w (€).
l
J
4. Let (Ui , V;), 1 < i < n, be i.i.d.  PEP.
(a) Let y(P)
= PIU, > 0, V, > OJ. Show that if P = N(O, 0,1,1, p), then
p
~ sin21l' (Y(P)  ~).
7.\} . e E Rand (i) For some «eo) >0 Eo. for each e eo. 5.2.n 1 (XiJ. Hint: From continuity of p. N (/J.sup{lp(X. then is a consistent estimate of p. (Wald) Suppose e ~ p( X.lJ. eo)} : e E K n {I : 1 IB . Show that the maximum contrast estimate 8 is consistent. n Lt= ".p(Xi . V)j /VarpU Varp V for PEP ~ {P: EpU' + EpV' < 00. e') . eo) : e' E S(O. 5_0  lim Eo. Prove that (5. . and < > 0 there is o{0) > 0 such that e.14)(i) add (ii) suffice for consistency. (Ii) Show that condition (5.e'l < .lO)2)) . (a) Show that condition (5. e)1 (ii) Eo.Xn are i.(eo)} < 00.\} c U s(ejJ(e j=l T J) {~ t n . 05) where ao is known.2.p(X. {p(X" 0) .(ii) holds. (c) Suppose p{P) is defined generally as Covp (U. e) is continuous.. i (e))} > <.•.e')p(X. (i). VarpUVarpV > O}. Eo.d. .i.e)1 : e' E 5'(e. .6 Problems and Complements 347 (b) Deduce that if P is the bivariate normal distribution.2 rII.l)2 _ + tTl = o:J..p(X. Hint: K can be taken as [A. . By compactness there is a finite number 8 1 . e') . Hint: sup 1. e) . Suppose Xl. . eol > A} > 0 for some A < 00.o)} =0 where S( 0) is the 0 ball about Therefore. by the basic property of maximum contrast estimates. A]. inf{p(X. ".14)(i). sup{lp(X. Or of sphere centers such that K Now inf n {e: Ie .eol > . (1 (J.. inf{p(X. Show that the sample correlation coefficient continues to be a consistent estimate of p(P) but p is no longer consistent.eol > . where A is an arbitrary positive and finite constant. t e. and the dominated convergence theorem.Section 5. 6.~ . eo) : Ie  : Ie .2.8) fails even in this simplest case in which X ~ /J is clear.p(X.
. .X. L:~ 1(Xi ."d' k=l d < mm+1 L d ElY.. < Mjn~ E (. Problems for Section 5. • • (i) Suppose Xf. li . Establish (5.3.d. for some constants M j . . 2. . Establish Theorem 5. in (i) and apply (ii) to get (iv) E IL:~ 1 (Xi  x. n. then by Jensen's inequality.3. < < 17 < 1/<.i. Hint: Taylor expand and note that if i . Compact sets J( can be taken of the form {II'I < A.d.. .X.)I' < MjE [L:~ 1 (Xi . ..) 2] . Indicate how the conditions of Problem 7 have to be changed to ensure uniform consistencyon K. and take the values ±1 with probability ~.. P > 1. .3.. Let X~ be i.k / 2 .i. + id = m E II (Y. II I . Establish (5.[. . . . (72).. (ii) If ti are ij.L. 1 + . = 1.1..Xn but independent of them. en are constants. with the same distribution as Xl.2.3.. L:!Xi . For r fixed apply the law of large numbers.)2]' < t Mjn~ E [.. Establish (5. J (iii) Condition on IXi  X. .O'l n i=l p(X"Oo)}: B' E S(Bj. < > OJ.3 I.. and let X' = n1EX.X'i j .I'lj < EIX . i .X~ are i. Hint: See part (a) of the proof of Lemma 5.O(BJll}.11). 8.1'. I 10. .3..3.1.. 4. 9. . Show that the log likelihood tends to 00 as a + 0 and the condition fails.348 Asymptotic Approximations Chapter 5 > min l<J$r {~tinf{p(Xi. and if CI. Extend the result of Problem 7 to the case () E RP. Then EIX .l m < Cm k=I n.d. ~ . 1/(J.xW) < I'lj· 3.. i .3) for j odd as follows: .9) in the exponential model of Example 5.. The condition of Problem 7(ii) can also fail. .
theu the LR test of H : or = a~ versus K : af =I a~ is based on the statistic s1 / s~. i./LI = E(Xt}....(X.Xn1 bei. .aD.28. ~ xl"..) I : i" . then (sVaf)/(s~/a~) has an . where s1 = (n.Fk. LetX1"".1)'2:7' . X. ~ iLl' I IXli > liLl}P(lXd > 11.m with K. 8. under H : Var(Xtl ~ Var(Y. if a < EXr < 00.m distribution with k = nl .m 00. I < liLl}P(IXd < liLl)· 7. km K. j = 1. P(sll s~ < ~ 1~ a as k ~ 00. . .d. L~=1 i j = m then m a~l.Tn) + 1 .\k for some .a~d < [max(al"" . Let XI.m) Then. Show that if m = . . j > 2. Show that ~ sup{ IE( X. 0'1 = Var(Xd· Xl /LI Ck.(l'i . (b) Show that when F and G are nonnal as in part (a).i. G. ..6 Problems and Complements 349 Suppose ad > 0. 1)'2:7' . (d) Let Ck.c> .I.1) +E{IX. .. .). n} EIX.1 and 'In = n2 . .i.3. ~ iLli < 2 i EIX. 6.andsupposetheX'sandY's are independent. respectively..pli ~ E{IX.r'j". ~ iLli I IX.."" X n be i. 2 = Var[ ( ) /0'1 ]2 .. s~ ~ (n2 . 1 < j < m.a as k t 00. PH(st! s~ < Ck. )=1 5. R valued with EX 1 = O.d.I) Hint: By the iterated expectation theorem EIX. Establish 5. then EIX. FandYi""'Yn2 bei. Show that if EIXlii < 00.l(e) Now suppose that F and G are not necessarily nonnal but that and that 0 < Var( Xn < Ck.\ > 0 and = 1 + JI«k+m) ZIa.ad)]m < m L aj j=1 m <mmILaj.i. replaced by its method of moments estimate. (a) Show that if F and G are N(iL"af) and N(iL2.d.. Show that under the assumptions of part (c).m be Ck.Section 5.
t . .. . . "i. i = 1. (cj Show that if p = 0.XC) wHere XC = N~n Et n+l Xi. . t N }. • op(n'). 7 (1 . JLl = JL2 = 0. X) U t" (ii) X R = bopt(U ~  iL) as in Example 3. that is. from {t I. then jn(r' . show that i' ~ . .XN} or {(uI.p') ~ N(O. Y) ~ N(I'I. p).+xn n when T. Instead assume that 0 < Var(Y?) < x. Tin' which we have sampled at random ~ E~ 1 Xi. if p i' 0. . .xd.. .m (0 Let 9.m) + 1 . . Without loss of generality.p). then P(sUs~ < qk.. N where the t:i are i. (b) Use the delta method for the multivariate case and note (b opt  b)(U . 4p' (I .. if ~ then .4. iik. . use a normal approximation to tind an approximate critical value qk. < 00 (in the supennodel).A») where 7 2 = Var(XIl.\ < 1.". In Example 5. 1). . ' . i = I. = (1. there exists T I .d. • .. (I _ P')').~) (X . ~ ] .3. Show that under the assumptions of part (e).xd.iL) • .1. (iJi . and if EX. .. (iJl.I))T has the same asymptotic distribution as n~ [n.) =a 2 < 00 and Var(U) Hint: (a) X ._ J . Et:i = 0. Hint: Use the centra] limit theorem and Slutsky's theorem.\ 00. Wnte (l 1pP ..Il2...".4.. .2< 7'.." . . Without loss of generality." •.. (b) If (X.. to estimate Ii 1 < j < n. (UN.1 EX.2" ( 11. err eri Show that 4log (~+~) is the variance stabilizing transfonnation for the correlation 1 Ip coefficient in Example 5.Ii > 0. . 0 < .I' ill "rI' and "2 = Var[( X. = ("t .p) ~ N(O.1'2)1"2)' such that PH(SI!S~ < Qk.i.1JT.l EXiYi .p')') and. n. "1 = "2 = I. () E such that T i = ti where ti = (ui. In survey sampling the modelbased approach postulates that the population {Xl. . i (b) Suppose Po is such that X I = bUi+t:i. . be qk. (a) If 1'1 = 1'2 ~ 0.lIl) + 1 .1 · /. In particular. suppose in the context of Example 3.6.i. Consider as estimates e = ') (I X = x.1 that we use Til.. . n.1. Var(€.I. ! . .(I_A)a2). .m with KI and K2 replaced by their method of moment estimates.p.N. • 10.a as k . . Show that jn(XRx) ~N(0. suppose i j = j. then jn(r .+. jn(r . . as N 2 t Show that.6. = = 1.d.I).XN)} we are interested in is itself a sample from a superpopulation that is known up to parameters.m (depending on ~'l ~ Var[ (X I . + I+p" ' ) I I I II. if 0 < EX~ < 00 and 0 < Eyl8 < 00.3. H En.p) ~ N(O.350 Asymptotic Approximations Chapter 5 (e) Next drop the assumption that G E g. (a) jn(X .a: as k + 00. Po. TN i. ..x) ~ N(O. jn((C ... Under the assumptions of part (c).00.l EY? .
3.E{h(X))1' = 0 to terms up to order l/n' for all A > 0 are of the form h{t) ~ ct'!3 + d. 1931) is found to be excellent Use (5.3. . 14.. E(WV) < 00. X~.14 to explain the numerical results of Problem 5.. and Hint: Use (5.99. .. This < xl (h) From (a) deduce the approximation P[Sn '" 1>( v'2X  v'2rl). Normalizing Transformation for the Poisson Distribution. X n are independent.s.3. .i'b)(Yc . Let Sn have a X~ distribution.3.y'ri has approximately aN (0...3. 16.3 < 00. then E(UVW) = O.12). The following approximation to the distribution of Sn (due to Wilson and Hilferty. (h) Use (a) to justify the approximation 17. Show that IE(Y. X... W).. each with HardyWeinberg frequency function f given by .10..Section 5.14).. X = XO. It can be shown (under suitable conditions) that the nonnal approximation to the distribution of h( X) improves as the coefficient of skewness 'Y1 n of heX) diminishes. Suppose XI. EU 13. is known as Fisher's approximation. X n is a sample from a population with mean third central moment j1.90. Suppose X 1l .25.. Here x q denotes the qth quantile of the distribution.6 Problems and Complements 351 12.i'c) I < Mn'. (a) Use this fact and Problem 5.i'a)(Yb . (a) Show that the only transformations h that make E[h(X) . 15.: .. (a) Suppose that Ely. . . !) distribution.6) to explain why. Hint: If U is independent of (V. (c) Compare the approximation of (b) with the central limit approximation P[Sn < xl = 1'((x .n)/v'2rl) and the exact values of PISn < xl from the X' table for x = XO.. (h) Deduce fonnula (5. (b) Let Sn .3' Justify fonnally j1" variance (72.13(c). = 0. (a) Show that if n is large. n = 5. Suppose X I I • • • l X n is a sample from a peA) distrihution.
Variance Stabilizing Transfo111U1tion for the Binomial Distribution. 20.. [vm +  n (Bm. .1'2)]2 177 n +2h. Justify fonnally the following expressions for the moments of h(X. 101 02 20(10) I f (10)2 2 tJ in terms of fJ and t. ..n P . Y) ~ p!7IC72.a) E(Bmn ) = . i 21. Y) where (X" Yi). 1'2)h2(1'1. Var Bmnm + n +Rmn = 'm+n ' ' where Rm. Let then have a beta distribution with parameters m and n. Var(Y) = C7~..n = (mX/nY)i1 independent standard eXJX>nentials. if m/(m + n) .h(l'l. 172 + [h 2(1'l.(x. • j b _ . . 1'2)(Y  + O( n '). .2.. and h'(t) > for all t. Y" . Bm.Xm .. is given by h(t) = (2/1r) sin' (Yt).I'll + h2(1'1. 1'2) = h. (Xn . . Var(X) = 171. 19. (a) (b) Var(h(X.a) Hmt: Use Bm. Yn are < < . where I I h. Yn) is a sample from a bivariate population with E(X) ~ 1'1.y) = a axh(x..• • • ~{[hl (1". Cov(X. (b) Find an approximation to P[JX' < t] in terms of 0 and t. Y) .a tends to zero at the rate I/(m + n)2.y).. .1') + X 2 . 1'2)pC7. where I' = (a) Find an approximation to P[X < E(X.. + (mX/nY)] where Xl.y).n tends to zero at the rate I/(m + n)2. .)? 18. Show that ifm and n are both tending to oc in such a way that m/(m + n) > a. v'a(l. 0< a < I. 1'2)]2C7n + O(n 2) i . (1'1. (1'1. Show that the only variance stabilizing transformation h such that h(O) = 0.5 that under the conditions of the previous problem.. h 2 (x. ° . 1'2)(X . E(Y) = 1'2. h(l) = I.n  m/(m + n») < x] 1 > 'li(x). I ~ 352 x Asymptotic Approximations Chapter 5 where °< e < f(x) 1. then I m a(l. V)) '" . (e) What is the approximate distribution of Vr'( X .y) = a ayh(x. 1'2) Hint: h(X. which are integers. Show directly using Problem B. X n be the indicators of n binomial trials with probability of success B. Let X I.
J1. n[X(I .2)/24n2 Hint: Therefore.. find the asymptotic distrintion of nT nsing P(nT y'nX < Vi).J1.)] ~ as ~h(2)(J1. .2V with V "' Give an approximation to the distribution of X (1 .X) in tenns of the distribution function when J.)] !:. (It may be shown but is not required that [y'nR" I is bounded. Suppose that Xl.2V where V ~ 00.3 1/6n2 + M(J1. Let Sn "' X~. . . Let Xl.J1.l4 is finite.. XI. Compare your answer to the answer in part (a).Section 5.4 + 3o.)1J1. ° while n[h(X h(J1. Let Xl>"" X n be a sample Irom a population with and let T = X2 be an estimate of 112 . 1 X n be a sample from a population with mean J. Suppose IM41 (x)1 < M for all x and some constant M and suppose that J. (a) When J1. k > I.). = 0. Show that Eh(X) 3 h(J1.)2. .l and variance (J2 < Suppose h has a second derivative h<2) continuous at IJ.2. find the asymptotic distribution of y'n(T. (a) Show that y'n[h(X) .J1.6 Problems and Complements 353 22. Usc Stirling's approximation and Problem 8.h(J1. xi.. < t) = p(Vi < (b) When J1.)2 and n(X .) + ": + Rn where IRnl < M 1(J1..X2 . (e) Fiud the limiting laws of y'n(X . and that h(1)(J.l) = o.(1. ()'2 = Var(X) < 00. = E(X) "f 0. J1. _.) 23. 24...4 to give a directjustification of where R n / yin 0 as in n Recall Stirling's approximation: + + 00.2) using the delta method.)] is asymptotically distributed (b) Use part (a) to show that when J1.J1.) + ~h(2)(J1. = ~.l = ~. X n is a sample from a population and that h is a realvalued function 01 X whose derivatives of order k are denoted by h(k). xi 25.
Xi we treat Xl.9. 29. Show that if Xl" . whereas the situation is reversed if the sample size inequalities and variance inequalities agree.t2. .9.) of Example 4.. " Yn as separate samples and use the twosample t intervals (4.~a) > 2( Val + a'4  ~ a) where the righthand side is the limit of . 27. n2 + 00. We want to study the behavior of the twosample pivot T(Li.(~':~fJ'). (c) Apply Slutsky's theorem.354 26..3) has asymptotic probability of coverage < 1 .9.3. . . ./". then limn PIt. cyi . (Xnl Yn ) are n sets of control and treatment responses in a matched pair experiment. . . Let T = (D .2a 4 ). . Asymptotic Approximations Chapter 5 then ~ £ c: 2 yn(X . (a) Show that P[T(t. so that ntln ~ .)/SD where D. a~.9. . 1 Xn.I I (a) Show that T has asymptotically a standard Donnal distribution as and I I . Viillnl ~ 2vul + a~z(l .3). Assume that the observations have a common N(jtl.JTi times the length of the onesample t interval based on the differences.. n2 + 00.\ < 1.}.3.. < tJ ~ <P(t[(. p) distribution.. Suppose (Xl.Jli = ~. Yn2 are as in Section 4. if n}. and a~ = Var(Y1 ).3) and the intervals based on the pivot ID .\a'4)I').d. 28. (d) Make a comparison of the asymptotic length of (4. .) < (b) Deduce that if p tl ~ <P (t [1.9. /"' = E(YIl. Hint: (a).3). 1 X n1 and Y1 . .\)al + .9. Suppose nl + 00 .4. I . Yd.t. 2pawz)z(1  > 0 and In is given by (4.9.3) have correct asymptotic (c) Show that if a~ > al and . Hint: Use (5. . = = az.') < 00.\ probability of coverage. al = Var(XIl.a.9.\ul + (1 ~ or al .(j ' a ) N(O. that E(Xt) < 00 and E(Y.\. (c) Show that if IInl is the length of the interval In. E In] > 1 .) (b) Deduce that if. Eo) where Eo = diag(a'.4.1/ S D where D and S D are as in Section 4. . Suppose Xl.2 . .\)a'4)/(1 . XII are ij. the intervals (4.3 independent samples with III = E(XIl. .•. . Let n _ 00 and (a) Show that P[T(t.o.t. I i t I . 0 < .\. Y1 .\ > 1 . and SD are as defined in Section 4.33) and Theorem 5. We want to obtain confidence intervals on J1. Suppose that instead of using the onesample t intervals based on the differences Vi .. N(Pl (}2).. . ..4. What happens? Analysis for fixed n is difficult because T(~) no longer has a '12n2 distribution.9.. the interval (4.a. b l . .3. 'I :I ! t.
X)" Then by Theorem B.z = Var(X). .. X. (d) Find or write a computer program that carries out the Welch test. (a) Show that the MLEs of I'i and ().1 L j=I k X ij and iT z ~ (kp)I L L(Xij i=l j=l p k Iii)" . I< = Var[(X 1')/()'jZ.9 and 4. .a)) and evaluate the approximations when F is 7.Section 5.9.6 Problems and Complements 355 (b) Let k be the Welch degrees of freedom defined in Section 4.3.d. then there exist universal constants 0 < Cd < Cd < 00 Such that cdlxl1 < Ixl < Cdlxh· 31. ... Tn .. T and where X is unifonn.p.. Hint: If Ixll ~ L. Let X ij (i = I. (a) SupposeE(X 4 ) < 00. and let Va ~ (n .1).£.xdf and Ixl is Euclidean distance. Plot your results. Show that k ~ ex:: as 111 ~ 00 and 112 t 00. I is the Euclidean norm.3 by showing that if Y 1. X n be i.a). where tk(1. x = (XI. .1)1 L. .4.:~l IXjl. then P(Vn < va<) t a as 11 t 00. . k) be independent with Xi} ~ N(l'i.1. vectors and EIY 1 1k < 00. 33. Generalize Lemma 5. then lim infn Var(Tn ) > Var(T). (c) Let R be the method of moment estimaie of K. Show that if 0 < EX 8 < 00.3. Hint: See Problems B.3. (c) Show using parts (a) and (b) that the tests that reject H : fJI = 112 in favor of K: 112 > III when T > tk(l .I) + y'i«n .I k and k only. .~ t (Xi .z has a X~I distribution when F is theN(fJl (72) distribution.d.4. Let XI._I' Find approximations to P(Yn and P{Vn < XnI (I .8Z = (n ... Y nERd are Li.. has asymptotic level a.i. < XnI (a)) (b) Let Xn_1 (a) be the ath quantile of X. but Var(Tn ) + 00." .I)z(a). U( 1. 32. Vn = (n . .3.3 using the Welch test based on T rather than the twosample t test based on Sn. then for all integers k: where C depends on d. Let .I)sz/(). 30.£.. (). as X ~ F and let I' = E(X). It may be shown that if Tn is any sequence of random variables such that Tn if the variances ofT and Tn exist. . where I. ().Z). Show that as n t 00.a) is the critical value using the Welch approximation. EIY. j = I.16.z are Iii = k. Carry out a Monte Carlo study such as the one that led to Figure 5.
1) for every sequence {On} wilb On 1 n = 0 + t/. • 1 = sgn(x). n 1 F' (x) exists. ! Problems for Section 5.p(X.0) and 7'(0) .. I ! 1 i . . . Let i !' X denote the sample median.1)cr'/k.Xn be i.On) .n7(0) ~[. (Use . Show that _ £ ( .nCOn .0) !:. I .. N(O) = Cov(. 1948). . I _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _1 . (b) Show that if k is fixed and p ~ 00. F(x) of X.p(X.n for t E R..p(X.d. Ep.0) !..L~. O(P) is lbe unique solution of Ep. Show lbat if I(x) ~ (d) Assume part (c) and A6.0) = O.On) < 0). (a) Show that (i)... random variables distributed according to PEP. . N(O. then 8' !. (c) Assume the conditions in (a) and (b). Let On = B(P).p(x) (b) Suppose lbat for all PEP. where P is the empirical distribution of Xl.. . 00 (iii) I!/!I(x) < M < for all x.p( 00) as 0 ~ 00.n(On ... _ .p(Xi ..n(X . then < t) = P(O < On) = P (..0). (c) Give a consistent estimate of (]'2..4 1.. Show that On is consistent for B(P) over P..0))  7'(0)) 0. under the conditions of (c). . Deduce that the sample median is a consistent estimate of the population median if lbe latter is unique. .p(X.f.\(On)] !:.i... aliI/ > O(P) is finite. . .. . I(X.. . .0) ~ N Hint: P( ... . N(O. 1/4/'(0)).. .0). 1/). Show that.. [N(O))' . I i . ..(0) Varp!/!(X.. is continuous and lbat 1(0) = F'(O) exists. Assmne lbat '\'(0) < 0 exists and lbat . . .) Hint: Show that Ep1/J(X1  8) is nonincreasing in B. 1 X n. .. . . Suppose !/! : R~ R (i) is monotone nondecreasing (ii) !/!(oo) < 0 < !/!(oo) .p(X.)). (ii). (k . Set (.. That is the MLE jl' is not consistent (Neyman and Scott. and (iii) imply that O(P) defined (not uniquely) by Ep. Let Xl... Use the bounded convergence lbeorem applied to .O(P)) > 0 > Ep!/!(X. . \ . (e) Suppose lbat the d.p(X.356 Asymptotic Approximations Chapter 5 .
(x."' c .. 0 «< 0.. Show that assumption A4' of this section coupled with AGA3 implies assumption A4. U(O.15 and 0. 82 ) Pis N(I".(x.. X) as defined in (f). Hint: Apply A4 and the dominated convergence theorem B. Thus. 'T = 4. Conclude that (b) Show that if 0 = max(XI.B) < xl = 1.(x. (h) Suppose that Xl has the Cauchy density fix) = 1/1r(1 + x 2).6 Problems and Complements 357 (0 For two estImates 0. oj). not O! Hint: Peln(B .5  and '1'. 1979) which you may assume. Show that if (g) Suppose Xl has the gross error density f. the asymptotic 2 .a)dl"(x). Hint: teJ1J.' .7.a)p(x.I / 2 .O)p(x.O)dl"(x)) = J 1J.(x. 0 > 0. 0» density. ~"'.(x) = (I .c)'P. If (] = I.i.9)dl"(x) = J te(1J. J = 1. .X) =0. then 'ce(n(B . then ep(X.0) (see Section 3.5) where f.20 and note that X is more efficient than X for these gross error cases.b)dl"(x) J 1J.O) is defined with Pe probability I but < x. not only does asymptotic nonnality not hold but 8 converges B faster than at rate n. X n be i. Find the efficiency ep(X. This is compatible with I(B) = 00.y. 0) = . 2.4.~ for 0 > x and is undefined for 0 g~(X..Section 5.O)p(x. evaluate the efficiency for € = . Condition A6' pennits interchange of the order of integration by Fubini's theorem (Billingsley.2.0)p(x.10. 3.'" ..(x).05.. 4. . Show that A6' implies A6.0) ~ N(O.O))dl"(x) ifforalloo < a < b< 00. Show that   ep(X..  = a?/o.(x.exp(x/O). X) = 1r/2. Let XI. denotes the N(O.d.    .(x .(1:e )n ~ 1.. . ( 2 ). and O WIth .0.Xn ) is the MLE. lJ :o(1J.jii(Oj .(x) + <'P. x E R. .O). " relative efficiency of 8 1 with respect to 82 is defined as ep(8 1 . (a) Show that g~ (x.b)p(x.B)) ~ [(I/O).0.
i I £"+7.In ~ "( n & &8 10gp(Xi..22).4. 8)p(x. (8) Show that in Theorem 5. Apply the dominated convergence theorem (B. 0 ~N ~ (.14.80 + p(X . A2. j.50)..(X. &l/&O so that E.·_t1og vn n j = p( x"'o+.O) 1(0) and Hint: () _ g.in. 8 ) 0 .~ (X.5 Asymptotic Approximations Chapter 5 ~lOg n p(Xi.[ L:. ) n en] .00) . I 1 .p ~ 1(0) < 00..4.. ) ! .4. . Show that the conclusions of PrQple~ 5.Oo) p(Xi . 80 +. I . i.5 continue to hold if . 1(80 ).) P"o (X ."log .in) . 0).O') . . 8 ) i . I (c) Prove (5. and A6 hold for."('1(8) . under F oo ' and conclude that . ~log (b) Show that n p(Xi .i.4) to g.0'1 < En}'!' 0 .18 .in) 1 is replaced by the likelihood ratio statistic 7.7. .in) ~ . Hint: (b) Expand as in (a) but aroond 80 (d) Show that ?'o+".80 + p(Xi .i (x.In) + . Suppose A4'.80 p(X" 80) +.< ~log n p(X... 8) is continuous and I if En sup {:. ~ L.g. 0 for any sequence {en} by using (b) and Polyii's theorem (A.i'''i~l p(Xi. Show that 0 ~ 1(0) is continuous. O. ..358 5. I I j 6.i(X. I .
61) for X ~ 0 and L 12.a.7). the bound (5.B lower confidence bounds.6 and Slutsky's theorem.2.4.5 hold.4.54) and (5. if X = 0 or 1.2.4.4. gives B Compare with the exact bound of Example 4.4.Section 5. setting a lower confidence bound for binomial fJ.14. at all e and (A4'). f is a is an asymptotic lower confidence bound for fJ.4. Hint. (Ii) Compare the bounds in Ca) with the bound (5. ~ 1 L i=I n 8'[ ~ 8B' (Xi.6 Problems and Complements 359 8.4. hence.4. Hint: Use Problems 5. all the 8~i are at least as good as any competitors. 9. Establish (5. B' nj for j = 1. .4. Compare Theorem 4. (a) Establish (5. Consider Example 4.61).Q" is asymptotically at least as good as B if.3. (Ii) Suppose the conditions of Theorem 5. (c) Show that if Pe is a one parameter exponential family the bound of (b) and (5. A. Hint: Use Problem 5. (a) Show that under assumptions (AO)(A6) for all Band (A4').59).5 and 5.4.:vell .58). (a) Show that under assumptions CAO)(A6) for 1/! consistent estimate of I (fJ). 11.4.4. Let [~11.4. there is a neighborhood V(Bo) of B o o such that limn sup{Po[B~ < BJ : B E V(Bo)} ~ 1 .57) can be strengthened to: For each B E e. B). 10. Let B~ be as in (5. Let B be two asymptotic l.5.3). which agrees with (4. We say that B >0 nI Show that 8~I and. and give the behavior of (5.4. (a) Show that.59) coincide.4. which is just (4.6.7.2. ~. B' _n + op (n 1/') 13.4.59). for all f n2 nI n2 .56).4. Then (5. = X. (Ii) Deduce that g.
X n be i. A exp .11)..2. " " I I .5 ! 1. : A < Ao (a) Find the NeymanPearson (NP) test for testing H : .. ! I • = '\0 versus K : . 1) distribution..360 1 Asymptotic Approximations Chapter 5 14. the test statistic is a sum of i. . 7f(1' 7" 0) = 1 .] versus K . That is. Consider the Bayes test when J1. 1). Consider the problem of testing H : I' E [0. By Problem 4. 1.\ l . LJ. if H is false..\ = '\0 versus K • (b) Show that the NP test is UMP for testing H : A > AO versus K : A < AO. (b) Suppose that I' ).i. Problems for Section 5. the pvalue fJ . = O.i.=2[1  <1>( v'nIXI)] has a I U(O.•. phas a U(O.. 2.21J2 2x A} . Consider testing H : j.I ' '1 against H as measured by the smallness of the pvalue is much greater than the evidence measured by the smallness of the posterior probability of the hypothesis (Lindley's "paradox").1.5.1. (a) Show that the test that rejects H for large values of v'n(X . Hint: By (3. is distributed according to 7r such that 1> 7f({O}) and given I' ~ A > 0.t = 0 versus K : J.A 7" 0.d. 1) distribution. I' has aN(O. Establish (5.) has pvalue p = <1>(v'n(X .I). N(I'.4.4. 1. (c) Find the approximate critical value of the NeymanPearson test using a Donnal approximation. I .d. (c) Suppose that I' = 8 > O. variables with mean zero and variance 1(0). ! (a) Show that the posterior probability of (OJ is where m n (... is a given number. X n i. X > 0. J1. the evidence I .. Jl > OJ .4. 1 .LJ. That is. Show that (J/(J l'. Show that (J l'. .d. .2. 1 15.\ < Ao.riX) = T(l + nr 2)1/2rp (I+~~. =f:.'" XI! are ij. (d) Find the Wald test for testing H : A = AO versus K : A < AO. 0 given Xl.6. where Jl is known. .1 and 3. where . (e) Find the Rao score test for testing H : . N(Jt. Let X 1> .2.)) and that when I' = LJ.10) and (3..LJ. inverse Gaussian with parameters J.i.d.)l/2 Hint: Use Examples 3.. 00. Suppose that Xl.55). Now use the central limit theorem.. .L and A. T') distribution.\ > o. > ~.1. each Xi has density ( 27rx3 A ) 1/2 {AX + J.01 .
5..5. all . Extablish (5.5. for all M < 00.. Fe. ~ L N(O.O'(t)). 4.5. 10 .1 is not in effect.645 and p = 0.034 .2. By Theorem 5.' " .4.I).17).(Xi.052 100 .. Suppose that in addition to the conditions of Theorem 5.13) and the SLLN.17).058 20 . P. Apply the argnment used for Theorem 5. J:J').5. yn(E(9 [ X) . (e) Verify the following table giving posterior probabilities of 1.046 .2.i Itlqn(t)dt < J:J').2 it is equivalent to show that J 02 7f (0)dO < a.~~(:.14).1 I'> = 1.sup{ :.i It Iexp {iI(O)'.. Show that the posterior probability of H is where an = n/(n + I).01 < 0 } continuThen 00.O') : 100'1 < o} 5. ifltl < ApplytheSLLN and 0 ous at 0 = O. Hint: In view of Theorem 5.s..)}.Section 5. L M M tqn(t)dt ~ 0 a.Oo»)+log7f(O+ J.1. yn(anX   ~)/ va.050 logdnqn(t) = ~ {I(O). [0.042 .for M(.} dt < .. 1) and p ~ L U(O.029 . (Lindley's "paradox" of Problem 5.0 3.6 Problems and Complements 361 (b) Suppose that J.2 and the continuity of 7f( 0).8) ~ 0 a.. Establish (5. Hint: By (5.:.l has a N(O.05. ~ E..) sufficiently large. oyn. 1) prior.I(Xi. ~l when ynX = n 1'>=0.s.5. Hint: 1 n [PI . (e) Show that when Jl ~ ~.) (d) Compute plimn~oo p/pfor fJ.S.5..0'): iO' .054 50 .(Xi. i ~. 0' (i» n L 802 i=l < In n ~sup {82 8021(X i .( Xt. By (5..
A.5. Show tbat (5.) ~ 0 and I jtl7r(t)dt < co. Fisher (1925).I(Xi}»} 7r(t)dt. • .f. Bernstein and R. . 7. I: .".) and A5(a. J I . . dn Finally... L . (a) Show that sup{lqn (t) . all d and c(d) / in d. I i. von Misessee Stigler (1986) and Le Cam and Yang (1990). qn (t) > c} are monotone increasing in c.!. . i=l }()+O(fJ) I Apply (5. . . for all 6.I Notes for Section 5. Fn(x) is taken to be O. . (I) If the rightband side is negative for some x. . ~ II• '. ~ 0 a.5.s(8)vn tqn(t)dt ~ roc vIn(t  0) exp {i=(l(Xi' t) .s.8) and (5. 5. Finally.4 (1) This result was first stated by R.) by A4 and A5. (0» (b) Deduce (5.s. The sets en (c) {t . " . . For n large these do not correspond to distributions one typically faces.5. .1 (I) The bound is actually known to be essentially attained for Xi = a with probability Pn and 1 with probability 1 .s.1. 362 f Asymptotic Approximations Chapter 5 > O. Hint: (t : Jf(O)<p(tJf(O)) > c(d)} = [d.7 NOTES Notes for Section 5. (O)<p(tI. 'i roc J.2 we replace the assumptions A4(a.d) for some c(d).Pn where Pn a or 1. ! ' (2) Computed by Winston Cbow. A proof was given by Cramer (1946). • Notes for Section 5.29). Suppose that in Theorem 5.5. . Notes for Section 5. .9) hold with a.1 . See Bhattacharya and Ranga Rao (1976) for further discussion. . convergence replaced by convergence in Po probability.5. (1) This famous result appears in Laplace's work and was rediscovered by S.5 " I i .16) noting that vlnen. .3 ).5.S. jo I I . I : ItI < M} O. to obtain = I we must have C n = C(ZI_ ~ [f(0)nt l / 2 )(1 + op(I)) by Theorem 5.
WILSON. 1964. Vol. Vol. 58. BILLINGSLEY. RANGA RAO. R. S. R. Amer... Theory o/Statistics New York: Springer.. E. 1979. Elements of LargeSample Theory New York: SpringerVerlag. reprinted in Biometrika Tables/or Statisticians (1966). S. Asymptotics in Statistics. Hartley and E. CA: University of California Press.. 1967. CASELLA. J. SCOTT. 1986.. S. HANSCOMB.. Mathematical Methods of Statistics Princeton. Math. STIGLER. AND G. Wiley & Sons. Pearson. J. Editors Cambridge: Cambridge University Press. Camb. Acad. 1999. "Probability inequalities for sums of bounded random variables. M.. C. Box. A. Proc. L. 1380 (1963).. W. 700725 (1925). Nat. R. A." Proc. N. H. p. BHATTACHARYA. Symp.. O. RUDIN. SCHERVISCH. J. Statist. AND G. CRAMER. E.. M. SERFLING. 318324 (1953).. . Wiley & Sons. The Behavior of the Maximum Likelihood Estimator Under NonStandard Conditions. 'The distribution of chi square. New York: McGraw Hill. 22. 1980. AND D.8 REFERENCES BERGER. YANG. A CoUrse in Large Sample Theory New York: Chapman and Hall. The History of Statistics: The Measurement of Uncertainty Before 1900 Cambridge. J. Mathematical Analysis. 1973. "Nonnormality and tests on variances:' Biometrika.. 40. I. MA: Harvard University press.. 16. P. 132 (1948). 1987. AND R. AND E. M. 3rd ed. C." Econometrica. 17. "Consistent estimates based on partially consistent observations. J. Probability and Measure New York: Wiley. R. New York: J.. I Berkeley. 2nd ed. FERGUSON. Approximation Theorems of Mathematical Statistics New York: J. HfLPERTY. RAo. Vth Berkeley Symposium. 684 (1931). L. 1946. 1996. 3rd ed. HUBER." J. E.A. Some Basic Concepts New York: Springer. 1985. DAVID. 1938. LEHMANN. LE CAM. Theory ofPoint EstimatiOn New York SpringerVerlag." Proc. LEHMANN.S. L.. Assoc. P. H.8 References 363 5. Statist. Vth Berk. R. Normal Approximation and Asymptotic Expansions New York: Wiley. HOEFFDING. B.. 1998. AND M. 1990. Soc. Sci. FISHER..Section 5. Statisticallnjerence and Scientific Method. Linear Statistical Inference and Its Applications. "Theory of statistical estimation. P. Monte Carlo Methods London: Methuen & Co. HAMMERSLEY. NJ: Princeton University Press. L. Prob. E. Tables of the Correlation Coefficient.. W. Statistical Decision Theory and Bayesian Analysis New York: SpringerVerlag. FISHER.. Phil.. U. T. L. 1958. G. F. NEYMAN. Cambridge University Press. 1995.. 1976. H..
. ' i ' . .I 1 . . I .\ .. (I I i : " . .
However. the multinomial (Examples 1. There is.2. however.3). 365 . with the exception of Theorems 5. in which we looked at asymptotic theory for the MLE in multiparameter exponential families. and efficiency in semiparametric models. we have not considered asymptotic inference. the number of observations.6. the fact that d. confidence regions.5.3. curve estimates. multiple regression models (Examples 1.or nonpararnetric models.4. and prediction in such situations.3).Chapter 6 INFERENCE IN THE MUlTIPARAMETER CASE 6. for instance.1) and more generally have studied the theory of multiparameter exponential families (Sections 1.2. are often both large and commensurate or nearly so. and n. the number of parameters.and semiparametric models. Talagrand type and the modern empirical process theory needed to deal with such questions will also appear in the later chapters of Volume II.8 C R d We have presented several such models already.2. 2.4. 2.1.6. an important aspect of practical situations that is not touched by the approximation. [n this final chapter of Volume I we develop the analogues of the asymptotic analyses of the behaviors of estimates. tests. The inequalities ofVapnikChervonenkis. testing. the bootstrap. 1. often many.2 and 5.1. This chapter is a leadin to the more advanced topics of Volume II in which we consider the construction and properties of procedures in non. We begin our study with a thorough analysis of the Gaussian linear model with known variance in which exact calculations are possible.3. 2. The approaches and techniques developed here will be successfully extended in our discussions of the delta method for functionvalued statistics. and confidence regions in regular onedimensional parametric models for ddirnensional models {PO: 0 E 8}. 2.7. real parameters and frequently even more semi. the properties of nonparametric MLEs. the modeling of whose stochastic structure involves complex models governed by several.1 INFERENCE FOR GAUSSIAN LINEAR MODElS • Most modern statistical questions iovol ve large data sets. We shall show how the exact behavior of likelihood procedures in this model correspond to limiting behavior of such procedures in the unknown variance case and more generally in large samples from regular ddimensional parametric models and shall illustrate our results with a number of important examples.3.
.2(4) in this framework. The normal linear regression model is }i = /31 + L j=2 p Zij/3j + €i.. Regression.. • . " 'I and Y = Z(3 + e.1.l)T. i = l. the Zij are called the design values.. are U.Zip.. Here is Example 1. .1 is also of the fonn (6.j .. • i .. say the ith case.€nareij. I I.1. .. In vector and matrix notation. These are among the most commonly used statistical techniques. The OneSample Location Problem. We have n independent measurements Y1 . The model is Yi j = /31 + €i.1. when there is no ambiguity. We consider experiments in which n cases are sampled from a population. we have a response Yi and a set of p .. Example 6.n (6.Herep= 1andZnxl = (l.3): l I • Example 6. " n (6. 366 Inference in the Multiparameter Case Chapter 6 ! ." . In this section we will derive exact statistical procedures under the assumptions of the model (6.1.d.3).. Z = (Zij)nxp. ...1. 1 en = LZij{3j j=1 +Ei. and Z is called the design matrix.1.2J) (6.d. '. and for each case. . In the classical Gaussian (nannal) linear model this dependence takes the fonn p 1 Yi where EI. .4) 0 where€I. .6 we will investigate the sensitivity of these procedures to the assumptions of the model.1.. n (6.1. let expressions such as (J refer to both column and row vectors.N(O. .2) Ii .'" .1.1 The Classical Gaussian linear Model l' f Many of the examples considered in the earlier chapters fit the framework in which the ith measurement Yi among n independent observations has a distribution that depends on known constants Zil. . i = 1. .1. .. Notational Convention: In this chapter we will.1.4 and 2. Here Yi is called the response variable.1.1 covariate measurements denoted by Zi2 • .zipf.1) j .. 0"2). .5) .1.. we write (6. In Section 6.1.. N(O. i .3).1. i = 1. . . It turn~ out that these techniques are sensible and useful outside the narrow framework of model (6..l p. andJ is then x nidentity matrix.a2). .Zip' We are interested in relating the mean of the response to the covariate values.2. .3) whereZi = (Zil. . I The regression framewor~ of Examples 1. e ~N(O. 6. 1 Yn from a population with mean /31 = E(Y).
5) is called the fixed design norrnallinear regression model. (6. .1 Inference for Gaussian Linear Models 367 where (31 is called the regression intercept and /32.. The model (6. Yn1 + 1 .9.6) where Y kl is the response of the lth subject in the group obtaining the kth treatment. (6. To fix ideas suppose we are interested in comparing the performance of p > 2 treatments on a population and that we administer only one treatment to each subject and a sample of nk subjects get treatment k..Yn . 0 Example 6.~) random variables.1fwe set Zil = 1.. · · . In this case.4. .. We can think of the fixed design model as a conditional version of the random design model with the inference developed for the conditional distribution of Y given a set of observed covariate values. TWosample models apply when the design values represent a qualitative factor taking on only two values./..3 we considered experiments involving the comparisons of two population means when we had available two independent samples. If we are comparing pollution levels.1. We treat the covariate values Zij as fixed (nonrandom). The pSample Problem Or One.0: because then Ok represents the difference between the kth and average treatment . 1 < k < p.Way Layout.1. nl + . where YI.3) applies.3. To see that this is a linear model we relabel the observations as Y1 .3. + n p = n. Generally. ."i = 1. Yn1 correspond to the group receiving the first treatment.. this terminology is commonly used when the design values are qualitative. .1. we often have more than two competing drugs to compare. . 13k is the mean response to the kth treatment. and the €kl are independent N(O. .•. one from each population.1.6) is an example of what is often called analysis ojvariance models.Section 6. .2) and (6. Then for 1 < j < p. The random design Gaussian linear regression model is given in Example 1.1..6) is often reparametrized by introducing ( l ~ pl I:~~1 (3k and Ok = 13k . the design matrix has elements: 1 if L jl nk + 1<i < L nk k=1 j k=l ootherwise and z= o 0 ••• o o Ip where I j is a column vector of nj ones and the 0 in the "row" whose jth member is I j is a column vector of nj zeros. In Example I.3 and Section 4.1. . The model (6. and so on.1.. If the control and treatment responses are independent and nonnally distributed with the same variance a 2 . and so on. /3p are called the regression coefficients.. then the notation (6. Ynt +n2 to that getting the second. n. we arrive at the one~way layout or psample model. if no = 0. we want to do so for a variety of locations. we are interested in qualitative factors taking on several values. Frequently.
j = I •. . . (3 E RP}. . by the GramSchmidt process) (see Section B. 1 JLn)T I' where the Cj = Z(3 = L j=l P . b I _ .po In tenns of the new parameter {3* = (0'. an orthononnal basis VI. is common in analysis of variance models. However.• znj)T.P..368 effects. . i = T + 1. i = 1. the linear Y = Z'(3' + E. the vector of means p of Y always is. .. vT n i I t and that T = L(vTt)v. " . It follows that the parametrization ({3. Z) and Inx I is the vector with n ones. Ui = v..6jCj are the columns of the design matrix. n... and with the parameters identifiable only once d ..7) !. • I t Ew ¢:> t = L:(v[t)Vi i=l ¢:> vTt = 0. The parameter set for f3 is RP and the parameter set for It is W = {I' = Z(3... •. .1. . . We now introduce the canonical variables and means . E ~N(O. When VrVj = O. It is given by 0 = (itl. k model is Inference in the Multiparameter Case Chapter 6 1. • . We assume that n > r. of the design matrix.. . Cj = (Zlj.. j = I. . . Note that Z· is of rank p and that {3* is not identifiable for {3* E RP+l.17). V r span w. . Because dimw = r..a 2 ) is identifiable if and only ifr = p(Problem6.Y. i=I (6. r i' . . there exists (e. Let T denote the number of linearly independent Cj. .. •• Note that w is the linear space spanned by the columns Cj.. {3* is identifiable in the pdimensional linear subspace {(3' E RP+l : L:~~l dk = O} of RP+ 1 obtained by adding the linear restriction E~=l 15k = 0 forced by the definition of the 15k '5.p. Recall that orthononnal means Vj = 0 fori f j andvTvi = 1. i5p )T... The Canonical Fonn of the Gaussian Linear Model The linear model can be analyzed easily using some geometry. This type oflinear model with the number of columns d of the design matrix larger than its rank r. then r is the rank of Z and w has dimension r. ..r additional linear restrictions have been specified.. .. V n for Rn such that VI.g..(T2J) where Z.2). ... .. we call Vi and Vj orthogonal.. Note that any t E Jl:l can be written .1. TJi = E(Ui) = V'[ J1. I5 h .3. j = 1. Even if {3 is not a parameter (is unidentifiable). n.! x (p+l) = (1. n. i .
. (3.. . .. . which are sufficient for (IJ. (6.. . is Ui = making vTit.11) _ n log(21r"'). (T2)T. . . whereas (6. In the canonical/orm ofthe Gaussian linear model with 0.' based on Y using (6.• . Theorem 6.2 ) using the parametrization (11.2 L i=l + _ '" ryiui _ 02~ i=l 1 r r 2 (6. and then translate them to procedures for . n because p.3.Section 6. also UMVU for CY. E w. .1... = L.1..1. Moreover. = 1.Ur istheMLEof1]I.. Let A nxn be the orthogonal matrix with rows vi. 0. (iii) U i is the UMVU estimate of1]i. • .8) So.9) Var(Y) ~ Var(U) = . 1]r)T varies/reely over Rr.).1.... n. and by Theorem B..1... where = r + 1. ais (v) The MLE of.Cr are constants. i it and U equivalent.'Ii) . .10). U. . 1 If Cl.9 N(rli. 2 0.. •..•• (ii) U 1 ..1 Inference for G::'::"::"::"::"::L::'::"e::'::'::M::o::d::e'::' '" '3::6::::.'J nxn .u) 1 ~ 2 n 2 . observing U and Y is the same thing. n.. i i = 1. . .. r.(Ui . and .2.1.2<7' L. IJ. v~. Proof..and 7J are equivalently related. ... Theorem 6. .1]r. The U1 are independent and Ui 11i = 0.)T is sufJicientfor'l.. .2 We first consider the known case. U1 . . Note that Y~AIU. which is the guide to asymptotic inference in general. 7J = AIJ..". then the MLE of 0: Ci1]i is a = E~ I CiUi.1. . .1.1. 20. it = E. '" ~ A 1'1.2 known (i) T = (UI •.2"log(21r<7 ) t=1 n 1 n _ _ ' " u. We start by considering the log likelihood £('1.2 Estimation 0. i (iv) = 1.2 and E(Ui ) = vi J. while (1]1. .10) It will be convenient to obtain our statistical procedures for the canonical variables U. (6. u) based on U £('I. .L = 0 for i = r + 1. Un are independent nonnal with 0 variance 0.2.. . n. 1 ViUi and Mi is UMVU for Mi.. Then we can write U = AY. 2 ' " 'Ii i=l ~2(T2 6.8)(6. .1.
1.1.N(~.. define the norm It I of a vector tERn by  ! It!' = I:~ . That is. (iv) By the invariance of the MLE (Section 2. (ii) U1. is q(8) = L~ 1 <.2 (ii).4 and Example 3.1.O)T E R n . .10. apply Theorem 3.. . . 1 U1" are the MLEs of 1}1. i > r+ 1.4. 1 Ur .6. .3.. r. then W . . {3. . .Ur'L~ 1 Ul)T is sufficient.. = 1. (U I .1. By Problem 3. where J is the n x n identity matrix. recall that the maximum of (6. = 0.. Ui is UMVU for E(U. . .. (iii) 8 2 =(n . 2:7 r+l un Tis sufficientfor (1]1... and is UMVU for its expectation E(W. we need to maximize n (log 27f(J2) 2 II ". . . l Cr . Proof.'  .. (i) T = (Ul1" . But because L~ 1 Ui2 L:~ I Ui2 + r+l Ul..1. . (iv) The conclusions a/Theorem 6. wecan assume without loss of generality that 2:::'=1 c.4. .. 0  i Next we consider the case in which a 2 is unknown and assume n >r+ 1..2).. and give a geometric interpretation of ii.1. To this end. .1. By (6. 2(J i=r+l ~ U. j = 1. i = 1.2.11) by setting T j = Uj .3 and Example (iii) is clear because 3. there exists an orthonormal basis VI) .11). this statistic is equivalent to T and (i) follows.1)..) (iii) By Theorem 3. Q is UMVU.. the MLE of q(9) = I:~ I c.. " V n of R n with VI = C = (c" .1 L~ r+l U. Theorem 6. .6 to the canonical exponential family obtained from (6. . The distribution of W is an exponential family. Proof.2 .2 . By GramSchmidt orthogonalization. In the canonical Gaussian linear model with a 2 unknown. 370 Inference in the Multiparameter Case Chapter 6 I iI . n " . The maximizer is easily seen to be n 1 L~ r+l (Problem 6.Ui · If all thee's are zero. I. I . and ..1]r..• EUl U? Ul Projections We next express j1 in terms of Y."7i)2 and is minimized by setting '(Ii = Ui. Assume that at least one c is different from zero.1.3.ti· 1 . . (6. (ii) The MLE ofa 2 is n. ~i = vi'TJ. T r + 1 = L~ I and ()r+l = 1/20.11) is a function of 1}1. 1lr only through L~' 1 CUi . " as a function of 0. 2 2 ()j = 1'0/0... .. n. (i) By observation.1.2.11) is an exponential family with sufficient statistic T. obtain the MLE fJ of fl.4. (We could also apply Theorem 2.. .3. .11) has fJi 1 2 = Ui .. WI = Q is sufficient for 6 = Ct.(J2J) by Theorem B.. To show (ii). (6. ( 2 )T.4. Let Wi = vTu. by observation.r) 2:7 1 L~ r+l ul is an unbiased estimator of (J2.'7.(v) are still valid. .2. I i I . r. To show (iv). I i > .) = '7" i = 1. "7r because... 0. (v) Follows from (iv).4.) = a.
1. any linear combination of U's is a UMVU estimate of its expectation. ZT (Y .Z/3 I' .L.1.n.1.ti. f3j and Jii are UMVU because. ZTZ is nonsingular. In the Gaussian linear model (i) jl is the unique projection oiY on L<.1..' and by Theorems 6. (i) is clear because Z.3 because Ii = l:~ 1 ViUi and Y . _.3 of. It follows that /3T (ZT s) = 0 for all/3 E RP. . . = Z T Z/3 and ZTji = ZTZfj and.14) follows.12) ii. . To show f3 = (ZTZ)lZTy..1.1.. fj = arg min{Iy .. (ii) and (iii) are also clear from Theorem r+l VjUj . /3 = (ZTZ)l ZT..' and is given by ji (ii) jl is orthogonal to Y (iii) 8' ~ = zfj (6. .L = {s ERn: ST(Z/3) = Oforall/3 E RP}.1.. note that 6.1 Inference for Gaussian linear Models 371 E Rn on w is the point Definition 6.p.12) implies ZT. 0 ~ .. To show (iv).Section 6.3(iv).Z/3I' : /3 E W}.1.. That is. <T) = . equivalently. = Z/3 and (6. ) 2 or.1. j = 1.. because Z has full rank.9). The projection Yo = 7l"(Y I L.14) (v) f3j is the UMVU estimate of (3j.1 and Section 2.3 maximizes 1 log p(y. fj = (ZTZ)lZTji.Ii = .1.1..ji) = 0 and the second equality in (6.4.J) of a point y Yo =argmin{lyW: tEw}. which implies ZT s = 0 for all s E w. note that the space w.1. by (6. IY . the MLE = LSE of /3 is unique and given by (6. and  Jii is the UMVU estimate of J. any linear combination ofY's is also a linear combination of U's.n log(21f<T . /3.3 E RP.1.2(iv) and 6. then /3 is identifiable.1. i=l. Proof. We have Theorem 6.3. ~ The maximum likelihood estimate .L of vectors s orthogonal to w can be written as ~ 2:7 w.. and /3 = (ZTZll ZT 1".2<T' IY .. the MLE of {3 equal~ the least squares estimate (LSE) of {3 defined in Example 2. spans w.2.13) (iv) lfp = r.jil' /(n r) (6.. Thus..
I. In this method of "prediction" of Yi. In the Gaussian linear model (i) the fitted values Y = iJ.. w~ obtain Pi = Zi(3. Taking z = Zi. Suppose we are given a value of the covariate z at which a value Y following the linear model (6. the ith component of the fitted value iJ.4..16) J I I CoroUary 6. H I I It follows from this and (B. In statistics it is also called the hat matrix because it "puts the hat on Y. 1 · .). • ~ ~ Y=HY where i• 1 . n}. .2 illustrates this tenninology in the context of Example 6. Note that by (6.14) and the normal equations (Z T Z)f3 = Z~Y. (j ~ N(f3. H T 2 =H.1.. (iv) ifp = r. .2. € ~ N(o. (12 (ZTz) 1 ).1. That is. the best MSPE predictor of Y if f3 is known as well as z is E(Y) = ZT{3 and its best (UMVU) estimate not knowing f3 is Y = ZT {j. The residuals are the projection of Y on the orthocomplement of wand ·i . =H . (6. the residuals €.1.. • .1 we give an alternative derivation of (6. i = 1.. . Var(€) = (12(J .2 with P = 2. 1 < i < n. it is commOn to write Yi for 'j1i.H).15) Next note that the residuals can be written as ~ I .1. (ii) (iii) y ~ N(J.({3t + {3.Y = (J ." As a projection matrix H is necessarily symmetric and idempotent.1. i . whenp = r.)I are the vertical distances from the points to the fitted line. I .3) is to be taken.' € = Y . .ii is called the residual from this fit.372 Inference in the Multiparameter Case Chapter 6 Note that in Example 2.14).H)). see also Section RIO. moreover. n lie on the regression line fitted to the data {('" y. The estimate fi = Z{3 of It is called thefitted value and € = y . 1 < i < n.1.1. (12(J .3) that if J = J nxn is the identity matrix. The goodness of the fit is measured by the residual sum of squares (RSS) IY .1. Y = il. then (6. and I.1.I" (12H). There the points Pi = {31 + fhzi.'. .12) and (6. . and the residual € are independent. . By Theorem 1.1. ~ ~ ·.5. 1 ~ ~ ! I ~ ~ ~ i . .ill 2 = 1 q. : H = Z(Z T Z)IZT The matrix H is the projection matrix mapping R n into w. i = 1. = [y. L7 1 . Example 2.. We can now conclude the following.H)Y.1.
0 ~ We now return to our examples. in general) p k=l L and the UMVU estimate of the incremental effect Ok = {3k .ii = Y.4.i respectively. The OneWay Layout (continued).no The variances of . o . By Theorem 6.. . One Sample (continued).p. and € are nonnally distributed with Y and € independent. is th~ UMVU estimate of the average effect of all a 1 p = Yk . Moreover. k=I..8 and that {3j and f1i are UMVU for {3j and J1. .8) = (J2(Z T Z)1 follows from (B. In the Gaussian case.. .a of the kth treatment is ~ Pk = Yk.3. 'E) is a linear transformation of U and.3. in the Gaussian model. } is a multipleindexed sequence of numbers or variables. k = 1.Y are given in Corollary 6.8 and (3.. .2. i = 1. Here Ji = .3). In this example the nonnal equations (Z T Z)(3 = ZY become n.1.p. then replacement of a subscript by a dot indicates that we are considering the average over that subscript. then the MLE = LSE estimate is j3 = (ZTZr1 ZTy as seen before in Example 2.2).1.. Thus. 0 Example 6. If the design matrix Z has rank P. If {Cijk _.ii...1. (n . Var(.. . . ..1. the treatments. a = {3.p. Regression (continued).8. .1 ~ Inference for Gaussian Linear Models 373 Proof (Y. Example 6..1.1. .j = 1.Ill'. and€"= Y . .. We now see that the MLE of It is il = Z.Section 6. The independence follows from the identification of j1 and € in tenns of the Ui in the theorem.2. which we have seen before in the unbiased estimator 8 2 of (72 is L:~ I (Yi ~  Yf/ Problem 1. + n p and we can write the least squares estimates as {3k=Yk .p. ... nk(3k = LYkl..3. where n = nl + .1. k = 1.5.1 and Section 2. (not Y. 0 ~ ~ ~ ~ ~ ~ ~ Example 6. Y. The error variance (72 = Var(El) can be unbiasedly estimated by 8 2 = (n _p)lIY .81 and i1 = {31 Y. 1=1 At this point we introduce an important notational convention in statistics.1.1).8. hence.. joint Gaussian.1.
. .4. We first consider the a 2 known case and consider the likelihood ratio statistic . j. a regression equation of the form mean response ~ .L are least absolute deviation estimates (LADEs) obtained by minimizing the absolute deviation distance L~ 1 IYi . .JL): JL E wo} ~ ! .6.\(y) = sup{p(Y. the LADEs are obtained fairly quickly by modem computing methods. j 1 • 1 . . in the context of Example 6..JL): JL E w} sup{p(Y. For instance. . An alternative approach 10 the MLEs for the nonnal model and the associated LSEs of this section is an approach based on MLEs for the model in which the errors El. ! j ! I I .En in (6.17). and the matrix IIziillnx3 with Zil = 1 has rank 3. .1.i 1 6.." Now.. j where Zi2 is the dose level of the drug given the ith patient.1. under H. i = 1. In general. The most important hypothesistesting questions in the context of a linear model correspond to restriction of the vector of means JL to a linear subspace of the space w.. However. Next consider the psample model of Example 1. which together with a 2 specifies the model.1. . Thus. The LSEs are preferred because of ease of computation and their geometric properties. see Problems 1. The LADEs were introduced by Laplace before Gauss and Legendre introduced the LSEssee Stigler (1986).3 Tests and Confidence Intervals . Zi3 is the age of the ith patient. we let w correspond to the full model with dimension r and let Wo be a qdimensional linear subspace over which JL can range under the null hypothesis H.31.3 with 13k representing the mean resJXlnse for the kth population.. n} is a twodimensional linear subspace of the full model's threedimensional linear subspace of Rn given by (6.1..374 Inference in the Multiparameter Case Chapter 6 Remark 6.17) I.2. . the mean vector is an element of the space {JL : J. {JL : Jti = 131 + f33zi3. which is a onedimensional subspace of Rn. I.Li = 13 E R. For more on LADEs. q < r. . . in a study to investigate whether a drug affects the mean of a response such as blood pressure we may consider. whereas for the full model JL is in a pdimensional subspace of Rn. i = 1. 1 < .1..7 and 2. = {3p = 13 for some 13 E R versus K: "the f3's are not all equal. • .1.1.2.zT . 1 and the estimates of f3 and J. • = 131 + f32Zi2 + f33Zi3. see Koenker and D'Orey (1987) and Portnoy and Koenker (1997).1) have the Laplace distribution with density . The first inferential question is typically "Are the means equal or notT' Thus we test H : 131 = ..p}. under H.al. Now we would test H : 132 = 0 versus K : 132 =I O. . (6. " .
21). 2 log . We write X..l2).l9) then.1.1.1.\(Y) It follows that = exp 1 2172 L '\' Ui' (6. In this case the distribution of L:~q+1 (Uda? is called a chisquare distribution with r .I8) then. In the Gaussian linear model with 17 2 known.1. o . 1) distribution with OJ = 'Ida.2(v)... respectively. Note that (uda) has a N(Oi. = IJL i=q+l r JLol'. i=q+l r = a 21 fL  Ito [2 (6. V r span wand set . when H holds.IY .iio 2 1 } where i1 and flo are the projections of Yon wand wo. ..1. . Write 'Ii is as defined in (6. (}r.iii' .1.Section 6_1 Inference for Gaussian Linear Models 375 for testing H : fL E Wo versus j{ : JL E W  woo Because (6..3.19). .q degrees offreedom and noncentrality parameter Ej2 = 181 2 = L:=q+ I where 8 = «(}q+l. v~' such that VI. by Theorem 6. V q (6..1.1.21 ) where fLo is the projection of fL on woo In particular.1. then r.\(Y) ~ exp {  2t21Y ..q {(}2) for this distribution.J X. . by Theorem 6._q' = AJL where A L 'I.1. Proposition 6.q( (}2) distribution with 0 2 =a 2 ' \ '1]i L. .. span Wo and VI.\(Y) Proof We only need to establish the second equality in (6. . 2 log . .\(Y) = L i=q+l r (ud a )'. But if we let A nx n be an orthogonal matrix with rows vi.'" (}r)T (see Problem B.1... We have shown the following.20) i=q+I 210g.\(Y) has a X.
JLo.23) .18).21ii..J.3. We have seen in PmIXlsition 6. Proposition 6.E Wo for K : Jt E W .1.22) Because T = {n . it can be shown (Problem 6. In particular. T = n .2. we can write . .nr distribution.2 is known" is replaced by '.a = ~(Y'E..22). /L.2 is the same under Hand K and estimated by the MLE 0:2 for /. Remark 6..r.r IY .max{p{y. In Proposition 6./L.2 1JL . Substituting j). has the noncentral F distribution F r _q.1. 8 2 . is poor compared to the fit under the general model.I}.1.5) that if we introduce the variance equal likelihood ratio w'" .1.L a =  n I' and .L n where p{y. and 86 into the likelihood ratio statistic./to I' ' n J respectively. Thus./Lol'.JtoI 2 .1 suppose the assumption "0. . The distribution of such a variable is called the noncentral F distribution with noncentrality parameter 0 2 and r .q variable)/df (central X~T variahle)/df with the numerator and denominator independent. {io.a:.'I/L. The resulting test is intuitive. I . statistic :\(y) _ max{p{y.r degrees affreedam (see Prohlem B. In the Gaussian linear model the F statistic defined by (6.iT') : /L E wo} (6.14). T has the representation . as measured by the residual sum of squares under the model specified by H. T is called the F statistic for the general linear hypothesis.l) {Ir .376 Inference in the Multiparameter Case Chapter 6 Next consider the case in which a 2 is unknown.1.~I./L.iT'): /L E w} .liol' IY r _q IY _ iii' iii' (r .I.r){r .q and n .19).1 that 021it .0. ]t consists of rejecting H when the fit.1. .liol' (n _ r) 'IY _ iii' .q and m = n . = aD IIY . (6.1. IIY . We write :h. .. T is an increasing function of A{Y) and the two test statistics are equivalent.m{O') for this distrihution where k = r . . We have shown the following.• j ..1.') denotes the righthand side of (6.'}' YJ..1.1. T has the (central) Frq.1 that the MLEs of a 2 for It E wand It E wQ are . J • T = (noncentral X:.wo. we obtain o A(Y) = P Y. E In this case.n_r(02) where 0 2 = u... We know from Problem 6.'IY _ iii' = L~ r+l (Ui /0)2. which has a X~r distribution and is independent of u.itol 2 = L~ q+ 1 (Ui /0) 2. For the purpose of finding critical values it is more convenient to work with a statistic equivalent to >.1.q)'{[A{Y)J'/n .q)'Ili . By the canonicaL representation (6.2.itol 2 have a X.(Y). when H I lwlds. which is equivalent to the likelihood ratio statistic for H : Ii._q(02) distribution with 0' = .
Example 6.1 Inference for Gaussian Linear Models ~ 377 then >. The projections il and ilo of Yon w and Wo.(Y) equals the likelihood ratio statistic for the 0'2.1.1. In this = (Y 1'0)2 (nl) lE(Y.iLol' = IY .2.22).1. We test H : (31 case wo = {I'o}.r)ln central X~_cln where T is the F statistic (6.1. 0 .25) which we exploited in the preceding derivations.Y)2' which we recognize as t 2 / n. This is the Pythagorean identity.~T = (n .iLol'.9. See Pigure 6.iLl' ~++Yl I' I' 1 .1. (6.1 and Section B. (6.IO. It follows that ~ (J2 known case with rT 2 replaced by r. Y3 y Iy .3.1.19) made it possible to recognize the identity IY .\(Y) = . The canonical representation (6. where t is the onesample Student t statistic of Section 4.1. We next return to our examples.1.24) Remark 6.1'0 I' y.Section 6. and the Pythagorean identity.iLl' + liL . q = 0.q noncentral X? q 2Iog. One Sample (continued).1.1. r = 1 and T = 1'0 versus K : (3 'f 1'0. Yl = Y2 Figure 6.
. P .ito 1 are the residual sums of squares under the full model and H. 02 simplifies to (J2(p . .. in general 02 depends on the sample correlations between the variables in Zl and those in Z2' This issue is discussed further in Example 62. anddfF = np and dfH = nq are the corresponding degrees of freedom. respectivelY. . •.26) 0 versus K : f3 2 i' O. which only depends on the second set of variables and coefficients. 7) iW l.  I 1. Regression (continued).1..1. The One. 1 02 = (J2(p .1. Using (6.q covariates does not affect the mean response.vy.q) x 1 vector of main (e..: I :I T _ n ~ p L~l nk(Y .27) 1 ..n_p(fP) distribution with noncentrality parameter (Problem 6. Under the alternative F has a noncentral Fp_q... Now the linear model can be written as i i I I ! We test H : f32 (6. L~"(Yk' . Z2) where Z1 is n x q and Z2 is 11 X (p . j ! j • litPol2 = LL(Yk _y. treatment) effect coefficients and f3 1 is a q x 1 vector of "nuisance" (e. (6. In this case f3 (ZrZ)lZry and f3 0 = (Z[Ztl1Z[Y are the ML& under the full model (6.. '1 Yp.Q)f3f(ZfZ2)f32.np distribution. and we partition {3 as f3T = (f3f.P .1 L~=..1.{3p are Y1.g.. . respectively. Under H all the observations have the same mean so that. 0 I j Example 6..' • ito = Thus.1. .. However. To formulate this question.g.q covariates in multiple regression have an effect after fitting the first q. The F test rejects H if F is large when compared to the ath quantile of the Fpq.q)1f3nZfZ2 .1.RSSF)/(dflf . I3I) where {32 is a (p . We consider the possibility that a subset of p ..1. In the special case that ZrZ2 = 0 so the variables in Zl are orthogonal to the variables in Z2.2. we want to test H : {31 = .}f32.)2= Lnk(Yk.3. • 1 F = (RSSlf .1.y)2 k . = {3p.q). Recall that the least squares estimates of {31. age. P nk (Y.1.22) we obtain the F statistic for the hypothesis H in the oneway layout .26) and H. .22) we can write the F statistic version of the likelihood ratio test in the intuitive fonn =   i . Without loss of generality we ask whether the last p .Way Layout (continued).)2. . I • ! . _y. we partition the design matrix Z by writing it as Z = (ZI.ZfZl(ZiZl)'ziz. economic status) coefficients..)2' ..dfp) RSSF/dh I 2 where RSSF = IY and RSSH = IY .=11=1 k=1 •• Substituting in (6. As we indicated earlier. I • I b . .Yk.n 378 Inference in the Multiparameter Case Chapter 6 iI II " • Example 6.
.Y. .. are not all equal. See Tables 6.1 Inference for Gaussian Linear Models 379 When H holds.1. SST/a 2 is a (noncentral) X2 variable with (n ..1) degrees of freedom and noncentrality parameter 0'..1) degrees offreedom as "coming" from SS8/a' and the remaining (n .28) where j3 = n I :2:=~"" 1 nifJi. This information as well as S8B/(P . is a measure of variation between the p samples YII . T has a Fpl.)'. and III with I being the "high" end. then by the Pythagorean identity (6. .1..p) degrees of freedom. We assume the oneway layout is valid. the within groups (or residual) sum of squares. .. respectively. compute a.1. If the fJ.1... identifying 0' and (p .Section 6. . Because 88B/0" and SSw / a' are independent X' variables with (p . sum of squares in the denominator. .. (3p.p) of the (n 1) degrees offreedom of SST /0" as "cooling" from S8w/a'. .fJ.. the unbiased estimates of 02 and a 2 .. are often summarized in what is known as an analysis a/variance (ANOVA) table. T has a noncentral Fp1.3.. SSB. ijjT = There is an interesting way of looking at the pieces of infonnation summarized by the F statistic. and the F statistic.fLo 12 for the vector Jt (rh.25) 88T = 888 + 88w . SST.np distribution.1 and 6. consider the following data(I) giving blood cholesterol levels of men in three different socioeconomic groups labeled I. which is their ratio..1) and (n . As an illustration.1.)' .np distribution with noncentrality parameter (6. k=l 1=1 measures variation within the samples..···.2 1M . (3p)T and its projection 1'0 = (ij. The sum of squares in the numerator.30) can also be viewed stochastically.1) and 8Sw /(n . Note that this implies the possibly unrealistic . . . into two constituent components. we have a decomposition of the variability of the whole set of data. Ypl . . k=1 1=1 which measures the variability of the pooled samples. (3" .29) Thus.1. 888 =L k=I P nk(Yk.. the total sum of squares. the between groups (or treatment) sum of squares and SSw. (31. . . . Ypnp' The 88w ~ L L(Y Y p "' k'  k )'. To derive IF. n. Yi nl . [f we define the total sum of squares as 88T =L l' L(Y nk k'  Y.p). (6. we see that the decomposition (6..
j:
I ,
380
Inference in the Multiparameter Case Chapter 6
,
:
TABLE 6.1.1. ANOYA table for the oneway layout
Sum of squares
d.f.
,
Between samples
Within samples
SSe
I:
r I lld\'k_
Mean squares
F value
1\1 S '
i'dS B
Lf!
P
1
MSB
~
, ,
Total
58W  ""k '1'1n I "(I'k l  I')'  I " k 55T  ,. P 1 L "I" 1 (I'kl _ I' )' ~ k I
np
Tl  1
" '''
A1S w = SSw
1
1
j
TABLE 6.1.2. Blood cholesterol levels
I
J
286 290
II III
403 312 403
311 222 244
269 302 353
336 420 235
259 420 319
386 260
353
210
l 1
I
I
,
i,
.!
, , I
I , ,
assumption that the variance of the measurement is the same in the three groups (not to speak of normality). But see Section 6.6 for "robustness" to these assumptions. We want to test whether there is a significant difference among the mean blood cholesterol of the three groups. Here p = 3, nl = 5, n2 = 10, n3 = 6. n = 21, and we compute
TABLE 6.1.3. ANOYA table for the cholesterol data
;
,
88
Between groups Within groups Total
I
,
,I
1202.5 85,750,5 86,953.0
dJ, 2 18 20
M8
601.2 4763,9
F~value
0.126
'I
From :F tables, we find that the pvalue corresponding to the Fvalue 0.126 is 0.88. Thus, there is no evidence to indicate that mean blood cholesterol is different for the three socioeconomic groups. 0 Remark 6.1.4. Decompositions such as (6.1.29) of the response total sum of squares SST into a variety of sums of squares measuring variability in the observations corresponding to variation of covariates are referred to as analysis oj variance. They can be fonnulated in any linear model including regression models. See Scheff" (1959, pp. 4245) and Weisberg (1985, p. 48). Originally such decompositions were used to motivate F statistics and to establish the distribution theory of the components via a device known as Cochran's theorem (Graybill, 1961, p. 86). Their principal use now is in the motivation of the convenient summaries of infonnation we call ANOVA tables.
,
,I "
,
,
,
Section 6.1
Inference for Gaussian linear Models
~
381
Confidence Intervals and Regions We next use our distributional results and the method of pivots to find confidence intervals for J.li, 1 < i < n, !3j, 1 <j < p, and in general, any linear combination
n
,p
= ,p(/l) = Lai/Li
i::: 1
~ aT /l
of the J.l's. If we set;j;
= 1:7
1 ai/Ii
= aT fl and
~
~
where H is the hat matrix, then (,p  ,p)ja(,p) has a N(O, I) distribution. Moreover,
(n  r)8 2ja 2 ~
IY  iii 2 ja 2 =
L
i=r+l
n
(Uda 2 )'
has a X~r distribution and is independent of ;;;. Let
~
~
be an estimate of the standard deviation a('IjJ) of
~
'0. This estimated standard deviation is
called the standard error of 'IjJ. By referring to the definition of the t distribution, we find that the pivot
has a TnT. distribution. Let t n  r (1  40) denote the 1 ~o: quantile of the bution, then by solving IT(,p)1 < t n _, (I  ~",) for,p, we find that
Tn r
distri
is, in the Gaussian linear model, a 100(1  a)% confidence interval for 1/J. Example 6.1.1. One Sample (continued). Consider'IjJ = p. We obtain the interval
i' ~ Y ± t n 1
(1
~q) sj.,fii,
which is the same as the interval of Example 4.4.1 and Section 4.9.2. Example 6.1.2. Regression (continued). Assume that p = T. First consider 1/J = f3j for some specified ~gression coefficient (3j. The 100(1  a)% confidence interval for (3j is
(3j
}" = (3j ± t n p (','" s{ [ (ZT Z) 11 j j ' 1 )
382
Inference in the Multiparameter Case
Chapter 6
where [(ZTZ)~l]j) is the jthdiagonal element of (ZTZ)~ '. Computersoftware computes (ZTZ)~I and labels S{[(ZTZ)~lI]j); as the standard errOr of the (estimated) jth regression coefficient. Next consider ljJ = j.ti = mean response for the ith case, 1 < i < n. The level (1  0:) confidence interval is
J.Li =
Jii ± t n _ p (1 
!a) sJh:
where hii is the ith diagonal element of the hat matrix H. Here sjh;; is called the standard error of the (estimated) mean of the ith case. Next consider the special case in which p = 2 and
Yi = 131 + 132Zi2 + Eil i = 1 ", n.
"
If we use the identity
n
~)Zi2  Z2)(l'i  Y)
i=1
~)Zi2  Z.2)l'i,
We obtain from Example 2.2.2 that
ih ~
Because Var(Yi) = a 2 , we obtain
~ Var(.6,)
L~l (Zi2  z.,)l'i
L~ 1 (Zi2  Z.2)2 .
(6.1.30)
= (J I L)Zi' i=l
''''
n
Z.2) ,
,
and the 100(1  a)% confidence interval for .6, has the form
732 ± t n p (I  ~a) sl J2:(Zi2  Z.2)'. The confidence interval for 131 is given in Problem 6.1.10. .62
=
Similarly, in the p = 2 case, it is straightforward (Problem 6.1.10) to compute
i.
h ii
I ,
•
;
_ 1

(Zi2  z.,)2
",n
+ n
L....i=l Zi2 
(
Z·2
)'
,
0
•
I
and the confidence interval for th~ mean response J1.i of the ith case has a simple explicit
fonn.
,
1
,
i ,
.I
Example 6.1.3. OneWay Layout (continued). We consider 'I/J = 13k. 1 < k .6k = Yk. ~ N(.6k, (J'lnk), we find the 100(1  a)% confidence interval
~
< p. Because
•
,
I
= 7J. ± t n  p (1 ~a) slj'nj; where 8' = SSwl(np). The intervals for I' = .6. and the incremental effect Ok =.6k1'
.6k
are given in Problem 6.1.11. 0
j
j
"I
I
i
,
I
I
Do
l
Section 6.2
Asymptotic Estimation Theory in p Dimensions
383
Joint Confidence Regions
We have seen how to find confidence intervals for each individual (3j, 1 < j < p. We next consider the problem of finding a confidence region C in RP that covers the vector /3 with prescribed probability (1  0:). This can be done by inverting the likelihood ratio test or equivalently the F test That is, we let C be the collection of f3 0 that is accepted when the level (1  a) F test is used to test H : (3 ~ (30' Under H, /' = /'0 = Z(3o; and the numerator of the F statistic (6.1.22) is based on
Iii  /'01'
C "
=
Izi3 
Z(3ol' =
(13  (30)T(Z T Z)(i3 (30)
(30)
Thus, using (6.1.22), the simultaneous confidence region for
f3 is the ellipse
= {(30 .. (13  (30)T(Z 2 Z)(i3 rs
T
< f r,ni"" (1 _ 1 20
'J}
. .T
(6.1.31)
where fr,nr
(1  40:)
is the 1  !o quantile of the :Fr,nr distribution.
Example 6.1.2. Regression (continued). We consider the case p = r and as in (6.1.26) write Z = (ZJ, Z2) and /3T = (f3j, {3f), where f32 is a vector of main effect coefficients and f3 1is a vector of "nuisance" coefficients. Similarly, we partition {3 as {3 = ({31 ,(32 ) where (31 is q x 1 and (3, is (p  q) x 1. By Corollary 6.1.1, O"(ZTZ) is the variancecovariance matrix of {3. It follows that if we let 8 denote the lower right (p  q) x (p ~ q) comer of (ZTZ)I, then (728 is the variancecovariance matrix of 132' Thus, a joint 100(1  0:)% confidence region for {32 is the p  q dimensional ellipse
~ ~ ~
.T ",T
C={(3
0' .
. (i3,(302)TSl(i3,_(3o,) <f
() 2
pqs

pq,np
(11
20:·
l}
o
Summary. We consider the classical Gaussian linear model in which the resonse Yi for (3jZij of the ith case in an experiment is expressed as a linear combination J1i = covariates plus an error fi, where Ci, ... 1 f n are i.i.d. N (0, (72). By introducing a suitable orthogonal transfonnation, we obtain a canonical model in which likelihood analysis is straightforward. The inverse of the orthogonal transfonnation gives procedures and results in terms of the original variables. In particular we obtain maximum likelihood estimates, likelihood ratio tests, and confidence procedures for the regression coefficients {(3j}, the resJXlnse means {J1i}, and linear combinations of these.
LJ=l
6.2
ASYMPTOTIC ESTIMATION THEORY IN p DIMENSIONS
In this section we largely parallel Section 5.4 in which we developed the asymptotic properties of the MLE and related tests and confidence bounds for onedimensional parameters. We leave the analogue of Theorem 5.4.1 to the problems and begin immediately generalizing Section 5.4.2.
384'
f ",'c:,c:""""ce=:::;'=''''h':.M=,'"';,,pa::.'::'m'''e::.t::"...:C::':::",..::Cc::h,:,p:::'e::'~6
6.2.1
Estimating Equations
OUf assumptions are as before save that everything is made a vector: X!, ... , X n are i.i.d. Pwhere P E Q, a model containing P = {PO: 0 E e} such that
(i)
e open C RP.
e.
(ii) Densities of P, are pC 0),9 E
1 , I
The following result gives the general asymptotic behavior of the solution of estimating equations.
AO. 'I'
=(,p" ... , ,pp)T where,p) ~ g:., is well defined and
 L >I'(X n.
1=1
1
n
..
i,
On) = O.
(6.2.1)
A solution to (6.2.1) is called an estimating equation estimate or an M estimate.
AI. The parameter 8( P) given by the solution of (the nonlinear system of p equations in p
unknowns):
J
A2. Epl'l'(X" 0(P)1 2
>I'(x, O)dP(x)
=0
(6.2.2)
, , ,
I
is well defined on Q so that O(P) is the unique solution of (6.2.2). Necessarily O(PO) because Q => P.
=0
I , ,
< 00 where I· I is the Euclidean nonn.
I I:
.' , ,
,
I
A3. 'l/Ji{', 8), 1 < i < P. have firstorder partials with respect to all coordinates and using the notation of Section B.8,
I
where
': I
,
~
.,
i
~~
l:iI ~
is nonsingular.
A4. sup
..
{I ~ L~ 1 (D>I'(X"
p
t)  D>I'(Xi , O(P))) I: It  O(P)I < En} ~ 0 if En ~ O.
, .I ,
,
I
I
AS. On ~ O(P) for all P E Q.
Theorem 6.2.1. Under AOA5 ofthis section
8n = O(P) + where
L iii(X n.
t=l
1
n
i,
O(P))
+ op(n 1 / 2 )
(6.2.3)
i..I , ,
• •
iii(x,o(p))
= (EpD>I'(X" 0(P)W 1 >1'(x, O(P)).
(6.2.4)
b
Section 6.2
Asymptotic Estimation Theory in p Dimensions
385
Hence,
(6.2.5)
where
E(q" P) ~ J(O. p)Eq,q,T(X" O(p))F (0, P)
and
81/J~
J
(6.2.6)
r'(o,p)
= EpDq,(X"O(P))
~
E p 011 (X"O(P))
The proof of this result follows precisely that of Theorem 5.4.2 save that we need multivariate calculus as in Section B.8. Thus,
1
 n
2.:= q,(Xi , O(P)) = n 2.:= Dq,(Xi,0~)(9n
i=l
n
1
n
 O(P)).
(6.2.7)
i=l
Note that the lefthand side of (6.2.7) is a p x 1 vector, the right is the product of a p x p matrix and a p x 1 vector. The rest of the proof follows essentially exactly as in Section 5.4.2 save that we need the observation that the set of nonsingular p x p matrices, when viewed as vectors, is an open , subset of RP , representable, for instance, as the Set of vectors for which the determinant, a continuous function of the entries, is different from zero. We use this remark to conclude that A3 and A4 guarantee that with probability tending to 1, ~ l::~ I Dq,(Xi , 6~) is nonsingular. Note. This result goes beyond Theorem 5.4.2 in making it clear that although the definition of On is motivated by p, the behavior in (6.2.3) is guaranteed for P E Q, which can include P <1c P. In fact, typically Q is essentially the set of P's for which O(P) can be defined uniquely by (6.2.2). We can again extend the assumptions of Section 5.4.2 to: A6. If 1(,0) is differentiabLe
EODq,(X"O)

EOq,(X" O)DI(X" 0) CovO(q,(X" 0), DI(X, , 0))
(6.2.8)
defined as in B.5.2. The heuristics and conditions behind this identity are the same as in the onedimensional case. Remarks 5.4.2, 5.4.3, and Assumptions A4' and A61 extend to the multivariate case readily. Note that consistency of On is assumed. Proving consistency usually requires different arguments such as those of Section 5.2. It may, however. be shown that with probability tending to 1, a rootfinding algorithm starting at a consistent estimate 6~ will find a solution On of (6.2.1) that satisfies (6.2.3) (Problem 6.2.10).
386
Inference in the Multiparameter Case
Chapter 6
6,2,2
comes
Asymptotic Normality and Efficiency of the MLE
[fwe take p(x,O)
=
l(x,O)
= 10gp(x,0), and >I>(x,O)
obeys AOA6, then (62,8) be.
T
l
EODl (X" O)D l( X 1,0» VarODl(X"O)
(62,9)
where
j 1 ,
, ,
is the Fisher information matrix I(e) introduced in Section 3.4. If p: e _ R, e c R d , is a scalar function, the matrix)! 8~~P(}J (e) is known as the Hessian or curvature matrix of the sutface p. Thus, (6.2.9) stateS that the expected value of the Hessian of l is the negative of the Fisher information. We also can immediately state the generalization of Theorem 5.4.3.
Theorem 6.2.2. If AOA6 holdfor p(x, 0)
,1, , ,
I
j
=10gp(x, 0), then the MLE  satisfies On
i,
(62,10)
(62,11)
On ~ 0+  :Lr'(O)DI(Xi,O) + op(n"/')
n
i=1
1
n
,,
so that
is a minimum contrast estimate with p and 'f/J satisfying AOA6 and corresponding asymptotic variance matrix E('I1, Pe), then
"
"
If en
"
, ,
E(>I>,PO)
On
> r'(O)
(62,12)
in the sense of Theorem 3.4.4 with equality in (6,2,12) for 0
,
= 0 0 iff, undRr 0 0 ,
(6,2,13)
~
On + op(n 1/2 ),
, i ,
•
,
Proof. The proofs of (6,2,10) and (6,2,11) parallel those of (5.4.33) and (5.4,34) exactly, The proof of (6,2,12) parallels that of Theorem 3.4.4, For completeness we give it Note that hy (6,2,6) and (6,2,8)
I
" :I I.
1" ,
where U >I>(Xt,O). V Var(U T , VT)T nonsingular Var(V)
=
E(>I>,PO)
= CovO'(U, V)VarO(U)CovO'(V, U)
~
(62,14)
DI(X1,0),
But hy (B,lO.8), for any U,V with
(6,2.15)
> Cov(U, V)Var1(U)Cov(V, U),
Taking inverses of both sides yields
r1(0)
= Var01(V) < E(>I>,O).
(6.2.16)
• • !
Section 6.2
Asymptotic Estimation Theory in p Dimensions
387
Equality holds in (6.2.15) by (B. 10.2.3) iff for some b U
= b(O)
(6.2.17)
= b + Cov(U, V)Var1(V)V with probability 1. This means in view of Eow = EODl = 0 that
w(X"O) = b(0)Dl(X1,0).
In the case of identity in (6.2.16) we must have
[EODw(X 1, OW'W(X 1 , 0)
=
r1(0)DI(X" 0).
(6.2.18)
Hence, from (6.2.3) and (6.2.10) we conclude that (6.2.13) holds.
o
apxl, a T 8 n
~
We see that, by the theorem, the MLE Is efficient in the sense that for any has asymptotic bias o(n1/2) and asymptotic variance nlaT ]1(8)a, which is n.? larger than that of any competing minimum contrast estimate. Further any competitor 8 n such that aTO n has the same asymptotic behavior as a T 8 n for all a in fact agrees with On to ordern 1/2


A special case of Theorem 6.2.2 that we have already established is Theorem 5.3.6
on the asymptotic nonnality of the MLE in canonical exponential families. A number of important new statistical issues arise in the multiparameter case. We illustrate with an example. Example 6.2.1. The Linear Model with Stochastic Covariates. Let Xi = (Zr, Yi)T. 1 < i < n, be ij.d. as X = (ZT, Y) T where Z is a p x 1 vector of explanatory variables and Y is the response of interest. This model is discussed in Section 2.2.1 and Example 1.4.3. We specialize in two ways:
(il
(6.2.19) where, is distributed as N(O, (),2) independent of Z and E(Z) Z, Y has aN (a + ZT [3, (),2) distrihution.
= O.
That is, given
(ii) The distribution Ho of Z is known with density h o and E(ZZT) is nonsingular.
The second assumption is unreasonable but easily dispensed with. It readily follows (Prohlem 6.2.6) that the MLE of [3 is given by (with probability 1) [3 = [Z(n}Z(n}]
T
~
lT
Zen} Y.
(6.2.20)
Here Zen) is the n x p matrix IIZij ~ Z.j II where z.j = ~ 1 Zij. We used subscripts (n) to distinguish the use of Z as a vector in this section and as a matrix in Section 6.1. In the present context, Zen) = (Zl, .. , 1 Zn)T is referred to as the random design matrix. This example is called the random design case as opposed to the fixed design case of Section 6.1. Also the MLEs of a and ()'2 are
p
2:7
Ci
=Y
 ~ Zj{3j, (j
J=l
"

2
.
1 ~2 = IY  (Ci + Z(n)[3)1 .
n
(6.2.21 )
I
388
~
Inference in the Multiparameter Case
Chapter 6
Note that although given ZI," " Zn, (3 is Gaussian, this is not true of the marginal distribution of {3. It is not hard to show that AOA6 hold in this case because if H o has density k o and if 8 denotes (a,{3T,a 2 )T. then
~
j
1
j
I(X,8) Dl(X,8)
and
 20'[Y  (a + Z T 13W
(
1
)
2(logo'
1
)
+ log21T) + logho(z)
(6.2.22)
;2'
1
i,
Z ;" 20 4
(.2
o o
1)
I
.,,
a 2
1(8) =
I
0 0' E(ZZT)
0
,
o o
20 4
(6.2.23)
i ,
,
1
so that by Theorem 6.2.2
iI
I'
L(y'n(ii 
a,(3  13,8'  0 2»
~ N(O,diag(02,02[E(ZZ T W',20 4 ».
(6.2.24)
1
!
I
!
,,
This can be argued directly as well (Problem 6.2.8). It is clear that the restriction of H o 2 known plays no role in the limiting result for a,j3,Ci • Of course, these will only be the MLEs if H o depends only on parameters other than (a, f3,( 2 ). In this case we can estimate E(ZZT) by ~ L~ 1 ZiZ'[ and give approximate confidence intervals for (3j. j = 1 .. ,po " An interesting feature of (6.2.23) is that because 1(8) is a block diagonal matrix so is II (6) and, consequently, f3 and 0'2 are asymptotically independent. In the classical linear model of Section 6.1 where we perfonn inference conditionally given Zi = Zi, 1 < i < n, we have noted this is exactly true. This is an example of the phenomenon of adaptation. If we knew 0 2 , the MLE would still be and its asymptotic varianc~ optimal for this model. If we knew a and 13. ;;2 would no longer be the MLE. But its asymptotic variance would be the same as that of the MLE and, by Theorem 6.2.2, 0=2 would be asymptotically equivalent to the MLE. To summarize, estimating either parameter with the other being a nuisance parameter is no harder than when the nuisance parameter is known. Formally, in a model P = {P(9,"} : 8 E e, '/ E £}
~
j,
1 ,
l
I ,
1 ,
1
13
. ,
:
~
•
L
.' ,
we say we can estimate B adaptively at 'TJO if the asymptotic variance of the MLE (J (or more generally, an efficient estimate of e) in the pair (e, iiJ is lhe same as that of e(,/o), the efficient estimate for 'P'I7o = {P(9,l'jo) : E 8}. The possibility of adaptation is in fact rare. though it appears prominently in this way in the Gaussian linear model. In particular consider estimating {31 in the presence of a, ({32 ... , /3p ) with
~ ~
e
(i) a, (3" ... , {3p known. (ii)
13 arbitrary.
/32
... = {3p = O. LetZi
In case (i), we take, without loss of generality, a = (ZiI, ... , ZiP) T. then the efficient estimate in case (i) is
,
•• "
I
;;n _ L~l Zit Yi Pi n 2
Li=l Zil
(6,2.25)
, , ,
, I
.
I
Section 6.2
Asymptotic Estimation Theory in p Dimensions
389
with asymptotic variance (T2[EZlJ1. On the other hand, [31 is the first coordinate of {3 given by (6.2.20). Its asymptotic variance is the (1,1) element of O"'[EZZTjl, which is strictly bigger than ,,' [EZfj1 unless [EZZTj1 is a diagonal matrix (Problem 6.2.3). So in general we cannot estimate [31 adaptively if {3z, .. . , (3p are regarded as nuisance parameters. What is happening can be seen by a representation of [Z'fn) Z(n)ll Zen) Y and [11(11) where ['(II) Ill'j(lI) II· We claim that


=
2i
fJl
= 2:~_l(Z"
",n
 Zill))y; (Z _ 2(11 ),
tl
t
(6.2.26)
L....t=1
where Z(1) is the regression of (Zu, ... , Z1n)T on the linear space spanned by (Zj1,' . " Zjn)T, 2 < j < p. Similarly.
[11 (II)
= 0"'/ E(ZlI
 I1(ZlI
1
Z'l,"" Zp,»)'
(6.2.27)
where II(Z11 I ZZ1,' .. , Zpl) is the projection of ZII on the linear span of Z211 .. , , Zpl (Problem 6.2.11). Thus, I1(ZlI I Z'lo"" Zpl) = 2:;=, Zjl where (ai, ... ,a;) minimizes E(Zll  2: P _, ajZjl)' over (a" ... , ap) E RPI (see Sections 1.4 and B.10). What (6.2.26) and (6.2.27) reveal is that there is a price paid for not knowing [3" ... , f3p when the variables Z2, .. . ,Zp are in any way correlated with Z1 and the price is measured by
a;
[E(Zll  I1(Zll I Z'lo"" Zpl)'jl = ( _ E(I1(Zll 1 Z'l, ... ,ZPI)),)I '''''';:;f;c;o,~''"'''''1 , E(ZII) E(ZII)
(6.2.28) In the extreme case of perfect collinearity the price is 00 as it should be because (31 then becomes unidentifiable. Thus, adaptation corresponds to the case where (Z2, . .. , Zp) have no value in predicting Zl linearly (see Section 1.4). Correspondingly in the Gaussian linear model (6.1.3) conditional on the Zi, i = 1, ... , n, (31 is undefined if the denominator in (6.2.26) is 0, which corresponds to the case of collinearity and occurs with probability 1 if E(Zu  I1(ZlI I Z'l, ... , Zpt})' = O. 0

Example 6.2.2. M Estimates Generated by Linear Models with General Error Structure. Suppose that the €i in (6.2.19) are ij.d. but not necessarily Gaussian with density ~fo (~). for instance, e x
fo(x)
~
=
(1 + eX)"
the logistic density. Such error densities have the often more realistic, heavier tails(l) than the Gaussian density. The estimates {301 0'0 now solve
and
390
Inference in the Multiparameter Case
Chapter 6
f' ~ ~ where1/; = 'X;,X(Y) =  (iQ)~ Yfo(y)+1 ,(30 (fito, ... ,ppO)T The assumptions of Theorem 6.2.2 may be shown to hold (Problem 6.2.9) if
=
(i) log fa is strictly concave, i.e.,
*
is strictly decreasing.
(ii) (log 10)" exists and is bonnded.
Then, if further fa is symmetric about 0,
I(IJ)
cr"I((3T, I)
= cr"
where
Cl
(C 1 Z ) E(t
~)
(6.2.29)
, ,
= J (f,(x»)' lo(x)dx, c, = J (xf,(x) + I)' 10 (x)dx.
~
Thus, ,Bo, ao are opti
mal estimates of {3 and (l in the sense of Theorem 6.2.2 if fo is true. Now suppose fa generating the estimates Po and (T~ is symmetric and satisfies (i) and (ii) but the true error distribution has density f possibly different from fo. Under suitable conditions we can apply Theorem 6.2.1 with
1 ,
,
i
where
i
1f;j(Z,y,(3,(J)
,
;
"
~ 1f; (y  L~1 Zkpk )
, I
<j < p
(6.2.30)
!
,
•
> (y  L~1
to conclude that
where (301 ao solve
I 1
j
Zkpk )
1
I
I
I
L:o( y'n(,Bo  (30)) ~ N(O, I: ('I', P») L:( y'n(a  cro) ~ N(O, (J'(P))
1 •
,
(6.2.31)
• ,
J'I'(y zT(30)dP=O
• ,
I
p
II . ,.I'
ii
I:
and E('I', P) is as in (6.2.6). What is the relation between (30' (Jo and (3, (J given in the Gaussian model (6.2.19)? If 10 is symmetric about 0 and the solution of (6.2.31) is unique, . then (30 = (3. But (Jo = c(fo)q for some, c(to) typically different from one. Thus, (30 can be used for estimating (3 althougll if t1j. true distribution of the is N(O, (J') it should ' perform less well than (3. On the Qther hand, o is an estimate of (J only if normalized by a constant depending on 10. (See Problem 6.2.5.) These are issues of robustness, that
~

a
'i
1:
is, to have a bounded sensitivity curve (Section 3~5. Problem 3.5.8), we may well wish to use a nonlinear bounded '.Ii = ('!f;11'" l ,pp)T to estimate f3 even though it is suboptimal when € rv N(O, (12). and to use a suita~ly nonnalized version of {To for the same purpose. One effective choice of.,pj is the Hu~er function defined in Problem 3.5.8. We will discuss these issues further in Section 6.6 and Volume II. 0
•
2. We have implicitly done this in the calculations leading up to (5. ife denotes the MLE. the latter can pose a fonnidable problem if p > 2. . See Schervish (1995) for some of the relevant calculations. Optimality criteria are not easily stated even in the fixed sample case and not very persuasive except perhaps in the case of testing hypotheses about a real parameter in the presence of other nuisance parameters such as H : fJ 1 < 0 versus K : fh > 0 where fJ 2 . Although it is easy to write down the posterior density of 8.s. The problem arises also when. All of these will be developed in Section 6.2.s.3 The Posterior Distribution in the Multiparameter Case The asymptotic theory of the posterior distribution parallels that in the onedimensional case exactly. . because we then need to integrate out (03 . Kadane. Wald tests (a generalization of pivots).I as the Euclidean nonn in conditions A7 and A8. under PeforallB. Ifthe multivariate versions of AOA3.).5.. and Rao's tests. the likelihood ratio principle. and interpret I . 2 The asymptotic theory we have developed pennits approximation to these constants by the procedure used in deriving (5.(t) n~ 1 p(Xi . fJ p vary freely.2. A class of Monte Carlo based methods derived from statistical physics loosely called Markov chain Monte Carlo has been developed in recent years to help with these problems. A4(a. We simply make 8 a vector. Confidence regions that parallel the tests will also be developed in Section 6. Op). However. We defined minimum contrast (Me) and M estimates in the case of pdimensional parameters and established their convergence in law to a nonnal distribution. O ).. t)dt. This approach is refined in Kass.3.5. A new major issue that arises is computation. The consequences of Theorem 6. When the estimating equations defining the M estimates coincide with the likelihood .) and A6A8 hold then. A5(a. typically there is an attempt at "exact" calculation. as is usually the case. for fixed n. the equivalence of Bayesian and frequentist optimality asymptotically. in perfonnance and computationally.2. .5. 1T(8) rr~ 1 P(Xi1 8). Summary.3 are the same as those of Theorem 5.32) a. 6. . Again the two approaches differ at the second order when the prior begins to make a difference.s.Section 6.3. The three approaches coincide asymptotically but differ substantially. say. Using multivariate expansions as in B.3. up to the proportionality constant f e . .2 Asymptotic Estimation Theory in p Dimensions 391 Testing and Confidence Bounds There are three principal approaches to testing hypotheses in multiparameter models. ~ (6. and Tierney (1989).19) (Laplace's method).8 we obtain Theorem 6. we are interested in the posterior distribution of some of the parameters.2...19). These methods are beyond the scope of this volume but will be discussed briefly in Volume II. say (fJ 1.
PO'  . These were treated for f) real in Section 5.9) . 9 E 8} sup{p(x.B} has mean zero and variance matrix equal to the smallest possible for a general class of regular estimates of () in the family of models {PO. .4.I .6) that these methods in many respects are also approximately correct when the distribution of the error in the model fitted is not assumed to be normal.denotes the MLE for PO' then the posterior distribution of yn(O if 0 r' (9)) distribution. However. ! I In Sections 4. the exact critical value is not available analytically. In this section we will use the results of Section 6. equations. X II . this result gives the asymptotic distribution of the MLE.i.. 9 E 8" 8 1 = 8 . ry E £} if the asymptotic distribution of In(() .4. 1 '~ J . confidence regions. We find that the MLE is asymptotically efficient in the sense that it has "smaller" asymptotic covariance matrix than that of any MD or AIestimate if we know the correct model P = {Po : BEe} and use the MLE for this model. covariates can be arbitrary but responses are necessarily discrete (qualitative) or nonnegative and Gaussian models do not seem to he appropriate approximations.110 : () E 8}. I . 6.1 . and confidence procedures. we need methods for situations in which. In Section 6. In these cases exact methods are typically not available. 1 . In such . We shall show (see Section 6. Wald and Rao large sample tests. I . Asymptotic Approximation to the Distribution of the Likelihood Ratio Statistic i.3 • .s.. I H I! ~. 170 specified. I j I I . However. under P. as in the linear model. . . LARGE SAMPLE TESTS AND CONFIDENCE REGIONS ~ 0) converges a. "' ~ 6.392 Inference in the Multiparameter Case Chapter 6 .4 to vectorvalued parameters. X n are i. . 1 Zp.8 0 .9). A(X) simplified and produced intuitive tests whose critical values can be obtained from the Student t and :F distributions.3.1 we considered the likelihood ratio test statistic. and we tum to asymptotic approximations to construct tests. . and showed that in several statistical models involving normal distributions. We use an example to introduce the concept of adaptation in which an estimate f) is called adaptive for a model {PO. Another example deals with M estimates based on estimating equations generated by linear models with nonGaussian error distribution.•• . in many experimental situations in which the likelihood ratio test can be used to address important questions.f/ : () E 8. adaptive estimation of 131 is possible iff Zl is uncorre1ated with every linear function of Z2) . 9 E 8 0 } for testing H : 9 E 8 0 versus K .d.9 and 6.4. to the N(O.I . We present three procedures that are used frequently: likelihood ratio.2 to extend some of the results of Section 5. and other methods of inference. Finally we show that in the Bayesian framework where given 0. In linear regression.1 we developed exact tests and confidence regions that are appropriate in re~ gression and anaysis of variance (ANOVA) situations when the responses are normally distributed. '(x) = sup{p(x.".
1 exp{ iJx )/qQ). ~ In Example 2. such that VI. .2 we showed how to find B as a nonexplicit solution of likelihood equations.. . Thus.i.v'[.\(Y)). Here is an example in which Wilks's approximation to £('\(X)) is useful: Example 6.\(X) based on asymptotic theory. Then Xi rvN(B i .1.1). E W ..2 we showed that the MLE.randXi = Uduo. . 0 It remains to find the critical value..\(Y) = L X. Wilks's approximation is exact. The Gaussian Linear Model with Known Variance."" V r spanw. . The MLE of f3 under H is readily seen from (2. Suppose XL X 2 .. This is not available analytically.1. as X where X has the gamma.. distribution with density p(X. Y n be nn..2. Moreover. 0) = IT~ 1 p(Xi. .tl... r and Xi rv N(O..tn)T is a member of a qdimensional linear subspace of woo versus the alternative that J. We next give an example that can be viewed as the limiting situation for which the approximation is exact: independent with Yi rv N(Pi.£ X~" for degrees of freedom d to be specified later. under regularity conditions. 0 = (iX.n. . when testing whether a parameter vector is restricted to an open subset of Rq or R r . . Using Section 6.3 we test whether J.1. we conclude that under H. .3. i = 1.5) to be iJo = l/x and p(x. .. q < r.3 Large Sample Tests and Confidence Regions 393 cases we can turn to an approximation to the distribution of . As in Section 6. x ~ > 0. In this u 2 known example. 0) ~ iJ"x·. 1). = (J. Q > 0.l.. . where A nxn is an orthogonal matrix with rows vI. V q span Wo and VI.'··' J. <1~) where {To is known. . the numerator of .3.q' i=q+1 r Wilks's theorem states that. Suppose we want to test H : Q = 1 (exponential distribution) versus K : a i.Wo where w is an rdimensional linear subspace of Rn and W ::> Wo. . iJo) is the denominator of the likelihood ratio statistic.i = 1. ~ ~ ~ ~ ~ The approximation we shall give is based on the result "2 log '\(X) . the X~_q distribution is an approximation to £(21og .Section 6. . •. Other approximations that will be explored in Volume II are based on Monte Carlo and bootstrap simulations.Xn are i. the hypothesis H is equivalent to H: 6 q + I = . . Let YI .n.3.i = 1..4.. 0). ~ X. 1.. exists and in Example 2. . qQ. which is usually referred to as Wilks's theorem or approximation.... i = r+ 1.\(x) is available as p(x. 0 We iHustrate the remarkable fact that X~q holds as an approximation to the null distribution of 2 log A quite generally when the hypothesis is a nice qdimensional submanifold of an rdimensional parameter space with the following. 2 log .1.3. and we transform to canonical form by setting Example 6.d. iJ).l. . iJ). iJ > O.3. =6r = O. SetBi = TU/uo.
and conclude that Vn S X.. 21og>. (6. .3.i=q+l 'C'" I=r+l Xi' ) X. V T I«(Jo)V ~ X. In Section 6. The result follows because. I(J~ . under H .2) V ~ N(o. 1 • ! We tirst consider the simple hypothesis H : 6 = 8 0 _ Theorem 6. hy Corollary B. 0). .." .2. . . we can conclude arguing from A.(Jol. The Gaussian Linear Model with Unknown Variance.3. N(o. Apply Example 5.6. we derived e e 2log >'(Y) ~ n log ( 1 + 2::" L.394 Inference in the Multiparameter Case Chapter 6 1 . . .3 and AA that that In((J~) "'" 2[ln((Jn) In((Jo)] £r ~ V I((Jo)V. Suppose the assumptions of Theorem 6.2 but (T2 is unknown then manifold whereas under H. . where DO is the derivative with respect to e.(Jo) for some (J~ with [(J~ . Because .1.1) " I. where x E Xc R'.d. (72) ranges over an r + Idimensional Example 6.2. B).3. ·'I1 . Eln«(Jo) ~ I«(Jo). Here ~ ~ j.3.3. .(Jol.. Hence. . j . X. !. Note that for J.(J) Uk uJ rxr c'· • 1 .Xn a sample from p(x.\(Y) £ X._q as well..2. an expansion of lnUI) about en evaluated at 8 = 0 0 gives 2[ln«(Jn) In((Jo)] = n«(Jn . I( I In((J) ~ n 1 n L i=l & & &" &".. Then. (J = (Jo. = 2[ln«(Jn) In((Jo)] ~ ~ ~ £ X. "'" T.3. Write the log likelihood as e In«(J) ~ L i=I n logp(X i . i\ I . ! . Proof Because On solves the likelihood equation DOln(O) = 0.2 with 9(1) = log(l + I).. I .....Onl + IOn  (Jol < 210n .. . "'" (6. y'ri(On . I. and (J E c W.7 to Vn = I:~_q+lXl/nlI:~ r+IX.(Y) defined in Remark ° 6.(Jo) ".3.. By Theorem 6.3.logp(X.q also in the 0. I I i Example 6.. If Yi are as in = (IL.1 ((Jo)).q' Finally apply Lemma 5..~ L H . o . 0 Consider the general i. 2 log >'(Y) = Vn "..2.i.(Jol < I(J~ .(In[ < l(Jn . : .1. ranges over a q + Idimensional manifold. I1«(Jo)).(Jo) In«(Jn)«(Jn . case with Xl. where Irxr«(J) is the Fisher information matrix.1.2.· . ". an = n.2 unknown case.(X) ~ 1 . c = and conclude that 210g .2 are satisfied.3.
2. Let Po be the model {Po: 0 E eo} with corresponding parametn"zation 0(1) = ~(1) (1) ~(1) (8 1. Then under H : 0 E 8 0.3.80.3.3. Suppose that the assumptions of Theorem 6. 8 2)T where 8 1 is the first q coordinates of8.. for given true 0 0 in 8 0 • '7=M(OOo) where. Let 00.(9 0 . OT ~ (0(1). Next we tum to the more general hypothesis H : 8 E 8 0 ..j. 0 E e.4).1) applied to On and the corresponding argument applied to 8 0 . T (1) (2) Proof. 0). dropping the dependence on 6 0 • M ~ PI 1 / 2 (6. where x r (1. 0(2) = (8. j ~ q + 1. and {8 0 . We set d ~ r .10) and (6.n and (6.2.3.Section 6. the test that rejects H : () 2Iog>.)T.n = (80 . ~ {Oo : 2[/ n (On) /n(OO)] < x...6) .8q ). has approximately level 1.81(00).0:'.3.2 hold for p(x. 0b2) = (80"+1".' ". SupposetharOo istheMLEofO underH and that 0 0 satisfiesA6for Po.(] . ~(11 ~ (6.)T. .q.T.. 10 (0 0 ) = VarO. . Let 0 0 E 8 0 and write 210g >'(X) ~ = 2[1n(9 n )  In(Oo)] .3) is a confidence region for 0 with approximat~ coverage probability 1 .2 illustrate such 8 0 .3.2[1.0) is the 1 and C!' quantile of the X. Make a change of parameter.+l.2.a.3.n) /n(Oo)l.0(2». Examples 6.j} are specified values.8.0(1) = (8 1" ".)1'.a)) (6.. Furthennore. (6..3.1 and 6. distribution.(X) 8 0 when > x.". where e is open and 8 0 is the set of 0 E e with 8j = 80 . (JO.(Ia).3 Large Sample Tests and Confidence Regions 395 = As a consequence of the theorem.4) It is easy to see that AOA6 for 10 imply AOA5 for Po. Theorem 6. By (6.35) where 8(00) = n 1/2 L Dl(X" 0) i=1 n and 8 = (8 1 ..8.0 0 ).
00 : e E 8 0} MAD = {'1: '/.3. ).3.II).I '1): 11 0 + M.8.8) that if T('1) then = 1 2 n. Me : 1]q+l = TJr = a}.2. Such P exists by the argument given in Example 6.1ITD III(x. His {fJ E applying (6.1 because /1/2 Ao is the intersection of a q dirnensionallinear subspace of R r with JI/2{8 . 0 Note that this argument is simply an asymptotic version of the one given in Example 6. : • Var T(O) = pT r ' /2 II. 18q acting as r nuisance parameters.0: confidence region is i .5) to "(X) we obtain.1/ 2 p = J.11) . J) distribution by (6.3.a) is an asymptotically level a test of H : 8 E Of equal importance is that we obtain an asymptotic confidence region for (e q+ 1.. Now write De for differentiation with respect to (J and Dry for differentiation with respect to TJ. a piece of 8.1'1 '1 E eo} and from (B. 1e ).+1 ~ .3. I I eo~ ~ ~ {(O. = 0. (6.6) and (6. = . (O) r • + op(l) (6.0. • . .396 Inference in the Multiparameter Case Chapter 6 and P is an orthogonal matrix such that.. .l..+ 1 ..3.(l .all (6.3. • 1 y{X) = sup{P{x.1I0 + M.. ••.3. Note that.1> . .(X) = y{X) where ! 1 1 .3. under the conditions of Theorem 6. if A o _ {(} .+I> .• I .. IIr ) : 2[ln(lI n ) .. because in tenus of 7].2. We deduce from (6.LTl(o) + op(l) i=l i=l L i=q+l r Ti2 (0) + op(l). Thus.(1 .. ..9).00 .9) .1I0 + M./ L i=I n D'1I{X" liD + M.1I 0 + M.13) D'1I{x. by definition.(X) > x r .2.3. with ell _ . ~ 'I.1 '1) = [M.In( 00 . This asymptotic level 1 .10) LTl(o) . which has a limiting X~q distribution by Slutsky's theorem because T(O) has a limiting Nr(O. then by 2 log "(X) TT(O)T(O) .8 0 : () E 8}.. Moreover.. '1 E Me}.1 '1ll/sup{p(x. Or)J < xr_.1 '1).7).Tf(O)T ..... is invariant under reparametrization . rejecting if . .l.3. Tbe result follows from (6.
of f)l" .3. B2 .2 and the previously conditions on g. will appear in the next section.8r are known.2 to this situation is easy and given in Problem 6. More complicated linear hypotheses such as H : 6 . Suppose the MLE hold (JO. We only need note that if WQ is a linear space spanned by an orthogonal basis v\.p:::'e:. Theorem 6. if 0 0 is true.3.3. More sophisticated examples are given in Problems 6.3.5 and 6.'':::':::"::d_C:::o:::. X. Vl' are orthogonal towo and v" . 2 e eo Ii o ~ (Xl + X 2) 2 + ~ 1 _ (X. As an example ofwhatcau go wrong..3. where J is the 2 x 2 identity matrix aud 8 0 = {O : B1 + O < I}. (6.. .3.o''C6::. (6...3.:::f. If 81 + B2 = 1.. The esseutial idea is that.14) a hypothesis of the fonn (6....3. q + 1 < j < r written as a vector g.) be i.e:::S:::.2)(6. .13) Evideutly..2 is still inadequate for most applications..'! are the MLEs. Here the dimension of 8 0 and 8 is the same but the boundary of 8 0 has lower dimension... The proof is sketched iu Problems (6. t. J).. R.3.. such that Dg(O) exists and is of rauk r ..::. Examples such as testing for independence in contingency tables. which require the following general theorem. " ' J v q and Vq+l " . q + 1 < j < r. l B.o:::"'' 397 CC where 00. It can be extended as follows. if A(X) is ~ the likelihood ratio statistic for H : 0 E 8 0 given in (6. .::Se::.. Defiue H : 0 E 8 0 with e 80 = {O E 8 : g(O) = o}. Theorem 6..6. Suppose H is specified by: There exist d functions.. N(B 1 .3.3.3.3. Suppose the assumptions of Theorem 6. 9j .3. .. Then.n under H is consistent for all () E 8 0.i. .t. "'0 ~ {O : OT Vj = 0.3...13)..q at all 0 E 8.l:::. . 210g A(X) ". . let (XiI.d.. 2' 2 + X 2 )) 2 and 210g A(X) ~ xf but if 81 + 82 < 1 clearly 2 log A(X) ~ op(I).cTe:::..g..13) then the set {(8" . themselves depending on 8 q + 1 . The formulation of Theorem 6.3.( 0 ) ~ o} (6. q + 1 <j < r}.13).3.80.(0) ~ 8j .. 8q o assuming that 8q + 1 .3).2. ..3. Wilks's theorem depends critically On the fact that nOt only is open but that if giveu in (6. A(X) behaves asymptotically like a test for H : 0 E 8 00 where 800 = {O E 8: Dg(Oo)(O . X~q under H. We need both properties because we need to analyze both the numerator and denominator of A(X)..."de"'::'"e:::R::e"g. 8q)T : 0 E 8} is opeu iu Rq... 8.:::m. v r span " R T then.2 falls under this schema with 9.12) The extension of Theorem 6.8 0 E Wo where Wo is a linear space of dimension q are also covered.
y T I(IJ)Y .3.~ D 2 1n ( IJ n).Or) and define the Wald (6.7.2..9).X. for instance.. The last Hessian choice is ~ ~ . I .15) and (6. 1(8) continuous implies that [1(8) is continuous and.a)} is an ellipsoid in W easily interpretable and computablesee (6. in particular .0. It follows that the Wald test that rejects H : (J = 6 0 in favor of K : () • i= 00 when . Under the conditions afTheorem 6. . ~ ~ 00.6. 0 ""11 F . Then .4.lJ n ) where IJ n statistic as ~(1) ~(2) ~(1) = (0" .3.3. y T I(IJ)Y. according to Corollary B.15) Because I( IJ) is continuous in IJ (Problem 6.2.. (6. (6.31). . Slutsky's it I' I' . hence. theorem completes the proof.18) i 1 Proof. If H is true.q' ~(2) (6.10). i favored because it is usually computed automatically with the MLE. the lower diagonal block of the inverse of .3.. r 1 (1J)) where.3. it follows from Proposition B.~ D 2ln (IJ n ). respectively.fii(lJn (2) L 1J 0 ) ~ Nd(O.2 Wald's and Rao's Large Sample Tests The Wald Test Suppose that the assumptions of Theorem 6.2. I.Nr(O.: b. (6.rl(IJ» asn ~ p ~ L 00.2 hold. the Hessian (problem 6. .3.l(a) that I(lJ n ) ~ I(IJ) asn By Slutsky's theorem B.7. X. More generally /(9 0 ) can be replaced by any consistent estimate of I( 1J0 ). . i . Y . . has asymptotic level Q.. I22(lJo)) if 1J 0 E eo holds. It and I(8 n ) also have the advantage that the confidence region One generates {6 : Wn (6) < xp(l . But by Theorem 6. [22 is continuous.) and IJ n ~ ~ ~(2) ~ (Oq+" . .2.3. More generally.3. For the more general hypothesis H : (} E 8 0 we write the MLE for 8 E as = ~ ! I: (IJ n .2.2.3.16) n(9 n _IJ)TI(9 n )(9 n IJ) !:.16). ~ Theorem 6.398 Inference in the Multiparameter Case Chapter 6 6.~ D 2l n (1J 0 ) or I (IJ n ) or .17) ~ ~ e ~ en 1 • I I Wn(IJ~2)) ~ n(9~2) 1J~2)l[I"(9nJrl(9~) _1J~2») where I 22 (IJ) is the lower diagonal block of II (IJ) written as I2 I 1 _ (Ill(lJ) (IJ) I21(IJ) I (1J)) I22(IJ) with diagonal blocks of dimension q x q and d x d. I 22 (9 n ) is replaceable by any consistent estimate of J22( 8).fii(1J IJ) ~N(O..1.2.3.. j Wn(IJ~2)) !:.
The argument is sketched in Problem 6.. Rn(Oo) = nt/J?:(Oo)r' (OO)t/Jn(OO) !. The test that rejects H when R n ( (}o) > x r (1.n)[D. as (6... (6. is.n) ~  where :E is a consistent estimate of E( (}o). (J = 8 0 . the asymptotic variance of . N(O.Section 6. This test has the advantage that it can be carried out without computing the MLE. ~1 '1'n(OO. (Problem 6.1/ 2 D 2 In (0) where D 1 l n represents the q x 1 gradient with respect to the first q coordinates and D 2 l n the d x 1 gradient with respect to the last d.n under H.3.3. and the convergence Rn((}o) ~ X. in practice they can be very different.3.9) under AOA6 and consistency of (JO.. ..3.3.8) E(Oo) = 1.19) : (J c: 80.nl]  2  1 . The Rao test is based on the statistic Rn(Oo ) ~n'1'n(Oo. The Rao Score Test For the simple hypothesis H that.. 1(0 0 )) where 1/J n = n .19) indicates... Let eo '1'n(O) = n.0:) is called the Rao score test. as n _ CXJ. What is not as evident is that.Q). the two tests are equivalent asymptotically.' (0 0 )112 (00) (6.n] D"ln(Oo. and so on.9. by the central limit theorem.n W (80 . It follows from this and Corollary B.ln(8 0 ....ln(Oo.n) under n H.3 Large Sample Tests and Confidence Regions 399 The Wold test.(00 )  121 (0 0 )/.3. = 2 log A(X) + op(l) (6. a consistent estimate of . under H. 112 is the upper right q x d block.:El ((}o) is ~ n 1 [D. asymptotically level Q. X.6. It can be shown that (Problem 6..21) where III is the upper left q x q block of the r x r infonnation matrix I (80 ).3.n) 2  + D21 ln (00.3.nlE ~ (2) _ T  .\(X) is the LR statistic for H Thus. The extension of the Rao test to H : (J E runs as follows. requires much weaker regularity conditions than does the corresponding convergence for the likelihood ratio and Wald tests. ~ B ) T given by {8(2) : r W n (0(2)) < x r _ q (1 These regions are ellipsoids in R d Although. which rejects iff HIn (9b ») > x 1_ q (1 .I Dln ((Jo) is the likelihood score vector.20) vnt/Jn(OO) !. therefore. Furthermore.2 that under H. un.22) . The Wald test leads to the Wald confidence regions for (B q +] . 2 Wn(Oo ) ~(2) where . the Wald and likelihood ratio tests and confidence regions are asymptotically equivalent in the sense that the same conclusions are reached for large n. Rao's SCOre test is based on the observation (6.
. under regularity conditions. .0 0 ) = T'" "" (vn(On . . . We established Wilks's theorem. then.400 Inference in the Multiparameter Case Chapter 6 where D~ is the d x d matrix of second partials of ill with respect to matrix of mixed second partials with respect to e(l). I I .5.2 but with Rn(lJb 2 1 1 » !:.. it shares the disadvantage of the Wald test that matrices need to be computed and inverted. The analysis for 8 0 = {Oo} is relatively easy. The advantage of the Rao test over those of Wald and Wilks is that MLEs need to be computed only under H. For instance. Consistency for fixed alternatives is clear for the Wald test but requires conditions for the likelihood ratio and score testssee Rao (1973) for more on this.)I(On)( vn(On  On) + A) !:.8 0 where is an open subset of R: and 8 0 is the collection of 8 E 8 with the last r ~ q coordinates 0(2) specified. We also considered a quadratic fonn.0 0 ) I(On)(On . for the Wald test neOn . X.2.q distribution under H. and Wald Tests It is possible as in the onedimensional case to derive the asymptotic power for these tests for alternatives of the form On = ~ eo + ~ where 8 0 E 8 0 . which stales that if A(X) is the LR statistic. In particular we shall discuss problems of . which is based on a quadratic fonn in the gradient of the log likelihood. Finally.3. Summary. b . 0(2).. called the Wald statistic. X~' 2 j > xd(1 . The asymptotic distribution of this quadratic fonn is also q• I e X: X: 6. We considered the problem of testing H : 8 E 80 versus K : 8 E 8 . e(l)._q(A T I(Oo)t. . where X~ (')'2) is the noncentral chi square distribution with m degrees of freedom and noncentrality parameter "'? It may he shown that the equivalence (6.a)}. . and so on. which measures the distance between the hypothesized value of 8(2) and its MLE. 2 log A(X) has an asymptotic X. On the other hand. . and showed that this quadratic fonn has limiting distribution q .3..19) holds under On and that the power behavior is unaffected and applies to all three tests. Rao.a)} and • The Rao large sample critical and confidence regions are {R n (Ob » {0(2) : Rn (0(2) < xd(l.4 LARGE SAMPLE METHODS FOR DISCRETE DATA In this section we give a number of important applications of the general methods we have developed to inference for discrete data. Under H : (J E A6 required only for Po eo and the conditions ADAS a/Theorem 6.) .On) + t. we introduced the Rao score test. Power Behavior of the LR. I . D 21 the d x d Theorem 6. .
k Wn(Oo) = L(N.4 Large Sample Methods for Discret_e_D_a_ta ~_ _ 401 goodnessoffit and special cases oflog linear and generalized linear models (GLM). treated in more detail in Section 6. j=l . k .OO.3.32) thaI IIIij II. . Pearson's X2 Test As in Examples 1. . ~ N.5.2. j = 1.3. . we need the information matrix I = we find using (2. + n LL(Bi . log(N. L(9..00k)2lOOk... . we consider the parameter 9 = (()l.4.OOi)(B.+ ] 1 1 0.d. It follows that the large sample LR rejection region is ~ k 2IogA(X) = 2 LN. . j = 1.)2InOo. .nOo. k.E~. ()j. Let ()j = P(Xi = j) be the probability of the jth category. Ok Thus. 2.)/OOk.Section 6.0). For i. and 2. with ()Ok =  1 1 E7 : kl j=l ()OJ.i.8.. . j=1 To find the Wald test..4.]2100.) /OOk = n(Bk . In. Because ()k = 1 .1.8 we found the MLE 0.00. Ok if i if i = j. ..2. Thus. where N. = L:~ 1 1{Xi = j}. or we may be testing whether the phenotypes in a genetic experiment follow the frequencies predicted by theory. j=l i=1 The second term on the right is kl 2 n Thus. 6.. In Example 2.33) and (3. the Wald statistic is klkl Wn(Oo) = n LfB.1 GoodnessofFit in a Multinomial Model.lnOo. j=1 Do. consider i. we may be testing whether a random number generator used in simulation experiments is producing values according to a given distribution.7. I ij [ . j = 1..2." .) > Xkl(1.()k_l)T and test the hypothesis H : ()j = ()OJ for specified (JOj. k1. trials in which Xi = j if the ith trial prcx:luces a result in the jth category. # j. .6.
13. because kl L(Oj .L .Expected)2 Expected (6. To derive the Rao test. .1) X where the sum is over categories and "expected" refers to the expected frequency E H (Nj ). note that from Example 2.Nk_d T and.8. n ( L kl ( j=1 ~ ~) !..~ 8 8 0j 0k  2 80j ) ( n LL _ _ L kl kl kl ( J=1 z=1 ~ 8_Z 80i ~) 8k 80k (~ 8 _J 80J ~) ) 8k . ~ . the Rao statistic is ..4. 80j 80k 1 and expand the square keeping the square brackets intact..2.. 80i 80j 80k (6. we write ~ ~  8j 80j .402 Inference in the Multiparameter Case Chapter 6 The term on the right is called Pearson's chisquare (X 2 ) statistic and is the statistic that is typically used for this multinomial testing problem. where N = (N1 .~) 8 80j 80 k ~ ~ ~ ] 2 0j = n [~ 80 k ~ 1] 2 To simplify the first term on the right of (6.. ]1(9) = ~ = Var(N). with To find ]1. It is easily remembered as 2 = SUM (Observed .80k).2) .4. j=1 .11).1S. ~ = Ilaijll(kl)X(kl) with Thus.. '" . The general form (6.1) of Pearson's X2 will reappear in other multinomial applications in this section.4. we could invert] or note that by (6. . The second term on the right is n [ j=1 (!. by A..4.2. 8k 80 k = {[8 0k (8 j  80j )]  [8 0j (8 k ~  80k ) ] } . Then.L . .2). ...80j ) = (Ok .
4. Possible types of progeny were: (1) round yellow. Testing a Genetic Theory. in the HardyWeinberg model (Example 2.9 when referred to a X~ table.25)2 312. However.2 GoodnessofFit to Composite Multinomial Models. 3.. and we want to test whether the distribution of types in the n = 556 trials he performed (seeds he observed) is consistent with his theory. nOlO = 312. which has a pvalue of 0.. Contingency Tables Suppose N = (NI . k = 4 X 2 = (2. l::. and (4) wrinkled green. We will investigate how to test H : 0 E 8 0 versus K : 0 ¢:. Then. 04 = 1/16. . which is a onedimensional curve in the twodimensional parameter space 8. Nk)T has a multinomial.75)2 104. In experiments on pea breeding. n020 = n030 = 104.25)2 104.2) becomes It follows that the Rao statistic equals Pearson 's X2. distribution.25 + (3. Mendel's theory predicted that 01 = 9/16. 2.4 Large Sample Methods for Discrete Data 403 the first term on the right of (6.i::.75 + (3. this value may be too small! See Note 1. For comparison 210g A = 0. .75 . 4 as above and associated probabilities of occurrence fh. Mendel observed the different kinds of seeds obtained by crosses from peas with round yellow seeds and peas with wrinkled green seeds. 7.25. LOi=l}.48 in this case.25 + (2.fh. There is insufficient evidence to reject Mendel's hypothesis.4). n3 = 108.75. n2 = 101.4. Here testing the adequacy of the HardyWeinberg model means testing H : 8 E 8 0 versus K : 8 E . n040 = 34. If we assume the seeds are produced independently. 04.1) dimensional parameter space k 8={8:0i~0. 8).: .1.75. j. (3) round green. 02.Section 6. M(n. Example 6. n4 = 32. 8 0 . i=1 For example.k. (2) wrinkled yellow.75)2 = 04 34. we can think of each seed as being the outcome of a multinomial trial with possible outcomes numbered 1. fh = 03 = 3/16.4. where 8 0 is a composite "smooth" subset of the (k .1. • 6. Mendel observed nl = 315.
. approximately X. Other examples. we define ej = 9j (8).l now leads to the Rao statistic R (8("") n 11 =~ ~ j=l [Ni . ..4.404 Inference in the Multiparameter Case Chapter 6 8 1 where 8 1 = 8 . 8) denote the frequency function of N. nk.4. then it must solve the likelihood equation for the model.2) by ej(ij). i=l k If 11 ~ e( 11) is differentiable in each coordinate. e~) T ranges over an open subset of Rq and ej = eOj ... 2 log . . .8 0 . Maximizing p(n1. nk. Then we can conclude from Theorem 6.2~(1 ..1. and the map 11 ~ (e 1(11)...("") X J 11 I where the righthand side is Pearson's X2 as defined in general by (6.3..X approximately has a X.~) and test H : e~ = O. . The algebra showing Rn(8 0 ) = X2 in Section 6. . we obtain the Rao statistic for the composite multinomial hypothesis by replacing eOj in (6. . ek(11))T takes £ into 8 0 .nk. Consider the likelihood ratio test for H : e E 8 0 versus K : e 1:. . to test the HardyWeinberg model we set e~ = e1 ..4.1). .. and ij exists. which will be pursued further later in this section. 8(11)) for 11 E £. To apply the results of Section 6. If £ is not open sometimes the closure of £ will contain a solution of (6. 8) for 8 E 8 0 is the same as maximizing p( n1.r for specified eOj ." For instance.4. ij satisfies l a a'r/j logp(nl. However. {p(. . nk. l~j~q. r. i=l t n. thus. £ is open.3). The Rao statistic is also invariant under reparametrization and.."" 'r/q) T. e~ = e2 . 8(11)) = 0.. ij = (iiI..3 that 2 log . j Oj is. the Wald statistic based on the parametrization 8 (11) obtained by replacing e by e (ij).. also equal to Pearson's X2 • . 1 ~ j ~ q or k (6. a 11 'r/J . the log likelihood ratio is given by log 'x(nb . j = 1. 8 0 • Let p( n1. . nk) = L ndlog(ni/n) log ei(ij)].3 and conclude that. . We suppose that we can describe 8 0 parametrically as where 11 = ('r/1.(t)~ei(11)=O.. .4.q' Moreover.r. That is. . . 8(11)) : 11 E £}.4. by the algebra of Section 6..X approximately has a xi distribution under H.3) Le. . If a maximizing value..ne j (ij)]2 = 2 ne.q distribution for large n. where 9j is chosen so that H becomes equivalent to "( e~ ... is a subset of qdimensional space.. under H.1. .. j = q + 1. To avoid trivialities we assume q < k .. . .q) exists. involve restrictions on the e obtained by specifying independence assumptions on i classifications of cases into different categories.. The Wald statistic is only asymptotically invariant under reparametrization.
(2nl + n2)(2 3 + n2) . i(1 . 1] is the desired estimate (see Problem 6. is male or female. 9(ij) ((2nl + n2) 2n 2n 2. and so on. {}12. AB. then (Nb . If Ni is the number of offspring of type i among a total of n offspring.3) becomes i(1 °:. HardyWeinberg. green base leaf versus white base leaf) leads to four possible offspring types: (1) sugarywhite. .TJ). (3) starchywhite.4. specifies that where TJ is an unknown number between 0 and 1. . N 4 ) has a M(n.. The Fisher Linkage Model. The results are assembled in what is called a 2 x 2 contingency table such as the one shown.4. For instance. 0 Testing Independence of Classifications in Contingency Tables Many important characteristics have only two categories. p. {}22. The likelihood equation (6.1).6 that Thus. 1958. do smoking and lung cancer have any relation to each other? Are sex and admission to a university department independent classifications? Let us call the possible categories or states of the first characteristic A and A and of the second Band B. To study the relation between the two characteristics we take a random sample of size n from the population. . nl (2+TJ) (n2 + n3) n4 (1TJ) +:ry 0. We often want to know whether such characteristics are linked or are independent.5. (}4) distribution. (4) starchygreen.a) with if (2nl + n2)/2n. we obtain critical values from the X~ tables.Section 6. respectively. Independent classification then means that the events [being an A] and [being a B] are independent or in terms of the B . 301).2.4. The only root of this equation in [0..4 Methods for Discrete Data 405 Example 6. Denote the probabilities of these types by {}n. TJ). Because q = 1. k 4. T() test the validity of the linkage model we would take 8 0 {G(2 + TJ).4. (h. (6. {}21. ( 2n3 + n2) 2) T 2n 2n o Example 6.4) which reduces to a quadratic equation in if.4. H is rejected if X2 2:: Xl (1 . (2) sugarygreen. ij {}ij = (Bil + Bi2 ) (Blj + B2j ).• . AB.. We found in Example 2.4. An individual either is or is not inoculated against a disease. iTJ) : TJ :. A selfcrossing of maize heterozygous on two characteristics (starchy versus sugary. is or is not a smoker. . AB. 1} a "onedimensional curve" of the threedimensional parameter space 8. A linkage model (Fisher. Then a randomly selected individual from the population can be one of four types AB.
5) whose solutions are Til 'r/2 = (n11+ n 12)/n (nll + n21)/n.RiCj/n) are all the same in absolute value and. 'r/2 ::. 'r/1 ::. I}. (6. where 8 0 is a twodimensional subset of 8 given by 8 0 = {( 'r/1 'r/2. In fact (Problem 6.4.6) the proportions of individuals of type A and type B. We test the hypothesis H : () E 8 0 versus K : 0 f{. Thus. Here we have relabeled 0 11 + 0 12 .406 Inference in the Multiparameter Case Chapter 6 l A 11 The entries in the boxes of the table indicate the number of individuals in the sample who belong to the categories of the appropriate row and column. These solutions are the maximum likelihood estimates.~ + n12) iiI (nll + n2d (nll i72 (n21 + n22) (1 .'r/2).4. 'r/1 (1 . 021 . 0 12 .Tid (n12 + n22) (1 .2).Ti2) (6.4. N 21 .011 + 021 as 'r/1. where z tt [R~jl1 2=1 J=l . For () E 8 0 . ( 22 ). 8 0 .'r/2)) : 0 ::. C j = N 1j + N 2j is the jth column sum. for example N12 is the number of sampled individuals who fall in category A of the first characteristic and category B of the second characteristic. respectively. q = 2. 0 ::. 1.3) become . the likelihood equations (6. Pearson's statistic is then easily seen to be (6.'r/1). This suggests that X2 may be written as the square of a single (approximately) standard normal variable.4. which vary freely. because k = 4. the (Nij .7) where Ri = Nil + Ni2 is the ith row sum.'r/1) (1 . By our theory if H is true. N 22 )T. 0 11 . 'r/2 to indicate that these are parameters. if N = (Nll' N 12 . X2 has approximately a X~ distribution. we have N rv M(n. 'r/2 (1 . (1 . Then.4.
1 :::. i :::. B to denote the event that a randomly selected individual has characteristic A. The N ij can be arranged in a a x b contingency table.. Therefore.4 Large Sample Methods for Discrete Data 407 An important altemative form for Z is given by (6.. It may be shown (Problem 6. . j :::. Thus. . that is. peA I B) = peA I B). B..4. i :::. b ~ 2 (e. b). then Z is approximately distributed as N(O. Next we consider contingency tables for two nonnumerical characteristics having a and b states. . peA I B)) versus K : peA I B) > peA I B). j = 1. Nab Cb Ra n . B. and only if.a) as a level a onesided test of H : peA I B) = peA I B) (or peA I B) :::.. If (Jij = P[A randomly selected individual is of type i for 1 and j for 2]. .. j :::. b NIb Rl a Nal C1 C2 . 1)..4..4... .. Z indicates what directions these deviations take. a.4. b where N ij is the number of individuals of type i for characteristic 1 and j for characteristic 2. 1 :::.e. .. b where the TJil. 1 :::. respectively. (Jij TJil The hypothesis that the characteristics are assigned independently becomes H : TJil TJj2 for 1 :::.g. b} "J M(n. a. then {Nij : 1 :::... If we take a sample of size n from a population and classify them according to each characteristic we obtain a vector N ij .Section 6. hair color).3) that if A and B are independent. (Jij : 1 :::. a. j :::.. it is reasonable to use the test that rejects. 1 Nu 2 N12 .. The X2 test is equivalent to rejecting (twosidedly) if. if X2 measures deviations from independence. if and only if. i = 1. a.8) Thus. . that A is more likely to occur in the presence of B than it would in the presence of B). TJj2 are nonnegative and 2:~=1 = 2:~=1 TJj2 = 1. Positive values of Z indicate that A and B are positively associated (i. i :::. B.. . Z ~ z(l. . a. Z = v'n[P(A I B) .peA I B)] [~(B) ~(~)ll/2 peA) peA) where P is the empirical distribution and where we use A. .. eye color.
Because 7r(z) varies between 0 and 1. Instead we turn to the logistic transform g( 7r). 1) d. . . whicll we introduced in Example 1.9) which has approximately a X(al)(bl) distribution under H. log(l "il] + t..7r)]. Next we choose a parametric model for 7r(z) that will generate useful procedures for analyzing experiments with binary responses. (2) election polls where a voter either supports a proposition (Y = 1) or does not (Y = 0). ' XI. (6. we observe independent Xl' . k.4. we observe the number of successes Xi = L. Examples are (1) medical trials where at the end of the trial the patient has either recovered (Y = 1) or has not recovered (Y = 0). we obtain what is called the logistic linear regression model where ." We assume that the distribution of the response Y depends on the known covariate vector ZT. In this section we assume that the data are grouped or replicated so that for each fixed i. usually called the log it. B(mi' 7ri)' where 7ri = 7r(Zi) is the probability of success for a case with covariate vector Zi.4.408 Inference in the Multiparameter Case Chapter 6 with row and column sums as indicated.8 as the canonical parameter zT TJ = g ( 7r) = log [7r / (1 .10. perhaps after a transfonnation. Other transforms. As is typical..4. i :::. In this section we will consider Bernoulli responses Y that can only take on the values 0 and 1. Maximum likelihood and dimensionality calculations similar to those for the 2 x 2 table show that Pearson's X2 for the hypothesis of independence is given by (6.{3p. The argument is left to the problems as are some numerical applications.1 we considered linear models that are appropriate for analyzing continuous responses {Yi} that are. 6. or (3) market research where a potential customer either desires a new product (Y = 1) or does not (Y = 0).f. The log likelihood of 7r = (7rl.11) When we use the logit transfonn g(7r)."" 7rk)T based on X = (Xl"'" Xk)T is t.3 Logistic Regression for Binary Responses In Section 6. a simple linear representation zT ~ for 7r(. log ( ~: ) . Thus. (6. we call Y = 1 a "success" and Y = 0 a "failure. with Xi binomial..4. [Xi C :i"J log + m.7r)] are also used in practice. approximately normally distributed ~ for known constants {Zij} and and whose means are modeled as J. and the loglog transform g2(7r) = 10g[log(1 .~ =1 Zij {3j = unknown parameters {31.) over the whole range of z is impossible..Li = L."F~1 Yij where Yij is the response on the jth of the mi trials in block i.6. 1 :::. such as the probit gl(7r) = <I>1(7r) where <I> is the N(O.
. Alternatively with a good initial value the NewtonRaphson algorithm can be employed.13) By Theorem 2.4 Large Sample Methods for Discrete Data 409 The special case p = 2.Li = E(Xi ) = mi1fi. Zi)T is the logistic regression model of Problem 2.14). j = 1. The condition is sufficient but not necessary for existencesee Problem 2.4. the solution to this equation exists and gives the unique MLE 73 of f3..! ) ( mi .4. It follows that the NILE of f3 solves E f3 (Tj ) = T .f3p )T is.~ Xi 2 1 +"2 ' (6.. the likelihood equations are just ZT(X . . k m} (V.Li.4. Theorem 2.Section 6..log(l:'.3.4. 7r Using the 8method.1fi)*}' + 00.16) 1fi) 1].12) where T j = 2:7=1 ZijXi and we make the dependence on N explicit. we can guarantee convergence with probability tending to 1 as N + 00 as follows. .3.3. p (Jj Tj  ~ mi loge1+ exp{Zif3} ) + ~ log '. ( 1 . W is estimated using Wo = diag{mi1f. The coordinate ascent iterative procedure of Section 2. Although unlike coordinate ascent NewtonRaphson need not converge.p. Zi The log likelihood l( 7r(f3)) = == k (1...2 can be used to compute the MLE of f3. where Z = IIZij Ilrnxp is the design matrix.+ .L) = O.. IN(f3) of f3 = (f31. Tp) T.4 the Fisher information matrix is I(f3) = ZTWZ (6.15) Vi = log X+. j Thus.1fi )* = 1 .4.4. it follows by Theorem 5. .J. i ::. that if m 1fi > 0 for 1 ::.+ ..4. Similarly. k ( ) (6.14) where W = diag{ mi1fi(l1fi)}kxk..1 and by Proposition 3. or Ef3(Z T X) = ZTX.3.(1.3. Then E(Tj ) = 2:7=1 Zij J.. I N(f3) = f. mi 2mi Xi 1 Xi 1 = .4. mi 2mi Here the adjustment 1/2mi is used to avoid log 0 and log 00.[ i(l7r iW')· ..1 applies and we can conclude that if 0 < Xi < mi and Z has rank p. in (6.J) SN(O.l. (6. Note that IN(f3) is the log likelihood of a pparameter canonical exponential model with parameter vector f3 and sufficient statistic T = (T1' . We let J. if N = 2:7=1 mi.1. Because f3 = (ZTz) 1 ZT 'T] and TJi = 10g[1fi (1 1fi) in TJi has been replaced by 730 is a plugin estimate of f3 where 1fi and (1 1f.3. the empirical logistic transform. . As the initial estimate use (6.
18) {Lo~ where jio is the MLE of JL under H and fl~i = mi .) +X. In this case the likelihood is a product of independent binomial densities.. We want to contrast w to the case where there are no restrictions on TJ.:IITlPt'pr Case 6 Because Z has rankp. For a second pendently and we observe Xl.4. The LR statistic 2 log .410 Inference in the Mllltin::lr.4.. {3 E RP} and let r be the dimension of w. . To get expressions for the MLEs of 7r and JL. ."" Xk independent with Xi example.k. recall from Example 1. suppose we want to compare k different locations with respect to the percentage that have a certain attribute such as the intention to vote for or against a certain proposition. .)l {LOt (6.14) that {3o is consistent. D(X. Thus.w is denoted by D(y. ji) 2 I)Xi 10g(Xdfli) + XIlog(XI!flDJ i=1 (6.8 that the inverse of the logit transform 9 is the logistic distribution function Thus. we let w = {TJ : 'f]i = {3. then we can form the LR statistic for H : TJ E Wo versus K: TJ E w .4. and the MLEs of 1ri and {Li are Xdmi and Xi. Testing In analogy with Section 6.~.4. Example 6. ji) measures the distance between the fit ji based on the model wand the data X. where ji is the MLE of JL for TJ E w.. Suppose that k treatments are to be tested for their effectiveness by assigning the ith treatment to a sample of mi patients and recording the number Xi of patients that recover. As in Section 6. Here is a special case. ji)..4. we set n = Rk and consider TJ E n. k.1 linear subhypotheses are important. D(X.. distribution for TJ E w as mi + 00. i 1. 2 log .4.. by Problem 6.12) k zT D(X. . the MLE of 1ri is 1ri = 9...13. f"V . We obtain k independent samples. . fl) has asymptotically a X%r. i = 1. and for the ith location count the number Xi among mi that has the given attribute.Wo 210g.1. i = 1.4.1.13.\ has an asymptotic X. it foHows (Problem 6. If Wo is a qdimensional linear subspace of w with q < r.. that is.q distribution as mi + 00.flm.1 (L:~=l Xij(3j).IOg(E. from (6.3.11) and (6. The Binomial OneWay Layout. .4.4. Theorem 5. In the present case. By the multivariate delta method. k < oosee Problem 6.6.\=2t [Xi log i=1 (Ei.\ for testing H : TJ E w versus K : TJ n . The samples are collected indeB(1ri' mi). one from each location.17) where X: = mi Xi and M~ mi Mi.
" which equals the sum of standardized squared distances between observed frequencies and expected frequencies under H.3.Idimensional parameter space e.15). we test H : 7rl = 7r2 7rk = 7r. 7r E (0.4.Li = EYij.5 Generalized Linear Models 411 This model corresponds to the oneway layout of Section 6. It follows from Theorem 2.Li = ~i.3. 6.5 GENERALIZED LINEAR MODELS Yi In Sections 6. The Pearson statistic 2 _ X  L k1 i=l (X "'(1) m'7r .Li of a response is expressed as a function of a linear combination ~i =. We derive the likelihood equations.18) with JiOi mi7r. .N log(1+ exp{{3}) + ~ log ( k ~: ) where T = 2::=1 Xi. and (3 = 10g[7r1(1 . Thus. In Section 6.3 we considered experiments in which the mean J.1. the Pearson statistic is shown to have a simple intuitive form..13). an important hypothesis is that the popUlations are homogenous. we give explicitly the MLEs and X2 test. In the special case of testing equality of k binomial parameters.L (ml7r. In the special case of testing independence of multinomial frequencies representing classifications in a twoway contingency table. then J. the Wald and Rao statistics take a form called "Pearson's X2.mk7r)T. if J. versus the alternative that the 7r'S are not all equal. Under H the log likelihood in canonical exponential form is {3T . where J.Section 6. we considered logistic regression for binary responses in which the logit transformation of the probability of success is modeled to be a linear function of covariates. in the case of a Gaussian response. We used the large sample testing results of Section 6. The LR statistic is given by (6.1 and 6. In particular. N 2::=1 mi. the Rao statistic is again of the Pearson X2 form. z.3 to find tests for important statistical problems involving discrete data. . Using Z as given in the oneway layout in Example 6.4.4.4. (3 = L Zij{3j j=l p of covariate values.Li 7ri = gl(~i)' where gl(y) is the logistic distribution . When the hypothesis is that the multinomial parameter is in a qdimensional subset of the k .7r)]. J.7r ~ i "')2 mi 7r is a Wald statistic and the X2 test is equivalent asymptotically to the LR test (Problem 6.1). discuss algorithms for computing MLEs. and give the LR test.4. We found that for testing the hypothesis that a multinomial parameter equals a specified value. and as in that section. Finally. Summary.3.1. we find that the MLE of 7r under His 7r = TIN.1 that if 0 < T < N the MLE exists and is the solution of (6..
the natural parameter space of the nparameter canonical exponential family (6. .. g = or 11 L J'j Z(j) = Z{3. A (11) = Ao (TJi) for some A o. in which case 9 is also called the link function. Note that if A is oneone. Typically. but in a subset of E obtained by restricting TJi to be of the form where h is a known function. See Haberman (1974). These Ii are independent Gaussian with known variance 1 and means JLi The model is GLM with canonical g(fL) fL. which is p x 1. Canonical links The most important case corresponds to the link being canonical. . Y) where Y = (Yb . We assume that there is a onetoone transform g(fL) of fL....."" zn) with Zi = (Zib"" Zip)T nonrandom and Y has density p(y. in which case JLi A~ (TJi). g(fL) is of the form (g(JL1) . such that g(fL) = L J'jZ(j) j=1 p Z{3 where Z(j) (Zlj.6. . Special cases are: (i) The linear model with known variance. Typically. j=1 p In this case. g(JLn)) T.r Case 6 function. 1989) synthesized a number of previous generalizations of the linear model. .1) where 11 is not in E. The generalized linear model with dispersion depending only on the mean The data consist of an observation (Z. Znj)T is the jth column vector of Z. that is.. More generally. Yn)T is n x 1 and ZJxn = (ZI. fL determines 11 and thereby Var(Y) A( 11).1. the GLM is the canonical subfamily of the original exponential family generated by ZTy.1). most importantly the log linear model developed by Goodman and Haberman. = L:~=1 Zij{3.5. 11) given by (6. ••• . the identity. the mean fL of Y is related to 11 via fL = A(11).. called the link junction.412 Inference in the 1\III1IITin'~r:>rn"'T. McCullagh and NeIder (1983. As we know from Corollary 1.5. (ii) Log linear models.
that procedure is just (6.1::. Bj > 0. and the log linear model corresponding to log Bij = f3i + f3j.1. See Haberman (1974) for a further discussion.4. .5. With a good starting point f3 0 one can achieve faster convergence with the NewtonRaphson algorithm of Section 2.8) (6. p are canonical parameters.2) It's interesting to note that (6. that Y = IIYij 111:Si:Sa.. b. .. Algorithms If the link is canonical. so that Yij is the indicator of. i ::.1. If we take g(/1) = (log J.t1. where f3i. Then () = IIBij II. j ::. 1f. The log linear label is also attached to models obtained by taking the Yi independent Bernoulli ((}i). . In this case. say. . is that of independence Bij = Bi+B+j where Bi+ = 2:~=1 Bij . . This isjust the logistic linear model of Section 6.2) can be interpreted geometrically in somewhat the same way as in the Gausian linear modelthe "residual" vector Y .2) (or ascertain that no solution exists). j ::..8) is orthogonal to the column space of Z.6. Suppose. 1 ::. j::. 1 ::. Then. The models we obtain are called log linearsee Haberman (1974) for an extensive treatment.8) is not a member of that space.3. 1 ::. j ::.4.5. a.. as seen in Example 1. NewtonRaphson coincides with Fisher's method of scoring described in Problem 6. b. the link is canonical. if maximum likelihood estimates they necessarily uniquely satisfy the equation f3 exist. /1(. the .5.5 Generalized linear Models 413 Suppose (Y1.. . in general.7.3.B 1. f3j are free (unidentifiable parameters). But. ZTy or = ZT E13 y = ZT A(Z. The coordinate ascent algorithm can be used to solve (6 . p.80 ~ f3 0 .Bp). log J. classification i on characteristic 1 and j on characteristic 2. 0 < (}i < 1 withcanonicallinkg(B) = 10g[B(1B)]. by Theorem 2. 1 ::. "Ij = 10gBj.Yp)T is M(n. for example.3) where In this situation and more generally even for noncanonical links./1(.Section 6. 2:~=1 B = 1.5.tp) T . B+ j = 2:~=1 Bij .5..
5. IL) : IL E wo} where iLo is the MLE of IL in Woo The LR statistic for H : IL E Wo versus K : IL E WI . In that case the MLE of J1 is iL M = (Y1 .2.4) That is. In this context. the deviance of Y to iLl and tl(iL o.20) when the data are the residuals from the fit at stage m.5) is always ~ O.. If Wo C WI we can write (6. For the Gaussian linear model with known variance 0'5 (Problem 6. ••• .D(Y . iLo) == inf{D(Y.1](lLa)] for the hypothesis that IL = lLa within M as a "measure" of (squared) distance between Y and lLo. 210gA = 2[l(Y. As in the linear model we can define the biggest possible GLM M of the form (6. then with probability tending to 1 the algorithm converges to the MLE if it exists. Yn ) T (assume that Y is in the interior of the convex support of {y : p(y. This name stems from the following interpretation. Testing in GLM Testing hypotheses in GLM is done via the LR statistic. We can then formally write an analysis of deviance analogous to the analysis of variance of Section 6.5. the algorithm is also called iterated weighted least squares. 1]) > o}).D(Y. Write 1](') for AI. iLl) where iLl is the MLE under WI. . The LR statistic for H : IL E Wo is just D(Y. iLl) == D(Y.5. (6. the correction Am+ l is given by the weighted least squares formula (2.1. Unfortunately tl =f. iLo) .13m' which satisfies the equation (6.5.5.2. the variance covariance matrix is W m and the regression is on the columns ofW mZProblem 6.Wo with WI ~ Wo is D(Y.1) for which p = n. Let ~m+l == 13m+1 . D generally except in the Gaussian case.414 Inference in the Multiparameter Case Chapter 6 true value of {3.5. iLo) . called the deviance between Y and lLo. iLl)' each of which can be thought of as a squared distance between their arguments. as n ~ 00. We can think of the test statistic .4). This quantity.6) a decomposition of the deviance between Y and iLo as the sum of two nonnegative components.1](Y)) l(Y .
. so that the MLE {3 is unique.. and (6. which we temporarily assume known.3 hold (Problem 6. 6.3). However.. For instance. in order to obtain approximate confidence procedures. . then (ZI' Y1 ).5. can be estimated by f = :EA(Z{3) where :E is the sample variance matrix of the covariates. Thus. d < p.2.5. Similar conclusions r '''''''..3.10) is asymptotically X~ under H. This can be made precise for stochastic GLMs obtained by conditioning on ZI.9) What is ]1 ((3)? The efficient score function a~i logp(z .5.2.q. ..7) More details are discussed in what follows. Zn in the sample (ZI' Y 1 ).. (Zn' Y n ) can be viewed as a sample from a population and the link is canonical. is consistent with probability 1.5.2 and 6. if we assume the covariates in logistic regression with canonical link to be stochastic. . (3))Z.. if we take Zi as having marginal density qo.. f3p ). ..5 Generalized Linear Models 415 Formally if Wo is a GLM of dimension p and WI of dimension q with canonical links. asymptotically exists.3. fil) is thought of as being asymptotically X. (Zn. f3d+l. then ll(fio.0. . Ao(Z. which.5. y . we obtain If we wish to test hypotheses such as H : 131 = . 1.. .Section 6 .2. the theory of Sections 6. . (Zn' Y n ) has density (6. . . Y n ) from the family with density (6. Asymptotic theory for estimates and tests If (ZI' Y 1 ). (3) is (Yi and so.8) This is not unconditionally an exponential family in view of the Ao (z. and can conclude that the statistic of (6. there are easy conditions under which conditions of Theorems 6.1 .3 applies straightforwardly in view of the general smoothness properties of canonical exponential families.. (3) term. we can calculate i=l where {3 H is the (p xl) MLE for the GLM with {3~xl = (0. .. = f3d = 0. and 6. ••• .
then it is easy to see that E(Y) = A(11) Var(Y) = c(r)A(11) (6.9 are rather similar. Cox (1970) considers the variance stabilizing transformation which makes asymptotic analysis equivalent to that in the standard Gaussian linear model. For further discussion of this generalization see McCullagp. The generalized linear model The GLMs considered so far force the variance of the response to be a function of its mean.416 Inference in the Multiparameter Case Chapter 6 follow for the Wald and Rao statistics. From the point of analysis for fixed n. the results of analyses with these various transformations over the range . As he points out.5.5.1 ::.5. if in the binary data regression model of Section 6. For instance. .A(11))}h(y. Important special cases are the N (JL.12) The lefthand side of (6. An additional "dispersion" parameter can be introduced in some exponential family models by making the function h in (6.5.1 (r)(11T y . for c( r) > 0. p(y.4. General link functions Links other than the canonical one can be of interest. we take g(JL) = <1>1 (JL) so that 7ri = <1> (zT!3) we obtain the socalled probit model.10) is of product form A( 11) [1/ c( r)] whereas the righthand side cannot always be put in this form. These conclusions remain valid for the usual situation in which the Zi are not random but their proof depends on asymptotic theory for independent nonidentically distributed variables.5. It is customary to write the model as. 1989). 11. However. . (6. when it can. then A(11)/c(r) = log! exp{c1 (r)11T y}h(y. JL ::.1) depend on an additional scalar parameter r.3. r)dy = 1.14) so that the variance can be written as the product of a function of the mean and a general dispersion parameter. Note that these tests can be carried out without knowing the density qo of Z1.11) Jp(y. r) = exp{c. Because (6.r)dy. Existence of MLEs and convergence of algorithm questions all become more difficult and so canonical links tend to be preferred. A) families. 11.13) (6. (J2) and gamma (p.5. which we postpone to Volume II. r). noncanonicallinks can cause numerical problems because the models are now curved rather than canonical exponential families. and NeIder (1983.
We considered generalized linear models defined as a canonical exponential model where the mean vector of the vector Y of responses can be written as a function. in fact. under further mild conditions on P..2.2. a 2 ) or.1) where E is independent of Z but ao(Z) is not constant and ao is assumed known. the GaussMarkov theorem. which is symmetric about 0. called the link function. roughly speaking. Y) has a joint distribution and if we are interested in estimating the best linear predictor /lL(Z) of Y given Z.J ..r+i".19) with Ei i.d.Section 6.8 but otherwise also postponed to Volume II. we use the asymptotic results of the previous sections to develop large sample estimation results. In Example 6.31) with fa symmetric about O.2. still a consistent asymptotically normal estimate of (3 and.." Of course. We discussed algorithms for computing MLEs of In the random design case. /3. EpZTZ nonsingular.2.6. Ii 6. are discussed below. of a linear predictor of the form Ef3j Z(j). the distributional and implicit structural assumptions of parametric models are often suspect. had any distribution symmetric around 0 (Problem 6. We considered the canonical link function that corresponds to the model in which the canonical exponential model parameter equals the linear predictor.5). if P3 is the nonparametric model where we assume only that (Z. Y) rv P given by (6.19) even if the true errors are N(O. Another even more important set of questions having to do with selection between nested models of different dimension are touched on in Problem 6. we studied what procedures would be appropriate if the linearity of the linear model held but the error distribution failed to be Gaussian. and Models 417 Summary. with density f for some f symmetric about O}.2. These issues. where the Z(j) are observable covariate vectors and (3 is a vector of regression coefficients. E p y2 < oo}."n".6 Robustness Pr. then the LSE of (3 is. confidence procedures.6 ROBUSTNESS PROPERTIES AND SEMIPARAMETRIC MODELS As we most recently indicated in Example 6.6. so is any estimate solving the equations based on (6.i.i. y)T rv P that satisfies Ep(Y I Z) = ZT(3. and tests. the right thing to do if one assumes the (Zi' Yi) are i.. the resulting MLEs for (3 optimal under fa continue to estimate (3 as defined by (6. We found that if we assume the error distribution fa. if we consider the semiparametric model.d. There is another semiparametric model P 2 = {P : (ZT. That is. ".2. have a fixed covariate exact counterpart. the LSE is not the best estimate of (3. For this model it turns out that the LSE is optimal in a sense to be discussed in Volume II.2. PI {P : (zT. for estimating /lL(Z) in a submodel of P 3 with /3 (6. is "act as ifthe model were the one given in Example 6.2. Furthermore. in fact. ____ . whose further discussion we postpone to Volume II.1.2.
2) where Z is an n x p matrix of constants. Then.1. .d. Example 6. Ej). Suppose the GaussMarkov linear model (6.4 where it was shown that the optimal linear predictor in the random design case is the same as the optimal predictor in the multivariate normal case. E(a) = E~l aiJLi Cov(Yi.2. where Varc(Ci) stands for the variance computed under the Gaussian assumption that E1. jj and il are still unbiased and Var(Ji) = a 2 H. in Example 6. 0 Note that the preceding result and proof are similar to Theorem 1. . the conclusions (1) and (2) of Theorem 6.'. . n = 0:.an. and Y. is UMVU in the class of linear estimates for all models with EY/ < 00..6). and ifp = r. Varc(a) ::. the result follows. and a is unbiased.p .1. Ifwe replace the Gaussian assumptions on the errors E1.Yn . One Sample (continued). 13i of Section 6. In Example 6. Many of the properties stated for the Gaussian case carry over to the GaussMarkov case: Proposition 6. Instead we assume the GaussMarkov linear model where (6.En in the linear model (6. In fact. Moreover. N(O. However. {3 is p x 1.Ej) = a La. our current Ji coincides with the empirical plugin estimate of the optimal linear predictor (1. The preceding computation shows that VarcM(Ci) = Varc(Ci).1. By Theorems 6. Var(jj) = a 2 (ZTZ)1. i=l i<j i=l where VarCM refers to the variance computed under the GaussMarkov assumptions. for any parameter of the form 0: = E~l aiJLi for some constants a1. € are n x 1..4.H).3) are normal. the estimate = E~l aiJii has uniformly minimum variance among all unbiased estimates linear in Y 1 . Var(e) = a 2 (I ..2) holds. . ( 2 ). Var(Ei) + 2 Laiaj COV(Ei.1. .6.S. . Because Varc = VarCM for all linear estimators.Y ..1.. and by (B. See Problem 6..1.1..6.6.1.3.4.6. .6. 418 Inference in the Multiparameter Case Chapter 6 Robustness in Estimation We drop the assumption that the errors E1..1. . .' En with the GaussMarkov assumptions. Let Ci stand for any estimate linear in Y1 . Theorem 6.1. in addition to being UMVU in the normal case.. moreover. . The GaussMarkov theorem shows that Y.En are i. Varc(Ci) for all unbiased Ci.4 are still valid. The optimality of the estimates Jii.1.1 and the LSE in general still holds when they are compared to other linear estimates. n 2 VarcM(a) = La.1.. n a Proof.3(iv).2(iv) and 6. and Ji = 131 = Y. for . JL = f31. .i. Yj) = Cov( Ei.14). Because E( Ei) = 0.
Suppose we have a heteroscedastic 0.6."n".8(P)) ~ N(O.2.:.3) .6...2.6.2. which is the unique solution of ! and w(x. Suppose (ZI' Yd. P)) with :E given by (6. If the are version of the linear model where E( €) known.. Y n ) are d. The Linear Model with Stochastic Covariates. Another weakness is that it only applies to the homoscedastic case where Var(1'i) is the same for all i. Y has a larger variance (Problem 6.:lrarnetlric Models 419 sample n large.6 Robustness "" . . as (Z.6. y E R. which does not belong to the model. but Var( Ei) depends on i.P) i= II (8 0 ) in general. There is an important special case where all is well: the linear model we have discussed in Example 6. 0.1. our estimates are estimates of Section 2.and twosample problems using asymptotic and Monte Carlo methods. Because ~ ('lI . the asymptotic distribution of Tw is not X.4) and the Wald test statis8 0 evidently we have i= Tw But if 8(P) = 8 0 . ~(w. consider H : 8 tics Tw = n(e ( 0 )T I(8 0 )(e ( 0 ).6. 00..d.5) than the nonlinear estimate Y median when the density of Y is the Laplace density 1 2A exp{ Aly .2 we know that if '11(" 8) = Die. Now we will use asymptotic methods to investigate the robustness of levels more generally. Remark 6. we can use the GaussMarkov theorem to conclude that the weighted least squares are unknown. ~('11. For this density and all symmetric densities.Ill}. 8)dP(x) ° = 80 (6.. when the not optimal even in the class of linear estimates. Il E R.2 are UMVU.ennip.6. the asymptotic behavior of the LR. . The 6method implies that if the observations Xi come from a distribution P. From the theory developed in Section 6. If 8(P) (6.ti".. Wald and Rao tests depends critically on the asymptotic behavior of the underlying MLEs 8 and 80 . Y is unbiased (Problem 3. This observation holds for the LR and Rao tests as wellbut see Problem 6. 1 (Zn. However. then Tw ~ VT I(8 0 )V where V '" N(O. 8) and Pis true we expect that 8n 8(P).5) . for instance.12). P)). Example 6.6). and .6. Y) where Z is a (p x 1) vector of random covariates and we model the relationship between Z and Y as (6. Thus.. aT Robustness of Tests In Section 5. As seen in Example 6.ii(8 .Section 6. a major weakness of the GaussMarkov theorem is that it only applies to linear estimates..4.4.3 we investigated the robustness of the significance levels of t tests for the one.1..2. 0.6. A > O..
It follows by Slutsky's theorem that.np(1 .  2 Now..7.6. hence.420 Inference in the M .. procedures based on approximating Var(. Zen) S (Zb . (Jp (Jo.q+ 1.p or more generally (3 E £0 + (30' a qdimensional affine subspace of RP.5) has asymptotic level a even if the errors are not Gaussian. the limiting distribution of is Because !p. even if E is not Gaussian.9) i=l n1Zfn)Z(n)S2 / Vii are still asymptotically of correct level.30) then (3(P) specified by (6. under H : (3 = 0.2) the LR.7) and (6.Yi) = IY n .6.. It is intimately linked to the fact that even though the parametric model on which the test was based is false.. .1 that (6.2.6. ..Zn)T and  2 1 ~ " 2 1 ~(Yi .. n lZT Z (n) (n) so that the confidence ~ ~ ZiZ'f n ~ 1 (6.np(l a) . Then.8) Moreover.6.1 and 6.a) where _ ""T T  2 (6. when E is N(O. and Rao tests all are equivalent to the F test: Reject if Tn = (3 Z(n)Z(n)(3/ S 2 !p. say. Wald. Thus. (i) the set of P satisfying the hypothesis remains the same and (ii) the (asymptotic) variance of (3 is the same as under the Gaussian model. by the law of large numbers.P i=l n p  Z(n)(31 .r Case 6 with the distribution P of (Z.t.. Y) such that Eand Z are independent. For instance.2. !ltin:::. E p E2 < 00. EpE 0. . . and we consider.3. the test (6.6.6. (see Examples 6. it is still true by Theorem 6.t xp(l a) by Example 5.5) and (12(p) = Varp(E)..6) (3 is the LSE. H : (3 = O. This kind of robustness holds for H : (Jq+ 1 (Jo.B) by where W pxp is Gaussian with mean 0 and.3) equals (3 in (6. X.2. it is still true that if qi is given by (6.. (12).6.r::nn".2.
17) is Suppose our model is that X "Reject iff n (0 1 where &2 2 O) > x 1 (1  .Section 6.5) are still meaningful but (and Z are dependent..4.3. Q) (6. (X(I). If we take H : Al .12) The conditions of Theorem 6. i=1 (6.2 clearly hold.6 Robustness Properties and Semiparametric Models 421 If the first of these conditions fails. (6.6. The Linear Model with Stochastic Covariates with E and Z Dependent. Example 6. under H.10) nL and ~ ~ X(j) 2 (6. 2  (th). Simply replace (12 by (J ~2 = ~ ~ { ( ~1) n~ 2=1 Xz _ (X(I) + X(2»))2 2 + _ (X(1) + X(2»))2} 2 .6.11) i=1 &= v 2n ! t(xP) + X?»).7) fails and in fact by Theorem 6. E (i..6. X(2) are independent E of paired pieces of equipment. To see this. . X(2») where X(I). we assume variances are heteroscedastic.) respectively. However suppose now that X(I) 101 and X(2) 102 are identically distributed but not exponential. the lifetimes A2 the standard Wald test (6. That is..10) does not have asymptotic level a in general.. It is possible to construct a test equivalent to the Wald test under the parametric model and valid in general. Suppose E(E I Z) = 0 so that the parameters f3 of (6. XU) and X(2) are identically distributed...6. the theory goes wrong. For simplicity.6. let the distribution of ( given Z = z be that of a(z)(' where (' is independent of Z.14) o Example 6.1 Vn(~ where (3) + N(O.6. The TwoSample Scale Problem. Then H is still meaningful. note that..6.6. V2 VarpX{I) (~2) Xz in general. (6.6. We illustrate with two final examples.3..a)" (6.6.6. But the test (6.3. If it holds but the second fails.15) . Suppose without loss of generality that Var( (') = L Then (6. then it's not clear what H : (J = (Jo means anymore.2.13) but & V2Ep(X(I») ::f.
n 1 L. Specialize to the case Z = (1.. N(O.. d.1.d.n l1Y .Ad)T where (I1. LR.. j ::. In partlCU Iar..Id . 0 To summarize: If hypotheses remain meaningful when the model is false then.1. ••• ...1 1. (5 2)T·IS (U1. The GaussMarkov theorem states that the linear estimates that are optimal in the linear model continue to be so if the i. = {3d = 0 fail to have correct asymptotic levels in general unless (52 (Z) is constant or A1 = . 0 < Aj < 1. A solution is to replace Z(n)Z(n)/ 8 2 in (6.6.. (52) are still reasonable when the true error distribution is not Gaussian.11 . and are uncorrelated.6) by Q1. (6.6). where Qis a consistent estimate of Q. in general.11) with both 11 and (52 unknown.Id) has a multinomial (A1..6) does not have the correct level. 1 ::. the methods need adjustment. (ii) if n 2: r + 1. provided we restrict the class of estimates to linear functions of Y1 . and Rao tests. = Ad = 1/ d (Problem 6. the test (6. . N(O.d.3. (5 .6.6) and. It is easy to see that our tests of H : {32 = .A1.4. In this case. then the MLE and LR procedures for a specific model will fail asymptotically in the wider model. Wald.16) Summary. . . .fiT) o:2)T of U 2)T .6. . . . Yn ."i=r+ I i . (52) assumption on the errors is replaced by the assumption that the errors have mean zero. For the canonical linear Gaussian model with (52 unknown. We considered the behavior of estimates and tests when the model that generated them does not hold. Ur. identical variances. Show that in the canonical exponential model (6... . .. ""2 _ ""12 (TJ1.Ad) distribution. This is the stochastic version of the dsample model of Example 6.3 to compute the information lower bound on the variance of an unbiased estimator of (52. the confidence procedures derived from the normal model of Section 6.4.. The simplest solution at least for Wald tests is to use as an estimate of [Var y'n8]1 not 1(8) or ~D2ln(8) but the socalled "sandwich estimate" (Huber. In the linear model with a random design matrix. . 1967). In particular. we showed that the MLEs and tests generated by the model where the errors are U.9.422 Inference in the Multiparameter Case Chapter 6 (Problem 6. use Theorem 3. in general.6.i. and we gave the sandwich estimate as one possible adjustment to the variance of the MLE for the smaller model.1 are still approximately valid as are the LR. Wald. TJr.7 PROBLEMS AND COMPLEMENTS Problems for Section 6.. then the MLE (iil. (i) the MLE does not exist if n = r. . We also demonstrated that when either the hypothesis H or the variance of the MLE is not preserved when going to the wider model. .. . and Rao tests need to be modified to continue to be valid asymptotically.. the twosample problem with unequal sample sizes and variances. Compare this bound to the variance of 8 2 • . For d = 2 above this is just the asymptotic solution of the BehrensFisher problem discussed in Section 4. .JL • ""n 2. 6.6.
Here the empirical plugin estimate is based on U.28) for the noncentrality parameter ()2 in the regression example. c E [0. Find the MLE B (). = (Zi2. Var(Uf) = 20.22. n.. then B the MLE of (). n.. zi = (Zi21"" Zip)T.1] is a known constant and are i..n ~ where fi can be written as fi = ceil + ei for given constant c satisfying 0 ~ c are independent identically distributed with mean zero and variance 0. N(O.. . .2 with p = r coincides with the empirical plugin estimate of JLL = ({LL1.1.14).1. i = 1.. 1 n f1..1. . Yn ) where Z. . . 0. Yl).2 ).13.. ."" (Z~. 9.l+c (_c)j+l)/~ (1.d. i = 1. where IJ.a)% confidence interval for /31 in the Gaussian linear model is Z2)2 ] . (d) Show that Var(O) ~ Var(Y). Suppose that Yi satisfies the following model Yi ei = () + fi. .. Derive the formula (6. ..Section 6.'''' Zip)T.29). i = 1.Li = {Ly+(zi {Lz)f3. 8.1.n. fO 0. . Derive the formula (6.i.. and (3 and {Lz are as defined in (1. and the 1. of 7.2 . .. . .(_C)i)2 l+c L i=l (a) Show that Bis the weighted least squares estimate of ().5) Yi () + ei.2 replaced by 0: 2 • 6.4 • 3.2 coincides with the likelihood ratio statistic A(Y) for the 0.2.2 known case with 0. 0.1. where a. Show that in the regression example with p = r = 2. see Problem 2. the 100(1 .. 5. Let 1. {Ly = /31. J =~ i=O L)) c i (1. Consider the model (see Example 1.29) for the noncentrality parameter 82 in the oneway layout. (e) Show that Var(B) < Var(Y) unless c = O.2 ).7 Problems and Complements 423 Hint: By A. i eo = 0 (the €i are called moving average errors. t'V (b) Show that if ei N(O. Show that >:(Y) defined in Remark 6.d. . 1 {LLn)T. fn where ei = ceil + fi.4. is (c) Show that Y and Bare unbiased. (Zi.. i = 1. Let Yi denote the response of a subject at time i. Show that Ii of Example 6.. 4.
. Yn . (b) Find confidence intervals for 'lj. ~ 1 .p)S2 /x n . Consider a covariate x.. Show that if p Inference in the Multiparameter Case Chapter 6 = r = 2 in Example 6. We want to predict the value of a 14.Z. Often a treatment that is beneficial in small doses is harmful in large doses. .. then the hat matrix H = (h ij ) is given by 1 n 11.2) L(Zi2 . = 2(Y . which is the amount .a)% confidence intervals for a and 6k • 12. Note that Y is independent of Yl. 15...":=::. = np n/2(p 1).p)S2 /x n .. Yn be a sample from a population with mean f.Z. The following model is useful in such situations. ••. a 2 ::.k)' C (b) If n is fixed and divisible by p. (a) Show that level (1 . where n is even.1. (Zi2 Z. n2 = ..a) confidence intervals for linear functions of the form {3j .e .2)2 a and 8k in the oneway layout Var(8k ) ..2)(Zj2 .:1 Yi + (3/ 2n) L~= ~ n+ 1 }i.a).~ L~=1 nk = (p~:)2 + Lk#i .p (~a) (n .p (1 .. In the oneway layout. Assume the linear regression model with p future observation Y to be taken at the pont z.1 2)? and that a level (1 a) confidence interval for a 2 is given by ~a)::. Y ::. . . 1  (a) Why can you conclude that T1 has a smaller MSE (mean square error) than T2? (b) Which estimate has the smallest MSE for estimating 0 13. (a) Find a level (1 .2). .a) confidence interval for the best MSPE predictor E(Y) = {31 + {32 Z.. (n .• 424 10. ~(~ + t33) t31' r = 2. (d) Give the 100(1 . . then Var(8k ) is minimized by choosing ni = n/2. then Var( a) is minimized by choosing ni = n / p. T2 1 (1/ 2n) L. Show that for the estimates (a) Var(a) = +==_=:. 'lj. Yn ).• statistics HYl. ::. Let Y 1 .2. Yn ) such that P[t.. Consider the three estimates T1 = and T3 Y. l(Yh .{3i are given by = ~(f.1 and variance a 2 . (b) Find a level (1 a) prediction interval for Y (i.. (c) If n is fixed and divisible by 2(p .I). .
Y (yield) 3. Problems for Section 6. : {L E R. Let 8. But xC'C = o => IIxC'I1 2 = xC'Cx' = 0 => xC' = O.5 Zi2 * + (33zi2 where Zil = Xi  X. r :::. compute confidence intervals for (31. Suppose a good fit is obtained by the equation where Yi is observed yield for dose 10gYi where E1.95 Yi = e131 e132xi Xf3. fli) where fi1.1 and let Q be the class of distributions with densities of the form (1 E) <t' CJL. n are independent N(O..A6 when 8 = ({L. Hint: Because C' is of rank r. (32.1) + 2<t'CJL. 1952. a 2 <7 2 .or dose of a treatment. 1 2 where <t'CJL. E :::. and a response variable Y. .r2) (x). •. 17. ( 2 ) logp(x. ( 2 ) density.I::llogxi' You may use X = 0. n.0289. 8). Show that if C is an n x r matrix of rank r. Check AO. 7 2 ) does not exist so that A6 doesn't hold. . nonsingular. (b) Show that the MLE of ({L. hence. X (nitrogen) Hint: Do the regression for {Li = (31 + (32zil + (32zil = 10gXi . fi3 and level 0. Yi) and (Xi. (a) Show that AOA4 and A6 hold for model Qo with densities of the form 1 2<t'CJL.7 Probtems and Lornol. p. 653).0'2)(x ) + E<t'(JL.• . P = {N({L. xC' = 0 => x = 0 for any rvector x. In the Gaussian linear model show that the parametrization ({3. ( 2 ). (33. fi2. and Q = P. (b) Plot (Xi. . (a) For the following data (from Hald. P. ( 2 )..0'2) is the N(Il'.r2 )l 1 2 7 > O. then the r x r matrix C'C is of rank r and.. and p be as in Problem 6.2 1. Assume the model = (31 + (32 X i + (33 log Xi + Ei. i = 1. . . ( 2 )T is identifiable if and only if r = p.Section 6.77 167. En Xi. a 2 > O}. 16. Find the value of X that maxi mizes the estimated yield Y= e131 e132xx133. p(x.emEmts 425 .. 8) = 2. which is yield or production.2.
21).2.2. show that the assumptions of Theorem 6.i > 1. T(X) = Zen).1 show thatMLEs of (3. '". (a) In Example 6.fii(ji .426 Inference in the Multiparameter Case Chapter 6 I . (e) Construct a method of moment estimate moments which are ~ consistent..IY .'. (J derived as the limit of Newtonwith equality if and only if > [EZfj1 4.  (3)T)T has a (".2. show that c(fo) = (To/a is 1 if fo is normal and is different from 1 if 10 is logistic. Y).2.2.Zen) (31 2 e ~ 7..2. (P("H) : e E e. 3. . 5.1) EZ . H abstract.(b) The MLE minimizes . ell of (} ~ = (Ji. (fj multivariate nannal distribution with mean 0 and variance. e Euclidean. Z i ~O. (I) Combine (aHe) to establish (6.  I . (6. In Example 6. /1.I3). In Example 6.20).1.2 are as given in (6.2.2. 6. and . t) is then called a conditional MLE..2. The MLE of based on (X.1/ 2 ).. then ({3.2.2.". given Z(n). Hint: !x(x) = !YIZ(Y)!z(z). X . In some cases it is possible to find T(X) such that the distribution of X given T(X) = tis Q" which doesn't depend on H E H._p distribution. show that ([EZZ T jl )(1. I .)Z(n)(fj . and that 0:2 is independent of the preceding vector with n(j2 /0 2 having a (b) Apply the law oflarge numbers to conclude that T P n I Z(n)Z(n) ~ E ( ZZ T ). n[z(~)~en)jl)' X. . b .(3) = op(n. 8.24) directly as follows: (a) Show that if Zn = ~ I:~ 1 Zi then. i' " (b) Suppose that the distribution of Z is not known so that the model is semiparametric. I I ..10 that the estimate Raphson estimates from On is efficient. T 2 ) based on the first two I (d) Deduce from Problem 6. . hence. (c) Apply Slutsky's theorem to conclude that and.1.2. . In Example 6. (e) Show that 0:2 is unconditionally independent of (ji. Hint: (a)...Pe"H).2. Fill in the details of the proof of Theorem 6.2 hold if (i) and (ii) hold. iT 2 ) are conditional MLEs. . Show that if we identify X = (z(n). H E H}. that (d) (fj  (3)TZ'f.24). Establish (6.
0 0 1 < <} ~ 0.L'l'(Xi'O~) n.'2:7 1 'l'(Xi.. Show that if BB a unique MLE for B1 exists and 10. a' . and f(x) = e.1) starting at 0.2. R d are such that: (i) sup{jDgn(O) .Oo) i=cl (1 . Suppose AGA4 hold and 8~ is vn consistent.O) has auniqueOin S(Oo.' .p(Xi'O~) + op(1) n i=l n ) (O~ .. 8~ = 1 80 + Op(n. t=l 1 tt Show that On satisfies (6. . .2.. hence. ao (}2 = (b) Write 8 1 uniquely solves 1 8. that is. (iii) Dg(0 0 ) is nonsingular. Him: You may use a uniform version of the inverse function theorem: If gn : Rd j.L 'l'(Xi'O~) n i=l 1 1 =n n L 'l'(Xi.LD.l real be independent identically distributed Y.1/ 2 ). (b) Show that under AGA4 there exists E > 0 such that with probability tending to 1.3).x (1 + C X ) . E R. (logistic). (ii) gn(Oo) ~ g(Oo).LD. Hint: n . p is strictly convex.l exists and uniquely ~ . the <ball about 0 0 .l + aCi where fl.•.1' _ On = O~  [ . (al Let ii n be the first iterate of the NewtonRapbson algorithm for solving (6. p i=l J.0 0 ). a > 0 are unknown and ( has known density f > 0 such that if p(x) log f(x) then p" > 0 and. 10 . Examples are f Gaussian.. Let Y1 •. (iv) Dg(O) is continuous at 0 0 . ] t=l 1 n .Section 6. = J.p(Xi'O~) n. Y.(XiI') _O. . <).7 Problems and Complements 427 9. (a) Show that if solves (Y ao is assumed known a unique MLE for L.Dg(O)j .
< Zn 1 for given covariate values ZI. converges to the unique root On described in (b) and that On satisfies 1 1 • j. Hint: Write n J • • • i I • DY. 'I) : 'I E 3 0 } and.2 holds for eo as given in (6. 8) for 'I E 3 and 3 = {'1( 0) : 9 E e}.l hold for p(x..3. i2.. = versus K : O oF 0. = 131 (Zil .27). P I . Thus.3 i 1. Similarly compute the infonnation matrix when the model is written as Y.3). P(Ai).. Zip)) + L j=2 p "YjZij + €i J . Wald. i=l • Z.d. CjIJtlZ. 0 < ZI < . a 2 ). LJ j=2 Differentiate with respect to {31. 1 Zw Find the asymptotic likelihood ratio. . q(. for "II sufficiently large.. then it converges to that solution.428 then.12). minimizing I:~ 1 (Ii 2 i. that Theorem 6. . .. II. •. 2 ° 2. Suppose responses YI . Hint: You may use the fact that if the initial value of NewtonRaphson is close enough to a unique solution.6). {Vj} are orthonormal. )Ih  '" j=l LJ(lJj + where ~(l) = I:j and the Cj do not depend on j3.2. Inference in the Multiparameter Case Chapter 6 > O. 1 Yn are independent Poisson variables with Yi . N(O. > 0 such that gil are 1 .Zi )13.. and Rao tests for testing H : Ih.3.2. . Establish (6.2... . . Show that if 3 0 = {'I E 3 : '1j = 0..12) and the assumptions of Theorem (6.3. . hence.1 on 8(9 0 • 6) (c) Conclude that with probability tending to 1. ~.. Problems for Section 6. q + 1 < j < r} then )'(X) for the original testing problem is given = by )'(X) = sup{ q(X· 'I) : 'I E 3}/ sup{q(X. and ~ .O E e. Suppose that Wo is given by (6.i. i .(Z"  ~(') z.3. I '" LJ i=l ~(1) Y. •. 'I) = p(.. .. log Ai = ElI + (hZi.' " 13.(Z" ./3)' = L i=1 1 CjZiU) n p . • (6. iteration of the NewtonRaphson algorithm starting at 0.. . Ii 'I :f Z'[(3)2 over all {3 is the same as minimizing n . there exists a j and their image contains a ball S(g(Oo).O).ip range freely and Ei are i.II(Zil I Zi2. Y. Reparametnze l' by '1(0) = Lj~1 '1j(O)Vj where '1j(9) 0 Vj.26) and (6. f. .. . t• where {31.~_. Zi.(j) l.
gr. Testing Simple versus Simple.) 1(Xi . respectively. B). (ii) E" (a) Let ).l. q + 1 <j < r S(lJo) . independent.d. 210g..... with Xi and Y.. (Adjoin to 9q+l. .3 is valid.. 1).. p .. . Suppose that Bo E (:)0 and the conditions of Theorem 6.. (ii) 'I is IIon S( 1J0 ) and D'I( IJ) is a nonsingular r x r matrix for aIlIJ E S( 1J0). . S(lJ o) and a map ce which is continuously differentiable such that (i) ~J(IJ) = 9J(IJ) on S(lJo).(X1 . . IJ. hence. N(IJ" 1).Xn ) be the likelihood ratio statistic. Deduce that Theorem 6.aq are orthogonal to the linear span of ~(1J0) .n:c"' 4c:2=9 3.) Show that if we reparametrize {PIJ: IJ E S(lJo)} by q('. There exists an open ball about 1J0. q('. Xi. .3 hold. Oil and p(X"Oo) = E. =  p(·. N(Oz.. Show that under H.n) satisfy the conditions of Problem 6. Let e = {B o . Vi). 1 5. 0 ») is nK(Oo.(I(X i .o log (X 0) PI. Bt ) is a KullbackLeibler in fonnation ° "_Q K(Oo.'1(IJ» is uniquely defined on:::: = {'1(IJ) : IJ E Tio.'1 8 0 = and.'1) and Tin '1(lJ n) and. Suppose 8)" > 0.(X" . .) where k(Bo. pXd .3. q + 1 <j< rl. Let (Xi.n 7J(Bo. with density p(. OJ}.IJ) where q(.3. Consider testing H : 8 1 = 82 = 0 versus K : 0. X n Li.X. (b) If h = 2 show that asymptotically the critical value of the most powerful (NeymanPearson) test with Tn ~ L~ .. {IJ E S(lJo): ~j(lJ) = 0. 'I) S(lJo)} then. j = 1. Consider testing H : (j = 00 versus K : B = B1 . aTB. = ~ 4. even if ~ b = 0.. . .d. > 0 or IJz > O. Oil + Jria(00 .) O. 1 < i < n. IJ.. Assume that Pel of Pea' and that for some b > 0. .3.7 Problems and C:com"p"":c'm"'.'" .. \arO whereal.Section 6. be ii. < 00.2.2.
li). . = 0. /:1.19) holds. 0. Po . ii. Exhibit the null distribution of 2 log .6.=0 where U ~ N(O. which is a mixture of point mass at O. = (T20 = 1 andZ 1= X 1. 1) with probability ~ and U with the same distribution.. Then xi t. Sucb restrictions are natural if.. 0. < . (iii) Reparametrize as in Theorem 6.6.) if 0. 2 log >'( Xi. Show that 2log.\(X). < cIJ"O.\( Xi.3. Wright. (h) : 0 < 0.i.) under the model and show that (a) If 0.. (b) If 0. .\(Xi. (ii) Show that Wn(B~2» is invariant under affine reparametrizations "1 B is nonsingular. and Dykstra. > 0.. • i . Z 2 = poX1Y v' . (b) Suppose Xil Yi e 4 (e) Let (X" Y.. XI and X~ but with probabilities ~ . Hint: ~ (i) Show that liOn) can be replaced by 1(0). Note: The results of Problems 4 and 5 apply generally to models obeying AQA6 when we restrict the parameter space to a cone (Robertson. for instance.3..0).0.2 for 210g. Yi : 1 and X~ with probabilities ~.3. In the model of Problem 5(a) compute the MLE (0. ' · H In!: Consl'defa1O . I . mixture of point mass at 0. ±' < i < n) is distributed as a respectively. (d) Relate the result of (b) to the result of Problem 4(a). 2 e( y'n(ii.2 and compute W n (8~2» showing that its leading term is the same as that obtained in the proof of Theorem 6.1.. Hint: By sufficiency reduce to n = 1.li : 1 <i< n) has a null distribution. O > 0.d. = v'1~c2' 0 <.0.0.) have an N.0"10' O"~o. = a + BB where . Yi : l<i<n). > OJ.  0. we test the efficacy of a treatment on the basis of two correlated responses per individual. =0. be i. j ~ ~ 6. > 0. 7.0.)) ~ N(O. and ~ where sin. (0.430 Infer~nce in the Multiparameter Case Chapter 6 (a) Show that whatever be n. 1 < i < n. under H. are as above with the same hypothesis but = {(£It. Show that (6.~.  OJ. ~ ° with probability ~ and V is independent of ~ (e) Obtain tbe limit distribution of y'n( 0. 1988).1. Let Bi l 82 > 0 and H be as above. Po) distribution and (Xi.
A611 Problems for Section 6. A3. Show that under A2.7 Problems and Complements ~(1 ) 431 8. (a) Show that the correlation of Xl and YI is p = peA n B) .6 is related to Z of (6.4. 10.2 to lin .22) is a consistent estimate of ~l(lIo).P(A))P(B)(1 . Under conditions AOA6 for (a) and AOA6 with A6 for (a) ~(1 ) i!~1) for (b) establish that [~D2ln(en)]1 is a consistent estimate of 1. .3.8) by Z ~ . 3.4.8) (e) Derive the alternative form (6. Show that under AOA5 and A6 for 8 11 where ~(lIo) is given by (6.4.4 ~ 1(11) is continuous. (e) Conclude that if A and B are independent. Hint: Argue as in Problem 5. (a) Show that for any 2 x 2 contingency table the table obtained by subtracting (estimated) expectations from each entry has all rows and columns summing to zero.8) for Z. In the 2 x 2 contingency table model let Xi = 1 or 0 according as the ith individual sampled is an A or A and Yi = 1 or 0 according as the ith individual sampled is a Born.2.21).3. hence..3. 9.nf. is of the fonn (b) Deduce that X' ~ Z' where Z is given by (6. then Z has a limitingN(011) distribution.0 < PCB) < 1.3. 2.Section 6. 1. Hint: Write and apply Theorem 6. (b) (6.1(0 0 ). Exhibit the two solutions of (6.P(B))· (b) Show that the sample correlation coefficient r studied in Example 5.4) explicitly and find the one that corresponds to the maximizer of the likelihood.P(A)P(B) JP(A)(1 .4. 0 < P( A) < 1.10.
( where ( .. B!c1b!. . N 12 • N 21 .)). Consider the hypothesis H : Oij = TJil TJj2 for all i. N 22 ) rv M (u. S..D. Hint: (a) Consider the likelihood as a function of TJil. . 6.C... are the multinomial coefficients. . N ll and N 21 are independent 8( r. (a) Let (NIl. i = 1. .~~.~l. . It may be shown (see Volume IT) that the (approximate) tests based on Z and Fisher's test are asymptotically equivalent in the sense of (5. . Let R i = Nil + N i2 • Ci = Nii + N 2i · Show that given R 1 = TI. R 2 = T2 = n . C j = L' N'j. TJj2. n a2 ) . . (a) Show that the maximum likelihood estimates of TJil. = Lj N'j.4. TJj2 = 2::: a x b contingency table with associated probabilities Bij and 1 Oij. Cj = Cj] : ( nll.54).. Let N ij be the entries of an let 1Jil = E~=l (}ij. 811 \ 12 . 8(r2' 82 I/ (8" + 8.4 deduce that jf j(o:) (depending on chosen so that Tl.1 . n) can be then the test that rejects (conditionally on R I = TI' C 1 = GI) if N ll > j(a) is exact level o. j.TI. b I Ri ( = Ti. n R.. 8l! / (8 1l + 812 )).6 that H is true. Fisher's Exact Test From the result of Problem 6. CI.~~.b1only. j = 1.ra) nab . Ti) (the hypergeometric distribution).9) and has approximately a X1al)(bl) distribution under H.u.4.~....2. i = 1. (a) Show that then P[N'j niji i = 1.. This is known as Fisher's exact test. j = 1. n. . I (c) Show that under independence the conditional distribution of N ii given R.1.1 II . ( rl. C i = Ci.. 8 21 . nab) A ) = B. in principle. (b) Deduce that Pearson's X2 is given by (6... 7. (22 ) as in the contingency table. 1 : X 2 (b) How would you.. a . . . . ... Suppose in Problem 6. .4. 432 Inference in the Multiparameter Case Chapter 6 i 4. (b) Sbow that 812 /(8l! ° + 812 ) ~ 821 /(8 21 + 822 ) iff R 1 and C 1 are independent.. nal ) n12.. TJj2 are given by TJil ~ = .2 is 1t(Ci. =T 1. TJj2 ~ = Cj n where R.. use this result to construct a test of H similar to the test with probability of type I error independent of TJil' TJj2? 1 .
Zi not all equal. Establish (6. (i) p(AnB (ii) I C) = PeA I C)P(B I C) (A. and that we wish to test H : Ih < f3E versus K : Ih > f3E. n. C are three events. 9. Give pvalues for the three cases. 5 Hint: (b) It is easier to work with N 22 • Argue that the Fisher test is equivalent to rejecting H if N 22 > q2 + n . which rejects. N 22 is conditionally distributed 1t(r2. = 0 in the logistic model.12 I .. (h) Construct an experiment and three events for which (i) and (ii) hold.14). and petfonn the same test on the resulting table.(rl + cI). + IhZi.. if A and C are independent or B and C are independent. . Then combine the two tables into one.0.7 Problems and Complements 433 8. and that under H. where Pp~ [2:f . 10. Suppose that we know that {3.BINDEPENDENTGIVENC) n B) = P(A)P(B) (A.1""".4. 11.BINDEPENDENTGNENC) p(AnB I C) ~ peA I C)P(B I C) (A.. (e) The following 2 x 2 tables classify applicants for graduate study in different departments of the university according to admission status and sex. The following table gives the number of applicants to the graduate program of a small department of the University of California. for suitable a. 2:f . ~i = {3. Would you accept Or reject the hypothesis of independence at the 0. (b).) Show that (i) and (ii) imply (iii).5? Admit Deny Men Women 1 19 . there is a UMP level a test. classified by sex and admission status. B INDEPENDENT) (iii) PeA (C is the complement of C.05 level (a) using the X2 test with approximate critical value? (b) using Fisher's exact test of Problem 6. consider the assertions. C2).ZiNi > k] = a. Test in both cases whether the events [being a man] and [being admitted] are independent.93'1 215 103 69 172 Deny 225 162 n=387 (d) Relate your results to the phenomenon discussed in (a). Show that. if and only if.4.Section 6.(rl + cd or N 22 < ql + n . B. but (iii) does not.ziNi > k. Admit Men Women 1 235 1~35' 38 7 273 42 n = 315 Deny Admit 270 45 Men Women I 122 1'. (a) If A.
k and a 2 is unknown.5.. .4.20) for the regression described after (6.d. . but under K may be either multinomial with 0 #. for example. Y n ) have density as in (6.f. Use this to imitate the argument of Theorem 6. 1 .OiO)2 > k 2 or < k}.OkO) under H. which is valid for the i. (Zn.00 or have Eo(Nd : . if the design matrix has rank p.z(kJl) ~I 1 . .15) is consistent. a under H.5. In the binomial oneway layout show that the LR test is asymptotically equivalent to Pearson's  X2 test in the sense that 2log'\  X2 .. or Oi = Bio (known) i = 1. = rn I < /3g and show that it agrees with the test of (b) Suppose that 131 is unknown. .'Ir(. " Ok = 8kO. 2.. Show that.2. Suppose that (Z" Yj).8) and. a5 j J j 1 J I . .5. Fisher's Method ofScoring The following algorithm for solving likelihood equations was proosed by Fishersee Rao (1973).LJ2 Z i )).Ld... a 2 ) where either a 2 = (known) and 01 . (a) Compute the Rao test for H : (32 Problem 6. . Suppose the ::i in Problem 6. Given an initial value ()o define iterates  Om+l .. .4. 15. 13. . . but Var.: nOiO.. i 16.4. asymptotically N(O.5 construct an exact test (level independent of (31). .4. i Problems for Section 6. 434 • Inference in the Multiparameter Case Chapter 6 f • 12. " lOk vary freely. I).3. 1 i Show that for GLM this method coincides with the NewtonRaphson method of Section 2.. case. then 130 as defined by (6.3. with (Xi I Z....4.. .. . • I 1 • • (a)P[Z.. Xi) are i..5 1. . Let Xl. a 2 = 0"5 is of the form: Reject if (1/0". ..11. This is an approximation (for large k.. 3..d. . I < i < k are independent. ~ 8 m + [1(8 m )Dl(8 m ).)llmi~i(1 'lri). Zi and so that (Zi.Oio)("Cooked data"). .. Compute the Rao test statistic for H : (32 case.4). Verify that (6.. < f3g in this (c) By conditioning on L~ 1 Xi and using the approach of Problem 6. X k be independent Xi '" N (Oi. .i. in the logistic regression model.1 14.11 are obtained as realization of i..". ..18) tends 2 to Xrq' Hint: (Xi . · . 010.l 1 . mk ~ 00 and H : fJ E Wo is true then the law of the statistic of (6. Tn. I... Show that the likelihO<Xt ratio test of H : O} = 010 . Show that if Wo C WI are nested logistic regression models of dimension q < r < k and mI.i. Nk) ~ M(n.(Ni ) < nOiO(1 . . E {z(lJ. .) ~ B(m.4) is as claimed formula (2.4. n) and simplification of a model under which (N1.3) l:::~ I (Xi .
contains an open L. your result coincides with (6.)z'j dC.5). Yn ) are i.. (d) Gaussian GLM.= 1 . Assume that there exist functions h(y. Hint: Show that if the convex support of the conditional distribution of YI given ZI = zU) contains an open interval about p'j for j = 1.) . d".F. k.b(Oi)} p(y. Find the canonical link function and show that when g is the canonical link. h(y...Section 6. T.. (e) Suppose that Y. . and v(.).. J JrJ 4. . . 05. (y.9). ball about "k=l A" zlil in RP . .) and C(T) such that the model for Yi can be written as O.T)exp { C(T) where T is known. Wn). g("i) = zT {3. the deviance is 5. . (a) Show that the likelihood equations are ~ i=I L. give the asymptotic distribution of y'n({3 ..(3). . C(T).7 Problems and Complements 435 (b) The linear span of {ZII).. T).. under appropriate conditions.5. Hint: By the chain rule a l(y a(3j' 0) = i3l dO d" a~ . <..5. Let YI.. . b(9).) = ~ Var(Y)/c(T) b"(O). b(B).5.. 9 = (b')l. . k.ztkl} is RP (c) P[ZI ~ z(jl] > 0 for all j. j = 1.. the resuit of (c) coincides with (6. . Show that.y .. J .p. Give 0. P(I"). In the random design case... L. Show that for the Gaussian linear model with known variance D(y. T). g(p. and v(.T). ~ . . then the convex support of the conditional distribution of = 1 Aj Yj zU) given Z j = Z (j) .) and v(. distribution... Yn be independent responses and suppose the distribution of Yi depends on a covariate vector Zi. .) ~ 1/11("i)( d~i/ d". Set ~ = g(.9). Give 0. as (Z. (c) Suppose (Z" Y I ). C(T).1 = 0 V fJ~ . T. Suppose Y..i. Show that the conditions AQA6 hold for P = P{3o E P (where qo is assumed known).O(z)) where O(z) solves 1/(0) = gI(zT {3).Oi)~h(y.(.d. 00 d" ~ a(3j (b) Show that the Fisher information is Z]. ~ N("" <7. Y) and that given Z = z.WZ v where Zv = Ilz'jll is the design matrix and W = diag( WI. . and b' and 9 are monotone.). h(y. has the Poisson.. . b(9). Show that when 9 is the canonical link... (Zn. Wi = w(". 1'0) = jy 1'01 2 /<7. Y follow the model p(y.
Suppose ADA6 are valid.Q+1>'" . t E R. and R n are the corresponding test statistics.6. W n and R n are computed under the assumption of the Gaussian linear model with a 2 known.(3p ~ (3o.10) creates a valid level u test. n . I: .6. I 4. the unique median of p. . hence.3. Consider the Rao test for H : f} = f}o for the model "P = {P/I : /I E e} and ADA6 hold. Show that 0: 2 given in (6.s(t). if VarpDl(X.d. Wald.15) by verifying the condition of Theorem 6. I "j 5. the infonnation bound and asymptotic variance of Vri(X 1'). but if f~(x) = ~ exp Ix 1'1. then under H. I .2. Suppose that the ttue P does not belong to"P but if f}(P) is defined by (6. l 1 . if P has a positive density v(P).3 is as given in (6. that is.j 436 Inference in the Multiparameter Case Chapter 6 i Problems for Section 6.3. in fact.1 under this model and verifying the fonnula given. P.3) then f}(P) = f}o. .6.Xn are i. Suppose Xl. Note: 2 log An. Apply Theorem 6.7. . then the Rao test does not in general have the correct asymptotic level.6. 0 < Vae f < 00.10).. (3) where s(t) is the continuous distribution function of a random variable symmetric about 0. " . 0'2). By Problem 5. /10) is estimated by 1(80 ). . • I 3. replacing &2 by 172 in (6.). 2.14) is a consistent estimate of2 VarpX(l) in Example 6. Show that. = 1" (b) Show that if f is N(I'. ! I I ! I .6. and Rao tests are still asymptotically equivalent in the sense that if 2 log An. O' 2(p)/0'2 = 2/1r.p under the sole assumption that E€ = 0.6. .6.1. In the hinary data regression model of Section 6.6. Wn Wn + op(l) + op(I).i.1. Consider the linear model of Example 6. Show that the LR.O' 2 (p») = 1/4f(v(p». Show that the standard Wald test forthe problem of Example 6.1) ~ . then 0'2(P) < 0'2. set) = 1. .6 1.. i 6.3 and. 7.4. Hint: Retrace the arguments given for the asymptotic equivalence of these statistics under parametric model and note that the only essential property used is that the MLEs under the model satisfy an appropriate estimating equation. let 1r = s(z. f}o) is used. . Establish (6. (6. j. then O' 2(p) > 0'2 = Varp(X. .2 and the hypothesis (3q+l = (3o. (a) Show that if f is symmetric about 1'. W n . then it is.2. but that if the estimate ~ L:~ dDIllDljT (Xi. then v( P) . then the sample median X satisfies ~ f at I I Vri(X where O' (p) 2 yep») ~ N(0..
. are indepeqdent of}] 1 ' · ' 1 Yn and ~* is distributed as }'i. = (3p = 0 by the (average) expected prediction error ~ ~ n EPE(p) = n.{3lp) p and {3(P) ~ (/31.i . O)T and deduce that (c) EPE(p) = RSS(p) + ~.1.(p) the corresponding fitted value.2. 1973.2 is an unbiased estimate of EPE(P). Model selection consists in selecting p to minimize EPE(p) and then using Y(P) as a predictor (Mallows. 0.lpI)2 be the residual sum of squares. (Model Selection) Consider the classical Gaussian linear model (6. ..d.9.7 Problems and Complements 437 (a) Show that ~ Jr can be written in this form for both the probit and logit models..2 (b) Show that (1 + ~D + .I EL(Y.2 is known...(b) continue to hold if we assume the GaussMarkov model. . then ~ Vri(rJL QI ({3) Var(ZI (Y1   {3 Ll has a limiting normal distribution with mean 0 and variance A(Zr {3)) )[QI ({3)J where Q({3) = E(Zr A(Zr{30)ZI) is p x p and necessarily nonsingular. . 8.jP)2 where ILl ) = z.' i=l . i = 1. i . . .. Zi) : 1 <i< n}. the model with 13d+l = . Let RSS(p) = 2JY. at Zll'" . But if 13L is defined as the solution of EZ1s(Zr {30) = Q(/3) where Q({3) = E(Zr A(Zr/3) is p x 1.. f3p. L:~ 1 (P. . Show that if the correct model has Jri given by s as above and {3 = {3o. ~ ~ (d) Show that (a). Suppose that the covariates are ranked in order of importance and that we entertain the possibility that the last d .. ... 1 V. for instance). Hint: Apply Theorem 6. where Xln) ~ {(Y. then 13L is not a consistent estimate of f3 0 unless s(t) is the logistic distribution. (a) Show that EPE(p) ~ .1) Yi = J1i + ti. y~p) and.lp)2 Here Yi" l ' •• 1 Y.d..p. _. 1 n. .. A natural goal to entertain is to obtain new values Yi". Let f3(p) be the LSE under this assumption and Y. Zi are ddimensional vectors for covariate (factor) values.p don't matter. Zi.Section 6.9. that Zj is bounded with probability 1 and let ih(X ln )).. 1 n.....1. be the MLE for the logit model. /3P+I = '" = /3d ~ O.Zn and evaluate the performance of Yjep) . (b) Suppose that Zi are realizations of U.. Suppose that . .. i = 1. hence. Gaussian with mean zero J1i = Z T {3 and variance 0'2. where ti are i.
438 (e) Suppose p ~ Inference in the Multiparameter Case Chapter 6 2 and 11(Z) ~ Ii.) .6.1 (1) From the L A. i=1 Derive the result for the canonical model. + Evaluate EPE for (i) ~i ~ "IZ.' . 1 "'( i=1 . 1969. AND F. which are not multinomial. this makes no sense for the model we discussed in this section. The Analysis of Binary Data London: Methuen. Y. . . Fisher pointed out that the agreement of this and other data of Mendel's with his hypotheses is too good. but it is reasonable.4. Heart Study after Dixon and Massey (1969). 1 I .i2} such that the EPE in case (i) is smaller than in case (d) and vice versa. 3rd 00. W. 6. if we consider alternatives to H. EPE(p) n R SS( p) = . Note for Section 6. DIXON. . Use 0'2 = 1 and n = 10. To guard against such situations he argued that the test should be used in a twotailed fashion and that we should reject H both for large and for small values of X2 • Of course. Y/" j1~. MASSEY.16). . LR test statistics for enlarged models of this type do indeed reject H for data corresponding to small values of X2 as well as large ones (Problem 6.9 REFERENCES Cox.L . t12 and {Z/I. i 6. A.9 for a discussion of densities with heavy tails. (b) The result depends only on the mean and covariance structure of the i=l"". . and (ii) 'T}i = /31 Zil + f32zi2. . Note for Section 6. 1 n n L (I'.n. we might envision the possibility that an overzealous assistant of Mendel "cooked" the data... Give values of /31. ~(p) n .z. For instance. Hint: (a) Note that ".2 (1) See Problem 3. New York: McGrawHill. 1970.8 NOTES Note for Section 6. Introduction to Statistical Analysis. ti. ! . The moral of the story is that the practicing statisticians should be on their guard! For more on this theme see Section 6..4 (1) R. R.  J .~I'. )I'i).". D.5.
383393 (1987). of California Press. STIGLER. Fifth BNkeley Symp. T. Math. 279300 (1997). 1985. "The behavior of the maximum likelihood estimator under nonstandard conditions. second edition. AND Y. WEISBERG. MA: Harvard University Press. II. LAPLACE. TIERNEY.. NELDER. 36. "Some comments on C p . The Analysis of Variance New York: Wiley. AND j. /5. Univ. HAW.. S." Biometrika. 1952." Statistical Science. I New York: McGrawHill. 221233 (1967).9 References 439 Frs HER. GRAYBILL. A." Technometrics.. Order Restricted Statistical Inference New York: Wiley. Theory ofStatistics New York: Springer. Statistical Theory with £r1gineering Applications New York: Wiley. J. II. 13th ed. Paris) (1789).661675 (1973). GauthierVillars. 76. P. "Approximate marginal densities of nonlinear functions. Soc. R.Section 6. Wiley & Sons. Statist. R. 1961. $CHERVISCH.Him{ Models. Linear Statisticallnference and Its Applications. 1973. S. c. I. A. The l!istory of Statistics: The Measuremem of Uncel1ainty Before 1900 Cambridge. 1995." J. RAO. ROBERTSON. . R.. New York: Hafner. 12.425433 (1989). 1988. HABERMAN. L. "Computing regression qunntiles. 475558. New York. New York: Wiley. J. t 983.. 2nd ed. KOENKER. C.. S." Memoires de l'Academie des Sciences de Paris (Reprinted in Oevres CompUtes. PORTNOY. DYKSTRA. A. SCHEFFE. Roy. 1959. 1989. C. Vol. KASS. 1974. S. 2nd ed. P. D'OREY. WRIGHT. KADANE AND L. 1958. McCULLAGH. T. AND R.." Proc. Statist. Ser. New York: J. KOENKER. "The Gaussian Hare and the Laplacian Tortoise: Computability of squarederror versus absoluteerror estimators. Genemlized Linear Models London: Chapman and Hall. M.• St(ltistical Methods {Of· Research lV(Jrkers. AND R. f. Applied linear Regression. The Analysis of Frequency Data Chicago: University of Chicago Press. Prob.. E.• "Sur quelques points du systeme du monde. 1986. HUBER. An Inrmdllction to Linear Stati. R.. MALLOWS.. PS.
j. . I I .. .I :i . I j i. .. i 1 I . . . I . .
require that every repetition yield the same outcome. The intensity of solar flares in the same month of two different years can vary sharply. Viewed naively. in addition. at least conceptually. The situations we are going to model can all be thought of as random experiments. This 441 . What we expect and observe in practice when we repeat a random experiment many times is that the relative frequency of each of the possible outcomes will tend to stabilize. The reader is expected to have had a basic course in probability theory. Therefore. Sections A. A coin that. is tossed can land heads or tails. Probability theory provides a model for situations in which like or similar causes can produce one of a number of unlike effects. In Appendix B we will give additional probability theory results that are of special interest in statistics and may not be treated in enough detail in some probability texts. an experiment is an action that consists of observing or preparing a set of circumstances and then observing the outcome of this situation. we include some commentary. The adjective random is used only to indicate that we do not. A. we include some proofs as well in these sections. Because the notation and the level of generality differ somewhat from that found in the standard textbooks in probability at this level. A prerequisite for such a study is a mathematical model for randomness and some knowledge of its properties. We add to this notion the requirement that to be called an experiment such an action must be repeatable.Appendix A A REVIEW OF BASIC PROBABILITY THEORY In statistics we study techniques for obtaining and using information in the presence of uncertainty.I THE BASIC MODEl Classical mechanics is built around the principle that like causes produce like effects. The purpose of this appendix is to indicate what results we consider basic and to introduce some of the notation that will be used in the rest of the book. The Kolmogorov model and the modem theory of probability based on it are what we need.IS contain some results that the student may not know. which are relevant to our study of statistics.14 and A. although we do not exclude this case. A group of ten individuals selected from the population of the United States can have a majority for or against legalized abortion.
. intersection. the null set or impossible event. falls under the vague heading of "random experiment. c. In this section and throughout the book. {w} is called an elementary event. and Berger (1985). By interpreting probability as a subjective measure.. almost any kind of activity involving uncertainty. it is called a composite event.l. B.1/ /1. the operational interpretation of the mathematical concept of probability. We denote it by n. we preSUnle the reader to be familiar with elementary set theory and its notation at the level of Chapter I of Feller (1968) or Chapter 1 of Parzen (1960).3 Subsets of are called events. We shall use the symbols U. Savage (1962). " are pairwise disjoint sets in I • I Recall that Ui I Ai is just the collection of points that are in anyone of the sets Ai and that two sets are disjoint if they have no points in common.l The sample space is the set of all possible outcomes of a random experiment. I l 1. If wEn.4 We will let A denote a class of subsets of to which we an assign probabilities. Lindley (1965).l. i ! A. Chung. However. 1 1  . Its complement. from horse races to genetic experiments. 1974. intersections. then (ii) If AI. Grimmett and Stirzaker. = 1. is to many statistician::. 1977)..1. A.I I 442 A Review of Basic Probability Theory Appendix A 1 longtefm relative frequency 11 . 1992. and so on or by a description of their members. set theoretic difference. complementation. which by definition is a nonempty class of events closed under countable unions.1.2 A sample point is any member of 0. and inclusion as is usual in elementary set theory. and Loeve. Ii I I I j . n. they are willing to assign probabilities in any situation involving uncertainty. . and complementation (cf. de Groot (1970). A random experiment is described mathematically in tenns of the following quantities. . In this sense. • . For example. A. Raiffa and Schlaiffer (\96\). . . A probabiliry distribution or measure is a nonnegative function P on A having the following properties: (i) P(Q) n 1 n ." The set operations we have mentioned have interpretations also.~ n and is typically denoted by w. as we shall see subsequently." Another school of statisticians finds this formulation too restrictive. .. whether it is conceptually repeatable or not. The relation between the experiment and the model is given by the correspondence "A occurs if and only if the actual outcome of the experiment is a member of A. A2 . If A contains more than one point.\ is the number of times the possible outcome A occurs in n repetitions. A. is denoted by A. We denote events by A. the probability model. We now turn to the mathematical abstraction of a random experiment. where 11. For technical mathematical reasons it may not be possible to assign a probability P to every subset of n. C for union. For a discussion of this approach and further references the reader may wish to consult Savage (1954). the relation A C B between sets considered as events means that the occurrence of A implies the occurrence of B. A is always taken to be a sigma field. induding the authors.
. W2. P(0) ~ O.3 Panen (1960) Chapter I.) (Bonferroni's inequality).Section A. c . P) either as a probability model or identify the model with what it represents as a (random) experiment. > PeA). A. (n~~l Ai) > 1.l1.3 Hoe!. C An .. we can write f! = {WI.. Sections 13.2 and 1.2) n . c A.68 Grimmett and Stirzaker (1992) Sections 1. Port.3 If A C B.1...A) = PCB) . when we refer to events we shall automatically exclude those that are not members of A. That is.t A probability model is called discrete if is finite or countably infinite and every subset of f! is assigned a probability.2 ELEMENTARY PROPERTIES OF PROBABILITY MODELS The following are consequences of the definition of P.3.3 A.1 If A c B.S P A..4).3 Hoel. ~ A.S The three objects n. A. and Stone (1971) Sections 1.. A.2 Parzen (1960) Chapter 1. } and A is the collection of subsets of n. and Stone (1992) Section 1. Port. References Gnedenko (1967) Chapter I.2.l. (1967) Chapter l. we have for any event A.L~~l peA.40 < P(A) < 1.2. . (U. For convenience. (A.2. Sections 45 Pitman (1993) Section 1. then P (U::' A.2 PeN) 1 .. then PCB .. by axiom (ii) of (A. Sections 15 Pitman (1993) Sections 1. A. We shall refer to the triple (0..6 If A .3 A. In this case.2.' 1 peA. A.2 Elementary Properties of Probability Models 443 A. Section 8 Grimmett and Stirzaker (1992) Section 1.' 1 An) < L.PeA).7 P 1 An) = limn~= P(A n ).3 DISCRETE PROBABILITY MODELS A.P(A).3. References Gnedenko. P(B) A.2. J..2. 1.). and P together describe a random experiment mathematically.2.
I B). are (pairwise) disjoint events and P(B) > 0. . guinea pigs.:. ~f 1 . . I B) = ~ PiA. say N.' . shaking well. flowers. etc. For large N.(A n B j ). From a heuristic point of view P(A I B) is the chance we would assign to the event A if we were told that B has occurred. (A.3) yield U. (A. the identity A = . by l P(A I B) ~ PtA n B) P(B) .~ 1 1 ~• 1 .3) 1 I A.:c:.:c Number of elements in A N . is an experiment leading to the model of (A.=:. . j=l (A.4)(ii) and (A.4 CONDITIONAL PROBABILITY AND INDEPENDENCE I j 1 1 ! . the function P(. and P( A ) ~ j .3.3).• P (Q A. Transposition of the denominator in (AA. (A.. then P( A I B) corresponds to the frequency of occurrence of A relative to the class of trials in which B does occur. . 1 1 PiA n B) = P(B)P(A I B). Sections 67 Pitman (1993) Section l.l) gives the multiplication rule.1 • 1 . and drawing.3. Then selecting an individual from this population in such a way that no one member is more likely to be drawn than another. which we write PtA I B).4 Suppose that WI. WN are the members of some population (humans.=. for fixed B as before.4.4.444 A Review of Basic Probability Theory Appendix A An important special case arises when n has a finite number of elements.4.. (A. then .4. A.. (A.4) . Given an event B such that P( B) > 0 and any other event A. n PiA) = LP(A I Bj)P(Bj ). .. machines. i References Gnedenko (1967) Chapter I.1.l n' " I"I ". selecting at random.. Then P( {w}) = 1/ N for every wEn..1) If P(A) corresponds to the frequency with which A occurs in a large number of repetitions of the experiment..2) In fact. '< = ~=~:. . a random number table or computer can be used. i • i: " i:.3. we define the conditional probability of A given B.I B) is a probability measure on (fl. all of which are equally likely.). Such selection can be carried out if N is small by putting the "names" of the Wi in a hopper. Sections 45 Parzen (1960) Chapter I.• 1 B n are (pairwise) disjoint events of positive probability whose union is fl.4..3) If B l l B 2 . A) which is referred to as the conditional probability measure given B. If A" A 2 .
.i k } of the integers {l. Chapter 3..}.) j=l k (AA.. . . ." ...En such that P(B I n . Port. BnJl (AA. Ifall theP(A i ) are positive.. (AA..) = II P(A. we can combine (A.. B ll is written P(A B I •.4. (AA. ..• ... and (A. I PeA I B.i.8) > 0. relation (A..4) and obtain Bayes rule .3).. References Gnedenko (1967) Chapter I.. . Simple algebra leads to the multiplication rule. .. ..I_1 ..8) may be written P(A I B) ~ P(A) (AA. n B n ) > O.II) for any j and {i" . B I .} such thatj ct {i" .. I. B 2 ) . An are said to be independent if P(A i ..9) In other words. n··· nA i .S parzen (1960) Chapter 2. ... n Bnd > O.n}...)P(B. B n ) and for any events A.···.I J (AA.) ~ P(A J ) (AA.7) whenever P(B I n . P(B" I Bl. A and B are independent if knowledge of B does not affect the probability of A. . Two events A and B are said to be independent if P(A n B) If P( B) ~ P(A)P(B). P(B 1 n·· n B n ) ~ P(B 1 )P(B2 I BJlP(B3 I ill.4. P(il.) A) ~ ""_ PIA I B )p(BT L. .S) The conditional probability of A given B I defined by . The events All . and Stone (1971) Sections lA..lO) is equivalent to requiring that P(A J I A.... . Sections IA Pittnan (1993) Section lA .1 ). the relation (AA.Section AA Conditional Probability and Independence 445 If P( A) is positive.4. . Sections 9 Grimmett and Stirzaker (1992) Section IA Hoel.. . Section 4..A.i.lO) for any subset {iI.
then the sample space . x An by 1 j P(A.0 ofthe n stage compound experiment is by definition 0 1 x·· . 1977) that if P is defined by (A. The interpretation of the sample space 0 is that (WI. . wn ) is a sample point in . independent. More generally.5 COMPOUND EXPERIMENTS There is an intuitive notion of independent experiments. (A5A) i.53) holds provided that P({(w"". I n  I . .) . . and if we do not replace the first chip drawn before the second draw. To say that £t has had outcome pound event (in . . (A.w n ) E o : Wi = wf}. To be able to talk about independence and dependence of experiments.on.An with Ai E Ai. the Cartesian product Al x . x On in the compound experiment. P. X A 2 X '" x fl n ). ) = P(A.. 1 < i < n}. ..1 Recall that if AI. x " . 446 A Review of Basic Probability Theory Appendix A A."" .. These will be discussed in this section.1 X Ai x 0i+ I x .0 if and only if WI is the outcome of £1. .. W n ) : W~ E Ai. the sigma field corresponding to £i. . x flnl n ' . Pn({W n }) foral! Wi E fli. 1974.. There are certain natural ways of defining sigma fields and probabilities for these experiments... . then (A52) defines P for A) x ". Pn(A n ). x flnl n Ifl) x A 2 x ' . X .. Chung. .5.... . .5. If we are given n experiments (probability models) t\.' . . ... it is easy to give examples of dependent experiments: If we draw twice at random from a hat containing two green chips and one red chip.. we introduce the notion of a compound experiment.. . For example. W2 is the outcome of £2 and E Oi corresponds to the Occurrence of the comso on. on (fl2. the subsets A ofn to which we can assign probability(1). . I <i< n. x An.. The (n stage) compound experiment consists in performing component experiments £1. . it can be uniquely extended to the sigma field A specified in note (l) at the end of this appendix. An is by definition {(WI.0 1 x··· XOi_I x {wn XOi+1 x··· x.wn )}) I: " = PI ({Wi}) .0 1 X '" x . A.on. we should have . X fl n 1 X An). . P([A 1 x fl2 X .. On the other hand.6 where examples of compound experiments are given.3) for events Al x .. If we want to make the £i independent. P(fl..o n = {(WI.3). This makes sense in the compound experiment. £n with respective sample spaces OJ. . a compound experiment is one made up of two or more component experiments. (A53) It may be shown (Billingsley. Loeve.. (A5. I ! i w? 1 . 1 £n if the n stage compound experiment has its probability structure specified by (A5. A2). if Ai E Ai. x An of AI. 1995. ... x fl 2 x '" x fln)P(fl. .. An are events. InformaUy. then intuitively we should have alI classes of events AI.. In the discrete case (A.0) given by .. If P is the probability measure defined on the sigma field A of the compound experiment." X An) = P. 1 1 .o i . . the outcome of the first experiment (toss) reasonably has nothing to do with the outcome of the second. x An) ~ P(A. x'" .. .1 . .2) 1 i " 1 If we are given probabilities Pi on (fl" AI).'" £n and recording all n outcomes.. then Ai corresponds to . if we toss a coin twice. then the probability of a given chip in the second draw will depend on the outcome of the first dmw. . that is. r. The reader not interested in the formalities may skip to Section A. We shall speak of independent experiments £1. I P n on (fln> An). \ I I .. x .' .
wq and P( {Wi}) = Pi.. if an experiment has q possible outcomes WI. we refer to such an experiment as a multinomial trial with probabilities PI. ill the discrete case.S) . A. 1.6. .2) where k(w) is the number of S's appearing in w.k)!' The fonnula (A. which we shall denote by 5 (success) and F (failure). . then (A.Section A 6 Bernoulli and Multinomial Trials. the following. any point wEn is an ndimensional vector of S's and F's and..5.Pq' If fl is the sample space of this experiment and W E fl. Ifweassign P( {5}) = p.7) we have..6 BERNOULLI AND MULTINOMIAL TRIALS. .5. we say we have performed n Bernoulli trials with success probability p. A.3) is known as the binomial probability. If o is the sample space of the compound experiment..6. ..SP({(w] .. Port. Sampling With and Without Replacement 447 Specifying P when the £1 are dependent is more complicated. i = 1" . .6...w n )}) = P(£l has outcome wd P(£2 hasoutcomew21 £1 has outcome WI)'" P(£T' has outcome W I £1 has outcome WI. then (A. If we repeat such an experiment n times independently.. .£"_1 has outcome wnd. n The probability structure is determined by these conditional probabilities and conversely. In the discrete case we know P once we have specified P( {(wI .. The simplest example of such a Bernoulli trial is tossing a coin with probability p of landing heads (success). References Grimmett and Stirzaker ( (992) Sections (.'" . If the experiment is perfonned n times independently. and Stone (1971) Section 1.. Other examples will appear naturally in what follows.: n ) with Wi E 0 1. . /I. If Ak is the event [exactly k S's occur]. SAMPLING WITH AND WITHOUT REPLACEMENT A. ..u. we shall refer to such an experiment as a Bernoulli trial with probability of success p. .. By the multiplication rule (A. (A.. . the compound experiment is called n multinomial trials with probabilities PI.6 Hoel.6.4.·. J.1 Suppose that we have an experiment with only two possible outcomes.6.4 More generally.6.· IPq.3) where n ) ( k = n! kl(n ..5 Parzen (1960) Chapter 3 A.: n )}) for each (WI.
. Sectiou 11 Hoel.6) where the k i are natural numbers adding up to n..1 j ..N (1 . .6.n)!' If the case drawn is replaced before the next drawing.. If Np of the members of n have a "special" characteristic S and N (1 ~ p) have the opposite characteristic F and A k = (exactly k "special" individuals are obtained in the sample). ..S If we have a finite population of cases = {WI"" WN} and we select cases Wi successively at random n times without replacement.P))nk k (N)n = (A6.4 Parzen (1960) Chapter 3. exactly kqwq's are observed). . and the component experiments are independent and P( {a}) = liNn. . = N! (N ..0. . . n . Np). n P({a}) ~ (N)n where 1 I (A.6.1O) J for max(O. PtA ) k =( n ) (Np). and P(Ak) = 0 otherwise. A.) of the compound experiment. I A. i .k q is the event (exactly k l WI 's are observed. the component experiments are not independent and.k. i . P) independently n times.7 PROBABILITIES ON EUCLIDEAN SPACE Random experiments whose outcomes are real numbers playa central role in theory and practice. .S) as follows. .6. · · • A. (N)n .7 If we perform an experiment given by (. Sections 14 Pitman (1993) Section 2. 10) is known as the hypergeometric probability. I' . . we are sampling with replacement.1 . . .6. .p)) < k < min( n. The fonnula (A. then P( k" .) A = n! k k k !. __________________J i . When n is finite the tenn. we shall sometimes refer to the outcome of the compound experiment as a sample of size n from the population given by (n. and Stone (1971) Section 2.1 . . with replacement is added to distinguish this situation from that described in (A. The probability models corresponding to such experiments can all be thought of as having a Euclidean space for sample space. . ·Pq t (A.(N(I. 1 .6. exactly k 2 wz's are observed. If AkJ.9) . . for any outcome a = (Wil"" 1Wi.448 A Review of Basic Probability Theory Appendix A where k~(w) = number of times Wi appears in the sequence w..A l P). then ~ ..6.. kq!Pt' . A. Port. References Gnedeuko (1967) Chapter 2.
(ak.7.EA pix... An important special case of (A. P defined by A..S) is given by (A. xd'. x A. This definition is consistent with (A.. .). . Any subset of R k we might conceivably be interested in turns out to be a member of f3k. .5) A.. which we denote by f3k. A..8 are usually called absolutely continuous. dx n • is called a density function. Geometrically. .bd. we shall call the set (aJ. Riemann integrals are adequate. JR' where dx denotes dX1 ..S) is by definition r JR' 1A(X)P(x)dx where 1A(x) ~ 1 if x E A.7. } of vectors and that satisfies L:~ I P(Xi) = 1 defines a unique discrete probability distribution by the relation P(A) = L x. It may be shown that a function P so defined satisfies (AlA). . We will write R for R1 and f3 for f31.Xn . Recall that the integral on the right of (A.7. for practical purposes. X n .7.Section A. That is. .bd x '" (ak.. Integrals should be interpreted in the sense of Lebesgue.bk ) are k open intervals.3 A discrete (probability) distribution on R k is a probability measure P such that L:~ I P( {Xi}) = 1 for some sequence of points {xd in R k . P(A) is the volume of the "cylinder" with base A and height p(x) at x. (A7A) Conversely. . However.6 A nonnegative function p on R k • which is integrable and which has r p(x)dx = 1. (A7.1If (al..9) .7.. 1 <i <k}anopenkrectallgle.3.7 Probabilities on Euclidean Space 449 We shall use the notation R k of k~dimensional Euclidean space and denote members of Rk by symbols such as x or (Xl. Thefrequency function p of a discrete distribution is defined on Rk by n pix) = P({x»).I) because the study of this model and that of the model that has = {Xl. .7. where ( )' denotes transpose. only an Xi can occur as an outcome of the experiment.7. A. We will only consider continuous probability distributions that are also absolutely continuous and drop the term absolutely.7. A.7 A continuous probability distn'bution on Rk is a probability P that is defined by the relation P(A) = L p(x)d x =1 (A7. is defined to be the smallest sigma field having all open k rectangles as members.2 The Borelfield in R k .. } are equivalent.7..Xk) :ai <Xi <bi.b k ) = {(XI"".S) for some density function P and all events A. and 0 otherwise. .. .. any nonnegative function pon R k vanishing except on a sequence {Xl.
) F is defined by F(Xl' .f. 22 Hoel.17) = O. and h is close to 0..1. x n j X =? F(x n ) ~ (A. x. then by the mean value theorem paxo .12) The dJ.7. defines P in the sense that if P and Q are two probabilities with the same d. .1'. (A.16) defines a unique P on the real line.4. Sections 14. When k = 1.7. F is a function of a real variable characterized by the following properties: > I .7. References Gnedenko (1967) Chapter 4. x (00.. 4. thus.11 The distribution function (dJ.]). (A. (A. P Xl (A. . POlt.) = P( ( 00. 3. 5.2. For instance. if p is a continuous density on R.J x . We always have F(x)F(xO)(2) =P({x}).7.7.13) x <y =? F(x) < F(y) (Monotone) F(x) (Continuous from the right) (A. Sections 21. and Stone (1971) Sections 3.16) I It may be shown that any function F satisfying (A. Thus. be thought of as measuring approximately how much more Or less likely we are to obtain an outcome in a neighborhood of XQ then one in a neighborhood of Xl_ A. .h.lO) The ratio p(xo)jp(xl) can.:0 and Xl are in R. 5."(l) Although in a continuous model P( {x}) = 0 for every x.7.7 Pitman (1993) Sections 3.15) I.7.1. ! I . F is continuous at x if and only if P( {x}) (A. then P = Q. Xo P([xo .13HA. limx~oo F(x) limx~_oo =1 F(x) = O. • J ..xo + h]) '" 2hp(xo) and P([ h Xl 1 + h]) + Xl p(xo) hi) '" ( ). .2 parzen (1960) Chapter 4.7.x.7. r .h.450 A Review of Basic Probability Theory Appendix A It turns out that a continuous probability distribution determines the density that generates it "uniquely.5 i • ..7.. the density function has an operational interpretation close to thal of the frequency function.1 and 4.14) . x.
referring to those features of its probability distribution.8.8. we measure the weight of pigs drawn at random from a population.7.2 A random vector X = (Xl •. k. The probability distribution of a random vector X is.8. In the probability model. the statistician is usually interested primarily in one or more numerical characteristics of the sample point that has occurred. density. Letg be any function from Rk to Rm. Forexample.a RANDOM VARIABLES AND VECTORS: TRANSFORMATIONS Although sample spaces can be very diverse. The probability of any event that is expressible purely in tenns of X can be calculated if we know only the probability distribution of X. we will refer to the frequency Junction. In the discrete case this means we need only know the frequency function and in the continuous case the density. . dJ. ifXisdiscrete xEA L (A. and so on of a random vectOr when we are. dj.(1) = A. The study of real.8. if X is continuous.5) p(x)dx .'s. and so on. When we are interested in particular random variables or vectors. in fact.or vectorvalued functions of a random vectOr X is central in the theory of probability and of statistics. . Similarly.Xk)T is ktuple of random variables. by definition.8) PIX E: A] LP(X).7.S. from (A.1 (H) is in A for every B E BkJI) For k = 1 random vectors are just random variables. the probability measure Px in the model (R k .1 A random variable X is a function from Oto Rsuch that the set {w: X(w) E B} X. or equivalently a function from to Rk such that the set {w . Px ) given by n Px(B) = PIX E BI· (A.5) and (A. A. these quantities will correspond to random variables and vectors. m > 1.8 Random VariOlbles and Vectors: Transformations 451 A. The subscript X or X will be used for densities. the concentration of a certain pollutant in the atmosphere.Section A. X(w) E: B} ~ X. we will describe them purely in terms of their probability distributions without any further specification of the underlying sample space on which they are defined.4 A random vector is said to have a continuous or discrete distribution (or to be continuous or discrete) according to whether its probability distribution is continuous or discrete. Thus.1 (B) is in 0 fnr every BE B.3) A. such that(2) gl(B) = {y E: Rk : g(y) E: .. and so on to indicate which vector or variable they correspond to unless the reference is clear from the context in which case they will be omitted. Here is the formal definition of such transformations. the time to breakdown and length of repair time for a randomly chosen machine. the yield per acre of a field of wheat in a given year. 13k . The event Xl( B) will usually be written [X E B] and P([X E BJ) will be written PIX E: H].
y). If g(X) ~ aX + 1'.X)2. V). . (A.(5) (A.8)) that X is a marginal density function given by px(x) ~ 1: P(X. y). i . Discrete random variables may be used to approximate continuous ones arbitrarily closely and vice versa.: for every B E Bill. if (X. . The probability distribution of g(X) is completely detennined by that of X through L:: P[g(X) E BI = PIX E gI(B)]. Then g(X) is continuous with density given by .452 A Review of BClsic ProbClbility Theory Appendix A [J} E BI. .92/ with 91 (X) = k.11) and (A.y)dy.1 1 Xi = X and 92(X) = k.8) Suppose that X is continuous with density PX and 9 is realvalued and onetoone(3) on an open set S such that P[X E 5] = 1.8. then g(X) is discrete and has frequency function Pg(X)(t) = L {x:g(x)=t} Px(x). and 0 otherwise.S. .6) An example of a transformation often used in statistics is g (91. • II ! I • .11) Similarly. Another common example is g(X) = (min{X. Furthennore. The (marginal) frequency or density of X is found as in (A. max{X. .8. a # 0.y)(x.12) 1 . y (A. assume that the derivative l of 9 exists and does not vanish on S. (A.Y) (x.1') .7. and X is continuous. (A.S. j . ! for Pg(x)(I) ~ PX(gl(t)) Ig'(g 1(1))1 (A.8. known as the marginal frequency function.9) t E g(S). a random vector obtained by putting two random vectors together.12) by summing or integrating out over yin P(X. These notions generalize to the case Z = (X.8.1 E~' l(X i . This is called the change of variable formula.8.jPX a (A.8.Y).7) If X is discrete with frequency function Px.8. then the frequency function of X.8. is given by(4) i I ( PX(X) = LP(X.8. then 1 (I . it may be shown (as a consequence of (A.7) and (A.)'. Pg(X) (I) = j.y)(x.8) it follows that if (X.Y). Then the random tran~form(lti(m g( X) is defined by g(X)(w) = g(X(w)). .1O) From (A.). . y)T is continuous with density p(X. Yf is a discrete random vector with frequency function p(X.
9.9.. . Sections 15.l5 and B. if (Xl. (X"XIX. and Stone (1971) Sections 3. Theorem A.4 A.[Xn E An] are independent.13 A convention: We shan write X = Y if the probability of the event IX fe YI is O. .Section A. " X n ) is either a discrete or continuous random vector.4) (A.) and Y" and so on. which may be easier to deal with.4 Parzen (1960) Chapter 7. . whatever be g and h. the events [X I E All.5) = following two conditions hold: A.An in S.Xn (not necessarily of the same dimensionality) we need only use the events [Xi E Ail where Ai is a set in the range of Xi' A. 1 X n are independent if.. S. Nevertheless.3 By (A. 9 Pitman (1993) Section 4.9... the approximation of a discrete distribution by a continuous one is made reasonable by one of the limit theorems of Sections A. and only if. A.2 The random variables X I.Xn are said to be (mutually) independent if and only if for any sets AI.6. . 6. " X n ) is continuous.9. either ofthe (A. then X = (Xl. . . ' .9.. X 2 ) and (Yt> Y2 ) are independent.S. Sections 2124 Grimmett and Stirzaker (1992) Section 4.1 Two random variables X I and X 2 are said to be independent if and only if for sets A and B in E.7.7 The preceding equivalences are valid for random vectors XlJ .. To generalize these definitions to random vectors Xl..3. A.9 INDEPENDENCE OF RANDOM VARIABLES AND VECTORS A... The justification for this may be theoretical or pragmatic. .9. 5.S. X n with X (XI>"" Xn)' .9.7).9. so are Xl + X2 and YI Y2 .6IftheXi are all continuous and independent.2.. .1..7 Hoel. Suppose X (Xl. all random variables are discrete because there is no instrument that can measure with perfect accuracy. For example. E BI are independent. A. Then the random variables Xl.9 Independence of Random Variables and Vectors 453 In practice. . References Gnedenko (1967) Chapter 4.. Port.1.. One possibility is that the observed random variable or vector is obtained by rounding off to a large number of places the true unobservable continuous random variable specified by some idealized physical modeL Or else. . if X and Yare independent. it is common in statistics to work with continuous distributions.. so are g(X) and h(Y). . the events [XI E AI and IX.
X n are independent identically distributed kdimensional random vectors with dJ. If X is a nonnegative.2 More generally. X l l is called a random sample (l size n from a population with dJ. .) A. In line with these ideas we develop the general concept of expectation as follows.p) to 0. . F x or density (frequency function) PX. then Xl . X2. we define the expectation or mean of X.I0.454 A Review of Basic Probability Theory Appendix A A.2. . 1 Xi =i.~ 1 i E(X) = LXiPX(Xi). written E(X).' (Infinity is a possible value of E(X). 24 Grimmett and Stirzaker (1992) Sections 3. .. . Fx or density (frequency function) Px.5..S If Xl .I) .W. A.. Take • . Port. px(i) = i(i+1)' i= 1. I. . Such samples win be referred to as the indicators of n Bernoulli tn"als with probability ofsuccess p" References Gnedenko (1967) Chapter 4. we define the random variable 1 A. The same quantity arises (approximately) if we use the longrun frequency interpretation of probability and calculate the average height of the individuals in a large sample from the population in question.....3.I l I .. i . decompose {Xl l X2.EA XiPX (Xi) < . . such a random sample is often obtained by selecting n members at random in the sense of (A.I . If A is any event. by 1 ifw E A o otherwise. then the Xi form a sample from a distribution that assigns probability p to 1 and (1.2. by 00 1 i .3 .j j . Sections 23.9) If we perfonn n Bernoulli trials with probability of success p and we let Xi be the indicator of the event (success on the ith trial).. and Stone (1971) Section 3. Sections 6... if X is discrete.X.. i==l (A. } into two sets A and B where A consists of aU nonnegative Xi and B of all negative Xi· If either 'L.• }. 5. In statistics.2 Hoel. .4) from a population and measuring k characteristics on each member. "" 1 X q are the only heights present in the population.4 Par""n (1960) Chapter 7. .Ll XiP[X = Xi] where P[X = Xi] is just the proportion of individuals of height Xi in the population. Then a reasonable measure of the center of the distribution of X is the average height of an individual in the given population. .9.. discrete random variable with possible values {Xl. it follows that this average is given by 'L.. the indicator of the event A. (A9.. I i I I .I0 THE EXPECTATION OF A RANDOM VARIABLE r: r I Let X be the height of an individual sampled at random from a finite population. If XI . 4. 7 Pitman (1993) Sections 2.~ ( . .
lOA) If X is an ndimensional random vector.8From (A. we have 00 E(IXI) ~ L i=l IXiIPx(Xi). (A.'"":.9.rn". then E(X) = c. it is natural to attempt a definition of the expectation via approximation from the discrete case. 10. then it may be shown that 00 E(g(X)) = Lg(Xi)PX(Xi) i=l (A. 10.7) it follows that if X < Y and E(X).J:.ES( x.6) Taking g(x) = L:~ 1 O:'iXi we obtain the fundamental relationship (A. Otherwise. . n. we leave E(X) undefined.. an are constants and E(!Xil) < 00. then E(X) < E(Y).. A. 10. Jo= xpx(x)dx or A.tO A random variable X is said to be integrable if E(IXI) < 00. I 0. then E(X) = P(A).3) = IA (cf (A. we define E(X) unambiguously by (A.p. If X (A.) < 00.. 10.)Px (.I). i = 1. 10.c.7) if Qt."_"_b_. E(Y) are defined. If X is a constant."_0_"_A". . .v') = c for all w. E(X) is left undefined.V. .l"O_T_h''CEx. If X is a continuous random variable. Here are some properties of the expectation that hold when X is discrete. It may be shown that if X is a continuous k~dimensional random vector and g(X) is any random variable such that JR' { Ig(x)IPx(x)dx < 00.. .lO.5) As a consequence of this result.9)). and if E(lg(X) I) < 00. .'_""o"_of_'""R_'_"d"o:.IO. (A. . X(t..IO."5'. Otherwise..9) as the definition of the expectation or mean of X f~oo xpx(x)dx is finite. Those familiar with Lebesgue integration will realize that this leads to E(X) = 1: xpx(x)dx whenever (A. if 9 is a realvalued function on Rn... 00 455 or L.
1.2) xkpx(x)dx if X is continuous. E(X k ) j = Lxkpx(x) if X is discrete . In the continuous case dJ1(x) = dx and M(X) is called Lebesgue measure. discrete case f: f References i~l (AIO. 10.. continuous case.. A. Sections 14 Pitman (1993) Sections 3. 'I I' I. 10. 3A.. g(X)Px(x)dx. the moments depend on the distribution of X only...5) and (A. 10. .7).H. By (A.. We assume that all moments written here exist.12) where F denotes the distribution function of X and P is the probability function of X defined by (AS. (AIO.5) and (A 10. 4..IO.IRk g(x)dP(x) (A. 4. The fonnulae (A 10.. It is possible to define the expectation of a random variable in general using discrete approximations.B) g(x)p(x)dx..3. I In general.3 Hoel. (A.3..ll MOMENTS 1 . We will often refer to p(x) as the density of X in the discrete case as well as the continuous case. I0. • g(x)dP(x) = L g(Xi)p(Xi). The interested reader may consult an advanced text such as Chung (1974). It I! II .S) as well as continuous analogues of (A. and Stone (1971) Sections 4.5) and (A.! If k is any natural number and X is a random variable. In the discrete case J1 assigns weight one to each of the points in {x : p(x) > O} and it is called counting measure.. Section 26 Grimmett and Stirzaker (1992) Sections 3. I0.lO.' • 1: x I (A I 1. A convenient notation is dP(x) = p(x)dl'(x).1 parzen (1960) Chapter 5. j i . j . I .. Chapter 3. 7. POrl. lOA).11)... .6) hold. (A. I I) are both sometimes written as E(g(X)) ~ r ink = g(x)dF(x) or r . Chapter S.1 I 1 ..5). and (AIO. .. Chung (1974) Chapter 3 Gnedenko (1967) Chapter 5. j A. I: We refer to J1 = IlP as the dominating measure for P.!I) In the continuous case expectation properties (A. the kth moment of X is defined to be the expectation of X k. II. 1 .3).456 then E(g(X» exists and E(g(X» A Review of BaSIC Probability Theory Appendix A = 1.. which means J ..
j) of Xl and X 2 is.S) A. It is also called a measure of scale.ll. then by (A.7) Var(aX + b) = a2 Var X.S The second central moment is called the variance of X and will be written Var X.ll Moments 457 A.E(XIl)i(X.ll. A. These descriptive measures are useful in comparing the shapes of various frequently used densities. The central product moment of order (i. E(XiX~). 1) is .n. then the coefficient of skewness and the kurtosis of Y = 12 = O.1 and the kunosis ')'2. (A.7 If X is any random variable with welldefined (finite) mean and variance. 10. If a and b are constants.E(X 2 ))i]. are used in the coefficient of skewness .15. and is denoted by I'k. .n. These results follow.6) it follows then that A. .E(X)J).E(X)).n.12 It is possible to generalize the notion of moments to random vectors. is by definition E[(X . are the same as those of X. which are defined by where 0'2 = Var X. (AI2.9 If E(X 2 ) = 0. If X ~ N(/l.Section A. j are natural numbers. See also Section A. then X = O.E(X))/vVar x. the stan E(Z) = 0 and Var Z = 1. for instance. If Var X = 0. A. This is the case. A. The nonnegative square root of Var X is called the standard deviation of X. (All.n If Y = a + bX with b > 0. then the product moment of order (i. (AIl. The central product moment of order (1.. By (A 10.4 The kth central moment of X (X . The variance of X is finite if and only if the second moment of X is finite (cf. if the random variable possesses a moment generating function (cf.) dardized version or Zscore of X is the random variable Z ~ (X .3 The distribution of a random variable is typically uniquely specified by its moments.6) (One side of the equation exists if and only if the other does.15». then 11 A. the kth moment of A.E(X))k]. For sim plicity we consider the case k = 2.U.IO The third and fourth central moment.II. for example.2 are expressed in tenns of cumulants.2). from (A. and X 2 is again by definition E[(X. 0 2). X = E(X) (a constant).12 where.H.l and.1l. by definition.j) of X. The standard deviation measures the spread of the distribution of X about its expectation. which is often referred to as the mean deviation. If XI and X 2 are random variables and i.1».7) and (All. Another measure of the same type is E(jX .
. i .).) (A.E(X J!)( X. (2) (XI . A proof of the CauchySchwartz inequality is given in Remark 1.1.)) and using (AI 0. X. such that E(Zf) < 00.X 2)..458 A Review of Basic Prob<lbility Theory Appendix A called the cOl'Orionce of X 1 and. Equality holds if and only if one of Zll Z2 equals 0 or 2 1 = aZ2 for some constant a. denoted by Corr(XI1 X2). is defined whenever Xl and X 2 are not constant and the variances of Xl and X 2 are finite by (A.. 0 # 0) of XI' l il . = a + oXI . 7).lI. we get the formula Var X = E(X') . = X. . .) (X.IIl4) If Xi and Xf are distributed as Xl and X 2 and are independent of Xl and X 2. Z. The correlation inequality correspcnds to the special case ZI = Xl . • i .11l3) (A. This is the correlation inequality.).[E(X)j2 (A.) = lfwe put XI ~ ~E(XI .I6) .E(X I ).\ 1.E(X I ») = CO~(X~X. + dX. with equality holding if and only if (1) Xl or X 2 is a constant Or ! • • . I liS) The covariance is defined whenever Xl and X 2 have finite variances and in that case (AII.E( X.I US) The correlation of Xl and X 2 is the covariance of the standardized versions of Xl and X 2· The correlation inequality is equivalent to the statement (A. . .) = ac Cov(X I .X. J I • for any two random variables Z" Z. (AI 117) j r . ar '2 .4. we obtain the relations.)(X. Equality holds if and only if X.19) 1 • t . =X in (A. o.) + od Cov(X"X. is linear function (X.E(X.X. . I 1 1 j I I . E(Zil < 00.E(X. X 3 ) + be Cov(X" X 3 ) and + ad Cov(X I . Cov(aX I + oX" eX. By expanding the product (XI .14). The correlation of Xl and X 2.3) and (AID.X. X.). It may be obtained from the CauchySchwartz inequality. then Cov(XI.. •  II .\:2 and is written Cov(.I 1.
i ~ 1. Port. Xl)· (A.. 7.J4). b < orb> 0. 12.2) eSXpx(x)dx if X is continuous.12.) > 0.) ~ o when Var(X.22) This may be checked directly.5. are If M x is well defined in a neighhorhood finite and {s . CoV(X1.bX 1.12 MOMENT AND CUMULANT GENERATING FUNCTIONS A. 11.. Mx(s) ~ E(e'X) is well defined for and is called the moment generating function of X.21 ) or in view of(All. all moments of X k! sk.e.13) the relation Var(X 1 + .23) A.24.20) If Xl and X 2 are independent and X I and X 2 are integrable. Sections 27. 10. XII have finite variances.4. lsi < so} of zero. (A.Section A. respectively). lsi < So Mx(s) LeSX'PX(Xi) if X is discrete 1: Mx(s) i~1 (A..5) and (A. I 1.l2.II. we obtain as a consequence of (A... It is 1 Or 1 in the case of perfect relationship (X2 = a l. 30 Hoel... +2L '<7 COV(X. The correlation coefficient roughly measures the amount and sign of linear relationship between Xl and X 2.X.l1.20)..3 Parzen (1960) Chapter 5. 28.3) k=O . .Xn are independent with finite variances.X2 ) = Corr(X1.4 ° + .22) and (AI 1. It is not trUe in general that X I and X 2 that satisfy (AIl.ID. See also Section 1. then (A. Sections 14 Pitman (1993) Section 6. + X n ) = L '1=1 n Var X. =L = E(X k ) lsi < so· (A. As a consequence of (AI 1.12 Moment and Cumulant Generating Functions 459 If Xl..22) (i. we see that if Xl.II).'" . By (A. + X n ) = L Var Xi· i=I n (A. are uncorrelated) need be independent. Chapter 8. then Var(X 1 References Gnedenko (1967) Chapter 5.2.llf E(c'nIXI) < 00 for some So > 0.II. and Stone (1971) Sections 4.