INTRODUCTION TO STATISTICS AND ECONOMETRICS
.
7 Definitions of a Random Variable 19 Discrete Random Variables 20 Univariate Continuous Random Variables 27 Bivariate Continuous Random Variables 29 Distribution Function 43 Change of Variables 4 7 Joint Distribution of Discrete and Continuous Random Variables 57 59 EXERCISES .1 What Is Probability? 1 1.4 Conditional Probability and Independence 2.CONTENTS Preface xi 1 INTRODUCTION 1.2 3.4 3.5 Probability Calculations 13 EXERCISES 17 10 3 R A N D O M VARIABLES A N D PROBABILITY DISTRIBUTIONS 3.6 3.1 3.1 Introduction 5 2.2 Axioms of Probability 5 2.3 3.3 Counting Techniques 7 2.5 3.2 What Is Statistics? 2 2 PROBABILITY 5 2.
4 Examples 107 EXERCISES 109 100 19 BlVARlATE REGRESSION M O D E L 10.4 9..3 Normal Approximation of Binomial 104 6.3 Tests of Hypotheses 248 EXERCISES 253 11 POINT ESTIMATION  ELEMENTS O F M A T R I X ANALYSIS  257 .2 4.1 Introduction 160 8.2 Confidence Intervals 161 8.5 Definition of Basic Terms 257 Matrix Operations 259 Determinants and Inverses 260 Simultaneous Linear Equations 264 Properties of the Symmetric Matrix 270 278 EXERCISES .1 What Is an Estimator? 112 7.1 Modes of Convergence 100 6.4 11.2 5.3 4.1 9.3 Maximum Likelihood Estimator: Definition and Computation 133 '7.n Z .3 5.7 Introduction 182 Type I and Type I1 Errors 183 NeymanPearson Lemma 188 Simple against Composite 194 Composite against Composite 201 Examples of Hypothesis Tests 205 Testing about a Vector Parameter 210 219 EXERCISES LARGE SAMPLE THEORY 6. 7.4 Binomial Random Variables 87 Normal Random Variables 89 Bivariate Normal Random Variables 92 Multivariate Normal Random Variables 97 98 87 EXERCISES 9.3 11.4 Maximum Likelihood Estimator: Properties 138 EXERCISES 151 11.2 Laws of Large Numbers and Central Limit Theorems 103 6.2 Properties of Estimators 116 7.5 9.2 11.1 11.3 Bayesian Method 168 EXERCISES 178 EXERCISES 9 TESTS OF HYPOTHESES BINOMIAL A N D NORMAL R A N D O M VARIABLES 5.1 Introduction 228 10.1 4.4 Expected Value 61 Higher Moments 67 Covariance and Correlation 70 Conditional Mean and Variance 77 83 61 8 INTERVAL E S T I M A T I O N 8.3 9.1 5.2 Least Squares Estimators 230 10.2 9.viii Contents Contents ix 160 4 MOMENTS 4.6 9.
I devote a lot of space to these and other fundamental concepts because I believe that it is far better for a student to have a solid .1 13.4 12. At the same time. The present book is aimed at filling that gap. and the properties of the maximum likelihood estimator. and other social science disciplines. or they give it a superficial treatment.5 13. I discuss these topics without undue use of mathematics and with many illustrative examples and diagrams.5 Introduction 281 Least Squares Estimators 283 Constrained Least Squares Estimators Tests of Hypotheses 299 Selection of Regressors 308 310 PREFACE 296 EXERCISES 13 ECONOMETRIC MODELS 314 13. they usually contain only a cursory discussion of regression analysis and seldom cover various generalizations of the classical regression model important in econometrics and other social science applications. Examples are best prediction and best linear prediction. large sample theory. statistics. Chapters 1 through 9 cover probability and statistics and can be taught in a semester course for advanced undergraduates or firstyear graduate students.3 12. Moreover. The prerequisites are one year of calculus and an ability to think mathematically. conditional density of the form f (x I x < y). In addition. but either they do not include statistics proper. in most of these textbooks the selection of topics is far from ideal from an econometricianyspoint of view.2 12. many exercises are given at the end of each chapter (except Chapters 1 and 13). In these chapters I emphasize certain topics which are important in econometrics but which are often overlooked by statistics textbooks at this level.6 Generalized Least Sauares 314 Time Series Regression 325 Simultaneous Equations Model 327 Nonlinear Regression Model 330 Qualitative Response Model 332 Censored or Truncated Regression Model (Tobit Model) 339 13. My own course on this material has been taken by both undergraduate and graduate students in economics.2 13. there are many textbooks on econometrics. the joint distribution of a continuous and a discrete random variable.4 13.1 12.3 13.x Contents 12 MULTIPLE REGRESSION MODEL 281 12.7 Duration Model 343 Appendix: Distribution Theory References Name Index Subject Index 353 357 361 363 Although there are many textbooks on statistics.
especially in the realistic case of testing a composite null against a composite alternative. and duration models. I also thank Michael Aronson. (independent and identically distributed) sample are extended to the regression model. Yoshiko. for which I use my Advanced Econometrics (Harvard University Press.4) and certain other statistical models extensively used in econometrics and other social science applications (13. this book is written mainly from the classical point of view) but because it provides a pedagogically useful framework for consideration of many fundamental issues in statistical inference.4 needs to be supplemented by additional readings. which discusses qualitative response models. Finally. however. I also believe that students should be trained to question the validity and reasonableness of conventional statistical techniques. I am also indebted to my students Fumihiro Goto and Dongseok Kim for carefully checking the entire manuscript for typographical and more substantial errors. The first part of the chapter is a quick overview of the topics. I do so not because I believe it is superior (in fact. Chapters 10 through 13 can be taught in the semester after the semester that covers Chapters 1 through 9. Chapter 10 presents the bivariate classical regression model in the conventional summation notation. 1 discuss various generalizations of the classical regression model (Sections 13. who read all or part of the manuscript and gave me valuable comments. B y studying it in earnest. and Chapter 13 studied independently. I also present a critical evaluation of the classical method of hypothesis testing. I dedicate this book to my wife. I give a thorough analysis of the problem of choosing estimators.1 through 13. I frequently have recourse to Bayesian statistics. take responsibility for the remaining errors.1 through 13. Peter Robinson. Dongseok Kim also prepared all the figures in the book. general editor at Harvard University Press. and James Powell. Therefore. Chapter 12 gives the multiple classical regression model in matrix notation. The second part. is a more extensive introduction to these important sub~ects. including a comparison of various criteria for ranking estimators. the reader should be able to understand Chapters 12 and 13 as well as the brief sections in Chapters 5 and 9 that use matrix notation. It is expected that those who complete the present textbook will be able to understand my advanced textbook.d. in Chapter 13. for students with less background. In discussing these issues as well as other problematic areas of classical statistics. censored and truncated regression models.  . Alternatively. In Chapters 10 and 12 the concepts and the methods studied in Chapters 1 through 9 in the framework of the i..i. for constant encouragement and guidance.7). I alone. I am grateful to Gene Savin. At Stanford about half of the students who finish a yearlong course in statistics and econometrics go on to take a year's course in advanced econometrics. 1985).Under this plan.xii Preface Preface xiii knowledge of the basic facts about random variables than to have a superficial knowledge of the latest techniques. Chapter 11 is a brief introduction to matrix analysis. though perhaps not longlasting effect. and Elizabeth Gretz and Vivian Wheeler for carefully checking the manuscript and suggesting numerous stylistic changes that considerably enhanced its readability. who for over twenty years has made a steadfast effort to bridge the gap between two cultures. Chapters 1 through 12 may be taught in a year.5 through 13. the material in Sections 13. Her work.
.
.
and in making decisions it is useful for us to know the probability of certain events. A Bayesian can talk about the probability of any one of these events or statements. If we asked a related but different questionwhat is the probability that a man who died of lung cancer was a cigarette smoker?the experiment would be simpler. This mode of thinking is indispensable for a statistician. For example. before deciding to gamble. Thus we state a second working definition of statistics: Statistics is the science of observing data and making inferences about the characteristics of a random mechanism that has generated the data. the Federal Reserve Board needs to assess the probabilistic impact of a change in the rate on the unemployment rate and on inflation. 1. A reasonable way to determine a probability should take into account the past record of an event in question or. In determining the discount rate. one of which corresponds to heads and the other to tails. It may be argued that a frequency interpretation of (2) is possible to the extent that some of Plato's assertions have been proved true by a later study and some false. What is the probability that a person who smokes more than ten cigarettes a day will contract lung cancer? (This may not be the optimal way. In statistics we study whether it is indeed reasonable and. with p being the proportion of heads balls. we should imagine a box that contains balls. Its only significance is that we can toss a coin as many times as we wish. in what sense. The statistician regards every event whose outcome is unknown to be like drawing . This is not an essential difference. A statistician looks at the world as full of balls that have been drawn by God and examines the balls in order to estimate the characteristics ("proportion") of boxes from which the balls have been drawn. these two interpretations of probability lead to two different methods of statistical inference. Most people will think it reasonable to use the ratio of heads over tosses as an estimate. a classical statistician can do so only for the event (1). consider the question of whether or not cigarette smoking is associated with lung cancer. we need to paraphrase the question to make it more readily accessible to statistical analysis. which may require specific tools of analysis. We are ready for our first working definition of statistics: Statistics is the science of assigning a probability to a n event on the basis of experiments. although in the short run we may be lucky and avoid the consequences of a haphazard decision. otherwise we will lose in the long run.2 1 I Introduction 1. whenever possible. corresponding to cigarette smokers. For example. Although in this book I present mainly the classical method. the results of a deliberate experiment. Tossing a coin with the probability of heads equal to p is identical to choosing a ball at random from a box that contains two types of balls. Consider estimating the probability of heads by tossing a particular coin many times. We want to know the probability of rain when we decide whether or not to take an umbrella in the morning.2 I What Is Statistics? 3 (3) The probability of obtaining heads when we toss a particular coin is '/. It is advisable to determine these probabilities in a reasonable way. whereas in the present example the statistician must work with whatever sample God has provided. but in economics or other behavioral sciences we must often work with a limited sample..2 WHAT I S STATISTICS? In our everyday life we must continuously make decisions in the face of uncertainty. some of the balls have lung cancer marked o n them and the rest do not. whereas in this example it is as though God (or a god) has tossed a coin and we simply observe the outcome. (Such an experiment would be a costly one. whereas statement (2) or (3) is either true or false as it deals with a particular thingone of a kind. But in that case we are considering any assertion of Plato's. but we choose it for the sake of illustration. rather than the particular one regarding Atlantis.a ball at random from a box that contains various types of balls in certain proportions. Note that ( I ) is sometimes true and sometimes false as it is repeatedly observed. if so. Drawing a ball at random corresponds to choosing a cigarette smoker at random and observing him until he dies to see whether or not he contracts lung cancer. As we shall see in later chapters.) To apply the boxball analogy to this example. In the physical sciences we are often able to conduct our own experiments.because only (1) is concerned with a repeatable event. I will present Bayesian method whenever I believe it offers more attractive solutions. One way is to ask.) This example differs from the example of coin tossing in that in coin tossing we create our own sample. The two methods are complementary. First. we would want to know the probability of winning. . and different situations call for different methods.
1 to heads and 0 to tails.2. The characteristics of a continuous random variable are captured by a density function. Probability is a subject which can be studied independently of statistics.) Or what is the probability a person will win a game in tennis if the probability of his winning a point is p? The answer is For example.2.0. A discrete random variable is characterized by the probabilities of the outcomes.525. and the second.65. The random mechanism whose outcome is the height (measured in feet) of a Stanford student is another random variable.9~' 0. if p = 0. see Examples 2. We use the term prob ability distribution as a broader concept which refers to either a set of discrete probabilities or a density function. As a more complicated example. In these calculations we have not engaged in any statistical inference. a continuous random variable (assuming hypothetically that height can be measured to an infinite degree of accuracy). Now we can compose a third and final definition: Statistics is the science of estimating the probability distribution of a random variable on the basis of repeated observations drawn. (The answer is derived under the assumption that the ten schools make independent decisions. A random mechanism whose outcomes are real numbers is called a random variable.from the 2. 2. the formula gives 0. 4 1 I Introduction Coin tossing is an example of a random mechanism whose outcomes are objects called heads and tails. what is the p r o b ability that a student will be accepted by at least one graduate school if she applies to ten schools for each of which the probability of acceptance is 0. These terms inevitably remain vague until they are illustrated. .1 INTRODUCTION In this chapter we shall define probability mathematically and learn how to calculate the probability of a complex event when the probabilities of simple events are given. For example. the statistician assigns numbers to objects: for example. which is defined in such a way that the probability of any interval is equal to the area under the density function over that interval.2 AXIOMS OF PROBABILITY Definitions of a few commonly used terms follow. The first is called a discrete random variable.1? The answer is 1 . In order to facilitate mathematical analysis.2.1 and 2. it forms the foundation of statistics. what is the probability that a head comes up twice in a row when we toss an unbiased coin? We shall learn that the answer is I/q.51.
so that n ( S ) = 36. The calculation is especially easy when the sample space consists of a finite number of simple events with equal probabilities. 4 .1. TT}. i = 1. When the sample space is continuous. 2 . An event which cannot be a union of other events.j) represent the event that i shows on the first die and j on the second. . EXAMPLE 2 . . every possible subset of the sample space) in a way which is consistent with the probability axioms. f l Aj = RI for all i # j ) . as in Example 2. . A = { 2 . We have E X A M P L E 2. the class of all the intervals contained in ( 0 . We want probability to follow the same rule. j = 1 . Axiom ( 3 ) suggests that often the easiest way to calculate the probability of a composite event is to sum the probabilities of all the simple events that constitute it.HT. Composite event. . E X A M P L E 2 . 100) and their unions satisfies the condition. for relative frequency follows the same rule. Let the ordered pair (i. it is not possible to do so. such as Chung (1974). In such a case we restrict our attention to a smaller class of events to which we can assign probabilities in a manner consistent with the axioms. . Compute the p r o b ability that the sum of the two numbers is equal to each of the integers 2 through 12.2. . 3 . The event that a head occurs at least once: HH U HT U TH. at the roll of a die. If. A is the event that the die shows 1 and B the event that it shows 2. j)I i. Events of interest are intervals contained in the above interval.5. TH. Then we have Axioms of Probability Two examples of this rule follow. The reader who wishes to study this problem is advised to consult a book on the theory of probability. it is possible to assign probability to every event (that is.3 COUNTING TECHNIQUES EXAMPLE 2 . 2 . the relative frequency of A U B (either 1 or 2 ) is clearly the sum of the relative frequencies of A and B. . Sample space: Real interval (0. Then S = {(i. In the subsequent discussion we shall implicitly be dealing with such a class. where S is the sample space.6 2 I Probability 2.2 . as in Example 2. When the sample space is discrete.2. 1 Experiment: Tossing a coin twice. A pair of fair dice are rolled once. 2 . P ( A ) = 0. 2 Experiment: Reading the temperature (F) at Stanford at noon on October 1.3. 2. 2. Event. A.3. A subset of the sample space. ( 3 ) If (A. The set of all the possible outcomes of an experiment. Let n ( A ) be the number of the simple events contained in subset A of the sample space S. 2.2. The third rule is consistent with the frequency interpretation of probability. ( 2 ) P ( S ) = 1. ( 1 ) P ( A ) 2 0 for any event A. Therefore. I What is the probability that an even number will show in a roll of a fair die? We have n ( S ) = 6.1 Simple Events with Equal Probabilities A probability is a nonnegative number we assign to every event. then P(A1 U A2 U . An event which is not a simple event. For example. a situation which often occurs in practice.3 ( Counting Techniques 7 Sample space. 6 ] . use of the word probability. The first two rules are reasonable and consistent with the everyday . 6 )and hence n ( A ) = 3. . . Sample space: { H H .]. are mutually exclusive (that is. . however. Simple event.100).) = P ( A I ) + P(A2) + . . . . . The axioms of probability are the rules we agree to follow when we assign probabilities.
Therefore. ng.1 P:=n!/(nr)!. ( 2 .1 we shall solve the same problem in an alternative way. Of these. THEOREM 2. 1 ) make the same combination. 2) and ( 2 . by C:.wheren!readsnfactorialanddenotes n ( n .2 The number of combznatzons of taking r elements from n elements is the number of distinct sets consisting of r distinct elements which can be formed out of a set of n distinct elements and is denoted E X A M P L E 2. The desired probability P is thus given c?.3 Compute the probability of getting two of a kind and three of a kind (a "full house") when five dice are rolled. We shall take as the sample space the set of all the distinct ordered 5tuples ( n l . Let the ordered pair (i. 1)l = 2. ( 3 . 2. 3 ) . which is also and similarly for hearts and diamonds.3.3. we must multiply ($ by four.3 ) . The number of the distinct ordered pairs. by In Example 2. Given a particular (i.)     Proot In the first position we can choose any one of n elements and in the second position n . The number of permutations of taking r elements from n elements is the number of distinct ordered sets consisting of r distinct elements which can be formed out of a set of n distinct elements and is denoted by P:. be the number on the ith die.( 3 . we have P: = 6. n 5 ) .j = 2) n [ ( l . Let n. 2 ) ] = 3.2 ) . ( 1 . what is the probability that it will contain three aces? We shall take as the sample space the set of distinct poker hands without regard to a particular order in which the five cards are drawn.8 2 I Probability = 2. We must also count the number of the hands that contain three aces but not the ace of spades. therefore. Therefore we conclude that the desired probability P is given by The formulae for the numbers of permutations and combinations are useful for counting the number of simple events in many practical problems. D E F I N 1 T I 0 N 2. n ( S ) = c.2. C p . (We define O! = 1. ( 3 . ( 1 .1 and so on. the number of the hands that contain three aces but not the ace of clubs is equal to the number of ways of choosing the 48 two remaining cards out of the 48 nonaces: namely.r 1 elements. See Exercise 2. THEOREM 2.3.1 For example. Therefore. j). . Thus In the example of taking two numbers from three.therefore. and so on. l ) . l ) .2 Permutations and Combinations Prooj It follows directly from the observation that for each combination. nq.l ) ]= 1. and finally in the rth position we can choose one of n . ng. . j) mean that i is the number that appears twice and j is the number that appears three times. Note that the order of the elements matters in permutation but not in combination.5.so that n ( S ) = 65. and 3 are (1. R +  D EF l N I T I 0 N 2. Therefore.. 1 ) . we can choose two dice out of five which show i: there are C: ways to do so.3.3.3.2) . the permutations of taking two numbers from the three numbers 1. 3 ) . is P:. the number of permutations is the product of all these numbers. (2. n(z + j = 4 ) = n [ ( l .1.l ) ( n .'. 2 ) . ( 2 .2 n ( i + j = 3 ) = n [ ( l . 2 .3 1 Counting Techniques 9 n ( i 4. R EXAMPLE 2. r! different permutations are possible.3.4 If a poker hand of five cards is drawn from a deck. 2 ) . ( 2 .
it makes sense to talk about the conditional probability that number one will show in a roll of a die given that an odd number has occurred. ( 2 ) P ( A B) = 1 for any event A 3 B. from axiom ( 3 ) of probability. Most textbooks adopt the latter approach. Therefore we may postulate either axioms ( 2 ) .4. THEOREM 2 . 2.and ( 3 ) are analogous to the corresponding axioms of probability. . then Bayes' theorem follows easily from the rules of probability but is listed separately here because of its special usefulness. . Since E fl Al.1 as the only axiom of conditional probability. From Theorem 2. E fl A. . .1) P ( A I B) = P ( A f l B I B) + P ( A n fl( B). U A. O The reason axiom (I) was not used in the above proof is that axiom ( 1 ) follows from the other three axioms. 2.)P(A. 2 (Bayes) Let eventsAl. It is easy to show the converse. j=l P(E I Az)P(A. . . and ( 4 ) imply Theorem 2.1 shows P ( A I B ) = n ( A fl B ) / n ( B ) . . 4 . are mutually exclusive and their union is equal to E. . 2 . . (2. we can prove + 0 . .4. (3). If the conditioning event B consists of simple events with equal p r o b ability. .1) to obtain I The concept of conditional probability is intuitively easy to understand. for any pair of events A and B in a sample space. But from axioms ( 2 ) and ( 3 ) we can easily deduce that P(C B ) = 0 i f C fl B = 0." denoted by P ( A I B ) . Theorem 2.) Proof. 4 . Therefore we can eliminate the last term of (2. Using the four axioms of conditional probability. I B) = P ( A I B) + p(A21 B ) + .3 may be used to calculate a conditional probability in this case. THEOREM 2 . .1. The theorem follows by putting H = A fl B and G = B in axiom ( 4 ) and noting P ( B I B) = 1 because of axiom ( 2 ) . 1 Prooj From axiom ( 3 ) we have (2. Therefore. we have.4. then P(A1 U A2 U . and establish axioms that govern the calculation of conditional probabilities. ( 3 ) . Theorem 2. .) i = l .1 CONDITIONAL PROBABILITY AND INDEPENDENCE Axioms of Conditional Probability where B denotes the complement of B.) ( 1 ) P ( A I B) r 0 for any event A. ( 2 ) . E fl AZ. this conditional probability can be regarded as the limit of the ratio of the number of times one occurs to the number of times an odd number occurs. and ( 4 ) or. . we have P ( A I B ) =P(AflB)/P(B)foranypairofeventsAand B such that P ( B ) > 0. are mutually exclusive. . Then P(AzI E ) = . . ( 4 ) If B 3 H and B 3 G and P(G) P(H IB) . . . 4 . n. . Axiom ( 4 ) is justified by observing that the relative frequency of H versus G remains the same before and after B is known to have happened. C P(E I A.10 2 I Probability 2.4. ( 3 ) If (4fl BJ. .2 Bayes' Theorem Axioms of Conditional Probability (In the following axioms it is assumed that P ( B ) > 0. They mean that we can treat conditional probability just like probability by regarding B as the sample space.4.4.1. A. .P(H) Axioms ( I ) .4. In the frequency interpretation.4 2. Let E be an arbitrary event such that P ( E ) > 0. .4 1 Conditional Probability and Independence 11 2.4. provided P ( B ) > 0 . Thus we have proved that (2). .) = 1 and P ( 4 ) > 0 for each i.2) P(A B) I = P ( A fl Bl B ) . more simply. the counting techniques of Section 2. . be mutually exclusive such that P(A1 U Ap U . In general we shall consider the "conditional probability of A given B. For example.4. i = 1.
independence in the sense of Definition 2.4.1. I  Note that independence between A f l C and B or between B f l C and A follows from the above. .4.12 2 I Probability 2. It means that the probability of occurrence of A is not affected by whether or not B has occurred. Then A. aces and nonaces. which may be stated as the independence between A fl B and C.3) and (2. and C are said to be mutually indqbendent if the following equalities hold: .4. .4) and by noting that P ( E n A. Since the outcome of the second toss of a coin can be reasonably assumed to be independent of the outcome of the first.5 PROBABILITY CALCULATIONS We have now studied all the rules of probability for discrete events: the axioms of probability and conditional probability and the definition of independence. Let A. The following are examples of calculating probabilities using these rules.5. . O 2. . Definition 2.) by Theorem 2. B." We can now recursively define mutual independence for any number of events: D E F I N I T I O N 2. A*. We shall first compute the probability that three aces turn up in a particular sequence: for example.4.4. . But that is not enough. In the present approach we recognize only two types of cards. we have E X A M P L E 2. B. B.1 needs to be generalized in order to define the mutual independence of three or more events. and let C be the event that either both tosses yield heads or both tosses yield tails. suppose the first three cards are aces and the last two nonaces. The following example shows that pairwise independence does not imply mutual independence. 2. D E F I N IT I o N 2 .4. . and C are pairwise independent but not mutually independent.3. denote the event that the ith card is a nonace. the above equality is equivalent to P ( A ) P ( B ) = P ( A n B ) or to P ( B ) = P ( B I A ) . P(A.4. . by the repeated application of Theorem 2.1.) = P(AI)P(A2) . Let A be the event that a head appears in the first toss of a coin. A. that is. 4 . we shall solve the same problem that appears in Example 2.1. P ( A f l B) = P ( A n B I C). A . that is. whereas  P ( A ) P ( B ) P ( C )=I/. denote the event that the ith card is an ace and let N.1 between any pair. 2 Events A.3 Events Al. Because of Theorem 2. are said to be mutually independent if any proper subset of the events are mutually independent and P(A1 f l A2 The term "independence" has a clear intuitive meaning.)P(A. because P ( A f l B n C ) = P ( A f l B) = 1/4.4. We do not want A and B put together to influence C. n A.3 Statistical Independence We shall first define the concept of statistical (stochastic) independence for a pair of events.5 1 Probability Calculations 13 Thus the theorem follows from (2.. First we shall ask what we mean by the mutual independence of three events. and C . Clearly we mean pairwise independence. .4.).) = P ( E I A. the above formula enables us to calculate the probability of obtaining heads twice in a row when tossing a fair coin to be 1/4. Thus we should have n . let B be the event that a head appears in the second toss. without paying any attention to the other characteristicssuits or numbers.4. Using the axioms of conditional probability. Then. To summarize. Henceforth it will be referred to simply as "independence.
pB. (In other words. . B. P ( B I C) = l/g.5. (2. and (2.5. Suppose that there are four balls in a box with the numbers 1 through 4 written on them. by our assumptions. assume that P ( A ( B 1 7 C) = P ( A B ) and that P ( A I B n C) = P ( A B ) .6) There are C! ways in which three aces can appear in five cards. and PC.If we try Therefore. with respective probabilities PA.1) by c:. A. The result of Example 2. Figure 2.5. But it does.1). EXAMPLE 2. What is the probability that 1 and 2 are drawn given that 1 or 2 is drawn? What is the probability that 1 and 2 are drawn given that 1 is drawn? By Theorem 2.5.4. and C.P(l and 2) P ( l or 2) 2. By repeated application of Theorem 2. and we pick two balls at random.3 is somewhat counterintuitive: once we have learned that 1 or 2 has been drawn.2 Suppose that three events.14 2 I Probability 2..4 There is an experiment for which there are three outcomes. and C.= .4).1. C or C does not affect the probability of A.) Calculate P ( A I C) and P ( A I Since A = ( A n B) U ( A fl B ) . P ( A I B ) = '/2.5) we obtain P ( A I C ) = 5/1?.5.5.2) P ( A I C) = P ( A f l B 1 C) + P ( A n B I C). learning further that 1 has been drawn does not seem to contain any more relevant information about the event that 1 and 2 are drawn. Ei * P(l and 2 1 1 or 2) = P [ ( l and 2) f l (1 or 2)] P ( l or 2) .2). from (2. P ( l and 2 1 1 ) =  P [ ( l and 2) n 11 P(l) P(l and 2) P(1) (2. we have by axiom (3) of conditional probability E X A M PLE I c).5.5.. if B or B is known. Finally. A.3 Probability calculation may sometimes be counterintuitive.5 / Probability Calculations 15 Similarly.5. and each way has the same probability as (2. Therefore the answer to the problem is obtained by multiplying (2.4) 1 1 1 P(A f l B I C) = P(A I B)P(B I C ) = .5. affect each other in the following way: P ( B I C ) = '/2. Calculating P ( A 1 C) is left as a simple exercise. Furthermore. E XA M PLE 2. P ( A I B) = 1/3. .5. B. (2.4.1 we have (2.1 illustrates the relationship among the relevant events in this example.5. we have = P(A I B fl C)P(B I C). 2 2 4 . Similarly.
2) Suppose that the Stanford faculty consists of 40 percent Democrats and 60 percent Republicans.  . Then we have by Theorem 2. be the event that A happens in the ith trial.1 Characterization of events this experiment repeatedly. (1) Let A. We shall solve this problem by two methods. Thus 3.4.1 implies (9). . Then we have EXERCISES 1. E X A M P L E 2. (Section 2. . Let P be the desired probability.C.5. (Section 2.16 2 I Probability I Exercises 17 1and 4 lor2 Sample space lation actually has cancer.C) = (A n B) .2.4.2) Prove (a) A n ( B U G ) = ( A n B) U ( A n C ) .5 This is an application of Bayes' theorem. and (4) of the axioms of conditional probability.005 of the popu 4. Let C indicate the event that this individual actually has cancer and let T be the event that the test shows he has cancer. and define Bi and C.  (2) We claim that the desired probability is essentially the same thing as the conditional probability of A given that A C WB has o c c m d .1) Show that Theorem 2. .4. (Section 2. (Section 2. which in this case has turned out to be correct.C) n (B . which gives the same answer as the first method.2 (Bayes) I 3 and 4 F lC UR E 2. (Section 2. The second method is an intuitive approach.1) Complete Example 2. Assuming that 0. If 10 percent of the Democrats and 70 percent of the Republicans vote for Bush. Suppose that a cancer diagnostic test is 95% accurate both on those who do have cancer and on those who do not have cancer.3. (b) A U ( B n C ) = ( A U B ) n ( A U C ) . what is the probability that a Stanford faculty member who voted for Bush is a Republican? 5.4. (3).4. what is the probability that A occurs before B does? Assume pc # 0. (c) (A .3) Fill each of the seven disjoint regions described in the figure below ". substantiated by the result of the rigorous first approach. 2. similarly. given that the test says he has cancer.3. compute the probability that a particular individual has cancer.
6).2 as a random mechanism whose outcomes are real numbers.7) are not.4. 7.2. over a sample space. At our level of study. R A N D O M VARIABLES A N D PROBABILITY DISTRIBUTIONS 6.1 2 A random variable is a realvalued function defined . (2. (Section 2. (Section 2. (b) (2.7) are satisfied but (2. (2.1 is just as good.5) A die is rolled successively until the ace turns up. (Section 2. We have already loosely defined the term random variable in Section 1.5) Calculate P ( A ) C) in Example 2. In general.1 DEFINITIONS OF A RANDOM VARIABLE 8.5).4. what is the probability that you will be admitted into at least one school? Find the limit of this probability as n goes to infinity. (2.4. (Section 2.) 3.4. and (2.4. 1 A random variable is a variable that takes values accord 9.6).6)."we think of all the possible values it can take. I .5) If the probability of being admitted into a graduate school is l / n and you apply to n schools. The customary textbook definition of a random variable is as follows: D E F I N IT I 0 N 3.71.5 that the ace will turn up at least once? ing to a certain probability distribution. what is the probability of winning a tennis game under the "noad" scoring? (The first player who wins four points wins the game.4.4. (2. Note that the idea of a . We have mentioned discrete and continuous random variables: the discrete random variable takes a countable number of real numbers with preassigned probabilities. and the continuous random variable takes a continuum of values in the real line according to the rule determined by a density function. and (2. we can simply state D E F I N ITI o N 3. 10.4. when we speak of a "random variable.4.4." we think in addition of the probability distribution according to which it takes all possible values. Definition 3.8) is not.8) is satisfied but (2. (c) (2.18 2 I Probability by an integer representing the number of simple events with equal probabilities in such a way that (a) (2. and (2. When we speak of a "variable.5). Defining a random variable as a function has a certain advantage which becomes apparent at a more advanced stage of probability theory.5) Compute the probability of obtaining four of a kind when five dice are rolled.4. (Section 2.5).4.1. How many rolls are necessary before the probability is at least 0.5.8) are all satisfied. Later in this chapter we shall also mention a random variable that is a mixture of these two types.5) If the probability of winning a point is p.
.1. P 7 Probability Sample space H H I 7 1 P 1 HT I TH I TT I 3.2. It indicates the little difference there is between Definition 3. 1 Experiment: A throw of a fair die. It means the random variable X takes value x.1 A discrete random variable is a variable that takes a countable number of real numbers with certain probabilities. We must. .= 1. we can forget about the original sample space and pay attention only to what values a random variable takes with what probabilities. Experiment: Tossing a fair coin twice.1 to the case of a discrete random variable as follows: D E F I N ITI o N 3.20 3 1 Random Variables and Probability Distributions EXAMPLE 3 . of course. have Z:='=..p.1. The arrows indicate mappings from the sample space to the random variables.2 as well.2.1 and Definition 3. 2. for a sample space always has a probability function associated with it and this determines the probability distribution of a particular random variable. n. . we shallillustrate how a probability function defined over the events in a sample space determines the probability distribution of a random variable.) = p. In the next section. We specialize Definition 3. EXAMPLE 3 . n may be w in some cases.2 3. Note that Xi can hardly be distinguished from the sample space itself. Note that the probability distribution of X2 can be derived from the sample space: P(X2 = 1) = '/ 2 and P(X2 = 0) = %. 2 . with probability p. P B P 6 1 X (number of heads in 1st toss) 1 6 1 6 Probability Sample space 6 Y (number of heads in 2nd toss) In almost all our problems involving random variables.. i = 1. . .2 1 Discrete Random Variables 21 probability distribution is firmly embedded in Definition 3. It is customary to lues it takes by by a capi denote a random lowercase letters. 2 .1.2. The probability distribution of a discrete random variable is completely characterized by the equation P(X = x.1.1 DISCRETE RANDOM VARIABLES Univariate Random Variables The following are examples of several random variables defined over a given sample space. 2 3.
) .4. . . .1) P ( ~I I ~ j ) p ( ~j )P(x1 I ~ P(x2 yj)Pbj) j ) I P(x2 I~ j ) for every j.4. 1 ) is not a real number. Proo$ ("only if" part). n and/or m may be in some cases.2.1 we defined independence between a pair of events. If X and Y are independent. AMixed to the end of Table 3.1 3. Since a quantity such as ( 1 .)P(Y = y.) for all i.1 are independent if and only if every row is proportional to any other row.2 shows the values taken jointly by two random variables X and Y.I :g . i = 1 .2.2. again. n.2 Bivariate Random Variables Probability distribution of a bivariate random variable The last row in Example 3.1 are a column and a row representing marginal probabilities and poi = Cy=. D E F I NIT Io N 3 . Thus we have D E F I N IT I o N 3. the probability distribution of one of the univariate random variables is given a special name: mcrginal probability distribution..) = P ( X = x. we can define Conditionalprobability P(X = xi :. . That is to say. the first two rows.. .1. J= 1 Discrete random variables X and Y with the probability distribution given in Table 3. 2 . . Using Theorem 2. Therefore. . = E~=lpij comes from the positions of the marginal probabilities in the table.2 A bivariate discrete random variable is a variable that takes a countable number of points on the plane with certain probabilities. m. j = 1 . n. Because of probability axiom ( 3 ) of Section 2. P(X = x. . It is instructive to represent the probability distribution of a bivariate random variable in an n X m table.Y = y.(The word marginal calculated by the rules pi.1. we do not have a random variable here as defined in Definition 3. 2. the first two rows are proportional to each other. equivalently. We call pq the joint probability. every column is proportional to any other column.2.3 which does not depend on j. See Table 3.pij. Y = y .) and the event (Y = y. T H Eo R E M 3.) are independent for all i... we have by Definition 3.2. 2. 2. or... In Definition 2. Y = y.I ifP(Y=yj)>o. The same argument holds for any pair of rows and any pair of columns. for example. j.) P(Y = yj) (3.2 1 Discrete Random Variables 23 3. .) B y looking at the table we can quickly determine whether X and Y are independent or not according to the following theorem. i = 1. . 3 Discrete random variables are said to be independent if the event ( X = x.1 Marginal probability m P ( X = x . P22 Pzm PZo The probability distribution of a bivariate random variable is determined by the equations P ( X = xi. the marginal probability is related to the joint probabilities by the following relationship. But it is convenient to have a name for a pair of random variables put together.2.22 3 1 Random Variables and Probability Distributions TABLE 3. Here we shall define independence between a pair of two discrete random variables. ) = x P ( X = x .2. We have P2j Plj (Y = yj) = P(X = x. Consider. When we have a bivariate random variable in mind. Y = yj) = pq.2. j.1.
3 Let the joint probability distribution of X and Y be given by The probability distribution of a trivariate random variable is determined by the equations P ( X = xi.. P(x.) = C C P ( X = x. . Y.2. q.2 as follows. . . . Y = y . 2 . Definition 3.2.2. P(Y = 1 IX = I P(X. .1.andj.5 generalizes Definition 3. P(xi) for every i and j. From (3.2. D E F I N IT I o N 3. .2.)=C.2. we can define Marginal probability P(X = x.i = 1. to be unity for every j..4) we have (3. 0 We shall give two examples of nonindependent random variables..4 A Tvariate discrete random variable is a variable that takes a countable number of points on the Tdimensional Euclidean space with certain probabilities..2.3 Multivariate Random Variables P(xiI yj) = cj . Therefore X and Y are independent. E X A M P L E 3. P ( Y = l I X = O ) ='/4 P ( Y = O I X = O ) ='/. .24 3 1 Random Variables and Probability Distributions 3.3) P ( Y = . . Z = zk) = pUk.) P(Y = y.2. . m.~.2.4 =OF   Conditional pobability P(X = x8.. and therefore X cannot be regarded as a function of x2. ' Note that this example does not contradict Theorem 3. % % Then we have P(Y = 1 I X = 1) = (%)/(%I = % and P(Y = 1 I X = (%)/ (%) = 2/5.P(X~I foreveryi.3) by P(y. j. The word function implies that each value of the domain is mapped to a uniquevalue of the range. n.I I X = I ) ='/. 7%. z = zk) > 0.=I h= 1 m 9 i = 1 .2. . .5) ( J P(xJ . 2. E X AM PLE 3. .Y = yj I Z = zk) = P(X = x. Then from (3. . we determine c.2.I~. k = 1.6) ' P(q ' J for ) every i. Y =y.2. Therefore (3. .2. . Summing both sides over i. . but x2and Y' Random variables X and Y defined below are not indeare independent. Y = y. . P(xk) for every i and k.2. Y = yl.2. j = 1. As in Section 3. 3.3) and (3.4) 0 ) = %.5..2.k. Z = zk)..2 1 Discrete Random Variables 25 ("if' part).2. Suppose all the rows are proportional to each other.. We can generalize Definition 3.3. . 2. . I Y = y.) and sum over j to get (3.) i f P ( Y = y. . . and ir. 2. z = 2. Z P(Z = zk) if P ( Z = zk) > 0 = zk) pendent. z = 2.) Multiply both sides of (3. P(xb) .) = ctk . P(X = x.1) we have (3. Z = zk) = P(X = x.. which shows that X and Y are not independent.
. Let the sample space S be the set of eight integers 1 through 8 with the equal probability of '/s assigned to each of the eight integers. For a precise definition. Z = 1) = P ( i = 4) = '/. For such a 1)iPCg ?. 2 .1: A continuous random variable is a variable that takes a continuum of values in the real line according to the = P(X = 1. but X. The following defines a continuous random variable and a density at the same time. = 0 otherwise. . and therefore it does not matter whether < or 5 is used within the probability bracket. . . then X is a continuous random variable and f (x) is called its density function. Or we can define it in P ( X = ~ . There are many possible answers. Z = 1 for i even. 6 (3. In most practical applications.. and/or x:! = w. Y. = 0 otherwise Y=1 for35z56.1) P(xi 5 X 5 x2) = Ixy(x)dx XI X = 1 for i 5 4. Y P ( X = 1. see. = 0 otherwise.. EXAMPLE 3 . for example. 1 An example of mutually independent random variables follows. 2 . It follows from Definition 3.5. We shall allow xl = a. Then.. define EXAMPLE 3 . but we can.3. as illustrated by Example 3. and Z are mutually independent because Then X. Find three random variables (realvalued functions) defined over S which are mutually independent. k .3 1 Univariate Continuous Random Variables 27 DEFINITION 3. We assume that the reader is familiar with the Riemann integral. 5 Suppose X and Y are independent random variables which each take values 1 or 1 with probability 0. Z = ~ ) = P ( ~ = I ~ X = ~ . f (x) will be continuous except possibly for a finite number of discontinuities. Y independent because 3.2. We need to make this definition more precise. .5 Afinite set of discrete randomvariablesX. x2 satisfying xl 5 XZ. Y = yj. . Y = ~ ) P ( X = ~ .26 3 1 Random Variables and Probability Distributions 3.)P(Z zk) .1 whereas In Chapter 2 we briefly discussed a continuous sample space.I)* .1. z . . however. Y = ~ . . Apostol (1974). 2. we define a continuous random variable as a realvalued function defined over a continuous sample space. j.1) simply as the area under f (x) over the interval [xl. .5 and define Z = XY.... and so on for all eight possible outcomes. Y = 1) = '/4. 3 . If there is a nonnegative function f (x) defined over the whole line such that D E F I N ITI o N 3 . we must have J"f(x)dx = 1. . by axiom (2) of probability. and Z are not mutually Then Z is independent of either X or Y.3 UNlVARlATE CONTINUOUS RANDOM VARIABLES Density Function 3. forall i. x2].2.= = Zk. Y = ~ ) a way analogous to Definition 3.2.3.) = P ( X = xj)P(Y = y.3. are mutually independent if P(X = x. For our discussion it is sufficient for the reader to regard the righthand side of (3. rule determined by a density function. Y = 1. P(X = l ) P ( Y = l ) P ( Z = 1) = '/s. Following Definition 3.3. It is important to note that pairwise independence does not imply mutual independence.2.1 that the probability that a continuous random variable takes any single value is zero. for example. = P(X = l)P(Y = for any XI.
1 Bivariate Density Function Suppose that a random variable X has density f ( x ) and that [a. it must be nonnegative and the total volume under it must be equal to 1 because of the probability axioms. We can easily verify that f ( x 1 a (3. it will satisfy (3. 3 .5) f( x I S ) = = P(X E S ) 0 f(x' for x S.3. The conditional density of X b. The rule is that the probability that a bivariate random variable falls into any region on the plane is equal to the volume of the space under the density function over that region. Then the conditional density of X given X E S. D E F l N IT I 0 N 3. then (X. For such a function we may change the order of integration so that we have (3. yl 5 B. y) is called the joint density function. provided that $I: f (x)dx # 0.4.4. for any closed interval [xl. Then.2 Conditional Density Function 3. D E F I N I T I O N 3.4 BlVARlATE CONTINUOUS RANDOM VARIABLES 3.3.1 If there is a nonnegative function f ( x . in addition to satisfylng the above two conditions. denoted by f ( x a 5 X 5 b). b ] . and the details are provided by Definition 3.1 Now we want to ask the question: Is there a function such that its integral over [ x l . denoted by f ( x 1 S ) . 3.1) exists. y1. Y) is a bivariatt.2)? The answer is yes.2.4. D EFI N IT I o N 3 . b] is a certain closed interval such that P ( a 5 X S b) > 0.2.for a 5 x 5 b. b] 3 [ x l . is defined by I f f (x. 2 We may loosely say that the bivariate continuous random variable is a variable that takes a continuum of values on the plane according to the rule determined by a joint density function defined over the plane. xp] contained in [a.4.4.x p ] is equal to the conditional probability given in (3.28 3 1 Random Variables and Probability Distributions 3.1. as desired. X Z .4 1 Bivariate Continuous Random Variables 29 function the Riemann integral on the righthand side of (3. = for any X I . We shall give a few examples concerning the joint density and the evaluation of the joint probability of the form given in (3.3.3. otherwise. . y) is continuous except possibly over a finite number of smooth curves on the plane.3 Let X have the density f ( x ) and let S be a subset of the real line such that P ( X E S ) > 0. We shall give a more precise definition similar to Definition 3. y) defined over the whole plane such that given a (3.1).3. From the above result it is not difficult to understand the following generalization of Definition 3.xp].3. which was given for a univariate continuous random variable. In order for a function f (x. y) to be a joint density function. and therefore f ( x ) can be a density function. is defined by I (4 f ( x I a I X 5 b) = f .3.3.we have from Theorem 2.1) and hence qualify as a joint density function. yp satisfylng xl 5 xp.3. The second condition may be mathematically expressed as 0 otherwise.3) 5 X 2 Let X have density f ( x ).4) 5 X I b) defined above satisfies P ( x l ~ X I x 2 1 a 5 X < b ) = f(xIa5XSb)dx I: whenever [a.3. continuous random variable and f (x.
and (3. We write this statement mathematically as (3.9).4. y ) = 1 for 0 < x < 1 .4). Let S be as in Figure 3. Y ) falls into the shaded quarter of the unit circle in Figure 3.9) P t (X.1.4. x > O .3 Domain of a double integral: I Domain of a double integral: 11 Domain of a double integral: 111 where U and V are functions of x. (3. 4 .6). y) is a complicated function. a n d O o t h e r w i s e . Iff (x. y > 0 .2. calculate P ( X > Y ) and P(X' + Y2 < 1 ) . The event (x2+ Y* < 1 ) means that ( X . We shall show the algebraic approach to evaluating the double integral (3. we can treat any general region of practical interest since it can be expressed as a union of such regions. Since P ( X > Y ) is the volume under the density over the triangle. the probability that ( X . a = 0.4.2 = Putting U = eCY. the inside of a circle) can be also evaluated as a double integral off (x. V = y.4.4. V = x. Y ) falls into a more general region (for example. Assuming f ( x . which is 1. Y). Therefore.4.4.4.4. where S is a subset of the plane.4.2 FIGURE 3. y) plane. 1 Iff(x. Then we have . P ( X > Y ) = 1/2. p ( x 2 + Y2 < 1 ) = ~ / 4Note .30 3 1 Random Variables and Probability Distributions 3. We shall consider only the case where S is a region surrounded by two vertical line segments and two functions of x on the ( x . Therefore.which will have a much more general usage than the geometric approach.4. a = 1.2e1.1 FIGURE 3. EXAMPLE 3. y)dxdy. and b = 1 in (3. Y ) . y).4. w h a t is P ( X > 1 .y) = ~ y e ~ ( " + ~ ) . Once we know how to evaluate the double integral over such a region.7) = [)us]: + = e l + [ . Y ) E Sl = I1 f (x. we need the following formula for integration by parts: FIGURE 3. this intuitive interpretation will enable us to calculate the double integral without actually This geometric approach of evaluating the double integral (3. The event ( X > Y ) means that ( X .e .~ ~= : 1 .3. y ) is a simple function.7)we obtain If a function f (x. that the square in each figure indicates the total range of ( X .5) we have (3.1)we have Y Y To evaluate each of the single integrals that appear in (3.we have = eLX.9) may be intuitively interpreted as the volume of the space under f (x. Therefore from (3.9)may not work iff (x. Y ) falls into the shaded triangle in Figure 3.4 1 Bivariate Continuous Random Variables 31 E X A M P L E 3 . Y < I ) ? B y (3.3. Putting U b = in (3. y) is a joint density function of a bivariate random variable (X. it must equal the area of the triangle times the height of the density. as in the following example.5).4). A region surrounded by two horizontal line segments and two functions of y may be similarly treated.4. as shown in Figure 3. and performing the mathematical operation of integration. y) over S . 0 < y < 1 and 0 otherwise. The double integral on the righthand side of (3.
4.~ ) . b 1. To evaluate P(X > Y ) . EXAMPLE 3. Y) falls into the shaded region of Figure 3.4.) h(xJ f ( q .4. we must reverse the role of x and y and put a = 0.yldxdy = 1: [r h(x) f(x.5.10). The following examples use (3. +'(t) denotes the derivative of with respect to t.10) to this problem.4.4. Event (Y < %) means that ( X .10) so that we have EXAMPLE 3 . we shall put x = cos 0. b = %. g ( x ) = p ( x 2 + Y* < 1) = 1: (17~) dx = To evaluate the last integral above. yjdy . Summing both sides of (3. and h ( x ) = 0 so that we have (3. 0 < y < 1 .x. B * 0. .4.16) d o = 8 cos 0 1 'iT +o r =. . .10) so that we have m.The following relationship between a marginal probability and a joint probability is obviously true. h(y) = 0 in (3.4.we put f ( x .4 the volume to be evaluated is drawn as a loaflike figure.4.4.4.2 are special cases of region S depicted in Figure 3. We shall show graphically why the righthand side of (3. the p r o b ability pertaining to one of the variables. Then. 3. 2 0 4 Volume of the ith slice z f (xi. we get (3. Here.4. y ) = 1.11) over i.).3 We shall calculate the two probabilities asked for in Example 3.10).14) To evaluate P(X' + y2 < l ) . 4 Volume z .411 if is a monotonic function such that +(tl) = xl and +(t2) = xp.y. y ) = 24 xy for 0 < x < 1.15) f (x)dx = XI / [ ~ ( t+'(tjdt. g(y) = 1 . 4 .4 Double integral (3.1 and 3. Note that the shaded regions in When we are considering a bivariate random variable (X.we put f (x. In order to apply (3.I I 32 3 1 Random Variables and Probability Distributions 3.10) to evaluate the joint probability.4.10) 11 S f (x. y)dy] dx.12) Suppose f (x.y)dy .4.4. Next. We have approximately (3.4.4.2 M a r g i n a l Density But the limit of the righthand side of (3. (x.4. y) over S.. such as P(xl 5 X 5 q ) or P(yl 5 Y 5 y2).4 1 Bivariate Continuous Random Variables 33 Figures 3. since dx/d0 = sin 0 and sin20 + cos20 = 1.is called the marginalpobability.x and = 0 otherwise. g(x) = x. and h(x) = 0 in (3. we need the following formula for integration by change of variables: FI CU R E 3.12) as n goes to is clearly equal to the righthand side of (3. a = 0 . Calculate P(Y < %). we have + + (3. b = 1. In Figure 3.3.10) indeed gives the volume under f (x. (xi . Y ) .2 using formula (3. The first slice of the loaf is also described in the figure. y) = 1.x .= 1 2 /s. )] fl (3.
Y) have thejoint density f(x.1 shows how a marginal density is related to a joint density. h(x) = yl. say. 3 Let (X. Proo$ We only need to show that the righthand side of (3. y 1 S) = = f (ay) P[(X.y) be theQointdensityofXandYandletf(x) be the marginal density of X.3 Conditional Density We shall extend the notion of conditional density in Definitions 3.18) by x E S where S is an arbitrary subset of the real line.3.we have . 1 We are also interested in defining the conditional density for one of the variables above.We assume that P[(X. y S). Similarly. Formally. D E F I N I T I O N 3 .3 to the case of bivariate random variables. We shall explicitly define it for the case that S has the form of Figure 3.4. Y) given (X.8. A generalization of Definition 3.Y) have thejoint density f(x. Since in this case the subset S can be characterized as yl 5 Y 5 yn. Theorem 3. b = m. X. 2 Let (X.4 1 Bivariate Continuous Random Variables 35 3.3. when we are considering a bivariate random variable (X. Y ) .19) equation (3. Y) E S.4.4.3.y) and l e t s be a subset of the plane which has a shape as in Figure 3. It may be instructive to write down the formula (3. 4 . Domain of a double integral for Example 3. Y) E S] > 0. Then = 0 otherwise.21) F l G UR E 3. y I S) with respect to y. one may replace xl 5 X 5 x 2 in both sides of (3. Y) E S. and g(x) = yn in Figure 3.2 and 3.4. denoted by f (x I S). We have For an application of this definition. 4 . Y ) E Sl 0 for (x.3. 4 . We shall consider first the situation where the conditioning event has a positive probability and second the situation where the conditioning event has zero probability.3. it can be obtained by integrating f (x.22) explicitly for a simple case where a = a. is defined by I (3.5 f (x. Under the first situation we shall define both the joint conditional density and the conditional density involving only one of the variables. Y) E S] > 0. see Example 3.4.34 3 1 Random Variables and Probability Distributions 3. is defined by Letf(x. Then the conditional density of X given (X.1).3. otherwise. y) E S.y) a n d l e t s be a subset of the plane such that P[(X.3 is straightforward: D E F I N I T I O N 3 . THEOREM 3 .3. given a conditioning event involving both X and Y.4.4 More generally.4. denoted by f (x.4. Then the conditional density of (X. the density function of one of the variables is called the margznal density.
since P(Y = yl + cX) = 0.4. (3.26) can be obtained by defining for all xl.23) is integrated over an arbitrary interval [xl.4 provided the denominator is positive.4. Next we shall seek to define the conditional probability when the conditioning event has zero probability. h(x) = yl + cx. if it exists. denoted by f ( x Y = yl cX).4. I yl 5 Y 5 y2).26) is as follows.y1 + cx)dx P(xl = 5 X 1 5 xz I Y = yl 5 + cX) + cX 5 Therefore the theorem follows from (3.4 THEOREM 3 .4. where yl and c are arbitrary constants.y) 1 y = yl + cx]. 2 1 Bivariate Continuous Random Variables 37 (3. S = { ( x .xp]. (x. An alternative way to derive the conditional density (3.3 can be verified by noting . 5 The conditional density of X given Y = yl + cX.4.4.1   given Y = yl (3.4. write as a separate theorem: . Proof. xp satisfying xl Now we can prove 5 q.24) + The conditional probability that X falls into [xl. + cX).25). 4 .4.4.36 3 1 Random Variables and Probability Distributions 3. y)dydx The conditional density f ( x Y = yl I + c x ) exists and is given by The reasonableness of Definition 3. B y putting a = m. and g ( x ) = y:. and (3. We begin with the definition of the conditional probability Pixl 5 X 5 x g 1 Y = y1 + c X ) and then seek to obtain the function of x that yields this probability when it is integrated over the interval [xl. + cx in Figure 3. b = m. y)dy f ( x .2. where yl < yz.4.27). we have from (3. see Example 3.4.23) f(x I yl 5 Y 5 yn)  Il' I f (x.9.4.4.22)  DEFINITION 3 . DEFINITION 3. is defined to be a function that satisfies I + Then the formula (3. it yields the conditional probability P(xl 5 X 5 x:. 4 .3. Next we have  For an application of Theorem 3.4.4.that when (3.x2].q ] cX is defined by rm f by the mean value theorem of integration. We have J \&. y) plane: that is to say. YIuyuX by Theorem 2. Note that this conditional probability cannot be subjected to Theorem 2.24). We shall confine our attention to the case where the conditioning event S represents a line on the (x. 0 Y 5 lim P(xl Y 2+Y X 5 xpr yl y:.1.
and independence.4.2.6 in the same way that Definition 3. stated without proof. Figure 3. y) = f (x)f (y) for all x and y.1 to a multivariate case. consider Examples 3. 5 X 5 x2)P(y1 5 Y 5 y2) We shall give examples involving marginal density. D E F 1 N I T I 0 N 3 . F I G u R E 3. I x2.) 3. are said to be mutually independent if Finally.3.1. . respectively. 4 . Then X and Y are independent if and only if S is a rectangle (allowing a or to be an end point) with sides parallel to the axes and f (x. z.4. One can ascertain this fact by noting that f (%. The following theorem. .5 Examples This definition can be shown to be equivalent to stating P(x1 5 X 5 xg.6 Joint density and marginal density i T H E o R E M 3. The area of the shaded regim represents f (yl).3/q) is outside of the triangle whereas both f (x = %) and f (y = %) are clearly nonzero. In Example 3.5. y) = g(x) h(y) over S.). . is more apparent. ye? over S.2. In Example 3. .1 and 3. even though the joint density 24xy factors out as the product of a function of x alone and that of y alone over S.4. Note that if g(x) = cf (x) for some c.4 1 Bivariate Continuous Random Variables 39 for all X . X and Y are independent because S is a rectangle and xye(x+~)= xeX .30). conditional densitJr. but the (We have never defined a multivariate joint density f (x.y) >(lover S and f (x.6 describes the joint density and the marginal density appwing in the righthand side of (3. This may be a timeconsuming process. A finite set of continuous random variables X .4. 3/4) = 0 since the point (1/2.4. 4 . we shall define the notion of independence between two continuous random variables. Thus stated. Z. y1 5 y2.3 i The conditional density of X given Y = yl.4. X and Y are not independent because S is a triangle. we should obtain the marginal densities and check whether their product equals the joint density.38 3 1 Random Variables and Probability Distributions 3. 4 LetSbeasubsetoftheplanesuch thatf(x.5 generalizes 3.4. . 4 . Definition 3.4.4. 7 provided that f (yl) > 0. yl.6 implies that in order to check the independence between a pair of continuous random variables. provides a quicker method for determining independence. 6 Continuous random variables X and Y are said to be independent iff (x. . denoted by As examples of using Theorem 3. D EF 1 N I T 1 0 N 3 .4. where g(x) and h(y) are some functions of x and y . . Y.3. which defined independence for a pair of discrete random variables.4. My) = ctf(y). The next definition generalizes Definition 3. its connection to Definition 3. .4.4. THEOREM 3 . y reader should be able to generalize Definition 3. as shown in Figure 3. yg such that xp 5 xp.4. y1 5 Y 5 yg) = P(x. y) = 0 outside S.2.
Calculate P ( 0 < X < 0. y) = 24xy for 0 < x < 1 and 0 < y < 1 . Calculate marginal density f ( y ).4.   Therefore That X and Y are not independent is immediately known because f (x. Calculate P(0 < Y < S/4 1 X = %) .4. y) = % over the rectangle determined by the four corner points (1.41). as can be seen either from (3. Therefore =0 otherwise. B y Theorem 3. we have (3.34). 1) and = 0 otherwise. 9) = '/b .5 1 Y = 0.5) and P ( 0 < X < 0.4. whereas the range of the second integral is from 0 to 1/2. O).4. To calculate P(0 < X < 0.5 in (3. We should calculate f ( y ) separately for y 2 0 and y < 0 because the range of integration with respect to x differs in two cases.5 E X A M P L E 3 . we must first obtain the is equal to marginal density f ( y ) .1 we have x6.40) the range of the first integral is from 0 to 5/4.40 3 1 Random Variables and Probability Distributions 3. y ) = (9/2)(x2 + y2) f o r 0 < x < 1. Such an observation is very important in solving this kind of problem. 6 Let f (x. B y Theorem 3. f (x.y. y) cannot be expressed as the product of a function of x alone and a function of y alone. since f (x. and (0.y)dx = i"' n~ 2 dx=l+y if1 I y < 0 .4 1 Bivariate Continuous Random Variables 41 Suppose f ( x . Note that in (3.5 are very useful in this regard. We can ascertain this fact by showing that and (3.5).7 Putting y = 0. f ( x . (.O < y < 1 and = 0 otherwise.4.5 10 < Y < 0.5) = '/5.4.4.39) or from Figure 3. y) is integrated with respect to x from m to CQ but % is integrated from y . Suppose f (x. 4 . We have EXAMPLE 3.5. l ) .5 1 Y = 0.5) = . for example. y) be the same as in Example 3.4.5 1 0 < Y < 0.3. we must f i s t obtain the conditional density f ( x y). We have By a simple double integration it is easy to determine that the numerator To obtain the denominator. O). This is because f (y I x = 1/2) = 8y only for 0 < y < '/2 and f (y I x = '/9) = 0 for % < y.x and = 0 otherwise. and diagrams such as Figure 3.35) f (x I Y = 3 12 2 0.42) f (y) = m f ( x . Note that in (3. That is. Therefore Therefore Therefore P ( 0 < X < 0. We have EXAMPLE 3.4. ( 0 . 7 7     .1 to 1 .1.5) and determine if X and Y are independent.4.4.4.+ x .
This example is an application of Definition 3. Then its distribution is a step function with a jump of length pi at x. the value of the distribution function is at the solid point instead of the empty point.22) we have = 0 otherwise. The answer is immediately obtained by putting yl = %. 5 . 1 . but this cannot be done for a continuous random variable.8 Suppose f(x.5 X ) . F(m) = 0. This dichotomy can sometimes be a source of inconvenience. Let X be a finite discrete random variable such that P ( X = xi) = pi. This can be done by using a cumulative distribution function (or. Thus E X A M P L E 3. and g(x) = 1 in Figure 3. which can be defined for any random variable.2. i = I . Figure 3.4.5.41) as the area of the shaded region. E X A M P L E 3. h(x) = x. and in certain situations it is better to treat all the random variables in a unified way. y) = 1 for 0 i x 5 1 and 0 5 y 5 1 and < Y).4.y) and = 0 if x is outside the interval. Obtain f (x X I F(x) = P(X < x) for every real x. %). The distribution function of a continuous random variable X with is given by density function f ( a ) .4. denoted by F ( . 0 < y < 1 and = 0 otherwise. Conversely. a continuous random variable has a density function but a discrete random variable does not. as shown in Figure 3. .3. c = 1 in (3. This example is an application of Theorem 3. Some textbooks define the distribution function as F(x) = P(X 5 x). .4. for 0 <x < 1 2 =0 otherwise. At each point of jump.4. Obtain the conditional density f (x I Y = 0.5 DISTRIBUTION FUNCTION 1 FI C u R E 3. 2. =0 otherwise.4. is continuous from the left.1) if x is within the interval (y .3. is defined by (3.7 Marginal density As we have seen so far.9 + From the definition and the axioms of probability it follows directly that F is a monotonically nondecreasing function. 1 The (cumulative) distribution function of a random variable X. a discrete random variable is characterized by specifying the probability with which the random variable takes each single value.26) and noting that the range of X given Y = % IX is the interval (0.y) = 1 f o r 0 < x < 1 .7 describes (3. 3. indicating the fact that the function is continuous from the left. and F(m) = I.4. D E F I N I TI 0 N 3 . .42 3 ( Random Variables and Probability Distributions 3. .5 1 Distribution Function 43 = 2. Put a = 0.1. b = 1. more simply. Assume f(x. n. Then the distribution function can be shown to be continuous from the right. a distribution function).) . Then from (3.8.
44
3
1
Random Variables and Probability Distributions
3.5
1
Distribution Function
45
?J
Then F(x) = 0 if x 5 0. For x > 0 we have
,'/2dt
= [,t/?;
= 1
,
x/z.
EXAMPLE 3.5.2
Suppose foro<x<l, otherwise.
=
(3.5.6)
f(x)=2(1x)
=
0
Clearly, F(x) = 0 for x 5 0 and F(x)
1 for x 2 1. For 0 < x < 1 we have
FIG u R E 3 . 8
Distribution function of a discrete random variable
=O+2
I:,
(1t)dt=2
2
From (3.5.2) we can deduce that the density function is the derivative of the distribution function and that the distribution function of a continuous random variable is continuous everywhere. The probability that a random variable falls into a closed interval can be easily evaluated if the distribution function is known, because of the following relationship:
Example 3.5.3 gives the distribution function of a mixture of a discrete and a continuous random variable.
EXAMPLE 3.5.3
Consider
If X is a continuous random variable, P(X = x2) = 0; hence it may be omitted from the terms in (3.5.3). The following two examples show how to obtain the distribution function using the relationship (3.5.2), when the density function is given.
EXAMPLE 3.5.1
This function is graphed in Figure 3.9. The random variable in question takes the value 0 with probability % and takes a continuum of values between % and 1 according to the uniform density over the interval with height 1.
Suppose
(3.5.4)
j(x)
=
1 2
for x > 0,
A mixture random variable is quite common in economic applications. For example, the amount of money a randomly chosen person spends on the purchase of a new car in a given year is such a random variable because we can reasonably assume that it is equal to 0 with a positive probability and yet takes a continuum of values over an interval. We have defined pairwise independence (Definitions 3.2.3 and 3.4.6)
A
*"
A*
3
1
Random Variables and Probability Distributions
DEFINITION 3.5.4
3.6
1
Change of Variables
47
Bivariate random variables (Xl,YI), (X2,Y2),. . . , (X,, Y,) are mutually independent if for any points xl, x2, . . . , X, and yl,
3'2,.
.  ,.Yn,
I I I
F I G u R E 3.9
Distribution function of a mixture random variable
and mutual independence (Definitions 3.2.5 and 3.4.7) first for discrete random variables and then for continuous random variables. Here we shall give the definition of mutual independence that is valid for any sort of random variable: discrete or continuous or otherwise. We shall not give the definition of pairwise independence, because it is merely a special case of mutual independence. As a preliminary we need to define the multivariate distribution function F(xl, xn, . . . ,x,) for n random variables X1, X2, . . . , Xn by F(xl, q, . . . , x,) = P(X1 < XI, X2 < xp, . . . ,X, < x,). Random variables XI, XZ, . . . , Xn are said to be mutually independent if for any points XI, xe, . . . , x,,
D EF l N I T I O N 3 . 5 . 2
Note that in Definition 3.5.4 nothing is said about the independence or nonindependence of X, and Y,. Definition 3.5.4 can be straightforwardly generalized to trivariate random variables and so on or even to the case where groups (terms inside parentheses) contain varying numbers of random variables. We shall not state such generalizations here. Note also that Definition 3.5.4 can be straightforwardly generalized to the case of an infinite sequence of bivariate random variables. Finally, we state without proof:
THEOREM 3.5.1
f
Let 4 and be arbitrary functions. If a finite set of random variables X, Y, Z, . . . are independent of another finite set of random variables U, V, W, . . . , then +(X, Y, Z, . . .) is independent of +(U, . . .).
+
v,w,
$ '
Just as we have defined conditional probability and conditional density, we can define the conditional distribution function.
D E F I N I T I O N 3.5.5 LetXandYberandomvariablesandletSbeasubset of the (x, y) plane. Then the conditional distribution function of X given S, denoted by F(x 1 S), is defined by
(3.5.9)
F(xl,x2, . . . , x , ) = F ( x I ) F ( x 2 ) . . . F ( x , ) .
Equation (3.5.9) is equivalent to saying
(3.5.10)
P(XlES1,X2ES2 , . . . ,XnESn)
= P(Xl E S,) P (X, E S,)
(3.5.12)
E S,)
F(x(S) = P ( X < x I ( X , Y ) ES).
. . . P(X,
for any subsets of the real line S1, S2, . . . , Sn for which the probabilities in (3.5.10) make sense. Written thus, its connection to Definition 2.4.3 concerning the mutual independence of events is more apparent. Definitions 3.2.5 and 3.4.7 can be derived as theorems from Definition 3.5.2. We still need a few more definitions of independence, all of which pertain to general random variables.
D E F I N IT I o N 3 . 5 . 3 An infinite set of random variables are mutually &dependent if any finite subset of them are mutually independent.
Note that the conditional density f (x I S) defined in Definition 3.4.3 may be derived by differentiating (3.5.12) with respect to x.
3.6 CHANCE OF VARIABLES
In this section we shall primarily study how to derive the probability distribution of a random variable Y from that of another random variable X when Y is given as a function, say +(X), of X. The problem is simple if X and Y are discrete, as we saw in Section 3.2.1; here we shall assume that they are continuous.
'
48
3
1
Random Variables and Probability Distributions
3.6
1
Change of Variables
49
We shall initially deal with monotonic functions (that is, either strictly increasing or decreasing) and later consider other cases. We shall first prove a theorem formally and then illustrate it by a diagram.
T H E O R E M 3.6.1
The term in absolute value on the righthand side of (3.6.1) is called the Jacobian of transfmation. Since d+vdy = (d+/dx)', we can write (3.6.1) as
(3.6.9)

L e t f ( x ) bethedensityofXandletY=+(X),where+ is a monotonic differentiable function. Then the density g(y) of Y is given
by
g(y) = ('I  (or, mnemonically, g ( y )idyl = f ( x )ldx\), Idy/&l
which is a more convenient formula than (3.6.1) in most cases. However, since the righthand side of (3.6.9) is still given as a function of x, one must replace x with +  l ( y ) to obtain the final answer.
+I

where
is the inverse function of 4. (Do nOt mistake it for 1 over 4.)
Proot We have
E X A M P L E 3.6.1 Suppose f ( x ) = 1 for 0 < x < 1 and = 0 otherwise. Assuming Y = x', obtain the density g(y) of Y. Since dy/dx = 2x, we have by (3.6.9)
Suppose
(3.6.3)
+ is increasing. Then we have from (3.6.4)
P(Y < y) = P[X
<
But, since x =
6, we have from (3.6.10)
Denote the distribution functions of Y and X by G ( . ) and F ( . ) , respectively. Then (3.6.3) can be written as
(3.6.4)
G(y) = ~ [ $  ' ( y ) l .
Differentiating both sides of (3.6.4) with respect to y, we obtain
It is a good idea to check for the accuracy of the result by examining that the obtained function is indeed a density. The test will be passed in this case, because (3.6.11) is clearly nonnegative and we have
Next, suppose
(3.6.6)
4 is decreasing. Then we have from
(3.6.2)


P(Y < y) = PIX >
The same result can be obtained by using the distribution Fi%%etirm=K without using Theorem 3.6.1, as follows. We have
(3.6.13)

which can be rewritten as
(3.6.7)
G(y) = P(Y
< y)
= P(X'
< y)
= P(X
< 5)
G(y) = 1  F [ +  ' ( ~ ) ] .
Differentiating both sides of (3.6.'7), we obtain Therefore, differentiating (3.6.13) with respect toy, we obtain
The theorem follows from (3.6.5) and (3.6.8). R
2 can be generalized and stated formally& follows.1 will not work. In this example we have considered an increasing function.21) G(y) = F(y) . Therefore we have approximately Figure 3.16) we have (3. Since Y lies between y and y Ay if and only if X lies between x and x Ax.6.6. however. If AX is small then Ay is also small. 1 < x < 1.50 3 ( Random Variables and Probability Distributions 3.6 ( Change of Variables 51 Since we can make Ax arbitrarily small. Differentiating (3. and the area of (1) is approximately f (x)Ax and the area of (2) is approximately g(y)Ay. We haw + + (3.6. 4 6 ' The result of Example 3. Therefore + F lC U R E 3.17) in fact holds exactly. as it does not utilize the power of Theorem 3. .10 illustrates the result of Theorem 3.1.1 1 is helpful even in a formal approach. using the distribution function. and = x2 if X < 0.6. we get From (3.6. Figure 3. (3.6 )  2 =  2 +  O<y<l.6. or through the graphic approach.6. E X A M P L E 3. Therefore But if Ax is small.11 we must have area (3) = area (1) area (2). find g(y).1 0 Change of variables: onetoone case Therefore This latter method is more lengthy. the formula of Theorem 3. either through the formal a p proach. In the case of a nonmonotonic function. of being more fundamental.6.22) g(y) = f (y) + f( . shaded regions (1) and (2) must have the same area. It has the advantage. It is clear that we would need the absolute value of d$/dx if we were to consider a decreasing function instead.F ( .6.6. We shall first employ a graphic approach.6.6 ) . but we can get the correct result if we understand the process by which the formula is derived.6.21) with respect to y. we also have approximately (3.1. In Figure 3.15) g(y)Ay f (x)hx.15) and (3.2 Given f (x) = %.
O<z<l.5 X ) . 1 . We shall always use the method in which the distribution function is obtained first and then the density is obtained by differentiation. See Figure 3. Obtain the density of Z defined by Z = Y / X .6. Y 5 z ) . Calculate the density function g ( z ) of Z = max(X. hence.6. . Later in this section we shall discuss an alternative method. Since X and Y are independent. y) = 1 for 0 < x < 1.9. 2 Suppose the inverse of y = +(x) is multivalwd and can be written as = area for 0 < z < 1. Obtain the conditional density f (x I + .5. we conclude that g ( z ) = 22. 0 < y < 1 and Y = 0. This problem was solved earlier. Since the density g ( z ) is the derivative of the distribution function.26) with respect to z.12. called the h s u m e again f(x.5 So far we have studied the transformation of one random variable into another. A =5 2 1 =1areaB=lfor22 22 Note that nyindicates the possibility that the number of values of x varies with y. In the next three examples we shall show how to obtain the density of a random variable which is a function of two other random variables. 0 < y < 1 and = 0 otherwise. Let F(.4 F I G uR E 3 . y) = 1 for 0 < x < 1.1 for z r 1. Then <x  T IIE 0 ff E M 3 .6. in Example 3.X . The present solution is more complicated but serves as an exercise. Then the density g(. the probability of the two events is the same.6. where f (.27) f(z) = for o < z < 1.0. Define Z = Y . but the present method is more fundamental and will work even when the Jacobian method fails. we get (3. For any z. Y ) . 2z2 EXAMPLE 3.25) P(Z 5 z) = P(X 5 z)P(Y 5 z) =z2.) of Y is given by Differentiating (3. Then we have = 0 otherwise. we have (3. 6 . y) = 1 for 0 < 1 and 0 < y < 1. EXAMPLE 3. EXAMPLE 3. 0 < z < 1. but here we shall present an alternative solution using the distribution function.6 ( Change of Variables 53 Jacobian method.) be the distribution function of Z.6. the event ( Z 5 z) is equivalent to the event (X 5 z.6.1. 1 1 Change of variables: nononetoone case Let X and Y have the joint density f (x.4. which is a generalization of Theorem 3.6.3 Assume f(x.52 3 1 Random Variables and Probability Distributions 3.) is the density of X and +' is the derivative of +.
x2) be the joint density of a bivariate random variable (XI.6. 1 3 Domainofajointdensity =z + x + 0.4 FIGURE 3 .32) (3.x < z < 0.6 1 Change of Variables 55 I 0 F I G U R E 3.X2) and let (Y1. T H E O R E M 3 .5x. Suppose alla22 .13.54 3 1 Random Variables and Probability Distributions 3.5 .5 . . we finally get (3.6. z) is indicated by the shaded region in Figure 3. f(b1l~l+ b 1 2 ~ 2b .5.6.z)=l. 21~1 + b22~2) Ialla22 . 0.6. 3 Let f (xl. Therefore Theorem 3.6.5.35) g(yi.31). (3.6.6.a12a21# 0 so that (3. must be appropriately determined.Y2) be defined by a linear transformation Therefore. . y2) = f (x I Y = 0.a12a211 Then the joint density g(yl. 0.6. otherwise.1 2 1 X Illustration for Example 3. the range of (yl.x. n)over which g is positiw. from (3.5 + X) = f (x I Z = 0) = 2 =0 for 0 < x < 0.5x<z<0. that is. O<x<1.6. from (3..33) can be solved for XI and X2 as The domain of the joint density f (x. 6 . From (3.Y2) is given by Therefore. y2) of (Y1.1 to a linear transformation of a bivariate random variable into another.30) f(x.3 generalizes Theorem 3.29) and the marginal density of X.30) and (3.6. where the support of g.6.30) we get X2 = b21Y1 + bZ2Y2.
x 2 ) = 4x1x2f0r0 5 x 1 I . what is the joint density of (Yl. (~11x1 + ~12x2 + a11AX1. The linear mapping (3.( X I + Ax.) or the conditional probability P(yi x). . Thus the support of g is given as the inside of the parallelogram in Figure EXAMPLE 3. where the coordinates of its four cornerscounterclockwise starting from the southwest cornerare given by ( X I . . ( X I + Ax.39) I 2 0 5 .Y 1 + . q l X l + q 2 X 2 f aZ2AX2). f ( x I yi). and ( X I . the best way to characterize the relationship seems to be to specify either the conditional density f ( x y.6.6.X2).Xl + aZ2X2+ aglAXl + aZ2AX2).P(yi).Y2)? Solving (3.6. and P(yi 1 x) related to one another? ( 2 ) Is there a bivariate I I . That this is needed can be best understood by the following geometric consideration.X2.6.6.35).al2a21 is the determinant of the 2 X 2 matrix FIG u R E 3.6 Next we derive the support of g. Since 0 5 xl r 1 and 0 S % 5 1. i = 1.6 Suppose f ( x l . we obtain  . whose coordinates are given by (allX1 + a12X2. .a12asllappearing on the righthand side of (3.6.). 2. (allX1 + 42x2 + allAX1 + a12AX2. and (allX1 + a12X2+ a12AX2.and if we suppose for simplicity that all the a's are positive and that alla22 . If we assume that X and Y are related to each other.56 3 1 Random Variables and Probability Distributions 3. Chapter 11 shows that alla22 . I 4 Illustration for Exampk 5. 3.6. Let X be a continuous random variable with density f ( x ) and let Y be a discrete random variable taking values yi.35) is called the Jacobian of transformation. X Z ) .7 1 joint Distribution 57 The absolute value lalla22.     .. we have from (3. a22X21. . with probabilities P(yi).7 J O I N T DISTRIBUTION OF DISCRETE AND CONTINUOUS RANDOM VARIABLES 1.al2a21 > 0. Y2 = X1 .1 and0 5 x 2 5 3.we immediately obtain In Section 3.14. then the area of the parallelogram must be (~11~2 2 a12a21)8X1AX2.X:! + AX2).33) maps this rectangle to a parallelogram on aZ1X1 + the YlY2 plane.6.36) for X1 and X2. x 2 + AX.3 can be generalized to a linear transformation of a general nvariate random variable into another.6.4 we studied the joint distribution of continuous random variables. The area of the rectangle is AX1AX2. a21X1 + a22X2 + a21AX1).31) (3.6. In this section we ask two questions: (I) How are the four quantities f ( x ) .2 we studied the joint distribution of discrete random variables and in Section 3. Consider a small rectangle on the X l l X 2 plane. n. In some applications we need to understand the characteristics of the joint distribution of a discrete and a continuous random variable. Theorem 3. Inserting the appropriate numbers into (3..Y2 5 1 3 3 B y using matrix notation. If (3.36) Y l = Xl + 2x2 .a2.
7..7.1 by probability axiom (3). 1 x) involves the conditioning event that happens with zero probability.4. y. Suppose the density of X is given by f(x) = 1.8 and 1. x 5 X .n) and S = {y.: for any a 5 b. .I i E I}? Note that.5). Y < 0.3) Let the joint density of (X. 0 (a) P(X < 0. x > 0.4. Thus f (x I y. = = x 2EI 5. Y E S) = Sb. 2. so does P(y.6) P(a 5 X 5 b. and Z be binary random variables each taking two values. O<y<l.2. P(a 5 X 5 6 . O<x<l. (b) Calculate P(0 < Y < 3/4 1 X = 0. . Hence. f(r(~)=2xy+2(1x)(ly). where I is a subset of integers (1.3 and 3. 1 x 5 X 5 x + E) as E goes to zero. (Section 3.) by Theorem 2. Since the conditional probability P(y. I .)P(y.)P(Y = yt) by Theorem E 5 x + €1 Given the density f (x.3) Let X be the midterm score of a student and Y be his final exam score.4.) defined in the second question. like any other conditional density defined in Sections 3. y) = 2(x + y) . (c) P(Y < 0.58 3 1 Random Variables and Probability Distributions I Exercises 59 function +(x. calcluate 3. x)f ( 4 . which provides the answer to the fim question. Next consider What is the probability that he will get an A on the final if he got an A on the midterm? 4. (b) Y = x2.5. we need to define it as the limit of P(y.2).) must satisfy where S = {y. 1 or 0. . ( 2 E I}. CzEI$(x. (b) P(X < 0. y.) such that P(a 5 X 5 b. . Y) be given by (a) Calculate marginal density f (x). ) I f( 4 by the mean d u e theorem. < x < 1. Y E S ) Given f (x) = exp(x). . by (3.61 + E) x * I Y = y. The score is scaled to range between 0 and 1. (Section 3.5).5). y.1) and the conditional density of Y given X is given by . 2.+o P(x 5 X 5 x P(x 5 X = lim E+O by Theorem 2. Thus we have EXERCISES 1. find the density of the variable (a) Y = 2 X + 1 .5). f (x I y. (Section 3. Specify ajoint distribution in such a way that the three variables are pairwise independent but not mutually independent. Y . 0 < y < x.f (x Y ~ ) ~ ( Y .) plays the role of the bivariate function +(x. (Section 3.4.1 by (3.4.3) Let X.4. and grade A is given to a score between 0. Y = y.)dx. O< x< 1 P(x 5 X 5 x + E) 2.3)  = lim P(Y = y. (Section 3.
= m and C+ is finite. . < 0. . Let Xi be the payoff of a particular gamble made for the ith time. 0 =XY.1. (Section 3. (Section 3.) + Cx.) 6. Define Y by Y=X+l = ifO<X<l if 1 2X < X < 0. We shall define the expected value of a random variable. if a fair coin is tossed and we gain one dollar when a head comes up and nothing when a tail comes up. y) = 1.1 . 2.P(x.2. If X is the payoff of a gamble (that is.). (Section 3.) if the series converges absolutely.5 for 1 4 MOMENTS < x < 1.P(x.60 3 1 Random Variables and Probability Distributions (c) Y = 1/X. then we say EX = m. Then the average gain from repeating this gamble n times is ~ .1.is finite.6) Assuming f (x. < x < 1. This is a consequence of . 0 < y < 1.. the expected value of this gamble is 50 cents. is defined to be EX = Z~="=. If C. t > 0. denoted by EX. If C+ = and C. 7. Obtain the density of Y.1 EXPECTED VALUE 8. We can formalize this statement as follows.)= GO and Cx.P(x. Let X be a discrete random variable taking the value x. .' z $ ~ xand ~ .). Find the conditional density of X given Y if X = U and Y = U + V.P(x. I The expected value has an important practical meaning. We can write EX = C+x. (The symbol log refers to natural logarithm throughout.6) Let X have density f (x) = 0. first. if you gain xi dollars with probability P(xi)). with probability P(x.P(x.1 and. (d) Y = log X. it converges to EX in probability.6) Suppose that U and V are independent with density f (t) = exp(t). we say EX does not exist. It means that if we repeat this gamble many times we will gain 50 cents on the average.) = m. obtain the density of Z 4. then we say EX = GO. D E F I N I TI o N 4. for a discrete random variable i n Definition 4. . If C+x. where in the first summation we sum for i such that x. for a continw>m random variable in Definition 4. the expected value signifies the amount you can expect to gain on the average. Then the expected value (expectation or mean) of X. i = 1. For example. > 0 and in the second summation we sum for i such that x. second.
because it means minimizing the maximum loss (loss may be defined as negative gain).1. EX = a.1. J ! x f (x)dx = m. J ! x f (x)dx = m and 5.2. the utility function is logarithmic. being a measure of its central location. We can think of many other strategies which may be regarded as reasonable by certain people in certain situations. If.) EX is sometimes called the populatzon mean so that it may be distinguished from the sample mean.  Besides having an important practical meaning as the fair price of a gamble. and the median m is defined by the equation P(X 5 m) = %. Coming back to the aforementioned gamble that pays one dollar when a head comes up. decision making under uncertainty always means choosing one out of a set of random variables X(d) that vary as d varies within the decision set D. we write EX = m. To illustrate this point. where U denotes the utility function. Obviously. EU(X). the expected value is a very important characteristic of a p r o b ability distribution. We shall learn in Chapter '7 that the sample mean is a good estimator of the population mean. whereas a risk averter may pay only 10 cents. Petersburg gamble for any price higher than two dollars. however. A good. The decision to gamble for c cents or not can be thought of as choosing between two random variables XI and X P . DEFlN I T I O N 4 . By changing the utility function. Hence.2. Such a strategy is called the minimax strategy. by Definition 4. we write EX = m. the real worth of the St. For the exact definition of convergence in probability. (More exactly. More generally.)I = log 4. then mode (X) < m < EX. an extremely riskaverse person may choose the d that maximizes minX(d). the expected value of X. we may say that the fair price of the gamble is 50 cents. For example.1 m EX A positively skewed density EU(X) for some U. we say the expected value does not exist. is defined to be EX = x f (x)dx = c~ and $2 x f (x)dx if the integral is absolutely convergent. A risk taker may be willing to pay as much as 90 cents to gamble. it is the sample mean based on a sample of size n. 2 Let X be a conthuous random variable with density f (x). however. A coin is tossed repeatedly until a head comes up. Not all the decision strategies can be regarded as the maximization of I I I mode F I G U R E 4. consider the following example. One way to resolve this paradox is to note that what one should maximize is not EX itself but. The quantity ~'C. 1 . Such a person will not undertake the St. is called the sample mean. The payoff of this gamble is represented by the random variable X that takes the value 2' with probability 2'.1.1 1 Expected Value 63 Theorem 6. see Definition 6. If the density is positively skewed as in Figure 4. simple exposition of this and related topics can be found in Arrow (1965). where XI takes value 1 with probability % and 0 with probability % and X2 takes value c with probability 1. If . the utility of gaining four dollars for certainty.1. . Then. and 2' dollars are paid if a head comes up for the first time in the ith toss. for example. Petersburg gamble is merely E log X = log 2 . Here X(d) is the random gain (in dollars) that results from choosing a decision d. The other important measures of central location are mode and median.! " J x f (x)dx is finite. one can represent various degrees of riskaverseness. however. then y = EX = m = mode (X). If J : xf (x)dx = w and . that everybody should pay exactly 50 cents to play. If the density function f (x) is bellshaped and symmetric around x = p. If . z ('/.1. Choosing the value of d that maximizes EX(d) may not necessarily be a reasonable decision strategy. nobody would pay m dollars for this gambIe. denoted by EX. Petersburg Paradox. where minx means the minimum possible value X can take. The mode is a value of x for which f (x) is the maximum. This does not mean." because the Swiss mathematician Daniel Bernoulli wrote about it while visiting St. Z = :. How much this gamble is worth depends upon the subjective evaluation of the risk involved.=~X. The three measures of central location are computed in the following examples.62 4 I Moments 4. Petersburg Academy. xf (x)dx is finite. This example is called the "St. rather.
1.) is continuous.. m ..2 show a simple way to calculate the expected value of a function of a random variable.15).1 1 Expected Value 65 f(x) = 2 ~ .4. Note that the density of Example 4. EXY = EXEY. y. If +(. O Let X be a continuous random variable with densityf (x) and let +() be a function for which the integral below can be defined. If X and Y are random variables and L X    We shall not prove this theorem. Then T H E o R E M 4. .7 when (X. y)..1.2 and Theorems 4.1.64 4 I Moments 4.and let +() be an arbitrary function. The same value is obtained by either procedure.3) T H E O R E M 4. The mode is again 1.1 . i = 1.1 and then using Theorem 4. . 1 . then the proof is an easy consequence of approximation (3.2.ThenEX > =ST 2x'dx= 2.Since 1/2 = 1 : X'dx = . Therefore m = 6The mode is clearly 1. j = 1. .1.1. Then we have (4. yldxdy. Although we have defined the expected value so far only for a discrete or a continuous random variable.) be an arbitrary function. Y)= Pro@ Define Y = + ( X ) Then Y takes value +(xi) with probability P(xi). be an arbitrary function.1. . The following three theorems characterize the properties of operator E.1) follows from Definition 4.1 + Theorem 4.1. 1 .).~ f o r x1> . and P are EY = r If X and Y are independent random variables. E+(X) can be computed either directly from (4.1.1.1.2 r m m +(x. y).6.1. . Then Theorems 4.1 and 4. Y) be a bivariate continuous random variable with joint density function f (x. which we state without proof.5) above or by first obtaining the marginal density f (x) by Theorem 3. provided that the expected values exist. Y) be a bivariate discrete random variable taking value (x.2.1.2 has a fatter tail (that is. T H E O R E M 4. 2.ThenEX=sTx'dx=m. THE 0REM 4.1. we have m = 2. and monotonic. and let +(.1.The median m must satisfy '/2 = 2xW3dx = m* 1.1. The proofs of Theorems 4. the density converges more slowly to 0 in the tail) than that of Example 4. Let Y = +(X) and denote the density of Y by g(y).5 is trivial.4. Then a) +(x. E X A M P L E 4.).2 Let (X.1.) with probability P(x. y)f (x.3 and 4. Y) is either discrete or continuous follow easily from Definitions 4. . i.1. Then T H E O R E M 4.. because the proof involves a level of analysis more advanced than that of this book. . T H E 0 R E M 4 .1.  Note that given f (x.m 1 + 1.6 constants. yg(y)dy = lim Zyg(y)Ay = lim Z+(x)f(x)Ax The proof of Theorem 4.~ f o r x1. and let +(. 2. E(aX T H E o R E M 4 .1.1.1. y.1. . Therefore (4. Let X be a discrete random variable taking value x. 3 f(x) = ~ . with probability P(x.1.1. which has pushed both the mean and the median to the right.5 Ifaisaconstant.1. .1 The following is a similar generalization of Theorem 4. 4 Let (X. affecting the mean much more than the median.1 can be easily generalized to the case of a random variable obtained as a function of two other random variables. differentiable. the following theorems are true for any random variable. E X A M P L E 4. E a = a .1 and 4.7 + PY) = clEX + PEY.6 and 4.1.


66
4
I
Moments
4.2
(4.1.6)
1
Higher Moments
67

Theorem 4.1.7 is very useful. For example, let X and Y denote the face numbers showing in two dice rolled independently. Then, since EX = E Y = 7/2, we have EXY = 49/4 by Theorem 4.1.7. Calculating EXY directly from Definition 4.1.3 without using this theorem wodd be quite timeconsuming. Theorems 4.1.6 and 4.1.7 may be used to evaluate the expected value of a mixture random variable which is partly discrete and partly continuous. Let X be a discrete random variable taking value x, with probability P ( x , ) , i = 1, 2, . . . . Let Y be a continuous random variable with density f ( y ) . Let W be a binary random variable taking two values, 1 and 0, with probability p and 1  p, respectively, and, furthermore, assume that W is inde~endent of either X or Y. Define a new random variable Z = W X + 1 (1  W)Y. Another way to define Z is to say that Z is equal to X with probability p and equal to Y with probability 1  p. A random variable such as Z is called a mixture random variable. Using Theorems 4.1.6 and 4.1.7, we have EZ = EWEX E ( l  W ) E Y . But since EW = p from Definition 4.1.1, we get EZ = pEX + ( 1  P)EY. We shall write a generalization of this result as a theorem.
Y
=
0 if 5
5
5
X < 10,
5

if 10 5 X
15.
Clearly, Y is a mixture random variable that takes 0 with probability % and a continuum of values in the interval [2,3] according to the density f ( y ) = 1/2. Therefore, by Theorem 4.1.8,we have
(4.1.7)
EY=O.+
:/:
[/lo 5
5 ydy=.
4
Alternatively, EY may be obtained directly from Theorem 4.1.2 by taking I $ to be a function defined in (4.1.6).Thus
(4.1.8)

E Y = + ~( X ) ~ ( X ) ~ X = $ ~ ; + ( X ) ~ X
 rn 
+
10 10
+(x)dx
+
r5
10

$(x)~x]
5
T H E O R E M 4.1.8 Let X be a mixture random variable taking discrete value xi, i = 1, 2, . . . , n, with probability pi and a continuum of values in interval [a, b] according to densityf ( x ) : that is, if [ a , b] 3 [xl, x2],P ( x I I X 5 x2) = 5:: f(x)dx. Then EX = Cy=lxipi + $b, xf ( x ) d x . (Note that we must have Z,"_,p, 5: f ( x ) d x = 1.)
4.2
HIGHER MOMENTS
+
The following example from economics indicates another way in which a mixture random variable may be generated and its mean calculated.
E X A M P L E 4 . 1 . 3 Suppose that in a given year an individual buys a car if and only if his annual income is greater than 10 (ten thousand dollars) and that if he does buy a car, the one he buys costs onefifth of his income. Assuming that his income is a continuous random variabIe with uniform density defined over the interval [5, 151, compute the expected amount of money he spends on a car. Let X be his income and let Y be his expenditure on a car. Then Y is related to X by
As noted in Section 4.1, the expected value, or the mean, is a measure of the central location of the probability distribution of a random variable. Although it is probably the single most important measure of the characteristics of a probability distribution, it alone cannot capture all of the characteristics. For example, in the cointossing gamble of the previous section, suppose one must choose between two random variables, X 1and X2, when XI is 1 o r 0 with probability 0.5 for each value and X2is 0.5 with probability 1. Though the two random variables have the same mean, they are obviously very different. The characteristics of the probability distribution of random variable X can be represented by a sequence of moments defined either as
(4.2.1)
kth moment around zero = E x k
or
(4.2.2)
kth moment around mean = E ( X
 23x1~.
68
4
I
Moments
EXAMPLE 4.2.2
4.2
1
Higher Moments
69
Knowing all the moments (either around zero o r around the mean) for k = 1, 2, . . . , is equivalent to knowing the probability distribution completely. The expected value (or the mean) is the first moment around zero. Since either xk or (X  EX)^ is clearly a continuous function of x, moments can be evaluated using the formulae i n Theorems 4.1.1 and 4.1.2. As we defined sample mean in the previous section, we can similarly define the sample klh moment around zero. Let X I , X2, . . . , Xn be mutually independent and identically distributed as X. Then, ~  ' Z , " , ~ X :is the sample kth moment around zero based on a sample of size n. Like the sample mean, the sample kth moment converges to the population kth moment in probability, as will be shown in Chapter 6. Next to the mean, by far the most important moment is the second moment around the mean, whi variance of X by VX, we have
DEFINITION 4.2.1
X has density f(x) = 1/(2a), a
< x < a.
that the d n c e Note that we previously obtained EX = 2, which S ~ O W S is more strongly affected by the fat tail.

Examples 4.2.4 and 4.2.5 illustrate the use of the: s e c o d farmula of Definition 4.2.1 for computing the variance.
A die is loaded so that the probability of a given face turning up is proportional to the number on that face. Calculate the mean and variance for X, the face number showing. We have, by Definition 4.1.1,
E X A M PL E 4.2.4

VX
=
E(X

EX)^ EX)^.
=
EX^
 
 



The second equality in this de the squared term in the above and using Theorem 4.1.6. It gives a more convenient formula than the first. It says that the variance is the mean of the square minus the square of the mean. The square root of the variance is called the standard deviation and is denoted by a. (Therefore variance is sometimes denoted by a%nstead of V.) From the definition it is clear that X = 0 if and only if X = EX VX r 0 for any random variable and that V with probability 1. The variance measures the degree of dispersion of a probability distribution. In the example of the cointossing gamble we have V X I = '/4 and VX2 = 0. (As can be deduced from the definition, the variance of any constant is 0.) The following three examples indicate that the variance is an effective measure of dispersion.
EXAMPLE 4 . 2 . 1
Next, using Theorem 4.1.1,





Therefore, by Definition 4.2.1,
(4.2.5)
V X = 21

169 = 20 9
9
.
X has density f ( x ) = 2(1  x) for 0 < x otherwise. Compute VX. By Definition 4.1.2 we have
E X A M P L E 4.2.5
< 1 and = 0
X=a
= a
with probability % with probability % B y Theorem 4.1.2 we have
I
V X = EX' = a2.

70
4
1
Moments
4.3
1
Covariance and Correlation
71
(4.2.7)
EX'
=
2
1 /' (x2  x3)dx =  . o 6
Therefore, by Definition 4.2.1,
6, we can show that the sample covariance converges to the population covariance in probability. It is apparent from the definition that Cov > 0 if X  EX and Y  EY tend to have the same sign and that Cov < 0 if they tend to have the opposite signs, which is illustrated by
EXAMPLE 4 . 3 . 1
The following useful theorem is an easy consequence of the ckfmiticm of the variance.
THEOREM 4 . 2 . 1
(X, Y) = (1, 1
= ( 1, 1) =
with probability 4 2 , with probability 4 2 , with probability (1  a)/2, with probability (1  a ) / 2 .
IfaandPareconstants,wehave
(1,  1
( 11
Note that Theorem 4.2.1 shows that adding a constant to a random variable does not change its variance. This makes intuitive sense because adding a constant changes only the central location of the probability distribution and not its dispersion, of which the variance is a measure. We shall seldom need to know any other moment, but we mention the third moment around the mean. It is 0 if the probability distribution is symmetric around the mean, positive if it is positively skewed as in Figure 4.1, and negative if it is negatively skewed as the mirror image of Figure 4.1 would be.
4.3
=
Since EX = EY
=
0,
= a 
Cov(X, Y) = EXY
(1  a ) = 2a  1.
Note that in this example Cov = 0 if a = %, which is the case of independence between X and Y . More generally, we have
T H E0 RE M 4 . 3 . 1
If X and Y are independent, Cov(X, Y) = 0 provided that VX and W exist.
COVARIANCE A N D CORRELATION
Covariance, denoted by Cov(X, Y) o r uxy, is a measure of the relationship
between two random variables X and Y and is defined by
DEFINITION 4.3.1
The proof follows immediately from the second formula of Definition 4.3.1 and Theorem 4.1.7. The next example shows that the converse of Theorem 4.3.1 is not necessarily true.
EXAMPLE 4 . 3 . 2
COV(X,Y) = E[(X  EX)(Y  EY)] =EXY  EXEY. Let the joint probability dbtdb~tion of @ , Y ) be @v&n by
The second equality follows from expanding (X  EX) (Y  EY) as the sum of four terms and then applying Theorem 4.1.6. Note that because of Theorem 4.1.6 the covariance can be also written as E[(X  EX)Y] or E[(Y  EY)X]. Let (XI,Y 1 ) , (X2,Y2), . . . , (Xn,Yn) be mutually independent in the sense of Definition 3.5.4 and identically distributed as (X, Y). Then we define the sample covariance by n'~in,,(x~  2)(Yi  p), where 2 and P are the sample means of X and Y , respectively. Using the results of Chapter

and the same variance of return..2 that the conclusion of Theorem 4.3.2.3.3 Theorem 4. f o r O < x < l = It is clear from Theorem 4. what will be the mean and variance of the annual return on your portfolio? (b) What if you buy two shares of each stock? Let X.2 gives a useful formula for computing the variance of the sum or the difference of two random variables. Then where the marginal probabilities are also s h m ..4 illustrate the use of the second formula of Definition 4.1. Y) be given by The proof follows immediately from the definitions of variance and covariance.6. 3 . we can easily show that the variance of the sum of independent random variables is equal to the sum of the variances. Consumption) is larger if both variables are measured in cents than in .3. which we state as T H Eo R E M 4 . Cov(X. Therefore.3 holds if we merely assume Cov(X.S There are five stocks. Y).1.= . N = 5/8..72 4 I Moments 4. (a) If you buy ten shares of one stock. 2. Assume that the returns from the five stocks are pairwise independent.3.3.by Definition 4.3.6. (a) E(lOX.1.by symmetry 12 7 : .S/16 = ' /..3. Combining Theorems 4.1 and 4. .)= 10p by Theorem 4. be pairwise independent. 3  Let X. EY = . Examples 4.1 and Theorem 4.1. Let the joint probability distribution of (X. and EXY = 1/4.3.3 1 Covariance and Correlation 73 Clearly.) = 100u2 by Theorem 4. Calculate Cov(X.3. Y) = '/4 . As an application of Theorem 4.3. X and Y are not independent by Theorem 3.3.3. by Definition 4. consider E X A M P L E 4 .3 and 4.. E X AM P L E 4. each of which sells for $100 per share and has the same expected annual return per share. 3 . n.3. Cov(Income. Y) .3. EXAMPLE 4.2. Y) = EXY = 0. (b) E(2 c&~x. .) = 0 for every pair such that i f j. 2 V(X 'C Y) = V X + VY 'C 2 COV(X. We have  : . Then. but we have Cov(X. 3 .2.2. u2. W have EX = 1/2.1 for computing the covariance.1. and V(lOX. . 49 1 Cov(X.) = 10p by Theorem 4. p. i = 1. f(x.3.1. Y ) = . 3 144 144 1 A weakness of covariance as a measure of relationship is that it depends on the units with which X and Y are measured. . be the return per share from the ith stock. THEOREM 4 ... 0 otherwise.3.X. and v(~c:=~xJ= 20a2 by Theorem 4. For example.y) = x + y .4 Letthejointdensitybe and O < y < 1.
we have (4. PY) = Correlation (X.3. Y) = Cov(X.EX) . Y). X. the minimum meansquarederror linear predictor) of Y based on X is given by a* P*X. we have (4.74 4 I Moments 4.3. we obtain (4.3). u x ' UY Correlation is often denoted by pxy or simply p. E[(X .8) and (4.X(Y  2 0 for any A. The problem can be mathematically formulated as (4. we obtain the CauchySchwartz inequality Thus we have proved The theorem follows immediately from (4.3.4) Minimize E(Y . 5 aa . If p > 0 (p < 0). Expanding the squared term. 3 . Y) .3 1 Covariance and Correlation 75 dollars.3. Define = a* + P*X and U = Y ?. as If a and P are nonzero constants.a  PX)' with respect to a and P.6) and (4.2).2a  2EY + PPEX = 0 s 5 1.2 . we can write the minimand. 3 . Y .3. putting X = Cov/VY into the lefthand side of (4. Since the expected value of a nonnegative ran&m wriab$e is nonnegative. c 6 Correlation (X. Solving (4. We have . We shall interpret the word best in the sense of minimizing the mean squared error of prediction. Equating the derivatives to zero.3. and In particular. we get Expanding the squared term. we say X and Y are positively (negatively) correlated.1) as a ZPEX'  2EXY + 2aEX = 0. The latter will be called either the prediction error or the residual.2k Cov 2 0 for any k.7) = Proof.3.3. defined by DEFINITION 4.3. among all the possible linear functions of another random variable. where a* and p* are defined by (4. O If p = 0. This problem has a bearing on the correlation coefficient because the proportion of the variance of Y that is explained T H E0 RE M 4 .3.3. We also have THEOREM 4 . 6 The best linear predictor (or more exactly. we say X and Y are uncorrelated. We next consider the problem of finding the best predictor of one random variable. denoted by S. This weakness is remedied by considering cumlation (coeflcient).3.7) simultaneously for a and P and denoting the optimal values by a* and P*.6)  Correlation(aX. It is easy to prove T H E OR E M 4 . We shall solve this problem by calculus. 3 . and (4. 4 by the best linear predictor based on X turns out to be equal to the square of the correlation coefficient between Y and X. + Next we shall ask what proportion of VY is explained by a* + P*X and what proportion is left unexplained.9).2) V X + X'VY . as we shall see below.
using the conditional probability defined in Section 3.2.2. i. y. f') and the part which is uncorrelated with X (namely. 4 .4. consider Example 4. When a = 1/2. Therefore p = 0.) be an arbitrary function. . Y) .3.1 by (4.p2 proportion to the second part. We call W the mean squared prediction errorof the best linear predictor. Let +(.11). y. or it may be regarded as a random variable. given X.1 t i = p2VY by Definition 4. .2. we can take a further expectation of it using the probability distribution of X.4. Let +(. In the next section we shall obtain the best predictor and compare it with the best linear predictor. a p2 proportion of the variance of Y is attributable to the first part and a 1 . I XI. As a further illustration of the point that p is a measure of linear dependence. Then the conditional mean of +(X.1 and 4. We have A nonlinear dependence may imply a very small value of p. the degree of linear dependence is at the minimum. It may be evaluated at a particular value that X assumes.4 = W + (P*)~vx. denoted by E[+(X.3. (4. This may be thought of as another example where no correlation does not imply independence. there is an exact linear relationship between X and Y with a positive slope. If we treat it as a random variable.1. (4.2 and 3.3. p p Let (X. This result suggests that the correlation coefficient is a measure of the degree of a linear relationship between a pair of random variables.3.I @ by Definition 4. Let (X.8) and (4.. be an arbitrary function.2 = (1  p2)w by Definition 4.1. .1. j = 1.2. Y) = EXY = Ex3 = 0. 4.3. Y ) be a bivariate continuous random variable with conditional densityf (y 1 x).3. .3. Y) C O N D I T I O N A L M E A N A N D VARIANCE by Theorem 4. .3. We also have In Chapters 2 and 3 we noted that conditional probability and conditional density satisfy all the requirements for probability and density. Here we shall give two definitions: one for the discrete bivariate random variables and the other concerning the conditional density given in Theorem 3.2 and the various types of conditional densities defined in Section 3.2. U). Combining (4. we can define the conditional mean in a way similar to that of Definitions 4. Since V X =V Y = 1 in that example.1 = P*Cov(X.10). and also suppose that X has a symmetric density around EX = 0. Suppose that there is an exact nonlinear relationship between X and Y defined by Y = x2. Then the conditional mean of +(X..4. When a = I. Y) =0 I @ by Definition 4. Y).76 4 I Moments 4. 2. being a function of the random variable X. Then Cov(X. Y) I XI or by Edx+(X. is defined by D EF l N l T I 0 N 4. Y) given X is defined by DEFl N IT10 N 4 . Therefore..10).12).3. Y ) given X. p = 2a .2P* Cov(X. we can say that any random variable Y can be written as the sum of the two partsthe part which is expressed as a linear function of another random variable X (namely. there is an exact linear relationship with a negative slope. 1 X) be the conditional probability of Y = y.3. . The following theorem shows what happens.10) V? = (P*)~vx by Theorem 4. 2 a) The conditional mean Eylx+(X.3.3. and (4.4 1 Conditional Mean and Variance 77 (4. Let P(y.1 m = COV(?.Y) is a function only of X. Y ) be a bivariate discrete random variable taking values (x.3.).)P(y.4. When a = 0.3.1 again.3.1) EYlX+(X> Y) = ]=I C + ( X .
2 can be alternatively obtained using the result of Section 3.4.7).4. THEOREM 4 . This problem may be solved in two ways. use Definition 4.6) and (4.1 4.) Proof: We shall prove it only for the case of continuous random variables. use Theorem 4.E X ( E Y ~ X + ) ~ .6) ExVylx+ = ~4~.4.4) V 4 ( X . f ( y I x) Supposef(x) = l f o r O < x < 1 and =Ootherwiseand < y < x and = 0 otherwise. Y ) = ExVylx4(X.4. 1 ( l a w o f i t e r a t e d m e a n s ) E 4 ( X . 4 . Y ) ..4) in computing the unconditional mean and variance.1. First. 4 . (The symbol Ex indicates that the expectation is taken treating X as a random variable. we have (4. Proof Since P(Y=OIX=x)=lx.1: = x' for 0 Second.4 1 Conditional Mean and Variance 79 THEOREM 4 . Find EY and W. Y ) f VxEq&(X. \ \ The following examples show the advantage of using the righthand side of (4. x. 4 .4.  78 4 I Moments E X A M P L E 4. 2 But we have Therefore. The result obtained in Example 4. the proof is easier for the case of discrete random variables. We have . .4. Calculate EY. O < x 1. 2 ThemarginaldensityofXisgiven by ffx) = 1 . The conditional probability of Y given X is given by P(Y = 1 IX = x) = < (4.4. Y ) = ExEdx4(X. by adding both sides of (4.2: The following theorem is sometimes useful in computing variance. It says that the variance is equal to the mean of the conditional mriance plus the variance of the conditional mean.3) and (4.4.7. EY = 1 ExEylxY = ExX = . 2 EXAMPLE 4 . as follows.4.
4. where the crossproduct has dropped out because Therefore (4.9): EY = 7 / 2 .8) and (4. = 0 if Y is odd.2.3 1(Y 1 X). VX = 16/3. Let Y be the face number showing.E(Y I X ) ] + [E(Y I X ) . .80 4 1 Moments 4. We have 1 ) E[Y  (b(x)12 = E{[Y . W = 35/12. Cov = 7 / 3 .2. Put ?= ( 2 1 / 8 ) + ( 7 / 1 6 ) X . P(X = 0 . Therefore  Minimize E[Y  (b(x)12 with respect to +. E n d the best predictor and the best linear predictor of Y based on X. The following table gives E(Y X). 4 .10) To compute the best linear predictor. EY' = 91/6. P(X = 1 . 4 . EX = 2. there is a simple solution. Y = 1) = Pol = 0. E X A M P L E 4 . In the previous section we solved the problem of optimally predicting Y by a linear function of X.3. using the definition of the mean and the variance of a discrete random variable which takes on either 1 o r 0. Y = 0 ) = Poo = 0. The problem can be mathematically formulated as (4.11) is clearly minimized by choosing + ( X ) Thus we have proved T H EoR EM 4. 3 A fair die is rolled.4. P(X = 0 .3.4 1 Conditional Mean and Variance 81 In the next example we shall compare the best predictor with the best linear predictor. We shall compute the mean squared error of prediction for each predictor:  . we must first compute the moments that appear on the righthand sides of (4. I X 0 2 4 6 Therefore the mean and variance of Y can be shown to be the same as obtained above.2.4. 4 Let the joint probability distribution of X and Y be given as follows: P ( X = 1. E X A M P L E 4 .3. Here we shall consider the problem of optimally predicting Y by a general function of X. the minimum meansquarederror predictor) of Y based on X is given by E(Y I X ) . Define X by the rule X = Y if Y is even.+(x)])' where the values taken by Y and X are indicated by empty circles in Figure 4. Y = 1 ) = Pll = 0. EX' = EXY = 28/3.Thus we have Despite the apparent complexity of the problem. Y = 0 ) = Plo = 0. Obtain the The best predictor (or more exactly.3.
This result shows that the best predictor is identical with the best linear predictor. 4. which is a linear function of X.4. (Section 4.12) we readily obtain E(Y I X) = 0.82 4 I Moments = E. (Section 4.11) we obtain (4. Find V(X I Y). Compute EX and VX assuming the probability of heads is equal to p.4 0.2) Let X be the number of tosses required until a head comes up. From (4. and in the second line ten minutes.2 Comparison of best predictor and best linear predictor best predictor and the best linear predictor of Y as functions of X and calculate the mean squared prediction error of each predictor.8) and (4.2) Let the probability distribution of (X. Inserting these values into equations (4. (Section 4.1) A station is served by two independent bus lines going to the same destination.2X. and Cov(X.9) yields P* = 0. The moments of X and Y can be calculated as follows: EX = EY = 0. Y) be given by PII/(PII + Pio) Poi/ (Poi + Poo).3.12) . Both equations can be combined into one as (4. 3.Ex[E(Y x ) ~ ] + I 2YE(Y ( X ) ] I Best linear predictor. From (4.25.4.yEyix[~' E(Y x ) ~ = EY2 . Its mean squared prediction error (MSPE) can be calculated as follows: + .3.05. You get on the first bus that comes.2) Let the density of X be given by Best predictor. VX = W = 0.5.4.X). EXERCISES F I G U R E 4. In the first line buses come at a regular interval of five minutes. We have E(Y 1 X = 1) and E(Y I X = 0) = = 1. What is the expected waiting time? 2.14) MSPE =W  Cov(X9 vx v2= ap4.2 and a* = 0. but as an illustration we shall obtain two predictors separately. (Section 4.4. 1 % 3/s E(Y I X) = [ P I I / ( ~ I+ I PIO)IX+ [ P o I / ( ~ o+ I Poo)] (1 . Y) = 0.3.
(b) Are X + Y and X . find EZ and E(Z I X = 1). Y). and Cov(X. EY = 2. 10.3) Suppose X and Y are independent with EX = 1.4) With the same density as in Exercise 6. (Section 4. Supply your own definition of the phrase "no value in predicting X. each of which takes only two Let X.4Prediction) Let the joint probability distribution of X and Y be given by I (a) Show that X + Y and X . 11 and for some c in (0. otherwise. and W = 1. but Y is of no value in predicting X.4. (Section 4.4Prediction) Give an example in which X can be used to predict Y perfectly. y) = 2 for 0 x.4Prediction) Y = 1.4) . P ( Z = 1 1 Y = 1) = 0. 13. 5. Compute VX and Cov(X.   8.84 4 I Moments = I Exercises for 0 85 f( 4 x < x < 1. Given P(X = 1) = 0. X) = 0." 17. 7. + y for 0 < x < 1 and 0 < < x < 1 and 0 < y < Obtain the best predictor and the best linear predictor of Y as fumtions of X and calculate the mean squared prediction error for eack predictor.3) Let (X.4Prediction) Let X be uniformly distributed over [0. P(Y = 1 1 X = 1) = 0. Obtain Cov(X. VX = W = 2. Calculate Cov(Z. find the best linear predictor of Y based on 2. 1) define .4) Let X = 1 with probability p and 0 with probability 1 . y) = x y < 1. Y). Let the conditional density of Y given X = 1 be uniform over 0 < y < 1 and given X = 0 be uniform over 0 < y < 2. P(Y = 1 1 X = 0) = 0. If we Suppose EX = EY = 0. Y ) begiven by < x < 1 and 0 < y < 1. 14. (Section 4. 12. =2x = forl<x<2. Y). Determine a and p so that V(aX + PY) = 1 and Cov(aX PY.3) Let EX = EY = 0. (Section 4. Compute Cov(X. (Section 4. and Z be random variables. and Cov(X. Assume that the probability distribu 16. Define Z = X + Y and W = XY. (Section 4. (Section 4. Y values: 1 and 0. (Section 4. y) = 1 for 0 Calculate VX. Y) = 0. Y) have joint density f (x. obtain E(X Y = X I + 0.(20/19)Y are uncorrelated.4) Let f (x. + +   . Y) have joint density f (x. Y) = 1.5.5). VX = V define Z = X Y. VX = 1. (Section 4. (Section 4.p. (Section 4.5. (Section 4.7. W). (Section 4. 15.3.6.3) Let (X.3) Let the probability distribution o f (X.. 9.(20/19)Y independent? 6. Obtain E(X I X < Y). 0 11. P(Z = 1 1 Y = 0) = 0.
Compute their respective mean squared prediction errors and directly compare them. 2. 1 .1. (Section 4. which is called a binary or Bernoulli random variable.4Prediction) Suppose U and V are independent with exponential distribution with parameter A. Such a random variable often appears in practice (for example. p). Symbolically we write X . denoted variances V(X) and V(U). More formally we state D E F I N I T I O N 5 .2) X = . p).)Find the best predictor and the best linear predictor of X + Y given X .B ( n .1. i = 1.1) is distributed as B(l.4Prediction) Suppose that X and Y are independent. THEOREM 5.1 for the definition.1 B I N O M I A L R A N D O M VARIABLES t Let X be the number of successes in n independent trials of some experiment whose outcome is "success" or "failure" when the probability of success in each trial is p.~ . (T is exponentially distributed with parameter h if its density is given by h exp(kt) for t > 0.1.=I Y. where U = x . = 1 with probability p = 0 with probability 1  p = q. 19. . with the probability distribution given by I Y. the number of heads in n tosses) and is called a binomial random variable.).1 (5. Find the best predictor and the best linear predictor of Y given X. (See Section 5. . (Section 4. 1 Let (Y.2. n .86 4 I Moments Find the best predictor of X given Y. defined in (5.) Define X = U + V and Y = UV. I . is called a binomial random variable. Then the random variable X defined by n (5. 5.3) Forthebinomialrando~1MI. .Y . p ) . each distributed as B ( l . 2 and compare the BINOMIAL AND NORMAL R A N D O M VARIABLES 18. Note that Y .1.iableXwehave = P(X = k ) ~ t p ~ q ~ . .
7) (5.2).1. 1 Let X be the number of heads in five tosses of a fair coin.5) EX = np. D E F I N I T I O N 5.9) x n 1= The norrnaldensityisgivenby  EY. Note that the above derivation of the mean and variance is much simpler than the direct derivation using (5. for example. I*. O 1 When X has the above density.3.1.5 and V X = 1.= npq by Theorem 4. to give formula (5.1. = p E Y ~= p for every i for every i The normal distribution is by far the most important continuous distribution used in statistics.1  VY.1. Hoel (1984. 2 .1.4) and (5.1. and a. We have THEOREM 5 .1. however. Many reasons for its importance will become apparent as we study its properties below. This is a special case of the socalled central limit theorem.2 N O R M A L R A N D O M VARIABLES EY.6) (5.4) and (5. Therefore by (5. Prooj The probability that the first k trials are successes and the remaining n .p2 = pq for every i EX = (5. b e ThenEX = pandvx = a'. We should mention that the binomial random variable X defined in Definition 5.1. and all positive u by a rather complicated procedure using polar coordinates. we must multiply the above probability by C. 1 Jmm ~ e t ~ N(P.1.6 1 n 5 V X = 1= W . = p .k trials are failures is equal to pkqnk. The direct evaluation of a general integral Jb.3. we write symbolically X . See. In this example we have n = 5 and p = 0. Obtain the probability distribution of X and calculate EX and VX.5).1.3. Since k successes can occur in any of C.N(y.For example. 78). The normal density is completely characterized by two parameters.1. Using (5.3). which we shall discuss in Chapter 6.the mean and variance of X can be obtained by the following steps: (5. combinations with an equal probability.3). we have . = np by Theorem 4. Proof.Using (5. a 2 ) .we have EX = 2.1.3) we have We can verify f(x)dx = 1 for all IJ.1. in the direct derivation we must compute EXAMPLE 5 . p.1.1.2.2 1 Normal Random Variables 89 (5.88 5 1 Binomial and Normal Random Variables 5.1 is approximately normally distributed when n is large. Such an integral may be approximately evaluated from a normal p r o b ability table or by a computer program based on a numerical method.u').5.25.8) 5. Examples of the normal approximation of binomial are given in Section 6. We have Evaluating (5.11) for each k . VX = npq. f (x)dx is difficult because the normal density does not have an indefinite integral. 1 .
2. which is called the standard normal random variable.9 probability. calculateP(4 < X < 8 ) .3 < Z < 1) where Z . E X A M P L E 5 .2.1) it is clear that f (x) is symmetric and bellshaped around p.0.P(Z < 3) 0. what is the largest value that u can have? which shows that the larger u is. We will often need to evaluate the probability P(xl < X < x2) when X is N(p.2.p) / u is N(0. 1).2.3). from (5. Y and .N ( a + pp.N(10. p2u2). andletY a f (3X.N(O. 1 f (p) = G u .Thenwehave Proot Using Theorem 3.4).2.2.2. 2 .2 1 Normal Random Variables = 91 (uz + p) exp(z2/ 2)dz by putting z = u Y .1) c l = = P(Z < 1) . Theorem 5.1574.2.2.2. and (5. r Sometimes the problem specifies a probability and asks one to determine the variance. we have EX = p. the flatter f (x) is. we have The righthand side of (5.0013 from the standard normal table From (5.1587 .2 shows an important property of a normal random vanable: a linear function of a normal random variable is again normal. using integration by parts = u 2.N ( a + Pp.90 5 ( Binomial and Normal Random Variables 5. 1). p2u2). as in the following example. u2). To study the effect of o on the shape off (x).4). Defining Z in the above way.1 AssumingX.2.1.2 T H E O R E M 5. 2 z exp(z 2/2)dz by putting z =  x CT E X A M P L E 5. 2 Assume that the life in hours of a light bulb is normally distributed with mean 100. observe (5. Therefore. ~ e be ~ t ( p ~ u2) . EX = p follows directly from this fact.2 is that if X is N(p. . = P ( .2)) (5. u2). If it is required that the life should exceed 80 with at least 0.1. Next we have A useful corollary of Theorem 5. then Z = (X .8) can be evaluated from the probability table of the standard normal distribution. O because the integrand in (5. the density g(y) of Y is given by But we have Therefore.6.2.6) = 0.4) is the density of N(0. by Definition 5.
and finally (5. We can also prove that (5.100)/u. Therefore Cov(X. Then the marginal densities f (x) and f (y) and the conditional densities f (y I x) and f (x I y) are univariate normal densities. VX = u i .3. The next theorem shows avery important property of the bivariate normal distribution.11) and (5. hence Correlation(X. Y) = p. Y) = puyux.3 BlVARlATE N O R M A L R A N D O M VARIABLES = lr f2dy m because f does not depend on y because f2is a normal density. By symmetry we have Y  Therefore we can conclude that the conditional distribution of Y given X = x is N[py p u f l i l ( x . then a X + PY is normal.2. = f l D E F I N IT I o N 5 . 5. 1 The bivariate normal density is defined by Therefore we immediately see X N(py.3.6. Y) have the density (5.1) is indeed the only function o f x and y that possesses these properties.3 1 Bivariate Normal Random Variables 93 92 Let X be the life of a light bulb.). y) can be rewritten as  L stants. Y) = p. 3 .All the assertions of the theorem follow from (5. ~ (u.3.2. We have by Theorem 4. u2). and we have EX = px. u i ) and f is the demity of N[p+ p ~ ~ u ~px).4.N(100.2. Correlation(X.). a. If X and Y are bivariate normal and cr and (3 are con . Y) = p. 1 Let (X. We have From the standard normal table we see that + From (5.1).px). Next we have . (5.I 5 1 Binomial and Normal Random Variables 5.N(px.3.p2)]. EY = py. V Y = u:.3.3.All that is left to show is that Correlation(X.(l x .1.(1 . 3 . T H E0 RE M 5 . Then X determine u2 so as to satis5 .1) as a definition and then derived its various properties in Theorem 5.10) is equivalent to where f is the density of N(px.1 + THEOREM 5 . u.3) without much difficulty. O In the above discussion we have given the bivariate normal density (5. a. 3 .P2)].12) we conclude that we must have u 5 15. We must Defining Z = (X .6) EXY = ExE(XY I X) = Ex[XE(Y ) X)] = Ex[Xpy  + P ~ Y U X ~ X(X px)I   = P X ~ Y+ puyux. 2  Proot The joint density f (x.
2 repeatedly.8) Let { X . . py p*(~*.that g(t) is a normal density.4.the best predictor and the best linear predictor coincide in the case of the normal distributionanother interesting feature of normality.is N ( p . u .3. 2.2) is precisely the best linear predictor of Y based on X. Therefore X and Y are independent by Definition 3.P*')] and f~ is that of N ( p x . Z is also normal because of Theorem 5. we have THEOREM 5. 11 1 ) for an example of a pair of univariate normal random variables which are jointly not normal.3. If.8) as C.11).3 Differentiating both sides of (5.6.3.3.2 and 5.3 that E(Y X ) is the best predictor of Y based on X. we need to prove the t h e m miy for the case that p = 1. Z) = 0.2) follows readily from Theorems 5.i = 1. It may be worthwhile to point out that (5.Recall that these three equations imply that for any pair of random variables X and Y there exists a random variable Z such that I I But clearly h is the density of N [ a p . The following is another important property of the bivariate normal distribution..3. Y ) = 0. Note that the expression for E(Y X ) obtained in (5.10) Y = py UY +p( x . By applying Theorem 5. Then we have (5.3.4.3.1).3.3.3.2) before we proved Theorems 5. therefore. (5.)(x . If we put p = 0 in (5. as before.10). and Cov(X.3. we immediately see that f (x.4 and equations (4. (l/n)Cr=IX. n.3.3 1 Bivariate Normal Random Variables 95 Proof.12). then X and Y are independent.3.3.~ ~ (":)2(l .2.7) with respect to t md density of W by g ( .px) + U Y Z .3. 'JX EZ = 0.3. (4. In the preceding discussion we proved (5. ) .p.1 and equation (5.2. 1 Proof.4.3.ar / x)f (x)& 1 I f X and Y are bivariate normal and Cov(X. It is important to note that the conclusion of Theorem 5. we can easily prove that a linear combination of nvariate normal random variables is normal. X and Y are bivariate normal. which was obtained in Theorem 4. a * ) . Because of Theorem 5.4 g(t) = m f ( t .2. ) . ) . m j(x. T HEoR EM 5. we have (5. Define W = wX + Y.2 and 5.94 5 1 Binomial and Normal Random Variables 5. .3.1).p2. Since we showed in Theorem 4.3. y)dydx 2 [rm " " f (Y 1 x)dY]f (x)dx. .6.3.3. VZ = 1 .3. using Theorem 5. O If we define ('J? )* = a2u$ + we can rewrite (5.3.7) P(W c t) = = /E m r.2 does not necessarily follow if we merely assume that each of X and Y is univariately normal.3. We conclude. In particular. y) = f (x) f ( y ) . and (4. in addition.3. . be pairwise independent and Then identically distributed as N ( p .4. which implies that . +Z~opxc and ~ ~p* = (pey + aox)/u. D + + 1 . a 2 / n ) . See Ferguson (1967./a. Therefore Z and X are independent because of Theorem 5.
4 shows. . we arrive at (5.4.4 MULTIVARIATE N O R M A L R A N D O M VARIABLES In this section we present results on multivariate normal variables in matrix notation. the above density is reduced to the bivariate density (5. .3.1 = 5. in particular. however. Then. the linearity of E(Y I X ) does not imply the joint normality of X and Y.. and. respectively.3.4. 3 . D E F I N I T I O N 5. a.3. . .4 1 Multivariate Normal &ndom Variables 97 E(Z I X ) = EZ = 0 and V ( Z I X) = V Z = 1 . ditional mean and variance of both sides of (5.p2. The following two examples are applications of Theorems 5.). (Throughout this section.1).3. Calculate P(2. a matrix is denoted by a boldface capital letter and a vector by a boldface lowercase letter.5? Put X.3.Defining Z . ~.) We write their elements explicitly as follows: Therefore. how large a sample should you take so that the probability is at least 0. as Example 4.. .2) we have E X A M P LE 5. j = 1.1 We say x is multiuariate normal with mean p and if its density is given by variancecovariance matrix 8.4.3..y)/3. Suppose X and Y are distributed jointly normal with EX 1. the inequality above is . ]A).Z . and the correlation coefficient p = '/2. The results of this section will not be used directly until Section 9..2. 9 ) .2 1 X = 3 ) . . .3. Defining the standard normal Z = equivalent to G(a.3.3.3. = Vx.4. l ) . We sometimes write u: for aI. Using (5. 2. = Cov(x. Conversely.  1 " x). EY = 2. n. Y given X = 3 is N ( 3 .1.1 and 4.we have E X A M P L E 5 .3. Examples 4. . . Let x be an ndimensional column vector with E x = p and Yx = 2. Therefore.8 that your estimate will not be in error by more than 0. the answer is 60. i.10).1 and 5..2). . Note that a i = 1.N ( 0 . VX = VY = %. The student unfamiliar with matrix analysis should read Chapter 11 before this section. 5.3. Now we state without proof generalizations of Theorems 5.13.96 5 1 Binomial and Normal Random Variables 5. denoted N ( p .N ( . 2 If you wish to estimate the mean of a normal population whose variance is 9. n. 2. . : ) .2 also indicate the same point. want to choose n so that The reader should verify that in the case of n = 2. and 5.N ( p .=l X.=..3.2 < Y < 3. Therefore.7 and Chapters 12 and 13.4. taking the conwhich implies n > 59. x. by Theorem 5.
1) and N ( 1 .Ez) '1./36.i. Then any subvector of x.W)Y. Define X and Y by Y=X1U. VX. where ZI1 = V y = E[(y . W B(l. and W are mutually independent and distributed as X N ( l . Obtain E(Y I X) and V(Y I X). THEOREM 5 .29). a').1 L e t x W N ( p . Z ) a n d l e t y a n d z bedefinedasin Theorem 5.z'). S and T are independent and distributed as N(0. 9). Y) BN(0.) be i. is multivariate normal. respectively. EY = 2. and the conditional distribution of y given z (similarly for z given y) is multivariate normal with 5. (Section 5. 0. 6. (Section 5. Y N(2. VX = 4.Z12(z22)'ZP~.Ez)(z Ez) 'I. (independent and identically distributed) drawings from bivariate normal random variables with EX = l .  +   a * 3. Let x . ZZ = Vz = E[(z . Let X be the number of aces that turn up. Y. 2.1) Five fair dice are rolled once. Then Ax . 2 Let (XI.4.75. I). (Section 5. Partition 2 conformably as tor and the best linear predictor of Y given X and calculate their respective mean squared prediction errors.3) Suppose U and V are independent and each is distributed as N(0. and Cov(X. such as y or z. l ) .3) Suppose (X. Z and ) let A be an m X n matrix of constants such that m 5 n and the rows of A are linearly independent.98 5 1 Binomial and Normal Random Variables I Exercises 99 THEOREM 5. Calculate P ( p > 3 .Ey)(y . Compute EX.Ey)']. Z12= E[(y . Find the best predictor and the best linear predictor of y2 given X and find their respective mean squared prediction errors. where f (y) and f (z) are the multivariate densities of y and z. 4. Calculate P(U < 5) where u = wx (1 . meaning that X and Y are bivariate normal with zero ineans and unit variances and correlation p. If Z12= 0. p). EXERCISES 1. (Section 5. W = 9.2) Let X = S and Y = T + T S ~where .N ( ~ . (Section 5. 1. 1. y and z are independent. That is to say. f (x) = f (y)f (z). 4). Y) = 2. (Section 5.5). 4 . and /36 = ~:21~.N(Ap.2) Suppose X.d. and ZZ1= (&) '.4. respectively. 8 )andpartitionxf = (yf.Ey) (z .N ( p .3) and v(y I 2) = 2 1 1 . 3 Supposex. X=2Y3v.1.whereyis hdimensional and z is &dimensional such that h + k = n. Find the best predic . 0.Y. Define = ~ : 2 ~ ~ .  THEOREM 5 . and P(X 2 4). 4 .
1 . Thus we have D E F I N I T I O N 6 . 3 ( c o n v e r g e n c e i n m e a n square) A sequence {X. Most of the theorems will be stated without proofs. We could only talk about the probability of (6. Let us first review the definition of the convergence of a sequence of real numbers. Wewrite x. converges to the distribution function F of X at every continuity point of F. There is still another mode of convergence.6. We write X..1) holds with a probability approaching 1 as n goes to infinity. A definition can be found in any of the aforementioned books. and the third.). we write X . For the proofs the reader should consult. . A sequence of random variables { X . for a sequence of random variables.1) being true. In this chapter we shall make the notions of these convergences more precise. 1 . is obvious from the context). Serfling (1980).. I The following two theorems state that convergence in mean square implies convergence in probability. . . The first two are examples of a law of large numbers.. 1 ~ s e ~ u e n c e o f r e a l n ~ b e r s ( a . . . Rao (1973). .) is said to converge to X in mean square if limn. 5 X as n + cc or plim. were a random variable. D E F I N I T I O N 6 . 3 x . Y.X. D E F I N I T I O N 6 .x)' = 0. . which. in turn. converges to the probability of heads. implies convergence in distribution.} have the same limit distribution. If a. almost sure convergence.. 6. the if clause may be paraphrased as follows: if lim P(\X. that a sample mean converges to the population mean. and in Chapter 5. and we call F the limit distribution of {X. In Chapter 1 we mentioned that the empirical frequency r / n .. 2. issaid to converge to a real number a if for any E > 0 there exists an integer N such that for all n > N we have D E F I N I T I O N 6 .1 to a sequence of random variables.}and {Y. . . that the binomial variable is approximately distributed as a normal variable.6. where r is the number of heads in n tosses of a coin. = X." (Alternatively. 4 ( c o n v e r g e n c e i n d i s t r i b u t i o n ) A sequence {X. 2 ( c o n v e r g e n c e i n p r o b a b i l i t y ) . we could not have (6. an example of a central limit theorem. The last equality reads "the probabzlity limit of X. of X. 1 .X I < E) = 1 for any E > 0.) is said to converge to X i n distribution if the distribution function F. or Amemiya (1985). is said to converge to a random variable X i n probabzlzty if for any E > 0 and 6 > 0 there exists an integer N such that for all n > N we have P(IX. because it would be sometimes true and sometimes false. but we will not use it here. This suggests that we should modify the definition in such a way that the conclusion states that (6.1. } . n = 1. If {X.1. convergence in mean square and convergence in distributzon.)  We have already alluded to results in large sample theory without stating them in exact terms. for example.XI< E) > 1 . $ X. n = 1.1.1 MODES OF CONVERGENCE Unlike the case of the convergence of a sequence of constants. for which only one mode of convergence is sufficient. 1 .1 1 Modes of Convergence 101 I LARGE SAMPLE T H E O R Y Now we want to generalize Definition 6. is X.2 . ] . we need two other modes of convergence.1) exactly. Chung (1974).1.. We write X. E(X. in Chapter 4.
+N(0. 4 ax.X2.] be i.).)is an important necessary condition. but inequality (6. 1 . 2. . i = 1.1.=. . define = ~'C. A central limit theorem (CLT) specifies the conditions under which Z. (iii) (X. . of (6. .$ 6. . = p . which concerns the almost sure convergence.) Note that the conclusion of Theorem 6.) with EXi = p.1. .2 LAWS OF LARGE NUMBERS A N D CENTRAL L I M I T THEOREMS X.1 is deduced from the following inequality due to Clhebyshev. ) so that .. x.i = 1.& X . YJn) is the same as the distribution of g(X1.i.102 6 1 Large Sample Theory 6. K.1.)are uncorrelated with EX.EX.$ x J x. We shall write 2.2 that X.1. 1 ( K h i n c h i n e ) where f. 2 . therefore. and {X. converge jointly to {X. = a2. .d. 0. Then the limit distribution of g(X1. .Y2. THEOREM 6.) is any nonnegative continuous function.1.1 (Chebyshev). . THEOREM 6 . 5 p. Now we ask the question.1.+ X 3 x. 1). because it is degenerate. In certain situations it will be easier to apply z.1. . it should be nondegenerate. 1).OLJ). Theorem 6.1 (Chebyshev). because VZ.i. . which is useful on its own: where g(. exists. Suppose that g is a continuous function except for finite discontinuities.1 can be obtained from a different set of assumptions on (X. = p and VX. be a vector of random variables with a fixed finite number of elements.1. . Let g be a function continuous at a constant vector pointa. . THEOREM 6 .1.d. converges in distribution to a standard normal random variable.2) to be X. $ 0 is to show rf.)in distribution. f ix)dx.3 Let (Xi) be independent and identically distributed (i. 5 p. The following two theorems are very useful in proving the convergence of a sequence of functions of random variables. Let (X. 2 . . For example. plim Y. take g(x) to be x2 and take X. X .1. A law of large numbers (LLN) specifies the conditions under .EX. + N(0. however. .Ex.& a J g ( X .)"'(z. what is an approximate distribution of 2. Let X . In many applications the simplest way to show X. . .% a. . (ii) X.. . when n is large? Suppose a law of large numbers holds for a sequence { X . . .(x)dx m 2 e2 Is Given a sequence of random variables {Xi]. 1 .X~. converges to 0 in probability.X. 2 ( L i n d e b e r g . a 2 . . .3) Eg(X. More precisely. .J. Here we have assumed the existence of the density for simplicity. 4 ( S l u t s k y ) (i) x. XK. 1 ( C h e b y s h e v ) X. Then X.$ X and Y..E x .) exists./Y. provided a # 0. it means the following: if F. 3 0 and then to apply Theorem 6. It follows from Theorem 6. . For if the limit distribution of Z. with EXi and VX.. Chebyshev's inequality follows from the simple result: (6. To prove Theorem 6.) = jm g(x)/. is the distribution function of Z.)if we use Theorem 6. 2. Yln. 5 x + a. .EX.Ez.L e v y ) and Liapounov. and S = (x I g(x) r E ' ) .Here the joint convergence of {X. . = (VX.. if {X.. however. 2. . .).2 M 1 Laws of Large Numbers and Cent L h i t Tharems 103 THEOREM 6 .] to {X. We state without proof the following generalization of the Slutsky theorem.Y. We shall state two central limit theoremsLindebergL6vy THEOREM 6 . XZn. x.1. provided that Eg(X. . and therefore the distinction is unnecessary here. = 1 for all n. We do not use the strong law of convergence.2) is true for any sequence of random variables.2. 1 . i = 1. X/a. + Y. 5 0. ) 5 g ( a ) .. = cr2. XKn. Then 2.. ThenX. .a].then  THEOREM 6 . This law is sometimes which referred to as a weak law of large numbers to distinguish it from a strong law of large numbers. . It is more meaningful to inquire into the limit distribution of Z. .1. = a.(x) is the density function of X. It is an uninteresting limit distribution. then 4 p. by Theorem 6. x. 2 x.. ~f x. .
3 If 5% of the labor force is unemployed. we can conclude (6.5 < X* < 1. In Definition 5.5 and V X = 1. we shall approximate binomial X by normal X* N(2.. If 1/3 li n j m z=l :f. pq/n) or that X A (np.5) As we stated in the last paragraph of Section 6. 2 . P ( X = 2) by P(1. Since EX = 2. . 5 N(O. We now introduce the term asymptotic distnbution.5 < X* < 0. ) . The figure suggests that P ( X = 1 ) should be approximated by P(0. The same is true of P ( X = 5 ) .1.. 1 ) .1.8.12) and (6. In the terminology of Definition 6. is asymptotically normal with the asymptotic m a n E X . + N ( 0 . it may be approximated either by P ( X * < 0. we draw the probability step function of binomial X and the density function of normal X* in Figure 6.5. 2 EXAMPLE 6 . after some rounding off. because it makes the sum of the approximate probabilities equal to unity.2. 4 N ( 0 .3 N O R M A L APPROXIMATION OF B I N O M I A L Here we shall consider in detail the normal approximation of a binomial variable as an application of the LindebergLevy CLT (Theorem 6... 1.:) [i . 1  (6. with EY. central limit theorems provide conditions under which the limit distribution of 2. = a : . we may replace 5 above by A. 1.0. V z . X = C.1.2 and Figure 6.5). which means the "approximate distribution when n is large. = 1 with probability p and = 0 with probability q = 1 .) is N ( 0 . vX. because its condition is more difficult to verify." This last statement may also be stated as "x.d.2. 3 . and so on.2). 3 .2)." Given the mathematical result Z.3.13) = Let {X.2. npq).}:that is. EX. Both are special cases of the most general CLT.2) f(x) = 1 exp [ . and E(IX.5].1. is N(EX.1 1 2 3 4 5 These two CLTs are complementary: the assumptions of one are more restrictive in some respects and less restrictive in other respects than those of the other.) . EXAMPLE 6 . N ( 0 .1/2 = 0. however. 3 ( L i a p o u n o v ) VX.5. and the asymptotic variance vX.25). The true probabilities and their approximations are given in Table 6.5)'/2.2. Bernoulli variables {Y. however." These statements should be regarded merely as more intuitive paraphrases of the result Z.where Y.."=IY. 0 FIGURE 6. we shall make statements such as "the asymp totic distribution of Z.5). is N ( 0 .1. Let X be as defined in Example 5. 1). We shall consider three examples of a normal approximation of a binomial. I ) .} satisfy the conditions of the LindebergLevy CLT. is N(EX.). = p and V Y . Then EX = 1 and V X = 0.  p. = pq. = (vX.1)  Using (5. The former seems preferable. ms.p. 1 Normal Approximation of Binomial 105 THEOREM 6 .25) is. The densityfunction f ( x ) of N(2.5 < X* < 2.104 6 1 Large Sample Theory 6. Or we may state alternatively that X / n A N ( p .5).}be independent with EX.Note that it would be meaningless to say that "the limit distribution of X .1. 3 .25 in this case.. EXAMPLE 6 . what is the probability that one finds three or more unemployed workers among m p 4 N(0. We shall not use it in this book.4. 1).3.1 we defined a binomial variable X as a sum of i. The results are summarized in Table 6.2. then Z. l ) . which is due to Lindeberg and Feller." Normal approximation of B(5.1. As for P ( X = 0).3.i.( x . 1 ) ) or "the asymptotic distribution of X . Since {Y. Change the above example t o p = 0. 1)" (written as Z. 6.)"~(X.5) or P(0.3 = p.
p). Then X B ( 1 2 . 0.106 T A B L E 6. We first calculate the exact probability:  . where we first assume p = 0.05.1 6 1 Large Sample Theory 6.5) 1 Approximation X Probability twelve randomly chosen workers? What if 50% of the labor force is unemployed? Let X be the number of unemployed workers among twelve workers.4 1 Examples 107 Normal approximation of B(5.
p)" . Assume also .2.d. In the first case. EXAMPLE 6 .' ~ .p)/converge to N ( 0 .1 (Chebyshev) Vx or Theorem 6..2) +  fn i l l 2 We therefore conclude (6.] satisfies the conditions of . with EX. X = p. = py and W . the desired result follows from Theorem 6. defined as S .2..lY.]. 2.]be i.=~cT The ~ .3 (Liapounov) in this case becomes EXAMPLE 6 .)and {Y. > 0 for all i.2. . B y Theorem 6. we should assume that {X..d. (Section 6.4. it converges to the same constant in probability. by Theorem 6. Obtain the asymptotic distribution of F/x. EXAMPLE 6 . is clearly a continuous function of n'C. so that we can use (iii) of Theorem 6.i. and let {Y..4.3.i. In Example 6.4.. note E ( 2 . (Section 6. The next step is to find an appropriate normalization of (FIX . = px # 0 and VX. with the common mean p > 0 and common variance a2 and define X = n .1) and (6. Using (iii) of Theorem 6. x~ 1..2.2 n . (Section 6. Obtain the probability limit and the asymptotic that Y log E. is that this last quantity should converge to 0 as n goes to infinity.Xf converges to a2 in probability. . 4 .=.="=.Y = 1 n n C.4. Y.1.3) It is known that 5% of a daily output of machines are defective. therefore. f = ~ and ~.1) Prove that if a sequence of random variables converges to a constant in distribution. Because S.. Define W.) are identically distributed in addition to being independent. be i. required condition.pI3 = m3 Under what conditions on a: does (2 . 2 converge to (pX)2 in probability. 4 . Assume that {X.1 (Khinchine).] are independent. Then (W.= a.1. n. What is the probability that a sample of 10 contains 2 or more defective .1) Give an example of a sequence of random variables which converges to a constant in probability but not in mean square and an example of a sequence of random variables which converges in distribution but not in probability.Prove that the sample variance.=~ B y Khinchine's LLN (Theorem 6. In the second case.py/pX) to make it converge to a proper random variable. i = 1. obtain from (6. .1) we have plim. 3 x2.d. with EY.pyx. Assume that {X.1. with a finite mean p and a finite variance a*.i. . p/X 4 py/pX. indicate distribution of x clearly which theorems of Chapter 6 are used. = pXY.4 (Slutsky).1. + Then we can readily see that the numerator will converge to a normal variable with an appropriate normalization and the denominator will 4. E n'C.xXf and X.d. 4 .) are independent of each other. I ) ? The condition of Theorem 6.i.1 (Khinchine). = a. . 2.1.3.1. E ~ % n dplim.]and (Y. .2 (LindebergLevy) &a..4) r A sufficient condition to ensure this is X EXERCISES Let (X. 2 px and P 4 px. assume further that EIX.' c ~ .n c. we where a&= pis.2.. 4 Let ( X .)be i. For this purpose note the identity 3. At each step of the derivation.108 6 1 Large Sample Theory ) Exercises 109  We can answer this question by using either Theorem 6.. ] be i. (Section 6. ~ .4 (Slutsky). Therefore Theorem 6.2) Let {X. Therefore.
}.d. X+E Explain carefully each step of the derivation and at each step indicate what convergence theorems you have used. 132) and derive its asymptotic distribution. .}are needed in order for 4 = Z(X. 5. where 2. +. and P = ~'z:=~Y.4) n P ( X = k) = (A%4 /kt Derive the probability limit and the asymptotic bution of the Suppose X . (Section 6. 11. 11 and Y N[exp(a). . = n'~:=lx. (b) l / x . .i.8)'/n to converge to u2in probability? What more assumptions are needed for its asymptotic normality? 9.} be as in Example 6.05 of p with probability at least 0. and the covariance UXY. Note that EX = VX = A and v(X2) = 4h3 6h2 A. n.i. Y. (Section 6. Let {X.. Derive the asymptotic distribution of X E . observations on = ~'z:=.=I (b) Obtain the limit distribution of . 6. Y).4) Let {X.4) Suppose {X. (Section 6.2. (Section 6. and cr. + + 8. 2. be i.3) There is a coin which produces heads with an unknown probability p.i. with the means p .. What more assumptions on (X.d.4) Let (X.} be independent with EX = p and VX = u2.9? 10. independent of each other.We are to (X. (Section 6... Obtain the asymptotic distribution of (a) X2. If a theorem has a wellknown name. (Section 6. ~and py. with EX (a) Obtain " I = 0 and VX = u2 < co. the variances u. Prove the consistency of fi (see Definition 7..}are i.4.x. . plim n' ntm (X. i = 1.d. p.} be i.5. Otherwise. I].4) Let (X. describe it. and define estimate p by fi = logX/logF. How many times should we throw this coin if the proportion of heads is to lie within 0.  based on a sample of size n.110 6 1 Large Sample Theory I Exercises 111 machines? Solve this exercise both by using the binomial distribution and by using the normal approximation.N[exp(ap). . you may simply refer to it. . Y.4.XI+'). .
. They are also referred to by the same name. we call {(X. 1. The Bayesian method. This is an act of interval estimation. .1. We call the basic random variable X.1 WHAT I S AN ESTIMATOR? Sample Moments In Chapter 4 we defined population moments of various kinds.).5. . The goal of point estimation is to obtain a singlevalued estimate of a parameter in question. We define I . whose probability distribution we wish to estimate. . .2. X. Once we observe them.i. each of which has the same distribution as X. will be discussed in Chapter 8.5. .6. X2. X2. . . suppose we want to estimate the probability ( p ) of heads on a given coin toss on the basis of five heads in ten tosses. 0. X P . Y). . Here we shall define the corresponding sample moments. (We say that { X .d. xe.0.5 is an act of point estimation. . . Xn) a sample of size n. such as (1. In this chapter we discuss estimation from the standpoint of classical statistics. z = 1.5.. the population. with a particular degree of a confidence. . sample.). For example.1 1 What Is an Estimator? 113 for a given coin. 6. 1. 7. We can define X = 1 if a head appears and = 0 if a tail appears.) are random variables before we observe them.) For example.  C x)k. Sample moments are "natural" estimators of the corresponding population moments.. and we call (XI. X. say.7. If we denote the random variable in question by X. in which point estimation and interval estimation are more closely connected. At most we can say that p lies within an interval. the goal of interval estimation is to determine the degree of confidence we can attach to the statement that the true value of a parameter lies within a given interval. is the height of the ith student randomly chosen. suppose we want to estimate the probability ( p ) of heads Sample kth moment around the mean  1 " n 1 = 1 (X. We can never be perfectly sure that the true value of p is 0. If (X. . Guessing p to be 0. ..) 1 a bivariate sample of size n on a bivariate population (X. . . Y). they become a sequence of numbers.0. however. .X. .3. . the n repeated observations in mathematical terms mean a sequence of n mutually independent random variables XI..Y. Y. These observed values will be denoted by lowercase letters (xl. Note that (XI.1   Chapters 7 and 8 are both concerned with estimation: Chapter 7 with point estimation and Chapter 8 with interval estimation. We define Sample mean Sample variance In Chapter 1 we stated that statistics is the science of estimating the probability distribution of a random variable on the basis of repeated observations drawn from the same random variable. x.7).4 and have the same distribution as (X. 2. . . If X is the height of a male Stanford student. represents the outcome of the ith toss of the same coin.) or (5. ) are i.). . . (0. are mutually independent in the sense of Definition 3. n. . 7.8.9. .0. Then X.
Thus an estimator is a statistic used to estimate a parameter. .6.2) EX^= x j k p l . be the outcome of the ith roll of a die and define Y .1 (Chebyshev). The following way of representing the observed values of the sample moments is instructive. . .. X2. X.1.) = l/n. We can solve these six equations for the six unknowns (pi] and express each pj as a function of the five moments. regardless of the type of X. = 0 otherwise. .) be the observed values of a sample and define a discrete random variable X* such that P(X* = x. is determined). Note that X* is always discrete. . Mathematically we express it as We call any function of a sample by the name statistic. = 1 if X. . (that is. The fil just defined can be expressed as a function of the sample. we obtain estimators of {pi). 1 where p. . follows from (1) and (2) above because of Theorem 6.5. . Its observed value is called an estimate. 1. . using the term "good" in its loose everyday meaning. A "natural" estimator in this case is the ratio of the number of times the ace appears in n rolls to ndenote it by j l . In general. . (3) Using Theorem 6. by 5 and 2 sx . . We shall call X* the empirical image of X and its probability distribution the empi~ical distribution of X. (7. we know that V X = u2/n. In Section 7. is uniquely determined when X. we = EX.. . and the remaining five identities fork = 1.2 Estimators in General Sample correlation Sample Covariance SXSY The observed values of the sample moments are also called by the same names.2. Then the moments of X* are the observed values of the sample moments of X. Although. .3 we shall learn that j1 is a maximum likelihood estimator. . x2. If we replace these five moments with their corresponding sample moments. which shows that the degree of dispersion of the distribution of the sample mean around the population mean is inversely proportional to the sample size n. we estimate a parameter 8 by some function of the sample. jl is a function of XI. 5 are the definitions of the first five moments around zero. estimator sometimes coincides with the maximum likelihood estimator. We shall show that it is in fact a function of moments. it . respectively.= k = 0. . . . the method of moments as the method of moments. (2) Suppose that VX = a2is finite. Then p1 = (l/n)Cr=lYIY.1. we know that EX = EX.114 7 1 Point Estimation 7. = P(X = j).2) reduces to the identity which states that the sum of the probabilities is unity. . . . 2. ( I ) Using Theorem 4. . Since Y. 7. Let X. Note that an estimator is a random variable.1. When k = 0. They are defined by replacing the capital letters in the definitions above by the corresponding lowercase letters.3. = 1 and Y . But let us concentrate on the sample mean and see what we can ascertain about its properties.3. Then.2. 2.1. using Theorem 4.1. j = 1. 2. The observed values of the sample mean and the sample variance are denoted. we can say that the sample mean is a "good" estimator of the population mean. Y .. is not a moment. 6. i = 1. the same result also know that plim.This method of obtaining estimators is known as in this case. Let (xl. . We stated above that the parameter p.. Consider the following six identities: 6 (7.1 1 What Is an Estimator? 115 Sample covariance On the basis of these results. n. which means that the population mean is close to a "center" of the distribution of the sample mean. We have mentioned that sample moments are "natural" estimators of population moments. 2 . . If VX is finite. is a function of X. An example is the probability (pl) that the ace will turn up in a roll of a die. . Are they good estimators? This question cannot be answered precisely until we define the term "good" in Section 7.x. We may sometimes want to estimate a parameter of a distribution other than a moment.1 (Khinchine's law of large numbers).
71.3 Nonparametric Estimation  . Nonparametric estimation for a continuous random variable is therefore useful only when the sample size is very large. The difficulty of this approach is characterized by a dilemma: if d is large.1 Population: X = 1 with probability p. the distribution is not specified and the first few moments are estimated. assuming that it is zero outside the interval [4. because it does not use the information contained in the higher moments. the approximation of a density by a probability step function cannot be good. In the distributionspecific method. = 0 with probability 1 . We must divide this interval into 3 / d small intervals with length d and then estimate the ordinate of the density function over each of the small intervals by the number of students whose height falls into that interval divid6d by the sample size n.2 1 Properties of Estimators 117 is in general not as good as the maximum likelihood estimator. . In Figure 7.    In parametric estimation we can use two methods. For example.2.2. normaland these parameters are estimated. (2) Distributionfree method. suppose we want to estimate the density of the height of a Stanford male student. but if d is small. many intervals will contain only a small number of observations unless n is very large.1 Ranking Estimators F I G U R E 7.  7. the  distribution is assumed to belong to a class of functions that are characterized by a fixed and finite number of parametersfor example.1 Probability step functions of estimators EXAMPLE 7. In the distributionfree method. .fi Sample: (XI.1 we show the probability step functions of the three estimators for four different values of the parameter p. This example shows two kinds of ambiguities which arise when we try to rank the three estimators.116 7 1 Point Estimation 7. The reader who wishes to study nonparametric density estimation should consult Silverman (1986). In this book we shall discuss only parametric estimation.X 2 ) .1. In nonparametric estimation we attempt to estimate the probability distribution itself. Inherent problems exist in ranking estitnat~rs.2 PROPERTIES OF ESTIMAWRS 7. 7. . (1) Distributionspec@ method. The estimation of a probability distribution is simple for a discrete random variable taking a few number of values but poses problems for a continuous random variable. as illustrated by the following example.
0 ) * 5 E ( Y .) C3 (4).0) for every E. X is "better" than Y.2 Various Measures of Closeness The ambiguity of the first kind is resolved once we decide on a measure of closeness between the estimator and the parameter. say. see Hwang (1985). p = 3/4. Here X (solid line) and Y (dashed line) are two random variables defined over the sample space [0.1 THEOREM 7 . 3 (2) 3 (3) a n d ( 3 ) p (2).118 7 1 Point Estimation 9. E. (3) Eg(lX .4 idea of stochastic dominance is also used in the finance literame.01) = F I G U R E 7. (2) T dominates W for p = 0. 2 .Y 119 (1) For a particular value of the parameter. Criteria (3) and (4) are sometimes referred to as universal dominance and stochastic dominance. 4 IY .0) for every continuous function g(.0)'. fof example. 5 = 0 otherwise.2.0) 5 Eh. Criteria (I) through ( 5 ) are transitive.2 1 Roperties of Estimators .(X . 1985. 2 THEOREM 7 .01) 5 E ~ ( ~Y0)) for every continuous and nondecreasing function g. meaning that one does not imply the other.) which is nonincreasing for x < 0 and nondecreasing for x > 0.(X . These ambiguities are due to the inherent nature of the problem and should not be lightly dealt with.(z) = 1 if lzl 2 1.0 1 > IY . In the following discussion we shall denote two competing estimators by X and Y and the parameter by 0. (2) Eg(X .2 Illustration for Theorem 7. (We allow for the possibility of a tie. r > . we say X is strictly preferred to Y.(Y . (Obvious. (4) is equivalent to stating Then Eh. respectively. it is not clear which of the three estimators is preferred. but W dominates T for p = %. 2 . Adopting a particular measure of closeness is thus equivalent to defining the term better: (The term stn"ctZy better is defined analogously.0 1 5 IY .2. X 7.01) (4) ? (6). Note that 0 is always a fixed number in the present analysis.) O THEOREM 7 .) Or. The theorem follows from the fact that a continuous function can be approximated to any desired degree of accuracy by a linear combination of step functions. see. If X is preferred to Y and Y is not preferred to X.) (3) (3) + (5) and (5) p (3). f l i ) t s . Each of the six statements below gives the condition under which estimator X is prefmed to estimator Y .2.0) Eg(Y . (6) is not. The p r o b .01). (5) E ( X .2. we might say. and it is not easy to choose a particular one.0 that Eh. (6) P(IX . Huang and Litzenberger (1988). But because we usually must choose one estimator over the others. THEOREM 7.Therefore.0 1 > E) for every E. for a rigorous proof. Sketch o f ProoJ: Define h. There are many reasonable measures of closeness.. The h 1 Pro$ Consider Figure 7. 11. (Obvious. however. 2 . The reader should verify this. In this section we shall consider six measures of closeness and establish relationships among them.0) = P(IX .0 1< 1 r E). (See Hwang.) (1) P(IX .0 1 > E) 5 P(IY . we shall have to find some way to get around the ambiguities.. (4) P([X . P(IX .
Then X is strictly preferred to Y by criterion ( 6 ) . Then T is preferred to S by criterion ( 1 ) . Since X .8 c l THEOREM 7 . 6 ( 1 ) =3 ( 6 ) and ( 6 ) p (1). We also assume that 0 = 0.0 > 0 and Y . (6) . The righthand side of ( 6 ) is zero if ( I ) holds. But. as already noted. criteria ( 2 ) and ( 3 ) are equivalent.3. ( 3 ) + ( 1 ) . defined in Figure 7. p ( 1 ) . 5 ( 1 ) =3 ( 3 ) and ( 3 ) p ( 1 ) .120 7 1 Point Estimation 7.2. 2 . O > 0 and Y (1) ? (2). Define a function go in such a way that go(%) = go(%) = 1 and go(%) = %.2. Consider any pair of random variables X and Y such that X . Proof.01) r P [ ~ ( I X . Therefore.Therefore X is preferred to Y by criterion ( 2 ) . This shows that ( 1 ) does not imply ( 2 ) .because Ego( S . 2 .012 = 4 5/4 and E(Y .4.but P(IY . 0 Pro$ Consider estimators S and T in Example 7.2. as before. Pro05 ( 6 ) .= P(IX .This shows that ( 2 ) does not imply ( 1 ) .5.2. ( 2 ) and (3) are equivalent. Then T is not preferred to S by criterion ( 2 ) .01) r Eg(lY . Then.0 > 0 in this example. O > L THEOREM 7 . 2 .4. .0)' = 4.2. Clearly.2. T ~ U S 1.0 1 5 I Y .01) = * 0 FIG u R E 7.1 when p = 3/4. Consider X and Y . Xis preferred to Y by criterion ( 3 ) . IX . 7 Proof.01) 5 g ( [~ Therefore.0 . defined in Figure 7.01) = P ( Y < X ) < 1. and a dashed line between a pair of criteria means that one does not imply the other. and ( 4 ) ? (6) by T h e e rem 7. Eg(lX . Next. Then. by our construction. 15I Y 8 1 g(lX .0 ~ ( I Y .0 1 r IY . Since g is nondecreasing.2. We have shown that X is preferred to Y by criterion (4).eferred to Y by criterion ( 4 ) .  Proof.0 > 0. Consider X and Y .2.9 THEOREM 7 . Then ( 6 ) (1) must hold.3  : 1 Illustration for Theorem 7.p).2. therefore Y is strictly preferred to X by criterion (5).01) for every continuous and nondecreasing function g. an arrow indicates the direction of an implication. But clearly X is not preferred to Y by criterion ( 1 ) . But ( 3 ) and ( 4 ) are equivalent by Theorem 7. ferred to X by criterion ( 6 ) .2. In the figure. we assume that 0 = 0. In Figure 7. as we noted in the proof of Theorem 7. But E(X . consider X and Y .p) < Ego(T .3.3. But P(IX .whereas Y is strictly preferred to X by criterion ( 6 ) . O + The results obtained above are summarized in Figure 7.01). and.01) 5 ( 1 ) 3 (3). X is preferred to Y by criterion ( 3 ) because of Theorem 7. defined in Figure 7.Y is pre15I X . X is strictly p. X (solid line) and Y (dashed line) are defined over the same sample space as in Figure 7.2 1 Properties of Estimators 121 ability distribution defined over the sample space is assumed to be such that the probability of any interval is equal to its length.
1 . we say that T is better than S. (Here 8 denotes the parameter space. If 6 is an estimator of 0.8)' for at least one value of 0 in 8.2.0 ) 2 < E(Y . known as the mean squaredierro7: We shall follow this practice and define the term better in terms of this criterion throughout this book.) D E F I N I T I O N 7. as we shall illustrate by referring again to Example 7. B y adopting the mean squared error criterion. we have eliminated (though somewhat arbitrarily) the ambiguity of the first kind (see the end of Section 7. statisticians have most frequently used criterion ( 5 ) .2.1.2.1). FIGURE 7. W is the best estimator.2 are reasonable (except possibly criterion (6).2.p)2 = p(l . This should be distinguished from the statement that T is better than S at a specific value of p.1p(l p) 2  2 .3/41' = 3/16.2.1) E(T . In Example 7. 4 Relations among various criteria 7.5. we state L e t X a n d Y b e twoestimatorsof0. E(S . When T dominates S as in this example. for each value of the parameter.2.0)' 5 E(Y .122 7 1 Point Estimation 7.5 Mean squared errors of estimators in Example 7. Now we can rank estimators according to this criterion though there may still be ties. however. and there is no a priori reason to prefer one over the others in every situation. 11.WesayXisbetter (or more eficient) than Y if E(X .1.p).% ) 2 = 1/16. for the moment.3 Mean Squared Error Although all the criteria defined in Section 7. it is the closed interval [O.2. the set of all the possible values the parameter can take.2.2 4 P~O~SS&S of f&mat~rs 123 (7.2. and E(W .012 for all 0 E 8 and E(X . The mean squared errors of the three estimators as functions of p are obtained as (7. They are drawn as three solid curves in Figure 7.2) E(S .3/4)' = 3/32. because it is not transitive). because T is better for some values of p and W is better for other values of p.) It is evident from the figure that T clearly dominates S but that T and W cannot be unequivocally ranked. for this value of the parameter p. (Ignore the dashed curve. Therefore. We can easily calculate the mean squared errors of the three estimators in Example 7. unless otherwise noted.2.1: E(T . The ambiguity of the second kind remains. More formally.1 F IG UR E 7 . we call ~ ( 6 8)' the mean squared errm of the estimator.p).
DEFINITION 7.2. 2 . In our example. A certain degree of arbitrariness is unavoidable in this strategy. in Example 7. therefore. Minimax strategy. In many .5 that W does well for the values of p around %. For example.4 Strategies for Choosing an Estimator The mean squared error of Z is computed to be How can we resolve the ambiguity of the second kind and choose between two admissible estimators. we see that Z is chosen both by the subjective strategy with the uniform prior density for p and by the minimax strategy. it is usually not discussed in a textbook written in the framework of classical statistics. we say that the estimator is inadmissible. 7. we choose the estimator for which the largest possible value of the mean squared error is the smallest. One of the classes most commonly considered is that of linear unbiased estimators. One possible way to combine the two estimators is to define (7. T and W.1. We first define 6 issaidtobean unbimedestimatorof0if~6= Ofor all 0 E 8. One strategy is to compare the graphs of the mean squared errors for T and W in Figure 7. D E F I N I T I 0 N 7 .5 and to choose one after considering the a priori likely values of p.2.4) Z = XI + X2 + 1 4 Thus. as we shall explain in Chapter 8. whereas T does well for the values of p near 0 or 1.  . practical situations the statistician prefers a biased estimator with a small mean squared error to an unbiased estimator with a large mean squared error. 2 Let 6 be an estimator of 0. We call E8 . We formally define ~ e6 tbe an estimator of 0.5.2. I t b a dninimaxestimatwif.2. This strategy is highly subjective. 7.124 7 1 Point Estimation 7. T and S are unbiased and W and Z are biased. W. It is more in the spirit of Bayesian statistics. We say that 6 is inadmissible if there is another estimator which is better in the sense of Definition 7.3 and is graphed as the dashed curve in Figure 7. for any other estimator 0.2.2. Although unbiasedness is a desirable property of an estimator. T and W are equally good by this criterion.1. This strategy may be regarded as the most pessimistic and riskaverse approach. in Example 7. We would then choose the estimator which has the minimum area under the mean squared error function. 11. although.2. S is inadmissible and T and W admissible.5 Best Linear Unbiased Estimator Neither of the two strategies discussed in Section 7. although the second is less objectionable to them.1. According to the minimax strategy.2.1. a Bayesian would proceed in an entirely different manner. An estimator is admissible if it is not inadmissible. T is the best estimator within the class consisting of only T and S. it should not be regarded as an absolutely necessary condition.1? Subective strategy. T is preferred to W by this strategy.2.0 bias. Their primary strategy is that of defining a certain class of estimators within which we can find the best estimator in the sense of Definition 7. 0 f Among the three estimators in Example 7. For example. We can ignore all the inadmissible estimators and pay a t t a d o n only to the class of admissible estimators.2. rather than comparing the mean squared errors of estimators. When we compare the three estimators T. This suggests that we can perhaps combine the two estimators and produce an estimator which is better than either in some sense. We see in Figure 7. In Chapter 8 we shall learn that Z is a Bayes estimator. we have D E F I N I T I O N 7.4 max ~ 0 (6 012 5 max ~ ( 6 O12. as in the case of S by T in the above example.2 ) Properties of Estimators 125 When an estimator is dominated by another estimator. in Example 7. In our example.2.1. if we eliminate W and Z from our consideration. suppose we believe a priori that any value of p is equally likely and express this situation by a uniform density over the interval [0.2. and Z.4 is the primary strategy of classical statisticians.
For example.11) we conclude that MSE(Z) only if < MSE(T) if and Note that the second equality above holds because E [ ( 6 . for any estimator 6 of 0. . MSE(Z) < hrlSE(T) In the following example we shall generalize Example 7. T H E o R E M 7. 0 Since ( n 1 ) / ( 2 n for everv n if + + 1 ) is a decreasing function of n. The class of linear estimators consists of estimators which can be ex . we have Therefore ES. X2. That is. .2.2 Population: X = 1 with probability p.2.8) and (7.2.2. We have Since E T = p.10.2. the sample variance defined there is biased.2 1 Properties of Estimators 127 Theorem 7.126 7 1 Point Estimation 7. E X A M P L E 7. as we show in (7.1.2.~ 6( )~ 80)]= ( ~6 0 ) ~ (6 ~ 6 ) 0. .10. The same cannot necessarily be said of all the other moments defined in that section. Therefore.1 0 where MSE stands for mean squared error.2. = 0 with probability 1 . This formula is convenient when we calculate the mean squared error of an estimator.P.1 to the case of a general sample of size n and compare the mean squared errors of the generalized versions of the estimators T and Z using Theorem 7.Xn). Sample: (XI.13).For this reason some authors define the sample variance by dividing the sum of squares by n . We have The mean squared error is the sum of the variance and the bias squared. = ( n .2. the sample mean is generally an unbiased estimator of the population mean. . Estimators: T = As we stated in Section 7.1 instead of n to produce an unbiased estimator of the population variance.1. using Theorem 7.l ) a 2 / n . we obtain ProoJ It follows from the identity From (7.10 gives a formula which relates the bias to the mean squared error.2.
we obtain where we used the condition Zr=la. THEOREM 7.X P . We shall prove a slightly more general minimization problem.2. From a purely mathematical standpoint. .2. The theorem follows by noting that the lefthand side of the first equality of (7. . (7. the following theorem is one of the most important in mathematical statistics.14) G=1 and.20) is the sum of squares and hence nonnegative.Xl and impose the unbiasedness condition (7. Theorem '7. or BLUE. 2 . 0 (Note that we could define the class of linear estimators as a0 + C:=la.2. Therefore the theorem follows from (7..19) clearly holds if and only if a.14)  EX alX. Consider the class of linear estimators of p which can be written in the form Z. = 1 to obtain the second equality.Xl = a2 Now consider the identity I af.) Prooj We have Consider the problem of minimizing ~ : = ~ with a f re= 1.~=laa.2.2. .17) V a.2. } . noting that the lefthand side of (7.2.18) is the sum of squared terms and hence nonnegative.2. Consider the identity (7.14) would ensure that a0 = 0. n and (7.11 The equality in (7.1 is merely a special case of this theorem. The unbiasedness condition (7.19). moreover.128 7 1 Point Estimation pressed as a linear function of the sample ( X I . because the unbiasedness condition (7. t= n 1 Then (7. = p.17).1 are linear estimators.2.2.b.2. This class is considered primarily for its mathematical convenience rather than for its practical usefulness. Therefore.) We now know that the dominance of T over S in Example '7. the equality in (7.2. 0 . = l / n for all i.) subject to the condition C. This would not change the theorem.Xl with a constant term. i = 1 .nbeindependentandhavethe common mean p and variance a2. = l / n . problem is given by Proof.2. .a.16) u2 V z = . Despite the caveats we have expressed concerning unbiased estimators and linear estimators.2.2. of the population mean."=la. .X n ) All four estimators considered in Example 7.15) VX 5 V /= a i X j for all {ail satisfy~ng(7. = 1.2.2.b. (In words. .} subject to condition solution to minimizing ~ r ~ with CZ".1 2  L e t { X .16). .11 provides the ~ a respect : to (a. . The solution to this spect to {a. THEOREM 7. and (7. which has a wide applicability.2.15) holds if and only if a. the sample mean is the best linear unbiased estimatol.2:14) implies Zr=lai = 1.
2. 2.c.2. US. (7. . the problem is that of minimizing ~ : = ~ csubject ~ u ~ to C k l c i = 1.23) with respect to z.3 0 Parameter to estimate: p . .2. 2 Estimators: k1 = . n.12 by putting b.2. Therefore the solution is ci = U ~ ~ / C ~ = ~ U ~ ~ .x2dx = 02/3. We have EX^ = 0'$. f 2. Theorem 7. 2 .12. n.5 that b2 is the biascorrected maximum likelihood estimator.24).1. Assume that 6i are uncorrelated.2. . where M is a known constant. < Z) = zn .2 1 haperties of Estimators 131 (Theorem 7. .2. . where pi = 1 and M = 1.2.11 shows that the sample mean has a minimum variance (and hence minimum mean squared error) among all the linear unbiased estimators. this problem is reduced to the minimization problem of Theorem 7. Thus it is a special case of Example 7. More rigorously. Since the unbiasedness condition is equivalent to the condition C:'. = .2. Then we have for any 0 < z < 0. i = 1.  0" Differentiating (7.2.11 follows from Theorem 7. . 4 L e t & . nonlinear estimat Using (7. Example 7. the solution is E X A M P L E 7.4. Hence E X A M P L E 7 . we know that Z 5 0 and Z approaches 0 as n increases.2.23) G(z) = P(Z < z) = P ( X I < z)P(X2 < z) . . i = 1. = 1 for all i. . n.2. beunbiasedestirnatorsof0with is unbiased and variances a:. . Therefore it makes sense to multiply Z by a factor which is greater than 1 but decreases monotonically to 1 to estimate 0. = 1. we can calculate (7. i = 1. Choose (ci]SO that has a minimum variance. . Therefore VX = 02/12.2.2. . Let Xi be'the return per share of the ith stock. . We have already seen that a biased estimator.) We shall give two examples of the application of Theorem 7. Determine s so as to minimize V(ZL1ciXi) Put EXi = pi and VXi = subject to M = Zr=ficipi. and let ci be the number of shares of the ith stock to purchase. If we put a. An intuitive motivation for the second estimator is as follows: Since 0 is the upper bound of X. we shall show in Example 7.5 provides a case in which the sample mean is dominated by an unbiased.25) and EZ = 0" "I" zndz =  n 0 n+10 Therefore .12. can have a smaller mean squared error than the sample mean for some values of the parameter.3. such as W and Z of Example 7. P(X. 2 .2. we obtain  . Therefore.130 7 1 Point Estimation 7. respecLet G(z) and g(z) be the distribution and density function o tively. Assume that Xi are uncorrelated. = ciui and bi = p i / ( M u i ) .
132
.
7
1
Point Estimation
7.3
1
Maximum Likelihood Estimator: Win#ion
133
Since (7.2.25) shows that (7.2.27),
G2 is
an unbiased estimator, we have, ushg
\
7.3
M A X I M U M LIKELIHOOD ESTIMATOR: DEFINITION AND COMPUTATION Discrete Sample
7.3.1
Comparing (7.2.22) and (7.2.27), we conclude that MSE(b2) 5 MSE(bI), with equality holding if and only if n = 1.
7.2.6 Asymptotic Properties
Thus far we have discussed only the finite sample properties of estimators. It is frequently difficult, however, to obtain the exact moments, let alone the exact distribution, of estimators. In such cases we must obtain an approximation of the distribution or the moments. Asymptotic approximation is obtained by considering the limit of the sample size going to infinity. In Chapter 6 we studied the techniques necessary for this most useful approximation. One of the most important asymptotic properties of an estimator is consistency.
D E F I N I T Io N 7 . 2 . 5
Suppose we want to estimate the probability (p) that a head will appear for a particular coin; we toss it ten times and a head appears nine times. Call this event A. Then we suspect that the coin is loaded in favor of heads: in other words, we conclude that p = % is not likely. If p were '/2, event A would be expected to occur only once in a hundred times, since we have P(A I p = 1/2) = ~ i ~ ( 1 / G 2) 0.01. ~ ~ In the same situation p = y4is more likely, because P(A I p = 3/4) = ~ ; O ( g / q ) ' ( 1 / 4 ) 0.19, and p = g/ln is even more likely, because P(A I p = 9/10) = ~;O(9/10)~('/10) 0.39. Thus it makes sense to call P(A I p) = cAOpg(l  p) the likelihood function of p given event A. Note that it is the probability of event A given p, but we give it a different name when we regard it as a function of p. The maximum likelihood estimator of p is the value of p that maximizes P(A I p), which in our example is equal to 9/1+ More generally, we state
D E F I N I T I O N 7 . 3 . 1 Let (XI,&,. . . ,X,) be a random sample on a discrete population characterized by a vector of parameters 0 = (01,02, . . . , OK) and let x, be the observed value of X,. Then we call
We say 0. (See Definition 6.1.2.)
6 is a consistent estimator of 0 if plim,,
6=
In Examples 6.4.1 and 6.4.3, we gave conditions under which the sample mean and the sample variance are consistent estimators of their respective population counterparts. We can also show that under reasonable assump tions, all the sample moments are consistent estimators of their population values. Another desirable property of an estimator is asymptotic normality. (See Section 6.2.) In Example 6.4.2 we gave conditions under which the sample mean is asymptotically normal. Under reasonable assumptions all the moments can be shown to be asymptotically normal. We may even say that all the consistent estimators we are likely to encounter in practice are asymptotically normal. Consistent and asymptotically normal estimators can be ranked by Definition 7.2.1, using the asymptotic variance in lieu of the exact mean squared error. This defines the term asymptotically better or asymptotically eficient.
the likelihood function of 0 given (xl, x2, . . . , x,), and we call the value of 0 that maximizes L the maximum likelihood estimatol: Recall that the purpose of estimation is to pick a probability distribution among many (usually infinite) probability distributions that could have generated given observations. Maximum likelihood estimation means choosing that probability distribution under which the observed values could have occurred with the highest probability. It therefore makes good intuitive sense. In addition, we shall show in Section 7.4 that the maximum likelihood estimator has good asymptotic properties. The following two examples show how to derive the maximum likelihood estimator in the case of a discrete sample.
134
7 ( Point Estimation
I
7.3
1
Maximum Likelihood Estimator: Definition
135
E X A M P L E 7 . 3 . 1 SupposexB(n,p) andtheobservedvalueofXis k. The likelihood function of P is given by 1
We shall maximize log L rather than L because it is simpler ("log" refers to natural logarithm throughout this book). Since log is a monotonically increasing function, the value of the maximum likelihood estimator is unchanged by this transformation. We have (7.3.2) log L = log C;
mum likelihood estimator is the same as before, meaning that the extra information is irrelevant in this case. In other words, as far as the estimation of p is concerned, what matters is the total number of heads and not the particular order in which heads and tails appear. A function of a sample, such as Cr='=,xi in the present case, that contains all the necessary information about a parameter is called a suficient statistic. This is a generalization of Example 7.3.1. Let Xi, i = 1, 2, . . . , n, be a discrete random variable which takes K integer values 1, 2, . . . , K with probabilities pl, &, . . . ,pK. This is called the multinornial distribution. (The subsequent argument is valid if Xi takes a finite number of distinct values, not necessarily integers.) Let nj, j = 1, 2, . . . , K, be the number of times we observe X = j. (Thus X?=,nj = n.) The likelihood function is given by
E X A M P L E 7.3.2
+ k log p + (n

k) log(1  p).
Setting the derivative with respect to p equal to 0 yields
Solving (7.3.3) and denoting the maximum & 1 & d obtain
estimator b y $, we
where c To be complete, we should check to see that ('7.3.4) gives a maximum rather than any other stationary point by showing that a210g L / ~ P 'evaluated at p = k/n is negative. This example arises if we want to estimate the probability of heads on the basis of the information that heads came up k times in n tosses. Suppose that we are given more complete information: whether each toss has resulted in a head or a tail. Define X, = 1 if the ith toss shows a head and = 0 if it is a tail. Let x, be the observed value of X,, which is, of course, also 1 or 0. The likelihood function is given by (7.3.8)
=
n!/(n1!n2! . log L = log c
 nK!).The log likelihood function is given by
+
K
nj log pi.
j= 1
Differentiate (7.3.8) with respect to pl, p2, . . . , pKl, noting that 1  pl  p2  . . .  pKl, and set the derivatives equal to zero: (7.3.9)

pK
=
a log L  1 n.  nK = 0 ,
aPj
Pj
PK
j = 1 , 2, . . . , K1.
Adding the identity nK/pK = nK/pK to the above, we can write the K equations as (7.3.10) pj=anj, j = l , 2 , . . . , K,
2=1
Taking the logarithm, we have (7.3.6) log L = x, log p
where a is a constant which does not depend on j. Summing both sides of (7.3.10) with respect to j and noting that CF1pj = 1 and CFlnj = n yields
] [
+
n
x, log(1  p).
But, since k = X:=lxz, (7.3.6) is the same as (7.3.2) aside from a amstant term, which does not matter in the maximization. Therefore the maxi
1
(7.31)
a =
1 . n
Therefore, from (7.9.10) and (7.3.11) we obtain the maximum likeiihood estimator
136
7
1
Point Estimation
7.3
1
Maximum Ukdiho$)$ EstSmator: M d t i o n
137
The die example of Section 7.1.2 is a special case of this example.
7.3.2 Continuous Sample
and (7.3.18)
c2=
(x,  3'.
1 = ~
For the continuous case, the principle of the maximum likelihood estimator is essentially the same as for the discrete case, and we need to m o w Definition 7.3.1 only slightly.
They are the sample mean and the sample variance, respectively.
7.3.3 Computation
. . ,Xn) be a random sample on a continuous population with a density function f (lo), where 0 = (el, 02, . . . ,
DEFINITION 7.3.2 Let (X1,X2,.
OK), and let xi be the observed value of Xi. Then we call L = lIr='=, f(xi 1 0) the likelihood function of 0 given (xl, xz, . . . , x,) and the value of 0 that maximizes L, the maximum likelihood estimatoz
E X A M P L E 7.3.3 Let {X,), i = 1,
2 , . . . , n, be a random sample on N(p, a2)and let {x,] be their observed values. Then the likelihood function is given by
(7.3.13) so that (7.3.14)
L
=
" 1 I1 exp GU [
r=l

1 202
In all the examples of the maximum likelihood estimator in the preceding sections, it has been possible to solve the likelihood equation explicitly, equating the derivative of the log likelihood to zero, as in (7.3.3). The likelihood equation is often so highly nonlinear in the parameters, however, that it can be solved only by some method of iteration. The most common method of iteration is the NmtonRaphson method, which can be used to maximize or minimize a general function, not just the likelihood function, and is based on a quadratic approximation of the maximand or minimand. Let Q(8) be the function we want to maximize (or minimize). Its quadratic Taylor expansion around an initial value 81 is given by
n n 1 " log L =  log(27~)   log u2  2 2 2u , = I
'C
(XI
 p12.
Equating the derivatives to zero, we obtain (7.3.15) and (7.3.16)
 = 
where the derivatives are evaluated at 81. The secondround estimator of the iteration, denoted G2, is the value of 8 that maximizes the righthand side of the above equation. Therefore,
aiog L I ap
, a%0
,A)
=
U E = ~
aiogL au2
n
+
2u2
,
1 " 2a ,=I
(x, 
,A)' =
0.
The maximum likelihood estimator of p and u2, denoted as l i , and e2, are obtained by solving (7.3.15) and (7.3.16). (Do they indeed give a maximum?) Therefore we have
Next G2 can be used as the initial value to compute the thirdround estimator, and the iteration should be repeated until it converges. Whether the iteration will converge to the global maximum, rather than some other stationary point, and, if it does, how fast it converges depend upon the shape of Q and the initial value. Various modifications have been proposed to improve the convergence.
. For a rigorous discussion.4. from (7.4 1 Maximum Likelihood Estimator: Properties 139 7. because the theorem uses the phrase "under general conditions.4. .4.1) a log L 2 ~ ( 82 )  1 a2 log L E ae2 where the fourth equality follows from noting that E(l/L) (a2~/a0*) = $ ( a 2 ~ / a 0 2 )= d~ a2/ae2($~dx) = 0. In Section 7.) W = EY' = E  a2 log L  ae2 We also have Therefore (7. . also have (7. Xnl 0) be the likelihood function and let 6(x1.3.4. the expectation operation.4 M A X I M U M LIKELIHOOD ESTIMATOR: PROPERTIES.X..3).") Put X = 6 and Y = a log L/a0 in Theorem 7.) be an unbiased estimator of 0. In Section 7. In this section.138 7 1 Point Estimation 7. . . W e 7. X2. X. we have (7. Then.2 and 7. THEOREM 7. Sketch of Pro05 ( Arigorous proof is obviously not possible.4. which is closely related to the Cram&Rao lower bound. .2) and (7. where we are concerned with the properties of the maximum likelihood estimator.4 examples are given. . . . see Arnemiya (1985). Note that E. which makes the likelihood function itself a random variable.3 the likelihood function was always evaluated at the observed values of the sample.4)  The righthand side is known as the Cram&Rao lower bound (CRLB). under general conditions. some results are given without full mathematical rigor. Therefore. xs.4.4. . . R . In Sections 7. L  a a log L ae ae ae We shall derive a lower bound to the variance of an unbiased estimator and show that in certain cases the variance of the maximum likelihood estimator attains the lower bound. .1 CramCrRao Lower Bound a2 10.4.3) . X2.4. In Section 7. To avoid mathematical complexity. .x*.Xn.4. . ..1. we need to evaluate the likelihood function at the random variables XI. .3 we define the concept of asymptotic efficiency.4. . We show this by means of the Cram&Rao lower bound.1 (CramerRao) Let L(Xl. (In Section 7. Then we have where the integral is an ntuple integral with respect to X I .3 we show the consistency and the asymptotic normality of the maximum likelihood estimator under general conditions. however. is taken with respect to the random variables XI.4.1) follows from the CauchySchwartz inequality (4. X2. . because there we were only concerned with the definition and computation of the maximum likelihood estimator.4.1 we show that the maximum likelihood estimator is the best unbiased estimator under certain conditions.3) we have (7.X 2 .4.
. .10) en(@) = 1 log L. If. 0)..2).4.4.3.2 log f ( X .6. and (7. to converge to oO. where a random variable Xiappears in the argument off because we need to consider the property of the likelihood function as a random variable. a random variable. we implicitly assumed that they were evaluated at the true value of the parameter. note by (7.3) again with respect to p.4.d.) Next we shall show why we can expect Q. P(l . Therefore we obtain (7.2).3).2 Consistency where we have substituted X for k because here we must treat L . Suppose {X.(In the present analysis it is essential to distinguish 0.p) n The maximum likelihood estimator can be shown to be consistent under general conditions. the fourth equality of (7. Let {X.3.4. from 00. with the density f (x..4.4. we obtain  F l C U RE 7. the conditions are violated because the fifth equality of (7. for example. If &(0) converges to ~ ( 0 1we .5).1). are essentially the conditions on L which justify interchanging the derivative and the integration operations in (7.p)/n by (5.f. we obtain E X A M P L E 7.4.7) / CRLB = . 0).1. It can be also shown that even if (T' is unknown and estimated. which attains the global maximum at the true value of 0.5) do not hold.i.2. Differentiating (7.3.4.140 7 1 Point Estimation 7. and the third equality of (7. We shall give two examples in which the maximum likelihood estimator attains the CramCrRao lower bound. so that p is the only parameter to estimate. Therefore we can apply Khinchine's LLN (Theorem 6. denoted Q(0). in other words. Whenever L or its derivatives appeared in the equations. the maximum likelihood estimator.(B) =t=1 1 Since Vp = p(l . This was unnecessary in the analysis of the preceding section.]be as in Example 7.5).3 (normal density) except that we now assume u2 is known.4. Note that a ( 0 ) is maximized at 6.4. We shall only provide the essential ingredients of the are i.4.15) again with respect to p. CRLB =  u2 .10) that a ( 0 ) is ( l / n ) times the sum of i. random variables. E X A M P L E 7..ur.4 1 Maximum LikdItrood Estimator: P~operties 141 The unspecified general conditions.4. The discrete case can proof.9) . unless it was noted otherwise.4. Differentiating (7. the domain of the likelihood function. Therefore the maximum likelihood estimator attains the CramCrRao lower bound. the true value.To answer the first question.8) Therefore (7.3). a2 log L ap2 n . To prove the consistency of the maximum likelihood estimator. is the best unbiased estimator. Define (7. known as regularity conditions. we essentially need to show that Qn(0) converges in probability to a nonstochastic function of 0. p) as in Example 7.3. should expect 6. denoted OO. (7.(0) to converge to Q(0) and why we can expect Q(0) to be maximized at OO. the maximum likelihood estimator attains the CramCrRao lower bound and hence is the best unbiased estimator.4. This is illustrated in Figure 7. n But we have previously shown that V(x) = u2/n.6 Convergence of log likelihood functions 7. the support of L (the domain of L over which L is positive) depends on 0.i.1 Let X B(n. R is the best unbiased estimator.} be similarly analyzed. (7.d.
+ + (7.3). it is not a constant) and let g(.15) in Examples 7.we conclude Let the likelihood function be L(Xl. because where 0* lies between 8 and go.2 ( J e n s e n )  = (Here we interpret the maximum likelihood estimator as a solution to the likelihood equation obtained by equating the derivative to zero.4. . But we have (the derivatives being evaluated at throughout) 7. by (iii) of Theorem 6. alogL/d0 evaluated at 8 is zero.we obtain Therefore we obtain from (7. . 00) in Theorem 7.00). under general conditions.2.4.4.4. 0) < E log f (X.. 0).A)b] > Ag(a) (1 . a log L We have essentially proved the consistency of the global maximum likelihood estimato7: To prove the consistency of a local maximum likelihood estimator we should replace (7.4.17) 0 =  But the righthand side of the above inequality is equal to zero. 3 as in (7. 90) if 0 # OO. the maximum likelihood estimator 8 is asymptotically distributed as x..1 and 7.41) Eg(X) < g(EX).4.4.2.22) as .2) or (7.) be a strictly concave function.14) by the statement that the derivative of Q(0) is zero at Oo.4.12) and (7.4. rather than the global maximum likelihood estimator. The reader should verify (7. .4. (7. Solving for (8 (8 .13) (7. Since the asymptotic normality can be proved only for this local maximum likelihood estimator.4 1 Maximum Likelihood Estimator: Properties 143 provided that E log f (X. Therefore pIim.A)g(b) for any a < b and 0 < A < 1.4. 0) f(X9 0) < log EE log f(X. 0) < m. 00) f (X.). Then . Therefore.14) But we can show (see the paragraph following this proof) that 1 E log f (X.001. That is to say.E log f (X. We expand it in a Taylor series around O0 to obtain Let X be a proper random variable (that is.80) We may paraphrase (7.4 (Slutsky).4.. 0)/f (X. Then.) Sketch ofproof: By definition. In other words. where we have simply written f for f (X.4.142 7 1 Point Estimation Qn(0) = Q(0) 7. To answer the second question. g[Aa (1 . 4 . this is precisely what we showed in (7. X2.4. (Jensen k inequality) Taking g to be log and X to be f (X.4. we need T H E O R E M 7. .22) G (8 .1. we should show and  But assuming we can interchange the derivative and the expectation operation.2). I 0). henceforth this is always what we mean by the maximum likelihood estimator. we obtain (7. (7..12) f(X. 00) if 0 # OO.4..3 Asymptotic Normality T H EO REM 7 . o* .4.
we obtain (7.4.4. A significant consequence of Theorem '7. X .4. n 2 log x.31) again. 0 < x < 1. Solving (7. Obtain the maximum likelihood estimator of EX(= p) based on n we obtain from (7.144 7 1 Point Estimation I I " 7.33) 7..30) log L = n log S 1 .27) for 0.4.2.31) a log L dk n *.3 .1 (7. we have > 1.4.(12p)n +  2 " 1% xi.2. Therefore we define DEFINITION 7.4.34) E log X = (1 + 0) EL1.(I . . (7. 0 > 1. .1). Differentiating (7. C Equating (7.1). 1 Differentiating (7.16).4. I: p2(1 .29).4.3 is that the asymptotic variance of the maximum likelihood estimator is identical with the Cram&Rao lower bound given in (7.4.33) .4 Examples We shall give three examples to illustrate the properties of the maximum likelihood estimator and to compare it with the other estimators. = I " 1% x.4. a 2 1 0 g ~ .4.27) defines a onetoone function and 0 0 < p < 1. a= Thus the maximum likelihood estimator is asymptotically efficient e e n tially by definition.4.28) into (7. we obtain the maximum likelihood estimator A consistent estimator is said to be asymptoticalZy eficient if its asymptotic distribution is given by ('7. Somewhat more loosely than the above.4. and compare its asymptotic variance with the variance of the sample mean We have x.4. This is almost like (but not quite the same as) saying that the maximum likelihood estimator has the smallest asymptotic variance among all the consistent estimators.p12 .31) to zero. (xelog x)dx = EL Let X have density f (x) = (1 + O)xe. (7. .4.4.PI + 1 (1 .4. . log(1 S 0) + 0 C log x. .p13t = l Since we have.2).4. we must have and that the righthand side satisfies the conditions for the LindebergL6vy CLT (Theorem 6.19) follows from noting that 1 Since (7.p12 (1 .4 1 Maximum Likelihood Estimator: Properties 145 Finally. X2. I= 1 Inserting (7.4.20) follows from noting that The log likelihood function in terms of 0 is given by (7.30) with respect to p yields (7. The convergence result (7. using integration by parts.4. EXAMPLE 7.29) log L = n. .2p I :log x. the conclusion of the theorem follows from the identity observations XI.4. we can express the log likelihood function in terms of p as and that the righthand side satisfies the conditions for Khinchine's LLN (Theorem 6.
32) is a nonlinear function of the sample. Since @ in (7. In this example.3.36) and (7.3.34). the asymptotic variance of @.1) Therefore 2 Therefore the consistency of ji follows from Theorem 6. denoted AV(@).32) as AV(@)= p2(1 .because we must treat @ as a random variable.4.2. Even then. and 7. we can state The second equality with LD in (7.which expressed @ as an explicit function of the sample.1. Remark 2 .3: . without appealing to the general result of Section 7. with mean ( p . the Therefore.we have by Khinchine's LLN (Theorem 6. We have where X .  Finally. by Theorem 7.3. has been substituted for x.i. In a situation such as this example. This is not possible in many applications. as in Examples 7. ~8 . Similarly.4. however.3. means that both sides have the same limit distribution and is a conse .AV(P) = (2 'I4 > 0 for 0 <  P)n There are several points worth noting with regard to this example.43).4.4. But since {log X.1 ) / p as given in (7. 7. from (7. solving alogL/ap = 0 for p led to the closedform solution (7. in such cases the maximum likelihood estimator can be defined only implicitly by the likelihood equation. which we state as remarks.3.4.4.1.) are i.4. as pointed out in Section 7.d.2.146 7 1 Point Estimation I 7.2.4.32). we can derive the asymptotic normality directly without appealing to Theorem 7. Remark 4.39).is given by (7.p12 n ~ C Next we obtain the variance of the sample mean. Remark 1. Remark 3.3.1. That is why our asymptotic results are useful. we conclude (7. of the estimator are difficult to find. the asymptotic variance can be obtained by the method presented here.36) consistency can be directly shown by using the convergence theorems of Chapter 6. For this purpose rewrite (7.4.3.4 1 Maximum Likelihood Estimator: Properties 147 Therefore. let alone the exact distribution.3.4. .4. as defined in Definition 6. the exact mean and variance.4. where the maximum likelihood estimator is explicitly written as a function of the sample.440) I < 1.
3 (normal density).". if two parameters and O2 are related by a onetone continuous function O1 = g(02).47) log L =  n n 2 1 ' ' log (2.4.30). G . From these two figures it is clear that log L is maximized at a positive value of y when CElx. The function A is an even function depicted in Figure 7.2).2. and looks like Figure 7.7.4.1.p12 2 2 2~ t = 1 C which can be written as There are two roots for the above. = g(&).43) is a consequence of the LindebergL6vy CLT (Theorem 6. The shape of the function B depends on the sign of C:=.3. < 0.8. Assume that p Z 0.7 I Illustration for function (7.47) with respect to p equal to zero yields (7. > 0 and at a negative value of p when C.> 0 and the negative root if Cr=lx. obtain the maximum likelihood estimator of p and directly prove its consistency.4.4 Assuming u2 = p2 in Example 7.") .4. E X A M P L E 7.x. given by (7. so that p is the sole parameter of the distribution.7 (xi . We first expressed the log likelihood function in terms of p in (7.lx. one positive.4.log p .x.14) we have F I c u R E 7.4.48) We shall study the shape of log L as a function of y.148 7 1 Point Estimation T64 f Maximum Likelihood Edrnat~r: Pmperties 129 quence of (iii) of Theorem 6.3.29) with respect to 0 and inserted the maximum likelihood estimator of 0 into the last term of (7. From (7.30) and found the value of p that maximizes (7.. More generally. Setting the derivative of (7.the respective maximum likelihood estimators are related by 8. one negative. < 0.4.4.4. Therefore Remark 5 .49) B = * := 1 C xi CL and C is a constant term ahat does not depend on the parameter k.4.4 (Slutsky). which can be obtained as follows: B y integration by parts.27). We know from the argument of the preceding paragraph that the positive root is the maximum likelihood estimator if Z. Here we need the variance of log X. The convergence in distribution appearing next in (7. We would get the same estimator if we maximized (7.
49) Next we shall directly prove the consistency of the maximum likelihood estimator in this example.4.53) 1=1 XI plim . 2.=ljkfi. But because of (7. 1.2. Clearly.2. j = 1. each distributed as U(0.(p + qpl) 2.2. ..".54) plim n+m 1=I  n 2p2..4. by Theorem 6. Therefore the asymptotic distribution cannot be obtained by the standard procedure given in Section '7. . The likelihood function of the model is given by (7. st j.5. 0).1).1. Define Yj. > 0 and the negative .2) which shows that the positive root is consistent if IJ. = n' Z:=&. We have.4.8 Illustration for function (7. therefore.3. the signs of Z. and 2. defined in that example is the biascorrected maximum likelihood estimator.4.3. . (a) Find a method of moments estimate of A. Since p = 0/2.55) plim ntm &=2 (p 1 5 . . E X A M P L E 7. Further define $ . take three values 1. satisfies n n1 Z.2. j = 1. the support of the likelihood function depends on the unknown parameter 0 and.4.5 Let the model be the same as in Example 7.56) 1 L = . . and root is consistent if p Suppose X1 and X 2 are independent. 3.5. pl. we have (7.2) Let X.150 7 1 Point Estimation I Exercises 151 p are the same with probability approaching one as n goes to infinity. (Section 7.1. using Khinchine's LLN (Theorem 6. (Section 7.1. the maximum likelihood estimator of 8 is 2.=~X. and 3 with probabilities pl. the maximum likelihood estimator of p is 2/2 because of remark 5 of Example 7. . 2. therefore. Therefore the maximum likelihood estimator is consistent.4. Thus we see that &. EXERCISES C (7. the regularity conditions do not hold. the observed value of Z defined in Example 7.4. = 1 if X. (Section 7.2.2. = 0 if X.53). 1 w) 2 = (7. FIGuRE 7.2) Let X1. = j and Y. n where z = max(xl. . xn). = Z.4. 2. XZ)and X1 X2 as two estimators of 0 by criterion (6) in Section 7. Compare max(X1. = otherwise. xp. < 0.3. (b) Find a method of moments estimate of A different from the one in (a). and 3.lx. + .4. k = 0. Therefore. Then show that j. .for 8 en O 2 z.. and ps. Xn be independent with exponential distribution with parameter A. and 3. In this example. ntm n 1.IJ.
d. Assume that these forecasts are independent and that each forecast is accurate with a known probability n. Note that we o b serve neither X nor Y . whereas (6) is not. 5. 11. 12. Find the maximum likelihood estimator of p based  .1) A proportion p of n jurors always acquit everyone. (Section 7.3) Let XI. and X and Y are independent.) is convex if for any a < b and any 0 < A < 1.p . 0) 5/4. The remaining I .p) 5 Eg(S .3. p) and kt two estimators of p be defined as follows: Obtain the mean squared errors of the two estimators. +  6. and "2 = 0" eight times. P ( Y = 1) =I/. Supposing that twenty i. Define Z = X Y .) 9. (Section 7. and X 3 be independent binary random variables taking 1 with probability p and 0 with probability 1 . (Section 7.2.p. Define two estimators = and = (~/2) + (1/4). where = (XI + X2 + X3)/3. denoted by p.p) based on XI and X2.2. observations on Z yield "Z = 2" four times.3.3. (Section 7.p. Suppose the only information we have about p consists of the forecasts of n people published in the Stanford Daily. Ag(a) (1 .0 1 < c) 2 P(IY . (Section 7. each taking the value of 1 with probability p and 0 with probability 1 . Can you say one estimator is better than the other? 1 13.3) Let XI and X2 be independent. Show that Eg(T .3) Let X1.1) Suppose the probability distribution of X and Y is given as follows: P(X=l) P(Y = 7.2.2. (A more general theorem can be proved: in this model the sample mean is the best linear unbiased estimator of p in terms of an arbitrary convex loss function..p) for any convex function g and for any p. p).A)g(b) 2 g[Aa + (1 . compute the maximum likelihood estimator of p. Define the following two estimators of 0 = p(l .2) Let XI and X2 be independent.0 1 < E) for every E > 0 and > for at least one value of E. You may consider the special case where n = 2 and the true value of p is equal to 3/4. = P(X=O) = 1 . what is your maximum likelihood estimator of p? + Which estimator do you prefer? Why? 8. (Section 7. If n = 5 and r = 3.p. (Section 7.2) Show that criteria ( I ) through (5) are transitive.3.1) Suppose we want to estimate the probability that Stanford will win a football game.9 and acquit a criminal with probability 0. Note that a function g(. (Section 7." Consider the binary model: P(X. and let each take 1 and 0 with probability p and 1 . X2. (Section 7. and X3 be independently distributed r18 B(l.5) Suppose we define better in the following way: "Estimator X is better than Y in the estimation of 0 if P(Ix . =p.A)b]. = 0) = 1 .p.2. how would you estimate p? Justify your choice of estimator.2.2. If it is known that the probability a defendant has committed a crime is 0. (Section 7.i. Let two estimators of p be defined by T = (X1 + X2)/2 and S = XI. regardless of whether a defendant has committed a crime or not.1) Let X B(n. = 1) = p and P(X. "2 = 1" eight times. X2.5. find the maximum likelihood estimator of p when we observe that r jurors have acquitted the defendant. For what values of p is the mean squared error of jn smaller than that of $I? x 10. If r of them say Stanford will win.152 7 1 Point Estimation I Exercises 153 4. Show that the sample mean 2 is not the best linear unbiased estimator.p proportion ofjurors acquit a defendantwho has not committed a crime with probability 0.
2) Suppose that XI . See Box and Cox (1964). 1) and Y N(2p.2) Let XI.5. independent of each other. = 1) = a. 2.d.3. . . (a) Derive the secondround estimator k2 of the NewtonRaphson iteration (7. be a sample from the Cauchy distribution with the density f (x.2) The density of X is given by f(x) = 3/(40) f o r 0 5 x 5 0. .3.Ox). and 4.2) Let the density function of X be given by f (x) = 2x/0 = for 0 5 x 5 0. 0) = { ~ [+ l (x .3) 16. .3.5. . X. (Section 7.15) and (7. X.1) < x 5 20. P(X. Assuming that a sample of size 4 from this distribution yielded observations 1. n.1) Given f (x) = 0 exp(. (Section 7. the maximum likelihood estimator of 0 is XI. where 0 < 0 < 1. (Section 7.0. (b) For the following data. .i.4. 20. . This is called the BoxCox transformation. (Section 7.d. .51. be a sample drawn from a uniform distribution U[0 . (b) Find the maximum likelihood estimator of EX. find the maximum likelihood estimator of a. x > 0. assuming you know a priori that 0 5 p 5 0. (Section 7. Suppose X N(p. (a) Show that the maximum likelihood estimator of 0 is the same as the least absolute deviations estimator that minimizes C(X. with the common density f (x) = (1/2) exp(. = 0) = 1 .3.5.i. observations on X and Ny i.   . 3. the likelihood function has multiple maxima.0)2])1. (Section 7.4. observations on Y and show that it is best unbiased. (c) Show that the maximum likelihood estimator of EX is best unbiased.0. .3. . .i. calculate the maximum likelihood estimator of 0. . (Section 7. Obtain the maximum likelihood estimator of p based on Nx i. (b) Find the exact mean and variance of the maximum likelihood estimator of a assuming that n = 4 and the true value of a is 1.14).3. I ) . (a) Find the maximum likelihood estimator of 0. . (Section 7. . 2(x .)" 10 has a standard normal distribution. are independent and that it is known that (x.154 7 1 Point Estimation ( Exercises 155 on a single observation on X. i = 1. Derive its variance for the case of n = 3. 21.a. derive the maximum likelihood estimator of 0. .. (Section '7.5. compute k2: 17. . = 1/(40) for 0 =0 22. Assume XI < xp.19). i = 1. otherwise.1x1) (the Laplace or doubleexponential density).2) i and 6' obtained by solving (7. 23. . 15. X.i.d. . 18. Supposing that two independent observations on X yield xl and x2. . sample Y1.16) indeed Show that i maximize log L given by (7.1)/(0 .3.  Suppose that XI.2) Let XI.3. (a) Given i. Y. Y2. (b) Show that if n = 2. are i.3.3. . (Section 7. starting from an initial guess that ^XI = 1. (b) Show that it is also equal to the median of (Y. n. 0 + 0. 19. 0 > 0.]. 14.1) Suppose the probability distribution of X and Y is given as foBows: P(X. .01. (a) Show that if n = 1.d. . and the maximum likelihood estimator is not unique.1) for 0 < x 5 1. .3. Find the maximum likelihood estimator of 0. .3. .
5). respectively.4. . 1) and N(0.1. (Section 7.3) Exercises 157 24. Calculate the maximum likelihood estimator of p and an estimate of its asymptotic variance.2) Let (XI. be i. We are to observe XI. .~ . + + . . n. Y 2 .5. .].3) Suppose that X has the HardyWeinberg distribution. * . Let X and Y be independent and distributed as N(p. . . n. 2. (Section 7.5. Define 0 = exp(A). X = 2 four times. . .]. Find the maximum likelihood estimator of 0 based on a sample of size n from f and obtain its asymptotic variance in two ways: (a) Using an explicit formula for the maximum likelihood estimator. Y. 1.3) Let {X. (c) Show that the maximum likelihood estimator attains the Cram&Rao lower bound in this model.]and {X2. be i.i. Suppose that the observed values of {X. Also obtain its asymptotic variance and compare it with the variance of the sample mean. . 0 > 0. Find the maximum likelihood estimator of p assuming we know 0 < p 5 0. is unknown.) are (2.X2. . (a) Find the maximum likelihood estimate of p.4.3) Let (X. (Section 7. = 2 with probability 2p(1 . p.8 probability. . z = 1. 30. Obtain the maximum likelihood estimator of p and prove its consistency.3) Let {Xi].]. . where we assume p > 0. be independent of each other and across i. (Section 7.1.3) In the same model as in Exercise 30. (Section 7. 2. 25. . 28. 31.. be i. where 0 < p < 1. 33. . N(p. with P(X > t ) = exp(kt). V log (1 X) = o . 2.3) Suppose f (x) = 0/(1 + x ) ' " .4.4. Suppose the experiments yielded the following sequence of numbers: 1. we perform ten experiments. . 2. . 32.i = 1.156 7 1 Point Estimation 1 $9. n. H i n t :E log (1 X) = 0'. (b) Using the CramtrRao lower bound. . 5.4. n. It is also known that those who have committed a crime are acquitted with 0. 26. Suppose we observe X = 1 three times. Prove its consistency.).0. 3.d.4.d.4. each distributed as B(1. . p). Prove the consistency of jil = 1 and obtain their asymptotic distributions as N goes to infinity. 0 < x < m. 1. .3. 5. .i = 1.3. Derive the asymptotic variance of the maximum likelihood estimator of p based on a combined sample of (XI. . (b) Obtain an estimate of the variance of the maximum likelihood estimator. .4.i. where p > 0. . (Section 7.3) It is known that in a certain criminal court those who have not committed a crime are always acquitted. . . (Section 7. let N. .]and {Y.4.4. Compute the maximum likelihood estimator of p and an estimate of its asymptotic variance. Assume that all the X's are independent of all the Y's. be a random sample on N(p.d. (Section 7.2. . 2. ~ ( p ' I).2. 27. Xn) and (Yi. i = 1.i. what is your estimate of the true proportion of people who have not committed a crime? Also obtain the estimate of the mean squared error of your estimator. p). = 3 with probability (1  p)2. . .3) Using a coin whose probability of a head. and X = 3 three times.0. be the number of times and of ji2 = X = i in N trials. i = 1. respectively. If 30 people are acquitted among 100 people who are brought to the court. .z = 1.p). (Section 7.3.2. 2 . (Section 7.2 probability and are convicted with 0. . 1) and (1. X2. I ) and let (Y. Find the maximum likelihood estimator of 0 and its asymptotic variance.]. . X = 1 with probability p . In each experiment we toss the coin until a head appears and record the number of tosses required.4. . p). 1. .
(Section 7.3) Suppose f (x) = 0hexp(x/0).(3N1/N) and G2 = (3N2/ N) .25 5 0 5 1.2. 156.0)/3. (a) Find EX. (Section 7.1.a) for a < x < b.@ . 950.5. where N is unknown. Observe a sample of size n from f .3) Suppose that P(X = 1) = (1 . 273. 36.4. and we see that Y = 1 happens N1 times in N trials. Compare the asymptotic variances of the following two estimators of 0 = b . and VX. where we assume 0. (Section 7. (b) 8 = 243~(x.4. P ( Y = 1 1 X = 0) = 0. P(Y=OIX=l) = I . Suppose we observe only Y and not X.. Let X. denote the number obtained on the ith drawing. . (b) 8 = w.3) Abox contains cards on which are written consecutive whole numbers 1 through N. and 585. (Section 7.4. 0 > 0. We are to draw cards at random from the box with replacement.3) Suppose f (x) = l / ( b . where the K numbers drawn. x r 0. 34.15) in Examples 7. Observe a sample of size n from f.a: (a) 8 = maximum likelihood estimator (derive it). (Section 7.4. N 35.4. Compute their variances.3) Verify (7.1 and 7. P(X = 0) = 1  0. Derive the maximum likelihood estimator and compute its asymptotic variance.I.4. Do you think f i is a good estimator what is the numerical value of k? of N? Why or why not? 39. P ( Y = l I X = l ) =0. 38. Define = I . . Find an explicit formula for the maximum likelihood estimator of 0 and derive its asymptotic distribution. P(Y = 0 1 X = 0) = 0.4. (c) If five drawings produced numbers 411.g)'/n. and P(X = 3) = '/S Suppose Xis observed N times and let N. P(X = 2) = (1 + 0)/3. Find EN and V N . be the number of times X = i.5. is the average value of (b) Define estimator = 2 z .4.3) Let the joint distribution of X and Y be given as follows: P(X = 1) = 0.  37.4.Compare the asymptotic variances of the following two estimators of 0: (a) 8 = maximum likelihood estimator (derive it). (Section 7.158 7 1 Point Estimation I Exercises 159  .
p ) . has the same practical connotation as the word probability. (An exception occurs in Example 8. in effect. equivalently. . If we toss the coin 100 times and get 50 heads.8. the more fully it captures the relevant information contained in the sample. b] . we shall define a confidence interval mainly when the estimator used to construct it is either normal or asymptotically normal. Although there are certain important differences. we do not use it concerning parameters. The word confidence. Because we can never be certain that the true value of the parameter is exactly equal to an estimate. the greater confidence we have that 0 is close to the observed value of 8. a better estimator. suppose we want to know the true probability.2.p ) / n ] . Xn) is a given estimator of a parameter 0 based on the sample XI. we would like to know how much confidence we can have that 0 lies in a given interval. 0.4. . This is an act of interval estimation and utilizes more information contained in 8. This restriction is not a serious one. be distributed as B(1. carries out statistical inference. we will have more confidence that p is very close to %.6 or 0. although we are fairly certain that p will not be 0.9 or 0. "a 0. The classical statistician's use of the word confidence may be somewhat like letting probability in through the back door. 8. given 6. . . We would like to be able to make a statement such as "the true value is believed to lie within a certain interval with such and such wnjidence. Note that we have used the word conjidence here and have deliberately avoided using the word probability. X2. Thus." A confidence interval is constructed using some estimator of the parameter in question. More information is contained in 8: namely. suppose that 8(x1. Therefore. In Section 8. which may be biased in either direction. p. . in classical statistics we use the word probability only when a probabilistic statement can be tested by repeated observations.1. where a chisquare distribution is used to construct a confidence interval concerning a variance. We toss it ten times and get five heads. ." This degree of confidence obviously depends on how good an estimator is. E X A M P L E 8. Xn The estimator 8 summarizes the information concerning 0 contained in the sample.p ( l . therefore. For example.2 1 Confidence Intervals 161 8. i = 3. because most reasonable estimators are at least asymp totically normal. we would like to know how close the true value is likely to be to an estimated value in addition to just obtaining the estimate.) The concept of confidence or confidence intervals can be best understood through examples. however. N [ p . Our point estimate using the sample mean is %. . . who uses the word probability for any situation.2. Although some textbooks define it in a more general way.2 CONFIDENCE INTERVALS 1 IE We shall assume that confidence is a number between 0 and 1 and use it in statements such as "a parameter 0 lies in the interval [a. the smaller the mean squared error of 8. 2 .95 confidence. X2." or.5. but we must still allow for the possibility that p may be. the classical and Bayesian methods of inference often lead to a conclusion that is essentially the same except for a difference in the choice of words. b] with 0.1 T =X A Let X. n. More generally. How should we express the information contained in 8 about 0 in the most meaningful way? Writing down the observed value of 8 is not enoughthis is the act of point estimation. say.3 we shall examine how the Bayesian statistician. because we will have. . .1 INTRODUCTION Obtaining an estimate of a parameter is not the final purpose of statistical inference. The better the estimator. As discussed in Section 1. Then .1.95 conjidence interval for 0 is [a. . of getting a head on a given coin. we have .
Moreover.4).p)'/p(l . which in this example is the interval [0.3) may be equivalently written as = 0. Then we could .2. we have which may be further rewritten as where Thus a 95% confidence interval keeps getting shorter as n increasesa reasonable result. however.1) we have approximately Suppose an observed value of T is t .2. and C(a* < p < b*) = yk*. It is easy to see that this function shoots up to infinity near P = 0 and 1. because it concerns a random variable T. we have from (8. (8.4) can be written as yk = P(IZI < k). because there are many important sets for which confidence cannot be defined. since yk = 0. respectively.2. For example. By definition. we have C ( I 1 )2 C(Z2).5. Then we define mnjkhw b y If n = 100 and t which reads "the confidence that P lies in the interval defined by the inequality within the bracket is y c or "the yk confidence interval of p is as indicated by the inequality within the bracket.6) is a legitimate one. The probabilistic statement (8. assuming n = 10 and t = 0. Confidence is not as useful as probability. b) and (a*.96. Equation (8. For this purpose. C(a < p < a*) cannot be uniquely determined from definition (8.8) Then we can evaluate the value of yk for various values of k from the standard normal table. confidence clearly satisfies probability axioms ( 1 ) and ( 2 ) if in ( 2 ) we interpret the sample space as the parameter space.1 shows that if interval Zl contains interval Z2.4) is motivated by the observation that the probability that T lies within a certain distance from p is equal to the confidence that p lies within the same distance from an observed value of T. Let us construct a 95% confidence interval of p. 1 ) .2. consider n ( t .2. Thus the intervals (a. l ) and define (8. t ) and increasing in ( t . and we may further define C [ ( a< p < a*) U (b* < p < b)l = yk . Then.h2(T) 1 contains pwith probability yk. we have a proportionately large confidence that the parameter is close to the observed value of the estimator.4) defines C(a < p < b) = yk.8) is appealing as it equates the probability that a random interval contains p to the confidence that an observed value of the random interval contains p. In Bayesian statistics we would be able to treat p as a random variable and hence construct its density function. (8. Figure 8.2.2." Definition (8.95 when k = 1.p) as a function of p for fixed values of n and t. This is definitely a shortcoming of the confidence approach.2) Similarly. Note that this definition establishes a kind of mutual relationship between the estimator and the parameter in the sense that if the estimator as a random variable is close to the parameter with a large probability.yk*. This function is graphed in Figure 8.2 1 Confidence Intervals 163 Let Z be N ( 0 .2. It states that a random interval [hl ( T ) . Next we want to study how confidence intervals change ask changes for fixed values of n and t. and is decreasing in the interval ( 0 .4).162 8 1 Interval Estimation 8. b*) correspond to the yk and yk* confidence intervals. From (8.2.5.1.2.This suggests that we may extend the definition of confidence to a Iarger class of sets than in (8. even after such an extension. so that confidence satisfies probability axiom (3) as well. We have also drawn horizontal lines whose ordinates are equal to k%nd k*2.2. 11. attains the minimum value of 0 at p = t. Definition (8. For example.
~(p.88 < p < 6.wherepisunknown and a2is known. For. Its density is symmetric around 0 and approaches that of N(0. .(x. we can construct another by raising the portion of the function over (0.1 degrees offeedom. given T = t.cr~ 1.2. We may be tempted to define N(t.2 1 Confidence Intervals 165 I Therefore.1 Therefore the interval is (5.2 E X A M P L E 8 . t) and lowering the portion over (t. . ) . n = 10. of t. shown above.2.04. Note that (8. It depends only on n and is called the Student's t distributionwith n . where yk for various values of k can be computed or read from the Student's t table. the function obtained by eliminating the left half of the normal density N(t. 02/n) and doubling the right half will also serve as a confidence density. We can construct a 95% confidence interval by putting t = 6.14) yn = P(lt.2.2. the greater the confidence that p lies within the same distance from t . i~. ~et~. Define EXAMPLE 8. = 1 .12) defines confidence only for intervals with the center at t.96 in (8.ll  ~ < k). See Theorem 5 of the Appendix for its derivation. n .04) and that the average height of ten students is observed to be 6 (in feet). there is no unique function ("confidence density. ~ ( p . . Let T = be the estimator of p and let s2 = be the estimator of a2. Then the probability distribution ~'c:=.2. a2/n) as a confidence density for p. with both p and a2unknown.p) is known and has been tabulated. For example.1 Construction of confidence intervals in Example 8. Define   (8." so to speak) such that the area under the curve over an interval gives the confidence of the interval as defined in (8.n.4).12).15) S .2. o * )i . In other words.12) as F I G U R E 8.2. Then we have (8. This is not possible in the confidence approach. given one such function. we define confidence Thus the greater the probability that T lies within a certain distance from p.2. 2 . a2/n). 1) by the right amount. calculate the probability that p lies in a given interval simply as the area under the density function over that interval. = . 0.l = S'(T . and k = 1. a2= 0. We have T If N(p. Suppose that the height of the Stanford male student is distributed as N(p. 2 . 1) as n goes to infinity. but this is one among infinite functions for which the area under the curve gives the confidence defined in (8.164 8 ( Interval Estimation I I 8. . 3 Suppose that X.12).2.
2 in (8. See Section 10. a').It is natural to use the sample variance defined by s2 = n n 1 C.18) does not follow from equation (11) of the Appendix. The larger interval seems reasonable. 2 . This example differs from the examples we have considered so far in that (8. assume that {Xi} are independent of {Y. given the observed values 2. This time we want to define a confidence interval on a2. Putting t = 6 and s = 0. b.19). and that a random sample of 40 unemployment spells of male workers lasted 40 days on the where k1 and k2 are chosen so as to satisfy (8.4 E X A M P L E 8 . and   where we can get varying intervals by varying a. and proceed as follows:  Thus.19) yields Consider the same data on Stanford students used in Example 8. . . = 2 ' into (8. .2. u 2 ) . consider constructing a 0. we get Therefore the 95% confidence interval of p is (5. because in the present example we have less precise information. a  yconfidence interval is < ~ k).2. . s$ = (2.2. . then. N ( p .5) . I The assumption that X and Y have the same where yk = ~ ( l t ~ .2.24) ui P(kl < X:l < k2) = 7 .04. s i . and d? We reverse the procedure and start out with a statistic of which we know the distribution and see if we can form an interval like that in (8. S = s. 2 inserting k = 2.].2. variance is crucial. given in Theorem 3 of the Appendix. nx = 35.2 1 Confidence Intervals 167 Given T = t.95.2.2.3. c. Using it. 2. because otherwise (8. but assume that u2is unknown and estimated by s'. as in Example 8. we define confidence by average with a standard deviation of 2 days. we can define the confidence interval for px .21). used in the case of A s an application of the formula (8. E X A M P L E 8. and d.2. j = 40. .5 days. . given that a random sample of 35 unemployment spells of female workers lasted 42 days on the average with a standard deviation of 2.2. and s.95 confidence interval for the true difference between the average lengths of unemployment spells for female and male workers. .2.166 8 1 Interval Estimation 8.16).85 < p < 6. . . as shown in Theorem 6 of the Appendix. 5 Let X. ny.24) does not determine kl and k2 uniquely. i = 1.21) for various values of a.. given the observed value defined by  ? of s'. letYi N ( p y .=l(X. n. Note that this interval is slightly larger than the one obtained in the previous example. we would like to define a confidence interval of the form  x)2. u 2 ) .2. which is observed to be 0. b. can we calculate the probability of the event (8. nx.. .Then. c. Since ~ ( l t< ~ 2~ ) ( 0. with both p and u2 unknown. j. 2 = 42.2.i = 1 . In practice it is customary to determine these two values so as to satisfy . . ny = 40. i = 1.3.py by rk Therefore. L e t x i N ( p x . and s. We begin by observing ns2/u2 X:l.15).1 for a method which can be f a. A crucial question is. + ~ .
23) yields the confidence interval (27. is either normal or asymptotically normal. which does not differ greatly from the one obtained by the more exact method in Example 8.2. (8. .23).4. as is usually the case. consider the same data given at the end of Example 8. we may define confidence approximately by where Z is N ( 0 . using the estimator.26). we can define a density function over the parameter space and thereby consider the probability that a parameter lies in any given interval. whereas in the Bayesian statistics the posterior distribution is obtained directly from the sample without defining any estimator. (26.2. as will be shown below. 8.5. Then the posterior distribution of P given the sample. This probability distribution. we obtain an alternative confidence interval. Note that in classical statistics an estimator is defined first and then confidence intervals are constructed using the estimator. or 8. Assume that the height is normally distributed.2. kl = 74. kl and kg can be computed or read from the table of chisquare distribution. is calculated via Bayes' theorem as follows: . 6 Besides the preceding five examples. which he has before he draws any marble.3 BAYESIAN METHOD E X A M P L E 8 . respectively. as estimator of 0.27) S ' A N ( U ~ u4/50). 45.50).2. Then. because in it we can treat a parameter as a random variable and therefore define a probability distribution for it. we can define estimators using the posterior distribution if we wish. denoted by A.2.2.2. moreover. it is better to define confidence by the method given under the respective examples.22. If the parameter space is continuous. 48. We shall subsequently show examples of the posterior distribution and how to derive it.2. but a shortcoming of this method is i 1 [ i Suppose he obtains three red marbles and two white marbles in five drawings. even though we can also use the approximate method proposed in this example. by Theorem 4 of the Appendix. As an application of the formula (8. Suppose he believes that p = % is as three times as likely as p = '/2.2. The fraction of the marbles that are red is known to be either p = '/s or p = %.26). l ) and t and v are the observed values of T and V.2. 8. given that the sample variance computed from a random sample of 100 students gave 36 inches. 3 = 36.5. which was proved in Theorem 2. where the true value of the parameter is likely to lie. 3 .02. If the situations of Examples 8. 2 . that confidence can be defined only for a certain restricted sets of intervals. We are to guess the value of p after taking a sample of five drawings (replacing each marble drawn before drawing another).79.56 into (8. The two methods are thus opposite in this respect. It is derived by Bayes' theorem. defined over the parameter space. embodies all the information an investigator can obtain from the sample as well as from the a priori information. 1 Suppose there is a sack containing a mixture of red marbles and white marbles.98). If. The Bayesian expresses the subjective a priori belief about the value of p.5 actually occur. In the Bayesian method this problem is alleviated. the variance of T is consistently estimated by some estimator V. This is accomplished by constructing confidence intervals. As an application of the formula (8.3 1 Bayesian Method 1 69 Given y. in the form of what is called the prior distribution. and k p = 129. there are many situations where T. see DeGroot (1970) and Zellner (1971). Inserting n = 100.3.2. consider constructing a 95% confidence interval for the true variance u2 of the height of the Stanford male student.4.1.2. E X A M P L E 8 . called the posterior distribution. so that his prior distribution is We have stated earlier that the goal of statistical inference is not merely to obtain an estimator but to be able to say. After the posterior distribution has been obtained. For more discussion of the Bayesian method. 8.168 8 1 Interval Estimation 8. Estimating the asymptotic variance a4/50 by 36'/50 and using (8.
If the Bayesian's prior distribution assigned equal probability to p = l/g and p = % instead of (8. where the expectation is calculated using the posterior distribution. which assigns a larger probability to the event p = %. which indicates a greater likelihood that p = y2than p = '/s. however. because probability to him merely represents the degree of uncertainty.2).3 1 Bayesian Method 171 Loss matrix in estimation State of Nature Decision p=% p=% This calculation shows how the prior information embodied in (8. The Bayesian. whereas the Bayesian allows his conclusion to be influenced by a strong prior belief indicating a greater probability that p = %. he incurs a loss ye Thus the Bayesian regards the act of choosing a point estimate as a game played against nature. What if we drew five red marbles and two white marbles. If he wanted to make a point estimate.the Bayesian may or may not wish to pick either p = '/3 or p = % as his point estimate. It indicates a higher value of p than the Bayesian's a priori beliefs: it has yielded the posterior distribution (8. The difference occurs because the classical statistician o b tains information only from the sample. then his estimate would be the same as the maximum likelihood estimate.2). Suppose we change the question slightly as follows. because it contains all he could possibly know about the situation. therefore. the classical statistician must view the event ( p = %) as a statement which is either true or false and cannot assign a probability to it. (8.174) 8 1 Interval Estimation TABLE 8 . This estimate is different from the maximum likelihood estimate obtained by the classical statistician under the same circumstances. Since this is a repeatable event.1). let us assume y. 1 8. is free to assign a probability to it. the classical statistician can talk meaningfully about the probability of this event. if he chooses p = 1/3 when p = % is in fact true. There are four sacks containing red marbles and white marbles. In the wording of the present question.3. We are to pick one of the four sacks at random and draw five marbles. One of them contains an equal number of red and white marbles and three of them contain twice as many white marbles as red marbles. For example. The reader should recognize the subtle difference between this and the previous question. = y2. the posterior distribution now becomes . the event ( p = %) means the event that we pick the sack that contains the equal number of red marbles and white marbles.3. If three red and two white marbles are drawn. in the previous question.3. what is the probability that the sack with the equal number of red and white marbles was picked? Answering this question using Bayes' theorem. as given in Table 8.2) would be sufficient. there is only one sack. he chooses p = '/3 as his point estimate if For simplicity. Given the posterior distribution (8.3.39 as before. The prior probability in the previous question is purely subjective. hence.1)has been modified by the sample. In this case the Bayesian's point estimate will be p = %. instead? Denoting this event by B. He chooses the decision for which the expected loss is the smallest. we obtain 0. If he simply wanted to know the truth of the situation.3. whereas the corresponding probability in the present question has an objective basis.1. he would consider the loss he would incur in making a wrong decision. In the present example. In contrast.
we can write Bayes' formula in this example as These calculations show that the Bayesian inference based on the uniform prior density leads to results similar to those obtained in classical inference.3 I Bayesian Method 173 Using (8.We have from (8. We shall now consider the general problem of choosing a point estimate of a parameter 0 given its posterior density. This problem is a generalization of the game against nature considered earlier.12) where the denominator is the marginal probability that X = k.3. for the purpose of illustration.2. In Example 8. Let 8 be the estimate and assume that the loss of making a wrong estimate is given by (8.3. We shall assume that n = 10 and k = 5 in the model of Example 8.3.2366 < p < 0.8) for nonnegative integers ( l + n + m ) ! nandm.7.7).the Bayesian can evaluate the probability that p falls into any given interval. Therefore we have Loss = (8 . Using the result of Section 3.2 Let X be distributed as B ( n . assuming y. f l ( 0 ) . n! m! Note that the expectation is taken with respect to 0 in the above equation.o ) ~ . It is more realistic to assume that p can take any real number between 0 and 1. p ) .9) we obtained the 95% confidence interval as (0. say.2.3.1.p)nk k!(n . that is.7) In this case the Bayesian would also pick p = % as his estimate. = y2 as before.13) with respect to 8 to 0 .2. Then the Bayesian chooses 8 so as to minimize  ( n + I ) ! pk(l .k)! where the second equality above follows from the identity (8. because the information contained in the sample has now dominated his a priori information. Equating the derivative of (8. E X A M P L E 8. since 0 is the random variable and 6 is the control variable.3.172 8 1 Interval Estimation 8.8) we can calculate the 80% confidence interval We have from (8. the conditional density of p given X = k.3. we assumed that p could take only two values.1 and compare the Bayesian posterior probability with the confidence obtained there. Suppose we a priori think any value of p between 0 and 1 is equally likely.3.7) Suppose the observed value of X is k and we want to derive the posterior density of p. In (8. we obtain .3.7634). This situation can be expressed by the prior density From (8.
Therefore we finally obtain E X A M P L E 8 .13).3. the posterior density is proportional to the likelihood function. For example. be the observed value of X. . Although the classical statistician uses an intuitive word.7).. 3 . k2). x2.7). Classical statistics may therefore be criticized. In this case the difference between the maximum likelihood estimator and the Bayes estimator can be characterized by saying that the former chooses the maximum of the posterior density. Let x. Let us apply the result (8. .We shall write the exponent part successively as . p = 1. u2). if a head comes up in a single toss (k = 1. that is.3. As n approaches infinity.14) we have assumed that it is permissible to differentiate the integrand in (8.3. where a2 is assumed known. for ignoring the shape of the posterior density except for the location of its maximum. as long as the estimator is judged to be a good one by the standard of classical statistics. .3. Note that if the prior density is uniform. . the Bayes estimator is the expected value of 0 where the expectation is taken with respect to its posterior distribution. an intermediate step between classical and Bayesian principles. As this example shows. as in (8.15) to our example by putting 0 = p and f l(p) = f ( p 1 X = k) in (8. was proposed by Birnbaum (1962). because it can give her an estimator which may prove to be desirable by her own standard. n = I).3.3. we obtain I 1: Suppose the prior density of p is N ( p O . Then the likelihood function of p given the vector x = (xl. both estimators converge to the true value of p in probability.3.1 74 8 1 Interval Estimation 8. where cl is chosen to satisfy JLf ( p ( x) d t ~= . In words. from the Bayesian point of view. 2. .) is given by We call this the Bayes estimator (or. It gives a more reasonable estimate of p than the maximum likelihood estimator when n is small. x.3 1 Bayesian Method 175 Note that in obtaining (8. the Bayesian method is sometimes a useful tool of analysis even for the classical statistician. whereas the latter chooses its average.2 and found to have a smaller mean squared error than the maximum likelihood estimator k / n over a relatively wide range of the parameter value. . n.2. Then the posterior density of p given x is by the Bayes rule This is exactly the estimator Z that was defined in Example 7. . 3 Let { X . "likelihood. the Bayes estimator under the squared error loss). i = 1." she is not willing to make full use of its implication. as we can see from (8. . the Bayesian estimate p = % seems more reasonable than the maximum likelihood estimate.5). 1. Using the formula (8. Nothing prevents the classical statistician from using an estimator derived following the Bayesian principle. ) be independent and identically distributed as N ( p . more precisely. however.8) again. The likelihood principle.
0 otherwise.2 and 8. Assuming that the prior density of 0 is given by 8.3.24).22) becomes   1 (log 0. We might think that a uniform density over the whole parameter domain is the right prior that represents total ignorance. we could have obtained a result identical to (8.22).3.3.(Cf. This.3.1. might unduly influence statistical inference.2.3. however.21) by calculating One weakness of Bayesian statistics is the possibility that a prior distribution. Example 7. but this is not necessarily so. nlo2). This result is a consequence of the fact that 2 is a sufficient statistic for the estimation of p.3 1 Bayesian Method 177 Therefore we have f (0) = = 5 for 9.3.23) is what we mentioned as one possible confidence density in Example 8.8)(10. EXAMPLE Let X be uniformly distributed over the interval (0.5 < 0 < 9. A.4.2. Fisher and his followers in an effort to establish statistics as an objective science. is not always possible. This weakness could be eliminated if statisticians could agree upon a reasonable prior distribution which represents total prior ignorance (such as the one considered in Examples 8.3. The classical school. implies a nonuniform prior over 0: . which is the product of a researcher's subjective judgment. was developed by R.5).3.) Since we have . for 1 < p < 2 .3.22) depends on the sample x only through 3.) As we let A approach infinity in (8. a uniform prior over p. Then (8.3.12) whenever the latter is defined. f ( p ) = l . where cg = ( Therefore we conclude that the posterior distribution of p given x is I calculate the posterior density of 0 given that an observation of X is 10. if parameters 0 and p are related by 0 = p'.3.23) coincides with the confidence given by (8.24) f (3 1 p) = exp G u [ n ( i .p)2] 2u2 Using (8. 10. (Cf. Example 7. I / * ) ( / U X ) in order for f ( p X) to be a density. we have (8. Note that the righthand side of (8. the prior distribution approaches that which represents total prior ignorance. The probability calculated by (8. in fact.N ( p . For example.4 =0 otherwise.0) Note that (8. We have E ( p I x) in the above formula is suggestive.2.1 76 8 1 Interval Estimation 8.3) in every case. as it is the optimally weighted average of 2 and PO.7.2.3.5 .
p). construct an 80% confidence interval on the difference of the mean rates of recovery of the two groups ( p l . (Section 8.2) If 50 students in an econometrics class took on the average 35 minutes to solve an exam problem with a variance of 10 minutes.. Calculate the posterior probabilities Pl(p) = P ( p ) X1 = 1) and P2(P) = P(p I X1 = 1. 2. (Section 8.1.2. X2 = 1). Comparison of Bayesian and classical schools Classical school Use confidence intervals as substitute. what is her estimate of p?  EXERCISES 1. for 1/2 < 0 = O otherwise. How many times should you toss the coin in order that the 95% confidence interval for p is less than or equal to 0. Answer using both exact and asymptotic methods. p. Obtain an 80% confidence interval for 0 assuming 2 = 10. (Section 8.2.2) The heights (in feet) of five randomly chosen male Stanford students were 6.2) A particular drug was given to a group of 100 patients (Group I ) . . Exercises 179 f(0) = K 2 . 5.7. and the prior density of 0 is given by I * f(0) = 1/0' for 0 r 1.2 summarizes the advantages and disadvantages of Bayesian school vis2vis classical statistics. N(0. If sample size is large. *Can use good estimator such as sample mean without assuming any distribution. . . 6.2 4.i = 1. *No need to calculate complicated integrals. Also calculate P ( p I Xp = 1) using Pl (p) as the prior probabilities. Bayes inference may be robust against misspecification of distribution. (Section 8. otherwise.3)  A  . of a particular coin.2) Suppose X. *Bayes estimator is good. Let the prior probabilities of p be given by P ( p = %) = 0.5. 6. even by classical standard. (Section 8. Note: Asterisk indicates school's advantage.  Table 8. p).p2).3) Suppose the density of X is given by /I f(x1 0) = 1/0 for 0 =0 5 x 5 0.3) Let XI and X2 be independent and let each be B(1. 5. Bayesian school *Can make exact inference using posterior distribution. Find a 90% confidence interval for the mean height. 100. . and two heads appear in two tosses. Use prior distribution that represents total ignorance. If her prior density is f (p) = 6p(l . otherwise. TAB L E 8. *No need to obtain distribution of estimator.2) Suppose you have a coin for which the probability of a head is the unknown parameter p.5 and P ( p = 5/4) = 0. maximum likelihood estimator is just as good.8. 5.3.. e2). .. =0 . Compare it with P2(p). and 6. "Objective inference. assuming the height is normally distributed.  A    A Bayesian is to estimate the probability of a head. 0 5 p I 1. (Section 8. (Section 8. Assuming that 60 patients of Group 1 and 50 patients of Group 2 recovered. 178 8 1 Interval Estimation 1 < 1.5 in length? 8. and no drug was given to another group of 100 patients (Group 2). (Section 8. 3. construct a 90% confidence interval for the true standard deviation of the time it takes students to solve the given problem.
(Section 8. 11. (Section 8. p) and the prior density of p is given by f (p) = 1 for Let X 0 5 p 5 1. (a) Find the maximum likelihood estimator of 0 and obtain its exact mean squared error. . derive the Bayes estimator of 0 based on a single observation of X.3) The density of X. (Section 8. (a) Find the maximum likelihood estimator of 0. p. 9. otherwise.A)fz(x). is given by where e = p . (b) Assuming the prior density of 0 is f (0) = 02 for 0 2 1. Assuming the prior density of A is uniform over the interval [0.1)/(0 . Derive the maxiwhere f mum likelihood estimator of A based on one observation on X. find the Bayes estimator of p.5 for 0 5 0 =0 2. we observe that a head appears in the kth toss.$1. what is your estimate p? What if your prior density of p is given by f (p) = 1 for 0 5 p 5 12 and define {Y.0). assuming that the observed value of X is 2.3) Suppose that a head comes u p in one toss of a coin. 16. 14. f (0) = 0. given an unknown parameter A E [O. i = 1 and 2. (b) Find the Bayes estimator of 0 using the uniform prior density of 0 given by 2x/0 for 0 5 x 5 0. Assuming the uniform prior density. (Section 8. given X = 1. We do not observe { X i ) . and P ( p = 4/5) = 1/3 and your loss function is given by Ip . (Section 8. change the loss function to L(e) = Obtain the Bayes estimate of p. lei.1]. given X = 1. derive the Bayes estimator of X based on one observation on X.3) Suppose the density of X is given by f (x I A) = Af1(x) + (1 . Obtain the Bayes estimate of p. In the experiment of tossing the coin until a head appears. If your prior probability distribution of the probability of a head. is given by P ( p = %) = 1/3. = 2(x . 11. where 0 < 0 < 1.3) We have a coin for which the probability of a head is p. Suppose the loss function L ( . 12.3) B(1.p. ) is given by  Suppose we observe {Yi).i. (Section 8. (Section 8. Obtain its exact mean squared error.d. Assuming the prior densityf (0) = 60(1 . P(p = %) = l/g. find the Bayes estimate of 0. 15.3) Let ( X i ]be i.1) for 0 < x 5 1. and find Y 1 = 1 and Y p = 0.3) In the preceding exercise. and f 2 ( . with the density I Exercises 181 Obtain the Bayes estimate of 0. where we assume 0 r 0 5 2.) by 10.180 8 1 Interval Estimation 13. Suppose we want to estimate 0 on the basis of one observation on X . (Section 8. ) are known density functions.3) Let the density function of X be given by f(x) = =0 otherwise.
Thus X is an nvariate random variable taking values in En. however. the complement of R in En. D E F l N ITION 9 . In hypothesis testing we pay special attention to a particular set of values of the parameter space and decide if that set is likely or not. the other is that the classical theory of hypothesis testing is woefully inadequate for the realistic case.) if X E R. In practice. In Chapter '7 we called a statistic used to estimate a parameter an estimator. and the alternative hypothesis.9. ndimensional Euclidean space. denoted Ho. true. For example. A Type 1 when H I is true). the assumption that p = % in the binomial distribution is a simple hypothesis and the assumption that p > '/2 is a composite hypothesis.2 and 9. Most textbooks. we shall treat a critical region as a subset of E.  The question of how to determine the critical region ideally should depend on the cost of making a wrong decision. In hypothesis tests we choose between two competing hypotheses: the null hypothesis. . The set R is called the region o f rejection or the critical region of the test. . Thus the question of hypothesis testing mathematically concerns how we determine the critical region. the most interesting case is testing a composite hypothesis against a composite hypothesis. 1 . .1 INTRODUCTION There are two kinds of hypotheses: one concerns the form of a probability distribution. and the other concerns the parameters of a probability distribution when its form is known. and the hypothesis that the mean of a normally distributed sample is equal to a certain value is an example of the second. D E F l N IT10 N 9. Then a test of the hypothesis Hamathematically means determining a subset R of En such that we reject Ho (and therefore accept H I ) if X E R. a test of a hypothesis is often based on the value of a real function of the sample (a statistic). Sections 9. 1 A hypothesis is called simple if it specifies the values of all the parameters of a probability distribution. The purpose of estimation is to consider the whole parameter space and guess what values of the parameter are more likely than others.1 .4 and 9.2 1 Type I and Type I1 Errors 183 9 TESTS OF HYPOTHESES As we shall show in Section 9. In this regard it is useful to define the following two types of error. The hypothesis that a sample follows the normal distribution rather than some other distribution is an example of the first. A Type I error is the error of re~ecting Ha when it is 1error is the error of accepting Howhen it is false (that is. We make the decision on the basis of the sample (XI. If T(X) is such a statistic. Xn).3. the critical region is a subset R of the real line such that we re~ect H a if T(X) E R. 2 . There are two reasons: one is that we can learn about a more complicated realistic case by studying a simpler case. devote the greatest amount of space to the study of the simple against simple case. because the event T(X) E R can always be regarded as defining a subset of the space of X. denoted simply as X. Throughout this chapter we shall deal with tests of hypotheses of the second kind only.2 TYPE I AND TYPE II ERRORS 9. In Sections 9. and we accept Ha (and therefore reject H. A hypothesis may be either simple or composite.5 will deal with the case where one or both of the two competing hypotheses may be composite. In the general discussion that follows. it is called composite. 9. A statistic which is used to test a hypothesis is called a test statistic. Specifjrlng the mean of a normal distribution is a composite hypothesis if its variance is unspecified. compared with some other set.3 we shall assume that both the null and the alternative hypotheses are simple. .X2. denoted HI. Otherwise..
are given by (9. The remaining tests that we need to consider are termed admissible tests.2.2. and suppose we are to test Ho: p = % against HI: p = 3/4 on the basis of one observation on X.8)P2. Classical statisticians usually fail to do this. or the socalled loss function.6 respectively. The first test is better (or more powerful) than the second test if al 5 a 2 and PI 5 Pp with a strict inequality holding for at least one of the 5. as illustrated in Figure 9.1. If the probabilities of the two types of error P2).2 plots the characteristics of the eight tests on the a. respectively.2. Figure 9. Ifwe cannot determine that one test is better than another by Definition The probabilities of the two types of error are crucial in the choice of a critical region.2. say. Such a test can be performed if a researcher has a coin whose probability of a head is 6 . should ideally be devised by considering the relative costs of the two types of error. p).1 Relationship between a and P Pq) be the characteristics of two DEFlN I T I O N 9 . + ( 1 . where 6 is chosen a priori. a and P are represented by the areas of the shaded regions. Any point on the line segments connecting (1)(4)(7)(8) except the end points themselves represents the characteristics of an admissible ran . We denote the probability of Type I error by a and that of Type I1 error by P. we should choose the one with the smaller value of P. P a = 6a1 + ( 1 .P). tests. 3 are as small as possible. Therefore we can write mathematically 9. R1 and R P .6 ) a 2 and 6 3 = SP. if Type I error is much more costly than Type I1 error. DE F I N IT10 N 9 .1 describes the characteristics of all the nonrandomized tests. therefore. Construct all possible nonrandomized tests for this problem and calculate the values of a and p for each test. 2 .3) The following examples will illustrate the relationship between a and as well as the notion of admissible tests.2 1 Type I and Type II Errors 185 sider only the critical regions of the form R = { x I x > c].respectively. If we con E X A M P L E 9 . A n optimal test. In Section 9. Otherwise it is called admissible. 2 Let (al. Even if we do not know the relative costs of the two types of error.with probabilities 6 and 1 . and she decides in advance that she will choose R1 if a toss of the coin yields a head and R2 otherwise. we should devise a test so as to make a small even though it would imply a large value for P. Such a test is called a randomized test. the two types of error for the randomized test. we must consider the relative costs of the two types of errors. and R2 are ( a l . because a consideration of the costs tends to bring in a subjective element. p ) the characteristics of the test. denoted as ( a . this much is certain: given two tests with the same value of a.3 we shall show how the Bayesian statistician determines the best test by explicit consideration of the costs. Table 9. P I ) and (ap. P plane. however. 2 . the probabilities of for R . We call the values of ( a . 1 Let X be distributed as B(2.2 is useful to the extent that we can eliminate from consideration any test which is "worse" than another test.2. 2 .P I ) and (a2. Definition 9. and > The probability of Type I error is also called the size of a test.184 9 1 Hypotheses 9.2. In the figure the densities of X under the null and the alternative hypotheses are f ( x 1 H o ) and f ( x I H I ) . 3 A test is called inadmissible if there exists another test which is better in the sense of Definition 9. We want to use a test for which both a and / Making a small tends to make P large and vice versa. Sometimes it is useful to consider a test which chooses two critical regions. For example. Thus we define FIG u R E 9.
the reader should carefully distinguish two terms. In stating these definitions we identify a test with a critical region. We are to test Ho: 0 = 0 against H1: 0 = 1 on the basis of a single observation on X.1 TWO types of errors in a binomial example Test R R a = P(RI Ho) P= I HI) if X = 2.5 We shall illustrate the two terms using Example 9.) 5 a ) . it is inadmissible because it is dominated by some randomized tests based on (4) and (7).2 FIG u R E 9. size and level.3.1. =1+0x domized test. (7).2.2. (3).2 we defined the more powerful of two tests. and (5) are all dominated by (4) in the sense of Definition 9.2.2. w) where 0 5 t 5 1. For example. are graphed in Figure 9. In the two definitions that follow. Represent graphically the characteristics of all the admissible tests. we do not need to use the word &el.2.1 5 x < 8. denoted by fo(x) and fl(x).2. the randomized test that chooses the critical regions of tests (4) and (7) with the equal probability of '/2 has the characteristics a = % and p = 1/4 and therefore dominates (6). . otherwise. Note that if we are allowed randomization. The most powerful nonrandomized test of level S/s is (4). such that a(R.1 where all the possible tests are enumerated. D E F I N I T I O N 9.2. (R) 5 (R1). Such a randomized test can be performed by choosing Hoif X = 0. but the definitions apply to a randomized test as well. P(R) 5 P(Rl). Although test (6) is not dominated by any other nonrandomized test. a is represented by the area of the .186 9 1 Hypotheses 9.2. The most powerful randomized test of size S/s is S/4 . it is natural to talk about the most powerful test. Intuitively it is obvious that the critical region of an admissible nonrandomized test is a halfline of the form [t. E X A M P L E 9. (It may not be unique.2. choosing HI =0 for 0 5 x 5 0 + 1 . (4) + '/.2 1 Type I and Type II Errors 187 T A B L E 9. D E F I N I T I O N 9. and. flipping a coin and choosing H o if it is a head and H I otherwise. In Figure 9.4) f (x) = 1 . Tests (2).2 LetX have the density Two types of errors in a binomial example (9. In Definition 9.) Risthemostpowerfultestofleuelaifa(R) s o l a n d f o r any test R1 of level a (that is. if X = 1. We can state: The most powerful test of size '/4 is (4). When we consider a specific problem such as Example 9.3.0 +x for 0 . It is clear that the set of tests whose characteristics lie on the line segments constitutes the set of all the admissible tests.4 R i s the m o s t p o w e r f u ~ t e s t o f s i z e a i f a ( R= ) a and for any test R1 of size a . The densities of X under the two hypotheses.
For her it is a matter of choosing between Ho and H I given the posten'orprobabilities P ( H oI x ) and P ( H l I x) where x is the observed value of X. A Bayesian interpretation of the NeymanPearson lemma will be pedagogically useful here. by the area of the darker triangle.2. In other words. T H EoREM 9. For example. FIG u R E 9 .188 9 1 Hypotheses 9. is given by 9. the Bayesian problem may be formulated as that of determining a critical region R in the domain of X so as to (9. &.3. which we state without proof. her solution is given by the rule (9. Therefore. The latter is due to Neyman and Pearson Alternatively.4.3 NEYMANPEARSON LEMMA In this section we study the Bayesian strategy of choosing an optimal test among all the admissible tests and a practical method which enables us to find a best test of a given size. 4 A set of admissible characteristics Eliminating t from (9.1 Reject Ho if y l p ( H o1 x) < y2P(Hl I x ) . A more general result concerning the set of admissible characteristics is given in the following theorem.2. her critical region. if we choose H o when H I is in fact true. convex function which starts at a point within [O.P plane is a continuous. 3 Densities under two hypotheses lightly shaded triangle and algebraically. we incur a loss y2.2.3) Minimize $ ( R ) = ylP(HoI X E R) P(X E R) + yzP(H1I X E R)P(x E R).1] on the p axis and ends at a point within [0. no randomized test can be admissible in this situation.3.5) yields and is stated in the lemma that bears their names. The set of admissible characteristics plotted on the a. monotonically decreasing. Assuming that the Bayesian chooses the decision for which the expected loss is smaller.2.1) Equation (9. Suppose the loss of making a wrong decision is as given in Table 9. Because of the convexity of the curve. where the expectation is taken with respect to the posterior distribution. I ] on the ol axis. Every point on the curve represents the characteristics of an admissible nonrandomized test.3 1 NeymanPearson Lemma 189 F lG U R E 9 .6) is graphed in Figure 9. . We first consider how the Bayesian would solve the problem of hypothesis testing.
When the minimand is written in the provided that such c exists.5) 444) = y1P(HoI Rl n Ro)P(R1 n Ro) + YlP(H0 l Rl n RO)P(Rl n 4) n Ro)P(R.3 ) NeymanPearson Lemma 191 LOSS matrix in hypothesis testing State of Nature Decision' Ho H1 Yn 0 Ho H1 o Y1 We shall show that & as defined by (9.p plane. Multiply both sides of the inequality in (9.5). (Here. Then we have and (9. This fact is the basis of the NeymanPearson lemma.2) is indeed the solution of (9.5). which shows the convexity of the curve of admissible characteristics. The second and the third terms of (9.3.2 9 1 Hypotheses 9. Then the Bayesian optimal test & can be written as Thus we have proved THEOREM 9 .3. The best he can do. By virtue of Theorem 9. Compare the terms on the righthand side of (9. the best critical region of size a is given by We can rewrite + ( R ) as where L is the likelihood function and c (the critical value) is determined so as to satisfy = Y ~ P ( H and ~ ) . 0 may be a vector. 3 . touches the line that lies closest to the origin among all the straight lines with the slope equal to qo/ql. respectively.1. f i l fl6).4) are smaller than the third and the second terms of (9.) . P ( H o ) and P ( H l ) are the pior where q o = ylP(Ho).2. The first and fourth terms are identical. ) P ( H . n & ) ~ ( . against H I : 0 = 01.3. If the curve is differentiable as in Figure 9.7) cannot be carried out. ) . such as those drawn in Figures 9. Therefore we have + Y2PW1 I Rl form of (9. as well as in the following analysis. This attitude of the classical statistician is analogous to that of the economist who obtains the Pareto optimality condition without specifying the weights on two people's utilities in the social welfare function.P e a r s o n lemma) In testing Ho: 8 = 8. 1.hence he does not wish to specify the ratio q o / q l . therefore.190 T A B L E 9.4) with those on the righthand side of (9. Let R1 be some other set in the domain of X.2 and 9.3.3. the above analysis implies that every admissible test is the Bayesian optimal test corresponding to some value of the ratio q o / q l . Let L ( x ) be thejoint density or probability of X depending on whether X is continuous or discrete. ) . is to obtain the set of admissible tests.4.3).2) by L ( x ) and replace P(H. 1 ( N e y m a n . it becomes clear that the Bayesian optimal test Ro is determined at the point where the curve of the admissible characteristics on the a. probabilities for the two hypotheses.3. without which the minimization of (9.3.4.l x ) L ( x ) with L ( x I H .2).3.3. the point of the characteristics of the Bayesian optimal test is the point of tangency between the curve of admissible characteristics and the straight line with slope qo/ql The classical statistician does not wish to specify the losses yl and y2 or the prior probabilities P ( H o )and P ( H . because of the definition of & given in (9.i = 0.3.3.7). n 4) + y2P(Hl ( R.
even in situations where the NeymanPearson lemma does not indicate the best test of a given size a.3.15) and collecting terms.16) ( p l .12) and collecting terms.9).3. CI The NeymanPearson test is admissible because it is a Bayes test. p) and let x be its observed value. E X A M P L E 9 . Therefore.3. 3 > c for some c. If pl < Po.3. The choice of a is in principle left to the researcher.a2). and (8). does not always work. The result is consistent with our intuition. however.2.1. it is often possible to find a reasonable critical region on an intuitive ground. from (9.3. There is a tendency.the inequality in (9. There are often situations where a univariate statistic is used to test a hypothesis about a parameter.1.12) "(l pa . EXAMPLE 9 . in Example 9. Therefore.P1)nx  get Suppose p1 > Po. for example. .3.17) f > PO. where d is determined by P(X > d I H.2 9. T H Eo R E M 9 .2. Let & be as defined in (9. as the following counterexample shows.: p = p. > c for some c.the best critical region of size a is of the form > d.) = a.3. the better the test becomes.i= 1. by (9. who should determine it based on subjective evaluation of the relative costs of the two types of error. . (7). (4).01.7).&). we should do well if we used the best available estimator of a parameter as a test statistic to test a hypothesis concerning the parameter. 1 Let X be distributed as B(n. and there is no c that satisfies (9. 3 . Given a test statistic. Let xibe the observed value of Xi..3. The best critical region for testing Ho: p = po against H I : p = p1 is. Therefore the best critical region of size a is side of (9. Then.3. is. Then the term inside the parentheses on the lefthand is positive.15) Proof.3.9).3.17)is reversed. 2 " 2 Therefore if (9.3.3.1 the NeymanPearson test consists of (I). we obtain n (9.. PO)"" Taking the logarithm of both sides of (9.where a2is assumed known. 3 .such a statistic is called a test statistic.n. where d is determined by P(X > d I Ho) = a. from (9. Taking the logarithm of both sides of (9. In both examples the critical region is reduced to a subset of the domain of a univariate statistic (which in both cases is a sufficient statistic).3.13) defined by (9.14) Let the density of X be given by x > d.. If p1 < po.10) for a = %. it is not possible to have a(R)5 a(Ro) and P(R) 5 P(&) with a strict inequality in at least one. Common sense tells us that the better the estimator we use as a test statistic.2).A small value is often selected because of the classical statistician's reluctance to abandon the null hypothesis until the evidence of the sample becomes overwhelming. This result is also consistent with our intuition.192 9 1 Hypotheses EXAMPLE 9.3 1 NeymanPearson Lemma 193 The last clause in the theorem is necessary because.05 or 0. Intuition. The best critical region for testing Ho: p = po against H. (9.po) i= 1 xi > o log c + 2 ( p l . however. We shall consider a few examples of application of Theorem 9.14) is reversed. the inequality in (9. 2 LetXibedistributedasN(p. The Bayes test is admissible. As stated in Section 9. for the classical statistician automatically to choose a = 0. 3 . we (9.3.3.3.
4 1 Simple against Composite F I G u R E 9. The Find the NeyrnanPearson test of Ho: 0 = 0 against HI: 0 = densities under Hoand H1 are shown in Figure 9.2.5 Densities in a counterintuitive case f > 0. In this regard it is useful to consider the concept of the power function. = 1.4. where 8 1 is a subset of the parameter space. 1 If the distribution of the sample X depends on a vector of parameters 0. because here the P value (the probability of accepting Ho when H1 is true) is not uniquely determined if 8 1 contains more than one element.6 The NeymanPearson critical region in a counterintuitive case hypothesis.4 defined the concept of the most powerful test of size ct in the case of testing a simple against a simple I Using the idea of the power function. The NeymanPearson critical region.4 SIMPLE AGAINST COMPOSITE D E F I N I T I O N 9 . The shape of the function (9.2  el (9.2) Q1(8) 2 &(0) for all 0 E 8.19) changes with O1. Now we shall turn to the case where the null hypothesis is simple and the alternative hypothesis is composite. 9. In the present case we need to modify it. Definition 9. Then we say that the first test is uniformly better (or unvormly more powerful) than the second test in testing Ho: 0 = Oo against H1: 0 E if Q1(OO)= &(OO) and D E F I N I T I O N 9. denoted R. LetQ1(0) a n d a ( 0 ) be the power functions oftwo tests respectively.3. it is reduced to the simple hypothesis considered in the previous sections. In the figureVitis drawn assuming 0. we define the power function of the test based on the critical region R by We have so far considered only situations in which both the null and the alternative hypotheses are simple in the sense of Definition 9. If 8 1 consists of a single element.5. .4.194 9 1 Hypotheses 9.1. 4 . is identified in Figure 9. we can rank tests of a simple null hypothesis against a composite alternative hypothesis by the following definition.6. We can mathematically express the present case as testing Ho: 0 = O0 against HI: 0 E 81.1. We have I F 1 G U R E 9.
4 1 Simple against Composite 197 and (9. Note that we have 0 5 A 5 1 because the subset of the parameter space within which the supremum is taken contains 8. the NeymanPearson lemma provides a practical way to find the most powerful test of a given size a.1) we have el. even when it does not.4. which may be thought of as a generalization of the NeymanPearson test. the UMP test of a given size a may not always exist. usually gives (9. . but in others it is not. In the present case. Its graph is shown in Figure 9.: 0 = 0. D E F I N IT I o N 9 .7. If x / n > Po. Obtain and draw the power function of the test based on the critical region R = [0.L(x.. given the observation X = x.: 0 = 0.4.we have Q(O1) = 1  el.2 In the case where both the null and the alternative hypotheses are simple.4 and 9. and H 1 : 0 E is a subset of the parameter space 0. against HI: 0 E if P(R I 0. The following is a generalization of Definitions 9. In some of them the test is UMP. against H I is defined by the critical region = 0 otherwise.p).4. This time we shall state it for size and indicate the necessary modification for level in parentheses. E X A M P L E 9. el (9. F Ic u R E 9 .PO)"" < c for a certain c.1 Power function LetXhave thedensity the UMP test if a UMP test exists.) = ( 5 )a.) = (5) a and for any other test R1 such that P(R11 8. If x / n 5 po. R sup L(0) . which means that H o is accepted for any value of a less than 1. P(R I 0 ) 2 P(Rl where c is chosen to satisfy P(A < c I H o ) = a for a certain specified value of a. Therefore the critical region of the likelihood ratio test is given by E X A M PL E 9. p). The likelihood function is L(x.: p = against H I :p > Po. is what we earlier called a. we have I 0 ) for any 0 E 0.5) L(00) < c. standing for supremum. The socalled likelihood ratio test. 4 .) element equal to 01.3) Q I ( 0 ) > @ ( 0 ) for at least one 0 E Note that Q(0. . 7 The following is an example of the power function. maxpZp.: 0 = 1 against H I : 0 > 1 on the basis of one observation on X. clearly A = 1. a ) . B y (9. 3 A test R is the uniformly most powerful (UMP) test of size (level) a for testing H. DEFINITION 9 . We are to test H.196 9 1 Hypotheses 9. 4 . means the least upper bound and is equal to the maximum if the latter exists.4.'.5.and if el consists of a single p. however. p) = C:px(l .4. Sup..2. Let X be distributed as B ( n . the likelihood ratio test is known to have good asymptotic properties.2. 4 Let L(x 1 0 ) be the likelihood function and let the where null and alternative hypotheses be H. Below we give several examples of the likelihood ratio test.6) = p a . where the alternative hypothesis is composite.75. Then the likelihood ratio test of H . p) is attained at p = x / n . We are to test H.4.
This test cannot be UMP. we accept Ho.4. 1/2 on the basis of one observation + . log c > .a)' i n (~f)'.4.4.where a2is assumed known. because it is not a NeymanPearson test against a specific value of P.4 1 Simple against Composite 199 Therefore.t log po . L e t t h e s a m p l e b e X . then A = 1 because we can write C(x.4.. So suppose 2 > po.8) under the null hypothesis (approximately) equal to a. In a twotail test such as (9. where d is determined by P(8 > d 1 Ho) = a. 4 12 > d. since f > yo. Therefore we again obtain (9.3.3. then h = 1. the likelihood ratio test in this case is characterized by the critical region (9.t ) log (1 . If x 5 0. but this time without the further constraint that f > PO. this test is UMP.t ) log (1 .3 P(I~ > d I HO) = a. so we accept Ho.(1 . attaining its maximum at p = f. . Then the denominator of h attains a maximum at y = f .3 and test Ho: 0 = 0 against H I : 0 > 0 on the basis of one observation x.4. (Note that c need not be determined. i = 1 . . Let x.4.4.6]. The assumptions are the same as those of Example 9. be the observed value of X.8) does not depend on the value of p.6) and divid'ig by n. 4 . as discussed in Section 8. Tests such as (9. 4 .11) are called onetail tests. Then the numerator of A is equal to 1/[2(1 + x)'] and the denominator is equal to 1/2. Therefore assume x > 0. Therefore the likelihood ratio test is to reject Ho if x > d. 5 Consider the model of Example 9. which was obtained in Example 9.1) and because the test defined by (9.12) are called twetail tests. That the UMP test does not exist in this case can be seen more readily by noting that the NeymanPearson test in this example depends on a particular value of 0.7) I 9.4. .a confidence interval of y is defined by 1 2< d.4.10).~ ( p . where d is chosen appropriately.7) is an increasing function of t whenever t > Po. n. From Example 8.2 the 1 . Then the denominator in (9.4. we obtain (9. where d is determined by X where d should be determined so as to make the probability of event (9. E X A M P L E 9 . whereas tests such as (9.3 except that H I : y f yo.6 We are to test No: 0 = 1/2 against H I :0 Suppose Xhasauniformdensityover [0. . where d is the same as in (9. therefore. . we have AMPLE 9.4.4.8) and (9.4.198 9 1 Hypotheses Taking the logarithm of both sides of (9.12) we could perform the same test using a confidence interval. . We are to test Ho: y = po against H I : p > po The likelihood ratio test is to reject Ho if If n 5 yo.4.3.3.t) .) This test is UMP because it is the NeyrnanPearson test against any specific value of p > Po (see Example 9.4. . t log 1 + (1 . c r ' ) . Ho should be rejected if and only if yo lies outside the confidence interval.&) f > d. This test is not UMP because it is not a NeyrnanPearson test.12).2.4. it is equivalent to (9.8) For the same reason as in the previous example. Therefore.p)2 = Z(xi .9) is maximized with respect to the freely varying P. EXAMPLE9. 2 . Since it can be shown by differentiation that the lefthand side of (9.2. Therefore the critical region is EXAMPLE 9 . Therefore.O < 6 5 1. n where we have put t = x/n.4.
Consider the same model as in Example 9. In such a case the following theorem is useful. 11. the exact probability of A < c can be either calculated exactly or read from appropriate tables. if there are r parameters and Ha specifies the values of all of them. 11 should be part of the critical region. A = 1. 5 .5). and show that it is UMP. Let the null and alternative hypotheses be Ha: 0 E 8 0 and H I : 0 E el.8 x. 1 Fl G UR E 9. Next we must determine d so as to satisfy EXAMPLE 9 .8). therefore. draw its power function.200 9 1 Hypotheses 9.5. We conclude that the critical region is [O. r0.A is the same as in (9. where €lo and el are subsets of the parameter space 8.5. D E F I N IT I o N 9 .) 9. Since the numerator of the likelihood ratio attains its maximum at p = Po.4. Then.. Derive the likelihood ratio test of size %. The following are examples of the likelihood ratio test. 5 . 0.251 U [0. In all the examples of the likelihood ratio test considered thus far. 4 .5. There are cases.5. accept Ho. Here we define the concept of the UMP test as follows: Power function of a likelihood ratio test A test R is the uniformly most powerful test of size P(R I 0) = (I a ) and for any other test R1 such that (level) a if supeEeo supeEeo P(Rl 1 0) = ( I a) we have P(R 1 0) 2 P(RI 1 0) for any 0 E el.2 Let L(x 0) be the likelihood function.6).8. 0.4.2.4. but here test Ha:p 5 Po against H I :p > Po. Suppose the portion A of [O. Next assume that x E [O. First. 2 log A is asymptotically distributed as chisquare with the degrees of freedo~n equal to the number of exact restrictions implied by . This completes the proof.4. 0.5. Then part of the power function shifts downward to the broken curve. 1 where c is chosen to satisfy supeoP(A < c 0) = a for a certain specified value of a.. This implies that c = %. note that A = 0 for x E [0. therefore. the degrees of freedom are r. this situation is the most realistic.5 Composite against Composite 1 201 A H. 11.251 is removed from the critical region and the portion B is added in such a way that the size remains the same. however. 11 should be part of any reasonable critical region.5).5 COMPOSITE AGAINST COMPOSITE In this section we consider testing a composite null hypothesis against a composite alternative hypothesis. I f x / n 5 p. Henceforth suppose x / n > po. because this portion does not affect the size and can only increase the power. where P(A < c) cannot be easily evaluated. Then we have For the present situation we define the likelihood ratio test as follows. DEFlN I T I O N 9. As noted earlier. (For example. where c should satisfy P(2X < c I Ha) = %. Therefore the critical region is again given by (9. THEOREM 9 . 1 I Let A be the likelihood ratio test statistic defined in (9. first note that [0. Then the likelihood ratio test of Ha against H I is defined by the critical region I Therefore we reject Ho if 2x < c. To show that this is UMP. Its power function is depicted as a solid curve in Figure 9.
N(p.But since R is the U M P test of Ho: p = la. there is no need to compute the supremum. . that L1( 0 ) and & ( O ) are simple step functions defined by (9.. therefore (9. Then the Bayesian rejects H o if . This test is UMP. the critical region (9.~ 1 .2. we have  Theref ore where f (0 ( x) is the posterior density of 0. E X A M P L E 9. In Section 9. the .(x. we have Therefore the value of d is also the same as in Example 9. To see this.10) where b2= n'Z~=l(x.4.5. is reduced to which can be equivalently written as Recall that (9. .8) is distributed as Student's t with n . k can be computed or read from the appropriate table.l ~ r = l (~R i ) Therefore ~ ~ the critical region is (9. Then we have L1(0) = 0 = yl for 0 > 0. We are to test Ho: p. # po. Then it follows that P ( R . 2. > IJ.5."=.2 where e2is the unbiased estimator of u2 defined by ( n . u2)by 0. . let R be the test defined above and let R1 be some other test such that I p) 5 a.7) (62/&2)n'2< c for some c.2.8).l)'Z. (~T)~'~($)"$X~ [. If x/n 5 po. Let the sample be X . e2= n .5. Note that since P ( R I H o ) is uniquely determined in this example in spite of the composite null hypothesis. for 0 5 O0 and (9.]: ./ql.2. .1 degrees of freedom.z ) ~ Since the lefthand side of (9. n.5 Composite against Composite 1 203 But since P ( X / n > d I p) can be shown to be a monomnicaily increasing function of p.4. Let h ( 0 ) be the loss incurred by choosing Ho. Here we shall do the same for the composite against composite case.3 we gave a Bayesian interpretation of the classical method for the case of testing a simple null hypothesis against a simple alternative hypothesis. Here. as can be seen in (9. &&ore. = po and 0 < u2< m against H I : p.6) where sup L(0) = 00u0. and we shall see that the classical theory of hypothesis testing becomes more problematic. 202 9 1 Hypotheses 9. Let us first see how the Bayesian would solve the problem of testing Ho: 0 5 O0 against H I : 0 > OO.5. Henceforth suppose x / n > Po. we have P ( R I I p) 5 P ( R I p) for all p > po by the result of Example 9.5. accept Ho. 02). for simplicity.O and 0 < u2 < m. u" with unknown u2.. . = y2 for 0 r OO. In this case the same test can be performed using the confidence interval defined in Example 8.8) If the alternative hypothesis specifies p should be modified by putting the absolute value sign around the lefthand side. 1 Po) 5 a.5.9). in addition to the problem of not being able to evaluate q.2. A = 1. i= 1.11) is the basis for interpreting the NeymanPearson test. and L1(8) by choosing H1. Suppose.3.5. . Denoting ( p .5.3. against H I : p > Po. In this case the losses are as given in Table 9.
4 204 1 9 1 Hypotheses 9.P ) ~ .11).17) j ( \>(p)f(p 1 y)dp > j>(p)f(p 1 W P .6.5. and assume that the net benefit to society of approving the drug can be represented by a function U ( p ) . Then f ( p I x) and f(p I y) f In the preceding sections we have studied the theory of hypothesis testing.13) is an increw ing function in x. We toss this coin ten times and a head comes up eight times. Next we try to express (9.6 1 Examples of Hypothesis Tests and k(p) = f(p 1 x) . Note that in this decision problem.13). which establishes the result that the lefthand side of (9.5.6 EXAMPLES OF HYPOTHESIS TESTS Now cross only once.12) may not be a good substitute for the lefthand side of (9. Let p be the probability of a cure when the drug is administered to a patient.nondecreasing in p.15) is 0 if p = 1 and is monotonically increasing as p decreases to 0.13) is equivalent to (9.whereas the righthand side is smaller than U ( p * ) . and define h(p) = f (p I y ) .13) more explicitly as an inequality concerning x.If the classical statistician were to approximate the Bayesian decision.5.f 205 classical statistician faces the additional problem of not being able to make sense of L(x [ H I ) and L(x I Ho). consider the problem of deciding whether or not we should approve a certain drug on the basis of observing x cures in n independent trials. Her decision rule is of the same form as (9. except that she will determine c so as to conform to a preassigned size a.5.however. hypothesis testing on the parameter p is not explicitly considered.13) is essentially the same kind as (9. The decision rule (9.5.5.5.18). E X A M P L E 9.5.1) H. put f ( p x) = f(p [ y).: p = % and H I : p > %.* .5. we should approve the drug if (9.7). suppose y > x. size)?What if the significance level is lo%? From the wording of the question we know we must put I The lefthand side of (9.9). The likelihood ratio test is essentially equivalent to rejecting No if 1. Then j1 U(P)h(P)dp> l ~ ( p ) k ( p ) d' p h(P)dP Tk(p)dp P* 0 because the lefthand side is greater than U ( p * ) .5. To see this.16) p* ( p 1 x) ( p I y). Should we conclude that the coin is biased at the 5% significance level (more precisely.f we have (9.5.1 ( m e a n o f b i n o m i a l ) It is expected that a particular coin is biased in such a way that a head is more probable than a tail. Therefore (9.But (9.5.16) is equivalent to A problem here is that the lefthand side of (9. from (8.5. Sometimes a statistical decision problem we face in practice need not and/or cannot be phrased as the problem of testing a hypothesis on a parameter. first. Let p* be the solution to (9.15) such that p* # 0 or (9.5. this equality can be written as ( p I x) = ( n + l)Czpx(l . paraphrase the problem into that of testing hypothesis Ho: p 2 po versus H I : p < po for a certain constant po and then use the likelihood ratio test. If p f 0 or 1.5.5. (9.14) where c is determined by (9. . In this section we shall apply it to various practical problems. The classical statistician facing this decision problem will. assuming for simplicity that f ( p I x) is derived from a uniform prior density: that is. 9. For example. except possibly at p = 0 or 1. where f ( p I x) is the posterior density of p given x.6.3. she would have to engage in a rather intricate thought process in order to let her Po and a reflect the utility consideration. According to the Bayesian principle.18) x > c.
EXAMPLE 9.2. 6 . (9. rather than determining the critical region for a given size. This decision can sometimes be difficult.6. Student's t with 9 degrees of freedom. 3 ( m e a n o f n o r m a l .2. we know that the best test of a given size a should use as the test statistic and its critical region should be given by x where c (the critical value) should be chosen to satisfy (9.05 or a = 0. EXAMPLE 9 .4) we conclude that Ho should be accepted if a = 0.3) for either a = 0.016) under Ho. p).58) = 0.6.6. 0.a EXAMPLE 9 . called the pvalue: that is. the number of heads in ten tosses. what if the italicized phrase were removed from Example 9.7) where Z is N ( 0 .6.6 1 Examples of Hypothesis Tests 207 From Example 9.2. which would imply a different conclusion from the previous one. We are to test Ho: = 5. to say "Hoshould be accepted if a < 0.8) a ( x. the critical region should be chosen as (9.3) for a given value of a." a twotail test would be indicated. we were to add. For example. should we accept Ho at the 5% significance level? What if the significance level is lo%? From Example 9. Suppose theheight ' ) . If. v a r i a n c e u n k n o w n ) . 6 .1. Then we should calculate the pvalue (9.16. while in 1990 the average .4 foot.16.6. If the sample average of 10 students yields 6 . = 6.4. 1 ) .6. From (9." This is another reason why it is wise to calculate the pvalue. .6. In such a case we must provide our own. We have under Ho From (9.055 and rejected if a > 0.8) = t 9 . 6 Therefore.8) > c.1? Then the matter becomes somewhat ambiguous.4 ( d i f f e r e n c e o f means o f n o r m a l .6.2.8 against H.9) We have m(x 6  5.055.1. We must determine whether to use a onetail test or a twotail test from the wording of the problem. where c is determined by P(L? > c [ Ho) = a. It is perfectly appropriate. by Example 9. Assume the ' is unknown and we same model as in Example 9. instead of the italicized phrase.5.6) Since L? x > c.6. Another caveat: Sometimes a problem may not specify the size.8. wherec is determined by P (t9 > c) = or.N(5.6.6. and the critical region should be of the form  to be 0. determining the critical region by (9. we have P ( g > 6 ) = P ( Z > 1.6) and then checking if the observed value f falls into the region is equivalent to calculating the pvalue P ( x > f I Ho) and then checking if it is smaller than a. where a is the prescribed size. however.: p. v a r i a n c e k n o w n ) Suppose that in 1970 the average height of 25 Stanford male students was 6 feet with a standard deviation of 0.3. In this kind of question there is no need to determine c by solving (9.0571. where a ' is known of the Stanford male student is distributed as N ( F . v a r i a n c e k n o w n ) Therefore we conclude that Ho should be accepted at the 5% significance level but rejected at 10%. 2 ( m e a n o f n o r m a l . we know that we should use X B(10. as the test statistic.6. In fact. "but the direction of bias is a priori unknown. Instead we should calculate the probability that we will obtain the values of X greater than or equal to the observed value under the null hypothesis.5.7) we conclude that H o should be accepted if a is 5% and rejected if it is 10%. as before.9 1 Hypotheses 9. Note that. except that now a have the unbiased estimator of variance 6' = 0. in this particular question there is no value of c which exactly satisfies (9.05 and rejected if a = 0.
)/200. Let Y .2 with a standard deviation of 0. we have where we have assumed the independence of X and Y. we can get a better estimate of the common value by pooling the two samples to obtain (46 + 51)/(200 + 300). be the height of the ith student in the 1970 sample and 2d 5x = ( Z : ! ~ X .12) because it is believed that the height of young American males has been increasing during these years. EXAMPLE 9 . One way is to estimate px by 46/200 and fi by 51/300. we have under Ho  hypotheses can be expressed as (9. Define 7 = ( ~ : 2 ~ ~ . Therefore this example essentially belongs to the same category as Example 9.021 and . ) / ~ O . v a r i a n c e u n k n o w n ) we conclude that Ho should be accepted if a < 0. 6 ( d i f f e r e n c e o f means o f n o r m a l .6. If we define 7 = (C:!: Y.4 into (9.12).6. ny = 25. The only difference is that in the present example the variance of the test statistic X .4. = 1 if the ith man favors the proposition and = 0 otherwise. Using the latter method. Is there a real difference of opinion between men and women on this proposition? Define Y . EXAMPLE 9 .6. we have asymptotically This example is the same as Example 9.6.208 9 1 Hypotheses 9. Y = 6.)/300 and = (2. By Theorem 6 of the Appendix we have under Ho x Inserting nx = 30.18) above. Once we formulate the problem mathematically as (9. However.:: X. Define py = P(Yl = 1 ) and px = P ( X . Should we conclude that the mean height of Stanford male students increased in this period? Assume that the sample standard deviation is equal to the population standard deviation. respectively. We have Therefore we conclude that Ho should be accepted if a < 0.6..097.02 and rejected if a > 0. = 1 if the ith woman favors the proposition and = 0 otherwise. = 1 ) . = 6. Sx = 0. Similarly define X . But since px = f i under Ho. 5 ( d i f f e r e n c e s o f means o f b i n o m i a l ) In ap0ll51 of 300 men favored a certain proposition. and Sy = 0.4 except that now we shall not assume that the sample standard deviation is equal to the population standard deviation.02.3 foot. Since we have under Ho we conclude that Ho should be accepted if a < 0. )a/ n Assuming the normality and independence of XI and Y. the 1990 sample.12) Ho: px  py = 0 and H I : px  py > 0.11) and (9. 6 .6. we realize that this example is actually of the same type as Example 9.2. The competing hypotheses are (9.6 1 Examples of Hypothesis Tests 209 height of 30 students was 6.fi # 0. Since we have under Ho We have chosen H I as in (9. we calculate the observed value of the Student's t variable to be 2. =a : .6. and X .3.097 and rejected if u > 0.077.I? under Ho should be estimated in a special way.2. 6 . and 46 of 200 women favored it. we shall assume a .15) Ho:px  fi = 0 and H I :px .6.6.
1 we consider the case where Z is completely known. It is intuitively reasonable that an optimal critical region should be outside some enclosure containing eO.Olol.010 l2 + (61 . We consider the problem of testing Ho:0 = O0 against H1:0 00.21).2 the case where 2 is known only u p to a scalar multiple.7.6.810)2 + ( 6 2 . a . 2 is a K X K variancecovariance matrix: that is.21)  nys. .where 0 is a Kdimensional vector of parameters.7 1 Testing about a Vector Parameter 21 1 rejected if a > 0. 02)'and O0 = (010. under the null hypothesis : a =. we shall illustrate our results in the tw~dimensional case. 4  Xny1. By Theorem 3 of the Appendix. We are to use the test statistic 0 N(0.7 (difference of variances) i and (9. we have E X A M P L E 9. This weakness is alleviated by the following strategy: (9. I . the alternative hypothesis . That would amount to the test: (9.136. where cis chosen so as to make the probability of Type I error equal to a given value a. 2 = ~ (6 0 ) ( 8 . we need to assume a$ = a .136 and is a rejected if a > 0.4 very much.021.6.7. In using the Student's t test in Example 9. we conclude that Ho should be accepted if a < 0. An undesirable feature of this choice can be demonstrated as follows: Suppose ~6~ is much larger than ~ 6 2 Then .1) Reject Ho if (61 . for the latter could be a result of the large variability of 6 1 rather than the falseness of the null hypothesis. and in Section 9.02012 a2 2 'c. + Consider the case of K = 2. OZ0)'. (Throughout this section a matrix is denoted by a boldface capital letter and a vector by a boldface lowercase letter.6.2) Reject Ho if (61 .) In +  for some c.7 TESTING ABOUT A VECTOR PARAMETER Those who are not familiar with matrix analysis should study Chapter 11 before reading this section. ) where .6. Inserting the same numerical values as in Example 9. The results of this chapter will not be needed to understand Chapter 10. .9 2 Critical region for testing about two parameters Applying Definition 3 of the Appendix and (9.7.9. : u Section 9.6 into the lefthand side of (9.210 9 1 Hypotheses 0 2 9.6.0)'. ) . as depicted in Figure 9. Therefore. What should be the specific shape of the enclosure? An obvious first choice would be a circle with O0 at its center. it is wise to test this hypothesis. In this particular example the use of the Student's t statistic has not changed the result of Example 9. F I C U R E 9.0 ~ 0 ) ~c > 9.559.we have. a : . But we have 9. Insofar as possible. a should be more cause for rejecting No than an large value of 162 equally large value of 10.6.7.6.1 VarianceCovariance Matrix Assumed Known Since a twotail test is appropriate here (that is.7.22) yields the value 0.6. We can write 0 = (el.
that is. where A is the diagonal matrix of the characteristic roots of A (see Theorem 11.'(~ 0. 2. Then.This result is a consequence of the following important theorem.l ( x .I) as the test testing Ho:A0 = AOO statistic. we should n= exp 0 [+(6  % ) 1 ~ . By Definition 1 of the Appendix. a12 = ~ o v ( 682).p) ' A . An additional justification of the test (9.1) if a : =a . Let H be the orthogonal matrix which diagonalizes A.(0) @ But the maximand in the denominator clearly attains its unique &m at 0 = 0. the optimal test should be defined by (9. ~ . the inequality in (9. By this transformation the original testing problem can be paraphrased as against A0 # ABO using A0 N(AOO.3) A AO0) ~ > c. not necessarily diagonal nor identity.7.7.7. we obtain A'"(x .*:.7.7. where A is a positive definite matrix. and p3.7. = p.' / ~ A . note that by (5.A B ~ ) ' ( Suppose x is an nvector distributed as N ( p .Following (11.4).i * where A"' is the diagonal matrix obtained by taking the (1/2)th power of each diagonal element of A. That is. is a positive definite matrix. (x .2) if u12= 0 and.1) and (9. and 3 with respective p r o b abilities pl. We shall now proceed on the intuitively reasonable premise that if a: = and 01.7. so that c can be computed to conform to a specified value of a. Another attractive feature of the test (9.3) can be written as (9. We should not be completely satisfied by this solution either. A s an illustration of the above. = . using h  Proof.) I 1i (0 .1 / 2= 1 and ~'/2~1/2 = A1. to (9.4) Reject Ho if (0 .Suppose . = A'(Ar)'. Reject Ho if (A0 .p ) ' ~ .2) represents the region outside an ellipse with O0 at its center. But AXA' = I implies I. where <g / I 3 a .5).5. 1 (9. elongated horizontally.versus HI:not Ho.~ /(x ' .7. THE 0 R E M 9 . consider a threesided die (assume that such a die exists) which yields numbers 1. (9.4. by our premise.@ ) l ~ . suggests its deficiency.p ) N ( 0 . we can easily show that A'/~A~ .5) Reject Ho if Therefore. further.~ .1 / 2 ~ ~ 1 In the twodimensional case.1 we can find a matrix A such that AZA' = I.212 9 1 Hypotheses 9.7.7 1 Testing about a Vector Parameter 213 ! where a : = vil and a : = ~ 8 Geometrically.1 (~ 0.7.5.define ~ . that I Then by Theorem 11. A). 1 (9.7. Then (x . which implies Z' = A'A.1 / 2= H ~ . = A. ~ . Thus.4) is the fact that 022   4 4 1 under the null hypothesis. because the fact that this critical region does not depend on the covariance.5. H'AH (9.p) 0  .) > c. I). Therefore.1). . 7 .= 0.7) Test Ho:p.7. To see this.4. = p. 3 .O ~ ) ' Z .p) X : .1).5) is reduced to (9. We are to test the hypothesis that the die is not loaded versus the hypothesis that it is loaded on the basis of n independent rolls.4) is provided by the fact that it is a likelihood ratio test. p2. Note that (9.
2j1 .7).2pl .7.8) Test Ho:pl The lefthand side of the above inequality is asymptotically distributed as under the null hypothesis. we can write (9. for example.14) 2n(log 3 2 log d.7.p2. Suppose.7 1 Testing about a Vector Parameter 215 (9.9) is outside the parallel dashed lines and inside the triangle that defines the total feasible region.11) is not identical with the likelihood ratio test. namely. If we define j l and $2 as the relative frequencies of 1 and 2 in n rolls. under the null hypothesis.214 9 1 Hypotheses 9.10) holds only asymptotically.5) becomes (97. Now we apply the test (9.4. a reasonable test would be to (9.1 we transform the above inequality to (9.$21 >c 2 (n log 3 + nl log jl + n2log j2+ n3 log p3) = for some c.9) as a solution of the original testing problem > 2 log d./n and defining c equivalently as (9.7.13) A. the null hypothesis can be stated as 1 . 3p3 f 2. as we shall show below.7. .7. which is extremely unlikely under the original null hypothesis.7. will lead to an acceptance by this test.11) is called the generalized Wald test.13) 2 log A = g = 1 .7. Since (9.4.7. In order to make use of Theorem 9.11) Reject H O FIG u R E 9. = n. which can be approximately determined from the standard normal table because of the asymptotic normality of pl and In Figure 9.7. By Definition 9. (9. is the number of times j appears in n rolls.7.7.10 the critical region of the test (9.10 Critical region for testing the mean of a threesided die If we should be constrained to use any of the univariate testing methods expounded in the preceding sections.7. We have. A weakness of the test (9.7. Noting that j .4 the likelihood ratio test of the p r o b lem (9.7) is Xg  + 2p2 + 3p3 = 2 versus HI: p1 + 2p2 3. but that would not be entirely satisfactory.pl . Next we derive the likelihood ratio test and compare it with the generalized Wald test.9) Reject Ho if 1 1 . + jl log pl + log j2+ j3 log j3) > c. In such a case. the test (9. (9. Therefore (9.7. we would somehow have to reduce the problem to one with a single parameter.7.5) to the original problem (9. we decide to test the hypothesis that the expected value of the outcome of the roll is consistent with that of an unloaded die.p2 Since p = 0.7) is obvious: an outcome such as j1 = 0 and j2 = 1. where n.
however.'( ~Oo)/K W/M .For what kind of estimator where can we compute the distribution of the statistic above.O ~ ) ' Q .Xn.7. In this case it seems intuitively reasonable to reject H o if 1= (9. by Definition 3 of the Appendix. Q is a known positive definite matrix and u2 is an unknown scalar parameter.7. Therefore. have.[ p r o ] versus HI: [ z j * [g] and apply it to the three similar terms within the parentheses of the lefthand side of (9.N ( p y . is some reasonable estimator of u2.19) 1 I= I u 2 2 . If we are given a statistic W such that which is independent of Appendix (9.7 1 Testing about a Vector Parameter 21 7 Wald Test Likelihood Ratio Test One solution is presented below. Figure 9. so we shall give a couple of examples.11) and (9. u4 9. defining 6 ' = W / M will enable us to determine c appropriately.05.7.+n.F ( K .7. use the Taylor expansion  .7. .1 1 describes the acceptance region of the two tests for the case of n = 50 and c = 6. :L~ We Y . so that we can determine c so as to conform to a given size a? c2 .1 216 9 1 Hypotheses 9. sometimes a good solution if 2 = a 2 ~ where .2 VarianceCovariance Matrix Known u p t o a Scalar Multiple and because of Theorems 1 and 3 of the Appendix.Pxo . 1 Suppose X pendent of each other.7.7. Suppose that 8 = n ~ ' ~ and ~ E 2 =~ ~ .14). We are to test He: To show the approximate equality of the lefthand side of (9. on the basis of nx and ny independent observations on X and Y.7.~ ~X . F l G U R E 9. Note that ~ ( ~ > 22 6) 0. Assuming the availability of such W may seem arbitrary. from Definition 1 and Theorem 1 of the Appendix. respectively.17) Xi in (9. M).61. we obtain by Definition 3 of the (0 . There is no optimal solution to our problem if Z is completely unknown.14). ~ ( p ~ . There is. a and ') Y E X A M P L E 9 .6). u2)are inde Px .1 1 The 5% acceptance Mans of the generalized Wald test and the likelihood ratio test Therefore. We assume that the common variance u2is unknown. We first note (9. 7 .2.
7. (Section 9.7.y. Find a critical region of a = 0.71 .7. derive the Bayesian optimal critical region. ny. is distributed as N(p. Y N(py.4. Let y.2) Given the density f(x) = 1/0. nr. = 2 against HI: p = 3 on the basis of four independent observations on X. we have under Ho. respectively..] ^ 3. u2) are mutually independent. by Theorems 1 and 3 of the Appendix. Similarly. Then we have 1.u2). 4). and Z N(pz. . 2. (Section 9.. Is the region unique? If not. We are to test Ho: 0 = 0.218 9 1 Hypotheses EXERCISES I Exercises 219 We should reject Ho if the above statistic is larger than a certain value.2 against H I : 8 = 0. . and 0 elsewhere. We are to test Ho: px = py = pz versus HI: not H o on the basis of nx. 2 Suppose that X N(px. and S the sample variances based on nx. and n~ observations. E X A M P L E 9 . find the Bayesian optimal critical region. s.pz. b e c a w of Theorem 9.py.20) to conform to a preassigned size of the test.5 which minimizes p and compute the value of p. (Section 9. X1 = x . (a) List all the nonrandomized admissible tests.23) are independent. ny. 16). 7 .2) Suppose that X has the following probability distribution: X = 1 with probability 0 where where 0 5 9 5 1/3. u2 Xn . and Z. (b) Find the most powerful nonrandomized test of size 0. which we can determine from (9.3. and ^X2 = . respectively.7. 61.1. lets. Y.25 on the basis of one observation on X. 4. and we want to test Ho: p = 25 against HI: p = 30. (c) Find the most powerful randomized test of size 0. Define hl = px . define the class of such regions. Suppose the loss matrix is given by  True state Since the chisquare variables in (9. and 2 be the sample averages based : be on nx. we have Decision where e is Euler's e ( = 2. and nz observations.2. We want to test Ho: y But. . respectively. we are to test the hypothesis Ho:0 = 2 against HI: 0 = 3 by means of a single observed value of X. Assuming the prior probabilities P(Ho) = P(HI) = 0. Assuming that the prior probabilities of Hoand H1 are equal and the costs of the Type I and I1 errors are equal.    a. (Section 9.SQl k. 0 < x < 0. and nz independent observations on X.3) An estimator T of a parameter p.3) Let X N(p.22) and (9.5. ). a2). X2 = p~ .
25.4) Suppose that a bivariate random variable (X. y 5 1. 0 < x < 1. . Derive its power function. (Section 9. (a) Find the likelihood ratio test of size 0 < a < 1. Suppose that 100 independent tosses yielded N I ones. . 2. (Section 9. y) 9. and 0 < h < m.220 9 1 Hypotheses I Exercises 221 Calculate the probabilities of Type I a m d Ttrpe I f emrs for this critical region. .4) Given the density f (x) = 1/0. . 2. (b) If N = 2. N p = 13. N5 = 17. N g = 14. where we assume 0 5 0 < 1.4) Suppose (X. 4 and 5 each occurs with probability 1/5. . (Section 9. test the hypothesis Ho: 0 = 2 against HI: 0 > 2 by means of a single observed value of X. If you cannot. and 6 occurs with probability (a) If number j appears N. . . we are to x 5 p. 11.k = 1. and 0 elsewhere. 12. < M. = 1/ ( p i ) . (Section 9.times. .8. Consider the test which rejects Hoif X > c.3) Letf(x) = 0 exp( Ox). 13. (Section 9.3) Supposing f (x) = (1 O)xe. define the best test you can think of and justify it from either intuitive or logical consideration. Indicate how to determine the critical region if the size of the test is a. (c) If N1 = 16. Determine c so that a = 1/4 and draw the graph of its power function. Choose a = 0. 2.2. 0 5 Y 5 A .). and N g threes. P(X = k) = p(l .L.5 on the basis of a single observation on (X. (b) Show that it is not the uniformly most powerful test of size a. p2. Find the uniformly most powerful test of the hypothesis 0 = 1 of size a = 0.25. Y) with a = 0. 0 > 0. in N throws of the die. (Section 9. (a) Derive the likelihood ratio test. . 6 . You may use the normal approximation. Derive: (a) the NeymanPearson optimal critical region. . (Section 9. xO. we are to test Ho: 0 = O0 against H1: 0 = O1 < Oo. 0 < x < 0. That is. should you reject the null hypothesis at the 5% significance level? What about at lo%? You may use the normal approximation. (b) the Bayesian optimal critical region. Y) have density f (x. assuming a = 0. PI = p2 = 2/5 against H I : PI = Obtain a NeymanPearson test of Ho: % and p2 = %. (Section 9.P)~'. Find the NeymanPearson test based on a sample of size n. + 7. Y) is uniformly distributed over the square defined by 0 5 x. . We want to test Ho: 0 = 1 against HI: 0 = 2 on the basis of one observation on X.2. 10. (c) Prove that the likelihood ratio test of the problem is the uniformly most powerful test of size 0.4) Random variables X and Y have a joint density 6. obtain the most powerful test of size '/4 and compute its p value. We are to test Ho: 0 = 0.4) Let X be the number of trials needed before a success (with probabilityp) occurs.3) We wish to test the null hypothesis that a die is fair against the alternative hypothesis that each of numbers 1. .05.05. N 2 twos. Compare it with the power function of the critical region consisting of the numbers {1. (Section 9. (b) Obtain the power function of the likelihood ratio test (or your alternative test) and sketch its graph.3) Let X be the outcome of tossing a threesided die with the numbers 1. 0 0 < I.5 against H1: 0 # 0. Y). j = 1.9. and 3 occurs with probability Xo. x r 0. 3. and N6 = 18.and p3. Find the power function for testing Ho: p = 1/4 if the critical region consists of the numbers k = 1. assuming that P(Ho) = P(H1) and that the loss of Type I error is 2 and the loss of Type I1 error is 5. and 3 occurring with probabilities PI.01 based on a single observation of X and Y . N4 = 22. 0 > 0. define the NeymanPearson test. We are to test Ho: p = A = 1 versus H I : not Hoon the basis of one observation on (X. 2.
9 for 0. (Section 9.6) Suppose you roll a die 100 times and the average number showing on the face turns out to be 4. Assume that the loss matrix is the same as in Exercise 16. How large a percentage point difference must be observed for you to be able to conclude that Clinton is ahead of Bush at the significance level of 5%? 21.0. Y).5.5 =2(p0. (Section 9.05. (Section 9. 21.5 against H1:0 > 0. 16. Suppose that we are given a single observation x of X. 5 < p 5 1. 01. how large should x be for the drug to be approved? 20. (Section 9. (a) Derive the Bayes estimate of 0.5) +1 for 2 5 0 5 2 a n d O 5 x 5 1. 18.1 5 0 5 1. Y 10) = 0' for o 5x I :0. y) = 20* for x = + y 5 0. 11 and that x patients out of n randomly chosen homogeneous patients have been observed to be cured by the drug.4) The joint density of X and Y is given by f (x.05.5) Random variables X and Y have a joint density f (x. (Section 9. 0 I and the prior density of 0 is f (0) = 20. We test Ho: 0 = 0. 19. If twenty races are won by the stimulated runner. (Section 9. show how a Bayesian tests Ho: 0 5 0. (b) Derive its power function and draw its graph. Suppose that the net social utility from a commercial production of the drug is given by U(p) = 0. Formulate a Bayesian decision rule regarding whether or not the drug should be approved. (Section 9. Find the Bayesian test of Ho: 0 2 % against H I : 0 < % based on a .25. (b) Assuming that the costs of the Type I and I1 errors are the same. find the Bayes test of Ho: 0 E [ l . (a) Derive the likelihood ratio test of size 0. assuming the prior density single observation of each of X and Y f (0) = 1/0.6) One hundred randomly selected people are polled on their preference between George Bush and Bill Clinton. If n = 2. in which one runner is given a stimulant and another is not.5 against H I : 0 # 0. (Section 9. on the basis of one observation on (X. 1. Assume that the loss matrix is given by Suppose that a prior density of p is uniform over the interval [0.222 9 1 Hypotheses I Exercises 223 14. (c) Show that it is the uniformly most powerful test of size 0. where we a 5 1.25. oI 5 0.51 versus H I : 0 E (1.5) Let p be the probability that a patient having a particular disease is cured by a new drug.6) We throw a die 20 times.5. 0. 0 otherwise. 15. for 0 .5) :x 5 0. 22. 0 5 y.5. Suppose that the density of X given 0 is f (x 1 0) = 2x/e2.O < 0 < 1. 21 on the basis of one observation on X.1 5 0 I :1. Show that this test is h e uniformly most powerful test of size 0. Assuming that the prior density of 0 is uniform over [ l .5) for 0 5 p 5 0. should you decide that the stimulant has an effect at the 1% significance level? What about at 5%? 17. (Section 9. 0 5 x.6) Thirty races are run.5. (Section 9. Obtain the likelihood ratio test of Ho: 0 = 2 against H1:0 < 2 on the basis of one observation of X at CY = 0. Is it reasonable to conclude that the die is loaded? Why? 23.5) Let X be uniformly distributed over [O.4) The density of X is given by f(x) = 0(x . 1 comes up four times and 2 comes up seven .
City A n X D E City B 9 9 2 n"(x.6) The price of a certain food item was sampled in various stores in two cities. can we conclude that the fertilizer is effective at the 5% significance level? Is it at the 1% significance level? Assume that the yields are normally distributed.~)' 18 10 2 28. showed Clinton ahead by 20 points. based on a sample of 5000 voters. (Section 9. Participant A B C D Weight after (lbs) 126 125 129 131 128 130 135 142 Farm A B C Y~eld without fertilizer (tons) 5 6 7 8 9 Yield with fertilizer (tons) 7 8 7 10 10 25. (Section 9. Should we reject the hypothesis at 5%? What about at 10%? 24.6) The following data are from an experiment to study the effect of training on the duration of unemployment. On the basis of our experiment. of which 78 graduated. Would you accept the claim made for the diet? Weight before (lbs) unemployment for those without tdnhg. (Section 9.6) According to the Stanford Observer (October 19'77). Test the hypothesis that there is no difference between the mean prices of the particular food item in the two cities using the 5% and 10% significance levels. showed Clinton ahead by 23 points.224 9 1 Hypotheses times. Let pl be the probability that 1 comes up and p2 be the probability that 2 comes up. (Section 9. The weights of seven women who followed the diet.6) One preelection poll. based on a sample of 3000 voters. and Y be the duration for those with training: x y 35 31 42 37 17 21 55 10 24 28 Assuming the twosample normal model with equal variances. Among the 1024 students were 84 athletes.6) The accompanying table shows the yields (tons per hectare) of a certain agricultural product in five experimental farms with and without an application of a certain fertilizer. are given in the accompanying table. (Section 9. whereas another poll. (Section 9. 1024 male students entered Stanford in the fall of 1972 and 885 graduated. Assume that the prices are normally distributed with the same variance (unknown) in each city. can we conclude that training has an effect at the 5% significance level? What about at lo%? 2'7. recorded before and after the twcweek period of dieting. test the hypothesis pl = p2 = '/6 against the negation of that hypothesis. Other things being equal. Are the results significantly different at the 5% significance level? How about at lo%? 26. and the results were as given below. Let X be the duration of . Would you conclude that the graduation record of athletes is superior to that of nonathletes at the 1% or 5% significance level? 29.6) It is claimed that a new diet will reduce a person's weight by an average of 10 pounds in two weeks.
The students all took the same test.3 8.7) In Exercise 25 above. in Group 2. passes the test.2. test Ho: p1 = p2 = pg against H I : not Ho. b2.05 or at a = 0. where 6' = (bl. We are to test Ho: pl = p.3. Choose the 5% significance level. of the variances 32. 35. (Section 9. Students are homogeneous within each group.r2 of n2 students passed the test. Use it to answer problem (a) above.a 2 )for class i = 1. 2.8 34. and 1. Assume that the test results across the students are independent. $2.7) Test the hypothesis p1 = ~2 = p5 using the estimators bI. 31. (a) Using the asymptotic normality of jl = r l / n l and jP = r2/n2.1? (b) Derive the likelihood ratio test for the problem.3 Class 3 Test the hypothesis that the mean prices in the three cities are the same. . (Section 9. Score in Class 1 8.7) There are three classes of five students each. b2.1 Class 2 7. pP = ( ~ 1 p2> . p3).0 6.and @a are 4.A). derive the Wald test for the problem. (Section 9. Given nl = 20. Let pl and p2 be the probability that a student in Group 1 and in Group 2. r1 = 14. (Section 9. = 0. and their test scores were as shown in the accompanying table. (Section 9. and 7 2 = 16. add one more column as follows: City C '  Assume that the observed values of b l . respectively. and fis having the joint distribution j i N(p. rl of nl students passed a test.5 against H I : not Ho. b 3 ) . Assuming that the test scores are independently distributed as N ( p i .7) In Group 1.6) Using the data of Exercise 26 above.226 9 1 Hypotheses I Exercises 227 1 30. should you re~ect H o at a = 0.6) Using the data of Exercise 27 abow*test the e q * at the 10% significance level. 7. and respectively. Choose the size of the test to be 1% and 5%. q = 40.8 7. test the equa£irp of the dances at the 10% significance level. (Section 9.
.}and {y.1) y. in a consumption function consumption is usually regarded as a dependent variable since it is assumed to depend on the value of income. say. and a ' are unknown parameters that we wish to estimate. y) = f (y I x)f (x). {x. y) on the basis of independent observations {x.T. . but it is not essential for the argument. The problem we want to examine is how to make an inference about f (x. In Chapters 10. or a regressox For example. In the present c h a p ter we shall consider the relationship between two random variables. 2 . The linear regression model with all the above assumptions is called the clcwr sical regression model. whereas income is regarded as an independent variable since its value may safely be assumed to be determined independently of consumption. Here. In this chapter we shall assume that the conditional mean is linear in x and the conditional variance is a constant independent of x. . We also assume that x. Since we can always write f (x. y. The reader should determine from the context whether a symbol denotes a random variable or its observed value. we can state that the purpose of bivariate regression analysis is to make a statistical inference on the conditional density f (y I x) based on independent observations of x and y. . We define the bivariate linear regression model as follows: (10.1 INTRODUCTION In Chapters 1 through 9 we studied statistical inference about the distribution of a single random variable on the basis of independent observations on the variable. Note that we assume {x. .}. 2. T. mutivariate) statistical analysis. we mean the inference about the joint distribution of x and y. We call this bivariate (more generally. I I 10. the independent variable. 12.)are known constants. f (y I x).i. with Eu. Let (X. B y the inference about the relationship between two random variables x and y. and a variable such as x is called an independent variable. Regression analysis is useful in situations where the value of one variable. t = 1. we shall need the additional assumption that {u. y) . 2. Bivariate regression analysis is a branch of bivariate statistical analysis in which attention is focused on the conditional density of one variable given the other. fi. This is equivalent to assuming that (10.1 1 Introduction 229 i 1 10 BIVARIATE REGRESSION MODEL 10. We make this assumption to simpllfy the following explanation. Thus. . we often want to estimate only the first few moments of the densitynotably. on x and y. x. A variable such as y is called a dependent variable or an endogenous vanable.]are observable random variables.).1) specifies the mean and variance of the conditional distribution of y given x. = a ' . and 13 we shall study statistical inference about the relationship among more than one random variable. = a + fix. .. the mean and the variance. Thus far we have considered statistical inference about F based on the observed values {x. instead. + u. Let us assume that x and y are continuous random variables with the joint density function f (x. and {u. where {y. we may not always try to estimate the conditional density itself. .] to be known constants rather than random variables. t = 1.] are unobservable random variables which are i. one can determine this question empirically.1. We shall continue to call x. .]are normally . . an exogenous vanabk.t = 1. At some points in the subsequent discussion. regression analysis implies that for the moment we ignore the estimation off (x).. As in the single variate statistical inference. It is wise to choose as the independent variable the variable whose values are easier to predict. is not equal to a constant for all t. = 0 and Vu. is determined through a certain physical or behavioral process after the value of the other variable. x and y. . From now on we shall drop the convention of denoting a random variable by a capital letter and its observed value by a lowercase letter because of the need to denote a matrix by a capital letter. a. T.d. be a sequence of independent random variables with the same distribution F.1.) of W t l . is determined. . In situations where theory does not clearly designate which of the two variables should be the dependent variable or the independent variable.
the least squares (LS) estimators of a and P. in general. and u2 in the bivariate linear regression model (10. it is possible that E(y I x) is linear in x after a suitable transformationsuch as. and so forth.. Minimizing the sum of squares of distances in any other direction would result in a different line. In Section 13. as in Figure 10. denoted by & and p. we can draw a line so as to minimize the sum of absolute deviations. In Figure 10. But in some applications t may represent the tth person. Our assumption concerning {u.1 Definition In this section we study the estimation of the parameters a. Given a joint distribution of x and y. The linearity assumption may be regarded simply as a starting point. x. for example. consumption and income). where y* and x* are the original variables (say. If we are dealing with a time series of observations. Still. and so on). However. The T observations on y and x can be plotted in a secalled scatter diagram. t refers to the tth period (year..). and the like. since if E(y* I x*) is nonlinear in x*.. but there are a multitude of ways to draw such a line.1.4 we shall briefly discuss nonlinear regression models. In that figure each dot represents a vector of observations on y and x.1). xi). We have also drawn a straight line through the scattered dots and labeled the point of intersection between the line and the dashed perpendicular line that goes through (j. a reasonable person would draw a line somewhere through a configuration of the scattered dots. We have used the subscript t to denote a particular observation on each variable. P. taking only two values). Vu. In Chapter 13 we shall also briefly discuss models in which (q) are serially correlated (that is. F 1 G u R E 1 0.2 1 Least Squares Estimators 231 distributed. P.u. Then the problem of estimating a and p can be geometrically interpreted as the problem of drawing a straight line such that its slope is an estimate of P and its intercept is an estimate of a. We first consider estimating a and p.) is indicated by h. # 0 even if t s) or heteroscedastic (that is. can be defined as the values of CY and P which minimize the sum of squared residuals . In the subsequent discussion the reader should pay special attention to the following question: In what sense and under what conditions is the least squares estimator the best estimator? Algebraically.2. = 0. We have labeled one dot as the vector (y.2 LEAST SQUARES ESTIMATORS 10. Alternatively.) as (5. as we have seen in Chapters 4 and 5. tth firm. and u2in the linear regression model because of its computational simplicity and certain other desirable properties. nonlinear in x. how shall we choose one method? The least squares method has proved to be by far the most popular method for estimating a. tth nation.1.1. or the sum of the fourth power of the deviations.1. which we shall show below.230 10 1 Bivariate Regression Model 10. varies with t).1) specifies completely the conditional distribution of y given x. Eu. x. Gauss in a publication dated 1821 proposed the least squares method in which a line is drawn in such a way that the sum of squares of the vertical distances between the line and each dot is minimized. it should by no means be regarded as the best estimator in every situation. Data which are not time series are called crosssection data. the linearity assumption is not so stringent as it may seem.. Then (10.1 Scatter diagram + 10. the vertical distance between the line and the point (y. We can go on forever defining different lines. E(y I x) is. The assumption that the conditional mean of y is linear in x is made for the sake of mathematic& convenience.] may also be regarded as a starting point. Since Eu. Two notable exceptions are the cases where x and y are jointly normal and x is binary (that is. x. Another simple method would be simply to connect the two dots signifying the largest and smallest values of x. y = logy* and x = log x*. month.
.' c ~:2 ' and s. . a sequence of T ones.} has been removed. 2 . .2.9).) Mathematically. The precise meaning of the statement above is as follows. In Section 10.] after the effect of the unity regressor has been removed.* without the 6.2.2.2. (See the general definition in Section 11. s: is the sample variance of x.2.y)'. where we have defined 7 = T'c%. .3.2. and sXy is the sample covariance.9) Czi.2.2Z(yt . based on the unit regressor as p and call it the least squares predictor of y. as defined in (10.2) and (10.2.3)  as aP . We shall call this fact the orthogonality between the least squares residual and a regressor.3). Then defined in (10.}. regarding P as the slope coefficient on the only independent variable of the model.3) simultaneously for a and P yields the following solutions: zLI as the coefficient on another sequence of known constantsnamely. on x. Solving (10.27. which defined the best linear unbiased predictor.] after the effect of {x. where t is not included in the sample period ( 1 . by the sample mean.. we can treat a and p on an equal basis by regarding a which is actually the deviation of x. = T'zxtyt ..x.3. Thus the least squares estimates can be regarded as the natural estimates of the coefficients of the best linear predictor of y given x.2.2. 0. Define the error of the predictor as and call it the least squares residual. We define the error made by the least squares predictor as where jJ is the value that minimizes X(x. It is interesting to note that (10.2) and (10. that is. Under this symmetric treatment we should call j.2.2 ( Least Squares Estimators 233 Differentiating S with respect to a and tives to 0. jJ = 2.4) and (10.6) the least squares predictor of y. can be interpreted as the least squares estimator of the coefficient of the regression of y. ) x= .2. .4). we can express the orthogonality as 1 and (10.232 10 ) Bivariate Regression Model 10. But as long as we can regard (x. = 2. . that is. . Note that J and 2 are sample means. based on the unity regressor and {x. T). and I5 as measuring the effect of the unity regressor on {y. So far we have treated a and p in a nonsymmetric way. In other words. We shall call this sequence of ones the unity regressox This symmetric treatment is useful in understanding the mechanism of the least squares method.2.}on {y. where X should be understood to mean unless otherwise noted. namely x. we obtain P and equating the partial den* 7 and (10. We define Note that (10. We shall present a useful interpretation of the least squares estimators 1 5and by means of the abovementioned symmetric treatment.8) and (10. and calling a the intercept. we are predicting x. There is an important relationship between the error of the least squares pre&ction and the regressors: the sum of the product of the least squares residual and a regressor is zero.a  ~ x . The least squares estimator (j can be interpreted as measuring the effect of {x. 2 = T'zx.]as known constants.5) can be obtained by substituting sample moments for the corresponding population moments in the formulae (4. y.8) and (4.. Define the least squares predictor of x.9) follow from (10. s: = T . from the sample mean since ? .2.6 below we discuss the prediction of a "future" value of y. respectively.. = 0.
2. is the least squares estimator of : without the intercept.3.2. *so (10. ~(11. the variance of we can show that &. Using (10.234 10 1 Bivariate Regression Model 10.17) yields Reversing the roles of {x.2. = 0 and {x:] are constants by m assumptions.2.12) and using (10. .2. & minimizes C(y.a ~ .4) and (10.2.23) 0 can be evaluated as follows: by Theorem 4. In other words.2.).2.12). we d&* squares predictor of the unit regressor based on {x. we obtain from (10.14) cl:ut &. Then. this intwpfeatiorlitis more natural to write fi as 0 Inserting (10. even though there is of course no need to predict 1 in the usual sense.2.16) rather than (10.16) and using (10.2.1. cx: . 0.2 Properties of & and fi First.as we can easily verify that 0 Similarly. we obtain the means and the variances of the least squares estimators & and For this purpose it is convenient to use the formulae (10. . Therefore Cx.]and the unity regressor. First. Similarly.2. note that the denominator of the expression for given in (10.1. inserting (10. on 1 )~ that words.21) and 10.) ~(x:)~ [C(x2.2.22) E& = a.2 1 Least Squares Estimators 235 intercept: that is. (10.2.)*l2 by Theorem 4. let us see what we can learn from the means and the variances obtained above.a =. we have froan (10.2. 6 = .12) and (10.2. where 8 minimizes C(l . minimizes C(y.20) ~p = p.2. if we define which implies (10.1.1) into (10. VO .19). their unbiasedness is clearly a desirable property.6xJ2. Next.2.)on the unity regressor or in the regression of the unity regressor on {x. How good are the least squares estimators? Before we compare them with other estimators.6. defined in (10.}as the least Since Eu. The orthogonality between the least squares residual and a regressor is also true in the regression of {x.2. The variance of & has a similar property.19) and Theorem 4..2. In other the coefficient of the regression of y. is an unbiased estimator of (10.1) into (10.23) is equal to T times the sample variance of x. This is another desirable property.5).2.)~ We call it the predictor of 1 for the sake of symmetric treatment.5).2.3 Note that this formula of & has a form similar to as given in (10. Therefore under to go to zero at about the reasonable circumstances we can expect same rate as the inverse of the sample size T. A problem arises if x.18) yields (10.21) 0 P.1 vp =  1 [Z(xf1 I u2 V(C x:u.~ x : ) ~In .
which may be regarded as a natural generalization of the sample mean.2 1 Least Squares Estimators 237 stays nearly constant for all t.29) and (10. The class of linear unbiased estimators is defined by imposing the following condition on {c. The results obtained above can be derived as special cases of this general result by putting dl = 0 and d2 = 1 for the estimation of p. but that the sample mean is best (in the sense of smallest mean squared error) among all the linear unbiased estimators. where {c. is a constant for all t. for then both c(x:)~ and ~ ( 1 : ) are ~ small.27) Zc.]satisfymg (10.30) follows from (10.12) we can easily verify that unbiased estimators.27).2. we cannot clearly distinguish the effects of {x.2.33) Cctx.2.30) follows from the following identity similar to the one used in the proof of Theorem 7. Consider the estimation of an arbitrary linear combination of the parameters dla t d2P.]. we define the class of linear estimators of dla d2p by E.28).) Intuitively speaking.2.2.2. since in that case the least squares estimators cannot be defined. we note that proving that $ is the best linear unbiased estimator (BLUE) of P is equivalent to proving (10.2.2.31) also shows that equality holds in (10.)' CC: for all {c.2. and putting dl = 1 and d2 = 0 for the estimation of a.31) is the sum of squared terms and hence is nonnegative. and (10.) are arbitrary constants. Again.236 10 1 Bivariate Regression Model 10.26) is equivalent to the conditions (10.11). The unbiasedness condition implies + From (10.2. (Note that when we defined the bivariate regression model we excluded the possibility that x.30) if and only if c.32) Cc.) when xt is nearly constant. (10.) Let us consider the estimation of p. The proof of the best linear unbiasedness of & is similar and therefore left as an exercise. Equation (10.28).x. Comparing (10. = 0 and since Zc. = x:/~(x. We have p is a member of the class of linear (10. we see that the condition (10.2.23).2.) on the unity regressor is precisely the sample mean of {y. (Note that the least squares estimator of the coefficient in the regression of {y. we give only a partial proof here.1) into the lefthand side of (10.30) 5 I ~(x. = 0.2. although its significance for the desirability of the estimators will not be discussed until Chapter 12.2. But (10.* = 1 using (10.y. Note that (10. = d2.y.2. We can establish the same fact regarding the least squares estimators. The problem of large variances caused by a closeness of regressors is called the problem of multicollinearity.2.]: (10.y. Actually. the least squares estimator.2.12: Recall that in Chapter '7 we showed that we can define a variety of estimators with mean squared errors smaller than that of the sample mean for some values of the parameter to be estimated.*)~in other words.1.2. = dl and (10. . The class of linear estimators of p is defined by Z. Then dl& f d2Pis the best linear unbiased estimator of dla + drip.31) because the lefthand side of (10.2.26) ECc. and inasmuch as we shall present a much simpler proof using matrix analysis in Chapter 12. we can prove a stronger result. Because the proof of this general result is lengthy.2.2.2. = P for all ci and f3.}and the unity regressor on (y.2'7) and (10. Inserting (10.26) and using Eu. For the sake of completeness we shall derive the covariance between & and p.
We shall evaluate ~ 6From ~ (10. Since {u.Q.29).37) 10.23) and (10.41).2. although the bias diminishes to zero as T goes to infinity.3 Estimation of 0 ' Taking the expectation of (10. we can write Equation (10.) defined in (10. and using (10.32) and (10. 10.2.38) cC: = XU:  C(Q. Using 6' we can estimate and V& given in (10.] were observable.7). Using (10.2)u2 and hence 62is a biased estimator of u2.2.44) implies that ~6~= T'(T .u.2. (10.2.2.9) yields We shall not obtain the variance of 6 ' here.42) Finally. we obtain from (10.2.2. in certain situations some of these estimators may be more desirable than the least squares.2.2. we should note that the proof of the best linear unbiasedness of the least squares estimator depends on our assumption that {u. linear unbiasedness of the least variance is given by ~ ~ ~ ( e squares estimator follows from the identity (10...* = ~ ( c : ) ~ .238 10 I Bivariate Regression Model (10. It is well to remember at this point that we can construct many biased and/or nonlinear estimators which have smaller mean squared errors than the least squares estimators for certain values of the parameters.2. except to note that (10.36) by Q. . If {u.40). Then u2 can be estimated by because of (10.2.2.2. its The f )best ~ .] are not observable. If we prefer an unbiased estimator. Similarly.)'.c. from which we obtain Then the least squares estimator dl& + d2P can be written as C C ~ and ~ . Also. Moreover.2 1 Least Squares Estimators 239 The variance of Zctytis again given by (10. from (10.2. Therefore.2.2.2.2. summing over t. the most natural estimator of u2 would be the sample variance T'XU:.15) by 1 : and summing over t yields because of (10. we can use the estimator defined by Multiplying both sides of (10. Dkfine CG: = Zu.36) and (10.2.38) and (10.38).2.11) by : x and summing over t yields We shall now consider the estimation of u2.33) imply Cc.24) by substituting 6' for u2 in the respective formulae. Although the use of the term bast squares here is not as compelling as in the case of & and p.2.2.7) .2.8) and (10. we must first predict them by the least squares residuals (Q.18). in Section 10.43) we conclude that which we shall call the least squares estimator of u2.39) yields But multiplying both sides of (10. we use it because it is an estimator based on the least squares residuals.2.1'7).3 we shall .2. multiplying both sides of (10.] are serially uncorrelated with a constant variance. we have We omit the proof of this identity. and (10.
55) lim x(lt*)' T4 =w and which is the square of the sample correlation coefficient between {y.2.5) and (10. .24) converge to zero.2. the smaller the G .2. is the sample variance of {y. .)and btl.] or on another independent sequence {s.)~2 f i ~ x . ' the better the fit. s. This decision should be made on the basis of various considerations such as how accurate and plausible the estimates of regression coefficients are.48) TS. we conclude that & and are consistent if fi p.51) we obtain In this section we prove the consistency and the asymptotic normality of and the consistency of (i2 under the least squares estimators & and suitable assumptions about the regressor {x. it is usually a good idea for a researcher to plot (6. it makes sense to choose the equation with the higher R'.50) CQ: = C(y.]. we shall discuss this issue further.  Inserting (10. P  a . = minZ(y. A systematic pattern in that plot indicates that the assumptions of the model may be incorrect.1. a.2.] and {x.2.we shall use the measure of the goodness of fit known as R square.2. 0 5 R' 5 1. The statistic G2 may be regarded as a measure of the goodness f i t . therefore. * ~ .2.2.46) and (10. To prove the consistency of & and we use Theorem 6.2. how accurately the future values of the independent variables can be predicted. That is.2 1 Least Squares Estimators 241 E indicate the distribution of Z C : .). from (10.a)'.If {y.4 Asymptotic Properties of Least Squares Estimators Therefore. we need only show that the variances given in (10. the regression is good. Since (10. This Here s statistic does not depend on the unit of measurement of either {y.5. where In practice we often face a situation in which we must decide whether {y. One purpose of regression analysis is to explain the variation of {y. as well as its variance.46) as the square of the sample correlation between {y.47) TI?' = minqy.1.] are the predictors of {u.From (10. a we have (i2 5 s : . using (10.j)' + fi'~(x2. from which one could derive more information.].2. In Section 12. The statistic G2 is merely one statistic derived from the least squares residual {zi.23) and (10.] or (xJ.12) into the righthand side of (10. Then we must respecify the model. we have (10.2.50) yields Finally.2.].j)'.or by including other independent variables in the righthand side of the regression equation. Therefore. 10. assuming the normality of (%I.240 10 1 Bivariate Regression Model 10. we must choose between two regression equations and .].} are to be regressed on {x.]. fi fi (10. . and so on.2.they should behave much like (u.] are explained well by {x.7). = T'IC(~. since 6' depends on the unit of measurement of {y.). . Other things being equal. Since both & and are unbiased estimators of the respective parameters. However.].}by we say that the fit of the variation of {x.].2.)' and (10. namely.]. perhaps by allowing serial correlation or heteroscedasticity in {u. We can interpret R2 defined in (10.px.].) against time.2.11). . Since {C. which states that convergence in mean square implies consistency.
242
10
1
Bivariate Regression Model
10.2 ( Least Squares Estimators
243
We shall rewrite these conditions in terms of the original variables {x,}. Since ~ ( 1 : ) and ~ (ZZ:)~ are the sums of squared prediction errors in predicting the unity regressor by {x,) and in predicting {x,) by the unity regressor, respectively, the condition that the two regressors are distinctly different in some sense is essential for (10.2.55)and (10.2.56)to hold. and (z,], t = 1,2, . . . , T, we measure Given the sequences of constants (x,] the degree of closeness of the two sequences by the index
Finally, we state our result as
T H E O R E M 10.2.1
In the bivariate regression model (10.1.1), the least squares estimators & and fi are consistent if
(10.2.64) lim EX: = a
T+m
i
and
Then we have 0
5 p ;
5
1. To show p ;
5
1, consider the identity
Note that when we defined the bivariate regression model in Section
Since (10.2.58)holds for any A, it holds in particular when
10.1, we assumed PT f 1.The assumption (10.2.65)states that p~ 1 holds is in general not restrictive. in the limit as well. The condition (10.2.64) Examples of sequences that do not satisfy (10.2.64)are x, = t' and x, = 2" but we do not commonly encounter these sequences in practice. we have Next we prove the consistency of 6'. From (10.2.38)
(10.2.66)
+
xu: 1 2 6* =   Z(d, u,). T T
Inserting (10.2.59) into the righthand side of (10.2.58)and noting that the lefthand side of (10.2.58) is the sum of nonnegative terms and hence is nonnegative, we obtain the CauchySchwartz inequality:
Since {u:) are i.i.d. with mean a2,we have by the law of large numbers (Theorem 6.2.1)
Equation (10.2.43)and the Chebyshev's inequality (6.1.2)i m p l y (See Theorem 4.3.5 for another version of the CauchySchwartz inequality.) The desired inequality p : 5 1 follows from (10.2.60). Note that p ; = I if and only if x, = z, for all t and &. = 0 if and only if {x,)and {z,) are x , %= 0). orthogonal (that is, E Using the index (10.2.57) with z, = 1,we can write
Therefore the consistency of 62follows from (10.2.66), (10.2.67), and (10.2.68) because of Theorem 6.1.3. We shall prove the asymptotic normality of & and From (10.2.19) and (10.2.21)we note that both fi  P and &  a can be written in expressions of the form
0.
and where {z,] is a certain sequence of constants. Since the variance of (10.2.69) goes to zero if Cz; goes to infinity, we transform (10.2.69)so that
I
244
10
1
Bivariate Regression Model
10.2
1
Least Squares Estimators
245
the transformed sequence has a constant variance for atE T. This is accomplished by considering the sequence
Next, using (10.2.61) and (10.2.62), we have
since the variance of (10.2.70) is unity for all T. We need to obtain the conditions on {z,}such that the limit distribution of (10.2.70) is N(0, 1). The answer is provided by the following theorem:
THEOREM 1 0 . 2 . 2
Let {u,} be i.i.d. with mean zero and a constantvariance u2 as in the model (10.1.1). If
n,ax z:
(10.2.71) lim
Tjm
 0,
zz:
Therefore (1:) satisfy the condition (10.2.71) if we assume (10.2.65) and (10.2.74). Thus we have proved that Theorem 10.2.2 implies the following theorem:
T H E O R E M 1 0.2.3
then
In the bivariate regression model (10.1.1), assume further (10.2.65) and (10.2.74). Then we have
Note that if zt = 1 for all t, (10.2.71) is clearly satisfied and this theorem is reduced to the LindebergL6vy central limit theorem (Theorem 6.2.2). Accordingly, this theorem may be regarded as a generalization of the LindebergL6vy theorem. It can be proved using the LindebergFeller central limit theorem; see Amemiya (1985, p. 96). 3 and &  a by putting z, = We shall apply the result (10.2.72) to fi  6 x : and z, = 1 : in turn. Using (10.2.63), we have (10.2.73) max (x:)' :  max (xt  2) 5 4 max x ~(x:)~ (1  p;)zx; (1  p;)Zx:
2
and
Using the terminology introduced in Section 6.2, we can say that & and
p are asymptotically normal with their respective means and variances.
Therefore {x:] satisfy the condition (10.2.71) if we assume (10.2.65) and
I
(10.2.74) lirn
T+m
'
2 max xt
Ex;
' 0.

Note that the condition (10.2.74) is stronger than (10.2.64), which was required for the consistency proof; this is not surprising since the asymp totic normality is a stronger result than consistency. We should point out, however, that (10.2.74) is only mildly more restrictive than (10.2.64). In of this fact, the reader should try to construct a order to be co~lvinced sequence which satisfies (10.2.64) but not (10.2.74). The conclusion of Theorem 10.2.3 states that & and fi are asymptotically
246
10
1
Bivariate Regression Model
10.2
1
Least Squares Estimators
247
normal when each estimator is considered separately. The assumptions of that theorem are actually sufficient to prove the joint asymptotic normality of & and that is, the joint distribution of the random variables defined in (10.2.76) and (10.2.77) converges to a joint normal distribution with zero means, unit variances, and the covariance which is equal to the limit of the covariance. We shall state this result as a theorem in Chapter 12, where we discuss the general regression model in matrix notation.
6;
Solving (10.2.81) for a2yields the maximum likelihood estimator, which is identical to the least squares estimator k2. These results constitute a generalization of the results in Example 7.3.3. In Section 12.2.5 we shall show that the least squares estimators & and are best unbiased if {u,]are normal.
p
10.2.6 Prediction
10.2.5
Maximum Likelihood Estimators
In this section we show that if we assume the normality of {u,]in the model (10.1.1), the least squares estimators 2, and 6 ' are also the maximum likelihood estimators. The likelihood function of the parameters (that is, the joint density of y1, y2, . . . , p)is given by
6,
The need to predict a value of the dependent variable outside the sample (a future value if we are dealing with time series) when the corresponding value of the independent variable is known arises frequently in practice. We add the following "prediction period" equation to the model (10.1.1):
where yp and u p are both unobservable, xp is a known constant, and up is independent of {u,),t = 1, 2, . . . , T, with Eup = 0 and Vup = u2.Note that the parameters a,p, and a2are the same as in the model (10.1.1). Consider the class of predictors of yp which can be written in the form
Taking the natural logarithm of both sides of (10.2.78), we have
(10.2.79)
log L =

T log 2 n  T log u 2  1 E(y,  a  PX,)? 2 2
where & and are arbitrary unbiased estimators of a and P, which are linear in {Y,], t = 1, 2, . . . , T. We call this the class of linear unbiased predictors of yp. The mean squared prediction error of jp is given by
(10.2.84)
p
Since log L depends on a and P only via the last term of the righthand side of (10.2.79), the maximum likelihood estimators of a and P are identical to the least squares estimators. Inserting & and into the righthand side of (10.2.79), we obtain the socalled concentrated loglikelihood function, which depends only on a2.
E(yp  jP)'
+ pxp)  (a+ = u2 + V ( &+ Pxp),
=
E{up  [(&
p
where the second equality follows from the independence of u p and {y,), t = l , 2 , . . . , T. The least squares predictw of yp is given by
(10.2.80)
log L*
T log = 2
T log a2  1 2~:. 2a  2 2a2 It is clearly a member of the class defined in (10.2.83). Since V(& Oxp) 5 V ( &ipxp)because of the result of Section 10.2.2, we conclude that the least squares predictor is the best linear unbiased predictol: We have now reduced the problem of prediction to the problem of estimating a linear combination of a and P.
Differentiating (10.2.80) with respect to u2 and equating the derivative to zero yields
(10.2.81)
+
d log L*
=
T 
du2
2a2
1 + E2i: =O.
za4
We can do so by the method of induction. we have 1 I 10. we shall postpone it until Chapter 12.3.3.3.5.2.3. Therefore.5) is simply @ .3.5) and (10.5) is a p proximately correct for a large sample even if (u.1) and (10.2) we must show that u' Cu ^f can be written as a sum of the squares of T . We shall consider the null hypothesis Ho: f3 = Po.2.76). but since they are jointly normal. From Definition 2 of the Appendix we know that in order to construct a Student's t statistic.4.3. Using (10.1. A hypothesis on a can be similarly dealt with.}are not normal.] are not normal. Since is a good estimator of P. Equation (10.3. we need a chisquare variable that is distributed independently of (10. (10.2) are independent. where a simpler proof using matrix analysis is given.2) are independent by Theorem 3. it implies their independence by Theorem 5. using (10. Note that the lefthand side of (10.3 ( Tests of Hypotheses 249 10.Po divided by the square root of an unbiased estimate of its variance.2. however.45). it is reasonable to expect that a test statistic which essentially depends on 0 is also a good one.2.17). A test on the null hypothesis a = a.d. The test is not exact if {u. provided that the assumptions for the asymptotic normality are satisfied.5 we showed that a hypothesis on the mean of a nomral i.3.248 10 ( Bivariate Regression Model 10. Using Definition 2 of the Appendix. can be performed using a similar result: To prove (10.3 TESTS OF HYPOTHESES 10. is zero.] are normally distributed. We now prove that (10.3.3.2. depending on the alternative hypothesis.2 Tests for Structural Change 1 i \ j Suppose we have two regression regimes i and I where each equation satisfies the assumptions of the model (10. sample with an unknown variance can be tested using the Student's t statistic. In the next two paragraphs we show that U~ZG: fits this specification. so we see from (10. we obtain In Section 9. We could use either a onetail or a twotail test. the distribution of would be completely specified and we could perform the standard normal test. We state without proof that where 1 5 ' is the unbiased estimator of 0' defined in (10. we assume that {ult]and {u2.3.19) that p since Zxf = 0 by (10.1).] . Throughout this section we assume that {u. In addition.1). we conclude that under Ho 6 under the null hypothesis Ho.3. Therefore. A similar test can be devised for testing hypothkses on a and P in the bivariate regession model.1.2.2. we must use a Student's t test. if u2 were known. the test based on (10. We denote Vult = o : and VuZt= u: .i.2 independent standard normal variables.11). A linear combination of normal random variables is normally distributed by Theorem 5.19).1) and (10. Since this proof is rather cumbersome.3.4) shows that the covariance between and ii. If u2 is unknown.1 Student's t Test Therefore. Because of the asymptotic normality given in (10.2. which is usually the case. where Po is a known specified value. as in the proof of Theorem 3 of the Appendix.
2) implies and T2 f . C2i.we have under Ho Setting a : =a .3.8) in the postwar period. under either the null or the alternative hypothesis. Since they are independent of each other because {ult} and (up.8).3 ( Tests of Hypotheses 251 are normally distributed and independent of each other. The difficulty of this situation arises from the fact that assuming a (10..Let 1 3 1 and 132 be the least squares estimators of PI and pp obtained from equations (10.8).3.X 2T .zi:.2. we consider a test of the null hypothesis Ho: pl = P2 without : = a. depending hypothesis a on the alternative hypothesis.3. Then (10..).3.t (10. we have by Definition 3 of the Appendix + ZTZ (10.3. For example.5). assuming a : = a .2 ~ Therefore. This tworegression model is useful to analyze the possible occurrence of a structural change from one period to another.7) and (10. .3..11) p . respectively. A simple test of this hypothesis can be constructed by using the chisquare variables defined in (10.~ .3. . t = x2. Several procedures are available to cope with this socalled BehrmFisher .3.7) and (10.3. = P2 without asBefore discussing the difficult problem of testing : = a : . by Theorem 1 of the Appendix. in (10. (10.9) and (10.3.14) cannot be derived from (10.3.3. The null hypothesis can be tested where Ci2 = (T .12) are independent. a2 0 : where we have set TI + T2 = T.12) t=1  TI Tz + t= 1 .3. we study the test of the null hypothesis Ho: PI = P2.respectively. we have p : .3.14) in either a onetail or a twotail test. Then.3.]are independent.4) ($2 using (10. and x .11).3.13) without assuming a : = a. We can construct a Student's t statistic similar to the one defined in (10.11). A onetail or a twotail test should be used. t t= 1   2 XT.R2 as in (10. First.10) and (10. Since (10.3. we have by Definition 2 of the Appendix Note that a : and a : drop out of the formula above under the null : = a.3.7) may represent a relationship between y and x in the prewar period and (10. Finally.13) simplifies it to Let {2ilt} and (2i2J be the least squares residuals calculated from (10.3.250 10 1 Bivariate Regression Model 10. let us consider testing the null hypothesis Ho: a : = suming a a : . depending on the alternative hypothesis. defining xTt = XI.
Assume (u.64) but not (10. The problem is to estimate the unknown parameter p in the relationship y$ = P$.1.2./C. are chosen: 5. That is to say. t = 1. . .1) assume that a = 0..have Evt = EX:. Since E( = 1 and since EX: = v by Theorem 2 of the Appendix. are i. is how we should determine the degrees of freedom v in such a way that vS is approximately X:.where ( is defined by cT=P~ denoted by 0. and 3. Y + v.1) obtain the constrained kmt squares estimator of P.i. = (T2 .4) Give an example of a sequence that satisfies (10.5.2) In the model (10. Derive its mean squared error without assuming that or = P. and (u.74). EXERCISES I . (Section 10. . 2. = x.) is a bivariate i. . into the righthand side of (10.d.px. (Section 10. {y. random variable with mean zero and vari2 ances ut and u .3. p~ p pR 4. we should determine v by v = 2(V()'. .16) is approximately true is equivalent to the assumption that US.)are unobservable random variables. . (Section 10. and covariance a . and 3. (Section 10. prove the J 2. where 6: = (TI . and 6.i. .2. The remaining question.2. on the basis of observations (y. 2. based on the assumption a = P. 13. and x.2. The assump tion that (10. .2)2i:. 2. and . z i : .d.3. Show that if in fact a = p. the mean squared error of is smaller than that of the least squares estimator p p p.]and (x. T. In practice.3. with the distribution P(ut = 1 ) = P(ul = 1) = 0. where (y?} and ( ~ $ are 1 unknown constants.' . 5..9) and (10. Also assume that {q). v must be estimated by inserting I?. t + 1.1'7) to obtain (10.4) Suppose that y. .252 10 I Bivariate Regression Model I Exercises 253 problem.3. P = I .]are observable random variables. Create your own data by generating {ut} according to the above scheme and calculate and for T = 25 and T = 50. Obtain the mean and mean squared error of the reverse least squares estimator (minimizing the sum of squares of the deviations in the direction of the xaxis) defined by = & l y : / ~ ~ = l y tand x t compare them with those of the 3 3 2 least squares estimator p = Zl=lytx. t = 1. but we shall present only onethe method proposed by Welch (1938). = y$ + ut and x.2) In the model (10.) and {v.Obtain the probability limit Since vX: = 2v by Theorem 2 of the Appendix. .} and {x. 2. denoted v. v.)'. is approximately distributed as X: for an appropriately chosen value of v.=lxt. . therefore. Then we can apply Definition 2 of the Appendix to (10.18) and then choosing the integer that most closely satisfies v = 2 ( ~ 6 ) . .1. T.2)'ZT2 .16). T = 3.2.). minimizes z T = ~ ( ~p .2) Following the proof of the best linear unbiasedness of same for &.3. We now equate the variances of ve and x:: 3. = t fort = 1. Welch's method is based on the assumption that the following is a p proximately true when appropriate degrees of freedom. (Section 10.For other methods see Kendall and Stuart (1973).2.2. we .
.d.d. T.50 Period 2 4. p .86 98..70 92.) Are these estimators better or worse than the least squares e s t i m a m and &? Explain. yp.)in ascending order and define x(l) 5 x(2) 5 . .. u2). .30 (1) j5 = & + +x5. 2. (Section 10..i. is the wage rate (dollars per hour) of the ith person and x. p This is 6.16 99.Prove the con P=7 and X2 .S. 2. . u. = u? Arrange {x. 0.). N(0.}are known constants and equal to (2.4) Consider a bivariate regression model y.] are i. y4).24 93. x .53 95.69 100. with Eu. D.00 3 where we assume limT. We consider two predictors of y5: + Y : x: 3. . . 10. The data are given by the following table: Number of people Male Female 20 10 Sample mean of wage rate 5 4 Sample variance of wage rate 3.30 5.75 3.00 4.i. .. = 0 and Vu. / ~ .I 254 10 1 Bivariate Regression Model I = E.5'1 h=  Pz. (Section 10. (Section 10. 19721979.) and (y. .C. . Government Printing Office. t = 1. where (x.1) Test the hypothesis that there is no gender difference in the wage rate by estimating the regression model P=Cy.6) Consider a bivariate regression model y.d.}are i. = 1 or 0.=I Cx2 7.. . y3. .. 2.2) The accompanying table gives the annual U.2'  I n . = 0 and Vu. (Source: Economic Report of the President.94 95.3. assume also that c= 1 and obtain the probability limit of T T Obtain the mean squared prediction errors of the two predictors. + u.2.2.i. limT+ known as an errorsinvariables model. where {x. depending on whether the ith person is male or female.Let S be T / 2 if T is even and (T + 1)/2 if T is odd. (Section 10. .60 3.30 4.4) In the model of the preceding exercise. = 0'. = a px. Washington. Also define + + 1 where y.assuming = ~ x . We wish to predict y 5 on the basis of observations (yl.3. 5 x(q. with Eu. Exercises 255 T 2 of = ~ ~ l ~ .40 6. For what values of a and p is j5 preferred to j5? 9. (Section 10.2.25 99.]are known constants and {u.80 5. t = 1. where & and are the least squares estimators based on the first four observations on (x.. = a fix. 1992. 5. 19801986. We assume that {u. 22 = d < a. p Period 1 8.R1 = c sistency of and & defined by < lim.} are i. and Period 2. 0. 4) and {u. data on hourly wage rates (y) and labor productivity (x) in two periods: Period 1.
256 10 1 Bivariate Regression Model (a) Calculate the linear regression equations of y and x for each period and test whether the two lines differ in slope. A matrix. Graybill (1969) described specific a p plications in statistics. here denoted by a boldface capital letter. . In this chapter we present basic results in matrix analysis. The multiple regression model with many independent variables can be much more effectively analyzed by using vector and matrix notation. and Bellman's discussion of them is especially good. Symmetric matrices play a major role in statistics. For the other proofs we refer the reader to Bellman (1970). we prove only those theorems which are so fundamental that the reader can learn important facts from the process of proof itself. Since our goal is to familiarize the reader with basic results. Et6MlLMTS OF M A T R I X A N A L Y S I S I In Chapter 10 we discussed the bivariate regression model using summation notation. chapter 4). Additional useful results. (c) Test the equality of the slope coefficients without assuming the equality of the variances. for example. Johnston (1984.1 D E F I N I T I O N OF BASIC TERMS Matrix. 11. is a rectangular array of real numbers arranged as follows: . assuming that the error variances are the same in both regressions. appendix). appendix). especially with respect to nonsymmetric matrices. Anderson (1984. Marcus and Minc (1964). (b) Test the equality df the error variances. or Arnemiya (1985. For concise introductions to matrix analysis see. may be found in a compact paperback volume.
1). From the definition it is clear that matrix multiplication is defined only when the number of columns of the first matrix is equal to the number of rows of the second matrix. For example. jth element is equal to a]. to be an n X m matrix whose i. a real number). The following example illustrates the definition of matrix multiplication: is a symmetric matrix. jth element is ca. Addition or subtraction. we have = {a. (A vector will be denoted by a boldface lowercase letter.1. Kcto?: An n X 1 matrix is called an ncomponent column vector. . a square matrix A is symmetric if A' = A.1) and Matrix multiplication. If A and B are matrices of the same size and A A +. indicating that its i. denoted by A'. For example. If a square matrix A is the same as its transpose.1. . is defined as an m X n matrix whose i. a. jth element is equal to a . we define cA or Ac. Square matrix..1) is a square matrix if n = m. and a 1 X n matrix is called an ncomponent row vector. Let A be as in (11.) If b is a column vector. Diagonal matrix. For example. Transpose.]. az2.. = b. 11. is called an n X m (read "n by m") matrix. A is called a symmetric matrix. . The exception is when one of the matrices is a scalarthe case for which multiplication was previously defined. A in (11. In other words. jth element (the element in the ith row and jth column) is u. An n X n diagonal matrix whose diagonal elements are all ones is called the identity matrix of size n and is denoted by I. if the size of the matrix is apparent from the context.2 MATRIX OPERATIONS = {b. a vector with a prime (transpose sign) means a row vector and a vector without a prime signifies a column vector.] and B B if and only if a.. .1. A matrix which has the same number of rows and columns is called a square matrix. Then the transpose of A.1) and suppose that n = m (square matrix). C = AB is an n X r matrix whose i. Then.. b' (transpose of b) is a row vector. Normally.]. jth element cll is equal to C&larkbkl. which has n rows and m columns.. then we write A = Equality. Then.] and B = {b.]. If A and B are matrices of the same size and A = {a. If A and B are square matrices of the same size. The other elements are offdiagonal elements. Thus. as in (11. In other words. both AB and BA are defined and are square matrices of the same size as A and B. Scalar multiplication..1. Let A be as in (11. Sometimes it is more simply written as I. For example.258 11 I Elements of Matrix Analysis 11. for every i and j. Matrix A may also be denoted by the symbol {a. the product of a scalar and a matrix. then Note that the transpose of a matrix is o b t a i n 4 by rewriting its columns as rows. are called diagonal elements. every element of A is multiplied by c. 2 b. . Elements all.2 1 Matrix Operations 259 A matrix such as A in (11. A square matrix whose offdiagonal elements are all zero is called a diagonal matrix. AB and BA are not in general equal. Let A be an n X m matrix {azl) let B be an m X r matrix {b.1).1) and let c be a scalar (that is.]. However. Symmetric matrix. Identity matrix.1.]. .1. B is a matrix of the same size as A and B whose i. Let A be as in (11.
Then we define the determinant of A. b. i = 1. = A. a Z 3 ) . . 2 . or that A is postmultiplied by B. is the scalar itself.a 2 ( i ) . we may say either that B is premultiplied by A. . . .1) matrix has already been defined. r. .. Then it is easy to show that 1.A 1= { a I . and I.(i). chosen in such a way that none of the elements lie on the same row.3 1 Determinants and Inverses 261 is given by [:] :] [Y I: = " In describing AB. . . . given inductively on the assumption that the &terminant of an ( n . Vectors a and b are said to be orthogonal if a'b = 0. . Consider a 2 X 2 matrix r 1 A = ( a l .b. .l~. and N = 2 for (az1. . Throughout this section. and consider the sequence [rl ( i ) . namely a'a. . . b e t h e 1 / ( n .a l s ) . in the case of a 3 X 3 matrix. we write A as a collection of its columns: (11.an. or a scalar. =1 IfAB is defined. by the above rule of matrix multiplication.Let rl (i) be the row number of al ( i ) .1) X ( n .b2.) I A l= t=l (1)N(')al(i)a2(i) .N = 1 for ( a l l . . denoted by [A[.i 260 11 I Elements of Matrix Analysis 11. is defined by The determinant of a 3 X 3 matrix where al. and so on. which is called the vector product of a and b . .3 DETERMINANTS AND INVERSES 1 Alternatively. . . a. a$$).r 2 ( i ) . Let a' be a row vector ( a l . . . . be the identity matrices of size n and m. 3 e t. Let A be an n X m matrix and let I. The determinant of a 1 X I matrix. THEOREM 1 1 .1) matrix obtained by deleting the ath row and the jth column from A. are n X 1 column vectors. .) and let b be a column vector such that its transpose b' = ( b l . . . I The j above can be arbitrarily chosen as any integer 1 through n without The term (l)'+J!A.u p . respectively.1 ) I ( n . r2( i ). we have arb = Cy=laa.1. element as I 1 11.3. 1 Now we present a formal definition. all the matrices are square and n X n. . 2 . Before we give a formal definition of the determinant of a square matrix. For example. a.).by [al ( i ) . a. . a r b = bra. .). r .a. . let us give some examples. . Consider a sequence of n numbers defined by the rule that the first number is an element of al (the first column of A ) . The vector product of a and itself. n!. . the determinant may be defined as follows. 3 . i / ~ ~ ~ l N l T l 0 ~ 1 1L . The proof of the following useful theorem is simple and is left as an exercise. a n d l e t A z . a. is called the inner product of a. . .A = A and Ah. .l)'+'a. I I A l =x n 2 (. denoted by [A/or det A. One can define n! distinct such sequences and denote the ith sequence. a 2. .jI is called the cofactor of the changing the value of /A]..azp.. . 2. ( i ) ] . Let N ( i ) be the smallest number of transpositions by which [rl ( i ). n ] .Then we have n I 11. (AB)' = B'A'. . Clearly. a .(i) ] can be obtained from [ l .as (1 1 .as2.3. Then. . ] b e a n n X n m a t r i x . First.. the second number is an element of a2. N = 0 for the sequence ( a l l . and so on. . . Its determinant.. ( i ) ] .
C. We have same size. is the matrix defined by b 1 = IA'(. but B and C need not be. 3. We now define the inverse of a square matrix. we may state all the results concerning the determinant in terms of the column vectors only as we have done in (11.1 and 11.3. the determinant is zero.3.3.2 11.3. This theorem follows immediately from Theorem 11.3. the determinant changes the sign.) Then 1 ~ = ~ IAl 1IBI if A and B are square matrices of the Proof.3.3.2 r provided that bl # 0. T H Eo R E M 1 1.3. 7 If the two adjacent columns are interchanged. the determinant I is zero... then B = A' and BI = A. 3 . but can be directly derived from Definition 11. 3 This theorem can be easily proved from Definitions 11.3.3.3.3. THEOREM 11.. The use of the word "inverse" is justified by the following theorem. (As a corollary.3. since the effect of interchanging adjacent columns is either increasing or decreasing N ( i ) by one.3.3 1 Determinants and Inverses 263 Let us state several useful theorems concerning the detest THEOREM 11.3. as given in Definition 11. 3 . Here (l)'"l&. The determinant of a matrix in which any row is a zero vector is also zero because of Theorem 11. I A I This theorem can be proved directly from (11.5).5 is rather involved.6) AI =  1 {(l)'''l~. we can easily prove the theorem without including the word "adjacent.3. a n d D b e matrices such that If any two columns are identical. T H E O R E M 1 1 .3.3) and (11. The theorem follows immediately from the identity ABBlA' = I.3).8 LetA.1. If any column consists only of zeroes. The proof of this theorem is apparent from (11.262 11 I Elements of Matrix Analysis DEF~NITION 11.1 The inverse of a matrix A.3. denoted by A'.5). The proof of Theorem 11.5). Theorem 11. (Note that A and D must be square. where 0 denotes a matrix of appropriate size which consists entirely of . then (AB)' = B'A'.2 and Theorem 11.5 is square and ID\ st 0.l.I] is the matrix whose z. T H E0 RE M 1 1 .l is the cofactor of a!. Because of the theorem.4.1. (11.B.3.l~. jth element is (l)''~~A. since the same results would hold in terms of the row vectors.3. It implies that if AB = I. but only for a matrix with a nonzero determinant.2 follows immediately from (11. and {(l)'+'IA.4 If A and B are square matrices of the same size such that (A1 f 0 and 1 B I # 0.") THEOREM 11.1.3. THEOREM 1 1 .
4.4 SIMULTANEOUS LINEAR EQUATIONS 3 1 since any point on the line xl + 2x2 = 1 satisfies (11. Consider the following n linear equations: I This matrix satisfies the condition because constitute the unique solution to XI = 2yl  n.3. 4 .3.1) with n = m. provided that the inverse on the lefthand side exists. Next consider Throughout this section.1. ~.3.5 yields (11. .8) is equal to I taking the determinant of both sides of (11. we can express the last clause of the previous sentence as THEOREM 1 1 . K. The matrix where E = A . simply premultiply both sides by ' 1 But there are infinite solutions to the system : 11.3. F = D . we shall assume that X is nXKwithKSn. Ax = y. (such that). . Define x = (xl. Generally.1) can be written in matrix notation as (114. .2) can be solved in terms of x for any y.8) is unity and the determinant of A [Dl. 3 . We can ascertain from (11.t. In general. + + does not satisfy the abovementioned condition because there is dearly no solution to the linear system Proot To prove this theorem. . and s. the solution to Ax = y is unique. 1 A s e t o f v e c t o r s x l . ~. .BD'c. it has infinite solutions for some other y. 9 Let us consider a couple of examples. the righthand side of (11. .4 1 Simultaneous Linear Equations 265 zeroes.x:! = n .)' and y = (yl. if A is such that Ax = y has no solution for some y.5). .y. and F' = D' D'cE'BD'..4. A will denote an n X n square matrix and X a matrix that is not necessarily square. . . E' = A' A'BF'CA'. . Otherwise independent if ~ f < c.264 11 I Elements of Matrix Analysis 11. Using the notation 3 (there exists).5) that the determinant of the first matrix of the lefthand side of (11.7). Then (11. .8) and using Theorem 11. y2. It can be shown that if A satisfies the said condition. Therefore. 2. We now embark on a general discussion.CA'B. = 0 for all i = 1.)' and let A be as in (11.3.x. V (for any). l = 0 implies c. x ~ i s s a i d to belinearly =x. .4. O BD'cI 1 1i 1 A major goal of this section is to obtain a necessary and sufficient condition on A such that (11. .yl for any yl and yp. x2. it is linearly dependent. . . in which the major results are given as a series of definitions and theorems.3.2) D E F I N I T I O N 1 1 .
Since 1 1 1 = 1. x.a Combining Theorems 11.4. . y) as X' in Theorem 11. THEOREM 11.' ~ .p.+~) and assume without loss of generality that xn.2) and (2. . y) is not CI. y)c = 0. A"exists. .. .. . Then Axl = el. .2 shows that (A.4.4.x. D THEOREM 11.4. But x # 0 means that at least one element of x is nonzero.4. But (1.a. .) and IAl = F(al.4. a.en) = I. . where xl is the first element of x. . ( ProoJ Assume K = n .3. there is a vector x # 0 such that Ax = 0. a2. we have ( 1 . Ax = y. A i s n o t CI =3 b l = 0. Prooj Using the matrix (A.t. .4.1. we conclude that the following five statements are equivalent: [A[f 0. stated below as Theorem 11.1 . .x.4. .a2. Ax2 = e2.cn+l)'?Write the nth row of X' as (xnl.4.3 and 11.2 THEOREM 11. Solving the last equation of X'c = 0 for cn+l and inserting its value into the remaining equations yields n .4) are linearly dependmt hecause c2 = I:[ I:[ + [:] The theorem is clearly true for n = 2. . THEOREM 11. . . 8 Pro$ (a) If IAl f 0. Therefore 1 The converse of Theorem 11. .5. = en may be summarized as AX = I by setting X = (xl. Since A is CI. If the column vectors of a matrix (not necessarily square) are linearly independent. .1 without loss of generality. therefore.. Premultiplying both sides by A yields Ax = y because of Theorem 11.266 11 I Elements of Matrix Analysis 11. we prove three other theorems first. ap...4. where X' is n X (n + 1) and c = (cl.4. x2. and consider n + 1. a.1.3.6. namely. IfX' is K X n. the coefficient on y in c is nonzero and solving for y yields Ax = y. . Ax = y. as. . Set x = A . .5 A is CI + IAi # 0. we say the matrix is row independent (RI) . a3.4 immediately yields the converse of Theorem 11. Since A is not CI.4 1 Simultaneous Linear Equations 267 For example. Ax. if the row vectors are linearly independent. A' exists by Definition 11. we say the matrix is column independent (abbreviated as CI). . Pmo$ Write A = (al. Therefore there exists c f 0 such that (A. . . . Q IA~ F(Ax. . Assume xl # 0 without loss of generality.t. We prove the theorem by induction. . A is CI.3.4 IAl f 0 a v y 3 x s. . f 0 follows from Theorem 11. Is there a vector c f 0 such that X'c = 0.+~f 0. . a. .1 equations for which the theorem was assumed to hold.1 v y 3 x s.). be a column vector with 1 in the ith position and 0 everywhere else.2 and the A l = 0. . a.2. . But the lefthand side of (11. DE F I N ITI o N 1 I . l ) and (1.3. . Assume that it is true for n.4. a2. From Definition 11. So the prescribed c exists.3.) + .) = ~1141 + x2F(a2. e2. the vectors ( 1 . c2. . O THEOREM 11.3.) and noting that (el.4. .4. . 2) are Imearly iaadepenht because 2 only for cl = c = 0.. .) + x3F(a3.K row vectors of zeroes at the bottom of X'. is more difficult to prove. (%) Let e.1. Q righthand side is xlbl by Theorem 11.2 From the results derived thus far. .a2. . for otherwise we can affix n .3 A is CI =3 can be satisfied by setting cl = 2 and c p = 1. where K < n.8) is zero by Theorem 11.5.Xf is not CI.
is full rank if and only if X'X is nonsingular. where K 5 n. .K) matrix Z of rank n . where K 5 n.8. = [ X'X 0 9 Pro$ RR(X) > K contradicts Theorem 11. k:] .5 CR(X) or.4. by Theorems 11. K 5 n. suppose that rank(X) = r < K. 6 Let an n X K matrixx be full rank.. Then there exists an n X ( n .4.5. where B.4 THEOREM 1 1 . but this contradicts the assumption that X is CI. Then by Theorem 11. Proof. We have . and suppose that there exists an n X ( n .8. THEOREM 1 1 . is determined by fill rank.5 below.4.4.6. Suppose Sc + Zd = 0 for some vectors c and d. To prove the "if" part. But since CR(X) = RR(X). B y reversing the rows and columns in the above argument. is the n X n matrix obtained by replacing the ith column of A by the vector y. RR(X). Collect these n . RR(X) = K. z)( # 0. Then (XI1. 4 . 4 . c = 0.4. 8 An n X K matrix X. Let S be r linearly independent columns of X.3.2) for x as x = A .K). Z)] > n. In this method x. there exists avector zl # 0 such that X'z. M~:B daat a q w e matrix is full rank if and only if it is nonsingular. we denote the maximal number of linearly independent column vectors of X by CR(X) (read "column rank of X") and the maximal number of linearly independent row vectors of X by RR(X) (read "row rank of X"). (x'xI f 0 and ID\ # 0.4. RR(X1) = r. we can solve (11.2. 7 where D = Z'Z is a diagonal matrix.. note that X'Xc = 0 =0 c1X'Xc = 0 + c = 0.r ) matrix. Suppose rank(X) = r > K . Then rank(X) 5 K. Let Xll consist of a subset of the row vectors of XI such that XI. which is a contradiction. Then.K ) matrix Z such that (X. Then. equivalently.K vectors and define Z as (zl. z. 4 . TO prove the "only if' part.2. z. . where Y is an arbitrary r X (K . Therefore. . the ith element of x. then rank(X) = K. note that Xc = 0 3 X'Xc 0 Proof. For an arbitrary matrixx. is called the rank of X. R THEOREM 1 1 . DEFINITION 11. o D EF 1 N ITI 0 N 1 1. we say X is Let X be a matrix not necessarily square. THEOREM 1 1 . Therefore. = 0. Next.4.2. Therefore RR(X) 2 CR(X).4. Clearly.. Let X1 consist of a subset of the column vectors of X such that Xl is n X r and CI. Z) is nonsingular and X'Z = 0. If rank(X) = min(number of rows. is r X r and RI. If W'X = 0 for some matrix W with n columns implies that rank(W) 5 n .4. . we say A L nonsingulal: If A is nonsingular.11. by Theorem 11.3.4.3 If any of the above five statements holds. Then CR[(X. Again by Theorem 11. 9 Proof. and so on. Ll T H Eo R E M 1 1.4.K. .Y) is RI. 1 0 ProoJ Suppose that X is n X K.4. d = 0. By the same theorem. If RR(X) < K. Because of Theorem 11. 4 . z1)'z2 = 0.1 and 11. 4 . there exists a vector cw # 0 such that X a = 0 because of Theorem 11. and CR(X) = r 5 K. Premultiplying by S t yields S'Sc = 0. . + c = 0. The remainder of this section is concerned with the relationship between nonsingularity and the concept called full ran$ see Definition 11.K such that Z'X = 0.' ~ An . Premultiplying by Z' yields Z'Zd = 0. number of columns). where K < n.268 11 I Elements of Matrix Analysis 1 1. Let an n X K matrix X.10.4 / Simultaneous Linear Equations 269 D E F I N IT I o N 1 1 . I(x. alternative solution is given by CramerS rule. there exists a vector z2 # 0 such that (X. be C1. Let S be as defined above. we can similarly show RR(X) r CR(X). X'Z = 0.
Y'B' is full rank. The reader should verify. Premultiplying (1 1.5. rank(BX) = rank(X) if B is nonsingular (NS). in the ith diagonal position. 54). THEOREM 1 1 . Also by Theorem 11. Inverting both sides of (11. But this implies that a = 0.5) A'(=AA) = H D ( A )HI. Denote A by D(A. 2. Proof. which is the ith diagonal element of A. Theorem 11. respectively. Then clearly A' = D(x.Then a'Z1 = 0 because B is NS. The ith column of H is called the characteristic vector (or eigenuector) of A corresponding to the characteristic root of A. that (115. Premultiplying and postmultiplying (11. if we ignore the order in which they are arranged. Thus the orthogonal diagonalization (11. Throughout this section. the inverse of a matrix defined in Definition 11. Proof. I I Given a symmetric matrix A.10. p.1) H'AH = A. Proo_f: See Bellman (1970. Clearly.1) by H yields . The diagonal elements of A are called the characteristic roots (or eigenualues) of A. there exists an orthogonal .5. however. a matrix operation f (A) can be reduced to the corresponding scalar operation by the formula (11. A will denote an n X n symmetric matrix and X a matrix that is not necessarily square.9.)]H1.A11 = 0. there exists afullrank (n . More generally. since H'AH = A would still hold if we changed the order of the diagonal elements of A and the order of the corresponding columns of 14 Let be a characteristic root of A and let h be the corresponding characteristic vector. for example.1) has enabled us to reduce the calculation of the matrix inversion to that of the ordinary scalar inversion. We can write Y'X = Y'B'BX. The set of the characteristic roots of a given matrix is unique. For example. 5 .4. we obtain (115. Note that H and A are not uniquely determined for a given symmetric matrix A.5.54 f (A) = HD[f(h. The following theorem about the diagonalization of a symmetric matrix is central to this section.1 1 For any matrix X not necessarily square.9 there exists a fullrank (n .5.4.1 since H'H = I implies H' = H'. matrix H (that is. Therefore rl 5 r:. (115. O 11.5.3.7) I A .9. since Z is full rank.5 PROPERTIES OF THE SYMMETRIC MATRIX H.5. 2 where A is a diagonal matrix. Then.2) and using Theorem 11. and noting that HH' = H'H = I. We shall often assume that X is n X K with K 5 n.3. ~ For any symmetric matrix A. indicating that it is a diagonal matrix with A.1 is important in that it establishes a close relationship between matrix operations and scalar operations. But this contradicts the assumption of the theorem. By Theorem 11..r) matrix W such that S'W = 0 and rank(W) = n .1) by H and H' respectively.r. Let rank(BX) and rank(X) be rl and rz. Since B is NS. suppose that a'ZrB = 0 for some a. there exists an n X ( n .2 is related to the usual inverse of a scalar in the following sense.4.270 11 I Elements of Matrix Analysis 11.rl) X n matrix Z' such that Z'BX = 0. which play a major role in multivariate statistical analysis.2) A =Hm'.4. how can we find A and H? The faglowing theorem will aid us. and rl = rz.). a square matrix satisfying H r H = I) such that (11.) Therefore r2 5 rl by the first part of Theorem 11. rank(X) = K .7 yields Now we shall study the properties of symmetric matrices.').6) Ah = hh and (115.4. Therefore. Z'B is also full rank. T H E0 R E M 1 1 5 .re) X n matrix Y such that Y'X = 0. (To see this.5 1 Properties of the Symmetric Matrix 271 1 4 3 by Theorem 11. 0 THEOREM 1 1.
7).5.6 Let XI and A . From this definition some of the theorems presented below hold for general square matrices. 4 = rank(A) [: I:[ = 3 [I:] and xy + xf = 1 For any matrices X and Y not necessarily square. Solving THE 0 R E M 1 1 . the reader may assume that the matrix in question is symmetric. p. Proof.1) and HH' = I .4. = A. have Therefore the characteristic roots are 3 and 1. 56).5 ( Properties of the Symmetric Matrix 273 Singling out the ith column of both sides of (11.8) yields Ah.XI)h = 0 and using Theorem 11.5 simultaneously for yl and yp.1 proves (11. Using (11. O Let us find the characteristic roots and vectors of the matrix The rank of a square matrix is equal to the number of its nonzero characteristic roots.3 11. we obtain yl = fi' and y2 = fi'~ (yl = @I and y2 = @I also constitute a solution. is the ith diagonal element of A and h.1) can be written in this case as . be the largest and the smallest characteristic roots. Then A and B can be diagonalized by the same orthogonal matrix if and only if AB = BA. T H E o R E hn 1 1 . ProoJ: See Bellman (1970.5.5. p. Using (11. The characteristic roots of any square matrix can be also defined by (11. Proof.2).6) as (A . where A.272 11 I Elements of Matrix Analysis THE 0 RE M 1 1 . Even when a theorem holds for a general square matrix. Whenever we speak of the characteristic roots of a matrix.. This proves (11.6). Solving Proof.7).5. of an n X n symmetric matrix A.h. See Bellman (1970.5.7 by Theorem 11. Then for every nonzero ncomponent vector x. simultaneously for x l and xp.4.5. whenever both XY and YX are defined. The following are useful theorems concerning characteristic roots.) The diagonalization (11.we have . Let A and B be symmetric matrices of the same size. 96).is the ith column of H.4.5. Writing (11.11 . respectively. we obtain xl = x2 = *I.11 by Theorem 11.5. Suppose that n l of the roots are nonzero. the nonzero characteristic roots of XY and YX are the same. T H E O R E M 1 1 5 .4.5.5. we have   = rank(AHr) = rank(HA) by Theorem 11. we shall prove it only for symmetric matrices. 5 . We shall prove the theorem for an n X n symmetric matrix A.
5.) If A is positive definite. Therefore A > 0 implies d'Ad > 0 and X'AX > 0.1 0 + Proof: We shall prove the theorem only for a symmetric matrix A. A > 0 does imply that all the diagonal elements are positive. is defined as the sum of the diagonal elements of the matrix. Taking the determinant of both sides of (11.5.Aisposztivedefinite The determinant of a square matrix is the product of its characteristic roots. which plays an important role in statistics. (Negative definite and nonpositive definite or negatzve semzdefinite are similarly defined. we write A > B.2) and using Theorems Similarly. we have . if A . The theorem follows immediately from Theorem 11.5. then d # 0.1 The trace of a square matrix.3. T H E0 RE M 1 1 .") Proof.5 (11. D E F I N I T I O N 11. since the characteristic roots of A' are the reciprocals of the characteristic roots of A because of (11.5. Using (11.5." "negative. Proof: We shall prove the theorem only for a symmetric matrix A.5.5.5. If X is full rank.h ) z 2 O and z l ( A . tr XY = tr YX. D E F I N I T I o N 1 1 . R THEOREM 11. implies the theorem." or "nonpositive. The trace of a square matrix is the sum of its characProof. Then csX'AXc = dsAd. 11. H'H = I implies I H I = 1. T H E O R E M 1 1.5.11) 1 Properties of the Symmetric Matrix = 275 where z = H'x. R The following useful theorem can be proved directly from the definition of matrix multiplication. 0 IAJ We now define another important scalar representation of a square matrix called the trace.274 11 I Elements of Matrix Analysis < 11. Then. is another important scalar representation of a matrix. we have X'AX r 0. A symmetric matrix is positive definite if and only if its characteristic roots are all positive.1 and 11. which we examined in Section 11.3. then A > 0 3 X'AX > 0. we use the symbol 2.X.8.9 LetAbeannXnsymmetricmatrixandletXbean n X K matri'k where K 5 n. Let X and Y be any matrices.5. For nonnegative definiteness. if x'Ax > 0 for every nvector x such that x 0. Moreover.) More generally.9) follow from e'(hlI .5. such that XY and YX are both defined.5. If x'Ax 2 0.3.5. Let c be an arbitrary nonzero vector of K components.12 There is a close connection between the trace and the characteristic roots.5. T H E0 RE M 1 1 . 0 teristic roots.8 THEOREM 11. The inequalities (11.7 tr A = tr HAH' tr AH'H = tr A . we say that A is nonnegative dejinite or positive semidejinite.I)z 2 0. = [ A ( which .10. Proof: The theorem follows from Theorem 11. The following theorem establishes a close connection between the two concepts. The inequality symbol should not be regarded as meaning that every element of A is positive. R We now introduce an important concept called positive definiteness. R Each characteristic rootbf a matrix can be regarded as a real function of the matrix which captures certain characteristics of that matrix. if rank(X) = K. (If A is diagonal. (The theorem is also true if we change the word "positive" to "nonnegative. We deal only with symmetric matrices. we write A > 0. denoted by the notation tr. not necessarily square.3).B is positive definite.5.2 1 IfAisannXnsymmetricmatrix. The determinant of a matrix.5. T H E O R E M 1 1.1 1 A >0 J AI > 0. Then A 2 0 *X'AX 2 0. and define d = Xc. since the determinant of Therefore a diagonal matrix is the product of the diagonal elements.5 yields IA~ = IHI' I A~.Since A 2 0 implies d'Ad 2 0.6.2) and Theorem 11.
9 3 ) .] and B = [2 O 0 0.P = M. The projection matrix M = z(z'z)'z'. Then we say that 8 is better than 0 if A 5 B for any parameter value and A f B for at least one value of the parameter. . It immediately follows from Theorem 11. Since ( X . that is. 1 6 P = P' = p2.14) by ( X . 3 I A l < IBI.4. Set yl = Xcl and y2 = Zc2.14 y = yl + y2 such that Pyl = yl and Py2 = 0 . it implies that c'0 is at least as good as c'0 for estimating c ' 0 for an arbitrary vector c of the same size as 0 .trA < tr B.) D E F I N ITI o N 1 1 . (Both A and B can be shown to be nonnegative definite directly from Definition 11. THEOREM 1 1 .1 we defined the goodness of an estimator using the mean squared error as the criterion. How do we compare two vector estimators of a vector of parameters? The following is a natural generalization of Definition 7.1 to the case of vector estimation.14 that Py = yl. Unfortunately. We must use some other criteria to rank estimators.5.0 ) ' and B = ~ ( 8 0 ) (8 .1. For example. Hence we call P a projection matrix. Z ) c = Xcl Zcp. Any square matrix A for which A = called an idempotent m a t r i x . 5 . there exists an nvector c such that y = ( X .14) (I . where Z is as defined in the proof of Theorem 11. The two most commonly used are the trace and the determinant. Note that A < B implies tr A < tr B because of Theorem 11.' X I . This matrix plays a very important role in the theory of the least squares estimator developed in Chapter 12.5.5. Z) is nonsingular. 1 3 Let A and B be symmetric positive definite matrices of the same size. Z) is nonsingular and X'Z = 0. In the remainder of this section we discuss the properties of a particular positive definite matrix of the form P = x ( x ' x ) .5 1. since the resulting vector yl = Xcl is a linear combination of the columns of X .5.9. T HE0R E M 1 1.5.Z) . 5 .O) . .2. z)' yields the desired result.14. plays the opposite role from the projection matrix P. Ci THEOREM 1 1 . Let A and B be their respective mean squared error matrix. the converse is not necessarily true. Postmultiplying both sides of (11. 5 . We call this operation the projection of y onto the space spanned by the columns of X . Let 0 and 8 be estimators of a vector parameter 0 . An arbitrary ndimensional vector y can be written as Proof.  276 11 Elements of Matrix Analysis 11.P  M ) ( X . We have (11. Theorem 11. where X is an n X K matrix of rank K. Then A r B + B' 2 A'.12. Z )= (X.5.13). It can be also shown that A < B implies IAl < IBI. and A > B 3 B' > A'. Namely. The question we now pose is.0 ) ' . Then clearly Pyl = yl and Py2 = 0 . By Theorem 11.12).2. 1 5 1 . Recall that in Definition '7.2.5. Next we discuss application of the above theorems concerning a positive definite matrix to the theory of estimation of multiple parameters. A = [:. In neither example can we establish that A 2 B or B r A. consider (11. In (11. we cannot always rank two estimators by this definition alone.(X. The proof is somewhat involved and hence is omitted.16 states that P is a symmetric idempotent matrix. is This can be easily verified. A = ~ ( 0 0 ) ( 0 . Proof. My = y2. Proof: See Bellman (19?0. Thus we see that this definition is a reasonable generalization of Definition 7. 0 is at least as good as 8 for estimating any element of 0 .(0.9. and in (11. there exists an n X (n . 5 .5.5. p. R + Note that if 0 is better than 8 in the sense of this definition.5.K ) matrix Z such that ( X .Z) = 0 .5 1 Properties of the Symmetric Matrix 277 T H € 0 RE M 1 1 .5. More generally.2. In each case.
3. and second.18 below. verify it for the A and B given in Exercise 3 above.*)'X. by using the inverse of the matrix. (Section 11. 5 .~ x . 1 9 L e t X b e a n n X KmatrixofrankK. by using Cramer's rule: I 7. Since. [:4k:] I:[ = 6.K) fullrank matrix Z such that PZ = 0. 1 Proof.Xp) such that X1 is n X K I and Xg is n X K2 and K1 + K2 = K . which in turn implies A = 0. ] then x ~ .4. 3.5. S .14.4) Solve the following equations for xl for x2. R THEOREM 1 1 . 1 7 rank(P) = K.x.3) Venfy = IAl IBI.5. By Theorem 11.3. (Section 11. (Section 11. Thus the theorem follows from Theorem 11. by using the inverse of the matrix.K. there exists an n X (n . If you cannot prove it.5. But since the second matrix is the identity of size K .278 11 1 Elements of Matrix Analysis I A= Exercises 279 THEOREM 1 1 . (Section 11.4) Solve the following equations for xl.4 the nonzero characteristic roots of x(x'x)'x'and (x'x)'X'X are the same. by using Cramer's rule: Characteristic roots of P consist of K ones and n . If we define X l = [I . I 8 4. first.PartitionXas X = (XI.) R THEOREM 1 1.3) Using Theorem 11.5. 5 . which implies rank(W) 5 n .K zeroes.10. W = XA + ZB for some matrices A and B. and xs. (An alternative proof is to use Theorem 11.4) Find the rank of the matrix Proof. (Section 11.3 and Theorem 11. x2. Proof. by Theorem 11. PW = 0 implies XA = 0.~X. first.)"x. 5. its characteristic roots are K ones.we have x(x'x)'x' = xl(x. As we have s h o w in the proof of Theorem 11. prove its corollary obtai word "adjacent" from the theorem. Suppose PW = 0 for some matrix W with n rows.5. and second.*(~.14. Therefore W = ZB. x ~ ) . (Section 11. + X.*~. [::] and B = [: i i. The theorem follows from noting that 2.3) (A + B)' = A'(A\ fB')'A' whenever all the Prove A" inverses exist. where .x ~ ( ~ .
2. 2. . . 14. (Section 11. after we rewrite (12. = 0 and Vu.) are observable random variables. . with Eu.5) Find the inverse of the matrix I dimension as I. The linear regression model with these assumptions is called the classical regression model. We define the multiple linear regres sion model as follows: and its characteristic vectors and roots. Compute x(x'x)'X where {y. . 15. except for the discussion of F tests in Section 12. We shall state the assump tion on {x.}. The organization of this chapter is similar to that of Chapter 10. and PI.5) Let A be a symmetric matrix whose characteristic r o w are less than one in absolute value. PR and (r2 are unknown parameters that we wish to estimate.4. it is positive definite.280 11 I Elements of Matrix Analysis and compute A' 10.8.  12 MULTIPLE REGRESSION MODEL Prove Theorem 11.i. .5.] are unobservable random variables which are i. The results of Chapter 11 on matrix analysis will be used extensively.1. a boldface capital letter will denote a matrix and a boldface lowercase letter will denote a vector.5) Suppose that A and B are symmetric positive definite matrices of the same size. = a ' .i = 1.d. (Section 11. The multiple regression model should be distinguished from the multivariate regression model. In this chapter we consider the multiple regression modelthe regression model with one dependent variable and many independent variables. (Section 11. . . Most of the results of this chapter are multivariate generalizations of those in Chapter 10.5) Define r + xx' where x is a vector of the same In Chapter 10 we considered the bivariate regression modelthe regression model with one dependent variable and one independent variable. {xi. As before. (u. PZ. . T are known constants. which refers to a set of many regression equations.1) in matrix notation. .]later. . 12.1 INTRODUCTION 13. (Section 11. (Section 11.5) Compute 5. Show that if AB is symmetric. . . . K and t = 1. Show that 12.
2) as a single vector equation.1.1. . Our assumptions on {u. we can find a subset of K1 columns of X which are linearly independent.T X Kg. .) The assumption that X is full rank is not restrictive.).13) y = X$ + u.1) to (12.=~u:. The reader must infer what 0 represents from the context. Without loss of generality assume that the subset consists of the first K.1. i = 1. Note that u'u.4. 2. K.4) Eu = 0 .A). .1. which the reader should also infer from the context.4. Define the Kdimensional row vector $ = (xtl.. . x(p). x2.3) as 12.5) is of the size T. . x(n). PK)I . Another way to express this assumption is to state that X'X is nonsingular.1) with respect to pcand equating the partial derivative to 0.] which minimize the sum of squared residuals Differentiating (12. and hence X = XI(I. . T.5) and using Theorem 11. A)P and XI is full rank.1.*. . . .8 yields Eu'u = U'T. . Note that in the bivariate regression model this assumption is equivalent to assuming that x. because of the following observation.1.). 2.1) can be written as (12. where X = I. .4). is not constant for all t. .2).1. . .. . . the real advantage of matrix notation is that we can write the T equations in (12. . . u2.X(*).5) Euu' = a I. We denote the columns of X by x(.2) as and (121.5) fully. .8.x ( ~ )The ] assumption rank(X) = K is equivalent to assuming that x(I. We shall denote a vector consisting of only zeroes and a matrix consisting of only zeroes by the same symbol 0. x ( ~is . Thus. . columns of X and partition X = (XI. Define the column vectors y = (yl. X(K) are linearly independent. by Definition 11. we obtain where pl = (I. Therefore we can rewrite the regression equation (12. .2 1 1 Least Squares Estimators 283 BS We shall rewrite (12.1.2 In practice. 1 We assume rank(X) = K. 0 denotes a column vector consisting of T zeroes. xtK) and the Kdimensional column vector P = (PI.% ! 1 yt = xlP + u.2 LEAST SQUARES ESTIMATORS 12.. Then (12.1. Although we have simplified the notation by going from (12.1. usually taken to be the vector consisting of T ones.4. Suppose rank(X) = K1 < K.1) in vector and matrix notation in two steps.)imply in terms of the vector u (121. Taking the trace of both sides of (12.Then we can write X2 = XIA for some K1 X K2 matrix A. P2. in the multiple regression model (12.X2). X = [x(. .1) are defined as the values of {P. (See Theorem 11. . .5.1.2. t = 1. . uT)' and define the T X K matrix X whose tth row is equal to xi so that X' = (x. Then we can rewrite (12. . 2 (12.1. . . T is a scalar and can be written in the summation notation as c. . . .x. The identity matrix on the righthand side of (12.1 Definition The least squares estimators of the regression coefficients {PC]. for most of our results do not require it. the reader should write out the elements of the T X T matrix uu'. . . .xT).where XI is T X K1 and X2. a row vector times a column vector.p ) ' and u = (ul. To understand (12. X(K). .. .. . ~ ~ In (12. Then.2. . . But we shall not make this assumption specifically as part of the linear regression model.282 12 1 Multiple Regression Model 12.y*.
2.8)by noting that it is equivalent to (12.2.2. we can rewrite (12.4) as Define the vector of partial derivatives Then The reader should verify (12.2) and rewrite it as Dizxfi {Dl).2).2. .. 2.2. .2.284 12 1 Multiple Regression Model 12. If we assume K = 2 and put x ( l ) = 1 and X ( Z ) = X .1. We now show how to obtain the same result without using the surnmation notation at all.11) and M = I . . 1)' be the vector of T ones and define x = ( x l .5) gives the least squares estimators & and @ defined in Section 10.2. The P and M above are projection matrices. are the solutions to the K equations in (12. Premultiplying (12.7) the vector of the least squares w. The reader should verify that making the same substitution + p2 1 and . . (12.13) represents the decomposition of y into the vector which is spanned by the columns of X and the vector which is orthogonal to the columns of X.9) and f = Py and where we have defined ( X ' X )"elds fi = (81.Then we can write (12. the multiple regression model is reduced to the bivariate regression model discussed in Chapter 10.2.4) by (12. .2. Premultiplying (12.sidtds is defined by where C should be understood to mean c : = . 1.5. .2.6) we define the least squares predictor of the vector y. Partition X conformably as X = a Kgvector such that K1 ( X I ..2. . P g .12) by X' and noting that X'M = 0.5.2.2.X 2 ) .+ BKzxtl~tK = zxtlyts fiK&tzxt~ Pgzx?z CX~~Y~.6) i 1 1 is the same as that explained in Theorem 11..5). Suppose we partition fit = (B. Let 1 = ( 1 .2.2.2 1 Least Squares Estimators 285 The least squares estimators i = 1. .2. . Using the vector and matrix notation defined in Section 12.10) respectively as (12.9) + Pgz~tl~t2 Plz~tg~tl * + . we can write (12.2. 7 = xfi..14. Note that the decomposition of y defined by since we have assumed the nonsingularity of X'X.2. . .) where fil is a K1vector and is Kg = K. in (12. Equation (12.2.9). 8.2.8) to 0 and solving for 3 f yields (12.8) and (10. As a generalization of (10.x2. .2. as I (12.We put PI = @.xT)'.2).3) i 1 As a generalization of (10.4) an explicit formula for a subvector of f i . We can write (12. we obtain S(p) = (y .3) as Defining P = x(x'x)'x' (12. . unless otherwise noted.2. It is useful to derive from (12. .12) El = My.2.in (12. whose properties were d i s cussed in Section 11. K. .. . Equating (12. .Xp)'(y Xp) = y'y  2y'xp + p'X'XP which signifies the orthogonality between the regressors and the least squares residuals and is a generalization of equations (10.P.2. denoted by f.1) in vector notation as (12.2. + + .2. P K ) '.
the elements of ~ ( are j large.2.15) yields (jl = (x.2. (12. The theorem derived below shows that (j is the best linear unbiased estimator (BLUE). Inserting (12.we have . jth element of ~ ( isj the covariance between 6 .. M2y.x.24).6.2. .we have = = ( X ' X )'XI (EUU')x(x'x) uZ(x'x)'. Similarly.2. 6.17) (j2 and inserting it into (12.23). Then. where M1 = I . THEOREM 1 2 .5.20).22) yields the variances V& and given.2.2. where M2 = I .3) into the lefthand side of (12.2 Finite Sample Properties of B We shall obtain the mean and the variancecovariance matrix of the least squares estimator (j. We define the class of linear estimators of P to be the class of estimators which can be written as C ' y for some T X K constant matrix C .21) By inserting (12.18) are reduced to (10. Since p* = P I.M a r k o v ) The third equality above follows from Theorem 4.2.18) (j2 = (x.6.2.iZ 286 12 1 Multiple Regression Model 12.we obtain Since X is a matrix of constants (a nonstochastic matrix) and Eu = 0 k y the assumptions of our model.2.20) to the variance of the ith element of (j.2.22) exists and is finite. and X2 = X . that setting X = (1. pJ).2.16) for (12. We call this largeness of the elements of ~ ( due to the nearsingularity of X'X the problem of multicollinearity. if the determinant X'X is nearly zero. which was obtained in (10. 12. the least squares estimator (j is unbiased.y. Define the variancecovariance matrix of 6. . (12.X I ( X . respectively.1.2 1 Least Squares Estimators 287 1 4 Solving (12. In other words.25).2. and Note that ~ ( isj equal to C O V ( ~ . as we can see from the definition of the inverse given in Definition 11. X 1 ) . j ~ ( =j p + (xrx)'X'EU = p.2.x ) in (12.2. VP O). That is.C ' u because of (12.M ~ X ~ ) .2.16) and (10. where C is a T X K constant matrix satisfying (12. (12. formulae (12. or more exactly. The fourth equality follows from the assump tion Euu' = 0'1. 2 .2. Then.24) and (10.2.3) into the righthand side of (12.2.2. If X'X is nearly singular.2. . Proof.1. x 2 ) . Next we shall prove that the least squares estimator (j is the best linear unbiased estimator of p. In the special case where X1 = 1. The least squares estimator (j is a member of this class where C' = (x'x) 'xl.3. The offdiagonal element yields Cov(&. as every variancecovariance matrix should be. we note that (12. Since we have assumed the nonsingularity of X'X in our model.3.2. where we have used the word best in the sense of Definition 11. .5).1. the variancecovariance matrix (12.2.' X .' X .19) and (12. We define the class of linear unbiased estimators as a subset of the class of linear estimators which are unbiased.24).23) is equivalent to (12.12).M&~)'x. The ith diagonal element of the variancecovariance matrix ~ ( is jequal Let p* = C ' y where C is a T X K constant matrix such that C'X = I.. The reader should verift. denoted as V (12. (j is better than P* if P* # (j. The i. since each element is a linear combination of the ? of the matrix (x'x)'x'uu'x(x'x)' elements of the matrix uu'.2. the vector of T ones. 1 ( C a u s s . C'X = I.23) EC'y = p.' X M.2. in (10. we impose vP.1.respectively. ' Thus the class of linear unbiased estimators is the class of estimators which can be written as C'y.24) ~ ( =j ~ ( ( j ~(j)(( j~ 6 ) ' .2. from Theorem 4.23) as the diagonal elements of the 2 X 2 variancecovariance matrix.17) and (12.(X. symmetric. using (12.2.
2. we see that 6.) 12.K) by Theorem 11.46) as Estimation of o2 As in Section 10.2.6 u2 tr M since Euu' = u21 u2(T . Then.30). (For further discussion of the ridge and related estimators.3.27) and (12. we note that Theorem 12. that the bias diminishes to zero as T goes to infinity.26).2.3. In particular.12). If we define a T X T matrix Note .19. = ~ .But Z'Z 2 0. that L is where 1 is the vector of T ones. An unbiased estimator of o2is defined by (12. meaning that Vp* . Suppose we wish to estimate an arbitrary linear combination of the regression coefficients.2. ~6~= T'(T .4 for the distribution of Q'Qunder the assumption that (u. we can define R ' by the same formula as Using 6 equation (10.5. is the better estimator of p.2.2. bv Theorem 11. and partition X = (1.2. From the discussion following Definition 11. although biased. by choosing d to be the vector consisting of one in the ith position and zero elsewhere.27) ii = My = it M(XP + u) = Mu. chapter 2. Using (12.K)u'. X?). by Theorem 11.46). there are biased or nonlinear estimators that are better than the least squares for certain values of the parameters.28) EQrii= Eu'Mu =E =E = = = tr u'Mu tr Muu' VB trM(Euur) byTheorem4.5.x(x'x)'.2.29) 6' = a'& TK See Section 12. we can rewrite (10. say drP. O (12. The estimator. is better than the least squares for certain values of the parameters because the addition of yI reduces the variance.2 regarding the sample mean.288 12 1 Multiple Regression Model 12. see Amemiya. than Pf Thus the best linear unbiasedness of 6i and fi proved in Section 10. 1985. multiply out the four terms within the square brackets above and use (12. Finally.1 implies that d'fi is better (in the sense of smaller mean squared error) than any other linear unbiased estimator d'P*.5.5. the premultiplication of a vector by L subtracts the average of the elements of the vector from each element.' y ' ~ y the projection matrix which projects any vector onto the space orthogonal to 1.2. In deriving is useful to note that (12.27) we obtain To verify the last equality.2 1 Least Squares Estimators 289 = = C' (Euu' ) C u2c1c using the least squares residuals ii defined in (12. The improvement is especially noteworthy when X'X is nearly singular. meaning that Z'Z is a nonnegative definite matrix. we can write s. Note. however. Therefore VP* r u2(X'X)'.u2(X'X)' is a nonnegative definite maand trix.2. the theorem follows from observing that u2(X'X)' = using Definition 11. hence &2 is a biased estimator of u2.2. ' defined in (12. The last term above can be written as U'Z'Z by setting Z = C .) are normally distributed.2.1. As we demonstrated in Section 7. since M~ = M since ufMu is a scalar by Theorem 11. we define the least squares estimator h2 as Suppose that the first column of X is the vector of ones.24).2 follows as a special case of Theorem 12.2.3.5.2.3 1 1 Therefore. An example is the ridge estimator defined as (X'X + y1)'Xry for an appropriately chosen constant y.8 From (12.1. . In other words.18.2.5.12.2.
2 . in this example.2.2.. the variance of the least squares estimatbr of each of the two parameters converges to u2/2 as T goes to infinity. Because of (12. See Section 12. 12.27) we have .(XrX) + implies that every diagonal element of X'X goes to infinity as T goes to infinity.2..5.4 Asymptotic Properties of Least Squares Estimators so that <. 3 In the multiple regression model (12.26) is a consistent estimator of 0 ' .)x(. . Using the results of Section 11. we have from (12. 62 as defined in (12. and Note that (12. (7*'7*) But the righthand side of (12.1.2.22). where Al denotes the largest characteristic root. i = 1. 2.35) we sometimes call R. 2 .34) defines the least squares predictor of y* based on X& SOwe can rewrite (12. Note that the assumption X.2. Since + 0.1.(XfX) + m.2.. XI[(XfX)'1 + 0 implies tr (x'x)~ + 0 because the trace is the sum of the charac= u2(x'x)" as we obtained teristic roots by Theorem 11. as we show below.1.2. The converse of this result is not true. the square root of R'. Therefore Xs(XfX) + m implies x~[(x'x)'1 + 0. Therefore AS(XrX) + m implies <. which in turn implies in (12.290 12 1 Multiple Regression Model 12. )can be written as \ Y We now seek an interpretation of (12. . Suppose X has two columns.2 .2 1 Least Squares Estimators 291 # ! VP. For this purpose define y* = Ly.33) as (12. Our theorem follows from Theorem 6.+ m. X: = LX2.)x(.37) THEOREM 1 2 . the ith diagonal element of X'X.u'u ulPu (12. be the Tvector that has 1 in the ith position and 0 elsewhere.2.6.1).2. we find that the characteristic roots of X'X are 1 and 2T . From (12.3. 2.2. + 0 f a i = I.36) is greater than or equal to XS(XrX) by Theorem 11.+ m. THEOREM Pro$ Equation (11. ) x ( .5 for a discussion of the necessity to modify R2 in o r d e r _ _ to use it as a criterion for choosing a regression equation. the multiple correlation coefJident. K.26) and (12.2. Then. Let e.9.2. Since the characteristic roots of (x'x)"re all positive. we have ! which is the square of the sample correlation coefficient between y* and f*. 2 In the multiple regression model (12. p for X.52).2. Therefore we do not have X.5.40) u .1). Then.33) that gmeralizes the intepretation given by (10. tr (X'X)' + 0 implies tr 1 2 . < .2.5. where XS(X1X) denotes the smallest characteristic root of X'X..1. We can prove this as follows. . VD 1 Proof. .1. the least squares estimator @ is a consistent estimator of p if XS(X1X) + m. the first of which consists of T ones and the second of which has zero in the first position and one elsewhere. Thus.35) R' = (y*'7*12 (Y*'Y*) .3) shows that the characteristic roots of (x'x)' are the reciprocals of the characteristic roots of X'X. Solving In this section we prove the consistency and the asymptotic normality of and the consistency of 6 ' under suitable assumptions on the regressor matrix X.T T 9 .
)]'I2 as its ith diagonal element.2.2.2.Xp).)x(. From (12.45) can be obtained from Theorem 10.2.1.2.5 by regarding the 2.2.45) is a generalization of equations (10. the sufficient condition for the asymptotic normality of (12.1)..2.1.2. 2 .2.67).p) + N(0. 0 Let x ( ~be ) the ith column of X and let X(o be the T X (K submatrix of X obtained by removing x ( ~from ) X.3) into the righthand side of (12.26).5 Maximum Likelihood Estimators B y a derivation similar to (12. T Define Z = XS'. Z'Z = R exists and is nonsingular. 12. Taking the natural logarithm of (12.2.2. we need the following vector generalization.dO. In the multiple regression model (12.2.1) Using (12.47) yields (12.1. the least squares estimators (j and 6' are also the maximum likelihood estimators.19) and (10.2.40).2.43) because of Theorem 6. 1985.2.1.68).we have Eu'Pu = u 2 ~Therefore.2. Then s(B .2.2).2. Since (12..48) it is apparent that the maximum likelihood estimator of p is identical to the least squares estimator To show that the maximum likelihood estimator of u2 is the 6 ' defined in (12. by Chebyshev's inequality (6. and assume that lim.45) has the same form as the expression (10. Then we have .2.292 12 = 1 1 Multiple Regression Model 1 4 12. B THEOREM 1 2 .41) x(x'x)'x'  as before.2.28). Suppose that a2 log L/d0aO1 denotes a matrix whose i. (12.2.2. 4 1 Least Squares Estimators 293 where P (12.1).4. Define In this section we show that if we assume the normality of {u.R a o ) Let 6" be any unbiased estimator of a vector parameter 0 and let VO* be its variancecovariance matrix. The following theorem generalizes Theorem 10.81) as that defined in (12.]in the model (12. (For proof of the theorem.2 and shows that the elements of (j are jointly asymptotically normal under the given assump tions.44) and noting that M(&.the likelihood function of the parameters p and u2 can be written in vector notation as .the reader should follow the discussion in Section 10. As we show& i equation (10. U'R'). where S is the K X K diagonal matrix with [x. 20 Inserting (12.2 THEOREM 1 2 ..) = 0.2. jth element is equal to d2 log L/de. We also show that (j is the best unbiased estimator in this case. B.21).we have for any e2 which implies (12. 5 ( C r a m B r . that appears in equations (10. chapter 3).xf3)'(y .2. 2 .12).2.2.2.1. .2.77).17). we have Note that (12.assume that 4 plim u'u 2 =a . a d (12. Using the multivariate normal density (5.43) plim = u'P u T 0 The consistency of 6' follows from (12.3. see Amemiya.80) and (10.41).1).we can write the ith element of (j as which is a generalization of equation (10. To show that is best unbiased under the normality assumption of u.48) log L = log T 2 T 2~ log 2 2 1 u 7 (y .
55) we conclude that if (gX is any unbd estimator of 0. if not.2.(X'y .2.1.6 12. u2)' and calculate the CramCrRao lower bound for the log L given in (12.P)' 5 E(P* . theleastsquares 1 3 i i 12.54)  a2 log L  apa(U2) .1. the corresponding predictor jp = xi6 is better than yp* .56) is the variancecovariance matrix of fi. In Section 11. and Eup = 0 and Vup= u2.3.2.Note that p and u2 are the same as in (12. we can rank the two estimators by the t . (12. it shows that if an estimator fi is better than an estimator p* in the sense of Definition 11.2.5.48). I f u i s n o r m a l i n themodel (12.2.1). and (12. We put 0 = ( P I . The righthand side of (12. we immediately see that the least squares predictor jp = x# is the best linear unbiased predictor of yp.2. we a f f i x the following "prediction period" equation to the model (12.2.jp12 = E[upxi@  P)]' (12.1).6 Prediction As in Section 10.1): where yp and up are both unobservable and xp is a Kvector of known constants.. Let fi and p* be the two estimators of p.1.5.We have fi is best unbiased.59) x+ E(yp. In particular. Thus. But since the righthand side of (12.52).2.294 12 [ Multiple Regression Model T H E O R E M 12.P)(P* . u 1 From (12.1..49) is called the C r a k .4. We assume that upis distributed independently of the vector u.2.6.56) The second equality follows from the independence of up and u in view of Theorem 3. Equation (12.53). we have proved the following generalization of l h m p l e 7.x$* in the sense that the former has the smaller mean squared prediction error.2.2. Let fi be an arbitrary estimator of f3 based on y and define the predictor i I : i We obtain the mean squared prediction error af j pconditional as (12. i E(fi .54) we obtain From (12.2.5.PI' or the reverse inequality.5 we demonstrated that we may not be always able to show either L vp* 2 $(x'x)I. (12.2 1 Least Squares Estimators 295 1 f estimator where r is in the sense given in connection with Definition 11. by restricting 6 to the class of linear unbiased estimators.49) and (12.2.2.X'XP).2.2.2.P)(fi .59) establishes a close relationship between the criterion of prediction and the criterion of estimation.R a o (matrix) lower bound.
the constraints mean that a Klcomponent subset of p is specified to take certain values. One weakness of this criterion is that xp may not always be known at the time when we must choose the estimator. Then the problem is equivalent to the minimization of 6'XrX6subject to Q'6 = y.1).2) subject to (12.5) for 6 gives . As another example. The essential part of the mean squared prediction error (12. K 2 = K.1).3. We assume that the constraints are of the form and where Q is a K X q matrix of known constants and c is qvector of known constants.1) when there are certain linear constraints about the elements of p. we must often predict xp before we can predict yp.3.61) plus a2 the unconditional mmn squared prediction erro?: 12. The constrained least squares (CLS) estimator of P. Accordingly we now treat xp as a random vector and take the expectation of x P ( 6 . we minimize (12. where we shall discuss tests of the linear hypothesis (12. Thus.p)(p .3 C O N S T R A I N E D LEAST SQUARES ESTIMATORS Instead of directly minimizing (12. We assume q < K and rank(Q) = q.The righthand side of (12.5.3.2.3) under (12.2) is minimized without constraint at the least squares estimator 0. denoted P+.58).3 cannot be found. For example. The study of this subject is useful for its own sake.59).2.3. whereas the remaining K2 elements are allowed to vary freely.1).c = y. In this section we consider the estimation of the parameters P and a2in the model (12. we can rewrite (12.3. provides another scalar criterion by which we can rank estimators. it also provides a basis for the next section. we obtain subject to the constraints specified in (12.296 12 ( Multiple Regression Model 12. In practice. Using (12.1). Put . is defined to be the value of f3 that minimizes the sum of the squared residuals: + which means that the second moments of the regressors remain the same from the sample period to the prediction period. 0) where I is the identity matrix of size K1 and 0 is the K 1 X K2 matrix of zeroes such that K . the case where Q' is a row vector of ones and c = 1 corresponds to the restriction that the sum of the regression parameters is unity.1) embody many common constraints which occur in practice.61) is a useful criterion by which to choose an estimator in situations where the best estimator in the sense of Definition 11. Solving (12.3.We assume that Constraints of the form (12. Writing for the sum of the squares of the least squares residuals.2.3.1.p = 6 and Q'@.3.2.3.p)'xp. if Q' = (I. In Section 12.3.2. The solution is obtained by equating the derivatives of p with respect to 6 and a qvector of Lagrange multipliers X to zero.3 1 Constrained Least Squares Estimators 297 trace or the determinant of the mean squared error matrix. which is mathematically simpler.2) as S(D) where E* denotes the expectation taken with respect to xp. We shall call (12.1 we showed that (12.
Zlc is the vector of dependent variables and a (K . Using A we can transform equation (12. and & consists of the last K . we obtain (12. Z2 and c are all known constants. S) is nonsingular.12) Since Z1.14) I E I Transforming 6 and y into the original variables.3.9) = (x'x)'Q[Q'(X'X)~Q]'~. (The proof can be found in Amemiya. 1985.3.3. The R is not unique.q) matrix R such that. we immediately see that ED' = P.3.] are normal in the model (12.16) we have used the same symbol as the CLS 0' because the righthand side of (12.10) if X'X is nonsingular.3.3. where Z = XA'. Throughout the section we assume the multiple regression model (12. Now define A = (Q. We can give an alternative derivation of the CLS P+.[Q' (x'x)~Q]'~.q) matrix S such that (Q.q columns of Z. we have reduced the problem of estimating K parameters subject to q constraints to the problem of estimating K .3.1) and if (12.4. inserting (12. (12.ZIc = Z2a + U.3.1) the null hypothesis.1) is true.1) is true.15) I & = ( z .3) as follows: R(R'X'XR)'R'x'~ + [I .~ z .13). first. If {u.16) P + = A' = p by Since the second term within the braces above is nonnegative definite.14) to obtain (12. We shall call (12. 4 Finally.7) into (12. R)'. by the transformation of (12.3.3.298 12 1 Multiple Regression Model 12.3. R) is nonsingular and.R(R'X'XR)'R'x'x]Q(Q'Q)".10) under the assumption that (12. From (12. R'Q = 0.6) and solving for A.1) is true. I The corresponding estimator of u2 is given by Taking the expectation of (12.Q(Q'Q)'Q'IS satisfies our two conditions. the constrained least squares estimators pt and 02+ are the maximum likelihood estimators. (Q. we have VP' 5 V f i .9 assures us that we can find a K X (K .4 1 Tests of Hypotheses 299 Inserting (12. We can apply the least squares method to (12.9)vector a constitutes the unknown regression coefficients.16) can be shown to be identical to the righthand side of (12.10) y .1.3.3.We should expect this result.1) with the normality of u. second.1. a = R'P.3. Finding R is easy for the following reason. I:[ In (12. We discuss . (~ Zlc) VP' = u2{(~'X)" (x'x)'Q[Qr (X'X)'Q]'Q~ (x'x)'1.c). we can write the solution as (12.1.4. equation (12.3.3.) 12. 1 p ' = fi  (x'x)'Q[Q'(x'x)'Q]'(Q'fi . It can be shown that if (12.3.13).8) A 6 = . Thus. Then R defined by R = [I . and then estimate (12. Theorem 11. we get (12.3. any value that satisfies these conditions will do.q parameters without constraint. Suppose we find a K X (K . chapter 1. for fi ignores the constraints.4 TESTS OF HYPOTHESES 12.3.3.8) back into (12.3.10) as (12.1) as a testable hypothesis and develop testing procedures.3. the CLS P+ is the best linear unbiased estimator. since the distribution of the test statistics we use is derived under the normality assumption. We can evaluate V P + from (12. Zl consists of the first q columns of Z.3.3.3.1 Introduction In this section we regard the linear constraints of (12. z ~ ) .5) and sdvlng for 8.14) represents a multiple regression model in which y .
Student's t with T .2. 4 . the F test.2) are independent.4) denotes a diagonal matrix that has one in the first T .2. by Definition 2 of the Appendix. As preliminaries.1) and (12.2).4. Next. Since iilii = ulMu 12. we need only show that (j and Q are uncorrelated.2) and (12. where 6 is the square root of the unbiased estimator of u2 defined in equation (12.8) are independent because of Theorem 12.4. we have TK where w = H'v and wiis the ith element of w. The random variables defined in (12. and a test for structural change (a special case of the F test).4 1 Tests of Hypotheses 301 Student's t test.7) Because of Theorem 11.We use a onetail or twotail test.4. we derive the distribution of (j and G'ii and related results.4.Inthemadel(121. I). Pro$ We need only show the independence of (j and Q because of Theorem 3.we immediately see that (j is normally distributed if y is normal. Applying Theorem 5. depending on the alternative hypothesis.29).2 to (j defined in (12.1) and (12.18.2.4. Note that the denominator in (12.we can write 0 .27).12). 2 In the model (12. Pro$ If we define v = u'u.4.2.4.4.4. The null hypothesis Q'P = c can be tested by the statistic (12.2.2. 0 zT~:w?  . q = 1. Hence. where the righthand side of (12.2. a 2 ~(x'x)'Q] ' under the null hypothesis (that is. Since (j is normal as shown above. discussed in the next section. The F test. Note that here Q' is a row vector and c is a scalar. with the normality of u we have (x'x)'x' (Euu') M = a2 (x'x)'X'M =o.5.4.4.K by Definition 1 of the Appendix.1. Therefore. must be used if q > 1. there exists a T X T matrix H such that H'H = I and Q'$ . we obtain THEOREM 1 2 .4.K degrees of freedom.6) E(B .P)Q' = E(x'x)'X'UU'M = THEOREM 12. This follows from (12.5.300 12 I Multiple Regression Model 12.N(0.5).2 Student's t Test The t test is ideal when we have a single constraint.1.4. Therefore.K diagonal positions and zero elsewhere. that is. Since w X.1 LetQbeasdefinedin(12. I ) . Using the mean and variance obtained in Section 12. But since (j and Q are jointly normally distributed by Theorem 5. the random variables defined in (12.4. we have v from (12.1) with the normality of u.N ( 0 . we have (12.2.N [c. let us show the independence of (12. if Q'P = c). in that order. .9).9) is an estimate of the standard deviation of the numerator.1).4.4.
11) if constrained least squares residuals can be easily computed.4.4.K s(P+) . I ) .Inserting these values into (12.12) we have (12.7 that this would be the likelihood ratio test if (3 were normal and the generalized Wald test if (3 were only asymptotically normal.10) and (12. where P1 is a K.). I)(x'x)'(o. .z(z'z)'Z'IZ. Therefore. q > 1 ) . Then the null hypothesis becomes Z$. From equation (12.4. the chisquare variables (12.4. (12.P + ) .4.3.4.13) (Q'B .we see that if q = 1 (and therefore Q ' is a row vector). the smaller s ( P + ) becomes. since a onetail test is possible only with the t test.11) Therefore. K2 ii'ii . by Theorem 11.K). = C.U.10) right away and reject the null hypothesis if the lefthand side were greater than a certain value.K ( @ 2 .11) is the square of the t statistic (12.T . Then the null hypothesis becomes Pi = Pi.4. Q is a Kvector of ones.4. Therefore. (12.3 The (12. respectively. The F statistic can be alternatively written as follows. and c = 0 .14) provides a more convenient form for computation than (12. The F statistic 7 given in (12.S(@) 7 = . Using the regression equation (12. 4 ~(li)  F Test In this section we consider the test of the null hypothesis Q ' P = c against the alternative hypothesis Q'f3 Z c when it involves more than one constraint (that is. Also note that (12. Note that s ( P +) is always nonnegative by the definition of p f and (3.15) and T . F(q. where 0 is the K2 X K1 matrix of zeroes. The reader will recall from Section 9.9) and (12.' Q ] .12) s ( P +) ~ ( p =) ((3 .4. we could use the test statistic (12.~B.12 1 Multiple Regression Model 12.2.F ( K 2 .4.3) we have where Z1 = [I (12.17) S(P+) .3.F(q. the F statistic (12.p2) q = . z~(z. The null hypothesis Q ' P = c is rejected if 7 > d. Consider the case where the f3 is partitioned as P' = ( P i . Finally.11) alternatively as (12.we have ~(8) ~(0) and If a ' were known.4.vector and p2 is a K2vector such that K 1 + K2 = K. 3 c will play a central role in the test statistic. The result (12.14) may be directly verified.4.4.4.11). where d is determined so that P ( q > d ) is equal to a certain prescribed significance level under the null hypothesis.7.4. Comparing (12.P2)'[(0. This hypothesis can be written in the form Q ' P = c by putting Q ' = ( 0 . ( Q r @. 12.z~)'z. T .9).11) takes on a variety of forms as we insert specific values into Q and c . TK). and c = D2. In this case the t test cannot be used.4. Partition X as X = (XI.18) ((3= ) U'Z~(Z. Therefore we can write (12.Z~)~Z.4. X2) conform .~ q TK  4 . by Definition 3 of the Appendix.. unspecified.4.4.P + ) ' x ' x ( ( ~. and the closer Q'f3 is to c .13).1..17) are independent because [I .4.4.4. = 0 . Since (3 and ii are independent as shown in the argument leading to Theorem 12. by Theorem 9. I is the identity matrix of size K 2 .18) somewhat. we have (12.4.4 1 Tests of Hypotheses 303 a The following are some of the values of Q and c that frequently occur in practice: The ith element of Q is unity and all other elements are zero.5.2) and (12.K ) .3.4. P.4.7) is valid even if g > 1 because of Theorem 5.' ( ~ ' ( j C) ii'ii .This fact indicates that if q = 1 we must use the t test rather than the F test. ] z . and the null hypothesis specifies p2 = p2 and leaves p.11) yields (12.4. Then the null hypothesis is simply Pi= c.c)' [ Q r ( X ' x ) . 1)'1'(@2.4. The distribuAgain ~ ' (tion of ~ ' (given 3 in (12. s ( P +) ~ ( ( 3 )= ( ~ ' (3 c)'[Q'(x'x)'Q] 1 I From equations (12.c).2.10) are independent. The ith and jth elements of Q are 1 and 1.14) T .19. We can simplify (12.
we have from equation (12.4.18) as (12.4. where Therefore (12. 1 ~ . ~ ~ .4.4 Tests for Structural Change Suppose we have two regression regimes and .33). Furthermore. X ~ ) . we can combine the two equations as where the vectors and matrices in (12. we assume = 0. Let s(B) be the sum of the squared where we have defined Z = (x.' x . We want to test the null hypothesis PI = P2 assuming cr: = o i (= a').2.4. and ul and u2 are normally distributed with zero means and the variancecovariance matrix xi)'. where fil = ( X . and define M1 = I by Theorem 11.4.2. q = K*. .22) Using the definition of R as Since cr: = cr: (= cr2).4.25) have T 2 rows.' Xand We can obtain the same result using (12. we can rewrite (12.24) have T 1 rows and those in (12. we have x~(x x. Then.' l l ' Also. Inserting these values into (12.4.11) yields the test statistic since Q'Q = ytLy  y ' ~ ~ 2 ( X .3. ~ ~ 2by ) (12. .2.K (82 K2 P 2 ) ' ~ .1.4.4 1 Tests of Hypotheses 305 ably with the partition of P.9.K ) . so that p1 is a scalar coefficient on the first column of X .24) and (12. ~ y 12.~ .4.4.4. K = 2K*.~ X . (12.4.25) without making use of the hypothesis PI = P2. which we assume to be the vector of ones (denoted by 1 ) .we further rewrite (12.19) becomes L = I .26) we combined equations (12. X1 is a T I X K* matrix and X2 is a T 2 X K* matrix.20) where K 1 = 1.4. In (12.14). Then MI in (12.that is. residuals from (12.24) and (12.1) by putting T = T I + T 2 . 1 ) .~~ fi2 = ( X .2.26). Of particular interest is a special case of (12. X ~ ) . If we do use it.F(Kp.I ) .304 12 1 Multiple Regression Model 12. T .3. and c = 0. combine equations (12.14).4. . We can represent our hypothesis P1 = p2 as a standard linear hypothesis of the form (12.25) as Therefore.32).4.20) can now be written as ' given in (12.26) is the same as the model (12. This test can be handled as a special case of the F test presented in the preceding section.4.20) 'q =. Q' = ( I .. ~ 1 ~ 2 ( 88 22 ) iilii .1) with normality. To apply the F test to the problem. .4.T .
4.27) look very different.4. before performing the F test discussed above. X ~ f i ) '(~ ~ 2~0 ) . denoted by i.4. The hypothesis P1 = P2 is merely one of the many linear hypotheses that can be imposed on the P of the model (12. . P2 = fin. Then using (12. So far we have considered the test of the hypothesis PI = P2 under the assumption that u: = a : . that is. We may wish to test the hypothesis a: = a : .'(y2 . The likelihood function of the model defined by (12.There may be a situation where we want to test the equality of a subset of P1 to the corresponding subset of P2. . and = = 6. 2 u2  The value of L attained when it is maximized subject to the constraints = P2. however. Under the null hypothesis that a: = a : (= a') we have In Section 10. can be obtained by evaluating the parameters of L at the constrained maximum likelihood estimates: Dl = fi2 (= p).24) and (12. Welch's approximate t test does not effectively generalize to the multiple regression model. For example. they can be shown to be equivalent in the same way that we showed the equivalence between (12. These may be obtained as follows: pl Step I . but q = KT.x ~ @ ~)X2O2). Calculate Unlike the F test developed in Section 12. and c = 0 .we have Even though (12.4. when the sample size is large.4.4. So we shall mention two simple procedures that can be used when the variances are unequal. Unfortunately. since either a large or a small value of the statistic in (12. Both procedures are valid only asymptotically. and I$.3.4.3.4 1 Tests of Hypotheses 307 and let s(P+) be the sum of the squared r e s i d t d h r n (12. We do not discuss this t test here.3. @ ~ )' X(I & ~). and 6: = ~ .4. we should use a twotail test here.can be obtained by evaluating the parameters of L at P 1 = B1. ' ( ~ r . 0 . we use the t test rather than the F test for the reason given in the last section.T. (12.26). since it is analogous to the one discussed in Section 10.25) is given by + The value of L attained when it is maximized without constraint. and I$.34) is a reason for rejecting the null hypothesis.14).%3). = ~ ~ ' (y~ l l u.4. The first is the likelihood ratio test.2.14).31) and (12.4. O). we have by Definition 3 of the Appendix Step 2.4. that is. If. Calculate Since these two chisquare variables are independent by the assumption of the model. denoted by E. '(~~ and (12. if the subset consists of the first KT elements of both P1 and P2. we put T = T 1 T 2 and K = 2K* as before.306 12 1 Multiple Regression Model 12.33)  ~ i M 2 ~ 2XT*K*. ~ a. Q' = (I.4.2 we presented Welch's method of testing the equality of regression coefficients without assuming the equality of variances in the bivariate regression model.~ Z ' ] ~ .11) and (12. 6.30) s ( P += ) yl[I .Z ( Z ~ Z ) . I. we wish to test the equality of a single element of P1 to the corresponding element of fin.
First.2. Theil offers the following justification for his corrected R2. any decision rule can be made equivalent to the choice of a particular value of c. substituting 6. the larger R2 tends to be.29). and 6.27) on them. assume X = ( X I . P obtained in step 3 Continue this process until the estimates converge. R . for 6. Then. and u Second.5.24) and (12. 12. Step 4.5. denoted R2.4. respectively.4. p. = y l [ I .Suppose the vectors p and y have K and H elements. Theil's corrected R2.4. We stated that. Without loss of generality.25) by 6 and define the new equation where K is the number of regressors.5 1 Selection of Regressors 309 Step 3.3 we briefly discussed the problem of choosing between two bivariate regression equations with the same dependent variable.4. Repeat step 2. = by2. the estimates obtained at the end of step 1 and step 2 may be used without changing the asymptotic result (12. and 6. other things being equal. he shows that In Section 10.1.5.1) is equivalent to accepting the hypothesis P2 = 0. An important special case of the problem considered above is when S is a subset of X.5.respectively. . Xc = fix2. The second test is derived by the following simple procedure. other things being equal. it makes sense to choose the equation with the higher R2.5. Theil (1961.36) is The null hypothesis large. because the greater the number of regressors. Since. where y. and & : .' s l ] y / ( T  H). By Theorem 9. substituting the estimates of for fi. In the extreme case where the number of regressors equals the number of ob' = 1 .4. and perform the F test (12.1.2) is the true model.20) with set equal to 0. treat (12. we need to adjust it somehow for the degrees of freedom.X2) and S = X I .1) and (12.5 SELECTION OF REGRESSORS choosing the equation with the largest R2 is equivalent to choosing the defined in (12.a regression equation.37) as the given equations.2). it no if the expectation is taken assuming that (12. and U: = 6u2.2. Theil proposed choosing the equation with the largest R2.308 12 1 Multiple Regression Model 12. PP . In practice.4. since 6 is a consistent estimator of ul/u2. choosing (12. That is. Here. The method works asymptotically because the variance of $ is approximately the same as that of ul when T I and T 2 are large. respectively.3).36) given below. and define 6 = k 1 / k 2 . and 6 . The justification is merely intuitive and not very strong. So if we are to use R2 as a criterion for choosing servations.4.s ( s ' s ) . equation with the smallest e2. is defined by = P2 is rejected when the statistic in (12. approximately. be the unbiased estimators of the error variances in regression equations (12. It can be shown that the use of Theil's R2 is equivalent to c = 1 . But the F test of the hypothesis accepts it if q < c. Partition p conformably as P' = ( P i . when both T I and T2 are large) longer makes sense to choose the equation with the higher R2.4.31). from (12.2) over (12. where r\ is as given in (12.we have asymptotically (that is. pi). Therefore. Repeat step 1. Finally. 213) proposed one such adjustment.2. Let 6: and 6. multiply both sides of (12. : by 6. we consider choosing between two multiple regression equations and where each equation satisfies the assumptions of model (12. If H # K. Then. estimate u. respectively.
sample {X. What value of c is implied by the customary choice of the 5% significance level? The answer depends on the degrees of freedom of the F test: K . as the first K1 columns of X and x i p as the first K1 elements of x.1 2. i = 1. .1 gives the value of c for selected values of the degrees of freedom. in which the value of c ranges roughly from 1.4.8 to 2.  Let Ij = (x'x)'x'~ and fi = (S'X)'S'~.X~)'X. though an improvement over the unadjusted still tends to favor a regression equation with more regressors. (Section 12. (Section 12. These results suggest that Theil's E2. without using Theorem 12.p + up.20).2) Consider the regression model y = px + u. . EXERCISES 1.obtain the unconditional mean squared prediction errors of the predictors xkfi and xipp:. without using the CramCrRao lower bound. 3. 0.2) Consider the regression model .5) Suppose the joint distribution of X and Y is given by the following table: Critical values of F test implied by 5% significance level KH 1 T . Euu' = 4. (b) Derive the Cram6rRao lower bound. Obtain the mean squared errors of the following two estimators: E 1 8 1 8 1 ta I fix* x 'x and D=?Y z'x' B where z' = (1. Akaike (1973).~.H. . (Section 12.1.6) In the model y = XP + u and yp = x.d.05..0. and Sawa and Hiromatsu (1973) obtained solutions to this problem on the basis of three different principles and arrived at similar recommendations.K. . without using 6. 1. We have definedx.2.O).). where h = 0.Y. where fi = (xlX)'~'yand p: = (X. T A B L E 1 2.465 (a) Derive an explicit formula for the maximum likelihood estimator of a based on i. Which estimator is preferred? Answer directly. The results cast some doubt on the customary choice of 5%.2.K 30 c 0.i. 1.2. (Section 12. T .13) satisfies the two conditions given there. .3. and + u. Table 12.H and T .2.2. Note that K ..1..3) Show that R defined in the paragraphs before (12. n.4. Assume that T is even. (Section 12. Under what circumstances can the second predictor be regarded as superior to the first? 5. The table is calculated by solving for c in P[F(K .K) < c ] = 0. u is a Tvector such that Eu = 0 and Euu' = u21~. where P is a scalar unknown parameter.2) Consider the regression model y = XP I. fi. x is a Tvector consisting entirely of ones. .H appears as K2 in (12.2. (Section 12. .310 12 1 Multiple Regression Model i I Exercises 311 i Mallows (1964). and derive its asymptotic distribution directly. 2.where Show directly that Ij is a better estimator than Theorem 12. .
and P3. You may assume that the u. The data are given by '~1x1 + Plzl + u1 and y2 = a2x2 P2z2 u2.). which we write as PI. each consisting of four observations: I 10. and u3 are independent normal with zero mean and constant variance. . are normal with mean zero and their variance is constant for all t and for all three industries.3) We want to estimate a CobbDouglas production function log Q. (Section 12.3) 9. . 8. and L. 1. + pl = s + pp and = 0 versus H I : not Ho. The data are as follows: in each of three industries A. where y and u are eightcomponent vectors. We assume that PI varies among the three industries. P2. test Ho:a. B. (b) Test pl = P2 = P3 at the 5% significance level. = p1 + PI log Kt + p3 log L.3) In the following regression model. up.3) Solve Exercise 35 of Chapter 9 in a regression framework.4. Assuming that the observed values of y' are (2." at the 5% 7. + q. Use the 5% significance level. I). = a3 and significance level.4. T. . test the null hypothesis P2 = P1 against the alternative hypothesis P2 > at the 5% significance level. Write detailed instructions on how to perform such a test. t = 1. + + where ul and up are independent of each other and distributed as N(0. where 1 is a fourcomponent vector consisting only of ones. ~ ' 1 ~ ) . and the elements of ul. PI = p2 = f3. 11. are distributed independently of the u. . (Section 12.4. and that the K . X is an 8 X 3 matrix.. y 1 = P2 Consider the regression model y = XP + u.312 12 1 Multiple Regression Model I Exercises 313 where pl and P2 are scalar unknown parameters and u N(0.0. (Section 12.  Test the null hypothesis "al = a:. 2. . u21. and C and test the hypothesis that P2 is the same for industries A and B and p3 is the same for industries B and C (jointly. and f3 is a threecomponent vector of unknown parameters. (Section 12. The data are given as follows: (a) Test P2 = P1 against p2 > P1 at the 5% significance level. We want to test hypotheses on the elements of f3.3) Consider three bivariate regression models. (Section 12.4. not separately).4.
Premultiplying (13.1 through 13. and hence the least squares estimator applied to (13. 13.3 arise as the assumption of exogeneity of the regressors is removed.'/'.1 Known VarianceCovariance Matrix The multiple regression model studied in Chapter 12 is by far the most frequently used statistical model in all the applied disciplines. the diagonal elements of A are positive by Theorem 11. by Theorem 11.1 1 Generalized Least Squares 315 13 ECONOMETRIC MODELS where we assume that X is a fullrank T X K matrix of known constants and u is a Tdimensional vector of random variables such that Eu = 0 and (13.1. Moreover. We assume only that 2 is a positive definite matrix.) Therefore (13. ~ and that (I.3) : * f = X*P f u*.6 = 2 . It is also the basic model from which various other models can be derived.5 through 13.2 and 13.6.1 / 2 2 1/2 8 112 21/2 13.i 13. since 2 is positive definite.1) by I. Then.6. . Our presentation will focus on the fundamental results. including econometrics. where = D{x.10.3) is a classical regression model. the models of Section 13. Eu* = 0 and x"~~.1."~]. where y* = X* = 4. where X. For a more detailed study the reader is referred to Amemiya (1985).5. whereas those discussed in Sections 13.12 i e 3 $ EuuP=z. we obtain (131.1 GENERALIZED LEAST SQUARES In this section we consider the regression model (The reader should verify that that ~ .1 we can find an orthogonal matrix H which diagonalizes 2 as H'ZH = A.2). is the ith diagonal element of A.7 are more general models.' / ' 2 ' = / I.5. Since Z is symmetric. In this subsection we develop the theory of generalized least squares under the assumption that Z is known (known up to a scalar multiple. We have given them the common term "econometric models.3) has all the good properties derived in Chapter z'/'x'/" x.5. Using (11.4 may be properly called regression models (models in which the conditional mean of the dependent variable is specified as a function of the independent variables). by Theorem 5 (131." but all of them have been used by researchers in other disciplines as well.1. to be precise).and u* = Z"*U.1.4) Eu*u*' = EX"/'UU'(~'/~)' x1/22(21/2)1 by Theorem 4.1.1. in the remaining subsections we discuss various ways the elements of 2 are specified as a function of a finite number of parameters so that they can be consistently estimated. In this chapter we study various other models frequently used in applied research.1 arise as the assumption of independence or homoscedastidty (constant variance) is removed from the classical regression model.'/~)' = 8I/z from the definitions of these matrices. we define 8'I2 = HA'/~H'.4). The models of Section 13. The models discussed in Sections 13.7 are more general than regression models. and 13. 1 i 1 13. This model differs from the classical regression model only in its general specification of the variancecovariance matrix given in (13. = Z'"x.4 arise as the linearity assumption is removed. For these reasons the model is sometimes called the classical regression model or the standard regression model. The models of Sections 13.5. Finally. where A is the diagonal matrix consisting of the characteristic roots of Z.1. The models of Sections 13.
6. The models discussed in Sections 13. where A'I2 = D { x ~ ' / ~where ].5.1 1 Generalized Least Squares 315 I ECONOMETRIC MODELS where we assume that X is a fullrank T X K matrix of known constants and u is a Tdimensional vector of random variables such that Eu = 0 and I : is a positive definite matrix.) Therefore (13.13. whereas those discussed in Sections 13.1 / 2 13.4). The models of Sections 13. Xi is the ith diagonal element of A. and that (21/2~1 = 2112 from the definitions of these matrices.5. 13. where A is the diagonal matrix consisting of the characteristic roots of Z. Finally.1.1 we can find an orthogonal matrix H which diagonalizes 2 as H'ZH = A.1.1.1. to be precise). we define 2'" = HA'/~H'. and 13.1) y = Xp + u.1 GENERALIZED LEAST SQUARES (The reader should verify that 81/22'/2 = 2.1. Then. For a more detailed study the reader is referred to Amemiya (1985).3) has all the good properties derived in Chapter In this section we consider the regression model (13. Since Z is symmetric.3) is a classical regression model. Using (11.5.4 arise as the linearity assumption is removed.1 arise as the assumption of independence or homoscedasticzty (constant variance) is removed from the classical regression model. the models of Section 13.3) y* = X*P f u*. since Z is positive definite. The models of Section 13. For these reasons the model is sometimes called the classzcal regression model or the standard regression model.1) by I. In this subsection we develop the theory of generalized least squares under the assumption that 2 is known (known u p to a scalar multiple.4 may be properly called regresszon models (models in which the conditional mean of the dependent variable is specified as a function of the independent variables). by Theorem 11.6 = 2 .5.'/2 = I.1 Known VarianceCovariance Matrix The multiple regression model studied in Chapter 12 is by far the most frequently used statistical model in all the applied disciplines. X* 4. the diagonal elements of A are positive by Theorem 11. Eu* = 0 and 2'/2~. including econometrics. In this chapter we study various other models frequently used in applied research.1. and u* = z ' / ' u ." but all of them have been used by researchers in other disciplines as well.2).1.we obtain (131. We have given them the common term "econometric models. Our presentation will focus on the fundamental results. 13.7 are more general than regression models. = where y* = Z'"y.1 through 13. It is also the basic model from which various other models can be derived. that 81/281.5 through 13.1.3 arise as the assumption of exogeneity of the regressors is removed.1 / 2 2 1/2 x 1/2 8 .2 and 13. Moreover. by Theorem = Xl/n'I:(21/2)1 by Theorem 4. in the remaining subsections we discuss various ways the elements of 2 are specified as a function of a finite number of parameters so that they can be consistently estimated.6.10. The models of Sections 13. This model differs We assume only that ' from the classical regression model only in its general specification of the variancecovariance matrix given in (13.7 are more general models.'/~. and hence the least squares estimator applied to (13. Premultiplying (13. .
and @ ( ( i ~ . In the first method. Its variancecovariance matrix.1 that (13. . (See Amemiya.1.4. Let 0 be a vector of unknown parameters of a finite dimension. this model is a special case of the model discussed in Section 13.7) or (12. We have. = (Suppose 2 is known up to a scalar multiple.1.1. we must specify them as depending on a finite number of parameters.1 1 Generalized Least Squares 317 2 12. .1. denoted by (jF.P) has the same limit distribution as fl(& . where a is a scalar positive unknown parameter and Q is a known positive definite matrix. using Theorem 4. Here we relax this assumption and specify more generally that ED = P This assumption of nonconstant variances is called heteroscedastidty. The above can also be directly verified using theorems in Chapter 11.6) Although strict inequality generally holds in (13.26). That is. In each of the models to be discussed. this is the same as (12. T1 + 2. . In the present case.P).1.4. the wezghted least squares estimator I f the variances are unknown. The LS estimator can be also shown to be consistent and asymptotically normal under general conditions in the model (13.2. Then a drops out of formula (13. say. is different from either (13. 2 is a diagonal matrix whose tth diagonal element is equal to a : . The other assumptions remain the same. by (13. . however.22). . 1985.~ x ] . we shall indicate how 0 can be consistently estimated.) The consistency and the asymptotic normality of the GLS estimator follow from Section 12. the variances are assumed to remain at a constant value. there are cases where equality holds. If T I is known.1.5) and using Theorem 4.2 Heteroscedasticity i I In the classical regression model it is assumed that the variance of the error term is constant (homoscedastic).6.1. We call it the generalized least squares (GLS) estimator applied to the original model (13. .1. Denoting the consistent estimator by 0. Since the GLS estimator is the best linear unbiased estimator S estimator is a linear estimator. The GLS estimator in this case is given a special name. .9) ~ f =i E(x'x)'x'uu'x(x'x)' = (x'x) 1x'2x(x'x)1.2. o:.1/2X)'X'~l/2 21/2 Y (x'~lx)'x'x'~.) Inserting (13.5) PG = (X*'X*)~X*'~*  (X'21/".1.2. section 6.3. If 2 is unknown.1).1) because the researcher may use the LS estimator under the mistaken assumption that his model is (at least a p proximately) the classical regression model. (13.1.5) and we have = (x'Q'x)'x'Q'~. 2.316 13 1 Econometric Models 13. Denoting it by BG. The classical regression model is a special case. we can define the feasible generalized least squares (FGLS) estimatol. it under the model (13. In the next three subsections we consider various parameterizations of 2 . If the variances are known.1. f i is ~ consistent. Thus the LS estimator is unbiased even under the model (13. EBG = P and (13. T.10) ( x r x ) .17) V@G where the dependence of 2 on 0 is expressed by the symbol 2(0).6. in the period t = 1.' x ~ z ( 0 ) . There are two main methods of parameterization.1. Under general conditions. its elements cannot be consistently estimated unless we specify them to be functions of a finite number of parameters. It is important to study the properties of the least squares estimator applied to the model (13. T1 and then change to a new constant value of a : in the period t = T1 + 1.' x r ~ x ( x t x ) r ' (x'z'x)'.1.we have (131.18) 13.11) 3 i I 1 i 3 BF = [ ~ ~ z ( e ) .1.1. There we suggested how to estimate and (131.1) and the L follows from Theorem 12. = (~'2'x)'. .1).1) into the final term of (13.1.10). suppose 2 = aQ.1).1. .1.' ~ . in which a = c2and Q = I. we can readily show that (131.
and usfor s f t in the model (12. weak stationarity). we conclude that (13.9).19) as special cases.1. . = .C. we can define the FGLS estimator by : and a : can be the formula (13. Next we evaluate the variances and covariances of {%I. 13.p2 u' Multiplying both sides of (13.1).1 and E . Then the heteroscedasticipconsistent estimator of (13. Even if we do not spec* a. It is defined by Eu~u~ = ~ u2p2 for a 1 1 r. ET with Euo = 0 and Vuo = u2/(1 . See Goldfeld and Quandt (1976) for further discussion of this case.1. Next.4 below. .i.3 Serial Correlation because of (13.9) is defined by Repeating the process for t = 2. T 1 as well as a still estimated. 1 . Conditions (13.15) by u. It can be specified in infinitely various ways. . . we obtain where { E .p2 i Repeating this process. we obtain Under general conditions 7@ can be shown to converge to TV@.12).p2). are independent. . we obtain (13. not necessarily related to x.1. here we consider one particular form of serial correlation associated with the stationaryfirstorder autoregressive model. but the computation and the statistical inference become much more complex.1. ] are i. .2.1.11).17) .16) and (13. Note that (13.1 ( Generalized Least Squares 319 a : and a .  318 13 1 Econometric Models 13.ci: on z. it is specified that Taking the expectation of both sides of (13.1. Repeating the same procedure for t = 2. is a vector of known constants. as a function of a finite number of parameters. with Ect = 0 and Vet = u2. we see that Eul = pEuo + Es1 = 0.1 and taking the expectation.1. .15) by q . 3. Goldfeld and Quandt (1972) considered the case where g(. 6: must be treated as the dependent variable of a nonlinear regression modelsee Section 13. If g(.1.1. .1.) is a linear function and proposed estimating a consistently by regressing . and (13.1?').and a is a vector of unknown parameters.1.16) Eu.d.} be the least squares residuals. we obtain where g(.19) In this section we allow a nonzero correlation between u.1.1. 1 .20) contains (13. . .T . . and define the diagonal matrix D whose tth diagonal element is equal to tf . T. Taking the variance of both sides of (13. If T I is unknown.1.) is a known function. we conclude that Vu. . we can consistently estimate the variancecovariance matrix of the LS estimator given by (13.1.See Eicker (1963) and White (1980). where (i&} are the least squares residuals defined in (12.17) and because u. multiplying both sides of (13. . EZ. = 0 for all t.1. Let {. Using these estimates. In the second method.1.1. It is not difficult to generalize to the case where the variances assume more than two values.2 and taking the expectation.) is nonlinear..and uo is independent of 81.15) for t = 1 .for all t. (13.18). (13. z. 3.1.20) constitute stationarity (more precisely.15) for t = 1 and using our assumptions. Correlation between the values at different periods of a time series is called serial correlation or autocorrelation.
1. . we can compute the GLS estimator of fJ by inserting X' obtained above into (13. u where we should perform the operation z. 1 ? I t can be shown that If p is known. .16) can be written equivalent to En 0 and (13.15) is an empirically useful model to the extent that the error term of the regression may be regarded as the sum of the omitted independent variables. It is defined by .26) u.For example. z .20). we should appropriately modify the definition of R used in (13.5) as (13. Thus the GLS estimator is computed as the LS estimator after this operation is performed on the dependent and the independent variables. if we suppose that {u. . . . The asymptotic distribution of the GLS estimator is unchanged if the first row of R is deleted E n defining .the estimator by (13. however.1. T. 3.1. If. . (13.1. = j=1 p.Note that a2need not be known because it drops out of the formula (13..25) & = (XIR'IIX)'X'RIR~.25).1. .) follow a higherorder autoregres sive process.on both the dependent and independent variables and then apply the LS method.25).1 1 Generalized Least Squares 321 In matrix notation.1.1..1. z2. .320 13 ( Econometric Models 13. we believe that {u.23) Except for the first row. premultiplication of a Tvector z = (zl. + E.90) is Using R.5).. . 2'= 1 R'R. Another important process that gives rise to serial correlation is the movingaverage process. we can write the GLS estimator (13.pz.1. The computation of the GLS estimator is facilitated by noting that (13.5).) follow a pth order autoregres sive model P (13. zT)'by R performs the operation z.1. t = 2. Many economic variables exhibit a pattern of serial correlation similar to that in (13. Therefore the firstorder autoregressive model (13.1.u.~ & ~ p .1. .1. .
1. and { E . it can be shown that P is generally biased. however. C u:t=2 1 which is approximately equal to 2 .1.15) itself cannot be regarded as the classical regression model because ut1 cannot be regarded as nonstochastic.]are in fact unobservable. For example. 2.1.1.1.15). we take the classical regression model as the null hypothesis and the model defined by (12. But it can also be shown that p is consistent and its asymptotic distribution is given by (13.26. we must first .28) by the LS residuals fit = y. a movingaverage process can be well approximated by an autoregressive process as long as its order is taken high enough. 13. we consider the test of independence against serial correlation. we assume that the sequence {p.30) as the %eat statistic. we should try to account for the difference by introducing the specific effects of time and crosssection into the error term of the regression. Still. Namely. we hope to be able to estimate the parameters of a relationship such as a production function or a demand function more efficiently than by using two sets of data separately.1. Before the days of modern computer technology. we may want to estimate production functions using data collected on the annual inputs and outputs of many firms. . . Nevertheless. In mutually independent with the variances a order to find the variancecovariance matrix 2 of this model. are the crosssection specific and timespecific components. inserting 6 into R in (13. researchers used the table of the upper and lower bounds of the statistic compiled by Durbin and Watson (1951).).1 1 Generalized Least Squares 323 where { E J are i.xi@.1. but with more difficulty than for an autoregressive process. and A. Therefore it would be reasonable where y . In particular. (A. .15). At the least. I Since {u.1. 1 . If (u. Finally.1) and (13. Today.29) @ (0 .15) as the alternative hypothesis.).1. For example.1.p) + N ( 0 .24). In the simplest version of such a model.29). a : . were observable.d.d.4 Error Components M o d e l Since (13. b does not possess all the properties of the LS estimator under the classical regression model. Computation of the GLS estimator is still possible in this case. T.15). as before. . the exact pvalue of the statistic can be computed. as follows: and It can be shown that 6 is consistent and has the same asymptotic distribution as b given in (13.322 13 1 Econometric Models 13.). we can compute the FGLS estimator. t = 1. because its distribution can be more easily computed than that of 6. . B y pooling timeseries and crosssection data. It is customary. it shodd be reasonable to replace them in (13. where fi is the LS estimator. random variables with zero mean and are .p2). . T 1=2 to use (13. ~ )are i. to use the DurbznWatson statistic C utut1 (13. respectively. we should not treat timeseries and crosssection data homogeneously.i.1) and (13. We consider next the estimation of p in the regression model defined by (12. however. and define The error components model is useful when we wish to pool timeseries and crosssection data. In the remainder of this section.28) fi = 7.1.1.i.1. we could estimate p by the LS estimator applied to (13. This test is equivalent to testing No: p = 0 versus H I :p # 0 in (13. and a : . of demand functions using data on the quantities and prices collected monthly from many consumers.1.
1. DEFINITION 13. using the consistent estimators of y l . . + E. . Remember that if we arrange the observations in a different way..k.2) + +az~m. . Its inverse can be ~hxrwn t .d. We can write (13.2. = ]=I p . y12.324 13 ( Econometric Models 13. . y.)in the earlier equation are not.ApB + yPJh7). yNT).1).Y I T . . we need a different formula for Z. we need the following definition. Let J K be the K X K matrix consisting entirely of ones.1.1). Y y2 = a. with E E ~ = 0 and V E = ~ u2.y I A . . t = p+l.If we define X and u similarly.37) 1 1 1 Q=IAB+Jw.1.1.it is not a classical regression model because Y cannot be regarded as a nonstochastic matrix. we can write (13.we can estimate P by the sclcalled transformation estimator Although the model superficially resembles (12. yet is consistent and has the same asymptotic distribution as FGLS. .1. or more practically by the FGLS estimator (13. .1. Y N I .11).1. y2. .1.i.1) y. .}are observable. cp+2.1) in matrix notation as (13. +~a?)l. . 0k (13. model differs from (13. . for example.. . T N NT L e t A = {a. Y N ~ .5). 2 where y1 = P a:(o: + Ta. E ~ ) This .32) in the form of (13.3).1. because it does not require estimation of the y's.1.)'. .26) only in that the {y. Then the KroneckerproductA Q B is a KM X LN matrix defined This estimator is computationally simpler than the FGLS estimator. .. y21.1. Then we have (13. .32) in the form of (13. In defining the vector y.}beaK X L m a t r i x a n d l e t B b e a n M X N matrix. it is customary to arrange the observations in the following way: y' = (yll. . 5 Y E ( I m . ~ 2 2 .YZT.. 13.35) 2I = 4 ( J . p can be estimated by the GLS estimator (13. T . . whereas the (u.2.(cr: From the above.. y by defining where A = IN Q JT and B = J N Q IT..To facilitate the derivation of Z.1 where (13.34) = a : ~ U. . and yJ Alternatively.B where { E ~ are ) i.and ( y l . .yP) are independent of (cp+l.1.2 1 Time Series Rwesslon I 325 decide how to write (13.2 TIME SERIES REGRESSION by In this section we consider the pth order autoregressive model P (132. . . p+2. . .
. The model was extensively used by econometricians in the 1950s and 1960s.2).3.3). = u121. + x2P2 + up. See a survey of the topic by Griliches (1967). We shall encounter another such example in + and . ~ and .3 SIMULTANEOUS EQUATIONS MODEL This model can be equivalently written as m (13. we can asymptotically treat (13. EU.2. which tells her . That is. Essentially the same asymptotic results hold for this model as for (13. The estimation of p and y in (13.2). obtained earlier. as it originated in the work of Koyck (1954).1.  326 13 1 Econometric Models 13.+ w ~ . ] are serially correlated.2. or the Koyck lag.and z.2. Economists call this model the distributedlag model.2. The transformation from (13. To return to the specific model (13. 13. the above result shows that even though (13.4) as if it were a classical regression model with the combined regressors X = (Y. we shall illustrate it by a supply and demand model. The term "distributed lag" describes the manner in which the coefficients on q.6) y. u~(x'x)']. This particular lag distribution is referred to as the geometric lag. In such a case we can consistently estimate mental variables (IV) estimator defined by P by the instru fi N [ p . A similar transformation is possible for a higherorder process.2) by including the independent variables on the righthand side as where S is a known nonstochastic matrix of the same dimension as X.2.2) is not a classical regression model. Consider where w .6) are distributed over j.2. VuI = U:I.5). the more S is correlated with X. the better. Since Theorem 12.3. For a more efficient set of instrumental variables. such that plim T'S'X is a nonsingular matrix.2.3. and ul and u2 are unobservable random variables such that Eul = Eu2 = 0. under general conditions.3.2) y 2 yzy.4).6) is called the inversion of the autoregressive process.2.5) presents a special problem if { E .=o A study of the simultaneous equations model was initiated by the researchers of the Cowles Commission at the University of Chicago in the 1940s. Then. It is useful to generalize (13.2.4 implies that 0 A N[P.2. Although it was more frequently employed in macroeconomic analysis. X1 and X2 are known nonstochastic matrices. A buyer comes to the market with the schedule (13.3) Section 13. although the results are more difficult to prove. this problem arises whenever plim T'X'U # 0 in the regression model y = XP u.29). + where yl and y2 are Tdimensional vectors of dependent variables.2. where Z = Euu'. is a special case of (13. Z). In general.5) to (13. plim T' z ~ = ~ # ~ 0.(13. we have where Z is a known nonstochastic matrix. in (13.U. The asymptotic variancecovariance matrix above suggests that. loosely speaking.1). u~(Y'Y)'].~ therefore E ~ the LS estimators of p and y are not consistent. = p1 q.2. Vu2 = U ~ Iand .3 1 Simultaneous Equations Model 327 The LS estimator fi = (Y'Y)'Y'~ is generally biased but is consistent with the asymptotic distribution (13. The reverse transformation is the inversion of the movingaverage process. see Amemiva and Fuller (1967). In this case. the above consideration suggests that z . corresponding to the tth element of the vector. we can asymptotically treat it as if' it were.2. A seller comes to the market with the schedule (13. We now consider a simple special case of (13. which tells him what price (yl) he should offer for each amount (y2) of a good he is to buy at each time period t.2.1 constitute a reasonable set of instrumental variables.We give these equations the following interpretation. It should be noted that the nonstochasticness of S assures plim T~S'U= 0 under fairly general a s sumptions on u in spite of the fact that the u are serially correlated. = yCdzt. Note that (13.
3.3. Express that fact as (13. For example.2) the structural equations and (13.3 1 Simultaneous Equations Model 329 how much (y2) she should sell at each value (yl) of the price offered at each t .3. A salient feature of a structural equation is that the LS estimator is inconsistent because of the correlation between the dependent variable that appears on the righthand side of a regression equation and the error 2 depends term. A structural equation describes the behavior of an economic unit such as a buyer or a seller.6) yields a consistent estimator of . however. PI)'.3.where eland ii2 are the LS estimators.6). Consider the estimation of y1 and PI in (13. Solving the above two equations for yl and y2. PI.5) and (13. discussed in Section 13. the LS estimator applied to (13. we consider the consistent estimation of the parameters of structural equations. the twostage least squares estimator is asymptotically more efficient than any instrumental variables estimator. p's.rr2 are functions of y's and P's. the values of yl and y2 that satisfy both equations simultaneouslynamely. and (13.rrl or m2.3) and (13. any solution to the equation ( e l . or trial and error). If mapping g(.l .we can derive the likelihood function from equations (13. rewrite the equation as We call (13.1). vl.3.3. and define the projection matrix P If we insert S = PZ on the righthand side of (13. p2). For this purpose. Then.328 13 1 Econometric Models 13. we assume the joint normality of ul and u2. 72.nl and . P1. and v2 are appropriately defined. we obtain the twostage least kquares (2SLS) estimator = x(x'x)'x'.3. u:(z'Pz)'1. ) is manytoane.3. c r : (s'z)~s's(z's)~].3.. It can be shown that (13. Nowadays the simultaneous equations model is not so frequently used . and hence of vl and v2.X) and a = (y. It is consistent and asymp totically (13. = g(y1. for example. whereas a reducedform equation represents a purely statistical relaticship. by some kind of market mechanism (for example.3.7) k2s A N [ a .l ~ ' ~ ( ~ ' ~ ) .3. Then the instrumental variables (W) estimator of a is defined by Under general conditions it is consistent and asymptotically (13. This estimator was proposed by Theil (1953).1) and (13. Let X be as defined after (13.and the resulting estimators are expected to possess desirable properties.2. Let S be a known nonstochastic matrix of the same size as Z such that plim T'S'Z is nonsingular.3.5) or (13.3.13) ( ~ 1m2) . we obtain (provided that Y1Y2 # 1): (13.9).4). the help of an auctioneer. plim T(Z'PZ)' 5 plim T ( s ' z ) . the equilibrium price and quantityare determined. and a's yields a consistent and asymptotically efficient estimator. is still consistent but in general not efficient. Since a reducedform equation constitutes a classical regression model.3) y1 = 1 1 @lP1 + X2~1P2 + U1 + ~ 1 ~ 2 ) and the estimates of y's and p's from the LS estimators of ml and n2.10) ikw A N[a. m2.4) the reducedfm equations. ii2) = g(yl. known as the full information maximum likelihood estimator A simple consistent estimator of y's and p's is provided by the instrumental variables method. Note that . Maximizing that function with respect to y's. y2. P2).6). If.3.12) where X consists of the distinct columns of XI and X2 after elimination of any redundant vector and ml.3.1) y2 is correlated with ul because y on ul. If m a p ping g ( .) is onetoone. as we can see in (13.3. Next.3.3. we can uniquely determine In other words.6) y 2 = Xm2: * V2. Rewrite the reducedform equations as where Z = (y2.3. in (13.
where xt is a vector of exogenous variables which. We do not observe Dl or St. f . . f.14) D. with Eu. depending on whether the research knows which of the two variables D.3. Let us illustrate.() is a known function. which is defined by (13.3.. The parameters of this model can be consistently and efficiently estimated by the maximum likelihood estimator. and u .4. but instead observe the actually traded amount Q. Let be the initial value.3.St). In practice we often specify and {u. and labor input. and where Q. We can write (13.3.2). The computation of the maximum likelihood estimator in the second instance is cumbersome. p).i. respectively.16) with the equilibrium condition D. capital input. estimators such as the instrumental variables and the twestage least squares are valuable because they can be effectively used whenever a correlation between the regressors and the error term exists. 13. Expand f. The NewtonRaphson method described in Section 7. and L denote output. unlike the linear regression model.d. = I ft($) = f (xt. is the quantity the seller desires to sell at price Pt. See Chapter 11.1) and (13.4 NONLINEAR REGRESSION MODEL where y. for the tth element. the GaussNewton method.3. we can estimate a2by The nonlinear regression model is defined by The estimators and 6 ' can be shown to be the maximum likelihood estimators if (u.ks model. respectively. The case when the researcher knows is called sample separatiolz. = St leads to a simultaneous equations model similar to (13.4 1 Nonlinear Regression Model 331 as in the 1950s and 1960s.330 13 1 Econometric Models 13. or S. is smaller. J ' . Another example is the CES production function (see Arrow et al.16) Q. may not necessarily be of the same dimension as P. is specifically designed for the nonlinear regression model.5. = 0 and V u . The derivation is analogous to the linear case given in Section 12.3 can be used for this purpose.. Consider (13. Although the simultaneous equations model is of limited use. be it an estimator or a mere guess. 1961): where Dl is the quantity the buyer desires to buy at price P. One reason is that a multivariate time series model has proved to be more useful for prediction than the simultaneous equations model. for the simplest such model and Fuller (1987) for a discussion in depth. p is a Kvector of unknown parameters. The minimization of ST(@) must generally be done by an iterative method. An example of the nonlinear regression model is the CobbDouglas production function with an additive error term. especially when data with time intervals finer than annual are used. Another example is the errorinvanab. we have the case of no sample separation.. = ylPt + X& + ult where f.2..3. when the researcher does not know.@) in a Taylor as series around p = 0 pl Bl .1) in vector notation as This is the disequilzbn'um model proposed by Fair and Jaffee (1972).] are assumed to be jointly normal. Another iterative method. K. Exercise 5. The nonlinear least squares (NLLS) estimator of P is defined as the value of p that minimizes Denoting the NLLS estimator by 0.2. Note that replacing (13. = min (Dl. Another reason is that a disequilibrium model is believed to be more realistic than an equilibrium model. and u are Tvectors having y. There are two different likelihood functions. We have already seen one such example in Section 13. again with the supply and demand model..] are i. and S.
however. the x. The first two examples are binary. In this book we consider only models that involve a single dependent variable. The practical implication of (13. / apr is a Kdimensional row vector whose jth element is the derivative off. provided that we use a f / ~ ? p 'for ( ~ X. In Section 13.4. with respect to the jth element of p.4. As in a regression model. If. what type of occupation a person's job is considered. It is simpler than the NewtonRaphson method because it requires computation of only the first derivatives off. as well as many other issues not dealt with here. as extensive sample survey data describing the behavior of individuals have become available.332 13 1 Econometric Models 13.need not be the original variables such as price and income.5 1 Qualitative Response Model 333 where df. given the independent variables.will include among other factors the price of the car and the person's income. where it was used to analyze phenomena such as whether a patient was cured by a medical treatment. and the vector x.8).4.5 QUALITATIVE RESPONSE MODEL The qualitative response model or discrete variables model is the statistical model that specifies the probability distribution of one or more discrete dependent variables as a function of independent variables. or whether insects died after the administration of an insecticide. It is analogous to a regression model in that it characterizes a relationship between two sets where we assume that y.4. The assumption that y takes the .1 Binary Model i F t j I 9 2 i i / i 1 i We formally define the univariate &nary model by The asymptotic variancecovariance matrix above is comparable to formula (12. and by what mode of transportation during what time interval a commuter travels to his workplace.2 we look at the multinomial model.9) is that of variables. 13.5.4. but differs from a regression model in that not all of the information of the model is fully captured by specifying conditional means and variances of the dependent variables. We can test hypotheses about in the nonlinear regression model by the methods presented in Section 12. The following are some examples: whether or not a consumer buys a car in a given year. takes the values 1 or 0. The &fference is that df/dpr depends on the unknown parameter p and hence is unknown. they could be functions of the original variables.7) into the righthand side of (13. The secondround estimator of the iteration. F is a known distribution function. y. = 1 represents the fact that the ith person buys a car.is a known nonstochastic vector. for example.22) for the LS estimator.2. x. where the dependent variable takes two values. Recently the model has gained great popularity among econometricians. treating the entire lefthand side as the dependent variable and aft/aprlsl as the vector of regressors. whereas NewtonRaphson requires the second derivatives as well. Note that df/dpr above is just like X in Theorem 12. multinomial.1) and rearranging terms. multivariate. and in Section 13. 13.7) holds approximately because the derivatives are evaluated by Inserting (13.4.. b2. The qualitative response model originated in the biometric field.2.5.2. We can show that under general assumptions the NLLS estimator @ is consistent and The above result is analogous to the asymptotic normality of the LS estimator given in Theorem 12. chapter 9). Many of these data are discrete. are discussed at an introductory level in Arnemiya (1981) and at a more advanced level in Arnemiya (1985. The iteration is repeated until it converges. Note that (13. how many cars a household owns. we obtain: Dl.1 we examine the binary model. and P is a vector of unknown parameters. we apply the model to study whether or not a person buys a car in a given year. where the dependent variable takes more than two values. and the last. The multivariate model.5.4. whether o r not a worker is unemployed at a given time. The same remark holds for the models of the subsequent two sections.4. the next two.is obtained as the LS estimator applied to (13. whereas X is assumed to be known.
when A is used. If one researcher fits a given set of data using a probit model and another researcher fits the same data using a logit model. To the extent that the econometrician experiments with various transformations of the original independent variables. The standard normal distributionfunction (see Section 5. we can prove the consistency and the asymptotic normality of the maximum likelihood estimator by an argument similar to the one presented in Sections 7. 334 13 1 Econometric Models 13. with ah/&. where X I .5. The two distribution functions have similar shapes.5. The likelihood function of the model is given by P(y.5. partially employed.4. but the econometrician assumed it to be F. Model (13.u l i is F and define xi = xli . respectively.. We must instead compare a@/ax. Thus we have (13. We assume that (13. rather. as he normally would in a regression model. maximizing L with respect to P by any standard iterative method such as the NewtonRaphson (see Section 7.In most cases these two derivatives will take very similar values.and U.. and 3. . The essential features of the model are unchanged if we choose any other pair of distinct real numbers.2 and '7.5. and selfemployed. and %.3) is straightforward. the vector aF/ax.2) is defined by When F is either @ or A. > Uo. by choosing a function h ( . which may be regarded as the omitted independent variables known to the decision maker but unobservable to the statistician.1) if we assume that the distribution function of uoi . To see this. We assume that the zth person chooses alternative 1 if and only if Ul. the model is called probit. Let U1. the likelihood function is globally concave in P.2) Uli = xiif3 + uli and Uoj = x b i ~ + %i.1). 13.. random variables. When @ is used in (13. The asymptotic distribution of the 3 is given by maximum likelihood estimator f where and the logistic distribution function is defined by where f is the density function of F. %. Although we do not have the i. the model is called logit. 2.d.1) is by the maximum likelihood method. = 1 ) = P(U1. suppose that the true distribution function is G. except that the logistic has a slightly fatter tail than the standard normal.1) the regression coefficients p do not have any intrinsic meaning.) We obtain model (13.1) can be derived from the principle of utility maximization as follows.3.x.5 1 Qualitative Response Model 335 values 1 or 0 is made for mathematical convenience. where the three alternatives are private car. Then. Therefore.5.5. which for convenience we associate with three integers 1.4.) are bivariate i.> uo. and train.) appropriately.. bus.i.. The best way to estimate model (13.3) It is important to remember that in model (13. he can always satisfy G(X: P) = ~ [ h ( x P)] : . One example of the threeresponse model is the commuter's choice of mode of transportation.d. sample here because xi varies with i.2 Multinomial Model I We illustrate the multinomial model by considering the case of three alternatives. it would be meaningless to compare the two estimates of P..5. be the ith person's utilities associated with the alternatives 1 and 0.. The following two distribution functions are most frequently used: standard normal @ and logistic A. the choice of F is not crucial. The important quantity is.i.3.5. are nonstochastic and known. Another example is the worker's choice of three types of employment: being fully employed. and (ul.
uQt) up to an unknown parameter vector 0. .. In the first example. .5 1 Qualitative Response Model 337 We extend (13. the model implies independence from irrelmant alternatives. we should expect inequality < to hold in the place of equality in (13. Therefore. Besides the advantage of having explicit formulae for the probabilities. It is easy to criticize the multinomial logit model from a theoretical point of view. This is cumbersome if the number of alternatives is larger than five. If we specify the joint distribution of (ul.5. to cite McFadden's wellknown example. 2. u3. respectively. An iterative method must be used for maximizing the above with respect to P and 0. 2.2) to the case of three alternatives as the errors are mutually independent (in addition to being independent across i ) and that each is distributed as This was called the Type 1 extremevalue distribution by Johnson and Kotz (1970. First. instead of bus and train.5.336 13 1 Econometric Models 13. this model has the computational advantage of a globally concave likelihood function. . and private car. The probabilities are explicitly given by where (ul. He assumed that and similar equalities involving the two other possible pairs of utilities. = 1 if y. 1. = j. 272). we can express the above probabilities as a function of p and 0.n.13) means that the information that a person has not chosen alternative 2 does not alter the probability that the person prefers 3 to 1. The former assumption is possible because the nonstochastic part can absorb nonzero means. An analogous model based on the normality assumption was estimated by Hausman and Wise (1978).by yJ. In the normal model we must evaluate the probabilities as definite integrals of a joint normal density. We should generally allow for nonzero correlation among the three error terms. p. t of the errors that makes McFadden (1974) proposed a ~ o i ndistribution possible an explicit representation of the probabilities. It is assumed that the individual chooses the alternative with the largest utility. Given that a person has not chosen red bus. 2. and suppose that a person is known to have chosen either bus or car. > Ul. ulr > U31) P(y.. We can assume without loss of generality that their means are zeroes and one of the variances is unity. It is perhaps reasonable to surmise that the nonselection of train indicates the person's dislike of public transportation.). and the latter because multiplication of the three utilities by an identical positive constant does not change their ranking. train. it is likely that she .)are i. our model is defined by P(yz = 1) = P(ulz > U21.) The equality (13.13). Given this information. U21> U3. 1989) has made the problem more manageable than formerly.. = P(yi = 2 ) . . Let us consider whether or not this assumption is reasonable in the two examples we mentioned at the beginning of this section. we might expect her to be more likely to choose car over bus. This argument would be more convincing if alternatives 1 and 2 corresponded to blue bus and red bus.5. the likelihood hnction of the model is given by This model is called the multinomial logit model. One way to spec* the distribution of the u9swould be to assume them to be jointly normal.. u2. = 2) = P(U2. j = 1. (We have suppressed the subscript z above to simplify the notation.i.. If we define binary variables yJ. Second. if we represent the ith person's discrete choice by the variable y. which can be mathematically stated as where Pli = P(yi = 1 ) and P2.. although an advance in the simulation method (see McFadden. If this reasoning is correct. no economist is ready to argue that the utility should be distributed according to the Type I extremevalue distribution. suppose that alternatives 1. up. and 3 correspond to bus.d.
]are assumed to be i. In a given practical problem the researcher must choose a priori which two alternatives should be paired in the nested logit model. It is assumed that {y. If.3.2. = x : p =0 + u. Tobin used this model to explain a household's expenditure (y) on a durable good in a given year as a function of independent variables (x). Suppose that u3 is distributed as (13.1) y. including the price of the durable good and the household's income.XZ~)'P/PI f 1. This model is called the censmed regression model or the Tobit model (after Tobin. . and selfemployed. if' p = 1 the model is reduced to the multinomial logit model. The joint distribution was named Gumbel's Type B bivariate extremevalue distribution by Johnson and Kotz (1972. 2. not necessarily derived from utility maximization. or u2 to infinity. Again. and if the researcher does not know how many obser: 5 0. in analogy to probit). lists several representative applications.6 CENSORED OR TRUNCATED REGRESSION MODEL (TOBIT MODEL) Tobin (1958) proposed the following important model: (136. n .11) and independent of u.5.* = xi6 + u. partially employed. For generalization of the nested logit model to the case of more than three alternatives and to the case of higherlevel nesting.1. sections 9. 256). and (13. If the observations corresponding to yT 5 0 are totally lost. yet the probabilities can be explicitly derived.and U Q .* 5 0. to the extent that the nonselection of "partially employed" can be taken to mean an aversion to work for others.5 and 9. but u 1 and u 2 follow the joint distribution and (13. . we view (13. however.]are observed for all i.3. if {x. we can readily see that each marginal distribution is the same as (13.)are not observed whenever y.338 13 1 Econometric Models 13.5.13). and it is hypothesized f the desired expenditure that a household does not buy the durable good i is zero or negative (a negative expenditure is not possible). Therefore it is useful to estimate this model and test the hypothesis p = 1. If there is a single independent variable x.5. N(0. a2)and x.xs. and 3 correspond to fully employed. Amemiya (1985). see McFadden (1981) or Amemiya (1985.5. is a known nonstochastic vector. we would expect inequality < in (13. . if y : >0 z = 1 . the observed data on y and x in the Tobit model will normally look like Figure 13. suppose that alternatives 1. but {y:] are unobserved if y4 5 0.1) as long as the researcher experiments with various transformations of the independent variables.11).The parameter p measures the (inverse) degree of association between ul and U . . it is much more general than it appears. In the aforementioned examples.*>O. The variable y interpreted as the desired amount of expenditure. it is natural to pair bus and train or fully employed and partially employed. p. + p log q]. that is. By taking either u.)'P exp[(x~. Any multinomial model can be approximated by a multinomial logit model if the researcher is allowed to manipulate the nonstochastic parts of the utilities. ify. We shall explain the nested logit model proposed by McFadden (1977) in the model of three alternatives. The probabilities of the above threeresponse nested logit model are specified by where (u. such that p = 1 implies independence. It is apparent there that the LS estimator of the slope coefficient obtained by regressing . p. the model is called the truncated regmszon vations exist for which y model.5 1 msared or Truncated Regression Model 339 In the second example. It is possible to generalize the multinomial logit model in such a way that the assumption of independence from irrelevant alternatives is removed. precisely for the same reason that the choice of F does not matter much in (13. = 13.d.6. . The Tobit model has been used in many areas of economics. where z.5.16) P& = 1 or 2) = A[(xzi .i. The above model is necessitated by the fact that there are likely to be many : may be households for which the expenditure is zero.2) y.5.6).12) as a purely statistical model.]and {x. 365. Clearly.
The likelihood function of this model is given by (13. The consistent and asymptotically efficient estimator of P and u2 in the Tobit model is obtained by the maximum likelihood estimator. 1 where 4.d. Amemiya (1985. and f ( . Since a test of Type 1 versus Type 2 cannot be translated as a test about the parameters of the Type 2 Tobit model.) are serially correlated (see Robinson. yet the maximum likelihood estimator can be shown to be consistent and asymptotically normal. . = 0 and y.. are observed for all i but that xp.1 Anexampleofcensoreddata and all the y's (including those that are zeroes) on x will be biased and inconsistent. L = 0 P(yT. drawings from a bivariate normal distribution with It is assumed that zero means. up. given y : > 0.) are i. Many generalizations of the Tobit model have been used in empirical research. . see Amemiya (1973). are independent.. > 0. where noand ITl stand for the product over those z for which y2. of which we shall discuss only Types 2 and 5. It is a peculiar product of a probability and a density. = x.Y$ = x. = 0 and y21Z 0.6. however.aP2 + u2t FIGURE 13. and covariance alp. Olsen (1978) proved the global concavity of (13. while the probit maximum likelihood estimator applied to the first equation of (13... is observed only when y?. on xzayields the maximum likelihood estimators of P2 and oi. For the proof. and are the standard normal distribution and density functions and no and 111 stand for the product over those i for which y. Amemiya (1973) proved the consistency and the asymptotic normality of this model as well.7) u 1 +[(y.5) 6 yT.6.3) L = n 0 [l . 5 0. The likelihood function of the model is given by (13. For discussion of these cases. : > 0) stands for the conditional density of y2*. only the sign of y . section 10. The likelihood function of the truncated regression model can be written as (13. > 0. respectively. Another special case of Type 2 is when ula and u . It is assumed that xl.~Ux:P)/u].6.. : is observed and that y2*.6. Type 2 Tobit is the simplest natural generalization of the Tobit model (Type 1) and is defined by (13.] is either nonnormal or hetercscedastic.P1 + ul. when the true distribution of {u. . Although not apparent from the figure. 5 0) n f ( y 2 . I Y. section 10. in which y : = y..6) classifies them into five broad types.6.340 13 1 Econometric Models 13. variances u: and a.5) yields the maximum likelihood estimator of Pl/~l. the choice between the two models must be made in a nonclassical way.5 ( Censored or Truncated Regression Model 341 The Tobit maximum likelihood estimator is consistent even when {u. ~~[(~. see Amemiya (1985. .need not be observed for those z such that y .xlP)/ul. it can further be shown that the LS estimator using only the positive y's is also biased and inconsistent. The Tobit model (Type 1) is a special case of the Type 2 Tobit. It loses its consistency. .i.> 01.3).4) + L= n 1 @ ( X ~ ~ / U ) .@(x:P/u)l n 1 where (ul. respectively.5).I y .6. 1982). In this case the LS regression of the positive y2. > o)P(YT.
and (13. In duration analysis the concept known as hazard plays an important role. We shall initially explain the basic facts about the duration model in the setting . . 2 . . z .of the i. Only when the offered wage exceeds the reservation wage do we observe the actual wage. X f (Gi I zzi I Z$ is the maximum )P2i.i. In the work of Dudley and Montmarquette (1976). is the utility of the jth portfolio of the of Dubin and McFadden (1984).342 13 1 Econometric Models 13. j = l .6.i. 2 . y = z. As is evident from these examples. we can completely characterize the duration model in the i. .15).S.9) The duration model purports to explain the distribution function of a duration variable as a function of independent variables. 2 . (13. In the model. and y . is the product over those i for which $ is the maximum and Pji = P(z: is the maximum). We define the hazard function. how long a patient lives after an operation. and for each i and j the two random variables may be correlated with each other. In some applications the maximum in (13. Introductory books on duration analysis emphasizing each of the areas of application mentioned above are Kalbfleisch and Prentice (1980).7 DURATION MODEL y .8) location.3.2) $ 2 Hazard(t. = xip.6. t + At).8) can be replaced by the minimum without changing the essential features of the model. Their joint distribution is variously specified by researchers. then later introduce the independent variables. The likelihood function of the model is given by (13. We define (13. . Yt=Ytz if $=maxz.d. the above signifies the probability that she dies in the time interval (t. (We have slightly simplified Lee's model.Y. Assuming that the density function f (t) exists to simplify the analysis. and y of the gas and electricity consumption associated with the jth portfolio in the ith household.1) F(t) = P ( T < t)..J .2) where the approximation gets better as At gets smaller. If T refers to the life of a person. . .. t + At) = P(t < T < t + At I T > t) X f (y.. denoted A(t). and economics.3. 13.J = 2. the life of a machine.14). including medicine. engineering. Denoting the duration variable by T. and y2*. T y p e 5 Tobit is defined by (13. Miller (1981). ! I J where H. + Vp. which is the greater of the two. the duration model is useful in many disciplines. which is equal to the offered wage. J where yi. (vector) consists electric and gas appliances of the ith household. given that she has lived up to time t. we have from (13. inclination to give aid to the ith country. represents the offered wage minus the reservation wage (the lowest wage the worker is willing to accept) and y: represents the offered wage. S. . It is assumed that (uji. and Lancaster (1990). The duration variable may be human life.+ = uJ. .7. case by specifying the distribution function L = n n n 1 f(yT.d.i. .3. or the duration of unemployment.) The disequilibrium model defined by (13. vji) are i. is the maximum)Pj. by ' . i = l . The researcher observes the actual wage rate y . y. . t + At). n. z$ is +e net profit accruing to the ith firm from the plant to be built in the jth i Y and call it the hazard of the interval (t. so that aid is given if y. and zfi represents the wage rate of the ith worker in case he joins the union and z:i in case he does not.16) becomes this type if we assume sample separation. In the model .6.7 1 Duration Model 343 In the work of Gronau (1973).sii are observed. xji. y. across i but may be correlated across j. 1 zTi is the maximum)Pli i (137.d. is positive. of Duncan (1980)..7. signifies the measure of the U. determines the actual amount of aid. sample. is the inputoutput vector at the jth location. In the model of Lee (1978).
j : ~ ( ~ ). lnitially there are 1000 people in each group. The exponential model for the life of a machine implies that the machine is always like new.7 1 Duration Model 345 There is a onetoone correspondence between F(t) and A(t).i. starting with the constant term only.2. regardless of how old it may be. and so on. hence it is easier for him to specify the former than the latter. For some other applications (for example. the duration of a marriage) an inverted U shape may be more realistic. for example.. Differentiating (13.4) shows that X(t) is known once F(t) is known.7.5) we have for this model F(t) = 1 . is the same for persons of every age. it is useful to define this concept because sometimes the researcher has a better feel for the hazard function than for the distribution function.(u. gamma. because economic theory does not clearly indicate whether a should be larger or smaller than 1.d. curiously. Since f (t) = aF(t)/at. This model would not be realistic to use for human life. In this example three groups of individuals are associated with three levels of the hazard rate0. The heterogeneity of the sample may not be totally explained by all the independent variables that the researcher can observe. say. and then rising again with age.1. the likelihood function of the model with the unobserved heterogeneity is given by Thus the Weibull model can accommodate an increasing or decreasing hazard function. d ~ ] The vector x.5. and 0. but neither a Ushaped nor an inverted Ushaped hazard . that his maximum likelihood estimator of a a p proached 1 from below as he kept adding the independent variables. and log replacement (unemployment benefit divided by earnings from the last job). 0. for it would imply that the probability a person dies within the next minute. attaining a minimum at youth.344 13 1 Econometric Models 13. Lancaster was interested in testing a = 1.) denotes the conditional likelihood function for the ith person. given v. He found. The simplest generalization of the exponential model is the Weibull model.)are i. remaining high for age 0 to 1. if different individuals are associated with different levels of the hazard function.7) with respect to t.7.1. He introduced independent variables into the model by specifying the hazard function of the ith unemployed worker as exp [. In such a case it would be advisable to introduce into the model an unobservable random variable. The simplest duration model is the one for which the hazard function is constant: This is called the exponential model. this phenomenon is due to the fact that even if the hazard function is constant over time for each individual. contains log age. that 500 people remain at the end of period 1 and the beginning of period 2. the researcher can test exponential versus Weibull by testing a = 1 in the Weibull model. Therefore. known as the unobserved heterogeneity. The next equation shows the converse: (13. log unemployment rate of the area. (13. In one of his models Lancaster (1979) specified the hazard function as Therefore X(t) contains no new information beyond what is contained in F(t). If L. The last row indicates the ratio of the aggregate number of people who die in each period to the number of people who remain at the beginning of the period. we obtain where {u. which acts as a surrogate for the omitted independent variables. &/at < 0). Nevertheless. the Weibull model is reduced to the exponential model.5) ~ ( t= ) I  Lancaster (1979) estimated a Weibull model of unemployment duration. in which the hazard function is specified as When a = 1.7.e'' and f (t) = hi".7. From (13. A more realistic model for human life would be the one in which h(t) has a U shape. As Lancaster showed. an aggregate estimate of the hazard function obtained by treating all the individuals homogeneously will exhibit a declining hazard function (that is. We explain this fact by the illustrative example in Table 13. The first row shows.
Heckman and Singer (1984) studied the properties of the maximum likelihood estimator of the distribution of the unobserved heterogeneity without parametrically specifying it in a general duration model.7.12). by differentiating the above as The computation of the integral in the above two formulae presents a problem in that we must specify the independent variable vector x .3tk Gritz (1993) ho(t) = A exp(y.9). The first step is to obtain the distribution function by the formula (13. Sturm (1991) Next we consider the derivation of the likelihood function of the duration model with the hazard function of the form (13.1 2 7 I  Flinn and Heckman (1982) (13. in the sense that x depends on time t as well as on individual i. second.13.7. as a .7.exp [/( o h0(s)e ~ ~ ( < ~ ) d s ] and then the density function. This formulation is more general than (13.13) Ao(t) = exp +Y t" .(t) = 1 .7. They showed that the maximum likelihood estimator of the distribution is discrete. first. The unobserved heterogeneity can be used with a model more general than Weibull. and.16) F.7 1 Duration Model 347 where the expectation is taken with respect to the distribution of v. A hazard function with independent variables may be written as where ho(t) is referred to as the baselim hazard function.5) as (13. in the sense that the baseline hazard function is general.) As Lancaster introduced the unobserved heterogeneity.7.7. 1 + .. . Some examples of the baseline hazard functions which have been used in econometric applications are as follows: (13.t + y2t2). his estimate of a further approached 1. (The likelihood function of the model without the unobserved heterogeneity will be given later.7.14) (13.15) Ao(t) = pktk' .
(t) must be interpreted as the probability that the spell of the ith person ends in the interval (t. the two models are mathematically equivalent. remains constant within each interval.18) is no longer the correct likelihood function. If the worker accepts the offer.7. t 1). For the correct likelihood function of the second sampling scheme with left censoring. Note that Lancaster's model (13. We mentioned earlier a problem of computing the integral in (13. With data such as unemployment spells. The contribution of a patient in the first category to the likelihood function is the density function evaluated at the observed survival duration. 1992.(t)]X.9) is a special case of such a model. As an illustration. See Moffitt (1985) for the maximum likelihood estimator of a duration model using a discrete approximation.7. 1992) are said to be right censored. The general model with the hazard function (13.6. Maximizing it would overestimate the survival duration. as G.7 1 Duration Model 349 continuous function of s. The contribution to the likelihood function of the spell that ends after k periods is I I f l : [l . The likelihood function depends on a sampling scheme. and December 31. For an econometric application of these estimators. In every period there is a probability A that a wage offer will arrive.(k). It is customary in practice to divide the sample period into intervals and assume that x i . and the contribution of a patient in the second category is the probability that he lived at least until December 31.3). he incurs the search cost c until he is + . whereas for patients of the second category ti refers to the time from the operation to December 31.7. We do so first in the case of discrete time.1992.7. as above. is called the proportional hazard model. see Lehrer (1988). The problem does not arise if we assume is the product over those individuals who died before December where 31.17). let us assume that our data consist of the survival durations of all those who had heart transplant operations at Stanford University from the day of the first such operation there until December 31.7.12). Note a similarity between the above likelihood function and the likelihood function of the Tobit model given in (13.d.i. Next we demonstrate how the exponential duration model can be derived from utility maximization in a simple jobsearch model. We have deliberately chosen the heart transplant example to illustrate two sampling schemes. Then (13. This estimator of P is called the partial maximum likelihood estimatm The baseline hazard Xo(t) can be nonparametrically estimated by the KaplanMeier estimator (1958). 1980) and were still living on that date are said to be lej censored. see Amemiya (1991). 1980. which arises when we specify the hazard function generally as (13. There are two categories of data: those who died before December 31. The survival durations of the patients who had their operations before the first day of observation (in this example January 1. and those who were still living on that date. he will receive the same wage forever.12) may be estimated by a discrete approximation.(t)]. 1980. In fact. its size is distributed i. 1980. In order to obtain consistent estimates of the parameters of this model.7. The survival durations of the patients still living on the last day of observation (in this example December 31. whereas the contribution to the likelihood function of the spell that lasts at least for k periods is l 7 [l . the first sampling scheme is practically impossible because the history of unemployment goes back very far.scheme tends to include more longsurviving patients than shortsurvivingpatients among those who had their operations before January 1. In this case X.348 13 I Econometric Models 13. 1992. 1992.1980. If he rejects it. Cox (1972) showed that in the proportional hazard model P can be estimated without specifying the baseline hazard Xo(t). Now consider another sampling scheme with the same heart transplant data. 1992. but were still living on that date. Suppose we observe only those patients who either had their operations between January 1. Consider a particular unemployed worker. or those who had their operations before January 1. Note that for patients of the first category ti refers to the time from the operation to the death. This assumption simplifes the integral considerably. and Ill is the product over those individuals who were still living on that date. because this sampling . and second in the case of continuous time. 1992. we must either maximize the no The duration model with the hazard function that can be written as a product of the term that depends only on t and the term that depends only on i. Thus the likelihood function is given by correct likelihood function or eliminate from the sample all the patients living on January 1. and if it does arrive.X.16) or (13.X.
It is easy to show that R* satisfies where R = S[(1 .24) V(t) = r n a x [ ~ . and define R* accordingly.7. A new feature in the model of Pakes (1986). Solve (13.c]. Then the Bellman equation is + ( I . call the solution V*. for example. Note further that V appears in both sides of (13. (13. We define c and 6 as before.] . Solve for V .28) f (t) = h P exp(hPt).d. the wage is observed with an error.c] and W(t) has been written simply as W because of our i.23). The worker should accept the wage offer if and only if W > R*.R)] V because of Taking the expectation of both sides and putting EV(t) stationarity. in Lippman and McCall (1976). R)] = I. Let f (t) be the density function of the unemployment duration.7.d.26) for V . When it arrives.7 1 Duration Model 351 utility at b e employed.X)[(1 . The duration T until the wage offer arrives is distributed exponentially with the rate h: that is. This feature makes solution of the Bellman equation considerably more cumbersome.S)V* .26) V = 6 . wdG(w) + RG(R). second. and define R* = 6[(1 . A fuller discussion of the model can be found. . the planning horizon is finite. (13. whcre R = 6K.6)V .7.21) V = 6'AE[max(~. in which W is the net return from the renewal of a patent. The Bellman equation is given by ) . Define P = P(w > R*). The model of Wolpin (1987) introduces the following extensions: first.22) E[max(W. Then we have (13. assumption. we have R)]. the wage is distributed i.28) is approximately equal to (13. The next model we consider is the continuous time version of the previous model. Taking the expectation of both sides and setting m ( t ) stationarity. is that W(t) is serially correlated.7. Note that (13. the reservation wage. (13. The discount rate is 6. as G.350 13 1 Econometric Models where 13.7. call the solution V*. where P = P(W > R*). P ( T > t) = exp(kt).7.7.7.' ~ [ m a x ( ~  V because of + 6'(1  X)R. Then the likelihood function of the worker who accepted the wage in the (t 1)st period is + Many extensions of this basic model have been estimated in econometric applications.i. For a small value of XP.' ~ ( t K .i.21). of which we mention only two. Let V(t) be the maximu~n t.7.S)EV(t + 1) . Thus we have obtained the exponential model.c]. (13.
APPENDIX: DISTRIBUTION THEORY Let {Z. n . and Y . Then the distribution of C : = ~ Zis ~ called the chisquare . If X . then EX = n and VX = 2n.X: and if X and Y are independent. be i. 1). with n degrees of freedom and denoted by DEFINITION 1 (Chisquare Distribution) I Xt T H Eo R EM 1 X + Y .X: . 2. .X:+m . as N(0.. .).d. .X. distribution. . then THEOREM 2 ~f x .i = 1.i.
Then x Let Y be N ( O . by Theorem 1. . I ) .A) and Pmoj Since x .d. as N ( p y . . (7) implies that these two terms are independent by virtue of Theorem 5.i.2. a. the theorem is true for n = 2. and (9) because of Definition 2.and x ~s2 = Let = 1. ny. be the sample variances. s2and Y are independent.d. .u. asN(kx.2.4. by Theorem 3. i = 1. Let and be the sample means and S: and S. Therefore. Then the distribution of & y / C i s called the Student S t distributionwith n degrees of freedom. Define X = (10) (X . 2. . . nI$). we can easily venfji But we have by Theorem 3 C ( X .Znl2 t=l (9)  up Xn1.1. n. We shall denote this distribution by t.) be i. i . Then if u i = u. as N ( 0 . i = 1.i. the lefthand side of (3) is X : .3. Then ~ . . .d.. We have  1 Also.). THEOREM 6 1 T H E OR E M 4 Let (Xi)and x. . . 2 . . . we have C (Xi .N ( p .(. I ) . R Let {Xi)bei. . we have + But the first and second terms of the righthand side above are i n d e pendent because (4) Since Xi. D E F I N I T I O N 2 (Student's t Distribution) Prooj We have THEOREM 5 {Xi) be i. Second.Ax  PY) (X  . assume it is true for n and consider n 1.i. be as defined in Theorem 3. Therefore. the righthand side of (2) is X: by Definition 1. (8). l ) and independent of a chisquare variable x:. (8) x x E(Z.X12 "x ny (Yi  n2 .) (Zn+l .' c % . Therefore..and are jointly normal. n Moreover. the theorem follows from (6). Assume that ( X i ) are independent of {Yi). nxandlet(Y. 2.354 Appendix Distribution Theory 355 But since (Z1 .8 . .z 2 ) / f i N ( 0 .5.212.) = 0.).. 2 which implies by Definition 1 that the square of (5) is X : . O Therefore.
Moreover. Calif. H. . eds. Apostol. Csaki." Econometnca 41: 997101 6. R 1972. Anderson. M. by Definition 2. 1974. of Economzcs and Statzstzcs 43: 225250. pp. m ) . New York McGrawHill. B. Reading. J. DeGroot.. Bellman.. B. 1965. and D. R. M. Minhas. New York: Academic Press. Cox. Therefore. "A Comparative Study of Alternative Estimators in a DistributedLag Model." Journal ofthe A m a can Statistzcal Associatzmz 57: 269326 (with discussion).   Akaike. ser. Clptzmal Statzstzcal Deciszonr.: Add~sonWesley. 2nd ed. Introductzon to Matnx Analyszs. 1973. Arrow. I S . "Information Theory and an Extension of the Maximum Likelihood Principle. Stanford Unmersity. "Cap~talLabor Substitution and Economic Efficiency. 1962.: Harvard University Press. 267281. Aspects of the Theory ofRzskBeanng Helsinki: Acadetnlc Book Store. 1985. K." Econometrica 35: 509529." Journal of Economzc Lzterature 19: 14831536. Zntroductzon to Multzvanate Statzstzcal Analysts. "Qualitative Response Models: A Survey. G. "On the Foundations of Statistical Inference." Journal of the Royal Statistzcal Sonety. Mass. A. Petrov and F. Here n is called the numerator degrees of freedom. S. 235. ser." Journal of the Royal Statutzcal Sonety. and W." in B. W. New York: John Wiley & Sons. Fuller. then ( X / n ) / ( Y / m ) is distributed as F with n and m degrees of freedom and denoted F ( n . "An Analysis of Transformations. Amemiya. New York: McGrawHill. "Regression Models and Life Tables." Revzeu. H. 1961. into (12). 1967. "Regression Analysls When the Dependent Variable Is Truncated Normal. Amemiya. CEPR. H. REFERENCES Finally. R.356 i Appendix where (11) follows from Theorems 1 and 3. Birnbaum. and m the denominator degrees of freedom. Solow.L. Advanced Econometrics Cambridge. A. Box. 2nd ed. R J... T. 2nd ed. 34: 187220 (with discussion). Chenery. Mass. A Course zn Probabilzly Theory. the theorem follows from inserting a . and R." Technical Paper no. Arrow. 1970. D. 1973. Budapest: Akademiai Kiado. 26: 21 1252 (with discussion). 1974. T. 0 ( F D i s t r i b u t i o n ) If X X: and Y X: and if X and Y are independent. B. This is known as the F distrihution. 1984. Mathematzcal Analyszs. M. Chung. 1964. E. 2nd ed. (10) and (11) are independent for the same reason that (8) holds. B. Second Intmational Sy~nposzumon Informatzon Themy. 1970. T. DEFINITION 3 = cr. "A Note on Left Censoring. T. P. 1981. N. 1991. Cox.
Mallows. Quandt. D. D. Gritz. Ferguson. Boston: Houghton Mifflin. Griliches. Fuller. 'The Economics of Job Search: A Survey.J.. Kotz. "Formulation a n d Statistical Analysis of the Mixed. New York: John Wiley & Sons. Johnston. T. "An Econometric Analysis of Residential Electric Appliance Holdings and Consumption. 1976. J. 2.  Johnson." Ewnomehica 47: 939956. F. Koyck. New York: John Wiley & Sons. 11. Marcus.: Ballinger Publishing. Calif. C. F. "Unemployment Insurance and the Distribution of Unemployent Spells. L. 3rd ed. and D. "Econometric Models of Probabilistic Choice. R.A Survey ofMatrix Theory and Matrix Inequalities. The Advanced Theory of Statistics. M. 1985. Continuous/Discrete Dependent Variable Model in Classical Production Theory. and B...." Bzometraka 38: 159178. 1984. The Econometric Analysis of Transition Data. 1985. Z. A. 1990." Applied Economics 20: 195210. 1951. L. Manhattan. J. eds."Journal of Pokttcal Economy 81: S1684199." Cowles Foundation Discussion Paper no. Zarembka. P. and D.. "Econometric Methods for the Duration of Unemployment. 1967. "Asymptotic Normality and Consistency of the Least Squares Estimators for Families of Linear Regressions. Meier. Introduction to Matnces wath Applzcatzons zn Statzstzcs.. Survival Analysis. Miller. ed. "D~stributedLag Models: A Survey. C. G. A. New York: John Wiley & Sons." Amencan Ewnomzc Review 66: 132142. 5th ed. Eicker.. Kans. Kaplan. Amsterdam: NorthHolland Publishing. and C. Measurement Error Models." in C. "Testing for Serial Correlation in Least Squares Regression. and J. W. L. J. 1964. S." in S. Singer. and A. S. vol. Hwang. 1976. Duncan. and H. L.J. McFadden. Gronau." Economehzca 35: 1649. J. Kendall." Paper presented at the Central Region Meeting of the Institute of Mathematical Statistics." in P. Lippman. "Determinants of Marital Instability: A Cox Regression Model. New York: Hafner Press. New York: Academic Press. and P. 1973. 1970. Hoel. Montmarquette. T. M. Amsterdam: NorthHolland Publishing. 474. S. 1988. F.. Heckman. 335.: MIT Press. Mathematzcal Statistzcs. Structural Analysis of Discrete Data with Econometric Applications. and J. and S. New York: McGrawHill. 1958. 1967. 1978. 1980. M. E. 1964. McCall." International Economic Review 19: 415433. Goldfeld. and R. Durbin. E. 1973. "Unionism and Wage Rates: A Simultaneous Equations Model with Qualitative and Limited Dependent Variables. Goldfeld and R. R.. S." Advances z n Econometncs 1: 3595. H. Watson. 1981. Distributions in Statistics: Continuow Multivariate Distributions. Econometric Methods. G. 1984. Dudley. Heckman. "A Method for Minimizing the Impact of Distributional Assumptions in Econometric Models for Duration Data." Econometnca 52: 345362. 1989. N. J. 1984. and G. New York John Wiley & Sons. E. New York: Academic Press. Prentice. 3rd ed.. Huang. "A Method of Simulated Moments for Estimation of Discrete Response Models without Numerical Integration. Minc. New York: Cambridge University Press." Annals of Mathematzcal Statistzcs 34: 447456. Foundationsfor Finann'al Economics.: Wadsworth Publishing. 1981. Continuous Univariate DistributionsI. 1969. Belmont. 1978. Amsterdam: NorthHolland Publishing. Lancaster. J. Wise. 57: 2151. Cambridge. New York: John Wiey & Sons. Litzenberger. pp. F."Jmrnal ofEconometrics 28: 85101. 1972. "Conditional Logit Analysis of Qualitative Choice Behavior. M. G. Boston: Prindle. 1987. Jr. pp." Economtnca 48: 839852. 1974. "Universal Domination and Stochastic Domination: Estimation Simultaneously under a Broad Class of Loss Functions."Journal of the Amenmencan Statistical Association 53: 457481. "The Impact of Training on the Frequency and Duration of Employment.. A. 1963." Journal of Ewnometncs. Mass. J.. 1982. McFadden. Frontiers in Econometrics.  . Mass. 1972. Weber & Schmidt. Jaffee. 1979. pp. StatisticalAnalysir ofFailure TimeData.. 105142." Econometnca 46: 403426." Econometrics 57: 9951026.. "Methods of Estimation for Markets in Disequilibrium " Ewnomehzca 40: 497514. 1972. R. Moffitt. 'Techniques for Estimating Swtching Regressions. A. Kalbfleisch.358 References References 359 Dubin. Studzes in Nonlinear Estzmatzon. 1988. Stuart. 1980. C.. 1954. M. Cambridge. G.. McFadden." Ewnomic Inquiry 14: 155189. M. A. J. 1993. Nonlinear Methods zn Economehzcs. eds. M. "Nonparametric Estimation from Incomplete Observations. M. Quandt. Flinn. and R L. "A Model of the Supply of Bilateral Foreign Aid. L. "The Effects of Children on the Household'sValue of Time. 198272. J. Manski and D. J. 1977." Econometnca 52: 271320.. 1984. "A Conditional Probit Model for Qualitative Choice: Discrete Decisions Recognizing Interdependence and Heterogeneous Preferences. A. L. 1976. C. Fair. "Models for the Analysis of Labor Force Dynamics. E. Lehrer. R. and R. "Choosing Variables in a Linear Regression: A Graphical Aid. Lee.. Hausman.. and D. Graybill. "Qualitative Methods for Analyzing Travel Behavior of Individuals: Some Recent Developments." Annals of Statistics 13: 295314. Introduction to Mathematical Statistics. Distributed Lags and Investment Analysis.
R.. 327.. 244. 349 DeGroot.. "Patents as Options: Some Estimates of the Value of Holding European Patent Stocks. A. 331 Chung. G. 1987. Welch. J. L ." Econometrica 41: 10931101. 330 Gauss. New York:John Wiley & Sons. 174 Box.. D. Tobin. K. W. K. T.338. Z.. 318 Fair. D . R. 338 Koyck. 337. "The Significance of the Difference between Two Means When the Population Variances Are Unequal. 1938. 338. 327. Pakes. L. . A... T. F... 333.. F . 1958... 330 Johnson. R. 138. Hiromatsu. 62 Birnbaum.. 339. 326 Lancaster. The Hague. J.. C. H. B. "A HeteroscedasticityConsistentCovariance Matrix Estimator and a Direct Test for Heteroscedasticity. 342 Lehrer. 347 Gronau... 1978. and T. R J. A n Introduction to Bayesian Inference in EcmomeCrics. 1973. "Reliability and Maintenance in European Nuclear Power Plants: A Structural Analysis of a Controlled Stochastic Process. J. P G. 327. Sawa. T. 342 Duncan. 342 Hausman. "Estimating a Structural Search Model: The Transition from School to Work. A. 341.. 2nd ed. T. J.. H. 257 Griliches. 299. L. 327.. 292. Central Planning Bureau. 257 Kalbfleisch. A. Robinson. T. White. N. L. P. 343 Kaplan. E. Silverman. 340. EconomicForecasts and Policy. "Repeated Least Squares Applied to Complete Equation Systems. 326 Grltz. Rao. 347. M. 155. R. C. New York: John Wiley & Sons. 314. Sturm. 349 Lippman. 323 Eicker. M." Econometrica 50: 2741. T.. H. J. Theil... R.. 336 Heckman." Ewnomehica 46: 121115.. 95 Fisher. G. 288. M. M. Linear Statistical Inference and Its Applications. 336. G... 330 Ferguson. 350 McFadden.. M. 1971. 343. 1980. B. S. 1982. D. M. London: C h a p man & Hall. 7. Serfling. A.. H. 2nd ed. K I. "On the Asymptotic Properties of Estimators of Models Containing Limited Dependent Variables. "Note on the Uniqueness of the Maximum Likelihood Estimator for the Tobit Model. 62." Ph." Econometrica 48: 817438. 317. 1953. S. J. E. E. 119 Hwang. F. F. 1973. 1961. M . 342 Durbtn. A. 342 Dudley. 27 Arrow. 949 Lee. J. W. T. L. 119 McCall. J. 252 Kotz. L. 310 Amemiya... 318 Graybill. R. T. J. 347 Fuller. A. J. J." Mimeographed paper. M . R.. A . C... C. P.360 References Olsen. D.. 401 Anderson. 1980.. Amsterdam: NorthHolland Publishing." Econometric~ 26: 2436. 310 Hoel. 1986. Stanford University. 349 Kendall. 89 Huang.. "Estimation of Relationships for Limited Dependent Variables.. S. B. H. 345.. New York John Wiley & Sons. 231 Goldfeld. 100 Cox. 169 Dubm. 257 Apostol. C. 350 Litzenherger.. 100.J. D." Econometrica 55: 801817. $$8 Johnston." Econometrica 54: 755784. A. 257. J. S. Wolpin.. "Minimax Regret Significance Points for a Preliminary Test in Regression Analysis. J. 1. dissertation. Approximate Theorems ofMathematica1 Statistics.. 118 Jaffee.. J. 155 Chenery. M. 1991. H. 347 Hiromatsu. L. S& . R .. D. Densig Estimation for Statistics and Data Analysis. 331 Bayes. R.. W. I NAME INDEX Akaike. 177 Flinn. 1 Bellman." Biometnka 29: 35C362. 257 Bernoulli.. Zellner. 1986.
329 Tobin.. 333 binomial distribution. 95 test. 339 Watson. H. 257 Minhas. D. 347 Solow.. 339 censoring: right. A. 104. 140... 178 BehrensF~sherproblem. 104 LindebergFeller. B. 247. 103 CES production function. 237.. 185 unbiased estimator. 318 estimation of. R. B. 295 bias. 252 White. R. C.. J. 343 Quandt. 184 Chebyshev's inequality. 350 Zellner. J. 326 baselime hazard function.. L. 341 S U B J E C T INDEX  distribution. J.362 Name Index Sawa. 128.. 116 Singer. 1.. R. H. 340 Pakes. 104 normality. 318 Rao.. 102. 92. 287 linear unb~ased predictor. 347 estimator. 124 linear unbiased estimator (BLUE). 104 autocorrelation of error: definit~on. G. 22 BoxCox transformation. 232 Cauchy distribution.. 331 change of variables.. B. M. C. 75. 257 Meier. T. W. 155 Mallows. 348 central limit theorems (CLT): Liapounov. 174 theorem. 344 Olsen. A. 336 Wolpin. 125 bias of sample variance.. 318 Wise... 132.. 359 Bellrnan equatlon. P... 318 higherorder. 309. 322 autoregressive process: firstorder. P. 100 Robinson. 100 Silverman. 310 Marcus. 292 . E. R. 310 Serfling.. 251.. 242 censored regression model... 347 Theil. 350 Bernoulli (binary) distribution.. S. 331 Moffitt. R. 74. 87 bivariate normal. G. 87 estimator (scalar case). L.. 252 Sturm. 144 mean. 95. S. A. 127 binary model. K. 132 variance... 276 linear predictor. M. 325 inversion of. H. 350 Prentlce. 80. Jr. 349 Montmarquette.. R.99 probability distribution. 323 Welch. 169 admissible and inadmissible estimator. 331 Stuart. 104 efficiency.. 348 left. 343 Minc. 169 Bayesian school. R. A. R. 321. L. 168. B. 244 LindebergLevy. 75. 123 estimator (vector case).. 295 predictor. 154 CauchySchwarz inequality. R. 243. 270 characteristics of test. 47 characteristic roots and vectors. M.. I. C. 349 Miller.
254. 326 Kronecker product. 8 concentrated loglikel~hood function. 47 in relation to zero covariance. 183 idempotent matrix. 332 disequilibrium model. 215. 6. 89 Poisson. 29 distribution between continuous and discrete random variables. 217 for equallty of three means. 87 Cauchy. 3032 integration: by change of variables. 92. 344 distribut~onfree method. 210. 356 F test for equality of two means. 316 generalized Wald test. 356 Student's t.100 correlation (coefficient). 260 Instrumental variables ( N ) estimator in distr~butedlag model. 78 confidence. 287 GaussNewton method. 306 In multiple regresslon. 329 fullrank matnx. 161162 consistent estlmator. 151 We~bull. 334 multlnomial. 276. 343 hazard function.d. 344 E. 314 hypothesis: null and alternat~ve. 263 Iterative methods: GaussNewton. 353 exponential. univariate. 3738 distribution. 112. 167168. 281 CobbDouglas product~on function. 334. 101 in probab~lity. 331 NewtonRaphson. 27 double. 338 HardyWeinberg dlstnbution. 230. 178 classical regression model. 26. 91. See also expected value expected value: of continuous random variable. 25 probability when conditioning event has zero probab~lity. 30 Interval estimation. 103. 165. 213. 12 between a pair of random variables. 351 Fdistribution. bivariate. 12. 261 GaussMarkov theorem. definition. defin~tion. 258 diagonalization of symmetric matrix. 118.36 variance. 65 exponential distribution. 43 distnbutions: Bernoull~(binary). 160162 confidence interval. 26. 229 multiple. 349 joint density. 330 estimator and estimate. convergence in drstribution. 323 error~nvariables model. 182 value. 354 Type I extremevalue. 344. 73 Cramer's rule. 300. 97. 7172. 77 probabdity. 33 by parts. 270 empirical d~stribut~on. 246 condlt~onal density. 157 hazard. 331 cofactor. 344 exponential model. 103. 7 exogenous variable. 343 baseline. 191 cross sectlon. 116 duration models. 64 of mixture random vanable. 268 Cram&Rao lower bound. mutual. 324 Laplace (double exponential) distnbution. 318 homoscedastic~ty. 270 discrete variables model. 135 normal: bivariate. 251. 354 density function.) random variables. 53 Jacoblan of transformation. 12 among more than two events: palrwlse. 327 Independent and identically distributed (i. 49. 28. 229 expectation. 343 DurbinWatson statistic. 110 Student's t. 167168. mutual. 112 independent variable. 22. 46 among bivariate random variables. 277 identity matrix. 116 distributionspecific method. 269 f t Subject lndex 365 chisquare d~str~bution. 61 of function of random variables. 124 test. 115 event. 95. 108 Koyck lag. 294 criter~a for ranklng estimators. 327 un~form. 229 error components model. 213. 260.i. 335 multinom~al logit. 29 dependent variable. 340 goodness of fit. 353 class~cal school. 70 rules of operation. 61. 114 endogenous variable. 99. standard. 39. 296 cr~tical region. 74 covariance. 347 declining aggregate. 98 independence from irrelevant alternatives. 154 logistic. 130. 4. 57 KaplanMeier estimator. 317 heteroscedasticityconsistent estlmator. 77. 100 in mean square.56 Jensen's inequality. 27. 38 among more than two random variables pairwise. 22. 329 integral: Riemann. 306 feas~ble generalued least squares (FGLS) estlmator. 154 chisquare. 326 distr~butionfunct~on. 346 heteroscedasticity. 349 Khinchine's law of large numbers. 261 combination.182 slmple and composite. 47 mean.364 Subject Index diagonal matrix. 114 image. 330. 87 binomial. 10. 35. 302 geometric lag. 137 Jacobian method. 63 of dlscrete random variable. 185 i i 1 independence between a pair of events. 154 . 230 cumulative distribution function. 138. 157 Laplace (doubleexponential) . 342 distr~butedlag model. 229 determinant of matrix. 317 firstorder autoregresslve process. 142 job search model. 99. 86. 218 for equality of variances. 23. 353 E. 229 Inner product. 338 HardyWeinberg. 331 generalized least squares (GLS) estimator. 323 eigen values and eigen vectors. 240 Gumbel's Type B blvariate extremgvalue distribut~on. 327 in simultaneous equations model. 1. multivariate. 302 for structural change. 318 full Information maxlmum l ~ k e l ~ h o o d estlmator. 132 constrained least squares estlmator.300. 356 Gumbel's Type B bivariate extremevalue. 259 inadmiss~ble estimator. 62 degrees of freedom: chisquare. 337 Tobit. See distr~butionfunction decislon making under uncertainty. 326 globally concave likelihood function: probit and logit. 66 rules of operation. 160 Inverse matrix. 86.
335 binomial. 4. 287 multinomial distribution. See selection of regressors orthogonal matrix. 144 !? 1 negative definite matrix. 185.283 finite sample properties. 316 least squares estimator of error variance: definition. 172 separation. 184 rank of matrix. 236. 103 linear independence. 34 probability. 143 multinomial. 288 risk averter and risk taker.25. 234. See also expected value mean squared error. 549 permutation. 169 space. least squares predictor pred~ction error. 22 multivariate: continuous. 122 mean squared error matrix. definition. 62 sample. discrete. 285. 39. 8 point estimation. 243. 70. 328 region of rejection.r . 95. 290 multiple regression model. 63 in estimation. 330. 135 models. 350 residual. 334 loss function: in estimation. 62 R'. 276 mean squared prediction error: population. best predictor. 174 likelihood ratio test: simple versus composite. 285 proportional hazard model. 189 I ' J * t i ? ~ ~ $ t y j & ance. 115 minimax strategy in decision making. 202 vector parameter. 137 consistency. 191 nonlinear least squares (NLLS) estimator. 329 Type 1 Tobit model. 113 density. 197. 270 orthogonality: definition. 275 negative semidefinite matrix. 137. 104 likelihood equation. 6768 most powerful test. 4. 326 multicollinearity. 285 power function.8183 of least squares predictor. 307 limit distribution. best linear unbiased predictor. 253 ridge estimator. 335 logit model. 6 limit (phm). 232. 268 normal distribution: bivariate. 323 population: definition. 260 between L S residual and regressor. 1 . 6 distribution. 212. 27. 114 positive definite matrix. 289 Pareto optimality. 92. See best linear predictor. 275 nested logit model. 275 nonsingular matrix. 136. 240. 113 moment. 229 logistic distribution. 201 binomial. 97 univariate. 342 uniform. 63 method of moments. 334 logit model. 99 linear combination. 68 correlation. 233. 29. 174 McFadden's blue busred bus example. 286 best linear unbiased. 2223. 302 structural change. 349 pvalue. 134 bivariate regresslon. 103. 322 level of test. 199. 243. 247. 112 Poisson distribution. 277. 247. discrete. 114 positive semidefinite matrix. 89 standard. 337 normal model. 25 randomized test. 133 likelihood principle. 281 multivariate normal distribution. 19 univariate: continuous. 182 onetail test. 191 partial maximum likelihood estimator. 144 binary model. 197 composite versus composite. 32 1 inversion of. 348 global versus local. 198.295 least squares residual. 4. 140 reservation wage. 182 regressor. 340 Type 2 Tobit model. 116 of baseline hazard function. 336 multiple regression. 112 St. 275 mean. 203 squared error. 275 covariance. 172 distribution. 61. 232. See likelihood equation null hypothesis. 110 pooling time series and cross section. 142. 124 mode. 265 linear regression. 143 likelihood function: continuous case. 104 LindebergLey's central limit theorem. 6' t probability. 341 Type 5 Tobit model. 293 normal. 101. definition.33 maximum likelihood estimator: asymptotic efficiency. 285. 190 probability: axioms. 169 probability. 334 projection matrix. 25 multivariate regression model. See also least squares residual reverse least squares estimator. 141 definihon: continuous case. 78 laws of large numbers (LLN). mixture. 201 normal. 113.187 moving average process. 91. 295 consistency. 75 prior density. 291 least squares predictor. I51 mean. 142 asymptotic variance. 187. 331 nonlinear regression model. 171 in hypothesis testing. 231. 196. 287 best unbiased.366 Subject Index asymptotic normality. 45. 21. 136 discrete case. 332 random variable: definition. 20 frequency interpretation. See independent variable regularity conditions. 281 Subject Index 367 law of iterated means. 275 nonparametric estimation. 334 normal equation. 296 median. 336 multiple correlation coefficient. 75. distribution. 68. 215 multiple regression. 137 NeymanPearson lemma. Petersbnrg paradox. 98 multivariate. 337 marginal density. 238. 347 nonpositive definite matrix.288 consistency. 190. 349 of unobserved heterogeneity. 62. 62 mean. 206 qualitative response model. 4. 338 NewtonRaphson method. 330 nonnegative definite matrix. 292 under general error covariances. 206 optimal significance level. 101 probit model. 113 posterior momen&.r .295 unconditional. 195 prediction. 136 simultaneous equations model. 245. 237. 63 moments. . 246 cornputahon. 103 least squares (LS) estimator: definition. 268 reduced form. 133 duration model. 127 r . 135 multinomial model. 290 asymptotic normality. 6667 bivariate: continuous. discrete. discrete case. 201 Liapounov's central limit theorem. 97 multivariate random variable. 75. 342 9 . 342.
SeeF test test of independence. 210 vector product.S) estimator. 118 structural change. 289 unemployment duration. 130. 345. 184.342 transformation estimator. 345. 324 transpose of matrix. 151 uniform prior. 196. 206 Type I extremevalue distribution. 183 unbiased estimator: definition. 308 serial correlation. 70 Slutsky's theorem.368 Subject Index Tobit model: Type 1. 118 unobserved heterogeneity. 177 uniformly most powerful (UMP) test. See classical regression model stationarity. See duration model symmetric matrix. 89 universal dominance. 199. 187. 336 in duration model. 328 Student's t distribution. 201 skewness. See generalized Wald test weak stationarity. 329 twctail test. time series. 344 weighted least squares estimator. 249 for structural change in bivariate regression. 339 twostage least squares (2SL.339 Type 2. 258. 307 selection of regressors. 68 normal distribution. 103. 327. 63. 318 significance level. 322 test statistic. 230. 55 survival model. 327 size of test. 91. 68 rules of operation. 209 in bivariate regression. 319 statistic. 354. 173. 207 in testing for difference of means of normal. 230 . 260 Wald test. 252. 239. 195196. 70. 115 stochastic dominance. 301 test for equality of variances. 203. 347 utility maximization in binary model. 258 trace of matrix. 274 truncated regression model. 330 support of density function. 73 variancecovariance matrix. 251 in multiple regression. 193 Theil's corrected R ~309 . 165. 349 variance: definition. 301. See also t statistic sufficient statistic. 317 Welch's method. 341 Type 5. 205 simultaneous equations model. 334 regression model. 135. 176 supply and demand model. 334 in multinomial model. 327 Type I and Type I1 errors. 349 uniform distribution. 270 t statistic in testing for mean of normal. 304 equations. 125 of variance. 319 Weibull model. 183. 102 standard deviation. 201 univariate normal distribution.