STAT111Faighth107 PDF

Statistics 111 Updated lecture notes
Fall 2013
Warren J. Ewens wewens@sas.upenn.edu Room 324 Leidy Labs (corner 38th St and Hamilton Walk). These notes provide an outline of the some of material to be discussed in the rst few lectures of STAT 111. They are important because they provide the framework for all the material given during the entire course. Also, much of this material is not given in the textbook for the course.
Introduction and basic ideas

What is Statistics? Statistics is the science of analyzing data in whose generation chance has played some part. This explains why statistical methods are important in many areas, including for example sociology, psychology, biology, economics and anthropology. In these areas there are many chance mechanisms at work. For example, in biology the random transmission of one chromosome from a pair of chromosomes from parent to ospring introduces a chance mechanism into many areas of genetics. Second, in all the above areas data are usually derived from some random sample of individuals. A dierent sample would almost certainly yield dierent data, so that the sampling process introduces a second chance element. Finally, in economics the values of quantities such as the Dow Jones industrial average cannot be predicted in advance, since their values are aected by many chance events that we cannot know of in advance. Everyone knows that you cannot make much progress in such areas as physics, astronomy and chemistry without using mathematics. Similarly, you cannot make much progress in such areas as psychology, sociology and biology without using statistics. Because of the chance, or random, aspect in the generation of statistical data, it is necessary, in discussing statistics, to also consider aspects of probability theory. The syllabus for this course thus starts with an introduction to probability theory, and this is reected in these introductory notes. But before discussing probability theory, we have to discuss the relation between probability theory and statistics.
The relation between probability theory and statistics

Most of the examples given in the class concern simple situations and are not taken from the sociological, psychological, etc. contexts. This is done so that the basic ideas of probability theory and statistics will not be confounded with the complexities arising in those areas. So we start here with a simple example concerning the ipping of a coin. Suppose that we have a coin that we suspect of being biased towards heads. To check up on this suspicion we ip the coins (say) 2,000 times and observe the number of heads that we get. If the coin is fair, we would, beforehand, expect to see about 1,000 heads. If once we ipped the coin we got 1,973 heads we would obviously (and reasonably) claim that we have very good evidence that the coin is biased towards heads. If you think about it, the reasoning that you went through in coming to this conclusion was something like this: If the coin is fair is is extremely unlikely that I would get 1,973 heads from 2,000 ips. Thus since I did in fact get 1,973 heads, I have strong evidence that the coin is unfair. Equally obviously, if we got 1,005 heads, we would conclude that we do not have good evidence that the coin is biased towards heads. Again, the reason for coming to this conclusion is that a fair coin can easily give 1,005 (or more) heads from 2,000 ips.
But these are extreme cases, and reality often has to deal with more gray-area cases. What if we saw 1,072 heads? Intuition and common sense might not help in such a case. What we have to do is to calculate the probability that we would get 1,072 or more heads if the coin is fair. If this probability is low we might conclude that we have signicant evidence that the coin is biased towards heads. If this probability is fairly high we might conclude that we do not have signicant evidence that the coin is biased. The conclusion that we draw is an act of statistical inference, or a statistical induction. An inference, or an induction, is a conclusion that we draw about reality, based on some observation or observations. The reason why this is a statistical inference (or induction) is that it is based on a probability calculation. No statistical inference can be made without rst making the relevant corresponding probability calculation. In the above example, probability theory calculations (which we will do later) shows that the probability of getting 1,072 or more heads from 2,000 ips of a fair coin is very low (less than 0.01). Thus having observed 1,072 heads in our 2,000 ips, we would reasonably conclude that we have signicant evidence that the coin is biased. Here is a more important example. Suppose that we are using some medicine (the current medicine) to cure some illness. From long experience we know that, for any person having this illness, the probability that this current medicine cures any patient is 0.8. A new medicine is proposed as being better than the current one. To test whether this claim is justied we plan to conduct a clinical trial, in which the new medicine will be given to 2,000 people suering from the disease in question. If the new medicine is equally eective as the current one we would, beforehand, expect it to cure about 1,600 of these people. If after the clinical trial is conducted the proposed new medicine cured 1,945 people, no-one would doubt that it is better than the current medicine. Again, the reason for this opinion is something like: If the new medicine has the same cure rate as the current one, it is extremely unlikely that is would cure 1,945 people out of 2,000. But it did cure 1,945 people, and therefore I have signicant evidence that its cure rate is higher than that of the current medicine. But, equally obviously, if the proposed medicine cured 1,615 people we do not have strong evidence that it is better than the current medicine. The reason for this is that if the new medicine is equally eective as the current one, that is if the probability of a cure with the new medicine is the same (0.8) as that for the current medicine, we can easily observe 1,615 (or more) people cured with the new medicine. Again these are extreme cases, and reality often has to deal with more gray-area cases. What if the new medicine cured 1,628 people? Intuition and common sense might not help in such a case. What we have to do is to calculate the probability that we would get 1,628 or more people cured with the new medicine if it is equally eective as the current medicine. This probability is about 0.11, and because this is not a really small probability we might conclude that we do not have signicant evidence that the new medicine is superior to the current one. Drawing this conclusion is an act of statistical inference. Statistics is a dicult subject for two reasons. First, we have to think of the situation both before and after our experiment, (in the medicine case the experiment is giving the
3
new medicine to the individuals in the clinical trial), and go back and forth several times between these two time points in any statistical operation. This is not easy. Second, before the experiment we have to consider aspects of probability theory. Unfortunately our minds are not wired up well to think in terms of probabilities. (Think of the two fair coins example given in class, and also the Monty Hall situation.) The central point is this: no statistical operation can be carried out without considering the situation before the experiment is performed. Because, at this time point, we do not know what will happen in our experiment, these considerations involve probability calculations. We therefore have to consider the general features of the probability theory before the experiment situation and the relation between these aspects and the statistical after the experiment aspects. We will do this later, after rst looking more closely at the relation between the deductive processes of probability and the inductive processes of statistics. Deductions (implications) and inductions (inferences) Probability theory is a deductive activity, and uses deductions (also called implications). It starts with some assumed state of the world (for example that the coin is fair), and enables us to make various probability calculations relevant to our proposed experiment. Statistics is the converse, or inductive, operation, and uses inductions (also called inferences). It starts with data from our experiment and attempts to make objective statements about the unknown real world (whether the coin is fair or not). These inductive statements are always based on some probability calculation. The relation between probability and statistics can be seen from the following diagram : Probability theory (deductive)
Some unknown reality and a hypothesis about it.
Uses data (what is observed in an experiment) to test this hypothesis.
Statistical inference (inductive) This diagram makes it clear that to learn how to conduct a statistical procedure we rst have to discuss probability on its own. We now do this.
Probability Theory
Events and their probabilities As has been discussed above, any discussion of Statistics requires a prior discussion of probability theory. In this section an introduction to probability theory is given as it applies to probabilities of events.
Events
An event is something which does or does not happen when some experiment is performed, eld survey is conducted, etc. Consider for example a Gallup poll, in which (say) 2,000 people are asked, before an election involving two candidates, Smith and Jones, whether they will vote for Smith or Jones. Here are some events that could occur:1. More people say that they will vote for Smith than say they will vote for Jones. 2. At least 1,200 people say that they will vote for Jones. 3. Exactly 1,124 people say that they will vote for Smith. We will later discuss probability theory relating to Gallup polls. However, all the examples given below relate to events involving rolling a six-sided die, since that is a simple and easily understood situation. Here are some events that could occur in that context:1. An even number turns up. 2. The number 3 turns up. 3. A 3, 4, 5 or a 6 turns up. Clearly there are many other events that we could consider. Also, with two or more rolls of the die, we have events like a 6 turns up both times on two rolls of a die, in ten rolls of a die, a 3 never turns up, and so on.
Notation
We denote events by upper-case letters at the beginning of the alphabet, and also the letter S and the symbol . So in the die-rolling example we might have:1. A is the event that an even number turns up. 2. B is the event that the number 3 turns up. 3. C is the event that a 3, 4, 5 or a 6 turns up. The letter S has a special meaning. In the die-rolling example it is the event that the number turning up is 1, 2, 3, 4, 5 or 6. In other words, S is the certain event. It comprises
all possible events that could occur. The symbol also has a special meaning. This is the so-called empty event. It is an event that cannot occur, such as rolling both an even number and an odd number in one single roll of a die. It is an important event when considering intersections of events - see below.
Unions, intersections and complements of events
Given a collection of events we can dene various derived events. The most important of these are unions of events, intersections of events, and complements of events.These are dened as follows:(i) Unions of events: If D and E are events, the union of D and E , written D E , is the event that either D, or E , or both occur. In the die-rolling example above, A B is the event that a 2, 3, 4 or 6 turns up, A C is the event that a 2, 3, 4, 5 or 6 turns up, and B C is the event that a 3, 4, 5 or a 6 turns up. (Notice that in this case B C is the same as C .) (ii) Intersections of events: If D and E are events, the intersection of D and E , written D E , is the event that both D and E occur. In the die-rolling example above, A B is the empty event , since A and B cannot both occur, A C is the event that a 4 or a 6 turns up, and B C is the event that the number 3 turns up. (Notice that in this case B C is the same as B .) (iii) Complements of events: If D is an event, Dc is the complementary event to D. It is the event that D does not occur. In the three examples above, Ac is the event that an odd number turns up, B c is the event that some number other than 3 turns up, and C c is the event that a 1 or a 2 turns up.
Probabilities of events
The concept of a probability is quite a complex one. These complexities are not discussed here: we will be satised with a straightforward intuitive concept of probability as in some sense meaning a long-term frequency. Thus we would say, for a fair coin, that the probability of a head is 1/2, in the sense that we think that in a very large number of ips of this coin, we will get a head almost exactly half the time. We are interested here in the probabilities of events, and we write the probability of the event A as P (A), the probability of the event B as P (B ), and so on.
Probabilities of derived events
The probabilities for the union and the intersection of two events are linked by the following equation. If D and E are any two events, P (D E ) = P (D) + P (E ) P (D E ). To check this equation, we note that P (A C ) = 5/6, and this is given by P (A) + P (C ) P (A C ) = 1/2 + 2/3 1/3 = 5/6. The other examples can be checked similarly. It is always true that for any event D, P (Dc ) = 1 P (D). This is obvious: the probability that D does not occur is 1 minus the probability that D does occur. Suppose that the die in the die-rolling example is fair. Then the probabilities of the various union and intersection events discussed above are as follows:P (A) = 1/2. P (B ) = 1/6. P (C ) = 2/3. P (A B ) = 2/3. P (A C ) = 5/6. P (B C ) = 2/3. P (A B ) = 0. P (A C ) = 1/3. P (B C ) = 1/6.
Notice that the probability of the empty event is 0.

Independence of events
Two events D and E are said to be independent if P (D E ) = P (D) P (E ). It can be seen from the calculations given above that A and B are not independent, B and C are not independent but that A and C are independent. The intuitive meaning of independence is that if two events are independent, and if you are told that one of them has occurred, then this information does not change the probability that the other event occurs. Thus in the above example, if you are given the information that an even number turned up (event A), then the probability that a 3, 4, 5 or a 6 turns up (event C ) is still 2/3, which is the probability of the event C without this information being given. Similarly if you are told that the number that turned up was a 3, 4, 5 or 6, then the probability that an even number turned up is still 1/2. (The calculations conrming this are given in the next section.) The calculations above assume that the die is fair. For an unfair die we might reach a dierent conclusion than the one that we reach for a fair die. For example, if the die is biased, so that the probabilites for a 1, 2, 3, 4, 5 or 6 turning up are, respectively, 0.1, 0.3, 0.1, 0.2, 0.2 and 0.1, then P (A) = 0.6, P (C ) = 0.6 and P (A C ) is 0.3. Since 0.6 0.6 = 0.36 = 0.3, the events A and C are now not independent.
Conditional probabilities
We often wish to calculate the probability of some event D, given that some other event E has occurred. Such a probability is called a conditional probability, and is denoted P (D|E ). The conditional probability P (D|E ) is calculated by the formula P (D|E ) = P (D E ) . P (E ) (1)
It is essential to calculate P (D|E ) using this formula: using any other approach, and in particular using common sense, will usually give an incorrect answer. If the events D and E are independent, then P (D|E ) = P (D). In other words, D and E are independent if the knowledge that E has occurred does not change the probability that D occurs. In the fair die example given above, equation (1) shows that P (A|C ) = (1/3)/(2/3) = 1/2, and this is equal to P (A). This conrms that A and C are independent (for a fair die). In the unfair die example given above, equation (1) shows that P (A|C ) = 0.3 /0.6 = 0.5, and this is not equal to P (A), which 0.6. This conrms that for this unfair die A and C are not independent.
Probability: One Discrete Random Variable

Random variables and data In this section we dene some terms that we will use often. We do this in terms of the coin ipping example, but the corresponding denitions for other examples are easy to imagine. Before we ip the coin the number of heads that we will get is unknown to us. This number is therefore called a random variable. It is a before the experiment concept. By the word data we mean the observed value of a random variable once some experiment is performed. In the coin example, once we have ipped the coin the data is simply the number of heads that we did get. It is the observed value of the random variable once the experiment of ipping the coin is carried out. It is an after the experiment concept. To assist us with keeping the distinction between random variables and data clear, and as a matter of notational convention, a random variable (a before the experiment is carried out concept) is always denoted by an upper case Roman letter. We use the upper-case letter X in these notes for this purpose. It is a concept of our mind - at this stage we do not know its value. In the coin example the random variable X is the concept of our mind number of heads we will get, tomorrow, when we ip the coin. The notation for the after the experiment is done data is the corresponding lower case letter. So after we have ipped the coin we would denote the number of heads that we did get by the corresponding lower-case letter x. Thus it makes sense, after the coin has been ipped, to say x = 1, 142. It does not make sense before the coin is ipped to say X = 1, 142. This second statement does not compute.
8
There are therefore two notational conventions that we always use: upper-case Roman letters for random variables, lower-case Roman for data. We will later nd a third notational convention (for parameters).
Probability: One Discrete Random Variable

Denition: one discrete random variable It is convenient to consider separately the cases of discrete and continuous random variables. In this section we give informal denitions for discrete random variables and their probability distributions rather than the formal denitions often found in statistics textbooks. Continuous random variables will be considered in a later section. A discrete random variable is a numerical quantity that in some future experiment that involves some degree of randomness will take one value from some discrete set of possible values. These possible values are usually known before the experiment: In the coin example the possible values of X , the number of heads that will turn up, tomorrow, when we will ip the coin 2,000 times, are clearly 0, 1, 2, 3, . . . 2, 000. In practice the possible values of a discrete random variable often consist of the numbers 0, 1, 2, 3, . . . k , for some number k . The probability distribution of a discrete random variable; parameters The probability distribution of a discrete random variable X is a listing of the possible values that this random variable can take, together with their respective probabilities. If there are k possible values of X , namely v1 , v2 , . . . , vk , with respective probabilities P(v1 ), P(v2 , ) . . . , P(vk ), this probability distribution can be written generically as Possible values of X v1 v2 ... vk Respective probabilities P(v1 ) P(v2 ) . . . P(vk ) (2)
In some cases we know (or hypothesize) the probabilities of the possible values v1 , v2 , . . . , vk . For example, if in the coin example we we know that the coin is fair, the probability distribution of X , the number of heads that we would get on two ips of the coin, is: Possible values of X 0 1 2 Respective probabilities .25 .50 .25 Here P(0) = .25, P(1) = .5, P(2) = .25. More generally, if the probability of getting a head on any ip is some value , the probability distribution of X is:Possible values of X 0 1 2 2 Respective probabilities (1 ) 2(1 ) 2
9
(3)
(4)
In this case, P(0) = (1 )2 , P(1) = 2(1 ), P(2) = 2 . (5) Here is a so-called parameter: see more on these below. The probability distribution (5) can be generalized to the case of an arbitrary number of ips of the coin- see (6) below. The binomial distribution There are many important discrete probability distributions that arise often in the applications of probability and statistics to real-world problems. Each one of these distributions is appropriate under some collection of requirement specic to that distribution. Here we focus only on the most important of these distributions, namely, the binomial distribution, and consider rst the requirements for it to be appropriate. The binomial distribution arises if, and only if, all four of the following requirements hold. First, we plan to conduct some xed number n of trials. (By xed we mean xed in advance, and not, for example, determined by the outcomes of the trials as they occur.) Second, there must be exactly two possible outcomes on each trial. The two outcomes are often called, for convenience, success and failure. (Here we might regard getting a head on the ip of a coin as a success and a tail as a failure.) Third, the various trials must be independent - the outcome of any trial must not aect the outcome of any other trial. Finally, the probability of success must be the same on all trials. One must be careful when using a binomial distribution that all four of these conditions hold. We reasonably believe that these conditions hold when ipping a coin. We often denote the probability of success on each trial by , since in practice this is often unknown. That is, it is a parameter. The random variable of interest is the total number X of successes in the n trials. The probability distribution of X is given by the (binomial distribution) formula P(x) = n x (1 )nx , x x = 0, 1, 2, . . . , n. (6)
The binomial coecient n is often spoken as n choose x: it is the number of dierent x orderings in which x successes can arise in the n trials. The factor 2 in (5) is an example of a binomial coecient, reecting the fact that there are two orders (success followed by failure and failure followed by success) in which we can obtain one success and one failure in two trials. In the expression (6) is the parameter, and n is called the index, of the binomial distribution. The probabilities in (5) are binomial distribution probabilities for the case n = 2, and can be found from (6) by putting n = 2 and considering the respective values x = 0, 1 and 2.
10
Parameters The quantity introduced above is a parameter. In general a parameter is some unknown constant. In the binomial case it is the unknown probability of success in (6). Almost all of Statistics consists of:(i) Estimating the value of a parameter. (ii) Giving some idea of the precision of our estimate of a parameter (sometimes called the margin of error). (iii) Testing hypotheses about the value of a parameter. We shall consider these three activities later in the course. In the coin example, these would be:(i) Estimating the value of the binomial parameter . (ii) Giving some idea of the precision of our estimate of this parameter. (iii) Testing hypotheses about the numerical value of this parameter, for example testing the hypothesis that = 1/2.
The mean of a discrete random variable

The mean of a random variable is often confused with the concept of an average, and it is important to keep a clear distinction between the two concepts. The mean of the discrete random variable X whose probability distribution is given in (2) above is dened as v1 P (v1 ) + v2 P (v2 ) + vk P (vk ). In more mathematical notation this is
k
(7)
vi P(vi ),
i=1
(8)
the summation being over all possible values (v1 , v2 , . . . , vk ) that the random variable X can take. As an example, the mean of a random variable having the binomial distribution (6) is
n
x
x=0
n x (1 )nx , x
(9)
and this can be shown, after some algebra, to be n. As a second example, consider the (random) number (which we denote by X ) to turn up when a die is rolled. The possible values of X are 1, 2, 3, 4, 5 and 6. If the die is fair, each
11
of these values has probability 1 . Application of equation (7) shows that the mean of X is 6 1 1 1 1 1 1 1 + 2 + 3 + 4 + 5 + 6 = 3.5. 6 6 6 6 6 6 (10)
Suppose on the other hand that the die is unfair, and that the probability distribution of the (random) number X to turn up is:Possible values of X 1 2 3 4 5 6 Respective probabilities 0.15 0.25 0.10 0.15 0.30 0.05 In this case the mean of X is 1 0.15 + 2 0.25 + 3 0.10 + 4 0.15 + 5 0.30 + 6 0.05 = 3.35. (12) (11)
There are several important points to note about the mean of a discrete random variable:(i) The notation is often used for a mean. In many practical situations the mean of a discrete random variable X is unknown to us, because we do not know the numerical values of the probabilities P (x). That is to say, is a parameter, and this is why we use Greek notation for it. As an example, if in the binomial distribution case we do not know the value of the parameter , then we do not know the value (= n) of the mean of that distribution. (ii) The mean of a probability distribution is its center of gravity, that is its knife-edge balance point. (iii) Testing hypotheses about the value of a mean is perhaps the most important of statistical operations. An important example of tests of hypotheses about means is a t test. Dierent t tests will be discussed in this course. (iv) The word average is not an alternative for the word mean, and has a quite dierent interpretation from that of mean. This distinction will be discussed often in class.
The variance of a discrete random variable

A quantity of importance equal to that of the mean of a random variable is its variance. The variance (denoted by 2 ) of the discrete random variable X whose probability distribution is given in (2) above is dened by 2 = (v1 )2 P(v1 ) + (v2 )2 P(v2 ) + . . . + (vk )2 P(vk ). In more mathematical terms we write this as
k
(13)
=
i=1
(vi )2 P(vi ),
12
(14)
the summation being taken over all possible values of the random variable X . In the case of a fair die, we have already calculated (in (12)) the mean of X , the (random) number to turn up on a roll of the die, to be 3.5. Application of (13) shows that the variance of X is 1 1 1 1 1 35 2 = (1 3.5)2 +(2 3.5)2 (3 3.5)2 (4 3.5)2 (5 3.5)2 (6 3.5)2 = . (15) 6 6 6 6 6 12 There are several important points to note about the variance of a discrete random variable:(i) The variance has the standard notation 2 , anticipated above. (ii) The variance is a measure of the dispersion of the probability distribution of the random variable around its mean. Thus a random variable with a small variance is likely to be close to its mean. (see Figure 1).
smaller variance Figure 1:
larger variance
(iii) A quantity that is often more useful than the variance of a probability distribution is the standard deviation. This is dened as the positive square root of the variance, and (naturally enough) is denoted by . (iv) The variance, like the mean, is often unknown to us. This is why we denote it by a Greek letter. (v) The variance of the number of successes in the binomial distribution (6) can be shown, after some algebra, to be n(1 ).
13
Many Random Variables

Introduction Almost every application of statistical methods in psychology, sociology, biology and similar areas requires the analysis of many observations. For example, if a psychologist wanted to assess the eects of sleep deprivation on the time needed to answer the questions in a questionnaire, he/she would want to test a fairly large number of people in order to get reasonably reliable results. Before this experiment is performed, the various times that the people in the experiment will need to answer the questions are all random variables. In line with the approach in this course, ideas about many observations will often be discussed in the simple case of rolling a die (fair or unfair) many times. Here the observations are the numbers that turn up on the various rolls of the die. If we wish to test whether this die is fair, we would plan to roll it many times, and thus plan to get many observations, before making our assessment. As with the sleep deprivation example, before we actually roll the die the numbers that will turn up on the various rolls are all random variables. To assess the implications of the numbers which do, later, turn up when we get around to rolling the die, and of the times needed in the sleep deprivation example, we have to consider the probability theory for many random variables. Notation Since we are now considering many random variables, the notation X for one single random variable is no longer sucient for us. We denote the rst random variable by X1 , the second by X2 , and so on. Suppose that in the die example we denote the planned number of rolls by n. We would then denote the (random) number that will turn up on the rst roll of the die by X1 , the (random) number that will turn up on the second roll of the die by X2 , . . ., the (random) number that will turn up on the n-th roll of the die by Xn . As with a single random variable (see notes, page 8), we need a separate notation for the actual observed numbers that did turn up once the die was rolled (n times). We denote these by x1 , x2 , . . . , xn . To assess (for example) whether we can reasonably assume that the die is fair we would use these numbers, but also we would have to use the theory of the n random variables X1 , X2 , . . . , Xn . As in the case of one random variable, a statement in the die example like: X6 = 4 makes no sense. It does not compute. On the other hand, once the die has been rolled, the statement x6 = 4 does make sense. It means that a 4 turned up on the sixth roll of the die. In the sleep example, a statement like: x6 = 23.7 also makes sense. It means that the time that the sixth person in the experiment took to complete the questionnaire was 23.7 minutes. By contrast, before the experiment was conducted, the time X6 that the sixth person will take to complete the questionnaire is unknown. It was a random variable. Thus a statement like: X6 = 23.7 does not make sense.
14
Independently and identically distributed random variables The die example introduces two important concepts. We would reasonably assume that X1 , X2 , . . . , Xn all have the same probability distribution, since it is the same die that is being rolled each time. For example, we would assume that the probability that a three turns up on roll 77 (whatever it might be) is the same as the probability that a three turns up on roll 144. Further, we would also reasonably assume that the various random variables X1 , X2 , . . . , Xn are all independent of each other. That is, we would reasonably assume that the value of any one of these would not aect the value of any other one. Whatever number turned up on roll 77 has no inuence on the number turning up on roll 144. Random variables which are independent of each other, and which all have the same probability distribution, are said to be iid (independently and identically distributed). This concept is discussed again below. The assumptions that the various random variables X1 , X2 , . . . , Xn are all independent of each other, and that they all have the same probability distribution, are often made in the application of statistical methods. However, in areas such as psychology, sociology and biology that are more scientically important and complex than rolling a die, the assumption of identically and independently distributed random variables might not be reasonable. Thus if twin sisters were used in the sleep deprivation example, the times that they take to complete the questionnaire might not be independent, since we might expect them to be quite similar because of the common environment and genetic make-up of the twins. If the people in the experiment were not all of the same age it might not be reasonable to assume that the times needed are identically distributed - people of dierent ages might perhaps be expected to tend to need dierent amounts of time. Thus in practice care must often be exercised and common sense used when applying the theory of iid random variables in areas such as psychology, sociology and biology. The mean and variance of a sum and of an average Given n random variables X1 , X2 , . . . , Xn , two very important derived random variables are their sum, denoted by Tn , and dened as Tn = X1 + X2 + + Xn , , and dened by and their average, denoted by X = X1 + X2 + + Xn = Tn . X n n (17) (16)
are functions of the random variables X1 , X2 , . . . , Xn they are themSince both Tn and X selves random variables. In the die example we do not know, before we roll the die, what the sum or the average of the n numbers that will turn up will be. Both the sum and the average, being random variables, each have a probability distribution, and thus each has a mean and a variance. These must be related in some way to the
15
mean and the variance of each of X1 , X2 , . . . , Xn . The general theory of many random variables shows that if X1 , X2 , . . . , Xn are iid, with (common) mean and (common) variance 2 , then the mean and the variance of the random variable Tn are, respectively, mean of Tn = n, variance of Tn = n 2 , (18)
are given respectively by and the mean and the variance of the random variable X = , mean of X
2 = . variance of X n
(19)
In STAT 111 we will call these four formulas the four magic formulas and will refer to them often. Thus you have to know them by heart. Equations (18) and (19) apply of course in the particular case n = 2. However in this case two further equations are important. If we dene the random variable D by D = X1 X2 (think of D standing for dierence) then mean of D = 0, variance of D = 2 2 . (20)
These are also magic formulas and will refer to them several times, especially when making comparison studies. Thus you also have to know these two formulas by heart. Two generalizations More generally, suppose that X1 , X2 , . . . , Xn are independent random variables with respec2 2 2 . Then , . . . n , 2 tive means 1 , 2 , . . . , n and respective variances 1 mean of Tn = 1 + 2 + + n , and = (1 + 2 + + n )/n, mean of X
2 2 2 = 1 + 2 + + n . variance of X n2 2 2 2 variance of Tn = 1 + 2 + + n ,
(21)
(22)
The formulas is (18) and (19) are, respectively, special cases of these formulas. Next, the generalization of the formulas in (20) is that Dij , dened by Dij = Xi Xj , is a random variable and that mean of Dij = i j ,
2 2 variance of Dij = i + j .
(23)
The formulas in (20) are, respectively, special cases of these formulas.
16
An example of the use of equations (19) In the case of a fair die, each Xi has mean 3.5 and variance 35/12, as given by (12) and (15), and thus standard deviation is (35/12), or about 1.708. On the other hand if n, the = (X1 + X2 + + X1,000 )/1, 000 is, from the number of rolls, is 1,000, the variance of X is (35/12, 000), second equation in (19), 35/12,000. Therefore the standard deviation of X or about 0.0540. This small standard deviation implies that once we roll the die 1,000 times, it is very likely that the observed average of the numbers that actually turned up will be very close to 3.5. This is no more than what intuition suggests. We will later do a JMP experiment to conrm this. Later we will see that if the die is fair, the probability that the observed average x is between its mean of X minus two standard deviations of X , (that is 3.5 - 2 0.0540 = plus two standard deviations of X , (that is 3.5 + 2 0.0540 = 3.392) and its mean of X 3.608) is about 95%. This statement is one of probability theory. It is an implication, or deduction. So here is a window into Statistics. If we have now rolled the die 1,000 times, and the observed average x is 3.382. This is outside the range 3.392 to 3.608, the range within which the average is about 95% likely to lie if the die is fair. Then we have good evidence that the die is not fair. This claim is an act of Statistics. It is an inference, or induction. We will later make many statistical inferences, all of which will be based on the relevant corresponding probability theory calculation. The proportion of successes in n binomial trials The random variable in the binomial distribution is the number of successes in n binomial trials, with probability distribution given in (6). In some applications it is necessary to consider instead the proportion of successes in these trials (more exactly, the proportion of trials leading to success). If X is the number of successes in n binomial trials, then this proportion is X/n, which we will denote by P . 1) 1 2 P is a discrete random variable, and its possible values are 0, n , n , ..., (n , 1. It has a n probability distribution which can be found from the binomial distribution (6) since the probability that P = i/n is the same as the probability that X = i for any value of i. Because P is a random variable it has a mean and variance. These are mean of P = , variance of P = (1 )/n. (24)
These equations bear a similarity to the formulas for the mean and variance of an average given in (19). We will see later, when testing for the equality of two binomial parameters, why it is often necessary to operate via the proportion of trials giving success rather than by the number of trials giving success.
17
The standard deviation and the standard error = ( X1 + X2 + + In the die example in the previous section the standard deviation of X X1,000 )/1, 000 is about 0.0540. The standard deviation of an average such as this is sometimes called the standard error of the mean. (This terminology is unfortunate and causes much confusion - it should be the standard deviation of the average. How can a mean have a standard deviation? A mean is a parameter, and only random variables can have a standard deviation.) Many textbooks use this unfortunate terminology. Watch out for it. Means and averages It is crucial to remember that a mean and an average are two entirely dierent things. (The textbook, and other textbooks, are sometimes not too good on making this distinction.) A mean is a parameter, that is some constant number whose value is often unknown to us. For example, with an unfair die for which the probabilities for the number to turn up on any roll are unknown, the mean of the number to turn up is unknown. It is a parameter which we might wish to estimate or test hypotheses about. We will always denote a mean by the Greek letter . ) is a random variable. It has a probability By contrast, an average as dened above (i.e. X distribution and thus has a mean and a variance. Thus it makes sense to say (as (19) and (22) say) the mean of the average is such and such. There is also a second concept of an average, and this was already referred to in the die-rolling example above. This is the actual average x of the numbers that actually turned up once the 1,000 rolls were completed. This is a number, for example 3.382. You can think once the rolling had taken place. of this as the realized value of the random variable X Thus there are three related concepts: rst a mean (a parameter), second a before the (a random variable, and a concept of probability theory), and third experiment average X an after the experiment average x (a calculated number, and a concept of Statistics). They are all important and must not be confused with each other. Why do we need all three concepts? Suppose that we wish to estimate a mean (rst concept), or to test some hypothesis about a mean (for example that it is 3.5). We would do this by using the third concept, the after the experiment observed average x . How good x is an estimate of the mean, or what hypothesis testing conclusion we might draw given the , (the second observed value of x , both depend on the properties of the random variable X concept), in particular its mean and variance.
Continuous Random Variables

Denition Some random variables by their very nature are discrete, such as the number of heads in 2, 000 ips of a coin. Other random variables, by contrast, are continuous. Continuous
18
random variables can take any value in some continuous range of values. Measurements such as height and blood pressure are of this type. Here we denote the range of a continuous random variable by (L, H ), (L = lowest possible value, H = highest possible value of the continuous random variable), and use this notation throughout. Probabilities for continuous random variables are not allocated to specic values, but rather are allocated to ranges of values. The probability that a continuous random variable takes some specied numerical value is zero. We use the same notation for continuous random variables as we do for discrete random variables, so that we denote a continuous random variable in upper case, for example by X . Every continuous random variable X has an associated density function f (x). The density function f (x) is the continuous random variable analogue of a discrete random variable probability distribution such as (6). This density function can be drawn as a curve in the (x, f (x)) plane. (Examples will be given in class.) The probability that the random variable X takes a value in some given range a to b is the area under this curve between a annd b. From a calculus point of view (for those who have a good calculus background) this probability is obtained by integrating this density function over the range a to b. For example, the probability that the (continuous) random variable X having density function f (x) takes a value between a and b, (with a < b) is given by
b
P(a < X < b) =

a
f (x) dx.
(25)
Because the probability that a continuous random variable takes some specied numerical value is zero, the three probabilities Prob(a X < b), Prob(a < X b), and Prob(a X b) are also given by the right-hand side in (25). As a particular case of equation (25),
H
f (x) dx = 1.
L
(26)
This equation simply states that a random variable must take some value in its range of possible values. For those who do not have a calculus background, dont worry - we will never do any of these integration procedures.
19
The mean and variance of a continuous random variable The mean and variance 2 of a continuous random variable X having range (L, H ) and density function f (x) are dened respectively by
H
=
L
xf (x)dx
(27)
and =
2
(x )2 f (x)dx.
L
(28)
Again, if you do not have a calculus background, dont worry about it. We will never do any of these integration procedures. The main thing to remember is that these denitions are the natural analogues of the corresponding denitions for a discrete random variable, that is that the mean is the center of gravity, or the knife-edge balance point of the density function f (x) and the variance is a measure of the dispersion, or spread-out-ness, of the density function around the mean. (Examples will be given in class.) Also, the remarks about the mean and the variance of a continuous random variable are very similar to those of a discrete random variable given above. In particular we denote a mean by and a variance by 2 . In a research context the mean and the variance 2 of the random variable of interest are often unknown to us. That is, they are parameters, as is indicated by the Greek notation that we use for them. Many statistical procedures involve estimating, and testing hypotheses about, the mean and the variance of continuous random variables. The normal distribution There are many continuous probability distributions relevant to statistical operations. We discuss the most important one in this section, namely the normal, or Gaussian, distribution. The (continuous) random variable X has a normal , or Gaussian , distribution if its range (i.e. set of possible values) is (, ) and density function f (x) given by
(x)2 1 f (x) = e 2 2 . 2
(29)
The shape of this density function is the famous (infamous?) bell-shaped curve. (Here is the well-known geometrical value of about 3.1416 and e is the equally important exponential constant of about 2.7183.) It can be shown that the mean of this normal distribution is and its variance 2 , and these parameters are built into the functional form of the distribution, as (29) shows. A random variable having this distribution is said to be an N (, 2 ) random variable. A stated above, the probability that a continuous random variable takes a value between a and b is found by a calculus operation, which gives the area under the density function
20
of the random variable between a and b. Thus for example the probability that a random variable having a normal distribution with mean 6 and variance 16 takes a value between 5 and 8 is 8 (x6)2 1 e 32 dx. (30) 2 16 5 Amazingly, the processes of mathematics do not allow us to evaluate the integral in (30): it is just too hard. (This indicates an interesting limit as to what mathematics can do.) So how would be nd the probability given in (30)? It has to be done using a chart. So we now have to discuss the normal distribution chart. There is a whole family of normal distributions, each member of the family corresponding to some pair of (, 2 ) values. (The case = 6 and 2 = 16 just considered is an example of one member of this family.) However, probability charts are available only for one particular member of this family, namely the normal distribution for which = 0 and 2 = 1. This is sometimes called the standardized normal distribution, for reasons which will appear shortly. (The normal distribution chart that you will be given refers to this specic member of the normal distribution family.) The way that chart works is best described by a few examples. (We will also do some examples in class.) The chart gives less than probabilities for a variety of positive numbers, generically denoted by z . Thus the probability that a random variable having a normal distribution with mean 0 and variance 1 takes a value less than 0.5 is .6915. The the probability that a random variable having a normal distribution with mean 0 and variance 1 takes a value less than 1.73 is .9582. Note that the chart only goes up to the z value 3.09; for any z greater than this, the probability that a random variable having a normal distribution with mean 0 and variance 1 takes a value less than any z larger than 3.09 is good enough as 1. We usually have to consider more complicated examples than this. For example, the probability that a random variable having a normal distribution with mean 0 and variance 1 takes a value between 0.5 and 1.73 is 0.8582 0.6915 = 0.2667. The probability that a random variable having a normal distribution with mean 0 and variance 1 takes a value between 1.23 and 2.46 is 0.9931 0.8907 = 0.1024. As a dierent form of calculation, we often have to nd greater than probabilities. For example, the probability that a random variable having a normal distribution with mean 0 and variance 1 takes a value exceeding 1.44 is 1 minus tthe probability that it takes a value less than 1.44, namely 1 0.9251 = 0.0749. Even more complicated calculations arise when negative numbers are involved. Here we have to use the symmetry of the normal distribution around the value 0. For example, the probability that a random variable having a normal distribution with mean 0 and variance 1 takes a value between 1.22 and 0 is the same as the probability that it takes a value between 0 and +1.22, and this is 0.8888 0.5 = 0.3888. The probability that a random variable having a normal distribution with mean 0 and variance 1 takes a value less than 0.87 is the same as the probability that it takes a value greater than +0.87, and this is
21
1 0.8078 = 0.1922. Finally, perhaps the most complicated calculation concerns the probability that a random variable having a normal distribution with mean 0 and variance 1 takes a value between some given negative number and some given positive number. Suppose for example that we want to nd the probability that a random variable having a normal distribution with mean 0 and variance 1 takes a value between 1.28 and +0.44. This is the probability that it takes a value between 1.28 and 0 plus the probability takes a value between 0 and +0.44. This in turn is the probabilitiy that it takes a value between 0 and +1.28 plus the probabiiliy that it takes a value between 0 and +0.44. This is (0.8997 0.5000) + (0.6700 0.5000) = 0.5697. Why is there a probability chart only for this one particular member of the normal distribution family? Suppose that a random variable X has the normal distribution (29), that is, with arbitrary mean and arbitrary variance 2 . Then the standardized random variable Z , dened by Z = (X )/ , has a normal distribution with mean 0, variance 1 (trust me on this). This standardization procedure can be used to nd probabilities for a random variable having any normal distribution. For example, if X is a random variable having a normal distribution with mean 6 and variance 16 (and thus standard deviation 4), P(7 < X < 10), that is the probability of the event 7 < X < 10, can be found by standardizing and creating a Z statistic: P(7 < X < 10) = P 76 X 6 10 6 < < 4 4 4 = P(0.25 < Z < 1),
(31)
and this probability is found from the standardized normal distribution chart, (or from computer packages), to be 0.8413 - 0.5987 = 0.2426. As a slightly more complicated example, the probability that the random variable X in the previous paragraph, the probability of the event 4 < X < 11, can be found by standardizing and creating a Z statistic: P(4 < X < 11) = P 46 X 6 11 6 < < 4 4 4 = P(0.5 < Z < 1.25),
(32)
and this probability is found from the kind of manipulations discussed above to be (0.6915 0.5000) + (0.8944 0.5000) = 0.5859. Two useful properties of the normal distribution, often used in conjunction with this standardization procedure, is that if the random variable Z has a normal distribution, mean 0, variance 1, then P(Z > +1.645) = 0.05 (33) and Pr(1.96 < Z < +1.96) = 0.95, (34)
22
or equivalently P(Z < 1.96) + P(Z > +1.96) = 0.05. (35) The standardized quantity Z , dened as Z = (X )/ , where X is a random variable with mean and standard deviation will be referred to often below, and the symbol Z is reserved, in these notes and in Statistics generally, for this standardized quantity. One of the applications of the normal distribution is to provide approximations for probabilities for various random variables, almost always using the standardized quantity Z . One frequently-used approximation derives from equation (34) by approximating the value 1.96 by 2. This is P(2 < Z < +2) 0.95, (36) Remembering that Z = (X )/ , this equation implies that if X is a random variable having a normal distribution with mean and variance 2 , then One frequently-used approximation derives from equation (34) by approximating the value 1.96 by 2. This is P( 2 < X < + 2 ) 0.95, A similar calculation, using the normal distribution chart, shows that P( 2.575 < X < + 2.575 ) 0.99. (38) (37)
Applications of (37) often arise from the Central Limit Theorem, discussed immediately below. The Central Limit Theorem An important property of an average and of a sum of several random variables derives from the so-called Central Limit Theorem. This states that if the the random variables X1 , X2 , . . . , Xn are independently and identically distributed, then no matter what the probability = (X1 + X2 + + Xn )/n and the distribution these random variables might be, average X sum X1 + X2 + + Xn both have approximately a normal distribution. This approximation becomes more accurate the larger n is, and is usually very good for values of n greater than about 50. Since many statistical procedures deal with sums or averages, the Central Limit Theorem ensures that we often deal with the normal distribution in these procedures. Also we often use the formulas (18) and (19) for the mean and variance of a sum and of an average and the approximation (37) when doing this. We have already seen an example of this (in the Section Example of the use of of the numbers to turn up on 10,000 rolls of a fair die equations (19). The average X is a random variable with mean 3.5 and variance 35/12,000, and thus standard deviation 35/1, 0002 0.0540. The central limit theorem states that to a very close approximation
23
this average has a normal distribution with this mean and this variance. Then application of (37) shows that to a very close approximation, < 3.5 + 2 0.0540) 0.95, P(3.5 2 0.540 < X (39)
and this led to the (probability theory) statement given in the Section Example of the takes a value between 3.392 and 3.608 use of equations (19) that the probability that X is is about 95%. We also saw how this statement gives us a window into Statistics. The Central Limit Theorem also applies to the binomial distribution. Suppose that X has a binomial distribution with index n (the number of trials) and parameter (the probability of success on each trial), and thus mean n and variance n(1 ). In the binomial context the Central Limit Theorem states that X has, to a very close approximation, a normal distribution with this mean and this variance. It similarly states that the proportion P of successes has, to a very close approximation, a normal distribution with mean and this variance (1 )/n. Here is an application of this result. Suppose that it is equally likely that a newborn will be a boy as a girl. If this is true, the number of boys is a sample of 2,500 newborns has approximately a normal distribution with mean 1,250 and variance (from note (v) about 1 1 variances) of 2 2 2, 500 = 625, and hence standard deviation 625 = 25. Then (38) shows that the probability is about 0.99 that the number of boys in this sample will be between 1, 250 2.575 25 and 1, 250 + 2.575 25,
that is about between 1185 and 1315. We saw these numbers in Homework 1. Here is the corresponding window into Statistics. IF a newborn is equally likely to be a boy as a girl, then the probability is about 99% that in a sample of 2,500 newborns, the number of boys that we see will be between 1185 and 1315. (This is a probability theory deduction, or implication. It is a zig. It is made under the assumption that newborn is equally lilkely to be a boy as a girl.) However when we actually took this sample we saw 1,334 boys. We therefore have good evidence that it is NOT equally likely for a newborn to be a boy as a girl. (This is an induction, or inference. It is a zag. It is a statement of Statistics. It cannot be made without the corresponding probability theory zig calculation.) The above example illustrates how we increase our knowledge in a context involving randomness (here the randomness induced by the sampling process) by a probability theory/Statistics zig-zag process. (In fact it is now known that it is NOT equally likely for a newborn to be a boy as a girl.)
24
Statistics Introduction So far in these notes we have been contemplating the situation before some experiment is carried out, so that we have been discussing random variables and their properties. We now do our experiment. As indicated above, if before the experiment we had been considering several random variables X1 , X2 , . . . , Xn , we denote the actually observed value of these random variables, once the experiment has been carried out, by x1 , x2 , . . . , xn . These observed values are our data. As an example, if an experiment consisted of the rolling of a die n = 3 times, and after the experiment we observe that a 5 turned up on the rst roll and a 3 on both the second and third rolls, we would say that x1 = 5, x2 = 3, x3 = 3. These are our data values. It does not make sense to say, before the experiment has been done, that X1 = 5, X2 = 3, X3 = 3. This comment does not compute. The three main activities of Statistics are the estimation of the numerical values of a parameter or parameters, assessing the accuracy these estimates, and testing hypotheses about the numerical values of parameters. We now consider each of these in turn. Estimation (of a parameter)
Comments on the die-rolling example
In much of the discussion in these notes (and the course) so far the values of the various parameters entering the probability distributions considered were taken as being known. A good example was the fair die simulation: we knew in advance that the die is fair, so we knew in advance the values of the mean (3.5) and the variance (35/12) of the number to turn up on any roll of the die. However, in practice these parameters are usually unknown, and must be estimated from data. This means that our JMP die rolling experiment is very atypical. The reason is that in this JMP experiment we know that the die is fair, so that we know for example the mean of the (random variable) average of the numbers turning up after (say) 1,000 of rolls of the die is 3.5. The real-life situation, especially in research, is that we will not know the relevant mean. For example, we might be interested in the mean blood-sugar reading of diabetics. To get some idea about what this mean might be we would take a sample of (say) 1,000 diabetics, measure the blood-sugar reading for each of these 1,000 people and iuse the average of these to estimate this (unknown) mean. This is a natural (as as we see later) correct thing to do. So think of the JMP die-rolling example as a proof of principle: because we know that the die is fair, we know in advance that the mean of the (random variable) average of the numbers turning up after (say) 1,000 of rolls of the die is 3.5. We also know that it has a small variance (35/12,000). This value (3.5) of the mean and this small variance imply that, once we have rolled the die 1,000 times, our actually observed average should be very
25
close to the mean of 3.5. And this is what we saw happen. This suggests that in a real-life example, where we do not knowthe numerical value of a a mean, then using an observed average should give us a pretty good idea of what the mean is. Later we will rene this idea more precisely.
General principles
In this section we consider general aspects of estimation procedures. Much of the theory concerning estimation of parameters is the same for both discrete and continuous random variables, so in this section we use the notation X for both. Let X1 , X2 , ...., Xn be n independently and identically (iid) random variables, each having a probability distribution P (x; ) (for discrete random variables) or density function f (x; ) (for continuous random variables), depending in both cases (as the notation implies) on some unknown parameter . We have now done our experiment, so that we now have the corresponding data values x1 , x2 , . . . , xn . How can we use these data values to estimate the parameter ? (Note that we are estimating the parameter , not calculating it. Even after we have the data values we still will not know what the numerical value of is. But at least if we use good estimation procedures we should have a reasonable approximate idea of its value.) Before discussing particular cases we have to consider general principles of estimation. An estimator of the parameter is some function of the random variables X1 , X2 , . . . , Xn , (X1 , X2 , . . . , Xn ), a notation that emphasizes that this estimator and thus may be written . (We is itself a random variable. For convenience we generally use the shorthand notation as -hat.) The quantity (x1 , x2 , . . . , xn ), calculated from the observed data pronounce values x1 , x2 , . . . , xn of X1 , X2 , . . . , Xn , is called the estimate of . The hat terminology is a signal to us that we are talking abpout either an estimator or an estimate. Note the two dierent words estimate and estimator. The estimate of is calculated from our data, and will then be just some number. How good this estimate is depends on the , in particular its mean and its variance. properties of the (random variable) estimator Various desirable criteria have been proposed for an estimator to satisfy, and we now discuss three of these. is said First, a desirable property of an estimator is that it be unbiased. An estimator to be an unbiased estimator of if its mean value is equal to . If is an unbiased estimator (x1 , x2 , . . . , xn ), calculated from the observed of we say that the corresponding estimate data values x1 , x2 , . . . , xn is an unbiased estimate of . It is shooting at the right target. Because of the randomness involved in the generation of our data, it will almost certainly not exactly hit the target. But at least it is shooting in the right direction. As an example, think of the average that you got of the number that turned up on your 1,000 rolls of a fair die in the JMP experiment. We will later show that, if you did not know the mean (3.5) in the die case you would use your average to estimate it. And almost certainly your average was not exactly 3.5. But at least it should have been close to 3.5,
26
since it was shooting at the right target. of some parameter is unbiased, we would also want the variance Second, if an estimator (x1 , x2 , . . . , xn ) calculated from your data, of to be small, since if it is, the observed value that is your estimate of , should be close to . This is why a variance is an important probability theory concept. has, either exactly or approximately, a normal Finally, it would also be desirable if distribution, since then well-known properties of this distribution can be used to provide In particular, we often use the two-standard-deviation rule in assessing the properties of . precision of our estimate, and this rule derives from the normal distribution. Fortunately, several of the estimators we consider are unbiased, have a small variance, and have an approximately normal distribution.
Estimation of the binomial parameter
The binomial distribution gives the probability distribution of the random variable X , the number of successes from n binomial trials with the binomial parameter (the probability of success on each trial). This random variable and its probability distribution are purely probability theory concepts. The corresponding statistical question is: I have now done my experiment and observed x successes from n trials. What can I say about ? In particular, how should I estimate ? How precise can I expect my estimate to be? The classical, and perhaps natural, estimate of is p = x/n, the observed proportion of successes. This is indeed our estimate of . What are the properties of this estimate? These depend on the properties of the random variable P , the (random) proportion of successes before we do the experiment. We know (from the relevant probability theory formulas) that the mean of P is and that the variance of P is (1 )/n. What does this imply? First, since we know that the random variable P has a mean of , the estimate p of , once we have done our experiment, is an unbiased estimate of . It is shooting at the right target. That is good news. Just as important, we want to ask: how precise is this estimate? An estimate of a parameter without any indication of its precision is not of much value. This precision of p as an estimate of depends on the variance, and thus on the standard deviation, of the random variable P . We know that the variance of P is (1 )/n, so that the standard deviation of P is (1 )/n. We now use two facts. (i) from the Central Limit Theorem as applied to the random variable P , we know that P has, to a very accurate approximation, a normal distribution (with mean and variance (1 )/n). (ii) Once we know that P hasa normal distribution (to a sucently good approximation) we
27
are free to adapt either (37) or (38), which are normal distribution results,to the question of the precision of p as an estimate of . First we have to nd out what these equations reduce to, or imply, in the binomial distribution context. They become P( 2 (1 )/n < P < + 2 and P( 2.575 (1 )/n < P < + 2.575 (1 )/n) 0.99. (41) (1 )/n) 0.95, (40)
The rst inequality implies, in words, something like this: Before we do our experiment we can say that the random variable P takes a value within 2 (1 )/n of with probability of about 95% . From this we can say: After we have done our experiment, it is about 95% likely that the observed proportion p of successes is within 2 (1 )/n of . We now turn this second statement inside-out, (using the if I am within 10 yards of you, you are and withinin 10 yards of me idea), and say: It is about 95% likely that the once we have done our experiment, is within 2 of observed proportion p of successes. Writing this somewhat loosely, we can say P(p 2 (1 )/n < < p + 2 (1 )/n) 0.95. (42) (1 )/n
We still have a problem. Since we do not know the value of we do not know the value of the expression (1 )/n occuring twice in (42). However at least we have an estimate of , namely p. Since (42) is already an approximation, we make a further approximation and say, again somewhat loosely, P(p 2 p(1 p)/n < < p + 2 p(1 p)/n) 0.95. (43)
This leads to the so-called (approximate) 95% condence interval for of p 2 p(1 p)/n to p+2 p(1 p)/n. (44)
28
As an example, suppose that n = 1, 000 and p is 0.47. The sort of thing we would say is: I estimate the value of to be 0.47, and I am (approximately) 95% certain that is between 0.47 2 0.47 0.53/1, 000 (i.e. 0.4384) and 0.47 + 2 0.47 0.53/1, 000 (i.e. 0.5016). In saying this we have not only indicated our estimate of , but we have also given some idea of the precision, or reliability, of that estimate. Notes on the above. (i) The range of values 0.4384 to 0.5016 in the above example is usually called a 95% condence interval for . The interpretation of this statement is that we are (approximately) 95% certain that is within this range of values. Thus the condence interval gives us an idea of the precision of the estimate 0.47. (ii) In research papers, books, etc, you will often see the above result written as = 0.47 0.0316. (iii) The precision of the estimate 0.47 as indicated by the condence interval depends on the variance (1 )/n of the random variable P . This is why we have to consider random variables, their properties and in particular their variances. (iv) It is a mathematical fact that p(1 p) can never exceed 1/4. Further, for quite a wide range of values of p near 1/2, p(1 p) is quite close to 1/4. So if we approximate p(1 p) by 1/4, and remember that 1/4 = 1/2, we arrive from (44) at a conservative condence interval for as p 1/n to p + 1/n. (45)
This formula is quite easy to remember and you may use it in place of (44). (v) What was the sample size? Suppose that a TV announcer says, before an election between two candidates Smith and Jones, that a Gallup poll predicts that 52% of the voters will vote for Smith, with a margin of error of 3%. The TV announcer has no idea where that 3% ( = 0.03) came from, but in eect it came from the (approximate) 95% condence interval (44) or (more likely) from (45). So we can work out, from (45), how many individuals were in the sample that led to the estimate 52%, or 0.52. All we have to do is to equate 1/n with 0.03. We nd n = 1, 111. ( Probably their sample size was 1,000, and with this value the margin of error is 1/1, 000 = 0.0316. They just approximated this by 0.03.) (vi) All of the above relates to an (approximate) 95% condence interval for . If you want to be more conservative, and have a 99% condence interval, you can start with the inequalities
29
in (41) which, compared to (40) (which led to our 95% condence interval) replaces the 2 in (40) by 2.575. Carrying through the same sort of argument that led to (44) and (45), we would arrive at an (approximate) 99% condence interval for of p 2.575 p(1 p)/n in place of (44) or p 1.2875 1/n in place of (45). Example. This example is from the eld of medical research. Suppose that someone proposes an entirely new medicine for curing some illness. Beforehand we know nothing about the properties of this medicine, and in particular we do not know the probability that it will cure someone of the illness involved. is an (unknown) parameter. We want to carry out a clinical trial to estimate . Suppose now that we have given the new medicine to 10,000 people with the illness and of these, 8,716 were cured. Then we estimate to be 0.8716. Since we want to be very precise in a medical context we might prefer to use the 99% condence interval (47) instead of the 95 % condence interval (45). Since 1/10, 000 is 0.01, we would say: I estimate the probability of a cure by 0.8716, and I am about 99% certain that the probability of a cure with this proposed medicine is between 0.8716 - 0.012875 (= 0.8587 ) and 0.8716 + 0.012875 (=0.8845). (vii) Notice that the length of both condence intervals (47) and (45) are proportional to 1/ n. This means that if you want to be twice as accurate you need four times the sample size, that if you want to be three times as accurate you need nine times the sample size, and so on. This is why your medicines are so expensive: the FDA requires considerable accuracy before a medicine can be put on the market, and this often implies that a very large sample size is needed to meet this required level of accuracy. (viii). Often in research publications the result of an estimation procedure is written as something like: estimate some measure of precsion of the estimate. Thus the result in the medical example above might be written as something like: = 0.8716 0.012875. This can be misleading because, for example, it is not indicated if this is a 95% or a 99% condence interval. Also, it is not the best way to present the conclusion. (ix). The width of the condence interval, and hence the precision of the estimate, ultimately depend on the variance of the random variable P . This is why we have to discuss (a) random variables and (b) variances of random variables. to p + 1.2875 1/n. (47) to p + 2.575 p(1 p)/n (46)
30
Estimation of a mean ()
Suppose that we wish to estimate the mean blood sugar level of diabetics. We take a random sample of n diabetics and measure their blood sugar levels, getting data values x1 , x2 , ...., xn . It is natural to estimate the mean by the average x of these observed values. What are the properties of this estimate? To answer these questions we have to zig-zag backwards and forwards between probability theory and Statistics. We start with probability theory theory and think of the situation before we got our data. We think of the data values x1 , x2 , ...., xn as the observed values of n iid random variables X1 , X2 , ...., Xn , all having some continuous probability density function with (unknown to us) mean and (unknown to us) variance 2 . Our aim then is to estimate from the data and to assess the precision of our estimate. The form of the density function of each X is unknown to us. We can however conceptualize about this distribution graphically:-
. . . . . . . There is some (unknown to us) probability (the shaded area below) that the blood sugar level of a randomly chosen diabetic lies between the values a and b:-
. .
31
. This distribution has some (unknown to us) mean (which is what we want to estimate) at the balance point of this density function, as indicated by the arrow:-
. . . . . . . We continue to think of the situation before get our data. That is we continue to think in of the random variables terms of probability theory. We conceptualize about the average X is an unbiased X1 , X2 , ...., Xn . Since the mean value of X is (from equation (19)), X estimator of . Thus x is an unbiased estimate of . That is good news: it is shooting at the right target. So our natural estimate is the correct one. Much more important: how precise is it as estimate of ? This depends on the variance . We know (see (19)) that the variance of X is 2 /n, and even though we do not know of X the value of 2 , this result is still useful to us. Next, the Central Limit Theorem shows is approximately normal when n is large, so that to a that the probability distribution of X good approximation we can use the two-standard-deviation rule. These facts lead us to an approximate 95% condence interval for .
The 95% condence interval for
Suppose rst that we know the numerical value of 2 . (In practice it is very unlikely that we would know this, but we will remove this assumption soon.) The two-standard-deviation rule, deriving from properties of the normal distribution, then shows that for large n, P 2 2 <+ <X n n 0.95. (48)
32
The inequalities (48) can be written in the equivalent turned inside-out form form P 2 2 + <<X X n n 0.95. (49)
This leads to an approximate 95% condence interval for , given the data values x1 , x2 , ...., xn , as 2 2 x to x + . (50) n n This interval is valuable in providing a measure of accuracy of the estimate x of . To be told that the estimate of a mean is 14.7 and that it is approximately 95% likely that the mean is between 14.3 and 15.1 is far more useful information than being told only that the estimate of a mean is 14.7. The main problem with the above is that, in practice, the variance 2 is usually unknown, so that (50) is not immediately applicable. However it is possible to estimate 2 from the data values x1 , x2 , ...., xn . The theory here is not easy, so here is a trust me result: the estimate s2 of 2 found from observed data values x1 , x2 , . . . , xn is s2 =
2 2 x2 x) 2 1 + x2 + + xn n( . n1
(51)
This leads (see (50)) to an even more approximate 95% condence interval for as 2s x n to 2s x + . n (52)
This estimated condence interval is useful, since it provides a measure of the accuracy of the estimate x and it can be computed entirely from the data. Further theory shows that in practice it is reasonably accurate. Some notes on the above 1. The number 2 appearing in the condence interval (52) comes, eventually, from the two-standard-deviation rule. This rule is only an approximate one so that, as mentioned above, the 95% condence interval (52) is only reasonably accurate. 2. Why do we have n 1 in the denominator of the formula (51) for s2 and not n (the sample size)? This question leads to the concept of degrees of freedom, which we shall discuss later. 3. The eect of changing the same size. Consider two investigators both interested inthe blood sugar levels of diabetics. Suppose that n = 10 for the rst investigator (i.e. her sample size was 10). Suppose that n = 40 for the second investigator (i.e. his sample size was 40). The two investigators will estimate by their respective values of x . Since both estimates
33
are unbiased, that is both are shooting at the same target (), they should be reasonably close to each other. Similarly, their respective estimates of 2 should be reasonably close to each other, since both are unbiased estimates of 2 . On the other hand the length of the condence interval for for the second investigator will be about half that of the rst investigator, since he will have a 40 involved in the calculation of his condence interval, not the 10 that the rst investigator will have (see (52) and note that 1/ 40 is half of 1/ 10). This leads to the next point. 4. To be twice as accurate you need four times the sample size. To be three times as accurate you need nine times the sample size. This happens because the length of the condence interval (52) is 4s/ n. The fact that there is a n in the denominator and not an n explains this phenomenon. This is why research is often expensive: to get really accurate estimates one often needs very large sample sizes. To be 10 times as accurate you need 100 times the sample size! 5. How large a sample size do you need before you do your experiment in order to get some desired degree of precision of the estimate of the mean ? One cannot answer this question in advance, since the precision of the estimate depends on 2 , which is unknown. Often one runs a pilot experiment to estimate 2 , and from this one can get a good idea of what sample size is needed to get the required level of precision, using the above formulas. 6. The quantity s/ n is often called the standard error of the mean. This statement incorporates three errors. More precisely it should be: the estimated standard deviation of the estimator of the mean. A numerical example. For many years corn has been grown using a standard seed processing method. A new method is proposed in which the seed is kiln-dried before sowing. We want to assess various properties of this new method. In particular we want to estimate , the the mean yield per acre (in pounds) under the new method and to nd two limits between which we are approximately 95% certain that lies. We plan to do this by sowing n = 11 separate acres of land with the new seed type and measuring the yield per acre for each of these 11 acres. At this point, before we do the experiment, these yields are unknown to us. They are random variables, and we think of their values, before we carry our this experiment, as the random variables X1 , X2 , ...., X11 . of these random variables has mean , so we know We know (as above) that the mean X that the estimate x will be unbiased. With this conceptualization behind us, we now apply this new style of seed to our 11 separate acre lots, and we get the following values (pounds per acre):1903, 1935, 1910, 2496, 2108, 1961,
34
2060,
1444,
1612,
1316
1511.
These are our data values, which we have previously generically denotes by x1 , x2 , ...., xn . Now to our estimation and condence interval procedures. We estimate the mean of the yield per acre by the average 1903 + 1935 + + 1511 = 1841.46. 11 We know from the above theory that this is an unbiased estimate of , the mean yield per acre. To calculate the approximate 95% condence interval (52) for we rst have to calculate 2 s , our estimate of the variance 2 of the probability distribution of yield with this new seed type. The estimate of 2 is, from (51), s2 = (1903)2 + (1935)2 + + (1511)2 11(1841.46)2 = 117, 468.9. 10
Following (52), these calculations lead to our approximate 95% condence interval for as 2 117468.9 2 117468.9 1841.46 to 1841.46 + , (53) 11 11 that is from 1634.78 to 2048.14. Since the individual yields are clearly given rounded to whole numbers, it is not appropriate to be more accurate than this in our nal statement, which is: We estimate the mean by 1841, and we are about 95% certain that it is between 1635 and 2048. Often in research publications the above result might be written = 1841 206. This can be misleading because, for example, it is not indicated if this is a 95% or a 99% condence interval. Also, it is not the best way to present the conclusion.
35
Estimating the dierence between two binomial parameters
Lets start with an example. Is there is a dierence between men and women on their attitudes in the pro-life/pro-choice debate? We approach this question from a statistical point of view as follows. Let 1 be the (unknown) probability that a woman is pro-choice and let 2 be the (unknown) probability that a man is pro-choice. So we are interested in the dierence 1 2 . Our aim is to take a sample of n1 women and n2 men and nd out for each person whether he/she is pro-life or pro-choice. Our aim then is to estimate 1 2 and to nd an approximate 95% condence interval for 1 2 . Suppose now that we have taken our sample, and that x1 of the n1 women are pro-choice, and x2 of the n2 men are pro-choice. We would estimate 1 by x1 /n1 , which we will write as p1 , and would estimate 2 by x2 /n2 , which we will write as p2 . Thus we would (correctly) estimate 1 2 by the dierence d = p1 p2 . What are the properties of this estimate? These are determined by the properties of the random variable D = P1 P2 , where, before we took our sample, P1 = X1 /n1 is the proportion of women who will be pro-choice and P2 = X2 /n2 is the proportion of men who will be pro-choice. Both P1 and P2 are random variables. Notice that P1 P2 is a dierence. In comparing two groups we are often involved with dierences. That is why we have done some probability theory about dierences. Now we know that the mean of P1 is 1 (this is one of the magic formulas for the proportion of successes in the binomial context) and we also know that the mean of P2 is 2 (this uses the same magic formula). Thus from the rst equation in (23), giving the mean of a dierence of two random variables with possibly dierent means, the mean of D is 1 2 . Thus D is an unbiased estimator of 1 2 and correspondingly d = p1 p2 is an unbiased estimate of 1 2 . It is shooting at the right target. It is the estimate of 1 2 that we will use. More important: how precise is this estimate? To answer this we have to nd the variance of the estimator P1 P2 . Now the variance of P1 is 1 (1 1 )/n1 (from the variance of the proportion of successes in n1 binomial trials). Similarly the variance of P2 is 2 (1 2 )/n2 . From the second equation in (23), giving the variance of a dierence of two random variables with possibly dierent variances, the variance of D is 1 (1 1 )/n1 + 2 (1 2 )/n2 . We do not of course know the numerical value of this variance, since we do not know the values of 1 and 2 . However we have an estimate of 1 , namely p1 and an estimate of 2 , namely p2 . So we could estimate this variance by p1 (1 p1 )/n1 + p2 (1 p2 )/n2 .
36
Using the same sort of argument that led to (43), we could then say P(p1 p2 2 p1 (1 p1 )/n1 + p2 (1 p2 )/n2 < 1 2 < p1 p2 + 2 p1 (1 p1 )/n1 + p2 (1 p2 )/n2 ) 0.95. (54)
This leads to the so-called (approximate) 95% condence interval for 1 2 of p1 p2 2 p1 (1 p1 )/n1 + p2 (1 p2 )/n2 to p1 (1 p1 )/n1 + p2 (1 p2 )/n2 . (55)
p1 p2 + 2
These formulas a pretty clumsy, so we carry out the same approximation that we did when estimating a single binomial parameter (see the discussion leading to (45)). That is, we use the mathematical fact that neither p1 (1 p1 ) nor p2 (1 p2 ) can ever exceed 1/4. Further, for quite a wide range of values of any fraction f near 1/2, f (1 f ) is quite close to 1/4. So if we approximate both p1 (1 p1 ) and p2 (1 p2 by 1/4, and remember that 1/4 = 1/2, we arrive at a conservative condence interval for 1 2 as p1 p2 1/n1 + 1/n2 to p1 p2 + 1/n1 + 1/n2 . (56)
Numerical example. Suppose that we interview n1 = 1, 000 women and n2 = 800 men on the pro-life/pro-choice question. We nd that 624 of the women are pro-life and 484 of the men are. So we estimate 1 by p1 = 624/1, 000 = 0.624 and we estimate 2 by p2 = 484/800 = 0.605. So we estimate the dierence between the proportion of women who are pro-choice and the proportion of men who are pro-choice to be 0.624 - 0.605 = 0.019. Further, we are approximately 95% certain that the actual proportion is between 0.019 1/1, 000 + 1/800 = 0.019 0.047 = 0.028 and 0.019 + 1/1, 000 + 1/800 = 0.019 + 0.047 = 0.066. A TV commentator would call 0.047 the margin of error. Later, when we do hypothesis testing, we will see if the estimate 0.019 diers signicantly from 0.
Estimating the dierence between two means
As in the previous section, lets start with an example. In fact we will follow the structure of the last section fairly closely. Is the mean blood pressure of women equal to the mean blood pressure of men ? We approach this question from a statistical point of view as follows. Let 1 be the (unknown) mean blood pressure for a a woman and 2 be the (unknown) mean blood pressure for a man. So we are interested in the dierence 1 2 . Our aim is
37
to take a sample of n1 women and n2 men and measure the blood pressures of all n1 + n2 people. Our aim then is to estimate 1 2 and to nd an approximate 95% condence interval for 1 2 . Clearly we estimate the mean blood pressure for women by x 1 , the average of the blood pressures of the n1 women in the sample, and similarly we estimate the mean blood pressure for men by x 2 , the average of the blood pressures of the n2 men in the sample. We then estimate 1 2 by x 1 x 2 . How accurate is this estimate? This depends on the variance of the random variable 1 X 2 . Using the formula for the variance of a dierence, as well as the formula for the X 2 2 2 variance of an average, this variance is n1 + n2 , where 1 is the (unknown) variance of blood 1 2 2 pressure among women and 2 is the (unknown) variance of blood pressure among men. We 2 2 do not know either 1 or 2 and these will have to be estimated from the data. If the blood 2 (see equation 51) by pressures of the n1 women are denoted x11 , x12 , ...., x1n1 , we estimate 1 s2 1 =
2 2 x2 x1 )2 11 + x12 + + x1n1 n1 ( . n1 1
(57)
2 Similarly, if the blood pressures of the n2 men are denoted x21 , x22 , ...., x2n1 , we estimate 2 (see equation 51) by 2 2 x2 x2 )2 21 + x22 + + x2n2 n2 ( 2 . (58) s2 = n2 1
Thus we estimate
2 1 n1
2 2 n2
by
s2 1 n1
s2 2 . n2
Finally our approximate 95% condence interval for 1 2 is x 1 x 2 2 s2 s2 1 + 2 n1 n2 to x 1 x 2 + 2 s2 s2 1 + 2. n1 n2 (59)
38
Regression How does one thing depend on another? How does the GNP of a country depend on the number of people in full-time employment? How does the reaction time of a person to some stimulus depend on the amount of sleep deprivation administered to that person? How does the growth height of a plant in a greenhouse depend on the amount of water that we give the plant during the growing period? Many practical questions are of the how does this depend on that? type. These questions are answered by the technique of regression. Regression problems can get pretty complicated, so we consider here only one case of regression: How does some random non-controllable quantity Y depend on some non-random controllable quantity x? Notice two things about the notation. First, we denote the random quantity in upper case - see Y above. This is in accordance with the notational convention of denoting random variables in upper case. We denote the controllable non-random quantity in lower case - see x above. Second, we are denoting the random variable by Y . Up to now we have denoted random variables using the letter X . We switch the notation from X to Y in the regression context to because we will later plot our data values in the standard x-y plane, and it is natural to plot the observed values of the random variable as the y values. We will use the plant and water example to demonstrate the central regression concepts. First we think of the situation before our experiment, and consider some typical generic plant to which we will plan to give x units of water. At this stage the eventual growth height Y of this plant is a random variable - we do not know what it will be. We make the assumption that the mean of Y is of the linear form:mean of Y = + x, (60)
where and are parameters, that is quantities whose value we do not know. In fact our main aim once the experiment is nished is to estimate the numerical values of these parameters and to get some idea of the precision of our estimates. Note that we assume that the mean growth height potentially depends on x, and indeed our main aim is to assess the way it depends on x. We also assume that variance of Y = 2 , (61) where 2 is another parameter whose value we do not know and which, once the experiment is nished, we wish to estimate. The fact that there is a (positive) variance for Y derives from the fact that there is some, perhaps muhc, uncertainly about what the value of the plant growth will be after we have done the experiment. There are many factors that we
39
do not know about, such as soil fertility, which imply that Y is a random variable, with a variance. So we are involved with three parameters, , and 2 . We do not know the value of any one of them. A stated above, one of our aims, once the experiment is over, is to estimate these parameters from our data. Of these three the most important one to us is . The interpretation of is that it is the mean increase in growth height per unit increase in the amount of water given. If = 0 this mean increase is zero, and equation (60) shows that the mean growth height does not depend on the amount of water that we give the plant. So we will be interested, later, in seeing if our estimate of , once we have our data, is close to zero or not. Taking a break from regression for a moment, equation (60) reminds us that the algebraic equation y = a + bx denes a geometric line in the x-y plane, as shown:-
The interpretation of a is that it is the intercept of this line on the y axis (as shown). The interpretation of b is hat it is the slope of the line. If b = 0 the line is horizontal, and then the values of y for points one the line are all the same, whatever the value of x. Now back to the regression context. We plan to use some pre-determined number n of plants in our greenhouse experiment, planning to give the plants respectively x1 , x2 , ..., xn units of water. These x values do not have to be all dierent from each other, but it is essential that they are not all equal. In fact there is a strategy question about how we would choose the values of x1 , x2 , ..., xn which is discussed later. We are still thinking of the situation before we conduct our experiment. At this stage we conceptualize about the growth heights Y1 , Y2 , ..., Yn of the n plants. (Y1 corresponds to the plant getting x1 units of water, Y2 corresponds to the plant getting x2 units of water, and so on.) These are all random variables - we do not know in advance of doing the experiment what values they will take. Then from equation (60), the mean of Y1 is + x1 , the mean of Y2 is + x2 , and so on. The variance of Y1 is 2 , the variance of Y2 is also 2 , and so
40
on. We assume that the various Yi values are independent. However they are clearly not assumed to be identically distributed, since if for example xi = xj , that is the amount of water to be given to plant i diers from that to be given to plant j , the means of Yi and Yj are dierent if = 0 and the assumptions embodied in (60) are true. Equation (60) shows that the mean of Y is a linear function of x. This means that once we have our data they should (if the assumption in (60) is correct) approximately lie on a straight line. We do not expect them to lie exactly on a straight line: we can expect random deviations from a straight line because of factors unknown to us such as dierences in soil composition among the pots that the various plants are grown in, temperature dierences from the environment of one plant to another, etc. The fact that deviations from a straight line are to be expected is captured by the concept of the variance 2 . The larger this (unknown to us) variance is, the larger these deviations from a line would tend to be. All the above refers to the situation before we conduct our experiment. We now do the experiment, and we obtain growth heights y1 , y2 , ..., yn . (The plant getting x1 units of water had growth height y1 , the x2 units of water had growth height y2 , and so on.) The rst thing that we have to do is to plot the (x1 , y1 ), (x2 , y2 ), ...., (xn , yn ) values on a graph. Equation (60) shows that when we do this the data points should more or less lie on a straight line. Suppose that our data points are as shown below:-
These data points more or less lie on a straight line. If the data points are more or less on a straight line (as above, and deciding this is really a matter of judgement) we can go ahead with our analysis. If they are clearly not on a straight line (see example at the top of the next page) then you should not proceed with the analysis.
41
There are methods for dealing with data that clearly do not lie close to being on a straight line, but we do not consider them here. So from now on we assume that the data are more or less on a straight line. Our rst aim is to use the data to estimate , and 2 . To do this we have to calculate various quantities. These are x = x1 + x2 + .... + xn , n y = y1 + y2 + .... + yn , n (62)
as well as the quantities sxx , syy and sxy , dened by sxx = (x1 x )2 + (x2 x )2 + .... + (xn x )2 , syy = (y1 y )2 + (y2 y )2 + ... + (yn y )2 , sxy = (x1 x )(y1 y ) + (x2 x )(y2 y ) + ... + (xn x )(yn y ). (63) (64) (65)
The most important parameter is , since if = 0 the growth height for any plant does not depend on the amount of water given to the plant. The derivation of unblased estimates here is complicated, so we just give the trust me results:We estimate by b, dened by b = sxy /sxx . We estimate by a, dened by a=y bx . Finally we estimate 2 by s2 r , dened by
2 s2 r = (syy b sxx )/(n 2).
(66) (67)
(68)
These are the three estimates that we want for our further analysis.
42
Notes on this.
2 1. The sux r in s2 r stands for the word regression. The formula for sr relates only to the regression context.
2. You will usually do a regression analysis by a statistical package (we will do an example in class), so in practice you will usually not have to do the computations for these estimates. 3. It can be shown (the math is too dicult to give here) that a is an unbiased estimate of 2 , that b is an unbiased estimate of , and that that s2 r is an unbiased estimate of . 4. How accurate is the estimate b of ? Again here there is some dicult math that you will have to take on trust. The bottom line is that we are about 95% certain that is between 2sr b sxx and 2sr b+ . sxx (69)
This is our (approximate) 95% condence interval for . Clearly the 2 in this result comes 2sr from the two-standard-deviation rule. You will have to take the part of it on trust. sxx 5. This result introduces a strategy concept into our choice of the values x1 , x2 , ..., xn , the amounts of water that we plan to put on the various plants. The width of this condence interval is proportional to 1/ sxx . Thus the larger we make sxx the shorter is the length of this condence interval and the more precise we can be about the value of . How can we make sxx large? We can do this by spreading the x values as far away from their average as we reasonably can. However two further considerations then come into play. We should keep the various x values within the range of values which is of interest to us. Also, we do not want to make half the x values at the lower end of this interesting range and the other half at the upper end of this interesting range. If we did this we would have no idea what is happening in the middle of this range - see the picture below to illustrate the case where we put about half our x values at the same low value and the other half our x values at the same high value.
43
So in practice we tend to put quite a few x values near the extremes but also string quite a few x values in the middle of the interesting range. This will be illustrated in a numerical example later. 6. A second result goes in the other direction. Suppose that the amounts of water put on the various plants were close to each other. In other words the x values would be close to each other and all would then close l to their average. This would mean that sxx would be small. So 2sr /sxx would be large, and the condence interval (69) would be wide. We then have little condence in our estimate of . I An even more extreme case arises if we give all plants the same amount of water. Then both sxx and sxy would be zero, and the denition of b shows that we would calculate b as 0/0, which mathematically makes no sense. In fact the formula for b is telling you: You cant estimate with the data that you have. The formula here is denitely sending you a message. In fact it is saying: You want to assess how the growth height depends on the amount of water given to the plant. If you give all plants the same amount of water there is no way that you can do this. It would be the same as a situation where you wanted to assess how the height of a child depended on his/her age, and all the children in your sample were of exactly the same age. You clearly could not make this assessment with data of that type. Example. We will do an example from the water and plant growth situation. We have n = 12 plants to which we gave varying amounts of water (see below). After the experiment we obtained the following data:Plant number Amount of water Growth height 1 16 2 16 3 16 4 18 5 18 6 20 7 22 8 24 9 24 10 26 11 26 12 26
76.2 77.1 75.7 78.1 77.8
79.2 80.2 82.5 80.7 83.1 82.2 83.6
From these we compute x = (16+16+ +26)/12 = 21 and y = (76.2+77.1+ +83.6)/12 = 79.7. Also we nd sxx = 188, syy = 83.54 and sxy = 122.4. We now compute our estimate b of as sxy /sxx = 122.4/188 = 0.6510638. (This result is given to 7 decimal places so as to compare with the JMP printout. In practice you are not justied in giving it to an accuracy greater than that of the data, so in practice we would write b = 0.65).
44
Next, our estimate a of is y bx = 79.7 (0.6510638 21) = 66.02766. (Again this result is given to 7 decimal places so as to compare with the JMP printout. In practice you are not justied in giving it to an accuracy greater than that of the data, so in practice we would write a = 66.03.) Finally we estimate 2 by s2 r , calculated in this case (see (68)) as 83.54 (0.65)2 (188) = 0.411. 10 How accurate is our estimate b of ? First, from the theory it is unbiased. It was found by a process which is truly aiming at . next, we are approximately 95% certain that is between 2 0.411 2 0.411 0.65 to 0.65 + , (70) 188 188 that is from 0.56 to 0.74. Notes on this 1. Our so-called estimated regression line is y = 66.03 + 0.65x. That is the equation of the line that appears on the JMP screen. We could use this line, for example, to say that we estimate the mean growth height for a plant given 21 units of water to be 66.03 + 0.6521 = 79.68. 2. Never extrapolate between the x values in the experiment. For example it is not appropriate to say that we estimate the mean growth height for a plant given 1,000 units of water is 66.03 + 0.651,000 = 716.03. (You probably would have killed the plant if you gave it this much water.) 3. Later we will consider testing the hypothesis that the growth height of the plant does not depend on the amount of water given to it. This is equivalent to testing the hypothesis = 0. 4. Notice the choices of the amount of water in the above example. We gave three plants the lowest amount of water (16) and three plants the highest amount of water (26). We also strung a few values out between these values, in acordance with the discussion above about the choice of x values. We will do this example by JMP in class. There will also be a handout discusing the JMP output and the interpretation of various things in this output. That handout should be regarded as part of these notes.
45
Testing hypotheses
Background In hypothesis testing we attempt to answer questions. Here are some simple examples. Is this coin fair? Is a women equally likely to be left-handed as a man is? Is there any dierence between men and women so far as blood pressure is concerned? Is there any eect of the amount of water given to a plant and its growth height? We always re-phrase these questions in terms of questions about parameters:If the probability of a head is , is = 1/2? If the probability that a woman is left-handed is 1 , and the probability that a man is left-handed is 2 , is 1 = 2 ? If the mean blood presure for a woman is 1 , and the mean blood presure for a man is 2 , is 1 = 2 ? Is = 0? We re-phrase these questions in this way because we know how to estimate parameters and to get some idea of the precision of our estimates. So re-phrasing questions in terms of questions about parameters helps us to answer them. Attempting to answer them is an activity of hypothesis testing. The general approach to hypothesis testing We will consider two equivalent approaches to hypothesis testing. The rst approach predates the availability of statistical packages, while the second approach is to some extent motivated by the availability of these packages. We will discuss both approaches. Both approaches involve ve steps. The rst three steps in both approaches are the same, and we consider these three steps rst. We will illustrate all steps by considering two problems involving the binomial distribution.
Step 1
Statistical hypothesis testing involves the test of a null hypothesis (which we write in shorthand as H0 ) against an alternative hypothesis (which we write in shorthand as H1 ). The rst step in a hypothesis testing procedure is to declare the relevant null hypothesis H0 and the relevant alternative hypothesis H1 . The null hypothesis, as the name suggests, usually states that nothing interesting is happening. This comment is discussed in more detail below. The choice of null and alternative hypotheses should be made before the data are seen. Also the nature of the alternative hypothesis must be decided before the data are seen: this is also discussed in more detail below. To decide on a hypothesis as a result of the data is to introduce a bias into the procedure, invalidating any conclusion that might be drawn
46
from it. Our aim is eventually to accept or to reject the null hypothesis as the result of an objective statistical procedure, using our data in making this decision. It is important to clarify the meaning of the expression the null hypothesis is accepted. This expression means that there is no statistically signicant evidence for rejecting the null hypothesis in favor of the alternative hypothesis. A better expression for accepting is thus not rejecting. So instead of saying We accept H0 , it is best to say We do not have signicant evidence to reject H0 . The alternative hypothesis will be one of three types: one-sided up, one-sided down, and two-sided. In any one specic situation which one of these three types is appropriate must be decided in advance of getting the data. The context of the situation will generally make it clear which is the appropriate alternative hypothesis. All the above seems very abstract, so as stated above we will illustrate the steps in the hypothesis testing procedure by two examples, both involving the binomial distribution. Example 1 It is essential for a gambling casino that the various games oered are fair, since an astute gambler will soon notice if they are unfair and bet accordingly. As a simplied example, suppose that one game involves ipping a coin, and it is essential, from the point of view of the casino operator, that this coin be fair. The casino operator now plans to carry out a hypotheis testing procedure. If the probability of getting head on any ip of the coin is denoted , the null hypothesis H0 for the casino operator then states that = 1/2. (No bias in the coin. Nothing interesting happening.) In the casino example it is important, from the point of view of the casino operator, to detect a bias of the coin towards either heads or tails (if there is such a bias). Thus in this case the alternative hypothesis H1 is the two-sided alternative = 1/2. This alternative hypothesis is said to be composite: it does not specify some numerical value for (as H0 does). Instead it species a whole collection of values. It often happens that the alternative hypothesis is composite. Example 2 This example comes from the eld of medical research. Suppose that we have been using some medicine for some illness for many years (we will call this the current medicine), and we in eect know from much experience that the probability of a cure with the current medicine is 0.84. A new medicine is proposed and we wish to assess whether it is better than the current medicine. Here the only interesting possibility is that it is better than
47
the current medicine: if is is equally eective as the current medicine, or (even worse) less eective than the current medicine, we would not want to introduce it. Let be the (unknown) probability of a cure with the new medicine. Here the null hypothesis is = 0.84. If this null hypothesis is true the new medicine is equally eective as the current one, since its cure rate would be equal to that of the current medicine. The natural alternative in this case is one-sided up, namely > 0.84. This is the only case of interest to us. This is also a composite hypothesis. Notice how, in both examples, the nature of the alternative hypothesis is determined by the context, and that in both cases the null and alternative hypotheses are stated before the data are seen.
Step 2
Since the decision to accept or reject H0 will be made on the basis of data derived from some random process, it is possible that an incorrect decision will be made, that is, to reject H0 when it is true (a Type I error ), or false positive, or to accept H0 when it is false (a Type II error ), or a false negative. This is illustrated in the following table:H0 is true H0 H0 is false H0 We accept H0 OK Type II error We reject H0 Type I error OK
When testing a null hypothesis against an alternative it is not possible to ensure that the probabilities of making a Type I error and a Type II error are both arbitrarily small unless we are able to make the number of observations as large as is needed to do this. In practice we are seldom able to get enough observations to do this. This dilemma is resolved in practice by observing that there is often an asymmetry in the implications of making the two types of error. In the two examples given above there might be more concern about making the false positive claim and less concern about making the false negative claim. This would be particularly true in the medicine example: we are anxious not to claim that the new medicine is better than the current one if it is not better. If we make this claim and the new medicine in not better than the current one, many millions of dollars will have been spent manufacturing the new medicine, only to nd later that it is not better than the current one. For this reason, a frequently adopted procedure is to focus on the Type I error, and to x the numerical value of this error at some acceptably low level (usually 1% or 5%), and not to attempt to control the numerical value of the Type II error. The value chosen is denoted . The choice of the values 1% and 5% is reasonable, but is also clearly arbitrary. The choice 1% is a more conservative one than the choice 5% and is often
48
made in a medical context. Step 2 of the hypothesis testing procedure consists in choosing the numerical value for the Type I error, that is in choosing the numerical value of . This choice is entirely at your discretion. In the two examples that we are considering we will choose 1% for the medical example and 5% for the coin example.
Step 3
The third step in the hypothesis testing procedure consists in determining a test statistic . This is the quantity calculated from the data whose numerical value leads to acceptance or rejection of the null hypothesis. In the coin example the natural test statistic is number of heads that we will get after we have ipped the coin in our testing procedure. In the medicine case the natural test statistic is number of people cured with the new medicine in a clinical trial. These are both more or less obvious, and both are the correct test statistics: however, in more complicated cases the choice of a test statistic is not so straightforward. As stated above there are two (equivalent) approaches to hypothesis testing. Which approach we use is simply a matter of our preference. As also stated above the rst three steps (as outlined above) are the same for both approaches. Steps 4 and 5 dier under the two approaches, so we now consider them separately.
Approach 1, Step 4
Under Approach 1, Step 4 in the procedure consists in determining which observed values of the test statistic lead to rejection of H0 . This choice is made so as to ensure that the test has the numerical value for the Type I error chosen in Step 2. We rst illustrate this step with the medicine example, where the calculations are simpler than in the coin example. First we review steps 1 - 3 in this example. Step 1. We write the (unknown) probability of a cure with the new medicine as . The null hypothesis claims that = 0.84 and the alternative hypothesis claims that > 0.84. Step 2. Since this is a medical example we choose a Type I error of 1%. Step 3. Suppose that we plan to give the new medicine to 5,000 patients. We will reject the null hypothesis if the number x of patients who were cured with the new medicine cured is large enough. In other words x is our test statistic. Now we proceed to steps 4 and 5. Step 4. How large does x, the number cured with the new medicine, have to be before we will reject the null hypothesis? We will reject the null hypothesis if x A, where A is
49
chosen so that the Type I error takes the desired value 1%. We now have to do a probability theory zig, and consider before the clinical trial is conducted the random variable X , the number of people who will be cured with the new medicine. Then we will will reject the null hypothesis if x A, where (using a probability theory zig), A is chosen so that P(X A when = 0.84) = 0.01. How do wecalculate the value of A? We will use the central limit theorem and a Z chart. If = 0.84 the mean of X is (5,000)(0.84) = 4,200 and the variance of X is (5,000)(0.84)(.16) = 672, using the formula for the mean and the variance of a binomial random variable. (Why binomial? Because there are two possible outcomes on each trial for each patient cured or not cured). Next, to a very close approximation X can be taken as having a normal distribution with this mean and this variance when when the null hypothesis is true. So to this level of approximation, A has to be such that P (X A) = 0.01, where X has a normal distribution with mean 4,200 and variance 672. We now do a z -ing: we want X 4, 200 A 4, 200 P( ) = 0.01. 672 672
4,200 4,200 is a Z , and the Z charts now show that A Now when the null hypothesis is true, X 672 672 has to be equal to 2.326. (You have to use the Z chart inside - out to nd this value.) 4,200 Solving the equation A = 2.326 we nd that A = 4260.30. To be conservative, we 672 choose the value 4261.
To conclude step 4, we have made the calculation that if the number of people cured with the new medicine is 4,261 or more we will reject the null hypothesis and claim that the new medicine is superior to the current one. It is now straightforward to do step 5, so we do it.
Approach 1, Step 5
The nal step in the testing procedure is straightforward. We do the clinical trial and count the number of people cured with the new medicine. If this number is 4,261 or larger we reject the null hypothesis and claim that the new medicine is superior to the current one. If this number is less than 4,261 we say that we do not have signicant evidence that the new medicine is better than the current one.
50
Note on this The value 4,261 is sometimes called the critical point and the range of values 4,261 or more is sometimes called the critical region. The coin example. First we review steps 1, 2 and 3. Step 1. We write as the probability of a head on each ip. The null hnypothesis claims that = 1/2 and the alternative hypothesis claims that = 1/2. Step 2. We choose a numerical value for as 5%. Step 3. Suppose that we plan to ip the coin 10,000 times in our experiment. The test statistic is x, the number of heads that we will get after we have ipped the coin 10,000 times. Now we proceed to steps 4 and 5. Step 4. This test is two-sided, so we will reject the null hypothesis if r x is either too large or too small. How large or how small? We will reject the null hypothesis if x A or if x B , where A and B have to be chosen so that = 5%. We now go on a probability theory zig, and consider the random variable X , the random number of times we will get heads before the experiment is done. Wehave to choose A and B so that P (X A) + P (X B ) = 0.05 when H0 is true.
Choosing A and B so as to satisfy this requirement ensures that the Type I error is indeed 5%. We usually adopt the symmetic requirement P (X A) = P (X B ) = 0.025 when H0 is true.
Let us rst calulate B . When the null hypothesis is true, X has a binomial distribution with mean 5,000 and variance 2,500 (using the formula for the mean and the variance of a binomial random variable). The standard deviation of X is thus 2, 500 = 50. To a suciently close approximation, when the null hypothesis is true X has a normal distribution with this mean and this standard deviation. Thus when the null hypothesis is true, (X 5, 000)/50 is a Z . Carrying out a Z -ing procedure, we get P( B 5, 000 X 5, 000 ) = 0.025. 50 50
51
5,000 Since X 50 is a Z when the null hypothesis is true, the Z charts show that and solving this equation for B we get B = 5, 098.
B 5,000 50
= 1.96,
Carrying out a similar operation for A we nd A = 4, 902. Step 5. This takes us straight to step 5. We now ip the coin 10,000 times, and if the number of heads is 4,902 or fewer, or 5,098 or more, we reject the null hypothesis and claim that we have signicant evidence that the coin is biased. If the number of heads between 4,903 and 5,097 inclusive, we say that we do not have signicant evidence to reject the null hypothesis. That is, we do not have signiicant evidence to claim that the coin is unfair. Note on this The values 4,902 and 5,098 are sometimes called the critical points and the range of values x 4, 902 or x 5, 098 is sometimes called the critical region. We now consider Approach 2 to hypothesis testing. As stated above, steps 1, 2 and 3 are the same under Approach 2 as they are under Approach 1. So we now move to steps 4 and 5 under Approach 2, again using the coin and the medicine examples.
Approach 2, Step 4
Under Approach 2 we now do our experiment and note the observed value of the test statistic. Thus in the medicine example we do the clinical trial (with the 5,000 patients) and observe the number of people cured under the new medicine. In the coin example we ip the coin 10,000 times and observe the number of heads that we got.
Approach 2, Step 5
This step involves the calculation of a so-called P -value . Once the data are obtained we calculate the probability of obtaining the observed value of the test statistic, or one more extreme in the direction indicated by the alternative hypothesis, assuming that the null hypothesis is true. This probability is called the P -value. If the P -value is less than or equal to the chosen Type I error, the null hypothesis is rejected. This procedure always leads to a conclusion identical to that based on the signicance point approach. For example, suppose that in the medicine example the number of people cured under the new medicine was 4,272. Using the normal distribution approximation to the binomial, the P -value is the probability that a random variable X having a normal distribution with
52
mean 4,200 and variance 672 (the null hypothesis mean and variance) takes a value 4,272 or more. This is a straightforward probability theory zig operation, caried out using a Z -ing procedure and normal distribution charts. We have P (X 4, 272) = P ( X 4, 200 4, 272 4, 200 ), 672 672
4,200 4,200 = 2.78, we obtain, and since X is a Z when the null hypothesis is true, and 4,272 672 672 from the Z chart, a P -value of 0.0027. This is less than the chosen Type I error (0.01) so we reject the null hypothesis. This is exactly the same conclusion that we would have reached using Approach 1, since the observed value 4,272 exceeds the critical point 4,261.
As a dierent example, suppose that the number cured with the new medicine was 4,250. This observed value does not exceed the critical point 4,261, so under Approach 1 we would not reject the null hypothesis. Using the P -value approach (Approach 2), we would calculate the P value as X 4, 200 4, 2750 4, 200 P (X 4, 250) = P ( ), 672 672 4,200 4,200 = 1.93, we obtain, and since X is a Z when the null hypothesis is true, and 4,250 672 672 from the Z chart, a P -value of 0.0268. This is more than the Type I error of 0.01, so we do not have enough evidence to reject the null hypothesis. That is, we do not have enough evidence to claim that the new medicine is better than the current one. This conclusion agrees with that we found under Approach 1. The coin example The P -value calculation for a two-sided alternative hypothesis such as in the coin case is more complicated than in the medicine example. Suppose for example that we obtained 5,088 heads from the 10,000 tosses. This is 88 more than the null hypothesis mean of 5,000. The P -value is then the probability of obtaining 5,088 or more heads plus the probability of getting 4,912 or fewer heads if the coin is fair (that is, if the null hypothesis is true), since values 4,912 or fewer are as extreme as, or more extreme, for a two-sided alternative, than the observed value 5,088. For example, 4,906 is more extreme than 5,088 in that it diers from the null hypothesis mean (5,000) by 96, which is more than 5,088 does. So using a normal distribution approximation, the P -value is the probability that a random variable having a normal distribution with mean 5,000 and standard deviation 50 takes a value 5,088 or more, plus the probability that a random variable having a normal distribution with mean 5,000 and standard deviation 50 takes a value 4,912 or fewer. Doing a Z -ing, this is 0.0392 + 0.0392 = 0.0784. This exceeds the value chosen for (0.05), so we do not have enough evidence to reject the null hypothesis. This agrees with the conclusion that we reached using the signicance point approach (see Approach 1, step 5).
53

STAT111Faighth107 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

STAT111Faighth107 PDF

Uploaded by

Copyright:

Available Formats

Statistics 111 Updated lecture notes

Introduction and basic ideas

The relation between probability theory and statistics

Some unknown reality and a hypothesis about it.

Uses data (what is observed in an experiment) to test this hypothesis.

Probabilities of derived events

Notice that the probability of the empty event is 0.

Probability: One Discrete Random Variable

Probability: One Discrete Random Variable

The mean of a discrete random variable

The variance of a discrete random variable

smaller variance Figure 1:

Many Random Variables

The formulas in (20) are, respectively, special cases of these formulas.

Continuous Random Variables

P(a < X < b) =

Estimating the dierence between two binomial parameters

Finally our approximate 95% condence interval for 1 2 is x 1 x 2 2 s2 s2 1 + 2 n1 n2 to x 1 x 2 + 2 s2 s2 1 + 2. n1 n2 (59)

76.2 77.1 75.7 78.1 77.8

79.2 80.2 82.5 80.7 83.1 82.2 83.6

You might also like