Statistics 111 Updated lecture notes

Fall 2013

Warren J. Ewens wewens@sas.upenn.edu Room 324 Leidy Labs (corner 38th St and Hamilton Walk). These notes provide an outline of the some of material to be discussed in the first few lectures of STAT 111. They are important because they provide the framework for all the material given during the entire course. Also, much of this material is not given in the textbook for the course.

1

Introduction and basic ideas
What is Statistics? Statistics is the science of analyzing data in whose generation chance has played some part. This explains why statistical methods are important in many areas, including for example sociology, psychology, biology, economics and anthropology. In these areas there are many chance mechanisms at work. For example, in biology the random transmission of one chromosome from a pair of chromosomes from parent to offspring introduces a chance mechanism into many areas of genetics. Second, in all the above areas data are usually derived from some random sample of individuals. A different sample would almost certainly yield different data, so that the sampling process introduces a second chance element. Finally, in economics the values of quantities such as the Dow Jones industrial average cannot be predicted in advance, since their values are affected by many chance events that we cannot know of in advance. Everyone knows that you cannot make much progress in such areas as physics, astronomy and chemistry without using mathematics. Similarly, you cannot make much progress in such areas as psychology, sociology and biology without using statistics. Because of the chance, or random, aspect in the generation of statistical data, it is necessary, in discussing statistics, to also consider aspects of probability theory. The syllabus for this course thus starts with an introduction to probability theory, and this is reflected in these introductory notes. But before discussing probability theory, we have to discuss the relation between probability theory and statistics.

The relation between probability theory and statistics
Most of the examples given in the class concern simple situations and are not taken from the sociological, psychological, etc. contexts. This is done so that the basic ideas of probability theory and statistics will not be confounded with the complexities arising in those areas. So we start here with a simple example concerning the flipping of a coin. Suppose that we have a coin that we suspect of being biased towards heads. To check up on this suspicion we flip the coins (say) 2,000 times and observe the number of heads that we get. If the coin is fair, we would, beforehand, expect to see about 1,000 heads. If once we flipped the coin we got 1,973 heads we would obviously (and reasonably) claim that we have very good evidence that the coin is biased towards heads. If you think about it, the reasoning that you went through in coming to this conclusion was something like this: “If the coin is fair is is extremely unlikely that I would get 1,973 heads from 2,000 flips. Thus since I did in fact get 1,973 heads, I have strong evidence that the coin is unfair.” Equally obviously, if we got 1,005 heads, we would conclude that we do not have good evidence that the coin is biased towards heads. Again, the reason for coming to this conclusion is that a fair coin can easily give 1,005 (or more) heads from 2,000 flips.

2

But these are extreme cases, and reality often has to deal with more gray-area cases. What if we saw 1,072 heads? Intuition and common sense might not help in such a case. What we have to do is to calculate the probability that we would get 1,072 or more heads if the coin is fair. If this probability is low we might conclude that we have significant evidence that the coin is biased towards heads. If this probability is fairly high we might conclude that we do not have significant evidence that the coin is biased. The conclusion that we draw is an act of statistical inference, or a statistical induction. An inference, or an induction, is a conclusion that we draw about reality, based on some observation or observations. The reason why this is a statistical inference (or induction) is that it is based on a probability calculation. No statistical inference can be made without first making the relevant corresponding probability calculation. In the above example, probability theory calculations (which we will do later) shows that the probability of getting 1,072 or more heads from 2,000 flips of a fair coin is very low (less than 0.01). Thus having observed 1,072 heads in our 2,000 flips, we would reasonably conclude that we have significant evidence that the coin is biased. Here is a more important example. Suppose that we are using some medicine (the “current” medicine) to cure some illness. From long experience we know that, for any person having this illness, the probability that this current medicine cures any patient is 0.8. A new medicine is proposed as being better than the current one. To test whether this claim is justified we plan to conduct a clinical trial, in which the new medicine will be given to 2,000 people suffering from the disease in question. If the new medicine is equally effective as the current one we would, beforehand, expect it to cure about 1,600 of these people. If after the clinical trial is conducted the proposed new medicine cured 1,945 people, no-one would doubt that it is better than the current medicine. Again, the reason for this opinion is something like: “If the new medicine has the same cure rate as the current one, it is extremely unlikely that is would cure 1,945 people out of 2,000. But it did cure 1,945 people, and therefore I have significant evidence that its cure rate is higher than that of the current medicine.” But, equally obviously, if the proposed medicine cured 1,615 people we do not have strong evidence that it is better than the current medicine. The reason for this is that if the new medicine is equally effective as the current one, that is if the probability of a cure with the new medicine is the same (0.8) as that for the current medicine, we can easily observe 1,615 (or more) people cured with the new medicine. Again these are extreme cases, and reality often has to deal with more gray-area cases. What if the new medicine cured 1,628 people? Intuition and common sense might not help in such a case. What we have to do is to calculate the probability that we would get 1,628 or more people cured with the new medicine if it is equally effective as the current medicine. This probability is about 0.11, and because this is not a really small probability we might conclude that we do not have significant evidence that the new medicine is superior to the current one. Drawing this conclusion is an act of statistical inference. Statistics is a difficult subject for two reasons. First, we have to think of the situation both before and after our experiment, (in the medicine case the experiment is giving the
3

new medicine to the individuals in the clinical trial), and go back and forth several times between these two time points in any statistical operation. This is not easy. Second, before the experiment we have to consider aspects of probability theory. Unfortunately our minds are not wired up well to think in terms of probabilities. (Think of the “two fair coins” example given in class, and also the Monty Hall situation.) The central point is this: no statistical operation can be carried out without considering the situation before the experiment is performed. Because, at this time point, we do not know what will happen in our experiment, these considerations involve probability calculations. We therefore have to consider the general features of the probability theory “before the experiment” situation and the relation between these aspects and the statistical “after the experiment” aspects. We will do this later, after first looking more closely at the relation between the deductive processes of probability and the inductive processes of statistics. Deductions (implications) and inductions (inferences) Probability theory is a deductive activity, and uses deductions (also called implications). It starts with some assumed state of the world (for example that the coin is fair), and enables us to make various probability calculations relevant to our proposed experiment. Statistics is the converse, or inductive, operation, and uses inductions (also called inferences). It starts with data from our experiment and attempts to make objective statements about the unknown real world (whether the coin is fair or not). These inductive statements are always based on some probability calculation. The relation between probability and statistics can be seen from the following diagram : Probability theory (deductive) →→→→→→

Some unknown reality and a hypothesis about it.

Uses data (what is observed in an experiment) to test this hypothesis.

←←←←←← Statistical inference (inductive) This diagram makes it clear that to learn how to conduct a statistical procedure we first have to discuss probability on its own. We now do this.

4

Probability Theory
Events and their probabilities As has been discussed above, any discussion of Statistics requires a prior discussion of probability theory. In this section an introduction to probability theory is given as it applies to probabilities of events.
Events

An event is something which does or does not happen when some experiment is performed, field survey is conducted, etc. Consider for example a Gallup poll, in which (say) 2,000 people are asked, before an election involving two candidates, Smith and Jones, whether they will vote for Smith or Jones. Here are some events that could occur:1. More people say that they will vote for Smith than say they will vote for Jones. 2. At least 1,200 people say that they will vote for Jones. 3. Exactly 1,124 people say that they will vote for Smith. We will later discuss probability theory relating to Gallup polls. However, all the examples given below relate to events involving rolling a six-sided die, since that is a simple and easily understood situation. Here are some events that could occur in that context:1. An even number turns up. 2. The number 3 turns up. 3. A 3, 4, 5 or a 6 turns up. Clearly there are many other events that we could consider. Also, with two or more rolls of the die, we have events like “a 6 turns up both times on two rolls of a die”, “in ten rolls of a die, a 3 never turns up”, and so on.
Notation

We denote events by upper-case letters at the beginning of the alphabet, and also the letter S and the symbol ∅. So in the die-rolling example we might have:1. A is the event that an even number turns up. 2. B is the event that the number 3 turns up. 3. C is the event that a 3, 4, 5 or a 6 turns up. The letter S has a special meaning. In the die-rolling example it is the event that the number turning up is 1, 2, 3, 4, 5 or 6. In other words, S is the certain event. It comprises

5

the union of D and E . A ∩ C is the event that a 4 or a 6 turns up. or E . is the event that either D. Thus we would say. and B ∪ C is the event that a 3. 4 or 6 turns up. It is the event that D does not occur. we will get a head almost exactly half the time. intersections and complements of events Given a collection of events we can define various derived events. written D ∩ E . or both occur. It is an important event when considering intersections of events . 3. A ∪ B is the event that a 2. Dc is the complementary event to D. such as rolling both an even number and an odd number in one single roll of a die. in the sense that we think that in a very large number of flips of this coin. This is the so-called “empty” event. Probabilities of events The concept of a probability is quite a complex one. In the three examples above. The most important of these are unions of events. 4.) (ii) Intersections of events: If D and E are events. In the die-rolling example above. (Notice that in this case B ∩ C is the same as B . These complexities are not discussed here: we will be satisfied with a straightforward intuitive concept of probability as in some sense meaning a long-term frequency. In the die-rolling example above. B c is the event that some number other than 3 turns up. (Notice that in this case B ∪ C is the same as C . Unions. 4. 6 .all possible events that could occur. 3.) (iii) Complements of events: If D is an event. the intersection of D and E . that the probability of a head is 1/2. We are interested here in the probabilities of events. It is an event that cannot occur. and we write the probability of the event A as P (A). 5 or a 6 turns up. Ac is the event that an odd number turns up.These are defined as follows:(i) Unions of events: If D and E are events. for a fair coin. The symbol ∅ also has a special meaning. the probability of the event B as P (B ). and C c is the event that a 1 or a 2 turns up. and so on. A ∩ B is the empty event ∅. A ∪ C is the event that a 2. written D ∪ E . is the event that both D and E occur. since A and B cannot both occur.see below. and complements of events. 5 or 6 turns up. intersections of events. and B ∩ C is the event that the number 3 turns up.

P (D ∪ E ) = P (D) + P (E ) − P (D ∩ E ).1. 4. P (B ∩ C ) = 1/6. 0. if you are given the information that an even number turned up (event A). If D and E are any two events. For an unfair die we might reach a different conclusion than the one that we reach for a fair die. This is obvious: the probability that D does not occur is 1 minus the probability that D does occur. and this is given by P (A) + P (C ) − P (A ∩ C ) = 1/2 + 2/3 − 1/3 = 5/6. It is always true that for any event D. and if you are told that one of them has occurred. we note that P (A ∪ C ) = 5/6. To check this equation. Similarly if you are told that the number that turned up was a 3. B and C are not independent but that A and C are independent. then the probability that a 3. P (A ∩ B ) = 0. Notice that the probability of the empty event ∅ is 0. It can be seen from the calculations given above that A and B are not independent.3.6 and P (A ∩ C ) is 0.1. 5 or a 6 turns up (event C ) is still 2/3. then P (A) = 0. P (C ) = 0. Since 0. Suppose that the die in the die-rolling example is fair.6 × 0. 5 or 6 turning up are. P (A ∩ C ) = 1/3. Thus in the above example. (The calculations confirming this are given in the next section. P (A ∪ B ) = 2/3. P (B ∪ C ) = 2/3. 0.) The calculations above assume that the die is fair. Independence of events Two events D and E are said to be independent if P (D ∩ E ) = P (D) × P (E ). the events A and C are now not independent.Probabilities of derived events The probabilities for the union and the intersection of two events are linked by the following equation. if the die is biased. P (C ) = 2/3. 3. so that the probabilites for a 1. P (A ∪ C ) = 5/6. For example.2 and 0. 7 . Then the probabilities of the various union and intersection events discussed above are as follows:P (A) = 1/2. The other examples can be checked similarly. 5 or 6. respectively. P (Dc ) = 1 − P (D). 4. P (B ) = 1/6.36 = 0. 4. 0.3. 0. 0. 2.2. then the probability that an even number turned up is still 1/2.3. then this information does not change the probability that the other event occurs.6. The intuitive meaning of independence is that if two events are independent. which is the probability of the event C without this information being given.6 = 0.1.

In the coin example the random variable X is the “concept of our mind” number of heads we will get.6 = 0. and in particular using “common sense”. and is denoted P (D|E ).5. This confirms that A and C are independent (for a fair die). when we flip the coin. tomorrow. Thus it makes sense.Conditional probabilities We often wish to calculate the probability of some event D. The notation for the “after the experiment is done” data is the corresponding lower case letter. equation (1) shows that P (A|C ) = 0. after the coin has been flipped. which 0. If the events D and E are independent. 8 . a random variable (a “before the experiment is carried out” concept) is always denoted by an upper case Roman letter. once we have flipped the coin the “data” is simply the number of heads that we did get.3 /0. but the corresponding definitions for other examples are easy to imagine. Probability: One Discrete Random Variable Random variables and data In this section we define some terms that we will use often. will usually give an incorrect answer. It is the observed value of the random variable once the “experiment” of flipping the coin is carried out. Before we flip the coin the number of heads that we will get is unknown to us. So after we have flipped the coin we would denote the number of heads that we did get by the corresponding lower-case letter x. then P (D|E ) = P (D). P (E ) (1) It is essential to calculate P (D|E ) using this formula: using any other approach. In other words.at this stage we do not know its value. to say “x = 1. In the “unfair die” example given above. equation (1) shows that P (A|C ) = (1/3)/(2/3) = 1/2. By the word “data” we mean the observed value of a random variable once some experiment is performed. given that some other event E has occurred. and this is equal to P (A).6. 142”. We do this in terms of the coin flipping example. To assist us with keeping the distinction between random variables and data clear. It does not make sense before the coin is flipped to say X = 1. This number is therefore called a “random variable”. Such a probability is called a conditional probability. We use the upper-case letter X in these notes for this purpose. In the “fair die” example given above. 142. In the coin example. and as a matter of notational convention. and this is not equal to P (A). This second statement “does not compute”. This confirms that for this unfair die A and C are not independent. The conditional probability P (D|E ) is calculated by the formula P (D|E ) = P (D ∩ E ) . D and E are independent if the knowledge that E has occurred does not change the probability that D occurs. It is an “after the experiment” concept. It is a concept of our mind . It is a “before the experiment” concept.

namely v1 . for some number k . 2. if the probability of getting a head on any flip is some value θ. . P(vk ). . parameters The probability distribution of a discrete random variable X is a listing of the possible values that this random variable can take. .There are therefore two notational conventions that we always use: upper-case Roman letters for random variables. P(1) = . v2 . v2 .000 times. We will later find a third notational convention (for “parameters”).25 . These possible values are usually known before the experiment: In the coin example the possible values of X . . the number of heads that we would get on two flips of the coin. when we will flip the coin 2. . P(vk ) (2) In some cases we know (or hypothesize) the probabilities of the possible values v1 . with respective probabilities P(v1 ). . 000. k .. . . the number of heads that will turn up. tomorrow. 1. vk . are clearly 0. A discrete random variable is a numerical quantity that in some future experiment that involves some degree of randomness will take one value from some discrete set of possible values. lower-case Roman for data.25. 1. .25. . The probability distribution of a discrete random variable. ) . 3. . vk Respective probabilities P(v1 ) P(v2 ) . vk . 3. this probability distribution can be written generically as Possible values of X v1 v2 . Continuous random variables will be considered in a later section.5. . . More generally. . In practice the possible values of a discrete random variable often consist of the numbers 0. .25 Here P(0) = . . 2. is: Possible values of X 0 1 2 Respective probabilities . together with their respective probabilities. the probability distribution of X . P(2) = . . Probability: One Discrete Random Variable Definition: one discrete random variable It is convenient to consider separately the cases of discrete and continuous random variables.50 . 2. For example. if in the coin example we we know that the coin is fair. In this section we give informal definitions for discrete random variables and their probability distributions rather than the formal definitions often found in statistics textbooks. the probability distribution of X is:Possible values of X 0 1 2 2 Respective probabilities (1 − θ) 2θ(1 − θ) θ2 9 (3) (4) . .. If there are k possible values of X . . P(v2 .

Here we focus only on the most important of these distributions. P(2) = θ2 . n.In this case. The factor 2 in (5) is an example of a binomial coefficient. and n is called the index.) Second. (6) The binomial coefficient n is often spoken as “n choose x”: it is the number of different x orderings in which x successes can arise in the n trials. 10 . the probability of success must be the same on all trials. Finally. we plan to conduct some fixed number n of trials. . reflecting the fact that there are two orders (success followed by failure and failure followed by success) in which we can obtain one success and one failure in two trials. the binomial distribution. One must be careful when using a binomial distribution that all four of these conditions hold.the outcome of any trial must not affect the outcome of any other trial. of the binomial distribution. (Here we might regard getting a head on the flip of a coin as a success and a tail as a failure. namely. it is a parameter. The random variable of interest is the total number X of successes in the n trials. determined by the outcomes of the trials as they occur. The probabilities in (5) are binomial distribution probabilities for the case n = 2. and can be found from (6) by putting n = 2 and considering the respective values x = 0. since in practice this is often unknown. The probability distribution (5) can be generalized to the case of an arbitrary number of flips of the coin. P(1) = 2θ(1 − θ). First. “success” and “failure”. 2. (By “fixed” we mean fixed in advance. 1 and 2. and not. there must be exactly two possible outcomes on each trial.see (6) below. for convenience. . The probability distribution of X is given by the (binomial distribution) formula P(x) = n x θ (1 − θ)n−x . (5) Here θ is a so-called “parameter”: see more on these below. P(0) = (1 − θ)2 . We reasonably believe that these conditions hold when flipping a coin. and consider first the requirements for it to be appropriate. x x = 0. and only if. . The two outcomes are often called. The binomial distribution There are many important discrete probability distributions that arise often in the applications of probability and statistics to real-world problems. for example. . Each one of these distributions is appropriate under some collection of requirement specific to that distribution. 1. We often denote the probability of success on each trial by θ. The binomial distribution arises if. That is. all four of the following requirements hold.) Third. In the expression (6) θ is the parameter. the various trials must be independent .

In the binomial case it is the unknown probability of success in (6). In the coin example.Parameters The quantity θ introduced above is a “parameter”. after some algebra. 5 and 6. the mean of a random variable having the binomial distribution (6) is n x x=0 n x θ (1 − θ)n−x . . If the die is fair. 3. x (9) and this can be shown. v2 . to be nθ. In general a parameter is some unknown constant. . and it is important to keep a clear distinction between the two concepts. In more mathematical notation this is k (7) vi P(vi ). consider the (random) number (which we denote by X ) to turn up when a die is rolled. 4. We shall consider these three activities later in the course. The mean of the discrete random variable X whose probability distribution is given in (2) above is defined as v1 P (v1 ) + v2 P (v2 ) + · · · vk P (vk ). As a second example. each 11 . vk ) that the random variable X can take. 2. (iii) Testing hypotheses about the numerical value of this parameter. The mean of a discrete random variable The mean of a random variable is often confused with the concept of an average. (ii) Giving some idea of the precision of our estimate of a parameter (sometimes called the “margin of error”). i=1 (8) the summation being over all possible values (v1 . . As an example. The possible values of X are 1. for example testing the hypothesis that θ = 1/2. Almost all of Statistics consists of:(i) Estimating the value of a parameter. . (iii) Testing hypotheses about the value of a parameter. these would be:(i) Estimating the value of the binomial parameter θ. (ii) Giving some idea of the precision of our estimate of this parameter.

Application of equation (7) shows that the mean of X is 6 1× 1 1 1 1 1 1 + 2 × + 3 × + 4 × + 5 × + 6 × = 3. In many practical situations the mean µ of a discrete random variable X is unknown to us. (12) (11) There are several important points to note about the mean of a discrete random variable:(i) The notation µ is often used for a mean. This distinction will be discussed often in class.15 + 2 × 0.25 0. then we do not know the value µ(= nθ) of the mean of that distribution. (ii) The mean of a probability distribution is its center of gravity. As an example. Different t tests will be discussed in this course. that is its “knife-edge balance point”. That is to say. and that the probability distribution of the (random) number X to turn up is:Possible values of X 1 2 3 4 5 6 Respective probabilities 0. 12 (14) . and has a quite different interpretation from that of “mean”.10 0.30 0. because we do not know the numerical values of the probabilities P (x).35.5. .15 + 5 × 0. .15 0. (iii) Testing hypotheses about the value of a mean is perhaps the most important of statistical operations. + (vk − µ)2 P(vk ). In more mathematical terms we write this as k (13) σ = i=1 2 (vi − µ)2 P(vi ). and this is why we use Greek notation for it. The variance of a discrete random variable A quantity of importance equal to that of the mean of a random variable is its variance. 6 6 6 6 6 6 (10) Suppose on the other hand that the die is unfair. (iv) The word “average” is not an alternative for the word “mean”.05 In this case the mean of X is 1 × 0.30 + 6 × 0.05 = 3. An important example of tests of hypotheses about means is a t test.15 0.10 + 4 × 0. if in the binomial distribution case we do not know the value of the parameter θ.25 + 3 × 0. The variance (denoted by σ 2 ) of the discrete random variable X whose probability distribution is given in (2) above is defined by σ 2 = (v1 − µ)2 P(v1 ) + (v2 − µ)2 P(v2 ) + . µ is a parameter.of these values has probability 1 .

the summation being taken over all possible values of the random variable X . anticipated above. (ii) The variance is a measure of the dispersion of the probability distribution of the random variable around its mean.5)2 × (4 − 3. we have already calculated (in (12)) the mean of X . (iv) The variance.5. the (random) number to turn up on a roll of the die. Application of (13) shows that the variance of X is 1 1 1 1 1 35 σ 2 = (1 − 3. This is why we denote it by a Greek letter. (15) 6 6 6 6 6 12 There are several important points to note about the variance of a discrete random variable:(i) The variance has the standard notation σ 2 . Thus a random variable with a small variance is likely to be close to its mean. In the case of a fair die. to be nθ(1 − θ). is often unknown to us.5)2 × (6 − 3. 13 .5)2 × +(2 − 3. (see Figure 1). (v) The variance of the number of successes in the binomial distribution (6) can be shown. smaller variance Figure 1: larger variance (iii) A quantity that is often more useful than the variance of a probability distribution is the standard deviation. This is defined as the positive square root of the variance.5)2 = . to be 3.5)2 × (5 − 3.5)2 × (3 − 3. after some algebra. like the mean. and (naturally enough) is denoted by σ .

before the experiment was conducted.7 minutes. For example. and of the times needed in the sleep deprivation example. before making our assessment. the time X6 that the sixth person will take to complete the questionnaire is unknown. . we would plan to roll it many times. the statement “x6 = 4” does make sense. the (random) number that will turn up on the n-th roll of the die by Xn . . 14 . if a psychologist wanted to assess the effects of sleep deprivation on the time needed to answer the questions in a questionnaire. As with the sleep deprivation example. a statement like: “x6 = 23. We denote the first random variable by X1 . As with a single random variable (see notes. X2 . .7” also makes sense. the second by X2 . We would then denote the (random) number that will turn up on the first roll of the die by X1 . If we wish to test whether this die is fair. sociology. Xn . As in the case of one random variable. the notation “X ” for one single random variable is no longer sufficient for us. It was a random variable. . It “does not compute”. the (random) number that will turn up on the second roll of the die by X2 . By contrast. he/she would want to test a fairly large number of people in order to get reasonably reliable results. later. On the other hand. we need a separate notation for the actual observed numbers that did turn up once the die was rolled (n times). x2 . To assess (for example) whether we can reasonably assume that the die is fair we would use these numbers. we have to consider the probability theory for many random variables. . . before we actually roll the die the numbers that will turn up on the various rolls are all random variables. the various times that the people in the experiment will need to answer the questions are all random variables.Many Random Variables Introduction Almost every application of statistical methods in psychology. biology and similar areas requires the analysis of many observations. but also we would have to use the theory of the n random variables X1 . It means that a 4 turned up on the sixth roll of the die. a statement in the die example like: “X6 = 4” makes no sense. Notation Since we are now considering many random variables. Thus a statement like: “X6 = 23. Here the observations are the numbers that turn up on the various rolls of the die. . In the sleep example. Before this experiment is performed. and thus plan to get many observations. Suppose that in the die example we denote the planned number of rolls by n. . In line with the approach in this course.7” does not make sense. and so on. xn . page 8). We denote these by x1 . once the die has been rolled. . turn up when we get around to rolling the die. To assess the implications of the numbers which do. . . ideas about many observations will often be discussed in the simple case of rolling a die (fair or unfair) many times. It means that the time that the sixth person in the experiment took to complete the questionnaire was 23..

. we would reasonably assume that the value of any one of these would not affect the value of any other one. Whatever number turned up on roll 77 has no influence on the number turning up on roll 144. and which all have the same probability distribution. . X2 .people of different ages might perhaps be expected to tend to need different amounts of time. . . The assumptions that the various random variables X1 . two very important derived random variables are their sum. Xn are all independent of each other. what the sum or the average of the n numbers that will turn up will be. X2 . since it is the same die that is being rolled each time. . . Xn they are themSince both Tn and X selves random variables. we would also reasonably assume that the various random variables X1 . . Xn are all independent of each other. In the die example we do not know. These must be related in some way to the 15 .Independently and identically distributed random variables The die example introduces two important concepts. Xn . X n n (17) (16) ¯ are functions of the random variables X1 . . the times that they take to complete the questionnaire might not be independent. This concept is discussed again below. That is. X2 . We would reasonably assume that X1 . are often made in the application of statistical methods. X2 . sociology and biology that are more scientifically important and complex than rolling a die. . sociology and biology. and defined as Tn = X1 + X2 + · · · + Xn . ¯ . being random variables. . before we roll the die. . are said to be iid (independently and identically distributed). and thus each has a mean and a variance. Xn all have the same probability distribution. since we might expect them to be quite similar because of the common environment and genetic make-up of the twins. Both the sum and the average. However. If the people in the experiment were not all of the same age it might not be reasonable to assume that the times needed are identically distributed . and defined by and their average. each have a probability distribution. the assumption of identically and independently distributed random variables might not be reasonable. . For example. in areas such as psychology. denoted by Tn . and that they all have the same probability distribution. Random variables which are independent of each other. Thus in practice care must often be exercised and common sense used when applying the theory of iid random variables in areas such as psychology. Further. . we would assume that the probability that a three turns up on roll 77 (whatever it might be) is the same as the probability that a three turns up on roll 144. . . . . denoted by X ¯ = X1 + X2 + · · · + Xn = Tn . . . Thus if twin sisters were used in the sleep deprivation example. X2 . . The mean and variance of a sum and of an average Given n random variables X1 .

Xn are independent random variables with respec2 2 2 . The general theory of many random variables shows that if X1 . 16 . . and ¯ = (µ1 + µ2 + · · · + µn )/n. (23) The formulas in (20) are. µ2 . µn and respective variances σ1 mean of Tn = µ1 + µ2 + · · · + µn . Then . Xn . the generalization of the formulas in (20) is that Dij . . X2 . . . (20) These are also “magic formulas” and will refer to them several times. . .mean and the variance of each of X1 . Xn are iid. variance of D = 2σ 2 . mean of X 2 ¯=σ . Two generalizations More generally. mean of Tn = nµ. Next. . . . . variance of X n2 2 2 2 variance of Tn = σ1 + σ2 + · · · + σn . σn . X2 . is a random variable and that mean of Dij = µi − µj . (21) (22) The formulas is (18) and (19) are. . σ2 tive means µ1 . 2 2 variance of Dij = σi + σj . variance of Tn = nσ 2 . especially when making “comparison studies”. However in this case two further equations are important. Equations (18) and (19) apply of course in the particular case n = 2. Thus you have to know them by heart. respectively. . . mean of X 2 2 2 ¯ = σ1 + σ2 + · · · + σn . then the mean and the variance of the random variable Tn are. with (common) mean µ and (common) variance σ 2 . If we define the random variable D by D = X1 − X2 (think of D standing for “difference”) then mean of D = 0. special cases of these formulas. respectively. suppose that X1 . special cases of these formulas. . (18) ¯ are given respectively by and the mean and the variance of the random variable X ¯ = µ. . respectively. . . variance of X n (19) In STAT 111 we will call these four formulas the four “magic formulas” and will refer to them often. X2 . . defined by Dij = Xi − Xj . Thus you also have to know these two formulas by heart. .

each Xi has mean 3.An example of the use of equations (19) In the case of a fair die. In some applications it is necessary to consider instead the proportion of successes in these trials (more exactly. the variance of X ¯ is (35/12. or deduction. This claim is an act of Statistics. This small standard deviation implies that once we roll the die 1. the proportion of trials leading to success). Because P is a random variable it has a mean and variance.0540 = 3.382.5 + 2 × 0. as given by (12) and (15). (24) These equations bear a similarity to the formulas for the mean and variance of an average given in (19). We will later do a JMP experiment to confirm this. It is an implication. it is very likely that the observed average of the numbers that actually turned up will be very close to 3. It has a n probability distribution which can be found from the binomial distribution (6) since the probability that P = i/n is the same as the probability that X = i for any value of i. If X is the number of successes in n binomial trials. and the observed average x ¯ is 3. and thus standard deviation is (35/12). 1) 1 2 P is a discrete random variable. the ¯ = (X1 + X2 + · · · + X1.608. 000). when testing for the equality of two binomial parameters. The proportion of successes in n binomial trials The random variable in the binomial distribution is the number of successes in n binomial trials. 17 .. n . then this proportion is X/n.000 times. all of which will be based on the relevant corresponding probability theory calculation.5.. n . the probability that the observed average x ¯ is ¯ ¯ between its mean of X minus two standard deviations of X .5 and variance 35/12. Therefore the standard deviation of X or about 0. (that is 3. This is no more than what intuition suggests.0540. with probability distribution given in (6).0540 = ¯ plus two standard deviations of X ¯ . 35/12. On the other hand if n. second equation in (19).2 × 0. These are mean of P = θ.000 times.708. the range within which the average is about 95% likely to lie if the die is fair. variance of P = θ(1 − θ)/n. This statement is one of probability theory.000. (n− .. It is an inference.000 )/1. 1. Then we have good evidence that the die is not fair. why it is often necessary to operate via the proportion of trials giving success rather than by the number of trials giving success. which we will denote by P . So here is a window into Statistics. 000 is. We will later make many statistical inferences. (that is 3.5 . We will see later. or about 1. Later we will see that if the die is fair.608) is about 95%. .000. from the number of rolls.392) and its mean of X 3. or induction. This is outside the range 3.392 to 3. If we have now rolled the die 1. and its possible values are 0. is 1.

They are all important and must not be confused with each other. for example 3. Continuous 18 .The standard deviation and the standard error ¯ = ( X1 + X2 + · · · + In the die example in the previous section the standard deviation of X X1. and a concept of probability theory). that is some constant number whose value is often unknown to us.0540. and other textbooks. or to test some hypothesis about a mean (for example that it is 3. How can a mean have a standard deviation? A mean is a parameter. the mean of the number to turn up is unknown. 000 flips of a coin.382. Thus it makes sense to say (as (19) and (22) say) “the mean of the average is such and such”. (This terminology is unfortunate and causes much confusion . How good x ¯ is an estimate of the mean. are sometimes not too good on making this distinction. (The textbook.it should be “the standard deviation of the average”. Watch out for it. and this was already referred to in the die-rolling example above. are continuous.) A mean is a parameter. such as the number of heads in 2. Why do we need all three concepts? Suppose that we wish to estimate a mean (first concept). (the second observed value of x ¯. the “after the experiment” observed average x ¯. in particular its mean and variance. both depend on the properties of the random variable X concept). and third experiment” average X an “after the experiment” average x ¯ (a calculated number. 000 is about 0. Means and averages It is crucial to remember that a mean and an average are two entirely different things.) Many textbooks use this unfortunate terminology. For example. X distribution and thus has a mean and a variance. an average as defined above (i. and a concept of Statistics). This is the actual average x ¯ of the numbers that actually turned up once the 1. Continuous Random Variables Definition Some random variables by their very nature are discrete.5). or what hypothesis testing conclusion we might draw given the ¯ . of this as the realized value of the random variable X Thus there are three related concepts: first a mean (a parameter). There is also a second concept of an average.000 )/1. Other random variables. We would do this by using the third concept. It has a probability By contrast. with an unfair die for which the probabilities for the number to turn up on any roll are unknown. by contrast. second a “before the ¯ (a random variable.000 rolls were completed. The standard deviation of an average such as this is sometimes called “the standard error of the mean”. This is a number. You can think ¯ once the rolling had taken place.e. ¯ ) is a random variable. It is a parameter which we might wish to estimate or test hypotheses about. We will always denote a mean by the Greek letter µ. and only random variables can have a standard deviation.

19 .we will never do any of these integration procedures. the three probabilities Prob(a ≤ X < b). We use the same notation for continuous random variables as we do for discrete random variables. for example by X . As a particular case of equation (25). Here we denote the range of a continuous random variable by (L. the probability that the (continuous) random variable X having density function f (x) takes a value between a and b. f (x)) plane. For example. so that we denote a continuous random variable in upper case. don’t worry . From a calculus point of view (for those who have a good calculus background) this probability is obtained by integrating this density function over the range a to b. The density function f (x) is the continuous random variable analogue of a discrete random variable probability distribution such as (6).random variables can take any value in some continuous range of values. and use this notation throughout. For those who do not have a calculus background. (25) Because the probability that a continuous random variable takes some specified numerical value is zero. This density function can be drawn as a curve in the (x. L (26) This equation simply states that a random variable must take some value in its range of possible values. and Prob(a ≤ X ≤ b) are also given by the right-hand side in (25). H f (x) dx = 1.) The probability that the random variable X takes a value in some given range a to b is the area under this curve between a annd b. (Examples will be given in class. Measurements such as height and blood pressure are of this type. but rather are allocated to ranges of values. Prob(a < X ≤ b). (with a < b) is given by b P(a < X < b) = a f (x) dx. H ). (L = lowest possible value. The probability that a continuous random variable takes some specified numerical value is zero. H = highest possible value of the continuous random variable). Probabilities for continuous random variables are not allocated to specific values. Every continuous random variable X has an associated density function f (x).

and testing hypotheses about. (Examples will be given in class. In particular we denote a mean by µ and a variance by σ 2 . distribution. (Here π is the well-known geometrical value of about 3. the probability that a continuous random variable takes a value between a and b is found by a calculus operation. The normal distribution There are many continuous probability distributions relevant to statistical operations. or “spread-out-ness”. they are parameters. A stated above. which gives the area under the density function 20 .) It can be shown that the mean of this normal distribution is µ and its variance σ 2 . L (28) Again. The main thing to remember is that these definitions are the natural analogues of the corresponding definitions for a discrete random variable. of the density function around the mean.1416 and e is the equally important exponential constant of about 2. that is that the mean is the “center of gravity”. set of possible values) is (−∞. or Gaussian . In a research context the mean µ and the variance σ 2 of the random variable of interest are often unknown to us. That is. ∞) and density function f (x) given by (x−µ)2 1 f (x) = √ e − 2σ 2 . the remarks about the mean and the variance of a continuous random variable are very similar to those of a discrete random variable given above. don’t worry about it. distribution if its range (i. Many statistical procedures involve estimating. σ 2 ) random variable. as (29) shows. We will never do any of these integration procedures. or Gaussian. and these parameters are built into the functional form of the distribution. if you do not have a calculus background.e. A random variable having this distribution is said to be an N (µ.The mean and variance of a continuous random variable The mean µ and variance σ 2 of a continuous random variable X having range (L. The (continuous) random variable X has a normal . H ) and density function f (x) are defined respectively by H µ= L xf (x)dx (27) and σ = 2 H (x − µ)2 f (x)dx.) Also. namely the normal. We discuss the most important one in this section. the mean and the variance of continuous random variables. 2πσ (29) The shape of this density function is the famous (infamous?) “bell-shaped curve”. or the “knife-edge balance point” of the density function f (x) and the variance is a measure of the dispersion. as is indicated by the Greek notation that we use for them.7183.

(The normal distribution chart that you will be given refers to this specific member of the normal distribution family.22.5 and 1.44 is 1 minus tthe probability that it takes a value less than 1. The probability that a random variable having a normal distribution with mean 0 and variance 1 takes a value less than −0. We usually have to consider more complicated examples than this.0749.) The chart gives “less than” probabilities for a variety of positive numbers. Note that the chart only goes up to the “z ” value 3.23 and 2.9582. Thus the probability that a random variable having a normal distribution with mean 0 and variance 1 takes a value less than 0. (We will also do some examples in class. and this is 0. For example.73 is 0. As a different form of calculation. we often have to find “greater than” probabilities.8907 = 0. σ 2 ) values.9931 − 0.09 is good enough as 1.6915. for any “z ” greater than this.6915 = 0. The the probability that a random variable having a normal distribution with mean 0 and variance 1 takes a value less than 1. each member of the family corresponding to some pair of (µ.9251 = 0.87.8582 − 0. the probability that a random variable having a normal distribution with mean 0 and variance 1 takes a value less than any “z ” larger than 3. the probability that a random variable having a normal distribution with mean 0 and variance 1 takes a value exceeding 1. the probability that a random variable having a normal distribution with mean 0 and variance 1 takes a value between 0. namely 1 − 0.5 = 0.) The way that chart works is best described by a few examples. (This indicates an interesting limit as to what mathematics can do.09. Thus for example the probability that a random variable having a normal distribution with mean 6 and variance 16 takes a value between 5 and 8 is 8 (x−6)2 1 √ √ e− 32 dx.44.) So how would be find the probability given in (30)? It has to be done using a chart. There is a whole family of normal distributions. and this is 21 . Even more complicated calculations arise when negative numbers are involved. (30) 2π 16 5 Amazingly. namely the normal distribution for which µ = 0 and σ 2 = 1. probability charts are available only for one particular member of this family. This is sometimes called the standardized normal distribution.2667.8888 − 0. (The case µ = 6 and σ 2 = 16 just considered is an example of one member of this family. The probability that a random variable having a normal distribution with mean 0 and variance 1 takes a value between 1. for reasons which will appear shortly.of the random variable between a and b.73 is . For example.3888.) However.46 is 0. generically denoted by “z ”.87 is the same as the probability that it takes a value greater than +0.22 and 0 is the same as the probability that it takes a value between 0 and +1. the probability that a random variable having a normal distribution with mean 0 and variance 1 takes a value between −1.1024. the processes of mathematics do not allow us to evaluate the integral in (30): it is just “too hard”. Here we have to use the symmetry of the normal distribution around the value 0. For example.5 is . So we now have to discuss the normal distribution chart.

This is (0. variance 1.6700 − 0.6915 − 0.96) = 0.96 < Z < +1. (31) and this probability is found from the standardized normal distribution chart. if X is a random variable having a normal distribution with mean 6 and variance 16 (and thus standard deviation 4). can be found by standardizing and creating a Z statistic: P(7 < X < 10) = P 7−6 X −6 10 − 6 < < 4 4 4 = P(0.25).28 plus the probabiiliy that it takes a value between 0 and +0. can be found by standardizing and creating a Z statistic: P(4 < X < 11) = P 4−6 X −6 11 − 6 < < 4 4 4 = P(−0.95.44.25 < Z < 1).645) = 0.8413 .8944 − 0. with arbitrary mean µ and arbitrary variance σ 2 .1 − 0. Suppose for example that we want to find the probability that a random variable having a normal distribution with mean 0 and variance 1 takes a value between −1.28 and 0 plus the probability takes a value between 0 and +0.44.5000) = 0. the probability that the random variable X in the previous paragraph. As a slightly more complicated example. variance 1 (trust me on this). P(7 < X < 10). has a normal distribution with mean 0. This is the probability that it takes a value between −1.5000) + (0. This standardization procedure can be used to find probabilities for a random variable having any normal distribution. that is.5 < Z < 1. Then the “standardized” random variable Z . often used in conjunction with this standardization procedure. then P(Z > +1.5859. Finally.8078 = 0.8997 − 0. This in turn is the probabilitiy that it takes a value between 0 and +1. Two useful properties of the normal distribution.05 (33) and Pr(−1.44. (or from computer packages). mean 0. perhaps the most complicated calculation concerns the probability that a random variable having a normal distribution with mean 0 and variance 1 takes a value between some given negative number and some given positive number. Why is there a probability chart only for this one particular member of the normal distribution family? Suppose that a random variable X has the normal distribution (29). For example. is that if the random variable Z has a normal distribution. the probability of the event 4 < X < 11. to be 0.5987 = 0.5000) = 0.5697.0. (32) and this probability is found from the kind of manipulations discussed above to be (0. (34) 22 . that is the probability of the event 7 < X < 10.5000) + (0. defined by Z = (X − µ)/σ .28 and +0.1922.2426.

96) = 0. .96) + P(Z > +1. The central limit theorem states that to a very close approximation 23 . . (38) (37) Applications of (37) often arise from the Central Limit Theorem. average X sum X1 + X2 + · · · + Xn both have approximately a normal distribution. Since many statistical procedures deal with sums or averages. where X is a random variable with mean µ and standard deviation σ will be referred to often below. .000. and the symbol Z is reserved.96 by 2. for this standardized quantity. (35) The “standardized” quantity Z .99. then no matter what the probability ¯ = (X1 + X2 + · · · + Xn )/n and the distribution these random variables might be.95.96 by 2. Xn are independently and identically distributed. and thus standard deviation 35/1. 0002 ≈ 0. using the normal distribution chart. This is P(−2 < Z < +2) ≈ 0.5 and variance 35/12. in these notes and in Statistics generally. We have already seen an example of this (in the Section “Example of the use of ¯ of the numbers to turn up on 10. almost always using the standardized quantity Z . Also we often use the formulas (18) and (19) for the mean and variance of a sum and of an average and the approximation (37) when doing this.05. One frequently-used approximation derives from equation (34) by approximating the value 1. (36) Remembering that Z = (X −µ)/σ . The average X is a random variable with mean 3. discussed immediately below. A similar calculation.or equivalently P(Z < −1. this equation implies that if X is a random variable having a normal distribution with mean µ and variance σ 2 . and is usually very good for values of n greater than about 50.000 rolls of a fair die equations (19)”. X2 . One of the applications of the normal distribution is to provide approximations for probabilities for various random variables. This states that if the the random variables X1 . The Central Limit Theorem An important property of an average and of a sum of several random variables derives from the so-called “Central Limit Theorem”.575σ ) ≈ 0. defined as Z = (X − µ)/σ .0540.575σ < X < µ + 2. the Central Limit Theorem ensures that we often deal with the normal distribution in these procedures.95. then One frequently-used approximation derives from equation (34) by approximating the value 1. This approximation becomes more accurate the larger n is. . This is P(µ − 2σ < X < µ + 2σ ) ≈ 0. shows that P(µ − 2.

608 use of equations (19)” that the probability that X is is about 95%. It is a “zig”.500 newborns has approximately a normal distribution with mean 1. In the binomial context the Central Limit Theorem states that X has. the number of boys is a sample of 2. It is made under the assumption that newborn is equally lilkely to be a boy as a girl.500 newborns.540 < X (39) and this led to the (probability theory) statement given in the Section “Example of the ¯ takes a value between 3.575 × 25.this average has a normal distribution with this mean and this variance.) However when we actually took this sample we saw 1. and thus mean nθ and variance nθ(1 − θ). that is about between 1185 and 1315.) 24 . We therefore have good evidence that it is NOT equally likely for a newborn to be a boy as a girl.5 − 2 × 0. We saw these numbers in Homework 1. 250 − 2.) The above example illustrates how we increase our knowledge in a context involving randomness (here the randomness induced by the sampling process) by a probability theory/Statistics “zig-zag” process.575 × 25 and 1. It similarly states that the proportion P of successes has. a normal distribution with this mean and this variance. 250 + 2.99 that the number of boys in this sample will be between 1. If this is true. Here is an application of this result. Suppose that X has a binomial distribution with index n (the number of trials) and parameter θ (the probability of success on each trial). to a very close approximation. (In fact it is now known that it is NOT equally likely for a newborn to be a boy as a girl. It cannot be made without the corresponding probability theory “zig” calculation. or implication.5 + 2 × 0. It is a “zag”.95. to a very close approximation. then the probability is about 99% that in a sample of 2. the number of boys that we see will be between 1185 and 1315. We also saw how this statement gives us a window into Statistics. (This is a probability theory deduction.392 and 3. Then application of (37) shows that to a very close approximation. IF a newborn is equally likely to be a boy as a girl. 500 = 625. or inference. ¯ < 3. The Central Limit Theorem also applies to the binomial distribution.0540) ≈ 0.334 boys. P(3. and hence standard deviation 625 = 25. (This is an induction. Here is the corresponding window into Statistics. a normal distribution with mean θ and this variance θ(1 − θ)/n. Suppose that it is equally likely that a newborn will be a boy as a girl. Then (38) shows that the probability is about 0.250 and variance (from note (v) about √ 1 1 variances) of 2 ×2 × 2. It is a statement of Statistics.

our actually observed average should be very 25 . X2 = 3. if an experiment consisted of the rolling of a die n = 3 times. This is a natural (as as we see later) correct thing to do. . We also know that it has a small variance (35/12. .5. and testing hypotheses about the numerical values of parameters.000 times. we would say that x1 = 5. x2 = 3. .5. . by x1 . The reason is that in this JMP experiment we know that the die is fair.5) of the mean and this small variance imply that. once the experiment has been carried out. and must be estimated from data. x3 = 3. we know in advance that the mean of the (random variable) average of the numbers turning up after (say) 1. We now consider each of these in turn.Statistics Introduction So far in these notes we have been contemplating the situation before some experiment is carried out. . As indicated above. especially in research.000 of rolls of the die is 3. X2 . A good example was the “fair die” simulation: we knew in advance that the die is fair. However. once we have rolled the die 1. and after the experiment we observe that a 5 turned up on the first roll and a 3 on both the second and third rolls. The three main activities of Statistics are the estimation of the numerical values of a parameter or parameters. X3 = 3.5) and the variance (35/12) of the number to turn up on any roll of the die. Xn . These observed values are our data. For example. in practice these parameters are usually unknown. so we knew in advance the values of the mean (3. x2 . we might be interested in the mean blood-sugar reading of diabetics. These are our data values. so that we have been discussing random variables and their properties. measure the blood-sugar reading for each of these 1.000 people and iuse the average of these to estimate this (unknown) mean. Estimation (of a parameter) Comments on the “die-rolling” example In much of the discussion in these notes (and the course) so far the values of the various parameters entering the probability distributions considered were taken as being known. This comment “does not compute”. is that we will not know the relevant mean. we denote the actually observed value of these random variables. if before the experiment we had been considering several random variables X1 . assessing the accuracy these estimates.000). xn .000 diabetics. so that we know for example the mean of the (random variable) average of the numbers turning up after (say) 1. before the experiment has been done. that X1 = 5. This means that our JMP “die rolling” experiment is very atypical. As an example. It does not make sense to say. . This value (3. The real-life situation. To get some idea about what this mean might be we would take a sample of (say) 1. . We now do our experiment. So think of the JMP die-rolling example as a “proof of principle”: because we know that the die is fair. .000 of rolls of the die is 3.

Xn be n independently and identically (iid) random variables. Let X1 . then using an observed average should give us a pretty good idea of what the mean is. θ) (for continuous random variables). . We will later show that. ˆ is said First.. . xn is an unbiased estimate of θ. depending in both cases (as the notation implies) on some unknown parameter θ. And this is what we saw happen. . Later we will refine this idea more precisely. calculated from the observed data pronounce θ values x1 . xn of X1 . a notation that emphasizes that this estimator and thus may be written θ ˆ. . . 26 . Note the two different words estimate and estimator. xn ). But at least it should have been close to 3. X2 . ˆ(X1 .close to the mean of 3. . calculated from the observed of θ we say that the corresponding estimate θ data values x1 . so that we now have the corresponding data values x1 . . .. . if you did not know the mean (3. . Xn . . X2 .5. Xn .5. The “hat” terminology is a signal to us that we are talking abpout either an estimator or an estimate.) The quantity θ ˆ(x1 . An estimator θ ˆ to be an unbiased estimator of θ if its mean value is equal to θ. X2 . x2 .000 rolls of a fair die in the JMP experiment. x2 . so in this section we use the notation X for both. Even after we have the data values we still will not know what the numerical value of θ is. each having a probability distribution P (x.5. .. . The estimate of θ is calculated from our data. An estimator of the parameter θ is some function of the random variables X1 .. properties of the (random variable) estimator θ Various desirable criteria have been proposed for an estimator to satisfy. and will then be just some number. x2 . . . Much of the theory concerning estimation of parameters is the same for both discrete and continuous random variables. where we do not knowthe numerical value of a a mean. . not calculating it. For convenience we generally use the shorthand notation θ ˆ as “θ-hat”. . . xn . This suggests that in a real-life example. . We have now done our experiment. How can we use these data values to estimate the parameter θ? (Note that we are estimating the parameter θ. Xn ). x2 . But at least if we use good estimation procedures we should have a reasonable approximate idea of its value. in particular its mean and its variance. (We is itself a random variable. If θ is an unbiased estimator ˆ(x1 . and we now discuss three of these. Because of the randomness involved in the generation of our data. .5) in the die case you would use your average to estimate it. . How good this estimate is depends on the ˆ. . x2 . . . X2 . . General principles In this section we consider general aspects of estimation procedures. . . think of the average that you got of the number that turned up on your 1. . . a desirable property of an estimator is that it be unbiased. θ) (for discrete random variables) or density function f (x. As an example. . . And almost certainly your average was not exactly 3. . It is “shooting at the right target”. . xn ). is called the estimate of θ.) Before discussing particular cases we have to consider general principles of estimation. But at least it is shooting in the right direction. it will almost certainly not exactly hit the target.

Estimation of the binomial parameter θ The binomial distribution gives the probability distribution of the random variable X . and thus on the standard deviation. estimate of θ is p = x/n. . and this rule derives from the normal distribution. We know that the variance of P is θ(1 − θ)/n. to a very accurate approximation. since if it is. (ii) Once we know that P hasa normal distribution (to a sufficently good approximation) we 27 . This is indeed our estimate of θ. and have an approximately normal distribution. the observed proportion of successes. This precision of p as an estimate of θ depends on the variance. What does this imply? First. precision of our estimate. It is “shooting at the right target”. xn ) calculated from your data. we want to ask: how precise is this estimate? An estimate of a parameter without any indication of its precision is not of much value. a normal distribution (with mean θ and variance θ(1 − θ)/n). several of the estimators we consider are unbiased. We know (from the relevant probability theory formulas) that the mean of P is θ and that the variance of P is θ(1 − θ)/n. and perhaps natural. how should I estimate θ? How precise can I expect my estimate to be? The classical. if an estimator θ ˆ ˆ(x1 . once we have done our experiment. This is why a variance is an important probability theory concept. That is good news. we know that P has. it would also be desirable if θ distribution. Just as important. of θ to be small. What are the properties of this estimate? These depend on the properties of the random variable P . so that the standard deviation of P is θ(1 − θ)/n. is an unbiased estimate of θ. What can I say about θ? In particular. . since we know that the random variable P has a mean of θ. the number of successes from n binomial trials with the binomial parameter θ (the probability of success on each trial). the estimate p of θ. we would also want the variance Second. . ˆ has. a normal Finally. of the random variable P . since then well-known properties of this distribution can be used to provide ˆ In particular. either exactly or approximately. should be close to θ. Fortunately. . The corresponding statistical question is: I have now done my experiment and observed x successes from n trials.since it was “shooting at the right target”. This random variable and its probability distribution are purely probability theory concepts. (i) from the Central Limit Theorem as applied to the random variable P . the (random) proportion of successes before we do the experiment. we often use the two-standard-deviation rule in assessing the properties of θ. x2 . ˆ of some parameter θ is unbiased. have a small variance. We now use two facts. the observed value θ that is your estimate of θ.

Writing this somewhat loosely.” From this we can say: “After we have done our experiment. (42) θ(1 − θ)/n We still have a problem. and say: “It is about 95% likely that the once we have done our experiment.95. (43) This leads to the so-called (approximate) 95% “confidence interval” for θ of p − 2 p(1 − p)/n to p+2 p(1 − p)/n. We now turn this second statement “inside-out”. However at least we have an estimate of θ. First we have to find out what these equations reduce to. Since (42) is already an approximation.575 θ(1 − θ)/n) ≈ 0. (using the “if I am within 10 yards of you. it is about 95% likely that the observed proportion p of successes is within 2 θ(1 − θ)/n of θ”.575 θ(1 − θ)/n < P < θ + 2. θ is within 2 of observed proportion p of successes”. (41) θ(1 − θ)/n) ≈ 0.95. (40) The first inequality implies.are free to adapt either (37) or (38). something like this: “Before we do our experiment we can say that the random variable P takes a value within 2 θ(1 − θ)/n of θ with probability of about 95% . P(p − 2 p(1 − p)/n < θ < p + 2 p(1 − p)/n) ≈ 0.99. we make a further approximation and say. or imply.to the question of the precision of p as an estimate of θ. (44) 28 . in the binomial distribution context. you are and withinin 10 yards of me idea”). we can say P(p − 2 θ(1 − θ)/n < θ < p + 2 θ(1 − θ)/n) ≈ 0.95. again somewhat loosely. They become P(θ − 2 θ(1 − θ)/n < P < θ + 2 and P(θ − 2. namely p. Since we do not know the value of θ we do not know the value of the expression θ(1 − θ)/n occuring twice in (42). which are normal distribution results. in words.

0316. 000 (i. The sort of thing we would say is: “I estimate the value of θ to be 0. Further. (i) The range of values 0. and I am (approximately) 95% certain that θ is between 0.52. We find n = 1. If you want to be more conservative.e. (v) What was the sample size? Suppose that a TV announcer says.03) came from.4384) and 0.47 as indicated by the confidence interval depends on the variance θ(1 − θ)/n of the random variable P . (45) This formula is quite easy to remember and you may use it in place of (44).5016 in the above example is usually called a “95% confidence interval for θ”. suppose that n = 1.5016).47 − 2 0. from (45). 000 (i. etc. and with this value the “ margin of error” is 1/1. 000 and p is 0. of that estimate. or reliability.As an example. (iv) It is a mathematical fact that p(1 − p) can never exceed 1/4.47.47 × 0. Notes on the above. The interpretation of this statement is that we are (approximately) 95% certain that θ is within this range of values. They just approximated this by 0. before an election between two candidates Smith and Jones. p(1 − p) is quite close to 1/4.03.47. 0. ( Probably their sample size was 1.4384 to 0.53/1. that a Gallup poll predicts that 52% of the voters will vote for Smith. (iii) The precision of the estimate 0.47. how many individuals were in the sample that led to the estimate 52%. The TV announcer has no idea where that 3% ( = 0. and remember that 1/4 = 1/2. “with a margin of error of 3%”. In saying this we have not only indicated our estimate of θ. and have a 99% confidence interval.e. 0. 000 = 0. books. 111. or 0.) (vi) All of the above relates to an (approximate) 95% confidence interval for θ. you can start with the inequalities 29 . but in effect it came from the (approximate) 95% confidence interval (44) or (more likely) from (45). we arrive from (44) at a conservative confidence interval for θ as p − 1/n to p + 1/n.47 × 0. All we have to do is to equate 1/n with 0.000. you will often see the above result written as θ = 0. So we can work out. This is why we have to consider random variables. for quite a wide range of values of p near 1/2. their properties and in particular their variances.03.47 ± 0.47 + 2 0. but we have also given some idea of the precision. So if we approximate p(1 − p) by 1/4.53/1. (ii) In research papers.0316. Thus the confidence interval gives us an idea of the precision of the estimate 0.

We want to carry out a clinical trial to estimate θ. and in particular we do not know the probability θ that it will cure someone of the illness involved. to p + 1.8716 ± 0.8587 ) and 0. (47) to p + 2. and this often implies that a very large sample size is needed to meet this required level of accuracy.575 p(1 − p)/n in place of (44) or p − 1. Thus the result in the medical example above might be written as something like: “ θ = 0. The width of the confidence interval.in (41) which.01. This is why we have to discuss (a) random variables and (b) variances of random variables. (viii). it is not the best way to present the conclusion.012875 (= 0.0. Suppose now that we have given the new medicine to 10. Since 1/10. ultimately depend on the variance of the random variable P .8716 + 0. we would arrive at an (approximate) 99% confidence interval for θ of p − 2.012875 (=0. compared to (40) (which led to our 95% confidence interval) replaces the “2” in (40) by 2. Example.8716. Beforehand we know nothing about the properties of this medicine. it is not indicated if this is a 95% or a 99% confidence interval.” This can be misleading because.012875.2875 1/n.8845).575 p(1 − p)/n (46) 30 . Often in research publications the result of an estimation procedure is written as something like: “estimate ± some measure of precsion of the estimate”.8716 . and I am about 99% certain that the probability of a cure with this proposed medicine is between 0. Then we estimate θ to be 0. 000 is 0.575. This is why your medicines are so expensive: the FDA requires considerable accuracy before a medicine can be put on the market.000 people with the illness and of these. This example is from the field of medical research. Suppose that someone proposes an entirely new medicine for curing some illness. that if you want to be three times as accurate you need nine times the sample size.716 were cured. and hence the precision of the estimate. This means that if you want to be twice as accurate you need four times the sample size.2875 1/n in place of (45). for example. (vii) √ Notice that the length of both confidence intervals (47) and (45) are proportional to 1/ n. Also. and so on. θ is an (unknown) parameter. we would say: “ I estimate the probability of a cure by 0. 8.8716. (ix). Since we want to be very precise in a medical context we might prefer to use the 99% confidence interval (47) instead of the 95 % confidence interval (45). Carrying through the same sort of argument that led to (44) and (45).

It is natural to estimate the mean µ by the average x ¯ of these observed values. We start with probability theory theory and think of the situation before we got our data. X2 .... What are the properties of this estimate? To answer these questions we have to zig-zag backwards and forwards between probability theory and Statistics. Xn . all having some continuous probability density function with (unknown to us) mean µ and (unknown to us) variance σ 2 . . .Estimation of a mean (µ) Suppose that we wish to estimate the mean blood sugar level of diabetics. There is some (unknown to us) probability (the shaded area below) that the blood sugar level of a randomly chosen diabetic lies between the values a and b:- .. . .. xn as the observed values of n iid random variables X1 . . getting data values x1 . . .. We can however conceptualize about this distribution graphically:- . We take a random sample of n diabetics and measure their blood sugar levels. . x2 .. We think of the data values x1 .. .. The form of the density function of each X is unknown to us. x2 . xn .... 31 . Our aim then is to estimate µ from the data and to assess the precision of our estimate. .

These facts lead us to an approximate 95% confidence interval for µ.. Next. . .. the Central Limit Theorem shows ¯ is approximately normal when n is large. That is good news: it is “shooting at the right target”. We conceptualize about the average X ¯ ¯ is an unbiased X1 . . We know (see (19)) that the variance of X ¯ is σ 2 /n. . P 2σ 2σ ¯ <µ+ √ µ− √ <X n n ≈ 0. Since the mean value of X is µ (from equation (19)). but we will remove this assumption soon. deriving from properties of the normal distribution. (In practice it is very unlikely that we would know this. Much more important: how precise is it as estimate of µ? This depends on the variance ¯ . This distribution has some (unknown to us) mean (which is what we want to estimate) at the “balance point” of this density function..) The two-standard-deviation rule. That is we continue to think in ¯ of the random variables terms of probability theory.. this result is still useful to us.95. (48) 32 . Xn . X estimator of µ. X2 . so that to a that the probability distribution of X good approximation we can use the two-standard-deviation rule.. . . We continue to think of the situation before get our data. then shows that for large n. The 95% confidence interval for µ Suppose first that we know the numerical value of σ 2 . Thus x ¯ is an unbiased estimate of µ. . as indicated by the arrow:- . So our “natural” estimate is the correct one. and even though we do not know of X the value of σ 2 .

as mentioned above.1 is far more useful information than being told only that the estimate of a mean is 14. The theory here is not easy. xn is s2 = 2 2 x2 x) 2 1 + x2 + · · · + xn − n(¯ . Suppose that n = 40 for the second investigator (i.The inequalities (48) can be written in the equivalent “turned inside-out form” form P 2σ 2σ ¯+√ ¯−√ <µ<X X n n ≈ 0. n (52) This estimated confidence interval is useful.. .. Since both estimates 33 . . 3. since it provides a measure of the accuracy of the estimate x ¯ and it can be computed entirely from the data. Consider two investigators both interested inthe blood sugar levels of diabetics.. so that (50) is not immediately applicable. so here is a “trust me” result: the estimate s2 of σ 2 found from observed data values x1 . x2 . from the two-standard-deviation rule. in practice.. x2 . the 95% confidence interval (52) is only reasonably accurate. eventually..7 and that it is approximately 95% likely that the mean is between 14.. Further theory shows that in practice it is reasonably accurate. (50) n n This interval is valuable in providing a measure of accuracy of the estimate x ¯ of µ. the variance σ 2 is usually unknown. n−1 (51) This leads (see (50)) to an even more approximate 95% confidence interval for µ as 2s x ¯− √ n to 2s x ¯+ √ . which we shall discuss later.e.3 and 15. The main problem with the above is that. xn . This rule is only an approximate one so that. (49) This leads to an approximate 95% confidence interval for µ. . . Why do we have n − 1 in the denominator of the formula (51) for s2 and not n (the sample size)? This question leads to the concept of “degrees of freedom”. her sample size was 10). x2 .95.. Some notes on the above 1. 2. To be told that the estimate of a mean is 14. xn . . Suppose that n = 10 for the first investigator (i. . as 2σ 2σ x ¯− √ to x ¯+ √ . The effect of changing the same size. However it is possible to estimate σ 2 from the data values x1 . The number “2” appearing in the confidence interval (52) comes. given the data values x1 . The two investigators will estimate µ by their respective values of x ¯.e.7.. his sample size was 40).

The quantity s/ n is often called “the standard error of the mean”. On the other hand the length of the confidence interval for µ for the second investigator √ will be about half that of the first investigator. using the above formulas. In particular we want to estimate µ. More precisely it should be: “the estimated standard deviation of the estimator of the mean”. We want to assess various properties of this new method.. This statement incorporates three errors. Similarly. which is unknown. not √ the 10 that the first investigator will have (see √ (52) and note that 1/ 40 is half of 1/ 10). that is both are“ shooting at the same target” (µ). we now apply this new style of seed to our 11 separate acre lots. before we do the experiment. 1444. This √ happens because the length of the confidence interval (52) is 4s/ n.. 34 2060. This leads to the next point. They are random variables. 2496. since he will have a 40 involved in the √ calculation of his confidence interval. and we get the following values (pounds per acre):1903.. To be three times as accurate you need nine√ times the sample size. . before we carry our this experiment. since both are unbiased estimates of σ 2 . For many years corn has been grown using a standard seed processing method.. To be 10 times as accurate you need 100 times the sample size! 5. and from this one can get a good idea of what sample size is needed to get the required level of precision. these yields are unknown to us. We plan to do this by sowing n = 11 separate acres of land with the new seed type and measuring the yield per acre for each of these 11 acres. A numerical example. as the random variables X1 . 1935. 1612. the the mean yield per acre (in pounds) under the new method and to find two limits between which we are approximately 95% certain that µ lies. X2 . X11 . 2108. and we think of their values. 1910. A new method is proposed in which the seed is kiln-dried before sowing. The fact that there is a n in the denominator and not an n explains this phenomenon. they should be reasonably close to each other. so we know We know (as above) that the mean X that the estimate x ¯ will be unbiased. 4. How large a sample size do you need before you do your experiment in order to get some desired degree of precision of the estimate of the mean µ? One cannot answer this question in advance. To be twice as accurate you need four times the sample size.are unbiased. Often one runs a pilot experiment to estimate σ 2 . . This is why research is often expensive: to get really accurate estimates one often needs very large sample sizes. 1316 1511. √ 6. ¯ of these random variables has mean µ. 1961. At this point. their respective estimates of σ 2 should be reasonably close to each other. With this conceptualization behind us. since the precision of the estimate depends on σ 2 .

Since the individual yields are clearly given rounded to whole numbers.. 35 . . 10 Following (52). We estimate the mean µ of the yield per acre by the average 1903 + 1935 + · · · + 1511 = 1841. Also.9 2 117468. for example.These are our data values.46. To calculate the approximate 95% confidence interval (52) for µ we first have to calculate 2 s . 468.46 − to 1841. it is not indicated if this is a 95% or a 99% confidence interval.” Often in research publications the above result might be written µ = 1841 ± 206. (53) 11 11 that is from 1634. xn . Now to our estimation and confidence interval procedures.46)2 = 117. the mean yield per acre..46 + . 11 We know from the above theory that this is an unbiased estimate of µ. it is not the best way to present the conclusion.. The estimate of σ 2 is. it is not appropriate to be more accurate than this in our final statement.14. and we are about 95% certain that it is between 1635 and 2048. x2 . which is: “We estimate the mean by 1841. This can be misleading because..9. from (51). which we have previously generically denotes by x1 . these calculations lead to our approximate 95% confidence interval for µ as √ √ 2 117468. s2 = (1903)2 + (1935)2 + · · · + (1511)2 − 11(1841. our estimate of the variance σ 2 of the probability distribution of yield with this new seed type.9 √ √ 1841.78 to 2048.

Both P1 and P2 are random variables. and would estimate θ2 by x2 /n2 . Let θ1 be the (unknown) probability that a woman is pro-choice and let θ2 be the (unknown) probability that a man is pro-choice. which we will write as p1 . So we are interested in the difference θ1 − θ2 . Thus we would (correctly) estimate θ1 − θ2 by the difference d = p1 − p2 . P1 = X1 /n1 is the proportion of women who will be pro-choice and P2 = X2 /n2 is the proportion of men who will be pro-choice. which we will write as p2 . namely p1 and an estimate of θ2 . More important: how precise is this estimate? To answer this we have to find the variance of the estimator P1 − P2 . It is “shooting at the right target”. In comparing two groups we are often involved with differences. Similarly the variance of P2 is θ2 (1 − θ2 )/n2 . Thus from the first equation in (23). namely p2 . before we took our sample. Thus D is an unbiased estimator of θ1 − θ2 and correspondingly d = p1 − p2 is an unbiased estimate of θ1 − θ2 . So we could estimate this variance by p1 (1 − p1 )/n1 + p2 (1 − p2 )/n2 . giving the mean of a difference of two random variables with possibly different means. Our aim then is to estimate θ1 − θ2 and to find an approximate 95% confidence interval for θ1 − θ2 . Suppose now that we have taken our sample. and x2 of the n2 men are pro-choice. It is the estimate of θ1 − θ2 that we will use. since we do not know the values of θ1 and θ2 . 36 . giving the variance of a difference of two random variables with possibly different variances. We would estimate θ1 by x1 /n1 . and that x1 of the n1 women are pro-choice. We do not of course know the numerical value of this variance. Our aim is to take a sample of n1 women and n2 men and find out for each person whether he/she is pro-life or pro-choice. From the second equation in (23). Now the variance of P1 is θ1 (1 − θ1 )/n1 (from the variance of the proportion of successes in n1 binomial trials). Is there is a difference between men and women on their attitudes in the pro-life/pro-choice debate? We approach this question from a statistical point of view as follows. the mean of D is θ1 − θ2 . the variance of D is θ1 (1 − θ1 )/n1 + θ2 (1 − θ2 )/n2 .Estimating the difference between two binomial parameters Let’s start with an example. That is why we have done some probability theory about differences. Now we know that the mean of P1 is θ1 (this is one of the magic formulas for the proportion of “successes” in the binomial context) and we also know that the mean of P2 is θ2 (this uses the same magic formula). where. What are the properties of this estimate? These are determined by the properties of the random variable D = P1 − P2 . However we have an estimate of θ1 . Notice that P1 − P2 is a difference.

019 − 1/1. So we estimate θ1 by p1 = 624/1. for quite a wide range of values of any fraction f near 1/2. (54) This leads to the so-called (approximate) 95% “confidence interval” for θ1 − θ2 of p1 − p2 − 2 p1 (1 − p1 )/n1 + p2 (1 − p2 )/n2 to p1 (1 − p1 )/n1 + p2 (1 − p2 )/n2 . A TV commentator would call 0. We find that 624 of the women are pro-life and 484 of the men are. we will see if the estimate 0.Using the same sort of argument that led to (43). That is.019 differs significantly from 0.066.0. (55) p1 − p2 + 2 These formulas a pretty clumsy.019. we use the mathematical fact that neither p1 (1 − p1 ) nor p2 (1 − p2 ) can ever exceed 1/4. so we carry out the same approximation that we did when estimating a single binomial parameter (see the discussion leading to (45)). let’s start with an example. Suppose that we interview n1 = 1. we could then say P(p1 − p2 − 2 p1 (1 − p1 )/n1 + p2 (1 − p2 )/n2 < θ1 − θ2 < p1 − p2 + 2 p1 (1 − p1 )/n1 + p2 (1 − p2 )/n2 ) ≈ 0. f (1 − f ) is quite close to 1/4.047 = 0.047 = −0. Further. we are approximately 95% certain that the actual proportion is between 0. So we estimate the difference between the proportion of women who are pro-choice and the proportion of men who are pro-choice to be 0.605.605 = 0.624 and we estimate θ2 by p2 = 484/800 = 0. Our aim is 37 . So we are interested in the difference µ1 − µ2 . Let µ1 be the (unknown) mean blood pressure for a a woman and µ2 be the (unknown) mean blood pressure for a man. Further. 000 women and n2 = 800 men on the pro-life/pro-choice question. Estimating the difference between two means As in the previous section. 000 + 1/800 = 0. So if we approximate both p1 (1 − p1 ) and p2 (1 − p2 by 1/4.624 . Later. 000 + 1/800 = 0.019 − 0. we arrive at a conservative confidence interval for θ1 − θ2 as p1 − p2 − 1/n1 + 1/n2 to p1 − p2 + 1/n1 + 1/n2 . Is the mean blood pressure of women equal to the mean blood pressure of men ? We approach this question from a statistical point of view as follows.95. 000 = 0. (56) Numerical example. and remember that 1/4 = 1/2.028 and 0.019 + 0. when we do hypothesis testing.047 the “margin of error”. In fact we will follow the structure of the last section fairly closely.019 + 1/1.

if the blood pressures of the n2 men are denoted x21 .. We then estimate µ1 − µ2 by x ¯1 − x ¯2 . as well as the formula for the X σ2 σ2 2 variance of an average. If the blood 2 (see equation 51) by pressures of the n1 women are denoted x11 . . the average of the blood pressures of the n1 women in the sample. this variance is n1 + n2 .to take a sample of n1 women and n2 men and measure the blood pressures of all n1 + n2 people... we estimate σ2 (see equation 51) by 2 2 x2 x2 )2 21 + x22 + · · · + x2n2 − n2 (¯ 2 . (58) s2 = n2 − 1 Thus we estimate 2 σ1 n1 + 2 σ2 n2 by s2 1 n1 + s2 2 . x22 .. n1 n2 (59) 38 .. Clearly we estimate the mean blood pressure for women by x ¯1 . We 2 2 do not know either σ1 or σ2 and these will have to be estimated from the data.. x2n1 . How accurate is this estimate? This depends on the variance of the random variable ¯1 − X ¯ 2 . . x12 . Our aim then is to estimate µ1 − µ2 and to find an approximate 95% confidence interval for µ1 − µ2 . the average of the blood pressures of the n2 men in the sample. x1n1 . n2 Finally our approximate 95% confidence interval for µ1 − µ2 is x ¯1 − x ¯2 − 2 s2 s2 1 + 2 n1 n2 to x ¯1 − x ¯2 + 2 s2 s2 1 + 2. Using the formula for the variance of a difference. we estimate σ1 s2 1 = 2 2 x2 x1 )2 11 + x12 + · · · + x1n1 − n1 (¯ . n1 − 1 (57) 2 Similarly. and similarly we estimate the mean blood pressure for men by x ¯2 . where σ1 is the (unknown) variance of blood 1 2 2 pressure among women and σ2 is the (unknown) variance of blood pressure among men...

Up to now we have denoted random variables using the letter X . that is quantities whose value we do not know. once the experiment is finished. First. Second. First we think of the situation before our experiment. There are many factors that we 39 . we are denoting the random variable by Y . we wish to estimate. We denote the controllable non-random quantity in lower case . perhaps muhc. Regression problems can get pretty complicated. These questions are answered by the technique of regression. (61) where σ 2 is another parameter whose value we do not know and which. and it is natural to plot the observed values of the random variable as the y values. We make the assumption that the mean of Y is of the “linear” form:mean of Y = α + βx. We also assume that variance of Y = σ 2 . Note that we assume that the mean growth height potentially depends on x. We will use the plant and water example to demonstrate the central regression concepts. and consider some typical generic plant to which we will plan to give x units of water. This is in accordance with the notational convention of denoting random variables in upper case. The fact that there is a (positive) variance for Y derives from the fact that there is some. so we consider here only one case of regression: How does some random non-controllable quantity Y depend on some non-random controllable quantity x? Notice two things about the notation. and indeed our main aim is to assess the way it depends on x. At this stage the eventual growth height Y of this plant is a random variable . In fact our main aim once the experiment is finished is to estimate the numerical values of these parameters and to get some idea of the precision of our estimates. We switch the notation from X to Y in the regression context to because we will later plot our data values in the standard x-y plane.see x above. (60) where α and β are parameters.see Y above. we denote the random quantity in upper case .Regression How does one thing depend on another? How does the GNP of a country depend on the number of people in full-time employment? How does the reaction time of a person to some stimulus depend on the amount of sleep deprivation administered to that person? How does the growth height of a plant in a greenhouse depend on the amount of water that we give the plant during the growing period? Many practical questions are of the “how does this depend on that?” type.we do not know what it will be. uncertainly about what the value of the plant growth will be after we have done the experiment.

.. whatever the value of x.. xn units of water. In fact there is a strategy question about how we would choose the values of x1 . If b = 0 the line is horizontal.. Then from equation (60). Y2 corresponds to the plant getting x2 units of water. Of these three the most important one to us is β . later. once the experiment is over. . in seeing if our estimate of β . So we are involved with three parameters. once we have our data. We plan to use some pre-determined number n of plants in our greenhouse experiment. Now back to the regression context.) These are all random variables . At this stage we conceptualize about the growth heights Y1 . x2 . is to estimate these parameters from our data. We do not know the value of any one of them. such as soil fertility. .. β and σ 2 . with a variance. equation (60) reminds us that the algebraic equation y = a + bx defines a geometric line in the x-y plane. and equation (60) shows that the mean growth height does not depend on the amount of water that we give the plant. . the variance of Y2 is also σ 2 .we do not know in advance of doing the experiment what values they will take. Taking a break from regression for a moment. the mean of Y2 is α + βx2 . These x values do not have to be all different from each other. A stated above.. α. xn which is discussed later. The interpretation of b is hat it is the slope of the line. as shown:- The interpretation of a is that it is the intercept of this line on the y axis (as shown). Y2 . and so 40 .. and so on.. but it is essential that they are not all equal. which imply that Y is a random variable. We are still thinking of the situation before we conduct our experiment. one of our aims. Yn of the n plants. The interpretation of β is that it is the mean increase in growth height per unit increase in the amount of water given. If β = 0 this mean increase is zero. So we will be interested. and then the values of y for points one the line are all the same. is close to zero or not.do not know about. the mean of Y1 is α + βx1 . and so on. x2 . The variance of Y1 is σ 2 . (Y1 corresponds to the plant getting x1 units of water.. planning to give the plants respectively x1 .

yn . Suppose that our data points are as shown below:- These data points “more or less” lie on a straight line. that is the amount of water to be given to plant i differs from that to be given to plant j .. If they are clearly not on a straight line (see example at the top of the next page) then you should not proceed with the analysis. the larger these deviations from a line would tend to be. the means of Yi and Yj are different if β = 0 and the assumptions embodied in (60) are true. temperature differences from the environment of one plant to another. Equation (60) shows that the mean of Y is a linear function of x. The fact that deviations from a straight line are to be expected is captured by the concept of the variance σ 2 . The larger this (unknown to us) variance is. yn ) values on a graph. (xn . This means that once we have our data they should (if the assumption in (60) is correct) approximately lie on a straight line.on.. We assume that the various Yi values are independent. . y2 ).. (The plant getting x1 units of water had growth height y1 . etc. All the above refers to the situation before we conduct our experiment. and we obtain growth heights y1 . 41 . and so on. Equation (60) shows that when we do this the data points should “more or less” lie on a straight line. If the data points are more or less on a straight line (as above. since if for example xi = xj . . (x2 . and deciding this is really a matter of judgement) we can go ahead with our analysis. y1 ). We now do the experiment. We do not expect them to lie exactly on a straight line: we can expect random deviations from a straight line because of factors unknown to us such as differences in soil composition among the pots that the various plants are grown in. However they are clearly not assumed to be identically distributed.) The first thing that we have to do is to plot the (x1 . y2 .. the x2 units of water had growth height y2 ....

(66) (67) (68) These are the three estimates that we want for our further analysis. sxy = (x1 − x ¯)(y1 − y ¯) + (x2 − x ¯)(y2 − y ¯) + . so we just give the “trust me” results:We estimate β by b. β and σ 2 .. The derivation of unblased estimates here is complicated. n (62) as well as the quantities sxx . defined by sxx = (x1 − x ¯)2 + (x2 − x ¯)2 + . defined by b = sxy /sxx ... + (yn − y ¯)2 .. + xn . These are x ¯= x1 + x2 + .. To do this we have to calculate various quantities.. defined by 2 s2 r = (syy − b sxx )/(n − 2). There are methods for dealing with data that clearly do not lie close to being on a straight line.. syy = (y1 − y ¯)2 + (y2 − y ¯)2 + . defined by a=y ¯ − bx ¯. (63) (64) (65) The most important parameter is β . + (xn − x ¯)(yn − y ¯).. + (xn − x ¯)2 . 42 . but we do not consider them here. Finally we estimate σ 2 by s2 r .. Our first aim is to use the data to estimate α. So from now on we assume that the data are “more or less” on a straight line... syy and sxy . since if β = 0 the growth height for any plant does not depend on the amount of water given to the plant.. We estimate α by a. + yn . n y ¯= y1 + y2 + ...

3. How can we √ make sxx large? We can do this by spreading the x values as far away from their average as we reasonably can.see the picture below to illustrate the case where we put about half our x values at the same low value and the other half our x values at the same high value. 4. The width of this confidence √ √ interval is proportional to 1/ sxx . We should keep the various x values within the range of values which is of interest to us. If we did this we would have no idea what is happening in the middle of this range . so in practice you will usually not have to do the computations for these estimates. You will usually do a regression analysis by a statistical package (we will do an example in class). Thus the larger we make sxx the shorter is the length of this confidence interval and the more precise we can be about the value of β . . How accurate is the estimate b of β ? Again here there is some difficult math that you will have to take on trust.Notes on this. sxx (69) This is our (approximate) 95% confidence interval for β . You will have to take the √ part of it on trust. xn . The bottom line is that we are about 95% certain that β is between 2sr b− √ sxx and 2sr b+ √ ... The formula for sr relates only to the regression context. Clearly the “2” in this result comes 2sr from the two-standard-deviation rule. that b is an unbiased estimate of β . 2. we do not want to make half the x values at the lower end of this “interesting range” and the other half at the upper end of this “interesting range”. The suffix “r” in s2 r stands for the word “regression”. 2 1. sxx 5. the amounts of water that we plan to put on the various plants. Also. 43 . x2 . It can be shown (the math is too difficult to give here) that a is an unbiased estimate of 2 α.. This result introduces a “strategy” concept into our choice of the values x1 . and that that s2 r is an unbiased estimate of σ . However two further considerations then come into play.

If you give all plants the same amount of water there is no way that you can do this”.8 79.2 80.5 80.6)/12 = 79.So in practice we tend to put quite a few x values near the extremes but also string quite a few x values in the middle of the interesting range. This would mean that sxx would be small. In practice you are not justified in giving it to an accuracy greater than that of the data.7. 6. In fact it is saying: “You want to assess how the growth height depends on the amount of water given to the plant. and the definition of b shows that we would calculate b as 0/0. Also we find sxx = 188. Then both sxx and sxy would be zero. and the confidence interval (69) would be wide. This will be illustrated in a numerical example later. In other words the x values would be close to each other and all would then close l to their average. I An even more extreme case arises if we give all plants the same amount of water. Suppose that the amounts of water put on the various plants were close to each other.4/188 = 0.1 77.65).4. syy = 83. so in practice we would write b = 0. Example. You clearly could not make this assessment with data of that type. We now compute our estimate b of β as sxy /sxx = 122. and all the children in your sample were of exactly the same age. 44 .1 75.54 and sxy = 122. (This result is given to 7 decimal places so as to compare with the JMP printout.2+77.7 83.6510638.2 83. In fact the formula for b is telling you: “You can’t estimate β with the data that you have”. which mathematically makes no sense. So 2sr /sxx would be large. After the experiment we obtained the following data:Plant number Amount of water Growth height 1 16 2 16 3 16 4 18 5 18 6 20 7 22 8 24 9 24 10 26 11 26 12 26 76. We then have little confidence in our estimate of β .1+ · · · +83. The formula here is definitely sending you a message.1 82.7 78.2 82.2 77. We have n = 12 plants to which we gave varying amounts of water (see below). We will do an example from the “water and plant growth” situation. A second result goes in the other direction.6 From these we compute x ¯ = (16+16+ · · · +26)/12 = 21 and y ¯ = (76. It would be the same as a situation where you wanted to assess how the height of a child depended on his/her age.

next.411.) Finally we estimate σ 2 by s2 r . That is the equation of the line that appears on the JMP screen.02766.) 3.68.56 to 0. In practice you are not justified in giving it to an accuracy greater than that of the data.65×21 = 79. (Again this result is given to 7 decimal places so as to compare with the JMP printout. 4.411 2 0.65×1.000 units of water is 66. We could use this line. in acordance with the discussion above about the choice of x values. That handout should be regarded as part of these notes. from the theory it is unbiased. 45 .03 + 0.65)2 (188) = 0. (70) 188 188 that is from 0.7 − (0. It was found by a process which is truly “aiming at β ”.65 − √ to 0.03. Notes on this 1. for example.6510638 × 21) = 66.54 − (0. calculated in this case (see (68)) as 83.Next. For example it is not appropriate to say that we estimate the mean growth height for a plant given 1. Later we will consider testing the hypothesis that the growth height of the plant does not depend on the amount of water given to it. There will also be a handout discusing the JMP output and the interpretation of various things in this output.411 0. This is equivalent to testing the hypothesis β = 0. 2. Notice the choices of the amount of water in the above example. our estimate a of α is y ¯ − bx ¯ = 79. Our so-called “estimated regression line” is y = 66. we are approximately 95% certain that β is between √ √ 2 0.03. We gave three plants the lowest amount of water (16) and three plants the highest amount of water (26).65x. 10 How accurate is our estimate b of β ? First. so in practice we would write a = 66.65 + √ . We also strung a few values out between these values. to say that we estimate the mean growth height for a plant given 21 units of water to be 66. (You probably would have killed the plant if you gave it this much water.000 = 716.03 + 0. We will do this example by JMP in class. Never extrapolate between the x values in the experiment.03 + 0.74.

The first three steps in both approaches are the same. Here are some simple examples. The first approach predates the availability of statistical packages.Testing hypotheses Background In hypothesis testing we attempt to answer questions. while the second approach is to some extent motivated by the availability of these packages. Attempting to answer them is an activity of hypothesis testing. The general approach to hypothesis testing We will consider two equivalent approaches to hypothesis testing. is θ = 1/2? If the probability that a woman is left-handed is θ1 . So re-phrasing questions in terms of questions about parameters helps us to answer them. The null hypothesis. Both approaches involve five steps. This comment is discussed in more detail below. To decide on a hypothesis as a result of the data is to introduce a bias into the procedure. The first step in a hypothesis testing procedure is to declare the relevant null hypothesis H0 and the relevant alternative hypothesis H1 . and the mean blood presure for a man is µ2 . and the probability that a man is left-handed is θ2 . as the name suggests. invalidating any conclusion that might be drawn 46 . We will discuss both approaches. and we consider these three steps first. is θ1 = θ2 ? If the mean blood presure for a woman is µ1 . Also the nature of the alternative hypothesis must be decided before the data are seen: this is also discussed in more detail below. Step 1 Statistical hypothesis testing involves the test of a null hypothesis (which we write in shorthand as H0 ) against an alternative hypothesis (which we write in shorthand as H1 ). usually states that “nothing interesting is happening”. We will illustrate all steps by considering two problems involving the binomial distribution. is µ1 = µ2 ? Is β = 0? We re-phrase these questions in this way because we know how to estimate parameters and to get some idea of the precision of our estimates. The choice of null and alternative hypotheses should be made before the data are seen. Is this coin fair? Is a women equally likely to be left-handed as a man is? Is there any difference between men and women so far as blood pressure is concerned? Is there any effect of the amount of water given to a plant and its growth height? We always re-phrase these questions in terms of questions about parameters:If the probability of a head is θ.

both involving the binomial distribution. The alternative hypothesis will be one of three types: “one-sided up”. This alternative hypothesis is said to be “composite”: it does not specify some numerical value for θ (as H0 does). In any one specific situation which one of these three types is appropriate must be decided in advance of getting the data. If the probability of getting “head” on any flip of the coin is denoted θ. from the point of view of the casino operator. Instead it specifies a whole collection of values. to detect a bias of the coin towards either heads or tails (if there is such a bias). Example 1 It is essential for a gambling casino that the various games offered are fair.from it. Thus in this case the alternative hypothesis H1 is the two-sided alternative θ = 1/2. All the above seems very abstract. suppose that one game involves flipping a coin. from the point of view of the casino operator. As a simplified example. The casino operator now plans to carry out a hypotheis testing procedure. so as stated above we will illustrate the steps in the hypothesis testing procedure by two examples. Suppose that we have been using some medicine for some illness for many years (we will call this the “current” medicine). and “two-sided”. It is important to clarify the meaning of the expression “the null hypothesis is accepted. the null hypothesis H0 for the casino operator then states that θ = 1/2. it is best to say “We do not have significant evidence to reject H0 ”. The context of the situation will generally make it clear which is the appropriate alternative hypothesis. and we in effect know from much experience that the probability of a cure with the current medicine is 0. “one-sided down”. and it is essential. (No bias in the coin. A better expression for “accepting” is thus “not rejecting.84. since an astute gambler will soon notice if they are unfair and bet accordingly. Here the only interesting possibility is that it is better than 47 . A new medicine is proposed and we wish to assess whether it is better than the current medicine.) In the casino example it is important. using our data in making this decision. Nothing interesting happening. Example 2 This example comes from the field of medical research.” So instead of saying “We accept H0 ”. It often happens that the alternative hypothesis is composite. Our aim is eventually to accept or to reject the null hypothesis as the result of an objective statistical procedure.” This expression means that there is no statistically significant evidence for rejecting the null hypothesis in favor of the alternative hypothesis. that this coin be fair.

namely θ > 0. and to fix the numerical value of this error at some acceptably low level (usually 1% or 5%). in both examples. we would not want to introduce it. that is. it is possible that an incorrect decision will be made. since its cure rate would be equal to that of the current medicine. Notice how. This dilemma is resolved in practice by observing that there is often an asymmetry in the implications of making the two types of error. This is the only case of interest to us. Here the null hypothesis is θ = 0. or to accept H0 when it is false (a Type II error ). or (even worse) less effective than the current medicine. If this null hypothesis is true the new medicine is equally effective as the current one. If we make this claim and the new medicine in not better than the current one. only to find later that it is not better than the current one.84. This is also a composite hypothesis. Step 2 Since the decision to accept or reject H0 will be made on the basis of data derived from some random process. to reject H0 when it is true (a Type I error ). The natural alternative in this case is “one-sided up”. and that in both cases the null and alternative hypotheses are stated before the data are seen. The value chosen is denoted α. In the two examples given above there might be more concern about making the false positive claim and less concern about making the false negative claim. Let θ be the (unknown) probability of a cure with the new medicine. The choice of the values 1% and 5% is reasonable.84. many millions of dollars will have been spent manufacturing the new medicine. For this reason. In practice we are seldom able to get enough observations to do this.the current medicine: if is is equally effective as the current medicine. a frequently adopted procedure is to focus on the Type I error. This is illustrated in the following table:H0 is true H0 H0 is false H0 We accept H0 OK Type II error We reject H0 Type I error OK When testing a null hypothesis against an alternative it is not possible to ensure that the probabilities of making a Type I error and a Type II error are both arbitrarily small unless we are able to make the number of observations as large as is needed to do this. The choice 1% is a more conservative one than the choice 5% and is often 48 . or a “false negative”. This would be particularly true in the “medicine” example: we are anxious not to claim that the new medicine is better than the current one if it is not better. but is also clearly arbitrary. and not to attempt to control the numerical value of the Type II error. or “false positive”. the nature of the alternative hypothesis is determined by the context.

Step 2 of the hypothesis testing procedure consists in choosing the numerical value for the Type I error. Step 4 Under Approach 1. the number cured with the new medicine. In other words x is our test statistic. First we review steps 1 .made in a medical context. where A is 49 . so we now consider them separately.84. Approach 1. have to be before we will reject the null hypothesis? We will reject the null hypothesis if x ≥ A. Step 1. Step 2. where the calculations are simpler than in the coin example. This choice is made so as to ensure that the test has the numerical value for the Type I error chosen in Step 2. In the medicine case the natural test statistic is number of people cured with the new medicine in a clinical trial.84 and the alternative hypothesis claims that θ > 0. Suppose that we plan to give the new medicine to 5. We first illustrate this step with the medicine example. As also stated above the first three steps (as outlined above) are the same for both approaches. These are both more or less obvious. Steps 4 and 5 differ under the two approaches. Since this is a medical example we choose a Type I error of 1%. and both are the correct test statistics: however. Step 4 in the procedure consists in determining which observed values of the test statistic lead to rejection of H0 . This is the quantity calculated from the data whose numerical value leads to acceptance or rejection of the null hypothesis.000 patients. This choice is entirely at your discretion. In the two examples that we are considering we will choose 1% for the medical example and 5% for the coin example. The null hypothesis claims that θ = 0. Step 3. We write the (unknown) probability of a cure with the new medicine as θ. that is in choosing the numerical value of α. in more complicated cases the choice of a test statistic is not so straightforward. How large does x. Step 4. Now we proceed to steps 4 and 5. Which approach we use is simply a matter of our preference. We will reject the null hypothesis if the number x of patients who were cured with the new medicine cured is large enough. In the coin example the natural test statistic is number of heads that we will get after we have flipped the coin in our testing procedure. Step 3 The third step in the hypothesis testing procedure consists in determining a test statistic . As stated above there are two (equivalent) approaches to hypothesis testing.3 in this example.

where X has a normal distribution with mean 4.01.261 or larger we reject the null hypothesis and claim that the new medicine is superior to the current one. to a very close approximation X can be taken as having a normal distribution with this mean and this variance when when the null hypothesis is true.) −4.000)(0. Next.30. X√ 672 672 has to be equal to 2. Step 5 The final step in the testing procedure is straightforward.84) = 0. (Why binomial? Because there are two possible outcomes on each trial for each patient cured or not cured).out” to find this value.01.200 is a Z .84 the mean of X is (5. we 672 choose the value 4261. (You have to use the Z chart “inside . where (using a probability theory “zig”). and the Z charts now show that A√ Now when the null hypothesis is true. To conclude step 4.84)(. 50 . so we do it. Approach 1.200 Solving the equation A√ = 2. We now have to do a probability theory “zig”.326 we find that A = 4260.chosen so that the Type I error takes the desired value 1%. To be conservative.261 or more we will reject the null hypothesis and claim that the new medicine is superior to the current one. If this number is less than 4. A is chosen so that P(X ≥ A when θ = 0. 200 A − 4. 200 P( √ ≥ √ ) = 0.200 −4. 672 672 −4.01. we have made the calculation that if the number of people cured with the new medicine is 4. and consider before the clinical trial is conducted the random variable X .84) = 4. A has to be such that P (X ≥ A) = 0. We do the clinical trial and count the number of people cured with the new medicine.000)(0. If this number is 4. using the formula for the mean and the variance of a binomial random variable. It is now straightforward to do step 5. Then we will will reject the null hypothesis if x ≥ A.261 we say that we do not have significant evidence that the new medicine is better than the current one. the number of people who will be cured with the new medicine. If θ = 0. So to this level of approximation.200 and the variance of X is (5.16) = 672.326.200 and variance 672. How do wecalculate the value of A? We will use the central limit theorem and a Z chart. We now do a z -ing: we want X − 4.

and consider the random variable X . 000 X − 5. How large or how small? We will reject the null hypothesis if x ≤ A or if x ≥ B . where A and B have to be chosen so that α = 5%. so we will reject the null hypothesis if r x is either too large or too small.000 times in our experiment. Let us first calulate B . 50 50 51 . Now we proceed to steps 4 and 5. The test statistic is x.025 when H0 is true.261 is sometimes called the “critical point” and the range of values “4. Step 1. We now go on a probability theory “zig”. (X − 5. Step 4. X has a binomial distribution with mean 5.261 or more” is sometimes called the “critical region”. Thus when the null hypothesis is true. 2 and 3.05 when H0 is true. The standard deviation of X is thus 2. the number of heads that we will get after we have flipped the coin 10.500 (using the formula for the √ mean and the variance of a binomial random variable). 000 ≥ ) = 0. the random number of times we will get heads before the experiment is done. when the null hypothesis is true X has a normal distribution with this mean and this standard deviation.Note on this The value 4. Carrying out a Z -ing procedure. We choose a numerical value for α as 5%. Step 2. 500 = 50.000 and variance 2. The coin example. First we review steps 1. We write θ as the probability of a head on each flip. Suppose that we plan to flip the coin 10. Choosing A and B so as to satisfy this requirement ensures that the Type I error is indeed 5%. We usually adopt the symmetic requirement P (X ≤ A) = P (X ≥ B ) = 0. This test is two-sided. Step 3. we get P( B − 5. When the null hypothesis is true. 000)/50 is a Z .000 times. To a sufficiently close approximation. The null hnypothesis claims that θ = 1/2 and the alternative hypothesis claims that θ = 1/2. Wehave to choose A and B so that P (X ≤ A) + P (X ≥ B ) = 0.025.

We now consider Approach 2 to hypothesis testing. Carrying out a similar operation for A we find A = 4. or one more extreme in the direction indicated by the alternative hypothesis. Step 4 Under Approach 2 we now do our experiment and note the observed value of the test statistic.097 inclusive. Approach 2.098 are sometimes called the “critical points” and the range of values “x ≤ 4. If the number of heads between 4. As stated above. Step 5 This step involves the calculation of a so-called P -value . 902.96. B −5. suppose that in the medicine example the number of people cured under the new medicine was 4. the Z charts show that and solving this equation for B we get B = 5. we say that we do not have significant evidence to reject the null hypothesis. Note on this The values 4. and if the number of heads is 4. assuming that the null hypothesis is true. 098 ” is sometimes called the “critical region”.903 and 5. If the P -value is less than or equal to the chosen Type I error. the null hypothesis is rejected. again using the coin and the medicine examples.000 patients) and observe the number of people cured under the new medicine. steps 1.000 Since X −50 is a Z when the null hypothesis is true. So we now move to steps 4 and 5 under Approach 2. In the coin example we flip the coin 10. we do not have signiificant evidence to claim that the coin is unfair. the P -value is the probability that a random variable X having a normal distribution with 52 .000 times and observe the number of heads that we got. We now flip the coin 10. we reject the null hypothesis and claim that we have significant evidence that the coin is biased. 902 or x ≥ 5.098 or more. This takes us straight to step 5. Once the data are obtained we calculate the probability of obtaining the observed value of the test statistic. This probability is called the P -value. Using the normal distribution approximation to the binomial. or 5.000 50 = 1. For example. 098. Thus in the medicine example we do the clinical trial (with the 5. Step 5. 2 and 3 are the same under Approach 2 as they are under Approach 1.272.5. This procedure always leads to a conclusion identical to that based on the significance point approach.000 times.902 or fewer. That is.902 and 5. Approach 2.

200 = 1.088. We have P (X ≥ 4. so under Approach 1 we would not reject the null hypothesis. and since X√ is a Z when the null hypothesis is true.906 is more extreme than 5. Suppose for example that we obtained 5.088 in that it differs from the null hypothesis mean (5. The P -value is then the probability of obtaining 5. 53 .000. This exceeds the value chosen for α (0. step 5). the P -value is the probability that a random variable having a normal distribution with mean 5.200 = 2.mean 4. This is a straightforward probability theory “zig” operation. or more extreme.912 or fewer.250. 200 √ √ ≥ ). we would calculate the P value as X − 4. Doing a Z -ing.272 or more. This is more than the Type I error of 0. This conclusion agrees with that we found under Approach 1. As a different example. we obtain.200 √−4. 672 672 −4. which is more than 5. this is 0.0392 + 0.250 672 672 from the Z chart.000 and standard deviation 50 takes a value 5. 200 4. caried out using a Z -ing procedure and normal distribution charts. for a two-sided alternative. and since X√ is a Z when the null hypothesis is true. plus the probability that a random variable having a normal distribution with mean 5.01) so we reject the null hypothesis. we do not have enough evidence to claim that the new medicine is better than the current one. if the null hypothesis is true). This observed value does not exceed the critical point 4. and 4.0784. That is. This is 88 more than the null hypothesis mean of 5.000 and standard deviation 50 takes a value 4.93. 272) = P ( X − 4.912 or fewer are as extreme as.000) by 96. This agrees with the conclusion that we reached using the significance point approach (see Approach 1.78. 272 − 4.0268. a P -value of 0.088 or more heads plus the probability of getting 4. we obtain.088 or more.912 or fewer heads if the coin is fair (that is.272 672 672 from the Z chart.0027. For example.272 exceeds the critical point 4.000 tosses. so we do not have enough evidence to reject the null hypothesis.0392 = 0. suppose that the number cured with the new medicine was 4.05). So using a normal distribution approximation. This is exactly the same conclusion that we would have reached using Approach 1. 200 4. a P -value of 0. than the observed value 5.088 does. and 4. so we do not have enough evidence to reject the null hypothesis. The coin example The P -value calculation for a two-sided alternative hypothesis such as in the coin case is more complicated than in the medicine example. This is less than the chosen Type I error (0. 2750 − 4.200 √−4. since the observed value 4.088 heads from the 10. 672 672 −4.261. since values 4. 200 √ P (X ≥ 4.01.261. 250) = P ( √ ≥ ). Using the P -value approach (Approach 2).200 and variance 672 (the null hypothesis mean and variance) takes a value 4. 4.

Sign up to vote on this title
UsefulNot useful