1 Basic ideas 1 1.1 Sample space, events . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 What is probability? . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Kolmogorov¶s Axioms . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Proving things from the axioms . . . . . . . . . . . . . . . . . . . 4 1.5 Inclusion-Exclusion Principle . . . . . . . . . . . . . . . . . . . . 6 1.6 Other results about sets . . . . . . . . . . . . . . . . . . . . . . . 7 1.7 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.8 Stopping rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.9 Questionnaire results . . . . . . . . . . . . . . . . . . . . . . . . 13 1.10 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.11 Mutual independence . . . . . . . . . . . . . . . . . . . . . . . . 16 1.12 Properties of independence . . . . . . . . . . . . . . . . . . . . . 17 1.13 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2 Conditional probability 23 2.1 What is conditional probability? . . . . . . . . . . . . . . . . . . 23 2.2 Genetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3 The Theorem of Total Probability . . . . . . . . . . . . . . . . . 26 2.4 Sampling revisited . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5 Bayes¶ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.6 Iterated conditional probability . . . . . . . . . . . . . . . . . . . 31 2.7 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3 Random variables 39 3.1 What are random variables? . . . . . . . . . . . . . . . . . . . . 39 3.2 Probability mass function . . . . . . . . . . . . . . . . . . . . . . 40 3.3 Expected value and variance . . . . . . . . . . . . . . . . . . . . 41 3.4 Joint p.m.f. of two random variables . . . . . . . . . . . . . . . . 43 3.5 Some discrete random variables . . . . . . . . . . . . . . . . . . 47 3.6 Continuous random variables . . . . . . . . . . . . . . . . . . . . 55 vii viii CONTENTS 3.7 Median, quartiles, percentiles . . . . . . . . . . . . . . . . . . . . 57 3.8 Some continuous random variables . . . . . . . . . . . . . . . . . 58 3.9 On using tables . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.10 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4 More on joint distribution 67 4.1 Covariance and correlation . . . . . . . . . . . . . . . . . . . . . 67 4.2 Conditional random variables . . . . . . . . . . . . . . . . . . . . 70 4.3 Joint distribution of continuous r.v.s . . . . . . . . . . . . . . . . 73 4.4 Transformation of random variables . . . . . . . . . . . . . . . . 74 4.5 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . 77 A Mathematical notation 79 B Probability and random variables 83

Chapter 1 Basic ideas
In this chapter, we don¶t really answer the question µWhat is probability?¶ Nobody has a really good answer to this question. We take a mathematical approach, writing down some basic axioms which probability must satisfy, and making deductions from these. We also look at different kinds of sampling, and examine what it means for events to be independent.

1.1 Sample space, events
The general setting is: We perform an experiment which can have a number of different outcomes. The sample space is the set of all possible outcomes of the experiment. We usually call it S. It is important to be able to list the outcomes clearly. For example, if I plant ten bean seeds and count the number that germinate, the sample space is S = {0,1,2,3,4,5,6,7,8,9,10}. If I toss a coin three times and record the result, the sample space is S = {HHH,HHT,HTH,HTT,THH,THT,TTH,TTT}, where (for example) HTH means µheads on the first toss, then tails, then heads again¶. Sometimes we can assume that all the outcomes are equally likely. (Don¶t assume this unless either you are told to, or there is some physical reason for assuming it. In the beans example, it is most unlikely. In the coins example, the assumption will hold if the coin is µfair¶: this means that there is no physical reason for it to favour one side over the other.) If all outcomes are equally likely, then each has probability 1/|S|. (Remember that |S| is the number of elements in the set S). 1 2 CHAPTER 1. BASIC IDEAS On this point, Albert Einstein wrote, in his 1905 paper On a heuristic point of view concerning the production and transformation of light (for which he was awarded the Nobel Prize), In calculating entropy by molecular-theoretic methods, the word ³probability´ is often used in a sense differing from the way the word is defined in probability theory. In particular, ³cases of equal probability´ are often hypothetically stipulated when the theoretical methods employed are definite enough to permit a deduction rather than a stipulation. In other words: Don¶t just assume that all outcomes are equally likely, especially when you are given enough information to calculate their probabilities! An event is a subset of S. We can specify an event by listing all the outcomes that make it up. In the above example, let A be the event µmore heads than tails¶ and B the event µheads on last throw¶. Then A = {HHH,HHT,HTH,THH}, B = {HHH,HTH,THH,TTH}. The probability of an event is calculated by adding up the probabilities of all

the outcomes comprising that event. So, if all outcomes are equally likely, we have P(A) = |A| |S| . In our example, both A and B have probability 4/8 = 1/2. An event is simple if it consists of just a single outcome, and is compound otherwise. In the example, A and B are compound events, while the event µheads on every throw¶ is simple (as a set, it is {HHH}). If A = {a} is a simple event, then the probability of A is just the probability of the outcome a, and we usually write P(a), which is simpler to write than P({a}). (Note that a is an outcome, while {a} is an event, indeed a simple event.) We can build new events from old ones: ‡ A[B (read µA union B¶) consists of all the outcomes in A or in B (or both!) ‡ A\B (read µA intersection B¶) consists of all the outcomes in both A and B; ‡ A\B (read µA minus B¶) consists of all the outcomes in A but not in B; ‡ A0 (read µA complement¶) consists of all outcomes not in A (that is, S \A); ‡ (read µempty set¶) for the event which doesn¶t contain any outcomes. 1.2. WHAT IS PROBABILITY? 3 Note the backward-sloping slash; this is not the same as either a vertical slash | or a forward slash /. In the example, A0 is the event µmore tails than heads¶, and A\B is the event {HHH,THH,HTH}. Note that P(A\B) = 3/8; this is not equal to P(A) ·P(B), despite what you read in some books!

1.2 What is probability?
There is really no answer to this question. Some people think of it as µlimiting frequency¶. That is, to say that the probability of getting heads when a coin is tossed means that, if the coin is tossed many times, it is likely to come down heads about half the time. But if you toss a coin 1000 times, you are not likely to get exactly 500 heads. You wouldn¶t be surprised to get only 495. But what about 450, or 100? Some people would say that you can work out probability by physical arguments, like the one we used for a fair coin. But this argument doesn¶t work in all cases, and it doesn¶t explain what probability means. Some people say it is subjective. You say that the probability of heads in a coin toss is 1/2 because you have no reason for thinking either heads or tails more likely; you might change your view if you knew that the owner of the coin was a magician or a con man. But we can¶t build a theory on something subjective. We regard probability as a mathematical construction satisfying some axioms (devised by the Russian mathematician A. N. Kolmogorov). We develop ways of doing calculations with probability, so that (for example) we can calculate how unlikely it is to get 480 or fewer heads in 1000 tosses of a fair coin. The answer agrees well with experiment.

1.3 Kolmogorov¶s Axioms
Remember that an event is a subset of the sample space S. A number of events,

n. Here are some examples of things proved from the axioms.2 If the sample space S is finite. . are pairwise disjoint. which is a number. then P(A1 [A2 [···) = P(A1)+P(A2)+··· Note that in Axiom 3. . a proposition. and build everything else from that. . . We will only use Axiom 3a.A2. say A = {a1.. we have P(A) _ 0. Axiom 2: P(S) = 1. each event A has a probability P(A). for example. 1. never write P(A1)[P(A2). . .say A1. . These numbers satisfy three axioms: Axiom 1: For any event A. .An are mutually disjoint 1. There is really no difference between a theorem.4 Proving things from the axioms You can prove simple properties of probability from the axioms. that is. . .4. a proposition a rather smaller statement. .an}. Proposition 1. Corollary 1. a theorem is a big.A2. are called mutually disjoint or pairwise disjoint if Ai \Aj = for any two of the events Ai and Aj. but the point of this game is that we assume only the axioms. and a corollary. . they all have to be proved. and Axiom 3b for infinitely many. .1 If the event A contains only a finite number of outcomes. say S = {a1. every step must be justified by appealing to an axiom.a2. for i = 1. Ai = {ai}. but 3b is important later on. . so by Axiom 3a. These properties seem obvious. and a corollary is something that follows quite easily from a theorem or proposition that came before. that is. . then P(A) = P(a1)+P(a2)+···+P(an). Then A1. Usually. just as obvious as the axioms. .An. BASIC IDEAS Axiom 3: If the events A1. . To prove the}. . PROVING THINGS FROM THE AXIOMS 5 (each contains only one element which is in none of the others). . important statement. we have P(A) = P(a1)+P(a2)+···+P(an). and A1 [ A2 [ ···[An = A. no two of the events overlap. Sometimes we separate Axiom 3 into two parts: Axiom 3a if there are only finitely many events A1. . That means. so that we have P(A1 [···[An) = n i=1 § P(Ai). According to Kolmogorov¶s axioms. . . .A2. 4 CHAPTER 1. Don¶t mix these up. we define a new event Ai containing only the outcome ai. then . . . . Notice that we write n i=1 § P(Ai) for P(A1)+P(A2)+···+P(An). we have the union of events and the sum of numbers.

and A1 [A2 = B. So P(A) = P(A1) = 1íP(A2). A2 = B \ A. so we have to take it off. not in A). but then we have included the size of A\B twice. Now going back to Proposition 1. P(A)+P(B\A) = P(B). This time. Notice that once we have proved something. every outcome in A also belongs to B.3. So P(A1)+P(A2) = P(A1 [A2) (Axiom 3) = P(S) = 1 (Axiom 2). then P(A) _ P(B).P(a1)+P(a2)+···+P(an) = 1. so 1í P(A) _ 0. we can use it on the same basis as an axiom to prove further facts. Let A1 = A and A2 = A0 (the complement of A). so P(A) _ P(B). justifying the principle we used earlier. if all outcomes are equally likely. So by Axiom 3. Then A1 \A2 = (that is. and P(A0) _ 0 by Axiom 1. and P(S) = 1 by Axiom 2. that is.5 P( ) = 0. Again we have A1 \ A2 = (since the elements of B\A are. BASIC IDEAS Here is another result.5 Inclusion-Exclusion Principle __ _ _ __ __ AB A Venn diagram for two sets A and B suggests that. For P(a1)+P(a2)+···+P(an) = P(S) by Proposition 1. so P( ) = 0. then P(A) = |A| |S| for any event A.1. The notation A _ B means that A is contained in B. the events A1 and A2 are disjoint). as we had to show. 6 CHAPTER 1. Now we see that. and P(S) = 1 by Axiom 2. then each has probability 1/n.3.1. 1. so P( ) = 1íP(S) by Proposition 1.7 . Now P(B\A) _ 0 by Axiom 1. you have made a mistake! Corollary 1. For 1íP(A) = P(A0) by Proposition 1.4 P(A) _ 1 for any event A. In other words. and A1 [A2 = S. and their probabilities sum to 1. to find the size of A[B.3 P(A0) = 1íP(A) for any event A. if all the n outcomes are equally likely. Proposition 1. In terms of probability: Proposition 1.6 If A _ B. by definition. take A1 = A. Remember that if you ever calculate a probability to be less than 0 or more than 1. Corollary 1. Proposition 1. we add the size of A and the size of B. For = S0. from which we get P(A) _ 1. P(A1)+P(A2) = P(A1 [A2) = P(B). that is. we see that. 1/|S|.

De Morgan¶s Laws: (A[B)0 = A0 \B0 and (A\B)0 = A0 [B0. BASIC IDEAS . You should draw Venn diagrams and convince yourself that they work. (We have three pairs of sets to check. But then we find that the outcomes lying in all three sets have been taken off completely. P(A[B) = P(A1)+P(A2)+P(A3). Here are some examples. The sets A1. that is. namely A1 = A\B. we have P(A) = P(A1)+P(A2). We will not give formal proofs of these. The Inclusion-Exclusion Principle extends to more than two events. A3 = B\A. but gets more complicated. We see that A[B is made up of three parts. since all elements of A1 belong to B but no elements of A2 do. try to prove it yourself. so we subtract P(A\B).6. OTHER RESULTS ABOUT SETS 7 So. we add P(A\B\C). P(B). by Axiom 3. P(B) = P(A1)+P(A3). From this we obtain P(A)+P(B)íP(A\B) = (P(A1)+P(A2))+(P(A1)+P(A3))íP(A1) = P(A1)+P(A2)+P(A3) = P(A[B) as required.B. Now A1 \A2 = .P(A[B) = P(A)+P(B)íP(A\B).8 For any three events A. A2 = A\B. we have P(A[B[C)=P(A)+P(B)+P(C)íP(A\B)íP(A\C)íP(B\C)+P(A\B\C).C. Proposition 1. The parts in common have been counted twice. We now prove this from the axioms.) 1. The arguments for the other two pairs are similar ± you should do them yourself. Similarly we have A1[A2 =A and A1[A3 = B.6 Other results about sets There are other standard results about sets which are often useful in probability theory.C be subsets of S.A2.A3 are mutually disjoint. Can you extend this to any number of events? 1. so must be put back. __ __ __ __ __ __ AB C To calculate P(A[B[C). and P(C). using the Venn diagram as a guide. Proposition 1.B. Distributive laws: (A\B)[C = (A[C)\(B[C) and (A[B)\C = (A\C)[(B\C). since anything in A[B is in both these sets or just the first or just the second. 8 CHAPTER 1. we first add up P(A). P(A\C) and P(B\C).9 Let A. Here it is for three events. Indeed we do have A[B = A1 [A2 [A3.

GG.RG. I draw a pen.GR. If we choose two pens with replacement.RB.G. so that our final selection cannot include two pens of the same colour. or ‡ sampling without replacement. we have to say whether we are ‡ sampling with replacement.PR}.RP. with each one equally likely to be chosen. put it back and shake the drawer. PB. GP. RG. then choose a pen again (which may be the same pen as before or a different one). PP} The event µat least one red pen¶ is {RR. PG. they are red. and has probability 7/16. GR. In this case. It doesn¶t really matter in this case whether we choose the pens one at a time or simply take two pens out of the drawer.PR}. if A is the event µred or green pen chosen¶. Sampling without replacement means that we choose a pen but do not put it back. and we are not . S ={R.BR. More generally.GR.BR. GP. then each of the n outcomes has probability 1/n. and an event consisting of m of the outcomes has probability m/n. BG. GB. What if we choose more than one pen? We have to be more careful to specify the sample space. then P(A) = |A| |S| = 2 4 = 1 2 . We will only consider this in the case of sampling without replacement. PB } and the event µat least one red pen¶ is {RG. if I have a set of n objects and choose one. the sample space is {RR. BP.RP. GR. and purple. with probability 6/12 = 1/2.7 Sampling I have four pens in my desk drawer.P}. PR. First. BR. RB. the sample space for choosing two pens is { RG. and so on until the required number of pens have been chosen. BR. PG.B. GB. SAMPLING 9 Now there is another issue. each pen has the same chance of being selected. BP. PR. 1. RB. In this case. BB. where R means µred pen chosen¶ and so on. RP.RB. BG. RP.1. blue. depending on whether we care about the order in which the pens are chosen. Sampling with replacement means that we choose a pen. green. note its colour. In this case.7.

and ní2 for the third. the numbers in the three boxes are 42 = 16. We should not be surprised that this is the same as in the previous case. So the sample space is a set of sets!) The event µat least one red pen¶ is {{R. think first of choosing an ordered sample.{G. but now only ní1 choices for the second (since we do not replace the first). Here are the proofs of the other three cases. For sampling without replacement and unordered sample. for sampling with replacement and ordered sample. and 4C2 = 6. 4P2 = 12.10 The number of selections of k objects from a set of n objects is given in the following table. and so on.) The product of k factors each equal to n is nk.P}.{B. and nPk = n! (ník)! . As before. and you are very unlikely to need it.B}. in a set. only which elements are actually present. BASIC IDEAS For sampling without replacement and ordered sample.{R.{R. so that nCk = n! k!(ník)! . obtaining nPk/k! = nCk choices.G}. but this is much harder to prove than the others.G}. So in this case the sample space is {{R.P}.{R. we don¶t care which element is first. there are still n choices for the first object. 10 CHAPTER 1. So we divide by k!. First. which we can do in nPk ways. There are formulae for the sample space size in these three cases. Theorem 1. and n choices for the second. These involve the following functions: n! = n(ní1)(ní2) ···1 nPk = n(ní1)(ní2) ···(ník+1) nCk = nPk/k! Note that n! is the product of all the whole numbers from 1 to n. in agreement with what we got when we wrote them all . In our example with the pens. there are n choices for the first object. and so on.B}. with replacement without replacement ordered sample nk nPk unordered sample nCk In fact the number that goes in the empty box is n+kí1Ck. This product is the formula for nPk.P}}. (Think of the choices as being described by a branching tree.B}. we multiply the choices for different objects. since kí1 have previously been removed and ní(kí1) remain. But each unordered sample could be obtained by drawing it in k! different orders.{R. (Each element is written as a set since. with probability 3/6 = 1/2. containing six elements.{G.P}}. we multiply.interested in which pen was chosen first. there are ník+1 choices for the kth object.

if we use the phrase µsampling without replacement. Note that. What is the probability that no lecture is scheduled at the weekend? Here the sampling is without replacement. RRRRRBBBBB). these will be the days of the Probability I lectures. and as in the previous case there are 10C5 patterns of red and blue balls. In fact. But there are many other ways to get five red and five blue balls.6). the five red balls could appear in any five of the ten draws.6). Now consider sampling without replacement. Thus. Example The names of the seven days of the week are placed in a hat. This means that there are 10C5 = 252 different patterns of five Rs and five Bs. To obtain a sum of 10 or more. which is once again 2/7. . Three names are drawn out. So the probability of the event is 6/36 = 1/6. 1. so |A| = 5P3 = 5 ·4 ·3 = 60. SAMPLING 11 Consider sampling with replacement. then |S| = 20P10. and so P(A) = 252 ·1010 2010 = 0. then A occurs precisely when all three days drawn are weekdays. So we have |A| = 252 ·1010.5). So |A| = (10P5)2 ·10C5.out. we are assuming that all outcomes are equally likely. What is |A|? The number of ways in which we can choose first five red balls and then five blue ones (that is. (5. For ordered samples. What is the probability that the sum of the numbers is at least 10? This time we are sampling with replacement. . If we decided to use unordered samples instead. the answer would be 5C3/7C3. (5. or any other combination. (6. and P(A) = (10P5)2 ·10C5 . Example A six-sided die is rolled twice. and the same for the ten blue balls. So the number of elements in the sample space is 62 = 36. P(A) = 60/210 = 2/7.4). and we are interested in the event that exactly 5 of the balls are red and 5 are blue. the possibilities for the two numbers are (4. If A is the event µno lectures at weekends¶. Then |S| = 2010. Do you think that this is more likely to occur if the draws are made with or without replacement? Let S be the sample space. of which 10 are red and 10 are blue. and A the event that five balls are red and five are blue.7. the size of the sample space is 7P3 = 7 · 6 · 5 = 210. since the two numbers may be the same or different. Example A box contains 20 balls. (6. the answers will be the same. If we regard the sample as being ordered.246. and we can take it to be either ordered or unordered. There are 10P5 ways of choosing five of the ten red balls.5) or (6.6). ordered sample¶. is 105 ·105 = 1010. We draw ten balls from the box.

. This is the same answer as in the case before.20P10 = 0. What is the probability that they are all of different material? What is the probability that they are all of the same material? In this case the sampling is without replacement and the sample is unordered. f p. 12 CHAPTER 1. So the probability is 72/286 = 0. . f f f }. as it should be. So the sample space is S = {p. Let us assume that the probability that you pass the test is 0. you don¶t need to take it again. where for example f f p denotes the outcome that you fail twice and pass on your third attempt. . they should give the same answer. (By Proposition 3. then use the formula for ordered samples. But it is unreasonable here to assume that all the outcomes are equally likely. then your chance of eventually passing the test and getting the certificate would be 3/4. you should first read the question carefully and decide whether the sampling is with or without replacement. then |S| = 20C10. So |A| = (10C5)2. In a sampling problem. does the question say anything about the first object drawn?). no matter how many times you have . If it is without replacement. .2. BASIC IDEAS The event that the three coins are of the same material can occur in 6C3+4C3+3C3 = 20+4+1 = 25 ways.087. the question doesn¶t care about order of choices! So the event is more likely if we sample with replacement. whichever is convenient. and P(A) = (10C5)2 20C10 = 0.8 Stopping rules Suppose that you take a typing proficiency test.8. then use the formula for sampling with replacement. then you can use either ordered or unordered samples. you may be very likely to pass on the first attempt.) Let us further assume that. Of course. We no longer have to count patterns since we don¶t care about the order of the selection.252. your chance of failing is 0. So |S| = 13C3 = 286. . and the probability is 25/286 = 0. You are allowed to take the test up to three times. .343. decide whether the sample is ordered (e.g. if you pass the test. If all outcomes were equally likely. If the sample is with replacement.343. There are 10C5 choices of the five red balls and the same for the blue balls. If we regard the sample as being unordered. The event that the three coins are all of different material can occur in 6 ·4 ·3 = 72 ways. Example I have 6 gold coins. since we must have one of the six gold coins. For example. 1. If not. or if it involves throwing a die or coin several times. and so on. . . I take out three coins at random. 4 silver coins and 3 bronze coins in my pocket. f f p. If so.

failed. What colour are your eyes? Blue 2 Brown 2 Green 2 Other 2 3.TH.} since in principle you may get arbitrarily large numbers of tails before the first head. Thus the probability that you eventually get the certificate is P({p. The experiment may potentially be infinite. I have a hat containing 20 balls. For example. the sample space is S = {H. or we might toss a coin until we have obtained heads twice. namely. P( f f p) = 0. if you toss a coin repeatedly until you obtain heads.2 ·0. or for someone taking a series of tests. I am interested in the event that I draw exactly five red and five blue balls. 1. BASIC IDEAS .992. P( f f f ) = 0. Then we have P(p) = 0.TTH.9. . f p. the rule is µstop if either you pass or you have taken the test three times¶. I draw 10 balls from the hat.9 Questionnaire results The students in the Probability I class in Autumn 2000 filled in the following questionnaire: 1. Do you think that this is more likely if I note the colour of each ball I draw and replace it in the hat. continue the experiment until some specified occurrence happens.008 = 0. Other kinds of stopping rules are possible. the results were as follows: Answer to ³More likely ³More likely question with replacement´ without replacement´ Eyes Brown Other Brown Other Mobile phone 35 4 35 9 No mobile phone 10 3 7 1 14 CHAPTER 1.TTTH.) In the typing test.16+0. or if I don¶t replace the balls in the hat after drawing? More likely with replacement 2 More likely without replacement 2 2.8+0. 10 red and 10 blue. you eventually get the certificate unless you fail three times. so the probability is 1í0. . and so on.032. P( f p) = 0. Alternatively. we will have more to say about the µmultiplication rule¶ we used for calculating the probabilities. (We have to allow all possible outcomes.23 = 0. . We will not deal with these.8. your chance of passing at the next attempt is still 0.032=0.16. Do you own a mobile phone? Yes 2 No 2 After discarding incomplete questionnaires. 1.22 ·0. the number of coin tosses might be determined by some other random process such as the roll of a die. In the meantime you might like to consider whether it is a reasonable assumption for tossing a coin.8 = 0. QUESTIONNAIRE RESULTS 13 For example.8. f f p}) = 0. This ensures that the sample space is finite.992.008.8 = 0. In the next chapter. A stopping rule is a rule of the type described here.

In fact. P(A) ·P(C) = 0. half of them (i. Do not say that two events are independent if one has no influence on the other. P(A\C) = 39/104 = 0. this is the correct answer. This is the definition of independence of events. since 83 out of 104 people have mobile phones. But they do show that these events are not independent. Also. P(A\B) = 45/104 = 0.e. which is quite a different thing!) Also.6676.4183. So indeed it seems that eye colour and phone ownership are more-or-less independent. sampling with replacement make the result more likely.4 people. (This doesn¶t matter. we would expect half (that is. since our results refer only to the people who filled out the questionnaire. i. or as near as we can expect. then of the 87 people with brown eyes.7981. P(B) = 87/104 = 0.3990. INDEPENDENCE 15 mobile phone. do not ever say that P(A \ B) = P(A) · P(B) unless you have some good reason for assuming that A and B are independent (either because this is given in the question.10 Independence Two events A and B are said to be independent if P(A\B) = P(A) ·P(B). but in a sense B and C µcome closer¶ . in the experiment with the coloured balls. actually it is more likely if we sample without replacement. and under no circumstances say that A and B are independent if A\B = (this is the statement that A and B are disjoint.4327. If you are asked in an exam to define independence of events. In fact the number is 70. So none of the three pairs is independent. P(B)\P(C) = 0. P(B\C) = 70/104 = 0. and C the event that the student has a 1. of the 83 people with mobile phones.6731. B the event that the student has brown eyes.e. Suppose that a student is chosen at random from those who filled out the questionnaire. as we saw in Chapter 1.5. On the other hand.What can we conclude? Half the class thought that. P(A) ·P(B) = 0. Let us return to the questionnaire example. P(C) = 83/104 = 0.375. we would expect that the same fraction 83/104 of the 87 brown-eyed people would have phones. or as in the next-but-one paragraph). Let A be the event that this student thought that the event was more likely if we sample with replacement. 41 or 42) would answer ³with replacement´. whereas in fact 45 did. whereas in fact 39 of them did. if we think that phone ownership and eye colour are independent. whereas people with brown eyes are slightly less smart! In fact we have shown no such thing.8365. Furthermore. Then P(A) = 52/104 = 0. So perhaps we have demonstrated that people who own mobile phones are slightly smarter than average. Let¶s test this. 43 or 44) would answer ³with replacement´. 1. since the students were instructed not to think too hard about it!) You might expect that eye colour and mobile phone ownership would have no influence on your answer. If true.10. (83 · 87)/104 = 69. in a sense we will come to soon.

R2B}. More precisely. or the events are the outcomes of different throws of a coin or die.G. If the pens are called R1. then S = {R1R2. since we assume that all 36 combinations of the two throws are equally likely.HHT. This time.R1G.TTT}. suppose that I have also a purple pen.R2B. as we noted. and B the event that I choose exactly one green pen. ‡ B = {HHH. but still it could happen that P(A\B) = P(A) · P(B). ‡ A\B = {HHH.HHT. In general. However. then A and B are independent. P(A \ B) = 2/10 = 1/5 6= P(A)P(B).R1B. A = {R1G.HHT}.R2. (There is one other case where you can assume independence: this is the result of different draws. Then ‡ A = {HHH. Let A be the event µthere are more heads than tails¶. are independent. so adding one more pen has made the events non-independent! We see that it is very difficult to tell whether events are independent or not. from a set of objects.than either of the others. So the two events are independent. But before you say µthat¶s obvious¶. it is always OK to assume that the outcomes of different tosses of a coin. you will find that P(A) = 6/10 = 3/5. There might be a very definite connection between A and B. or roll a die more than once.GB}. so A and B are independent. I choose two pens without replacement. if we roll a fair six-sided die twice.R1B. one green pen.B. But this does not apply in the other direction. if you write down the sample space and the two events and do the calculations. This holds even if the examples are not all equally likely. Example If we toss a coin more than once. if it is the case that the event A has no effect on the outcome of event B. and one blue pen. Each of the eight possible outcomes has probability 1/8.TTH. P(B)=3/6=1/2. P(A) = 1/2.R2G. We will see an example later. and the separate probabilities of getting 4 on the first throw and of getting 5 on the second are both equal to 1/6.R2G. or different throws of a die. Let A be the event that I choose exactly one red pen. But (1/36) = (1/6) · (1/6). P(A\B)=2/6=1/3=P(A)P(B). In practice.R2G. P(B) = 1/2. so A and B are independent.GB} 16 CHAPTER 1. with replacement. then you may assume that different tosses or rolls are independent. In practice. We will see an example shortly. both A and B clearly involve the results of the first two tosses and it is not possible to make a convincing argument that one . P(A\B) = 1/4.) Example Consider the experiment where we toss a fair coin three times and note the results. then the probability of getting 4 on the first throw and 5 on the second is 1/36. BASIC IDEAS We have P(A)=4/6=2/3. B = {R1G. assume that events are independent only if either you are told to assume it.HTH.THH}. and I do the same experiment. Example I have two red pens. so that A and B are independent. P(B) = 4/10 = 2/5. and B the event µthe results of the first two tosses are the same¶. This would work just as well for any other combination.

P(C) = 1/2. . ‡ C = {HHH. Example In the coin-tossing example. . The correct definition and proposition run as follows. . we only required all pairs of events to be exclusive in order to justify our conclusion. we know that P(B0) = 1íP(B).TTH}. and that the events A\B. ‡ A\C = {HHH.B. all 64 possible outcomes are equally likely. In other words. 18 CHAPTER 1. B the event µfirst and third tosses have the same result.of these events has no influence or effect on the other. For example. P(A\C) = 3/8. in Axiom 3. Now all you really need to know is that the same µphysical¶ arguments that justify that two events (such as two tosses of a coin.THH}.THH. . . We saw in the coin-tossing example above that it is possible to have three events A. and A\B\C are all equal to {HHH.C are not mutually independent. B and C are independent.HTH. so A and C are not independent. also justify that any number of such events are mutually independent. for example. . You should check that P(A) = P(B) = P(C) = 1/2.TTT}. 1. as we saw in Part 1.12. Then. . but A and C are not independent. . and C the event µsecond and third tosses have same result¶.B. From Corollary 4. PROPERTIES OF INDEPENDENCE 17 If all three pairs of events happen to be independent. Are B and C independent? 1. the events Ai1 \Ai2 \···\Aikí1 and Aik are independent. . Also. the events A\B and . the probability of getting the sequence HHTHHT is (1/2)6 = 1/64. Unfortunately it is not true. You will need to know the conclusions. and asked to prove that P(A\B0) = P(A) ·P(B0). though the arguments we use to reach them are not so important. and the same would apply for any other sequence. We say that these events are mutually independent if. if we toss a fair coin six times. .HTH. can we then conclude that P(A\B\C) = P(A) ·P(B) ·P(C)? At first sight this seems very reasonable. A\C. So A. ik with k _ 1. Thus any pair of the three events are independent. or two throws of a die) are independent. In other words.An be events.An be mutually independent. with probability 1/4. given any distinct indices i1. but P(A\B\C) = 1/4.C so that A and B are independent.12 Properties of independence Proposition 1. let C be the event µheads on the last toss¶. P(A) ·P(B) ·P(C) = 1/8.11 Mutual independence This section is a bit technical. 1. any one of the events is independent of the intersection of any number of the other events in the set.11 Let A1. Let A1. . . then A and B0 are independent. Then P(A1 \A2 \···\An) = P(A1) ·P(A2) ···P(An).12 If A and B are independent. let A be the event µfirst and second tosses have same result¶. . So. Proposition 1. . i2. BASIC IDEAS We are given that P(A\B) = P(A) ·P(B). B\C.

and the probability that component C works is 0.8).A \ B0 are disjoint (since no outcome can be both in B and B0). and we replace some of them by their complements. B and C for the events µcomponent A works¶. Example Consider the example of the typing proficiency test that we looked at earlier. C be mutually independent. PROPERTIES OF INDEPENDENCE 19 Example The electrical apparatus in the diagram works so long as current can flow from left to right. the probability that component B works is 0. Then A and B\C are independent.8. we have that P(A) = P(A\B)+P(A\B0). Corollary 1. P( f f f ) = (0. we use the letters A.8. as we claimed in the earlier example. Suppose that your chance of passing the test is 0. respectively. . µcomponent B works¶. We have to be a bit careful though. __ __ A __ __ B __ __ C At risk of some confusion. A and A0 are not usually independent! Results like the following are also true. P(p) = 0.75. Apply the Proposition twice. P(A\B0) = P(A)íP(A\B) = P(A)íP(A) ·P(B) (since A and B are independent) = P(A)(1íP(B)) = P(A) ·P(B0). The three components are independent. B.12. You are allowed up to three attempts to pass the test. The probability that component A works is 0.14 Let events A. For example. and then to B0 and A (to show that B0 and A0 are independent). if events A1. . and their union is A (since every event in A is either in B or in B0). Proposition 1.2)3. In other words.13 If A and B are independent. so by Axiom 3. Thus. then the resulting events are mutually independent.An are mutually independent.9. Find the probability that the apparatus works. P( f f p) = (0. Suppose also that the events of passing the test on any number of different occasions are mutually independent. which is what we were required to prove. so are A0 and B0.2) ·(0. and A and B[C are independent.8. P( f p) = (0. by Proposition 1. Now the apparatus will work if either A and B are working. More generally. first to A and B (to show that A and B0 are independent). mutual independence is the condition we need to justify the argument we used in that example. Then. and µcomponent C works¶. or C is working (or .11. . . That is.8). 1.2)2 ·(0. the probability of any sequence of passes and fails is the product of the probabilities of the terms in the sequence.

The toothbrushes there are red. What are the probabilities of getting three heads.B0. so the apparatus works with probability 1í0.25) = 0. I toss the coin three times. this is equal to (A0 \C0)[(B0 \C0).1) ·(0. the probability that you buy a blue one is twice the probability that you buy a green one.144. Similarly the probability of one head is 3·(0. By the Distributive Law.25)í(0.1) ·(0.07375 by the Principle of Inclusion and Exclusion. since the events A0 \C0 and B0 \C0 are not independent! 20 CHAPTER 1.216.C0) = (0. the probability is (0.8) ·(0.6) · (0. You might be tempted to say P(A0 \C0) = (0. There is a trap here which you should take care to avoid.6 = 0. Each outcome has probability (0. the probabilities . even if it is not a fair coin. or no heads? For three heads. BASIC IDEAS Example We can always assume that successive tosses of a coin are mutually independent.025) = 0. We have P((A0 \C0)[(B0 \C0) = P(A0 \C0)+P(B0 \C0)íP(A0 \B0 \C0) (by Inclusion±Exclusion) = P(A0) ·P(C0)+P(B0) ·P(C0)íP(A0) ·P(B0) ·P(C0) (by mutual independence of A0. Thus the event we are interested in is (A\B)[C. The probability of tails on any toss is 1í0.13 Worked examples Question (a) You go to the shop to buy a toothbrush.4) = 0. Thus. two heads. if C is not working and one of A and B is also not working.6 of coming down heads.25) = 0. purple and white.4)3 = 0. So the probability of two heads is 3 ·(0. and P(B0 \C0) = (0. Now the event µtwo heads¶ can occur in three possible ways.6) ·(0.possibly both). we have 0.05) ·(0.2) ·(0.288+0. HTH. and conclude that P((A0 \C0)[(B0 \C0)) = 0.07. blue.2) ·(0.93.6)3 = 0.8) ·(0. or THH. The apparatus will not work if both paths are blocked. one head. The problem can also be analysed in a different way.216+0. But this is not correct.25)+(0.064 = 1. as HHT. the event that the apparatus does not work is (A0[B0)\C0. 1.05.9)+(0.4. since successive tosses are mutually independent.6) · (0.432. The probability that you buy a red toothbrush is three times the probability that you buy a green one.05+0.1) ·(0. Now P((A\B)[C)) = P(A\B)+P(C)íP(A\B\C) (by Inclusion±Exclusion) = P(A) ·P(B)+P(C)íP(A) ·P(B) ·P(C) (by mutual independence) = (0.9) ·(0.25) = 0. Suppose that I have a coin which has probability 0.07 = 0.432+0. that is. and the probability of no heads is (0.025í(0. As a check. green.4)2 = 0.025.75) = 0.288.75)í(0.144) = ·(0.

Since these outcomes comprise the whole sample space. purple. We are given that P(R) = 3x. The event that the toothbrushes have the same colour consists of the five outcomes RR.G. 1/8 respectively. so its probability is P(RR)+P(BB)+P(GG)+P(PP)+P(WW) = 9 64 + 1 16 + 1 64 + 1 64 + 1 64 .of buying green. This is the only time that they change their toothbrushes. GG. (c) James and Simon live together for three terms. blue. 1/8. On the first day of term they both go to the shop to buy a toothbrush. Let x = P(G). On the first day of each term they buy new toothbrushes.13. By independence (given). WORKED EXAMPLES 21 all three terms or that they will sometimes have toothbrushes of the same colour? Solution (a) Let R.W be the events that you buy a red. BB. Is it more likely that they will have differently coloured toothbrushes from each other for 1. (b) Let RB denote the event µJames buys a red toothbrush and Simon buys a blue toothbrush¶. green. and their choices are independent. so x = 1/8. and white are all equal. 1/8. the probabilities are 3/8.B. we have. P(B) = 2x. P(RR) = (3/8) ·(3/8) = 9/64. purple and white toothbrush respectively. (b) James and Simon share a flat. so it would be confusing if their toothbrushes were the same colour. You are certain to buy exactly one toothbrush. independently of what they had bought before. For each of James and Simon. Find the probability that they buy toothbrushes of the same colour. Thus. For each colour. P(P) = P(W) = x. with probabilities as in (b). for example.P. Find the probablity that James and Simon have differently coloured toothbrushes from each other for all three terms. PP.WW. 1/4. etc. find the probability that you buy a toothbrush of that colour. Corollary 2 gives 3x+2x+x+x+x = 1. the probability of buying various colours of toothbrush is as calculated in (a).

BASIC IDEAS Solution The experiment consists of picking the five elephants. He counts how many of these elephants are tagged. he randomly selects five elephants from the reserve. since it is necessary to multiply by the 5C2 possible patterns of tagged and untagged . So it is more likely that they will have the same colour in at least one term. Assume that no elephants leave or enter the reserve. and that all outcomes of the selection are equally likely. Question There are 24 elephants in a game reserve. so has probability 1í(27/64) = (37)/(64). So the event µdifferent coloured toothbrushes in all three terms¶ has probability 3 4· 3 4· 3 4 = 27 64 . Find the probability that exactly two of the selected elephants are tagged. The next month. between the tagging and the selection. so |A| = 6C2 ·18C3. 22 CHAPTER 1. and these events are independent. Thus P(A) = 6C2 ·18C3 24C5 = 0. or die or give birth. If you want to use an ordered sample. Note: Should the sample should be ordered or unordered? Since the answer doesn¶t depend on the order in which the elephants are caught. Let A be the event that two of the selected elephants are tagged. The warden tags six of the elephants with small radio transmitters and returns them to the reserve.p. The event µsame coloured toothbrushes in at least one term¶ is the complement of the above. an unordered sample is preferable.= 1 4 . not the original choice of six elephants for tagging. the calculation is P(A) = 6P2 ·18P3 ·5C2 24P5 = 0. This involves choosing two of the six tagged elephants and three of the eighteen untagged ones. giving the answer correct to 3 decimal places.288. Let S be the sample space. (c) The event µdifferent coloured toothbrushes in the ith term¶ has probability 3/4 (from part (b)). Then |S| = 24C5.288 to 3 d.

GGGB. . 23 24 CHAPTER 2. In general.HTT}.HHT.BGGG. the outcomes µAlice pays¶ and µBob pays¶ are A\E = {HHH.TTT}. P(BGGB) = 1/16. Solution (a) S = {BB. F = ³there are more girls than boys´. How should we now reassess their chances? We have E = {HHH. P(F) = 5/16.GBGB. So P(E) = 8/16 = 1/2.GGBB.GGBB.HTH. suppose that we are given that an event E has occurred. Find P(E) and P(F) where E = ³there are at least two girls´. CONDITIONAL PROBABILITY Thus the new probabilities that Alice and Bob pay for dinner are 3/4 and 1/4 respectively.GGGG}. the probability that it is a boy is 1/2.1 What is conditional probability? Alice and Bob are going out to dinner.GGGG}. The sample space is S = {HHH. call this event E.THT.HTH.HHT.HHT. (a) Write down the sample space. Clearly each has a 50% chance of paying. since the outcomes may not be equally likely.GGGB. (b) E = {BGGB.GBGG.GBGG.GGGB.HTH.TTT}. Suppose that they are successful in their plan. B\E = {HTT}. and similarly for the other outcomes.THH.GGBG. Now we have P(BB) = 1/4.GBGG. and we want to compute the probability that another event A occurs. (b) Assume that. Question A couple are planning to have a family. F = {BGGG. and the events µAlice pays¶ and µBob pays¶ are respectively A = {HHH.THH}. since the outcomes not in E are no longer possible. They decide to stop having children either when they have two boys or when they have four children. GGBG.GGBG.HHT.GBB. each time that they have a child.elephants in a sample of five with two tagged. P(BGB) = 1/8.THT. In general.TTH.BGGB. we can no longer count. independent of all other times. Chapter 2 Conditional probability In this chapter we develop the technique of conditional probability to deal with cases where events are not independent. In the new experiment.HTT. then E now becomes the sample space of the experiment.BGGG.TTH.GGGG}. 2.GBGB. They toss a fair coin µbest of three¶ to decide who pays: if there are more heads than tails in the three tosses then Alice pays. They toss the coin once and the result is heads. B = {HTT. and if we are given the information that the result of the first toss is heads.HTH}. otherwise Bob pays. The correct definition is as follows.BGB.

Again I emphasise that this is the definition. . P(A | B) = P(A\B) P(B) if P(B) 6= 0. it is not enough to say ³the probability of A given that E has occurred´. since we have to divide by it. P(B | E) = P(B\E) P(E) = 1/8 1/2 = 1 4 . There is no reason why event E should occur before event A! Note the vertical bar in the notation. although this is the best way to understand it. and this would make no sense if P(E) = 0. Find the conditional probability that the driver is blonde given that the car is yellow. This is P(A | E). and let A be any event.Let E be an event with non-zero probability. The probability that the car is yellow is 3/100: the probability that the driver is blonde is 1/5. Thus. To check the formula in our example: P(A | E) = P(A\E) P(E) = 3/8 1/2 = 3 4 . The conditional probability of A given E is defined as P(A | E) = P(A\E) P(E) . If you are asked for the definition of conditional probability. Example A random car is chosen among all those passing through Trafalgar Square on a certain day. for example. but you should be familiar enough with this formula that you can write it down without stopping to think about the names of the events. not P(A/E) or P(A\E). Note also that the definition only applies in the case where P(E) is not equal to zero. and the probability that the car is yellow and the driver is blonde is 1/50. It may seem like a small matter.

A child receives one gene from each of its parents. So first suppose that A and B are independent. Then P(A | B) = P(A\B) P(B) = P(A) ·P(B) P(B) = P(A).2. 26 CHAPTER 2. Then A and B are independent if and only if P(A | B) = P(A). CONDITIONAL PROBABILITY .1 Let A and B be events with P(B) 6= 0.03.02 0. that is. if your genes are bb. using the definition of conditional probability. P(B) = 0. you have brown eyes.2 Genetics Here is a simplified version of how genes code eye colour. then we are given that P(Y) = 0. then A and B are independent.03 = 0. There is a connection between conditional probability and independence: Proposition 2. P(A | B) = P(A).2. you have blue eyes. This proposition is most likely what people have in mind when they say µA and B are independent means that B has no effect on A¶. as we had to prove. GENETICS 25 Solution: If Y is the event µthe car is yellow¶ and B the event µthe driver is blonde¶. which is just what the statement µA and B are independent¶ means. and similarly for its mother. So P(B | Y) = P(B\Y) P(Y) = 0. assuming only two colours of eyes. Proof The words µif and only if¶ tell us that we have two jobs to do: we have to show that if A and B are independent. Each gene is either B or b. Now clearing fractions gives P(A\B) = P(A) ·P(B). and that if P(A | B) = P(A). The genes received from father and mother are independent.02. If your genes are BB or Bb or bB. Now suppose that P(A | B) = P(A).667 to 3 d. then P(A | B) = P(A). Remember that this means that P(A\B) = P(A) ·P(B). each with probability 1/2.p. Each person has two genes for eye colour. 2. and P(Y \B) = 0. The gene it receives from its father is one of its father¶s two genes. In other words. Note that we haven¶t used all the information given. P(A\B) P(B) = P(A).2.

we may assume Bb. An Now we state and prove the Theorem of Total Probability. Thus each of John¶s parents is Bb or bB. . He estimates that. .Bb. . so the probability that he gets BB is 1/4. . if it is cloudy. . His sister has blue each with probability 1/4. . his chance is only 20%. Theorem 2. that is. and if it rains.2 Let A1. . So the possibilities for John are (writing the gene from his father first) BB. but we know what its probability would be if we were sure that some other event had occurred. . Ai \Aj = for any pair of events Ai and Aj. 2. the probability of cloud is 45%.) What is the overall probability that the salesman will sell all his stock? This problem is answered by the Theorem of Total Probability.An form a partition of the sample space if the following two conditions hold: (a) the events are pairwise disjoint. John gets his father¶s B gene with probability 1/2 and his mother¶s B gene with probability 1/2. THE THEOREM OF TOTAL PROBABILITY 27 Another way of saying the same thing is that every outcome in the sample space lies in exactly one of the events A1.3. (We assume that these are all the possible outcomes. Similarly for the other combinations. and these are independent.A2. and let B be any event.3 The Theorem of Total Probability Sometimes we are faced with a situation where we do not know the probability of an event B. First we need a definition. A1 A2 . which we now state. his chance is 60%. the probability of sunshine is 30%. .An form a partition of the sample space with P(Ai) 6= 0 for all i.) Let X be the event µJohn has BB genes¶ and Y the event µJohn has brown eyes¶.A2. 2. . According to the weather forecast. The picture shows the idea of a partition. so one b must have come from each parent.Example Suppose that John has brown eyes.bB. The events A1.bB}.A2. he has a 90% chance of selling all his stock. (b) A1 [A2 [···[An = S.Bb. . . so that their probabilities must add up to 100%. if the weather is sunny. The question asks us to calculate P(X | Y). What is the probability that John¶s genes are BB? Solution John¶s sister has genes bb. Then P(B) = . So do both of John¶s parents. .An. Then X = {BB} and Y = {BB. Example An ice-cream seller has to decide whether to order more stock for the Bank Holiday weekend. . and the probability of rain is 25%. This is given by P(X | Y) = P(X \Y) P(Y) = 1/4 3/4 = 1/3. (For example.

n i=1


P(B | Ai) ·P(Ai). Proof By definition, P(B | Ai) = P(B\Ai)/P(Ai). Multiplying up, we find that P(B\Ai) = P(B | Ai) ·P(Ai). Now consider the events B \ A1,B \ A2, . . . ,B \ An. These events are pairwise disjoint; for any outcome lying in both B\Ai and B\Aj would lie in both Ai and Aj, and by assumption there are no such outcomes. Moreover, the union of all these events is B, since every outcome lies in one of the Ai. So, by Axiom 3, we conclude that
n i=1


P(B\Ai) = P(B). Substituting our expression for P(B\Ai) gives the result. A1 A2 . . . An
_ _ __

B Consider the ice-cream salesman at the start of this section. Let A1 be the event µit is sunny¶, A2 the event µit is cloudy¶, and A3 the event µit is rainy¶. Then A1, A2 and A3 form a partition of the sample space, and we are given that P(A1) = 0.3, P(A2) = 0.45, P(A3) = 0.25. 28 CHAPTER 2. CONDITIONAL PROBABILITY Let B be the event µthe salesman sells all his stock¶. The other information we are given is that P(B | A1) = 0.9, P(B | A2) = 0.6, P(B | A3) = 0.2. By the Theorem of Total Probability, P(B) = (0.9×0.3)+(0.6×0.45)+(0.2×0.25) = 0.59. You will now realise that the Theorem of Total Probability is really being used when you calculate probabilities by tree diagrams. It is better to get into the habit of using it directly, since it avoids any accidental assumptions of independence. One special case of the Theorem of Total Probability is very commonly used, and is worth stating in its own right. For any event A, the events A and A0 form a partition of S. To say that both A and A0 have non-zero probability is just to say that P(A) 6= 0, 1. Thus we have the following corollary: Corollary 2.3 Let A and B be events, and suppose that P(A) 6= 0,1. Then P(B) = P(B | A) ·P(A)+P(B | A0) ·P(A0).

2.4 Sampling revisited
We can use the notion of conditional probability to treat sampling problems involving ordered samples. Example I have two red pens, one green pen, and one blue pen. I select two pens without replacement. (a) What is the probability that the first pen chosen is red? (b) What is the probability that the second pen chosen is red? For the first pen, there are four pens of which two are red, so the chance of

selecting a red pen is 2/4 = 1/2. For the second pen, we must separate cases. Let A1 be the event µfirst pen red¶, A2 the event µfirst pen green¶ and A3 the event µfirst pen blue¶. Then P(A1) = 1/2, P(A2) = P(A3) = 1/4 (arguing as above). Let B be the event µsecond pen red¶. If the first pen is red, then only one of the three remaining pens is red, so that P(B | A1) = 1/3. On the other hand, if the first pen is green or blue, then two of the remaining pens are red, so P(B | A2) = P(B | A3) = 2/3. 2.5. BAYES¶ THEOREM 29 By the Theorem of Total Probability, P(B) = P(B | A1)P(A1)+P(B | A2)P(A2)+P(B | A3)P(A3) = (1/3)×(1/2)+(2/3)×(1/4)+(2/3)×(1/4) = 1/2. We have reached by a roundabout argument a conclusion which you might think to be obvious. If we have no information about the first pen, then the second pen is equally likely to be any one of the four, and the probability should be 1/2, just as for the first pen. This argument happens to be correct. But, until your ability to distinguish between correct arguments and plausible-looking false ones is very well developed, you may be safer to stick to the calculation that we did. Beware of obvious-looking arguments in probability! Many clever people have been caught out.

2.5 Bayes¶ Theorem
There is a very big difference between P(A | B) and P(B | A). Suppose that a new test is developed to identify people who are liable to suffer from some genetic disease in later life. Of course, no test is perfect; there will be some carriers of the defective gene who test negative, and some non-carriers who test positive. So, for example, let A be the event µthe patient is a carrier¶, and B the event µthe test result is positive¶. The scientists who develop the test are concerned with the probabilities that the test result is wrong, that is, with P(B | A0) and P(B0 | A). However, a patient who has taken the test has different concerns. If I tested positive, what is the chance that I have the disease? If I tested negative, how sure can I be that I am not a carrier? In other words, P(A | B) and P(A0 | B0). These conditional probabilities are related by Bayes¶ Theorem: Theorem 2.4 Let A and B be events with non-zero probability. Then P(A | B) = P(B | A) ·P(A) P(B) . The proof is not hard. We have P(A | B) ·P(B) = P(A\B) = P(B | A) ·P(A), using the definition of conditional probability twice. (Note that we need both A and B to have non-zero probability here.) Now divide this equation by P(B) to get the result. 30 CHAPTER 2. CONDITIONAL PROBABILITY If P(A) 6= 0,1 and P(B) 6= 0, then we can use Corollary 17 to write this as P(A | B) = P(B | A) ·P(A)

P(B | A) ·P(A)+P(B | A0) ·P(A0). Bayes¶ Theorem is often stated in this form. Example Consider the ice-cream salesman from Section 2.3. Given that he sold all his stock of ice-cream, what is the probability that the weather was sunny? (This question might be asked by the warehouse manager who doesn¶t know what the weather was actually like.) Using the same notation that we used before, A1 is the event µit is sunny¶ and B the event µthe salesman sells all his stock¶. We are asked for P(A1 | B). We were given that P(B | A1) = 0.9 and that P(A1) = 0.3, and we calculated that P(B) = 0.59. So by Bayes¶ Theorem, P(A1 | B) = P(B | A1)P(A1) P(B) = 0.9×0.3 0.59 = 0.46 to 2 d.p. Example Consider the clinical test described at the start of this section. Suppose that 1 in 1000 of the population is a carrier of the disease. Suppose also that the probability that a carrier tests negative is 1%, while the probability that a noncarrier tests positive is 5%. (A test achieving these values would be regarded as very successful.) Let A be the event µthe patient is a carrier¶, and B the event µthe test result is positive¶. We are given that P(A) = 0.001 (so that P(A0) = 0.999), and that P(B | A) = 0.99, P(B | A0) = 0.05. (a) A patient has just had a positive test result. What is the probability that the patient is a carrier? The answer is P(A | B) = P(B | A)P(A) P(B | A)P(A)+P(B | A0)P(A0) = 0.99×0.001 (0.99×0.001)+(0.05×0.999) = 0.00099 0.05094 = 0.0194. (b) A patient has just had a negative test result. What is the probability that the patient is a carrier? The answer is P(A | B0) = P(B0 | A)P(A) P(B0 | A)P(A)+P(B0 | A0)P(A0) 2.6. ITERATED CONDITIONAL PROBABILITY 31 = 0.01×0.001 (0.01×0.001)+(0.95×0.999)

88 = 0. and 1/10 if the subject doesn¶t have the disease. It is given by P(C | A. Of course. is just P(C | A\B). given that both A and B have occurred.166. the calculations would be quite different.6 P(B | A3) = 0.02+0.9×0.B).1×0.88 P(B | A1) = 0.1 P(A3) = 0. A2.6×0. 32 CHAPTER 2. P(A1 | B) = P(B | A1)P(A1) P(B) = 0.6 Iterated conditional probability The conditional probability of C. A new blood test is developed. CONDITIONAL PROBABILITY so P(A\B\C) = P(C | A.1 Thus. Now we also have P(A\B) = P(B | A)P(A). by the Theorem of Total Probability. the probability of testing positive is 9/10 if the subject has the serious form.B) = P(C\A\B) P(A\B) . I have just tested positive.02 0. Sometimes instead we just write P(C | A.9 P(B | A2) = 0.94095 = 0. What is the probability that I have the serious form of the disease? Let A1 be µhas disease in serious form¶. and A3 be µdoesn¶t have disease¶. A3 form a partition and P(A1) = 0.1+0. Let B be µtest positive¶. .p. So a patient with a negative test result can be reassured. P(B) = 0. we have P(A\B\C) = P(C | A. 2. and then by Bayes¶ Theorem.02 P(A2) = 0. Example 2% of the population have a certain blood disease in a serious form. and 88% don¶t have it at all.166 = 0. so is likely to worry unnecessarily. 6/10 if the subject has the mild form. these calculations assume that the patient has been selected at random from the population. so finally (assuming that P(A\B) 6= 0). but a patient with a positive test result still has less than 2% chance of being a carrier.108 to 3 d. A2 be µhas disease in mild form¶.00001. 10% have it in a mild form.00001 0.B)P(A\B). Then we are given that A1.9×0. If the patient has a family history of the disease.B)P(B | A)P(A).= 0.

. Call this number qi. since whatever p1¶s birthday is. p2. . (This is not quite true but not inaccurate enough to have much effect on the conclusion. What is A2 \ A3? It is simply the event that all three people have birthdays on different days. It is not straightforward to evaluate P(A3). . Then P(A1 \A2 \···\An) = P(An | A1. . . But we can calculate that P(A3 | A2) = 1í 2 365 . . so 23 people are enough for the probability of coincidence to be greater than 1/2.Aní1) ···P(A2 | A1)P(A1). . . . Now this process extends. The birthday paradox is the following statement: If there are 23 or more people in a room. that is. The numbers qi decrease. Let A3 be the event µp3 has a different birthday from p1 and p2¶. What is the probability of the .4927 (to 4 d.5.An be events. that is. there is a 1 in 365 chance that p2 will have the same birthday. and assume that the other 365 days are all equally likely as birthdays of a random person. . . the probability of at least one coincidence is greater than 1/2. By calculation.5. n is the smallest number of people for which the probability that they all have different birthdays is less than 1/2. Then P(A2) = 1í 1 365 . So there will be some value of n such that qní1 > 0. If Ai denotes the event µpi¶s birthday is not on the same day as any of p1. . qn _ 0.). ITERATED CONDITIONAL PROBABILITY 33 and so by Proposition 2. pi have their birthdays on different days.6. pn. Let A2 be the event µp2 has a different birthday from p1¶. pií1¶.p. we find that q22 = 0. .5. We apply this to the birthday paradox.5 Let A1. So P(A2 \A3) = P(A2)P(A3 | A2) = (1í 1 365 )(1í 2 365 ). q23 = 0. Now return to a question we left open before. we ignore 29 February. . . it is the probability that all of the people p1. P(A1 \···\Ai) = (1í 1 365 )(1í 2 365 ) ···(1í ií1 365 ). since we have to consider whether p1 and p2 have the same birthday or not. .5243. . . .Aií1) = 1í ií1 365 . (See below). . 2. Suppose that P(A1 \ ··· \ Aní1) 6= 0. . . then P(Ai | A1. .This generalises to any number of events: Proposition 2. since at each step we multiply by a factor less than 1. .) Suppose that we have n people p1. since if A2 occurs then p1 and p2 have birthdays on different days. and A3 will occur only if p3¶s birthday is on neither of these days. then the chances are better than even that two of them have the same birthday. To simplify the analysis.

the lecturer started discussing the Birthday Paradox. and after a moment all the other students realised that the lecturer would certainly lose his wager. in a probability class with only ten students.p.7 Worked examples Question Each person has two genes for cystic fibrosis. So. Problem How many people would you need to pick at random to ensure that the chance of two of them being born in the same month are better than even? Assuming all months equally likely. Each child receives one gene from each parent. So. However. CONDITIONAL PROBABILITY for i = 4 and (11/12)×(10/12)×(9/12)×(8/12) = 0. On the other hand. . . .9945 to 4 d. a student in the back said ³I¶ll take the bet´. if Bi is the event that pi is born in a different month from any of p1. P(A3) = P(A3 | A2)P(A2)+P(A3 | A02)P(A02) = (1í 2 365 )(1í 1 365 )+(1í 1 365 ) 1 365 = 0. pií1. These two numbers are P(A3 | A2) and P(A3 | A02) respectively. ³I bet that no two people in the room have the same birthday´. Why? (Answer in the next chapter. if they are CC then you have cystic fibrosis. the probability is 1í 2 365 : this is the calculation we already did.Bií1) = 1í ií1 12 . Sally¶s sister Hannah does have cystic fibrosis. (Remember that there are eleven people in the room!) However. then the probability is 1í 1 365 .) 2.event A3? (This is the event that p3 has a different birthday from both p1 and p2.859. We calculate that this probability is (11/12)×(10/12)×(9/12) = 0. He said to the class. He should have been on safe ground. (a) Neither of Sally¶s parents has cystic fibrosis. Each gene is either N or C.) If p1 and p2 have different birthdays. ···. . Nor does she. A true story. Some years ago.3819 for i = 5. Find the probability that Sally has . if p1 and p2 have the same birthday. then as before we find that P(Bi | B1. so P(B1 \···\Bi) = (1í 1 12 )(1í 2 12 )(1í ií1 12 ). since q11 = 0. by the Theorem of Total Probability.5729 34 CHAPTER 2. If your genes are NN or NC or CN then you are normal. with five people. it is more likely that two will have the same birth month.

7. if H1 is the event µHarry has at least one C gene¶. (49/50) · (1/50). (b) In the general population the ratio of N genes to C genes is about 49 to 1. (1/50) · (49/50). respectively. But 2. Then P(S1 | S2) = P(S1 \S2) P(S2) = 2/4 3/4 = 2 3 . then S1 = {CN. If S1 is the event µSally has at least one C gene¶.NC. Find the probability that the child will have cystic fibrosis (given that neither Harry nor Sally has it). then S2 = {CN. so each parent is CN or NC. Then the probabilities of Harry having genesCC. Then A\B = A.CN. will have probability 1/4. We are given that the probability of a random gene being C or N is 1/50 and 49/50 respectively. (a) This is the same as the eye colour example discussed earlier. Find the probability that he has at least one C gene (given that he does not have cystic fibrosis). and H2 is the event µHarry does not have cystic fibrosis¶.NC. namely CC. Harry does not have cystic fibrosis. and if S2 is the event µSally does not have cystic fibrosis¶. Now by the basic rules of genetics.CN.NN. and so P(A | B) = P(A\B) P(B) = P(A) P(B). WORKED EXAMPLES 35 neither parent is CC. and one gene must come from each parent. all the four combinations of genes for a child of these parents.NC. and (49/50)2. Let A and B be events with A _ B. NC. NN are respectively (1/50)2. Solution During this solution.NN}. we will use a number of times the following least one C gene (given that she does not have cystic fibrosis). You can assume that the two genes in a person are independent.CC}. then P(H1 | H2) = P(H1 \H2) P(H2) = (49/2500)+(49/2500) . (b) We know nothing specific about Harry. We are given that Sally¶s sister has genes CC. (c) Harry and Sally plan to have a child. so we assume that his genes are randomly and independently selected from the population. So.

Thus. P(X) P(S2 \H2) = P(X) P(S3 \H3) · P(S3 \H3) P(S2 \H2) 36 CHAPTER 2. As in (a). so P(S3 \H3) = P(S3) ·P(H3). CONDITIONAL PROBABILITY = 1 4· P(S1 \S2) P(S2) · P(H1 \H2) P(H2) = 1 4 ·P(S1 | S2) ·P(H1 | H2) = 1 4· 2 . P(S2 \H2) = P(S2) ·P(H2). where S3 = S1 \S2 and H3 = H1 \H2. Now Harry¶s and Sally¶s genes are independent. these genes pass independently to the baby. Now if Harry and Sally are both CN or NC. That is. P(X) P(S2 \H2). X _ S3 \H3.(49/2500)+(49/2500)+(2401/2500) = 2 51 . (c) Let X be the event that Harry¶s and Sally¶s child has cystic fibrosis. (Remember the principle that we started with!) We are asked to find P(X | S2 \ H2). and so P(X | S3 \H3) = P(X) P(S3 \H3) = 1 4 . in other words (since X _ S3 \ H3 _ S2 \H2). this can only occur if Harry and Sally both have CN or NC genes.

(b) By Bayes¶ Theorem. Given this information. and the Dry season for 2/3 of the year. the probability that it is raining is 3/4. TheWet season lasts for 1/3 of the year. I thank Eduardo Mendes for pointing out a mistake in my previous solution to this problem. and R the event µit is raining when I arrive¶. Question The Land of Nod lies in the monsoon zone. P(W | R) = P(R |W)P(W) P(R) = (3/4) ·(1/3) 13/36 = 9 13 . Thus P(R\R0) = P(R\R0 |W)P(W)+P(R\R0 | D)P(D) = (3/4)2 ·(1/3)+(1/6)2 ·(2/3) = 89 432 . Given this information. We are given that P(W) = 1/3. the probability that it is raining is 1/6. P(R) = P(R |W)P(W)+P(R | D)P(D) = (3/4) ·(1/3)+(1/6) ·(2/3) = 13/36. what is the probability that my visit is during the Wet season? (c) I visit Oneirabad on a random day. . what is the probability that it will be raining when I return to Oneirabad in a year¶s time? (You may assume that in a year¶s time the season will be the same as today but. given the season. During the Wet season. whether or not it is raining is independent of today¶s weather.7. (a) I visit the capital city. P(R |W) = 3/4. What is the probability that it is raining when I arrive? (b) I visit Oneirabad on a random day. during the Dry season. Wet and Dry. on a random day of the year.3· 2 51 = 1 153 . and it is raining when I arrive.) Solution (a) Let W be the event µit is the wet season¶. P(R | D) = 1/6. WORKED EXAMPLES 37 (c) Let R0 be the event µit is raining in a year¶s time¶. P(D) = 2/3. By the ToTP. and it is raining when I arrive. 2. and has just two seasons. The information we are given is that P(R\R0 |W) = P(R |W)P(R0 |W) and similarly for D. D the event µit is the dry season¶. Oneirabad.

and normal. but more examples should make the idea clearer. ³neither holy. and the target space is the set of real numbers. but for us they will always be numbers. the random variable is µheight¶. The standard abbreviation for µrandom variable¶ is r. 3.and so P(R0 | R) = P(R\R0) P(R) = 89/432 13/36 = 89 156 . a random variable is neither random nor a variable: A random variable is a function defined on a sample space. Example I select at random a student from the class and measure his or her height in centimetres. and median. The values of the function can be anything at all. in the words of the historian Voltaire. nor Roman.1 What are random variables? The Holy Roman Empire was. the sample space is the set of students. .) Example I throw a six-sided die twice. CONDITIONAL PROBABILITY Chapter 3 Random variables In this chapter we define random variables and some related concepts such as probability mass function. The two random variables in the above examples are representatives of the two types of random variables that we will consider. Similarly. j) = i+ j. These definitions are not quite precise. . . and look at some particularly important types of random variables including the binomial.3. variance. 38 CHAPTER 2.v. RANDOM VARIABLES and the random variable F is given by F(i. expected value. I am interested in the sum of the two numbers. The target set is the set {2. which is a function from the set of students to the real numbers: h(S) is the height of student S in centimetres. Here the sample space is S = {(i.12}. Here. . the set of students in the class. 39 40 CHAPTER 3. (Remember that a function is nothing but a rule for associating with each element of its domain set an element of its target or range set. Poisson. j) : 1 _ i. . nor an empire´. A random variable F is discrete if the values it can take are separated by gaps. j _ 6}. Here the domain set is the sample space S.

) 3.HHT. (There is a fairly common convention in probability and statistics that random variables are denoted by capital letters and their values by lower-case letters. it is quite common to use the same letter in lower case for a value of the random variable. is equal to {HTT. 3. 3.HTT. we would write P(F = f ) in the above example.1. formula or table which gives the value of P(F =a) for each element a in the target set of F.3. But remember that this is only a convention. Example I toss a fair coin three times. values could be 1. The most basic question we can ask is: given any value a in the target set of F. If F takes only a few values. The possible values of X are 0.f.) A random variable is continuous if there are no gaps between its possible values. when written as a set of outcomes. or any real number between 4 and 5). where the values are the integers from 2 to 12). In the first example. or even over all real numbers. is continuous. 3. the number of nuclear decays which take place in a second in a sample of radioactive material ± the number is an integer but we can¶t easily put an upper limit on it. The random variable X gives the number of heads recorded. 2. otherwise we should give a formula if possible.THT. We begin by considering discrete random variables. and you are not bound to it. or if the values of F are integers (for example. what is the probability that F takes the value a? In other words. One could concoct random variables which are neither discrete nor continuous (e. is a0123 P(X = a) 18 38 38 1 8 For the sample space is {HHH.TTH.TTT}.f. and its p. for example. the height of a student could in principle be any real number between certain extreme limits.m.TTH}. the possible. and each outcome is equally likely. but we will not consider such random variables.g. it is convenient to list it in a table. and has probability 3/8. In fact. The event X = 1.2 Probability mass function Let F be a discrete random variable.HTH. We write X _Y . if we consider the event A = {x 2 S : F(x) = a} what is P(A)? (Remember that an event is a subset of the sample space.m.) Since events of this kind are so important. EXPECTED VALUE AND VARIANCE 41 The probability mass function of a discrete random variable F is the function. F is discrete if it can take only finitely many values (as in the second example above.THT. we simplify the notation: we write P(F = a) in place of P({x 2 S : F(x) = a}). Two random variables X and Y are said to have the same distribution if the values they take and their probability mass functions are equal.THH. The standard abbreviation for µprobability mass function¶ is p. A random variable whose values range over an interval of real numbers.2.For example. thus.

Thus E(X2) = n i=1 a2i § P(X = ai) (or an infinite sum. then E(X) = (a1+···+an)/n. . The variance of X is the number Var(X) given by Var(X) = E(X2)íE(X) this case. . then X and Y have the same distribution. . RANDOM VARIABLES Of course. then we define the expected value of X to be the infinite sum E(X) = g i=1 aiP(X § = ai). where pi = P(X = ai). In the above example. .a2. even though their actual values are different (indeed. then Var(X) is a measure of how spread-out the values are around their average. that is. in the few examples we consider where there are infinitely many values.n. This is a question which is discussed at great length in analysis. whether this infinite series is convergent. 42 CHAPTER 3. if Y is the number of tails recorded during the experiment. That is.3 Expected value and variance Let X be a discrete random variable which takes the values a1. say a1. . we assume that the number of values is finite. . which we know how to sum. Here. There is an interpretation of the expected value in terms of mechanics. In the proofs below. . Then . 3. Proposition 3. we multiply each value of X by the probability that X takes that value. If we put a mass pi on the axis at position ai for i = 1. so that each has probability 1/n.1 Let X be a discrete random variable with E(X) = . . The expected value is a kind of µgeneralised average¶: if each of the values is equally likely. discrete random variables will only have finitely many values. then the centre of mass of all these masses is at the point E(X).an. the series will usually be a geometric series or something similar. . Y = 3íX). . and sum these terms. if E(X) is a kind of average of the values of X. X2 is just the random variable whose values are the squares of the values of X. The next theorem shows that.a3. We won¶t worry about it too much. The expected value or mean of X is the number E(X) given by the formula E(X) = n i=1 aiP(X § = ai). now we have to worry about whether this means anything. Usually. If the random variable X takes infinitely many values. if necessary). which is just the average of the values. .

OF TWO RANDOM VARIABLES 43 Some people take the conclusion of this proposition as the definition of variance. and the third is n i=1 (aií § )2P(X = ai) = n i=1 (a2i § í2 ai+ 2)P(X = ai) = n § i=1 a2i P(X = ai)!í2 n § i=1 aiP(X = ai)!+ 2 n § i=1 P(X = ai)!.F. X is the number of heads.) 3.4. (Remember that E(X) = . What are the expected value and variance of X? E(X) = 0×(1/8)+1×(3/8)+2×(3/8)+3×(1/8) = 3/2. we find E((X í )2) = E(X2)í2 E(X)+ 2 = E(X2)íE(X)2. Example I toss a fair coin three times.) Continuing.1. JOINT P.M. For the second term is equal to the third by definition. (What is happening here is that the entire sum consists of n rows with three terms in each row. and we are done.Var(X) = E((X í )2) = n i=1 (aií § )2P(X = ai). getting three parts with n terms in each part. We add it up by columns instead of by rows. If we calculate the variance using Proposition 3. and that §n i=1 P(X = ai) = 1 since the events X = ai form a partition. Var(X) = 02×(1/8)+12×(3/8)+22×(3/8)+32×(1/8)í(3/2)2 = 3/4. we get Var(X) = _í 3 2_2 × 1 8 .

bm. So we could re-state the definition as follows: The random variables X and Y are independent if. and let Y be a random variable taking the values b1. RANDOM VARIABLES Example In Chapter 2.Y = bj) = P(X = ai) ·P(Y = bj).+_í 1 2_2 × 3 8 +_1 2_2 × 3 8 +_3 2_2 × 1 8 = 3 4 . . one green pen. ‡ The expected value of X always lies between the smallest and largest values of X.f. ‡ The variance of X is never negative. each of the form (ai í )2 (a square. . I select two pens without replacement. (For the formula in Proposition 3. Note the difference between µindependent events¶ and µindependent random variables¶.4 Joint p. 3. and one blue pen. We say that X and Y are independent if. Then P(X = 1. we have P(X = ai. .m. Two properties of expected value and variance can be used as a check on your calculations. we saw the following: I have two red pens. hence non-negative) times P(X = ai) (a probability. and Y the number of green pens selected. . for any possible values i and = 1) = P(X = 1) ·P(Y = 1). . 44 CHAPTER 3. Let X be the number of red pens selected. the events X =ai and Y =bj are independent (events).1 is a sum of terms. hence non-negative). for any value ai of X and any value bj of Y. Then the events µexactly one red pen selected¶ and µexactly one green pen selected¶ turned out to be independent.Y = bj) means the probability of the event that X takes the value ai and Y takes the value bj. . Are X and Y independent random variables? . of two random variables Let X be a random variable taking the values a1. . . Here P(X = ai.

m. because P(X = 2) = 1/6. Similarly the column sums form the p. and X and Y are the numbers that come up on the first and second throws. Note that summing the entries in the row corresponding to the value ai gives the probability that X = ai. but P(X = 2. and I choose two pens without replacement. (The row and column sums are sometimes called the marginal distributions or marginals. for each value ai of X and each value bj of Y. 3.m. X and Y are independent r.4. Y = b.2 Let X and Y be random variables.m. the row sums form the p. .f.M. that is. We arrange the table so that the rows correspond to the values of X and the columns to the values of Y.f.Z). JOINT P. and one blue pen.f.s if and only if each entry of the table is equal to the product of its row sum and its column sum.No. (b) If X and Y are independent.F.f. of each variable does not tell the whole story. even if the die is not fair (so that the outcomes are not all equally likely).) What about the expected values of random variables? For expected value. of Y. then Var(X +Y) = Var(X)+Var(Y).m.m. we say that they are mutually independent if the events that the random variables take specific values (for example.m.v. Let X be the number of red pens that I choose and Y the number of green pens. but for variance it helps if the variables are independent: Theorem 3.f.f. OF TWO RANDOM VARIABLES 45 Example I have two red pens.2. if I roll a die twice. Z = c) are mutually independent.s for X and Y: a012 P(X = a) 16 2 3 16 b01 P(Y = b) 12 12 Now we give the proof of Theorem 3. the probability that X = ai and Y = bj. If two random variables X and Y are not independent. If we have more than two random variables (for example X. (a) E(X +Y) = E(X)+E(Y).) of X and Y is the table giving. then X and Y will be independent.Y. then knowing the p. On the other hand. of X and Y is given by the following table: Y 01 0 0 16 X 1 13 13 2 16 0 The row and column sums give us the p. it is easy. X = a. one green pen. of X. We will see the proof later.Y = 1) = 0 (it is impossible to have two red and one green in a sample of two). The joint probability mass function (or joint p. Then the joint p. (You may want to revise the material on mutually independent events.) In particular. P(Y = 1) = 1/2.

So E(X +Y) = n i=1 aiP(X m j=1 bjP(Y § = ai)+ § = bj) = E(X)+E(Y).We consider the joint p. The random variable X +Y takes the values ai +bj for i = 1. . of X and Y. The variance is a bit trickier. E(X +Y) = § k ckP(X +Y = ck) = n i=1 § § n m j=1 (ai+bj)P(X = ai.Y = bj) = § i=1 ai m j=1 § m P(X = ai.m. Now §mj =1P(X = ai.m.Y = bj) is a row sum of the joint p. . 46 CHAPTER 3.m. . table.Y = bj) is a column sum and is equal to P(Y = bj). We have to consider the term E(XY).n and j = 1. and similarly §n i=1 P(X = ai. so is equal to P(X = ai). . RANDOM VARIABLES using part (a) of the Theorem. Thus.f. we have to make the assumption that X and Y are independent. Now the probability that it takes a given value ck is the sum of the probabilities P(X = ai.f. First we calculate E((X +Y)2) = E(X2+2XY +Y2) = E(X2)+2E(XY)+E(Y2). .Y = bj)!+ § j=1 bj n i=1 § P(X = ai. . For this.Y = bj) over all i and j such that ai+bj = ck.Y = bj)!. . . . that is.

j=1 bjP(Y .) 3. Var(cX) = c2Var(X). So E(C) = c ·1 = c. remember that a random variable is not a variable at all but a function.Y = bj) = P(X = ai) ·P(Y = bj). Proof (a) The random variable C takes the single value c with P(C = c) = 1.C = c) = P(X = a) ·1. So Var(X +Y) = E((X +Y)2)í(E(X +Y))2 = (E(X2)+2E(XY)+E(Y2))í(E(X)2+2E(X)E(Y)+E(Y)2) = (E(X2)íE(X)2)+2(E(XY)íE(X)E(Y))+(E(Y2)íE(Y)2) = Var(X)+Var(Y). (If the thought of a µconstant variable¶ worries you. SOME DISCRETE RANDOM VARIABLES 47 (b) This follows immediately from Theorem 3. we have E(XY) = n i=1 § § m j=1 aibjP(X = ai. Var(C) = 0. (a) E(C) = c.5. Var(X +c) = Var(X).) Proposition 3. (c) E(cX) = cE(X).P(X = a1. (b) E(X +c) = E(X)+c. we consider constant random variables.2. To finish this section.Y = bj) = n i=1 § n j=1 aibjP(X § = ai)P(Y = bj) = n § i=1 aiP(X = ai)!· m § = bj)! = E(X) ·E(Y).3 Let C be a constant random variable with value c.) Then E(X +c) = E(X)+E(C) = E(X)+c. once we observe that the constant random variable C and any random variable X are independent. (This is true because P(X = a. As before. Var(X +c) = Var(X)+Var(C) = Var(X). Var(C) = E(C2)íE(C)2 = c2íc2 = 0. (For C2 is a constant random variable with value c2. Also. and there is nothing amiss with a constant function. Let X be any random variable.

.(c) If a1. . . 48 CHAPTER 3. In the other direction. Let X be a random variable taking values a1. of X is given by FX(ai) = P(X _ ai).a2.can are the values of cX. looks as follows: .. It is defined for a discrete random variable as follows. It is much more important for continuous random variables! Bernoulli random variable Bernoulli(p) A Bernoulli random variable is the simplest type of all. They don¶t give the probability mass function (or p.).an are the values of X. 3.f.m. So E(cX) = n i=1 caiP(cX § = cai) =c n i=1 aiP(X § = ai) = cE(X).5 Some discrete random variables We now look at five types of discrete random So its p. the expected value.m.f. It only takes two values.f. or c.m. 0 and 1.f. each depending on one or more parameters.f.: P(X = ai) = FX(ai)íFX(aií1).d. Then Var(cX) = E(c2X2)íE(cX)2 = c2E(X2)í(cE(X))2 = c2(E(X2)íE(X)2) = c2Var(X). and the variance. from the c. .d.f.m. but a closely related function called the cumulative distribution function. a comment on the New Cambridge Statistical Tables. then ca1. .f. RANDOM VARIABLES We see that it can be expressed in terms of the p. we give the table number. we cn recover the p. Before we begin.d. . . of X as follows: FX(ai) = P(X = a1)+···+P(X = ai) = i j=1 § P(X = aj). . of a discrete random variable except for looking up the tables.. and some examples of using the tables. If the variable is tabulated in the New Cambridge Statistical Tables. The cumulative distribution function. . . and P(cX = cai) = P(x = ai). . and give the p. . We won¶t use the c. We assume that these are arranged in ascending order: a1 < a2 < ··· < an.m.f. A summary of this information is given in Appendix B. We describe for each type the situations in which it arises. You should have a copy of the tables to follow the examples.

. if a biased coin has probability p of coming down heads. p): k01234 P(X = k) q4 4q3p 6q2p2 4qp3 p4 Note: when we add up all the probabilities in the table. as´. Calculation of the expected value and variance of a Bernoulli random variable is easy. . let A be any event in a probability space S. Necessarily q (the probability that X = 0) is equal to 1í p. For example.x01 P(X = x) q p Here. it can be any number between 0 and 1. (Remember that _ means ³has the same p. This number is a Bin(n. we associate a random variable IA (remember that a random variable is just a function on S) by the rule IA(s) = n1 if s 2 A. and the p. . For small values we could tabulate the results. p) random variable.1. the event X = 1 asµsuccess¶.f. p) random variable X takes the values 0. More generally.2. where p = P(A). .1.) 3. A Bin(n. The random variable IA is called the indicator variable of A. and the probability of getting k heads and ník tails in a particular order is qník pk. for example.) E(X) = 0 ·q+1 · p = p. of X is given by P(X = k) = nCkqník pk for k = 0.n. . we sometimes describe the experiment as a µtrial¶. For a Bernoulli random variable X. . then the number of heads that we get when we toss the coin once is a Bernoulli(p) random variable.n. p) Remember that for a Bernoulli random variable. . We toss the coin n times and count the number of heads obtained. So p determines everything. Note that we have given a formula rather than a table here. Let X _ Bernoulli(p). Now a binomial random variable counts the number of successes in n independent trials each associated with a Bernoulli(p) random variable. With A. Var(X) = 02 ·q+12 · pí p2 = pí p2 = pq. and the event X = 0 as µfailure¶.) Some people write 11A instead of IA. suppose that we have a biased coin for which the probability of heads is p.m. . SOME DISCRETE RANDOM VARIABLES 49 Binomial random variable Bin(n. we get . For example. It is a Bernoulli(p) random variable. This is because there are nCk different ways of obtaining k heads in a sequence of n throws (the number of choices of the k positions in which the heads occur). 0 if s /2 A.5.f. p is the probability that X = 1. where q = 1íp. (The event IA = 1 is just the event A.2. for Bin(4. we describe the event X = 1 as a µsuccess¶. because its value indicates whether or not A occurred. (Remember that q = 1í p.m.

E(Xi)= p. (We don¶t insist that it takes all possible values. by Theorem 21. . as it should be: here we used the binomial theorem (x+y)n = n k=0 nCkxníkyk. However. To save space. Var(X) = pq+ pq+···+ pq = npq. Var(Xi) = pq.Xn are independent Bernoulli(p) random variables (since they are defined by different tosses of a coin). .4 Let GX (x) be the probability generating function of a random variable X. Then (a) [GX (x)]x=1 = 1. The easy way only works for the binomial. then E(X) = np. this method is fine for the binomial Bin(n. Xi is the indicator variable of the event µheads on the kth toss¶. For part (b). . when we differentiate the series term-by-term (you will learn later in Analysis . Then. So. Here is the easy method. (b) E(X) = _d dxGX (x)_x=1. p) random variable. but the harder way is useful for many random variables.) We use the notation [F(x)]x=1 for the result of substituting x = 1 in the series F(x). 50 CHAPTER 3. and X1. Proposition 3. We only use it here for calculating expected values and variances. (The sum is over all values k taken by X. an easy way and a harder way. Now we have X = X1+X2+···+Xn (can you see why?). p). Then X is our Bin(n. we write pk for the probability P(X = k). and we toss it n times and count the number X of heads. RANDOM VARIABLES In other words.(q+ p)n = 1. Part (a) is just the statement that probabilities add up to 1: when we substitute x = 1 in the power series for GX (x) we just get §pk. but if you learn more probability theory you will see other uses for it. as we saw earlier. Var(X) = npq. . since the variables are independent. Let Xk be the random variable defined by Xk = _1 if we get heads on the kth toss. which takes values between 0 and n. The other method uses a gadget called the probability generating function. Now the probability generating function of X is the power series GX (x) =§pkxk. There are two ways to prove this. n k=0 nCkqník pk = § § (This argument explains the name of the binomial random variable!) If X _ Bin(n. you can skip it if you wish: I have set it in smaller type for this reason. We have a coin with probability p of coming down heads. Let X be a random variable whose values are non-negative integers. p). (c) Var(X) = hd2 dx2GX (x)ix=1 +E(X)íE(X)2. we have E(X) = p+ p+···+ p = np. 0 if we get tails on the kth toss.

Now adding E(X)íE(X)2. p). So the probability of five heads in 15 tosses of a coin with p=0.5. As explained earlier. use the fact that the number of failures in Bin(n.2608í0. For part (c). For larger values of p. SOME DISCRETE RANDOM VARIABLES 51 Now let us appply this to the binomial random variable X _ Bin(n. differentiating twice gives d2 dx2GX (x) =§k(kí1)pkxkí2. of which M are red.1í p). Differentiating once. The tables only go up to p = 0. Now putting x = 1 in this series we get §kpk = E(X). what is the probability of five or fewer heads? Turning to the page n = 15 in Table 1 and looking at the row 0.that this is OK). What about sampling without replacement? This leads us to our next random variable: Hypergeometric random variable Hg(n. The binomial random variable is tabulated in Table 1 of the Cambridge Statistical Tables [1].M. and from tables the answer is 0. Putting x = 1 gives n(ní1)p2. If the coin is tossed 15 times.5.0514. We sample n balls from the box with replacement. in agreement with Proposition 3.9745í 0.45. 3. by the Binomial Theorem. we get n(ní1)p2(q+ px)ní2. Adding E(X) and subtracting E(X)2 gives E(X2)íE(X)2. Now putting x = 1 in this series we get §k(kí1)pk =§k2pk í§kpk = E(X2)íE(X). p) is equal to the number of successes in Bin(n. and different choices are independent. X _ Bin(n. We sample n balls from the box without replacement. where p = M/N is the proportion of red balls in the sample. we get np(q+ px)ní1. suppose that the probability that a certain coin comes down heads is 0.9231 = 0. p). you read off the answer 0.45.1404. we get d dx GX (x) =§kpkxkí1. Differentiating again. We have pk = P(X = k) = nCkqník pk. Let the random variable X be the number of . Putting x = 1 gives (q+ p)n = 1. let the random variable X be the number of red balls in the sample.2608. which by definition is Var(X).1204= 0. What is the probability of exactly five heads? This is P(5 or fewer)íP(4 or fewer). the tables give the cumulative distribution function. Another interpretation of the binomial random variable concerns sampling. we get Var(X) = n(ní1)p2+npín2p2 = npínp2 = npq. For example.55 is 0. using the Chain Rule.N) Suppose that we have N balls in a box. Putting x = 1 we find that E(X) = np. so the probability generating function is n k=0 nCkqník pkxk = § (q+ px)n. of which M are red. Suppose that we have N balls in a box. What is the distribution of X? Since each ball has probability M/N of being red.4(a).

. Thus. if the numbers M and N íM of red and non-red balls in the hat are both very large compared to the size n of the sample. and count the number of times we have tossed the coin. The expected value and variance of a hypergeometric random variable are as follows (we won¶t go into the proofs): E(X) = n_M N _. If we let p =M/N be the proportion of red balls in the hat. These agree with the formulae above. We calculated earlier that P(X = 0) = 1/6. P(X = 1) = 2/3 and P(X = 2)=1/6. (In theory we might never get a head and toss the coin infinitely often. We have again a coin whose probability of heads is p. the number of ways of choosing k of the M red balls and ník of the NíM others is MCk ·NíMCník. i. is given by the formula P(X = k) = MCk ·NíMCník NCn . Such an X is called a hypergeometric random variable Hg(n.1. the values of the variable are the positive integers 1.5.2. .2. For the number of samples of n balls from N is NCn. . and Var(X) is equal to npq multiplied by a µcorrection factor¶ (N ín)/(N í1). . we toss it until it comes down heads for the first time. and all choices are equally likely. RANDOM VARIABLES red balls in the sample. instead of tossing it a fixed number of times and counting the heads.f of a Geom(p) random variable is given by P(X = k) = qkí1p.) We always assume that 0 < p < 1. Its p. So we can say that Hg(n.N) is approximately Bin(n. In particular. The random variable X can take any of the values 0. where two pens are red. then the difference between sampling with and without replacement is very small. . . From this we find by direct calculation that E(X)=1 and Var(X)=1/3. the number of independent Bernoulli trials required until the first success is obtained is a geometric random variable. .m.3. More generally. The number X of red pens is a Hg(2. Now. one green.M.4) random variable. 3.M. Consider our example of choosing two pens from four.n.m. and indeed the µcorrection factor¶ is close to 1. SOME DISCRETE RANDOM VARIABLES 53 The p. and one blue.M/N) if n is small compared to M and N íM.2. as we will see. but if p > 0 this possibility is µinfinitely unlikely¶. Geometric random variable Geom(p) The geometric random variable is like the binomial but with a different stopping rule. has probability zero.N). then E(X)= np.f.e. . .52 CHAPTER 3. Var(X) = n_M N __N íM N __N ín N í1_. You should compare these to the values for a binomial random variable.

For example. Putting x = 1. the expected number . so the geometric progression does for the geometric random variable. so Var(X) = 2pq p3 + 1 pí 1 p2 = q p2 . the result will be that E(X) = 1/p. we get d dx GX (x) = (1íqx)p+ pxq (1íqx)2 = p (1íqx)2 . we obtain E(X) = p (1íq)2 = 1 p . Var(X) = q/p2. We have GX (x) = g k=1 qkí1 pxk = § px 1íqx . and this event has probability qkí1p. since µtails¶ has probability q and different tosses are independent. if we toss a fair coin until heads is obtained. where q < 1. and gives its name to the random variable.) We calculate the expected value and the variance using the probability generating function. If X _ Geom(p). (Just as the binomial theorem shows that probabilities sum to 1 for a binomial random variable. Differentiating again gives 2pq/(1íqx)3.where q = 1í p. Let¶s add up these probabilities: g k=1 qkí1p § = p+qp+q2p+··· = p 1íq = 1. For the event X = k means that we get tails on the first kí1 tosses and heads on the kth. Differentiating. again by summing a geometric progression. since the series is a geometric progression with first term p and common ratio q.

m. The Poisson random variabe X counts the number of µincidents¶ which occur in a given interval. So if. Unfortunately. Suppose that µincidents¶ occur at random times. you might use as a mnemonic the fact that if I go fishing. Again we use the probability generating function. for a Poisson(P) variable X is given by the formula P(X = k) = Pk k! eíP. Differentiation gives PeP(xí1). and the fish are biting at the rate of Pper hour on average. if you speak a little French. the p.of tosses until the first head is 2 (so the expected number of tails is 1). on average. again using the series for the exponential function. By analogy with what happened for the binomial and geometric random variables. Another example might be the number of telephone calls a minute to a busy telephone number. We get g § k=0 Pk k!!eíP= eP·eíP= 1.f. If X _ Poisson(P). and the variance of this number is also 2. RANDOM VARIABLES Poisson random variable Poisson(P) The Poisson random variable. The expected value and variance of a Poisson(P) random variable X are given by E(X) = Var(X) = P. is very closely connected with continuous things. unlike the ones we have seen before. then the number of fish I will catch in the next hour is a Poisson(P) random variable. there are 2. then GX (x) = g k=0 § (Px)k k! eíP= eP(xí1).4 nuclear decays per second. you might have expected that this random variable would be called µexponential¶. then the number of decays in one second starting now is a Poisson(2. Let¶s check that these probabilities add up to one. but at a steady rate overall. The best example is radioactive decay: atomic nuclei decay randomly. but the average number Pwhich will decay in a given interval is constant. so E(X) = P. 54 CHAPTER 3. so Var(X) = P2+PíP2 = P. Although we will not prove it. However. . since the expression in brackets is the sum of the exponential series. Differentiating again gives P2eP(xí1).4) random variable. this name has been given to a closely-related continuous random variable which we will meet later.

But it turns out to be Poisson(1).5 The c.f. there is no difference between P(X _ x) and P(X < x). and np = P.d.0907. Then. the lower case x refers to the argument of the function. Proposition 3.d. or perhaps the non-negative real numbers or just an interval. Remember from last week that the c.) Now let X be a continuous random variable. since the probability that X takes the precise value x is zero.4 fish bite per hour on average.9643 (so that the probability that I catch six or more is 0.f. and FX(íg)=0 because no value is smaller than íg! Another important function is the probability density function fX. Note: The name of the function is FX. The crucial property is that. CONTINUOUS RANDOM VARIABLES 55 The cumulative distribution function of a Poisson random variable is tabulated in Table 2 of the New Cambridge Statistical Tables. It is obtained by differentiating the c. instead. for example. we find from the tables that.f. if 2. while the probability that I catch at five or fewer is 0. then Bin(n.f. p) can be approximated by Poisson(P). or the time I have to wait for a bus. is zero. then FX(y)íFX(x) = P(X _ y)íP(X _ x) = P(x < X _ y) _ 0.d. So. and approaches the limits 0 as x!ígand 1 as x!g. to a very good approximation. 3. We use the cumulative distribution function or c. of the random variable Y (which might be a completely different function. then the probability that I will catch no fish in the next hour is 0. The general rule is: If n is large. There is another situation in which the Poisson distribution arises. if x < y. is an increasing function (this means that FX(x) _ FX(y) if x < y). we have (X = a) = 0. Also FX(g)=1 because X must certainly take some finite value. Its target set is the set of real numbers.d.f. RANDOM VARIABLES The function is increasing because. p is small.3. is precisely a. for any real number a. Note that FX(y) is the same function written in terms of the variable y instead of x.6. that is.6 Continuous random variables We haven¶t so far really explained what a continuous random variable is.0357). Suppose I am looking for some very rare event which only occurs once in 1000 trials on average. it would always be zero and give no information.: fX(x) = d dx . the probability that the event doesn¶t occur is about 1/e. 56 CHAPTER 3. the number which is substituted into the function.d. So I conduct 1000 independent trials. It is common but not universal to use as the argument the lower-case version of the name of the random variable. the probability that the height of a random student. whereas FY (x) is the c. So we can¶t use the probability mass function for continuous random variables. of the random variable X is the function FX defined by FX(x) = P(X _ x).1/1000). So. for example. as here. How many occurrences of the event do I see? This number is really a binomial random variable Bin(1000.

The support of Xis the interval [0. and replacing sums by integrals. But here is a small example to show the ideas. Note the use of the ³dummy variable´ t in this integral. We check the integral: Z g íg fX(x)dx 0 =Z1 2x dx = _x2_x=1 . 3.d. the probability of being close to x is proportional to fX(x). and again it will hold that the centre of mass is at E(X). given by fX(x) = n2x if 0 _ x _ 1.FX(x). The support of a continuous random variable is the smallest interval containing all values of x where fX(x) > 0. It is also true that Var(X) = E((X í )2).d. the expected value of X is given by E(X) = Z g íg x fX(x)dx. then FX is obtained by integrating.m. Suppose that the random variable X has p.d. If we know fX(x). You can think of the p. the mass to the left of x is FX(x). where = E(X).f. There is a mechanical analogy which you may find helpful. we have FX(x) = Z x íg fX(t)dt. although the probability of getting exactly the value x is zero. so that the density of the wire (mass per unit length) at the point x is equal to fX(x). like this: the probability that the value of X lies in a very small interval from x to x+h is approximately fX(x) · h. QUARTILES. PERCENTILES 57 We will see examples of these calculations shortly.7. For a continuous random variable. Most facts about continuous random variables are obtained by replacing the p. MEDIAN. and the variance is (as before) Var(X) = E(X2)íE(X)2. by the p. Then again the total mass is one.f. since it is the derivative of an increasing function. Now fX(x) is non-negative. where E(X2) = Z g íg x2 fX(x)dx. Remember that we modelled a discrete random variable X by placing at each value a of X a mass equal to P(X = a).1]. imagine instead a wire of variable thickness. and the expected value of X is the centre of mass. Because FX(íg) = 0.f. So. 0 otherwise. Then the total mass is one. Thus.f. Note also that P(a _ X _ b) = FX(b)íFX(a) = Z b a fX(t)dt.

then both the expected value and the median of X are equal to m. (Study this carefully to see how it works. The median of X is the value of m for which FX(m) = 1/2. x2 if 0 _ x _ 1. we see that m = 1/p2. The cumulative distribution function of X is FX(x) = Z x íg fX(t)dt = (0 if x < 0. so the quartiles give an estimate of how spread-out the distribution is. If there is a value m such that the graph of y= fX(x) is symmetric about x=m. E(X2) = Z g íg x2 fX(x)dx = Z 1 0 2x3dx = 1 2 .) We have E(X) = Z g íg x fX(x)dx = Z 1 0 2x2dx = 2 3 . quartiles. 1 if x > 1. m should satisfy FX(m) = 1/2. FX(u) = 3/4. we define the nth percentile of X to be the value of xn such that FX(xn) = n/100. More formally.x=0 = 1. Var(X) = 1 2 í_2 3_2 = 1 18 . percentiles Another measure commonly used for continuous random variables is the median. Thus. Since FX(x) = x2 for 0 _ x _ 1. the probability that X lies between l and u is 3/4í1/4 = 1/2. this is the value m such that ³half of the distribution lies to the left of m and half to the right´. The lower quartile l and the upper quartile u are similarly defined by FX(l) = 1/4.7 Median. . 3. we saw that E(X) = 2/3. It is not the same as the mean or expected value. In the example at the end of the last section. More generally.

which is discrete. Reminder If the c. this must be 1.f. is the area of a rectangle of height c and base bía. then ‡ differentiate FX to get fX. is FX(x) = (0 if x < a. The uniform random variable doesn¶t really arise in practical situations. the p. is fX(x). and normal. ³equally likely to be anywhere in the interval´. Not that it takes non-negative real numbers as values. Further calculation (or the symmetry of the p.b] (and zero outside the interval).1]. The details are summarised in Appendix B. By integration. 1 if x > b. (xía)/(bía) if a _ x _ b. RANDOM VARIABLES that is. However. they are not really random.b) Let a and b be real numbers with a<b. fish biting). and the median and percentiles of X. exponential. .g.) shows that the expected value and the median of X are both given by (a+b)/2 (the midpoint of the interval). A uniform random variable on the interval [a.d. which is continuous. 3.f. radioactive decays. while Var(X) = (bía)2/12. ‡ use FX to calculate P(a _ X _ b) (this is FX(b)íFX(a)). SOME CONTINUOUS RANDOM VARIABLES 59 the numbers produced. but there should be no obvious pattern to 3.d. ‡ use fX to calculate E(X) and Var(X).b) is given by fX(x) = n1/(bía) if a _ x _ b. measures exactly how long from now it is until the next event occurs. counts how many events will occur in the next unit of time. it is very useful for simulations.f. the probability that X is smaller than xn is n%.f.d.d. since the computer is a deterministic machine. In other words. of the random variable X _U(a. 0 otherwise.f. we find that the c. The exponential random variable. Thus. of X is FX(x) and the p.f. and in a large number of trials they should be distributed uniformly over the interval.8. Its great simplicity makes it the best choice for this purpose. its probability density function is constant on the interval [a.58 CHAPTER 3.b] is. which apparently produces independent values of a uniform random variable on the interval [0.d. Uniform random variable U(a. and integrate fX to get FX. Exponential random variable Exp(P) The exponential random variable arises in the same situation as the Poisson: be careful not to confuse them! We have events which occur randomly but at a constant average rate of Pper unit time (e. Of course. Most computer systems include a random number generator. roughly speaking. so c = 1/(bía). You will learn in the Statistics course how to use a uniform random variable to construct values of other types of discrete or continuous random variables. What should the constant value c be? The integral of the p.d. The Poisson random variable.8 Some continuous random variables In this section we examine three important continuous random variables: the uniform.

.... . of X is fX(x) = _0 if x < 0.. .W2) The normal random variable is the commonest of all in applications..... ...................... ........ ....... like a man¶s height.......... of the random variable X _ N( .. 60 CHAPTER 3.................. . ........... 1íeíPx if x _ 0............ ............... ... ......... There is a theorem called the central limit theorem which says that.... . . so that log2 = 0.. ........ ... ....f... ................... .f................... . The picture below shows the graph of this function for = 0...... the result will be approximately normal........... .. . .. .................. ...... if you take the sum (or the average) of n independent random variables with the same distribution as X................ .................. ...... ... (If you are approximating any discrete random variable by a continuous one..... ........... . RANDOM VARIABLES More precisely................. and the most important........... if n is large...... then a Bin(n. the familiar µbell-shaped curve¶.. has an approximately normal distribution... ............ .. the p...... we find the c... ............ for virtually any random variable X which is not too bizarre.......... .. and will become more and more like a normal variable as n grows........ ............... ...W2) is given by the formula fX(x) = 1 Wp2T eí(xí )2/2W2 .....) The p... .. This partly explains why a random variable affected by many independent factors.......... .... ................ (The logarithm is to base e.. Var(X) = 1/P2. Normal random variable N( .If X _ Exp(P).... .. Further calculation gives E(X) = 1/P............... you should make a ³continuity correction´ ± see the next section for details and an example. By integration... to be FX(x) = _0 if x < 0..d...... .......... .....f........... so that m = log2/P.d............. p) random variable is well approximated by a normal random variable with the same expected value np and the same variance npq....... ........... ........69314718056 approximately.d........ PeíPx if x _ 0............ . The median m satisfies 1íeíPm = 1/2.. We have E(X) = and Var(X) = W2..... ............ .

use it to find x such that F(x) is a given value c.. *(íc) = P(Y _ íc) = P(Y _ c) = 1íP(Y _ c) = 1í*(c)..6 If X _ N( ..1) ± this is the so-called standard normal random variable ± and we can find the c.. The c..0011.9. .2743. For example..6) = 1í0. .25) and Y = (X í6)/5 as before..68)=0.d. are given in Table 5 of the New Cambridge Statistical Tables [1].f.. the percentile points of the standard normal r. of X is obtained as usual by integrating the p.. Interpolation Any table is limited in the number of entries it contains..25)...29. the probability of this is *(0.. so that Y _ N(0.29)... where F(x1) is just less than c and F(x2) is just greater.1).f. 0.. So.1).....9 On using tables We end this section with a few comments about using tables.28) and F(0.67)=0......d.f. if X _ N(6. stripped of its constants. So F(0. of a standard normal r.7257 = 0. .. But . So there is no alternative but to make tables of its values.. However..f.. So we only need tables of the c. . ... For example.. of the normal distribution.28 and 0.. suppose that X _ N(6. it is not possible to write the integral of this function (which.... For example. From the tables..f. The function is called *in the tables.... so the required value is about 0.. if you have a table of values of F.45).0014/0. It is probably true that F is changing at a roughly constant rate between. Y is symmetric about zero.0031 = 0..7486 and *(0..6554.6) = 1íP(Y _ 0. then Y _ N(0. c won¶t be in the table and we have to interpolate between values x1 and x2.. not tied particularly to the normal distribution (though most of the examples will come from there). we find that X _ 8 if and only if Y _ (8í6)/5 = 0. The c.. is eíx2 ) in terms of µstandard¶ functions. of any normal random variable.. ON USING TABLES 61 3.. In this case...f. The p.d.7517..v.. then we find from tables *(0.d.6141.. So it is only necessary to tabulate the function for positive values of its argument..f.... then P(X _ 3) = P(Y _ í0.6114.283) = 0..6103 and *(0.. (Three-tenths of 0.0038 is 0. Tabulating something with the input given to one extra decimal place would make the table ten times as bulky! Interpolation can be used to extend the range of values tabulated.d..6745 (since 0.What is the probability that X _ 8? Putting Y = (X í6)/5.. and we want the upper quartile.. This means that. so you don¶t need to do this..29) = 0.4) = 0.. ..v...d.....W2). for N(0. so *(0.. Suppose that some function F is tabulated with the input given to three places of decimals. if *is the c.283) will be about three-tenths of the way between F(0..f.... say..d.) Using tables in reverse This means.. then *(0.28) = 0.. if *is the c..d..4. Usually... The crucial fact that means that we don¶t have to tabulate the function for all values of and Wis the following: Proposition 3. 3. of the normal distribution... for any positive number c.. and Y = (X í )/W... of the standard normal is given in Table 4 of the New Cambridge Statistical Tables [1]..

. since the pieces of the curve above and below the rectangle on either side of x = a will approximately cancel.5).... and P(140 _ X _ 150) _ P(139....5 Adding all these pieces.d.. So X is approximated by Y _ N(144.5 a a+0.. We summarise the continuity correction: Suppose that the discrete random variable X....... and so E(X) = 144.5 _Y _ 150..... .. what is the probability that the number which fail in a year lies between 140 and 150 inclusive? Solution Let X be the number of light bulbs which fail in a year..... taking integer values.5í144 6_ = P(í0. ..... (Here. _ means ³approximately equal´. This in turn is..5 to a+0.......... . Said otherwise....5)=FY (b+0.............5.... To say that X can be approximated by Y means that.. is approximated by the continuous random variable Y.. and P(139....2268 (from tables) = 0..... since FY is the integral of fY ..... .1).... for example.) Similarly..... We are given a table of the c... of Y and want to find information about X.. WORKED EXAMPLES 63 Let Z = (Y í144)/6..5).5 _Y _ 150. and light bulbs fail independently........5....083) = 0......5) by the continuity correction.75.. This is equal to the area 62 CHAPTER 3.. this is P(aí0....5 to x = b+0.......8606í0... ...f.5).6338......d...5).. where a and b are integers..... If 192 bulbs are installed..5)íFY (aí0..5).. y=fY (x) uP(X=a) aí0.... . of Y... to a good approximation. P(X _ b) _ P(Y _ b+0....3/4). .... For example..10......... Var(X) = 36. suppose that X takes integer values and we want to find P(a _ X _ b)..5).... Similarly for the other values........... ...75 _ Z _ 1........ we find that P(a _ X _ b) is approximately equal to the area under the curve y = fY (x) from x = aí0.. ..5_Y _b+0.............. the area under the curve y = fY (x) from x = aí0... where fY is the p. Example The probability that a light bulb will fail in a year is 0.....5) = P_139.... Continuity correction Suppose we know that a discrete random variable X is well approximated by a continuous random variable Y.... for example.5í144 6_Z_ 150....... Then Z _ N(0..... and P(X _ a) _ P(Y _ aí0..... 3.. This area is given by FY (b+0.36)... RANDOM VARIABLES of a rectangle of height fY (a) and base 1 (from aí will find it necessary in other cases......5)íFY (aí0............ Then P(a_X _b)_P(aí0.5 to x = a+0.. Then X _ Bin(192.. .. P(X = a) is approximately equal to fY (a)..5 _Y _ b+0... This probability is equal to P(X = a)+P(x = a+1)+···+P(X = b).

and let Y be the modulus of their difference (that is. (b) Write down the p.f. of (X.3.m.m.f. and calculate its expected value and its variance. and calculate its expected value and its variance.m. of Y. of X. the value of Y is the larger number minus the smaller number).Y).10 Worked examples Question I roll a fair die twice. Let the random variable X be the maximum of the two numbers obtained. (a) Write down the joint p.f. (c) Write down the p. (d) Are the random variables X and Y independent? Solution (a) Y 012345 11 36 0 0 0 0 0 21 36 2 36 0 000 31 36 2 36 2 36 0 36 2 36 2 36 2 36 0 00 X41 0 51 36 2 36 2 36 2 36 2 36 0 61 36 2 36 2 36 2 36 2 36 2 36 The best way to produce this is to write out a 6×6 table giving all possible values .

5). P(X = 1.g. is given by . (c) Take column sums: y012345 P(Y = y) 6 36 10 36 8 36 6 36 4 36 2 36 and so E(Y) = 35 18 . (d) No: e. RANDOM VARIABLES Hence in the usual way E(X) = 161 36 .for the two throws.f. (b) Take row sums: x123456 P(X = x) 1 36 3 36 5 36 7 36 9 36 11 36 64 CHAPTER 3. Question An archer shoots an arrow at a target. work out for each cell what the values of X and Y are.d.3) or (3. and then count the number of occurrences of each pair. Var(Y) = 665 324 . For example: X = 5. Y = 2 can occur in two ways: the numbers thrown must be (5. The distance of the arrow from the centre of the target is a random variable X whose p. Var(X) = 2555 1296 .Y = 2) = 0 but P(X = 1) ·P(Y = 2) = 8 1296 .

The archer¶s score is determined as follows: Distance X < 0. and P(X _ 2) = 40/216. Question Let T be the lifetime in years of new bus engines.10. WORKED EXAMPLES 65 Hence E(S) = 41+47 ·4+47 ·7+41 ·10 216 = 121 27 . and find the archer¶s expected score.fX(x) = _(3+2xíx2)/9 if x _ 3.5) = FX(0. Suppose that T is continuous with probability density function fT (x) =8> <> : .5 _ X < 2) = 41/216. P(1.5 _ X < 1 1 _ X < 1. Similarly we find that P(0.5) = 47/216.5 0 3+2xíx2 9 dx = _9x+3x2íx3 27 _1/2 0 = 41 216 . P(1 _ X < 1. Solution First we work out the probability of the arrow being in each of the given bands: P(X < 0.5 0.5 1. 0 if x > 3. So the p.f.5)íFX(0) = Z 0.m. fot the archer¶s score S is s 0 1 4 7 10 P(S = s) 40 216 41 216 47 216 47 216 41 216 3.5 _ X < 2 X _ 2 Score 10 7 4 1 0 Construct the probability mass function for the archer¶s score.5 _ X < 1) = 47/216.

(b) Find the mean and median of T. The median is the value m such that FT (m) = 1/2. find the probability that at most 10 of the engines last for 4 years or more. Now.0 for x < 1 d x3 for x > 1 for some constant d. of T is obtained by integrating the p. RANDOM VARIABLES (c) The probability that an engine lasts for four years or more is 1íFT (4) = 1í 1í_1 4_2!= 1 16 . the number which last for four years or more is a binomial random variable X _ Bin(240.d. That is.5). then Z _ N(0. if 240 engines are installed. with expected value 240× (1/16) = 15 and variance 240×(1/16)×(15/16) = 225/16. By making an appropriate approximation. 1 = Z g 1 d x3 dx = _íd 2x2 _g 1 = d/2. that is.1/16). We approximate X by Y _ N(15. (15/4)2).. P(X _ 10) _ P(Y _ 10.f. 1í1/m2 = 1/2. so d = 2. (a) Find the value of d. (c) Suppose that 240 new bus engines are installed at the same time. or m = p2. it is FT (x) =8><>: 0 for x < 1 1í 1 x2 for x > 1 The mean of T is Z g 1 x fT (x) dx = Z g 1 2 x2 dx = 2. Solution (a) The integral of fT (x).f. That is. and .d. Using the continuity correction. So. if Z = (Y í15)/(15/4). must be 1. over the support of T.1). (b) The c. 66 CHAPTER 3. and that their lifetimes are independent.

2) = 1íP(Z _ 1. of two discrete random variables X and Y. proved by the same argument. move to the discrete random variable X.1) Another quantity closely related to covariance is the correlation coefficient. a more general version of (a).P(Y _ 10. so that the last term is zero. MORE ON JOINT DISTRIBUTION In fact. who both had successful careers as mathematicians. We introduce a number (called the covariance of X and Y) which gives a measure of how far they are from being independent. corr( of values of X and Y. but of course the birthdays of twins are clearly not independent! Chapter 4 More on joint distribution We have seen the joint p. and we have learned what it means for X and Y to be independent. in any case.2) = 0.Y). and then proved that if X and Y are independent then E(XY) = E(X)E(Y).1 Covariance and correlation In this section we consider a pair of discrete random variables X and Y. and then move on to the continuous random variables Y and Z. which is just a ³normalised´ version of the covariance. Note that we start with the continuous random variable T.Y) for this quantity. Var(X +Y) = Var(X)+Var(Y)+2(E(XY)íE(X)E(Y)).5) = P(Z _ í1.) But what went wrong with our argument for the Birthday Paradox? We assumed (without saying so) that the birthdays of the people in the room were independent.1151 using the table of the standard normal distribution. We write Cov(X.m.Y = bj) = P(X = ai) ·P(Y = bj) holds for any pair (ai. 67 68 CHAPTER 4. where finally Z is standard normal and so is in the tables. We found that. (4. (b) If X and Y are independent. then Cov(X. Look back at the proof of Theorem 21(b). Now we define the covariance of X and Y to be E(XY)íE(X)E(Y). Then the argument we had earlier shows the following: Theorem 4. the class included a pair of twins! (The twins were Leo and Willy Moser. Remember that X and Y are independent if P(X = ai.Y). 4.Y) = 0.Y). A true story The answer to the question at the end of the last chapter: As the students in the class obviously knew.f. It is defined as follows: corr(X.1 (a) Var(X +Y) = Var(X)+Var(Y)+2Cov(X. says that Var(aX +bY) = a2Var(X)+b2Var(Y)+2abCov(X. Now we examine this further to see measures of non-independence and conditional distributions of random variables. where we showed that if X and Y are independent then Var(X +Y) = Var(X)+Var(Y).Y) = .

for smaller X values to be associated with larger Y values. Thus the correlation coefficient is a measure of the extent to which the two variables are related. Var(X) = 1/3. E(Y) = E(mX +c) = mE(X)+c = m +c E(Y2) = E(m2X2+2mcX +c2) = m2( 2+E)+2mc +c2 Var(Y) = E(Y2)íE(Y)2 = m2E E(XY) = E(mX2+cX) = m( 2+E)+c .Y) = E(XY)íE(X)E(Y) = mE corr(X. 4.f. since the sum E(XY) =§ i. Then the joint p. Also. For part (c). then corr(X. j . More generally.m. E(XY) = 1/3. í1 if m < 0.Y) pVar(X)Var(Y) .m. Let X be the number of red pens that I choose and Y the number of green pens. then you have made a mistake.Y) = 1 if m > 0.Y) = í1 if m < 0. (c) if Y = mX +c for some constants m 6= 0 and c.Cov(X. E(Y) = 1/2. The point of this is the first part of the following theorem. 2+E. and í1 if Y decreases linearly as X increases. Cov(X. one green pen. Then (a) í1 _ corr(X. and one blue pen. COVARIANCE AND CORRELATION 69 Example I have two red pens.f. and I choose two pens without replacement. Let E(X) = and Var(X) = E. 0 if there is no relation between them. The proof of the first part is optional: see the end of this section. so that E(X2) = Now we just calculate everything in sight. of X and Y is given by the following table: Y 01 0 0 16 X 1 13 13 2 16 0 From this we can calculate the marginal p. of X and of Y and hence find their expected values and variances: E(X) = 1.2 Let X and Y be random variables. It is +1 if Y increases linearly with X. then corr(X. a negative value.Y) _ 1. Theorem 4. Part (b) follows immediately from part (b) of the preceding theorem.Y) = Cov(X. But note that this is another check on your calculations: if you calculate a correlation coefficient which is bigger than 1 or smaller than í1. (b) if X and Y are independent. Var(Y) = 1/4. and corr(X. suppose that Y = mX +c.Y)/pVar(X)Var(Y) = mE/pm2E2 = n+1 if m > 0.1. a positive correlation indicates a tendency for larger X values to be associated with larger Y values.Y) = 0.

Hence Cov(X. We call two random variables X and Y uncorrelated if Cov(X. Y í1 0 1 í1 15 0 15 X 0 0 15 0 1 15 0 15 70 CHAPTER 4. or it touches it in one point. MORE ON JOINT DISTRIBUTION Now calculation shows that E(X) = E(Y) = E(XY) = 0. 4. when we plot the graph y = px2 +2qx+r.Y) = 0. if X = 0 then Y must be 1. (Note that x is an arbitrary real number here and has no connection with the random variable X!) Since the variance of a random variable is never negative.Y)2 _ Var(X) ·Var(Y). but if X = 1 then Y can be either 0 or 1. Indeed. and if X = 2 then Y must be 0. the hypothesis means that this parabola never goes below the X-axis. P(Y = 0) = 1/5.1) shows that px2+2qx+r = Var(xX +Y).Y)2 _ Var(X) ·Var(Y).Y = 0) = 0. that is.aibjP(X = ai. The negative correlation means that small values of X tend to be associated with larger values of Y.Y) = 0). q = Cov(X. if corr(X. So we can say: Independent random variables are uncorrelated. we get a parabola. From high-school algebra.q. or has two equal real roots.m. and corr(X. Then q2 _ pr. But X and Y are not independent: for P(X = í1) = 2/5. This means that the quadratic equation px2+2qx+r = 0 either has no real roots.Y) = 0. Clearly this is exactly equivalent to proving that its square is at most 1. Here is the proof that the correlation coefficient lies between í1 and 1.Y). Suppose that px2 +2qx+r _ 0 for all real numbers x. so Cov(X. but uncorrelated random variables need not be independent.Y) = í1/6 p1/12 =í 1 p3 . This depends on the following fact: Let p. so that either it lies entirely above the axis. but P(X = í1. However. and r = Var(Y). that Cov(X. that is.Y) = 1/3í1/2 = í1/6.Y)=0 (in other words. we know that this means that q2 _ pr. Now our argument above shows that q2 _ pr.Y = bj) contains only one term where all three factors are non-zero. r be real numbers with p > 0. we see that px2+2qx+r _ 0 for all choices of x.2 Conditional random variables . Example We have seen that if X and Y are independent then Cov(X. Consider the following joint p. it doesn¶t work the other way around. For. Equation (4. as required.f. Now let p = Var(X). Cov(X.

Then the joint p. This defines the probability mass function of the conditional random variable X | A. we have E(X | Y = bj) =§ i aiP(X = ai | Y = bj). Suppose that X is a discrete random variable. one green pen. is just P(X = ai | A) = P(A holds and X = ai) P(A) . talk about the conditional expectation E(X | A) =§ i aiP(X = ai | A).2. and I choose two pens without replacement. we have P(X = ai | Y = bj) = P(X = ai. In particular. the marginal distribution of Y.Y = bj) P(Y = bj) . CONDITIONAL RANDOM VARIABLES 71 Now the event A might itself be defined by a random variable.f. for example. So we can. of X and Y is given by the following table: Y 01 0 0 16 X 1 13 13 2 16 0 In this case. Let X be the number of red pens that I choose and Y the number of green pens. The sum of the entries in this column is just P(Y = bj). and one blue pen. We divide the entries in the column by this value to obtain a new distribution of X (whose probabilities add up to 1). A might be the event that Y takes the value bj.f. the conditional distributions of X corresponding to the two values of Y are as follows: a012 P(X = a | Y = 0) 0 23 13 a012 P(X = a | Y = 1) 13 2 30 . 4.m. Then the conditional probability that X takes a certain value ai. we have taken the column of the joint p. Example I have two red pens.m.Remember that the conditional probability of event B given event A is P(B | A) = P(A\B)/P(A). given A. for example. In this case. In other words. table of X and Y corresponding to the value Y = bj.

If we know the conditional expectation of X for all values of Y. we have E(X) = E(X | Y = 0)P(Y = 0)+E(X | Y = 1)P(Y = 1) = (4/3)×(1/2)+(2/3)×(1/2) = 1. we can find the expected value of X: Proposition 4. MORE ON JOINT DISTRIBUTION = § i ai§ j P(X = ai | Y = bj)P(Y = bj) = § j i § aiP(X = ai | Y = bj!P(Y = bj) = § j E(X | Y = bj)P(Y = bj). On the other hand. Example Let us revisit the geometric random variable and calculate its expected value. so if Y = 1. Let Y be the Bernoulli random variable whose value is 1 if the result of the first toss is heads. E(X | Y = 1) = 2 3 . If Y = 1.3 E(X) =§ j E(X | Y = bj)P(Y = bj). Proof: E(X) = § i aiP(X = ai) 72 CHAPTER 4. Recall the situation: I have a coin with probability p of showing heads. In the above example. then the sequence of tosses from that point on has the same distribution as the original experiment. I toss it repeatedly until heads appears for the first time. So E(X) = E(X | Y = 0)P(Y = 0)+E(X | Y = 1)P(Y = 1) = (1+E(X)) ·q+1 · p . then necessarily X = 1. X is the number of tosses.We have E(X | Y = 0) = 4 3 . then we stop the experiment then and there. 0 if it is tails. if Y = 0. so E(X | Y = 0) = 1+E(X) (the 1 counting the first toss). and we have E(X | Y = 1) = 1.

for any x and y. Then X and Y are independent if and only if.Y (x.m.= E(X)(1í p)+1. we find that E(X) = 1/p.Y over that region.S 73 4. This is obtained by applying Proposition 15 to the events X = ai and Y = bj. confirming our earlier value. I assume that this is unfamiliar to you. the covariance and correlation can be defined by the same formulae as in the discrete case. The joint cumulative distribution function of X and Y is the function FX.Y) corresponds to a point in some region of the plane is obtained by taking the double integral of fX. that is. double integrals. of X | (Y = bj) is equal to the p.1. integrate this function with respect to y from c to d.Y (x. Let X and Y be continuous random variables. A similar result holds for independence of random variables: Proposition 4.3.Y (x. For example. for any values ai and bj of X and Y respectively.) cZb a fX.Y (x. for any value bj of Y. integrate with respect to x between a and b keeping y fixed. so this section will be brief and can mostly be skipped. JOINT DISTRIBUTION OF CONTINUOUS R. rearranging this equation. differentiate with respect to x keeping y constant.) The joint probability density function of X and Y is fX. partial derivatives.y) = x2 xxxy FX. (Note that. We define X and Y to be independent if P(X _ x. 4.V. of X. But we have to examine what is meant by independence for continuous random variables.4 Let X and Y be discrete random variables. the result is a function of y. In other words.f. It can be stated in the following way: X and Y are independent if the conditional p.y).3 Joint distribution of continuous r. In Proposition 2.Y _ y) = P(X _ x) ·P(Y _ y). FX. we saw that independence of events can be characterised in terms of conditional probabilities: A and B are independent if and only if they satisfy P(A | B) = P(A). and then differentiate with respect to y keeping x constant (or the other way round: the answer is the same for all functions we consider.c _Y _ d) = Z d dy (the right hand side means.y) = FX(x) · FY (y). and Equation (4.m.) The probability that the pair of values of (X. while x is the argument of the function. just as in the onevariable case.1) remains valid. X is part of the name of the function.f.Y (x.Y _ y). The formalism here needs even more concepts from calculus than we have used before: functions of two variables.s For continuous random variables. we have P(X = ai | Y = bj) = P(X = ai).y)dx .v. P(a _ X _ b.Y of two real variables given by FX.y) = P(X _ x.

Y (x. and importantly. The continuous random variables X and Y are independent if and only if fX. since Y = pX. for any value b. and the marginal p. not surprisingly. Also. Example Let X and Y be random variables.f. MORE ON JOINT DISTRIBUTION Then the conditional p.d.y)dx 4.Y) = 0 (but not conversely!).2].4 Transformation of random variables If a continuous random variable Y is a function of another r. Solution (a) The support of X is [0. so the support of Y is [0.d.Y) = E(XY)íE(X)E(Y).4].Y (x.d.Y (x. of X is given by fX(x) = Z g íg fX. of X | (Y = b) is fX|(Y=b)(x) = fX.Y (x.f.Y (x.d.y)dy. we can find the distribution of Y in terms of that of X.Y) pVar(X)Var(Y) .f. What is the support of Y? Find the cumulative distribution function and the probability density function of Y. of Y is similarly fY (y) = Z g íg fX. The expected value of XY is. of X. corr(X. (Note that Y _ y if and only if X _ y2. Finally.The marginal p.f. and then as in the discrete case Cov(X. of course FY (y) = 0 for y < 0 and FY (y) = 1 for y > 2. (b) We have fX(x) = x/4 for 0 _ x _ 4. As usual this holds if and only if the conditional p.v. X. and Y =pX.d. 74 CHAPTER 4.f.4.Y) = corr(X. then Cov(X.b) fY (b) . TRANSFORMATION OF RANDOM VARIABLES 75 for 0 _ y _ 2. if X and Y are independent.y) = fX(x) · fY (y). E(XY) = Z g dy.4]) and Y =pX. Suppose that X _U[0.) (c) We have fY (y) = . ígZ g íg xy fX. of X | (Y = b) is equal to the marginal p.y)dx. Now FY (y) = P(Y _ y) = P(X _ y2) = FX(y2) = y2/4 4.4] (uniform on [0.Y) = Cov(X.

of a standard normal variable.) Thus FY (y) = FX(h(y)). Thus. g(x) = px. Then (a) the support of Y is the image of the support of X under g.f. If we knowY as a function of X. by the Chain Rule. I recommend going back to the argument we used in the example.1). Recall that fX(x) = 1 Wp2T eí(xí )2/2W2 . together with the conditions for its validity. Let g be a real function which is either strictly increasing or strictly decreasing on the support of X. (In our example. of Y is given by fY (y) = fX(h(y))|h0(y)|. where h is the inverse function of g. and the Chain Rule says that if x = h(y) we must multiply by h0(y) to find the derivative with respect to y.d dy FY (y) = ny/2 if 0 _ y _ 2. 76 CHAPTER 4. of X is fX(x) = 1/4 for 0 _ x _ 4. and which is differentiable there. this function is everywhere strictly increasing (the graph is a straight line with slope 1/W). and fY (y) = fX(Wy+ ) ·W= 1 p2T eíy2/2. 0 otherwise. then the event Y _ y is the same as the event X _ h(Y). and the inverse function is x = h(y) = Wy+ . rather than remember this formula. (This is because fX(x) is the derivative of FX(x) with respect to its argument x.d. here is the proof of Proposition 3. However. where h is the inverse function of g. the p.W2) and Y = (X í )/W. .5 Let X be a continuous random variable. since the p. (b) the p. The argument in (b) is the key.f. fY (y) = fX(h(y))h0(y). and so. h0(y) = W. where h0 is the derivative of h.d. say Y = g(X). where g(x) = (xí )/W.d. Theorem 4. Let Y = g(X). and so h(y) = y2. MORE ON JOINT DISTRIBUTION We have Y = g(X). then Y _ N(0. where g is an increasing function.f.) Applying this formula in our example we have fY (y) = 1 4 ·2y = y 2 for 0 _ y _ 2. Here is a formal statement of the result.6: if X _ N( . For example. This means that y = g(x) if and only if x = h(y).

1) and Y = X2.1]. If you blindly applied the formula of Theorem 4. is zero. Find the p. Example X _ N(0. of Y. and *0(x) = 1 p2T eíx2/2. Thus FY (y) = P(Y _ y) = P(ípy _ X _py) = *(py)í*(ípy) = *(pyí(1í*(py)) (by symmetry of N(0. so we must work out the two contributions and add them up.d. FX(x) = FY (x) = (0 if x < 0.f. by the symmetry of the p.d. and Y = X2. 4. if X is a random variable taking both positive and negative values. each value of Y corresponds to two values of X (one positive. The p.5. of X.d. since Y = X2. Let *(x) be its c.s of X and Y are identical.. this is valid for y > 0. so that P(X _ x) = *(x). Now Y = X2. Solution The c.1)) = 2*(py)í1.d. and hence find its expected value.f. and each value gives the same contribution. one negative). variance and median. so Y _ y if and only if ípy _ X _ py. 4.5 Worked examples Question Two numbers X and Y are chosen independently from the uniform distribution on the unit interval [0.d. of X is (1/p2T)eíx2/2.f. the p.f. then a given value y of Y could arise from either of the values py and ípy of X. not either increasing or decreasing). So fY (y) = d dy FY (y) = 2*0(py) · 1 2py (by the Chain Rule) = 1 p2Ty eíy/2. using h(y) = py.d. Of course. Let Z be the maximum of the two numbers. it arises from the fact that.f.d. you would not get this 2.If the transforming function g is not monotonic (that is.f. . that is. Find the p. x if 0 < x < 1. WORKED EXAMPLES 77 Note the 2 in the line labelled ³by the Chain Rule´. for y < 0. then life is a bit more complicated.f. of Z. For example.5.

If N is the number showing on the die. P(X _ x and Y _ x) = x ·x = x2. By the ToTP. its name doesn¶t matter. Var(Z) = Z 1 0 2x3dxí_2 3_2 = 1 18 . . then X would be a binomial Bin(n. if both X and Y are smaller than a given value x. P(X = k) = 6 n=1 § P(X = k | N = n)P(N = n). then so is their maximum. .f.) The key to the argument is to notice that Z = max(X.f. .m. (b) Calculate E(X) without using this information. (We start at k because you can¶t get k heads with fewer than k coin tosses!) The answer . Thus P(Z _ x) = x2. 78 CHAPTER 4. Solution (a) If we were given that N = n. (a) Write down the p. . by independence.1/2) random variable. 6. we have P(X _ x) = P(Y _ x) = x. of Z is FZ(x) = (0 if x < 0.1/2) r. but if at least one of them is greater than x. . I then toss a fair coin N times. we add up the probability that X = k for a Bin(n.1 if x > 1. . MORE ON JOINT DISTRIBUTION Question I roll a fair die bearing the numbers 1 to 6.d. Let X be the number of heads I obtain. So the c.Y) _ x if and only if X _ x and Y _ x.v. Clearly P(N = n) = 1/6 for n = 1. say. 1 if x > 1. that is m2 = 1/2. . So to find P(X = k).f. (The variable can be called x in both cases. for n = k. (For.) For 0 _ x _ 1.6 and divide by 6. We obtain the p. . or m = 1/p2. Of course this probability is 0 if x < 0 and is 1 if x > 1. The median of Z is the value of m such that FZ(m) = 1/2. So P(X = k | N = n) = nCk(1/2)n. then again so is their maximum. for X. Then we can find E(Z) and Var(Z) in the usual way: E(Z) = Z 1 0 2x2dx = 2 3 . 0 otherwise. of Z by differentiating: fZ(x) = n2x if 0 < x < 1.d. x2 if 0 < x < 1.

E(X) = 6 n=1 § E(X | (N = n))P(N = n). X has a binomial Bin(n.1/2) distribution. with expected value n/2.f. as we remarked.m. So E(X) = 6 n=1 § (n/2) ·(1/6) = 1+2+3+4+5+6 2 ·6 = 7 4 . to check that the answer is the same! Appendix A Mathematical notation The Greek alphabet Mathematicians use the Greek alphabet for an extra supply of symbols.3. Now if we are given that N =n then. P(X = 4) = 4C4(1/2)4+5C4(1/2)5+6C4(1/2)6 6 = 4+10+15 384 .comes to k0123456 P(X = k) 63 384 120 384 99 384 64 384 29 384 8 384 1 384 For example. . Try working it out from the p. (b) By Proposition 4.

or epsilon and upsilon.0. R Real numbers 12 . Name Capital Lowercase alpha A E beta B F gamma +K delta (H epsilon E I zeta Z ^ eta H L theta 5U iota I M kappa K O lambda 0P mu M nu N R xi <\ omicron O o pi 4T rho P V sigma 7W tau T X upsilon cY phi *J chi X G psi =] omega . MATHEMATICAL NOTATION Numbers Notation Meaning Example N Natural numbers 1. or nu and upsilon. Apologies to Greek students: you may not recognise this. You don¶t need to learn this. . |x| modulus |2| = 2. (some people include 0) Z Integers . |í3| = 3 . .T. like T. . which sound alike.í2. have standard meanings. . which look alike.1. .3. . and chi and xi. . keep it for reference. .2.í1. . . .p2.Some. but it is the Greek alphabet that mathematicians use! Pairs that are often confused are zeta and xi.[ 79 80 APPENDIX A.2. .

that is. (2. . . The notation n i=1 ai § (read ³sum.3}| = 3 (number of elements in A) A[B A union B {1. be numbers. 2/4 = 0.a3. The notation n § .4} = {1.2. . of ai´).2}\{3.a/b or a b a over b 12/3 = 4.3).3}[{2.2.4} (elements in either A or B) A\B A intersection B {1.3} NOTE: {1.2}×{1. from i equals 1 to n.3} A0 complement of A everything not in A empty set (no elements) {1.1).an.2. .3} (elements in A but not B) A _ B A is a subset of B (or equal) {1.1} x 2 A x is an element of the set A 2 2 {1.4} = {2} (elements in both A and B) A\B set difference {1. . .a2.} |A| cardinality of A |{1. . .y) ordered pair NOTE: (1.2. {x : x2 = 4} = {í2. .2) 6= (2.a2.5 a | b a divides b 4 | 12 mCn or _m n_ m choose n 5C2 = 10 n! n factorial 5! = 120 b i=a xi xa+xa+1+···+xb 3 i=1 i2 = § § 12+22+32 = 14 (see section on Summation below) x _ y x is approximately equal to y Sets Notation Meaning Example {.2.} a set {1.3}\{2. . . .} the set of all x such that . means: add up the numbers a1.3.4} = {1.4} =  (x.2.2.1).3} _ {1. (1.3} {x : . n i=1 ai = § a1+a2+···+an. .3)} 81 Summation What is it? Let a1.2} = {2.3} = (set of all ordered pairs) {(1.2.1) A×B Cartesian product {1.3}\{2. . (2.2} or {x | . .

For example. MATHEMATICAL NOTATION and then add up all the results. 82 APPENDIX A. since (if m and n are different) it is telling us to add up a different number of terms. n § i=1 ai!· m § j=1 . Sometimes I get lazy and don¶t bother to write out the values: I just say§ i ai to mean: add up all the relevant values. and then add the results. § = § § (A. a2 +b2 on the second line. n i=1 (ai+bi) n i=1 ai+ n i+1 bi. 20 § i=10 ai = a10+a11+···+a20. The sum doesn¶t have to start at 1. then we say that E(X) =§ i aiP(X = ai) where the sum is over all i such that ai is a value of the random variable X. The variable i or j is called a ³dummy variable´. and so on.exactly the same thing. The right-hand side says: add the first column (all the as) and the second column (all the bs). The notation m i=1 ai is j=1 aj means § not the same.1) Imagine the as and bs written out with a1 +b1 on the first line. For example. if X is a discrete random variable. The answers must be the same. Manipulation The following three rules hold. The left-hand side says: add the two terms in each line.

we have functions of x. since that is not possible.3) The left-hand side says: add up the functions and differentiate the sum. § = § d dx fi(x). Another useful result is the Binomial Theorem: (x+y)n = n k=0 nCkxníkyk. if ai = arií!= n i=1 § § (A. then g i=1 ai = § a+ar+ar2+··· = a 1ír . where í1 < r < 1. The right says: differentiate each function and add up the derivatives. for all values of i and j. § Infinite sums Sometimes we meet infinite sums. (A. But sometimes we know the answer another way: for example. You also need to know the sum of the ³exponential series´ g § .2) The double sum says add up all these products. using the formula for the sum of the ³geometric series´. This doesn¶t just mean ³add up infinitely many values´. If in place of numbers. then we can ³differentiate term-by-term´: d dx n i+1 fi(x) n i+1 m j=1 aibj. which we write as g i=1 ai for § example. A simple example shows how it works: (a1+a2)(b1+b2) = a1b1+a1b2+a2b1+a2b2. We need Analysis to give us a definition in general.

Do the three rules of the preceding section hold? Sometimes yes. e.Y) correlation coefficient of X and Y 68 X | B conditional random variable 70 X | (Y = b) 71 Bernoulli random variable Bernoulli(p) (p. where q = 1í p. 83 84 APPENDIX B.f. sometimes no. the number of heads in n coin tosses.) E(X) expected value of X 41 Var(X) variance of X 42 Cov(X. sampling with replacement from a population with a proportion p of distinguished elements. In Analysis you will see some answers to this question.i=0 xi i! = 1+x+ x2 2 + x3 6 + x4 24 +··· = ex. ‡ p. 48) ‡ Occurs when there is a single trial with a fixed probability p of success. Also.m.m. Var(X) = pq. p) (p. P(X = 1) = p.f. the rules will be valid. PROBABILITY AND RANDOM VARIABLES Binomial random variable Bin(n. P(X = 0) = q.Y) covariance of X and Y 67 corr(X. ‡ E(X) = p. or same p. .g. 49) ‡ Occurs when we are counting the number of successes in n independent trials with fixed probability p of success in each trial. Notation Meaning Page P(A) probability of A 3 P(A | B) conditional probability of A given B 24 X =Y the values of X and Y are equal X _Y X and Y have the same distribution 41 (that is.d. same p.f. X and Y are random variables. ‡ Takes only the values 0 and 1. In all the examples you meet in this book. A and B are events. Appendix B Probability and random variables Notation In the table.

m.m. ‡ p.f. . F(x) = (0 if x < a.1.M.1. number of tosses until the first head when tossing a coin. and np = P. Exponential random variable Exp(P) (p.M/N) if n is small compared to N. . where q = 1í p. 52) ‡ Describes the number of trials up to and including the first success in a sequence of independent Bernoulli trials.M.n. e. 51) ‡ Occurs when we are sampling n elements without replacement from a population of N elements of which M are distinguished.g. e. ‡ Values 0. (xía)/(bía) if a _ x _ b. then Bin(n. . P(X = k) = eíPPk/k! .2.2. ‡ If n is large. . ‡ E(X) = P. f (x) = _0 if x < 0. ‡ E(X) = (a+b)/2. 59) ‡ Occurs in the same situations as the Poisson random variable.b] (p. ‡ p. ‡ E(X) = np. 0 if x > b. P(X = k) = nCkqník pk for 0 _ k _ n. (any positive integer).b].f. the number of fish caught in a day.2. .s are approximately equal). . 1/(bía) if a _ x _ b. ‡ Values 1.‡ The sum of n independent Bernoulli(p) random variables.f.N) (p. but measures the time from now until the first occurrence of the event.f. ‡ c. Var(X) = q/p2.f. Geometric random variable Geom(p) (p. .f. . ‡ p. 1 if x > b. Hypergeometric random variable Hg(n. . p) is approximately equal to Poisson(P) (in the sense that the p. p is small. ‡ p.N íM. P(X = k) = qkí1p.f. 85 Poisson random variable Poisson(P) (p. Var(X) = P.m. 54) ‡ Describes the number of occurrences of a random event in a fixed time interval. ‡ p. P(X = k) = (MCk ·NíMCník)/NCn.m.2.d. . . . ‡ E(X) = n_M N _. Var(X) = n_M N __N íM N __N ín N í1_. ‡ E(X) = 1/p. where q = 1í p. .m. ‡ Values 0. f (x) = (0 if x < a.d. ‡ Approximately Bin(n. .1. Uniform random variable U[a.n.d.f. ‡ Values 0. Var(X) = (bía)2/12. .g. 58) ‡ Occurs when a number is chosen at random from the interval [a. (any non-negative integer) ‡ p. with all values equally likely. Var(X) = npq.

. This also works for many other types of random variables: this statement is known as the Central Limit Theorem. and Standard Normal random variables are tabulated in the New Cambridge Statistical Tables. Bin(n. Tables 1.s of the Binomial. ‡ E(X) = 1/P. PROBABILITY AND RANDOM VARIABLES Normal random variable N( . Poisson. ‡ c. p) is approximately N(np. use tables.d.f. f (x) = 1 Wp2T eí(xí )2/2W2 . 2 and 4. 86 APPENDIX B. Var(X) = W2. ‡ p.1) is given in the table. F(x) = _0 if x < 0. 1íeíPx if x _ 0. then (X í )/W_ N(0.f.f. ‡ E(X) = . the time until the next occurrence has the same distribution. 59) ‡ The limit of the sum (or average) of many independent Bernoulli random variables. ‡ Standard normal N(0. ‡ No simple formula for c.. Var(X) = 1/P2.d. ‡ However long you wait. The c.1).npq).PeíPx if x _ 0.d. If X _ N( .f.d. ‡ For large n.W2) (p.W2).

Sign up to vote on this title
UsefulNot useful