# COURSE NOTES STATS 210 Statistical Theory

Department of Statistics University of Auckland

Contents

1. Probability 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Sample spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Partitioning sets and events . . . . . . . . . . . . . . . . . . 1.5 Probability: a way of measuring sets . . . . . . . . . . . . . 1.6 Probabilities of combined events . . . . . . . . . . . . . . . . 1.7 The Partition Theorem . . . . . . . . . . . . . . . . . . . . . 1.8 Examples of basic probability calculations . . . . . . . . . . 1.9 Formal probability proofs: non-examinable . . . . . . . . . . 1.10 Conditional Probability . . . . . . . . . . . . . . . . . . . . . 1.11 Examples of conditional probability and partitions . . . . . . 1.12 Bayes’ Theorem: inverting conditional probabilities . . . . . 1.13 Statistical Independence . . . . . . . . . . . . . . . . . . . . 1.14 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 1.15 Key Probability Results for Chapter 1 . . . . . . . . . . . . 1.16 Chains of events and probability trees: non-examinable . . . 1.17 Equally likely outcomes and combinatorics: non-examinable

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

3 3 3 5 12 14 17 19 20 22 24 30 32 35 37 41 43 46 50 50 51 53 54 58 62 69 76 79 91 94 100 106 112

2. Discrete Probability Distributions 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 The probability function, fX (x) . . . . . . . . . . . . . . . . . . 2.3 Bernoulli trials . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Example of the probability function: the Binomial Distribution 2.5 The cumulative distribution function, FX (x) . . . . . . . . . . . 2.6 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Example: Presidents and deep-sea divers . . . . . . . . . . . . . 2.8 Example: Birthdays and sports professionals . . . . . . . . . . . 2.9 Likelihood and estimation . . . . . . . . . . . . . . . . . . . . . 2.10 Random numbers and histograms . . . . . . . . . . . . . . . . . 2.11 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.12 Variable transformations . . . . . . . . . . . . . . . . . . . . . . 2.13 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.14 Mean and variance of the Binomial(n, p) distribution . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

1

3. Modelling with Discrete Probability Distributions 3.1 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . 3.2 Geometric distribution . . . . . . . . . . . . . . . . . . . . . 3.3 Negative Binomial distribution . . . . . . . . . . . . . . . . . 3.4 Hypergeometric distribution: sampling without replacement 3.5 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . 3.6 Subjective modelling . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

118 . 118 . 119 . 124 . 127 . 130 . 136 138 . 138 . 140 . 151 . 156 . 158 . 160 . 162 . 166 . 168 . 173 . 175 . 178

4. Continuous Random Variables 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 The probability density function . . . . . . . . . . . . . . . . . . . . . 4.3 The Exponential distribution . . . . . . . . . . . . . . . . . . . . . . 4.4 Likelihood and estimation for continuous random variables . . . . . . 4.5 Hypothesis tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Expectation and variance . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Exponential distribution mean and variance . . . . . . . . . . . . . . 4.8 The Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 The Change of Variable Technique: ﬁnding the distribution of g(X) . 4.10 Change of variable for non-monotone functions: non-examinable . . . 4.11 The Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . 4.12 The Beta Distribution: non-examinable . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

5. The Normal Distribution and the Central Limit Theorem 179 5.1 The Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 5.2 The Central Limit Theorem (CLT) . . . . . . . . . . . . . . . . . . . . . . . . . 184 6. Wrapping Up 191 6.1 Estimators — the good, the bad, and the estimator PDF . . . . . . . . . . . . . 191 6.2 Hypothesis tests: in search of a distribution . . . . . . . . . . . . . . . . . . . . 196

e. Deﬁnition: A sample space. written as P(rain) = 0. . Subjective: probability represents a person’s degree of belief that an event will occur. Frequentist (based on frequencies). s2 .2 Sample spaces Deﬁnition: A random experiment is an experiment whose outcome is not known until it is observed.1 Introduction Deﬁnition: A probability is a number between 0 and 1 representing how likely it is that an event will occur.g.3 Chapter 1: Probability 1. if the sample space is Ω = {s1. 1. is a set of outcomes of a random experiment. Deﬁnition: A sample point is an element of the sample space. Every possible outcome must be listed once and only once. Probabilities can be: 1. I think there is an 80% chance it will rain today. 2. Ω. number of times event occurs number of opportunities for event to occur . e.g. s3}. then each si is a sample point. Regardless of how we obtain probabilities. For example. we always combine and manipulate them according to the same rules.80.

2. 1. Sample space: Ω = {HH.4 Examples: Experiment: Toss a coin twice and observe the result. Sample space: Ω = {same. T T } (discrete. 3.7} (discrete.g. Deﬁnition: A sample space is discrete if there are “gaps” between the diﬀerent elements. or if the elements can be “listed”. . T H. . T H. 4. HH or TT).} (discrete. . 3} (discrete and ﬁnite) Ω = {4. HT.). 1. diﬀerent} Discrete and continuous sample spaces Deﬁnition: A sample space is ﬁnite if it has a ﬁnite number of elements. 1]). ﬁnite) . 1. 4. 1] ={all numbers between 0 and 1 inclusive} (continuous. Deﬁnition: A sample space is continuous if there are no gaps between the elements. . . 1. ﬁnite) Ω = {0. so the elements cannot be listed (eg. the interval [0. inﬁnite) Ω = {HH. inﬁnite) (discrete. [90◦. a sample space is discrete if it is countable. . Examples: Ω = {0. 2. Sample space: Ω = {0. T T } An example of a sample point is: HT Experiment: Toss a coin twice and count the number of heads. 3. HT.6.5. 2} Experiment: Toss a coin twice and observe whether the two tosses are the same (e. ﬁnite) Ω= [0◦. In mathematical language. 90◦). even if an inﬁnite list (eg. 360◦) Ω = [0. 2.

Ω is a set. Somehow you need to harness the idea of randomness. It lays the ground for a whole mathematical formulation of randomness. The empty set. is also a subset of Ω. Ω. which is all about the unknown. . any collection of outcomes forms an event. A is a subset of Ω. Deﬁnition: Event A occurs if we observe an outcome that is a member of the set A. Sample space: Ω = {HH. which lists all possible outcomes of a random experiment. as in the deﬁnition. HT. The next concept that you would need to formulate is that of something that happens at random. in terms of set theory. One of the founders of probability theory. This is called the null event. How would you do it? So far. Kolmogorov (1903-1987). However. Example: Toss a coin twice. we have introduced the sample space. so Ω is an event. That is. and express it in terms of mathematics.1. How would you express the idea of an event in terms of set theory? Deﬁnition: An event is a subset of the sample space. We write A ⊂ Ω. or the event with no outcomes. or an event. We write: A =“exactly one head” Then A = {HT. ∅ = {}.3 Events Suppose you are setting out to create a science of randomness. T T } Let event A be the event that there is exactly one head. and might seem unexciting. T H. T H}. Note: Ω is a subset of itself.

6 Example: Experiment: throw 2 dice. . 3). . (4. . . opposites. For example. . 1)} Combining Events Formulating random events in terms of sets gives us the power of set theory to describe all possible ways of combining or manipulating events. . (2. 2). 2). . 1). 2). Sample space: Ω = {(1. (2. (1. We do this in the language of set theory. 6)} Event A = “sum of two faces is 5” = {(1. . . alternatives. (6. 6). (2. . (2. Bus Car Walk Bike Train People in class This sort of diagram representing events in a sample space is called a Venn diagram. and so on. . Example: Suppose our random experiment is to pick a person in the class and see what form(s) of transport they used to get to campus today. 6). . we need to describe things like coincidences (events happening together). 1). . (1. (3. 4).

To shade the union of Bus and Car. For example. or a bus. OR both. Bus Car Walk Bike Train People in class Overall. The union of events A and B is written A ∪ B. Alternatives: the union ‘or’ operator We wish to describe an event that is composed of several diﬀerent alternatives. you have to consider what does an outcome need to satisfy for the shaded event to occur? The answer is Bus. Note: Be careful not to confuse ‘Or’ and ‘And’. as “Bus UNION Car”. read The union operator. To remember whether union refers to ‘Or’ or ‘And’. NOT Bus AND Car. OR Car. we shade all out- comes in ‘Bus’ and all outcomes in ‘Car’. and is given by A ∪ B = {s : s ∈ A or s ∈ B or both} . the event that you used a motor vehicle to get to campus is the event that your journey involved a car. denotes Bus OR Car OR both. . ∪. We write the event that you used a motor vehicle as the event Bus ∪ Car. To represent the set of journeys involving both alternatives. or both. we had to shade everything in Bus AND everything in Car. Deﬁnition: Let A and B be events on the same sample space Ω: so A ⊂ Ω and B ⊂ Ω. we have shaded all outcomes in the UNION of Bus and Car.7 1.

read as The intersection operator. To represent this event. consider the event that your journey today involved BOTH a car AND a train. consider the event that your journey today did NOT involve walking. Concurrences and coincidences: the intersection ‘and’ operator The intersection is an event that occurs when two or more events ALL occur together. Bus Car Walk Bike Train People in class “Car INTERSECT Train”.8 2. 3. To represent this event. Deﬁnition: The intersection of events A and B is written A ∩ B and is given by A ∩ B = {s : s ∈ A AND s ∈ B} . We write the event that you used both car and train as Car ∩ Train. ∩. For example. we shade all outcomes in the overlap of Car and Train. we shade all outcomes in Ω except those in the event Walk. For example. . Opposites: the complement or ‘not’ operator The complement of an event is the opposite of the event: whatever the event was. it didn’t happen. denotes both Car AND Train together.

No. Yes. 8) A ∩ B = everything outside A ∩ B . 6) A ∩ B = {male and non-biker}. Say whether the following events have occurred: 1) A 3) A Yes. Yes. 7) A ∩ B = {male and bike rider}. Sample space: Ω = {all people in class}. 5) A ∪ B = {female or bike rider or both}. / Examples: Experiment: Pick a person in this class at random. Question: What is the event Ω? Ω=∅ Challenge: can you express A ∩ B using only a ∪ sign? Answer: A ∩ B = (A ∪ B). 2) B 4) B No. No. No. A ∩ B did not occur.9 Bus Car Walk Bike Train People in class We write the event ‘not Walk’ as Walk. Suppose I pick a male who did not travel by bike. . Yes. Deﬁnition: The complement of event A is written A and is given by A = {s : s ∈ A}. so A ∩ B did occur. Let event A =“person is male” and event B =“person travelled by bike today”.

) Ω Example: A B A B Ω C (a) A ∪ B ∪ C C (b) A ∩ B ∩ C Properties of union. Ω A B . although they are not used to provide formal proofs. (ii) For any event A. and complement The following properties hold. (i) ∅ = Ω and Ω = ∅. A B Ω (b) (A ∩ B) = A ∪ B. and A ∪ A = Ω. intersection. and (iii) For any events A and B. A ∩ A = ∅. A ∩ B = B ∩ A. (This was probably the case for our transport Venn diagram. (iv) (a) (A ∪ B) = A ∩ B. For more than 3 events. the diagram might not be able to represent all possible overlaps of events. Commutative. A ∪ B = B ∪ A.10 Limitations of Venn diagrams Venn diagrams are generally useful for up to 3 events.

. A ∪ (B1 ∩ B2 ∩ . However. addition is not distributive over multiplication: a + (b × c) = (a + b) × (a + c). if a. . ∩ Bn ) = (A ∪ B1) ∩ (A ∪ B2 ) ∩ . This means that. . . For set union and set intersection.. . and A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C). A∩ Bi i=1 = i=1 (A ∩ Bi). AND intersection is distributive over union. . and c are any numbers. and C: A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C). B. ∪ Bn ) = (A ∩ B1) ∪ (A ∩ B2 ) ∪ . . then a × (b + c) = a × b + a × c. b. . ∩ (A ∪ Bn ) n n i. . union is distributive over intersection. Ω Ω A B B A C C More generally.11 Distributive laws We are familiar with the fact that multiplication is distributive over addition. and A∪ Bi i=1 = i=1 (A ∪ Bi). . B2. . A ∩ (B1 ∪ B2 ∪ . . . Bn. ∪ (A ∩ Bn ) n n i. for any sets A.e. Thus. for several events A and B1.e.

. Bk form a partition of Ω if Bi ∩ Bj = ∅ for all i. ∪ Bk = Ω. k and i=1 Bi = B1 ∪ B2 ∪ . . Deﬁnition: Any number of events A1. . Ω 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 A 111111 000000 111111 000000 111111 000000 111111 000000 111111 000000 B Note: Does this mean that A and B are independent? No: quite the opposite. A EXCLUDES B from happening. and vice-versa. . we give some background deﬁnitions. or disjoint. . Ω A1 A2 00000 1111 1111 0000 0000 A3 11111 1111 1111 0000 0000 11111 00000 1111 1111 0000 0000 11111 00000 1111 1111 0000 0000 11111 00000 1111 1111 0000 0000 11111 00000 Deﬁnition: A partition of the sample space Ω is a collection of mutually exclusive events whose union is Ω.4 Partitioning sets and events The idea of a partition is fundamental in probability manipulations. For now. so B depends strongly on whether or not A happens. j with i = j . . . . .12 1. That is. . Deﬁnition: Two events A and B are mutually exclusive. Ak are mutually exclusive if every pair of the events is mutually exclusive: ie. . sets B1. j with i = j . Later in this chapter we will encounter the important Partition Theorem. This means events A and B cannot happen together. A2. If A happens. it excludes B from happening. if A ∩ B = ∅. B2. Ai ∩ Aj = ∅ for all i.

The partition idea shows us how to add the probabilities of these chunks together: see later. . . Bk form a partition of Ω. . . . . .13 Examples: B1 . . B5 partition Ω: Ω B2 B4 B2 B4 B5 Important: B and B partition Ω for any event B: Ω B B Partitioning an event A Any set or event A can be partitioned: it doesn’t have to be Ω. This is because it is often easier to ﬁnd the probability of small ‘chunks’ of A (the partitioned sections) than to ﬁnd the whole probability of A at once. B2. . (A ∩ Bk ) form a partition of A. B4 form a partition of Ω: Ω B1 B3 B1 B3 B1 . Ω B1 B2 A 1111 0000 1111 0000 1111 0000 1111 0000 B3 B4 We will see that this is very useful for ﬁnding the probability of event A. If B1. . . . B3. . then (A ∩ B1).

What happens. ‘Probability’. but we deﬁnitely need to give them diﬀerent probabilities! More problems arise when the sets are inﬁnite or continuous. This means somehow ‘measuring chance’. this is often what we do to measure probability — (although counting the number of elements can be far from easy!) But there are circumstances where this is not appropriate. You probably already have a good idea for a suitable way to measure the size of a set or event.5 Probability: a way of measuring sets Remember that you are given the job of building the science of randomness.14 1. Chance of what? The answer is sets. for example. but they have the same number of elements? Should they be the same probability? First set: {Lions win}. if (say) our random experiment is to pick a random number on [0. 20] — but no they shouldn’t (hopefully!) if our experiment was the time in years taken by a student to ﬁnish their degree. Both sets have just one element. the ﬁrst thing you would ask is what you are supposed to be measuring the heights of. is a way of measuring sets. 4] and [13. 111111111 000000000 111111111 000000000 111111111 000000000 111 000 11 111 00 000 111 000 1111 0000 Second set: {All Blacks win}. . If I sent you away to measure heights. People? Trees? Mountains? We have the same question when setting out to measure chance. Why not just count the number of elements in it? In fact. It was clever to formulate our notions of events and sample spaces in terms of sets: it gives us something to measure. 14] be the same probability. the name that we give to our chance-measure. just because they are the same length? Yes they should. if one set is far more likely than another. Should the intervals [3.

In general. 0 ≤ pi ≤ 1 for all i. among the diﬀerent sets in the sample space. . A discrete probability distribution on Ω is a set of real numbers {p1. . a probability distribution just lists every element in the sample space and allots it a probability between 0 and 1. A probability distribution is a rule according to which probability is apportioned. s2. s5.01.} be a discrete sample space. s2. . In the rugby example.g. E. or distributed. Discrete probability distributions Deﬁnition: Let Ω = {s1 .99. i pi = 1. s14}. . 2. then P(A) = p3 + p5 + p14. . . . . is to sum the probabilities of the elements of A: P(A) = i∈A pi .} such that: 1. At its simplest. such that the total sum of probabilities is 1. we could use the following probability distribution: P(Lions win)= 0. P(All Blacks win)= 0. if A = {s3 . We write: pi = P(si ). we have the following deﬁnition for discrete sample spaces. The rule for measuring the probability of any set. .15 Most of this course is about probability distributions. A ⊆ Ω.} associated with the sample points {s1 . pi is called the probability of the event that the outcome is si . p2. . or event.

1]. An are mutually exclusive events. The challenge of deﬁning a probability distribution on a continuous sample space is left till later. or axioms. Note: Remember that an EVENT is a SET: an event is a subset of the sample space. . We will need more sophisticated methods detailed later in the course. However.g. all of probability theory is based on the following three deﬁnitions. or event. 0 ≤ P(A) ≤ 1 for all events A. . . Ω = [0. e. Note: The axioms can never be ‘proved’: they are deﬁnitions. . + P(An ). then P(A1 ∪ A2 ∪ . . Axiom 1: Axiom 2: Axiom 3: P(Ω) = 1. If A1. (no overlap). discrete or continuous. It should be clear that the deﬁnitions given for the discrete sample space on page 15 will satisfy the axioms. A ⊆ Ω. it is a valid probability distribution. . A continuous probability distribution is a rule under which we can calculate a probability between 0 and 1 for any set. . ∪ An) = P(A1) + P(A2) + .16 Continuous probability distributions On a continuous sample space Ω. the same principle applies. . If our rule for ‘measuring sets’ satisﬁes the three axioms. Note: P(∅) = 0. Probability Axioms For any sample space. we can not list all the elements and give them an individual probability. . A2.

For Case 1. 1. Everything below applies to events (sets) in either a discrete or a continuous sample space. A ∩ B = ∅. P(A ∪ B ∪ C) = P(A) + P(B) + P(C) + P(A ∩ B ∩ C) . intersections. B. for any A.6 Probabilities of combined events In Section 1. A and B are mutually exclusive (no overlap): i. and complements of events. For three or more events: e. − P(A ∩ B) − P(A ∩ C) − P(B ∩ C) . P(A ∪ B) = P(A) + P(B) − P(A ∩ B). we get the probability of A ∪ B straight from Axiom 3: If A ∩ B = ∅ then P(A ∪ B) = P(A) + P(B).e. There are two cases for the probability of the union A ∪ B: 1. B . Note: The formula for Case 2 applies also to Case 1: just substitute P(A ∩ B) = P(∅) = 0. 2. A and B are not mutually exclusive: A ∩ B = ∅. We now look at the probabilities of these combinations. we have the following formula. Probability of a union Let A and B be events on a sample space Ω. For Case 2.3 we discussed unions. and C.g.17 1. For ANY events A.

we often use conditional probability (Section 1.13). 1. think of the Venn diagrams: A 11111111 00000000 11111111 00000000 111111111 000000000 11111111 00000000 111111111 000000000 1111 0000 11111111 00000000 111111111 000000000 1111 0000 11111111 00000000 111111111 000000000 1111 0000 11111111 00000000 111111111 000000000 1111 0000 11111111 00000000 111111111 000000000 1111 0000 11111111 00000000 111111111 000000000 111111111 000000000 B 111111111 000000000 Ω A 111111111 000000000 Ω 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 B \ (A n000000000 B) 111111111 When we add P(A) + P(B). A . but a formal proof is given in Sec. We might be able to use statistical independence (Section 1. This is obvious. To understand the formula. So we have to subtract the intersection once to get P(A ∪ B): P(A ∪ B) = P(A) + P(B) − P(A ∩ B).10. think of A ∪ B as two disjoint sets: all of A.9. Probability of a complement A Ω A 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 B Ω P(A) = 1 − P(A). The formal proof of this formula is in Section 1. P(A ∪ B) = P(A) + P(B) − P(A ∩ B).18 Explanation For any events A and B.) 3. 2. we add the intersection twice. Probability of an intersection There is no easy formula for P(A ∩ B).9 (non-examinable). and the bits of B without the intersection. Alternatively. So P(A ∪ B) = P(A) + P(B) − P(A ∩ B) . If A and B are not statistically independent.

. if B1 . . Bm which together cover everything in Ω. . Theorem 1. Bm form a partition of Ω. B1 B3 B2 B4 Also. .7 The Partition Theorem The Partition Theorem is one of the most useful tools for probability calculations.7: The Partition Theorem. . Bm form a partition of Ω if Bi ∩ Bj = ∅ for all i = j . Sets B1 . . Bm form a partition of Ω. . Then for any event A.19 1. . then (A ∩ B1). Note: Recall the formal deﬁnition of a partition. . Ω Recall that a partition of Ω is a collection of non-overlapping sets B1 . . . B2.9. (A ∩ Bm ) form a partition of the set or event A. i=1 . m P(A) = i=1 P(A ∩ Bi ). . . . . . . . B1 B2 A ∩ B1 A ∩ B2 A B3 B4 A ∩ B3 A ∩ B4 The probability of event A is therefore the sum of its parts: P(A) = P(A ∩ B1) + P(A ∩ B2 ) + P(A ∩ B3) + P(A ∩ B4 ).) Let B1 . . The Partition Theorem is a mathematical way of saying the whole is the sum of its parts. . It is based on the fact that probabilities are often easier to calculate if we break down a set into smaller parts. (Proof in Section 1. and m Bi = Ω.

12 = 0.13 P(C ∩ L) = 0.12.45.25.8 Examples of basic probability calculations An Australian survey asked people what sort of car they would like if they could choose any car at all.25. P(chooses large car) = 0. C = “no children” Next write down all the information given: P(C) = 0. (Section 1. 33% of respondents had children. P(L ∪ C) = P(L) + P(C) − P(L ∩ C) = 0.6) . P(L) = P(L ∩ C) + P(L ∩ C) = P(C ∩ L) + P(C ∩ L) = 0. First formulate events: Let C = “has children” L = “chooses large car”. (Partition Theorem) (b) Asked for P(L ∪ C).25 + 0.13 = 0. Find the probability that a respondent: (a) chose a large car. 13% of respondents had children and chose a large car. (b) either had children or chose a large car (or both).13 + 0.20 1.33 P(C ∩ L) = 0. 12% of respondents did not have children and chose a large car. (a) Asked for P(L).33 − 0.

Information given: P(M ∪ S) = 0. Formulate events: M = “interested in music”.34 − 0. .52 = 0. Probability that a student is interested in music.48 = 0. or both. (Section 1.22 + 0.22 P(S) = 0. P(M ∩ S) = P(M) + P(S) − P(M ∪ S) = 0.48. (b) We can not calculate P(M ∩ S) from the information given.21 Example 2: Facebook statistics for New Zealand university students aged between 18 and 24 suggest that 22% are interested in music. M S Thus P(M ∪ S) = 1 − P(M ∪ S) = 1 − 0. ﬁnd P(M ∪ S) and P(M ∩ S). or sport. P(M) = 1 − P(M) = 1 − 0.52. (c) Given the further information that 48% of the students are interested in neither music nor sport.22 = 0.04.6) Only 4% of students are interested in both music and sport.78.34. while 34% are interested in sport. S = “interested in sport”. (a) What is P(M)? (b) What is P(M ∩ S)? Information given: (a) P(M) = 0.

together with standard settheoretic results. and A ∩ ∅ = ∅ So P(A) = P(A ∪ ∅) = P(A) + P(∅) (Axiom 3) ⇒ P(∅) = 0. Proof: i) For any A. and A ∩ A = ∅ (mutually exclusive). ii) Ω = A ∪ A. .) If B1 . Bm form a partition of Ω. (ii) P(A) = 1 − P(A) for any event A. (mutually exclusive). but not sport. . then for any event A. B. . (iv) P(A ∪ B) = P(A) + P(B) − P(A ∩ B) for any events A. Theorem : The probability measure P has the following properties.22 − 0. (Axiom 3) Axiom 1 .22 (d) Find the probability that a student is interested in music. may be used. Only the Axioms. P(M ∩ S) = P(M) − P(M ∩ S) = 0. (iii) (Partition Theorem.18. (i) P(∅) = 0. you will be interested to see how properties of probability are proved formally. So 1 = P(Ω) = P(A ∪ A) = P(A) + P(A). we have A = A ∪ ∅. .9 Formal probability proofs: non-examinable If you are a mathematician. (Partition Theorem) 1.04 = 0. B2. m P(A) = i=1 P(A ∩ Bi).

(A ∩ Bm) are mutually exclusive also. ie. i=1 P(A ∩ Bi) = P i=1 (A ∩ Bi ) m (Axiom 3) = P A∩ = P(A) . .23 iii) Suppose B1 . (A ∩ Bi ) ∩ (A ∩ Bj ) = A ∩ (Bi ∩ Bj ) = A ∩ ∅ = ∅. (A ∩ B1 ). . Bm are a partition of Ω: then Bi ∩ Bj = ∅ if i = j. . (A ∩ B) ∩ (A ∩ B) = A ∩ (B ∩ B) = A ∩ ∅ = ∅. for i = j. iv) Bi i=1 (Distributive laws) = P(A ∩ Ω) A ∪ B = (A ∩ Ω) ∪ (B ∩ Ω) (Set theory) = A ∩ (B ∪ B) ∪ B ∩ (A ∪ A) (Set theory) = (A ∩ B) ∪ (A ∩ B) ∪ (B ∩ A) ∪ (B ∩ A) (Distributive laws) = (A ∩ B) ∪ (A ∩ B) ∪ (A ∩ B). and m Bi = Ω. . So. These 3 events are mutually exclusive: eg. P(A ∪ B) = P(A ∩ B) + P(A ∩ B) + P(A ∩ B) (Axiom 3) = P(A) − P(A ∩ B) from (iii) using B and B + P(B) − P(A ∩ B) from (iii) using A and A + P(A ∩ B) = P(A) + P(B) − P(A ∩ B). . . . m m So. i=1 Thus. etc. . .

given that A has occurred) = P(get ≥ 4. There will often be dependence between A and B. result 6 result 4 or 5 or 6 1 P(A given B) = P(A | B) = . . given that we know we got a 6) = 1. the whole ﬁeld of stochastic processes (Stats 320 and 325) is based on the idea of conditional probability. it changes our knowledge of the chance that A will occur.24 1. then P(A) = 1 6 and P(B) = 1 .10 Conditional Probability Conditioning is another of the fundamental tools of probability: probably the most fundamental tool. which themselves are critical for the useful Partition Theorem. Dependent events Suppose A and B are two events on the same sample space. Example: Toss a die once. on what has happened beforehand. then there is an increased chance that A has occurred: P(A occurs given that B has occurred) = We write 1 3. Additionally. 2 However. if we know that B has occurred. It is especially helpful for calculating the probabilities of intersections. 3 Question: what would be P(B | A)? P(B | A) = P(B occurs. or is conditional. What happens next in a process depends. such as P(A ∩ B). This means that if we know that B has occurred. Let event A = “get 6” Let event B= “get 4 or better” If the die is fair.

. total # in table 2680 Kate is correct that there are more women than men on Facebook. Suppose we pick a person from those in the table. but she is using the wrong sample space so her answer is not relevant. because there are more women than men on Facebook. there are exactly the same number of men as women! Conditioning is all about the sample space of interest. Here are the true 2009 ﬁgures for Facebook users at the University of Auckland: Relationship status Male Female Total Single In a relationship 660 520 660 840 1500 1320 1360 2680 Total 1180 Before we go any further . because out of the SINGLE people on Facebook. Deﬁne event M to be: M ={ person is male }. Her friend Kate tells her not to bother. But the sample space that should interest Sally is diﬀerent: it is S = {members of Ω who are SINGLE}. The table above shows the following sample space: Ω = {Facebook users at UoA}. .44. do you agree with Kate? No. Kate is referring to the following probability: P(M) = 1180 # M’s = = 0.25 Conditioning as reducing the sample space Sally wants to use Facebook to ﬁnd a boyfriend at Uni. .

Example: Deﬁne event R that a person is in a relationship. What is the proportion of males among people in a relationship. We write: P(M | S) = 0. given that the person is single) = = = # single males # singles # who are M and S # who are S 660 1320 = 0.5. .5.38.26 Now suppose we reduce our sample space from Ω = {everyone in the table} to S = {single people in the table}. P(M | R)? P(M | R) = = = # males in a relationship # in a relationship # who are M and R # who are R 520 1360 = 0. This is the probability that Sally is interested in — and she can rest assured that there are as many single men as single women out there. Then P(person is male.

The conditional probability that event A occurs. P(B) Read P(A | B) as “probability of A.27 We could follow the same working for any pair of events. and is given by P(A | B) = P(A ∩ B) . Note: Follow the reasoning above carefully. Note: P(A | B) gives P(A and B . from within the set of B’s only). given that event B has occurred. given B”. Think of P(A | B) as the chance of getting an A. Conditioning on event B means changing the sample space to B. A and B: P(A | B) = = # who are A and B # who are B (# who are A and B ) / (# in Ω) (# who are B ) / (# in Ω) P(A ∩ B) . is written P(A | B). from the set of B’s only. from the whole sample space Ω). P(B) = This is our deﬁnition of conditional probability: Deﬁnition: Let A and B be two events. . It is important to understand why the conditional probability is the probability of the intersection within the new sample space. P(A ∩ B) gives P(A and B .

we need to change the symbol P. but I think it’s because men in a relationship are far too busy buying ﬂowers for their girlfriends to have time to spend on Facebook. we found that P(M | S) = 0. This is what we do in conditional probability: to change the sample space from Ω to B . is P(A | B) + P(A | B) = 1? Answer: Replace B by Ω: this gives P(A | Ω) + P(A | Ω) = P(A) + P(A) = 1. and P(M | R) = 0. so P(C ∪ D | B) = P(C | B) + P(D | B) − P(C ∩ D | B).38.28 Note: In the Facebook example. Trick for checking conditional probability calculations: A useful trick for checking a conditional probability expression is to replace the conditioned set by Ω. and see whether the expression is still true. If we change the sample space. yes. This means that a single person on UoA Facebook is equally likely to be male or female. This shows that the symbol P is deﬁned with respect to Ω. say. P(A | B) + P(A | B) = 1 for any other sample space B . So. . For example: P(C ∪ D) = P(C) + P(D) − P(C ∩ D). but a person in a relationship is much more likely to be female than male! Why the diﬀerence? Your guess is as good as mine. we change from the symbol P to the symbol P( | B). P BELONGS to the sample space Ω. For example. The symbol P( | B) should behave exactly like the symbol P.5. That is. The symbol P belongs to the sample space Ω Recall the ﬁrst of our probability axioms on page 16: P(Ω) = 1.

. + P(A | Bm). P(B) P(B ∩ A) ⇒ P(B ∩ A) = P(A ∩ B) = P(B | A)P(A). + P(A | Bm )P(Bm). . NOT P(A) = P(A | B1) + . m m P(A) = i=1 P(A ∩ Bi ) = i=1 P(A | Bi)P(Bi). but especially m the conditional formulation i=1 P(A | Bi )P(Bi ). . . Both formulations of the Partition Theorem are very widely used. . Warning: Be sure to use this new version of the Partition Theorem correctly: it is P(A) = P(A | B1)P(B1) + . then for any event A. . . Immediate from the deﬁnitions: P(A | B) = P(A ∩ B) ⇒ P(A ∩ B) = P(A | B)P(B) . Proof: P(A ∩ B) = P(A | B)P(B) = P(B | A)P(A). The Multiplication Rule For any events A and B. The expression is NOT true. . .29 Is P(A | B) + P(A | B) = 1? Try to replace the conditioning set by Ω: we can’t! There are two conditioning sets: B and B . It doesn’t make sense to try to add together probabilities from two diﬀerent sample spaces. P(B | A) = P(A) and New statement of the Partition Theorem The Multiplication Rule gives us a new statement of the Partition Theorem: If B1 . Bm partition S.

and there is no overlap in the events by deﬁnition. From the information given. pretend you know it. pretend you do — and add up the diﬀerent possibilities.) Conditional probability is the Peter Pan of Stats 210. Yes: they cover all possible journeys (probabilities sum to 1). .6 . L = “late”.6.11 Examples of conditional probability and partitions Tom gets the bus to campus every day. P(work on time) = P(work on time | ﬁne) ×P(ﬁne) + P(work on time | wet) ×P(wet).4. (a) Do the events T and L form a partition of the sample space Ω? Explain why or why not. For example. but you don’t know what the weather will be like today. We can formulate events as follows: T = “on time”. 1.4. he would pretend to eat. Conditioning on an event is like pretending that you know that the event has happened. and late with probability 0. the events have probabilities: P(T ) = 0. The bus is on time with probability 0. P(L) = 0. (An excellent strategy. I have always found. When you don’t know something that you need to know.30 Conditional probability and Peter Pan When Peter Pan was hungry but had nothing to eat. if you know the probability of getting to work on time in diﬀerent weather conditions. The sample space can be written as Ω = {bus journeys}.

58 + 0.8. It is possible for the bus to be noisy when it is crowded. Do the events C and N form a partition of the sample space? Explain why or why not. (c) Write down probability statements corresponding to the information given above.8 when it is crowded.7. When the bus is on time. Your answer should involve two statements linking C with T and L.4 × (1 − 0. The bus is noisy with probability 0.5 × 0.31 The buses are sometimes crowded and sometimes noisy.4.6 + 0.4 when it is not crowded. N =“noisy”. When it is late. and with probability 0.58.5.8 × 0.7 × 0. (b) Formulate events C and N corresponding to the bus being crowded and noisy. it is crowded with probability 0.4 = 0. (Partition Theorem) (e) Find the probability that the bus is noisy. it is crowded with probability 0. (d) Find the probability that the bus is crowded. P(N | C) = 0. and two statements linking N with C.58) = 0. P(C | T ) = 0. P(C | L) = 0. (Partition Theorem) . Let C = “crowded”. so there must be some overlap between C and N . both of which are problems for Tom as he likes to use the bus journeys to do his Stats assignments.632.7.5. P(N ) = P(N | C)P(C) + P(N | C)P(C) = 0. C and N do NOT form a partition of Ω. P(C) = P(C | T )P(T ) + P(C | L)P(L) = 0. P(N | C) = 0.

i. P(A) (⋆) This is the simplest form of Bayes’ Theorem. to express P(B | A) in terms of P(A | B). and for any j = 1. . Then for any event A. P(earlier event | later event). . named after Thomas Bayes (1702–61).12 Bayes’ Theorem: inverting conditional probabilities Consider P(B ∩ A) = P(A ∩ B). . B2. This is very useful. . P(Bj | A) = (Bayes’ Theorem) Proof: Immediate from (⋆) (put B = Bj ). m. . but we might only observe the later event and wish to deduce the probability that the earlier event occurred. .32 1. .12: Let B1. Bm form a partition of Ω. Bayes’ Theorem allows us to “invert” the conditioning. . it might be easy to calculate. Apply multiplication rule to each side: P(B | A)P(A) = P(A | B)P(B) Thus P(B | A) = P(A | B)P(B) . . Full statement of Bayes’ Theorem: Theorem 1.e. P(A | Bj )P(Bj ) m i=1 P(A | Bi )P(Bi ) P(later event|earlier event). English clergyman and founder of Bayesian Statistics. and the Partition Rule which gives P(A) = m i=1 P(A | Bi )P(Bi ). For example.

33 Special case of Bayes’ Theorem when m = 2: use B and B as the partition of Ω: then P(B | A) = P(A | B)P(B) P(A | B)P(B) + P(A | B)P(B) Example: The case of the Perﬁdious Gardener. . . DEAD!!! What is the probability that the gardener did not water it? Solution: First step: formulate events Let : D = “rosebush dies” W = “gardener waters rosebush” W = “gardener fails to water rosebush” Second step: write down all information given P(D | W ) = 1 2 P(D | W ) = 3 4 P(W ) = 2 3 1 (so P(W ) = 3 ) Third step: write down what we’re looking for P(W | D) Fourth step: compare this to what we know Need to invert the conditioning. Smith returns from holiday to ﬁnd the rosebush . Worse still. so use Bayes’ Theorem: P(W | D) = 3 P(D | W )P(W ) 3/4 × 2/3 = = P(D | W )P(W ) + P(D | W )P(W ) 3/4 × 2/3 + 1/2 × 1/3 4 So the gardener failed to water the rosebush with probability 3 . and with probability 3/4 if not watered. 4 . It will die with probability 1/2 if watered. Smith employs a perﬁdious gardener who will fail to water the rosebush with probability 2/3. Mr Smith owns a hysterical rosebush.

322. accounting for 50%.006 × 0.004 × 0.0062 = 0. and 20% of the total output respectively. and 1.Example: The case of the Defective Ketchup Bottle.4%.3) let D = “bottle is defective” 2.004 × 0. Ketchup bottles are produced in 3 diﬀerent factories. Bayes Theorem: P(F1 | D) = = = P(D | F1)P(F1) P(D | F1)P(F1) + P(D | F2)P(F2) + P(D | F3 )P(F3) 0.2 P(D | F1) = 0.004 P(D | F2 ) = 0.5 + 0.5 P(F2) = 0. 0. Looking for: P(F1 | D) (so need to invert conditioning).012 3.5 0. . A statistics lecturer who eats only ketchup ﬁnds a defective bottle in her wig.3 P(F3) = 0.6%. 4.2%. Events: let Fi = “bottle comes from Factory i” (i=1.006 P(D | F3 ) = 0.3 + 0.2 0.012 × 0. The percentage of bottles from the 3 factories that are defective is respectively 0.002 0. 30%. What is the probability that it came from Factory 1? Solution: 1.2. Information given: P(F1) = 0.

35

1.13 Statistical Independence Two events A and B are statistically independent if the occurrence of one does

not aﬀect the occurrence of the other.

This means P(A | B) = P(A)

and P(B | A) = P(B).

Now P(A | B) =

P(A ∩ B) , P(B)

so if P(A | B) = P(A) then P(A ∩ B) = P(A) × P(B).
We use this as our deﬁnition of statistical independence. Deﬁnition: Events A and B are statistically independent if P(A ∩ B) = P(A)P(B). For more than two events, we say: Deﬁnition: Events A1, A2, . . . , An are mutually independent if P(A1 ∩ A2 ∩ . . . ∩ An) = P(A1)P(A2) . . . P(An), AND the same multiplication rule holds for every subcollection of the events too. Eg. events A1, A2, A3, A4 are mutually independent if i) P(Ai ∩ Aj ) = P(Ai)P(Aj ) for all i, j with i = j; AND ii) P(Ai ∩ Aj ∩ Ak ) = P(Ai)P(Aj )P(Ak ) for all i, j, k that are all diﬀerent; AND iii) P(A1 ∩ A2 ∩ A3 ∩ A4 ) = P(A1)P(A2)P(A3)P(A4). Note: If events are physically independent, then they will also be statistically independent.

36

Statistical independence for calculating the probability of an intersection In section 1.6 we said that it is often hard to calculate P(A ∩ B). We usually have two choices. 1. IF A and B are statistically independent, then P(A ∩ B) = P(A) × P(B). 2. If A and B are not known to be statistically independent, we usually have to

use conditional probability and the multiplication rule:
P(A ∩ B) = P(A | B)P(B). This still requires us to be able to calculate P(A | B).

Example: Toss a fair coin and a fair die together. The coin and die are physically independent. Sample space: Ω = {H1, H2, H3, H4, H5, H6, T 1, T 2, T 3, T 4, T 5, T 6}

- all 12 items are equally likely.

Let A= “heads” and B= “six”.
6 Then P(A) = P({H1, H2, H3, H4, H5, H6}) = 12 = 2 P(B) = P({H6, T 6}) = 12 = 1 6 1 2

1 Now P(A ∩ B) = P(Heads and 6) = P({H6}) = 12

But P(A) × P(B) = 1 × 2

1 6

=

1 12

also,

So P(A ∩ B) = P(A)P(B) and thus A and B are statistically indept.

37

Pairwise independence does not imply mutual independence Example: A jar contains 4 balls: one red, one white, one blue, and one red, white & blue. Draw one ball at random. Let A =“ball has red on it”, B =“ball has white on it”, C =“ball has blue on it”. Two balls satisfy A, so P(A) = Pairwise independent: Consider P(A ∩ B) = But, P(A) × P(B) =
1 4 1 2 2 4 1 = 2.

Likewise, P(B) = P(C) = 1 . 2

(one of 4 balls has both red and white on it). ×
1 2 1 = 4 , so P(A ∩ B) = P(A)P(B).

Likewise, P(A ∩ C) = P(A)P(C), and P(B ∩ C) = P(B)P(C). So A, B and C are pairwise independent. Mutually independent? Consider P(A ∩ B ∩ C) = while
1 4

(one of 4 balls)
1 2

1 P(A)P(B)P(C) = 2 × 1 × 2

=

1 8

= P(A ∩ B ∩ C).

So A, B and C are NOT mutually independent, despite being pairwise indepen-

dent.

1.14 Random Variables We have one more job to do in laying the foundations of our science of randomness. So far we have come up with the following ideas: 1. ‘Things that happen’ are sets, also called events. 2. We measure chance by measuring sets, using a measure called probability. Finally, what are the sets that we are measuring? It is a nuisance to have lots of diﬀerent sample spaces: Ω = {head, tail}; Ω = {same, diﬀerent}; Ω = {Lions win, All Blacks win}.

38 All of these sample spaces could be represented more concisely in terms of numbers: Ω = {0. quantify the average value (e. We write X : Ω → R. This gives us our formal deﬁnition of a random variable: Deﬁnition: A random variable (r. is the number of girls related to the hormone levels of the fathers?) The list is endless. For example. 2. To give us a framework in which these investigations can take place. 3. quantify relationships between diﬀerent random quantities (e. For example: function X : Ω → R X(“Lions win”) = 0. there are many random experiments that genuinely produce random numbers as their outcomes. When the outcome of a random experiment is a number. it enables us to quantify many new things of interest: 1.g. In fact. the average number of heads we would get if we made 10 coin-tosses again and again). A random experiment whose possible outcomes are real numbers is called a random variable. any random experiment can be made to have outcomes that are real numbers. quantify how much the outcomes tend to diverge from the average value. 1}. and so on.v. On the other hand. . simply by mapping the sample space Ω onto a set of real numbers using a function.) is a function from a sample space Ω to the real numbers R. the number of heads from 10 tosses of a coin. X(“All Blacks win”) = 1. we give a special name to random experiments that produce numbers as their outcomes.g. the number of girls in a three-child family.

for sample point si . 2. HTT. P(X = x) = P(outcome s is such that X(s) = x) = P({s : X(s) = x}). Intuitively. TTH. . etc. x). variances. etc. remember that a random variable equates to a random experiment whose outcomes are numbers. and for a real number x. 1} with P(1) = p and P(0) = 1 − p describes EVERY possible experiment with two outcomes.g. For a sample space Ω and random variable X : Ω → R. X). So X(HHH) = 3.39 Although this is the formal deﬁnition. Probabilities for random variables By convention. HTH. the intuitive deﬁnition of a random variable is probably more useful. we use CAPITAL LETTERS for random variables (e. 0 otherwise. HHT. relationships. Describing many diﬀerent sample spaces in the same terms: e. The sample space is Ω = {HHH.g. Then Y (HT H) = 0. Another example is Y : Ω → R such that Y (si) = 1 if 2nd toss is a head. X(T HT ) = 1. we have X(si ) = # heads in outcome si . Y (HHH) = 1. Giving a name to a large class of random experiments that genuinely produce random numbers. THT.g. and for which we want to develop general rules for ﬁnding averages. Example: Toss a coin 3 times. TTT} One example of a random variable is X : Ω → R such that. and lower case letters to represent the values that the random variable takes (e. Deﬁning random variables serves the dual purposes of: 1. Y (T HH) = 1. THH. and so on. A random variable produces random real numbers as the ‘outcome’ of a random experiment. Ω = {0.

T HH}) = 3/8. Y = y). P(X = 3) = P({HHH}) = 1/8. Similarly. Recall that two events A and B are independent if P(A ∩ B) = P(A)P(B). Let X : Ω → R. Note that P(X = 0) + P(X = 1) + P(X = 2) + P(X = 3) = 1. Then P(X = 0) = P({T T T }) = 1/8. We usually replace the cumbersome notation P({X = x} ∩ {Y = y}) by the simpler notation P(X = x. we will use the following notations interchangeably: P({X = x} ∩ {Y = y}) = P(X = x AND Y = y) = P(X = x. . Thus X and Y are independent if and only if P(X = x. random variables X and Y are deﬁned to be independent if P({X = x} ∩ {Y = y}) = P(X = x)P(Y = y) for all possible values x and y . Y = y) = P(X = x)P(Y = y) for ALL possible values x. HT H. T HT. such that X(s) = # heads in s. T T H}) = 3/8.40 Example: toss a fair coin 3 times. All outcomes are equally likely: P(HHH) = P(HHT) = . P(X = 2) = P({HHT. . y . . From now on. = P(TTT) = 1/8. P(X = 1) = P({HT T. Y = y). Independent random variables Random variables X and Y are independent if each does not aﬀect the other.

Am form a partition of the sample space. B. Or: P(A ∩ B) = P(A | B)P(B). P(B) This is a simpliﬁed version of Bayes’ Theorem. Conditional probability: P(A | B) = P(A ∩ B) P(B) for any A. 4. P(B | Ai )P(Ai ) . B. . i. then P(A ∪ B) = P(A) + P(B). then P(B | Aj )P(Aj ) P(B | A1 )P(A1 ) + . .e. . For any A.15 Key Probability Results for Chapter 1 1. . 5. i. B.e. P(A | B) = P(B | A)P(A) . they do not overlap (mutually exclusive) and collectively cover all possible outcomes (their union is the sample space). . A ∩ B = ∅). It shows how to ‘invert’ the conditioning. P(B | A)P(A) + P(B | A)P(A) This works because A and A form a partition of the sample space. + P(B | Am )P(Am ) m i=1 P(Aj | B) = = P(B | Aj )P(Aj ) .41 1. Complete version of Bayes’ Theorem: If sets A1 . 3. .e. 2. we can write P(A | B) = P(B | A)P(A) . If A and B are mutually exclusive (i. Bayes’ Theorem slightly more generalized: for any A. how to ﬁnd P(A | B) when you know P(B | A).

. Chains of events: P(A1 ∩ A2 ∩ A3 ) = P(A1 ) P(A2 | A1 ) P(A3 | A2 ∩ A1 ) . B. then we can treat P(· | B) just like P: e. P(A ∪ B) = P(A) + P(B) − P(A ∩ B) .. if A1 . + P(B ∩ Am ) . then P(A1 ∪ A2 | B) = P(A1 | B) + P(A2 | B) (compare with P(A1 ∪ A2 ) = P(A1 ) + P(A2 )). Unions: For any A. . C. Am form a partition of the sample space. . . and P(A | B) = 1 − P(A | B) for any A. . Conditional probability: If P(B) > 0. . then P(A ∩ B) = P(A) P(B) and and P(A | B) = P(A) P(B | A) = P(B) . then P(A1 | B) + P(A2 | B) +. . These are both very useful formulations. P(A ∪ B ∪ C) = P(A) + P(B) + P(C) − P(A ∩ B) − P(A ∩ C) − P(B ∩ C) + P(A ∩ B ∩ C) . 9. then applying it again to expand P(B ∪ C). 8. 10. . 7. . if A1 and A2 are mutually exclusive. Partition Theorem: if A1 . Statistical independence: if A and B are independent. . .+ P(Am | B) = 1.) The fact that P(· | B) is a valid probability measure is easily veriﬁed by checking that it satisﬁes Axioms 1. .g. . The second expression is obtained by writing P(A∪B ∪C) = P A∪(B ∪C) and applying the ﬁrst expression to A and (B ∪ C). + P(B | Am )P(Am ) . This can also be written as: P(B) = P(B | A1 )P(A1 ) + P(B | A2 )P(A2 ) + . and 3. 2. (Note: it is not generally true that P(A | B) = 1 − P(A | B).Am partition the sample space. .42 6. then P(B) = P(B ∩ A1 ) + P(B ∩ A2 ) + .

43 1. Event “2nd ball is red” is actually event {W1R2 .16 Chains of events and probability trees: non-examinable The multiplication rule is very helpful for calculating probabilities when events happen in sequence. R1R2 } = (W1 ∩ R2 ) ∪ (R1 ∩ R2 ). Example: Two balls are drawn at random without replacement from a box containing 4 white and 2 red balls. Find the probability that: (a) they are both white. a) P(W1 ∩ W2 ) = P(W2 ∩ W1) = P(W2 | W1)P(W1) Now P(W1) = 4 3 and P(W2 | W1) = . So P(2nd ball is red) = P(W1 ∩ R2 ) + P(R1 ∩ R2 ) (mutually exclusive) = P(R2 | W1)P(W1) + P(R2 | R1)P(R1) = 2 4 × 5 6 + 1 2 × 5 6 = 1 . 6 5 W1 So P(both white) = P(W1 ∩ W2) = 3 4 2 × = . 3 W1 R1 . we have to condition on what happened in the ﬁrst draw. (b) the second ball is red. Solution Let event Wi = “ith ball is white” and Ri = “ith ball is red”. 5 6 5 b) Looking for P(2nd ball is red). To ﬁnd this.

W2 P(W2 | W1 ) = P(R2 | W1 ) = P(W1 ) = 4 6 2 5 3 5 W1 R2 P(R1 ) = 2 6 W2 P(W2 | R1 ) = P(R2 | R1 ) = R2 1 5 4 5 R1 First Draw Second Draw Write conditional probabilities on the branches. 6 5 6 5 More than two events To ﬁnd P(A1 ∩ A2 ∩ A3) we can apply the multiplication rule successively: P(A1 ∩ A2 ∩ A3) = P(A3 ∩ (A1 ∩ A2 )) = P(A3 | A1 ∩ A2 )P(A1 ∩ A2) (multiplication rule) (multiplication rule) = P(A3 | A1 ∩ A2 )P(A2 | A1 )P(A1) Remember as: P(A1 ∩ A2 ∩ A3 ) = P(A1)P(A2 | A1 )P(A3 | A2 ∩ A1 ). and multiply to get probability of an intersection: eg. P(W1 ∩ W2) = 4 3 2 4 × . or P(R1 ∩ W2) = × . .44 Probability trees Probability trees are a graphical way of representing the multiplication rule.

white? Answer: P(W1 ∩ R2 ∩ W3) = P(W1)P(R2 | W1)P(W3 | R2 ∩ W1) = w w+r × r w+r−1 × w−1 . .45 On the probability tree: P(A3 | A2 ∩ A1 ) P(A2 | A1 ) P(A1 ) P(A1 ∩ A2 ∩ A3 ) P(A1 ) In general. Example: A box contains w white balls and r red balls. w+r−2 . .∩A1). . Draw 3 balls without replacement. . . . What is the probability of getting the sequence white. P(An | An−1 ∩. . .∩An ) = P(A1)P(A2 | A1)P(A3 | A2 ∩A1) . A2. we have P(A1 ∩A2 ∩. for n events A1. . An. red. .

These give the number of ways of choosing r objects from n distinct objects. . . = pk = k . . . GBG. . . Then A ={GGG. If: i) Ω = {s1 . 1 ii) each outcome si is equally likely. then all of the 8 outcomes in Ω are equally likely. . BBG. s6. abd. we have choices abc. possible outcomes from oldest to youngest are: Ω = {GGG. . This makes it easy to calculate probabilities. e). s2. . . s2. BGB. BBB} = {s1. s7 . d. GBG. Let event A be A = “oldest child is a girl”. p8} be a probability distribution on Ω. p2. .46 1. For example. BGG. . .17 Equally likely outcomes and combinatorics: non-examinable Sometimes. s5. if we wish to select 3 objects from n = 5 objects (a. = p8 = 8 . iii) event A = {s1 . s3 . so p1 = p2 = . s4. all the outcomes in a discrete ﬁnite sample space are equally likely. . so event A occurs with probability P(A) = 4 = 1 . . sr } contains r possible outcomes. 8 2 Counting equally likely outcomes To count the number of equally likely outcomes in an event. c. so 1 p1 = p2 = . Event A contains 4 of the 8 equally likely outcomes. . acd. we often need to use permutations or combinations. . b. GGB. . . then P(A) = # outcomes in A r = . GBB. If every baby is equally likely to be a boy or a girl. GBB}. ace. GGB. . abe. . k # outcomes in Ω Example: For a 3-child family. sk }. s8} Let {p1.

(n − r + 1) = n! . Number of Combinations. r! (n − r)!r! (because n Pr counts each permutation r! times.47 1. c) are the same. c). is the number of ways of selecting r objects from n distinct objects when diﬀerent orderings constitute the same choice. n Cr . The critical thing is to use the same rule for both numerator and denominator. and we only want to count it once: so divide n Pr by r!) Use the same rule on the numerator and the denominator # outcomes in A When P(A) = . (n − r)! (n choices for ﬁrst object. b. we can often think about the problem # outcomes in Ω either with diﬀerent orderings constituting diﬀerent choices.) 2. n Pr . a. Number of Permutations. That is. That is. c) counts separately from choice (b. choice (a. is the number of ways of selecting r objects from n distinct objects when diﬀerent orderings constitute diﬀerent choices. Then #combinations = Cr = n n r n = n! Pr = . . or with diﬀerent orderings constituting the same choice. c) and choice (b. etc. n Cr = n r The number of combinations. . Pr The number of permutations. choice (a. a. . (n − 1) choices for second. b. Then n #permutations = n Pr = n(n − 1)(n − 2) .

40 × 39 × 38 × 37 × 36 Thus P(A) = P(at least 2 cards are the same design) = 1 − P(A) = 1 − 0. (Note: order mattered above. Note that order matters: e. They insist on each receiving separate Christmas cards. Easiest to ﬁnd P(all 5 cards are diﬀerent) = P(A). 36 for second. so we need order to matter here too.392 = 0. and threaten to disinherit Tom if he sends two of them the same picture. Number of ways of selecting 5 distinct designs from 12 is 12 C5 = 12 5 = 12 ! = 792. Tom has Christmas cards with 12 diﬀerent designs.) So P(A) = 40 × 36 × 32 × 28 × 24 = 0. What is the probability that at least two of the great-aunts receive the same picture? Looking for P(at least 2 cards the same) = P(A) (say). Number of outcomes in A is (# ways of selecting 5 diﬀerent designs) = 40 × 36 × 32 × 28 × 24 .g. Tom buys a pack of 40 Christmas cards.392. In how many diﬀerent ways can he select 5 diﬀerent designs from the 12 designs available? Order of cards is not important. so use combinations. (12 − 5)! 5! b) The next year. He selects 5 cards at random to send to his great-aunts. . etc. featuring 10 diﬀerent pictures with 4 cards of each picture. because the 4 cards with the ﬁrst design are excluded.608. we are counting choice 12345 separately from 23154. (40 choices for ﬁrst card.) Total number of outcomes is (total # ways of selecting 5 cards from 40) = 40 × 39 × 38 × 37 × 36 .48 Example: (a) Tom has ﬁve elderly great-aunts who live together in a tiny bungalow.

Exercise: Check that this gives the same answer for P(A) as before. . 5 and there are 4 choices of card within each of the 5 chosen groups. 40 5 This works because there are 10 ways of choosing 5 diﬀerent designs from 10. Note: Problems like these belong to the branch of mathematics called Combinatorics: the science of counting.49 Alternative solution if order does not matter on numerator and denominator: (much harder method) 10 5 4 5 P(A) = . The total 5 40 number of ways of choosing 5 cards from 40 is 5 . So the total number of ways of choosing 5 cards of diﬀerent designs is 10 45.

Expectation and variance of a random variable: • the expectation of a random variable is the value it takes on average. e. g(X) = X or g(X) = X 2 . How unlikely is that? Can we continue to believe that the coin is fair when it produces nine heads out of ten tosses? 3. and the probability function fX (x): 50 • the probability function of a random variable lists the values the random variable can take. Probability distributions. and their probabilities. 2. Choosing a probability distribution to ﬁt a situation is called modelling. but never give results much lower (long-tailed distribution)? We will see how diﬀerent probability distributions are suitable for diﬀerent circumstances. But what does the randomness look like? Is it highly variable. p). or little variability? Does it sometimes give results much higher than average. • the variance of a random variable measures how much the random variable varies about its average. for some p. Hypothesis testing: • I toss a coin ten times and get nine heads. 4. Modelling: • we have a situation in real life that we know is random. where X is a random variable and g(X) is a function. Likelihood and estimation: • what if we know that our random variable is (say) Binomial(5. 6. Change of variable procedures: • calculating probabilities and expectations of √ g(X). but we don’t know the value of p? We will see how to estimate the value of p using maximum likelihood estimation. .1 Introduction In the next two chapters we meet several important concepts: 1.Chapter 2: Discrete Probability Distributions 2.g. 5.

Porsche Ferrari Random experiment: which car? MG.1.2 The probability function. and gives a probability to each value. X. fX (x) The probability function fX (x) lists all possible values of X. Random variable: X ...  ⇒ X=1  Ferrari Porsche ⇒ X=2 If he chooses. Example: Which car? Outcome: Ferrari Porsche x 1 2 1 1 Probability function. . fX (x). fX (x) = P(X = x) 6 6 MG 3 4 6 for all possible outcomes x of X . . eg. for a discrete random variable X . The random variable is discrete if the set of real values it can take is ﬁnite or countable. fX (x) = P(X = x). X gives numbers to the possible outcomes.51 2. assigns a real number to every possible outcome of a random experiment.. Recall that a random variable. {0. is given by. is 1 . }. .  MG ⇒ X=3 Deﬁnition: The probability function. .2. 6 1 We write: P(X = 1) = fX (1) = 6 : the probability he makes choice 1 (a Ferrari) .

52   1/6 if x = 1. fX (x) is just a list of probabilities.5.) fX (0) = 0.5. Properties of the probability function i) 0 ≤ fX (x) ≤ 1 for all x. 2}) = P(X = 1 or 2) = P(X = 1) + P(X = 2) = 1 6 + 1 6 2 = 6. x∈A iii) P (X ∈ A) = e.g. P(X ∈ {1.   0 otherwise.5. fX (x). We can also write the probability function as: fX (x) =  4/6 if x = 3.5) = 0.5 0. etc. Then X= 0 with probability 0. probabilities add to 1 overall. and let X=number of heads. in the car example. This is the probability of choosing either a Ferrari or a Porsche. ii) x probabilities are always between 0 and 1.   1/6 if x = 2.5 if x=0 fX (x) = 0. The probability function of X is given by: x fX (x) = P(X = x) 0 1 or   0.5 We write (eg.5 if x=1  0 otherwise 0. fX (1) = 0.Example: Toss a fair coin once. fX (7. fX (x) = 1.5. 1 with probability 0. .

He and his brother Jean. Each toss is 1 a Bernoulli trial with P(success) = 6 . ii) The probability of success. Their father wanted Jacques to be a theologist and Jean to be a merchant. 2 2) Repeated tossing of a fair die: success = “6”. Jacques Bernoulli was a Swiss mathematician in the late 1600s. failure= “not 6”. remains constant for all trials. 0 and 1. fY (y) = p if y = 1 1 − p if y = 0 That is.53 2. the event “success in trial i” does not depend on the outcome of any other trials. p. Deﬁnition: A random experiment is called a set of Bernoulli trials if it consists of several trials such that: i) Each trial has only 2 possible outcomes (usually called “Success” and “Fail- ure”). . Deﬁnition: The random variable Y is called a Bernoulli random variable if it takes only 2 values. P(Y = 1) = P(“success”) = p. ie. P(Y = 0) = P(“failure”) = 1 − p. both studied mathematics secretly against their father’s will.3 Bernoulli trials Many of the discrete random variables that we meet are based on counting the outcomes of a series of trials called Bernoulli trials. Examples: 1) Repeated tossing of a fair coin: each toss is a Bernoulli trial with P(success) = P(head) = 1 . who were bitter rivals. The probability function is. iii) The trials are independent.

n Explanation: An outcome with x successes and (n − x) failures has probability. then the probability function for X is n x p (1 − p)n−x x fX (x) = P(X = x) = for x = 0. p). p). px (1 − p)n−x (2) (1) where: (1) succeeds x times. each with probability p (2) fails (n − x) times. . each of which has probability p of success.4 Example of the probability function: the Binomial Distribution The Binomial distribution counts the number of successes in a ﬁxed number of Bernoulli trials. Probability function If X ∼ Binomial(n. or X ∼ Binomial(n. Deﬁnition: Let X be the number of successes in n independent Bernoulli trials each with probability of success = p.54 2. Then X has the Binomial distribution with parameters n and p. We write X ∼ Bin(n. . p) if X is the number of successes out of n independent trials. each with probability (1 − p). . 1. . Thus X ∼ Bin(n. p). .

of each such outcome) n x = p (1 − p)n−x x Note: fX (x) = 0 if n x ∈ {0. . . out of n trials in total. . Thus. 1. . P(#successes= x) = (#outcomes with x successes) × (prob.55 There are n x possible outcomes with x successes and (n − x) failures because we must select x trials to be our “successes”. / fX (x) = 1: Check that x=0 n n fX (x) = x=0 x=0 n x p (1 − p)n−x = [p + (1 − p)]n (Binomial Theorem) x = 1n = 1 It is this connection with the Binomial Theorem that gives the Binomial Distribution its name. 2. . n}.

4096 0. What is the distribution of X? 2. .0256 0. p = 1/6). X ∼ Binomial(n = 10. 2. 1. p = 0. What is the distribution of X? Assume: (i) each child is equally likely to be a boy or a girl.1536 0.5). x 0 1 2 3 4 fX (x) = P(X = x) 0. 1 6 1 1 1− 6 10−1 Example 3: Let X be the number of girls in a three-child family.0016 Example 2: Let X be the number of times I get a ‘6’ out of 10 rolls of a fair die. p = 0. P(X ≥ 2) = 1 − P(X < 2) = 1 − P(X = 0) − P(X = 1) 0 10−0 10 10 1 1 = 1− − 1− 6 6 0 1 = 0.2). (ii) all children are independent of each other. What is the probability that X ≥ 2? 1. Then X ∼ Binomial(n = 3. Write down the probability function of X.515.4096 0.56 Example 1: Let X ∼ Binomial(n = 4.

0 80 0.02 0.04 0.05 0.08 0.0 0.06 0.15 0.12 n = 100.5 0.20 0.57 Shape of the Binomial distribution The shape of the Binomial distribution depends upon the values of n and p.10 0.25 0. As n increases. This is because X counts the number of successes out of n trials. n = 10. For small n.9 0. the distribution becomes more and more symmetrical. X + Y counts the total number of successes out of n + m trials. then X + Y ∼ Bin(n + m. and X ∼ Binomial(n. p). and there is noticeable skew only if p is very close to 0 or 1.0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0. p = 0.4 n = 10.5.9 0. and Y counts the number of successes out of m trials: so overall. p). Y ∼ Binomial(m. p = 0. Note: X and Y must both share the same value of p. p). but highly skewed for values of p close to 0 or 1.10 90 100 Sum of independent Binomial random variables: If X and Y are independent. .2 0. The probability functions for various values of n and p are shown below.1 0.3 0. the distribution is almost symmetrical for values of p close to 0. p = 0.

sweeps up all the probability up to and including the point x. is an alternative function that also tells us everything there is to know about X. or the probability function. you could answer by giving either the distribution function.0 0.15 0. The probability function tells us everything there is to know about X.9) . FX (x).f.00 0.5) X ~ Bin(10. The cumulative distribution function.58 2.d.10 0.05 0 1 2 3 4 5 6 7 8 9 10 0.5 The cumulative distribution function.2 0. 0.20 0. Each of these functions encapsulate all possible information about X.25 0. 0. FX (x).1 0.) is FX (x) = P(X ≤ x) for − ∞ < x < ∞ If you are asked to ‘give the distribution of X’. The distribution function FX (x) as a probability sweeper The cumulative distribution function. Deﬁnition: The (cumulative) distribution function (c.4 0 1 2 3 4 5 6 7 8 9 10 X ~ Bin(10.3 0. 0. written as FX (x). FX (x) We have deﬁned the probability function. fX (x). fX (x). as fX (x) = P(X = x). or just distribution function.

1 ).25 + 0.5 = 0.75   0. .25 + 0.25 = 1 f (x) 1 2 if if if if x<0 0≤x<1 1≤x<2 x ≥ 2.25 Then FX (x) = P(X ≤ x) =  0.59 Example: Let X ∼ Binomial(2. 1 4 x 0 F (x) 1 3 4 1 2 1 4 1 2 x 0 1 2 FX (x) gives the cumulative probability up to and including point x.5 + 0. So FX (x) = y≤x fX (y) Note that FX (x) is a step function: it jumps by amount fX (y) at every point y with positive probability. 2 x 0 1 2 fX (x) = P(X = x) 1 1 1 4 2 4  0   0.

fX (x). we can also use the distribution function to ﬁnd probabilities. So FX (b) = FX (a) + P(a < X ≤ b) ⇒ FX (b) − FX (a) = P(a < X ≤ b). fX (x) = P(X = x) = P(X ≤ x) − P(X ≤ x − 1) = FX (x) − FX (x − 1). (if X takes integer values) This is why the distribution function FX (x) contains as much information as the probability function. In general: P(a < X ≤ b) = FX (b) − FX (a) Proof: P(X ≤ b) = P(X ≤ a) + P(a < X ≤ b) a X≤b X ≤a a<X≤b b if b > a. because we can use either one to ﬁnd the other. .60 Reading oﬀ probabilities from the distribution function As well as using the probability function to ﬁnd the distribution function.

what is: 1. For example. P(X > 42)? 1 − P(X ≤ 42) = 1 − FX (42). 2) FX (x) is a non-decreasing function of x: that is. P(X ≤ 30)? 2. if x1 < x2. P(X ≥ 56)? 1 − P(X < 56) = 1 − P(X ≤ 55) = 1 − FX (55). limh↓0 F (x + h) = F (x). 3) P(a < X ≤ b) = FX (b) − FX (a) if b > a. . P(X < 10) = P(X ≤ 9) = FX (9). (These are true because values are strictly between −∞ and ∞). Examples: Let X ∼ Binomial(100. Properties of the distribution function 1) F (−∞) =P(X ≤ −∞) = 0. then FX (x1) ≤ FX (x2). F (+∞) =P(X ≤ +∞) = 1. 4. 5. P(50 ≤ X ≤ 60)? P(X ≤ 60) − P(X ≤ 49) = FX (60) − FX (49). In terms of FX (x). 0. 4) F is right-continuous: that is. FX (29). P(X < 30)? 3.Warning: endpoints Be careful of endpoints and the diﬀerence between ≤ and <. FX (30).4).

15 0 1 2 3 4 5 x 6 7 8 9 10 For X ∼ Binomial(10. Probabilities for Binomial(n = 10. p = 0.63 Probability of observing something at least as weird as 9 heads. if the coin is fair: P(X = 9)+P(X = 10)+P(X = 1)+P(X = 0) where X ∼ Binomial(10.5)0 + 9 10 10 10 (0.05 0.5)1 + (0.5)9(0. we would only expect to see something as extreme as 9 heads on about 2.5) 0.00977 + 0. If we had a fair coin and tossed it 10 times.021. we have: P(X = 9) + P(X = 10) + P(X = 1) + P(X = 0) = 10 10 (0.00098 + 0.0 0. Is this weird? Yes.00977 + 0.5)0(0.1% of occasions. .5)10 1 0 = 0. 0. 0.5)10(0.00098 = 0. it is quite weird.5)9 + (0.5).5).25 P(X=x) 0.5)1(0.

OR the coin is not fair. the more evidence we have. It might be: after all. However. we can’t say. Think of the steps: • We have a question that we want to answer: Is the coin fair? • There are two alternatives: 1.021. 2. In fact.1%). If the coin really was fair. this gives us some evidence that the coin is not fair. so we have observed some evidence that the coin is NOT fair. 0.5).64 Is the coin fair? Obviously. We can deduce that.1% of occasions that you toss a fair coin 10 times. The value 2. you do get something as weird as 9 heads or more. on 2. We conclude that this is unlikely with a fair coin. if the coin is fair: prob=0. 2.1% is a small probability. • The probability is small (2. The smaller this proba- bility. it would be very unusual to get 9 heads or more. • We calculate the probability of observing something AT LEAST AS EXTREME as our observation. EITHER we have observed a very unusual event with a fair coin. the number of heads out of 10 tosses.1% measures the strength of our evidence. X = 9. We write down the distribution of X if the coin is fair: X ∼ Binomial(10. so it is still very unusual for a fair coin to produce something as weird as what we’ve seen. . • Our observed information is X. The coin is not fair. Formal hypothesis test We now formalize the procedure above. The coin is fair.

• The alternative hypothesis is general. Null hypothesis: the ﬁrst alternative. 0. It simply states that the null hypothesis is wrong. • The null hypothesis is speciﬁc.5 H1 : p = 0. we write: Number of heads. The null hypothesis is H0 : the coin is fair. We use H0 and H1 to denote the null and alternative hypotheses respectively. More precisely. p). . We expect to believe the null hypothesis unless we see convincing evidence that it is wrong.5. that the coin is NOT fair. It does not say what the right answer is. It speciﬁes an exact distribution for our observation: X ∼ Binomial(10.5). Think of ‘null hypothesis’ as meaning the ‘default’: the hypothesis we will accept unless we have a good reason not to. and H0 : p = 0. X ∼ Binomial(10. The alternative hypothesis is H1 : the coin is NOT fair. Alternative hypothesis: the second alternative. that the coin IS fair. In hypothesis testing. we often use this same formulation.65 Null hypothesis and alternative hypothesis We express the steps above as two competing hypotheses.

5. we would only have a 2. . we believe that our coin is fair unless we see convincing evidence otherwise. the p-value is the probability of observing something AT LEAST AS EXTREME AS OUR OBSERVATION. Our hypothesis test is designed to test whether the Binomial probability is p = 0. That is. Note: Be careful not to confuse the term p-value. In general. This means that SMALL p-values represent STRONG evidence against H0.021 as a measure of the strength of evidence against the hypothesis that p = 0.021 in our example. It states that. In the example above.021 represents quite strong evidence against the null hypothesis. A p-value of 0. we calculate the p-value of 0. the p-value was p = 0. with the Binomial probability p. which is 0. Many of us would see this as strong enough evidence to decide that the null hypothesis is not true.5. To test this. Small p-values mean Strong evidence. we always measure evidence AGAINST the null hypothesis.1% chance of observing something as extreme as 9 heads or tails. if the null hypothesis is TRUE. Large p-values mean Little evidence.021. We measure the strength of evidence against H0 using the p-value.66 p-values In the hypothesis-testing framework above. if H0 is TRUE.

67 Interpreting the hypothesis test There are diﬀerent schools of thought about how a p-value should be interpreted. • Most people agree that the p-value is a useful measure of the strength of evidence against the null hypothesis.05 (say).05 and less start to suggest that the null hypothesis is doubtful. This decision should usually be taken in the context of other scientiﬁc information. We do not talk about accepting or rejecting H0 . and accepted if the p-value is greater than 0. However. or more.05. Statistical signiﬁcance refers to the p-value. . The smaller the p-value.05. is less than 5% if the null hypothesis is true. Saying the test is signiﬁcant is a quick way of saying that there is evidence against the null hypothesis. Under this framework. The p-value measures how far out our observation lies in the tails of the distribution speciﬁed by H0. it is worth bearing in mind that p-values of 0. • Some people go further and use an accept/reject framework. the stronger the evidence against H0. The result of a hypothesis test is signiﬁcant at the 5% level if the p-value is less than 0. Statistical signiﬁcance You have probably encountered the idea of statistical signiﬁcance in other courses. usually at the 5% level. • In this course we use the strength of evidence interpretation. This means that the chance of seeing what we did see (9 heads). the null hypothesis H0 should be rejected if the p-value is less than 0.

we will get a p-value < 0. of the diﬀerence.5. before tossing the coin.021 which is < 0. This would have the eﬀect of halving the resultant p-value.5 is signiﬁcant at the 5% level. If we had a good reason. about once in every thousand tests.e. One-sided and two-sided tests The test above is a two-sided test. This means that we considered it just as weird to get 9 tails as 9 heads.05.5 is important in practical terms. even though H0 is true! A small p-value does NOT mean that H0 is deﬁnitely wrong.5. or • the diﬀerence between p and 0.68 In the coin example. or the IMPORTANCE. i. because the p-value is 0. It does not mean: • the diﬀerence between p and 0.5 against H1 : p = 0. This means that 5% of the time.05 EVEN WHEN H0 IS TRUE!! Indeed.5. then we could conduct a one-sided test: H0 : p = 0. we can say that our test of H0 : p = 0. .5 versus H1 : p > 0.5. Beware! The p-value gives the probability of seeing something as weird as what we did see.5 or > 0.5 is large.001. This means: • we have some evidence that p = 0. we will get a p-value < 0. if H0 is true. It says NOTHING about the SIZE. that it would be impossible to have p < 0. Statistically signiﬁcant means that we have evidence that there IS a diﬀerence. to believe that the binomial probability could only be = 0.

or just chance? We can use hypothesis tests to decide. • Two studies of deep-sea divers revealed that the men had a total of 190 children. could I continue to believe that the coin was fair? For the divers: If I tossed a coin 190 times and got only 65 heads.6. For the presidents: If I tossed a coin 153 times and got only 65 heads. comprising 88 sons and only 65 daughters: a sex ratio of 1.4 sons for every daughter.69 2. a ﬁghter pilot. or a heavy smoker. comprising 65 sons and 125 daughters: a sex ratio of 1.7 Example: Presidents and deep-sea divers Men in the class: would you like to have daughters? Then become a deep-sea diver.9 daughters for every son. The facts • The 44 US presidents from George Washington to Barack Obama have had a total of 153 children. Numbers suggest that men in diﬀerent professions tend to have more sons than daughters. But is it real. Would you prefer sons? Easy! Just become a US president. Could this happen by chance? Is it possible that the men in each group really had a 50-50 chance of producing sons and daughters? This is the same as the question in Section 2. ﬁghter pilots have daughters. Presidents have sons. or the reverse. could I continue to believe that the coin was fair? .

≥ (153 − 65) = 88 daughters.70 Hypothesis test for the presidents We set up the competing hypotheses as follows. .5. p = 0. for even fewer daughters.e. Let X be the number of daughters out of 153 presidential children.04 0 20 40 60 80 100 120 140 . where p is the probability that each child is a daughter. X = (153 − 65). . for too many daughters. 1. . .00 0. Probabilities for X ∼ Binomial(n = 153. 153. H1 : p = 0.5) 0. because we would be just as surprised if we saw ≤ 65 sons. . 2.5. i.5.02 0. if H0 is true and p really is 0. Then X ∼ Binomial(153. .06 0. We need the probability of getting a result AT LEAST AS EXTREME as X = 65 daughters. 65. p). Which results are at least as extreme as X = 65? X = 0. . Null hypothesis: Alternative hypothesis: p-value: H0 : p = 0. .

we use a computer with a package such as R. Multiply the lower-tail p-value by 2: 2 × 0. 153. In R: > 2 * pbinom(65.02 0. . .5) [1] 0. 153.04 0.5). 0.5).0375. p = 0.5) distribution is pbinom(65.03748079 0. 0. R command for the p-value The R command for calculating the lower-tail p-value for the Binomial(n = 153.0375 = 0. . 0.0750. 0. .5)153 + (0. we have two choices: 1. + P(X = 153) = 153 153 (0.5)1(0. + P(X = 65) + P(X = 88) + . Typing this in R gives: > pbinom(65.5)152 + . we could calculate this as P(X = 0) + P(X = 1) + .5) [1] 0.71 Calculating the p-value The p-value for the president problem is given by P(X ≤ 65) + P(X ≥ 88) where X ∼ Binomial(153. 0 1 This would take a lot of calculator time! Instead.07496158 . In principle. To get the overall p-value.5)0(0. 153. .00 0. .06 0 20 40 60 80 100 120 140 This gives us the lower-tail p-value only: P(X ≤ 65) = 0.

The upper tail gives us the probability of ﬁnding something equally surprising at the opposite end of the distribution. Note: In the R command pbinom(65. If you enter them in a diﬀerent order. the order that you enter the numbers 65. prob=0. 153.5). The overall p-value in this example is 2 × FX (65). 0.5) for X ∼ Binomial(153. 0.72 This works because the upper-tail p-value.0750. and 0. . 0. 0.0375 = 0. by deﬁnition. 2.5 is important. 0. 153. is always going to be the same as the lower-tail p-value. 153.) Note: The R command pbinom is equivalent to the cumulative distribution function for the Binomial distribution: pbinom(65.0375 + 0. 0. 153.pbinom(87.5). size=153. 153.03748079 The overall p-value is the sum of the lower-tail and the upper-tail p-values: pbinom(65. 153. (Same as before. 153. Calculate the upper-tail p-value explicitly: The upper-tail p-value is P(X ≥ 88) = 1 − P(X < 88) = 1 − P(X ≤ 87) = 1 − pbinom(87.5) + 1 . In R: > 1-pbinom(87. you will get an error.5). 0. An alternative is to use the longhand command pbinom(q=65.5) = 0.5) = P(X ≤ 65) = FX (65) where X ∼ Binomial(153.5) [1] 0.5). in which case you can enter the terms in any order. 0.

075 means that. Does this mean presidents are equally likely to have sons and daughters? No: the observations are also compatible with the possibility that there is a diﬀerence. it is usually easiest to formulate them in terms of the low result (sons).075. where p is the probability that each child is a son. and X ∼ Binomial(153. Then X ∼ Binomial(190.5. where p is the probability that each child is a daughter. Null hypothesis: Alternative hypothesis: p-value: H0 : p = 0. Let X be the number of sons out of 190 diver children. there would only be 7.5% chance of observing something as unusual as only 65 daughters out of the total 153 children. 2 × FX (65) = 0. Because pbinom is deﬁned as a lower-tail probability. if the presidents really were as likely to have daughters as sons.73 Summary: are presidents more likely to have sons? Back to our hypothesis test. Recall that X was the number of daughters out of 153 presidential children. there were 190 children: 65 sons. Note: We could just as easily formulate our hypotheses in terms of daughters instead of sons. . p). p). We just don’t have enough evidence either way. and 125 daughters. H1 : p = 0. The observations are compatible with the possibility that there is no diﬀerence. What does this mean? The p-value of 0.5. however. We conclude that there is no real evidence that presidents are more likely to have sons than daughters. but not very unusual. Hypothesis test for the deep-sea divers For the deep-sea divers. This is slightly unusual.

74

Null hypothesis: Alternative hypothesis: p-value:

H0 : p = 0.5. H1 : p = 0.5.

Probability of getting a result AT LEAST AS EXTREME as X = 65 sons, if H0 is true and p really is 0.5.

Results at least as extreme as X = 65 are: X = 0, 1, 2, . . . , 65, for even fewer sons. X = (190−65), . . . , 190, for the equally surprising result in the opposite direction

(too many sons).
Probabilities for X ∼ Binomial(n = 190, p = 0.5)
0.05 0.00 0.01 0.02 0.03 0.04 0.06

0

20

40

60

80

100

120

140

160

180

R command for the p-value p-value = 2×pbinom(65, 190, 0.5). Typing this in R gives: > 2*pbinom(65, 190, 0.5) [1] 1.603136e-05 This is 0.000016, or a little more than one chance in 100 thousand.

75

We conclude that it is extremely unlikely that this observation could have oc-

curred by chance, if the deep-sea divers had equal probabilities of having sons and daughters.

We have very strong evidence that deep-sea divers are more likely to have daughters than sons. The data are not really compatible with H0 . What next? p-values are often badly used in science and business. They are regularly treated as the end point of an analysis, after which no more work is needed. Many scientiﬁc journals insist that scientists quote a p-value with every set of results, and often only p-values less than 0.05 are regarded as ‘interesting’. The outcome is that some scientists do every analysis they can think of until they ﬁnally come up with a p-value of 0.05 or less. A good statistician will recommend a diﬀerent attitude. It is very rare in science

for numbers and statistics to tell us the full story.
Results like the p-value should be regarded as an investigative starting point, rather than the ﬁnal conclusion. Why is the p-value small? What possible mechanism could there be for producing this result? If you were a medical statistician and you gave me a p-value, I would ask you for a mechanism. Don’t accept that Drug A is better than Drug B only because the p-value says so: ﬁnd a biochemist who can explain what Drug A does that Drug B doesn’t. Don’t accept that sun exposure is a cause of skin cancer on the basis of a p-value alone: ﬁnd a mechanism by which skin is damaged by the sun. Why might divers have daughters and presidents have sons? Deep-sea divers are thought to have more daughters than sons because the underwater work at high atmospheric pressure lowers the level of the hormone testosterone in the men’s blood, which is thought to make them more likely to conceive daughters. For the presidents, your guess is as good as mine . . .

2.8 Example: Birthdays and sports professionals Have you ever wondered what makes a professional sports player? Talent? Dedication? Good coaching? Or is it just that they happen to have the right birthday. . . ? The following text is taken from Malcolm Gladwell’s book Outliers. It describes the play-by-play for the ﬁrst goal scored in the 2007 ﬁnals of the Canadian ice hockey junior league for star players aged 17 to 19. The two teams are the Tigers and Giants. There’s one slight diﬀerence . . . instead of the players’ names, we’re given their birthdays. March 11 starts around one side of the Tigers’ net, leaving the puck for his teammate January 4, who passes it to January 22, who ﬂips it back to March 12, who shoots point-blank at the Tigers’ goalie, April 27. April 27 blocks the shot, but it’s rebounded by Giants’ March 6. He shoots! Tigers defensemen February 9 and February 14 dive to block the puck while January 10 looks on helplessly. March 6 scores! Notice anything funny? Here are some ﬁgures. There were 25 players in the Tigers squad, born between 1986 and 1990. Out of these 25 players, 14 of them were born in January, February, or March. Is it believable that this should happen by chance, or do we have evidence that there is a birthday-eﬀect in becoming a star ice hockey player? Hypothesis test

Let X be the number of the 25 players who are born from January to March.
We need to set up hypotheses of the following form: Null hypothesis: H0 : there is no birthday eﬀect. Alternative hypothesis: H1 : there is a birthday eﬀect.

77 What is the distribution of X under H0 and under H1 ? Under H0 . p). Is it suﬃciently greater than 0.56. 15. so p = 1/4. The answer also depends on the sample size (25 in this case).25. Lower tail: an equal probability in the opposite direction. Probability of getting a result AT LEAST AS EXTREME as X = 14 Jan to March players. X ∼ Binomial(25. We need the p-value to measure the evidence properly. there is no birthday eﬀect. we can make a guess.25. if H0 is true and p really is 0. Under H1 . there is a birthday eﬀect. So the probability that each player has a birthday in Jan to March is about 1/4. H1 : p = 0. 1/4). but we might be wrong. Thus the distribution of X under H0 is X ∼ Binomial(25. p-value: H0 : p = 0. . (3 months out of a possible 12 months).25. Results at least as extreme as X = 14 are: Upper tail: X = 14. Our formulation for the hypothesis test is therefore as follows. . for even more Jan to March players. .25 to provide evidence against H0 ? Just using our intuition. This is more than the 0. 25. . Number of Jan to March players. Null hypothesis: Alternative hypothesis: Our observation: The observed proportion of players born from Jan to March is 14/25 = 0. .25 predicted by H0 . for too few Jan to March players.

(Recall P(X ≥ 14) = 1 − P(X ≤ 13). In fact. We get round this problem for calculating the p-value by just multiplying the upper-tail p-value by 2. It means that if there really was no birthday eﬀect. 0. It is more complicated in this example than in Section 2.10 0. 0.05 0. the lower tail probability lies somewhere between 0 and 1 player.25)). We conclude that we have strong evidence that there is a birthday eﬀect in this ice hockey team.001831663 This p-value is very small. The data are barely compatible with H0 . 25.25) 0.25)) [1] 0.00 0. 25. Probabilities for X ∼ Binomial(n = 25.15 0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 R command for the p-value We need twice the UPPER-tail p-value: p-value = 2 × (1−pbinom(13. p = 0.5. Something beyond ordinary chance seems to be going on.) Typing this in R gives: 2*(1-pbinom(13. we would expect to see results as unusual as 14 out of 25 Jan to March players less than 2 in 1000 times.7.78 Note: We do not need to calculate the values corresponding to our lower-tail pvalue. but it cannot be speciﬁed exactly. . because we do not have Binomial probability p = 0.

and astonishingly strong phenomenon. Most of the ‘talented’ players at this age are simply older and bigger. my guess is that the probability of a having a daughter is something close to p = 2/3. these players really are the best. A 10-year-old born in December is competing against players who are nearly a year older. We need some way of formalizing this. and March.51. Do you think that: 1. p could be as close to 0. February.51? The test doesn’t even tell us this much! If there was a huge sample size (number of children).000016 even if the true probability was 0. 2. Common sense. p could be as big as 0. about 10 years old.79 Why should there be a birthday eﬀect? These data are just one example of a much wider . The age diﬀerence makes a big diﬀerence in terms of size. the cut-oﬀ date for age-class teams is January 1st. we found that it would be very unlikely to observe as many as 125 daughters out of 190 children if the chance of having a daughter really was p = 0. equal to the value speciﬁed in the null hypothesis. Why? It’s because these sports select talented players for age-class star teams at young ages.9 Likelihood and estimation So far. In ice hockey. however. Professional sports players not just in ice hockey. or potential importance. But there then follow years in which they get the best coaching and the most practice. but in soccer.8? No idea! The p-value does not tell us. the hypothesis tests have only told us whether the Binomial probability p might be. born in January. of the departure from H0. and physical coordination.5 as. gives us a hint. . speed. or probably isn’t. and other sports have strong birthday clustering. we COULD get a p-value as small as 0. say. But what does this say about the actual value of p? Remember the p-value for the test was 0. They have told us nothing about the size.000016. Because there were almost twice as many daughters as sons. 0. for the deep-sea divers. For example. By the time they reach 17. 2. baseball.5.

given known information on diver age and number of daughters and sons? We need a general framework for estimation that can be applied to any situation. How would we estimate the unknown intercept α and slope β. we wish to estimate the probability p that the child of a diver is a daughter. For example. In the case of the deep-sea divers. Probably the most useful and general method of obtaining parameter estimates is the method of maximum likelihood estimation. . Return to the deep-sea diver example. X is the number of daughters out of 190 children. total number of children 190 p= However. The available data is the observed value of X: X = 125. Likelihood Likelihood is one of the most important concepts in statistics. p). what would we do if we had a regression-model situation (see other courses) and wished to specify an alternative form for p. such as p = α + β × (diver age). there are many situations where our common sense fails us.658. The value suggested is called the estimate of the parameter. The common-sense estimate to use is 125 number of daughters = = 0. We know that X ∼ Binomial(190. and we wish to estimate the value of p.80 Estimation The process of using observations to suggest a value for a parameter is called estimation.

5)125(1 − 0.5. 0. So far.5). P(X = 125) = 190 (0. what if we move p even closer to our common-sense estimate of 0.6)190−125 125 = 0. P(X = 125) = 190 (0. You can probably see where this is heading.6.6 is a better estimate than p = 0.81 Suppose for a moment that p = 0.6 than it would be if p = 0.5)190−125 125 = 3.6 is a better estimate than p = 0.6? When X ∼ Binomial(190. This still looks quite unlikely.658)190−125 125 = 0.5. 0.5.5. This is even more likely than for p = 0.6).061. P(X = 125) = 190 (0. What is the probability of observing X = 125? When X ∼ Binomial(190.6)125(1 − 0.658). Not very likely!! What about p = 0.016.658 is the best estimate yet.658)125(1 − 0.658? When X ∼ Binomial(190. This suggests that p = 0. So p = 0. we have discovered that it would be thousands of times more likely to observe X = 125 if p = 0.6? What would be the probability of observing X = 125 if p = 0.97 × 10−6. . If p = 0. but it is almost 4000 times more likely than getting X = 125 when p = 0.5. 0.

Overall.7).50 0. We can see that the maximum occurs somewhere close to our common-sense estimate of p = 0.01 0.75 0.70 0. This maximum likelihood value of p is our maximum likelihood estimate. P(X = 125) = 190 (0.7)125(1 − 0.65 p 0.7)190−125 125 = 0.658.7? When X ∼ Binomial(190.03 0.05 0.04 0. 0.028. say to p = 0. This is a value of p at which the observation X = 125 is MORE LIKELY than at any other value of p. we can plot a graph showing how likely our observation of X = 125 is under each diﬀerent value of p.658.658.02 0.06 P(X=125) when X~Bin(190. .00 0. 0. p) 0.60 0.7 than under p = 0.82 Can we do any better? What happens if we increase p a little more. This has decreased from the result for p = 0. so our observation of 125 is LESS likely under p = 0.55 0.80 The graph reaches a clear maximum.

the likelihood function shows how LIKELY the observation 125 is for every diﬀerent value of p. The likelihood function is: L(p) = P(X = 125) when X ∼ Binomial(190. We write: L(p . x) = P(X = x) when X ∼ Binomial(190. p). = = 190 125 p (1 − p)190−125 125 190 125 p (1 − p)65 . p. X = 125. p). = 190 x p (1 − p)190−x . 125 This function of p is the curve shown on the graph on page 82. under this value This function is called the likelihood function.83 The likelihood function Look at the graph we plotted overleaf: Horizontal axis: The unknown parameter. if our observation were X = x rather than X = 125. p). In general. It is a function of the unknown parameter p. For our ﬁxed observation X = 125. x . The probability of our observation. the likelihood function is a function of p giving P(X = x) when X ∼ Binomial(190. Vertical axis: of p.

but it could be anything.03 0.60 0. 0.84 Diﬀerence between the likelihood function and the probability function The likelihood function is a probability of x.75 0.05 0.65 p 0.55 0. .05 P(X=125) when X~Bin(190.02 0. The likelihood gives the probability of a FIXED observation x. Function of x for ﬁxed p.80 100 120 x 140 Likelihood function. Gives P(X = x) as p changes. for a FIXED value of p. (x = 125 here.06 P(X=x) when p=0. We can ﬁnd the maximum likelihood estimate using calculus. Compare this with the probability function. but it is a FUNCTION of p.04 0. but it could be anything. Function of p for ﬁxed x.04 0.02 0.6 here.) Maximizing the likelihood We have decided that a sensible parameter estimate for p is the maximum likelihood estimate: the value of p at which the observation X = 125 is more likely than at any other value of p.06 0.) Probability function.50 0. which is the probability of every diﬀerent value of x.01 0.03 0. (p = 0.6 0.00 0. p) 0.70 0. Gives P(X = x) as x changes. for every possible value of the parameter p. L(p .00 0. x). fX (x).01 0.

658 . 125) = 190 125 p (1 − p)65. diﬀerentiate the likelihood with respect to p: dL = dp 190 × 125 125 × p124 × (1 − p)65 + p125 × 65 × (1 − p)64 × (−1) (Product Rule) = 190 × p124 × (1 − p)64 125 190 124 p (1 − p)64 125 125(1 − p) − 65p = 125 − 190p .85 The likelihood function is L(p . To ﬁnd the maximizing value of p. The maximizing value of p occurs when dL = 0. 190 . dp This gives: dL = dp 190 124 p (1 − p)64 125 ⇒ 125 − 190p 125 − 190p ⇒ = 0 = 0 p = 125 = 0. 125 We wish to ﬁnd the value of p that maximizes this expression.

For example. 3. Write down the likelihood function for this observed value: L(p . Write down the observed value of X: Observed data: X = 125. like this: p. 2. Write down the distribution of X in terms of the unknown parameter: X ∼ Binomial(190. 190 The correct notation for the maximization is: dL dp 125 .86 For the diver example. p) = 190 125 p (1 − p)65. p= 125 . p). 190 =0 p=p ⇒ p= Summary of the maximum likelihood procedure 1. The ‘hat’ notation for an estimate It is conventional to write the estimated value of a parameter with a ‘hat’. 125 . 125) = P(X = 125) when X ∼ Binomial(190. the maximum likelihood estimate of 125/190 is the same as the common-sense estimate (page 80): number of daughters 125 p= = . total number of children 190 This gives us conﬁdence that the method of maximum likelihood is sensible.

You will usually be told to assume that the MLE exists. Solve for p: p= 125 . Diﬀerentiate the likelihood with respect to the parameter. care must be taken when the parameter has a restricted range like 0 < p < 1 (see later). it is always best to plot the likelihood function. when p = p. and set to 0 for the maximum: dL = dp 190 124 p (1 − p)64 125 125 − 190p = 0. Where possible. 190 This is the maximum likelihood estimate (MLE) of p. as on page 82. p=p we should verify that the result is a maximum (rather than a minimum) by showing that d2 L < 0.87 4. . 5. This conﬁrms that the maximum likelihood estimate exists and is unique. Verifying the maximum Strictly speaking. when we ﬁnd the maximum likelihood estimate using dL dp = 0. we will be relaxed about this. dp2 p=p In Stats 210. This is the Likelihood Equation. In particular.

190 The maximum likelihood estimaTE of p. and the maximum likelihood estimate of p was 125 . we can provide a RULE for calculating the maximum likelihood estimate once X is observed: Rule: Let Whatever value of X we observe.88 Estimators For the example above. This means that even before we have made our observation of X. the maximum likelihood estimaTOR of p is p= X . we had observation X = 125. and we would obtain p= x . 190 x . p). the maximum likelihood estimate of p will be p= X . once we have observed that X = x. A random variable specifying how an estimate is calculated from an observation is called an estimator. p= 190 It is clear that we could follow through the same working with any value of X. 190 X ∼ Binomial(190. which we can write as X = x. 190 Exercise: Check this by maximizing the likelihood using x instead of 125. In the example above. Note that this expression is now a random variable: it depends on the random value of X . is p= .

Write down the observed value of X: Observed data: X = x. x) = P(X = x) when X ∼ Binomial(n. . and set to 0 for the maximum: dL = dp n x−1 p (1 − p)n−x−1 x x − np = 0. 3. p). (n is known. n This is the maximum likelihood estimate of p. We make a single observation X = x. Solve for p: p= x .) 2. where n is KNOWN and p is to be estimated. Write down the likelihood function for this observed value: L(p . (Exercise) 5. p). p) Take any situation in which our observation X has the distribution X ∼ Binomial(n. 1. p) = n x p (1 − p)n−x. x 4.89 General maximum likelihood estimator for Binomial(n. Diﬀerentiate the likelihood with respect to the parameter. Follow the steps on page 86 to ﬁnd the maximum likelihood estimator for p. Write down the distribution of X in terms of the unknown parameter: X ∼ Binomial(n. when p = p.

5 in this example. p) and n is KNOWN.425 . to convert from the estimate to the estimator. . . This comes back to the meaning of signiﬁcantly diﬀerent in the statistical sense.. Note: We have only considered the class of problems for which X ∼ Binomial(n.5 from routine sampling variability. 3. If n is not known. and one of them (n) should only take discrete values 1. Out of 153 children. just by a little. we can plug in values of n and x to get an instant MLE for any Binomial(n. . We expect that p probably IS diﬀerent from 0. p) problem in which n is known. The maximum likelihood estimate gives us the ‘best’ estimate of p.5.5. we have a harder problem: we have two parameters. However. . the MLE of p is deﬁnitely diﬀerent from 0.90 The maximum likelihood estimator of p is p= X .7 that p was not signiﬁcantly diﬀerent from 0. What is the maximum likelihood estimate of p? Solution: Plug in the numbers n = 153. 2. Let p be the probability that a presidential child is a daughter. Saying that p is not signiﬁcantly diﬀerent from 0. We will not consider problems of this type in Stats 210. x = 65: the maximum likelihood estimate is p= x 65 = = 0. n 153 Note: We showed in Section 2.) By deriving the general maximum likelihood estimator for any problem of this sort.5 just means that we can’t DISTINGUISH any diﬀerence between p and 0. Example: Recall the president problem in Section 2. n (Just replace the x in the MLE with an X .7. 65 were daughters.

6) Caution: the R inputs n and size are the opposite to what you might expect: n gives the required sample size. 110) would be 24. To generate (say) 100 random numbers from the Binomial(n = 190. prob=0. and if 24 of the random numbers fall between 105 and 110. 190.6) distribution in R. 0.6) distribution. Statistical packages like R have custom-made commands for doing this. size=190. p = 0.6) or in long-hand. The height of each bar of the histogram shows how many of the random numbers fall into the interval represented by the bar. 190. For example. we use: rbinom(100. p = 0.91 2. Here are histograms from applying the command rbinom(100. . then the height of the histogram bar for the interval (105. 8 6 10 8 frequency of x frequency of x frequency of x 6 4 4 2 2 0 0 80 90 100 120 140 80 90 100 120 140 0 1 2 3 4 5 6 7 80 90 100 120 140 Each graph shows 100 random numbers from the Binomial(n = 190. if each histogram bar covers an interval of length 5. 0.6) three diﬀerent times. and size gives the Binomial parameter n! Histograms The usual graph used to visualise a set of random numbers is the histogram. rbinom(n=100.10 Random numbers and histograms We often wish to generate random numbers from a given distribution.

Each histogram bar covers an interval of 25 30 Histogram of rbinom(100. rbinom(100.6)). obtained from the command hist(rbinom(100. 190. histogram bars would cover a larger interval. For example. 0. The histogram should have the same shape as the probability function. Histograms as the sample size increases Histograms are useful because they show the approximate shape of the underly- ing probability function. on the right is a histogram using the default settings in R. All the histograms below have bars covering an interval of 1 integer. the sum of the heights of all the bars is 100. especially as the sample size gets large. and the histogram would be smoother. 0. 190. For example. They show how the histogram becomes smoother and less erratic as sample size increases. Usually. 190. 0. The probability function plots EXACT PROBABILITIES for the distribution. .6) Frequency 0 5 10 15 20 100 110 120 130 5 integers. the histogram starts to look identical to the probability function.6) In all the histograms above. because there are 100 observations. Note: diﬀerence between a histogram and the probability function The histogram plots OBSERVED FREQUENCIES of a set of random numbers.92 Note: The histograms above have been specially adjusted so that each histogram bar covers an interval of just one integer. They are also useful for exploring the eﬀect of increasing sample size. Eventually. the height of the bar plotted at x = 109 shows how many of the 100 random numbers are equal to 109. with a large enough sample size.

0.04 The probability function is 0.02 0. 0.000: rbinom(10000.01 0.6) 60 60 50 60 frequency of x 80 90 100 120 140 50 40 frequency of x frequency of x 40 30 30 20 20 10 0 10 0 80 90 100 120 140 0 80 10 20 30 40 50 90 100 120 140 Sample size 10.03 0. 0. 190.6) 6000 6000 5000 5000 5000 frequency of x 80 90 100 120 140 4000 4000 frequency of x frequency of x 3000 3000 2000 2000 1000 1000 0 0 80 90 100 120 140 0 80 1000 2000 3000 4000 6000 90 100 120 140 P(X=x) when X~Bin(190. 0.6): 100 x 120 140 .06 Probability function for Binomial(190. The histograms become stable in shape and approach the shape of the probability function as sample size gets large. 190.000: rbinom(100000. 0.00 80 0.6) ﬁxed and exact.93 Sample size 1000: rbinom(1000.05 0. 0. 190.6) 600 600 600 frequency of x 80 90 100 120 140 500 500 frequency of x 400 frequency of x 400 300 300 200 200 0 100 0 100 80 90 100 120 140 0 80 100 200 300 400 500 90 100 120 140 Sample size 100.

86 Note: You will get a diﬀerent result every time you run this command.6): P(X=x) when p=0.06 100 120 x 140 116 + 116 + .02 0.7. . of the ﬁrst ten values is: 0. 20 The mean of the ﬁrst thirty values is: 116 + 116 + .8. . . or mean.03 0.00 0. 10 The mean of the ﬁrst twenty values is: 116 + 116 + . 30 The answers all seem to be close to 114.05 R command: rbinom(30. + 119 + 120 = 114. . 0. 0. What would happen if we took the average of hundreds of values? 100 values from Binomial(190.04 0. . . .01 0. 190. 0.6)) Result: 114. + 116 + 121 = 113.11 Expectation Given a random variable X that measures something. p = 0.2. 190. here are 30 random observations taken from the distribution X ∼ Binomial(n = 190.6): R command: mean(rbinom(100. we often want to know what is the average value of X ? For example.2.6 0.6) 116 116 117 122 111 112 114 120 112 102 125 116 97 105 108 117 118 111 116 121 107 113 120 114 114 124 116 118 119 120 The average. + 112 + 102 = 114.

It is a FIXED property of the Binomial(190. also called the expectation or mean. This distribution mean is called the expectation.6) distribution. If we took a very large sample of random numbers from the distribution of X . 0. can be written as either E(X).6): R command: mean(rbinom(1000. This means it is a ﬁxed constant: there is nothing random about it. Deﬁnition: The expected value. The expected value is a measure of the centre. 190. or average.6) distribution. 0.6) distribution. or µX .6)) Result: 114. 0. of a discrete random variable X.6): R command: mean(rbinom(1000000. 0. This is because 114 is the DISTRIBUTION MEAN: the mean value that we would get if we were able to draw an inﬁnite sample from the Binomial(190. we would keep getting answers closer and closer to 114.0001 The average seems to be converging to the value 114. The larger the sample size. . or expected value.6)) Result: 114.95 1000 values from Binomial(190. 0. the closer the average seems to get to 114. weighted according to the probability of each value. If we kept going for larger and larger sample sizes.02 1 million values from Binomial(190. 0. or E(X). of the Bino- mial(190. 190. 0. of the set of values that X can take. and is given by µX = E(X) = x xfX (x) = x xP(X = x) . their average would be approximately equal to µX .

Roughly 90 of these 100 times will have X = 1. think of observing X many times. ) We could repeat this for any sample size.9 × 1 + 0. What is E(X)? E(X) = x 190 xP(X = x) 190 (0.6 = 114.6).6)x(0. say 100 times.8. −1 with probability 0.4)190−x. 100 = 0. and only occasionally is X = −1. Let the random variable X be deﬁned as X = 1 with probability 0.14. and use a simpler example.96 Example: Let X ∼ Binomial(n = 190.1. Using 1+(−1) 2 Instead. We will see why in Section 2. Explanation of the formula for expectation We will move away from the Binomial distribution for a moment. x = x=0 x Although it is not obvious.1 × (−1) ( = 0. . because it ignores the fact that usually X = 1. Roughly 10 of these 100 times will have X = −1 The average of the 100 values will be roughly 90 × 1 + 10 × (−1) . p = 0. the answer to this sum is n × p = 190 × 0. X takes only the values 1 and −1.9. What is the ‘average’ value of X? = 0 would not be useful.

the average of the sample will get ever closer to 0. This is why the distribution mean is given by E(X) = P(X = 1) × 1 + P(X = −1) × (−1). E(aX + b) = x (ax + b)fX (x) xfX (x) + b x x = a fX (x) = a E(X) + b × 1. E(X) = x P(X = x) × x.11: Let a and b be constants.97 As the sample gets large. or in general.1 × (−1). Then E(aX + b) = aE(X) + b. E(X) is a ﬁxed constant giving the average value we would get from a large sample of X .9 × 1 + 0. Proof: Immediate from the deﬁnition of expectation. Linear property of expectation Expectation is a linear operator: Theorem 2. .

512 + 1 × 0.3). Example 2: Let Y be Bernoulli(p) (Section 2. 1. 1 with probability p. Y = Find E(Y ). .096 3 0.008 Then 3 E(X) = x=0 xfX (x) = 0 × 0. We have: P(X = x) = x fX (x) = P(X = x) 3 (0.2).8)3−x for x = 0. We will prove in Section 2.98 Example: ﬁnding expectation from the probability function Example 1: Let X ∼ Binomial(3. then E(X) = np. 2. Write down the probability function of X and ﬁnd E(X).2).2 for X ∼ Binomial(3. 0 y P(Y = y) 1 − p 1 p E(Y ) = 0 × (1 − p) + 1 × p = p. x 0 0.384 + 2 × 0.14 that whenever X ∼ Binomial(n.096 + 3 × 0. Note: We have: E(X) = 0.008 = 0. 0. 0.6 = 3 × 0. 3. 0 with probability 1 − p.6. p).2)x(0.512 1 0.384 2 0. That is.

. . Special case: when X and Y are INDEPENDENT: When X and Y are INDEPENDENT. E(XY ) is NOT equal to E(X)E(Y ). . . 2. This result holds for any random variables X1 . to be studied later in the course. E(XY ) = E(X)E(Y ). . General case: For general X and Y . We have to ﬁnd E(XY ) either using their joint probability function (see later). . We can summarize this important result by saying: The expectation of a sum is the sum of the expectations – ALWAYS. Xn . or using their covariance (see later). It does NOT require X1 . . E (X1 + X2 + . . . + an Xn ) = a1 E(X1) + a2 E(X2) + . Expectation of a product of random variables: E(XY ) There are two cases when ﬁnding the expectation of a product: 1. . In particular.99 Expectation of a sum of random variables: E(X + Y ) For ANY random variables X1 . . Xn to be independent. . Xn. . . . . . E(X + Y ) = E(X) + E(Y ) for ANY X and Y . The proof requires multivariate methods. . For any constants a1 . . X2. . + an E(Xn ). . . + E(Xn ). + Xn ) = E(X1) + E(X2) + . Note: We can combine the result above with the linear property of expectation. we have: E (a1 X1 + a2 X2 + . . an . . .

384 2 0. Overall. and let Y = X 2 . Find the probability function of Y .512 1 0.384 22 0. Simply change all the values and keep the probabilities the same. .008 This is because Y takes the value 02 whenever X takes the value 0. we would write the probability function of Y = X 2 as: y P(Y = y) 0 0.096 9 0.096 32 0. .12 Variable transformations We often wish to transform random variables through a function. The probability function for X is: x P(X = x) 0 0.512 12 0. Example 1: Let X ∼ Binomial(3. For example.008 Thus the probability function for Y = X 2 is: y P(Y = y) 02 0. given the random variable X. Thus the probability that Y = 02 is the same as the probability that X = 0. given that the probability function for X is known.. We often summarize all possible variable transformations by referring to Y = g(X) for some function g . 4X 3 . possible transformations of X include: √ X2 . it is very easy to ﬁnd the probability function for Y = g(X).512 1 0. 0.384 4 0. transform the values and leave the probabilities alone.2. For discrete random variables.096 3 0.008 To transform a discrete random variable. X.. and so on.2).

5 13.512 0 0. 0. Expected value of a transformed random variable We can ﬁnd the expectation of a transformed random variable just like any other random variable. His balloons come in three sizes: heights 2m. where h is the height of the balloon.008 . while 30% hire the 3m balloon and 20% hire the 4m balloon. and Y = X 2. The probability function of X is: height. Find the probability function of Y . 3m.2). x (m) P(X = x) 2 0. 50% of Mr Chance’s customers choose to hire the cheapest 2m balloon. in Example 1 we had X ∼ Binomial(3.008 9 0.096 3 0.096 4 0. and leave the probabilities alone.512 1 0.2 Let Y be the amount of gas required: Y = X 3 /2. For example. the amount of helium gas required for a randomly chosen customer.3 32 0.384 1 0.5 3 0. The amount of helium gas in cubic metres required to ﬁll the balloons is h3 /2.5 0.384 2 0. and 4m. The probability function for X is: and for Y = X 2 : x P(X = x) y P(Y = y) 0 0. y (m 3 ) P(Y = y) 4 0. Let X be the height of balloon ordered by a random customer. The probability function of Y is: gas.3 4 0.2 Transform the values.Example 2: Mr Chance hires out giant helium balloons for advertising.

008. the expected value of g(X) is given by E{g(X)} = x g(x)P(X = x) = x g(x)fX (x). we have: E(X 2 ) = 02 × 0.384 + g(2) × 0.512 + 1 × 0. Clearly the same arguments can be extended to any function g(X) and any discrete random variable X: E{g(X)} = x g(x)P(X = x).008.096 + 32 × 0. Because we transform the values and keep the probabilities the same. we could cut out the middle step of writing down the probability function of Y .512 + g(1) × 0.384 + 4 × 0. 2 Note: E(X 2 ) is NOT the same as {E(X)}2 .84.36. this becomes: E{g(X)} = E(X 2) = g(0) × 0.384 + 22 × 0.008 = 0.512 + 12 × 0.102 Thus the expectation of Y = X is: E(Y ) = E(X 2 ) = 0 × 0. Deﬁnition: For any function g and discrete random variable X. Transform the values.096 + g(3) × 0. Check that {E(X)}2 = 0. and leave the probabilities alone. To make the calculation quicker. . If we write g(X) = X 2 .096 + 9 × 0.

3 0. + Z5 ) = E(Z1) + E(Z2) + . . .3 + × 0. X3 E 2 = x x3 × P(X = x) 2 = 33 43 23 × 0. (c) How much does Mr Chance expect to earn in total from his next 5 customers? (expectation is linear) = 400 × (2 × 0.5 + × 0. (b) Mr Chance charges \$400×h to hire a balloon of height h. x (m) 2 3 4 P(X = x) 0.7 = \$1080 per customer.45 m3 gas.5 + 3 × 0. Z5 be the earnings from the next 5 customers. . The probability function of X is: height.103 Example: Recall Mr Chance and his balloon-hire business from page 101.2 2 2 2 = 12. E(400X) = 400 × E(X) = 400 × 2. .2) Let Z1. Average gas per customer is E(X 3 /2). The total expected earning is E(Z1 + Z2 + .3 + 4 × 0. . Each Zi has E(Zi) = 1080 by part (b). . + E(Z5) = 5 × 1080 = \$5400. .2 (a) What is the average amount of gas required per customer? Gas required was X 3 /2 from page 101.5 0. . . What is his expected earning per customer? Expected earning is E(400X). Let X be the height of balloon selected by a randomly chosen customer.

X takes value 3. 8 with probability 1/4. X= 8 with probability 1/4.104 Getting the expectation. Wr g! on Suppose X= 3 with probability 3/4. So E(X) = 3 4×3 + 1 × 8. . and 1/4 of the time. 4 add up the values times how often they occur √ What about E( X)? √ √ √3 with probability 3/4. . X takes value 8. . Then 3/4 of the time.

Common mistakes √ √ i) E( X) = EX = on Wr g! 3 4×3 + 1×8 4 √ ii) E( X) = 3 4×3 + 1 4×8 on Wr g! √ iii) E( X) = ng! 3 4×3 + + 1 4×8 = Wr o √ 3 4× 3 √ 1 4× 8 .105 add up the values times how often they occur √ √ √ 3 1 E( X) = 4 × 3 + 4 × 8.

Every day she hopes to ﬁll her cash machine with enough cash to see the well-heeled citizens of Remuera through the day. Variance is the average squared distance of a random variable from its own mean. the money required will be GREATER than the average.000.13 Variance Example: Mrs Tractor runs the Rational Bank of Remuera. She knows that the expected amount of money withdrawn each day is \$50. How much money should she load in the machine? \$50. and is given by 2 σX = Var(X) = E (X − µX )2 = E (X − EX)2 . 2 Deﬁnition: The variance of a random variable X is written as either Var(X) or σX . near the centre of the distribution.000 is the average. the variance of a function of X is 2 Var(g(X)) = E g(X) − E(g(X)) . About half the time. For questions like this. we need the study of variance. Similarly. so sd(X) = Var(X) = 2 σX = σX . Note: The variance is the square of the standard deviation of X.000? No: \$50. How much money should Mrs Tractor put in the machine if she wants to be 99% certain that there will be enough for the day’s transactions? Answer: it depends how much the amount withdrawn varies above and below its mean.2. .

. Then E(X) = µX = 3 × 2 Var(X) = σX When we observe X.6875. It is the average squared distance between a value of X and the central (mean) value. (2) Then take the average over all values X can take: ie.25)2 + × (8 − 4. Example: Let X = 3 with probability 3/4. ﬁnd what would be the average squared distance between X and µX . of X are just numbers: there is nothing random or variable about them. 8 with probability 1/4.107 Variance as the average squared distance from the mean The variance is a measure of how spread out are the values that X can take. σX . 1 3 + 8 × = 4. µX . if we observed X many times.25)2 = 4 4 = 4. 2 Note: The mean. µX .25. and the variance. Square it to balance positive and negative distances. we get either 3 or 8: this is random. Possible values of X x1 x2 x3 x4 x5 x6 x2 − µX x4 − µX µX (central value) Var(X) = E [(X − µX )2 ] (2) (1) (1) Take distance from observed values of X to the central point.6875.25 4 4 1 3 × (3 − 4. 2 But µX is ﬁxed at 4. and σX is ﬁxed at 4. regardless of the outcome of X. µX .

0625. so µ2 = (EX)2 = (4.75.v.25. This is not the same as (EX)2: 3 with probability 0. constant = E(X 2) − 2µX E(X) + µ2 X = E(X 2) − 2µ2 + µ2 X X = E(X 2) − µ2 .13A: (important) Var(X) = E(X 2) − (EX)2 = E(X 2 ) − µ2 X Proof: Var(X) = E (X − µX )2 = E[ X 2 −2 X r. X But 3 1 E(X 2 ) = 32 × + 82 × = 22. 4 4 Thus E(X 2 ) = (EX)2 in general. 8 with probability 0. by deﬁnition µX + µ2 ] X constant r.108 For a discrete random variable. Theorem 2.25)2 = 18.25. Then µX = EX = 4. X by Thm 2.11 Note: e. .v.75.g. This uses the deﬁnition of the expected value of a function of X: Var(X) = E(g(X)) where g(X) = (X − µX )2 . E(X 2) = xx 2 fX (x) = X= xx 2 P(X = x) . Var(X) = E (X − µX )2 = x (x − µX )2fX (x) = x (x − µX )2P(X = x).

11 Example: ﬁnding expectation and variance from the probability function Recall Mr Chance’s balloons from page 101.3 32 0.5 13.109 Theorem 2. Proof: (part (i)) Var(aX + b) = E {(aX + b) − E(aX + b)}2 = E {aX + b − aE(X) − b}2 = E {aX − aE(X)}2 = E a2 {X − E(X)}2 = a2 E {X − E(X)}2 = a2 Var(X) .11). The random variable Y is the amount of gas required by a randomly chosen customer. y (m 3) P(Y = y) Find Var(Y ).13B: If a and b are constants and g(x) is a function. by Thm 2. The probability function of Y is: gas.11 by Thm 2.2 . Variances are more diﬃcult to manipulate than expectations. Note: These are very diﬀerent from the corresponding expressions for expectations (Theorem 2. then i) Var(aX + b) = a2 Var(X). Part (ii) follows similarly. ii) Var(a g(X) + b) = a2 Var{g(X)}.5 0. 4 0.

3 + 322 × 0. 2. General case: For general X and Y . We have to ﬁnd Var(X + Y ) using their covariance (see later).110 We know that E(Y ) = µY = 12. Variance of a sum of random variables: Var(X + Y ) There are two cases when ﬁnding the variance of a sum: 1.5 − 12.45)2 × 0.5 + 13. Var(X + Y ) = Var(X) + Var(Y ).2 = 112.3 + (32 − 12.45)2 = 112.2 = 267. So Var(Y ) = 267.45)2 × 0.45 from page 103.47 as before.45)2 × 0.475 − (12. Second method: use E(Y 2 ) − µ2 : (usually easier) Y E(Y 2 ) = 42 × 0. Var(X + Y ) is NOT equal to Var(X) + Var(Y ). .47.52 × 0. Special case: when X and Y are INDEPENDENT: When X and Y are INDEPENDENT.5 + (13. First method: use Var(Y ) = E[(Y − µY )2]: Var(Y ) = (4 − 12.475.

3. The probability of getting 8 or more heads is less than 1%. last sale 143 23 18 52 . . . Pick a column of ﬁgures. FALSE : in NZ the probability is about 50%. . The ﬁgures are over 5 times more likely to begin with the digit 1 than with the digit 9. The chance of getting a run of at least 6 heads or 6 tails in a row is less than 10%. 2. Toss a fair coin 200 times.5%. 4. FALSE it is 5. RUE T or FALSE ?? 1. and one teacher of age 50. Answers: 1. FALSE : it is 97%. 4. Consider a classroom with 30 pupils of age 5. or open an atlas at the pages giving country areas or populations. share A Barnett Advantage I AFFCO Air NZ . The probability that the pupils all outlive the teacher is about 90%. TRUE : in fact they are 6.111 Interlude: Guess whether each of the following statements is true or false. 3. Open the Business Herald at the pages giving share prices.5 times more likely. . 2. . Toss a fair coin 10 times.

. then X is the number of successes out of n independent trials. which is the same as X . Yn are independent. Var(X) = Var(Y1 ) + Var(Y2) + . . + E(Yn) (does NOT require independence).14 Mean and variance of the Binomial(n. so Var(X) = npq . then: E(X) = E(Y1) + E(Y2) + . That is. Overall. Yi counts as a 1 if trial i is a success. . + Yn . Note: Each Yi is a Bernoulli(p) random variable (Section 2. . p) distribution Let X ∼ Binomial(n. . . then: E(X) = µX = np 2 Var(X) = σX = np(1 − p). + Yn is the total number of successes out of n independent trials. . . We have mentioned several times that E(X) = np. . where each Yi = 1 with probability p. Now if X = Y1 + Y2 + . + Var(Yn ) (DOES require independence). We now prove this and the additional result for Var(X). . . We often write q = 1 − p. .112 2. 0 with probability 1 − p. Y1 + . p). If X ∼ Binomial(n. and as a 0 if trial i is a failure. This means that we can write: X = Y1 + Y2 + . p). . + Yn . p). . . Easy proof: X as a sum of Bernoulli random variables If X ∼ Binomial(n.3). each with P(success) = p. and Y1 .

. E(Yi2) = 02 × (1 − p) + 12 × p = p.113 The probability function of each Yi is: Thus. + E(Yn) = p +p +. + Var(Yn) = n × p(1 − p). . So E(Yi) = 0 × (1 − p) + 1 × p = p. . Var(Yi) = E(Yi2 ) − (EYi)2 = p − p2 = p(1 − p). . Therefore: E(X) = E(Y1) + E(Y2) + . . y 0 P(Yi = y) 1 − p 1 p Also. And: Var(X) = Var(Y1) + Var(Y2) + .+ p = n × p. Thus we have proved that E(X) = np and Var(X) = np(1 − p).. .

n! = n(n − 1)! etc. and when x = n. This gives. wherever possible: e. continuing. n x E(X) = xfX (x) = x p (1 − p)n−x = x x=0 x=0 n n n x x=0 n! px (1 − p)n−x (n − x)!x! But x 1 = and also the ﬁrst term xfX (x) is 0 when x = 0. x! (x − 1)! n So. n px = p · px−1 E(X) = x=1 n(n − 1)! p · p(x−1) (1 − p)(n−1)−(x−1) [(n − 1) − (x − 1)]!(x − 1)! n = np what we want x=1 n − 1 x−1 p (1 − p)(n−1)−(x−1) x−1 need to show this sum = 1 Finally we let y = x − 1 and let m = n − 1. E(X) = x=1 n! px(1 − p)n−x (n − x)!(x − 1)! Next: make n’s into (n − 1)’s.g. y = 0. So m E(X) = np y=0 m y p (1 − p)m−y y (Binomial Theorem) = np(p + (1 − p))m E(X) = np.114 Hard proof: for mathematicians (non-examinable) We show below how the Binomial mean and variance formulae can be derived directly from the probability function. x’s into (x − 1)’s. When x = 1. . n − x = (n − 1) − (x − 1). y = n − 1 = m. as required.

Note the steps: take out x(x−1) and replace n by (n−2). so instead of ﬁnding E(X 2). it will be easier to ﬁnd E[X(X − 1)] = E(X 2) − E(X) because then we will be able to cancel x(x−1) 1 = (x−2)! . sum=1 by Binomial Theorem So E[X(X − 1)] = n(n − 1)p2 . we used x! = (x−1)! . Thus n E[X(X − 1)] = p n(n − 1) = n(n − 1)p2 2 x=2 m y=0 n − 2 x−2 p (1 − p)(n−2)−(x−2) x−2 m y p (1 − p)m−y y if m = n − 2. y = x − 2. . Thus Var(X) = E(X 2) − (E(X))2 = E(X 2) − E(X) + E(X) − (E(X))2 = E[X(X − 1)] + E(X) − (E(X))2 = n(n − 1)p2 + np − n2 p2 = np(1 − p). x! Here goes: n E[X(X − 1)] = = x=0 n x=0 x(x − 1) n x p (1 − p)n−x x x(x − 1)n(n − 1)(n − 2)! p2p(x−2) (1 − p)(n−2)−(x−2) [(n − 2) − (x − 2)]!(x − 2)!x(x − 1) First two terms (x = 0 and x = 1) are 0 due to the x(x − 1) in the numerator. x 1 For E(X).115 For Var(X). use the same ideas again. x by (x−2) wherever possible.

estimates of parameters should always be accompanied by estimates of their variability. perhaps 0.02? We assess the usefulness of estimators using their variance.658 ± 0. n 190 What is our margin of error on this estimate? Do we believe it is 0. we have: n X n Var(p) = Var = = = 1 Var(X) n2 1 × np(1 − p) n2 p(1 − p) . Make a single observation X = x.3 (say). n for X ∼ Binomial(n. we estimated the probability that a diver has a daughter is p= 125 X = = 0.658. p).116 Variance of the MLE for the Binomial p parameter In Section 2. Reminder: Take any situation in which our observation X has the distribution X ∼ Binomial(n. or do we believe it is very precise. Given p = X . in the deep-sea diver example introduced in Section 2. p) (⋆) . The maximum likelihood estimator of p is p= X . where n is KNOWN and p is to be estimated.7. in other words almost useless.9 we derived the maximum likelihood estimator for the Binomial parameter p. For example.658 ± 0. n In practice.

034 = 0. we do not know the true value of p. n Margin of error = 1. so we cannot calculate the exact Var(p).658 ± 0. so: se(p) = 0. We will study the Central Limit Theorem in later chapters. we should therefore quote: p = 0. Instead.117 In practice.591. n Var (p) = The standard error of p is: se(p) = Var (p). For our ﬁnal answer.658 (0.658. 0. .067 or p = 0. The expression p ± 1. Example: For the deep-sea diver example.658) 190 = 0. with n = 190. We call our estimated variance Var (p): p(1 − p) . we have to ESTIMATE Var(p) by replacing the unknown p in equation (⋆) by p. We usually quote the margin of error associated with p as p(1 − p) .96 × This result occurs because the Central Limit Theorem guarantees that p will be approximately Normally distributed in large samples (large n).96 × se(p) gives an approximate 95% conﬁdence interval for p under the Normal approximation. however.658 × (1 − 0.658 ± 1.96 × se(p) = 1. p = 0.725).96 × 0.034.

is constant for all trials. Variance: Var(X) = np(1 − p). . and variance. 2) Having children: each child can be thought of as a Bernoulli trial with outcomes {girl. 1. First we revise Bernoulli trials and the Binomial distribution. ii) the probability of success.Chapter 3: Modelling with Discrete Probability Distributions 118 In Chapter 2 we introduced several fundamental ideas: hypothesis testing. p. . . each with P(success) = p. . .5. and if X and Y both share the same parameter p. n. Examples: 1) Repeated tossing of a fair coin: each toss is a Bernoulli trial with 1 P(success) = P(head) = 2 . n x px(1 − p)n−x for x = 0. Sum of independent Binomials: If X ∼ Binomial(n. Each of these was illustrated by the Binomial distribution.1 Binomial distribution Description: X ∼ Binomial(n. 3. boy} and P(girl) = 0. We now introduce several other discrete distributions and discuss their properties and usage. expectation. Probability function: fX (x) = P(X = x) = Mean: E(X) = np. iii) the trials are independent. p). p) if X is the number of successes out of a ﬁxed number n of Bernoulli trials. likelihood. then X + Y ∼ Binomial(n + m. and if X and Y are independent. p) and Y ∼ Binomial(m. Bernoulli Trials A set of Bernoulli trials is a series of trials such that: i) each trial has only 2 possible outcomes: Success and Failure. p).

5).9 if n is large) 0.119 Shape: Usually symmetrical unless p is close to 0 or 1. • The Binomial distribution counts the number of successes out of a ﬁxed number of trials.2 Geometric distribution Like the Binomial distribution. • The Geometric distribution counts the number of trials before the ﬁrst This means that the Geometric distribution counts the number of failures before the ﬁrst success.0 0. If every trial has probability p of success. p = 0. success occurs.08 0.25 0. Then X ∼ Geometric(p).3 0.1 0. p = 0.10 0.5 (symmetrical) 0.20 0. Peaks at approximately np.15 0.0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0.4 n = 10. p = 0. On every jump the ﬁsh has probability p of reaching the top.04 0. Examples: 1) X =number of boys before the ﬁrst girl in a family: X ∼ Geometric(p = 0.2 0.06 0.02 0. 2) Fish jumping up a waterfall.10 90 100 3. .9 (skewed for p close to 1) (less skew for p = 0.12 0. n = 10. Let X be the number of failed jumps before the ﬁsh succeeds.9 n = 100.0 80 0. the Geometric distribution is deﬁned in terms of a sequence of Bernoulli trials. we write: X ∼ Geometric(p).05 0.

. etc. . while the Binomial distribution has probability function P(x failures. x 1−p q = p p q 1−p = 2 2 p p Var(X) = . .g. . ii) Probability function For X ∼ Geometric(p). F SF . x failures For the Binomial distribution.120 Properties of the Geometric distribution i) Description X ∼ Geometric(p) if X is the number of failures before the ﬁrst success in a series of Bernoulli trials with P(success) = p. SF . . F S. failures and successes can occur in any order: e. Explanation: P(X = x) = (1 − p)x need x failures ﬁnal trial must be a success × p Diﬀerence between Geometric and Binomial: For the Geometric distribution. 1. x+1 (1 − p)xp. . 1 success) = (1 − p)xp. . . . . . This is why the Geometric distribution has probability function P(x failures. F . 1 success) = iii) Mean and variance E(X) = For X ∼ Geometric(p). F S . 2. F . F F . fX (x) = P(X = x) = (1 − p)xp for x = 0. the trials must always occur in the order F F .

121

iv) Sum of independent Geometric random variables If X1 , . . . , Xk are independent, and each Xi ∼ Geometric(p), then X1 + . . . + Xk ∼ Negative Binomial(k, p). v) Shape Geometric probabilities are always greatest at x = 0. The distribution always has a long right tail (positive skew). The length of the tail depends on p. For small p, there could be many failures before the ﬁrst success, so the tail is long. For large p, a success is likely to occur almost immediately, so the tail is short.
p = 0.3 (small p)
0.30 0.5

(see later)

p = 0.5 (moderate p)
0.8

p = 0.9 (large p)

0.25

0.20

0.4

0.15

0.3

0.10

0.2

0.0

0.05

0.0

0.1

0

1

2

3

4

5

6

7

8

9 10

0

1

2

3

4

5

6

7

8

9 10

0.0 0

0.2

0.4

0.6

1

2

3

4

5

6

7

8

9 10

vi) Likelihood For any random variable, the likelihood function is just the probability function expressed as a function of the unknown parameter. If: • X ∼ Geometric(p); • p is unknown; • the observed value of X is x; then the likelihood function is: L(p ; x) = p(1 − p)x

for 0 < p < 1.

Example: we observe a ﬁsh making 5 failed jumps before reaching the top of a waterfall. We wish to estimate the probability of success for each jump.

Then L(p ; 5) = p(1 − p)5 for 0 < p < 1. Maximize L with respect to p to ﬁnd the MLE, p.

122

For mathematicians: proof of Geometric mean and variance formulae (non-examinable) We wish to prove that E(X) = We use the following results:
∞ x=1 1−p p

and Var(X) =

1−p p2

when X ∼ Geometric(p).

xq x−1 =

1 (1 − q)2 2 (1 − q)3

(for |q| < 1),

(3.1)

and

∞ x=2

x(x − 1)q x−2 =

(for |q| < 1).

(3.2)

Proof of (3.1) and (3.2): Consider the inﬁnite sum of a geometric progression:
∞ x=0

qx =

1 1−q d dq 1 1−q

(for |q| < 1).

Diﬀerentiate both sides with respect to q: d dq
∞ x=0 ∞ x=1 ∞ x=0

qx

=

d x 1 (q ) = dq (1 − q)2 xq x−1 = 1 , (1 − q)2 as stated in (3.1).

Note that the lower limit of the summation becomes x = 1 because the term for x = 0 vanishes. The proof of (3.2) is obtained similarly, by diﬀerentiating both sides of (3.1) with respect to q (Exercise).

123

Now we can ﬁnd E(X) and Var(X). E(X) = =
∞ x=0 ∞

xP(X = x) xpq x xq x xq x−1 (by equation (3.1)) (because 1 − q = p) (where q = 1 − p) (lower limit becomes x = 1 because term in x = 0 is zero)

= p

x=0 ∞ x=1 ∞

= pq

x=1

= pq = pq = q , p

1 (1 − q)2 1 p2

as required.

For Var(X), we use Var(X) = E(X 2 ) − (EX)2 = E {X(X − 1)} + E(X) − (EX)2 . Now E{X(X − 1)} = =
∞ x=0 ∞ x=0

(⋆)

x(x − 1)P(X = x) x(x − 1)pq x
∞ x=2

(where q = 1 − p) (note that terms below x = 2 vanish) (by equation (3.2))

= pq

2

x(x − 1)q x−2

= pq 2 = Thus by (⋆), 2q 2 . p2

2 (1 − q)3

2q 2 q Var(X) = 2 + − p p as required, because q + p = 1.

q p

2

=

q(q + p) q = 2, 2 p p

p) if X is the number of failures before the k ’th success in a series of Bernoulli trials with P(success) = p. we write: X ∼ NegBin(k. . Examples: 1) X =number of boys before the second girl in a family: X ∼ NegBin(k = 2. p).5).3 Negative Binomial distribution The Negative Binomial distribution is a generalised form of the Geometric distribution: • the Geometric distribution counts the number of failures before the ﬁrst success. . 2. p). p). 1. • the Negative Binomial distribution counts the number of failures before the k ’th success. Let X be the number of papers ? Tom fails in his degree. He passes each paper with probability p. p = 0. . fX (x) = P(X = x) = k+x−1 k p (1 − p)x x for x = 0. independently of all other papers. If every trial has probability p of success. 2) Tom needs to pass 24 papers to complete his degree.124 3. ii) Probability function For X ∼ NegBin(k. Then X ∼ NegBin(24. Properties of the Negative Binomial distribution i) Description X ∼ NegBin(k. .

. iii) Mean and variance E(X) = For X ∼ NegBin(k. . × (1 − p)x x failures Var(X) = These results can be proved from the fact that the Negative Binomial distribution is obtained as the sum of k independent Geometric random variables: X = Y1 + . we need x failures and k successes. with the same value of p. ⇒ E(X) = kE(Yi) = p kq Var(X) = kVar(Yi) = 2 . + Yk . p). if x = 3 failures and k = 2 successes. and X ∼ NegBin(k. so the last trial must be a success. Y ∼ NegBin(m. k(1 − p) kq = p p k(1 − p) kq = 2 p2 p k successes × pk F F SF S F SF F S SF F F S • This leaves x failures and k − 1 successes to occur in any order: a total of k − 1 + x trials. we could have: F F F SS So: P(X = x) = k+x−1 x (k − 1) successes and x failures out of (k − 1 + x) trials. • The trials stop when we reach the k’th success. For example. Yi indept. then X + Y ∼ NegBin(k + m. p). p iv) Sum of independent Negative Binomial random variables If X and Y are independent. p). p). kq . .125 Explanation: • For X = x. where each Yi ∼ Geometric(p).

15 0. Example: Tom fails a total of 4 papers before ﬁnishing his degree. p). k = 3.04 0.0 0. If: • X ∼ NegBin(k. Observation: X = 4 failed papers.08 k = 10. 4) = 24 + 4 − 1 24 p (1 − p)4 = 4 27 24 p (1 − p)4 4 0. Likelihood: L(p . p = 0.3 0.05 0. What is his pass probability for each paper? X =# failed papers before 24 passed papers: X ∼ NegBin(24.02 0.0 0 2 4 6 8 10 12 14 16 18 20 22 24 vi) Likelihood As always.8 0.126 v) Shape The Negative Binomial is ﬂexible in shape. p = 0.5 0. .5 k = 3. Below are the probability functions for various diﬀerent values of k and p. x) = k+x−1 k p (1 − p)x x for 0 < p < 1.10 0. • k is known.06 for 0 < p < 1.5 0.0 0. • p is unknown. p). p. then the likelihood function is: L(p .1 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0. p = 0. • the observed value of X is x.4 0. the likelihood function is the probability function expressed as a function of the unknown parameters. Maximize L with respect to p to ﬁnd the MLE.2 0.

n = 5). M. • the other N − M objects are not special. Then X ∼ Hypergeometric(N. M). n). n + M − N ) to x = min(n. Then X ∼ Hypergeometric(N = 20. M = 8. Example: Ron has a box of Chocolate Frogs. ii) Probability function For X ∼ Hypergeometric(N. i) Description Suppose we have N objects: • M of the N objects are special. Let X = number of the n removed objects that are special. Ron grabs a random handful of 5 chocolate frogs and stuﬀs them into his mouth when he thinks that noone is looking. Let X be the number of dark chocolate frogs he picks.127 3.4 Hypergeometric distribution: sampling without replacement The hypergeometric distribution is used when we are sampling without replace- ment from a ﬁnite population. M x N −M n−x N n fX (x) = P(X = x) = for x = max(0. We remove n objects at random without replacement. There are 20 chocolate frogs in the box. M. n). Eight of them are dark chocolate. and twelve of them are white chocolate. .

n). We need P(X = 2) = 8 2 12 3 20 5 = 28 × 220 = 0. n−x • Total number of ways of choosing x special objects and (n−x) other objects is: M × N −M . x • Number of ways of selecting n − x other objects from the N − M available is: N −M . Thus: number of desired ways = P(X = x) = total number of ways M x N −M n−x N n . iii) Mean and variance For X ∼ Hypergeometric(N. 15504 Same answer as Tutorial 3 Q2(c). E(X) = np Var(X) = np(1 − p) N −n N −1 M N. There are N = 20 chocolates. M. Note: We need 0 ≤ x ≤ M (number of special objects). where p = . Example: What is the probability that Ron selects 3 white and 2 dark chocolates? X =# dark chocolates. including M = 8 dark chocolates.128 Explanation: We need to choose x special objects and n − x other objects.397 . n−x x • Overall number of ways of choosing n objects from N is: N n . this gives us the stated constraint that x = max(0. M). After some working. n + M − N ) to x = min(n. and 0 ≤ n − x ≤ N − M (number of other objects). • Number of ways of selecting x special objects from the M available is: M .

M. Hypergeometric(N. The Binomial distribution is used when the population is sampled with replace- ment. n) → Binomial(n. = <sum of Binomial probabilities> = 1.10 0.25 Binomial(10.e. n) distribution by the Binomial(n. . 12 ) 30 0.0 0 0. because removing n objects does not change the overall composition of the population very much when n/N is small.10 0.30 0.129 iv) Shape The Hypergeometric distribution is similar to the Binomial distribution when n/N is small. • Binomial probabilities sum to 1 because of the Binomial Theorem: p + (1 − p) n as N → ∞.15 0.15 0.1 we often approximate the Hypergeometric(N. 12. −k: p k 1 − (1 − p) ∞ −k = <sum of NegBin probabilities> = 1. 10) 0.0 0. the Binomial expansion with a negative power.25 0. • Geometric probabilities sum to 1 because they form a Geometric series: p x=0 (1 − p)x = <sum of Geometric probabilities> = 1. • Negative Binomial probabilities sum to 1 by the Negative Binomial expansion: i. As noted above.05 0.05 0 1 2 3 4 5 6 7 8 9 10 0. because these involve sampling without replacement from a ﬁnite population. For n/N < 0. N Hypergeometric(30.20 0. p = M ) distribution. M ) N A note about distribution names Discrete distributions often get their names from mathematical power series. M.20 1 2 3 4 5 6 7 8 9 10 Note: The Hypergeometric distribution can be used for opinion polls.

How many cars will cross the Harbour Bridge today? X ∼ Poisson. The Poisson distribution arose from a mathematical formulation called the Poisson Process. Suppose that: i) all accidents are independent of each other. How many volcanoes will erupt over the next 1000 years? X ∼ Poisson. unfortunately! (give or take 1000 years or so.5 Poisson distribution When is the next volcano due to erupt in Auckland? Any moment now. ii) accidents occur at a constant average rate of λ per year. X. A distribution that counts the number of random events in a ﬁxed space of time is the Poisson distribution. ) A volcano could happen in Auckland this afternoon. . or it might not happen for another 1000 years. . How many road accidents will there be in NZ this year? X ∼ Poisson. published by Sim´on-Denis Poisson in 1837. has the distribution X ∼ Poisson(λ).3. Volcanoes are almost impossible to predict: they seem to happen completely at random. . Example: Let X be the number of road accidents in a year in New Zealand. Then the number of accidents in a year. iii) accidents cannot occur simultaneously. e Poisson Process The Poisson process counts the number of events occurring in a ﬁxed time or space. when events occur independently and at a constant average rate.

Example: XA = number of raisins in a volume A of currant bun. 1. Then XA ∼ Poisson(λA). 2. . for x = 0. Then Xt ∼ Poisson(λt). Note: For a Poisson process in space. .131 Number of accidents in one year Let X be the number of accidents to occur in one year: X ∼ Poisson(λ). Then Xt ∼ Poisson(λt). . . iii) events cannot occur simultaneously. 2. . . let XA = # events in area of size A. 2. 1. Let Xt be the number of events to occur in time t. 1. . . . The probability function for X ∼ Poisson(λ) is λx −λ P(X = x) = e x! Number of accidents in t years Let Xt be the number of accidents to occur in time t years. ii) events occur at a constant average rate of λ per unit time. and (λt)x −λt P(Xt = x) = e x! for x = 0. . General deﬁnition of the Poisson process Take any sequence of random events such that: i) all events are independent. and P(Xt = x) = (λt)x −λt e x! for x = 0.

t + δt]) δt = λ. 2. These conditions can be used to derive a partial diﬀerential equation on a function known as the probability generating function of Xt . iii) lim δt↓0 P(more than one event occurs in time interval[t. The parameter λ is called the rate of the Poisson distribution. It is also used in many other situations. t + δt]) δt = 0. fX (x) = P(X = x) = λx −λ e x! for x = 0.6). non-examinable). . Deﬁnition: The random variables {Xt : t > 0} form a Poisson process with rate λ if: i) events occurring in any time interval are independent of those occurring in any other disjoint time interval. 1. The partial diﬀerential x equation is solved to provide the form P(Xt = x) = (λt) e−λt . . . Its properties are as follows.132 Where does the Poisson formula come from? (Sketch idea. often as a subjective model (see Section 3. The formal deﬁnition of the Poisson process is as follows. i) Probability function For X ∼ Poisson(λ). x! Poisson distribution The Poisson distribution is not just used in the context of the Poisson process. ii) lim δt↓0 P(exactly one event occurs in time interval[t. for mathematicians. .

2. the distribution becomes more and more symmetrical. Y ∼ Poisson(µ).03 80 100 120 140 .15 0.0 60 0.3 0. As λ increases. For small λ.02 0. then iv) Shape X + Y ∼ Poisson(λ + µ). λ is the average number of events per unit time in the Poisson process.1 0. The probability functions for various λ are shown below.05 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0.04 λ = 100 0.2 0.10 0. λ=1 0. E(X) = Var(X) = λ when X ∼ Poisson(λ).133 ii) Mean and variance The mean and variance of the Poisson(λ) distribution are both λ. The variance of the Poisson distribution increases with the mean (in fact. and X ∼ Poisson(λ). This is often the case in real life: there is more uncertainty associated with larger numbers than with smaller numbers. until for large λ it has the familiar bellshaped appearance. The shape of the Poisson distribution depends upon the value of λ. Notes: 1.20 λ = 3. iii) Sum of independent Poisson random variables If X and Y are independent. variance = mean).5 0. It makes sense for E(X) = λ: by deﬁnition.0 0. the distribution has positive (right) skew.01 0.0 0.

For mathematicians: proof of Poisson mean and variance formulae (non-examinable) We wish to prove that E(X) = Var(X) = λ for X ∼ Poisson(λ). Likelihood: λ28 −λ L(λ . • the observed value of X is x. 28) = for 0 < λ < ∞. e 28! ˆ Maximize L with respect to λ to ﬁnd the MLE.134 v) Likelihood As always. Observation: X = 28 babies. λx −λ For X ∼ Poisson(λ). Because we don’t know λ. we have to estimate the variance: ˆ ˆ Var(λ) = λ. Thus the estimator variance is: ˆ Var(λ) = Var(X) = λ. ˆ Similarly. because X ∼ Poisson(λ). . λ. 2. the maximum likelihood estimator of λ is: λ = X . ˆ We ﬁnd that λ = x = 28. 1. If: • X ∼ Poisson(λ). Estimate λ. the probability function is fX (x) = e for x = 0. Example: 28 babies were born in Mt Roskill yesterday. Let X =# babies born in a day in Mt Roskill. • λ is unknown. the likelihood function is the probability function expressed as a function of the unknown parameters. x! . then the likelihood function is: L(λ. x) = λx −λ e x! for 0 < λ < ∞. . . Assume that X ∼ Poisson(λ).

we use: because the sum=1 (sum of Poisson probabilities) . . Var(X) = E(X 2 ) − (EX)2 = E[X(X − 1)] + E(X) − (EX)2 = E[X(X − 1)] + λ − λ2 . So E(X) = λ. λx −λ x(x − 1) e But E[X(X − 1)] = x! x=0 = ∞ x=2 ∞ λx e−λ (x − 2)! ∞ (terms for x = 0 and x = 1 are 0) = λ 2 x=2 ∞ y=0 λx−2 −λ e (writing everything in terms of x − 2) (x − 2)! λy −λ e y! (putting y = x − 2) = λ 2 = λ2 .135 So E(X) = λx −λ xfX (x) = x e x! x=0 x=0 = ∞ x=1 ∞ ∞ λx e−λ (x − 1)! (note that term for x = 0 is 0) = λ ∞ x=1 ∞ y=0 λx−1 −λ e (writing everything in terms of x − 1) (x − 1)! λy −λ e y! (putting y = x − 1) = λ = λ. as required. So Var(X) = E[X(X − 1)] + λ − λ2 = λ2 + λ − λ2 = λ. For Var(X). as required.

we see that the shape of the distribution is roughly Poisson. . The ﬁt of the Poisson distribution is quite good. or the right sort of relationship between variance and mean. In a subjective model. For example. the Binomial distribution describes exactly the distribution of the number of successes in n Bernoulli trials. Let X = number of letters in an English word chosen at random from the dictio- nary. If we plot the frequencies on a barplot. English word lengths: X − 1 ∼ Poisson(6.05 0. If so.6 Subjective modelling Most of the distributions we have talked about in this chapter are exact models for the situation described. such as the right sort of symmetry or skew.10 0. we pick a probability distribution to describe a situation just because it has properties that we think are appropriate to the situation.22) Word lengths from 25109 English words probability 0. However. Example: Distribution of word lengths for English words.0 0 0.136 3. We need to use X ∼ 1 + Poisson because X cannot take the value 0. we will use a subjective model. there is often no exact model available.15 5 10 number of letters 15 20 The Poisson probabilities (with λ estimated by maximum likelihood) are plotted as points overlaying the barplot.

10 10 20 number of strokes 30 It turns out.08 0.7.137 In this example we can not say that the Poisson distribution represents the number of events in a ﬁxed time or space: instead. it is being used as a subjective model for word length. however.02 0. 0.64). . Stroke counts from 13061 Chinese characters The best-ﬁtting Negative Binomial distribution 0.0 0 0. X does not represent the number of failures before the k’th success: the NegBin is a 0 10 20 number of strokes 30 0. The best-ﬁtting Poisson distribution (found by MLE) is overlaid. Best Poisson fit awful.02 The ﬁt of the Poisson distribution is 0.0 0. X is the number of strokes in a randomly chosen character.04 Here are stroke counts from 13061 Chinese characters.06 0. that the Chinese stroke distribution is well-described by a Negative Binomial model.08 (found by MLE) 0.06 probability is NegBin(k = 23. probability 0. The ﬁt is very good. However. p = 0. Can a Poisson distribution ﬁt any data? The answer is no: in fact the Poisson distribution is very inﬂexible.04 subjective model.

001) can be > 0. . We can’t even begin to list all the values that X can take. P(X = x) = 0 for ALL x. When X is continuous. the Emperor Joseph II responded wryly. x fX (x) = P(X = x) 0 p 1 pq 2 pq 2 . however ‘small’.. This means we should use the distribution function. e.g. there are so many numbers in any continuous set that each of them must have probability 0. But suppose that X takes values in a continuous set.0001? 0. 1). P(X = 1) = 0. The probability function is meaningless. for which we can list all the values and their probabilities..1 Introduction When Mozart performed his opera Die Entf¨hrung aus dem u Serail. If there was a probability > 0 for all the numbers in a continuous set. It describes a continuously varying quantity such as time or height. FX (x) = P(X ≤ x). b). 1]? • the smallest number is 0. but P(0. for X ∼ Geometric(p). ∞) or (0.. . In fact.. there simply wouldn’t be enough probability to go round. but what is the next smallest? 0. how would you list all the numbers in the interval [0.999 ≤ X ≤ 1. For example. we are able to assign probabilities to intervals: eg.0000000001? We just end up talking nonsense. A continuous random variable takes values in a continuous interval (a.138 Chapter 4: Continuous Random Variables 4.g. ‘Too many notes.01? 0. even if the list is inﬁnite: e. [0. Mozart!’ In this chapter we meet a diﬀerent problem: too many numbers! We have met discrete random variables. Although we cannot assign a probability to any value of X.

s. So it makes no diﬀerence whether we say P(a < X ≤ b) or P(a ≤ X ≤ b). b]) = F (b) − F (a). P(X = a) = 0. b]) = F (b) − F (a). P(a < X ≤ b) = P(X ∈ (a. • FX (x) is a continuous function: probability accumulates continuously. P(a < X < b) = P(a + 1 ≤ X ≤ b − 1) = FX (b − 1) − FX (a). . x • P(a < X ≤ b) = P(X ∈ (a. For a discrete random variable with values 0.v. . . FX (x) Recall that for discrete random variables: • FX (x) is a step function: • FX (x) = P(X ≤ x). FX (x) probability accumulates in discrete steps. This is not true for a discrete random variable: in fact. for a continuous random variable. 0 x • As before. . 2. 1. Endpoints are very important for discrete r. For a continuous random variable: FX (x) 1 • FX (x) = P(X ≤ x). Endpoints are not important for continuous r.. P(a < X < b) = P(a ≤ X ≤ b) = FX (b) − FX (a).s.139 The cumulative distribution function.v. However. For a continuous random variable.

representing the top 10 times up to 2002 from each of the 168 sprinters. 1st Qu. There are 1680 times in total. However.14 10.21 10.2 10.d. we never really understand the random variable until we’ve seen its ‘face’ — the probability density function. write emails or send txts to each other all day. All-time top-ten 100m sprint times The histogram below shows the best 10 sprint times from the 168 all-time top male 100m sprinters. but you never really know a person until you’ve seen their face. Just like a cell-phone for keeping in touch. We use it all the time to calculate probabilities and to gain an intuitive feel for the shape and nature of the distribution. it is not very good at telling us what the distribution looks like. Median Mean 3rd Qu.f.0 time (s) 10.f.4 .) is the best way to describe and recognise a continuous random variable. it is quite diﬃcult to describe exactly what the probability density function is. 9. Using the p. is like recognising your friends by their faces. For this we use a diﬀerent tool called the probability density function.78 10. here are the summary statistics: Min. Out of interest.08 10. The probability density function (p.8 10.2 The probability density function Although the cumulative distribution function gives us an interval-based tool for dealing with continuous random variables. the cumulative distribution function is a tool for facilitating our interactions with the continuous random variable. Max.15 10. Surprisingly.140 4. You can chat on the phone.41 frequency 100 200 300 0 9.d. In this section we take some time to motivate and describe this fundamental idea.

4 0.01s intervals 20 40 60 80 0 9. The histograms tell us the most intuitive thing we wish to know about the distribution: its shape: • the most probable times are close to 10.4 0.0 time 10. • times below 10.2 10.141 We could plot this histogram using diﬀerent time intervals: 0.0 time 10.0s and above 10.2 seconds.2 10.02s intervals 100 0 50 150 9.8 10.2 10. • the distribution of times has a long left tail (left skew).1s intervals 200 400 600 0 9.8 10.4 We see that each histogram has broadly the same shape.3 seconds have low probability.4 0.0 time 10.8 10.2 10.8 10.05s intervals 100 200 300 0 9. although the heights of the bars and the interval widths are diﬀerent. .0 time 10.

but keeps a constant height? What should that height be? The standardized histogram We now focus on an idealized (smooth) version of the sprint times distribution.4 0.8 probability 0.08 10. or function.2 10.4 0.2 10.4 This doesn’t work.20 10. probability 0. Wider bars have higher probabilities.0 0.0 0.0 0.8 probability 0.0 time interval 10.2 10. rather than using the exact 1680 sprint times observed.0 time interval 10. First idea: plot the probabilities instead of the frequencies. because the height (probability) still depends upon the bar width. that captures the shape of the histograms.4 0. but will keep the same height for any choice of histogram bar width. How can we derive a curve or function that captures the common shape of the histograms.10 9. We are aiming to derive a curve.0 time interval 10. The height of each histogram bar now represents the probability of getting an observation in that bar.04 9. but the problem is that the histograms are not standardized: • every time we change the interval width. the heights of the bars change.8 10.142 We could ﬁt a curve over any of these histograms to show the desired shape. .2 9.

.4 probability / interval width 4 0.1s intervals 0 1 2 3 9. But.02s intervals 0 1 2 3 9.4 This seems to be exactly what we need! The same curve ﬁts nicely over all the histograms and keeps the same height regardless of the bar width.0 time interval 10. probability / interval width 4 0.0 time interval 10.8 10.143 Second idea: plot the probabilities divided by bar width.0 time interval 10. The nice-ﬁtting curve is the probability density function.2 10.8 10.4 probability / interval width 4 0.2 10.05s intervals 0 1 2 3 9. what is it?! . divided by the width of the bar.4 probability / interval width 4 0.8 10.2 10. The height of each histogram bar now represents the probability of getting an observation in that bar. .0 time interval 10. These histograms are called standardized histograms.2 10.01s intervals 0 1 2 3 9.8 10.

the standardized histogram plots the probability of an observation falling between x and x +t. Thus the height of the standardized histogram bar over the interval from x to x + t is: probability P(x ≤ X ≤ x + t) FX (x + t) − FX (x) = = .144 The probability density function We have seen that there is a single curve that ﬁts nicely over any standardized histogram from a given distribution. We will write the p. What is the height of the standardized histogram bar? For an interval from x to x + t. is in fact the limit of the standardized histogram as the bar width approaches zero. This expression should look familiar: it is the derivative of FX (x). The p.d. The p.f. in the sprint times we can have fX (x) = 4.d. divided by the width of the interval. This curve is called the probability density function (p. fX (x) is clearly NOT the probability of x — for example. t.d. fX (x): fX (x) = lim t→0 FX (x + t) − FX (x) t by deﬁnition.f. However.).d.f.d.f.d. of a continuous random variable X as p. Now consider the limit as the histogram bar width (t) goes to 0: this limit is DEFINED TO BE the probability density function at x.f. interval width t t where FX (x) is the cumulative distribution function. = fX (x). . as the histogram bars of the standardized histogram get narrower. curve. the bars get closer and closer to the p. so it is deﬁnitely NOT a probability.f.

d.) is therefore the function ′ fX (x) = FX (x). the probability density function has another major use: • it calculates probabilities by integration. dx fX (x) = It gives: ′ • the RATE at which probability is accumulating at any given point. a . Suppose we want to calculate P(a ≤ X ≤ b).f. unchanging curve that describes the SHAPE of any histogram drawn from the distribution of X . We already know that: P(a ≤ X ≤ b) = FX (b) − FX (a). The probability density function (p. • the SHAPE of the distribution of X .f. Using the probability density function to calculate probabilities As well as showing us the shape of the distribution of X. FX (x).) of X is deﬁned as dFX ′ = FX (x).d. It is deﬁned to be a single. dx so In fact: FX (x) = fX (x) dx b (without constants). Formal deﬁnition of the probability density function Deﬁnition: Let X be a continuous random variable with distribution function FX (x). But we also know that: dFX = fX (x). FX (b) − FX (a) = fX (x) dx.145 The probability density function (p.

Then b P(a ≤ X ≤ b) = P(X ∈ [ a. fX (x) P(a ≤ X ≤ b) is the AREA under the curve fX (x) between a and b. b ] ) = fX (x) dx . curve is equal to the total probability that X takes a value between −∞ and +∞.d.f. which is 1.d. a This means that we can calculate probabilities by integrating the p.146 This is a very important result: Let X be a continuous random variable with probability density function fX (x).d. x . fX (x) total area under the curve fX (x) ∞ is 1 : −∞ fX (x) dx = 1.f. curve is: total area = ∞ −∞ fX (x) dx = FX (∞) − FX (−∞) = 1 − 0 = 1. a b x The total area under the p. This says that the total area under the p.f.

and wish to calculate the distribution function. It’s How can x range from −∞ to x?! . FX (x). u: x Writing FX (x) = −∞ fX (u) du means: fX (u) integrate fX (u) as u ranges from −∞ to x. FX (x) Suppose we know the probability density function. fX (x). Proof: x −∞ f (u)du = FX (x) − FX (−∞) = FX (x) − 0 = FX (x).d.147 Using the p. to calculate the distribution function. FX (x) = P(X ≤ x) x x u Writing FX (x) = fX (x) dx is WRONG and MEANINGLESS: you will LOSE −∞ A MARK every time. nonsense! x −∞ fX (x) dx means: integrate fX (x) as x ranges from −∞ to x. We use the following formula: x Distribution function. Using the dummy variable. FX (x) = −∞ fX (u) du. In words.f.

Using the c.2 10. (i) We need: ∞ −∞ 0 (ii) Find P(1 < X ≤ 3). Example of calculations with the p. fX (x). F(x) 0. otherwise.d.1 to 10.. we can see that it is about 10.0 0.f.2 seconds. f(x) Let fX (x) = (i) Find the constant k.148 Why do we need fX (x)? Why not stick with FX (x)? These graphs show FX (x) and fX (x) from the men’s 100m sprint times (X is a random top ten 100m sprint time). which is the region of highest probability? Using the p. 0 x fX (x) dx = 1 k e−2x dx = 1 ∞ 0 0 dx + −∞ 0 ∞ e−2x k −2 = 1 .4 10.d. k e−2x 0 for 0 < x < ∞. (iii) Find the cumulative distribution function.4 9.8 0.2 10. For example.8 10.4 Just using FX (x) gives us very little intuition about the problem.f.0 x 10. FX (x). we would have to inspect the part of the curve with the steepest gradient: very diﬃcult to see.8 f(x) 0 1 2 3 4 9.0 x 10..f. FX (x).d. for all x.

FX (x) = So overall. (iii) x FX (x) = −∞ 0 fX (u) du x = −∞ 0 du + 0 x 0 2 e−2u du for x > 0 2e−2u = 0+ −2 = −e−2x + e0 = 1 − e−2x = 0.149 −k −∞ (e − e0 ) = 1 2 −k (0 − 1) = 1 2 k = 2. −2x for x > 0. (ii) 3 P(1 < X ≤ 3) = = fX (x) dx 1 3 1 2 e−2x dx 3 1 = 2e−2x −2 = −e−2×3 + e−2×1 = 0. x −∞ 0 du FX (x) = 0 for x ≤ 0. for x > 0.132. When x ≤ 0. 1−e .

FX (x): x FX (x) = −∞ fX (u) du. Then use: P(a ≤ X ≤ b) = FX (b) − FX (a) for any a.: b P(a ≤ X ≤ b) = fX (x) dx. curve is 1: ∞ −∞ fX (x) dx = 1. . Endpoints: DO NOT MATTER for continuous random variables: P(X ≤ a) = P(X < a) and P(X ≥ a) = P(X > a) . a 2.d.f.150 Total area under the p. The p.d. If you only need to calculate one probability P(a ≤ X ≤ b): integrate the p. If you will need to calculate several probabilities.f. is NOT a probability: fX (x) ≥ 0 always. 1. b.d. it is easiest to ﬁnd the distribution function.f. Calculating probabilities: but we do NOT require fX (x) ≤ 1.

which has led the Auckland Regional Council to assess current volcanic risk by assuming that volcanic eruptions in Auckland follow a Poisson 1 process with rate λ = 1000 volcanoes per year. The Poisson distribution was used to count the 13 Oct 2009? 9J 207 un 4? number of volcanoes that would occur in a ﬁxed space of time. This is because FX (x) = P(X ≤ x) gives us a probability. 1000 Nt is the number of volcanoes to have occurred by time t. (λt)n −λt e . The ﬁrst two eruptions occurred in the Auckland Domain and Albert Park — right underneath us! The most recent. FX (x). Distribution of the waiting time in the Poisson process The length of time between events in the Poisson process is called the waiting time. To ﬁnd the distribution of a continuous random variable. Suppose that {Nt : t > 0} forms a Poisson process with rate λ = We know that Nt ∼ Poisson(λt) .000 years or so.f.nz/environment/volcanoes-of-auckland/. starting from now.3 The Exponential distribution When will the next volcano erupt in Auckland? We never quite answered this question in Chapter 3. see: www. There have been about 20 eruptions in the last 20.000 years. unlike the p. We have not said how long we have to wait for the next volcano: this is a continuous random variable. 1 .arc.govt. eruption was Rangitoto. fX (x). we often work with the cumulative distribution function. For background information. about 600 years ago. n! so P(Nt = n) = .ov 5 N45? 3 2 4.d. Auckland Volcanoes About 50 volcanic eruptions have occurred in Auckland over the last 100. and biggest. We are comfortable with handling and manipulating probabilities.

Overall: FX (x) = P(X ≤ x) = 1 − e−λx 0 for x ≥ 0. This is the ﬁgure given by the Auckland Regional Council at the above web link (under ‘Future Hazards’). Example: What is the probability that there will be a volcanic eruption in Auckland within the next 50 years? Put λ = 1 1000 .049.152 Let X be a continuous random variable giving the number of years waited before the next volcano. There is about a 5% chance that there will be a volcanic eruption in Auckland over the next 50 years. starting now. We need P(X ≤ 50). for x < 0. (i) When x < 0: FX (x) = P(X ≤ x) = P( less than 0 time before next volcano) = 0. . (ii) When x ≥ 0: FX (x) = P(X ≤ x) = P(amount of time waited for next volcano is ≤ x) = P(there is at least one volcano between now and time x) = P(# volcanoes between now and time x is ≥ 1) = P(Nx ≥ 1) = 1 − P(Nx = 0) (λx)0 −λx e = 1− 0! = 1 − e−λx . We will derive an expression for FX (x). P(X ≤ 50) = FX (50) = 1 − e−50/1000 = 0. The distribution of the waiting time X is called the Exponential distribution because of the exponential formula for FX (x).

n! e • Deﬁne X to be either the time till the ﬁrst event.153 The Exponential Distribution We have deﬁned the Exponential(λ) distribution to be the distribution of the waiting time (time between events) in a Poisson process with rate λ. or the time between any two events. P. or X ∼ Exp(λ).d. Let X ∼ Exponential(λ). or the time from now until the next event. Distribution function: FX (x) = P(X ≤ x) = 1 − e−λx for x ≥ 0.f. Note: λ > 0 always. • Nt ∼ Poisson(λt) . Probability density function: fX (x) = ′ FX (x) = λe−λx 0 for x ≥ 0. fX (x) Link with the Poisson process C. .. However.f. We write X ∼ Exponential(λ). the Exponential distribution has many other applications: it does not always have to arise from a Poisson process. Then: • Nt is the number of events to occur by time t.d. Let {Nt : t > 0} be a Poisson process with rate λ. 0 for x < 0. just like the Poisson distribution. so P(Nt = n) = (λt)n −λt .. for x < 0. Then X ∼ Exponential(λ). FX (x) = P(X ≤ x). X is called the waiting time of the process.

. memorylessness means that ‘old is as good as new’ — or. This property of the Exponential distribution is called memorylessness: • the distribution of the time from now until the ﬁrst event is the same as the distribution of the time from the start until the ﬁrst event: the time from the start till now has been forgotten! time from start to first event this time forgotten time from now to first event START NOW FIRST EVENT The Exponential distribution is famous for this memoryless property: it is the only memoryless distribution. or the time from now until the next event.Me Memorylessness We have said that the waiting time of the Poisson process can be deﬁned either as the time from the start to the ﬁrst event. ‘new is as bad as old’ ! A memoryless light bulb is quite likely to fail almost immediately. It is not always a desirable property: you don’t want a memoryless waiting time for your bus! The Exponential distribution is often used to model failure times of components: for example X ∼ Exponential(λ) is the amount of time before a light bulb fails. put another way. Memorylessness applies to any Poisson process. The derivation of the Exponential distribution was valid for all of them. because events occur at a constant average rate in the Poisson process. mo a si ry like eve ! zzz z All of these quantities have the same distribution: X ∼ Exponential(λ). In this case. or the time between any two events. memorylessness means that the 600 years we have waited since Rangitoto erupted have counted for nothing. The chance that we still have 1000 years to wait for the next eruption is the same today as it was 600 years ago when Rangitoto erupted. For volcanoes.

FY (y) = P(Y ≤ y) = P(X ≤ t + y | X > t) = P(X ≤ t + y AND X > t) P(X > t) (deﬁnition of conditional probability) P(t < X ≤ t + y) 1 − P(X ≤ t) FX (t + y) − FX (t) 1 − FX (t) = = (1 − e−λ(t+y) ) − (1 − e−λt ) = 1 − (1 − e−λt ) = e−λt − e−λ(t+y) e−λt e−λt (1 − e−λy ) = e−λt Thus the conditional probability of waiting time y extra. = 1 − e−λy . The time t already waited is forgotten. Also.e. given that we have already waited time t (say). Let Y be the amount of extra time waited for the event. given that we have already waited time t. is the same as the probability of waiting time y in total. i. because X is the total time waited. and Y is the time waited after time t.155 For private reading: proof of memorylessness Let X ∼ Exponential(λ) be the total time waited for an event. . This means we need to prove that Y ∼ Exponential(λ). This proves that Y is Exponential(λ) like X. First note that X = t+Y . we must condition on the event {X > t}. So P(Y ≤ y) = P(X ≤ t + y | X > t). because we know that we have already waited time t. We wish to prove that Y has the same distribution as X. that the time t already waited has been ‘forgotten’. So Y ∼ Exponential(λ) as required. Proof: We will work with FY (y) and prove that it is equal to 1 − e−λy .

. Xn are continuous random variables such that: • X1 . . dx • Although the notation fX (x) means something diﬀerent for continuous and discrete random variables. • λ is unknown. then the likelihood is fX (x1)fX (x2) . . • the observed value of X is x. Xn are INDEPENDENT.4 Likelihood and estimation for continuous random variables • For discrete random variables.156 4. we found the likelihood using the probability function. fX (x) = P(X = x). Note: Both discrete and continuous r.. • all the Xi s have the same p. λ. Example: Exponential likelihood Suppose that: • X ∼ Exponential(λ). we ﬁnd the likelihood using the probability density function. x) = fX (x) = λe−λx for 0 < λ < ∞.v. fX (x) = dFX . . it is used in exactly the same way for likelihood and estimation. .f. .d. Then the likelihood function is: L(λ . fX (xn). We estimate λ by setting dL ˆ = 0 to ﬁnd the MLE. . . fX (x). .s have the same deﬁnition for the cumulative distribution function: FX (x) = P(X ≤ x). dλ Two or more independent observations Suppose that X1 . . . • For continuous random variables.

0 0 0. i=1 Thus L(λ . x1. . to be the sample mean of x1. . λ = 1 . . xn) = λn e−λnx for 0 < λ < ∞. xn) = i=1 n fX (xi) λe−λxi i=1 n −λ n i=1 = = λ e xi Deﬁne x = 1 n n i=1 xi for 0 < λ < ∞. . λ = ∞. . 0. x The MLE of λ is ˆ 1 λ= . . . and Xi ∼ Exponential(λ) for all i. xn. . . Solve dL = 0 to ﬁnd the MLE of λ: dλ dL = nλn−1 e−λnx − λn × nx × e−λnx = 0 dλ nλn−1e−λnx (1 − λx) = 0 ⇒ λ = 0. . Find the maximum likelihood estimate of λ.157 Example: Suppose that X1. . 2 4 6 n lambda Solution: L(λ . .004 likelihood 0. 2). x1. . . So n xi = nx. X2 .002 Likelihood graph shown for λ = 2 and n = 10. . . . . Xn are independent. x1. . x10 generated by R command rexp(10. . x .

Then the upper-tail p-value is P(X ≥ 7) = 1 − P(X ≤ 7) = 1 − FX (7). we would expect to see BIGGER values of X . Example with the Exponential distribution A very very old person observes that the waiting time from Rangitoto to the next volcanic eruption in Auckland is 1500 years. This is because X is the time between volcanoes. Suppose H0 : X ∼ Binomial(n = 10. the procedure for hypothesis testing is the same: • Use H0 to specify the distribution of X completely. .158 4. The only diﬀerence is: • endpoints matter for discrete random variables. Suppose H0 : X ∼ Exponential(2). • That is. Example: discrete. and λ is the rate at which volcanoes occur. but not for continuous ran- dom variables. A smaller value of λ means volcanoes occur less often. NOT smaller. and we have observed the value x = 7. and we have observed the value x = 7. Example: continuous. • Make observation x. Test the hypothesis that 1 1 λ = 1000 against the one-sided alternative that λ < 1000 . Then the upper-tail p-value is P(X ≥ 7) = 1 − P(X ≤ 6) = 1 − FX (6). so the time X between them is BIGGER. Other than this trap. p = 0. 1 Note: If λ < 1000 .5 Hypothesis tests Hypothesis tests for continuous random variables are just like hypothesis tests for discrete random variables. if H0 is true. ﬁnd the probability under the distribution speciﬁed by H0 of seeing an observation further out in the tails than the value x that we have seen. and oﬀer a one-tailed or two-tailed alternative hypothesis H1 .5). • Find the one-tailed or two-tailed p-value as the probability of seeing an observation at least as weird as what we have seen.

0006 1000 2000 x 3000 4000 5000 . So p − value = P(X ≥ 1500) = 1 − P(X ≤ 1500) = 1 − FX (1500) = 0. i. Values weirder than x = 1500 years: all values BIGGER than x = 1500.0002 0. 0.e. H0 : λ = H1 1 1000 1 : λ< 1000 one-tailed test Observation: x = 1500 years. p-value: P(X ≥ 1500) when X ∼ Exponential(λ = 1 1000 ). that volcanoes erupt once every 1000 years on average. 1/1000) There is no evidence against H0. The observation x = 1500 years is consistent with the hypothesis that λ = 1/1000.223. when X ∼ Exponential(λ = 1 1000 ) = 1 − (1 − e−1500/1000) R command: 1-pexp(1500.159 Hypotheses: Let X ∼ Exponential(λ).0010 Interpretation: f(x) 0 0 0.

Imagine fX (x) cut out of cardboard and balanced on a pencil. Expectation is also the balance point of fX (x) for both continuous and discrete X. ′ where fX (x) = FX (x) is the probability density function. then P(X = x) = 0 for all x. In fact.6 Expectation and variance Remember the expectation of a discrete random variable is the long-term av- erage: µX = E(X) = x xP(X = x) = x xfX (x). if X is continuous. replace the probability function with the probability density function. • a ‘sum’ of values multiplied by how common they are: xf (x) or xf (x) dx. . and replace µX = E(X) = x ∞ −∞ by ∞ −∞ : xfX (x) dx. we add in the value and multiply by the proportion of times we would expect to see that value: P(X = x).160 4. E(X) is: • the long-term average of X. Note: There exists no concept of a ‘probability function’ fX (x) = P(X = x) for continuous random variables. (For each value x. The idea behind expectation is the same for both discrete and continuous random variables.) For a continuous random variable.

fX (x) =P(X = x) Transform the values.) Variance If X is continuous. Var(X) = E(X ) − (EX) = 2 2 ∞ −∞ x2 fX (x)dx − (EX)2. X For a continuous random variable. ′ fX (x) =FX (x) (p.d.f. The second expression is usually easier (although not always). we can either compute the variance using Var(X) = E (X − µX ) or 2 = ∞ −∞ (x − µX )2fX (x)dx. leave the probability density alone. . its variance is deﬁned in exactly the same way as a discrete random variable: 2 Var(X) = σX = E (X − µX )2 = E(X 2) − µ2 = E(X 2) − (EX)2.161 Discrete: E(X) = x Continuous: xfX (x) E(X) = ∞ −∞ xfX (x) dx E(g(X)) = x g(x)fX (x) E(g(X)) = ∞ −∞ g(x)fX (x) dx Transform the values. leave the probabilities alone.

4. Y independent. Xn . and for constants a and b: • E(aX + b) =aE(X) + b. . Y independent. . it makes sense that E(X) = λ . + E(Xn). and X1 . the average time waited between events is 1 hour. . X. . if λ = 4 events per hour. For any random variables. The following statements are generally true only when X and Y are INDEPENDENT: • E(XY ) =E(X)E(Y ) when X . continuous or discrete. Note: If X is the waiting time for a Poisson process with rate λ events per year 1 (say). 4 .162 Properties of expectation and variance All properties of expectation and variance are exactly the same for continuous and discrete random variables. • Var(X + Y ) =Var(X) + Var(Y ) when X . • E(ag(X) + b) =aE(g(X)) + b. • Var(aX + b) =a2 Var(X). For example. • E(X1 + . then: E(X) = 1 λ Var(X) = 1 λ2 . • Var(ag(X) + b) =a2 Var(g(X)). . . • E(X + Y ) =E(X) + E(Y ).7 Exponential distribution mean and variance When X ∼ Exponential(λ). . + Xn ) =E(X1) + . Y . .

Variance: Var(X) = E(X 2 ) − (EX)2 = E(X 2) − Now Let E(X ) = u = x2 . and let ∞ 0 dv dx Then E(X 2) = uv − ∞ 0 v du dx = dx + 0 ∞ 2xe−λx dx 2 = 0+ λ = So Var(X) = E(X ) − (EX) 2 2 λxe−λx dx 2 2 × E(X) = 2 . so Then du dx = 1. λ λ 1 λ 2 2 = 2− λ 1 . 163 Integration by parts: recall that Let u = x. λ2 Var(X) = . = λe−λx . so 2 ∞ −∞ du dx 1 λ2 . so v = −e−λx . − x2e−λx ∞ 0 ∞ 0 = 2x.Proof : E(X) = ∞ −∞ xfX (x) dx = ∞ −λx dx. so v = −e−λx . and let ∞ 0 = λe−λx . x fX (x) dx = 0 2 ∞ x2λe−λx dx. 0 xλe dv u dx dx = uv − dv dx du v dx dx. ∞ E(X) = xλe −λx dx = 0 u dv dx dx − ∞ 0 ∞ 0 = = uv ∞ 0 v du dx dx ∞ 0 − xe −λx − (−e−λx) dx = 0+ = ∴ E(X) = −1 λ 1 λ −1 −λx ∞ λe 0 −1 λ ×0− × e0 .

so it divides the p. and variance . and Variance For any distribution: • the mean is the average that would be obtained if a large number of observations were drawn from the distribution. so to get a rough guess (not exact).5 each. median.0 0 0. (answers overleaf) 0.f. but it isn’t easy! Have a go at the examples below. • the variance is the average squared distance of observations from the mean.004 50 100 150 x 200 250 300 .008 Guess the mean. median. f(x) 0. Median.f. into two equal areas of 0. The mean is the point where the rod would have to be placed for the cardboard to balance. .164 Interlude: Guess the Mean. we should be able to guess roughly the distribution mean. • the median is the half-way point of the distribution: every observation has a 50-50 chance of being above the median or below the median.012 0.d. it is easiest to guess an average distance from the mean and square it. Given the probability density function of a distribution. . As a hint: • the mean is the balance-point of the distribution. • the variance is the average squared distance of an observation from the mean. is made of cardboard and balanced on a rod.d. • the median is the half-way point. Imagine that the p. and variance.

008 mean (90. the distribution shown is a Lognormal distribution. Variance=1.0.0 0 0. .0. f(x) 0. Answers are written below the graph.004 50 100 150 x 200 250 300 Notes: The mean is larger than the median. Out of interest. The variance is huge . .0 0 1 2 x 3 4 5 Answers: Median = 0. .6) 0.012 median (54.6 0.0) variance = (118) = 13924 2 0. it is quite believable that the average squared distance of an observation from the mean is 1182. Mean=1. Example 2: Try the same again with the example below.2 0.165 Answers: f(x) 0.8 1.693.0 0. This always happens when the distribution has a long right tail (positive skew) like this one.4 0. but when you look at the numbers along the horizontal axis.

or X ∼ U(a. if a ≤ x ≤ b. b). FX (x) 1 0 1 0 1 0 1 0 111 000 11111 00000 1 10 111 000 1 0 111 000 1 0 111 000 1 0 111 000 1 0 111 000 1 0 111 000 1 0 111 000 1 0 0 000000 1111111111 0000000000x 111111 a b Thus   0    1 FX (x) = x−a b−a if x < a. b] if X is equally likely to fall anywhere in the interval [a. if x > b. FX (x) x x FX (x) = −∞ fY (y) dy = a 1 dy b−a x a if a≤x≤b = = x−a b−a y b−a if a ≤ x ≤ b.8 The Uniform distribution X has a Uniform distribution on the interval [a. b]. b]. b]. fX (x) If X ∼ U [a. Probability density function. or X ∼ U[a. We write X ∼ Uniform[a. X ∼ Uniform(a. b). b]. then 1 b−a fX (x) =  0   if a ≤ x ≤ b. .166 4. otherwise. Equivalently. fX (x) 1 b−a 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 a b x Distribution function.

167

Mean and variance:

If X ∼ Uniform[a, b],
Proof : E(X) =
∞ −∞

a+b , E(X) = 2
b

(b − a)2 . Var(X) = 12 1 x2 dx = b−a 2 = = = 1 b−a 1 b−a a+b . 2
b a b a

xf (x) dx =
a

1 x b−a

1 · (b2 − a2 ) 2 1 (b − a)(b + a) 2

Var(X) = E[(X − µX ) ] =

2

b a

(x − µX )2 1 (x − µX )3 dx = b−a b−a 3 = 1 b−a

(b − µX )3 − (a − µX )3 3
a−b 2 .

But µX = EX = So, Var(X) =

a+b 2 ,

so

b − µX =

b−a 2

and

a − µX =

1 b−a

(b − a)3 − (a − b)3 23 × 3

(b − a)3 + (b − a)3 = (b − a) × 24 = (b − a)2 . 12

Example: let X ∼ Uniform[0, 1]. Then fX (x) = 1 if 0 ≤ x ≤ 1 0 otherwise.
0+1 2

µX = E(X) =

=

1 2

(half-way through interval [0, 1]).
1 . 12

2 σX = Var(X) =

1 (1 12

− 0)2 =

168

4.9 The Change of Variable Technique: ﬁnding the distribution of g(X) Let X be a continuous random variable. Suppose • the p.d.f. of X , fX (x), is known; • we wish to ﬁnd the p.d.f. of Y .

• the r.v. Y is deﬁned as Y = g(X) for some function g ;

We use the Change of Variable technique. Example: Let X ∼ Uniform(0, 1), and let Y = − log(X). The p.d.f. of X is fX (x) = 1 for 0 < x < 1. What is the p.d.f. of Y , fY (y)? Change of variable technique for monotone functions Suppose that g(X) is a monotone function R → R. This means that g is an increasing function, or g is a decreasing f n . When g is monotone, it is invertible, or (1–1) (‘one-to-one’). That is, for every y there is a unique x such that g(x) = y . This means that the inverse function, g −1 (y), is well-deﬁned as a function for a certain range of y. When g : R → R, as it is here, then g can only be (1–1) if it is monotone.
y = g(x) = x2
1.0 0.8 0.8 1.0

x = g −1 (y) =

√ y

0.6

y

0.4

x 0.2 0.0 0.0 0.2 0.4 x 0.6 0.8 1.0 0.0 0.0 0.2 0.4

0.6

0.2

0.4 y

0.6

0.8

1.0

169

Change of Variable formula Let g : R → R be a monotone function and let Y = g(X). Then the p.d.f. of Y = g(X) is fY (y) = fX (g −1(y))
d −1 dy g (y)

.

Easy way to remember

Write

y = y(x)(= g(x)) x = x(y)(= g −1(y)) fY (y) = fX x(y)
dx dy

Then

.

Working for change of variable questions 1) Show you have checked g(x) is monotone over the required range. 2) Write y = y(x) for x in <range of x>, e.g. for a < x < b. 3) So x = x(y) for y in <range of y >: for y(a) < y(x) < y(b) if y is increasing; for y(a) > y(x) > y(b) if y is decreasing. dx = <expression involving y >. dy dx 5) So fY (y) = fX (x(y)) by Change of Variable formula, dy = ... . Quote range of values of y as part of the FINAL answer. Refer back to the question to ﬁnd fX (x): you often have to deduce this from information like X ∼ Uniform(0, 1) or X ∼ Exponential(λ), or it may be given 4) Then

explicitly.

y 0 0. 1). for 0 < y < ∞). ie. Note: In change of variable questions. dy dy fY (y) = fX (x(y)) dx dy 5) So for 0 < y < ∞ for 0 < y < ∞. not stating g(x) is monotone over the required range of x.0 2) Let y = y(x) = − log x for 3) Then x = x(y) = e−y 4) 0 < x y < 1. = fX (e−y )e−y But X ∼ Uniform(0.0 1 2 so we can apply the Change of Variable formula. fY (y) = . . So Y ∼ Exponential(1). 3 0.8 1.d. as part of the ﬁnal answer. and let Y = − log(X). for − log(0) > > − log(1). d −y dx = (e ) = −e−y = e−y for 0 < y < ∞. of Y . you lose a mark for: 1. Thus fY (y) = fX (e−y )e−y = e−y for 0 < y < ∞.4 x 0. not giving the range of y for which the result holds.2 0. 2. . (eg. dy 4 y = −log(x) Example 1: Let X ∼ Uniform(0. so fX (x) = 1 for 0 < x < 1.6 0.170 Note: There should be no x’s left in the answer! x(y) and dx are expressions involving y only. . 1). 1) y(x) = − log(x) is monotone decreasing.f. Find the p. ⇒ fX (e−y ) = 1 for 0 < y < ∞. 0 < y < ∞.

2 < y < ∞. The function y(x) = 1/x is monotone decreasing for 0 < x < 2.e. 0 Let Y = 1/X.  2 4y 5 fY (y) =   0 otherwise. fX (x) = 1 3 4x for 0 < x < 2. i.f. 1 2 < y < ∞. Change of variable formula: fY (y) = fX (x(y)) = = = dx dy 1 3 dx (x(y)) 4 dy 1 1 1 × 3× 2 4 y y 1 4y 5 for 1 < y < ∞. Let Then y = y(x) = 1/x x = x(y) = 1/y dx = | − y −2 | = 1/y 2 dy for 0 < x < 2. Find the probability density function of Y .171 Example 2: Let X be a continuous random variable with p. for for 1 0 1 2 > y > 1. . 2 Thus   1 for 1 < y < ∞. so we can apply the Change of Variable formula.d. Let Y = 1/X . fY (y). otherwise.

w = g−1 (y) d FY (y) dy d = FX (g −1(y)) dy d ′ = FX (g −1(y)) (g −1(y)) (Chain Rule) dy d = fX (g −1(y)) (g −1(y)) dy d −1 dy (g (y)) Now g is increasing. we obtain g −1 (x) < g −1(y) ⇔ g(g −1 (x)) < g(g −1 (y)) ⇔ x < y. Now FY (y) = P(Y ≤ y) = P(g(X) ≤ y) = P(X ≤ g −1 (y)) = FX (g −1(y)). and w = g (y). i) g increasing g is increasing if u < w ⇔ g(u) < g(w). ⊛ −1 −1 Note that putting u = g (x). so g −1 is also decreasing.f. (Putting u = g −1 (x) and w = g −1(y) gives g −1(x) > g −1 (y) ⇐⇒ x < y.d. w = g −1(y) in (⋆)) . (⋆) > 0.172 For mathematicians: proof of the change of variable formula Separate into cases where g is increasing and where g is decreasing. so d and thus fY (y) = fX (g −1(y))| dy (g −1(y))| as required. so g −1 is also an increasing function. = P(X ≥ g −1 (y)) (put u = X. i. of Y is d d fY (y) = 1 − FX (g −1(y)) = −fX g −1(y) g −1 (y) . u > w ⇐⇒ g(u) < g(w).f. so g −1 is also increasing (by overleaf). dy dy = 1 − FX (g −1(y)). ii) g decreasing. of Y is fY (y) = put u = X. So the p.) FY (y) = P(Y ≤ y) = P(g(X) ≤ y) Thus the p. in ⊛ to see this.e.d.

so FY (y) = 0 if y < 0. Y y √ − y 0 √ y X So FY (y) = 0 √ √ FX ( y) − FX (− y) if y < 0. Y ≥ 0. We can sometimes do this by using the distribution function directly. Find the p. 4. g is decreasing. Let Y = X 2 . of Y .d. if y ≥ 0. of Y . Example: Let X have any distribution.10 Change of variable for non-monotone functions: non-examinable Suppose that Y = g(X) and g is not monotone. with distribution function FX (x).d. −1 is also decreasing. Clearly. and thus . For y ≥ 0: FY (y) = P(Y ≤ y) = P(X 2 ≤ y) √ √ = P(− y ≤ X ≤ y) √ √ = FX ( y) − FX (− y) . d d g −1 (y) = g −1 (y) dy dy d g −1 (y) dy fY (y) = fX g −1(y) .173 This time. We wish to ﬁnd the p.f. so g − So once again.f. .

∴ 1 √ √ fY (y) = √ fX ( y) + fX (− y) for y ≥ 0. This example has shown that if X ∼ Normal(0.d.d.d.f. of X is: 1 2 fX (x) = √ e−x /2. . 2 y Example: Let X ∼ Normal(0. whenever Y = X 2 .f. of Y is fY (y) = d √ √ d d FY = (FX ( y)) − (FX (− y)) dy dy dy 1 1 √ ′ √ ′ = 1 y − 2 FX ( y) + 1 y − 2 FX (− y) 2 2 = 1 √ √ √ fX ( y) + fX (− y) 2 y for y ≥ 0. The Chi-squared distribution is a special case of the Gamma distribution (see next section). The p. By the result above. 1).174 So the p. 2π Find the p. of Y = X 2 .f. This is the familiar bell-shaped distribution (see later). 1). fY (y) = 1 1 √ · √ (e−y/2 + e−y/2) 2 y 2π 1 = √ y −1/2e−y/2 for y ≥ 0.f. 2π This is in fact the Chi-squared distribution with ν = 1 degrees of freedom.d. then Y = X 2 ∼Chi-squared(df=1). Y = X 2 has p.

is a constant that ensures fX (x) integrates to 1. Xk ∼ Exponential(λ)and X1 . . λ). . then E(Y ) = ν k λ = ν. . then X1 + X2 + . E(X) = k λ and Var(X) = k λ2 Relationship with the Chi-squared distribution The Chi-squared distribution with ν degrees of freedom.v. .) Probability density function. . + Xk ∼ Gamma(k. λ = 1 ). .11 The Gamma distribution The Gamma(k. λ) distribution is a very ﬂexible family of distributions. ν 2 2 So if Y ∼ χ2 . fX (x) = λk k−1 −λx e Γ(k) x 0 if x ≥ 0.e. . Special Case: When k = 1. . i. Γ(k). When k is an integer. . ∞ 0 fX (x)dx = 1.175 4. . Here. and Var(Y ) = k λ2 = 2ν. otherwise.v. Gamma(1. λ) = Exponential(λ) (the sum of a single Exponential r. Xk are independent. It is deﬁned as the sum of k independent Exponential r. λ). called the Gamma function of k. fX (x) For X ∼ Gamma(k. χ2 . Γ(k) = (k − 1)! Mean and variance of the Gamma distribution: For X ∼ Gamma(k. is a special case of ν the Gamma distribution.s: if X1 . It is deﬁned as Γ(k) = 0 ∞ y k−1 e−y dy . χ2 = Gamma(k = ν . λ). .

If X ∼ Gamma(k. then FX (x) can can only be calculated by computer. k=5 .s k=1 k=2 Notice: right skew (long right tail).f.d. FX (x) There is no closed form for the distribution function of the Gamma distribution.176 Gamma p. ﬂexibility in shape controlled by the 2 parameters k=5 Distribution function. λ).

Var(X) = E(X ) − (EX) 2 2 = = 0 k2 x fX (x) dx − 2 λ 2 ∞ x2λk xk−1e−λx k2 dx − 2 Γ(k) λ Γ(k) ∞ k+1 −y e 0 y = ∞ 1 k+1 −λx e dx 0 ( λ )(λx) k2 − 2 λ where y = λx. dx dy 1 = λ) (property of the Gamma function). 1 dx = dy λ 1 = 2· λ dy Γ(k) k2 − 2 λ 1 Γ(k + 2) k 2 − 2 = 2· λ Γ(k) λ 1 (k + 1)k Γ(k) k 2 = 2 − 2 λ Γ(k) λ = k . λ ∞ 0 (letting y = λx.177 Proof that E(X) = ∞ 0 k λ and Var(X) = ∞ 0 k λ2 (non-examinable) EX = xfX (x) dx = = = = = = λk xk−1 −λx x· e dx Γ(k) Γ(k) ∞ k −λx dx 0 (λx) e ∞ k −y 1 0 y e ( λ ) dy Γ(k) 1 Γ(k + 1) · λ Γ(k) 1 k Γ(k) · λ Γ(k) k . λ2 .

B(α. as they both describe the ‘waiting time’ before the kth event. p) distribution. β) = for α > 0. In the same way. Thus the time waited until the kth event in a Poisson process with rate λ has the Gamma(k. . the sum of k independent Exponential(λ) r. The function B(α.d. We write X ∼ Beta(α.178 Gamma distribution arising from the Poisson process Recall that the waiting time between events in a Poisson process with rate λ has the Exponential(λ) distribution. P. B(α. then Xi ∼ Exp(λ). β) is the Beta function and is deﬁned by the integral 1 B(α.12 The Beta Distribution: non-examinable The Beta distribution has two parameters. Note: There are some similarities between the Exponential(λ) distribution and the (discrete) Geometric(p) distribution.s. β). Both distributions describe the ‘waiting time’ before an event. β > 0. λ) distribution. λ) distribution is similar to the (discrete) Negative Binomial(k. + Xk . 4. It can be shown that Γ(α)Γ(β) . β) 0 otherwise. Γ(α + β) . The time waited from time 0 to the time of the kth event is X1 + X2 + . if Xi =time waited between event i − 1 and event i. f (x) = 1 xα−1(1 − x)β−1 for 0 < x < 1. α and β. β) = 0 xα−1(1 − x)β−1 dx. .f. That is.v. the Gamma(k.

or X ∼ N(µ. and the variance. µ and σ 2 satisfy −∞ < µ < ∞. . . σ 2). fX (x) 1 2πσ 2 fX (x) = √ e {−(x−µ)2 /2σ2 } for −∞ < x < ∞.1 The Normal Distribution The Normal distribution has two parameters. If X ∼ Normal(µ. R command: FX (x) = pnorm(x. and n is large. σ 2). σ 2 > 0. FX (x) There is no closed form for the distribution function of the Normal distribution. . Xn are i. + Xn ∼ approx Normal if X1 .d. We write X ∼ Normal(µ. mean=µ. mainly because of its link with the Central Limit Theorem. σ 2 . we look at the Normal distribution and some of its general properties. It is probably the most important distribution in statistics. µ. Distribution function. . . then FX (x) can can only be calculated by computer. Probability density function. . the mean. σ 2). identically distributed random variables is approximately Normal: X1 + X2 + .i. 5.Chapter 5: The Normal Distribution and the Central Limit Theorem 179 The Normal distribution is the familiar bell-shaped distribution. Before studying the Central Limit Theorem. which states that any large sum of independent. . sd=sqrt(σ 2)).

a2 σ 2 .180 Probability density function. In particular. Var(X) = σ 2 . FX (x) Mean and Variance For X ∼ Normal(µ. . σ 2). X ∼ Normal(µ σ 2 ) ⇒ ∼ Normal(0. aX + b ∼ Normal aµ + b. fX (x) Distribution function. 1). σ 2). X −µ σ E(X) = µ. Linear transformations If X ∼ Normal(µ. then for any constants a and b.

1). σ σ Let Z = aX + b = X −µ . General proof that aX + b ∼ Normal aµ + b. σ 2). 2) Let y = y(x) = ax + b for −∞ < x < ∞. Z ∼ Normal aµ + b. so we can apply the Change of Variable technique. a2 σ 2 ∼ Normal Z ∼ Normal(0. and let Y = aX + b. 1) y(x) = ax +b is monotone. 4) 5) So (⋆) /2σ 2 But X ∼ Normal(µ. We wish to ﬁnd the distribution of Y . |a| 2 y−b a for −∞ < y < ∞. = = dy a |a| fY (y) = fX (x(y)) dx = fX dy y−b a 1 2πσ 2 1 . so fX (x) = √ Thus fX y−b a e−(x−µ) y−b 1 2 2 = √ e−( a −µ) /2σ 2πσ 2 1 2 2 2 e−(y−(aµ+b)) /2a σ . Use the change of variable technique. σ 2). σ σ σ2 ∼ Normal(0. = √ 2πσ 2 .181 Proof: Let a = 1 µ and b = − . 1) is called the standard Normal random variable. a2σ 2 : Let X ∼ Normal(µ. 3) Then x = x(y) = dx 1 1 . Then σ µ µ σ2 − .

σ 2). the proof that −∞ fX (x) dx = 1 follows by using the change (x − µ) in the integral. . 1 n For mathematicians: properties of the Normal distribution 1. . Xn are independent. a2 σ 2 ) random variable. This result is non-trivial to prove. X2 . 2 More generally. and X ∼ Normal(µ1 . and Xi ∼ Normal(µi . of variable y = √ 2σ ∞ . σ1 ). σi ) for i = 1. fY (y) = fX y−b a · 1 1 2 2 2 e−(y−(aµ+b)) /2a σ for − ∞ < y < ∞. ∞ −∞ The full proof that fX (x) dx = ∞ −∞ ∞ −∞ √ 1 2πσ 2 e{−(x−µ) 2 /(2σ 2 )} dx = 1 relies on the following result: FACT: e−y dy = 2 √ π. . . See Calculus courses for details. . . σ2 ).f. if X ∼ Normal(µ. . So. . σ1 + σ2 . =√ |a| 2πa2 σ 2 But this is the p. Using this result. of a Normal(aµ + b. then 2 2 X + Y ∼ Normal µ1 + µ2 . Sums of Normal random variables 2 2 If X and Y are independent. if X1. then a1 X1 +a2 X2 +. Y ∼ Normal(µ2 .182 Returning to (⋆). a2 σ 2 . . . . n.+an Xn ∼ Normal (a1 µ1 +.d.+an µn ). . . 2 2 (a2 σ1 +. . Proof that ∞ −∞ fX (x) dx = 1. then aX + b ∼ Normal aµ + b.+a2 σn) .

183 2. g(−z) = −g(z)). Thus E(X) = 0 + µ × 1 = µ. Var(X) = E (X − µ)2 = 1 2 2 (x − µ)2 √ e−(x−µ) /(2σ ) dx 2πσ 2 −∞ 2 ∞ = σ = σ 1 2 √ z 2 e−z /2 dz 2π −∞ 1 2 √ −ze−z /2 2π ∞ −∞ ∞ putting z = + 1 2 √ e−z /2 dz 2π −∞ ∞ x−µ σ 2 (integration by parts) = σ 2 {0 + 1} = σ2. Thus E(X) = = 1 2 (σz + µ) · √ · e−z /2 · σ dz 2πσ 2 −∞ σz 2 √ · e−z /2 dz 2π −∞ this is an odd function of z (i. Proof thatVar(X) = σ 2 . of N (0. E(X) = ∞ −∞ xfX (x) dx = 1 2 2 x√ e−(x−µ) /2σ dx 2πσ 2 −∞ x−µ σ ∞ Change variable of integration: let z = ∞ : then x = σz + µ and dx dz = σ. 3. 1) integrates to 1. so it integrates to 0 over range −∞ to ∞.d.e. ∞ + µ 1 2 √ e−z /2 dz 2π −∞ ∞ p.f. Proof that E(X) = µ. .

Xn be independent r. . For example. . .2 The Central Limit Theorem (CLT) also known as. . . .v. from ANY distribution. . Theorem (The Central Limit Theorem): Let X1 . .184 5. . n i=1 Xi has a distribution The mean of the Normal distribution is E(Sn) = The variance of the Normal distribution is n n i=1 E(Xi ) = nµ. Xi ∼ Binomial(n. the Piece of Cake Theorem The Central Limit Theorem (CLT) is one of the most fundamental results in statistics. In its simplest form. p) for each i.s with mean µ and variance σ 2. . . then the distribution of their sum (or alternatively their sample average) always converges to the Normal distribution. Xn are independent = nσ 2. . + Xn = that tends to Normal as n → ∞. + Xn → Normal(nµ. . . Var(Sn) = Var i=1 n Xi = i=1 Var(Xi ) because X1 . So Sn = X1 + X2 + . nσ 2) as n → ∞. . it states that if a large number of independent random variables are drawn from any distribution. Then the sum Sn = X1 + . so µ = np and σ 2 = np(1 − p).

. . Skewed distributions converge more slowly than symmetric Normal-like distributions. . Xn. It is usually safe to assume that the Central Limit Theorem applies whenever n ≥ 30. Xn are independent and have the same distribution. . The speed of convergence of Sn to the Normal distribution depends upon the distribution of X. . We will look particularly at how fast the distribution of Sn converges to the Normal distribution. A suﬃcient condition on X for the Central Limit Theorem to apply is that Var(X) is ﬁnite. 3. 3 σ 2 = Var(X) = E(X 2) − {E(X)}2 1 = 0 1 x fX (x) dx − 2x3 dx − 1 0 2 2 3 2 = 0 4 9 2x4 = 4 1 = . . This is a remarkable theorem. Example 1: Triangular distribution: fX (x) = 2x for 0 < x < 1. 18 − 4 9 . . Find E(X) and Var(X): 1 f (x) µ = E(X) = 0 1 xfX (x) dx 0 1 x = 0 2x2 dx 1 0 = 2x3 3 2 = . The Central Limit Theorem in action : simulation studies The following simulation study illustrates the Central Limit Theorem. Other versions of the Central Limit Theorem relax the conditions that X1. making use of several of the techniques learnt in STATS 210. because the limit holds for any distribution of X1 . It might apply for as little as n = 4. 2. .185 Notes: 1.

2 0. . + Xn ) = nσ 2 ⇒ So Sn ∼ approx Normal 0.5 2. . Var(Sn) = 18 by independence for large n. .0 2.5 0.4 0.6 0. .0 1. by the Central Limit Theorem. . . 3.0 1.0 0.5 3. Var(X) = σ 2 = 5 . + Xn ) = nµ = 2n 3 Var(Sn) = Var(X1 + . + Xn ) = nσ 2 ⇒ So Sn ∼ approx Normal 2n n .f. nσ 2) = Normal( 2n . Then E(Sn) = E(X1 + .5 2.0 0.4 0.5 n = 10 0. + Xn for n as small as 10.d.0 1.0 0.5 1. . The Normal p. the Normal curve is a very good approximation. .0 0.0 0. .0 0. .0 0.4 1.8 n=3 0. Xn are independent.2 0. Normal(nµ. 2.6 0. .0 1.6 1.2 0. n and 10. . . Even with this highly non-Normal distribution for X.0 n=2 1. . the Normal curve provides a good approximation to Sn = X1 + . .8 0. n=1 2.+Xn for n = 1.0 0.0 0. 18 3 n .2 0.186 Let Sn = X1 + . 3 We ﬁnd that E(X) = µ = 0. Even for n as low as 10. (Exercise) -1 0 1 x Let Sn = X1 + .0 0. . 3n 5 by independence Var(Sn) = 3n . + Xn where X1 . .3 4 5 6 7 8 9 Sn Sn Sn Sn f (x) 3 Example 2: U-shaped distribution: fX (x) = 2 x2 for −1 < x < 1. .2 1. by the Central Limit Theorem. The graph shows histograms of 10 000 values of Sn = X1 +. . . Then E(Sn) = E(X1 + . 5 for large n. + Xn ) = nµ = 0 Var(Sn) = Var(X1 + . .2 0. Xn are independent.4 0. 18 ) is superimposed 3 across the top. .5 1. . .1 0. + Xn where X1 .5 0. .8 1.

Bin(n. Thus by the CLT.n=1 1.p) np(1 − p) as n → ∞ with p ﬁxed.6 n=2 0.8 0.0 0. .5 0. .2 0. + Xn → Normal(nµ. .0 -0. for any ﬁxed value of p.3 0. np(1 − p) . .10 -5 0 5 Sn Sn Sn Sn Normal approximation to the Binomial distribution.0 0. σ 2 = Var(Xi ) = p(1 − p). using the CLT Let Y ∼ Binomial(n. . var of Bin(n.15 n = 10 187 1.6 0.p) .4 0.4 n=3 0.2 0. . Thus.0 0.2 0. where Xi = 1 if trial i is a “success” (prob = p). . The Binomial distribution is therefore well approximated by the Normal distribution when n is large. nσ 2) = Normal np. The Normal distribution is also a good approximation to the Poisson(λ) distribution when λ is large: Poisson(λ) → Normal(λ. p).4 0.0 0.05 0. + Xn .4 1. p) → Normal np mean of Bin(n.0 0.1 -1.0 0. 0 otherwise (prob = 1 − p) So Y = X1 + . Y = X1 + X2 + . + Xn and each Xi has µ = E(Xi) = p.2 0. We can think of Y as the sum of n Bernoulli random variables: Y = X1 + X2 + . λ)when λ is large.0 -2 -1 0 1 2 -3 -2 -1 0 1 2 3 0.5 1.

(linear transformation of Normal r.01 0. unknown level of support for the Labour party in New Zealand. n n where q = 1 − p. you deserve a piece of cake! Example: Remember the margin of error for an opinion poll? An opinion pollster wishes to estimate the level of support for Labour in an upcoming election. Let X be the number of of the n people interviewed by the opinion pollster who plan to vote Labour.188 Binomial(n = 100. At the end of Chapter 2.02 30 40 50 60 70 0. p = 0. She interviews n people about their voting preferences.04 Poisson(λ = 100) 0.04 0. p). we said that the maximum likelihood estimator for p is X p= .5) 0.03 80 100 120 140 Why the Piece of Cake Theorem? • The Central Limit Theorem makes whole realms of statistics into a piece of cake.0 0.v.08 0.02 0. we now know that X ∼ approx Normal(np. Then X ∼ Binomial(n.) . • After seeing a theorem this good. Let p be the true.0 60 0. npq) So p= X pq ∼ approx Normal p.06 0. n In a large sample (large n).

1).96 < Z < 1.95. A narrow conﬁdence interval suggests a useful estimate (low variance). n The 95% conﬁdence interval has RANDOM end-points. This enables us to form an estimated 95% conﬁdence interval for the unknown parameter p: estimated 95% conﬁdence interval is p − 1. Check in R: pnorm(1. we obtain p−p pq n P −1. n • Think: Central Limit Theorem! • Have: a piece of cake.96 ≃ 0.96 < Rearranging: P p − 1. Next time you see the newspapers quoting the margin of error on an opinion poll: • Remember: margin of error = 1.189 So p−p pq n ∼ approx Normal(0. Conﬁdence intervals are extremely important for helping us to assess how useful our estimate is. a wide conﬁdence interval suggests a poor estimate (high variance). these random end-points will enclose the true unknown value.96. which depend on p. pq < p < p + 1. mean=0.95. mean=0. About 95% of the time.pnorm(-1.96 p(1 − p) . .96. sd=1) Putting Z = p−p pq n .96 < 1.96) = 0. sd=1) . Now if Z ∼ Normal(0. we ﬁnd (using a computer) that the 95% central probability region of Z is from −1.96: P(−1.96 n pq n ≃ 0.96 p(1 − p) n to p + 1. 1). p.96 p(1−p) .95.96 to +1.

190 Using the Central Limit Theorem to ﬁnd the distribution of the mean. . n n as n → ∞. . The following three statements of the Central Limit Theorem are equivalent: X= X1 + X2 + . . . . . . nσ 2 Sn − nµ X −µ √ = ∼ approx Normal (0. Xn are independent. then Zn = n i=1 (Xi − µi ) n 2 i=1 σi → Normal(0. Xn be independent. whatever the distribution of the original r. + Xn ∼ approx Normal µ. . . . . X. as n grows large. + Xn . X itself is approximately Normal for large n: X1 + X2 + . n σ2 n as n → ∞. Xn are independent.v.s. 1) as n → ∞. .v. . is deﬁned as: X= X1 + X2 + . . . nσ 2) by the CLT. + Xn ∼ approx Normal nµ. The sample mean. Other versions of the CLT relax the condition that X1 . σ 2 /n nσ 2 The essential point to remember about the Central Limit Theorem is that large sums or sample means of independent random variables converge to a Normal distribution. and 2 E(Xi) = µi . . More general version of the CLT A more general form of CLT states that. . Var(Xi) = σi (not necessarily all equal). . n Because X is a scalar multiple of a Normal r. . + Xn σ2 ∼ approx Normal µ. + Xn ∼ approx Normal(nµ. as n → ∞. X Let X1 . identically distributed with mean E(Xi) = µ and variance Var(Xi ) = σ 2 for all i. if X1 . . 1) as n → ∞. . Sn = X1 + X2 + . n So X = Sn . . where Sn = X1 + . .

p) with known n and observed value X = x. and the estimator PDF We have seen that an estimator is a capital letter replacing a small letter. variance. or estimate. This means they have distributions. x • The maximum likelihood estimate of p is p = n . • The maximum likelihood estimator of λ is λ = Why are we interested in estimators? The answer is that estimators are random variables. change of variable. and variances that tell us how well we can trust our single observation. means. Let’s see how these diﬀerent ideas all come together. What’s the point of that? Example: Let X ∼ Binomial(n. distributions. • The maximum likelihood estimator of p is p = X n. and the Central Limit Theorem. . from this distribution. • hypothesis testing. 1 X. Example: Let X ∼ Exponential(λ) with observed value X = x. the bad. 6.Chapter 6: Wrapping Up Probably the two major ideas of this course are: • likelihood and estimation. 191 Most of the techniques that we have studied along the way are to help us with these two goals: expectation. 1 • The maximum likelihood estimate of λ is λ = x .1 Estimators — the good.

of Estimator 1 p. The p.d. So we know the p. So we can ﬁnd the p. .i. In Chapter 4 we calculated the maximum likelihood estimator of λ: 1 n λ= . + Xn X Now λ is a random variable with a distribution. How? We know that T = X1+. Exponential(λ). . n Now λ = T .d. Xn are independent. . .+Xn ∼ Gamma(n.d. . the better. λ is unknown. .f.f. λ) when Xi ∼ i. = X1 + X2 + .192 Good and bad estimators Suppose that X1 .f. we can calculate the p.f. 100 pieces of information about λ.f.d. for n = 10.f. . Here are the p. and Xi ∼ Exponential(λ) for all i. of T . of Estimator 2 True λ (unknown) λ Clearly.s of λ for two diﬀerent values of n: • Estimator 1: n = 100.d.d. f( λ ) p. of λ using the change of variable technique. and we wish to estimate it.f. 10 pieces of information about λ. For a given value of n. of λ. for n = 100 is focused much more tightly about the true value λ (unknown) than the p.f. • Estimator 2: n = 10.d. . . the more information we have.d.d. X2.

curve.f.d. .d.f.193 It is important to recognise what we do and don’t know in this situation: What we don’t know: • the true λ. curve is guaranteed to be good.d.d.f. • WHERE we are on the p.f. + Xn ∼ Gamma(n.d. A poor estimator has high estimator variance: some places on the estimator’s p. • we know we’re SOMEWHERE on that curve. so we could ﬁnd the p. Exponential(λ). curve is good! f( λ ) p. Note: We were lucky in this example to happen to know that T = X1 + . of our estimator λ = n/T .d.d. What we do know: • the p. of Estimator 2 True λ (unknown) λ This is why we are so concerned with estimator variance.i. . curve.f. while others may be very bad. of Estimator 1 p. Because we don’t know where we are on the curve. curve may be good.f. The estimator variance tells us how much the estimator can be trusted. we can’t trust any estimate from this poor estimator.f. A good estimator has low estimator variance: everywhere on the estimator’s p.f. λ) when Xi ∼ i.d.d. We won’t usually be so lucky: so what should we do? Use the Central Limit Theorem! . So we need an estimator such that EVERYWHERE on the estimator’s p.

0 3 0. (c) Find Var(X). 6 Use E(X ) = 0 2 s 2 x fX (x) dx = 2 s 2 s 0 (sx2 − x3) dx. x) = s2 (e) The likelihood graph for a particular value of x is shown here.10 4 5 6 s 7 8 9 . 0. s2 18 . s2 fX (x) =   0 otherwise . s is a parameter to be estimated. Write down the likelihood function.05 0. Let X be a continuous random variable with probability density function   2(s − x)  for 0 < x < s . 3 s Here. Use E(X) = 0 2 2 xfX (x) dx = 2 s s 0 (sx − x2) dx. Answer: Var(X) = (d) Suppose that we make a single observation X = x. You should refer to the graph in your answer. Use Var(X) = E(X 2) − (EX)2. L(s . and state the range of values of s for which your answer is valid.194 Example: calculating the maximum likelihood estimator The following question is in the same style as the exam questions. x). where s is the maximum value of X and s > 0. s (a) Show that E(X) = . L(s .15 Likelihood 0. 2(s − x) for x < s < ∞. s2 (b) Show that E(X ) = . Show that the maximum likelihood estimator of s is s = 2X .

x) = 2s−2(s − x) So dL = 2 −2s−3(s − x) + s−2 ds = 2s−3(−2(s − x) + s) = 2 (2x − s).195 L(s . (f) Find the estimator variance. we can see that s = ∞ is not the maximum. s3 s=∞ At the MLE. 9 by (c) So also: 2s2 . dL =0 ds ⇒ or s = 2x. Var(s) = Var(2X) = 22Var(X) s2 = 4× 18 2s2 Var(s) = . From the graph. in terms of s. Var(s). Hence ﬁnd the estimated variance. Thus the maximum likelihood estimator is s = 2X. Var(s) = 9 . in terms of s. So s = 2x. Var(s).

0. p). • Too many daughters? Do divers have more daughters than sons? Let X be the number of daughters out of 190 diver children. X ∼ Binomial(190. 6. Easy. and its estimated variance and standard error. It is ﬁnding the distribution that is the diﬃcult part. s = 2X = 2 × 3 = 6. .82. 2s2 2 × 62 Var(s) = = =8 9 9 se(s) = Var(s) = √ 8 = 2. X ∼ Binomial(10. • Weird coin: is my coin fair? Let X be the number of heads out of 10 tosses.64 ! Taking the twice standard error interval strictly applies only to the Normal distribution.82: that is. but it is a useful rule of thumb to see how ‘good’ the estimator is. we need a test statistic: some random variable with a distribution that we can specify exactly under H0 and that diﬀers under H1 . p). The value s = 6 is the value of s under which the observation X = 3 is more likely than it is at any other value of s. This means s is a POOR estimator: the twice standard-error interval would be 6 − 2 × 2.36 to 11.196 (g) Suppose we make the single observation X = 3. We have an easy distribution and can do a hypothesis test.82 to 6 + 2 × 2. (h) Write a sentence in plain English to explain what the maximum likelihood estimate from part (g) represents. Find the maximum likelihood estimate of s.2 Hypothesis tests: in search of a distribution When we do a hypothesis test.

8 -9.8 2.0 -8.3 8.52 0. possibly because he (or his employers) wanted it to be kept secret that he was doing his statistical research as part of his employment at Guinness Brewery.37 -0.40 1. We have a simple distribution and test statistic (X): we can test the observed length of time between eruptions and see if it this is a believable observation under a hypothesized value of λ. but how can we tell? The unknown variance (4 versus 100) interferes. Here are some observations.6 -5.85 0. S.61 -0.0 3. and we don’t know their underlying variance. Do they come from a distribution (any distribution) with mean 0? 3.54 -1. yes they do (Normal(0.4 -11.96 2. 100) this time).3 -30.85 0.0 Again.32 -1. He called himself only ‘Student’.42 -0.54 Answer: yes. they are Normal(0.66 1.1 -13.6 -6. It was originally developed to help Guinness decide how large a sample of people should be used in its beer tastings! .4 -5. If we assume volcanoes occur as a Poisson process.1 -7.7 12. The test that ‘Student’ developed is the familiar Student’s t-test. but how can we tell? What about these? 3.48 1. More advanced tests Most things in life are not as easy as the three examples above.0 -9.197 • Too long between volcanoes? Let X be the length of time between volcanic eruptions.58 1.17 1. 4).1 8.51 1. so that the second sample does not cluster about its mean of 0 at all. What test statistic should we use? If we don’t know that our data are Normal.81 -0.6 -6. then X ∼ Exponential(λ).07 -0.6 1.4 -1. Gossett (1876-1937) worked out an answer.22 1. what can we use as our X to test whether µ = 0? Answer: a clever person called W.5 9.14 3.

The Central Limit Theorem produces approximate Normal distributions. the distribution of T is known: T has p. µ: T = X −µ n 2 i=1 (Xi −X) n(n−1) Under H0 : µ = 0. .d. derived as the ratio of a Normal random variable and an independent Chi-Squared random variable. As you can probably guess. It looks diﬃcult It is! Most of the statistical tests in common use have deep (and sometimes quite impenetrable) theory behind them. but once researchers had derived the distribution of a suitable test statistic. All these things are not coincidental: the Central Limit Theorem rocks! . is astonishing. . the Pearson’s chi-square test statistic is shown to have a Chi-squared distribution under H0.198 Student used the following test statistic for the unknown mean. With the help of our best friend the Central Limit Theorem. It produces larger values under H1 . In the Chi-squared goodness-of-ﬁt test. The result. Student did not derive the distribution above without a great deal of hard work. fT (t) = Γ n 2 n−1 2 (n − 1)π Γ t2 1+ n−1 −n/2 for − ∞ < t < ∞.f. For other distributions. . . A ratio of two Chi-squared distributions produces an F -distributed random variable. by the Central Limit Theorem. The Student’s t-test is exact when the distribution of the original data X1 . One interesting point to note is the pivotal role of the Central Limit Theorem in all of this. the rest was easy. observations of T will tend to lie out in the tails of this distribution. T is the Student’s t-distribution. If µ = 0. Xn is Normal. Normals divided by Chi-squareds produce t-distributed random variables. it is still approximately valid in large samples. The Chi-squared test for testing proportions in a contingency table also has a deep theory. Normal random variables squared produce Chi-squared random variables. however. Student’s T -statistic gives us a test for µ = 0 (or any other value) that can be used with any large enough sample.