This action might not be possible to undo. Are you sure you want to continue?

Department of Statistics University of Auckland

Contents

1. Probability 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Sample spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Partitioning sets and events . . . . . . . . . . . . . . . . . . 1.5 Probability: a way of measuring sets . . . . . . . . . . . . . 1.6 Probabilities of combined events . . . . . . . . . . . . . . . . 1.7 The Partition Theorem . . . . . . . . . . . . . . . . . . . . . 1.8 Examples of basic probability calculations . . . . . . . . . . 1.9 Formal probability proofs: non-examinable . . . . . . . . . . 1.10 Conditional Probability . . . . . . . . . . . . . . . . . . . . . 1.11 Examples of conditional probability and partitions . . . . . . 1.12 Bayes’ Theorem: inverting conditional probabilities . . . . . 1.13 Chains of events and probability trees: non-examinable . . . 1.14 Simpson’s paradox: non-examinable . . . . . . . . . . . . . . 1.15 Equally likely outcomes and combinatorics: non-examinable 1.16 Statistical Independence . . . . . . . . . . . . . . . . . . . . 1.17 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 1.18 Key Probability Results for Chapter 1 . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

3 3 3 5 12 14 17 19 20 22 24 29 31 34 38 39 42 45 49 51 51 52 54 55 59 63 70 77 80 92 95 101 107 113

2. Discrete Probability Distributions 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 The probability function, fX (x) . . . . . . . . . . . . . . . . . . 2.3 Bernoulli trials . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Example of the probability function: the Binomial Distribution 2.5 The cumulative distribution function, FX (x) . . . . . . . . . . . 2.6 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Example: Presidents and deep-sea divers . . . . . . . . . . . . . 2.8 Example: Politicians and the alphabet . . . . . . . . . . . . . . 2.9 Likelihood and estimation . . . . . . . . . . . . . . . . . . . . . 2.10 Random numbers and histograms . . . . . . . . . . . . . . . . . 2.11 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.12 Variable transformations . . . . . . . . . . . . . . . . . . . . . . 2.13 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.14 Mean and variance of the Binomial(n, p) distribution . . . . . . 1

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

3. Modelling with Discrete Probability Distributions 3.1 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . 3.2 Geometric distribution . . . . . . . . . . . . . . . . . . . . . 3.3 Negative Binomial distribution . . . . . . . . . . . . . . . . . 3.4 Hypergeometric distribution: sampling without replacement 3.5 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . 3.6 Subjective modelling . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

119 . 119 . 120 . 125 . 128 . 131 . 137 139 . 139 . 141 . 152 . 157 . 159 . 161 . 163 . 167 . 169 . 174 . 176 . 179

4. Continuous Random Variables 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 The probability density function . . . . . . . . . . . . . . . . . . . . . 4.3 The Exponential distribution . . . . . . . . . . . . . . . . . . . . . . 4.4 Likelihood and estimation for continuous random variables . . . . . . 4.5 Hypothesis tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Expectation and variance . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Exponential distribution mean and variance . . . . . . . . . . . . . . 4.8 The Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 The Change of Variable Technique: ﬁnding the distribution of g(X) . 4.10 Change of variable for non-monotone functions: non-examinable . . . 4.11 The Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . 4.12 The Beta Distribution: non-examinable . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

5. The Normal Distribution and the Central Limit Theorem 180 5.1 The Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 5.2 The Central Limit Theorem (CLT) . . . . . . . . . . . . . . . . . . . . . . . . . 185 6. Wrapping Up 192 6.1 Estimators — the good, the bad, and the estimator PDF . . . . . . . . . . . . . 192 6.2 Hypothesis tests: in search of a distribution . . . . . . . . . . . . . . . . . . . . 197

3

Chapter 1: Probability

1.1 Introduction Deﬁnition: A probability is a number between 0 and 1 representing how likely it is that an event will occur. Probabilities can be: 1. Frequentist (based on frequencies),

e.g.

number of times event occurs number of opportunities for event to occur ;

2. Subjective: probability represents a person’s degree of belief that an event will occur,

**e.g. I think there is an 80% chance it will rain today, written as P(rain) = 0.80.
**

Regardless of how we obtain probabilities, we always combine and manipulate them according to the same rules. 1.2 Sample spaces Deﬁnition: A random experiment is an experiment whose outcome is not known

until it is observed.

Deﬁnition: A sample space, Ω, is a set of outcomes of a random experiment. Every possible outcome must be listed once and only once. Deﬁnition: A sample point is an element of the sample space. For example, if the sample space is Ω = {s1, s2 , s3}, then each si is a sample point.

1. 4. HT.4 Examples: Experiment: Toss a coin twice and observe the result. Deﬁnition: A sample space is continuous if there are no gaps between the elements. . Sample space: Ω = {0. 1. ﬁnite) Ω = {0. ﬁnite) Ω= [0◦.6.). the interval [0. a sample space is discrete if it is countable.7} (discrete. . 90◦). . inﬁnite) (discrete. ﬁnite) . HT. T T } An example of a sample point is: HT Experiment: Toss a coin twice and count the number of heads. or if the elements can be “listed”. Examples: Ω = {0. so the elements cannot be listed (eg. 2. 360◦) Ω = [0. 2. Deﬁnition: A sample space is discrete if there are “gaps” between the diﬀerent elements. 1. 1] ={all numbers between 0 and 1 inclusive} (continuous. Sample space: Ω = {same. [90◦. T H. . 3. HH or TT). 2} Experiment: Toss a coin twice and observe whether the two tosses are the same (e.g. 2. even if an inﬁnite list (eg. 3. T T } (discrete. Sample space: Ω = {HH. .} (discrete.5. diﬀerent} Discrete and continuous sample spaces Deﬁnition: A sample space is ﬁnite if it has a ﬁnite number of elements. 1. 4. 3} (discrete and ﬁnite) Ω = {4. In mathematical language. . 1]). T H. inﬁnite) Ω = {HH.

which lists all possible outcomes of a random experiment.1. and might seem unexciting. . This is called the null event. Sample space: Ω = {HH. ∅ = {}. we have introduced the sample space. Deﬁnition: Event A occurs if we observe an outcome that is a member of the set A. A is a subset of Ω. That is. and express it in terms of mathematics. or the event with no outcomes. One of the founders of probability theory. Kolmogorov (1903-1987). The empty set. any collection of outcomes forms an event. T H}. It lays the ground for a whole mathematical formulation of randomness. However. Somehow you need to harness the idea of randomness. The next concept that you would need to formulate is that of something that happens at random. We write A ⊂ Ω. which is all about the unknown. in terms of set theory. or an event. How would you express the idea of an event in terms of set theory? Deﬁnition: An event is a subset of the sample space.3 Events Suppose you are setting out to create a science of randomness. How would you do it? So far. We write: A =“exactly one head” Then A = {HT. as in the deﬁnition. Example: Toss a coin twice. HT. T T } Let event A be the event that there is exactly one head. Ω. Ω is a set. T H. is also a subset of Ω. so Ω is an event. Note: Ω is a subset of itself.

. 2). . (6. (2. (1. . 4). (3. . Example: Suppose our random experiment is to pick a person in the class and see what form(s) of transport they used to get to campus today. . . (2. and so on. (4. Bus Car Walk Bike Train People in class This sort of diagram representing events in a sample space is called a Venn diagram. . 6)} Event A = “sum of two faces is 5” = {(1. Sample space: Ω = {(1. For example. . 3). . We do this in the language of set theory. 2). (2. we need to describe things like coincidences (events happening together). alternatives. 1). . 6). 1). 1)} Combining Events Formulating random events in terms of sets gives us the power of set theory to describe all possible ways of combining or manipulating events. 2). . 6). opposites. . . (2.6 Example: Experiment: throw 2 dice. (1.

we shade all out- comes in ‘Bus’ and all outcomes in ‘Car’. To represent the set of journeys involving both alternatives. NOT Bus AND Car. read The union operator. We write the event that you used a motor vehicle as the event Bus ∪ Car. as “Bus UNION Car”. the event that you used a motor vehicle to get to campus is the event that your journey involved a car. ∪. OR Car.7 1. Note: Be careful not to confuse ‘Or’ and ‘And’. OR both. To remember whether union refers to ‘Or’ or ‘And’. or both. Deﬁnition: Let A and B be events on the same sample space Ω: so A ⊂ Ω and B ⊂ Ω. The union of events A and B is written A ∪ B. For example. denotes Bus OR Car OR both. Bus Car Walk Bike Train People in class Overall. . Alternatives: the union ‘or’ operator We wish to describe an event that is composed of several diﬀerent alternatives. and is given by A ∪ B = {s : s ∈ A or s ∈ B or both} . or a bus. we had to shade everything in Bus AND everything in Car. we have shaded all outcomes in the UNION of Bus and Car. you have to consider what does an outcome need to satisfy for the shaded event to occur? The answer is Bus. To shade the union of Bus and Car.

Deﬁnition: The intersection of events A and B is written A ∩ B and is given by A ∩ B = {s : s ∈ A AND s ∈ B} . Bus Car Walk Bike Train People in class “Car INTERSECT Train”. denotes both Car AND Train together. consider the event that your journey today did NOT involve walking. 3. For example. read as The intersection operator. . we shade all outcomes in the overlap of Car and Train. To represent this event. consider the event that your journey today involved BOTH a car AND a train.8 2. We write the event that you used both car and train as Car ∩ Train. For example. we shade all outcomes in Ω except those in the event Walk. Concurrences and coincidences: the intersection ‘and’ operator The intersection is an event that occurs when two or more events ALL occur together. Opposites: the complement or ‘not’ operator The complement of an event is the opposite of the event: whatever the event was. To represent this event. it didn’t happen. ∩.

Question: What is the event Ω? Ω=∅ Challenge: can you express A ∩ B using only a ∪ sign? Answer: A ∩ B = (A ∪ B). Sample space: Ω = {all people in class}. / Examples: Experiment: Pick a person in this class at random. 2) B 4) B No. Deﬁnition: The complement of event A is written A and is given by A = {s : s ∈ A}. Yes. 5) A ∪ B = {female or bike rider or both}.9 Bus Car Walk Bike Train People in class We write the event ‘not Walk’ as Walk. No. 7) A ∩ B = {male and bike rider}. . 8) A ∩ B = everything outside A ∩ B . so A ∩ B did occur. Say whether the following events have occurred: 1) A 3) A Yes. Let event A =“person is male” and event B =“person travelled by bike today”. 6) A ∩ B = {male and non-biker}. A ∩ B did not occur. Yes. Suppose I pick a male who did not travel by bike. Yes. No. No.

(iv) (a) (A ∪ B) = A ∩ B. and (iii) For any events A and B. and complement The following properties hold. Ω A B . For more than 3 events. Commutative. intersection. although they are not used to provide formal proofs. (This was probably the case for our transport Venn diagram. the diagram might not be able to represent all possible overlaps of events. and A ∪ A = Ω. A B Ω (b) (A ∩ B) = A ∪ B. A ∩ A = ∅. A ∪ B = B ∪ A. (ii) For any event A. A ∩ B = B ∩ A.) Ω Example: A B A B Ω C (a) A ∪ B ∪ C C (b) A ∩ B ∩ C Properties of union.10 Limitations of Venn diagrams Venn diagrams are generally useful for up to 3 events. (i) ∅ = Ω and Ω = ∅.

. ∩ Bn ) = (A ∪ B1) ∩ (A ∪ B2 ) ∩ . addition is not distributive over multiplication: a + (b × c) = (a + b) × (a + c). b.11 Distributive laws We are familiar with the fact that multiplication is distributive over addition. . . B. . and C: A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C). union is distributive over intersection. A∩ Bi i=1 = i=1 (A ∩ Bi). ∩ (A ∪ Bn ) n n i. and c are any numbers. However. . . . . for several events A and B1. A ∩ (B1 ∪ B2 ∪ . Bn.e. . .. Thus. A ∪ (B1 ∩ B2 ∩ . and A∪ Bi i=1 = i=1 (A ∪ Bi). for any sets A. ∪ Bn ) = (A ∩ B1) ∪ (A ∩ B2 ) ∪ . Ω Ω A B B A C C More generally. . then a × (b + c) = a × b + a × c. . if a.e. This means that. . AND intersection is distributive over union. B2. For set union and set intersection. and A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C). ∪ (A ∩ Bn ) n n i.

. That is. B2. . A2. This means events A and B cannot happen together. . sets B1. . Deﬁnition: Two events A and B are mutually exclusive. so B depends strongly on whether or not A happens. or disjoint.4 Partitioning sets and events The idea of a partition is fundamental in probability manipulations. A EXCLUDES B from happening. Ai ∩ Aj = ∅ for all i. . Later in this chapter we will encounter the important Partition Theorem. Ak are mutually exclusive if every pair of the events is mutually exclusive: ie. we give some background deﬁnitions. and vice-versa. k and i=1 Bi = B1 ∪ B2 ∪ . . j with i = j . For now. ∪ Bk = Ω. . it excludes B from happening. Deﬁnition: Any number of events A1. if A ∩ B = ∅. Bk form a partition of Ω if Bi ∩ Bj = ∅ for all i. . . If A happens.12 1. j with i = j . Ω A1 A2 00000 1111 1111 0000 0000 A3 11111 1111 1111 0000 0000 11111 00000 1111 1111 0000 0000 11111 00000 1111 1111 0000 0000 11111 00000 1111 1111 0000 0000 11111 00000 Deﬁnition: A partition of the sample space Ω is a collection of mutually exclusive events whose union is Ω. Ω 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 A 111111 000000 111111 000000 111111 000000 111111 000000 111111 000000 B Note: Does this mean that A and B are independent? No: quite the opposite. . .

. . B2. . . Bk form a partition of Ω. . This is because it is often easier to ﬁnd the probability of small ‘chunks’ of A (the partitioned sections) than to ﬁnd the whole probability of A at once. . The partition idea shows us how to add the probabilities of these chunks together: see later. . B4 form a partition of Ω: Ω B1 B3 B1 B3 B1 . . If B1. . . then (A ∩ B1). B5 partition Ω: Ω B2 B4 B2 B4 B5 Important: B and B partition Ω for any event B: Ω B B Partitioning an event A Any set or event A can be partitioned: it doesn’t have to be Ω.13 Examples: B1 . (A ∩ Bk ) form a partition of A. B3. . . . Ω B1 B2 A 1111 0000 1111 0000 1111 0000 1111 0000 B3 B4 We will see that this is very useful for ﬁnding the probability of event A.

If I sent you away to measure heights. for example. People? Trees? Mountains? We have the same question when setting out to measure chance. . but they have the same number of elements? Should they be the same probability? First set: {Lions win}. Should the intervals [3. 14] be the same probability. 4] and [13. This means somehow ‘measuring chance’. the ﬁrst thing you would ask is what you are supposed to be measuring the heights of. Chance of what? The answer is sets. It was clever to formulate our notions of events and sample spaces in terms of sets: it gives us something to measure. What happens. 20] — but no they shouldn’t (hopefully!) if our experiment was the time in years taken by a student to ﬁnish their degree.14 1. Both sets have just one element. Why not just count the number of elements in it? In fact. but we deﬁnitely need to give them diﬀerent probabilities! More problems arise when the sets are inﬁnite or continuous. ‘Probability’. if one set is far more likely than another. if (say) our random experiment is to pick a random number on [0.5 Probability: a way of measuring sets Remember that you are given the job of building the science of randomness. is a way of measuring sets. 111111111 000000000 111111111 000000000 111111111 000000000 111 000 11 111 00 000 111 000 1111 0000 Second set: {All Blacks win}. the name that we give to our chance-measure. You probably already have a good idea for a suitable way to measure the size of a set or event. just because they are the same length? Yes they should. this is often what we do to measure probability — (although counting the number of elements can be far from easy!) But there are circumstances where this is not appropriate.

p2. E. .} such that: 1. s14}. or distributed. . The rule for measuring the probability of any set.01. if A = {s3 . we could use the following probability distribution: P(Lions win)= 0. P(All Blacks win)= 0.g.99. A probability distribution is a rule according to which probability is apportioned. 0 ≤ pi ≤ 1 for all i. then P(A) = p3 + p5 + p14. At its simplest. A discrete probability distribution on Ω is a set of real numbers {p1. . s5. such that the total sum of probabilities is 1. i pi = 1. s2.15 Most of this course is about probability distributions. Discrete probability distributions Deﬁnition: Let Ω = {s1 . among the diﬀerent sets in the sample space. . or event. In general. We write: pi = P(si ).} associated with the sample points {s1 . A ⊆ Ω. .} be a discrete sample space. In the rugby example. 2. . we have the following deﬁnition for discrete sample spaces. pi is called the probability of the event that the outcome is si . . . . a probability distribution just lists every element in the sample space and allots it a probability between 0 and 1. s2. . is to sum the probabilities of the elements of A: P(A) = i∈A pi .

Note: The axioms can never be ‘proved’: they are deﬁnitions. Ω = [0. ∪ An) = P(A1) + P(A2) + . all of probability theory is based on the following three deﬁnitions. A2. or event. (no overlap). + P(An ). If A1. An are mutually exclusive events. The challenge of deﬁning a probability distribution on a continuous sample space is left till later. the same principle applies. It should be clear that the deﬁnitions given for the discrete sample space on page 15 will satisfy the axioms. . or axioms. If our rule for ‘measuring sets’ satisﬁes the three axioms. e. . Note: P(∅) = 0. . . . it is a valid probability distribution. . discrete or continuous. Axiom 1: Axiom 2: Axiom 3: P(Ω) = 1. . However. Probability Axioms For any sample space. Note: Remember that an EVENT is a SET: an event is a subset of the sample space. A continuous probability distribution is a rule under which we can calculate a probability between 0 and 1 for any set. 1]. then P(A1 ∪ A2 ∪ .16 Continuous probability distributions On a continuous sample space Ω. 0 ≤ P(A) ≤ 1 for all events A. We will need more sophisticated methods detailed later in the course. . . A ⊆ Ω.g. we can not list all the elements and give them an individual probability.

3 we discussed unions. B. For Case 1.17 1. and C. and complements of events. we have the following formula. Everything below applies to events (sets) in either a discrete or a continuous sample space. There are two cases for the probability of the union A ∪ B: 1. Probability of a union Let A and B be events on a sample space Ω. 2. We now look at the probabilities of these combinations. For three or more events: e. A and B are mutually exclusive (no overlap): i.e. A and B are not mutually exclusive: A ∩ B = ∅. intersections. we get the probability of A ∪ B straight from Axiom 3: If A ∩ B = ∅ then P(A ∪ B) = P(A) + P(B). for any A. For Case 2. P(A ∪ B) = P(A) + P(B) − P(A ∩ B). A ∩ B = ∅. P(A ∪ B ∪ C) = P(A) + P(B) + P(C) + P(A ∩ B ∩ C) . Note: The formula for Case 2 applies also to Case 1: just substitute P(A ∩ B) = P(∅) = 0. For ANY events A. B . 1.6 Probabilities of combined events In Section 1.g. − P(A ∩ B) − P(A ∩ C) − P(B ∩ C) .

We might be able to use statistical independence (Section 1.9. The formal proof of this formula is in Section 1. So we have to subtract the intersection once to get P(A ∪ B): P(A ∪ B) = P(A) + P(B) − P(A ∩ B). To understand the formula. A .16). we often use conditional probability (Section 1. 2. If A and B are not statistically independent. we add the intersection twice. Alternatively. This is obvious.9 (non-examinable). Probability of an intersection There is no easy formula for P(A ∩ B).10.) 3. but a formal proof is given in Sec. think of A ∪ B as two disjoint sets: all of A. P(A ∪ B) = P(A) + P(B) − P(A ∩ B).18 Explanation For any events A and B. So P(A ∪ B) = P(A) + P(B) − P(A ∩ B) . Probability of a complement A Ω A 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 B Ω P(A) = 1 − P(A). 1. think of the Venn diagrams: A 11111111 00000000 11111111 00000000 111111111 000000000 11111111 00000000 111111111 000000000 1111 0000 11111111 00000000 111111111 000000000 1111 0000 11111111 00000000 111111111 000000000 1111 0000 11111111 00000000 111111111 000000000 1111 0000 11111111 00000000 111111111 000000000 1111 0000 11111111 00000000 111111111 000000000 111111111 000000000 B 111111111 000000000 Ω A 111111111 000000000 Ω 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 B \ (A n000000000 B) 111111111 When we add P(A) + P(B). and the bits of B without the intersection.

.7: The Partition Theorem. Bm form a partition of Ω if Bi ∩ Bj = ∅ for all i = j . B2. . . . Sets B1 .9. . i=1 . . . The Partition Theorem is a mathematical way of saying the whole is the sum of its parts. . Note: Recall the formal deﬁnition of a partition. Ω Recall that a partition of Ω is a collection of non-overlapping sets B1 . . It is based on the fact that probabilities are often easier to calculate if we break down a set into smaller parts. . . .) Let B1 . . Bm which together cover everything in Ω. Bm form a partition of Ω. . (A ∩ Bm ) form a partition of the set or event A. then (A ∩ B1).7 The Partition Theorem The Partition Theorem is one of the most useful tools for probability calculations. B1 B2 A ∩ B1 A ∩ B2 A B3 B4 A ∩ B3 A ∩ B4 The probability of event A is therefore the sum of its parts: P(A) = P(A ∩ B1) + P(A ∩ B2 ) + P(A ∩ B3) + P(A ∩ B4 ). . Theorem 1. . . B1 B3 B2 B4 Also.19 1. Then for any event A. and m Bi = Ω. . m P(A) = i=1 P(A ∩ Bi ). . . (Proof in Section 1. Bm form a partition of Ω. if B1 .

20

1.8 Examples of basic probability calculations 300 Australians were asked about their car preferences in 1998. Of the respondents, 33% had children. The respondents were asked what sort of car they would like if they could choose any car at all. 13% of respondents had children and chose a large car. 12% of respondents did not have children and chose a large car. Find the probability that a randomly chosen respondent: (a) would choose a large car; (b) either has children or would choose a large car (or both).

**First formulate events: Let
**

C = “has children” L = “chooses large car”. C = “no children”

**Next write down all the information given:
**

P(C) = 0.33 P(C ∩ L) = 0.13 P(C ∩ L) = 0.12.

(a) Asked for P(L).

P(L) = P(L ∩ C) + P(L ∩ C) = P(C ∩ L) + P(C ∩ L) = 0.13 + 0.12 = 0.25. P(chooses large car) = 0.25.

(Partition Theorem)

(b) Asked for P(L ∪ C).

P(L ∪ C) = P(L) + P(C) − P(L ∩ C) = 0.25 + 0.33 − 0.13 = 0.45.

(Section 1.6)

21

Respondents were also asked their opinions on car reliability and fuel consumption. 84% of respondents considered reliability to be of high importance, while 40% considered fuel consumption to be of high importance. Formulate events: R = “considers reliability of high importance”, F = “considers fuel consumption of high importance”. (c) What is P(R)? (d) What is P(R ∩ F )?

Information given: (c)

P(R) = 0.84

P(F ) = 0.40.

P(R) = 1 − P(R) = 1 − 0.84 = 0.16.

**(d) We can not calculate P(R ∩ F ) from the information given.
**

(e) Given the further information that 12% of respondents considered neither reliability nor fuel consumption to be of high importance, ﬁnd P(R ∪ F ) and P(R ∩ F ).

Information given:

P(R ∪ F ) = 0.12. P(R ∪ F ) = 1 − P(R ∪ F ) = 1 − 0.12 = 0.88.

R

F

Thus

Probability that respondent considers either reliability or fuel consumption, or both, of high importance.

P(R ∩ F ) = P(R) + P(F ) − P(R ∪ F ) = 0.84 + 0.40 − 0.88 = 0.36.

(Section 1.6)

Probability that respondent considers BOTH reliability AND fuel consumption of high importance.

22

(f) Find the probability that a respondent considered reliability, but not fuel consumption, of high importance.

P(R ∩ F ) = P(R) − P(R ∩ F ) = 0.84 − 0.36 = 0.48.

(Partition Theorem)

1.9 Formal probability proofs: non-examinable If you are a mathematician, you will be interested to see how properties of probability are proved formally. Only the Axioms, together with standard settheoretic results, may be used. Theorem : The probability measure P has the following properties. (i) P(∅) = 0. (ii) P(A) = 1 − P(A) for any event A. (iii) (Partition Theorem.) If B1 , B2, . . . , Bm form a partition of Ω, then for any event A,

m

P(A) =

i=1

P(A ∩ Bi).

(iv) P(A ∪ B) = P(A) + P(B) − P(A ∩ B) for any events A, B. Proof: i) For any A, we have A = A ∪ ∅; and A ∩ ∅ = ∅ So P(A) = P(A ∪ ∅) = P(A) + P(∅) (Axiom 3) ⇒ P(∅) = 0. (mutually exclusive).

23

**ii) Ω = A ∪ A; and A ∩ A = ∅ (mutually exclusive). So 1 = P(Ω) = P(A ∪ A) = P(A) + P(A). (Axiom 3)
**

Axiom 1

**iii) Suppose B1 , . . . , Bm are a partition of Ω: then Bi ∩ Bj = ∅ if i = j, and m Bi = Ω. i=1 Thus, (A ∩ Bi ) ∩ (A ∩ Bj ) = A ∩ (Bi ∩ Bj ) = A ∩ ∅ = ∅, for i = j, ie. (A ∩ B1 ), . . . , (A ∩ Bm) are mutually exclusive also.
**

m m

So,

i=1

P(A ∩ Bi) = P

i=1

(A ∩ Bi )

m

(Axiom 3)

= P A∩ = P(A) . iv)

Bi

i=1

(Distributive laws)

= P(A ∩ Ω)

A ∪ B = (A ∩ Ω) ∪ (B ∩ Ω) (Set theory) = A ∩ (B ∪ B) ∪ B ∩ (A ∪ A) (Set theory) = (A ∩ B) ∪ (A ∩ B) ∪ (B ∩ A) ∪ (B ∩ A) (Distributive laws)

= (A ∩ B) ∪ (A ∩ B) ∪ (A ∩ B).

**These 3 events are mutually exclusive: eg. (A ∩ B) ∩ (A ∩ B) = A ∩ (B ∩ B) = A ∩ ∅ = ∅, etc. So, P(A ∪ B) = P(A ∩ B) + P(A ∩ B) + P(A ∩ B) (Axiom 3) = P(A) − P(A ∩ B)
**

from (iii) using B and B

+

P(B) − P(A ∩ B)

from (iii) using A and A

+ P(A ∩ B)

= P(A) + P(B) − P(A ∩ B).

Additionally. given that A has occurred) = P(get an even number. Let event A = “get a 6” Let event B= “get an even number” If the die is fair. Dependent events Suppose A and B are two events on the same sample space. if we know that B has occurred. the whole ﬁeld of stochastic processes (Stats 320 and 325) is based on the idea of conditional probability. 3 Question: what would be P(B | A)? P(B | A) = P(B occurs. on what has happened beforehand. It is especially helpful for calculating the probabilities of intersections. it changes our knowledge of the chance that A will occur. There will often be dependence between A and B. This means that if we know that B has occurred. Example: Toss a die once. 3 result 6 result 2 or 4 or 6 1 P(A given B) = P(A | B) = . What happens next in a process depends. or is conditional. . then there is an increased chance that A has occurred: P(A occurs given that B has occurred) = We write 1 . which themselves are critical for the useful Partition Theorem.10 Conditional Probability Conditioning is another of the fundamental tools of probability: probably the most fundamental tool. given that we know we got a 6) = 1. such as P(A ∩ B). then P(A) = 1 6 and P(B) = 1 . 2 However.24 1.

P(A) = # A’s 130 = 0. Let event A = “respondent thinks that prestige is more important”.53.8 also asked respondents which they valued more highly in a car: ease of parking.43. We write: P(A | B) = 0.25 Conditioning as reducing the sample space The car survey in Section 1. Suppose we reduce our sample space from Ω = {all people in table} to B = {all males in table}. = total # respondents 300 However.53. or style/prestige. given that respondent is male) = = = # males who favour prestige total # males # male A’s # males 79 150 = 0. P(respondent thinks prestige is more important. . Here are the responses: Male Female Total Prestige more important than parking Prestige less important than parking Total 79 71 150 51 99 150 130 170 300 Suppose we pick a respondent at random from all those in the table. this probability diﬀers between males and females.

P(B) = This is our deﬁnition of conditional probability: Deﬁnition: Let A and B be two events. P(A ∩ B) gives P(A and B .26 We could follow the same working for any pair of events. Note: P(A | B) gives P(A and B . from within the set of B’s only). from the whole sample space Ω). Note: Follow the reasoning above carefully. given that event B has occurred. and is given by P(A | B) = P(A ∩ B) . P(B) Read P(A | B) as “probability of A. is written P(A | B). from the set of B’s only. . The conditional probability that event A occurs. given B”. It is important to understand why the conditional probability is the probability of the intersection within the new sample space. Conditioning on event B means changing the sample space to B. A and B: P(A | B) = = = # B ’s who are A total # B ’s # in table who are BOTH B and A # B ’s (# in B AND A) / (# in Ω) (# in B ) / (# in Ω) P(A ∩ B) . Think of P(A | B) as the chance of getting an A.

say. This indicates that the symbol P is deﬁned with respect to Ω.27 The symbol P belongs to the sample space Ω Recall the ﬁrst of our probability axioms on page 16: P(Ω) = 1. The symbol P( | B) should behave exactly like the symbol P. This is what we do in conditional probability: to change the sample space from Ω to B . The expression is NOT true. so So. . P BELONGS to the sample space Ω. and see whether the expression is still true. P(C ∪ D | B) = P(C | B) + P(D | B) − P(C ∩ D | B). we change from the symbol P to the symbol P( | B). Is P(A | B) + P(A | B) = 1? Try to replace the conditioning set by Ω: we can’t! There are two conditioning sets: B and B . That is. is P(A | B) + P(A | B) = 1? Answer: Replace B by Ω: this gives P(A | Ω) + P(A | Ω) = P(A) + P(A) = 1. P(A | B) + P(A | B) = 1 for any other sample space B . For example: P(C ∪ D) = P(C) + P(D) − P(C ∩ D). and in fact it doesn’t make sense to try to add together probabilities from two diﬀerent sample spaces. we need to change the symbol P. If we change the sample space. Trick for checking conditional probability calculations: A useful trick for checking a conditional probability expression is to replace the conditioned set by Ω. For example. yes.

+ P(A | Bm). Proof: Immediate from the deﬁnitions: P(A | B) = P(A ∩ B) ⇒ P(A ∩ B) = P(A | B)P(B) . P(A ∩ B) = P(A | B)P(B) = P(B | A)P(A). . . P(B) and P(B | A) = P(B ∩ A) ⇒ P(B ∩ A) = P(A ∩ B) = P(B | A)P(A). NOT P(A) = P(A | B1) + . Bm partition S. . Warning: Be careful to use this new version of the Partition Theorem correctly: it is P(A) = P(A | B1)P(B1) + . .28 The Multiplication Rule For any events A and B. then for any event A. . but especially m the conditional formulation i=1 P(A | Bi )P(Bi ). . . + P(A | Bm )P(Bm). . P(A) New statement of the Partition Theorem The Multiplication Rule gives us a new statement of the Partition Theorem: If B1 . . m m P(A) = i=1 P(A ∩ Bi ) = i=1 P(A | Bi)P(Bi). Both formulations of the Partition Theorem are very widely used.

The bus is on time with probability 0.6. and there is no overlap in the events by deﬁnition.11 Examples of conditional probability and partitions Tom gets the bus to campus every day.29 Conditional probability and Peter Pan When Peter Pan was hungry but had nothing to eat. and late with probability 0. For example.) Conditional probability is the Peter Pan of Stats 210. From the information given. (a) Do the events T and L form a partition of the sample space Ω? Explain why or why not. Yes: they cover all possible journeys (probabilities sum to 1).6 . 1.4. pretend you do — and add up the diﬀerent possibilities. When you don’t know something that you need to know. We can formulate events as follows: T = “on time”. . if you know the probability of getting to work on time in diﬀerent weather conditions. Conditioning on an event is like pretending that you know that the event has happened. but you don’t know what the weather will be like today.4. the events have probabilities: P(T ) = 0. he would pretend to eat. pretend you know it. The sample space can be written as Ω = {bus journeys}. (An excellent strategy. P(L) = 0. P(work on time) = P(work on time | ﬁne) ×P(ﬁne) + P(work on time | wet) ×P(wet). I have always found. L = “late”.

30

The buses are sometimes crowded and sometimes noisy, both of which are problems for Tom as he likes to use the bus journeys to do his Stats assignments. When the bus is on time, it is crowded with probability 0.5. When it is late, it is crowded with probability 0.7. The bus is noisy with probability 0.8 when it is crowded, and with probability 0.4 when it is not crowded. (b) Formulate events C and N corresponding to the bus being crowded and noisy. Do the events C and N form a partition of the sample space? Explain why or why not.

Let C = “crowded”, N =“noisy”. C and N do NOT form a partition of Ω. It is possible for the bus to be noisy when it is crowded, so there must be some overlap between C and N .

(c) Write down probability statements corresponding to the information given above. Your answer should involve two statements linking C with T and L, and two statements linking N with C. P(C | T ) = 0.5; P(C | L) = 0.7.

P(N | C) = 0.8;

P(N | C) = 0.4.

(d) Find the probability that the bus is crowded.

P(C) = P(C | T )P(T ) + P(C | L)P(L) = 0.5 × 0.6 + 0.7 × 0.4 = 0.58.

(Partition Theorem)

(e) Find the probability that the bus is noisy.

P(N ) = P(N | C)P(C) + P(N | C)P(C) = 0.8 × 0.58 + 0.4 × (1 − 0.58) = 0.632.

(Partition Theorem)

31

1.12 Bayes’ Theorem: inverting conditional probabilities

**Consider P(B ∩ A) = P(A ∩ B). Apply multiplication rule to each side:
**

P(B | A)P(A) = P(A | B)P(B)

Thus

P(B | A) =

P(A | B)P(B) . P(A)

(⋆)

This is the simplest form of Bayes’ Theorem, named after Thomas Bayes (1702–61), English clergyman and founder of Bayesian Statistics. Bayes’ Theorem allows us to “invert” the conditioning, i.e. to express P(B | A) in terms of P(A | B). This is very useful. For example, it might be easy to calculate, but we might only observe the later event and wish to deduce the probability that the earlier event occurred, P(earlier event | later event). Full statement of Bayes’ Theorem: Theorem 1.12: Let B1, B2, . . . , Bm form a partition of Ω. Then for any event A, and for any j = 1, . . . , m, P(A | Bj )P(Bj ) m i=1 P(A | Bi )P(Bi ) P(later event|earlier event),

P(Bj | A) =

(Bayes’ Theorem)

Proof:

**Immediate from (⋆) (put B = Bj ), and the Partition Rule which gives P(A) =
**

m i=1 P(A | Bi )P(Bi ).

32

Special case of Bayes’ Theorem when m = 2: use B and B as the partition of Ω:

then

P(B | A) =

P(A | B)P(B) P(A | B)P(B) + P(A | B)P(B)

Example: The case of the Perﬁdious Gardener. Mr Smith owns a hysterical rosebush. It will die with probability 1/2 if watered, and with probability 3/4 if not watered. Worse still, Smith employs a perﬁdious gardener who will fail to water the rosebush with probability 2/3. Smith returns from holiday to ﬁnd the rosebush . . . DEAD!!! What is the probability that the gardener did not water it?

**Solution: First step: formulate events
**

Let : D = “rosebush dies” W = “gardener waters rosebush” W = “gardener fails to water rosebush”

**Second step: write down all information given
**

P(D | W ) =

1 2

P(D | W ) =

3 4

P(W ) =

2 3

1 (so P(W ) = 3 )

**Third step: write down what we’re looking for
**

P(W | D)

**Fourth step: compare this to what we know Need to invert the conditioning, so use Bayes’ Theorem:
**

P(W | D) = 3 P(D | W )P(W ) 3/4 × 2/3 = = P(D | W )P(W ) + P(D | W )P(W ) 3/4 × 2/3 + 1/2 × 1/3 4

So the gardener failed to water the rosebush with probability 3 . 4

Example: The case of the Defective Ketchup Bottle. Ketchup bottles are produced in 3 diﬀerent factories, accounting for 50%, 30%, and 20% of the total output respectively. The percentage of bottles from the 3 factories that are defective is respectively 0.4%, 0.6%, and 1.2%. A statistics lecturer who eats only ketchup ﬁnds a defective bottle in her wig. What is the probability that it came from Factory 1?

**Solution: 1. Events: let Fi = “bottle comes from Factory i” (i=1,2,3) let D = “bottle is defective” 2. Information given:
**

P(F1) = 0.5 P(F2) = 0.3 P(F3) = 0.2 P(D | F1) = 0.004 P(D | F2 ) = 0.006 P(D | F3 ) = 0.012

3. Looking for:

P(F1 | D)

(so need to invert conditioning).

4. Bayes Theorem:

P(F1 | D) = = = P(D | F1)P(F1) P(D | F1)P(F1) + P(D | F2)P(F2) + P(D | F3 )P(F3) 0.004 × 0.5 0.004 × 0.5 + 0.006 × 0.3 + 0.012 × 0.2 0.002 0.0062

= 0.322.

(b) the second ball is red. Example: Two balls are drawn at random without replacement from a box containing 4 white and 2 red balls. 5 6 5 b) Looking for P(2nd ball is red). We can’t ﬁnd this without conditioning on what happened in the ﬁrst draw. 6 5 W1 So P(both white) = P(W1 ∩ W2) = 3 4 2 × = .34 1. R1R2 } = (W1 ∩ R2 ) ∪ (R1 ∩ R2 ). Find the probability that: (a) they are both white. Solution Let event Wi = “ith ball is white” and Ri = “ith ball is red”.13 Chains of events and probability trees: non-examinable The multiplication rule is very helpful for calculating probabilities when events happen in sequence. So P(2nd ball is red) = P(W1 ∩ R2 ) + P(R1 ∩ R2 ) (mutually exclusive) = P(R2 | W1)P(W1) + P(R2 | R1)P(R1) 2 4 1 2 = × + × 5 6 5 6 W1 R1 = 1 3 . a) P(W1 ∩ W2 ) = P(W2 ∩ W1) = P(W2 | W1)P(W1) Now P(W1) = 4 3 and P(W2 | W1) = . Event “2nd ball is red” is actually event {W1R2 .

W2 P(W2 | W1 ) = P(R2 | W1 ) = P(W1 ) = 4 6 2 5 3 5 W1 R2 P(R1 ) = 2 6 W2 P(W2 | R1 ) = P(R2 | R1 ) = R2 1 5 4 5 R1 First Draw Second Draw Write conditional probabilities on the branches. P(W1 ∩ W2) = 4 3 2 4 × . or P(R1 ∩ W2) = × . and multiply to get probability of an intersection: eg.35 Probability trees Probability trees are a graphical way of representing the multiplication rule. 6 5 6 5 More than two events To ﬁnd P(A1 ∩ A2 ∩ A3) we can apply the multiplication rule successively: P(A1 ∩ A2 ∩ A3) = P(A3 ∩ (A1 ∩ A2 )) = P(A3 | A1 ∩ A2 )P(A1 ∩ A2) (multiplication rule) (multiplication rule) = P(A3 | A1 ∩ A2 )P(A2 | A1 )P(A1) .

On the probability tree: P(A3 | A2 ∩ A1 ) P(A2 | A1 ) P(A1 ) P(A1 ∩ A2 ∩ A3 ) P(A1 ) In general. for n events A1. . What is the probability of getting the sequence white. . . we have P(A1 ∩A2 ∩. Draw 3 balls without replacement. An. P(An | An−1 ∩. . Example: A box contains w white balls and r red balls. . . . . .∩An ) = P(A1)P(A2 | A1)P(A3 | A2 ∩A1) .∩A1).36 Remember as: P(A1 ∩ A2 ∩ A3 ) = P(A1)P(A2 | A1 )P(A3 | A2 ∩ A1 ). . red. white? Answer: P(W1 ∩ R2 ∩ W3) = P(W1)P(R2 | W1)P(W3 | R2 ∩ W1) = w w+r × r w+r−1 × w−1 . A2. w+r−2 .

What happens? Never believe what you read. . or are you??? Have a look at the ﬁgures: Study 1 AntiCough Given to: Cured: %Cured: 25 20 80% Other Medicine 75 58 77% Study 2 AntiCough Given to: 75 Cured: 50 67% %Cured: Other Medicine 25 16 64% Combine the studies . . You’re Better with Off oug AntiC h! So you’re better off with AntiCough . This is Simpson’s Paradox. Never believe what you read.37 Two separate studies say . . . . . . . . . . . This is Sim .

P(S2 | A). . despite being about 3% higher in both Study 1 and Study 2. Anticough) to be better than another (Other Medicine) in every one of a set of categories (e.g. C = {cured} S1 = {Study 1} A = {Anticough} S2 = {Study 2} A = {Other Medicine} Although P(C | A ∩ S1 ) > P(C | A ∩ S1 ).g. P(C | A) = P(C | A ∩ S1 )P(S1 | A) + P(C | A ∩ S2)P(S2 | A) .38 1.14 Simpson’s paradox: non-examinable It is possible for one treatment (e. P(S2 | A). the other terms can change the overall outcome: P(S1 | A). This eﬀect is known as Simpson’s Paradox. Study 1 and Study 2). and P(C | A ∩ S2 ) > P(C | A ∩ S2 ). It occurs because P(C | A) = P(C | A ∩ S1 )P(S1 | A) + P(C | A ∩ S2)P(S2 | A) . AntiCough has a 4% lower cure percentage (70%). but worse overall! Combining the results overleaf: AntiCough Given to: Cured: %Cured: Other Medicine 100 70 70% 100 74 74% Overall. P(S1 | A).

. possible outcomes from oldest to youngest are: Ω = {GGG. GBB}. iii) event A = {s1 . . so event A occurs with probability P(A) = 4 = 1 . GGB. s5. abd. . if we wish to select 3 objects from n = 5 objects (a. . . then P(A) = # outcomes in A r = . we have choices abc. sk }. . 1 ii) each outcome si is equally likely. GBG. e). s2. c. s2. BBB} = {s1. GBB. . For example. . . Then A ={GGG. s6. . ace. so 1 p1 = p2 = . k # outcomes in Ω Example: For a 3-child family. . If every baby is equally likely to be a boy or a girl. b. . Let event A be A = “oldest child is a girl”. BBG. . p2. Event A contains 4 of the 8 equally likely outcomes. then all of the 8 outcomes in Ω are equally likely. s3 . all the outcomes in a discrete ﬁnite sample space are equally likely. so p1 = p2 = . d. GGB.39 1. acd. = pk = k . . If: i) Ω = {s1 . . 8 2 Counting equally likely outcomes To count the number of equally likely outcomes in an event. we often need to use permutations or combinations. GBG. s8} Let {p1. BGB. p8} be a probability distribution on Ω. BGG. . . abe. . These give the number of ways of choosing r objects from n distinct objects. .15 Equally likely outcomes and combinatorics: non-examinable Sometimes. sr } contains r possible outcomes. This makes it easy to calculate probabilities. s7 . s4. . = p8 = 8 . .

(n − r + 1) = n! . Pr The number of permutations. b. That is. a. b.40 1. The critical thing is to use the same rule for both numerator and denominator. we can often think about the problem # outcomes in Ω either with diﬀerent orderings constituting diﬀerent choices. r! (n − r)!r! (because n Pr counts each permutation r! times. is the number of ways of selecting r objects from n distinct objects when diﬀerent orderings constitute the same choice. . choice (a. is the number of ways of selecting r objects from n distinct objects when diﬀerent orderings constitute diﬀerent choices. Then n #permutations = n Pr = n(n − 1)(n − 2) . (n − 1) choices for second. n Pr . n Cr . c) are the same. Number of Permutations. a. c).) 2. etc. That is. or with diﬀerent orderings constituting the same choice. Then #combinations = n Cr = n r n = n! Pr = . . c) counts separately from choice (b. choice (a. (n − r)! (n choices for ﬁrst object. and we only want to count it once: so divide n Pr by r!) Use the same rule on the numerator and the denominator # outcomes in A When P(A) = . c) and choice (b. n Cr = n r The number of combinations. Number of Combinations. .

(Note: order mattered above. because the 4 cards with the ﬁrst design are excluded. Number of ways of selecting 5 distinct designs from 12 is 12 C5 = 12 5 = 12 ! = 792. (40 choices for ﬁrst card. 36 for second. (12 − 5)! 5! b) The next year. He selects 5 cards at random to send to his great-aunts.41 Example: (a) Tom has ﬁve elderly great-aunts who live together in a tiny bungalow. They insist on each receiving separate Christmas cards. Easiest to ﬁnd P(all 5 cards are diﬀerent) = P(A). . Tom buys a pack of 40 Christmas cards.392. so use combinations. What is the probability that at least two of the great-aunts receive the same picture? Looking for P(at least 2 cards the same) = P(A) (say). 40 × 39 × 38 × 37 × 36 Thus P(A) = P(at least 2 cards are the same design) = 1 − P(A) = 1 − 0.g.) So P(A) = 40 × 36 × 32 × 28 × 24 = 0. and threaten to disinherit Tom if he sends two of them the same picture. In how many diﬀerent ways can he select 5 diﬀerent designs from the 12 designs available? Order of cards is not important. we are counting choice 12345 separately from 23154.) Total number of outcomes is (total # ways of selecting 5 cards from 40) = 40 × 39 × 38 × 37 × 36 . featuring 10 diﬀerent pictures with 4 cards of each picture. etc. Number of outcomes in A is (# ways of selecting 5 diﬀerent designs) = 40 × 36 × 32 × 28 × 24 .392 = 0. Note that order matters: e.608. Tom has Christmas cards with 12 diﬀerent designs. so we need order to matter here too.

. P(B) so if P(A | B) = P(A) then P(A ∩ B) = P(A) × P(B).16 Statistical Independence Two events A and B are statistically independent if the occurrence of one does not aﬀect the occurrence of the other. Deﬁnition: Events A and B are statistically independent if P(A ∩ B) = P(A)P(B). Now P(A | B) = P(A ∩ B) . 40 5 This works because there are 10 ways of choosing 5 diﬀerent designs from 10. This means P(A | B) = P(A) and P(B | A) = P(B). 5 Exercise: Check that this gives the same answer for P(A) as before. The total 5 number of ways of choosing 5 cards from 40 is 40 . We use this as our deﬁnition of statistical independence. Note: Problems like these belong to the branch of mathematics called Combinatorics: the science of counting. 5 and there are 4 choices of card within each of the 5 chosen groups.42 Alternative solution if order does not matter on numerator and denominator: (much harder method) 10 5 4 5 P(A) = . 1. So the total number of ways of choosing 5 cards of diﬀerent designs is 10 45.

43 For more than two events. j. . we say: Deﬁnition: Events A1. . then they will also be statistically independent. then P(A ∩ B) = P(A) × P(B). A2. P(An). we usually have to use conditional probability and the multiplication rule: P(A ∩ B) = P(A | B)P(B). A2. k that are all diﬀerent. . . IF A and B are statistically independent. AND the same multiplication rule holds for every subcollection of the events too. . . . ∩ An ) = P(A1)P(A2) . Statistical independence for calculating the probability of an intersection In section 1. Note: If events are physically independent. If A and B are not known to be statistically independent. . . 1. Eg. An are mutually independent if P(A1 ∩ A2 ∩ . We usually have two choices. j with i = j . A3. 2. A4 are mutually independent if i) P(Ai ∩ Aj ) = P(Ai)P(Aj ) for all i. AND ii) P(Ai ∩ Aj ∩ Ak ) = P(Ai)P(Aj )P(Ak ) for all i. This still requires us to be able to calculate P(A | B).6 we said that it is often hard to calculate P(A ∩ B). AND iii) P(A1 ∩ A2 ∩ A3 ∩ A4 ) = P(A1)P(A2)P(A3)P(A4). events A1 .

one white. Likewise. Let A= “heads” and B= “six”. H2. 2 (one of 4 balls has both red and white on it). H3. B =“ball has white on it”. H6}) = 12 = 2 P(B) = P({H6. H6. T 1. and one red. one blue. So P(A ∩ B) = P(A)P(B) and thus A and B are statistically indept. P(A ∩ C) = P(A)P(C). Likewise. so P(A) = Pairwise independent: Consider P(A ∩ B) = But. T 4. Pairwise independence does not imply mutual independence Example: A jar contains 4 balls: one red. . T 5. T 2. Sample space: Ω = {H1. 6 Then P(A) = P({H1. H4. and P(B ∩ C) = P(B)P(C). H2. despite being pairwise indepen- dent. T 6}) = 12 = 1 6 1 2 1 Now P(A ∩ B) = P(Heads and 6) = P({H6}) = 12 But P(A) × P(B) = 1 × 2 1 6 = 1 12 also. Two balls satisfy A. B and C are NOT mutually independent. H4. The coin and die are physically independent. so P(A ∩ B) = P(A)P(B). T 6} .44 Example: Toss a fair coin and a fair die together. × 1 2 1 = 4 .all 12 items are equally likely. Mutually independent? Consider P(A ∩ B ∩ C) = while 1 4 (one of 4 balls) 1 2 1 P(A)P(B)P(C) = 2 × 1 × 2 = 1 8 = P(A ∩ B ∩ C). H5. P(B) = P(C) = 1 . So A. Draw one ball at random. B and C are pairwise independent. Let A =“ball has red on it”. So A. T 3. H5. white & blue. P(A) × P(B) = 1 4 1 2 2 4 1 = 2. H3. C =“ball has blue on it”.

the number of heads from 10 tosses of a coin. quantify the average value (e. ‘Things that happen’ are sets. Ω = {same. Ω = {Lions win.g. also called events. So far we have come up with the following ideas: 1.17 Random Variables We have one more job to do in laying the foundations of our science of randomness. To give us a framework in which these investigations can take place. On the other hand. . and so on. Finally. 2. the average number of heads we would get if we made 10 coin-tosses again and again). When the outcome of a random experiment is a number. All of these sample spaces could be represented more concisely in terms of numbers: Ω = {0. We measure chance by measuring sets. 1}. quantify relationships between diﬀerent random quantities (e. 2. using a measure called probability. it enables us to quantify many new things of interest: 1. there are many random experiments that genuinely produce random numbers as their outcomes. tail}. 3. we give a special name to random experiments that produce numbers as their outcomes. diﬀerent}. what are the sets that we are measuring? It is a nuisance to have lots of diﬀerent sample spaces: Ω = {head. is the number of girls related to the hormone levels of the fathers?) The list is endless. A random experiment whose possible outcomes are real numbers is called a random variable. All Blacks win}. quantify how much the outcomes tend to diverge from the average value.g. For example.45 1. the number of girls in a three-child family.

The sample space is Ω = {HHH. HTH. any random experiment can be made to have outcomes that are real numbers. relationships. TTH. Describing many diﬀerent sample spaces in the same terms: e.46 In fact. THH. Ω = {0. Example: Toss a coin 3 times. X(T HT ) = 1. Although this is the formal deﬁnition. X(“All Blacks win”) = 1. variances. HTT. etc. So X(HHH) = 3. For example: function X : Ω → R X(“Lions win”) = 0. This gives us our formal deﬁnition of a random variable: Deﬁnition: A random variable (r. and for which we want to develop general rules for ﬁnding averages. the intuitive deﬁnition of a random variable is probably more useful. Giving a name to a large class of random experiments that genuinely produce random numbers.) is a function from a sample space Ω to the real numbers R. 1} with P(1) = p and P(0) = 1 − p describes EVERY possible experiment with two outcomes. . for sample point si . and so on.v. remember that a random variable equates to a random experiment whose outcomes are numbers. we have X(si ) = # heads in outcome si . TTT} One example of a random variable is X : Ω → R such that. We write X : Ω → R. Intuitively. simply by mapping the sample space Ω onto a set of real numbers using a function. 2. A random variable produces random real numbers as the ‘outcome’ of a random experiment. HHT. Deﬁning random variables serves the dual purposes of: 1. THT.g.

random variables X and Y are deﬁned to be independent if P({X = x} ∩ {Y = y}) = P(X = x)P(Y = y) for all possible values x and y . 0 otherwise. Example: toss a fair coin 3 times. etc. All outcomes are equally likely: P(HHH) = P(HHT) = . HT H. For a sample space Ω and random variable X : Ω → R. Probabilities for random variables By convention. = P(TTT) = 1/8. T HH}) = 3/8. Let X : Ω → R.47 Another example is Y : Ω → R such that Y (si) = 1 if 2nd toss is a head. X). and lower case letters to represent the values that the random variable takes (e. . Similarly. such that X(s) = # heads in s. T T H}) = 3/8. P(X = 3) = P({HHH}) = 1/8.g. P(X = 2) = P({HHT. T HT. Y (HHH) = 1. P(X = 1) = P({HT T. x). Then Y (HT H) = 0. Independent random variables Random variables X and Y are independent if each does not aﬀect the other. Then P(X = 0) = P({T T T }) = 1/8. and for a real number x.g. . we use CAPITAL LETTERS for random variables (e. Note that P(X = 0) + P(X = 1) + P(X = 2) + P(X = 3) = 1. Recall that two events A and B are independent if P(A ∩ B) = P(A)P(B). P(X = x) = P(outcome s is such that X(s) = x) = P({s : X(s) = x}). . Y (T HH) = 1.

. Thus X and Y are independent if and only if P(X = x. y .48 We usually replace the cumbersome notation P({X = x} ∩ {Y = y}) by the simpler notation P(X = x. From now on. Y = y). Y = y). we will use the following notations interchangeably: P({X = x} ∩ {Y = y}) = P(X = x AND Y = y) = P(X = x. Y = y) = P(X = x)P(Y = y) for ALL possible values x.

i. how to ﬁnd P(A | B) when you know P(B | A). B. then P(A ∪ B) = P(A) + P(B). Conditional probability: P(A | B) = P(A ∩ B) P(B) for any A. Or: P(A ∩ B) = P(A | B)P(B). . It shows how to ‘invert’ the conditioning. P(A | B) = P(B | A)P(A) . 4.e.18 Key Probability Results for Chapter 1 1. + P(B | Am )P(Am ) m i=1 P(Aj | B) = = P(B | Aj )P(Aj ) . B.49 1. then P(B | Aj )P(Aj ) P(B | A1 )P(A1 ) + . 5. If A and B are mutually exclusive (i. Am form a partition of the sample space. they do not overlap (mutually exclusive) and collectively cover all possible outcomes (their union is the sample space). P(B | A)P(A) + P(B | A)P(A) This works because A and A form a partition of the sample space. A ∩ B = ∅). P(B | Ai )P(Ai ) .e. 3. Bayes’ Theorem slightly more generalized: for any A. i. For any A. P(B) This is a simpliﬁed version of Bayes’ Theorem. we can write P(A | B) = P(B | A)P(A) .e. . 2. . Complete version of Bayes’ Theorem: If sets A1 . B. . . .

then P(A1 ∪ A2 | B) = P(A1 | B) + P(A2 | B) (compare with P(A1 ∪ A2 ) = P(A1 ) + P(A2 )).Am partition the sample space. if A1 . Partition Theorem: if A1 . Statistical independence: if A and B are independent. . 7. Unions: For any A. . Chains of events: P(A1 ∩ A2 ∩ A3 ) = P(A1 ) P(A2 | A1 ) P(A3 | A2 ∩ A1 ) .) The fact that P(· | B) is a valid probability measure is easily veriﬁed by checking that it satisﬁes Axioms 1.50 6. . . then applying it again to expand P(B ∪ C).+ P(Am | B) = 1.g. . . . . then P(A1 | B) + P(A2 | B) +. Am form a partition of the sample space. and 3. . and P(A | B) = 1 − P(A | B) for any A. (Note: it is not generally true that P(A | B) = 1 − P(A | B). 8. Conditional probability: If P(B) > 0. 9. 2. P(A ∪ B) = P(A) + P(B) − P(A ∩ B) . then we can treat P(· | B) just like P: e. if A1 and A2 are mutually exclusive. . C. This can also be written as: P(B) = P(B | A1 )P(A1 ) + P(B | A2 )P(A2 ) + . . 10. B. The second expression is obtained by writing P(A∪B ∪C) = P A∪(B ∪C) and applying the ﬁrst expression to A and (B ∪ C). + P(B ∩ Am ) . + P(B | Am )P(Am ) . .. . P(A ∪ B ∪ C) = P(A) + P(B) + P(C) − P(A ∩ B) − P(A ∩ C) − P(B ∩ C) + P(A ∩ B ∩ C) . . These are both very useful formulations. then P(B) = P(B ∩ A1 ) + P(B ∩ A2 ) + . then P(A ∩ B) = P(A) P(B) and and P(A | B) = P(A) P(B | A) = P(B) .

g. 5. 6. g(X) = X or g(X) = X 2 . Choosing a probability distribution to ﬁt a situation is called modelling. Likelihood and estimation: • what if we know that our random variable is (say) Binomial(5. p). . where X is a random variable and g(X) is a function. Hypothesis testing: • I toss a coin ten times and get nine heads.1 Introduction In the next two chapters we meet several important concepts: 1. but never give results much lower (long-tailed distribution)? We will see how diﬀerent probability distributions are suitable for diﬀerent circumstances. Expectation and variance of a random variable: • the expectation of a random variable is the value it takes on average.Chapter 2: Discrete Probability Distributions 2. • the variance of a random variable measures how much the random variable varies about its average. and the probability function fX (x): 51 • the probability function of a random variable lists the values the random variable can take. But what does the randomness look like? Is it highly variable. 2. Probability distributions. but we don’t know the value of p? We will see how to estimate the value of p using maximum likelihood estimation. How unlikely is that? Can we continue to believe that the coin is fair when it produces nine heads out of ten tosses? 3. Modelling: • we have a situation in real life that we know is random. Change of variable procedures: • calculating probabilities and expectations of √ g(X). and their probabilities. for some p. e. 4. or little variability? Does it sometimes give results much higher than average.

. MG ⇒ X=3 Deﬁnition: The probability function.1. Recall that a random variable. fX (x) The probability function fX (x) lists all possible values of X.. X. fX (x) = P(X = x) 6 6 MG 3 4 6 for all possible outcomes x of X . fX (x).52 2. and gives a probability to each value. ⇒ X=1 Ferrari Porsche ⇒ X=2 If he chooses.. {0. assigns a real number to every possible outcome of a random experiment.2. eg. Random variable: X . The random variable is discrete if the set of real values it can take is ﬁnite or countable. . .2 The probability function. . }. 6 1 We write: P(X = 1) = fX (1) = 6 : the probability he makes choice 1 (a Ferrari) . is given by. Porsche Ferrari Random experiment: which car? MG. for a discrete random variable X . X gives numbers to the possible outcomes. is 1 . fX (x) = P(X = x).. Example: Which car? Outcome: Ferrari Porsche x 1 2 1 1 Probability function.

5 if x=0 fX (x) = 0.5 We write (eg. ii) x probabilities are always between 0 and 1.5 if x=1 0 otherwise 0. Properties of the probability function i) 0 ≤ fX (x) ≤ 1 for all x. P(X ∈ {1.5. The probability function of X is given by: x fX (x) = P(X = x) 0 1 or 0. We can also write the probability function as: fX (x) = 4/6 if x = 3. This is the probability of choosing either a Ferrari or a Porsche. etc. fX (x) = 1. 2}) = P(X = 1 or 2) = P(X = 1) + P(X = 2) = 1 6 + 1 6 2 = 6. 0 otherwise.5 0. 53 1/6 if x = 1. 1 with probability 0.) fX (0) = 0.5. fX (7.5. . probabilities add to 1 overall.g. and let X=number of heads. in the car example.Example: Toss a fair coin once. Then X= 0 with probability 0.5. 1/6 if x = 2. x∈A iii) P (X ∈ A) = e. fX (x) is just a list of probabilities. fX (x). fX (1) = 0.5) = 0.

54 2. P(Y = 0) = P(“failure”) = 1 − p. Each toss is 1 a Bernoulli trial with P(success) = 6 . p. who were bitter rivals. Examples: 1) Repeated tossing of a fair coin: each toss is a Bernoulli trial with P(success) = P(head) = 1 .3 Bernoulli trials Many of the discrete random variables that we meet are based on counting the outcomes of a series of trials called Bernoulli trials. 2 2) Repeated tossing of a fair die: success = “6”. 0 and 1. remains constant for all trials. iii) The trials are independent. the event “success in trial i” does not depend on the outcome of any other trials. He and his brother Jean. both studied mathematics secretly against their father’s will. P(Y = 1) = P(“success”) = p. Deﬁnition: The random variable Y is called a Bernoulli random variable if it takes only 2 values. Their father wanted Jacques to be a theologist and Jean to be a merchant. ie. failure= “not 6”. Jacques Bernoulli was a Swiss mathematician in the late 1600s. The probability function is. Deﬁnition: A random experiment is called a set of Bernoulli trials if it consists of several trials such that: i) Each trial has only 2 possible outcomes (usually called “Success” and “Fail- ure”). ii) The probability of success. . fY (y) = p if y = 1 1 − p if y = 0 That is.

p). or X ∼ Binomial(n. then the probability function for X is n x p (1 − p)n−x x fX (x) = P(X = x) = for x = 0. . Probability function If X ∼ Binomial(n. Then X has the Binomial distribution with parameters n and p. . We write X ∼ Bin(n. p) if X is the number of successes out of n independent trials. . px (1 − p)n−x (2) (1) where: (1) succeeds x times. each of which has probability p of success. each with probability p (2) fails (n − x) times. n Explanation: An outcome with x successes and (n − x) failures has probability.4 Example of the probability function: the Binomial Distribution The Binomial distribution counts the number of successes in a ﬁxed number of Bernoulli trials. Deﬁnition: Let X be the number of successes in n independent Bernoulli trials each with probability of success = p.55 2. 1. . Thus X ∼ Bin(n. . each with probability (1 − p). p). p).

. . n}. 1. 2. / fX (x) = 1: Check that x=0 n n fX (x) = x=0 x=0 n x p (1 − p)n−x = [p + (1 − p)]n (Binomial Theorem) x = 1n = 1 It is this connection with the Binomial Theorem that gives the Binomial Distribution its name. . . . of each such outcome) n x = p (1 − p)n−x x Note: fX (x) = 0 if n x ∈ {0. Thus.56 There are n x possible outcomes with x successes and (n − x) failures because we must select x trials to be our “successes”. out of n trials in total. P(#successes= x) = (#outcomes with x successes) × (prob.

x 0 1 2 3 4 fX (x) = P(X = x) 0. Then X ∼ Binomial(n = 3.4096 0.2).515.1536 0. p = 0. 1 6 1 1 1− 6 10−1 Example 3: Let X be the number of girls in a three-child family. P(X ≥ 2) = 1 − P(X < 2) = 1 − P(X = 0) − P(X = 1) 0 10−0 10 10 1 1 = 1− − 1− 6 6 0 1 = 0. 1. p = 1/6).4096 0.0016 Example 2: Let X be the number of times I get a ‘6’ out of 10 rolls of a fair die. What is the distribution of X? 2.0256 0.57 Example 1: Let X ∼ Binomial(n = 4. What is the distribution of X? Assume: (i) each child is equally likely to be a boy or a girl.5). 2. (ii) all children are independent of each other. . X ∼ Binomial(n = 10. p = 0. Write down the probability function of X. What is the probability that X ≥ 2? 1.

p).05 0.25 0. Note: X and Y must both share the same value of p. This is because X counts the number of successes out of n trials. X + Y counts the total number of successes out of n + m trials.9 0. . but highly skewed for values of p close to 0 or 1.0 0.0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0. p).0 80 0.4 n = 10.10 90 100 Sum of independent Binomial random variables: If X and Y are independent.12 n = 100.2 0. and there is noticeable skew only if p is very close to 0 or 1. then X + Y ∼ Bin(n + m. n = 10. and Y counts the number of successes out of m trials: so overall.1 0. p = 0.06 0.15 0. p).02 0. The probability functions for various values of n and p are shown below.5.3 0. p = 0.20 0. p = 0. For small n.08 0.5 0. the distribution is almost symmetrical for values of p close to 0. Y ∼ Binomial(m.10 0. and X ∼ Binomial(n.04 0.58 Shape of the Binomial distribution The shape of the Binomial distribution depends upon the values of n and p. the distribution becomes more and more symmetrical.9 0. As n increases.

you could answer by giving either the distribution function. or just distribution function. The distribution function FX (x) as a probability sweeper The cumulative distribution function.00 0. fX (x). FX (x) We have deﬁned the probability function.) is FX (x) = P(X ≤ x) for − ∞ < x < ∞ If you are asked to ‘give the distribution of X’. 0. FX (x). Each of these functions encapsulate all possible information about X. The cumulative distribution function.f.59 2.5) X ~ Bin(10.05 0 1 2 3 4 5 6 7 8 9 10 0.1 0. is an alternative function that also tells us everything there is to know about X. or the probability function. as fX (x) = P(X = x).3 0.4 0 1 2 3 4 5 6 7 8 9 10 X ~ Bin(10. 0. Deﬁnition: The (cumulative) distribution function (c. FX (x). 0.10 0.9) .5 The cumulative distribution function.0 0.25 0.2 0.20 0. fX (x). written as FX (x).15 0. The probability function tells us everything there is to know about X. sweeps up all the probability up to and including the point x.d.

25 + 0. 1 ).25 = 1 f (x) 1 2 if if if if x<0 0≤x<1 1≤x<2 x ≥ 2. .60 Example: Let X ∼ Binomial(2.25 Then FX (x) = P(X ≤ x) = 0.5 = 0. 2 x 0 1 2 fX (x) = P(X = x) 1 1 1 4 2 4 0 0. So FX (x) = y≤x fX (y) Note that FX (x) is a step function: it jumps by amount fX (y) at every point y with positive probability.25 + 0.5 + 0. 1 4 x 0 F (x) 1 3 4 1 2 1 4 1 2 x 0 1 2 FX (x) gives the cumulative probability up to and including point x.75 0.

61 Reading oﬀ probabilities from the distribution function As well as using the probability function to ﬁnd the distribution function. In general: P(a < X ≤ b) = FX (b) − FX (a) Proof: P(X ≤ b) = P(X ≤ a) + P(a < X ≤ b) a X≤b X ≤a a<X≤b b if b > a. we can also use the distribution function to ﬁnd probabilities. because we can use either one to ﬁnd the other. (if X takes integer values) This is why the distribution function FX (x) contains as much information as the probability function. fX (x) = P(X = x) = P(X ≤ x) − P(X ≤ x − 1) = FX (x) − FX (x − 1). fX (x). So FX (b) = FX (a) + P(a < X ≤ b) ⇒ FX (b) − FX (a) = P(a < X ≤ b). .

(These are true because values are strictly between −∞ and ∞). For example. P(X ≥ 56)? 1 − P(X < 56) = 1 − P(X ≤ 55) = 1 − FX (55). then FX (x1) ≤ FX (x2). Properties of the distribution function 1) F (−∞) =P(X ≤ −∞) = 0. FX (29).Warning: endpoints Be careful of endpoints and the diﬀerence between ≤ and <. P(X < 10) = P(X ≤ 9) = FX (9). 3) P(a < X ≤ b) = FX (b) − FX (a) if b > a. FX (30). what is: 1. P(X > 42)? 1 − P(X ≤ 42) = 1 − FX (42). limh↓0 F (x + h) = F (x). In terms of FX (x). P(X < 30)? 3. 5. if x1 < x2.4). 0. Examples: Let X ∼ Binomial(100. 4. F (+∞) =P(X ≤ +∞) = 1. P(50 ≤ X ≤ 60)? P(X ≤ 60) − P(X ≤ 49) = FX (60) − FX (49). . P(X ≤ 30)? 2. 2) FX (x) is a non-decreasing function of x: that is. 4) F is right-continuous: that is.

0 heads. How weird is that? What is ‘weird’ ? • Getting 9 heads out of 10 tosses: we’ll call this weird. 1 head. The concept of the hypothesis test is at its easiest to understand with the Binomial distribution in the following example. Common hypothesis tests include t-tests and chisquared tests. Distribution of X. p = 0. assuming that the coin is fair. 10 heads. Set of weird outcomes If our coin is fair. • Getting 10 heads out of 10 tosses: even more weird! • Getting 8 heads out of 10 tosses: less weird.6 Hypothesis testing You have probably come across the idea of hypothesis tests. However. So how weird is 9 heads or worse.63 2. p-values. . hypothesis tests can be conducted in much simpler circumstances than these. if the coin is fair: X ∼ Binomial(n = 10.5). • Getting 1 head out of 10 tosses: same as getting 9 tails out of 10 tosses: H H just as weird as 9 heads if the coin is fair. if the coin is fair? We can add the probabilities of all the outcomes that are at least as weird as 9 heads out of 10 tosses. • Getting 0 heads out of 10 tosses: same as getting 10 tails: more weird than 9 heads if the coin is fair. and signiﬁcance in other courses. Example: Weird Coin? I toss a coin 10 times and get 9 heads. All other hypothesis tests throughout statistics are based on the same idea. the outcomes that are as weird or weirder than 9 heads are: 9 heads.

1% of occasions. we have: P(X = 9) + P(X = 10) + P(X = 1) + P(X = 0) = 10 10 (0.021.5). If we had a fair coin and tossed it 10 times. if the coin is fair: P(X = 9)+P(X = 10)+P(X = 1)+P(X = 0) where X ∼ Binomial(10.5)1 + (0. 0.25 P(X=x) 0. it is quite weird.5)0(0. we would only expect to see something as extreme as 9 heads on about 2.15 0 1 2 3 4 5 x 6 7 8 9 10 For X ∼ Binomial(10.5)9(0. Probabilities for Binomial(n = 10.5)9 + (0.00098 = 0.00098 + 0.00977 + 0. 0.5)10 1 0 = 0.5) 0.05 0.5)0 + 9 10 10 10 (0.5)1(0.5).5)10(0.64 Probability of observing something at least as weird as 9 heads.0 0. . Is this weird? Yes. p = 0.00977 + 0.

so it is still very unusual for a fair coin to produce something as weird as what we’ve seen. • Our observed information is X. However.65 Is the coin fair? Obviously. The smaller this proba- bility. we can’t say.1%). X = 9. We conclude that this is unlikely with a fair coin. the more evidence we have. • The probability is small (2. 2. We can deduce that. EITHER we have observed a very unusual event with a fair coin. The coin is fair. 0. the number of heads out of 10 tosses. • We calculate the probability of observing something AT LEAST AS EXTREME as our observation.1% measures the strength of our evidence. 2. this gives us some evidence that the coin is not fair. if the coin is fair: prob=0.021. The value 2. Formal hypothesis test We now formalize the procedure above. The coin is not fair. it would be very unusual to get 9 heads or more. Think of the steps: • We have a question that we want to answer: Is the coin fair? • There are two alternatives: 1. OR the coin is not fair. In fact. so we have observed some evidence that the coin is NOT fair.5). you do get something as weird as 9 heads or more.1% is a small probability. on 2. It might be: after all. If the coin really was fair.1% of occasions that you toss a fair coin 10 times. . We write down the distribution of X if the coin is fair: X ∼ Binomial(10.

that the coin is NOT fair. that the coin IS fair. 0. The alternative hypothesis is H1 : the coin is NOT fair. It simply states that the null hypothesis is wrong.5.5). It does not say what the right answer is. The null hypothesis is H0 : the coin is fair. p). Think of ‘null hypothesis’ as meaning the ‘default’: the hypothesis we will accept unless we have a good reason not to. Null hypothesis: the ﬁrst alternative. • The alternative hypothesis is general. Alternative hypothesis: the second alternative. X ∼ Binomial(10. and H0 : p = 0. we often use this same formulation. It speciﬁes an exact distribution for our observation: X ∼ Binomial(10.5 H1 : p = 0. In hypothesis testing. . We expect to believe the null hypothesis unless we see convincing evidence that it is wrong. • The null hypothesis is speciﬁc. we write: Number of heads. More precisely.66 Null hypothesis and alternative hypothesis We express the steps above as two competing hypotheses. We use H0 and H1 to denote the null and alternative hypotheses respectively.

This means that SMALL p-values represent STRONG evidence against H0. which is 0. We measure the strength of evidence against H0 using the p-value.021 as a measure of the strength of evidence against the hypothesis that p = 0. Our hypothesis test is designed to test whether the Binomial probability is p = 0. Small p-values mean Strong evidence. we always measure evidence AGAINST the null hypothesis.021. we believe that our coin is fair unless we see convincing evidence otherwise. . In general. It states that. with the Binomial probability p.5.021 in our example. if the null hypothesis is TRUE. Many of us would see this as strong enough evidence to decide that the null hypothesis is not true. the p-value was p = 0. Note: Be careful not to confuse the term p-value. if H0 is TRUE. That is.67 p-values In the hypothesis-testing framework above. the p-value is the probability of observing something AT LEAST AS EXTREME AS OUR OBSERVATION. To test this. Large p-values mean Little evidence. A p-value of 0.1% chance of observing something as extreme as 9 heads or tails. In the example above.5.021 represents quite strong evidence against the null hypothesis. we calculate the p-value of 0. we would only have a 2.

Saying the test is signiﬁcant is a quick way of saying that there is evidence against the null hypothesis. Under this framework. • Some people go further and use an accept/reject framework. This decision should usually be taken in the context of other scientiﬁc information.05. the stronger the evidence against H0. The smaller the p-value. • Most people agree that the p-value is a useful measure of the strength of evidence against the null hypothesis.05. The result of a hypothesis test is signiﬁcant at the 5% level if the p-value is less than 0. or more. it is worth bearing in mind that p-values of 0. the null hypothesis H0 should be rejected if the p-value is less than 0. Statistical signiﬁcance refers to the p-value. Statistical signiﬁcance You have probably encountered the idea of statistical signiﬁcance in other courses. However. We do not talk about accepting or rejecting H0 . The p-value measures how far out our observation lies in the tails of the distribution speciﬁed by H0. is less than 5% if the null hypothesis is true.68 Interpreting the hypothesis test There are diﬀerent schools of thought about how a p-value should be interpreted. This means that the chance of seeing what we did see (9 heads). usually at the 5% level. and accepted if the p-value is greater than 0. .05 and less start to suggest that the null hypothesis is doubtful. • In this course we use the strength of evidence interpretation.05 (say).

i. It says NOTHING about the SIZE.5 is large. This means: • we have some evidence that p = 0.5 versus H1 : p > 0.05.e.001.69 In the coin example.5 is important in practical terms. If we had a good reason. even though H0 is true! A small p-value does NOT mean that H0 is deﬁnitely wrong. It does not mean: • the diﬀerence between p and 0. Statistically signiﬁcant means that we have evidence that there IS a diﬀerence. we will get a p-value < 0. Beware! The p-value gives the probability of seeing something as weird as what we did see. then we could conduct a one-sided test: H0 : p = 0. or • the diﬀerence between p and 0. One-sided and two-sided tests The test above is a two-sided test.5.5. because the p-value is 0. to believe that the binomial probability could only be = 0.5 against H1 : p = 0. about once in every thousand tests.05 EVEN WHEN H0 IS TRUE!! Indeed. we will get a p-value < 0. This means that we considered it just as weird to get 9 tails as 9 heads. we can say that our test of H0 : p = 0.5 is signiﬁcant at the 5% level.5.021 which is < 0. if H0 is true.5. before tossing the coin. This means that 5% of the time.5 or > 0. or the IMPORTANCE. This would have the eﬀect of halving the resultant p-value. of the diﬀerence. that it would be impossible to have p < 0. .

7 Example: Presidents and deep-sea divers Men in the class: would you like to have daughters? Then become a deep-sea diver. But is it real. a ﬁghter pilot.4 sons for every daughter. For the presidents: If I tossed a coin 151 times and got only 63 heads. or the reverse. ﬁghter pilots have daughters.70 2.6. or just chance? We can use hypothesis tests to decide. Numbers suggest that men in diﬀerent professions tend to have more sons than daughters. The facts • The 43 US presidents from George Washington to George W. Bush have had a total of 151 children. or a heavy smoker. Would you prefer sons? Easy! Just become a US president. comprising 65 sons and 125 daughters: a sex ratio of 1. • Two studies of deep-sea divers revealed that the men had a total of 190 children. could I continue to believe that the coin was fair? . comprising 88 sons and only 63 daughters: a sex ratio of 1. could I continue to believe that the coin was fair? For the divers: If I tossed a coin 190 times and got only 65 heads.9 daughters for every son. Presidents have sons. Could this happen by chance? Is it possible that the men in each group really had a 50-50 chance of producing sons and daughters? This is the same as the question in Section 2.

5. . because we would be just as surprised if we saw ≤ 63 sons.04 0.00 0. 63. . Which results are at least as extreme as X = 63? X = 0. 151. .05 0 20 40 60 80 100 120 140 .02 0.e. X = (151 − 63). p). 2. Probabilities for X ∼ Binomial(n = 151.71 Hypothesis test for the presidents We set up the competing hypotheses as follows. p = 0. for even fewer daughters. . H1 : p = 0. . 1. . Null hypothesis: Alternative hypothesis: p-value: H0 : p = 0. where p is the probability that each child is a daughter.5. .5) 0. ≥ (151 − 63) = 88 daughters. .5.03 0. i. Let X be the number of daughters out of 151 presidential children. Then X ∼ Binomial(151.01 0. We need the probability of getting a result AT LEAST AS EXTREME as X = 63 daughters.06 0. if H0 is true and p really is 0. for too many daughters.

72

Calculating the p-value The p-value for the president problem is given by P(X ≤ 63) + P(X ≥ 88) where X ∼ Binomial(151, 0.5). In principle, we could calculate this as P(X = 0) + P(X = 1) + . . . + P(X = 63) + P(X = 88) + . . . + P(X = 151) = 151 151 (0.5)0(0.5)151 + (0.5)1(0.5)150 + . . . 0 1

This would take a lot of calculator time! Instead, we use a computer with a package such as R. R command for the p-value The R command for calculating the lower-tail p-value for the Binomial(n = 151, p = 0.5) distribution is pbinom(63, 151, 0.5). Typing this in R gives: > pbinom(63, 151, 0.5) [1] 0.02522393 This gives us the lower-tail p-value only: P(X ≤ 63) = 0.0252. To get the overall p-value, we have two choices: 1. Multiply the lower-tail p-value by 2: 2 × 0.0252 = 0.0504. In R: > 2 * pbinom(63, 151, 0.5) [1] 0.05044785

0.05 0.00 0.01 0.02 0.03 0.04 0.06

0

20

40

60

80

100

120

140

73

This works because the upper-tail p-value, by deﬁnition, is always going to be the same as the lower-tail p-value. The upper tail gives us the probability of ﬁnding something equally surprising at the opposite end of the distribution. 2. Calculate the upper-tail p-value explicitly:

**The upper-tail p-value is
**

P(X ≥ 88) = 1 − P(X < 88) = 1 − P(X ≤ 87) = 1 − pbinom(87, 151, 0.5). In R: > 1-pbinom(87, 151, 0.5) [1] 0.02522393 The overall p-value is the sum of the lower-tail and the upper-tail p-values: pbinom(63, 151, 0.5) + 1 - pbinom(87, 151, 0.5) = 0.0252 + 0.0252 = 0.0504.

(Same as before.)

Note: The R command pbinom is equivalent to the cumulative distribution function for the Binomial distribution:

pbinom(63, 151, 0.5) = P(X ≤ 63) = FX (63)

where X ∼ Binomial(151, 0.5) for X ∼ Binomial(151, 0.5).

The overall p-value in this example is 2 × FX (63). Note: In the R command pbinom(63, 151, 0.5), the order that you enter the numbers 63, 151, and 0.5 is important. If you enter them in a diﬀerent order, you will get an error. An alternative is to use the longhand command pbinom(q=63, size=151, prob=0.5), in which case you can enter the terms in any order.

74

Summary: are presidents more likely to have sons? Back to our hypothesis test. Recall that X was the number of daughters out of 151 presidential children, and X ∼ Binomial(151, p), where p is the probability that each child is a daughter. Null hypothesis: Alternative hypothesis: p-value: H0 : p = 0.5. H1 : p = 0.5.

2 × FX (63) = 0.0504.

What does this mean? The p-value of 0.0504 means that, if the presidents really were as likely to have

daughters as sons, there would only be 5.04% chance of observing something as unusual as only 63 daughters out of the total 151 children.

This is unusual, but not extremely unusual. We conclude that there is some evidence that presidents are more likely to have

sons than daughters. However, the observations are also consistent with the possibility that there is no real diﬀerence.

Hypothesis test for the deep-sea divers For the deep-sea divers, there were 190 children: 65 sons, and 125 daughters. Let X be the number of sons out of 190 diver children. Then X ∼ Binomial(190, p), where p is the probability that each child is a son. Note: We could just as easily formulate our hypotheses in terms of daughters instead of sons. Because pbinom is deﬁned as a lower-tail probability, however, it is usually easiest to formulate them in terms of the low result (sons).

75

Null hypothesis: Alternative hypothesis: p-value:

H0 : p = 0.5. H1 : p = 0.5.

Probability of getting a result AT LEAST AS EXTREME as X = 65 sons, if H0 is true and p really is 0.5.

Results at least as extreme as X = 65 are: X = 0, 1, 2, . . . , 65, for even fewer sons. X = (190−65), . . . , 190, for the equally surprising result in the opposite direction

**(too many sons).
**

Probabilities for X ∼ Binomial(n = 190, p = 0.5)

0.05 0.00 0.01 0.02 0.03 0.04 0.06

0

20

40

60

80

100

120

140

160

180

R command for the p-value p-value = 2×pbinom(65, 190, 0.5). Typing this in R gives: > 2*pbinom(65, 190, 0.5) [1] 1.603136e-05 This is 0.000016, or a little more than one chance in 100 thousand.

Why might divers have daughters and presidents have sons? Deep-sea divers are thought to have more daughters than sons because the underwater work at high atmospheric pressure lowers the level of the hormone testosterone in the men’s blood.76 We conclude that it is extremely unlikely that this observation could have oc- curred by chance. and often only p-values less than 0. . which is thought to make them more likely to conceive daughters. They are regularly treated as the end point of an analysis. What next? p-values are often badly used in science and business.05 are regarded as ‘interesting’. after which no more work is needed. if the deep-sea divers had equal probabilities of having sons and daughters. The outcome is that some scientists do every analysis they can think of until they ﬁnally come up with a p-value of 0. It is very rare in science for numbers and statistics to tell us the full story. A good statistician will recommend a diﬀerent attitude. Don’t accept that sun exposure is a cause of skin cancer on the basis of a p-value: ﬁnd a mechanism by which skin is damaged by the sun. . Many scientiﬁc journals insist that scientists quote a p-value with every set of results. Why is the p-value small? What possible mechanism could there be for producing this result? If you were a medical statistician and you gave me a p-value. Results like the p-value should be regarded as an investigative starting point. rather than the ﬁnal conclusion. . I would ask you for a mechanism. Don’t accept that Drug A is better than Drug B because the p-value says so: ﬁnd a biochemist who can explain what Drug A does that Drug B doesn’t. We have very strong evidence that deep-sea divers are more likely to have daugh- ters than sons. For the presidents.05 or less. your guess is as good as mine .

Of the 590 winning candidates. . Alternative hypothesis: H1 : there is an alphabetical advantage.) Candidates are listed on the voting paper in alphabetical order. Candidates from minor parties such as the Monster Raving Loony Party were excluded for this analysis. and their names are all right at the beginning of the alphabet! ce hoi kc Mar an X with rk Cla bat om W ie mb Zo Is it true that political candidates with names at the beginning of the alphabet have an advantage over other candidates. and Northern Ireland. We need to set up hypotheses of the following form: Null hypothesis: H0 : there is no alphabetical advantage. names of all candidates and the winning candidate can be found on the internet for 590 constituency seats in England. . Each seat had three candidates. For the 2001 UK general election. . because their names come at the top of the list on the ballot cards? The appropriate tool to use is another hypothesis test. Blair. Wales. Clark? They are all elected presidents or prime ministers .8 Example: Politicians and the alphabet What do the following people all have in common: Bush. (Results for Scotland did not include candidate names. Is there any evidence that there is an alphabetical advantage in the voting process? Hypothesis test Let X be the number of the 590 winners who are alphabetically ﬁrst.2. 207 were alphabetically ﬁrst of the three candidates in their constituency. Clinton.

p-value: Probability of getting a result AT LEAST AS EXTREME as X = 207 alphabet-ﬁrst winners. there is no alphabetical eﬀect. (Three candidates for each seat. 208. p). 1/3). Under H1 . for too few alphabetﬁrst winners. Null hypothesis: Alternative hypothesis: Our observation: The observed proportion of winners who were alphabet-ﬁrst is 207/590 = 0. the evidence properly. there is an alphabetical eﬀect. So the probability that each winner is alphabetically ﬁrst should be 1/3. it is very hard to guess. Our formulation for the hypothesis test is therefore as follows.78 What is the distribution of X under H0 and under H1 ? Under H0. 590. Lower tail: an equal probability in the opposite direction. . We need the p-value to measure H0 : p = 1/3. Results at least as extreme as X = 207 are: Upper tail: X = 207. for even more alphabet-ﬁrst winners. H1 : p = 1/3. . each with the same probability of being alphabetically ﬁrst.351. This is a little more than the 1/3 predicted by H0. Is it suﬃciently greater than 1/3 to provide evidence against H0? Just using our intuition. .) Thus the distribution of X under H0 is X ∼ Binomial(590. . . so p = 1/3. if H0 is true and p really is 1/3. Number of alphabet-ﬁrst winners. X ∼ Binomial(590.

We get round this problem for calculating the p-value by just multiplying the upper-tail p-value by 2.5. It is more complicated in this example than in Section 2. the lower tail probability is from 0 to somewhere between 185 and 186. 590. 590.010 0. but it cannot be speciﬁed exactly.) Typing this in R gives: 2*(1-pbinom(206. because we do not have Binomial probability p = 0. we would expect to see results as unusual as 207 out of 590 alphabet-ﬁrst winners about 39% of the time.7. Probabilities for X ∼ Binomial(n = 590.000 0. 1/3)) [1] 0.030 0. p = 1/3) 0. In fact. . 1/3)).3897671 This p-value is large. (Recall P(X ≥ 207) = 1 − P(X ≤ 206). We conclude that there is no evidence that there was an alphabetical advantage in the 2001 UK election.020 140 160 180 200 220 240 260 R command for the p-value We need twice the UPPER-tail p-value: p-value = 2 × (1−pbinom(206.79 Note: We do not need to calculate the values corresponding to our lower-tail pvalue. It means that if there really was no alphabetical advantage.

of the departure from H0.5 as.5. Because there were almost twice as many daughters as sons.000016 even if the true probability was 0. But what does this say about the actual value of p? Remember the p-value for the test was 0. We need some way of formalizing this. equal to the value speciﬁed in the null hypothesis. if an alphabetical advantage does exist. we found that it would be very unlikely to observe as many as 125 daughters out of 190 children if the chance of having a daughter really was p = 0. They have told us nothing about the size. Do you think that: 1. gives us a hint. p could be as big as 0.51? The test doesn’t even tell us this much! If there was a huge sample size (number of children).9 Likelihood and estimation So far. and that there is an alphabetical advantage that is too small to distinguish from sampling variability. For example. . or probably isn’t.80 Note: This does not mean that the alphabetical advantage does not exist! It simply means that. 2. say. from the evidence given. the hypothesis tests have only told us whether the Binomial probability p might be. for the deep-sea divers. my guess is that the probability of a having a daughter is something close to p = 2/3.000016. 0. p could be as close to 0. we cannot distinguish it from pure chance. 2. Common sense. The evidence is consistent with both the possibility that there is no alphabetical advantage. or potential importance. however. we COULD get a p-value as small as 0.51.8? No idea! The p-value does not tell us.

there are many situations where our common sense fails us. How would we estimate the unknown intercept α and slope β. Return to the deep-sea diver example. such as p = α + β × (diver age). given known information on diver age and number of daughters and sons? We need a general framework for estimation that can be applied to any situation. The value suggested is called the estimate of the parameter. Probably the most useful and general method of obtaining parameter estimates is the method of maximum likelihood estimation. Likelihood Likelihood is one of the most important concepts in statistics.81 Estimation The process of using observations to suggest a value for a parameter is called estimation. The common-sense estimate to use is 125 number of daughters = = 0.658. In the case of the deep-sea divers. For example. . p). total number of children 190 p= However. We know that X ∼ Binomial(190. and we wish to estimate the value of p. X is the number of daughters out of 190 children. we wish to estimate the probability p that the child of a diver is a daughter. what would we do if we had a regression-model situation (see other courses) and wished to specify an alternative form for p. The available data is the observed value of X: X = 125.

5)125(1 − 0. You can probably see where this is heading.5.6)125(1 − 0.6 is a better estimate than p = 0. This is even more likely than for p = 0.658)190−125 125 = 0.6).061.5). Not very likely!! What about p = 0.6? When X ∼ Binomial(190.6 than it would be if p = 0.6. what if we move p even closer to our common-sense estimate of 0.82 Suppose for a moment that p = 0.658). What is the probability of observing X = 125? When X ∼ Binomial(190.5.5.5)190−125 125 = 3. This still looks quite unlikely. P(X = 125) = 190 (0.016.6 is a better estimate than p = 0.6? What would be the probability of observing X = 125 if p = 0.658)125(1 − 0.5. 0. P(X = 125) = 190 (0. 0. .658 is the best estimate yet. we have discovered that it would be thousands of times more likely to observe X = 125 if p = 0.5. So p = 0.97 × 10−6.658? When X ∼ Binomial(190. but it is almost 4000 times more likely than getting X = 125 when p = 0. So far. P(X = 125) = 190 (0. This suggests that p = 0.6)190−125 125 = 0. 0. If p = 0.

60 0. This has decreased from the result for p = 0.02 0.83 Can we do any better? What happens if we increase p a little more. .7)125(1 − 0. P(X = 125) = 190 (0. This maximum likelihood value of p is our maximum likelihood estimate. Overall.05 0. We can see that the maximum occurs somewhere close to our common-sense estimate of p = 0.658.01 0.00 0. 0. say to p = 0.04 0. so our observation of 125 is LESS likely under p = 0. 0.75 0.658.70 0.03 0.7).7? When X ∼ Binomial(190.06 P(X=125) when X~Bin(190. This is a value of p at which the observation X = 125 is MORE LIKELY than at any other value of p.55 0.658.80 The graph reaches a clear maximum.50 0.7 than under p = 0. we can plot a graph showing how likely our observation of X = 125 is under each diﬀerent value of p. p) 0.028.7)190−125 125 = 0.65 p 0.

The likelihood function is: L(p) = P(X = 125) when X ∼ Binomial(190. p). In general. if our observation were X = x rather than X = 125. p. x) = P(X = x) when X ∼ Binomial(190. p). We write: L(p . the likelihood function is a function of p giving P(X = x) when X ∼ Binomial(190. 125 This function of p is the curve shown on the graph on page 83. under this value This function is called the likelihood function. p). For our ﬁxed observation X = 125.84 The likelihood function Look at the graph we plotted overleaf: Horizontal axis: The unknown parameter. = 190 x p (1 − p)190−x . x . X = 125. The probability of our observation. It is a function of the unknown parameter p. Vertical axis: of p. the likelihood function shows how LIKELY the observation 125 is for every diﬀerent value of p. = = 190 125 p (1 − p)190−125 125 190 125 p (1 − p)65 .

x).55 0.00 0.02 0.01 0.) Maximizing the likelihood We have decided that a sensible parameter estimate for p is the maximum likelihood estimate: the value of p at which the observation X = 125 is more likely than at any other value of p. The likelihood gives the probability of a FIXED observation x. fX (x). for every possible value of the parameter p.05 P(X=125) when X~Bin(190.85 Diﬀerence between the likelihood function and the probability function The likelihood function is a probability of x. Gives P(X = x) as p changes.06 P(X=x) when p=0.) Probability function. for a FIXED value of p. (x = 125 here. Compare this with the probability function.80 100 120 x 140 Likelihood function.70 0.02 0. but it could be anything. Function of p for ﬁxed x.00 0. We can ﬁnd the maximum likelihood estimate using calculus. (p = 0. 0. Gives P(X = x) as x changes.04 0. .06 0.50 0.04 0. Function of x for ﬁxed p. but it is a FUNCTION of p.65 p 0. but it could be anything.75 0. which is the probability of every diﬀerent value of x.01 0.05 0. L(p . p) 0.03 0.60 0.6 0.6 here.03 0.

dp This gives: dL = dp 190 124 p (1 − p)64 125 ⇒ 125 − 190p 125 − 190p ⇒ = 0 = 0 p = 125 = 0. 190 . 125 We wish to ﬁnd the value of p that maximizes this expression.658 . 125) = 190 125 p (1 − p)65. To ﬁnd the maximizing value of p. The maximizing value of p occurs when dL = 0.86 The likelihood function is L(p . diﬀerentiate the likelihood with respect to p: dL = dp 190 × 125 125 × p124 × (1 − p)65 + p125 × 65 × (1 − p)64 × (−1) (Product Rule) = 190 × p124 × (1 − p)64 125 190 124 p (1 − p)64 125 125(1 − p) − 65p = 125 − 190p .

the maximum likelihood estimate of 125/190 is the same as the common-sense estimate (page 81): number of daughters 125 p= = . The ‘hat’ notation for an estimate It is conventional to write the estimated value of a parameter with a ‘hat’. p). total number of children 190 This gives us conﬁdence that the method of maximum likelihood is sensible. like this: p. 3. Write down the distribution of X in terms of the unknown parameter: X ∼ Binomial(190. p) = 190 125 p (1 − p)65. p= 125 . Write down the likelihood function for this observed value: L(p . 125 .87 For the diver example. 190 The correct notation for the maximization is: dL dp 125 . 125) = P(X = 125) when X ∼ Binomial(190. Write down the observed value of X: Observed data: X = 125. 190 =0 p=p ⇒ p= Summary of the maximum likelihood procedure 1. 2. For example.

You will usually be told to assume that the MLE exists. when we ﬁnd the maximum likelihood estimate using dL dp = 0. Verifying the maximum Strictly speaking. 5. when p = p. we will be relaxed about this. In particular.88 4. 190 This is the maximum likelihood estimate (MLE) of p. p=p we should verify that the result is a maximum (rather than a minimum) by showing that d2 L < 0. This is the Likelihood Equation. and set to 0 for the maximum: dL = dp 190 124 p (1 − p)64 125 125 − 190p = 0. This conﬁrms that the maximum likelihood estimate exists and is unique. it is always best to plot the likelihood function. . Diﬀerentiate the likelihood with respect to the parameter. Solve for p: p= 125 . as on page 83. dp2 p=p In Stats 210. Where possible. care must be taken when the parameter has a restricted range like 0 < p < 1 (see later).

the maximum likelihood estimaTOR of p is p= X . 190 x . is p= . the maximum likelihood estimate of p will be p= X . This means that even before we have made our observation of X. and we would obtain p= x . In the example above. 190 X ∼ Binomial(190. once we have observed that X = x.89 Estimators For the example above. 190 Exercise: Check this by maximizing the likelihood using x instead of 125. which we can write as X = x. we can provide a RULE for calculating the maximum likelihood estimate once X is observed: Rule: Let Whatever value of X we observe. 190 The maximum likelihood estimaTE of p. we had observation X = 125. p). A random variable specifying how an estimate is calculated from an observation is called an estimator. and the maximum likelihood estimate of p was 125 . Note that this expression is now a random variable: it depends on the random value of X . p= 190 It is clear that we could follow through the same working with any value of X.

1. Follow the steps on page 87 to ﬁnd the maximum likelihood estimator for p. Solve for p: p= x .90 General maximum likelihood estimator for Binomial(n.) 2. and set to 0 for the maximum: dL = dp n x−1 p (1 − p)n−x−1 x x − np = 0. p). 3. We make a single observation X = x. n This is the maximum likelihood estimate of p. Write down the distribution of X in terms of the unknown parameter: X ∼ Binomial(n. p) = n x p (1 − p)n−x. x) = P(X = x) when X ∼ Binomial(n. (n is known. p). Write down the likelihood function for this observed value: L(p . p) Take any situation in which our observation X has the distribution X ∼ Binomial(n. . x 4. Write down the observed value of X: Observed data: X = x. Diﬀerentiate the likelihood with respect to the parameter. (Exercise) 5. when p = p. where n is KNOWN and p is to be estimated.

333. just by a little. the MLE of p is deﬁnitely diﬀerent from 0. x = 207: the maximum likelihood estimate is p= x 207 = n 590 (0. p) and n is KNOWN. p) problem in which n is known. 207 were alphabetically ﬁrst of the candidates in the seat. to convert from the estimate to the estimator. n (Just replace the x in the MLE with an X .333 just means that we can’t DISTINGUISH any diﬀerence between p and 0. . and one of them (n) should only take discrete values 1. 2. we can plug in values of n and x to get an instant MLE for any Binomial(n. 3. . We will not consider problems of this type in Stats 210. What is the maximum likelihood estimate of p? Solution: Plug in the numbers n = 590. Out of 590 winners.8 that p was not signiﬁcantly diﬀerent from 1/3 = 0.333. This comes back to the meaning of signiﬁcantly diﬀerent in the statistical sense. If n is not known. we have a harder problem: we have two parameters.) By deriving the general maximum likelihood estimator for any problem of this sort. . However. Note: We showed in Section 2..8.333 in this example. Saying that p is not signiﬁcantly diﬀerent from 0. The maximum likelihood estimate gives us the ‘best’ estimate of p. . Let p be the probability that a winning candidate is alphabetically ﬁrst. We expect that p probably IS diﬀerent from 0.91 The maximum likelihood estimator of p is p= X . Note: We have only considered the class of problems for which X ∼ Binomial(n.333 from routine sampling vari- ability. Example: Recall the alphabetic advantage problem in Section 2.351) .

110) would be 24. For example.6) distribution in R. size=190. p = 0. 0. To generate (say) 100 random numbers from the Binomial(n = 190. 8 6 10 8 frequency of x frequency of x frequency of x 6 4 4 2 2 0 0 80 90 100 120 140 80 90 100 120 140 0 1 2 3 4 5 6 7 80 90 100 120 140 Each graph shows 100 random numbers from the Binomial(n = 190. if each histogram bar covers an interval of length 5. . Statistical packages like R have custom-made commands for doing this. p = 0. 190. prob=0.6) or in long-hand. and size gives the Binomial parameter n! Histograms The usual graph used to visualise a set of random numbers is the histogram. 0. and if 24 of the random numbers fall between 105 and 110.6) three diﬀerent times. Here are histograms from applying the command rbinom(100. we use: rbinom(100.6) Caution: the R inputs n and size are the opposite to what you might expect: n gives the required sample size.92 2.6) distribution.10 Random numbers and histograms We often wish to generate random numbers from a given distribution. 190. then the height of the histogram bar for the interval (105. rbinom(n=100. The height of each bar of the histogram shows how many of the random numbers fall into the interval represented by the bar.

the histogram starts to look identical to the probability function. 0. Histograms as the sample size increases Histograms are useful because they show the approximate shape of the underly- ing probability function. Eventually. They show how the histogram becomes smoother and less erratic as sample size increases.6)). obtained from the command hist(rbinom(100. the height of the bar plotted at x = 109 shows how many of the 100 random numbers are equal to 109. For example. 190. All the histograms below have bars covering an interval of 1 integer. Usually. 190. especially as the sample size gets large. and the histogram would be smoother. Each histogram bar covers an interval of 25 30 Histogram of rbinom(100.6) In all the histograms above. The probability function plots EXACT PROBABILITIES for the distribution. histogram bars would cover a larger interval. The histogram should have the same shape as the probability function. 0. . on the right is a histogram using the default settings in R. with a large enough sample size. 0. 190. the sum of the heights of all the bars is 100.6) Frequency 0 5 10 15 20 100 110 120 130 5 integers.93 Note: The histograms above have been specially adjusted so that each histogram bar covers an interval of just one integer. They are also useful for exploring the eﬀect of increasing sample size. because there are 100 observations. For example. rbinom(100. Note: diﬀerence between a histogram and the probability function The histogram plots OBSERVED FREQUENCIES of a set of random numbers.

06 Probability function for Binomial(190. The histograms become stable in shape and approach the shape of the probability function as sample size gets large.000: rbinom(10000. 190. 190.02 0. 0.94 Sample size 1000: rbinom(1000. 0.6) 6000 6000 5000 5000 5000 frequency of x 80 90 100 120 140 4000 4000 frequency of x frequency of x 3000 3000 2000 2000 1000 1000 0 0 80 90 100 120 140 0 80 1000 2000 3000 4000 6000 90 100 120 140 P(X=x) when X~Bin(190.03 0.01 0. 0. 0.000: rbinom(100000.6) ﬁxed and exact.6): 100 x 120 140 . 0.6) 600 600 600 frequency of x 80 90 100 120 140 500 500 frequency of x 400 frequency of x 400 300 300 200 200 0 100 0 100 80 90 100 120 140 0 80 100 200 300 400 500 90 100 120 140 Sample size 100.6) 60 60 50 60 frequency of x 80 90 100 120 140 50 40 frequency of x frequency of x 40 30 30 20 20 10 0 10 0 80 90 100 120 140 0 80 10 20 30 40 50 90 100 120 140 Sample size 10.00 80 0. 190.05 0.04 The probability function is 0. 0.

+ 116 + 121 = 113. .00 0. 0. we often want to know what is the average value of X ? For example.6): P(X=x) when p=0. 30 The answers all seem to be close to 114. of the ﬁrst ten values is: 0.05 R command: rbinom(30. 190.2.7. + 119 + 120 = 114. 0.6)) Result: 114. . . 10 The mean of the ﬁrst twenty values is: 116 + 116 + .6) 116 116 117 122 111 112 114 120 112 102 125 116 97 105 108 117 118 111 116 121 107 113 120 114 114 124 116 118 119 120 The average.2.86 Note: You will get a diﬀerent result every time you run this command. or mean.6 0. What would happen if we took the average of hundreds of values? 100 values from Binomial(190.03 0. . 0.01 0.04 0. 190.02 0. p = 0. here are 30 random observations taken from the distribution X ∼ Binomial(n = 190.6): R command: mean(rbinom(100. . .11 Expectation Given a random variable X that measures something.8. + 112 + 102 = 114.06 100 120 x 140 116 + 116 + . . 20 The mean of the ﬁrst thirty values is: 116 + 116 + .

This distribution mean is called the expectation. 190. or µX . 0. 0. also called the expectation or mean. of the Bino- mial(190. . If we kept going for larger and larger sample sizes. The larger the sample size. or E(X). and is given by µX = E(X) = x xfX (x) = x xP(X = x) . of a discrete random variable X. This means it is a ﬁxed constant: there is nothing random about it. This is because 114 is the DISTRIBUTION MEAN: the mean value that we would get if we were able to draw an inﬁnite sample from the Binomial(190. we would keep getting answers closer and closer to 114. 0. 0. can be written as either E(X).6): R command: mean(rbinom(1000000. The expected value is a measure of the centre. 0.6)) Result: 114. weighted according to the probability of each value. of the set of values that X can take. or expected value. It is a FIXED property of the Binomial(190.6): R command: mean(rbinom(1000. Deﬁnition: The expected value.6) distribution. or average.0001 The average seems to be converging to the value 114.96 1000 values from Binomial(190.02 1 million values from Binomial(190. 0. If we took a very large sample of random numbers from the distribution of X . 0. the closer the average seems to get to 114.6) distribution. 190.6) distribution.6)) Result: 114. their average would be approximately equal to µX .

100 = 0.1. −1 with probability 0. say 100 times. p = 0.6)x(0. We will see why in Section 2. and use a simpler example. Let the random variable X be deﬁned as X = 1 with probability 0. ) We could repeat this for any sample size. Roughly 10 of these 100 times will have X = −1 The average of the 100 values will be roughly 90 × 1 + 10 × (−1) . Explanation of the formula for expectation We will move away from the Binomial distribution for a moment.9.14. What is E(X)? E(X) = x 190 xP(X = x) 190 (0. x = x=0 x Although it is not obvious.6 = 114.6).1 × (−1) ( = 0. Using 1+(−1) 2 Instead.8. . X takes only the values 1 and −1.97 Example: Let X ∼ Binomial(n = 190. the answer to this sum is n × p = 190 × 0. What is the ‘average’ value of X? = 0 would not be useful. think of observing X many times. because it ignores the fact that usually X = 1. Roughly 90 of these 100 times will have X = 1.9 × 1 + 0.4)190−x. and only occasionally is X = −1.

. This is why the distribution mean is given by E(X) = P(X = 1) × 1 + P(X = −1) × (−1).11: Let a and b be constants. E(X) = x P(X = x) × x. Then E(aX + b) = aE(X) + b.98 As the sample gets large. or in general. E(aX + b) = x (ax + b)fX (x) xfX (x) + b x x = a fX (x) = a E(X) + b × 1.9 × 1 + 0. Proof: Immediate from the deﬁnition of expectation. E(X) is a ﬁxed constant giving the average value we would get from a large sample of X . the average of the sample will get ever closer to 0.1 × (−1). Linear property of expectation Expectation is a linear operator: Theorem 2.

3. Note: We have: E(X) = 0.6.14 that whenever X ∼ Binomial(n.096 3 0.384 + 2 × 0.2).384 2 0. We will prove in Section 2.6 = 3 × 0.2)x(0.008 = 0. 0. 1 with probability p.008 Then 3 E(X) = x=0 xfX (x) = 0 × 0.8)3−x for x = 0.2 for X ∼ Binomial(3. Example 2: Let Y be Bernoulli(p) (Section 2. 0 with probability 1 − p. 1. 2.2). That is. x 0 0. .3). Write down the probability function of X and ﬁnd E(X). p).512 + 1 × 0. 0 y P(Y = y) 1 − p 1 p E(Y ) = 0 × (1 − p) + 1 × p = p. 0.99 Example: ﬁnding expectation from the probability function Example 1: Let X ∼ Binomial(3. Y = Find E(Y ).096 + 3 × 0.512 1 0. We have: P(X = x) = x fX (x) = P(X = x) 3 (0. then E(X) = np.

. Special case: when X and Y are INDEPENDENT: When X and Y are INDEPENDENT. . .100 Expectation of a sum of random variables: E(X + Y ) For ANY random variables X1 . The proof requires multivariate methods. to be studied later in the course. E(XY ) = E(X)E(Y ). . E(X + Y ) = E(X) + E(Y ) for ANY X and Y . + an E(Xn ). E(XY ) is NOT equal to E(X)E(Y ). . + Xn ) = E(X1) + E(X2) + . . or using their covariance (see later). This result holds for any random variables X1 . We can summarize this important result by saying: The expectation of a sum is the sum of the expectations – ALWAYS. 2. It does NOT require X1 . . . . we have: E (a1 X1 + a2 X2 + . E (X1 + X2 + . . . . an . . For any constants a1 . . Note: We can combine the result above with the linear property of expectation. . . General case: For general X and Y . . Xn to be independent. Xn . + an Xn ) = a1 E(X1) + a2 E(X2) + . . . . In particular. . Expectation of a product of random variables: E(XY ) There are two cases when ﬁnding the expectation of a product: 1. X2. + E(Xn ). . We have to ﬁnd E(XY ) either using their joint probability function (see later). . . . Xn.

Find the probability function of Y . . we would write the probability function of Y = X 2 as: y P(Y = y) 0 0.008 This is because Y takes the value 02 whenever X takes the value 0.384 4 0. 0. possible transformations of X include: √ X2 . Thus the probability that Y = 02 is the same as the probability that X = 0.2. The probability function for X is: x P(X = x) 0 0.2). given the random variable X.12 Variable transformations We often wish to transform random variables through a function.008 Thus the probability function for Y = X 2 is: y P(Y = y) 02 0.384 22 0.512 1 0.096 9 0. Overall. We often summarize all possible variable transformations by referring to Y = g(X) for some function g . X.512 1 0. For example.384 2 0. Example 1: Let X ∼ Binomial(3. it is very easy to ﬁnd the probability function for Y = g(X). For discrete random variables.008 To transform a discrete random variable. and let Y = X 2 . and so on. 4X 3 . given that the probability function for X is known.512 12 0. Simply change all the values and keep the probabilities the same..096 32 0.096 3 0. . transform the values and leave the probabilities alone..

008 9 0. while 30% hire the 3m balloon and 20% hire the 4m balloon.2 Let Y be the amount of gas required: Y = X 3 /2. The amount of helium gas in cubic metres required to ﬁll the balloons is h3 /2. His balloons come in three sizes: heights 2m.5 0.096 4 0. Find the probability function of Y . x (m) P(X = x) 2 0. The probability function for X is: and for Y = X 2 : x P(X = x) y P(Y = y) 0 0.384 2 0.2). Expected value of a transformed random variable We can ﬁnd the expectation of a transformed random variable just like any other random variable. 0. y (m 3 ) P(Y = y) 4 0.2 Transform the values. 50% of Mr Chance’s customers choose to hire the cheapest 2m balloon.3 32 0.096 3 0.008 .3 4 0.5 13. and leave the probabilities alone.384 1 0. Let X be the height of balloon ordered by a random customer. in Example 1 we had X ∼ Binomial(3. the amount of helium gas required for a randomly chosen customer. and 4m. The probability function of X is: height. The probability function of Y is: gas.512 1 0. 3m.512 0 0.Example 2: Mr Chance hires out giant helium balloons for advertising. For example.5 3 0. where h is the height of the balloon. and Y = X 2.

and leave the probabilities alone.384 + g(2) × 0. we could cut out the middle step of writing down the probability function of Y .096 + g(3) × 0. If we write g(X) = X 2 . this becomes: E{g(X)} = E(X 2) = g(0) × 0. 2 Note: E(X 2 ) is NOT the same as {E(X)}2 . Because we transform the values and keep the probabilities the same.096 + 32 × 0. we have: E(X 2 ) = 02 × 0. To make the calculation quicker.384 + 4 × 0. .512 + g(1) × 0.512 + 1 × 0. the expected value of g(X) is given by E{g(X)} = x g(x)P(X = x) = x g(x)fX (x).84. Transform the values.008. Clearly the same arguments can be extended to any function g(X) and any discrete random variable X: E{g(X)} = x g(x)P(X = x). Deﬁnition: For any function g and discrete random variable X.384 + 22 × 0.096 + 9 × 0. Check that {E(X)}2 = 0.103 Thus the expectation of Y = X is: E(Y ) = E(X 2 ) = 0 × 0.008.512 + 12 × 0.008 = 0.36.

Each Zi has E(Zi) = 1080 by part (b).2) Let Z1. . The probability function of X is: height. Z5 be the earnings from the next 5 customers. Average gas per customer is E(X 3 /2). + E(Z5) = 5 × 1080 = $5400. . (b) Mr Chance charges $400×h to hire a balloon of height h.5 + 3 × 0.3 + × 0.2 = 2 2 2 = 12.104 Example: Recall Mr Chance and his balloon-hire business from page 102. . E(400X) = 400 × E(X) = 400 × 2. .45 m3 gas. (c) How much does Mr Chance expect to earn in total from his next 5 customers? (expectation is linear) = 400 × (2 × 0.5 + × 0.3 + 4 × 0.3 0. E X3 2 = x x3 × P(X = x) 2 33 43 23 × 0. Let X be the height of balloon selected by a randomly chosen customer. x (m) 2 3 4 P(X = x) 0.5 0. What is his expected earning per customer? Expected earning is E(400X). .7 = $1080 per customer. .2 (a) What is the average amount of gas required per customer? Gas required was X 3 /2 from page 102. + Z5 ) = E(Z1) + E(Z2) + . . The total expected earning is E(Z1 + Z2 + . . .

. Wr g! on Suppose X= 3 with probability 3/4. .105 Getting the expectation. X takes value 8. and 1/4 of the time. Then 3/4 of the time. X takes value 3. 4 add up the values times how often they occur √ What about E( X)? √ √ √3 with probability 3/4. X= 8 with probability 1/4. . 8 with probability 1/4. So E(X) = 3 4×3 + 1 × 8.

Common mistakes √ √ i) E( X) = EX = on Wr g! 3 4×3 + 1×8 4 √ ii) E( X) = 3 4×3 + 1 4×8 on Wr g! √ iii) E( X) = g! 3 4×3 + + 1 4×8 = on Wr √ 3 4× 3 √ 1 4× 8 .106 add up the values times how often they occur √ √ √ 3 1 E( X) = 4 × 3 + 4 × 8.

and is given by 2 σX = Var(X) = E (X − µX )2 = E (X − EX)2 . How much money should she load in the machine? $50. For questions like this. so sd(X) = Var(X) = 2 σX = σX . . the money required will be GREATER than the average.13 Variance Example: Mrs Tractor runs the Rational Bank of Remuera. near the centre of the distribution. Variance is the average squared distance of a random variable from its own mean. How much money should Mrs Tractor put in the machine if she wants to be 99% certain that there will be enough for the day’s transactions? Answer: it depends how much the amount withdrawn varies above and below its mean. Similarly.000? No: $50. About half the time. we need the study of variance.000 is the average.000. She knows that the expected amount of money withdrawn each day is $50. Note: The variance is the square of the standard deviation of X. Every day she hopes to ﬁll her cash machine with enough cash to see the well-heeled citizens of Remuera through the day.2. 2 Deﬁnition: The variance of a random variable X is written as either Var(X) or σX . the variance of a function of X is 2 Var(g(X)) = E g(X) − E(g(X)) .

and σX is ﬁxed at 4. of X are just numbers: there is nothing random or variable about them.25. (2) Then take the average over all values X can take: ie. Square it to balance positive and negative distances. if we observed X many times. 2 Note: The mean. . 2 But µX is ﬁxed at 4. we get either 3 or 8: this is random. Possible values of X x1 x2 x3 x4 x5 x6 x2 − µX x4 − µX µX (central value) Var(X) = E [(X − µX )2 ] (2) (1) (1) Take distance from observed values of X to the central point. µX . It is the average squared distance between a value of X and the central (mean) value. 8 with probability 1/4.6875. µX . 1 3 + 8 × = 4. µX . ﬁnd what would be the average squared distance between X and µX . Example: Let X = 3 with probability 3/4.25 4 4 1 3 × (3 − 4. and the variance.108 Variance as the average squared distance from the mean The variance is a measure of how spread out are the values that X can take.6875. regardless of the outcome of X.25)2 = 4 4 = 4. σX . Then E(X) = µX = 3 × 2 Var(X) = σX When we observe X.25)2 + × (8 − 4.

109

For a discrete random variable,

Var(X) = E (X − µX )2 =

x

(x − µX )2fX (x) =

x

(x − µX )2P(X = x).

This uses the deﬁnition of the expected value of a function of X:

Var(X) = E(g(X)) where g(X) = (X − µX )2 .

Theorem 2.13A: (important)

Var(X) = E(X 2) − (EX)2 = E(X 2 ) − µ2 X

Proof: Var(X) = E (X − µX )2 = E[ X 2 −2 X

r.v.

by deﬁnition µX + µ2 ] X

constant

r.v. constant

= E(X 2) − 2µX E(X) + µ2 X = E(X 2) − 2µ2 + µ2 X X = E(X 2) − µ2 . X

by Thm 2.11

Note: e.g.

E(X 2) =

xx

2

fX (x) = X=

xx

2

P(X = x) . This is not the same as (EX)2:

3 with probability 0.75, 8 with probability 0.25.

Then µX = EX = 4.25, so µ2 = (EX)2 = (4.25)2 = 18.0625. X But 3 1 E(X 2 ) = 32 × + 82 × = 22.75. 4 4

Thus

E(X 2 ) = (EX)2 in general.

110

Theorem 2.13B: If a and b are constants and g(x) is a function, then i) Var(aX + b) = a2 Var(X). ii) Var(a g(X) + b) = a2 Var{g(X)}. Proof:

(part (i)) Var(aX + b) = E {(aX + b) − E(aX + b)}2

= E {aX + b − aE(X) − b}2 = E {aX − aE(X)}2 = E a2 {X − E(X)}2 = a2 E {X − E(X)}2 = a2 Var(X) . Part (ii) follows similarly. Note: These are very diﬀerent from the corresponding expressions for expectations (Theorem 2.11). Variances are more diﬃcult to manipulate than expectations.

by Thm 2.11

by Thm 2.11

Example: ﬁnding expectation and variance from the probability function Recall Mr Chance’s balloons from page 102. The random variable Y is the amount of gas required by a randomly chosen customer. The probability function of Y is: gas, y (m 3) P(Y = y) Find Var(Y ). 4 0.5 13.5 0.3 32 0.2

111

We know that E(Y ) = µY = 12.45 from page 104. First method: use Var(Y ) = E[(Y − µY )2]:

Var(Y ) = (4 − 12.45)2 × 0.5 + (13.5 − 12.45)2 × 0.3 + (32 − 12.45)2 × 0.2

= 112.47.

Second method: use E(Y 2 ) − µ2 : (usually easier) Y E(Y 2 ) = 42 × 0.5 + 13.52 × 0.3 + 322 × 0.2 = 267.475.

So Var(Y ) = 267.475 − (12.45)2 = 112.47

as before.

Variance of a sum of random variables: Var(X + Y ) There are two cases when ﬁnding the variance of a sum: 1. General case:

For general X and Y , Var(X + Y ) is NOT equal to Var(X) + Var(Y ).

We have to ﬁnd Var(X + Y ) using their covariance (see later).

2. Special case: when X and Y are INDEPENDENT:

When X and Y are INDEPENDENT, Var(X + Y ) = Var(X) + Var(Y ).

112

Interlude:

Guess whether each of the following statements is true or false.

TRU

E

or

FALSE

??

1. Toss a fair coin 10 times. The probability of getting 8 or more heads is less than 1%.

2. Toss a fair coin 200 times. The chance of getting a run of at least 6 heads or 6 tails in a row is less than 10%.

3. Consider a classroom with 30 pupils of age 5, and one teacher of age 50. The probability that the pupils all outlive the teacher is about 90%.

4. Open the Business Herald at the pages giving share prices, or open an atlas at the pages giving country areas or populations. Pick a column of ﬁgures.

share A Barnett Advantage I AFFCO Air NZ . . .

last sale 143 23 18 52 . . .

The ﬁgures are over 5 times more likely to begin with the digit 1 than with the digit 9.

Answers: 1. FALSE it is 5.5%. 2. FALSE : it is 97%. 3. FALSE : in NZ the probability is about 50%. 4. TRUE : in fact they are 6.5 times more likely.

. . Note: Each Yi is a Bernoulli(p) random variable (Section 2. 0 with probability 1 − p. Easy proof: X as a sum of Bernoulli random variables If X ∼ Binomial(n. Yn are independent. p). That is. . Yi counts as a 1 if trial i is a success. + Var(Yn ) (DOES require independence).3). and Y1 . then: E(X) = E(Y1) + E(Y2) + .113 2. We now prove this and the additional result for Var(X). which is the same as X . . each with P(success) = p.14 Mean and variance of the Binomial(n. then: E(X) = µX = np 2 Var(X) = σX = np(1 − p). . + Yn . + Yn . Overall. Now if X = Y1 + Y2 + . p). . then X is the number of successes out of n independent trials. . so Var(X) = npq . where each Yi = 1 with probability p. . + E(Yn) (does NOT require independence). and as a 0 if trial i is a failure. . Var(X) = Var(Y1 ) + Var(Y2) + . Y1 + . We often write q = 1 − p. This means that we can write: X = Y1 + Y2 + . We have mentioned several times that E(X) = np. . . . . p) distribution Let X ∼ Binomial(n. . + Yn is the total number of successes out of n independent trials. . If X ∼ Binomial(n. p).

. . So E(Yi) = 0 × (1 − p) + 1 × p = p. And: Var(X) = Var(Y1) + Var(Y2) + . .. Thus we have proved that E(X) = np and Var(X) = np(1 − p). E(Yi2) = 02 × (1 − p) + 12 × p = p. . 0 y P(Yi = y) 1 − p 1 p Also. + E(Yn) = p +p +. Therefore: E(X) = E(Y1) + E(Y2) + .114 The probability function of each Yi is: Thus. . + Var(Yn) = n × p(1 − p). .+ p = n × p. Var(Yi) = E(Yi2 ) − (EYi)2 = p − p2 = p(1 − p).

and when x = n. y = n − 1 = m. x’s into (x − 1)’s. wherever possible: e.115 Hard proof: for mathematicians (non-examinable) We show below how the Binomial mean and variance formulae can be derived directly from the probability function. So m E(X) = np y=0 m y p (1 − p)m−y y (Binomial Theorem) = np(p + (1 − p))m E(X) = np. E(X) = x=1 n! px(1 − p)n−x (n − x)!(x − 1)! Next: make n’s into (n − 1)’s. This gives.g. continuing. When x = 1. . x! (x − 1)! n So. n x E(X) = xfX (x) = x p (1 − p)n−x = x x=0 x=0 n n n x x=0 n! px (1 − p)n−x (n − x)!x! But x 1 = and also the ﬁrst term xfX (x) is 0 when x = 0. n px = p · px−1 E(X) = x=1 n(n − 1)! p · p(x−1) (1 − p)(n−1)−(x−1) [(n − 1) − (x − 1)]!(x − 1)! n = np what we want x=1 n − 1 x−1 p (1 − p)(n−1)−(x−1) x−1 need to show this sum = 1 Finally we let y = x − 1 and let m = n − 1. n − x = (n − 1) − (x − 1). y = 0. n! = n(n − 1)! etc. as required.

so instead of ﬁnding E(X 2). Note the steps: take out x(x−1) and replace n by (n−2). it will be easier to ﬁnd E[X(X − 1)] = E(X 2) − E(X) because then we will be able to cancel x(x−1) 1 = (x−2)! . y = x − 2. . x! Here goes: n E[X(X − 1)] = = x=0 n x=0 x(x − 1) n x p (1 − p)n−x x x(x − 1)n(n − 1)(n − 2)! p2p(x−2) (1 − p)(n−2)−(x−2) [(n − 2) − (x − 2)]!(x − 2)!x(x − 1) First two terms (x = 0 and x = 1) are 0 due to the x(x − 1) in the numerator. Thus n E[X(X − 1)] = p n(n − 1) = n(n − 1)p2 2 x=2 m y=0 n − 2 x−2 p (1 − p)(n−2)−(x−2) x−2 m y p (1 − p)m−y y if m = n − 2. Thus Var(X) = E(X 2) − (E(X))2 = E(X 2) − E(X) + E(X) − (E(X))2 = E[X(X − 1)] + E(X) − (E(X))2 = n(n − 1)p2 + np − n2 p2 = np(1 − p). x by (x−2) wherever possible. use the same ideas again. x 1 For E(X).116 For Var(X). we used x! = (x−1)! . sum=1 by Binomial Theorem So E[X(X − 1)] = n(n − 1)p2 .

The maximum likelihood estimator of p is p= X .02? We assess the usefulness of estimators using their variance. Given p = X .658 ± 0. Make a single observation X = x. perhaps 0. n In practice.3 (say). where n is KNOWN and p is to be estimated. in the deep-sea diver example introduced in Section 2.7.9 we derived the maximum likelihood estimator for the Binomial parameter p.117 Variance of the MLE for the Binomial p parameter In Section 2.658. estimates of parameters should always be accompanied by estimates of their variability. For example. p) (⋆) . we have: n X n Var(p) = Var = = = 1 Var(X) n2 1 × np(1 − p) n2 p(1 − p) . n for X ∼ Binomial(n. n 190 What is our margin of error on this estimate? Do we believe it is 0. we estimated the probability that a diver has a daughter is p= 125 X = = 0. Reminder: Take any situation in which our observation X has the distribution X ∼ Binomial(n. or do we believe it is very precise. in other words almost useless. p).658 ± 0.

96 × 0. we have to ESTIMATE Var(p) by replacing the unknown p in equation (⋆) by p. We usually quote the margin of error associated with p as p(1 − p) .067 or p = 0. We will study the Central Limit Theorem in later chapters.658 (0. with n = 190. . we do not know the true value of p. For our ﬁnal answer. Example: For the deep-sea diver example.658 ± 1. We call our estimated variance Var (p): p(1 − p) .658) 190 = 0.96 × se(p) gives an approximate 95% conﬁdence interval for p under the Normal approximation.034. 0. n Var (p) = The standard error of p is: se(p) = Var (p). so: se(p) = 0.591. n Margin of error = 1.118 In practice. p = 0.96 × This result occurs because the Central Limit Theorem guarantees that p will be approximately Normally distributed in large samples (large n). however. Instead. The expression p ± 1. we should therefore quote: p = 0.96 × se(p) = 1.658.658 × (1 − 0.034 = 0. so we cannot calculate the exact Var(p).725).658 ± 0.

3. . Sum of independent Binomials: If X ∼ Binomial(n. expectation. p). p) and Y ∼ Binomial(m. Each of these was illustrated by the Binomial distribution. iii) the trials are independent. n x px(1 − p)n−x for x = 0. p). and if X and Y both share the same parameter p. p) if X is the number of successes out of a ﬁxed number n of Bernoulli trials. and if X and Y are independent. Variance: Var(X) = np(1 − p). 2) Having children: each child can be thought of as a Bernoulli trial with outcomes {girl.Chapter 3: Modelling with Discrete Probability Distributions 119 In Chapter 2 we introduced several fundamental ideas: hypothesis testing. boy} and P(girl) = 0. . n. 1. . is constant for all trials.5. and variance. . each with P(success) = p. First we revise Bernoulli trials and the Binomial distribution. Examples: 1) Repeated tossing of a fair coin: each toss is a Bernoulli trial with 1 P(success) = P(head) = 2 . then X + Y ∼ Binomial(n + m. Bernoulli Trials A set of Bernoulli trials is a series of trials such that: i) each trial has only 2 possible outcomes: Success and Failure. ii) the probability of success. Probability function: fX (x) = P(X = x) = Mean: E(X) = np.1 Binomial distribution Description: X ∼ Binomial(n. p. . likelihood. We now introduce several other discrete distributions and discuss their properties and usage.

• The Binomial distribution counts the number of successes out of a ﬁxed number of trials. n = 10.06 0.2 0.15 0. we write: X ∼ Geometric(p).02 0. 2) Fish jumping up a waterfall.9 if n is large) 0.1 0.9 n = 100.05 0.25 0.0 80 0. success occurs.3 0.120 Shape: Usually symmetrical unless p is close to 0 or 1.12 0.9 (skewed for p close to 1) (less skew for p = 0.5 (symmetrical) 0. the Geometric distribution is deﬁned in terms of a sequence of Bernoulli trials.04 0. p = 0.10 0. . • The Geometric distribution counts the number of trials before the ﬁrst This means that the Geometric distribution counts the number of failures before the ﬁrst success. Peaks at approximately np.08 0.20 0.10 90 100 3. Examples: 1) X =number of boys before the ﬁrst girl in a family: X ∼ Geometric(p = 0.2 Geometric distribution Like the Binomial distribution. Then X ∼ Geometric(p).5).4 n = 10. On every jump the ﬁsh has probability p of reaching the top. p = 0. p = 0.0 0. If every trial has probability p of success. Let X be the number of failed jumps before the ﬁsh succeeds.0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0.

the trials must always occur in the order F F . . 1. F F . . F . . . This is why the Geometric distribution has probability function P(x failures. fX (x) = P(X = x) = (1 − p)xp for x = 0. 1 success) = (1 − p)xp. . x 1−p q = p p q 1−p = 2 p2 p Var(X) = . . F S . while the Binomial distribution has probability function P(x failures. . .121 Properties of the Geometric distribution i) Description X ∼ Geometric(p) if X is the number of failures before the ﬁrst success in a series of Bernoulli trials with P(success) = p. failures and successes can occur in any order: e. . F . ii) Probability function For X ∼ Geometric(p). . x+1 (1 − p)xp. F SF .g. Explanation: P(X = x) = (1 − p)x need x failures ﬁnal trial must be a success × p Diﬀerence between Geometric and Binomial: For the Geometric distribution. SF . 2. . F S. 1 success) = iii) Mean and variance E(X) = For X ∼ Geometric(p). x failures For the Binomial distribution. etc.

• p is unknown. We wish to estimate the probability of success for each jump. .6 1 2 3 4 5 6 7 8 9 10 vi) Likelihood For any random variable. 5) = p(1 − p)5 for 0 < p < 1. Then L(p . .3 (small p) 0. v) Shape Geometric probabilities are always greatest at x = 0.122 iv) Sum of independent Geometric random variables If X1 .20 0. .0 0.10 0. . Xk are independent.3 0.5 (moderate p) 0. + Xk ∼ Negative Binomial(k.05 0. . .1 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0.4 0. the likelihood function is just the probability function expressed as a function of the unknown parameter.25 0.15 0. p = 0. The length of the tail depends on p. p). so the tail is short. For small p.0 0.4 0. x) = p(1 − p)x for 0 < p < 1. so the tail is long. p.8 p = 0. a success is likely to occur almost immediately. • the observed value of X is x.9 (large p) 0.0 0 0. there could be many failures before the ﬁrst success.2 0. Example: we observe a ﬁsh making 5 failed jumps before reaching the top of a waterfall. The distribution always has a long right tail (positive skew).5 (see later) p = 0. and each Xi ∼ Geometric(p). then the likelihood function is: L(p . then X1 + . .2 0. Maximize L with respect to p to ﬁnd the MLE. For large p. If: • X ∼ Geometric(p).30 0.

1). Note that the lower limit of the summation becomes x = 1 because the term for x = 0 vanishes. (1 − q)2 as stated in (3. (3.1) and ∞ x=2 x(x − 1)q x−2 = (for |q| < 1). by diﬀerentiating both sides of (3.2) is obtained similarly.1) and (3. xq x−1 = 1 (1 − q)2 2 (1 − q)3 (for |q| < 1).2): Consider the inﬁnite sum of a geometric progression: ∞ x=0 qx = 1 1−q d dq 1 1−q (for |q| < 1). The proof of (3.123 For mathematicians: proof of Geometric mean and variance formulae (non-examinable) We wish to prove that E(X) = We use the following results: ∞ x=1 1−p p and Var(X) = 1−p p2 when X ∼ Geometric(p).2) Proof of (3.1) with respect to q (Exercise). (3. Diﬀerentiate both sides with respect to q: d dq ∞ x=0 ∞ x=1 ∞ x=0 qx = d x 1 (q ) = dq (1 − q)2 xq x−1 = 1 . .

E(X) = = ∞ x=0 ∞ xP(X = x) xpq x xq x xq x−1 (by equation (3. 2 (1 − q)3 2q 2 q Var(X) = 2 + − p p as required.2)) = pq 2 x(x − 1)q x−2 = pq 2 2q 2 = . p2 Thus by (⋆).1)) (because 1 − q = p) (where q = 1 − p) (lower limit becomes x = 1 because term in x = 0 is zero) = p x=0 ∞ x=1 ∞ = pq x=1 = pq = pq = q . we use Var(X) = E(X 2 ) − (EX)2 = E {X(X − 1)} + E(X) − (EX)2 . For Var(X). because q + p = 1. p2 p . p 1 (1 − q)2 1 p2 as required. Now E{X(X − 1)} = = ∞ x=0 ∞ x=0 (⋆) x(x − 1)P(X = x) x(x − 1)pq x ∞ x=2 (where q = 1 − p) (note that terms below x = 2 vanish) (by equation (3. q p 2 = q(q + p) q = 2.124 Now we can ﬁnd E(X) and Var(X).

. Then X ∼ NegBin(24. ii) Probability function For X ∼ NegBin(k. . Let X be the number of papers ? Tom fails in his degree. . Examples: 1) X =number of boys before the second girl in a family: X ∼ NegBin(k = 2. fX (x) = P(X = x) = k+x−1 k p (1 − p)x x for x = 0. independently of all other papers. p). p = 0. If every trial has probability p of success. p).125 3. • the Negative Binomial distribution counts the number of failures before the k ’th success. we write: X ∼ NegBin(k. He passes each paper with probability p. 2. . 1. p) if X is the number of failures before the k ’th success in a series of Bernoulli trials with P(success) = p. 2) Tom needs to pass 24 papers to complete his degree. p). Properties of the Negative Binomial distribution i) Description X ∼ NegBin(k.3 Negative Binomial distribution The Negative Binomial distribution is a generalised form of the Geometric distribution: • the Geometric distribution counts the number of failures before the ﬁrst success.5).

we need x failures and k successes. p). we could have: F F F SS So: P(X = x) = k+x−1 x (k − 1) successes and x failures out of (k − 1 + x) trials. k(1 − p) kq = p p k(1 − p) kq = 2 p2 p k successes × pk F F SF S F SF F S SF F F S • This leaves x failures and k − 1 successes to occur in any order: a total of k − 1 + x trials. ⇒ E(X) = kE(Yi) = p kq Var(X) = kVar(Yi) = 2 . and X ∼ NegBin(k. . • The trials stop when we reach the k’th success. . Y ∼ NegBin(m.126 Explanation: • For X = x. kq . where each Yi ∼ Geometric(p). + Yk . For example. then X + Y ∼ NegBin(k + m. p). if x = 3 failures and k = 2 successes. with the same value of p. × (1 − p)x x failures Var(X) = These results can be proved from the fact that the Negative Binomial distribution is obtained as the sum of k independent Geometric random variables: X = Y1 + . p iv) Sum of independent Negative Binomial random variables If X and Y are independent. p). iii) Mean and variance E(X) = For X ∼ NegBin(k. so the last trial must be a success. . p). Yi indept.

4 0.04 0. • p is unknown.08 k = 10.2 0. If: • X ∼ NegBin(k. then the likelihood function is: L(p .3 0.02 0. p). p = 0.8 0. Example: Tom fails a total of 4 papers before ﬁnishing his degree.5 0. Likelihood: L(p .1 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0. Observation: X = 4 failed papers. k = 3. Maximize L with respect to p to ﬁnd the MLE.05 0. p = 0.06 for 0 < p < 1. What is his pass probability for each paper? X =# failed papers before 24 passed papers: X ∼ NegBin(24. 4) = 24 + 4 − 1 24 p (1 − p)4 = 4 27 24 p (1 − p)4 4 0.5 0.15 0.0 0. Below are the probability functions for various diﬀerent values of k and p.127 v) Shape The Negative Binomial is ﬂexible in shape. p). • k is known. p = 0.0 0. p.0 0 2 4 6 8 10 12 14 16 18 20 22 24 vi) Likelihood As always.5 k = 3. x) = k+x−1 k p (1 − p)x x for 0 < p < 1.10 0. the likelihood function is the probability function expressed as a function of the unknown parameters. . • the observed value of X is x.

and twelve of them are white chocolate. ii) Probability function For X ∼ Hypergeometric(N. M x N −M n−x N n fX (x) = P(X = x) = for x = max(0. Let X = number of the n removed objects that are special. Then X ∼ Hypergeometric(N = 20.4 Hypergeometric distribution: sampling without replacement The hypergeometric distribution is used when we are sampling without replace- ment from a ﬁnite population. There are 20 chocolate frogs in the box. Eight of them are dark chocolate. n). Let X be the number of dark chocolate frogs he picks. n). We remove n objects at random without replacement. M. i) Description Suppose we have N objects: • M of the N objects are special.128 3. M). • the other N − M objects are not special. n + M − N ) to x = min(n. Example: Ron has a box of Chocolate Frogs. M. Ron grabs a random handful of 5 chocolate frogs and stuﬀs them into his mouth when he thinks that noone is looking. . M = 8. n = 5). Then X ∼ Hypergeometric(N.

397 . • Number of ways of selecting x special objects from the M available is: M . and 0 ≤ n − x ≤ N − M (number of other objects). including M = 8 dark chocolates. Note: We need 0 ≤ x ≤ M (number of special objects). n + M − N ) to x = min(n. where p = . M). this gives us the stated constraint that x = max(0. 15504 Same answer as Tutorial 3 Q2(c). There are N = 20 chocolates. n−x x • Overall number of ways of choosing n objects from N is: N n . Example: What is the probability that Ron selects 3 white and 2 dark chocolates? X =# dark chocolates. n). Thus: number of desired ways = P(X = x) = total number of ways M x N −M n−x N n .129 Explanation: We need to choose x special objects and n − x other objects. n−x • Total number of ways of choosing x special objects and (n−x) other objects is: M × N −M . M. After some working. We need P(X = 2) = 8 2 12 3 20 5 = 28 × 220 = 0. E(X) = np Var(X) = np(1 − p) N −n N −1 M N. x • Number of ways of selecting n − x other objects from the N − M available is: N −M . iii) Mean and variance For X ∼ Hypergeometric(N.

• Negative Binomial probabilities sum to 1 by the Negative Binomial expansion: i. = <sum of Binomial probabilities> = 1. n) distribution by the Binomial(n. p = M ) distribution. N Hypergeometric(30. 10) 0. For n/N < 0. Hypergeometric(N. 12 ) 30 0.20 1 2 3 4 5 6 7 8 9 10 Note: The Hypergeometric distribution can be used for opinion polls.20 0.0 0 0.10 0. 12. M.1 we often approximate the Hypergeometric(N.130 iv) Shape The Hypergeometric distribution is similar to the Binomial distribution when n/N is small.10 0. • Geometric probabilities sum to 1 because they form a Geometric series: p x=0 (1 − p)x = <sum of Geometric probabilities> = 1.15 0. the Binomial expansion with a negative power. As noted above. . because these involve sampling without replacement from a ﬁnite population. −k: pk 1 − (1 − p) ∞ −k = <sum of NegBin probabilities> = 1.e. M.25 Binomial(10. n) → Binomial(n.30 0.25 0. M ) N A note about distribution names Discrete distributions often get their names from mathematical power series.15 0. The Binomial distribution is used when the population is sampled with replace- ment.05 0. because removing n objects does not change the overall composition of the population very much when n/N is small. • Binomial probabilities sum to 1 because of the Binomial Theorem: p + (1 − p) n as N → ∞.0 0.05 0 1 2 3 4 5 6 7 8 9 10 0.

has the distribution X ∼ Poisson(λ). ii) accidents occur at a constant average rate of λ per year. How many road accidents will there be in NZ this year? X ∼ Poisson. . Then the number of accidents in a year. published by Sim´on-Denis Poisson in 1837. How many volcanoes will erupt over the next 1000 years? X ∼ Poisson. unfortunately! (give or take 1000 years or so. Volcanoes are almost impossible to predict: they seem to happen completely at random.5 Poisson distribution When is the next volcano due to erupt in Auckland? Any moment now. . Suppose that: i) all accidents are independent of each other. iii) accidents cannot occur simultaneously. Example: Let X be the number of road accidents in a year in New Zealand. The Poisson distribution arose from a mathematical formulation called the Poisson Process. ) A volcano could happen in Auckland this afternoon. A distribution that counts the number of random events in a ﬁxed space of time is the Poisson distribution. e Poisson Process The Poisson process counts the number of events occurring in a ﬁxed time or space. . when events occur independently and at a constant average rate. or it might not happen for another 1000 years.3. How many cars will cross the Harbour Bridge today? X ∼ Poisson. X.

Then Xt ∼ Poisson(λt). . and P(Xt = x) = (λt)x −λt e x! for x = 0. Note: For a Poisson process in space. iii) events cannot occur simultaneously. . Then XA ∼ Poisson(λA). Then Xt ∼ Poisson(λt). General deﬁnition of the Poisson process Take any sequence of random events such that: i) all events are independent. 2. ii) events occur at a constant average rate of λ per unit time. 1. for x = 0.132 Number of accidents in one year Let X be the number of accidents to occur in one year: X ∼ Poisson(λ). . 1. 1. . The probability function for X ∼ Poisson(λ) is λx −λ P(X = x) = e x! Number of accidents in t years Let Xt be the number of accidents to occur in time t years. . . let XA = # events in area of size A. . 2. and (λt)x −λt P(Xt = x) = e x! for x = 0. . . Example: XA = number of raisins in a volume A of currant bun. . Let Xt be the number of events to occur in time t. 2.

ii) lim δt↓0 P(exactly one event occurs in time interval[t.133 Where does the Poisson formula come from? (Sketch idea. i) Probability function For X ∼ Poisson(λ). These conditions can be used to derive a partial diﬀerential equation on a function known as the probability generating function of Xt . The parameter λ is called the rate of the Poisson distribution. x! Poisson distribution The Poisson distribution is not just used in the context of the Poisson process.6). for mathematicians. Deﬁnition: The random variables {Xt : t > 0} form a Poisson process with rate λ if: i) events occurring in any time interval are independent of those occurring in any other disjoint time interval. λx −λ fX (x) = P(X = x) = e x! for x = 0. . . t + δt]) δt = λ. Its properties are as follows. . often as a subjective model (see Section 3. iii) lim δt↓0 P(more than one event occurs in time interval[t. 1. . The formal deﬁnition of the Poisson process is as follows. t + δt]) δt = 0. It is also used in many other situations. 2. The partial diﬀerential x equation is solved to provide the form P(Xt = x) = (λt) e−λt . non-examinable).

λ=1 0.0 60 0. iii) Sum of independent Poisson random variables If X and Y are independent.04 λ = 100 0.01 0. As λ increases. For small λ. The probability functions for various λ are shown below. the distribution becomes more and more symmetrical.1 0.134 ii) Mean and variance The mean and variance of the Poisson(λ) distribution are both λ. This is often the case in real life: there is more uncertainty associated with larger numbers than with smaller numbers.05 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0. λ is the average number of events per unit time in the Poisson process. E(X) = Var(X) = λ when X ∼ Poisson(λ).5 0. and X ∼ Poisson(λ).2 0.0 0. The variance of the Poisson distribution increases with the mean (in fact.20 λ = 3. Notes: 1. then iv) Shape X + Y ∼ Poisson(λ + µ).15 0.3 0. until for large λ it has the familiar bellshaped appearance. Y ∼ Poisson(µ).10 0. 2. It makes sense for E(X) = λ: by deﬁnition.03 80 100 120 140 .0 0.02 0. The shape of the Poisson distribution depends upon the value of λ. the distribution has positive (right) skew. variance = mean).

because X ∼ Poisson(λ). • the observed value of X is x. then the likelihood function is: λx −λ L(λ. . . Let X =# babies born in a day in Mt Roskill. ˆ We ﬁnd that λ = x = 28. λ. Estimate λ. the maximum likelihood estimator of λ is: λ = X . 2. Example: 28 babies were born in Mt Roskill yesterday. For mathematicians: proof of Poisson mean and variance formulae (non-examinable) We wish to prove that E(X) = Var(X) = λ for X ∼ Poisson(λ). 28) = λ28 −λ e 28! for 0 < λ < ∞. Assume that X ∼ Poisson(λ). Thus the estimator variance is: ˆ Var(λ) = Var(X) = λ. we have to estimate the variance: ˆ ˆ Var(λ) = λ. ˆ Maximize L with respect to λ to ﬁnd the MLE. the likelihood function is the probability function expressed as a function of the unknown parameters.135 v) Likelihood As always. ˆ Similarly. the probability function is fX (x) = e for x = 0. • λ is unknown. . 1. Observation: X = 28 babies. If: • X ∼ Poisson(λ). λx −λ For X ∼ Poisson(λ). x) = e x! for 0 < λ < ∞. Likelihood: L(λ . x! . Because we don’t know λ.

136 So E(X) = ∞ x=0 xfX (x) = = ∞ x=0 ∞ x=1 x λx −λ e x! (note that term for x = 0 is 0) λx e−λ (x − 1)! = λ ∞ x=1 ∞ y=0 λx−1 −λ e (writing everything in terms of x − 1) (x − 1)! λy −λ e y! (putting y = x − 1) = λ = λ. . as required. For Var(X). So Var(X) = E[X(X − 1)] + λ − λ2 = λ2 + λ − λ2 = λ. we use: because the sum=1 (sum of Poisson probabilities) . λx −λ x(x − 1) e But E[X(X − 1)] = x! x=0 = ∞ x=2 ∞ λx e−λ (x − 2)! ∞ (terms for x = 0 and x = 1 are 0) = λ 2 x=2 ∞ y=0 λx−2 −λ e (writing everything in terms of x − 2) (x − 2)! λy −λ e y! (putting y = x − 2) = λ 2 = λ2 . Var(X) = E(X 2 ) − (EX)2 = E[X(X − 1)] + E(X) − (EX)2 = E[X(X − 1)] + λ − λ2 . So E(X) = λ. as required.

However. Let X = number of letters in an English word chosen at random from the dictio- nary.15 5 10 number of letters 15 20 The Poisson probabilities (with λ estimated by maximum likelihood) are plotted as points overlaying the barplot.05 0. the Binomial distribution describes exactly the distribution of the number of successes in n Bernoulli trials. The ﬁt of the Poisson distribution is quite good. or the right sort of relationship between variance and mean.0 0 0. If so.6 Subjective modelling Most of the distributions we have talked about in this chapter are exact models for the situation described.10 0. we pick a probability distribution to describe a situation just because it has properties that we think are appropriate to the situation.137 3. we see that the shape of the distribution is roughly Poisson. For example. such as the right sort of symmetry or skew. . English word lengths: X − 1 ∼ Poisson(6. In a subjective model. there is often no exact model available.22) Word lengths from 25109 English words probability 0. If we plot the frequencies on a barplot. We need to use X ∼ 1 + Poisson because X cannot take the value 0. we will use a subjective model. Example: Distribution of word lengths for English words.

The best-ﬁtting Poisson distribution (found by MLE) is overlaid.64).04 subjective model. however. X does not represent the number of failures before the k’th success: the NegBin is a 0 10 20 number of strokes 30 0.04 Here are stroke counts from 13061 Chinese characters. The ﬁt is very good. probability 0.0 0.7.10 10 20 number of strokes 30 It turns out.02 0. X is the number of strokes in a randomly chosen character.06 probability is NegBin(k = 23. that the Chinese stroke distribution is well-described by a Negative Binomial model.08 (found by MLE) 0. it is being used as a subjective model for word length.0 0 0. 0. . Stroke counts from 13061 Chinese characters The best-ﬁtting Negative Binomial distribution 0. However. Can a Poisson distribution ﬁt any data? The answer is no: in fact the Poisson distribution is very inﬂexible.138 In this example we can not say that the Poisson distribution represents the number of events in a ﬁxed time or space: instead. Best Poisson fit awful. p = 0.06 0.08 0.02 The ﬁt of the Poisson distribution is 0.

g.001) can be > 0. b). It describes a continuously varying quantity such as time or height. 1]? • the smallest number is 0. but P(0. This means we should use the distribution function. even if the list is inﬁnite: e. the Emperor Joseph II responded wryly. But suppose that X takes values in a continuous set. When X is continuous.g. e. ‘Too many notes. FX (x) = P(X ≤ x).. there simply wouldn’t be enough probability to go round. x fX (x) = P(X = x) 0 p 1 pq 2 pq 2 . there are so many numbers in any continuous set that each of them must have probability 0.1 Introduction When Mozart performed his opera Die Entf¨hrung aus dem u Serail.. but what is the next smallest? 0. In fact.0001? 0. Although we cannot assign a probability to any value of X. P(X = 1) = 0. for X ∼ Geometric(p). 1). . how would you list all the numbers in the interval [0. for which we can list all the values and their probabilities.. however ‘small’.999 ≤ X ≤ 1.. Mozart!’ In this chapter we meet a diﬀerent problem: too many numbers! We have met discrete random variables.0000000001? We just end up talking nonsense.01? 0.139 Chapter 4: Continuous Random Variables 4. we are able to assign probabilities to intervals: eg. ∞) or (0. [0. We can’t even begin to list all the values that X can take. For example. . A continuous random variable takes values in a continuous interval (a. If there was a probability > 0 for all the numbers in a continuous set. P(X = x) = 0 for ALL x. The probability function is meaningless.

P(X = a) = 0.v. . For a continuous random variable. So it makes no diﬀerence whether we say P(a < X ≤ b) or P(a ≤ X ≤ b). for a continuous random variable.. . P(a < X ≤ b) = P(X ∈ (a. b]) = F (b) − F (a). P(a < X < b) = P(a ≤ X ≤ b) = FX (b) − FX (a). . FX (x) probability accumulates in discrete steps.s.v.s. This is not true for a discrete random variable: in fact. 0 x • As before. For a discrete random variable with values 0. . 2. P(a < X < b) = P(a + 1 ≤ X ≤ b − 1) = FX (b − 1) − FX (a). Endpoints are very important for discrete r. b]) = F (b) − F (a). x • P(a < X ≤ b) = P(X ∈ (a. • FX (x) is a continuous function: probability accumulates continuously. Endpoints are not important for continuous r. FX (x) Recall that for discrete random variables: • FX (x) is a step function: • FX (x) = P(X ≤ x). However.140 The cumulative distribution function. For a continuous random variable: FX (x) 1 • FX (x) = P(X ≤ x). 1.

Using the p. it is quite diﬃcult to describe exactly what the probability density function is. we never really understand the random variable until we’ve seen its ‘face’ — the probability density function.2 The probability density function Although the cumulative distribution function gives us an interval-based tool for dealing with continuous random variables.15 10. the cumulative distribution function is a tool for facilitating our interactions with the continuous random variable.08 10. Just like a cell-phone for keeping in touch.141 4.41 frequency 100 200 300 0 9. However. There are 1680 times in total.8 10.21 10. In this section we take some time to motivate and describe this fundamental idea. but you never really know a person until you’ve seen their face. Out of interest. 1st Qu. You can chat on the phone. We use it all the time to calculate probabilities and to gain an intuitive feel for the shape and nature of the distribution.14 10. is like recognising your friends by their faces. here are the summary statistics: Min. Median Mean 3rd Qu. it is not very good at telling us what the distribution looks like. All-time top-ten 100m sprint times The histogram below shows the best 10 sprint times from the 168 all-time top male 100m sprinters. The probability density function (p.78 10.f.0 time (s) 10. 9.d. write emails or send txts to each other all day. Surprisingly. For this we use a diﬀerent tool called the probability density function. representing the top 10 times up to 2002 from each of the 168 sprinters.4 .2 10.) is the best way to describe and recognise a continuous random variable. Max.f.d.

3 seconds have low probability.8 10.4 0.8 10.0 time 10.01s intervals 20 40 60 80 0 9.0 time 10.1s intervals 200 400 600 0 9.05s intervals 100 200 300 0 9. The histograms tell us the most intuitive thing we wish to know about the distribution: its shape: • the most probable times are close to 10.142 We could plot this histogram using diﬀerent time intervals: 0.4 0. although the heights of the bars and the interval widths are diﬀerent.2 10.0 time 10.0s and above 10.2 seconds.8 10.2 10.02s intervals 100 0 50 150 9.4 0.0 time 10.8 10.4 We see that each histogram has broadly the same shape.2 10. • times below 10. • the distribution of times has a long left tail (left skew). .2 10.

4 This doesn’t work.0 0.8 probability 0. probability 0. How can we derive a curve or function that captures the common shape of the histograms.8 probability 0. the heights of the bars change. but will keep the same height for any choice of histogram bar width.20 10.0 time interval 10.143 We could ﬁt a curve over any of these histograms to show the desired shape.2 10. because the height (probability) still depends upon the bar width. but keeps a constant height? What should that height be? The standardized histogram We now focus on an idealized (smooth) version of the sprint times distribution.0 time interval 10.2 10.08 10.0 0. We are aiming to derive a curve.10 9. First idea: plot the probabilities instead of the frequencies.04 9.4 0.0 time interval 10. that captures the shape of the histograms. . Wider bars have higher probabilities.2 10. but the problem is that the histograms are not standardized: • every time we change the interval width.4 0.0 0.4 0.8 10. or function. rather than using the exact 1680 sprint times observed. The height of each histogram bar now represents the probability of getting an observation in that bar.2 9.

what is it?! .0 time interval 10.2 10.01s intervals 0 1 2 3 9.0 time interval 10.8 10. These histograms are called standardized histograms. The height of each histogram bar now represents the probability of getting an observation in that bar.4 probability / interval width 4 0.144 Second idea: plot the probabilities divided by bar width.8 10.2 10. .4 This seems to be exactly what we need! The same curve ﬁts nicely over all the histograms and keeps the same height regardless of the bar width.4 probability / interval width 4 0. The nice-ﬁtting curve is the probability density function. divided by the width of the bar.2 10.1s intervals 0 1 2 3 9. .8 10.0 time interval 10. probability / interval width 4 0.4 probability / interval width 4 0.0 time interval 10.05s intervals 0 1 2 3 9.02s intervals 0 1 2 3 9.2 10. But.8 10.

f.d. is in fact the limit of the standardized histogram as the bar width approaches zero.f. = fX (x).f. . interval width t t where FX (x) is the cumulative distribution function.d. This expression should look familiar: it is the derivative of FX (x).d. curve. the standardized histogram plots the probability of an observation falling between x and x +t.d.d. However.f. in the sprint times we can have fX (x) = 4.). as the histogram bars of the standardized histogram get narrower. We will write the p. Thus the height of the standardized histogram bar over the interval from x to x + t is: probability P(x ≤ X ≤ x + t) FX (x + t) − FX (x) = = . fX (x): fX (x) = lim t→0 FX (x + t) − FX (x) t by deﬁnition. so it is deﬁnitely NOT a probability.145 The probability density function We have seen that there is a single curve that ﬁts nicely over any standardized histogram from a given distribution. What is the height of the standardized histogram bar? For an interval from x to x + t. of a continuous random variable X as p.f. The p. t.d. The p.f. fX (x) is clearly NOT the probability of x — for example. This curve is called the probability density function (p. Now consider the limit as the histogram bar width (t) goes to 0: this limit is DEFINED TO BE the probability density function at x. the bars get closer and closer to the p. divided by the width of the interval.

Using the probability density function to calculate probabilities As well as showing us the shape of the distribution of X. We already know that: P(a ≤ X ≤ b) = FX (b) − FX (a). a .) is therefore the function ′ fX (x) = FX (x). Formal deﬁnition of the probability density function Deﬁnition: Let X be a continuous random variable with distribution function FX (x). The probability density function (p. FX (x). dx so In fact: FX (x) = fX (x) dx b (without constants).146 The probability density function (p. the probability density function has another major use: • it calculates probabilities by integration.f.d. It is deﬁned to be a single.d. Suppose we want to calculate P(a ≤ X ≤ b). • the SHAPE of the distribution of X .) of X is deﬁned as dFX ′ = FX (x). dx fX (x) = It gives: ′ • the RATE at which probability is accumulating at any given point. FX (b) − FX (a) = fX (x) dx. unchanging curve that describes the SHAPE of any histogram drawn from the distribution of X .f. But we also know that: dFX = fX (x).

This says that the total area under the p. a This means that we can calculate probabilities by integrating the p.d.f.d.f.d.147 This is a very important result: Let X be a continuous random variable with probability density function fX (x). x . which is 1. curve is: total area = ∞ −∞ fX (x) dx = FX (∞) − FX (−∞) = 1 − 0 = 1. curve is equal to the total probability that X takes a value between −∞ and +∞. fX (x) total area under the curve fX (x) ∞ is 1 : −∞ fX (x) dx = 1. fX (x) P(a ≤ X ≤ b) is the AREA under the curve fX (x) between a and b.f. b ] ) = fX (x) dx . Then b P(a ≤ X ≤ b) = P(X ∈ [ a. a b x The total area under the p.

and wish to calculate the distribution function. Using the dummy variable. FX (x) = P(X ≤ x) x x u Writing FX (x) = fX (x) dx is WRONG and MEANINGLESS: you will LOSE −∞ A MARK every time. Proof: x −∞ f (u)du = FX (x) − FX (−∞) = FX (x) − 0 = FX (x).d. FX (x). fX (x). FX (x) Suppose we know the probability density function.148 Using the p. FX (x) = −∞ fX (u) du. to calculate the distribution function. In words. u: x Writing FX (x) = −∞ fX (u) du means: fX (u) integrate fX (u) as u ranges from −∞ to x. We use the following formula: x Distribution function.f. nonsense! x −∞ fX (x) dx means: integrate fX (x) as x ranges from −∞ to x. It’s How can x range from −∞ to x?! .

we can see that it is about 10.0 0.d.2 10.f.149 Why do we need fX (x)? Why not stick with FX (x)? These graphs show FX (x) and fX (x) from the men’s 100m sprint times (X is a random top ten 100m sprint time).d.f. FX (x). f(x) Let fX (x) = (i) Find the constant k. k e−2x 0 for 0 < x < ∞. Example of calculations with the p.0 x 10.2 10.f. we would have to inspect the part of the curve with the steepest gradient: very diﬃcult to see. Using the c.8 0. For example.4 10.8 10. 0 x fX (x) dx = 1 k e−2x dx = 1 ∞ 0 0 dx + −∞ 0 ∞ e−2x k −2 = 1 .4 9.d..8 f(x) 0 1 2 3 4 9. which is the region of highest probability? Using the p.2 seconds. FX (x). for all x.0 x 10. F(x) 0. (i) We need: ∞ −∞ 0 (ii) Find P(1 < X ≤ 3).4 Just using FX (x) gives us very little intuition about the problem. (iii) Find the cumulative distribution function.. otherwise. fX (x).1 to 10.

FX (x) = So overall. 1−e . x −∞ 0 du FX (x) = 0 for x ≤ 0. When x ≤ 0. for x > 0.132. (iii) x FX (x) = −∞ 0 fX (u) du x = −∞ 0 du + 0 x 0 2 e−2u du for x > 0 2e−2u = 0+ −2 = −e−2x + e0 = 1 − e−2x = 0. −2x for x > 0.150 −k −∞ (e − e0 ) = 1 2 −k (0 − 1) = 1 2 k = 2. (ii) 3 P(1 < X ≤ 3) = = fX (x) dx 1 3 1 2 e−2x dx 3 1 = 2e−2x −2 = −e−2×3 + e−2×1 = 0.

Then use: P(a ≤ X ≤ b) = FX (b) − FX (a) for any a.d.d.: b P(a ≤ X ≤ b) = fX (x) dx. a 2. If you only need to calculate one probability P(a ≤ X ≤ b): integrate the p. is NOT a probability: fX (x) ≥ 0 always.f. . If you will need to calculate several probabilities. 1. FX (x): x FX (x) = −∞ fX (u) du. Endpoints: DO NOT MATTER for continuous random variables: P(X ≤ a) = P(X < a) and P(X ≥ a) = P(X > a) .151 Total area under the p. Calculating probabilities: but we do NOT require fX (x) ≤ 1. it is easiest to ﬁnd the distribution function.f. The p.f.d. curve is 1: ∞ −∞ fX (x) dx = 1. b.

The Poisson distribution was used to count the 14 Oct 2008? 9J 207 un 4? number of volcanoes that would occur in a ﬁxed space of time. about 600 years ago.000 years or so. eruption was Rangitoto. fX (x).arc. Distribution of the waiting time in the Poisson process The length of time between events in the Poisson process is called the waiting time. For background information.d. . (λt)n −λt e . Suppose that {Nt : t > 0} forms a Poisson process with rate λ = We know that Nt ∼ Poisson(λt) . FX (x).3 The Exponential distribution When will the next volcano erupt in Auckland? We never quite answered this question in Chapter 3. unlike the p.ov 5 N45? 23 4.000 years. which has led the Auckland Regional Council to assess current volcanic risk by assuming that volcanic eruptions in Auckland follow a Poisson 1 process with rate λ = 1000 volcanoes per year. There have been about 20 eruptions in the last 20. This is because FX (x) = P(X ≤ x) gives us a probability. We are comfortable with handling and manipulating probabilities. and biggest.f.govt.nz/environment/volcanoes-of-auckland/. Auckland Volcanoes About 50 volcanic eruptions have occurred in Auckland over the last 100. The ﬁrst two eruptions occurred in the Auckland Domain and Albert Park — right underneath us! The most recent. so P(Nt = n) = n! 1 . To ﬁnd the distribution of a continuous random variable. starting from now. We have not said how long we have to wait for the next volcano: this is a continuous random variable. see: www. 1000 Nt is the number of volcanoes to have occurred by time t. we often work with the cumulative distribution function.

Overall: FX (x) = P(X ≤ x) = 1 − e−λx 0 for x ≥ 0. This is the ﬁgure given by the Auckland Regional Council at the above web link (under ‘Future Hazards’).049. Example: What is the probability that there will be a volcanic eruption in Auckland within the next 50 years? Put λ = 1 1000 . There is about a 5% chance that there will be a volcanic eruption in Auckland over the next 50 years. starting now.153 Let X be a continuous random variable giving the number of years waited before the next volcano. (ii) When x ≥ 0: FX (x) = P(X ≤ x) = P(amount of time waited for next volcano is ≤ x) = P(there is at least one volcano between now and time x) = P(# volcanoes between now and time x is ≥ 1) = P(Nx ≥ 1) = 1 − P(Nx = 0) (λx)0 −λx e = 1− 0! = 1 − e−λx . We need P(X ≤ 50). for x < 0. . (i) When x < 0: FX (x) = P(X ≤ x) = P( less than 0 time before next volcano) = 0. We will derive an expression for FX (x). The distribution of the waiting time X is called the Exponential distribution because of the exponential formula for FX (x). P(X ≤ 50) = FX (50) = 1 − e−50/1000 = 0.

154 The Exponential Distribution We have deﬁned the Exponential(λ) distribution to be the distribution of the waiting time (time between events) in a Poisson process with rate λ.d. P. or the time between any two events. n! e • Deﬁne X to be either the time till the ﬁrst event. Note: λ > 0 always. or X ∼ Exp(λ)... • Nt ∼ Poisson(λt) . 0 for x < 0. FX (x) = P(X ≤ x). Probability density function: ′ fX (x) = FX (x) = λe−λx 0 for x ≥ 0. so P(Nt = n) = (λt)n −λt . the Exponential distribution has many other applications: it does not always have to arise from a Poisson process.f. or the time from now until the next event. fX (x) Link with the Poisson process C. Then X ∼ Exponential(λ). for x < 0.d. We write X ∼ Exponential(λ). X is called the waiting time of the process. Let X ∼ Exponential(λ).f. Let {Nt : t > 0} be a Poisson process with rate λ. just like the Poisson distribution. However. Then: • Nt is the number of events to occur by time t. . Distribution function: FX (x) = P(X ≤ x) = 1 − e−λx for x ≥ 0.

or the time between any two events. because events occur at a constant average rate in the Poisson process. memorylessness means that ‘old is as good as new’ — or. It is not always a desirable property: you don’t want a memoryless waiting time for your bus! The Exponential distribution is often used to model failure times of components: for example X ∼ Exponential(λ) is the amount of time before a light bulb fails. The derivation of the Exponential distribution was valid for all of them. ‘new is as bad as old’ ! A memoryless light bulb is quite likely to fail almost immediately. . This property of the Exponential distribution is called memorylessness: • the distribution of the time from now until the ﬁrst event is the same as the distribution of the time from the start until the ﬁrst event: the time from the start till now has been forgotten! time from start to first event this time forgotten time from now to first event START NOW FIRST EVENT The Exponential distribution is famous for this memoryless property: it is the only memoryless distribution. memorylessness means that the 600 years we have waited since Rangitoto erupted have counted for nothing. or the time from now until the next event. In this case. put another way. The chance that we still have 1000 years to wait for the next eruption is the same today as it was 600 years ago when Rangitoto erupted. For volcanoes. Memorylessness applies to any Poisson process.Me Memorylessness We have said that the waiting time of the Poisson process can be deﬁned either as the time from the start to the ﬁrst event. mo a si ry like eve ! zzz z All of these quantities have the same distribution: X ∼ Exponential(λ).

given that we have already waited time t (say). This proves that Y is Exponential(λ) like X.156 For private reading: proof of memorylessness Let X ∼ Exponential(λ) be the total time waited for an event. = 1 − e−λy . we must condition on the event {X > t}. and Y is the time waited after time t. i. So Y ∼ Exponential(λ) as required. given that we have already waited time t. is the same as the probability of waiting time y in total. because we know that we have already waited time t. We wish to prove that Y has the same distribution as X. Proof: We will work with FY (y) and prove that it is equal to 1 − e−λy . that the time t already waited has been ‘forgotten’.e. Let Y be the amount of extra time waited for the event. The time t already waited is forgotten. First note that X = t+Y . . because X is the total time waited. So P(Y ≤ y) = P(X ≤ t + y | X > t). This means we need to prove that Y ∼ Exponential(λ). Also. FY (y) = P(Y ≤ y) = P(X ≤ t + y | X > t) = P(X ≤ t + y AND X > t) P(X > t) (deﬁnition of conditional probability) P(t < X ≤ t + y) 1 − P(X ≤ t) FX (t + y) − FX (t) 1 − FX (t) (1 − e−λ(t+y) ) − (1 − e−λt ) 1 − (1 − e−λt ) = = = e−λt − e−λ(t+y) = e−λt e−λt (1 − e−λy ) = e−λt Thus the conditional probability of waiting time y extra.

we ﬁnd the likelihood using the probability density function.d. fX (x) = P(X = x). .v. . .4 Likelihood and estimation for continuous random variables • For discrete random variables. fX (xn). λ.157 4. • For continuous random variables. . . it is used in exactly the same way for likelihood and estimation. . Example: Exponential likelihood Suppose that: • X ∼ Exponential(λ). Xn are INDEPENDENT. we found the likelihood using the probability function. x) = fX (x) = λe−λx for 0 < λ < ∞. fX (x). • all the Xi s have the same p.s have the same deﬁnition for the cumulative distribution function: FX (x) = P(X ≤ x). Then the likelihood function is: L(λ . then the likelihood is fX (x1)fX (x2) . . Note: Both discrete and continuous r. Xn are continuous random variables such that: • X1 . • the observed value of X is x. fX (x) = dFX . . • λ is unknown. We estimate λ by setting dL ˆ = 0 to ﬁnd the MLE. dλ Two or more independent observations Suppose that X1 . dx • Although the notation fX (x) means something diﬀerent for continuous and discrete random variables. .. . .f.

Xn are independent. . x The MLE of λ is ˆ 1 λ= . . . . x1. xn) = λn e−λnx for 0 < λ < ∞.002 Likelihood graph shown for λ = 2 and n = 10. . . Find the maximum likelihood estimate of λ. x . So n xi = nx.0 0 0. x1. 2). . x1. . x10 generated by R command rexp(10. . . . . . λ = ∞. . to be the sample mean of x1. λ = 1 . 0. . i=1 Thus L(λ . .004 likelihood 0. xn. . . and Xi ∼ Exponential(λ) for all i.158 Example: Suppose that X1. . Solve dL = 0 to ﬁnd the MLE of λ: dλ dL = nλn−1 e−λnx − λn × nx × e−λnx = 0 dλ nλn−1e−λnx (1 − λx) = 0 ⇒ λ = 0. . 2 4 6 n lambda Solution: L(λ . X2 . xn) = i=1 n fX (xi) λe−λxi i=1 n −λ n i=1 = = λ e xi Deﬁne x = 1 n n i=1 xi for 0 < λ < ∞.

A smaller value of λ means volcanoes occur less often. Then the upper-tail p-value is P(X ≥ 7) = 1 − P(X ≤ 6) = 1 − FX (6). we would expect to see BIGGER values of X .5).5 Hypothesis tests Hypothesis tests for continuous random variables are just like hypothesis tests for discrete random variables. • That is. Other than this trap. • Make observation x. and λ is the rate at which volcanoes occur. p = 0. Suppose H0 : X ∼ Binomial(n = 10. ﬁnd the probability under the distribution speciﬁed by H0 of seeing an observation further out in the tails than the value x that we have seen. so the time X between them is BIGGER. the procedure for hypothesis testing is the same: • Use H0 to specify the distribution of X completely. if H0 is true. and we have observed the value x = 7. 1 Note: If λ < 1000 . and we have observed the value x = 7. Example with the Exponential distribution A very very old person observes that the waiting time from Rangitoto to the next volcanic eruption in Auckland is 1500 years.159 4. This is because X is the time between volcanoes. NOT smaller. Then the upper-tail p-value is P(X ≥ 7) = 1 − P(X ≤ 7) = 1 − FX (7). • Find the one-tailed or two-tailed p-value as the probability of seeing an observation at least as weird as what we have seen. Test the hypothesis that 1 1 λ = 1000 against the one-sided alternative that λ < 1000 . and oﬀer a one-tailed or two-tailed alternative hypothesis H1 . . Example: continuous. Suppose H0 : X ∼ Exponential(2). The only diﬀerence is: • endpoints matter for discrete random variables. Example: discrete. but not for continuous ran- dom variables.

H0 : λ = H1 1 1000 1 : λ< 1000 one-tailed test Observation: x = 1500 years. i.e. 0. The observation x = 1500 years is consistent with the hypothesis that λ = 1/1000. p-value: P(X ≥ 1500) when X ∼ Exponential(λ = 1 1000 ). Values weirder than x = 1500 years: all values BIGGER than x = 1500. So p − value = P(X ≥ 1500) = 1 − P(X ≤ 1500) = 1 − FX (1500) = 0.0010 Interpretation: f(x) 0 0 0.160 Hypotheses: Let X ∼ Exponential(λ). 1/1000) There is no evidence against H0. when X ∼ Exponential(λ = 1 1000 ) = 1 − (1 − e−1500/1000) R command: 1-pexp(1500.0006 1000 2000 x 3000 4000 5000 .0002 0.223. that volcanoes erupt once every 1000 years on average.

(For each value x. we add in the value and multiply by the proportion of times we would expect to see that value: P(X = x). • a ‘sum’ of values multiplied by how common they are: xf (x) or xf (x) dx. where fX (x) = ′ FX (x) is the probability density function. Note: There exists no concept of a ‘probability function’ fX (x) = P(X = x) for continuous random variables.6 Expectation and variance Remember the expectation of a discrete random variable is the long-term av- erage: µX = E(X) = x xP(X = x) = x xfX (x). if X is continuous. replace the probability function with the probability density function. In fact.161 4. The idea behind expectation is the same for both discrete and continuous random variables. and replace µX = E(X) = x ∞ −∞ by ∞ −∞ : xfX (x) dx. . Expectation is also the balance point of fX (x) for both continuous and discrete X. then P(X = x) = 0 for all x. E(X) is: • the long-term average of X.) For a continuous random variable. Imagine fX (x) cut out of cardboard and balanced on a pencil.

Var(X) = E(X 2) − (EX)2 = ∞ −∞ x2 fX (x)dx − (EX)2. leave the probability density alone. .162 Discrete: E(X) = x Continuous: xfX (x) E(X) = ∞ −∞ xfX (x) dx E(g(X)) = x g(x)fX (x) E(g(X)) = ∞ −∞ g(x)fX (x) dx Transform the values. fX (x) =P(X = x) Transform the values. its variance is deﬁned in exactly the same way as a discrete random variable: 2 Var(X) = σX = E (X − µX )2 = E(X 2) − µ2 = E(X 2) − (EX)2. X For a continuous random variable.) Variance If X is continuous.d. we can either compute the variance using Var(X) = E (X − µX ) or 2 = ∞ −∞ (x − µX )2fX (x)dx. ′ fX (x) =FX (x) (p. The second expression is usually easier (although not always). leave the probabilities alone.f.

. . + E(Xn). it makes sense that E(X) = λ . For any random variables. • Var(ag(X) + b) =a2 Var(g(X)). and for constants a and b: • E(aX + b) =aE(X) + b. if λ = 4 events per hour. X. • E(ag(X) + b) =aE(g(X)) + b. + Xn ) =E(X1) + . • E(X1 + . 4 .163 Properties of expectation and variance All properties of expectation and variance are exactly the same for continuous and discrete random variables. . continuous or discrete. Xn . . then: E(X) = 1 λ Var(X) = 1 λ2 .7 Exponential distribution mean and variance When X ∼ Exponential(λ). The following statements are generally true only when X and Y are INDEPENDENT: • E(XY ) =E(X)E(Y ) when X . • Var(X + Y ) =Var(X) + Var(Y ) when X . Y . . and X1 . the average time waited between events is 1 hour. Note: If X is the waiting time for a Poisson process with rate λ events per year 1 (say). 4. Y independent. . . For example. . Y independent. • Var(aX + b) =a2 Var(X). • E(X + Y ) =E(X) + E(Y ).

164 Integration by parts: recall that Let u = x. = λe−λx .Proof : E(X) = ∞ −∞ xfX (x) dx = ∞ −λx dx. −x e 2 −λx ∞ 0 ∞ 0 = 2x. λ2 Var(X) = . Variance: Var(X) = E(X 2 ) − (EX)2 = E(X 2) − Now Let E(X ) = u = x2 . ∞ E(X) = xλe −λx dx = 0 u dv dx dx − ∞ 0 ∞ 0 = = uv ∞ 0 v du dx dx ∞ 0 − xe −λx − (−e−λx) dx = 0+ = ∴ E(X) = −1 λ 1 λ −1 −λx ∞ λe 0 −1 λ ×0− × e0 . and let ∞ 0 = λe−λx . so v = −e−λx . x fX (x) dx = 0 2 ∞ x2λe−λx dx. so v = −e−λx . so Then du dx = 1. λ λ 1 λ 2 2 = 2− λ 1 . and let ∞ 0 dv dx Then E(X ) = uv − ∞ 0 du v dx = dx + 0 ∞ 2xe−λx dx 2 = 0+ λ = So Var(X) = E(X ) − (EX) 2 2 λxe−λx dx 2 2 × E(X) = 2 . so 2 2 ∞ −∞ du dx 1 λ2 . 0 xλe dv u dx dx = uv − dv dx du v dx dx.

and variance . and variance. median. As a hint: • the mean is the balance-point of the distribution. • the median is the half-way point. . f(x) 0. (answers overleaf) 0. Median. and Variance For any distribution: • the mean is the average that would be obtained if a large number of observations were drawn from the distribution.d.d. so to get a rough guess (not exact).165 Interlude: Guess the Mean.5 each.004 50 100 150 x 200 250 300 . Given the probability density function of a distribution. The mean is the point where the rod would have to be placed for the cardboard to balance.f. Imagine that the p. we should be able to guess roughly the distribution mean.0 0 0. so it divides the p.012 0. . into two equal areas of 0. it is easiest to guess an average distance from the mean and square it. • the variance is the average squared distance of observations from the mean.008 Guess the mean. but it isn’t easy! Have a go at the examples below. • the median is the half-way point of the distribution: every observation has a 50-50 chance of being above the median or below the median. median. • the variance is the average squared distance of an observation from the mean.f. is made of cardboard and balanced on a rod.

6 0.693.0.0 0 1 2 x 3 4 5 Answers: Median = 0. but when you look at the numbers along the horizontal axis.012 median (54.0. .0 0 0. . The variance is huge . This always happens when the distribution has a long right tail (positive skew) like this one. .8 1.004 50 100 150 x 200 250 300 Notes: The mean is larger than the median. Variance=1.166 Answers: f(x) 0. Answers are written below the graph.6) 0.2 0. Out of interest. f(x) 0. it is quite believable that the average squared distance of an observation from the mean is 1182. the distribution shown is a Lognormal distribution.0) variance = (118) = 13924 2 0.0 0.008 mean (90.4 0. Mean=1. Example 2: Try the same again with the example below.

otherwise. b] if X is equally likely to fall anywhere in the interval [a. b]. b). then 1 b−a fX (x) = 0 if a ≤ x ≤ b. b). FX (x) 1 0 1 0 1 0 1 0 111 000 11111 00000 1 10 111 000 1 0 111 000 1 0 111 000 1 0 111 000 1 0 111 000 1 0 111 000 1 0 111 000 1 0 0 1111111111 0000000000x 111111 000000 a b Thus 0 1 FX (x) = x−a b−a if x < a. X ∼ Uniform(a. Equivalently. or X ∼ U[a. if a ≤ x ≤ b.8 The Uniform distribution X has a Uniform distribution on the interval [a. Probability density function. if x > b. b]. FX (x) x x FX (x) = −∞ fY (y) dy = a 1 dy b−a x a if a≤x≤b = = x−a b−a y b−a if a ≤ x ≤ b. fX (x) If X ∼ U [a. b].167 4. We write X ∼ Uniform[a. . b]. fX (x) 1 b−a 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 a b x Distribution function. or X ∼ U(a.

Then fX (x) = 1 if 0 ≤ x ≤ 1 0 otherwise. 1]. = 12 Example: let X ∼ Uniform[0. 1]). Proof : E(X) = ∞ −∞ E(X) = a+b .168 Mean and variance: If X ∼ Uniform[a. But µX = EX = So. 12 b a xf (x) dx = a 1 x b−a 1 x2 dx = b−a 2 = = = 1 b−a 1 b−a a+b . 0+1 2 µX = E(X) = = 1 2 (half-way through interval [0. b]. 12 2 σX = Var(X) = 1 (1 12 − 0)2 = . 2 1 · (b2 − a2 ) 2 1 (b − a)(b + a) 2 Var(X) = E[(X − µX ) ] = 2 b a (x − µX )2 1 (x − µX )3 dx = b−a b−a 3 = 1 b−a b a (b − µX )3 − (a − µX )3 3 a−b 2 . so b − µX = b−a 2 and a − µX = = 1 b−a (b − a)3 − (a − b)3 23 × 3 (b − a)3 + (b − a)3 (b − a) × 24 (b − a)2 . 1 . 2 b Var(X) = (b − a)2 . Var(X) = a+b 2 .

The p. is known. Y is deﬁned as Y = g(X) for some function g . g −1 (y). of X . or (1–1) (‘one-to-one’).0 0. When g is monotone.2 0.2 0.f.8 1.169 4. 1). of Y . of Y . • we wish to ﬁnd the p.0 .6 0.v.4 y 0.8 0. or g is a decreasing f n .4 x 0.0 0.4 0. it is invertible.4 x 0. Example: Let X ∼ Uniform(0.0 x = g −1 (y) = √ y 0.8 1. is well-deﬁned as a function for a certain range of y. • the r.6 0.0 0. We use the Change of Variable technique. and let Y = − log(X). then g can only be (1–1) if it is monotone. What is the p. as it is here.2 0.0 0.9 The Change of Variable Technique: ﬁnding the distribution of g(X) Let X be a continuous random variable.d.d. That is. fY (y)? Change of variable technique for monotone functions Suppose that g(X) is a monotone function R → R. of X is fX (x) = 1 for 0 < x < 1.f.f. This means that g is an increasing function.f. When g : R → R. Suppose • the p.0 0.8 1.d.d. for every y there is a unique x such that g(x) = y .0 0. fX (x).6 y 0.6 0. y = g(x) = x2 1. This means that the inverse function.2 0.

dy dx 5) So fY (y) = fX (x(y)) by Change of Variable formula. Easy way to remember Write ∴ y = y(x)(= g(x)) x = x(y)(= g −1(y)) fY (y) = fX x(y) dx dy Then . Refer back to the question to ﬁnd fX (x): you often have to deduce this from information like X ∼ Uniform(0. for y(a) > y(x) > y(b) if y is decreasing. for a < x < b.d. or it may be given 4) Then explicitly..g.170 Change of Variable formula Let g : R → R be a monotone function and let Y = g(X). Then the p. 2) Write y = y(x) for x in <range of x>. . 3) So x = x(y) for y in <range of y >: for y(a) < y(x) < y(b) if y is increasing.f. 1) or X ∼ Exponential(λ). Quote range of values of y as part of the FINAL answer. of Y = g(X) is fY (y) = fX (g −1(y)) d −1 dy g (y) . dx = <expression involving y >. Working for change of variable questions 1) Show you have checked g(x) is monotone over the required range. . dy = .. e.

0 < y < ∞. y 0 0. so fX (x) = 1 for 0 < x < 1. Find the p. for − log(0) > > − log(1). Thus fY (y) = fX (e−y )e−y = e−y for 0 < y < ∞.f.171 Note: There should be no x’s left in the answer! x(y) and dx are expressions involving y only.0 2) Let y = y(x) = − log x for 3) Then x = x(y) = e−y 4) 0 < x y < 1. not giving the range of y for which the result holds. you lose a mark for: 1.6 0. ⇒ fX (e−y ) = 1 for 0 < y < ∞. (eg. .0 1 2 so we can apply the Change of Variable formula. dy 4 y = −log(x) Example 1: Let X ∼ Uniform(0. dy dy fY (y) = fX (x(y)) dx dy 5) So for 0 < y < ∞ for 0 < y < ∞. fY (y) = . 2. 1). and let Y = − log(X). of Y . = fX (e−y )e−y But X ∼ Uniform(0.8 1. d −y dx = (e ) = −e−y = e−y for 0 < y < ∞. as part of the ﬁnal answer. . for 0 < y < ∞).d. Note: In change of variable questions.2 0.4 x 0. not stating g(x) is monotone over the required range of x. 1). . ie. 3 0. So Y ∼ Exponential(1). 1) y(x) = − log(x) is monotone decreasing.

fY (y). The function y(x) = 1/x is monotone decreasing for 0 < x < 2. for for 1 0 1 2 > y > 1. Find the probability density function of Y . 2 4y 5 fY (y) = 0 otherwise. Let Then y = y(x) = 1/x x = x(y) = 1/y dx = | − y −2 | = 1/y 2 dy for 0 < x < 2. i. . otherwise. so we can apply the Change of Variable formula.172 Example 2: Let X be a continuous random variable with p.d. 2 Thus 1 for 1 < y < ∞. Change of variable formula: fY (y) = fX (x(y)) = = = dx dy 1 3 dx (x(y)) 4 dy 1 1 1 × 3× 2 4 y y 1 4y 5 for 1 < y < ∞. Let Y = 1/X . 0 Let Y = 1/X.e. 2 < y < ∞.f. 1 2 < y < ∞. fX (x) = 1 3 4x for 0 < x < 2.

Now FY (y) = P(Y ≤ y) = P(g(X) ≤ y) = P(X ≤ g −1 (y)) = FX (g −1(y)). so d and thus fY (y) = fX (g −1(y))| dy (g −1(y))| as required. so g −1 is also an increasing function. u > w ⇐⇒ g(u) < g(w). dy dy = 1 − FX (g −1(y)). i. So the p.d. w = g −1(y) in (⋆)) .d. (⋆) > 0. so g −1 is also increasing (by overleaf).173 For mathematicians: proof of the change of variable formula Separate into cases where g is increasing and where g is decreasing.f.) FY (y) = P(Y ≤ y) = P(g(X) ≤ y) Thus the p. ⊛ −1 −1 Note that putting u = g (x). ii) g decreasing. so g −1 is also decreasing. and w = g (y). i) g increasing g is increasing if u < w ⇔ g(u) < g(w). = P(X ≥ g −1 (y)) (put u = X.e. in ⊛ to see this. we obtain g −1 (x) < g −1(y) ⇔ g(g −1 (x)) < g(g −1 (y)) ⇔ x < y. of Y is fY (y) = put u = X. (Putting u = g −1 (x) and w = g −1(y) gives g −1(x) > g −1 (y) ⇐⇒ x < y. of Y is d d fY (y) = 1 − FX (g −1(y)) = −fX g −1(y) g −1 (y) . w = g−1 (y) d FY (y) dy d = FX (g −1(y)) dy d ′ = FX (g −1(y)) (g −1(y)) (Chain Rule) dy d = fX (g −1(y)) (g −1(y)) dy d −1 dy (g (y)) Now g is increasing.f.

Let Y = X 2 . .10 Change of variable for non-monotone functions: non-examinable Suppose that Y = g(X) and g is not monotone.f. Example: Let X have any distribution. d d g −1 (y) = g −1 (y) dy dy d g −1 (y) dy fY (y) = fX g −1(y) . Y y √ − y 0 √ y X So FY (y) = 0 √ √ FX ( y) − FX (− y) if y < 0. with distribution function FX (x).d. For y ≥ 0: FY (y) = P(Y ≤ y) = P(X 2 ≤ y) √ √ = P(− y ≤ X ≤ y) √ √ = FX ( y) − FX (− y) .174 This time. so FY (y) = 0 if y < 0. of Y . We wish to ﬁnd the p. g is decreasing.f. −1 is also decreasing. so g − So once again. of Y . We can sometimes do this by using the distribution function directly. and thus . if y ≥ 0. Y ≥ 0.d. Find the p. 4. Clearly.

175 So the p. 1). of X is: 1 2 fX (x) = √ e−x /2. The p.d. of Y is fY (y) = d √ √ d d FY = (FX ( y)) − (FX (− y)) dy dy dy 1 1 √ ′ √ ′ = 1 y − 2 FX ( y) + 1 y − 2 FX (− y) 2 2 = 1 √ √ √ fX ( y) + fX (− y) 2 y for y ≥ 0. of Y = X 2 . This is the familiar bell-shaped distribution (see later). The Chi-squared distribution is a special case of the Gamma distribution (see next section). . fY (y) = 1 1 √ · √ (e−y/2 + e−y/2) 2 y 2π 1 = √ y −1/2e−y/2 for y ≥ 0.d. then Y = X 2 ∼Chi-squared(df=1). Y = X 2 has p.f. 2π This is in fact the Chi-squared distribution with ν = 1 degrees of freedom. 2π Find the p.d. ∴ 1 √ √ fY (y) = √ fX ( y) + fX (− y) for y ≥ 0. 1). whenever Y = X 2 .f.f.d.f. 2 y Example: Let X ∼ Normal(0. This example has shown that if X ∼ Normal(0. By the result above.

Gamma(1. . . χ2 = Gamma(k = ν . ∞ 0 fX (x)dx = 1. + Xk ∼ Gamma(k.v. is a constant that ensures fX (x) integrates to 1. λ).11 The Gamma distribution The Gamma(k. λ = 1 ). . Special Case: When k = 1. Here.e.176 4. . fX (x) = λk k−1 −λx e Γ(k) x 0 if x ≥ 0. . λ) distribution is a very ﬂexible family of distributions.) Probability density function. otherwise. E(X) = k λ and Var(X) = k λ2 Relationship with the Chi-squared distribution The Chi-squared distribution with ν degrees of freedom. called the Gamma function of k. then E(Y ) = ν k λ = ν. Γ(k) = (k − 1)! Mean and variance of the Gamma distribution: For X ∼ Gamma(k.v. . fX (x) For X ∼ Gamma(k. It is deﬁned as the sum of k independent Exponential r. Xk are independent. . Xk ∼ Exponential(λ)and X1 . Γ(k). . χ2 . then X1 + X2 + . . . i. It is deﬁned as Γ(k) = 0 ∞ y k−1 e−y dy . When k is an integer. and Var(Y ) = k λ2 = 2ν. λ). ν 2 2 So if Y ∼ χ2 . is a special case of ν the Gamma distribution.s: if X1 . λ) = Exponential(λ) (the sum of a single Exponential r. λ). .

then FX (x) can can only be calculated by computer.f.177 Gamma p. k=5 . If X ∼ Gamma(k.d. λ).s k=1 k=2 Notice: right skew (long right tail). FX (x) There is no closed form for the distribution function of the Gamma distribution. ﬂexibility in shape controlled by the 2 parameters k=5 Distribution function.

λ ∞ 0 (letting y = λx.178 Proof that E(X) = ∞ 0 k λ and Var(X) = ∞ 0 k λ2 (non-examinable) EX = xfX (x) dx = = = = = = λk xk−1 −λx x· e dx Γ(k) Γ(k) ∞ k −λx dx 0 (λx) e ∞ k −y 1 0 y e ( λ ) dy Γ(k) 1 Γ(k + 1) · λ Γ(k) 1 k Γ(k) · λ Γ(k) k . Var(X) = E(X ) − (EX) 2 2 = = 0 k2 x fX (x) dx − 2 λ 2 ∞ x2λk xk−1e−λx k2 dx − 2 Γ(k) λ Γ(k) ∞ k+1 −y e 0 y = ∞ 1 k+1 −λx e dx 0 ( λ )(λx) k2 − 2 λ k2 λ2 where y = λx. 1 dx = dy λ 1 = 2· λ dy Γ(k) − 1 Γ(k + 2) k 2 − 2 = 2· λ Γ(k) λ 1 (k + 1)k Γ(k) k 2 = 2 − 2 λ Γ(k) λ = k . dx dy 1 = λ) (property of the Gamma function). λ2 .

β) = for α > 0.v. β) is the Beta function and is deﬁned by the integral 1 B(α. The time waited from time 0 to the time of the kth event is X1 + X2 + . then Xi ∼ Exp(λ). That is. as they both describe the ‘waiting time’ before the kth event. We write X ∼ Beta(α. f (x) = 1 xα−1(1 − x)β−1 for 0 < x < 1. The function B(α. if Xi =time waited between event i − 1 and event i. the Gamma(k. the sum of k independent Exponential(λ) r. λ) distribution is similar to the (discrete) Negative Binomial(k. β > 0.179 Gamma distribution arising from the Poisson process Recall that the waiting time between events in a Poisson process with rate λ has the Exponential(λ) distribution. B(α. Both distributions describe the ‘waiting time’ before an event. α and β. . p) distribution. β) = 0 xα−1(1 − x)β−1 dx. P.d.s. It can be shown that Γ(α)Γ(β) . In the same way. β) 0 otherwise. + Xk . . Γ(α + β) . Note: There are some similarities between the Exponential(λ) distribution and the (discrete) Geometric(p) distribution. λ) distribution. B(α. β).12 The Beta Distribution: non-examinable The Beta distribution has two parameters. 4.f. Thus the time waited until the kth event in a Poisson process with rate λ has the Gamma(k.

R command: FX (x) = pnorm(x. sd=sqrt(σ 2)). σ 2 > 0. . and the variance. Xn are i. . identically distributed random variables is approximately Normal: X1 + X2 + . the mean. . mean=µ. µ.d. or X ∼ N(µ. FX (x) There is no closed form for the distribution function of the Normal distribution. µ and σ 2 satisfy −∞ < µ < ∞. . 5. we look at the Normal distribution and some of its general properties. + Xn ∼ approx Normal if X1 . . .i. .Chapter 5: The Normal Distribution and the Central Limit Theorem 180 The Normal distribution is the familiar bell-shaped distribution. then FX (x) can can only be calculated by computer. Probability density function. If X ∼ Normal(µ.1 The Normal Distribution The Normal distribution has two parameters. We write X ∼ Normal(µ. σ 2). fX (x) 1 fX (x) = √ e 2πσ 2 {−(x−µ)2 /2σ2 } for −∞ < x < ∞. σ 2). and n is large. mainly because of its link with the Central Limit Theorem. It is probably the most important distribution in statistics. which states that any large sum of independent. Before studying the Central Limit Theorem. σ 2). σ 2 . Distribution function.

σ 2). FX (x) Mean and Variance For X ∼ Normal(µ. X −µ σ E(X) = µ. σ 2). a2 σ 2 . Linear transformations If X ∼ Normal(µ. fX (x) Distribution function. aX + b ∼ Normal aµ + b. then for any constants a and b. Var(X) = σ 2 .181 Probability density function. . In particular. X ∼ Normal(µ σ 2 ) ⇒ ∼ Normal(0. 1).

1) is called the standard Normal random variable. a2σ 2 : Let X ∼ Normal(µ. Then σ ∼ Normal(0. 1) y(x) = ax +b is monotone. General proof that aX + b ∼ Normal aµ + b. so fX (x) = √ Thus fX y−b a e−(x−µ) y−b 1 2 2 = √ e−( a −µ) /2σ 2πσ 2 1 2 2 2 e−(y−(aµ+b)) /2a σ . We wish to ﬁnd the distribution of Y . = √ 2πσ 2 . ∼ Normal σ σ σ2 Z ∼ Normal(0. Z ∼ Normal aµ + b. so we can apply the Change of Variable technique. 1). = = dy a |a| fY (y) = fX (x(y)) dx = fX dy y−b a 1 2πσ 2 1 . 3) Then x = x(y) = dx 1 1 . a σ µ µ σ2 − . σ σ Let Z = aX + b = 2 2 X −µ . 4) 5) So (⋆) /2σ 2 But X ∼ Normal(µ. and let Y = aX + b. 2) Let y = y(x) = ax + b for −∞ < x < ∞. Use the change of variable technique. |a| 2 y−b a for −∞ < y < ∞. σ 2).182 Proof: Let a = 1 µ and b = − . σ 2).

This result is non-trivial to prove. a2 σ 2 . ∞ −∞ The full proof that fX (x) dx = ∞ −∞ ∞ −∞ √ 1 2πσ 2 e{−(x−µ) 2 /(2σ 2 )} dx = 1 relies on the following result: FACT: e−y dy = 2 √ π.183 Returning to (⋆). 2 More generally.f. . . σ 2). .+an Xn ∼ Normal (a1 µ1 +. σ1 + σ2 . Y ∼ Normal(µ2 . the proof that −∞ fX (x) dx = 1 follows by using the change (x − µ) in the integral. Proof that ∞ −∞ fX (x) dx = 1. . a2 σ 2 ) random variable. . then 2 2 X + Y ∼ Normal µ1 + µ2 . Sums of Normal random variables 2 2 If X and Y are independent. . if X ∼ Normal(µ. σ1 ). . =√ |a| 2πa2 σ 2 But this is the p. See Calculus courses for details. fY (y) = fX y−b a · 1 1 2 2 2 e−(y−(aµ+b)) /2a σ for − ∞ < y < ∞. . . X2 . and X ∼ Normal(µ1 . then aX + b ∼ Normal aµ + b. of a Normal(aµ + b. n.+a2 σn) .d. . 1 n For mathematicians: properties of the Normal distribution 1. Xn are independent. σ2 ). if X1. then a1 X1 +a2 X2 +. and Xi ∼ Normal(µi . σi ) for i = 1. Using this result. . So. of variable y = √ 2σ ∞ . 2 2 (a2 σ1 +. . . .+an µn ).

g(−z) = −g(z)). ∞ + µ 1 2 √ e−z /2 dz 2π −∞ ∞ p. 3. Thus E(X) = 0 + µ × 1 = µ.184 2. . E(X) = ∞ −∞ xfX (x) dx = 1 2 2 x√ e−(x−µ) /2σ dx 2πσ 2 −∞ x−µ σ ∞ Change variable of integration: let z = ∞ : then x = σz + µ and dx dz = σ. Thus E(X) = = 1 2 (σz + µ) · √ · e−z /2 · σ dz 2πσ 2 −∞ σz 2 √ · e−z /2 dz 2π −∞ this is an odd function of z (i.f. Proof thatVar(X) = σ 2 .d. of N (0. Var(X) = E (X − µ)2 = 1 2 2 (x − µ)2 √ e−(x−µ) /(2σ ) dx 2πσ 2 −∞ 2 ∞ = σ = σ 1 2 √ z 2 e−z /2 dz 2π −∞ 1 2 √ −ze−z /2 2π ∞ −∞ ∞ putting z = + 1 2 √ e−z /2 dz 2π −∞ ∞ x−µ σ 2 (integration by parts) = σ 2 {0 + 1} = σ2. Proof that E(X) = µ. 1) integrates to 1. so it integrates to 0 over range −∞ to ∞.e.

+ Xn → Normal(nµ.v. . Theorem (The Central Limit Theorem): Let X1 . . n i=1 Xi has a distribution The mean of the Normal distribution is E(Sn) = The variance of the Normal distribution is n n i=1 E(Xi ) = nµ. Var(Sn) = Var i=1 n Xi = i=1 Var(Xi ) because X1 . then the distribution of their sum (or alternatively their sample average) always converges to the Normal distribution. Xn be independent r. Then the sum Sn = X1 + . from ANY distribution. For example.2 The Central Limit Theorem (CLT) also known as. . . so µ = np and σ 2 = np(1 − p).185 5. . . Xn are independent = nσ 2. it states that if a large number of independent random variables are drawn from any distribution. . . In its simplest form. nσ 2) as n → ∞.s with mean µ and variance σ 2. . Xi ∼ Binomial(n. So Sn = X1 + X2 + . the Piece of Cake Theorem The Central Limit Theorem (CLT) is one of the most fundamental results in statistics. . . + Xn = that tends to Normal as n → ∞. . . . . p) for each i.

Find E(X) and Var(X): 1 f (x) µ = E(X) = 0 1 xfX (x) dx 0 1 x = 0 2x dx 2x3 3 1 0 2 = 2 = . 18 − 4 9 . because the limit holds for any distribution of X1 . It is usually safe to assume that the Central Limit Theorem applies whenever n ≥ 30. 3. Example 1: Triangular distribution: fX (x) = 2x for 0 < x < 1.186 Notes: 1. . . It might apply for as little as n = 4. 2. Xn are independent and have the same distribution. A suﬃcient condition on X for the Central Limit Theorem to apply is that Var(X) is ﬁnite. Other versions of the Central Limit Theorem relax the conditions that X1. . The Central Limit Theorem in action : simulation studies The following simulation study illustrates the Central Limit Theorem. . . . The speed of convergence of Sn to the Normal distribution depends upon the distribution of X. Xn. making use of several of the techniques learnt in STATS 210. . This is a remarkable theorem. We will look particularly at how fast the distribution of Sn converges to the Normal distribution. . 3 σ 2 = Var(X) = E(X 2) − {E(X)}2 1 = 0 1 x fX (x) dx − 2x3 dx − 1 0 2 2 3 2 = 0 4 9 2x4 = 4 1 = . Skewed distributions converge more slowly than symmetric Normal-like distributions.

n and 10.5 1. Xn are independent. . the Normal curve provides a good approximation to Sn = X1 + .4 0. + Xn where X1 .8 n=3 0. . the Normal curve is a very good approximation.d. . .2 0. . 18 Var(Sn) = Var(X1 + . 18 ) is superimposed 3 across the top.4 1. by the Central Limit Theorem.4 0. . Var(X) = σ 2 = 5 . . Then E(Sn) = E(X1 + .0 0. 5 for large n.6 1.5 0. .5 0.3 4 5 6 7 8 9 Sn Sn Sn Sn f (x) Example 2: U-shaped distribution: fX (x) = 3 x2 for −1 < x < 1. + Xn ) = nσ 2 ⇒ So Sn ∼ approx Normal 0. + Xn ) = nσ 2 ⇒ So Sn ∼ approx Normal 2n n .0 1.2 0.6 0. 2 3 We ﬁnd that E(X) = µ = 0.0 2. . + Xn ) = nµ = 2n 3 n . Even for n as low as 10. . The Normal p. (Exercise) Let Sn = X1 + . . . 3n 5 by independence Var(Sn) = 3n . 2. + Xn for n as small as 10. .0 n=2 1.5 1. Normal(nµ. .4 0.0 1.+Xn for n = 1. .0 0.0 0. .5 3.2 1.0 0. n=1 2.1 0. .0 0. . .8 0. .0 0.6 0. .2 0. nσ 2) = Normal( 2n . . Then E(Sn) = E(X1 + .0 1.5 2.0 1. 18 3 by independence Var(Sn) = for large n. .2 0.8 1. + Xn ) = nµ = 0 -1 0 1 x Var(Sn) = Var(X1 + . Even with this highly non-Normal distribution for X. . . + Xn where X1 .187 Let Sn = X1 + .2 0.0 0.5 2. 3.0 0. Xn are independent. by the Central Limit Theorem.0 0.f. The graph shows histograms of 10 000 values of Sn = X1 +.5 n = 10 0.0 0.

5 0. We can think of Y as the sum of n Bernoulli random variables: Y = X1 + X2 + . p).10 -5 0 5 Sn Sn Sn Sn Normal approximation to the Binomial distribution.8 0. p) → Normal np mean of Bin(n.4 1.p) np(1 − p) as n → ∞ with p ﬁxed. + Xn → Normal(nµ. . λ)when λ is large. The Normal distribution is also a good approximation to the Poisson(λ) distribution when λ is large: Poisson(λ) → Normal(λ. . for any ﬁxed value of p. where Xi = 1 if trial i is a “success” (prob = p). .p) .5 1. using the CLT Let Y ∼ Binomial(n.0 0.0 -0.6 n=2 0. .0 0. Y = X1 + X2 + .2 0.2 0.2 0. .0 0. var of Bin(n.3 0. The Binomial distribution is therefore well approximated by the Normal distribution when n is large. Thus.4 n=3 0. . np(1 − p) .05 0.15 n = 10 188 1.n=1 1. nσ 2) = Normal np.0 0.2 0.6 0.1 -1.4 0.4 0.0 0.0 0. . + Xn . σ 2 = Var(Xi ) = p(1 − p). + Xn and each Xi has µ = E(Xi) = p. Thus by the CLT. 0 otherwise (prob = 1 − p) So Y = X1 + .0 -2 -1 0 1 2 -3 -2 -1 0 1 2 3 0. Bin(n.

06 0.02 30 40 50 60 70 0.) . Then X ∼ Binomial(n. At the end of Chapter 2.v. unknown level of support for the Labour party in New Zealand. n In a large sample (large n). (linear transformation of Normal r. Let X be the number of of the n people interviewed by the opinion pollster who plan to vote Labour. npq) So p= X pq ∼ approx Normal p.01 0. p = 0. Let p be the true. you deserve a piece of cake! Example: Remember the margin of error for an opinion poll? An opinion pollster wishes to estimate the level of support for Labour in an upcoming election.0 0.02 0. we said that the maximum likelihood estimator for p is X p= .03 80 100 120 140 Why the Piece of Cake Theorem? • The Central Limit Theorem makes whole realms of statistics into a piece of cake.0 60 0. n n where q = 1 − p. we now know that X ∼ approx Normal(np. • After seeing a theorem this good. p).04 Poisson(λ = 100) 0. She interviews n people about their voting preferences.5) 0.189 Binomial(n = 100.08 0.04 0.

pq < p < p + 1.96) = 0.96.190 So p−p pq n ∼ approx Normal(0. Now if Z ∼ Normal(0. mean=0. a wide conﬁdence interval suggests a poor estimate (high variance).96.pnorm(-1. p.96 p(1 − p) . . Next time you see the newspapers quoting the margin of error on an opinion poll: • Remember: margin of error = 1.96 < Z < 1.95.96 p(1 − p) n to p + 1. we ﬁnd (using a computer) that the 95% central probability region of Z is from −1. 1).95. 1). which depend on p.96 < 1.95. Check in R: pnorm(1. n The 95% conﬁdence interval has RANDOM end-points.96 ≃ 0.96 n pq n ≃ 0. sd=1) . A narrow conﬁdence interval suggests a useful estimate (low variance).96 p(1−p) .96 < Rearranging: P p − 1. mean=0. Conﬁdence intervals are extremely important for helping us to assess how useful our estimate is. we obtain p−p pq n P −1. This enables us to form an estimated 95% conﬁdence interval for the unknown parameter p: estimated 95% conﬁdence interval is p − 1. n • Think: Central Limit Theorem! • Have: a piece of cake. these random end-points will enclose the true unknown value.96 to +1. sd=1) Putting Z = p−p pq n .96: P(−1. About 95% of the time.

. Var(Xi) = σi (not necessarily all equal). Xn be independent. . . . n σ2 n as n → ∞. . 1) as n → ∞. Sn = X1 + X2 + . is deﬁned as: X= X1 + X2 + . X Let X1 . . . . X itself is approximately Normal for large n: X1 + X2 + .v. if X1 . . . . σ 2 /n nσ 2 The essential point to remember about the Central Limit Theorem is that large sums or sample means of independent random variables converge to a Normal distribution. . nσ 2) by the CLT. identically distributed with mean E(Xi) = µ and variance Var(Xi ) = σ 2 for all i. n Because X is a scalar multiple of a Normal r. where Sn = X1 + .s. Xn are independent. . . X. . + Xn ∼ approx Normal nµ. and 2 E(Xi) = µi . then Zn = n i=1 (Xi − µi ) n 2 i=1 σi → Normal(0. 1) as n → ∞. as n → ∞. . + Xn . . . whatever the distribution of the original r. n So X = Sn . n n as n → ∞. . Other versions of the CLT relax the condition that X1 . . .191 Using the Central Limit Theorem to ﬁnd the distribution of the mean. + Xn ∼ approx Normal µ. More general version of the CLT A more general form of CLT states that. as n grows large. The sample mean. . The following three statements of the Central Limit Theorem are equivalent: X= X1 + X2 + . nσ 2 Sn − nµ X −µ √ = ∼ approx Normal (0. Xn are independent. + Xn σ2 ∼ approx Normal µ.v. . + Xn ∼ approx Normal(nµ.

6. from this distribution. means. . the bad. or estimate. This means they have distributions. 1 X. distributions. 1 • The maximum likelihood estimate of λ is λ = x . variance. Let’s see how these diﬀerent ideas all come together. 192 Most of the techniques that we have studied along the way are to help us with these two goals: expectation. What’s the point of that? Example: Let X ∼ Binomial(n. • hypothesis testing. • The maximum likelihood estimator of λ is λ = Why are we interested in estimators? The answer is that estimators are random variables. • The maximum likelihood estimator of p is p = X n. and the Central Limit Theorem. and variances that tell us how well we can trust our single observation. change of variable. Example: Let X ∼ Exponential(λ) with observed value X = x.1 Estimators — the good. p) with known n and observed value X = x. x • The maximum likelihood estimate of p is p = n . and the estimator PDF We have seen that an estimator is a capital letter replacing a small letter.Chapter 6: Wrapping Up Probably the two major ideas of this course are: • likelihood and estimation.

Here are the p. λ is unknown. The p. for n = 100 is focused much more tightly about the true value λ (unknown) than the p. we can calculate the p.f. for n = 10.d. . = X1 + X2 + .f. . In Chapter 4 we calculated the maximum likelihood estimator of λ: 1 n λ= .d.f.d.f.s of λ for two diﬀerent values of n: • Estimator 1: n = 100. 100 pieces of information about λ.f. 10 pieces of information about λ. of λ. f( λ ) p. So we can ﬁnd the p. and we wish to estimate it. n Now λ = T . .f.+Xn ∼ Gamma(n. the more information we have.d. Xn are independent.d. .d.f. So we know the p. . For a given value of n. How? We know that T = X1+.193 Good and bad estimators Suppose that X1 . .f. the better. . of λ using the change of variable technique.d. . + Xn X Now λ is a random variable with a distribution.i. Exponential(λ). X2. of Estimator 1 p.d. of T . and Xi ∼ Exponential(λ) for all i. • Estimator 2: n = 10. .d. λ) when Xi ∼ i. of Estimator 2 True λ (unknown) λ Clearly.

so we could ﬁnd the p.f. of Estimator 1 p. So we need an estimator such that EVERYWHERE on the estimator’s p.f. we can’t trust any estimate from this poor estimator.d. curve. λ) when Xi ∼ i.f. What we do know: • the p.f. A poor estimator has high estimator variance: some places on the estimator’s p.d. . .f.f. • we know we’re SOMEWHERE on that curve.f. We won’t usually be so lucky: so what should we do? Use the Central Limit Theorem! .d. + Xn ∼ Gamma(n. A good estimator has low estimator variance: everywhere on the estimator’s p.i.d. curve is good! f( λ ) p.d. The estimator variance tells us how much the estimator can be trusted. • WHERE we are on the p. curve.d. of our estimator λ = n/T .d.d.f. Note: We were lucky in this example to happen to know that T = X1 + .194 It is important to recognise what we do and don’t know in this situation: What we don’t know: • the true λ. Because we don’t know where we are on the curve. curve may be good. Exponential(λ). of Estimator 2 True λ (unknown) λ This is why we are so concerned with estimator variance. while others may be very bad.d. curve is guaranteed to be good.

(c) Find Var(X). Write down the likelihood function. 0. You should refer to the graph in your answer.15 Likelihood 0.05 0. Use E(X) = 0 2 2 xfX (x) dx = 2 s s 0 (sx − x2) dx. x).195 Example: calculating the maximum likelihood estimator The following question is in the same style as the exam questions. L(s . L(s . s2 18 .10 4 5 6 s 7 8 9 . x) = s2 (e) The likelihood graph for a particular value of x is shown here. s (a) Show that E(X) = . where s is the maximum value of X and s > 0. s is a parameter to be estimated. s2 fX (x) = 0 otherwise . Let X be a continuous random variable with probability density function 2(s − x) for 0 < x < s . 2(s − x) for x < s < ∞. and state the range of values of s for which your answer is valid. 3 s Here. Use Var(X) = E(X 2) − (EX)2. Show that the maximum likelihood estimator of s is s = 2X . 6 Use E(X ) = 0 2 s 2 x fX (x) dx = 2 s 2 s 0 (sx2 − x3) dx.0 3 0. Answer: Var(X) = (d) Suppose that we make a single observation X = x. s2 (b) Show that E(X ) = .

Var(s) = 9 . we can see that s = ∞ is not the maximum. (f) Find the estimator variance. dL =0 ds ⇒ or s = 2x.196 L(s . in terms of s. Var(s). x) = 2s−2(s − x) So dL = 2 −2s−3(s − x) + s−2 ds = 2s−3(−2(s − x) + s) = 2 (2x − s). Hence ﬁnd the estimated variance. Var(s). So s = 2x. Thus the maximum likelihood estimator is s = 2X. in terms of s. Var(s) = Var(2X) = 22Var(X) s2 = 4× 18 by (c) Var(s) = 2s2 . 9 So also: 2s2 . From the graph. s3 s=∞ At the MLE.

X ∼ Binomial(190. 2s2 2 × 62 Var(s) = = =8 9 9 se(s) = Var(s) = √ 8 = 2. The value s = 6 is the value of s under which the observation X = 3 is more likely than it is at any other value of s.36 to 11. Easy. This means s is a POOR estimator: the twice standard-error interval would be 6 − 2 × 2. Find the maximum likelihood estimate of s. • Too many daughters? Do divers have more daughters than sons? Let X be the number of daughters out of 190 diver children.82: that is.197 (g) Suppose we make the single observation X = 3. and its estimated variance and standard error. 0. 6.82 to 6 + 2 × 2. s = 2X = 2 × 3 = 6.2 Hypothesis tests: in search of a distribution When we do a hypothesis test. X ∼ Binomial(10.64 ! Taking the twice standard error interval strictly applies only to the Normal distribution.82. but it is a useful rule of thumb to see how ‘good’ the estimator is. We have an easy distribution and can do a hypothesis test. It is ﬁnding the distribution that is the diﬃcult part. (h) Write a sentence in plain English to explain what the maximum likelihood estimate from part (g) represents. p). p). • Weird coin: is my coin fair? Let X be the number of heads out of 10 tosses. we need a test statistic: some random variable with a distribution that we can specify exactly under H0 and that diﬀers under H1 . .

85 0.198 • Too long between volcanoes? Let X be the length of time between volcanic eruptions.22 1. then X ∼ Exponential(λ).1 8.0 -8.4 -11.58 1.0 3. He called himself only ‘Student’.1 -13.4 -1.14 3.8 2.5 9.40 1. yes they do (Normal(0.07 -0.66 1.0 Again.32 -1.17 1.6 1.42 -0.85 0.8 -9.61 -0. If we assume volcanoes occur as a Poisson process. Here are some observations.52 0. What test statistic should we use? If we don’t know that our data are Normal.3 -30. The test that ‘Student’ developed is the familiar Student’s t-test. More advanced tests Most things in life are not as easy as the three examples above.3 8.6 -5.96 2.6 -6. and we don’t know their underlying variance.4 -5.1 -7.48 1. possibly because he (or his employers) wanted it to be kept secret that he was doing his statistical research as part of his employment at Guinness Brewery.81 -0. they are Normal(0.6 -6. It was originally developed to help Guinness decide how large a sample of people should be used in its beer tastings! . but how can we tell? The unknown variance (4 versus 100) interferes. Gossett (1876-1937) worked out an answer.7 12. We have a simple distribution and test statistic (X): we can test the observed length of time between eruptions and see if it this is a believable observation under a hypothesized value of λ. 100) this time). but how can we tell? What about these? 3. Do they come from a distribution (any distribution) with mean 0? 3.51 1.54 Answer: yes.54 -1. what can we use as our X to test whether µ = 0? Answer: a clever person called W. 4). so that the second sample does not cluster about its mean of 0 at all.0 -9. S.37 -0.

The Chi-squared test for testing proportions in a contingency table also has a deep theory. The Student’s t-test is exact when the distribution of the original data X1 . It looks diﬃcult It is! Most of the statistical tests in common use have deep (and sometimes quite impenetrable) theory behind them. One interesting point to note is the pivotal role of the Central Limit Theorem in all of this.f. In the Chi-squared goodness-of-ﬁt test. derived as the ratio of a Normal random variable and an independent Chi-Squared random variable. With the help of our best friend the Central Limit Theorem. . If µ = 0. A ratio of two Chi-squared distributions produces an F -distributed random variable. Student’s T -statistic gives us a test for µ = 0 (or any other value) that can be used with any large enough sample. the distribution of T is known: T has p. observations of T will tend to lie out in the tails of this distribution. however. . The Central Limit Theorem produces approximate Normal distributions. it is still approximately valid in large samples. All these things are not coincidental: the Central Limit Theorem rocks! . µ: T = X −µ n 2 i=1 (Xi −X) n(n−1) Under H0 : µ = 0. The result. is astonishing. T is the Student’s t-distribution. but once researchers had derived the distribution of a suitable test statistic. As you can probably guess. fT (t) = Γ n 2 n−1 2 (n − 1)π Γ t2 1+ n−1 −n/2 for − ∞ < t < ∞. the Pearson’s chi-square test statistic is shown to have a Chi-squared distribution under H0. It produces larger values under H1 . Normals divided by Chi-squareds produce t-distributed random variables.d. by the Central Limit Theorem. Normal random variables squared produce Chi-squared random variables. Student did not derive the distribution above without a great deal of hard work. the rest was easy. For other distributions.199 Student used the following test statistic for the unknown mean. . . Xn is Normal.

Sign up to vote on this title

UsefulNot useful- Lecture 14 Probability
- Glossary
- qtm session 2_dec09
- 06b Probability
- probability Chap 2
- C5606_Statistik
- example 2
- Negative Binomial and Geometric Distributions
- Pr Student Manual
- Bernoulli Distribution
- 17 Probability
- Mike Zero - Mike EVA Technical Documentation
- 2 Statprob Walpole
- Examples of Probability Measures
- Stat 230 notes
- STAT 230 Notes 2013
- Asset
- PROBABILITY-1
- Prop Ability
- Probability Distribution(Report)
- A Tutorial on Learning With Bayesian Networks_v2
- poisson distribution
- 2 a Probability Distributions
- addmaths
- Statistics Study Guide
- Untitled
- Final+Assignment
- SULTAN ABU BAKAR 2013 M2(Q)
- Tabela BINOMNE DISTRIBUCIJE.pdf
- Some Basic Probability Concepts
- Stats 210 Course Book

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue reading from where you left off, or restart the preview.

scribd