Statistical Theory | Probability Theory | Probability Distribution

COURSE NOTES STATS 210 Statistical Theory

Department of Statistics University of Auckland

Contents

1. Probability 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Sample spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Partitioning sets and events . . . . . . . . . . . . . . . . . . 1.5 Probability: a way of measuring sets . . . . . . . . . . . . . 1.6 Probabilities of combined events . . . . . . . . . . . . . . . . 1.7 The Partition Theorem . . . . . . . . . . . . . . . . . . . . . 1.8 Examples of basic probability calculations . . . . . . . . . . 1.9 Formal probability proofs: non-examinable . . . . . . . . . . 1.10 Conditional Probability . . . . . . . . . . . . . . . . . . . . . 1.11 Examples of conditional probability and partitions . . . . . . 1.12 Bayes’ Theorem: inverting conditional probabilities . . . . . 1.13 Statistical Independence . . . . . . . . . . . . . . . . . . . . 1.14 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 1.15 Key Probability Results for Chapter 1 . . . . . . . . . . . . 1.16 Chains of events and probability trees: non-examinable . . . 1.17 Equally likely outcomes and combinatorics: non-examinable

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

3 3 3 5 12 14 17 19 20 22 24 30 32 35 37 41 43 46 50 50 51 53 54 58 62 69 76 79 91 94 100 106 112

2. Discrete Probability Distributions 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 The probability function, fX (x) . . . . . . . . . . . . . . . . . . 2.3 Bernoulli trials . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Example of the probability function: the Binomial Distribution 2.5 The cumulative distribution function, FX (x) . . . . . . . . . . . 2.6 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Example: Presidents and deep-sea divers . . . . . . . . . . . . . 2.8 Example: Birthdays and sports professionals . . . . . . . . . . . 2.9 Likelihood and estimation . . . . . . . . . . . . . . . . . . . . . 2.10 Random numbers and histograms . . . . . . . . . . . . . . . . . 2.11 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.12 Variable transformations . . . . . . . . . . . . . . . . . . . . . . 2.13 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.14 Mean and variance of the Binomial(n, p) distribution . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

1

3. Modelling with Discrete Probability Distributions 3.1 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . 3.2 Geometric distribution . . . . . . . . . . . . . . . . . . . . . 3.3 Negative Binomial distribution . . . . . . . . . . . . . . . . . 3.4 Hypergeometric distribution: sampling without replacement 3.5 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . 3.6 Subjective modelling . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

118 . 118 . 119 . 124 . 127 . 130 . 136 138 . 138 . 140 . 151 . 156 . 158 . 160 . 162 . 166 . 168 . 173 . 175 . 178

4. Continuous Random Variables 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 The probability density function . . . . . . . . . . . . . . . . . . . . . 4.3 The Exponential distribution . . . . . . . . . . . . . . . . . . . . . . 4.4 Likelihood and estimation for continuous random variables . . . . . . 4.5 Hypothesis tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Expectation and variance . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Exponential distribution mean and variance . . . . . . . . . . . . . . 4.8 The Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 The Change of Variable Technique: finding the distribution of g(X) . 4.10 Change of variable for non-monotone functions: non-examinable . . . 4.11 The Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . 4.12 The Beta Distribution: non-examinable . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

5. The Normal Distribution and the Central Limit Theorem 179 5.1 The Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 5.2 The Central Limit Theorem (CLT) . . . . . . . . . . . . . . . . . . . . . . . . . 184 6. Wrapping Up 191 6.1 Estimators — the good, the bad, and the estimator PDF . . . . . . . . . . . . . 191 6.2 Hypothesis tests: in search of a distribution . . . . . . . . . . . . . . . . . . . . 196

.80.g. s3}. then each si is a sample point. 2. e. we always combine and manipulate them according to the same rules.2 Sample spaces Definition: A random experiment is an experiment whose outcome is not known until it is observed. number of times event occurs number of opportunities for event to occur .g. Subjective: probability represents a person’s degree of belief that an event will occur.3 Chapter 1: Probability 1. s2 . Frequentist (based on frequencies). is a set of outcomes of a random experiment. written as P(rain) = 0. For example.1 Introduction Definition: A probability is a number between 0 and 1 representing how likely it is that an event will occur. 1. Every possible outcome must be listed once and only once. Definition: A sample space. I think there is an 80% chance it will rain today. Ω. Probabilities can be: 1. Definition: A sample point is an element of the sample space. Regardless of how we obtain probabilities. e. if the sample space is Ω = {s1.

1. 1] ={all numbers between 0 and 1 inclusive} (continuous. 4. infinite) (discrete. different} Discrete and continuous sample spaces Definition: A sample space is finite if it has a finite number of elements.5. 1. [90◦. 2. . . Sample space: Ω = {0. Definition: A sample space is continuous if there are no gaps between the elements. finite) Ω= [0◦. HT. 2. 90◦). or if the elements can be “listed”. Examples: Ω = {0. Sample space: Ω = {HH. finite) . 1. 1]). T H.g. 2} Experiment: Toss a coin twice and observe whether the two tosses are the same (e. 2.7} (discrete. even if an infinite list (eg.} (discrete. T T } An example of a sample point is: HT Experiment: Toss a coin twice and count the number of heads. HH or TT). 1.4 Examples: Experiment: Toss a coin twice and observe the result. . 4. . 3. 3} (discrete and finite) Ω = {4. finite) Ω = {0. HT. 360◦) Ω = [0. the interval [0.). . Sample space: Ω = {same. 3. T T } (discrete.6. Definition: A sample space is discrete if there are “gaps” between the different elements. so the elements cannot be listed (eg. In mathematical language. a sample space is discrete if it is countable. infinite) Ω = {HH. T H. .

A is a subset of Ω. is also a subset of Ω. Somehow you need to harness the idea of randomness.1.3 Events Suppose you are setting out to create a science of randomness. It lays the ground for a whole mathematical formulation of randomness. ∅ = {}. or the event with no outcomes. Kolmogorov (1903-1987). Note: Ω is a subset of itself. Ω. Definition: Event A occurs if we observe an outcome that is a member of the set A. . HT. which lists all possible outcomes of a random experiment. in terms of set theory. T T } Let event A be the event that there is exactly one head. which is all about the unknown. T H. We write: A =“exactly one head” Then A = {HT. Example: Toss a coin twice. How would you do it? So far. That is. and might seem unexciting. as in the definition. and express it in terms of mathematics. so Ω is an event. The empty set. Ω is a set. Sample space: Ω = {HH. We write A ⊂ Ω. any collection of outcomes forms an event. How would you express the idea of an event in terms of set theory? Definition: An event is a subset of the sample space. However. This is called the null event. T H}. we have introduced the sample space. One of the founders of probability theory. or an event. The next concept that you would need to formulate is that of something that happens at random.

opposites. . 2). 2). . Bus Car Walk Bike Train People in class This sort of diagram representing events in a sample space is called a Venn diagram. . . 4). 1)} Combining Events Formulating random events in terms of sets gives us the power of set theory to describe all possible ways of combining or manipulating events. and so on. . . (1. (2. 1). 6)} Event A = “sum of two faces is 5” = {(1. we need to describe things like coincidences (events happening together). Sample space: Ω = {(1. (2. (2. (6. . . 2). . 1). We do this in the language of set theory. For example.6 Example: Experiment: throw 2 dice. 6). (3. alternatives. 6). . (1. . (2. . (4. . Example: Suppose our random experiment is to pick a person in the class and see what form(s) of transport they used to get to campus today. 3).

For example. NOT Bus AND Car. Definition: Let A and B be events on the same sample space Ω: so A ⊂ Ω and B ⊂ Ω. and is given by A ∪ B = {s : s ∈ A or s ∈ B or both} . or both. . you have to consider what does an outcome need to satisfy for the shaded event to occur? The answer is Bus. We write the event that you used a motor vehicle as the event Bus ∪ Car. ∪. the event that you used a motor vehicle to get to campus is the event that your journey involved a car. Bus Car Walk Bike Train People in class Overall. To shade the union of Bus and Car. The union of events A and B is written A ∪ B. OR Car. To remember whether union refers to ‘Or’ or ‘And’. or a bus. we had to shade everything in Bus AND everything in Car. denotes Bus OR Car OR both. as “Bus UNION Car”. we have shaded all outcomes in the UNION of Bus and Car.7 1. read The union operator. To represent the set of journeys involving both alternatives. Note: Be careful not to confuse ‘Or’ and ‘And’. Alternatives: the union ‘or’ operator We wish to describe an event that is composed of several different alternatives. we shade all out- comes in ‘Bus’ and all outcomes in ‘Car’. OR both.

To represent this event. read as The intersection operator. we shade all outcomes in the overlap of Car and Train. Definition: The intersection of events A and B is written A ∩ B and is given by A ∩ B = {s : s ∈ A AND s ∈ B} . we shade all outcomes in Ω except those in the event Walk. Concurrences and coincidences: the intersection ‘and’ operator The intersection is an event that occurs when two or more events ALL occur together. ∩. Bus Car Walk Bike Train People in class “Car INTERSECT Train”. . For example. To represent this event. denotes both Car AND Train together. consider the event that your journey today did NOT involve walking.8 2. Opposites: the complement or ‘not’ operator The complement of an event is the opposite of the event: whatever the event was. it didn’t happen. 3. We write the event that you used both car and train as Car ∩ Train. consider the event that your journey today involved BOTH a car AND a train. For example.

8) A ∩ B = everything outside A ∩ B . Yes. Yes. No. No. Yes. 6) A ∩ B = {male and non-biker}. No. 7) A ∩ B = {male and bike rider}. Sample space: Ω = {all people in class}.9 Bus Car Walk Bike Train People in class We write the event ‘not Walk’ as Walk. Suppose I pick a male who did not travel by bike. . 5) A ∪ B = {female or bike rider or both}. / Examples: Experiment: Pick a person in this class at random. 2) B 4) B No. Question: What is the event Ω? Ω=∅ Challenge: can you express A ∩ B using only a ∪ sign? Answer: A ∩ B = (A ∪ B). A ∩ B did not occur. Say whether the following events have occurred: 1) A 3) A Yes. so A ∩ B did occur. Definition: The complement of event A is written A and is given by A = {s : s ∈ A}. Let event A =“person is male” and event B =“person travelled by bike today”.

For more than 3 events. and A ∪ A = Ω. intersection. (iv) (a) (A ∪ B) = A ∩ B. A ∩ B = B ∩ A. (ii) For any event A. Ω A B . (i) ∅ = Ω and Ω = ∅. and (iii) For any events A and B. A ∩ A = ∅. and complement The following properties hold.) Ω Example: A B A B Ω C (a) A ∪ B ∪ C C (b) A ∩ B ∩ C Properties of union. although they are not used to provide formal proofs. A B Ω (b) (A ∩ B) = A ∪ B.10 Limitations of Venn diagrams Venn diagrams are generally useful for up to 3 events. A ∪ B = B ∪ A. Commutative. the diagram might not be able to represent all possible overlaps of events. (This was probably the case for our transport Venn diagram.

. b. A ∪ (B1 ∩ B2 ∩ . then a × (b + c) = a × b + a × c.e. B2. for any sets A. ∪ Bn ) = (A ∩ B1) ∪ (A ∩ B2 ) ∪ . .. . . and A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C). B. . However. Thus. This means that. . . union is distributive over intersection. ∩ (A ∪ Bn ) n n i. For set union and set intersection. A∩ Bi i=1 = i=1 (A ∩ Bi). and c are any numbers. . . ∪ (A ∩ Bn ) n n i. ∩ Bn ) = (A ∪ B1) ∩ (A ∪ B2 ) ∩ . and A∪ Bi i=1 = i=1 (A ∪ Bi). . for several events A and B1. addition is not distributive over multiplication: a + (b × c) = (a + b) × (a + c). and C: A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C). if a. Bn.11 Distributive laws We are familiar with the fact that multiplication is distributive over addition.e. AND intersection is distributive over union. . . A ∩ (B1 ∪ B2 ∪ . . Ω Ω A B B A C C More generally.

∪ Bk = Ω. we give some background definitions. j with i = j . . . . and vice-versa. Definition: Any number of events A1. This means events A and B cannot happen together. Ai ∩ Aj = ∅ for all i. A EXCLUDES B from happening. Ω A1 A2 00000 1111 1111 0000 0000 A3 11111 1111 1111 0000 0000 11111 00000 1111 1111 0000 0000 11111 00000 1111 1111 0000 0000 11111 00000 1111 1111 0000 0000 11111 00000 Definition: A partition of the sample space Ω is a collection of mutually exclusive events whose union is Ω. Bk form a partition of Ω if Bi ∩ Bj = ∅ for all i. . Definition: Two events A and B are mutually exclusive. or disjoint. j with i = j . A2. k and i=1 Bi = B1 ∪ B2 ∪ . so B depends strongly on whether or not A happens. . . If A happens. .4 Partitioning sets and events The idea of a partition is fundamental in probability manipulations. it excludes B from happening. . For now. B2. sets B1. . Ak are mutually exclusive if every pair of the events is mutually exclusive: ie.12 1. . if A ∩ B = ∅. Ω 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 A 111111 000000 111111 000000 111111 000000 111111 000000 111111 000000 B Note: Does this mean that A and B are independent? No: quite the opposite. Later in this chapter we will encounter the important Partition Theorem. That is. .

. . . (A ∩ Bk ) form a partition of A. . . .13 Examples: B1 . . . B4 form a partition of Ω: Ω B1 B3 B1 B3 B1 . The partition idea shows us how to add the probabilities of these chunks together: see later. If B1. B2. Bk form a partition of Ω. Ω B1 B2 A 1111 0000 1111 0000 1111 0000 1111 0000 B3 B4 We will see that this is very useful for finding the probability of event A. B3. B5 partition Ω: Ω B2 B4 B2 B4 B5 Important: B and B partition Ω for any event B: Ω B B Partitioning an event A Any set or event A can be partitioned: it doesn’t have to be Ω. . . . . then (A ∩ B1). This is because it is often easier to find the probability of small ‘chunks’ of A (the partitioned sections) than to find the whole probability of A at once. .

If I sent you away to measure heights. 111111111 000000000 111111111 000000000 111111111 000000000 111 000 11 111 00 000 111 000 1111 0000 Second set: {All Blacks win}. 20] — but no they shouldn’t (hopefully!) if our experiment was the time in years taken by a student to finish their degree.14 1. You probably already have a good idea for a suitable way to measure the size of a set or event. Should the intervals [3. . the first thing you would ask is what you are supposed to be measuring the heights of. This means somehow ‘measuring chance’. is a way of measuring sets. ‘Probability’. the name that we give to our chance-measure. Chance of what? The answer is sets. but they have the same number of elements? Should they be the same probability? First set: {Lions win}. What happens. this is often what we do to measure probability — (although counting the number of elements can be far from easy!) But there are circumstances where this is not appropriate.5 Probability: a way of measuring sets Remember that you are given the job of building the science of randomness. but we definitely need to give them different probabilities! More problems arise when the sets are infinite or continuous. if one set is far more likely than another. Why not just count the number of elements in it? In fact. if (say) our random experiment is to pick a random number on [0. It was clever to formulate our notions of events and sample spaces in terms of sets: it gives us something to measure. 14] be the same probability. just because they are the same length? Yes they should. for example. 4] and [13. People? Trees? Mountains? We have the same question when setting out to measure chance. Both sets have just one element.

g. we could use the following probability distribution: P(Lions win)= 0. . . A ⊆ Ω. we have the following definition for discrete sample spaces.15 Most of this course is about probability distributions.} such that: 1. At its simplest. s2. E. i pi = 1. . . . In general. The rule for measuring the probability of any set.99. . 2. or event. among the different sets in the sample space. if A = {s3 . . or distributed. . . A discrete probability distribution on Ω is a set of real numbers {p1. such that the total sum of probabilities is 1. s2. 0 ≤ pi ≤ 1 for all i. a probability distribution just lists every element in the sample space and allots it a probability between 0 and 1. s14}. s5. Discrete probability distributions Definition: Let Ω = {s1 . A probability distribution is a rule according to which probability is apportioned. is to sum the probabilities of the elements of A: P(A) = i∈A pi . pi is called the probability of the event that the outcome is si .} associated with the sample points {s1 . then P(A) = p3 + p5 + p14. . In the rugby example.} be a discrete sample space. We write: pi = P(si ).01. P(All Blacks win)= 0. p2.

If A1.g. . Note: The axioms can never be ‘proved’: they are definitions. discrete or continuous. . . A ⊆ Ω. Note: P(∅) = 0. . the same principle applies. . it is a valid probability distribution. We will need more sophisticated methods detailed later in the course. Probability Axioms For any sample space. It should be clear that the definitions given for the discrete sample space on page 15 will satisfy the axioms. Note: Remember that an EVENT is a SET: an event is a subset of the sample space. ∪ An) = P(A1) + P(A2) + . However. 1].16 Continuous probability distributions On a continuous sample space Ω. . we can not list all the elements and give them an individual probability. . . An are mutually exclusive events. e. + P(An ). 0 ≤ P(A) ≤ 1 for all events A. . then P(A1 ∪ A2 ∪ . (no overlap). If our rule for ‘measuring sets’ satisfies the three axioms. Axiom 1: Axiom 2: Axiom 3: P(Ω) = 1. A2. The challenge of defining a probability distribution on a continuous sample space is left till later. all of probability theory is based on the following three definitions. A continuous probability distribution is a rule under which we can calculate a probability between 0 and 1 for any set. Ω = [0. or event. or axioms.

P(A ∪ B) = P(A) + P(B) − P(A ∩ B). Note: The formula for Case 2 applies also to Case 1: just substitute P(A ∩ B) = P(∅) = 0. We now look at the probabilities of these combinations. and C. For Case 1. For ANY events A. Everything below applies to events (sets) in either a discrete or a continuous sample space. intersections. 2.e. A ∩ B = ∅. and complements of events. B.17 1. B . For Case 2. we have the following formula.3 we discussed unions.g. A and B are not mutually exclusive: A ∩ B = ∅.6 Probabilities of combined events In Section 1. we get the probability of A ∪ B straight from Axiom 3: If A ∩ B = ∅ then P(A ∪ B) = P(A) + P(B). There are two cases for the probability of the union A ∪ B: 1. Probability of a union Let A and B be events on a sample space Ω. − P(A ∩ B) − P(A ∩ C) − P(B ∩ C) . For three or more events: e. for any A. 1. A and B are mutually exclusive (no overlap): i. P(A ∪ B ∪ C) = P(A) + P(B) + P(C) + P(A ∩ B ∩ C) .

A . P(A ∪ B) = P(A) + P(B) − P(A ∩ B). but a formal proof is given in Sec. think of the Venn diagrams: A 11111111 00000000 11111111 00000000 111111111 000000000 11111111 00000000 111111111 000000000 1111 0000 11111111 00000000 111111111 000000000 1111 0000 11111111 00000000 111111111 000000000 1111 0000 11111111 00000000 111111111 000000000 1111 0000 11111111 00000000 111111111 000000000 1111 0000 11111111 00000000 111111111 000000000 111111111 000000000 B 111111111 000000000 Ω A 111111111 000000000 Ω 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 B \ (A n000000000 B) 111111111 When we add P(A) + P(B). Probability of an intersection There is no easy formula for P(A ∩ B). So we have to subtract the intersection once to get P(A ∪ B): P(A ∪ B) = P(A) + P(B) − P(A ∩ B).9 (non-examinable). This is obvious. So P(A ∪ B) = P(A) + P(B) − P(A ∩ B) . If A and B are not statistically independent.18 Explanation For any events A and B.10. think of A ∪ B as two disjoint sets: all of A. We might be able to use statistical independence (Section 1. and the bits of B without the intersection.) 3. Alternatively. we add the intersection twice.9. Probability of a complement A Ω A 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 B Ω P(A) = 1 − P(A).13). we often use conditional probability (Section 1. 1. To understand the formula. The formal proof of this formula is in Section 1. 2.

Then for any event A. B1 B2 A ∩ B1 A ∩ B2 A B3 B4 A ∩ B3 A ∩ B4 The probability of event A is therefore the sum of its parts: P(A) = P(A ∩ B1) + P(A ∩ B2 ) + P(A ∩ B3) + P(A ∩ B4 ). .19 1. Ω Recall that a partition of Ω is a collection of non-overlapping sets B1 . .9. . . . . . (Proof in Section 1. Bm form a partition of Ω if Bi ∩ Bj = ∅ for all i = j . . m P(A) = i=1 P(A ∩ Bi ). i=1 . . .7 The Partition Theorem The Partition Theorem is one of the most useful tools for probability calculations. B1 B3 B2 B4 Also. B2. . if B1 . and m Bi = Ω. .) Let B1 . then (A ∩ B1). . The Partition Theorem is a mathematical way of saying the whole is the sum of its parts. . Sets B1 . Note: Recall the formal definition of a partition. . . . It is based on the fact that probabilities are often easier to calculate if we break down a set into smaller parts. . (A ∩ Bm ) form a partition of the set or event A. Theorem 1. Bm which together cover everything in Ω. . Bm form a partition of Ω.7: The Partition Theorem. . Bm form a partition of Ω.

P(L ∪ C) = P(L) + P(C) − P(L ∩ C) = 0.33 − 0. P(L) = P(L ∩ C) + P(L ∩ C) = P(C ∩ L) + P(C ∩ L) = 0. C = “no children” Next write down all the information given: P(C) = 0.45.13 = 0.13 P(C ∩ L) = 0. (b) either had children or chose a large car (or both).8 Examples of basic probability calculations An Australian survey asked people what sort of car they would like if they could choose any car at all.12 = 0. 33% of respondents had children.6) . (Partition Theorem) (b) Asked for P(L ∪ C).12.13 + 0.20 1. P(chooses large car) = 0. 13% of respondents had children and chose a large car. 12% of respondents did not have children and chose a large car.25.33 P(C ∩ L) = 0. Find the probability that a respondent: (a) chose a large car.25.25 + 0. (Section 1. (a) Asked for P(L). First formulate events: Let C = “has children” L = “chooses large car”.

.22 + 0. (c) Given the further information that 48% of the students are interested in neither music nor sport. P(M ∩ S) = P(M) + P(S) − P(M ∪ S) = 0.21 Example 2: Facebook statistics for New Zealand university students aged between 18 and 24 suggest that 22% are interested in music. (a) What is P(M)? (b) What is P(M ∩ S)? Information given: (a) P(M) = 0. find P(M ∪ S) and P(M ∩ S).34.22 P(S) = 0. (b) We can not calculate P(M ∩ S) from the information given.34 − 0.6) Only 4% of students are interested in both music and sport. Information given: P(M ∪ S) = 0.48. S = “interested in sport”. or sport.22 = 0. Probability that a student is interested in music.04.52 = 0. P(M) = 1 − P(M) = 1 − 0.78. M S Thus P(M ∪ S) = 1 − P(M ∪ S) = 1 − 0.48 = 0. or both. (Section 1.52. Formulate events: M = “interested in music”. while 34% are interested in sport.

ii) Ω = A ∪ A. Proof: i) For any A.18.22 (d) Find the probability that a student is interested in music. and A ∩ ∅ = ∅ So P(A) = P(A ∪ ∅) = P(A) + P(∅) (Axiom 3) ⇒ P(∅) = 0. (iv) P(A ∪ B) = P(A) + P(B) − P(A ∩ B) for any events A. B. you will be interested to see how properties of probability are proved formally. m P(A) = i=1 P(A ∩ Bi). So 1 = P(Ω) = P(A ∪ A) = P(A) + P(A). (iii) (Partition Theorem. and A ∩ A = ∅ (mutually exclusive). . together with standard settheoretic results. B2. (ii) P(A) = 1 − P(A) for any event A. . Bm form a partition of Ω. (Partition Theorem) 1. . we have A = A ∪ ∅. (i) P(∅) = 0.9 Formal probability proofs: non-examinable If you are a mathematician. Theorem : The probability measure P has the following properties.04 = 0. (Axiom 3) Axiom 1 . .22 − 0. then for any event A. may be used. but not sport. (mutually exclusive). Only the Axioms. P(M ∩ S) = P(M) − P(M ∩ S) = 0.) If B1 .

23 iii) Suppose B1 . . Bm are a partition of Ω: then Bi ∩ Bj = ∅ if i = j. . i=1 Thus. (A ∩ Bm) are mutually exclusive also. . . (A ∩ B1 ). So. . . ie. iv) Bi i=1 (Distributive laws) = P(A ∩ Ω) A ∪ B = (A ∩ Ω) ∪ (B ∩ Ω) (Set theory) = A ∩ (B ∪ B) ∪ B ∩ (A ∪ A) (Set theory) = (A ∩ B) ∪ (A ∩ B) ∪ (B ∩ A) ∪ (B ∩ A) (Distributive laws) = (A ∩ B) ∪ (A ∩ B) ∪ (A ∩ B). P(A ∪ B) = P(A ∩ B) + P(A ∩ B) + P(A ∩ B) (Axiom 3) = P(A) − P(A ∩ B) from (iii) using B and B + P(B) − P(A ∩ B) from (iii) using A and A + P(A ∩ B) = P(A) + P(B) − P(A ∩ B). and m Bi = Ω. . (A ∩ Bi ) ∩ (A ∩ Bj ) = A ∩ (Bi ∩ Bj ) = A ∩ ∅ = ∅. These 3 events are mutually exclusive: eg. for i = j. i=1 P(A ∩ Bi) = P i=1 (A ∩ Bi ) m (Axiom 3) = P A∩ = P(A) . m m So. . . etc. (A ∩ B) ∩ (A ∩ B) = A ∩ (B ∩ B) = A ∩ ∅ = ∅.

such as P(A ∩ B). on what has happened beforehand. It is especially helpful for calculating the probabilities of intersections.24 1. There will often be dependence between A and B. This means that if we know that B has occurred. which themselves are critical for the useful Partition Theorem. Let event A = “get 6” Let event B= “get 4 or better” If the die is fair. . then P(A) = 1 6 and P(B) = 1 . it changes our knowledge of the chance that A will occur.10 Conditional Probability Conditioning is another of the fundamental tools of probability: probably the most fundamental tool. Example: Toss a die once. 3 Question: what would be P(B | A)? P(B | A) = P(B occurs. or is conditional. the whole field of stochastic processes (Stats 320 and 325) is based on the idea of conditional probability. result 6 result 4 or 5 or 6 1 P(A given B) = P(A | B) = . Additionally. then there is an increased chance that A has occurred: P(A occurs given that B has occurred) = We write 1 3. given that we know we got a 6) = 1. What happens next in a process depends. Dependent events Suppose A and B are two events on the same sample space. if we know that B has occurred. 2 However. given that A has occurred) = P(get ≥ 4.

but she is using the wrong sample space so her answer is not relevant.44. Kate is referring to the following probability: P(M) = 1180 # M’s = = 0. because there are more women than men on Facebook. there are exactly the same number of men as women! Conditioning is all about the sample space of interest. Her friend Kate tells her not to bother. Suppose we pick a person from those in the table. total # in table 2680 Kate is correct that there are more women than men on Facebook.25 Conditioning as reducing the sample space Sally wants to use Facebook to find a boyfriend at Uni. Here are the true 2009 figures for Facebook users at the University of Auckland: Relationship status Male Female Total Single In a relationship 660 520 660 840 1500 1320 1360 2680 Total 1180 Before we go any further . . The table above shows the following sample space: Ω = {Facebook users at UoA}. But the sample space that should interest Sally is different: it is S = {members of Ω who are SINGLE}. . because out of the SINGLE people on Facebook. do you agree with Kate? No. Define event M to be: M ={ person is male }. .

. What is the proportion of males among people in a relationship. given that the person is single) = = = # single males # singles # who are M and S # who are S 660 1320 = 0. P(M | R)? P(M | R) = = = # males in a relationship # in a relationship # who are M and R # who are R 520 1360 = 0.38.26 Now suppose we reduce our sample space from Ω = {everyone in the table} to S = {single people in the table}. Then P(person is male.5. Example: Define event R that a person is in a relationship. This is the probability that Sally is interested in — and she can rest assured that there are as many single men as single women out there. We write: P(M | S) = 0.5.

from the set of B’s only. Think of P(A | B) as the chance of getting an A. Conditioning on event B means changing the sample space to B.27 We could follow the same working for any pair of events. P(A ∩ B) gives P(A and B . P(B) Read P(A | B) as “probability of A. The conditional probability that event A occurs. A and B: P(A | B) = = # who are A and B # who are B (# who are A and B ) / (# in Ω) (# who are B ) / (# in Ω) P(A ∩ B) . and is given by P(A | B) = P(A ∩ B) . . is written P(A | B). P(B) = This is our definition of conditional probability: Definition: Let A and B be two events. Note: Follow the reasoning above carefully. given B”. It is important to understand why the conditional probability is the probability of the intersection within the new sample space. given that event B has occurred. from the whole sample space Ω). Note: P(A | B) gives P(A and B . from within the set of B’s only).

is P(A | B) + P(A | B) = 1? Answer: Replace B by Ω: this gives P(A | Ω) + P(A | Ω) = P(A) + P(A) = 1. For example: P(C ∪ D) = P(C) + P(D) − P(C ∩ D).5. and P(M | R) = 0. but a person in a relationship is much more likely to be female than male! Why the difference? Your guess is as good as mine. This is what we do in conditional probability: to change the sample space from Ω to B . P BELONGS to the sample space Ω. yes. The symbol P belongs to the sample space Ω Recall the first of our probability axioms on page 16: P(Ω) = 1. say. So. we found that P(M | S) = 0. The symbol P( | B) should behave exactly like the symbol P. we change from the symbol P to the symbol P( | B).28 Note: In the Facebook example. Trick for checking conditional probability calculations: A useful trick for checking a conditional probability expression is to replace the conditioned set by Ω. so P(C ∪ D | B) = P(C | B) + P(D | B) − P(C ∩ D | B). and see whether the expression is still true. This means that a single person on UoA Facebook is equally likely to be male or female. This shows that the symbol P is defined with respect to Ω. . For example. we need to change the symbol P. That is.38. P(A | B) + P(A | B) = 1 for any other sample space B . but I think it’s because men in a relationship are far too busy buying flowers for their girlfriends to have time to spend on Facebook. If we change the sample space.

29 Is P(A | B) + P(A | B) = 1? Try to replace the conditioning set by Ω: we can’t! There are two conditioning sets: B and B . + P(A | Bm). It doesn’t make sense to try to add together probabilities from two different sample spaces. P(B) P(B ∩ A) ⇒ P(B ∩ A) = P(A ∩ B) = P(B | A)P(A). The expression is NOT true. Proof: P(A ∩ B) = P(A | B)P(B) = P(B | A)P(A). . then for any event A. m m P(A) = i=1 P(A ∩ Bi ) = i=1 P(A | Bi)P(Bi). The Multiplication Rule For any events A and B. Both formulations of the Partition Theorem are very widely used. Immediate from the definitions: P(A | B) = P(A ∩ B) ⇒ P(A ∩ B) = P(A | B)P(B) . . P(B | A) = P(A) and New statement of the Partition Theorem The Multiplication Rule gives us a new statement of the Partition Theorem: If B1 . . + P(A | Bm )P(Bm). . . . . . Bm partition S. Warning: Be sure to use this new version of the Partition Theorem correctly: it is P(A) = P(A | B1)P(B1) + . . but especially m the conditional formulation i=1 P(A | Bi )P(Bi ). NOT P(A) = P(A | B1) + .

but you don’t know what the weather will be like today. We can formulate events as follows: T = “on time”.11 Examples of conditional probability and partitions Tom gets the bus to campus every day. (a) Do the events T and L form a partition of the sample space Ω? Explain why or why not.6. The bus is on time with probability 0. pretend you do — and add up the different possibilities. he would pretend to eat. L = “late”. For example. Yes: they cover all possible journeys (probabilities sum to 1). The sample space can be written as Ω = {bus journeys}. and there is no overlap in the events by definition. and late with probability 0. 1.) Conditional probability is the Peter Pan of Stats 210. the events have probabilities: P(T ) = 0. P(L) = 0. I have always found.6 .4. pretend you know it. P(work on time) = P(work on time | fine) ×P(fine) + P(work on time | wet) ×P(wet). .4. From the information given. if you know the probability of getting to work on time in different weather conditions. (An excellent strategy.30 Conditional probability and Peter Pan When Peter Pan was hungry but had nothing to eat. Conditioning on an event is like pretending that you know that the event has happened. When you don’t know something that you need to know.

When the bus is on time.58) = 0.58 + 0. P(C) = P(C | T )P(T ) + P(C | L)P(L) = 0.5. both of which are problems for Tom as he likes to use the bus journeys to do his Stats assignments. it is crowded with probability 0. The bus is noisy with probability 0. C and N do NOT form a partition of Ω. (c) Write down probability statements corresponding to the information given above. it is crowded with probability 0. P(C | L) = 0. It is possible for the bus to be noisy when it is crowded. P(C | T ) = 0. (b) Formulate events C and N corresponding to the bus being crowded and noisy. Let C = “crowded”.5. (d) Find the probability that the bus is crowded.5 × 0.6 + 0. P(N ) = P(N | C)P(C) + P(N | C)P(C) = 0.4 × (1 − 0. and two statements linking N with C.4 = 0. When it is late. P(N | C) = 0.8.8 when it is crowded. (Partition Theorem) (e) Find the probability that the bus is noisy.31 The buses are sometimes crowded and sometimes noisy.58. (Partition Theorem) .7.8 × 0.7.7 × 0. and with probability 0.4 when it is not crowded. Your answer should involve two statements linking C with T and L. Do the events C and N form a partition of the sample space? Explain why or why not.4. P(N | C) = 0.632. so there must be some overlap between C and N . N =“noisy”.

to express P(B | A) in terms of P(A | B). English clergyman and founder of Bayesian Statistics.12 Bayes’ Theorem: inverting conditional probabilities Consider P(B ∩ A) = P(A ∩ B). . . Bayes’ Theorem allows us to “invert” the conditioning. This is very useful. Full statement of Bayes’ Theorem: Theorem 1. and for any j = 1. . . and the Partition Rule which gives P(A) = m i=1 P(A | Bi )P(Bi ). B2. . For example. but we might only observe the later event and wish to deduce the probability that the earlier event occurred. . P(earlier event | later event). Apply multiplication rule to each side: P(B | A)P(A) = P(A | B)P(B) Thus P(B | A) = P(A | B)P(B) . m. i.12: Let B1. P(A | Bj )P(Bj ) m i=1 P(A | Bi )P(Bi ) P(later event|earlier event). P(Bj | A) = (Bayes’ Theorem) Proof: Immediate from (⋆) (put B = Bj ). P(A) (⋆) This is the simplest form of Bayes’ Theorem. Bm form a partition of Ω. . named after Thomas Bayes (1702–61). .32 1. Then for any event A. .e. it might be easy to calculate.

It will die with probability 1/2 if watered. DEAD!!! What is the probability that the gardener did not water it? Solution: First step: formulate events Let : D = “rosebush dies” W = “gardener waters rosebush” W = “gardener fails to water rosebush” Second step: write down all information given P(D | W ) = 1 2 P(D | W ) = 3 4 P(W ) = 2 3 1 (so P(W ) = 3 ) Third step: write down what we’re looking for P(W | D) Fourth step: compare this to what we know Need to invert the conditioning. so use Bayes’ Theorem: P(W | D) = 3 P(D | W )P(W ) 3/4 × 2/3 = = P(D | W )P(W ) + P(D | W )P(W ) 3/4 × 2/3 + 1/2 × 1/3 4 So the gardener failed to water the rosebush with probability 3 . Worse still. and with probability 3/4 if not watered. . . Smith employs a perfidious gardener who will fail to water the rosebush with probability 2/3.33 Special case of Bayes’ Theorem when m = 2: use B and B as the partition of Ω: then P(B | A) = P(A | B)P(B) P(A | B)P(B) + P(A | B)P(B) Example: The case of the Perfidious Gardener. Mr Smith owns a hysterical rosebush. Smith returns from holiday to find the rosebush . 4 .

A statistics lecturer who eats only ketchup finds a defective bottle in her wig. 4.6%.004 × 0.322.012 3.4%. Ketchup bottles are produced in 3 different factories. .2%.0062 = 0. The percentage of bottles from the 3 factories that are defective is respectively 0. 0.006 P(D | F3 ) = 0. Information given: P(F1) = 0.3) let D = “bottle is defective” 2.004 P(D | F2 ) = 0. Looking for: P(F1 | D) (so need to invert conditioning).004 × 0. What is the probability that it came from Factory 1? Solution: 1.002 0.5 P(F2) = 0.Example: The case of the Defective Ketchup Bottle.2 P(D | F1) = 0.3 P(F3) = 0.2. and 1. Bayes Theorem: P(F1 | D) = = = P(D | F1)P(F1) P(D | F1)P(F1) + P(D | F2)P(F2) + P(D | F3 )P(F3) 0.3 + 0. 30%.2 0.006 × 0. Events: let Fi = “bottle comes from Factory i” (i=1. and 20% of the total output respectively.5 + 0.5 0.012 × 0. accounting for 50%.

35

1.13 Statistical Independence Two events A and B are statistically independent if the occurrence of one does

not affect the occurrence of the other.

This means P(A | B) = P(A)

and P(B | A) = P(B).

Now P(A | B) =

P(A ∩ B) , P(B)

so if P(A | B) = P(A) then P(A ∩ B) = P(A) × P(B).
We use this as our definition of statistical independence. Definition: Events A and B are statistically independent if P(A ∩ B) = P(A)P(B). For more than two events, we say: Definition: Events A1, A2, . . . , An are mutually independent if P(A1 ∩ A2 ∩ . . . ∩ An) = P(A1)P(A2) . . . P(An), AND the same multiplication rule holds for every subcollection of the events too. Eg. events A1, A2, A3, A4 are mutually independent if i) P(Ai ∩ Aj ) = P(Ai)P(Aj ) for all i, j with i = j; AND ii) P(Ai ∩ Aj ∩ Ak ) = P(Ai)P(Aj )P(Ak ) for all i, j, k that are all different; AND iii) P(A1 ∩ A2 ∩ A3 ∩ A4 ) = P(A1)P(A2)P(A3)P(A4). Note: If events are physically independent, then they will also be statistically independent.

36

Statistical independence for calculating the probability of an intersection In section 1.6 we said that it is often hard to calculate P(A ∩ B). We usually have two choices. 1. IF A and B are statistically independent, then P(A ∩ B) = P(A) × P(B). 2. If A and B are not known to be statistically independent, we usually have to

use conditional probability and the multiplication rule:
P(A ∩ B) = P(A | B)P(B). This still requires us to be able to calculate P(A | B).

Example: Toss a fair coin and a fair die together. The coin and die are physically independent. Sample space: Ω = {H1, H2, H3, H4, H5, H6, T 1, T 2, T 3, T 4, T 5, T 6}

- all 12 items are equally likely.

Let A= “heads” and B= “six”.
6 Then P(A) = P({H1, H2, H3, H4, H5, H6}) = 12 = 2 P(B) = P({H6, T 6}) = 12 = 1 6 1 2

1 Now P(A ∩ B) = P(Heads and 6) = P({H6}) = 12

But P(A) × P(B) = 1 × 2

1 6

=

1 12

also,

So P(A ∩ B) = P(A)P(B) and thus A and B are statistically indept.

37

Pairwise independence does not imply mutual independence Example: A jar contains 4 balls: one red, one white, one blue, and one red, white & blue. Draw one ball at random. Let A =“ball has red on it”, B =“ball has white on it”, C =“ball has blue on it”. Two balls satisfy A, so P(A) = Pairwise independent: Consider P(A ∩ B) = But, P(A) × P(B) =
1 4 1 2 2 4 1 = 2.

Likewise, P(B) = P(C) = 1 . 2

(one of 4 balls has both red and white on it). ×
1 2 1 = 4 , so P(A ∩ B) = P(A)P(B).

Likewise, P(A ∩ C) = P(A)P(C), and P(B ∩ C) = P(B)P(C). So A, B and C are pairwise independent. Mutually independent? Consider P(A ∩ B ∩ C) = while
1 4

(one of 4 balls)
1 2

1 P(A)P(B)P(C) = 2 × 1 × 2

=

1 8

= P(A ∩ B ∩ C).

So A, B and C are NOT mutually independent, despite being pairwise indepen-

dent.

1.14 Random Variables We have one more job to do in laying the foundations of our science of randomness. So far we have come up with the following ideas: 1. ‘Things that happen’ are sets, also called events. 2. We measure chance by measuring sets, using a measure called probability. Finally, what are the sets that we are measuring? It is a nuisance to have lots of different sample spaces: Ω = {head, tail}; Ω = {same, different}; Ω = {Lions win, All Blacks win}.

g. For example: function X : Ω → R X(“Lions win”) = 0. the number of heads from 10 tosses of a coin. we give a special name to random experiments that produce numbers as their outcomes. 2. there are many random experiments that genuinely produce random numbers as their outcomes. and so on. . it enables us to quantify many new things of interest: 1. For example. In fact. To give us a framework in which these investigations can take place. any random experiment can be made to have outcomes that are real numbers. 1}.g. the average number of heads we would get if we made 10 coin-tosses again and again). A random experiment whose possible outcomes are real numbers is called a random variable.v. is the number of girls related to the hormone levels of the fathers?) The list is endless. quantify how much the outcomes tend to diverge from the average value. simply by mapping the sample space Ω onto a set of real numbers using a function. This gives us our formal definition of a random variable: Definition: A random variable (r. When the outcome of a random experiment is a number.) is a function from a sample space Ω to the real numbers R.38 All of these sample spaces could be represented more concisely in terms of numbers: Ω = {0. quantify the average value (e. 3. the number of girls in a three-child family. quantify relationships between different random quantities (e. On the other hand. We write X : Ω → R. X(“All Blacks win”) = 1.

etc. etc. For a sample space Ω and random variable X : Ω → R.g. 0 otherwise. TTH. Intuitively. and lower case letters to represent the values that the random variable takes (e. Y (T HH) = 1. P(X = x) = P(outcome s is such that X(s) = x) = P({s : X(s) = x}). HTH. . variances. The sample space is Ω = {HHH. Giving a name to a large class of random experiments that genuinely produce random numbers.g. we use CAPITAL LETTERS for random variables (e. Y (HHH) = 1. Probabilities for random variables By convention. Describing many different sample spaces in the same terms: e. Ω = {0. So X(HHH) = 3. x). THT. HTT. Then Y (HT H) = 0. and for which we want to develop general rules for finding averages. 1} with P(1) = p and P(0) = 1 − p describes EVERY possible experiment with two outcomes. HHT. and for a real number x. Defining random variables serves the dual purposes of: 1. remember that a random variable equates to a random experiment whose outcomes are numbers. THH. the intuitive definition of a random variable is probably more useful. X(T HT ) = 1. and so on. A random variable produces random real numbers as the ‘outcome’ of a random experiment. 2. we have X(si ) = # heads in outcome si . Example: Toss a coin 3 times. Another example is Y : Ω → R such that Y (si) = 1 if 2nd toss is a head.39 Although this is the formal definition. TTT} One example of a random variable is X : Ω → R such that. X). for sample point si .g. relationships.

Thus X and Y are independent if and only if P(X = x. Y = y) = P(X = x)P(Y = y) for ALL possible values x. . P(X = 2) = P({HHT. All outcomes are equally likely: P(HHH) = P(HHT) = . Note that P(X = 0) + P(X = 1) + P(X = 2) + P(X = 3) = 1. Then P(X = 0) = P({T T T }) = 1/8. such that X(s) = # heads in s. = P(TTT) = 1/8. we will use the following notations interchangeably: P({X = x} ∩ {Y = y}) = P(X = x AND Y = y) = P(X = x. Similarly. T HT. . Independent random variables Random variables X and Y are independent if each does not affect the other. P(X = 1) = P({HT T. T HH}) = 3/8.40 Example: toss a fair coin 3 times. We usually replace the cumbersome notation P({X = x} ∩ {Y = y}) by the simpler notation P(X = x. random variables X and Y are defined to be independent if P({X = x} ∩ {Y = y}) = P(X = x)P(Y = y) for all possible values x and y . From now on. Y = y). HT H. T T H}) = 3/8. Y = y). P(X = 3) = P({HHH}) = 1/8. Let X : Ω → R. y . Recall that two events A and B are independent if P(A ∩ B) = P(A)P(B). .

It shows how to ‘invert’ the conditioning. i. P(A | B) = P(B | A)P(A) . Conditional probability: P(A | B) = P(A ∩ B) P(B) for any A.e.e. 5.15 Key Probability Results for Chapter 1 1.41 1. we can write P(A | B) = P(B | A)P(A) . then P(B | Aj )P(Aj ) P(B | A1 )P(A1 ) + . Am form a partition of the sample space. . then P(A ∪ B) = P(A) + P(B).e. 3. 2. Or: P(A ∩ B) = P(A | B)P(B). + P(B | Am )P(Am ) m i=1 P(Aj | B) = = P(B | Aj )P(Aj ) . how to find P(A | B) when you know P(B | A). For any A. Complete version of Bayes’ Theorem: If sets A1 . . If A and B are mutually exclusive (i. . . . B. . Bayes’ Theorem slightly more generalized: for any A. B. P(B | Ai )P(Ai ) . P(B) This is a simplified version of Bayes’ Theorem. A ∩ B = ∅). P(B | A)P(A) + P(B | A)P(A) This works because A and A form a partition of the sample space. 4. they do not overlap (mutually exclusive) and collectively cover all possible outcomes (their union is the sample space). B. i.

Chains of events: P(A1 ∩ A2 ∩ A3 ) = P(A1 ) P(A2 | A1 ) P(A3 | A2 ∩ A1 ) .+ P(Am | B) = 1. 9. and P(A | B) = 1 − P(A | B) for any A. . B.) The fact that P(· | B) is a valid probability measure is easily verified by checking that it satisfies Axioms 1. if A1 . . 10. Conditional probability: If P(B) > 0. These are both very useful formulations. Unions: For any A. 7. P(A ∪ B) = P(A) + P(B) − P(A ∩ B) . + P(B | Am )P(Am ) . . .42 6. . This can also be written as: P(B) = P(B | A1 )P(A1 ) + P(B | A2 )P(A2 ) + . then P(B) = P(B ∩ A1 ) + P(B ∩ A2 ) + . and 3. then P(A ∩ B) = P(A) P(B) and and P(A | B) = P(A) P(B | A) = P(B) . then P(A1 | B) + P(A2 | B) +. then applying it again to expand P(B ∪ C).Am partition the sample space. Partition Theorem: if A1 . + P(B ∩ Am ) . P(A ∪ B ∪ C) = P(A) + P(B) + P(C) − P(A ∩ B) − P(A ∩ C) − P(B ∩ C) + P(A ∩ B ∩ C) . . (Note: it is not generally true that P(A | B) = 1 − P(A | B). .g. . then P(A1 ∪ A2 | B) = P(A1 | B) + P(A2 | B) (compare with P(A1 ∪ A2 ) = P(A1 ) + P(A2 )). Statistical independence: if A and B are independent. 8. . . .. . . 2. Am form a partition of the sample space. if A1 and A2 are mutually exclusive. The second expression is obtained by writing P(A∪B ∪C) = P A∪(B ∪C) and applying the first expression to A and (B ∪ C). then we can treat P(· | B) just like P: e. . C.

3 W1 R1 . R1R2 } = (W1 ∩ R2 ) ∪ (R1 ∩ R2 ). So P(2nd ball is red) = P(W1 ∩ R2 ) + P(R1 ∩ R2 ) (mutually exclusive) = P(R2 | W1)P(W1) + P(R2 | R1)P(R1) = 2 4 × 5 6 + 1 2 × 5 6 = 1 . To find this. a) P(W1 ∩ W2 ) = P(W2 ∩ W1) = P(W2 | W1)P(W1) Now P(W1) = 4 3 and P(W2 | W1) = .16 Chains of events and probability trees: non-examinable The multiplication rule is very helpful for calculating probabilities when events happen in sequence.43 1. Find the probability that: (a) they are both white. Event “2nd ball is red” is actually event {W1R2 . 5 6 5 b) Looking for P(2nd ball is red). (b) the second ball is red. Solution Let event Wi = “ith ball is white” and Ri = “ith ball is red”. we have to condition on what happened in the first draw. 6 5 W1 So P(both white) = P(W1 ∩ W2) = 3 4 2 × = . Example: Two balls are drawn at random without replacement from a box containing 4 white and 2 red balls.

44 Probability trees Probability trees are a graphical way of representing the multiplication rule. or P(R1 ∩ W2) = × . P(W1 ∩ W2) = 4 3 2 4 × . and multiply to get probability of an intersection: eg. . 6 5 6 5 More than two events To find P(A1 ∩ A2 ∩ A3) we can apply the multiplication rule successively: P(A1 ∩ A2 ∩ A3) = P(A3 ∩ (A1 ∩ A2 )) = P(A3 | A1 ∩ A2 )P(A1 ∩ A2) (multiplication rule) (multiplication rule) = P(A3 | A1 ∩ A2 )P(A2 | A1 )P(A1) Remember as: P(A1 ∩ A2 ∩ A3 ) = P(A1)P(A2 | A1 )P(A3 | A2 ∩ A1 ). W2 P(W2 | W1 ) = P(R2 | W1 ) = P(W1 ) = 4 6 2 5 3 5 W1 R2 P(R1 ) = 2 6 W2 P(W2 | R1 ) = P(R2 | R1 ) = R2 1 5 4 5 R1 First Draw Second Draw Write conditional probabilities on the branches.

. . What is the probability of getting the sequence white. An. . . we have P(A1 ∩A2 ∩. . Draw 3 balls without replacement. . white? Answer: P(W1 ∩ R2 ∩ W3) = P(W1)P(R2 | W1)P(W3 | R2 ∩ W1) = w w+r × r w+r−1 × w−1 .∩A1). w+r−2 . . Example: A box contains w white balls and r red balls. red. A2. P(An | An−1 ∩. . . for n events A1.45 On the probability tree: P(A3 | A2 ∩ A1 ) P(A2 | A1 ) P(A1 ) P(A1 ∩ A2 ∩ A3 ) P(A1 ) In general.∩An ) = P(A1)P(A2 | A1)P(A3 | A2 ∩A1) . .

46 1. s3 . These give the number of ways of choosing r objects from n distinct objects. k # outcomes in Ω Example: For a 3-child family. If every baby is equally likely to be a boy or a girl. possible outcomes from oldest to youngest are: Ω = {GGG. b. iii) event A = {s1 . Let event A be A = “oldest child is a girl”. . s2. GBB. s5. BBG. GBB}. . Then A ={GGG. d. GGB. . so p1 = p2 = . s4. 8 2 Counting equally likely outcomes To count the number of equally likely outcomes in an event. then P(A) = # outcomes in A r = . Event A contains 4 of the 8 equally likely outcomes. . . For example. .17 Equally likely outcomes and combinatorics: non-examinable Sometimes. abd. . . c. s7 . . BGB. BGG. p8} be a probability distribution on Ω. s8} Let {p1. . . s6. so event A occurs with probability P(A) = 4 = 1 . . . . BBB} = {s1. all the outcomes in a discrete finite sample space are equally likely. GBG. we have choices abc. . 1 ii) each outcome si is equally likely. sr } contains r possible outcomes. ace. . sk }. This makes it easy to calculate probabilities. acd. s2. GGB. . . = pk = k . If: i) Ω = {s1 . = p8 = 8 . . so 1 p1 = p2 = . e). . abe. GBG. then all of the 8 outcomes in Ω are equally likely. p2. if we wish to select 3 objects from n = 5 objects (a. we often need to use permutations or combinations. .

(n − r)! (n choices for first object. or with different orderings constituting the same choice.) 2. . That is. is the number of ways of selecting r objects from n distinct objects when different orderings constitute the same choice. b. choice (a. etc. we can often think about the problem # outcomes in Ω either with different orderings constituting different choices. The critical thing is to use the same rule for both numerator and denominator. . c) counts separately from choice (b. c). Then n #permutations = n Pr = n(n − 1)(n − 2) . Number of Combinations. Pr The number of permutations. c) and choice (b. r! (n − r)!r! (because n Pr counts each permutation r! times. Then #combinations = Cr = n n r n = n! Pr = . choice (a. That is. . and we only want to count it once: so divide n Pr by r!) Use the same rule on the numerator and the denominator # outcomes in A When P(A) = . c) are the same. Number of Permutations. n Cr . n Pr . is the number of ways of selecting r objects from n distinct objects when different orderings constitute different choices.47 1. a. (n − 1) choices for second. a. (n − r + 1) = n! . b. n Cr = n r The number of combinations.

) Total number of outcomes is (total # ways of selecting 5 cards from 40) = 40 × 39 × 38 × 37 × 36 . Note that order matters: e. What is the probability that at least two of the great-aunts receive the same picture? Looking for P(at least 2 cards the same) = P(A) (say). etc.608. They insist on each receiving separate Christmas cards. He selects 5 cards at random to send to his great-aunts.392. 36 for second. (Note: order mattered above.g.48 Example: (a) Tom has five elderly great-aunts who live together in a tiny bungalow. and threaten to disinherit Tom if he sends two of them the same picture.) So P(A) = 40 × 36 × 32 × 28 × 24 = 0. 40 × 39 × 38 × 37 × 36 Thus P(A) = P(at least 2 cards are the same design) = 1 − P(A) = 1 − 0. (40 choices for first card. because the 4 cards with the first design are excluded. . (12 − 5)! 5! b) The next year.392 = 0. Tom buys a pack of 40 Christmas cards. so use combinations. so we need order to matter here too. In how many different ways can he select 5 different designs from the 12 designs available? Order of cards is not important. Number of outcomes in A is (# ways of selecting 5 different designs) = 40 × 36 × 32 × 28 × 24 . we are counting choice 12345 separately from 23154. Number of ways of selecting 5 distinct designs from 12 is 12 C5 = 12 5 = 12 ! = 792. Easiest to find P(all 5 cards are different) = P(A). featuring 10 different pictures with 4 cards of each picture. Tom has Christmas cards with 12 different designs.

40 5 This works because there are 10 ways of choosing 5 different designs from 10. . Exercise: Check that this gives the same answer for P(A) as before. So the total number of ways of choosing 5 cards of different designs is 10 45. 5 and there are 4 choices of card within each of the 5 chosen groups. Note: Problems like these belong to the branch of mathematics called Combinatorics: the science of counting. The total 5 40 number of ways of choosing 5 cards from 40 is 5 .49 Alternative solution if order does not matter on numerator and denominator: (much harder method) 10 5 4 5 P(A) = .

6. But what does the randomness look like? Is it highly variable. • the variance of a random variable measures how much the random variable varies about its average. 2. e. Likelihood and estimation: • what if we know that our random variable is (say) Binomial(5. but never give results much lower (long-tailed distribution)? We will see how different probability distributions are suitable for different circumstances. . p). Change of variable procedures: • calculating probabilities and expectations of √ g(X). where X is a random variable and g(X) is a function. or little variability? Does it sometimes give results much higher than average.g. g(X) = X or g(X) = X 2 . Hypothesis testing: • I toss a coin ten times and get nine heads. Probability distributions. 5. and their probabilities. but we don’t know the value of p? We will see how to estimate the value of p using maximum likelihood estimation.Chapter 2: Discrete Probability Distributions 2. Modelling: • we have a situation in real life that we know is random. 4. for some p. and the probability function fX (x): 50 • the probability function of a random variable lists the values the random variable can take. Expectation and variance of a random variable: • the expectation of a random variable is the value it takes on average.1 Introduction In the next two chapters we meet several important concepts: 1. How unlikely is that? Can we continue to believe that the coin is fair when it produces nine heads out of ten tosses? 3. Choosing a probability distribution to fit a situation is called modelling.

{0.51 2. The random variable is discrete if the set of real values it can take is finite or countable. X gives numbers to the possible outcomes. for a discrete random variable X . Porsche Ferrari Random experiment: which car? MG. fX (x) = P(X = x). and gives a probability to each value. Example: Which car? Outcome: Ferrari Porsche x 1 2 1 1 Probability function. . }..  MG ⇒ X=3 Definition: The probability function. . is 1 . fX (x) The probability function fX (x) lists all possible values of X.1.2 The probability function. assigns a real number to every possible outcome of a random experiment.. . Random variable: X .  ⇒ X=1  Ferrari Porsche ⇒ X=2 If he chooses. Recall that a random variable. eg. . fX (x). X.2. is given by.. fX (x) = P(X = x) 6 6 MG 3 4 6 for all possible outcomes x of X . 6 1 We write: P(X = 1) = fX (1) = 6 : the probability he makes choice 1 (a Ferrari) .

fX (1) = 0. P(X ∈ {1. etc.Example: Toss a fair coin once.5.5.5. 2}) = P(X = 1 or 2) = P(X = 1) + P(X = 2) = 1 6 + 1 6 2 = 6.   0 otherwise. x∈A iii) P (X ∈ A) = e.   1/6 if x = 2.5 if x=1  0 otherwise 0.5. The probability function of X is given by: x fX (x) = P(X = x) 0 1 or   0. fX (x) is just a list of probabilities. This is the probability of choosing either a Ferrari or a Porsche. .5 We write (eg. and let X=number of heads. We can also write the probability function as: fX (x) =  4/6 if x = 3. fX (x). probabilities add to 1 overall. fX (7.5) = 0.5 if x=0 fX (x) = 0. 52   1/6 if x = 1. Then X= 0 with probability 0. Properties of the probability function i) 0 ≤ fX (x) ≤ 1 for all x. ii) x probabilities are always between 0 and 1. in the car example. fX (x) = 1.) fX (0) = 0.g.5 0. 1 with probability 0.

P(Y = 0) = P(“failure”) = 1 − p. . the event “success in trial i” does not depend on the outcome of any other trials. 2 2) Repeated tossing of a fair die: success = “6”. Examples: 1) Repeated tossing of a fair coin: each toss is a Bernoulli trial with P(success) = P(head) = 1 . 0 and 1. who were bitter rivals. Definition: A random experiment is called a set of Bernoulli trials if it consists of several trials such that: i) Each trial has only 2 possible outcomes (usually called “Success” and “Fail- ure”). P(Y = 1) = P(“success”) = p.53 2. failure= “not 6”. p. remains constant for all trials. He and his brother Jean. Their father wanted Jacques to be a theologist and Jean to be a merchant. ie. The probability function is. Each toss is 1 a Bernoulli trial with P(success) = 6 . Jacques Bernoulli was a Swiss mathematician in the late 1600s. ii) The probability of success. Definition: The random variable Y is called a Bernoulli random variable if it takes only 2 values. iii) The trials are independent. both studied mathematics secretly against their father’s will. fY (y) = p if y = 1 1 − p if y = 0 That is.3 Bernoulli trials Many of the discrete random variables that we meet are based on counting the outcomes of a series of trials called Bernoulli trials.

. We write X ∼ Bin(n. . . each with probability (1 − p). Definition: Let X be the number of successes in n independent Bernoulli trials each with probability of success = p.54 2. Probability function If X ∼ Binomial(n. then the probability function for X is n x p (1 − p)n−x x fX (x) = P(X = x) = for x = 0. each of which has probability p of success. . Then X has the Binomial distribution with parameters n and p. n Explanation: An outcome with x successes and (n − x) failures has probability. p). Thus X ∼ Bin(n. p). px (1 − p)n−x (2) (1) where: (1) succeeds x times. p). .4 Example of the probability function: the Binomial Distribution The Binomial distribution counts the number of successes in a fixed number of Bernoulli trials. or X ∼ Binomial(n. p) if X is the number of successes out of n independent trials. each with probability p (2) fails (n − x) times. 1.

. of each such outcome) n x = p (1 − p)n−x x Note: fX (x) = 0 if n x ∈ {0. . P(#successes= x) = (#outcomes with x successes) × (prob.55 There are n x possible outcomes with x successes and (n − x) failures because we must select x trials to be our “successes”. n}. 2. . . out of n trials in total. Thus. 1. . / fX (x) = 1: Check that x=0 n n fX (x) = x=0 x=0 n x p (1 − p)n−x = [p + (1 − p)]n (Binomial Theorem) x = 1n = 1 It is this connection with the Binomial Theorem that gives the Binomial Distribution its name.

What is the distribution of X? Assume: (i) each child is equally likely to be a boy or a girl. Then X ∼ Binomial(n = 3. p = 0.56 Example 1: Let X ∼ Binomial(n = 4.5).2). x 0 1 2 3 4 fX (x) = P(X = x) 0.0256 0. 1 6 1 1 1− 6 10−1 Example 3: Let X be the number of girls in a three-child family.515. 2. P(X ≥ 2) = 1 − P(X < 2) = 1 − P(X = 0) − P(X = 1) 0 10−0 10 10 1 1 = 1− − 1− 6 6 0 1 = 0. p = 1/6). 1. Write down the probability function of X.4096 0. What is the probability that X ≥ 2? 1. p = 0.0016 Example 2: Let X be the number of times I get a ‘6’ out of 10 rolls of a fair die.1536 0. X ∼ Binomial(n = 10. . What is the distribution of X? 2.4096 0. (ii) all children are independent of each other.

This is because X counts the number of successes out of n trials.3 0.04 0.08 0.5 0. and there is noticeable skew only if p is very close to 0 or 1.20 0. and X ∼ Binomial(n. the distribution is almost symmetrical for values of p close to 0.5.05 0. p = 0. but highly skewed for values of p close to 0 or 1. p). . p).10 90 100 Sum of independent Binomial random variables: If X and Y are independent. X + Y counts the total number of successes out of n + m trials.4 n = 10.02 0.9 0.25 0.1 0.12 n = 100. p = 0. Note: X and Y must both share the same value of p.0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0. the distribution becomes more and more symmetrical.10 0.57 Shape of the Binomial distribution The shape of the Binomial distribution depends upon the values of n and p. and Y counts the number of successes out of m trials: so overall. then X + Y ∼ Bin(n + m.9 0. p = 0. The probability functions for various values of n and p are shown below.15 0.2 0.0 0. Y ∼ Binomial(m.06 0.0 80 0. For small n. n = 10. As n increases. p).

FX (x) We have defined the probability function.4 0 1 2 3 4 5 6 7 8 9 10 X ~ Bin(10. 0. as fX (x) = P(X = x).10 0.5 The cumulative distribution function. 0.2 0. is an alternative function that also tells us everything there is to know about X. FX (x). or the probability function. or just distribution function. The cumulative distribution function. fX (x).58 2.3 0. you could answer by giving either the distribution function.1 0.15 0. The distribution function FX (x) as a probability sweeper The cumulative distribution function. fX (x). written as FX (x). The probability function tells us everything there is to know about X. 0.9) .d. sweeps up all the probability up to and including the point x.00 0.0 0. Each of these functions encapsulate all possible information about X. FX (x).f. Definition: The (cumulative) distribution function (c.05 0 1 2 3 4 5 6 7 8 9 10 0.5) X ~ Bin(10.25 0.20 0.) is FX (x) = P(X ≤ x) for − ∞ < x < ∞ If you are asked to ‘give the distribution of X’.

1 4 x 0 F (x) 1 3 4 1 2 1 4 1 2 x 0 1 2 FX (x) gives the cumulative probability up to and including point x. .5 = 0.25 = 1 f (x) 1 2 if if if if x<0 0≤x<1 1≤x<2 x ≥ 2.5 + 0. 1 ).59 Example: Let X ∼ Binomial(2.75   0.25 + 0. 2 x 0 1 2 fX (x) = P(X = x) 1 1 1 4 2 4  0   0.25 + 0.25 Then FX (x) = P(X ≤ x) =  0. So FX (x) = y≤x fX (y) Note that FX (x) is a step function: it jumps by amount fX (y) at every point y with positive probability.

So FX (b) = FX (a) + P(a < X ≤ b) ⇒ FX (b) − FX (a) = P(a < X ≤ b). . fX (x). fX (x) = P(X = x) = P(X ≤ x) − P(X ≤ x − 1) = FX (x) − FX (x − 1). (if X takes integer values) This is why the distribution function FX (x) contains as much information as the probability function. In general: P(a < X ≤ b) = FX (b) − FX (a) Proof: P(X ≤ b) = P(X ≤ a) + P(a < X ≤ b) a X≤b X ≤a a<X≤b b if b > a.60 Reading off probabilities from the distribution function As well as using the probability function to find the distribution function. we can also use the distribution function to find probabilities. because we can use either one to find the other.

For example. limh↓0 F (x + h) = F (x). 2) FX (x) is a non-decreasing function of x: that is. . F (+∞) =P(X ≤ +∞) = 1. 0. (These are true because values are strictly between −∞ and ∞). In terms of FX (x). P(X < 10) = P(X ≤ 9) = FX (9). what is: 1. 3) P(a < X ≤ b) = FX (b) − FX (a) if b > a. Examples: Let X ∼ Binomial(100. FX (29). then FX (x1) ≤ FX (x2).4). 4. P(X ≤ 30)? 2. Properties of the distribution function 1) F (−∞) =P(X ≤ −∞) = 0. FX (30).Warning: endpoints Be careful of endpoints and the difference between ≤ and <. if x1 < x2. 4) F is right-continuous: that is. 5. P(X > 42)? 1 − P(X ≤ 42) = 1 − FX (42). P(X < 30)? 3. P(X ≥ 56)? 1 − P(X < 56) = 1 − P(X ≤ 55) = 1 − FX (55). P(50 ≤ X ≤ 60)? P(X ≤ 60) − P(X ≤ 49) = FX (60) − FX (49).

However. Distribution of X.62 2. The concept of the hypothesis test is at its easiest to understand with the Binomial distribution in the following example. and significance in other courses. • Getting 0 heads out of 10 tosses: same as getting 10 tails: more weird than 9 heads if the coin is fair. Set of weird outcomes If our coin is fair. . • Getting 10 heads out of 10 tosses: even more weird! • Getting 8 heads out of 10 tosses: less weird. p-values. p = 0. 1 head. Example: Weird Coin? I toss a coin 10 times and get 9 heads.6 Hypothesis testing You have probably come across the idea of hypothesis tests. hypothesis tests can be conducted in much simpler circumstances than these.5). if the coin is fair? We can add the probabilities of all the outcomes that are at least as weird as 9 heads out of 10 tosses. All other hypothesis tests throughout statistics are based on the same idea. 0 heads. if the coin is fair: X ∼ Binomial(n = 10. • Getting 1 head out of 10 tosses: same as getting 9 tails out of 10 tosses: H H just as weird as 9 heads if the coin is fair. Common hypothesis tests include t-tests and chisquared tests. So how weird is 9 heads or worse. the outcomes that are as weird or weirder than 9 heads are: 9 heads. 10 heads. assuming that the coin is fair. How weird is that? What is ‘weird’ ? • Getting 9 heads out of 10 tosses: we’ll call this weird.

it is quite weird.63 Probability of observing something at least as weird as 9 heads. Is this weird? Yes.5)0 + 9 10 10 10 (0.5)0(0.021. If we had a fair coin and tossed it 10 times.00098 = 0.25 P(X=x) 0.00977 + 0.5)9(0.5)10 1 0 = 0.5). p = 0.5)9 + (0.05 0. 0.00098 + 0. we would only expect to see something as extreme as 9 heads on about 2.5) 0.15 0 1 2 3 4 5 x 6 7 8 9 10 For X ∼ Binomial(10. we have: P(X = 9) + P(X = 10) + P(X = 1) + P(X = 0) = 10 10 (0.5)1 + (0.1% of occasions.00977 + 0. 0.5).5)10(0.5)1(0. Probabilities for Binomial(n = 10.0 0. . if the coin is fair: P(X = 9)+P(X = 10)+P(X = 1)+P(X = 0) where X ∼ Binomial(10.

2.1% measures the strength of our evidence. The value 2.021. The coin is fair.64 Is the coin fair? Obviously. • The probability is small (2. X = 9. OR the coin is not fair. We can deduce that. . on 2. 2. so it is still very unusual for a fair coin to produce something as weird as what we’ve seen. However.1% of occasions that you toss a fair coin 10 times. It might be: after all. We conclude that this is unlikely with a fair coin. Think of the steps: • We have a question that we want to answer: Is the coin fair? • There are two alternatives: 1. In fact.5). you do get something as weird as 9 heads or more. 0. • We calculate the probability of observing something AT LEAST AS EXTREME as our observation. the number of heads out of 10 tosses. so we have observed some evidence that the coin is NOT fair. the more evidence we have. The coin is not fair. we can’t say. it would be very unusual to get 9 heads or more. this gives us some evidence that the coin is not fair. Formal hypothesis test We now formalize the procedure above. if the coin is fair: prob=0. The smaller this proba- bility.1% is a small probability. EITHER we have observed a very unusual event with a fair coin.1%). We write down the distribution of X if the coin is fair: X ∼ Binomial(10. • Our observed information is X. If the coin really was fair.

We expect to believe the null hypothesis unless we see convincing evidence that it is wrong.65 Null hypothesis and alternative hypothesis We express the steps above as two competing hypotheses. Null hypothesis: the first alternative. and H0 : p = 0. p). 0. We use H0 and H1 to denote the null and alternative hypotheses respectively. It specifies an exact distribution for our observation: X ∼ Binomial(10. that the coin is NOT fair. we often use this same formulation. The null hypothesis is H0 : the coin is fair. It simply states that the null hypothesis is wrong. More precisely. Think of ‘null hypothesis’ as meaning the ‘default’: the hypothesis we will accept unless we have a good reason not to.5. In hypothesis testing. It does not say what the right answer is. Alternative hypothesis: the second alternative. The alternative hypothesis is H1 : the coin is NOT fair.5). • The alternative hypothesis is general. we write: Number of heads. that the coin IS fair. • The null hypothesis is specific. X ∼ Binomial(10.5 H1 : p = 0. .

Our hypothesis test is designed to test whether the Binomial probability is p = 0.021.021 as a measure of the strength of evidence against the hypothesis that p = 0. the p-value was p = 0.66 p-values In the hypothesis-testing framework above. It states that. which is 0. if H0 is TRUE. We measure the strength of evidence against H0 using the p-value. Large p-values mean Little evidence. if the null hypothesis is TRUE. Many of us would see this as strong enough evidence to decide that the null hypothesis is not true. we would only have a 2.5.021 represents quite strong evidence against the null hypothesis. with the Binomial probability p. . Note: Be careful not to confuse the term p-value. That is.021 in our example. we believe that our coin is fair unless we see convincing evidence otherwise. we calculate the p-value of 0. In the example above. we always measure evidence AGAINST the null hypothesis.1% chance of observing something as extreme as 9 heads or tails. A p-value of 0. This means that SMALL p-values represent STRONG evidence against H0. the p-value is the probability of observing something AT LEAST AS EXTREME AS OUR OBSERVATION.5. Small p-values mean Strong evidence. To test this. In general.

We do not talk about accepting or rejecting H0 . This decision should usually be taken in the context of other scientific information. and accepted if the p-value is greater than 0. The smaller the p-value.05. • Most people agree that the p-value is a useful measure of the strength of evidence against the null hypothesis. it is worth bearing in mind that p-values of 0. . The result of a hypothesis test is significant at the 5% level if the p-value is less than 0.67 Interpreting the hypothesis test There are different schools of thought about how a p-value should be interpreted. the null hypothesis H0 should be rejected if the p-value is less than 0. Statistical significance You have probably encountered the idea of statistical significance in other courses. or more. Under this framework.05 (say). However. Saying the test is significant is a quick way of saying that there is evidence against the null hypothesis. Statistical significance refers to the p-value. is less than 5% if the null hypothesis is true. usually at the 5% level. This means that the chance of seeing what we did see (9 heads). • Some people go further and use an accept/reject framework.05 and less start to suggest that the null hypothesis is doubtful. • In this course we use the strength of evidence interpretation. the stronger the evidence against H0. The p-value measures how far out our observation lies in the tails of the distribution specified by H0.05.

5 is important in practical terms. . we will get a p-value < 0.5. One-sided and two-sided tests The test above is a two-sided test. if H0 is true. before tossing the coin.05.5. or • the difference between p and 0.68 In the coin example. because the p-value is 0. This would have the effect of halving the resultant p-value.5 against H1 : p = 0.001.5. If we had a good reason. It does not mean: • the difference between p and 0. It says NOTHING about the SIZE.05 EVEN WHEN H0 IS TRUE!! Indeed.e.021 which is < 0. about once in every thousand tests. or the IMPORTANCE. then we could conduct a one-sided test: H0 : p = 0.5 or > 0.5 is significant at the 5% level. Beware! The p-value gives the probability of seeing something as weird as what we did see. This means: • we have some evidence that p = 0. This means that 5% of the time.5. This means that we considered it just as weird to get 9 tails as 9 heads.5 is large. to believe that the binomial probability could only be = 0. that it would be impossible to have p < 0. i. we will get a p-value < 0. we can say that our test of H0 : p = 0. of the difference. Statistically significant means that we have evidence that there IS a difference.5 versus H1 : p > 0. even though H0 is true! A small p-value does NOT mean that H0 is definitely wrong.

Presidents have sons.9 daughters for every son. comprising 65 sons and 125 daughters: a sex ratio of 1. Would you prefer sons? Easy! Just become a US president.4 sons for every daughter. or a heavy smoker. Could this happen by chance? Is it possible that the men in each group really had a 50-50 chance of producing sons and daughters? This is the same as the question in Section 2. could I continue to believe that the coin was fair? . For the presidents: If I tossed a coin 153 times and got only 65 heads. fighter pilots have daughters.6. Numbers suggest that men in different professions tend to have more sons than daughters. or the reverse. But is it real. The facts • The 44 US presidents from George Washington to Barack Obama have had a total of 153 children.7 Example: Presidents and deep-sea divers Men in the class: would you like to have daughters? Then become a deep-sea diver. comprising 88 sons and only 65 daughters: a sex ratio of 1.69 2. or just chance? We can use hypothesis tests to decide. • Two studies of deep-sea divers revealed that the men had a total of 190 children. could I continue to believe that the coin was fair? For the divers: If I tossed a coin 190 times and got only 65 heads. a fighter pilot.

06 0. where p is the probability that each child is a daughter. .70 Hypothesis test for the presidents We set up the competing hypotheses as follows. . We need the probability of getting a result AT LEAST AS EXTREME as X = 65 daughters. for too many daughters. .5. 65. ≥ (153 − 65) = 88 daughters. Which results are at least as extreme as X = 65? X = 0. 2. because we would be just as surprised if we saw ≤ 65 sons.00 0. . Let X be the number of daughters out of 153 presidential children. X = (153 − 65).e. H1 : p = 0. p = 0. .5. . . Null hypothesis: Alternative hypothesis: p-value: H0 : p = 0.5) 0. if H0 is true and p really is 0.04 0 20 40 60 80 100 120 140 .5. Then X ∼ Binomial(153. for even fewer daughters. Probabilities for X ∼ Binomial(n = 153. . 1. 153. p).02 0. i.

153.5)153 + (0.5) [1] 0. 0.03748079 0.71 Calculating the p-value The p-value for the president problem is given by P(X ≤ 65) + P(X ≥ 88) where X ∼ Binomial(153.02 0. we have two choices: 1. p = 0. . .06 0 20 40 60 80 100 120 140 This gives us the lower-tail p-value only: P(X ≤ 65) = 0. R command for the p-value The R command for calculating the lower-tail p-value for the Binomial(n = 153. + P(X = 153) = 153 153 (0. In R: > 2 * pbinom(65. . 0.5). 0. In principle.5) distribution is pbinom(65.0750.0375 = 0. Multiply the lower-tail p-value by 2: 2 × 0.5). .5)152 + . 153.07496158 .04 0. . 153. + P(X = 65) + P(X = 88) + . 0 1 This would take a lot of calculator time! Instead. To get the overall p-value.00 0. .5)1(0. 0.0375. we use a computer with a package such as R. Typing this in R gives: > pbinom(65.5) [1] 0.5)0(0. we could calculate this as P(X = 0) + P(X = 1) + .

(Same as before. 0.0750. If you enter them in a different order. is always going to be the same as the lower-tail p-value. 153. prob=0.5). the order that you enter the numbers 65. In R: > 1-pbinom(87.5) = P(X ≤ 65) = FX (65) where X ∼ Binomial(153. The overall p-value in this example is 2 × FX (65).5 is important.5). An alternative is to use the longhand command pbinom(q=65.5).5). 2. 153. 0. 0. 0.0375 = 0. 0. 153. size=153. The upper tail gives us the probability of finding something equally surprising at the opposite end of the distribution.5) [1] 0. in which case you can enter the terms in any order.03748079 The overall p-value is the sum of the lower-tail and the upper-tail p-values: pbinom(65.pbinom(87. 153. Calculate the upper-tail p-value explicitly: The upper-tail p-value is P(X ≥ 88) = 1 − P(X < 88) = 1 − P(X ≤ 87) = 1 − pbinom(87. 0. you will get an error. 153. 153. . 153.0375 + 0. 0.5) = 0. by definition. and 0. 0. Note: In the R command pbinom(65.5) + 1 .5) for X ∼ Binomial(153.) Note: The R command pbinom is equivalent to the cumulative distribution function for the Binomial distribution: pbinom(65.72 This works because the upper-tail p-value.

where p is the probability that each child is a daughter. and 125 daughters. Recall that X was the number of daughters out of 153 presidential children. Does this mean presidents are equally likely to have sons and daughters? No: the observations are also compatible with the possibility that there is a difference.5% chance of observing something as unusual as only 65 daughters out of the total 153 children.5. it is usually easiest to formulate them in terms of the low result (sons). there would only be 7. This is slightly unusual. The observations are compatible with the possibility that there is no difference. Hypothesis test for the deep-sea divers For the deep-sea divers. Then X ∼ Binomial(190. but not very unusual.075. and X ∼ Binomial(153. .5. p). Null hypothesis: Alternative hypothesis: p-value: H0 : p = 0. however. Note: We could just as easily formulate our hypotheses in terms of daughters instead of sons. H1 : p = 0. there were 190 children: 65 sons. What does this mean? The p-value of 0.075 means that. We conclude that there is no real evidence that presidents are more likely to have sons than daughters. where p is the probability that each child is a son. We just don’t have enough evidence either way. if the presidents really were as likely to have daughters as sons.73 Summary: are presidents more likely to have sons? Back to our hypothesis test. 2 × FX (65) = 0. Because pbinom is defined as a lower-tail probability. Let X be the number of sons out of 190 diver children. p).

74

Null hypothesis: Alternative hypothesis: p-value:

H0 : p = 0.5. H1 : p = 0.5.

Probability of getting a result AT LEAST AS EXTREME as X = 65 sons, if H0 is true and p really is 0.5.

Results at least as extreme as X = 65 are: X = 0, 1, 2, . . . , 65, for even fewer sons. X = (190−65), . . . , 190, for the equally surprising result in the opposite direction

(too many sons).
Probabilities for X ∼ Binomial(n = 190, p = 0.5)
0.05 0.00 0.01 0.02 0.03 0.04 0.06

0

20

40

60

80

100

120

140

160

180

R command for the p-value p-value = 2×pbinom(65, 190, 0.5). Typing this in R gives: > 2*pbinom(65, 190, 0.5) [1] 1.603136e-05 This is 0.000016, or a little more than one chance in 100 thousand.

75

We conclude that it is extremely unlikely that this observation could have oc-

curred by chance, if the deep-sea divers had equal probabilities of having sons and daughters.

We have very strong evidence that deep-sea divers are more likely to have daughters than sons. The data are not really compatible with H0 . What next? p-values are often badly used in science and business. They are regularly treated as the end point of an analysis, after which no more work is needed. Many scientific journals insist that scientists quote a p-value with every set of results, and often only p-values less than 0.05 are regarded as ‘interesting’. The outcome is that some scientists do every analysis they can think of until they finally come up with a p-value of 0.05 or less. A good statistician will recommend a different attitude. It is very rare in science

for numbers and statistics to tell us the full story.
Results like the p-value should be regarded as an investigative starting point, rather than the final conclusion. Why is the p-value small? What possible mechanism could there be for producing this result? If you were a medical statistician and you gave me a p-value, I would ask you for a mechanism. Don’t accept that Drug A is better than Drug B only because the p-value says so: find a biochemist who can explain what Drug A does that Drug B doesn’t. Don’t accept that sun exposure is a cause of skin cancer on the basis of a p-value alone: find a mechanism by which skin is damaged by the sun. Why might divers have daughters and presidents have sons? Deep-sea divers are thought to have more daughters than sons because the underwater work at high atmospheric pressure lowers the level of the hormone testosterone in the men’s blood, which is thought to make them more likely to conceive daughters. For the presidents, your guess is as good as mine . . .

2.8 Example: Birthdays and sports professionals Have you ever wondered what makes a professional sports player? Talent? Dedication? Good coaching? Or is it just that they happen to have the right birthday. . . ? The following text is taken from Malcolm Gladwell’s book Outliers. It describes the play-by-play for the first goal scored in the 2007 finals of the Canadian ice hockey junior league for star players aged 17 to 19. The two teams are the Tigers and Giants. There’s one slight difference . . . instead of the players’ names, we’re given their birthdays. March 11 starts around one side of the Tigers’ net, leaving the puck for his teammate January 4, who passes it to January 22, who flips it back to March 12, who shoots point-blank at the Tigers’ goalie, April 27. April 27 blocks the shot, but it’s rebounded by Giants’ March 6. He shoots! Tigers defensemen February 9 and February 14 dive to block the puck while January 10 looks on helplessly. March 6 scores! Notice anything funny? Here are some figures. There were 25 players in the Tigers squad, born between 1986 and 1990. Out of these 25 players, 14 of them were born in January, February, or March. Is it believable that this should happen by chance, or do we have evidence that there is a birthday-effect in becoming a star ice hockey player? Hypothesis test

Let X be the number of the 25 players who are born from January to March.
We need to set up hypotheses of the following form: Null hypothesis: H0 : there is no birthday effect. Alternative hypothesis: H1 : there is a birthday effect.

The answer also depends on the sample size (25 in this case). Probability of getting a result AT LEAST AS EXTREME as X = 14 Jan to March players. . but we might be wrong. there is no birthday effect.56. . if H0 is true and p really is 0. 25.25 to provide evidence against H0 ? Just using our intuition. . So the probability that each player has a birthday in Jan to March is about 1/4. Results at least as extreme as X = 14 are: Upper tail: X = 14. p). H1 : p = 0. Null hypothesis: Alternative hypothesis: Our observation: The observed proportion of players born from Jan to March is 14/25 = 0. We need the p-value to measure the evidence properly. . . Number of Jan to March players.77 What is the distribution of X under H0 and under H1 ? Under H0 . Lower tail: an equal probability in the opposite direction. This is more than the 0.25.25. p-value: H0 : p = 0. Thus the distribution of X under H0 is X ∼ Binomial(25.25 predicted by H0 . Under H1 . we can make a guess. Our formulation for the hypothesis test is therefore as follows. so p = 1/4. X ∼ Binomial(25. 1/4).25. for even more Jan to March players. 15. there is a birthday effect. Is it sufficiently greater than 0. for too few Jan to March players. (3 months out of a possible 12 months).

The data are barely compatible with H0 . We get round this problem for calculating the p-value by just multiplying the upper-tail p-value by 2.) Typing this in R gives: 2*(1-pbinom(13. Probabilities for X ∼ Binomial(n = 25.25)).5.7. In fact.001831663 This p-value is very small. It is more complicated in this example than in Section 2. 0.78 Note: We do not need to calculate the values corresponding to our lower-tail pvalue. because we do not have Binomial probability p = 0.15 0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 R command for the p-value We need twice the UPPER-tail p-value: p-value = 2 × (1−pbinom(13.10 0. 25. It means that if there really was no birthday effect. 25. (Recall P(X ≥ 14) = 1 − P(X ≤ 13). .25) 0.00 0. 0. but it cannot be specified exactly. Something beyond ordinary chance seems to be going on. we would expect to see results as unusual as 14 out of 25 Jan to March players less than 2 in 1000 times. We conclude that we have strong evidence that there is a birthday effect in this ice hockey team.05 0. p = 0.25)) [1] 0. the lower tail probability lies somewhere between 0 and 1 player.

51? The test doesn’t even tell us this much! If there was a huge sample size (number of children). but in soccer. the cut-off date for age-class teams is January 1st. or potential importance. In ice hockey. A 10-year-old born in December is competing against players who are nearly a year older.51. for the deep-sea divers.5. Why? It’s because these sports select talented players for age-class star teams at young ages. say. Professional sports players not just in ice hockey. about 10 years old. Common sense. 2. Because there were almost twice as many daughters as sons. my guess is that the probability of a having a daughter is something close to p = 2/3. we COULD get a p-value as small as 0. these players really are the best. But what does this say about the actual value of p? Remember the p-value for the test was 0. or probably isn’t.79 Why should there be a birthday effect? These data are just one example of a much wider . . For example. The age difference makes a big difference in terms of size. They have told us nothing about the size. February. of the departure from H0. equal to the value specified in the null hypothesis. baseball. But there then follow years in which they get the best coaching and the most practice. p could be as big as 0. we found that it would be very unlikely to observe as many as 125 daughters out of 190 children if the chance of having a daughter really was p = 0. gives us a hint. We need some way of formalizing this. the hypothesis tests have only told us whether the Binomial probability p might be. 2. Do you think that: 1.9 Likelihood and estimation So far. p could be as close to 0.5 as. By the time they reach 17.000016. and physical coordination. born in January.and astonishingly strong phenomenon.8? No idea! The p-value does not tell us. however. and other sports have strong birthday clustering.000016 even if the true probability was 0. speed. and March. Most of the ‘talented’ players at this age are simply older and bigger. 0.

. How would we estimate the unknown intercept α and slope β.658. We know that X ∼ Binomial(190. and we wish to estimate the value of p. what would we do if we had a regression-model situation (see other courses) and wished to specify an alternative form for p. For example. Return to the deep-sea diver example. given known information on diver age and number of daughters and sons? We need a general framework for estimation that can be applied to any situation. In the case of the deep-sea divers. The common-sense estimate to use is 125 number of daughters = = 0. we wish to estimate the probability p that the child of a diver is a daughter. such as p = α + β × (diver age). The value suggested is called the estimate of the parameter. total number of children 190 p= However. Probably the most useful and general method of obtaining parameter estimates is the method of maximum likelihood estimation. Likelihood Likelihood is one of the most important concepts in statistics. p). The available data is the observed value of X: X = 125. there are many situations where our common sense fails us. X is the number of daughters out of 190 children.80 Estimation The process of using observations to suggest a value for a parameter is called estimation.

81 Suppose for a moment that p = 0.6 is a better estimate than p = 0. So p = 0. P(X = 125) = 190 (0. 0. what if we move p even closer to our common-sense estimate of 0.658 is the best estimate yet.658).061.6)190−125 125 = 0.5.016. What is the probability of observing X = 125? When X ∼ Binomial(190. 0.5)190−125 125 = 3.5)125(1 − 0. This suggests that p = 0.6 than it would be if p = 0.5. Not very likely!! What about p = 0.658)190−125 125 = 0. This still looks quite unlikely.6 is a better estimate than p = 0. 0.658? When X ∼ Binomial(190. P(X = 125) = 190 (0. P(X = 125) = 190 (0.5).97 × 10−6.6? When X ∼ Binomial(190.6). You can probably see where this is heading.658)125(1 − 0.6)125(1 − 0. but it is almost 4000 times more likely than getting X = 125 when p = 0.5.5. If p = 0.6.6? What would be the probability of observing X = 125 if p = 0. . This is even more likely than for p = 0.5. So far. we have discovered that it would be thousands of times more likely to observe X = 125 if p = 0.

06 P(X=125) when X~Bin(190. so our observation of 125 is LESS likely under p = 0.7).7? When X ∼ Binomial(190.03 0.7)190−125 125 = 0.7 than under p = 0. we can plot a graph showing how likely our observation of X = 125 is under each different value of p. This maximum likelihood value of p is our maximum likelihood estimate. p) 0. say to p = 0.50 0. 0. .7)125(1 − 0.658.82 Can we do any better? What happens if we increase p a little more.658. We can see that the maximum occurs somewhere close to our common-sense estimate of p = 0.01 0.00 0. 0. This is a value of p at which the observation X = 125 is MORE LIKELY than at any other value of p. Overall.60 0.80 The graph reaches a clear maximum.028.02 0.70 0.75 0. P(X = 125) = 190 (0. This has decreased from the result for p = 0.658.04 0.55 0.65 p 0.05 0.

p).83 The likelihood function Look at the graph we plotted overleaf: Horizontal axis: The unknown parameter. x . x) = P(X = x) when X ∼ Binomial(190. For our fixed observation X = 125. We write: L(p . the likelihood function shows how LIKELY the observation 125 is for every different value of p. The probability of our observation. = = 190 125 p (1 − p)190−125 125 190 125 p (1 − p)65 . In general. p. 125 This function of p is the curve shown on the graph on page 82. the likelihood function is a function of p giving P(X = x) when X ∼ Binomial(190. p). The likelihood function is: L(p) = P(X = 125) when X ∼ Binomial(190. under this value This function is called the likelihood function. = 190 x p (1 − p)190−x . p). X = 125. if our observation were X = x rather than X = 125. It is a function of the unknown parameter p. Vertical axis: of p.

6 here.70 0. (p = 0.01 0.50 0.03 0.06 P(X=x) when p=0.05 P(X=125) when X~Bin(190. x). p) 0.84 Difference between the likelihood function and the probability function The likelihood function is a probability of x.00 0.75 0. Function of p for fixed x. 0.00 0.55 0.02 0. Gives P(X = x) as p changes.6 0. but it could be anything. Compare this with the probability function.01 0. fX (x).04 0. Function of x for fixed p.) Maximizing the likelihood We have decided that a sensible parameter estimate for p is the maximum likelihood estimate: the value of p at which the observation X = 125 is more likely than at any other value of p.04 0. . for every possible value of the parameter p.80 100 120 x 140 Likelihood function. for a FIXED value of p. but it could be anything. Gives P(X = x) as x changes.02 0. but it is a FUNCTION of p.65 p 0. We can find the maximum likelihood estimate using calculus.60 0. The likelihood gives the probability of a FIXED observation x.05 0. (x = 125 here.) Probability function. which is the probability of every different value of x.06 0. L(p .03 0.

85 The likelihood function is L(p . 190 . To find the maximizing value of p.658 . 125 We wish to find the value of p that maximizes this expression. differentiate the likelihood with respect to p: dL = dp 190 × 125 125 × p124 × (1 − p)65 + p125 × 65 × (1 − p)64 × (−1) (Product Rule) = 190 × p124 × (1 − p)64 125 190 124 p (1 − p)64 125 125(1 − p) − 65p = 125 − 190p . 125) = 190 125 p (1 − p)65. The maximizing value of p occurs when dL = 0. dp This gives: dL = dp 190 124 p (1 − p)64 125 ⇒ 125 − 190p 125 − 190p ⇒ = 0 = 0 p = 125 = 0.

2. like this: p. Write down the observed value of X: Observed data: X = 125. p= 125 . Write down the likelihood function for this observed value: L(p . The ‘hat’ notation for an estimate It is conventional to write the estimated value of a parameter with a ‘hat’. Write down the distribution of X in terms of the unknown parameter: X ∼ Binomial(190. For example. 3.86 For the diver example. 190 =0 p=p ⇒ p= Summary of the maximum likelihood procedure 1. 125 . 190 The correct notation for the maximization is: dL dp 125 . p). p) = 190 125 p (1 − p)65. total number of children 190 This gives us confidence that the method of maximum likelihood is sensible. 125) = P(X = 125) when X ∼ Binomial(190. the maximum likelihood estimate of 125/190 is the same as the common-sense estimate (page 80): number of daughters 125 p= = .

Verifying the maximum Strictly speaking. Where possible. it is always best to plot the likelihood function. Differentiate the likelihood with respect to the parameter. 190 This is the maximum likelihood estimate (MLE) of p. Solve for p: p= 125 . and set to 0 for the maximum: dL = dp 190 124 p (1 − p)64 125 125 − 190p = 0. when p = p. care must be taken when the parameter has a restricted range like 0 < p < 1 (see later). This is the Likelihood Equation. 5. p=p we should verify that the result is a maximum (rather than a minimum) by showing that d2 L < 0. as on page 82. when we find the maximum likelihood estimate using dL dp = 0. This confirms that the maximum likelihood estimate exists and is unique. .87 4. dp2 p=p In Stats 210. In particular. we will be relaxed about this. You will usually be told to assume that the MLE exists.

190 X ∼ Binomial(190. we can provide a RULE for calculating the maximum likelihood estimate once X is observed: Rule: Let Whatever value of X we observe. the maximum likelihood estimate of p will be p= X . A random variable specifying how an estimate is calculated from an observation is called an estimator. p= 190 It is clear that we could follow through the same working with any value of X. the maximum likelihood estimaTOR of p is p= X . This means that even before we have made our observation of X.88 Estimators For the example above. which we can write as X = x. In the example above. p). we had observation X = 125. once we have observed that X = x. 190 The maximum likelihood estimaTE of p. and the maximum likelihood estimate of p was 125 . and we would obtain p= x . 190 x . is p= . 190 Exercise: Check this by maximizing the likelihood using x instead of 125. Note that this expression is now a random variable: it depends on the random value of X .

3. We make a single observation X = x. Write down the likelihood function for this observed value: L(p . Solve for p: p= x . x 4. (Exercise) 5. Follow the steps on page 86 to find the maximum likelihood estimator for p. x) = P(X = x) when X ∼ Binomial(n. p) Take any situation in which our observation X has the distribution X ∼ Binomial(n.) 2. p) = n x p (1 − p)n−x. p). where n is KNOWN and p is to be estimated. when p = p. Write down the observed value of X: Observed data: X = x. (n is known. 1. p). . n This is the maximum likelihood estimate of p. Write down the distribution of X in terms of the unknown parameter: X ∼ Binomial(n.89 General maximum likelihood estimator for Binomial(n. and set to 0 for the maximum: dL = dp n x−1 p (1 − p)n−x−1 x x − np = 0. Differentiate the likelihood with respect to the parameter.

n (Just replace the x in the MLE with an X . Note: We have only considered the class of problems for which X ∼ Binomial(n.5 just means that we can’t DISTINGUISH any difference between p and 0.7 that p was not significantly different from 0. and one of them (n) should only take discrete values 1. 2.425 . . We expect that p probably IS different from 0.5. x = 65: the maximum likelihood estimate is p= x 65 = = 0. .. This comes back to the meaning of significantly different in the statistical sense. p) and n is KNOWN. just by a little. 65 were daughters. Saying that p is not significantly different from 0. the MLE of p is definitely different from 0. we have a harder problem: we have two parameters. If n is not known. n 153 Note: We showed in Section 2.7.90 The maximum likelihood estimator of p is p= X . 3.5 from routine sampling variability. Let p be the probability that a presidential child is a daughter.5.) By deriving the general maximum likelihood estimator for any problem of this sort. What is the maximum likelihood estimate of p? Solution: Plug in the numbers n = 153. We will not consider problems of this type in Stats 210. we can plug in values of n and x to get an instant MLE for any Binomial(n. Example: Recall the president problem in Section 2. .5 in this example. to convert from the estimate to the estimator. However. The maximum likelihood estimate gives us the ‘best’ estimate of p. Out of 153 children. . p) problem in which n is known.

91 2. and size gives the Binomial parameter n! Histograms The usual graph used to visualise a set of random numbers is the histogram. Here are histograms from applying the command rbinom(100.6) distribution.6) or in long-hand.6) three different times. 190. if each histogram bar covers an interval of length 5. rbinom(n=100. we use: rbinom(100. p = 0.6) Caution: the R inputs n and size are the opposite to what you might expect: n gives the required sample size. 0. .10 Random numbers and histograms We often wish to generate random numbers from a given distribution. For example.6) distribution in R. 110) would be 24. p = 0. 190. and if 24 of the random numbers fall between 105 and 110. then the height of the histogram bar for the interval (105. Statistical packages like R have custom-made commands for doing this. prob=0. The height of each bar of the histogram shows how many of the random numbers fall into the interval represented by the bar. To generate (say) 100 random numbers from the Binomial(n = 190. 0. 8 6 10 8 frequency of x frequency of x frequency of x 6 4 4 2 2 0 0 80 90 100 120 140 80 90 100 120 140 0 1 2 3 4 5 6 7 80 90 100 120 140 Each graph shows 100 random numbers from the Binomial(n = 190. size=190.

Usually. 0. 190. 0. All the histograms below have bars covering an interval of 1 integer. Histograms as the sample size increases Histograms are useful because they show the approximate shape of the underly- ing probability function. 190. For example. Each histogram bar covers an interval of 25 30 Histogram of rbinom(100.92 Note: The histograms above have been specially adjusted so that each histogram bar covers an interval of just one integer. because there are 100 observations. They show how the histogram becomes smoother and less erratic as sample size increases. the histogram starts to look identical to the probability function. and the histogram would be smoother. Eventually.6) Frequency 0 5 10 15 20 100 110 120 130 5 integers. obtained from the command hist(rbinom(100. rbinom(100. 190. The probability function plots EXACT PROBABILITIES for the distribution. Note: difference between a histogram and the probability function The histogram plots OBSERVED FREQUENCIES of a set of random numbers. the sum of the heights of all the bars is 100. the height of the bar plotted at x = 109 shows how many of the 100 random numbers are equal to 109.6)). especially as the sample size gets large. The histogram should have the same shape as the probability function. 0. with a large enough sample size. . on the right is a histogram using the default settings in R. histogram bars would cover a larger interval.6) In all the histograms above. For example. They are also useful for exploring the effect of increasing sample size.

02 0.03 0.01 0. The histograms become stable in shape and approach the shape of the probability function as sample size gets large. 0. 0.06 Probability function for Binomial(190. 0. 0.000: rbinom(10000. 0. 190.6): 100 x 120 140 .000: rbinom(100000. 0.05 0.6) 600 600 600 frequency of x 80 90 100 120 140 500 500 frequency of x 400 frequency of x 400 300 300 200 200 0 100 0 100 80 90 100 120 140 0 80 100 200 300 400 500 90 100 120 140 Sample size 100.6) 60 60 50 60 frequency of x 80 90 100 120 140 50 40 frequency of x frequency of x 40 30 30 20 20 10 0 10 0 80 90 100 120 140 0 80 10 20 30 40 50 90 100 120 140 Sample size 10. 190.6) fixed and exact. 190.00 80 0.93 Sample size 1000: rbinom(1000.04 The probability function is 0.6) 6000 6000 5000 5000 5000 frequency of x 80 90 100 120 140 4000 4000 frequency of x frequency of x 3000 3000 2000 2000 1000 1000 0 0 80 90 100 120 140 0 80 1000 2000 3000 4000 6000 90 100 120 140 P(X=x) when X~Bin(190.

02 0.6) 116 116 117 122 111 112 114 120 112 102 125 116 97 105 108 117 118 111 116 121 107 113 120 114 114 124 116 118 119 120 The average.6)) Result: 114.6): P(X=x) when p=0. . 20 The mean of the first thirty values is: 116 + 116 + . . 10 The mean of the first twenty values is: 116 + 116 + . 30 The answers all seem to be close to 114. + 112 + 102 = 114. we often want to know what is the average value of X ? For example.2. here are 30 random observations taken from the distribution X ∼ Binomial(n = 190. of the first ten values is: 0.7. 0.86 Note: You will get a different result every time you run this command. p = 0.6): R command: mean(rbinom(100.03 0. . .00 0. . 0.06 100 120 x 140 116 + 116 + .11 Expectation Given a random variable X that measures something. . 0. What would happen if we took the average of hundreds of values? 100 values from Binomial(190.05 R command: rbinom(30.04 0.8. 190. + 116 + 121 = 113. 190. + 119 + 120 = 114. . or mean.6 0.2.01 0.

Definition: The expected value. 0.6)) Result: 114. . or expected value. of a discrete random variable X.0001 The average seems to be converging to the value 114.6) distribution. or average. the closer the average seems to get to 114. weighted according to the probability of each value. 0. If we took a very large sample of random numbers from the distribution of X . 0. 0. and is given by µX = E(X) = x xfX (x) = x xP(X = x) .6)) Result: 114. 0.6) distribution. This is because 114 is the DISTRIBUTION MEAN: the mean value that we would get if we were able to draw an infinite sample from the Binomial(190. It is a FIXED property of the Binomial(190. This means it is a fixed constant: there is nothing random about it. 0. their average would be approximately equal to µX . 190. 190.6): R command: mean(rbinom(1000. or µX . can be written as either E(X). The larger the sample size. also called the expectation or mean.95 1000 values from Binomial(190. 0.6): R command: mean(rbinom(1000000. The expected value is a measure of the centre. of the Bino- mial(190.6) distribution.02 1 million values from Binomial(190. This distribution mean is called the expectation. or E(X). we would keep getting answers closer and closer to 114. of the set of values that X can take. If we kept going for larger and larger sample sizes.

9 × 1 + 0. . ) We could repeat this for any sample size. and use a simpler example. p = 0. the answer to this sum is n × p = 190 × 0. What is E(X)? E(X) = x 190 xP(X = x) 190 (0. X takes only the values 1 and −1. 100 = 0. Let the random variable X be defined as X = 1 with probability 0.96 Example: Let X ∼ Binomial(n = 190.1. What is the ‘average’ value of X? = 0 would not be useful. We will see why in Section 2. Explanation of the formula for expectation We will move away from the Binomial distribution for a moment. x = x=0 x Although it is not obvious.8.1 × (−1) ( = 0. think of observing X many times. Roughly 10 of these 100 times will have X = −1 The average of the 100 values will be roughly 90 × 1 + 10 × (−1) .6 = 114. and only occasionally is X = −1. say 100 times.6)x(0. Roughly 90 of these 100 times will have X = 1. because it ignores the fact that usually X = 1. −1 with probability 0.4)190−x.14.6).9. Using 1+(−1) 2 Instead.

. E(X) = x P(X = x) × x. E(X) is a fixed constant giving the average value we would get from a large sample of X . Then E(aX + b) = aE(X) + b. the average of the sample will get ever closer to 0. Linear property of expectation Expectation is a linear operator: Theorem 2.11: Let a and b be constants. Proof: Immediate from the definition of expectation.9 × 1 + 0.1 × (−1). This is why the distribution mean is given by E(X) = P(X = 1) × 1 + P(X = −1) × (−1). or in general.97 As the sample gets large. E(aX + b) = x (ax + b)fX (x) xfX (x) + b x x = a fX (x) = a E(X) + b × 1.

We have: P(X = x) = x fX (x) = P(X = x) 3 (0.6. 1.2 for X ∼ Binomial(3.512 + 1 × 0.2).096 3 0.512 1 0.98 Example: finding expectation from the probability function Example 1: Let X ∼ Binomial(3. Write down the probability function of X and find E(X). 2. Example 2: Let Y be Bernoulli(p) (Section 2. That is. We will prove in Section 2. 0. p).008 = 0. 1 with probability p.6 = 3 × 0.8)3−x for x = 0.384 2 0. x 0 0. 0 with probability 1 − p.2).384 + 2 × 0. Y = Find E(Y ). 0 y P(Y = y) 1 − p 1 p E(Y ) = 0 × (1 − p) + 1 × p = p. then E(X) = np.14 that whenever X ∼ Binomial(n.008 Then 3 E(X) = x=0 xfX (x) = 0 × 0. . 3.2)x(0.096 + 3 × 0. Note: We have: E(X) = 0. 0.3).

. . The proof requires multivariate methods. It does NOT require X1 . . . E (X1 + X2 + . In particular. + Xn ) = E(X1) + E(X2) + . This result holds for any random variables X1 . . to be studied later in the course. E(XY ) is NOT equal to E(X)E(Y ). Xn. Xn to be independent. Xn . 2. . . E(X + Y ) = E(X) + E(Y ) for ANY X and Y . we have: E (a1 X1 + a2 X2 + . . Expectation of a product of random variables: E(XY ) There are two cases when finding the expectation of a product: 1. . . . We have to find E(XY ) either using their joint probability function (see later). . . Special case: when X and Y are INDEPENDENT: When X and Y are INDEPENDENT. . + E(Xn ). . . . . General case: For general X and Y . Note: We can combine the result above with the linear property of expectation. . X2. + an Xn ) = a1 E(X1) + a2 E(X2) + . + an E(Xn ). or using their covariance (see later). an . . E(XY ) = E(X)E(Y ). . . .99 Expectation of a sum of random variables: E(X + Y ) For ANY random variables X1 . . . We can summarize this important result by saying: The expectation of a sum is the sum of the expectations – ALWAYS. For any constants a1 .

008 To transform a discrete random variable. Find the probability function of Y .2). .384 2 0. The probability function for X is: x P(X = x) 0 0. X.008 This is because Y takes the value 02 whenever X takes the value 0. possible transformations of X include: √ X2 . For example.512 1 0. For discrete random variables. . Simply change all the values and keep the probabilities the same. Thus the probability that Y = 02 is the same as the probability that X = 0. it is very easy to find the probability function for Y = g(X).384 22 0. we would write the probability function of Y = X 2 as: y P(Y = y) 0 0.096 9 0. transform the values and leave the probabilities alone.512 1 0. Example 1: Let X ∼ Binomial(3.512 12 0.. 0.12 Variable transformations We often wish to transform random variables through a function.384 4 0. and so on. given that the probability function for X is known. Overall. and let Y = X 2 . We often summarize all possible variable transformations by referring to Y = g(X) for some function g ..2. given the random variable X.096 3 0.096 32 0. 4X 3 .008 Thus the probability function for Y = X 2 is: y P(Y = y) 02 0.

Let X be the height of balloon ordered by a random customer. the amount of helium gas required for a randomly chosen customer.Example 2: Mr Chance hires out giant helium balloons for advertising. For example. The probability function for X is: and for Y = X 2 : x P(X = x) y P(Y = y) 0 0.3 32 0. The amount of helium gas in cubic metres required to fill the balloons is h3 /2.008 . in Example 1 we had X ∼ Binomial(3. Expected value of a transformed random variable We can find the expectation of a transformed random variable just like any other random variable. 3m. and leave the probabilities alone. His balloons come in three sizes: heights 2m.2). 50% of Mr Chance’s customers choose to hire the cheapest 2m balloon. while 30% hire the 3m balloon and 20% hire the 4m balloon.2 Transform the values.384 1 0.2 Let Y be the amount of gas required: Y = X 3 /2.5 0. where h is the height of the balloon. and 4m. y (m 3 ) P(Y = y) 4 0. 0.512 1 0.3 4 0. The probability function of Y is: gas. x (m) P(X = x) 2 0.5 3 0. Find the probability function of Y .5 13. and Y = X 2.096 3 0. The probability function of X is: height.384 2 0.096 4 0.512 0 0.008 9 0.

Check that {E(X)}2 = 0.096 + g(3) × 0.512 + 1 × 0. this becomes: E{g(X)} = E(X 2) = g(0) × 0. To make the calculation quicker. the expected value of g(X) is given by E{g(X)} = x g(x)P(X = x) = x g(x)fX (x).096 + 32 × 0.384 + 22 × 0. and leave the probabilities alone. we could cut out the middle step of writing down the probability function of Y .008.096 + 9 × 0.84.008. Definition: For any function g and discrete random variable X. If we write g(X) = X 2 . we have: E(X 2 ) = 02 × 0.36.512 + 12 × 0.384 + 4 × 0.008 = 0.384 + g(2) × 0.102 Thus the expectation of Y = X is: E(Y ) = E(X 2 ) = 0 × 0. Clearly the same arguments can be extended to any function g(X) and any discrete random variable X: E{g(X)} = x g(x)P(X = x). Transform the values. Because we transform the values and keep the probabilities the same. . 2 Note: E(X 2 ) is NOT the same as {E(X)}2 .512 + g(1) × 0.

The probability function of X is: height. .103 Example: Recall Mr Chance and his balloon-hire business from page 101. . E(400X) = 400 × E(X) = 400 × 2. (b) Mr Chance charges $400×h to hire a balloon of height h. .3 0.2 2 2 2 = 12. .5 0.5 + 3 × 0. x (m) 2 3 4 P(X = x) 0.3 + × 0. Z5 be the earnings from the next 5 customers.5 + × 0. X3 E 2 = x x3 × P(X = x) 2 = 33 43 23 × 0. . . . Let X be the height of balloon selected by a randomly chosen customer. . (c) How much does Mr Chance expect to earn in total from his next 5 customers? (expectation is linear) = 400 × (2 × 0. What is his expected earning per customer? Expected earning is E(400X).45 m3 gas. The total expected earning is E(Z1 + Z2 + .2 (a) What is the average amount of gas required per customer? Gas required was X 3 /2 from page 101. Each Zi has E(Zi) = 1080 by part (b).2) Let Z1. + Z5 ) = E(Z1) + E(Z2) + .7 = $1080 per customer. + E(Z5) = 5 × 1080 = $5400. Average gas per customer is E(X 3 /2).3 + 4 × 0. .

So E(X) = 3 4×3 + 1 × 8. and 1/4 of the time. . X= 8 with probability 1/4. X takes value 8. Then 3/4 of the time. . 8 with probability 1/4. .104 Getting the expectation. 4 add up the values times how often they occur √ What about E( X)? √ √ √3 with probability 3/4. X takes value 3. Wr g! on Suppose X= 3 with probability 3/4.

105 add up the values times how often they occur √ √ √ 3 1 E( X) = 4 × 3 + 4 × 8. Common mistakes √ √ i) E( X) = EX = on Wr g! 3 4×3 + 1×8 4 √ ii) E( X) = 3 4×3 + 1 4×8 on Wr g! √ iii) E( X) = ng! 3 4×3 + + 1 4×8 = Wr o √ 3 4× 3 √ 1 4× 8 .

and is given by 2 σX = Var(X) = E (X − µX )2 = E (X − EX)2 . She knows that the expected amount of money withdrawn each day is $50. we need the study of variance. How much money should she load in the machine? $50. so sd(X) = Var(X) = 2 σX = σX . How much money should Mrs Tractor put in the machine if she wants to be 99% certain that there will be enough for the day’s transactions? Answer: it depends how much the amount withdrawn varies above and below its mean.13 Variance Example: Mrs Tractor runs the Rational Bank of Remuera. the variance of a function of X is 2 Var(g(X)) = E g(X) − E(g(X)) . .000 is the average. About half the time.2. For questions like this. Variance is the average squared distance of a random variable from its own mean. Similarly.000? No: $50. Note: The variance is the square of the standard deviation of X. 2 Definition: The variance of a random variable X is written as either Var(X) or σX .000. near the centre of the distribution. the money required will be GREATER than the average. Every day she hopes to fill her cash machine with enough cash to see the well-heeled citizens of Remuera through the day.

6875. (2) Then take the average over all values X can take: ie. 2 Note: The mean.25 4 4 1 3 × (3 − 4. µX . σX .6875. It is the average squared distance between a value of X and the central (mean) value. Possible values of X x1 x2 x3 x4 x5 x6 x2 − µX x4 − µX µX (central value) Var(X) = E [(X − µX )2 ] (2) (1) (1) Take distance from observed values of X to the central point. find what would be the average squared distance between X and µX . regardless of the outcome of X. 2 But µX is fixed at 4. 1 3 + 8 × = 4.25)2 = 4 4 = 4. if we observed X many times. . of X are just numbers: there is nothing random or variable about them. and the variance.25. µX . µX . we get either 3 or 8: this is random. Example: Let X = 3 with probability 3/4.107 Variance as the average squared distance from the mean The variance is a measure of how spread out are the values that X can take. Square it to balance positive and negative distances. Then E(X) = µX = 3 × 2 Var(X) = σX When we observe X. and σX is fixed at 4. 8 with probability 1/4.25)2 + × (8 − 4.

11 Note: e. X But 3 1 E(X 2 ) = 32 × + 82 × = 22. . constant = E(X 2) − 2µX E(X) + µ2 X = E(X 2) − 2µ2 + µ2 X X = E(X 2) − µ2 .v. so µ2 = (EX)2 = (4. 8 with probability 0. Var(X) = E (X − µX )2 = x (x − µX )2fX (x) = x (x − µX )2P(X = x).25.75. 4 4 Thus E(X 2 ) = (EX)2 in general.g. Theorem 2.25. Then µX = EX = 4. X by Thm 2.75.13A: (important) Var(X) = E(X 2) − (EX)2 = E(X 2 ) − µ2 X Proof: Var(X) = E (X − µX )2 = E[ X 2 −2 X r. This uses the definition of the expected value of a function of X: Var(X) = E(g(X)) where g(X) = (X − µX )2 . by definition µX + µ2 ] X constant r. This is not the same as (EX)2: 3 with probability 0.25)2 = 18.v.0625.108 For a discrete random variable. E(X 2) = xx 2 fX (x) = X= xx 2 P(X = x) .

ii) Var(a g(X) + b) = a2 Var{g(X)}. Proof: (part (i)) Var(aX + b) = E {(aX + b) − E(aX + b)}2 = E {aX + b − aE(X) − b}2 = E {aX − aE(X)}2 = E a2 {X − E(X)}2 = a2 E {X − E(X)}2 = a2 Var(X) . 4 0.5 0.11).5 13.13B: If a and b are constants and g(x) is a function.11 by Thm 2.3 32 0.2 .11 Example: finding expectation and variance from the probability function Recall Mr Chance’s balloons from page 101. Variances are more difficult to manipulate than expectations. y (m 3) P(Y = y) Find Var(Y ). Note: These are very different from the corresponding expressions for expectations (Theorem 2. Part (ii) follows similarly. The probability function of Y is: gas. The random variable Y is the amount of gas required by a randomly chosen customer. then i) Var(aX + b) = a2 Var(X).109 Theorem 2. by Thm 2.

475.3 + (32 − 12. Var(X + Y ) = Var(X) + Var(Y ).5 + (13. Var(X + Y ) is NOT equal to Var(X) + Var(Y ). Second method: use E(Y 2 ) − µ2 : (usually easier) Y E(Y 2 ) = 42 × 0.2 = 112. We have to find Var(X + Y ) using their covariance (see later). . First method: use Var(Y ) = E[(Y − µY )2]: Var(Y ) = (4 − 12.52 × 0.47.475 − (12.45)2 = 112.45)2 × 0.45 from page 103. So Var(Y ) = 267.5 + 13.110 We know that E(Y ) = µY = 12. General case: For general X and Y .5 − 12.45)2 × 0.45)2 × 0.3 + 322 × 0.2 = 267. Variance of a sum of random variables: Var(X + Y ) There are two cases when finding the variance of a sum: 1. 2.47 as before. Special case: when X and Y are INDEPENDENT: When X and Y are INDEPENDENT.

2.5%. The figures are over 5 times more likely to begin with the digit 1 than with the digit 9.5 times more likely. FALSE : it is 97%. 3. Consider a classroom with 30 pupils of age 5. . Answers: 1. last sale 143 23 18 52 . The probability of getting 8 or more heads is less than 1%. 4. 3. . . Toss a fair coin 200 times.111 Interlude: Guess whether each of the following statements is true or false. share A Barnett Advantage I AFFCO Air NZ . Pick a column of figures. FALSE : in NZ the probability is about 50%. . FALSE it is 5. 2. and one teacher of age 50. RUE T or FALSE ?? 1. Toss a fair coin 10 times. or open an atlas at the pages giving country areas or populations. The chance of getting a run of at least 6 heads or 6 tails in a row is less than 10%. TRUE : in fact they are 6. . 4. Open the Business Herald at the pages giving share prices. The probability that the pupils all outlive the teacher is about 90%.

Var(X) = Var(Y1 ) + Var(Y2) + . . + E(Yn) (does NOT require independence). then: E(X) = E(Y1) + E(Y2) + . p).112 2. p). That is. . If X ∼ Binomial(n. . . . . We now prove this and the additional result for Var(X). p) distribution Let X ∼ Binomial(n. where each Yi = 1 with probability p. . each with P(success) = p. . which is the same as X . + Var(Yn ) (DOES require independence). Easy proof: X as a sum of Bernoulli random variables If X ∼ Binomial(n. + Yn . so Var(X) = npq . . Yn are independent. . Y1 + . . This means that we can write: X = Y1 + Y2 + . We often write q = 1 − p. We have mentioned several times that E(X) = np. . Overall. Note: Each Yi is a Bernoulli(p) random variable (Section 2. then: E(X) = µX = np 2 Var(X) = σX = np(1 − p). + Yn . Now if X = Y1 + Y2 + . 0 with probability 1 − p. .3). and Y1 . Yi counts as a 1 if trial i is a success. + Yn is the total number of successes out of n independent trials.14 Mean and variance of the Binomial(n. then X is the number of successes out of n independent trials. p). and as a 0 if trial i is a failure. . .

. .. . . E(Yi2) = 02 × (1 − p) + 12 × p = p. + E(Yn) = p +p +. + Var(Yn) = n × p(1 − p). y 0 P(Yi = y) 1 − p 1 p Also.. . Therefore: E(X) = E(Y1) + E(Y2) + . And: Var(X) = Var(Y1) + Var(Y2) + .+ p = n × p. Thus we have proved that E(X) = np and Var(X) = np(1 − p). Var(Yi) = E(Yi2 ) − (EYi)2 = p − p2 = p(1 − p). So E(Yi) = 0 × (1 − p) + 1 × p = p.113 The probability function of each Yi is: Thus.

n! = n(n − 1)! etc. x! (x − 1)! n So. . continuing. So m E(X) = np y=0 m y p (1 − p)m−y y (Binomial Theorem) = np(p + (1 − p))m E(X) = np. wherever possible: e. E(X) = x=1 n! px(1 − p)n−x (n − x)!(x − 1)! Next: make n’s into (n − 1)’s. n px = p · px−1 E(X) = x=1 n(n − 1)! p · p(x−1) (1 − p)(n−1)−(x−1) [(n − 1) − (x − 1)]!(x − 1)! n = np what we want x=1 n − 1 x−1 p (1 − p)(n−1)−(x−1) x−1 need to show this sum = 1 Finally we let y = x − 1 and let m = n − 1. and when x = n. x’s into (x − 1)’s. y = n − 1 = m.114 Hard proof: for mathematicians (non-examinable) We show below how the Binomial mean and variance formulae can be derived directly from the probability function. This gives. When x = 1. y = 0.g. n − x = (n − 1) − (x − 1). as required. n x E(X) = xfX (x) = x p (1 − p)n−x = x x=0 x=0 n n n x x=0 n! px (1 − p)n−x (n − x)!x! But x 1 = and also the first term xfX (x) is 0 when x = 0.

Thus n E[X(X − 1)] = p n(n − 1) = n(n − 1)p2 2 x=2 m y=0 n − 2 x−2 p (1 − p)(n−2)−(x−2) x−2 m y p (1 − p)m−y y if m = n − 2. x by (x−2) wherever possible. Note the steps: take out x(x−1) and replace n by (n−2). .115 For Var(X). sum=1 by Binomial Theorem So E[X(X − 1)] = n(n − 1)p2 . y = x − 2. x! Here goes: n E[X(X − 1)] = = x=0 n x=0 x(x − 1) n x p (1 − p)n−x x x(x − 1)n(n − 1)(n − 2)! p2p(x−2) (1 − p)(n−2)−(x−2) [(n − 2) − (x − 2)]!(x − 2)!x(x − 1) First two terms (x = 0 and x = 1) are 0 due to the x(x − 1) in the numerator. use the same ideas again. x 1 For E(X). Thus Var(X) = E(X 2) − (E(X))2 = E(X 2) − E(X) + E(X) − (E(X))2 = E[X(X − 1)] + E(X) − (E(X))2 = n(n − 1)p2 + np − n2 p2 = np(1 − p). it will be easier to find E[X(X − 1)] = E(X 2) − E(X) because then we will be able to cancel x(x−1) 1 = (x−2)! . we used x! = (x−1)! . so instead of finding E(X 2).

For example.7.658 ± 0. perhaps 0.116 Variance of the MLE for the Binomial p parameter In Section 2. or do we believe it is very precise. in the deep-sea diver example introduced in Section 2. The maximum likelihood estimator of p is p= X . where n is KNOWN and p is to be estimated.9 we derived the maximum likelihood estimator for the Binomial parameter p.658. Make a single observation X = x. Given p = X .658 ± 0. we estimated the probability that a diver has a daughter is p= 125 X = = 0.3 (say). p) (⋆) . estimates of parameters should always be accompanied by estimates of their variability. in other words almost useless.02? We assess the usefulness of estimators using their variance. p). n for X ∼ Binomial(n. we have: n X n Var(p) = Var = = = 1 Var(X) n2 1 × np(1 − p) n2 p(1 − p) . Reminder: Take any situation in which our observation X has the distribution X ∼ Binomial(n. n 190 What is our margin of error on this estimate? Do we believe it is 0. n In practice.

591. The expression p ± 1. we have to ESTIMATE Var(p) by replacing the unknown p in equation (⋆) by p. so: se(p) = 0. however.658 × (1 − 0. p = 0. so we cannot calculate the exact Var(p).96 × se(p) gives an approximate 95% confidence interval for p under the Normal approximation. 0. We usually quote the margin of error associated with p as p(1 − p) . We call our estimated variance Var (p): p(1 − p) . n Var (p) = The standard error of p is: se(p) = Var (p).067 or p = 0. For our final answer. we do not know the true value of p.658 ± 1. n Margin of error = 1.658 (0.034 = 0.117 In practice.034. We will study the Central Limit Theorem in later chapters.96 × 0. with n = 190. .96 × This result occurs because the Central Limit Theorem guarantees that p will be approximately Normally distributed in large samples (large n). Example: For the deep-sea diver example.96 × se(p) = 1. Instead.658) 190 = 0.658. we should therefore quote: p = 0.658 ± 0.725).

p).1 Binomial distribution Description: X ∼ Binomial(n. .Chapter 3: Modelling with Discrete Probability Distributions 118 In Chapter 2 we introduced several fundamental ideas: hypothesis testing. .5. Variance: Var(X) = np(1 − p). Each of these was illustrated by the Binomial distribution. Probability function: fX (x) = P(X = x) = Mean: E(X) = np. p) and Y ∼ Binomial(m. Sum of independent Binomials: If X ∼ Binomial(n. and variance. . . Bernoulli Trials A set of Bernoulli trials is a series of trials such that: i) each trial has only 2 possible outcomes: Success and Failure. 2) Having children: each child can be thought of as a Bernoulli trial with outcomes {girl. is constant for all trials. First we revise Bernoulli trials and the Binomial distribution. 1. ii) the probability of success. expectation. n x px(1 − p)n−x for x = 0. 3. Examples: 1) Repeated tossing of a fair coin: each toss is a Bernoulli trial with 1 P(success) = P(head) = 2 . likelihood. iii) the trials are independent. and if X and Y both share the same parameter p. and if X and Y are independent. p. p) if X is the number of successes out of a fixed number n of Bernoulli trials. each with P(success) = p. boy} and P(girl) = 0. p). n. then X + Y ∼ Binomial(n + m. We now introduce several other discrete distributions and discuss their properties and usage. .

On every jump the fish has probability p of reaching the top. • The Binomial distribution counts the number of successes out of a fixed number of trials. p = 0. • The Geometric distribution counts the number of trials before the first This means that the Geometric distribution counts the number of failures before the first success.119 Shape: Usually symmetrical unless p is close to 0 or 1.02 0.15 0. . If every trial has probability p of success.12 0.9 (skewed for p close to 1) (less skew for p = 0.2 0. Then X ∼ Geometric(p).05 0. success occurs. p = 0.06 0.2 Geometric distribution Like the Binomial distribution.4 n = 10. the Geometric distribution is defined in terms of a sequence of Bernoulli trials.25 0.0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0.1 0.20 0. n = 10. p = 0.5 (symmetrical) 0.10 90 100 3.10 0. we write: X ∼ Geometric(p).0 0.0 80 0. Let X be the number of failed jumps before the fish succeeds. Peaks at approximately np.9 if n is large) 0. Examples: 1) X =number of boys before the first girl in a family: X ∼ Geometric(p = 0.04 0.3 0.9 n = 100. 2) Fish jumping up a waterfall.5).08 0.

. . x 1−p q = p p q 1−p = 2 2 p p Var(X) = . SF . . F S. .120 Properties of the Geometric distribution i) Description X ∼ Geometric(p) if X is the number of failures before the first success in a series of Bernoulli trials with P(success) = p. failures and successes can occur in any order: e. 1. 1 success) = (1 − p)xp. 2. F SF . the trials must always occur in the order F F . x+1 (1 − p)xp. etc. . fX (x) = P(X = x) = (1 − p)xp for x = 0. . . . F . This is why the Geometric distribution has probability function P(x failures. . x failures For the Binomial distribution.g. 1 success) = iii) Mean and variance E(X) = For X ∼ Geometric(p). F . F S . while the Binomial distribution has probability function P(x failures. . Explanation: P(X = x) = (1 − p)x need x failures final trial must be a success × p Difference between Geometric and Binomial: For the Geometric distribution. ii) Probability function For X ∼ Geometric(p). F F . .

121

iv) Sum of independent Geometric random variables If X1 , . . . , Xk are independent, and each Xi ∼ Geometric(p), then X1 + . . . + Xk ∼ Negative Binomial(k, p). v) Shape Geometric probabilities are always greatest at x = 0. The distribution always has a long right tail (positive skew). The length of the tail depends on p. For small p, there could be many failures before the first success, so the tail is long. For large p, a success is likely to occur almost immediately, so the tail is short.
p = 0.3 (small p)
0.30 0.5

(see later)

p = 0.5 (moderate p)
0.8

p = 0.9 (large p)

0.25

0.20

0.4

0.15

0.3

0.10

0.2

0.0

0.05

0.0

0.1

0

1

2

3

4

5

6

7

8

9 10

0

1

2

3

4

5

6

7

8

9 10

0.0 0

0.2

0.4

0.6

1

2

3

4

5

6

7

8

9 10

vi) Likelihood For any random variable, the likelihood function is just the probability function expressed as a function of the unknown parameter. If: • X ∼ Geometric(p); • p is unknown; • the observed value of X is x; then the likelihood function is: L(p ; x) = p(1 − p)x

for 0 < p < 1.

Example: we observe a fish making 5 failed jumps before reaching the top of a waterfall. We wish to estimate the probability of success for each jump.

Then L(p ; 5) = p(1 − p)5 for 0 < p < 1. Maximize L with respect to p to find the MLE, p.

122

For mathematicians: proof of Geometric mean and variance formulae (non-examinable) We wish to prove that E(X) = We use the following results:
∞ x=1 1−p p

and Var(X) =

1−p p2

when X ∼ Geometric(p).

xq x−1 =

1 (1 − q)2 2 (1 − q)3

(for |q| < 1),

(3.1)

and

∞ x=2

x(x − 1)q x−2 =

(for |q| < 1).

(3.2)

Proof of (3.1) and (3.2): Consider the infinite sum of a geometric progression:
∞ x=0

qx =

1 1−q d dq 1 1−q

(for |q| < 1).

Differentiate both sides with respect to q: d dq
∞ x=0 ∞ x=1 ∞ x=0

qx

=

d x 1 (q ) = dq (1 − q)2 xq x−1 = 1 , (1 − q)2 as stated in (3.1).

Note that the lower limit of the summation becomes x = 1 because the term for x = 0 vanishes. The proof of (3.2) is obtained similarly, by differentiating both sides of (3.1) with respect to q (Exercise).

123

Now we can find E(X) and Var(X). E(X) = =
∞ x=0 ∞

xP(X = x) xpq x xq x xq x−1 (by equation (3.1)) (because 1 − q = p) (where q = 1 − p) (lower limit becomes x = 1 because term in x = 0 is zero)

= p

x=0 ∞ x=1 ∞

= pq

x=1

= pq = pq = q , p

1 (1 − q)2 1 p2

as required.

For Var(X), we use Var(X) = E(X 2 ) − (EX)2 = E {X(X − 1)} + E(X) − (EX)2 . Now E{X(X − 1)} = =
∞ x=0 ∞ x=0

(⋆)

x(x − 1)P(X = x) x(x − 1)pq x
∞ x=2

(where q = 1 − p) (note that terms below x = 2 vanish) (by equation (3.2))

= pq

2

x(x − 1)q x−2

= pq 2 = Thus by (⋆), 2q 2 . p2

2 (1 − q)3

2q 2 q Var(X) = 2 + − p p as required, because q + p = 1.

q p

2

=

q(q + p) q = 2, 2 p p

p). fX (x) = P(X = x) = k+x−1 k p (1 − p)x x for x = 0.3 Negative Binomial distribution The Negative Binomial distribution is a generalised form of the Geometric distribution: • the Geometric distribution counts the number of failures before the first success. 2. Let X be the number of papers ? Tom fails in his degree. Properties of the Negative Binomial distribution i) Description X ∼ NegBin(k. . If every trial has probability p of success. we write: X ∼ NegBin(k. • the Negative Binomial distribution counts the number of failures before the k ’th success. 1. . . Examples: 1) X =number of boys before the second girl in a family: X ∼ NegBin(k = 2. 2) Tom needs to pass 24 papers to complete his degree. . Then X ∼ NegBin(24.124 3. independently of all other papers. ii) Probability function For X ∼ NegBin(k. p = 0. p) if X is the number of failures before the k ’th success in a series of Bernoulli trials with P(success) = p. p).5). He passes each paper with probability p. p).

Y ∼ NegBin(m. and X ∼ NegBin(k. we could have: F F F SS So: P(X = x) = k+x−1 x (k − 1) successes and x failures out of (k − 1 + x) trials. if x = 3 failures and k = 2 successes. × (1 − p)x x failures Var(X) = These results can be proved from the fact that the Negative Binomial distribution is obtained as the sum of k independent Geometric random variables: X = Y1 + . iii) Mean and variance E(X) = For X ∼ NegBin(k. we need x failures and k successes. p iv) Sum of independent Negative Binomial random variables If X and Y are independent. • The trials stop when we reach the k’th success.125 Explanation: • For X = x. so the last trial must be a success. kq . p). p). where each Yi ∼ Geometric(p). p). . + Yk . For example. k(1 − p) kq = p p k(1 − p) kq = 2 p2 p k successes × pk F F SF S F SF F S SF F F S • This leaves x failures and k − 1 successes to occur in any order: a total of k − 1 + x trials. with the same value of p. . ⇒ E(X) = kE(Yi) = p kq Var(X) = kVar(Yi) = 2 . p). then X + Y ∼ NegBin(k + m. Yi indept. .

p = 0.04 0. k = 3. • k is known. p = 0.5 k = 3.05 0.0 0. Maximize L with respect to p to find the MLE. Likelihood: L(p .4 0. the likelihood function is the probability function expressed as a function of the unknown parameters.06 for 0 < p < 1. If: • X ∼ NegBin(k.0 0 2 4 6 8 10 12 14 16 18 20 22 24 vi) Likelihood As always. p.3 0.10 0. What is his pass probability for each paper? X =# failed papers before 24 passed papers: X ∼ NegBin(24.2 0. p). Example: Tom fails a total of 4 papers before finishing his degree.5 0.08 k = 10. • the observed value of X is x.126 v) Shape The Negative Binomial is flexible in shape. Observation: X = 4 failed papers. 4) = 24 + 4 − 1 24 p (1 − p)4 = 4 27 24 p (1 − p)4 4 0.15 0. p).1 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0. Below are the probability functions for various different values of k and p.02 0. .5 0. then the likelihood function is: L(p .0 0.8 0. p = 0. • p is unknown. x) = k+x−1 k p (1 − p)x x for 0 < p < 1.

Example: Ron has a box of Chocolate Frogs. n). M. and twelve of them are white chocolate. Eight of them are dark chocolate. i) Description Suppose we have N objects: • M of the N objects are special.127 3. M. M = 8. There are 20 chocolate frogs in the box. . ii) Probability function For X ∼ Hypergeometric(N. Then X ∼ Hypergeometric(N = 20. n + M − N ) to x = min(n. We remove n objects at random without replacement. Then X ∼ Hypergeometric(N. M x N −M n−x N n fX (x) = P(X = x) = for x = max(0. M). Ron grabs a random handful of 5 chocolate frogs and stuffs them into his mouth when he thinks that noone is looking. n = 5). Let X = number of the n removed objects that are special.4 Hypergeometric distribution: sampling without replacement The hypergeometric distribution is used when we are sampling without replace- ment from a finite population. • the other N − M objects are not special. Let X be the number of dark chocolate frogs he picks. n).

397 . After some working. Note: We need 0 ≤ x ≤ M (number of special objects). Thus: number of desired ways = P(X = x) = total number of ways M x N −M n−x N n . 15504 Same answer as Tutorial 3 Q2(c). n). M. including M = 8 dark chocolates. iii) Mean and variance For X ∼ Hypergeometric(N. n−x • Total number of ways of choosing x special objects and (n−x) other objects is: M × N −M . n + M − N ) to x = min(n. where p = . and 0 ≤ n − x ≤ N − M (number of other objects). We need P(X = 2) = 8 2 12 3 20 5 = 28 × 220 = 0. M). • Number of ways of selecting x special objects from the M available is: M . There are N = 20 chocolates. this gives us the stated constraint that x = max(0. E(X) = np Var(X) = np(1 − p) N −n N −1 M N. x • Number of ways of selecting n − x other objects from the N − M available is: N −M . Example: What is the probability that Ron selects 3 white and 2 dark chocolates? X =# dark chocolates.128 Explanation: We need to choose x special objects and n − x other objects. n−x x • Overall number of ways of choosing n objects from N is: N n .

As noted above.10 0. because these involve sampling without replacement from a finite population.20 1 2 3 4 5 6 7 8 9 10 Note: The Hypergeometric distribution can be used for opinion polls. The Binomial distribution is used when the population is sampled with replace- ment. 10) 0. N Hypergeometric(30. Hypergeometric(N.0 0 0.1 we often approximate the Hypergeometric(N.15 0.e. • Binomial probabilities sum to 1 because of the Binomial Theorem: p + (1 − p) n as N → ∞. .25 0. M.25 Binomial(10. For n/N < 0.05 0 1 2 3 4 5 6 7 8 9 10 0. 12. • Negative Binomial probabilities sum to 1 by the Negative Binomial expansion: i. • Geometric probabilities sum to 1 because they form a Geometric series: p x=0 (1 − p)x = <sum of Geometric probabilities> = 1.15 0. n) distribution by the Binomial(n. −k: p k 1 − (1 − p) ∞ −k = <sum of NegBin probabilities> = 1. 12 ) 30 0.10 0.129 iv) Shape The Hypergeometric distribution is similar to the Binomial distribution when n/N is small.05 0. M. because removing n objects does not change the overall composition of the population very much when n/N is small. p = M ) distribution. the Binomial expansion with a negative power.30 0.0 0. = <sum of Binomial probabilities> = 1. M ) N A note about distribution names Discrete distributions often get their names from mathematical power series.20 0. n) → Binomial(n.

ii) accidents occur at a constant average rate of λ per year. Then the number of accidents in a year. A distribution that counts the number of random events in a fixed space of time is the Poisson distribution. X. when events occur independently and at a constant average rate. Example: Let X be the number of road accidents in a year in New Zealand. How many volcanoes will erupt over the next 1000 years? X ∼ Poisson. How many road accidents will there be in NZ this year? X ∼ Poisson.5 Poisson distribution When is the next volcano due to erupt in Auckland? Any moment now. has the distribution X ∼ Poisson(λ). . Suppose that: i) all accidents are independent of each other. ) A volcano could happen in Auckland this afternoon. iii) accidents cannot occur simultaneously.3. . published by Sim´on-Denis Poisson in 1837. e Poisson Process The Poisson process counts the number of events occurring in a fixed time or space. Volcanoes are almost impossible to predict: they seem to happen completely at random. or it might not happen for another 1000 years. . unfortunately! (give or take 1000 years or so. The Poisson distribution arose from a mathematical formulation called the Poisson Process. How many cars will cross the Harbour Bridge today? X ∼ Poisson.

.131 Number of accidents in one year Let X be the number of accidents to occur in one year: X ∼ Poisson(λ). . . Then Xt ∼ Poisson(λt). . Let Xt be the number of events to occur in time t. let XA = # events in area of size A. 1. Then XA ∼ Poisson(λA). 2. ii) events occur at a constant average rate of λ per unit time. Then Xt ∼ Poisson(λt). The probability function for X ∼ Poisson(λ) is λx −λ P(X = x) = e x! Number of accidents in t years Let Xt be the number of accidents to occur in time t years. 2. 2. . and P(Xt = x) = (λt)x −λt e x! for x = 0. . and (λt)x −λt P(Xt = x) = e x! for x = 0. . General definition of the Poisson process Take any sequence of random events such that: i) all events are independent. Note: For a Poisson process in space. . . Example: XA = number of raisins in a volume A of currant bun. for x = 0. . 1. 1. iii) events cannot occur simultaneously.

. . for mathematicians. It is also used in many other situations. The partial differential x equation is solved to provide the form P(Xt = x) = (λt) e−λt . Definition: The random variables {Xt : t > 0} form a Poisson process with rate λ if: i) events occurring in any time interval are independent of those occurring in any other disjoint time interval. i) Probability function For X ∼ Poisson(λ). Its properties are as follows. . 2. t + δt]) δt = 0.6). These conditions can be used to derive a partial differential equation on a function known as the probability generating function of Xt . ii) lim δt↓0 P(exactly one event occurs in time interval[t. The parameter λ is called the rate of the Poisson distribution. The formal definition of the Poisson process is as follows. x! Poisson distribution The Poisson distribution is not just used in the context of the Poisson process. t + δt]) δt = λ.132 Where does the Poisson formula come from? (Sketch idea. fX (x) = P(X = x) = λx −λ e x! for x = 0. 1. often as a subjective model (see Section 3. non-examinable). iii) lim δt↓0 P(more than one event occurs in time interval[t. .

the distribution has positive (right) skew. Notes: 1.0 60 0.05 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0. The probability functions for various λ are shown below. 2. The variance of the Poisson distribution increases with the mean (in fact. E(X) = Var(X) = λ when X ∼ Poisson(λ). λ=1 0.01 0. and X ∼ Poisson(λ).20 λ = 3. The shape of the Poisson distribution depends upon the value of λ.15 0.04 λ = 100 0.0 0. As λ increases. Y ∼ Poisson(µ).3 0. variance = mean). the distribution becomes more and more symmetrical.5 0. λ is the average number of events per unit time in the Poisson process.02 0.133 ii) Mean and variance The mean and variance of the Poisson(λ) distribution are both λ. For small λ.2 0. This is often the case in real life: there is more uncertainty associated with larger numbers than with smaller numbers. then iv) Shape X + Y ∼ Poisson(λ + µ).10 0.1 0. until for large λ it has the familiar bellshaped appearance. iii) Sum of independent Poisson random variables If X and Y are independent.0 0. It makes sense for E(X) = λ: by definition.03 80 100 120 140 .

because X ∼ Poisson(λ). Observation: X = 28 babies. ˆ We find that λ = x = 28. Likelihood: λ28 −λ L(λ . 1. 28) = for 0 < λ < ∞. Example: 28 babies were born in Mt Roskill yesterday. ˆ Similarly. Thus the estimator variance is: ˆ Var(λ) = Var(X) = λ. e 28! ˆ Maximize L with respect to λ to find the MLE. x! . For mathematicians: proof of Poisson mean and variance formulae (non-examinable) We wish to prove that E(X) = Var(X) = λ for X ∼ Poisson(λ). then the likelihood function is: L(λ. Because we don’t know λ. If: • X ∼ Poisson(λ). . • λ is unknown. • the observed value of X is x. . λx −λ For X ∼ Poisson(λ). . x) = λx −λ e x! for 0 < λ < ∞. 2. the likelihood function is the probability function expressed as a function of the unknown parameters. the maximum likelihood estimator of λ is: λ = X . Estimate λ. Let X =# babies born in a day in Mt Roskill. Assume that X ∼ Poisson(λ). λ. we have to estimate the variance: ˆ ˆ Var(λ) = λ. the probability function is fX (x) = e for x = 0.134 v) Likelihood As always.

So E(X) = λ. λx −λ x(x − 1) e But E[X(X − 1)] = x! x=0 = ∞ x=2 ∞ λx e−λ (x − 2)! ∞ (terms for x = 0 and x = 1 are 0) = λ 2 x=2 ∞ y=0 λx−2 −λ e (writing everything in terms of x − 2) (x − 2)! λy −λ e y! (putting y = x − 2) = λ 2 = λ2 . For Var(X).135 So E(X) = λx −λ xfX (x) = x e x! x=0 x=0 = ∞ x=1 ∞ ∞ λx e−λ (x − 1)! (note that term for x = 0 is 0) = λ ∞ x=1 ∞ y=0 λx−1 −λ e (writing everything in terms of x − 1) (x − 1)! λy −λ e y! (putting y = x − 1) = λ = λ. as required. So Var(X) = E[X(X − 1)] + λ − λ2 = λ2 + λ − λ2 = λ. Var(X) = E(X 2 ) − (EX)2 = E[X(X − 1)] + E(X) − (EX)2 = E[X(X − 1)] + λ − λ2 . we use: because the sum=1 (sum of Poisson probabilities) . as required. .

0 0 0. English word lengths: X − 1 ∼ Poisson(6.6 Subjective modelling Most of the distributions we have talked about in this chapter are exact models for the situation described. Example: Distribution of word lengths for English words. However. . We need to use X ∼ 1 + Poisson because X cannot take the value 0.15 5 10 number of letters 15 20 The Poisson probabilities (with λ estimated by maximum likelihood) are plotted as points overlaying the barplot. the Binomial distribution describes exactly the distribution of the number of successes in n Bernoulli trials. such as the right sort of symmetry or skew. If we plot the frequencies on a barplot. we will use a subjective model.05 0. Let X = number of letters in an English word chosen at random from the dictio- nary. or the right sort of relationship between variance and mean. The fit of the Poisson distribution is quite good. If so.10 0. we pick a probability distribution to describe a situation just because it has properties that we think are appropriate to the situation.136 3. In a subjective model. there is often no exact model available. we see that the shape of the distribution is roughly Poisson.22) Word lengths from 25109 English words probability 0. For example.

02 0.0 0 0. 0. that the Chinese stroke distribution is well-described by a Negative Binomial model. probability 0. X is the number of strokes in a randomly chosen character. however.10 10 20 number of strokes 30 It turns out.7.08 0. The fit is very good.64).04 subjective model.137 In this example we can not say that the Poisson distribution represents the number of events in a fixed time or space: instead. . Can a Poisson distribution fit any data? The answer is no: in fact the Poisson distribution is very inflexible. Best Poisson fit awful. Stroke counts from 13061 Chinese characters The best-fitting Negative Binomial distribution 0.04 Here are stroke counts from 13061 Chinese characters. X does not represent the number of failures before the k’th success: the NegBin is a 0 10 20 number of strokes 30 0.08 (found by MLE) 0.06 0. p = 0.0 0.06 probability is NegBin(k = 23.02 The fit of the Poisson distribution is 0. However. The best-fitting Poisson distribution (found by MLE) is overlaid. it is being used as a subjective model for word length.

We can’t even begin to list all the values that X can take..001) can be > 0. there simply wouldn’t be enough probability to go round.0000000001? We just end up talking nonsense. for which we can list all the values and their probabilities. P(X = 1) = 0. there are so many numbers in any continuous set that each of them must have probability 0. for X ∼ Geometric(p). But suppose that X takes values in a continuous set. 1). [0. the Emperor Joseph II responded wryly. If there was a probability > 0 for all the numbers in a continuous set..g. but what is the next smallest? 0. When X is continuous. ∞) or (0. even if the list is infinite: e. but P(0.g. For example. ‘Too many notes.1 Introduction When Mozart performed his opera Die Entf¨hrung aus dem u Serail. . In fact. however ‘small’. P(X = x) = 0 for ALL x. x fX (x) = P(X = x) 0 p 1 pq 2 pq 2 .. The probability function is meaningless. FX (x) = P(X ≤ x). b).0001? 0. Mozart!’ In this chapter we meet a different problem: too many numbers! We have met discrete random variables. e. A continuous random variable takes values in a continuous interval (a.999 ≤ X ≤ 1. we are able to assign probabilities to intervals: eg. how would you list all the numbers in the interval [0. Although we cannot assign a probability to any value of X. 1]? • the smallest number is 0. .. This means we should use the distribution function.01? 0.138 Chapter 4: Continuous Random Variables 4. It describes a continuously varying quantity such as time or height.

s. P(a < X < b) = P(a + 1 ≤ X ≤ b − 1) = FX (b − 1) − FX (a).v. x • P(a < X ≤ b) = P(X ∈ (a. • FX (x) is a continuous function: probability accumulates continuously.v. . For a discrete random variable with values 0. for a continuous random variable. However. 0 x • As before.139 The cumulative distribution function. . P(a < X ≤ b) = P(X ∈ (a. Endpoints are not important for continuous r. This is not true for a discrete random variable: in fact. 1. FX (x) probability accumulates in discrete steps. b]) = F (b) − F (a). Endpoints are very important for discrete r. For a continuous random variable. . 2.. For a continuous random variable: FX (x) 1 • FX (x) = P(X ≤ x).s. So it makes no difference whether we say P(a < X ≤ b) or P(a ≤ X ≤ b). P(X = a) = 0. FX (x) Recall that for discrete random variables: • FX (x) is a step function: • FX (x) = P(X ≤ x). b]) = F (b) − F (a). . P(a < X < b) = P(a ≤ X ≤ b) = FX (b) − FX (a).

4 . but you never really know a person until you’ve seen their face. Out of interest. All-time top-ten 100m sprint times The histogram below shows the best 10 sprint times from the 168 all-time top male 100m sprinters. write emails or send txts to each other all day.f. 1st Qu.d. You can chat on the phone. Median Mean 3rd Qu. here are the summary statistics: Min. For this we use a different tool called the probability density function. we never really understand the random variable until we’ve seen its ‘face’ — the probability density function.21 10.140 4. Surprisingly.2 10. 9. it is quite difficult to describe exactly what the probability density function is.14 10. it is not very good at telling us what the distribution looks like.08 10.d.15 10. Just like a cell-phone for keeping in touch.0 time (s) 10. In this section we take some time to motivate and describe this fundamental idea. is like recognising your friends by their faces. However.f.78 10. The probability density function (p.8 10. There are 1680 times in total. Max. representing the top 10 times up to 2002 from each of the 168 sprinters. the cumulative distribution function is a tool for facilitating our interactions with the continuous random variable. We use it all the time to calculate probabilities and to gain an intuitive feel for the shape and nature of the distribution.2 The probability density function Although the cumulative distribution function gives us an interval-based tool for dealing with continuous random variables.) is the best way to describe and recognise a continuous random variable. Using the p.41 frequency 100 200 300 0 9.

05s intervals 100 200 300 0 9.8 10.8 10. although the heights of the bars and the interval widths are different.4 0.8 10.2 10.2 10.01s intervals 20 40 60 80 0 9.02s intervals 100 0 50 150 9.0 time 10.4 0. .2 10. • times below 10.141 We could plot this histogram using different time intervals: 0.8 10.0 time 10.0s and above 10.4 We see that each histogram has broadly the same shape. The histograms tell us the most intuitive thing we wish to know about the distribution: its shape: • the most probable times are close to 10. • the distribution of times has a long left tail (left skew).2 seconds.0 time 10.1s intervals 200 400 600 0 9.3 seconds have low probability.0 time 10.2 10.4 0.

0 time interval 10.0 time interval 10. but the problem is that the histograms are not standardized: • every time we change the interval width.4 0.142 We could fit a curve over any of these histograms to show the desired shape. How can we derive a curve or function that captures the common shape of the histograms. rather than using the exact 1680 sprint times observed.2 9.0 time interval 10.0 0.8 probability 0. but keeps a constant height? What should that height be? The standardized histogram We now focus on an idealized (smooth) version of the sprint times distribution. the heights of the bars change.4 0.8 10.8 probability 0. We are aiming to derive a curve.2 10. First idea: plot the probabilities instead of the frequencies. .10 9.2 10.2 10. The height of each histogram bar now represents the probability of getting an observation in that bar.4 0. but will keep the same height for any choice of histogram bar width.0 0.0 0. that captures the shape of the histograms. probability 0. Wider bars have higher probabilities. because the height (probability) still depends upon the bar width.08 10.4 This doesn’t work.20 10.04 9. or function.

probability / interval width 4 0.4 probability / interval width 4 0.2 10.4 probability / interval width 4 0.4 probability / interval width 4 0. But.143 Second idea: plot the probabilities divided by bar width.1s intervals 0 1 2 3 9.8 10.05s intervals 0 1 2 3 9.0 time interval 10. The nice-fitting curve is the probability density function.0 time interval 10. The height of each histogram bar now represents the probability of getting an observation in that bar.2 10.02s intervals 0 1 2 3 9. divided by the width of the bar. .0 time interval 10. These histograms are called standardized histograms.4 This seems to be exactly what we need! The same curve fits nicely over all the histograms and keeps the same height regardless of the bar width. what is it?! .8 10. .0 time interval 10.2 10.8 10.01s intervals 0 1 2 3 9.8 10.2 10.

fX (x): fX (x) = lim t→0 FX (x + t) − FX (x) t by definition. = fX (x).f.d.f. We will write the p.d.d. is in fact the limit of the standardized histogram as the bar width approaches zero. divided by the width of the interval. the standardized histogram plots the probability of an observation falling between x and x +t. The p. This expression should look familiar: it is the derivative of FX (x). . Thus the height of the standardized histogram bar over the interval from x to x + t is: probability P(x ≤ X ≤ x + t) FX (x + t) − FX (x) = = . t. in the sprint times we can have fX (x) = 4.d. interval width t t where FX (x) is the cumulative distribution function. fX (x) is clearly NOT the probability of x — for example. Now consider the limit as the histogram bar width (t) goes to 0: this limit is DEFINED TO BE the probability density function at x. What is the height of the standardized histogram bar? For an interval from x to x + t. of a continuous random variable X as p. the bars get closer and closer to the p. curve.d.f. However.f.d.f. as the histogram bars of the standardized histogram get narrower.144 The probability density function We have seen that there is a single curve that fits nicely over any standardized histogram from a given distribution.).f. so it is definitely NOT a probability. This curve is called the probability density function (p. The p.

But we also know that: dFX = fX (x).d. a . dx fX (x) = It gives: ′ • the RATE at which probability is accumulating at any given point.) is therefore the function ′ fX (x) = FX (x). Suppose we want to calculate P(a ≤ X ≤ b). • the SHAPE of the distribution of X . The probability density function (p. dx so In fact: FX (x) = fX (x) dx b (without constants). FX (b) − FX (a) = fX (x) dx. We already know that: P(a ≤ X ≤ b) = FX (b) − FX (a). FX (x). It is defined to be a single.145 The probability density function (p. unchanging curve that describes the SHAPE of any histogram drawn from the distribution of X .d.f. Using the probability density function to calculate probabilities As well as showing us the shape of the distribution of X. Formal definition of the probability density function Definition: Let X be a continuous random variable with distribution function FX (x). the probability density function has another major use: • it calculates probabilities by integration.) of X is defined as dFX ′ = FX (x).f.

fX (x) total area under the curve fX (x) ∞ is 1 : −∞ fX (x) dx = 1.d.f.f.f. which is 1. fX (x) P(a ≤ X ≤ b) is the AREA under the curve fX (x) between a and b. a b x The total area under the p. This says that the total area under the p. curve is equal to the total probability that X takes a value between −∞ and +∞. x .146 This is a very important result: Let X be a continuous random variable with probability density function fX (x). a This means that we can calculate probabilities by integrating the p.d. curve is: total area = ∞ −∞ fX (x) dx = FX (∞) − FX (−∞) = 1 − 0 = 1. Then b P(a ≤ X ≤ b) = P(X ∈ [ a. b ] ) = fX (x) dx .d.

to calculate the distribution function. u: x Writing FX (x) = −∞ fX (u) du means: fX (u) integrate fX (u) as u ranges from −∞ to x. We use the following formula: x Distribution function. Using the dummy variable. It’s How can x range from −∞ to x?! . FX (x) = P(X ≤ x) x x u Writing FX (x) = fX (x) dx is WRONG and MEANINGLESS: you will LOSE −∞ A MARK every time. FX (x) = −∞ fX (u) du. FX (x). fX (x). FX (x) Suppose we know the probability density function. Proof: x −∞ f (u)du = FX (x) − FX (−∞) = FX (x) − 0 = FX (x). In words.f.147 Using the p.d. nonsense! x −∞ fX (x) dx means: integrate fX (x) as x ranges from −∞ to x. and wish to calculate the distribution function.

(iii) Find the cumulative distribution function.2 seconds..f.4 9. we can see that it is about 10.1 to 10. For example.0 0.f. 0 x fX (x) dx = 1 k e−2x dx = 1 ∞ 0 0 dx + −∞ 0 ∞ e−2x k −2 = 1 . FX (x).d. FX (x).4 Just using FX (x) gives us very little intuition about the problem.4 10.f.. we would have to inspect the part of the curve with the steepest gradient: very difficult to see. otherwise.148 Why do we need fX (x)? Why not stick with FX (x)? These graphs show FX (x) and fX (x) from the men’s 100m sprint times (X is a random top ten 100m sprint time). Example of calculations with the p. Using the c. for all x. F(x) 0.0 x 10. fX (x). f(x) Let fX (x) = (i) Find the constant k. k e−2x 0 for 0 < x < ∞.2 10.d.d.8 f(x) 0 1 2 3 4 9.0 x 10. (i) We need: ∞ −∞ 0 (ii) Find P(1 < X ≤ 3). which is the region of highest probability? Using the p.2 10.8 0.8 10.

149 −k −∞ (e − e0 ) = 1 2 −k (0 − 1) = 1 2 k = 2. −2x for x > 0. FX (x) = So overall. x −∞ 0 du FX (x) = 0 for x ≤ 0. (ii) 3 P(1 < X ≤ 3) = = fX (x) dx 1 3 1 2 e−2x dx 3 1 = 2e−2x −2 = −e−2×3 + e−2×1 = 0.132. for x > 0. When x ≤ 0. (iii) x FX (x) = −∞ 0 fX (u) du x = −∞ 0 du + 0 x 0 2 e−2u du for x > 0 2e−2u = 0+ −2 = −e−2x + e0 = 1 − e−2x = 0. 1−e .

Endpoints: DO NOT MATTER for continuous random variables: P(X ≤ a) = P(X < a) and P(X ≥ a) = P(X > a) .f. b. a 2.d.f. The p.f. If you only need to calculate one probability P(a ≤ X ≤ b): integrate the p. 1. is NOT a probability: fX (x) ≥ 0 always. Then use: P(a ≤ X ≤ b) = FX (b) − FX (a) for any a. FX (x): x FX (x) = −∞ fX (u) du. Calculating probabilities: but we do NOT require fX (x) ≤ 1. . curve is 1: ∞ −∞ fX (x) dx = 1.150 Total area under the p.d.: b P(a ≤ X ≤ b) = fX (x) dx. it is easiest to find the distribution function.d. If you will need to calculate several probabilities.

We have not said how long we have to wait for the next volcano: this is a continuous random variable. There have been about 20 eruptions in the last 20. Auckland Volcanoes About 50 volcanic eruptions have occurred in Auckland over the last 100. see: www. 1 . which has led the Auckland Regional Council to assess current volcanic risk by assuming that volcanic eruptions in Auckland follow a Poisson 1 process with rate λ = 1000 volcanoes per year. fX (x). We are comfortable with handling and manipulating probabilities. (λt)n −λt e . n! so P(Nt = n) = .000 years or so.000 years. To find the distribution of a continuous random variable. 1000 Nt is the number of volcanoes to have occurred by time t. The first two eruptions occurred in the Auckland Domain and Albert Park — right underneath us! The most recent.govt. unlike the p. The Poisson distribution was used to count the 13 Oct 2009? 9J 207 un 4? number of volcanoes that would occur in a fixed space of time. This is because FX (x) = P(X ≤ x) gives us a probability. Distribution of the waiting time in the Poisson process The length of time between events in the Poisson process is called the waiting time.arc.ov 5 N45? 3 2 4. starting from now. about 600 years ago. FX (x). For background information. Suppose that {Nt : t > 0} forms a Poisson process with rate λ = We know that Nt ∼ Poisson(λt) .3 The Exponential distribution When will the next volcano erupt in Auckland? We never quite answered this question in Chapter 3. and biggest. we often work with the cumulative distribution function.nz/environment/volcanoes-of-auckland/.d. eruption was Rangitoto.f.

P(X ≤ 50) = FX (50) = 1 − e−50/1000 = 0. Example: What is the probability that there will be a volcanic eruption in Auckland within the next 50 years? Put λ = 1 1000 . This is the figure given by the Auckland Regional Council at the above web link (under ‘Future Hazards’). Overall: FX (x) = P(X ≤ x) = 1 − e−λx 0 for x ≥ 0. starting now. for x < 0. (i) When x < 0: FX (x) = P(X ≤ x) = P( less than 0 time before next volcano) = 0. We need P(X ≤ 50).152 Let X be a continuous random variable giving the number of years waited before the next volcano. The distribution of the waiting time X is called the Exponential distribution because of the exponential formula for FX (x). We will derive an expression for FX (x). There is about a 5% chance that there will be a volcanic eruption in Auckland over the next 50 years.049. (ii) When x ≥ 0: FX (x) = P(X ≤ x) = P(amount of time waited for next volcano is ≤ x) = P(there is at least one volcano between now and time x) = P(# volcanoes between now and time x is ≥ 1) = P(Nx ≥ 1) = 1 − P(Nx = 0) (λx)0 −λx e = 1− 0! = 1 − e−λx . .

or the time from now until the next event. Then X ∼ Exponential(λ).d. n! e • Define X to be either the time till the first event. FX (x) = P(X ≤ x). Probability density function: fX (x) = ′ FX (x) = λe−λx 0 for x ≥ 0. X is called the waiting time of the process. fX (x) Link with the Poisson process C. Let {Nt : t > 0} be a Poisson process with rate λ. the Exponential distribution has many other applications: it does not always have to arise from a Poisson process. or the time between any two events. P. Distribution function: FX (x) = P(X ≤ x) = 1 − e−λx for x ≥ 0. for x < 0.f. or X ∼ Exp(λ).153 The Exponential Distribution We have defined the Exponential(λ) distribution to be the distribution of the waiting time (time between events) in a Poisson process with rate λ. Note: λ > 0 always. 0 for x < 0. . so P(Nt = n) = (λt)n −λt . just like the Poisson distribution. Let X ∼ Exponential(λ).. However.d.f. We write X ∼ Exponential(λ). Then: • Nt is the number of events to occur by time t.. • Nt ∼ Poisson(λt) .

memorylessness means that ‘old is as good as new’ — or. ‘new is as bad as old’ ! A memoryless light bulb is quite likely to fail almost immediately. The derivation of the Exponential distribution was valid for all of them. memorylessness means that the 600 years we have waited since Rangitoto erupted have counted for nothing. For volcanoes. The chance that we still have 1000 years to wait for the next eruption is the same today as it was 600 years ago when Rangitoto erupted. mo a si ry like eve ! zzz z All of these quantities have the same distribution: X ∼ Exponential(λ). put another way. or the time between any two events.Me Memorylessness We have said that the waiting time of the Poisson process can be defined either as the time from the start to the first event. In this case. because events occur at a constant average rate in the Poisson process. or the time from now until the next event. Memorylessness applies to any Poisson process. . It is not always a desirable property: you don’t want a memoryless waiting time for your bus! The Exponential distribution is often used to model failure times of components: for example X ∼ Exponential(λ) is the amount of time before a light bulb fails. This property of the Exponential distribution is called memorylessness: • the distribution of the time from now until the first event is the same as the distribution of the time from the start until the first event: the time from the start till now has been forgotten! time from start to first event this time forgotten time from now to first event START NOW FIRST EVENT The Exponential distribution is famous for this memoryless property: it is the only memoryless distribution.

given that we have already waited time t. Let Y be the amount of extra time waited for the event. Also. we must condition on the event {X > t}. Proof: We will work with FY (y) and prove that it is equal to 1 − e−λy . given that we have already waited time t (say). This means we need to prove that Y ∼ Exponential(λ). The time t already waited is forgotten.155 For private reading: proof of memorylessness Let X ∼ Exponential(λ) be the total time waited for an event. because we know that we have already waited time t. . First note that X = t+Y . = 1 − e−λy . So P(Y ≤ y) = P(X ≤ t + y | X > t). that the time t already waited has been ‘forgotten’. FY (y) = P(Y ≤ y) = P(X ≤ t + y | X > t) = P(X ≤ t + y AND X > t) P(X > t) (definition of conditional probability) P(t < X ≤ t + y) 1 − P(X ≤ t) FX (t + y) − FX (t) 1 − FX (t) = = (1 − e−λ(t+y) ) − (1 − e−λt ) = 1 − (1 − e−λt ) = e−λt − e−λ(t+y) e−λt e−λt (1 − e−λy ) = e−λt Thus the conditional probability of waiting time y extra. So Y ∼ Exponential(λ) as required. This proves that Y is Exponential(λ) like X. i. is the same as the probability of waiting time y in total.e. and Y is the time waited after time t. We wish to prove that Y has the same distribution as X. because X is the total time waited.

. . fX (x) = dFX . We estimate λ by setting dL ˆ = 0 to find the MLE. . x) = fX (x) = λe−λx for 0 < λ < ∞. Note: Both discrete and continuous r. then the likelihood is fX (x1)fX (x2) . • the observed value of X is x. we found the likelihood using the probability function. . we find the likelihood using the probability density function. . fX (x) = P(X = x).4 Likelihood and estimation for continuous random variables • For discrete random variables. . • λ is unknown. Example: Exponential likelihood Suppose that: • X ∼ Exponential(λ).156 4. dx • Although the notation fX (x) means something different for continuous and discrete random variables. fX (xn). it is used in exactly the same way for likelihood and estimation.f. dλ Two or more independent observations Suppose that X1 . • For continuous random variables. . . • all the Xi s have the same p. Xn are continuous random variables such that: • X1 . . Xn are INDEPENDENT. Then the likelihood function is: L(λ ..d.v. . fX (x).s have the same definition for the cumulative distribution function: FX (x) = P(X ≤ x). . λ.

x1. to be the sample mean of x1. . . . . x10 generated by R command rexp(10. So n xi = nx. . .004 likelihood 0. . . Xn are independent. . 2 4 6 n lambda Solution: L(λ . . . xn) = λn e−λnx for 0 < λ < ∞. . X2 . 0. . . xn) = i=1 n fX (xi) λe−λxi i=1 n −λ n i=1 = = λ e xi Define x = 1 n n i=1 xi for 0 < λ < ∞. Find the maximum likelihood estimate of λ. . x The MLE of λ is ˆ 1 λ= . i=1 Thus L(λ . . 2). λ = ∞.0 0 0. . .157 Example: Suppose that X1. . Solve dL = 0 to find the MLE of λ: dλ dL = nλn−1 e−λnx − λn × nx × e−λnx = 0 dλ nλn−1e−λnx (1 − λx) = 0 ⇒ λ = 0. λ = 1 . x . . x1.002 Likelihood graph shown for λ = 2 and n = 10. xn. and Xi ∼ Exponential(λ) for all i. x1.

Example: discrete. Suppose H0 : X ∼ Binomial(n = 10. A smaller value of λ means volcanoes occur less often. Example: continuous. 1 Note: If λ < 1000 . Suppose H0 : X ∼ Exponential(2). . Test the hypothesis that 1 1 λ = 1000 against the one-sided alternative that λ < 1000 . The only difference is: • endpoints matter for discrete random variables. This is because X is the time between volcanoes. • That is. and we have observed the value x = 7. • Make observation x. we would expect to see BIGGER values of X .158 4. and we have observed the value x = 7. NOT smaller. so the time X between them is BIGGER. find the probability under the distribution specified by H0 of seeing an observation further out in the tails than the value x that we have seen. and λ is the rate at which volcanoes occur. Then the upper-tail p-value is P(X ≥ 7) = 1 − P(X ≤ 7) = 1 − FX (7). Example with the Exponential distribution A very very old person observes that the waiting time from Rangitoto to the next volcanic eruption in Auckland is 1500 years. if H0 is true.5 Hypothesis tests Hypothesis tests for continuous random variables are just like hypothesis tests for discrete random variables. p = 0. Other than this trap. Then the upper-tail p-value is P(X ≥ 7) = 1 − P(X ≤ 6) = 1 − FX (6).5). • Find the one-tailed or two-tailed p-value as the probability of seeing an observation at least as weird as what we have seen. but not for continuous ran- dom variables. the procedure for hypothesis testing is the same: • Use H0 to specify the distribution of X completely. and offer a one-tailed or two-tailed alternative hypothesis H1 .

e. So p − value = P(X ≥ 1500) = 1 − P(X ≤ 1500) = 1 − FX (1500) = 0. H0 : λ = H1 1 1000 1 : λ< 1000 one-tailed test Observation: x = 1500 years. Values weirder than x = 1500 years: all values BIGGER than x = 1500. p-value: P(X ≥ 1500) when X ∼ Exponential(λ = 1 1000 ). that volcanoes erupt once every 1000 years on average. The observation x = 1500 years is consistent with the hypothesis that λ = 1/1000. when X ∼ Exponential(λ = 1 1000 ) = 1 − (1 − e−1500/1000) R command: 1-pexp(1500.0002 0. i.159 Hypotheses: Let X ∼ Exponential(λ). 0.223.0006 1000 2000 x 3000 4000 5000 .0010 Interpretation: f(x) 0 0 0. 1/1000) There is no evidence against H0.

Note: There exists no concept of a ‘probability function’ fX (x) = P(X = x) for continuous random variables. replace the probability function with the probability density function. E(X) is: • the long-term average of X. • a ‘sum’ of values multiplied by how common they are: xf (x) or xf (x) dx. ′ where fX (x) = FX (x) is the probability density function. In fact. The idea behind expectation is the same for both discrete and continuous random variables.6 Expectation and variance Remember the expectation of a discrete random variable is the long-term av- erage: µX = E(X) = x xP(X = x) = x xfX (x). we add in the value and multiply by the proportion of times we would expect to see that value: P(X = x). .) For a continuous random variable. Expectation is also the balance point of fX (x) for both continuous and discrete X. (For each value x. and replace µX = E(X) = x ∞ −∞ by ∞ −∞ : xfX (x) dx.160 4. Imagine fX (x) cut out of cardboard and balanced on a pencil. if X is continuous. then P(X = x) = 0 for all x.

) Variance If X is continuous.161 Discrete: E(X) = x Continuous: xfX (x) E(X) = ∞ −∞ xfX (x) dx E(g(X)) = x g(x)fX (x) E(g(X)) = ∞ −∞ g(x)fX (x) dx Transform the values. leave the probability density alone. Var(X) = E(X ) − (EX) = 2 2 ∞ −∞ x2 fX (x)dx − (EX)2. . ′ fX (x) =FX (x) (p. fX (x) =P(X = x) Transform the values. leave the probabilities alone. we can either compute the variance using Var(X) = E (X − µX ) or 2 = ∞ −∞ (x − µX )2fX (x)dx. X For a continuous random variable.d. its variance is defined in exactly the same way as a discrete random variable: 2 Var(X) = σX = E (X − µX )2 = E(X 2) − µ2 = E(X 2) − (EX)2. The second expression is usually easier (although not always).f.

.7 Exponential distribution mean and variance When X ∼ Exponential(λ). Y independent. then: E(X) = 1 λ Var(X) = 1 λ2 . 4 . • Var(ag(X) + b) =a2 Var(g(X)). + Xn ) =E(X1) + . X. . Y . + E(Xn). Note: If X is the waiting time for a Poisson process with rate λ events per year 1 (say). it makes sense that E(X) = λ . . • E(ag(X) + b) =aE(g(X)) + b. and for constants a and b: • E(aX + b) =aE(X) + b. .162 Properties of expectation and variance All properties of expectation and variance are exactly the same for continuous and discrete random variables. . • Var(X + Y ) =Var(X) + Var(Y ) when X . For any random variables. 4. • E(X + Y ) =E(X) + E(Y ). Xn . the average time waited between events is 1 hour. • Var(aX + b) =a2 Var(X). and X1 . . . . • E(X1 + . Y independent. For example. continuous or discrete. The following statements are generally true only when X and Y are INDEPENDENT: • E(XY ) =E(X)E(Y ) when X . if λ = 4 events per hour.

λ2 Var(X) = . 0 xλe dv u dx dx = uv − dv dx du v dx dx. so v = −e−λx . = λe−λx . and let ∞ 0 dv dx Then E(X 2) = uv − ∞ 0 v du dx = dx + 0 ∞ 2xe−λx dx 2 = 0+ λ = So Var(X) = E(X ) − (EX) 2 2 λxe−λx dx 2 2 × E(X) = 2 . and let ∞ 0 = λe−λx . − x2e−λx ∞ 0 ∞ 0 = 2x. ∞ E(X) = xλe −λx dx = 0 u dv dx dx − ∞ 0 ∞ 0 = = uv ∞ 0 v du dx dx ∞ 0 − xe −λx − (−e−λx) dx = 0+ = ∴ E(X) = −1 λ 1 λ −1 −λx ∞ λe 0 −1 λ ×0− × e0 .Proof : E(X) = ∞ −∞ xfX (x) dx = ∞ −λx dx. so v = −e−λx . so 2 ∞ −∞ du dx 1 λ2 . Variance: Var(X) = E(X 2 ) − (EX)2 = E(X 2) − Now Let E(X ) = u = x2 . so Then du dx = 1. x fX (x) dx = 0 2 ∞ x2λe−λx dx. 163 Integration by parts: recall that Let u = x. λ λ 1 λ 2 2 = 2− λ 1 .

164 Interlude: Guess the Mean. and Variance For any distribution: • the mean is the average that would be obtained if a large number of observations were drawn from the distribution. into two equal areas of 0.0 0 0. median.d.f. . and variance. median. so it divides the p. it is easiest to guess an average distance from the mean and square it. Imagine that the p. (answers overleaf) 0.012 0. Given the probability density function of a distribution. • the median is the half-way point. is made of cardboard and balanced on a rod. .5 each. and variance .f. The mean is the point where the rod would have to be placed for the cardboard to balance. we should be able to guess roughly the distribution mean.d.004 50 100 150 x 200 250 300 . f(x) 0. so to get a rough guess (not exact). • the variance is the average squared distance of observations from the mean. As a hint: • the mean is the balance-point of the distribution. • the median is the half-way point of the distribution: every observation has a 50-50 chance of being above the median or below the median. but it isn’t easy! Have a go at the examples below.008 Guess the mean. • the variance is the average squared distance of an observation from the mean. Median.

012 median (54. This always happens when the distribution has a long right tail (positive skew) like this one.693. Variance=1.6) 0. it is quite believable that the average squared distance of an observation from the mean is 1182.0 0.008 mean (90.0.0 0 1 2 x 3 4 5 Answers: Median = 0. the distribution shown is a Lognormal distribution.004 50 100 150 x 200 250 300 Notes: The mean is larger than the median.165 Answers: f(x) 0.0) variance = (118) = 13924 2 0. The variance is huge . but when you look at the numbers along the horizontal axis. .4 0. Example 2: Try the same again with the example below. . .8 1.6 0.0. Out of interest. f(x) 0.2 0. Answers are written below the graph.0 0 0. Mean=1.

Probability density function. b). fX (x) If X ∼ U [a. or X ∼ U[a.8 The Uniform distribution X has a Uniform distribution on the interval [a. b] if X is equally likely to fall anywhere in the interval [a. b]. otherwise. or X ∼ U(a. b]. b]. FX (x) 1 0 1 0 1 0 1 0 111 000 11111 00000 1 10 111 000 1 0 111 000 1 0 111 000 1 0 111 000 1 0 111 000 1 0 111 000 1 0 111 000 1 0 0 000000 1111111111 0000000000x 111111 a b Thus   0    1 FX (x) = x−a b−a if x < a. b]. Equivalently. if a ≤ x ≤ b. b). X ∼ Uniform(a. if x > b. then 1 b−a fX (x) =  0   if a ≤ x ≤ b. . fX (x) 1 b−a 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 a b x Distribution function. We write X ∼ Uniform[a.166 4. FX (x) x x FX (x) = −∞ fY (y) dy = a 1 dy b−a x a if a≤x≤b = = x−a b−a y b−a if a ≤ x ≤ b.

167

Mean and variance:

If X ∼ Uniform[a, b],
Proof : E(X) =
∞ −∞

a+b , E(X) = 2
b

(b − a)2 . Var(X) = 12 1 x2 dx = b−a 2 = = = 1 b−a 1 b−a a+b . 2
b a b a

xf (x) dx =
a

1 x b−a

1 · (b2 − a2 ) 2 1 (b − a)(b + a) 2

Var(X) = E[(X − µX ) ] =

2

b a

(x − µX )2 1 (x − µX )3 dx = b−a b−a 3 = 1 b−a

(b − µX )3 − (a − µX )3 3
a−b 2 .

But µX = EX = So, Var(X) =

a+b 2 ,

so

b − µX =

b−a 2

and

a − µX =

1 b−a

(b − a)3 − (a − b)3 23 × 3

(b − a)3 + (b − a)3 = (b − a) × 24 = (b − a)2 . 12

Example: let X ∼ Uniform[0, 1]. Then fX (x) = 1 if 0 ≤ x ≤ 1 0 otherwise.
0+1 2

µX = E(X) =

=

1 2

(half-way through interval [0, 1]).
1 . 12

2 σX = Var(X) =

1 (1 12

− 0)2 =

168

4.9 The Change of Variable Technique: finding the distribution of g(X) Let X be a continuous random variable. Suppose • the p.d.f. of X , fX (x), is known; • we wish to find the p.d.f. of Y .

• the r.v. Y is defined as Y = g(X) for some function g ;

We use the Change of Variable technique. Example: Let X ∼ Uniform(0, 1), and let Y = − log(X). The p.d.f. of X is fX (x) = 1 for 0 < x < 1. What is the p.d.f. of Y , fY (y)? Change of variable technique for monotone functions Suppose that g(X) is a monotone function R → R. This means that g is an increasing function, or g is a decreasing f n . When g is monotone, it is invertible, or (1–1) (‘one-to-one’). That is, for every y there is a unique x such that g(x) = y . This means that the inverse function, g −1 (y), is well-defined as a function for a certain range of y. When g : R → R, as it is here, then g can only be (1–1) if it is monotone.
y = g(x) = x2
1.0 0.8 0.8 1.0

x = g −1 (y) =

√ y

0.6

y

0.4

x 0.2 0.0 0.0 0.2 0.4 x 0.6 0.8 1.0 0.0 0.0 0.2 0.4

0.6

0.2

0.4 y

0.6

0.8

1.0

169

Change of Variable formula Let g : R → R be a monotone function and let Y = g(X). Then the p.d.f. of Y = g(X) is fY (y) = fX (g −1(y))
d −1 dy g (y)

.

Easy way to remember

Write

y = y(x)(= g(x)) x = x(y)(= g −1(y)) fY (y) = fX x(y)
dx dy

Then

.

Working for change of variable questions 1) Show you have checked g(x) is monotone over the required range. 2) Write y = y(x) for x in <range of x>, e.g. for a < x < b. 3) So x = x(y) for y in <range of y >: for y(a) < y(x) < y(b) if y is increasing; for y(a) > y(x) > y(b) if y is decreasing. dx = <expression involving y >. dy dx 5) So fY (y) = fX (x(y)) by Change of Variable formula, dy = ... . Quote range of values of y as part of the FINAL answer. Refer back to the question to find fX (x): you often have to deduce this from information like X ∼ Uniform(0, 1) or X ∼ Exponential(λ), or it may be given 4) Then

explicitly.

4 x 0.170 Note: There should be no x’s left in the answer! x(y) and dx are expressions involving y only. 1). you lose a mark for: 1.8 1. Thus fY (y) = fX (e−y )e−y = e−y for 0 < y < ∞. fY (y) = .0 2) Let y = y(x) = − log x for 3) Then x = x(y) = e−y 4) 0 < x y < 1. not stating g(x) is monotone over the required range of x.2 0. for 0 < y < ∞).d.0 1 2 so we can apply the Change of Variable formula.f. of Y . dy 4 y = −log(x) Example 1: Let X ∼ Uniform(0. Note: In change of variable questions. ⇒ fX (e−y ) = 1 for 0 < y < ∞. for − log(0) > > − log(1). d −y dx = (e ) = −e−y = e−y for 0 < y < ∞. y 0 0. . so fX (x) = 1 for 0 < x < 1. So Y ∼ Exponential(1). not giving the range of y for which the result holds. . as part of the final answer. 3 0. (eg.6 0. 2. dy dy fY (y) = fX (x(y)) dx dy 5) So for 0 < y < ∞ for 0 < y < ∞. ie. and let Y = − log(X). = fX (e−y )e−y But X ∼ Uniform(0. 1). . 1) y(x) = − log(x) is monotone decreasing. Find the p. 0 < y < ∞.

2 < y < ∞. 1 2 < y < ∞. 0 Let Y = 1/X. Change of variable formula: fY (y) = fX (x(y)) = = = dx dy 1 3 dx (x(y)) 4 dy 1 1 1 × 3× 2 4 y y 1 4y 5 for 1 < y < ∞.e. i.f.171 Example 2: Let X be a continuous random variable with p. fX (x) = 1 3 4x for 0 < x < 2. otherwise. Let Y = 1/X . The function y(x) = 1/x is monotone decreasing for 0 < x < 2. fY (y).  2 4y 5 fY (y) =   0 otherwise.d. 2 Thus   1 for 1 < y < ∞. so we can apply the Change of Variable formula. . Let Then y = y(x) = 1/x x = x(y) = 1/y dx = | − y −2 | = 1/y 2 dy for 0 < x < 2. for for 1 0 1 2 > y > 1. Find the probability density function of Y .

d.f. i) g increasing g is increasing if u < w ⇔ g(u) < g(w). and w = g (y). (Putting u = g −1 (x) and w = g −1(y) gives g −1(x) > g −1 (y) ⇐⇒ x < y. in ⊛ to see this. = P(X ≥ g −1 (y)) (put u = X.172 For mathematicians: proof of the change of variable formula Separate into cases where g is increasing and where g is decreasing. Now FY (y) = P(Y ≤ y) = P(g(X) ≤ y) = P(X ≤ g −1 (y)) = FX (g −1(y)). ii) g decreasing. w = g −1(y) in (⋆)) .) FY (y) = P(Y ≤ y) = P(g(X) ≤ y) Thus the p.d. dy dy = 1 − FX (g −1(y)). (⋆) > 0. so g −1 is also increasing (by overleaf).e. w = g−1 (y) d FY (y) dy d = FX (g −1(y)) dy d ′ = FX (g −1(y)) (g −1(y)) (Chain Rule) dy d = fX (g −1(y)) (g −1(y)) dy d −1 dy (g (y)) Now g is increasing. u > w ⇐⇒ g(u) < g(w). so g −1 is also an increasing function. ⊛ −1 −1 Note that putting u = g (x). so g −1 is also decreasing. we obtain g −1 (x) < g −1(y) ⇔ g(g −1 (x)) < g(g −1 (y)) ⇔ x < y. of Y is fY (y) = put u = X. so d and thus fY (y) = fX (g −1(y))| dy (g −1(y))| as required. i. So the p.f. of Y is d d fY (y) = 1 − FX (g −1(y)) = −fX g −1(y) g −1 (y) .

10 Change of variable for non-monotone functions: non-examinable Suppose that Y = g(X) and g is not monotone. of Y . Y ≥ 0. −1 is also decreasing. Example: Let X have any distribution. if y ≥ 0. d d g −1 (y) = g −1 (y) dy dy d g −1 (y) dy fY (y) = fX g −1(y) . g is decreasing.f. Y y √ − y 0 √ y X So FY (y) = 0 √ √ FX ( y) − FX (− y) if y < 0. Clearly.d.d. so g − So once again. We wish to find the p. so FY (y) = 0 if y < 0. and thus . Find the p. with distribution function FX (x). of Y . .f. Let Y = X 2 . We can sometimes do this by using the distribution function directly. For y ≥ 0: FY (y) = P(Y ≤ y) = P(X 2 ≤ y) √ √ = P(− y ≤ X ≤ y) √ √ = FX ( y) − FX (− y) .173 This time. 4.

This is the familiar bell-shaped distribution (see later).d. 1). This example has shown that if X ∼ Normal(0. of Y is fY (y) = d √ √ d d FY = (FX ( y)) − (FX (− y)) dy dy dy 1 1 √ ′ √ ′ = 1 y − 2 FX ( y) + 1 y − 2 FX (− y) 2 2 = 1 √ √ √ fX ( y) + fX (− y) 2 y for y ≥ 0.f. 2 y Example: Let X ∼ Normal(0. The p. Y = X 2 has p. ∴ 1 √ √ fY (y) = √ fX ( y) + fX (− y) for y ≥ 0. whenever Y = X 2 .d. then Y = X 2 ∼Chi-squared(df=1).f. The Chi-squared distribution is a special case of the Gamma distribution (see next section). By the result above.d.174 So the p. 2π This is in fact the Chi-squared distribution with ν = 1 degrees of freedom. 2π Find the p. . of Y = X 2 . 1).d.f.f. of X is: 1 2 fX (x) = √ e−x /2. fY (y) = 1 1 √ · √ (e−y/2 + e−y/2) 2 y 2π 1 = √ y −1/2e−y/2 for y ≥ 0.

is a special case of ν the Gamma distribution. ν 2 2 So if Y ∼ χ2 . . λ). i.s: if X1 . then X1 + X2 + . fX (x) = λk k−1 −λx e Γ(k) x 0 if x ≥ 0. otherwise. . ∞ 0 fX (x)dx = 1. is a constant that ensures fX (x) integrates to 1. Special Case: When k = 1. Here. and Var(Y ) = k λ2 = 2ν.v. Xk are independent. λ). λ = 1 ).11 The Gamma distribution The Gamma(k. . fX (x) For X ∼ Gamma(k.) Probability density function. . Gamma(1.175 4. It is defined as the sum of k independent Exponential r. E(X) = k λ and Var(X) = k λ2 Relationship with the Chi-squared distribution The Chi-squared distribution with ν degrees of freedom. . χ2 = Gamma(k = ν . . λ) distribution is a very flexible family of distributions. When k is an integer. λ). . χ2 . .v. then E(Y ) = ν k λ = ν. It is defined as Γ(k) = 0 ∞ y k−1 e−y dy . + Xk ∼ Gamma(k. Γ(k) = (k − 1)! Mean and variance of the Gamma distribution: For X ∼ Gamma(k. . λ) = Exponential(λ) (the sum of a single Exponential r.e. . Xk ∼ Exponential(λ)and X1 . . Γ(k). called the Gamma function of k.

d. k=5 .176 Gamma p. λ). then FX (x) can can only be calculated by computer.s k=1 k=2 Notice: right skew (long right tail). flexibility in shape controlled by the 2 parameters k=5 Distribution function. FX (x) There is no closed form for the distribution function of the Gamma distribution. If X ∼ Gamma(k.f.

dx dy 1 = λ) (property of the Gamma function). Var(X) = E(X ) − (EX) 2 2 = = 0 k2 x fX (x) dx − 2 λ 2 ∞ x2λk xk−1e−λx k2 dx − 2 Γ(k) λ Γ(k) ∞ k+1 −y e 0 y = ∞ 1 k+1 −λx e dx 0 ( λ )(λx) k2 − 2 λ where y = λx. λ ∞ 0 (letting y = λx. λ2 .177 Proof that E(X) = ∞ 0 k λ and Var(X) = ∞ 0 k λ2 (non-examinable) EX = xfX (x) dx = = = = = = λk xk−1 −λx x· e dx Γ(k) Γ(k) ∞ k −λx dx 0 (λx) e ∞ k −y 1 0 y e ( λ ) dy Γ(k) 1 Γ(k + 1) · λ Γ(k) 1 k Γ(k) · λ Γ(k) k . 1 dx = dy λ 1 = 2· λ dy Γ(k) k2 − 2 λ 1 Γ(k + 2) k 2 − 2 = 2· λ Γ(k) λ 1 (k + 1)k Γ(k) k 2 = 2 − 2 λ Γ(k) λ = k .

Γ(α + β) . as they both describe the ‘waiting time’ before the kth event.d. f (x) = 1 xα−1(1 − x)β−1 for 0 < x < 1. + Xk . Both distributions describe the ‘waiting time’ before an event.178 Gamma distribution arising from the Poisson process Recall that the waiting time between events in a Poisson process with rate λ has the Exponential(λ) distribution. β) is the Beta function and is defined by the integral 1 B(α.12 The Beta Distribution: non-examinable The Beta distribution has two parameters. The function B(α. λ) distribution is similar to the (discrete) Negative Binomial(k. then Xi ∼ Exp(λ). The time waited from time 0 to the time of the kth event is X1 + X2 + .f. β) = 0 xα−1(1 − x)β−1 dx. β) 0 otherwise. It can be shown that Γ(α)Γ(β) . In the same way. We write X ∼ Beta(α. B(α. B(α. β > 0. if Xi =time waited between event i − 1 and event i. the Gamma(k. β). α and β.v. Note: There are some similarities between the Exponential(λ) distribution and the (discrete) Geometric(p) distribution. p) distribution. That is. P. Thus the time waited until the kth event in a Poisson process with rate λ has the Gamma(k. the sum of k independent Exponential(λ) r.s. λ) distribution. β) = for α > 0. 4. . .

or X ∼ N(µ.Chapter 5: The Normal Distribution and the Central Limit Theorem 179 The Normal distribution is the familiar bell-shaped distribution. FX (x) There is no closed form for the distribution function of the Normal distribution. Xn are i. which states that any large sum of independent. .d. σ 2).i. We write X ∼ Normal(µ. . . 5. R command: FX (x) = pnorm(x. mainly because of its link with the Central Limit Theorem. σ 2 . Before studying the Central Limit Theorem. identically distributed random variables is approximately Normal: X1 + X2 + . . . then FX (x) can can only be calculated by computer. and the variance. mean=µ. It is probably the most important distribution in statistics. fX (x) 1 2πσ 2 fX (x) = √ e {−(x−µ)2 /2σ2 } for −∞ < x < ∞. and n is large.1 The Normal Distribution The Normal distribution has two parameters. σ 2). the mean. µ and σ 2 satisfy −∞ < µ < ∞. If X ∼ Normal(µ. σ 2). + Xn ∼ approx Normal if X1 . Probability density function. we look at the Normal distribution and some of its general properties. . σ 2 > 0. Distribution function. sd=sqrt(σ 2)). . µ.

Linear transformations If X ∼ Normal(µ. a2 σ 2 . X ∼ Normal(µ σ 2 ) ⇒ ∼ Normal(0. FX (x) Mean and Variance For X ∼ Normal(µ. σ 2). aX + b ∼ Normal aµ + b. fX (x) Distribution function. 1). σ 2). In particular. X −µ σ E(X) = µ.180 Probability density function. then for any constants a and b. Var(X) = σ 2 . .

= √ 2πσ 2 . σ 2). 1) is called the standard Normal random variable. so fX (x) = √ Thus fX y−b a e−(x−µ) y−b 1 2 2 = √ e−( a −µ) /2σ 2πσ 2 1 2 2 2 e−(y−(aµ+b)) /2a σ . σ σ Let Z = aX + b = X −µ . = = dy a |a| fY (y) = fX (x(y)) dx = fX dy y−b a 1 2πσ 2 1 . 1) y(x) = ax +b is monotone. 4) 5) So (⋆) /2σ 2 But X ∼ Normal(µ. σ 2). 2) Let y = y(x) = ax + b for −∞ < x < ∞. Then σ µ µ σ2 − .181 Proof: Let a = 1 µ and b = − . σ σ σ2 ∼ Normal(0. so we can apply the Change of Variable technique. General proof that aX + b ∼ Normal aµ + b. a2σ 2 : Let X ∼ Normal(µ. 3) Then x = x(y) = dx 1 1 . |a| 2 y−b a for −∞ < y < ∞. and let Y = aX + b. Z ∼ Normal aµ + b. Use the change of variable technique. 1). We wish to find the distribution of Y . a2 σ 2 ∼ Normal Z ∼ Normal(0.

and Xi ∼ Normal(µi . . then a1 X1 +a2 X2 +. See Calculus courses for details. of a Normal(aµ + b. σ 2). σ2 ). 2 More generally. Y ∼ Normal(µ2 . . of variable y = √ 2σ ∞ . Proof that ∞ −∞ fX (x) dx = 1. So. σ1 + σ2 . σi ) for i = 1. 2 2 (a2 σ1 +. if X ∼ Normal(µ. . =√ |a| 2πa2 σ 2 But this is the p. if X1.182 Returning to (⋆).d. a2 σ 2 . a2 σ 2 ) random variable. . then 2 2 X + Y ∼ Normal µ1 + µ2 . then aX + b ∼ Normal aµ + b. fY (y) = fX y−b a · 1 1 2 2 2 e−(y−(aµ+b)) /2a σ for − ∞ < y < ∞. 1 n For mathematicians: properties of the Normal distribution 1. Using this result.f.+an µn ). σ1 ).+a2 σn) . . . n. . the proof that −∞ fX (x) dx = 1 follows by using the change (x − µ) in the integral. . . . This result is non-trivial to prove. Xn are independent. Sums of Normal random variables 2 2 If X and Y are independent. . X2 .+an Xn ∼ Normal (a1 µ1 +. . ∞ −∞ The full proof that fX (x) dx = ∞ −∞ ∞ −∞ √ 1 2πσ 2 e{−(x−µ) 2 /(2σ 2 )} dx = 1 relies on the following result: FACT: e−y dy = 2 √ π. . and X ∼ Normal(µ1 . .

f. Thus E(X) = 0 + µ × 1 = µ.e. Var(X) = E (X − µ)2 = 1 2 2 (x − µ)2 √ e−(x−µ) /(2σ ) dx 2πσ 2 −∞ 2 ∞ = σ = σ 1 2 √ z 2 e−z /2 dz 2π −∞ 1 2 √ −ze−z /2 2π ∞ −∞ ∞ putting z = + 1 2 √ e−z /2 dz 2π −∞ ∞ x−µ σ 2 (integration by parts) = σ 2 {0 + 1} = σ2. of N (0. Thus E(X) = = 1 2 (σz + µ) · √ · e−z /2 · σ dz 2πσ 2 −∞ σz 2 √ · e−z /2 dz 2π −∞ this is an odd function of z (i.d. . 3. g(−z) = −g(z)). so it integrates to 0 over range −∞ to ∞. ∞ + µ 1 2 √ e−z /2 dz 2π −∞ ∞ p.183 2. Proof thatVar(X) = σ 2 . 1) integrates to 1. Proof that E(X) = µ. E(X) = ∞ −∞ xfX (x) dx = 1 2 2 x√ e−(x−µ) /2σ dx 2πσ 2 −∞ x−µ σ ∞ Change variable of integration: let z = ∞ : then x = σz + µ and dx dz = σ.

. In its simplest form. . + Xn = that tends to Normal as n → ∞. . . Var(Sn) = Var i=1 n Xi = i=1 Var(Xi ) because X1 . Theorem (The Central Limit Theorem): Let X1 . . so µ = np and σ 2 = np(1 − p). Xi ∼ Binomial(n. nσ 2) as n → ∞. .2 The Central Limit Theorem (CLT) also known as. . then the distribution of their sum (or alternatively their sample average) always converges to the Normal distribution. Xn be independent r. + Xn → Normal(nµ. the Piece of Cake Theorem The Central Limit Theorem (CLT) is one of the most fundamental results in statistics. . . n i=1 Xi has a distribution The mean of the Normal distribution is E(Sn) = The variance of the Normal distribution is n n i=1 E(Xi ) = nµ. . Xn are independent = nσ 2. .v. . it states that if a large number of independent random variables are drawn from any distribution. For example. . p) for each i. Then the sum Sn = X1 + .s with mean µ and variance σ 2. So Sn = X1 + X2 + . . from ANY distribution. .184 5.

making use of several of the techniques learnt in STATS 210. 3. 3 σ 2 = Var(X) = E(X 2) − {E(X)}2 1 = 0 1 x fX (x) dx − 2x3 dx − 1 0 2 2 3 2 = 0 4 9 2x4 = 4 1 = . Xn. .185 Notes: 1. . . We will look particularly at how fast the distribution of Sn converges to the Normal distribution. . . This is a remarkable theorem. It might apply for as little as n = 4. 18 − 4 9 . Find E(X) and Var(X): 1 f (x) µ = E(X) = 0 1 xfX (x) dx 0 1 x = 0 2x2 dx 1 0 = 2x3 3 2 = . Skewed distributions converge more slowly than symmetric Normal-like distributions. . Xn are independent and have the same distribution. because the limit holds for any distribution of X1 . Other versions of the Central Limit Theorem relax the conditions that X1. The speed of convergence of Sn to the Normal distribution depends upon the distribution of X. Example 1: Triangular distribution: fX (x) = 2x for 0 < x < 1. . The Central Limit Theorem in action : simulation studies The following simulation study illustrates the Central Limit Theorem. . It is usually safe to assume that the Central Limit Theorem applies whenever n ≥ 30. A sufficient condition on X for the Central Limit Theorem to apply is that Var(X) is finite. 2.

0 0.0 0.5 1. the Normal curve is a very good approximation.4 0.5 0.0 1. 3n 5 by independence Var(Sn) = 3n . Even for n as low as 10. .0 0.8 0. n=1 2. 5 for large n.0 0. by the Central Limit Theorem. + Xn where X1 . .5 2. . . .4 1. . Then E(Sn) = E(X1 + . . .2 0. Xn are independent. The Normal p.0 1. Normal(nµ. . n and 10. 3 We find that E(X) = µ = 0.2 0.6 0. 3.2 0.0 0. . .5 3.2 0. + Xn ) = nσ 2 ⇒ So Sn ∼ approx Normal 2n n . + Xn ) = nµ = 0 Var(Sn) = Var(X1 + . . 2. Var(Sn) = 18 by independence for large n. .6 0. . .5 2.0 1.0 2. .2 0. Var(X) = σ 2 = 5 .0 0.1 0.f. . + Xn for n as small as 10. Even with this highly non-Normal distribution for X. . Then E(Sn) = E(X1 + .5 n = 10 0. .8 1.0 0. . Xn are independent. + Xn ) = nµ = 2n 3 Var(Sn) = Var(X1 + . 18 ) is superimposed 3 across the top.0 0.6 1.4 0. The graph shows histograms of 10 000 values of Sn = X1 +. (Exercise) -1 0 1 x Let Sn = X1 + . . the Normal curve provides a good approximation to Sn = X1 + . . + Xn ) = nσ 2 ⇒ So Sn ∼ approx Normal 0.3 4 5 6 7 8 9 Sn Sn Sn Sn f (x) 3 Example 2: U-shaped distribution: fX (x) = 2 x2 for −1 < x < 1.8 n=3 0.0 0.5 0.186 Let Sn = X1 + .2 1. + Xn where X1 . . 18 3 n .d.0 1. nσ 2) = Normal( 2n . .0 0.+Xn for n = 1.5 1.0 n=2 1. .4 0. by the Central Limit Theorem.

n=1 1. .2 0. .0 -0.05 0.1 -1.10 -5 0 5 Sn Sn Sn Sn Normal approximation to the Binomial distribution.0 0. var of Bin(n.8 0. p).0 0.5 0. np(1 − p) . .4 1. Y = X1 + X2 + . .2 0.6 n=2 0.0 0.0 0. + Xn . We can think of Y as the sum of n Bernoulli random variables: Y = X1 + X2 + . σ 2 = Var(Xi ) = p(1 − p). λ)when λ is large. using the CLT Let Y ∼ Binomial(n. . where Xi = 1 if trial i is a “success” (prob = p).4 0. . + Xn and each Xi has µ = E(Xi) = p.5 1.0 -2 -1 0 1 2 -3 -2 -1 0 1 2 3 0.15 n = 10 187 1. + Xn → Normal(nµ. Bin(n.2 0. 0 otherwise (prob = 1 − p) So Y = X1 + .0 0. nσ 2) = Normal np. . Thus.p) np(1 − p) as n → ∞ with p fixed. The Binomial distribution is therefore well approximated by the Normal distribution when n is large.p) . Thus by the CLT. for any fixed value of p.0 0.6 0.2 0.3 0.4 0. The Normal distribution is also a good approximation to the Poisson(λ) distribution when λ is large: Poisson(λ) → Normal(λ. p) → Normal np mean of Bin(n.4 n=3 0.

02 30 40 50 60 70 0.01 0.02 0. we said that the maximum likelihood estimator for p is X p= . you deserve a piece of cake! Example: Remember the margin of error for an opinion poll? An opinion pollster wishes to estimate the level of support for Labour in an upcoming election.03 80 100 120 140 Why the Piece of Cake Theorem? • The Central Limit Theorem makes whole realms of statistics into a piece of cake.0 0.08 0. unknown level of support for the Labour party in New Zealand. npq) So p= X pq ∼ approx Normal p.0 60 0.5) 0.188 Binomial(n = 100. Let p be the true. n n where q = 1 − p. Then X ∼ Binomial(n. n In a large sample (large n). p = 0. we now know that X ∼ approx Normal(np.) . At the end of Chapter 2.v.06 0. p).04 Poisson(λ = 100) 0. (linear transformation of Normal r. She interviews n people about their voting preferences. • After seeing a theorem this good.04 0. Let X be the number of of the n people interviewed by the opinion pollster who plan to vote Labour.

95. n • Think: Central Limit Theorem! • Have: a piece of cake. Confidence intervals are extremely important for helping us to assess how useful our estimate is.96: P(−1.96 p(1 − p) . we find (using a computer) that the 95% central probability region of Z is from −1.pnorm(-1. sd=1) Putting Z = p−p pq n . Next time you see the newspapers quoting the margin of error on an opinion poll: • Remember: margin of error = 1.96 < Z < 1. p.95. a wide confidence interval suggests a poor estimate (high variance). A narrow confidence interval suggests a useful estimate (low variance). mean=0.96 < Rearranging: P p − 1.96 < 1. 1). sd=1) . we obtain p−p pq n P −1.96 n pq n ≃ 0. which depend on p.96.96. About 95% of the time. .95. pq < p < p + 1.189 So p−p pq n ∼ approx Normal(0. these random end-points will enclose the true unknown value.96 p(1−p) . mean=0. This enables us to form an estimated 95% confidence interval for the unknown parameter p: estimated 95% confidence interval is p − 1. n The 95% confidence interval has RANDOM end-points.96 to +1.96) = 0.96 ≃ 0.96 p(1 − p) n to p + 1. 1). Check in R: pnorm(1. Now if Z ∼ Normal(0.

. n n as n → ∞. where Sn = X1 + . and 2 E(Xi) = µi . as n grows large. + Xn . . X Let X1 . . . . + Xn σ2 ∼ approx Normal µ. X. is defined as: X= X1 + X2 + . 1) as n → ∞. whatever the distribution of the original r. . . σ 2 /n nσ 2 The essential point to remember about the Central Limit Theorem is that large sums or sample means of independent random variables converge to a Normal distribution. Sn = X1 + X2 + . Xn are independent. . . . Xn are independent. . . nσ 2) by the CLT. . + Xn ∼ approx Normal(nµ. The sample mean. + Xn ∼ approx Normal nµ. More general version of the CLT A more general form of CLT states that.s. . .v. Other versions of the CLT relax the condition that X1 . . n σ2 n as n → ∞. . if X1 . . then Zn = n i=1 (Xi − µi ) n 2 i=1 σi → Normal(0. . . The following three statements of the Central Limit Theorem are equivalent: X= X1 + X2 + .190 Using the Central Limit Theorem to find the distribution of the mean. X itself is approximately Normal for large n: X1 + X2 + . + Xn ∼ approx Normal µ. n Because X is a scalar multiple of a Normal r. Xn be independent. 1) as n → ∞. . nσ 2 Sn − nµ X −µ √ = ∼ approx Normal (0. as n → ∞. . n So X = Sn . Var(Xi) = σi (not necessarily all equal). . identically distributed with mean E(Xi) = µ and variance Var(Xi ) = σ 2 for all i.v.

change of variable. x • The maximum likelihood estimate of p is p = n . 191 Most of the techniques that we have studied along the way are to help us with these two goals: expectation. p) with known n and observed value X = x. • The maximum likelihood estimator of λ is λ = Why are we interested in estimators? The answer is that estimators are random variables. 6. and the estimator PDF We have seen that an estimator is a capital letter replacing a small letter. and variances that tell us how well we can trust our single observation. distributions. variance. This means they have distributions. . What’s the point of that? Example: Let X ∼ Binomial(n. the bad. • The maximum likelihood estimator of p is p = X n. and the Central Limit Theorem. or estimate. means.1 Estimators — the good. Let’s see how these different ideas all come together. from this distribution. Example: Let X ∼ Exponential(λ) with observed value X = x.Chapter 6: Wrapping Up Probably the two major ideas of this course are: • likelihood and estimation. 1 • The maximum likelihood estimate of λ is λ = x . 1 X. • hypothesis testing.

.s of λ for two different values of n: • Estimator 1: n = 100.d. In Chapter 4 we calculated the maximum likelihood estimator of λ: 1 n λ= .192 Good and bad estimators Suppose that X1 .d. for n = 100 is focused much more tightly about the true value λ (unknown) than the p. of Estimator 1 p. Xn are independent. the better.d. of Estimator 2 True λ (unknown) λ Clearly. 100 pieces of information about λ. How? We know that T = X1+. λ) when Xi ∼ i. n Now λ = T . of λ. for n = 10. λ is unknown. + Xn X Now λ is a random variable with a distribution.f. 10 pieces of information about λ. . we can calculate the p. of T . X2.f. . • Estimator 2: n = 10. = X1 + X2 + . f( λ ) p. Exponential(λ).d. So we know the p. The p. For a given value of n.i. So we can find the p. of λ using the change of variable technique. .f. . the more information we have. Here are the p.d. and we wish to estimate it. .d. .d. . .d.f.f.f.+Xn ∼ Gamma(n.d. and Xi ∼ Exponential(λ) for all i.f.f.

d. . A good estimator has low estimator variance: everywhere on the estimator’s p. of our estimator λ = n/T .d. • we know we’re SOMEWHERE on that curve. What we do know: • the p.d.f.f. The estimator variance tells us how much the estimator can be trusted.d. Note: We were lucky in this example to happen to know that T = X1 + . Because we don’t know where we are on the curve. λ) when Xi ∼ i.d.d. curve. + Xn ∼ Gamma(n.f.d. curve. curve is good! f( λ ) p.193 It is important to recognise what we do and don’t know in this situation: What we don’t know: • the true λ.f. A poor estimator has high estimator variance: some places on the estimator’s p. we can’t trust any estimate from this poor estimator.f.d.d. So we need an estimator such that EVERYWHERE on the estimator’s p. We won’t usually be so lucky: so what should we do? Use the Central Limit Theorem! . curve may be good.f. while others may be very bad. • WHERE we are on the p. of Estimator 2 True λ (unknown) λ This is why we are so concerned with estimator variance. so we could find the p. of Estimator 1 p. . Exponential(λ).i. curve is guaranteed to be good.f.f.

x) = s2 (e) The likelihood graph for a particular value of x is shown here.05 0. where s is the maximum value of X and s > 0. Use E(X) = 0 2 2 xfX (x) dx = 2 s s 0 (sx − x2) dx. 6 Use E(X ) = 0 2 s 2 x fX (x) dx = 2 s 2 s 0 (sx2 − x3) dx.194 Example: calculating the maximum likelihood estimator The following question is in the same style as the exam questions. s (a) Show that E(X) = . s is a parameter to be estimated. Write down the likelihood function. L(s . You should refer to the graph in your answer. s2 18 .15 Likelihood 0. Let X be a continuous random variable with probability density function   2(s − x)  for 0 < x < s . 3 s Here. Use Var(X) = E(X 2) − (EX)2. L(s . 0. and state the range of values of s for which your answer is valid. x). s2 fX (x) =   0 otherwise . Show that the maximum likelihood estimator of s is s = 2X .0 3 0.10 4 5 6 s 7 8 9 . Answer: Var(X) = (d) Suppose that we make a single observation X = x. (c) Find Var(X). 2(s − x) for x < s < ∞. s2 (b) Show that E(X ) = .

From the graph. Hence find the estimated variance. in terms of s. Var(s) = 9 . Var(s) = Var(2X) = 22Var(X) s2 = 4× 18 2s2 Var(s) = . s3 s=∞ At the MLE. Var(s). (f) Find the estimator variance.195 L(s . 9 by (c) So also: 2s2 . Var(s). we can see that s = ∞ is not the maximum. x) = 2s−2(s − x) So dL = 2 −2s−3(s − x) + s−2 ds = 2s−3(−2(s − x) + s) = 2 (2x − s). in terms of s. Thus the maximum likelihood estimator is s = 2X. dL =0 ds ⇒ or s = 2x. So s = 2x.

(h) Write a sentence in plain English to explain what the maximum likelihood estimate from part (g) represents. we need a test statistic: some random variable with a distribution that we can specify exactly under H0 and that differs under H1 . p). 6.82 to 6 + 2 × 2.2 Hypothesis tests: in search of a distribution When we do a hypothesis test. We have an easy distribution and can do a hypothesis test. 2s2 2 × 62 Var(s) = = =8 9 9 se(s) = Var(s) = √ 8 = 2. but it is a useful rule of thumb to see how ‘good’ the estimator is. s = 2X = 2 × 3 = 6. p). X ∼ Binomial(10. • Weird coin: is my coin fair? Let X be the number of heads out of 10 tosses. 0. .196 (g) Suppose we make the single observation X = 3.82. It is finding the distribution that is the difficult part. • Too many daughters? Do divers have more daughters than sons? Let X be the number of daughters out of 190 diver children. This means s is a POOR estimator: the twice standard-error interval would be 6 − 2 × 2.82: that is. Find the maximum likelihood estimate of s. X ∼ Binomial(190. Easy.64 ! Taking the twice standard error interval strictly applies only to the Normal distribution. The value s = 6 is the value of s under which the observation X = 3 is more likely than it is at any other value of s.36 to 11. and its estimated variance and standard error.

1 8. yes they do (Normal(0.81 -0.0 3.58 1.07 -0. It was originally developed to help Guinness decide how large a sample of people should be used in its beer tastings! . We have a simple distribution and test statistic (X): we can test the observed length of time between eruptions and see if it this is a believable observation under a hypothesized value of λ.32 -1. possibly because he (or his employers) wanted it to be kept secret that he was doing his statistical research as part of his employment at Guinness Brewery.17 1.14 3.51 1. He called himself only ‘Student’. More advanced tests Most things in life are not as easy as the three examples above.3 -30.48 1. What test statistic should we use? If we don’t know that our data are Normal. Here are some observations. so that the second sample does not cluster about its mean of 0 at all.8 -9.4 -1.54 -1. but how can we tell? What about these? 3.1 -7. and we don’t know their underlying variance. Do they come from a distribution (any distribution) with mean 0? 3.0 Again. 4). what can we use as our X to test whether µ = 0? Answer: a clever person called W.5 9.6 -6.6 -5.96 2.4 -11.22 1.7 12.85 0. If we assume volcanoes occur as a Poisson process.4 -5.0 -9. but how can we tell? The unknown variance (4 versus 100) interferes.66 1.54 Answer: yes.0 -8. they are Normal(0. 100) this time).6 1. Gossett (1876-1937) worked out an answer.85 0. S.1 -13.52 0.61 -0. then X ∼ Exponential(λ).8 2. The test that ‘Student’ developed is the familiar Student’s t-test.40 1.6 -6.3 8.197 • Too long between volcanoes? Let X be the length of time between volcanic eruptions.42 -0.37 -0.

by the Central Limit Theorem. T is the Student’s t-distribution. In the Chi-squared goodness-of-fit test. Student did not derive the distribution above without a great deal of hard work. The Central Limit Theorem produces approximate Normal distributions. fT (t) = Γ n 2 n−1 2 (n − 1)π Γ t2 1+ n−1 −n/2 for − ∞ < t < ∞. Student’s T -statistic gives us a test for µ = 0 (or any other value) that can be used with any large enough sample. the Pearson’s chi-square test statistic is shown to have a Chi-squared distribution under H0. . .f. It produces larger values under H1 . If µ = 0.d. The Student’s t-test is exact when the distribution of the original data X1 . it is still approximately valid in large samples. . however. With the help of our best friend the Central Limit Theorem. Xn is Normal. Normals divided by Chi-squareds produce t-distributed random variables. the distribution of T is known: T has p. . Normal random variables squared produce Chi-squared random variables. the rest was easy. For other distributions. but once researchers had derived the distribution of a suitable test statistic. All these things are not coincidental: the Central Limit Theorem rocks! . It looks difficult It is! Most of the statistical tests in common use have deep (and sometimes quite impenetrable) theory behind them. One interesting point to note is the pivotal role of the Central Limit Theorem in all of this. µ: T = X −µ n 2 i=1 (Xi −X) n(n−1) Under H0 : µ = 0.198 Student used the following test statistic for the unknown mean. is astonishing. derived as the ratio of a Normal random variable and an independent Chi-Squared random variable. observations of T will tend to lie out in the tails of this distribution. The Chi-squared test for testing proportions in a contingency table also has a deep theory. The result. A ratio of two Chi-squared distributions produces an F -distributed random variable. As you can probably guess.

Sign up to vote on this title
UsefulNot useful