This action might not be possible to undo. Are you sure you want to continue?
Notes on Probability
Peter J. Cameron
ii
Preface
Here are the course lecture notes for the course MAS108, Probability I, at Queen
Mary, University of London, taken by most Mathematics students and some others
in the ﬁrst semester.
The description of the course is as follows:
This course introduces the basic notions of probability theory and de
velops them to the stage where one can begin to use probabilistic
ideas in statistical inference and modelling, and the study of stochastic
processes. Probability axioms. Conditional probability and indepen
dence. Discrete random variables and their distributions. Continuous
distributions. Joint distributions. Independence. Expectations. Mean,
variance, covariance, correlation. Limiting distributions.
The syllabus is as follows:
1. Basic notions of probability. Sample spaces, events, relative frequency,
probability axioms.
2. Finite sample spaces. Methods of enumeration. Combinatorial probability.
3. Conditional probability. Theorem of total probability. Bayes theorem.
4. Independence of two events. Mutual independence of n events. Sampling
with and without replacement.
5. Random variables. Univariate distributions  discrete, continuous, mixed.
Standard distributions  hypergeometric, binomial, geometric, Poisson, uni
form, normal, exponential. Probability mass function, density function, dis
tribution function. Probabilities of events in terms of random variables.
6. Transformations of a single random variable. Mean, variance, median,
quantiles.
7. Joint distribution of two random variables. Marginal and conditional distri
butions. Independence.
iii
iv
8. Covariance, correlation. Means and variances of linear functions of random
variables.
9. Limiting distributions in the Binomial case.
These course notes explain the naterial in the syllabus. They have been “ﬁeld
tested” on the class of 2000. Many of the examples are taken from the course
homework sheets or past exam papers.
Set books The notes cover only material in the Probability I course. The text
books listed below will be useful for other courses on probability and statistics.
You need at most one of the three textbooks listed below, but you will need the
statistical tables.
• Probability and Statistics for Engineering and the Sciences by Jay L. De
vore (ﬁfth edition), published by Wadsworth.
Chapters 2–5 of this book are very close to the material in the notes, both in
order and notation. However, the lectures go into more detail at several points,
especially proofs. If you ﬁnd the course difﬁcult then you are advised to buy
this book, read the corresponding sections straight after the lectures, and do extra
exercises from it.
Other books which you can use instead are:
• Probability and Statistics in Engineering and Management Science by W. W.
Hines and D. C. Montgomery, published by Wiley, Chapters 2–8.
• Mathematical Statistics and Data Analysis by John A. Rice, published by
Wadsworth, Chapters 1–4.
You should also buy a copy of
• New Cambridge Statistical Tables by D. V. Lindley and W. F. Scott, pub
lished by Cambridge University Press.
You need to become familiar with the tables in this book, which will be provided
for you in examinations. All of these books will also be useful to you in the
courses Statistics I and Statistical Inference.
The next book is not compulsory but introduces the ideas in a friendly way:
• Taking Chances: Winning with Probability, by John Haigh, published by
Oxford University Press.
v
Web resources Course material for the MAS108 course is kept on the Web at
the address
http://www.maths.qmw.ac.uk/
˜
pjc/MAS108/
This includes a preliminary version of these notes, together with coursework
sheets, test and past exam papers, and some solutions.
Other web pages of interest include
http://www.dartmouth.edu/
˜
chance/teaching aids/
books articles/probability book/pdf.html
A textbook Introduction to Probability, by Charles M. Grinstead and J. Laurie
Snell, available free, with many exercises.
http://www.math.uah.edu/stat/
The Virtual Laboratories in Probability and Statistics, a set of webbased resources
for students and teachers of probability and statistics, where you can run simula
tions etc.
http://www.newton.cam.ac.uk/wmy2kposters/july/
The Birthday Paradox (poster in the London Underground, July 2000).
http://www.combinatorics.org/Surveys/ds5/VennEJC.html
An article on Venn diagrams by Frank Ruskey, with history and many nice pic
tures.
Web pages for other Queen Mary maths courses can be found from the online
version of the Maths Undergraduate Handbook.
Peter J. Cameron
December 2000
vi
Contents
1 Basic ideas 1
1.1 Sample space, events . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 What is probability? . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Kolmogorov’s Axioms . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Proving things from the axioms . . . . . . . . . . . . . . . . . . . 4
1.5 InclusionExclusion Principle . . . . . . . . . . . . . . . . . . . . 6
1.6 Other results about sets . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.8 Stopping rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.9 Questionnaire results . . . . . . . . . . . . . . . . . . . . . . . . 13
1.10 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.11 Mutual independence . . . . . . . . . . . . . . . . . . . . . . . . 16
1.12 Properties of independence . . . . . . . . . . . . . . . . . . . . . 17
1.13 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 Conditional probability 23
2.1 What is conditional probability? . . . . . . . . . . . . . . . . . . 23
2.2 Genetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3 The Theorem of Total Probability . . . . . . . . . . . . . . . . . 26
2.4 Sampling revisited . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.6 Iterated conditional probability . . . . . . . . . . . . . . . . . . . 31
2.7 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3 Random variables 39
3.1 What are random variables? . . . . . . . . . . . . . . . . . . . . 39
3.2 Probability mass function . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Expected value and variance . . . . . . . . . . . . . . . . . . . . 41
3.4 Joint p.m.f. of two random variables . . . . . . . . . . . . . . . . 43
3.5 Some discrete random variables . . . . . . . . . . . . . . . . . . 47
3.6 Continuous random variables . . . . . . . . . . . . . . . . . . . . 55
vii
viii CONTENTS
3.7 Median, quartiles, percentiles . . . . . . . . . . . . . . . . . . . . 57
3.8 Some continuous random variables . . . . . . . . . . . . . . . . . 58
3.9 On using tables . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.10 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4 More on joint distribution 67
4.1 Covariance and correlation . . . . . . . . . . . . . . . . . . . . . 67
4.2 Conditional random variables . . . . . . . . . . . . . . . . . . . . 70
4.3 Joint distribution of continuous r.v.s . . . . . . . . . . . . . . . . 73
4.4 Transformation of random variables . . . . . . . . . . . . . . . . 74
4.5 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . 77
A Mathematical notation 79
B Probability and random variables 83
Chapter 1
Basic ideas
In this chapter, we don’t really answer the question ‘What is probability?’ No
body has a really good answer to this question. We take a mathematical approach,
writing down some basic axioms which probability must satisfy, and making de
ductions from these. We also look at different kinds of sampling, and examine
what it means for events to be independent.
1.1 Sample space, events
The general setting is: We perform an experiment which can have a number of
different outcomes. The sample space is the set of all possible outcomes of the
experiment. We usually call it S.
It is important to be able to list the outcomes clearly. For example, if I plant
ten bean seeds and count the number that germinate, the sample space is
S ={0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}.
If I toss a coin three times and record the result, the sample space is
S ={HHH, HHT, HTH, HTT, THH, THT, TTH, TTT},
where (for example) HTH means ‘heads on the ﬁrst toss, then tails, then heads
again’.
Sometimes we can assume that all the outcomes are equally likely. (Don’t
assume this unless either you are told to, or there is some physical reason for
assuming it. In the beans example, it is most unlikely. In the coins example,
the assumption will hold if the coin is ‘fair’: this means that there is no physical
reason for it to favour one side over the other.) If all outcomes are equally likely,
then each has probability 1/S. (Remember that S is the number of elements in
the set S).
1
2 CHAPTER 1. BASIC IDEAS
On this point, Albert Einstein wrote, in his 1905 paper On a heuristic point
of view concerning the production and transformation of light (for which he was
awarded the Nobel Prize),
In calculating entropy by moleculartheoretic methods, the word “prob
ability” is often used in a sense differing from the way the word is
deﬁned in probability theory. In particular, “cases of equal probabil
ity” are often hypothetically stipulated when the theoretical methods
employed are deﬁnite enough to permit a deduction rather than a stip
ulation.
In other words: Don’t just assume that all outcomes are equally likely, especially
when you are given enough information to calculate their probabilities!
An event is a subset of S. We can specify an event by listing all the outcomes
that make it up. In the above example, let A be the event ‘more heads than tails’
and B the event ‘heads on last throw’. Then
A = {HHH, HHT, HTH, THH},
B = {HHH, HTH, THH, TTH}.
The probability of an event is calculated by adding up the probabilities of all
the outcomes comprising that event. So, if all outcomes are equally likely, we
have
P(A) =
A
S
.
In our example, both A and B have probability 4/8 = 1/2.
An event is simple if it consists of just a single outcome, and is compound
otherwise. In the example, A and B are compound events, while the event ‘heads
on every throw’ is simple (as a set, it is {HHH}). If A = {a} is a simple event,
then the probability of A is just the probability of the outcome a, and we usually
write P(a), which is simpler to write than P({a}). (Note that a is an outcome,
while {a} is an event, indeed a simple event.)
We can build new events from old ones:
• A∪B (read ‘A union B’) consists of all the outcomes in A or in B (or both!)
• A∩B (read ‘A intersection B’) consists of all the outcomes in both A and B;
• A\B (read ‘A minus B’) consists of all the outcomes in A but not in B;
• A
(read ‘A complement’) consists of all outcomes not in A (that is, S \A);
• / 0 (read ‘empty set’) for the event which doesn’t contain any outcomes.
1.2. WHAT IS PROBABILITY? 3
Note the backwardsloping slash; this is not the same as either a vertical slash  or
a forward slash /.
In the example, A
is the event ‘more tails than heads’, and A∩B is the event
{HHH, THH, HTH}. Note that P(A∩B) = 3/8; this is not equal to P(A) · P(B),
despite what you read in some books!
1.2 What is probability?
There is really no answer to this question.
Some people think of it as ‘limiting frequency’. That is, to say that the proba
bility of getting heads when a coin is tossed means that, if the coin is tossed many
times, it is likely to come down heads about half the time. But if you toss a coin
1000 times, you are not likely to get exactly 500 heads. You wouldn’t be surprised
to get only 495. But what about 450, or 100?
Some people would say that you can work out probability by physical argu
ments, like the one we used for a fair coin. But this argument doesn’t work in all
cases, and it doesn’t explain what probability means.
Some people say it is subjective. You say that the probability of heads in a
coin toss is 1/2 because you have no reason for thinking either heads or tails more
likely; you might change your view if you knew that the owner of the coin was a
magician or a con man. But we can’t build a theory on something subjective.
We regard probability as a mathematical construction satisfying some axioms
(devised by the Russian mathematician A. N. Kolmogorov). We develop ways of
doing calculations with probability, so that (for example) we can calculate how
unlikely it is to get 480 or fewer heads in 1000 tosses of a fair coin. The answer
agrees well with experiment.
1.3 Kolmogorov’s Axioms
Remember that an event is a subset of the sample space S. A number of events,
say A
1
, A
2
, . . ., are called mutually disjoint or pairwise disjoint if A
i
∩A
j
= / 0 for
any two of the events A
i
and A
j
; that is, no two of the events overlap.
According to Kolmogorov’s axioms, each event A has a probability P(A),
which is a number. These numbers satisfy three axioms:
Axiom 1: For any event A, we have P(A) ≥0.
Axiom 2: P(S) = 1.
4 CHAPTER 1. BASIC IDEAS
Axiom 3: If the events A
1
, A
2
, . . . are pairwise disjoint, then
P(A
1
∪A
2
∪· · ·) = P(A
1
) +P(A
2
) +· · ·
Note that in Axiom 3, we have the union of events and the sum of numbers.
Don’t mix these up; never write P(A
1
) ∪P(A
2
), for example. Sometimes we sep
arate Axiom 3 into two parts: Axiom 3a if there are only ﬁnitely many events
A
1
, A
2
, . . . , A
n
, so that we have
P(A
1
∪· · · ∪A
n
) =
n
∑
i=1
P(A
i
),
and Axiom 3b for inﬁnitely many. We will only use Axiom 3a, but 3b is important
later on.
Notice that we write
n
∑
i=1
P(A
i
)
for
P(A
1
) +P(A
2
) +· · · +P(A
n
).
1.4 Proving things from the axioms
You can prove simple properties of probability from the axioms. That means,
every step must be justiﬁed by appealing to an axiom. These properties seem
obvious, just as obvious as the axioms; but the point of this game is that we assume
only the axioms, and build everything else from that.
Here are some examples of things proved from the axioms. There is really no
difference between a theorem, a proposition, and a corollary; they all have to be
proved. Usually, a theorem is a big, important statement; a proposition a rather
smaller statement; and a corollary is something that follows quite easily from a
theorem or proposition that came before.
Proposition 1.1 If the event A contains only a ﬁnite number of outcomes, say
A ={a
1
, a
2
, . . . , a
n
}, then
P(A) = P(a
1
) +P(a
2
) +· · · +P(a
n
).
To prove the proposition, we deﬁne a new event A
i
containing only the out
come a
i
, that is, A
i
= {a
i
}, for i = 1, . . . , n. Then A
1
, . . . , A
n
are mutually disjoint
1.4. PROVING THINGS FROM THE AXIOMS 5
(each contains only one element which is in none of the others), and A
1
∪A
2
∪
· · · ∪A
n
= A; so by Axiom 3a, we have
P(A) = P(a
1
) +P(a
2
) +· · · +P(a
n
).
Corollary 1.2 If the sample space S is ﬁnite, say S ={a
1
, . . . , a
n
}, then
P(a
1
) +P(a
2
) +· · · +P(a
n
) = 1.
For P(a
1
) +P(a
2
) +· · · +P(a
n
) = P(S) by Proposition 1.1, and P(S) = 1 by
Axiom 2. Notice that once we have proved something, we can use it on the same
basis as an axiom to prove further facts.
Now we see that, if all the n outcomes are equally likely, and their probabil
ities sum to 1, then each has probability 1/n, that is, 1/S. Now going back to
Proposition 1.1, we see that, if all outcomes are equally likely, then
P(A) =
A
S
for any event A, justifying the principle we used earlier.
Proposition 1.3 P(A
) = 1−P(A) for any event A.
Let A
1
= A and A
2
= A
(the complement of A). Then A
1
∩A
2
= / 0 (that is, the
events A
1
and A
2
are disjoint), and A
1
∪A
2
= S. So
P(A
1
) +P(A
2
) = P(A
1
∪A
2
) (Axiom 3)
= P(S)
= 1 (Axiom 2).
So P(A) = P(A
1
) = 1−P(A
2
).
Corollary 1.4 P(A) ≤1 for any event A.
For 1 −P(A) = P(A
) by Proposition 1.3, and P(A
) ≥ 0 by Axiom 1; so 1 −
P(A) ≥0, from which we get P(A) ≤1.
Remember that if you ever calculate a probability to be less than 0 or more
than 1, you have made a mistake!
Corollary 1.5 P(/ 0) = 0.
For / 0 =S
, so P(/ 0) = 1−P(S) by Proposition 1.3; and P(S) = 1 by Axiom 2,
so P(/ 0) = 0.
6 CHAPTER 1. BASIC IDEAS
Here is another result. The notation A ⊆B means that A is contained in B, that
is, every outcome in A also belongs to B.
Proposition 1.6 If A ⊆B, then P(A) ≤P(B).
This time, take A
1
= A, A
2
= B\ A. Again we have A
1
∩A
2
= / 0 (since the
elements of B\A are, by deﬁnition, not in A), and A
1
∪A
2
= B. So by Axiom 3,
P(A
1
) +P(A
2
) = P(A
1
∪A
2
) = P(B).
In other words, P(A) +P(B\A) = P(B). Now P(B\A) ≥0 by Axiom 1; so
P(A) ≤P(B),
as we had to show.
1.5 InclusionExclusion Principle
_
`
¸
_
`
¸
A B
A Venn diagram for two sets A and B suggests that, to ﬁnd the size of A∪B,
we add the size of A and the size of B, but then we have included the size of A∩B
twice, so we have to take it off. In terms of probability:
Proposition 1.7
P(A∪B) = P(A) +P(B) −P(A∩B).
We now prove this from the axioms, using the Venn diagram as a guide. We
see that A∪B is made up of three parts, namely
A
1
= A∩B, A
2
= A\B, A
3
= B\A.
Indeed we do have A∪B = A
1
∪A
2
∪A
3
, since anything in A∪B is in both these
sets or just the ﬁrst or just the second. Similarly we have A
1
∪A
2
=A and A
1
∪A
3
=
B.
The sets A
1
, A
2
, A
3
are mutually disjoint. (We have three pairs of sets to check.
Now A
1
∩A
2
= / 0, since all elements of A
1
belong to B but no elements of A
2
do.
The arguments for the other two pairs are similar – you should do them yourself.)
1.6. OTHER RESULTS ABOUT SETS 7
So, by Axiom 3, we have
P(A) = P(A
1
) +P(A
2
),
P(B) = P(A
1
) +P(A
3
),
P(A∪B) = P(A
1
) +P(A
2
) +P(A
3
).
From this we obtain
P(A) +P(B) −P(A∩B) = (P(A
1
) +P(A
2
)) +(P(A
1
) +P(A
3
)) −P(A
1
)
= P(A
1
) +P(A
2
) +P(A
3
)
= P(A∪B)
as required.
The InclusionExclusion Principle extends to more than two events, but gets
more complicated. Here it is for three events; try to prove it yourself.
_
`
¸
_
`
¸
_
`
¸
A B
C
To calculate P(A∪B∪C), we ﬁrst add up P(A), P(B), and P(C). The parts in
common have been counted twice, so we subtract P(A∩B), P(A∩C) and P(B∩C).
But then we ﬁnd that the outcomes lying in all three sets have been taken off
completely, so must be put back, that is, we add P(A∩B∩C).
Proposition 1.8 For any three events A, B,C, we have
P(A∪B∪C) =P(A)+P(B)+P(C)−P(A∩B)−P(A∩C)−P(B∩C)+P(A∩B∩C).
Can you extend this to any number of events?
1.6 Other results about sets
There are other standard results about sets which are often useful in probability
theory. Here are some examples.
Proposition 1.9 Let A, B,C be subsets of S.
Distributive laws: (A∩B) ∪C = (A∪C) ∩(B∪C) and
(A∪B) ∩C = (A∩C) ∪(B∩C).
De Morgan’s Laws: (A∪B)
= A
∩B
and (A∩B)
= A
∪B
.
We will not give formal proofs of these. You should draw Venn diagrams and
convince yourself that they work.
8 CHAPTER 1. BASIC IDEAS
1.7 Sampling
I have four pens in my desk drawer; they are red, green, blue, and purple. I draw a
pen; each pen has the same chance of being selected. In this case, S ={R, G, B, P},
where R means ‘red pen chosen’ and so on. In this case, if A is the event ‘red or
green pen chosen’, then
P(A) =
A
S
=
2
4
=
1
2
.
More generally, if I have a set of n objects and choose one, with each one
equally likely to be chosen, then each of the n outcomes has probability 1/n, and
an event consisting of m of the outcomes has probability m/n.
What if we choose more than one pen? We have to be more careful to specify
the sample space.
First, we have to say whether we are
• sampling with replacement, or
• sampling without replacement.
Sampling with replacement means that we choose a pen, note its colour, put
it back and shake the drawer, then choose a pen again (which may be the same
pen as before or a different one), and so on until the required number of pens have
been chosen. If we choose two pens with replacement, the sample space is
{RR, RG, RB, RP,
GR, GG, GB, GP,
BR, BG, BB, BP,
PR, PG, PB, PP}
The event ‘at least one red pen’ is {RR, RG, RB, RP, GR, BR, PR}, and has proba
bility 7/16.
Sampling without replacement means that we choose a pen but do not put it
back, so that our ﬁnal selection cannot include two pens of the same colour. In
this case, the sample space for choosing two pens is
{ RG, RB, RP,
GR, GB, GP,
BR, BG, BP,
PR, PG, PB }
and the event ‘at least one red pen’ is {RG, RB, RP, GR, BR, PR}, with probability
6/12 = 1/2.
1.7. SAMPLING 9
Now there is another issue, depending on whether we care about the order in
which the pens are chosen. We will only consider this in the case of sampling
without replacement. It doesn’t really matter in this case whether we choose the
pens one at a time or simply take two pens out of the drawer; and we are not
interested in which pen was chosen ﬁrst. So in this case the sample space is
{{R, G}, {R, B}, {R, P}, {G, B}, {G, P}, {B, P}},
containing six elements. (Each element is written as a set since, in a set, we don’t
care which element is ﬁrst, only which elements are actually present. So the sam
ple space is a set of sets!) The event ‘at least one red pen’ is {{R, G}, {R, B}, {R, P}},
with probability 3/6 = 1/2. We should not be surprised that this is the same as in
the previous case.
There are formulae for the sample space size in these three cases. These in
volve the following functions:
n! = n(n−1)(n−2)· · · 1
n
P
k
= n(n−1)(n−2)· · · (n−k +1)
n
C
k
=
n
P
k
/k!
Note that n! is the product of all the whole numbers from 1 to n; and
n
P
k
=
n!
(n−k)!
,
so that
n
C
k
=
n!
k!(n−k)!
.
Theorem 1.10 The number of selections of k objects from a set of n objects is
given in the following table.
with replacement without replacement
ordered sample n
k n
P
k
unordered sample
n
C
k
In fact the number that goes in the empty box is
n+k−1
C
k
, but this is much
harder to prove than the others, and you are very unlikely to need it.
Here are the proofs of the other three cases. First, for sampling with replace
ment and ordered sample, there are n choices for the ﬁrst object, and n choices
for the second, and so on; we multiply the choices for different objects. (Think of
the choices as being described by a branching tree.) The product of k factors each
equal to n is n
k
.
10 CHAPTER 1. BASIC IDEAS
For sampling without replacement and ordered sample, there are still n choices
for the ﬁrst object, but now only n −1 choices for the second (since we do not
replace the ﬁrst), and n−2 for the third, and so on; there are n−k +1 choices for
the kth object, since k −1 have previously been removed and n−(k −1) remain.
As before, we multiply. This product is the formula for
n
P
k
.
For sampling without replacement and unordered sample, think ﬁrst of choos
ing an ordered sample, which we can do in
n
P
k
ways. But each unordered sample
could be obtained by drawing it in k! different orders. So we divide by k!, obtain
ing
n
P
k
/k! =
n
C
k
choices.
In our example with the pens, the numbers in the three boxes are 4
2
= 16,
4
P
2
= 12, and
4
C
2
= 6, in agreement with what we got when we wrote them all
out.
Note that, if we use the phrase ‘sampling without replacement, ordered sam
ple’, or any other combination, we are assuming that all outcomes are equally
likely.
Example The names of the seven days of the week are placed in a hat. Three
names are drawn out; these will be the days of the Probability I lectures. What is
the probability that no lecture is scheduled at the weekend?
Here the sampling is without replacement, and we can take it to be either
ordered or unordered; the answers will be the same. For ordered samples, the
size of the sample space is
7
P
3
= 7 · 6 · 5 = 210. If A is the event ‘no lectures at
weekends’, then A occurs precisely when all three days drawn are weekdays; so
A =
5
P
3
= 5· 4· 3 = 60. Thus, P(A) = 60/210 = 2/7.
If we decided to use unordered samples instead, the answer would be
5
C
3
/
7
C
3
,
which is once again 2/7.
Example A sixsided die is rolled twice. What is the probability that the sum of
the numbers is at least 10?
This time we are sampling with replacement, since the two numbers may be
the same or different. So the number of elements in the sample space is 6
2
= 36.
To obtain a sum of 10 or more, the possibilities for the two numbers are (4, 6),
(5, 5), (6, 4), (5, 6), (6, 5) or (6, 6). So the probability of the event is 6/36 = 1/6.
Example A box contains 20 balls, of which 10 are red and 10 are blue. We draw
ten balls from the box, and we are interested in the event that exactly 5 of the balls
are red and 5 are blue. Do you think that this is more likely to occur if the draws
are made with or without replacement?
Let S be the sample space, and A the event that ﬁve balls are red and ﬁve are
blue.
1.7. SAMPLING 11
Consider sampling with replacement. Then S = 20
10
. What is A? The
number of ways in which we can choose ﬁrst ﬁve red balls and then ﬁve blue ones
(that is, RRRRRBBBBB), is 10
5
· 10
5
= 10
10
. But there are many other ways to get
ﬁve red and ﬁve blue balls. In fact, the ﬁve red balls could appear in any ﬁve of
the ten draws. This means that there are
10
C
5
= 252 different patterns of ﬁve Rs
and ﬁve Bs. So we have
A = 252· 10
10
,
and so
P(A) =
252· 10
10
20
10
= 0.246. . .
Now consider sampling without replacement. If we regard the sample as being
ordered, then S =
20
P
10
. There are
10
P
5
ways of choosing ﬁve of the ten red
balls, and the same for the ten blue balls, and as in the previous case there are
10
C
5
patterns of red and blue balls. So
A = (
10
P
5
)
2
·
10
C
5
,
and
P(A) =
(
10
P
5
)
2
·
10
C
5
20
P
10
= 0.343. . .
If we regard the sample as being unordered, then S =
20
C
10
. There are
10
C
5
choices of the ﬁve red balls and the same for the blue balls. We no longer have to
count patterns since we don’t care about the order of the selection. So
A = (
10
C
5
)
2
,
and
P(A) =
(
10
C
5
)
2
20
C
10
= 0.343. . .
This is the same answer as in the case before, as it should be; the question doesn’t
care about order of choices!
So the event is more likely if we sample with replacement.
Example I have 6 gold coins, 4 silver coins and 3 bronze coins in my pocket. I
take out three coins at random. What is the probability that they are all of different
material? What is the probability that they are all of the same material?
In this case the sampling is without replacement and the sample is unordered.
So S =
13
C
3
= 286. The event that the three coins are all of different material
can occur in 6· 4· 3 = 72 ways, since we must have one of the six gold coins, and
so on. So the probability is 72/286 = 0.252. . .
12 CHAPTER 1. BASIC IDEAS
The event that the three coins are of the same material can occur in
6
C
3
+
4
C
3
+
3
C
3
= 20+4+1 = 25
ways, and the probability is 25/286 = 0.087. . .
In a sampling problem, you should ﬁrst read the question carefully and decide
whether the sampling is with or without replacement. If it is without replacement,
decide whether the sample is ordered (e.g. does the question say anything about
the ﬁrst object drawn?). If so, then use the formula for ordered samples. If not,
then you can use either ordered or unordered samples, whichever is convenient;
they should give the same answer. If the sample is with replacement, or if it
involves throwing a die or coin several times, then use the formula for sampling
with replacement.
1.8 Stopping rules
Suppose that you take a typing proﬁciency test. You are allowed to take the test
up to three times. Of course, if you pass the test, you don’t need to take it again.
So the sample space is
S ={p, f p, f f p, f f f },
where for example f f p denotes the outcome that you fail twice and pass on your
third attempt.
If all outcomes were equally likely, then your chance of eventually passing the
test and getting the certiﬁcate would be 3/4.
But it is unreasonable here to assume that all the outcomes are equally likely.
For example, you may be very likely to pass on the ﬁrst attempt. Let us assume
that the probability that you pass the test is 0.8. (By Proposition 3, your chance
of failing is 0.2.) Let us further assume that, no matter how many times you have
failed, your chance of passing at the next attempt is still 0.8. Then we have
P(p) = 0.8,
P( f p) = 0.2· 0.8 = 0.16,
P( f f p) = 0.2
2
· 0.8 = 0.032,
P( f f f ) = 0.2
3
= 0.008.
Thus the probability that you eventually get the certiﬁcate is P({p, f p, f f p}) =
0.8+0.16+0.032 =0.992. Alternatively, you eventually get the certiﬁcate unless
you fail three times, so the probability is 1−0.008 = 0.992.
Astopping rule is a rule of the type described here, namely, continue the exper
iment until some speciﬁed occurrence happens. The experiment may potentially
be inﬁnite.
1.9. QUESTIONNAIRE RESULTS 13
For example, if you toss a coin repeatedly until you obtain heads, the sample
space is
S ={H, TH, TTH, TTTH, . . .}
since in principle you may get arbitrarily large numbers of tails before the ﬁrst
head. (We have to allow all possible outcomes.)
In the typing test, the rule is ‘stop if either you pass or you have taken the test
three times’. This ensures that the sample space is ﬁnite.
In the next chapter, we will have more to say about the ‘multiplication rule’ we
used for calculating the probabilities. In the meantime you might like to consider
whether it is a reasonable assumption for tossing a coin, or for someone taking a
series of tests.
Other kinds of stopping rules are possible. For example, the number of coin
tosses might be determined by some other random process such as the roll of a
die; or we might toss a coin until we have obtained heads twice; and so on. We
will not deal with these.
1.9 Questionnaire results
The students in the Probability I class in Autumn 2000 ﬁlled in the following
questionnaire:
1. I have a hat containing 20 balls, 10 red and 10 blue. I draw 10 balls
from the hat. I am interested in the event that I draw exactly ﬁve red and
ﬁve blue balls. Do you think that this is more likely if I note the colour of
each ball I draw and replace it in the hat, or if I don’t replace the balls in
the hat after drawing?
More likely with replacement 2 More likely without replacement 2
2. What colour are your eyes?
Blue 2 Brown 2 Green 2 Other 2
3. Do you own a mobile phone? Yes 2 No 2
After discarding incomplete questionnaires, the results were as follows:
Answer to “More likely “More likely
question with replacement” without replacement”
Eyes Brown Other Brown Other
Mobile phone 35 4 35 9
No mobile phone 10 3 7 1
14 CHAPTER 1. BASIC IDEAS
What can we conclude?
Half the class thought that, in the experiment with the coloured balls, sampling
with replacement make the result more likely. In fact, as we saw in Chapter 1,
actually it is more likely if we sample without replacement. (This doesn’t matter,
since the students were instructed not to think too hard about it!)
You might expect that eye colour and mobile phone ownership would have no
inﬂuence on your answer. Let’s test this. If true, then of the 87 people with brown
eyes, half of them (i.e. 43 or 44) would answer “with replacement”, whereas in
fact 45 did. Also, of the 83 people with mobile phones, we would expect half (that
is, 41 or 42) would answer “with replacement”, whereas in fact 39 of them did. So
perhaps we have demonstrated that people who own mobile phones are slightly
smarter than average, whereas people with brown eyes are slightly less smart!
In fact we have shown no such thing, since our results refer only to the peo
ple who ﬁlled out the questionnaire. But they do show that these events are not
independent, in a sense we will come to soon.
On the other hand, since 83 out of 104 people have mobile phones, if we
think that phone ownership and eye colour are independent, we would expect
that the same fraction 83/104 of the 87 browneyed people would have phones,
i.e. (83 · 87)/104 = 69.4 people. In fact the number is 70, or as near as we can
expect. So indeed it seems that eye colour and phone ownership are moreorless
independent.
1.10 Independence
Two events A and B are said to be independent if
P(A∩B) = P(A) · P(B).
This is the deﬁnition of independence of events. If you are asked in an exam
to deﬁne independence of events, this is the correct answer. Do not say that two
events are independent if one has no inﬂuence on the other; and under no circum
stances say that A and B are independent if A∩B = / 0 (this is the statement that
A and B are disjoint, which is quite a different thing!) Also, do not ever say that
P(A∩B) = P(A) · P(B) unless you have some good reason for assuming that A
and B are independent (either because this is given in the question, or as in the
nextbutone paragraph).
Let us return to the questionnaire example. Suppose that a student is chosen
at random from those who ﬁlled out the questionnaire. Let A be the event that this
student thought that the event was more likely if we sample with replacement; B
the event that the student has brown eyes; and C the event that the student has a
1.10. INDEPENDENCE 15
mobile phone. Then
P(A) = 52/104 = 0.5,
P(B) = 87/104 = 0.8365,
P(C) = 83/104 = 0.7981.
Furthermore,
P(A∩B) = 45/104 = 0.4327, P(A) · P(B) = 0.4183,
P(A∩C) = 39/104 = 0.375, P(A) · P(C) = 0.3990,
P(B∩C) = 70/104 = 0.6731, P(B) ∩P(C) = 0.6676.
So none of the three pairs is independent, but in a sense B and C ‘come closer’
than either of the others, as we noted.
In practice, if it is the case that the event A has no effect on the outcome
of event B, then A and B are independent. But this does not apply in the other
direction. There might be a very deﬁnite connection between A and B, but still it
could happen that P(A∩B) = P(A) · P(B), so that A and B are independent. We
will see an example shortly.
Example If we toss a coin more than once, or roll a die more than once, then
you may assume that different tosses or rolls are independent. More precisely,
if we roll a fair sixsided die twice, then the probability of getting 4 on the ﬁrst
throw and 5 on the second is 1/36, since we assume that all 36 combinations of
the two throws are equally likely. But (1/36) = (1/6) · (1/6), and the separate
probabilities of getting 4 on the ﬁrst throw and of getting 5 on the second are both
equal to 1/6. So the two events are independent. This would work just as well for
any other combination.
In general, it is always OK to assume that the outcomes of different tosses of a
coin, or different throws of a die, are independent. This holds even if the examples
are not all equally likely. We will see an example later.
Example I have two red pens, one green pen, and one blue pen. I choose two
pens without replacement. Let A be the event that I choose exactly one red pen,
and B the event that I choose exactly one green pen.
If the pens are called R
1
, R
2
, G, B, then
S = {R
1
R
2
, R
1
G, R
1
B, R
2
G, R
2
B, GB},
A = {R
1
G, R
1
B, R
2
G, R
2
B},
B = {R
1
G, R
2
G, GB}
16 CHAPTER 1. BASIC IDEAS
We have P(A) =4/6 =2/3, P(B) =3/6 =1/2, P(A∩B) =2/6 =1/3 =P(A)P(B),
so A and B are independent.
But before you say ‘that’s obvious’, suppose that I have also a purple pen,
and I do the same experiment. This time, if you write down the sample space
and the two events and do the calculations, you will ﬁnd that P(A) = 6/10 = 3/5,
P(B) = 4/10 = 2/5, P(A∩B) = 2/10 = 1/5 = P(A)P(B), so adding one more
pen has made the events nonindependent!
We see that it is very difﬁcult to tell whether events are independent or not. In
practice, assume that events are independent only if either you are told to assume
it, or the events are the outcomes of different throws of a coin or die. (There is
one other case where you can assume independence: this is the result of different
draws, with replacement, from a set of objects.)
Example Consider the experiment where we toss a fair coin three times and
note the results. Each of the eight possible outcomes has probability 1/8. Let A
be the event ‘there are more heads than tails’, and B the event ‘the results of the
ﬁrst two tosses are the same’. Then
• A ={HHH, HHT, HTH, THH}, P(A) = 1/2,
• B ={HHH, HHT, TTH, TTT}, P(B) = 1/2,
• A∩B ={HHH, HHT}, P(A∩B) = 1/4;
so A and B are independent. However, both A and B clearly involve the results of
the ﬁrst two tosses and it is not possible to make a convincing argument that one
of these events has no inﬂuence or effect on the other. For example, let C be the
event ‘heads on the last toss’. Then, as we saw in Part 1,
• C ={HHH, HTH, THH, TTH}, P(C) = 1/2,
• A∩C ={HHH, HTH, THH}, P(A∩C) = 3/8;
so A and C are not independent.
Are B and C independent?
1.11 Mutual independence
This section is a bit technical. You will need to know the conclusions, though the
arguments we use to reach them are not so important.
We saw in the cointossing example above that it is possible to have three
events A, B,C so that A and B are independent, B and C are independent, but A and
C are not independent.
1.12. PROPERTIES OF INDEPENDENCE 17
If all three pairs of events happen to be independent, can we then conclude
that P(A∩B∩C) = P(A) · P(B) · P(C)? At ﬁrst sight this seems very reasonable;
in Axiom 3, we only required all pairs of events to be exclusive in order to justify
our conclusion. Unfortunately it is not true. . .
Example In the cointossing example, let A be the event ‘ﬁrst and second tosses
have same result’, B the event ‘ﬁrst and third tosses have the same result, and
C the event ‘second and third tosses have same result’. You should check that
P(A) = P(B) = P(C) = 1/2, and that the events A∩B, B∩C, A∩C, and A∩B∩C
are all equal to {HHH, TTT}, with probability 1/4. Thus any pair of the three
events are independent, but
P(A∩B∩C) = 1/4,
P(A) · P(B) · P(C) = 1/8.
So A, B,C are not mutually independent.
The correct deﬁnition and proposition run as follows.
Let A
1
, . . . , A
n
be events. We say that these events are mutually independent if,
given any distinct indices i
1
, i
2
, . . . , i
k
with k ≥1, the events
A
i
1
∩A
i
2
∩· · · ∩A
i
k−1
and A
i
k
are independent. In other words, any one of the events is independent of the
intersection of any number of the other events in the set.
Proposition 1.11 Let A
1
, . . . , A
n
be mutually independent. Then
P(A
1
∩A
2
∩· · · ∩A
n
) = P(A
1
) · P(A
2
)· · · P(A
n
).
Now all you really need to know is that the same ‘physical’ arguments that
justify that two events (such as two tosses of a coin, or two throws of a die) are
independent, also justify that any number of such events are mutually independent.
So, for example, if we toss a fair coin six times, the probability of getting the
sequence HHTHHT is (1/2)
6
= 1/64, and the same would apply for any other
sequence. In other words, all 64 possible outcomes are equally likely.
1.12 Properties of independence
Proposition 1.12 If A and B are independent, then A and B
are independent.
18 CHAPTER 1. BASIC IDEAS
We are given that P(A∩B) = P(A) · P(B), and asked to prove that P(A∩B
) =
P(A) · P(B
).
From Corollary 4, we know that P(B
) = 1−P(B). Also, the events A∩B and
A∩B
are disjoint (since no outcome can be both in B and B
), and their union
is A (since every event in A is either in B or in B
); so by Axiom 3, we have that
P(A) = P(A∩B) +P(A∩B
). Thus,
P(A∩B
) = P(A) −P(A∩B)
= P(A) −P(A) · P(B)
(since A and B are independent)
= P(A)(1−P(B))
= P(A) · P(B
),
which is what we were required to prove.
Corollary 1.13 If A and B are independent, so are A
and B
.
Apply the Proposition twice, ﬁrst to A and B (to show that A and B
are inde
pendent), and then to B
and A (to show that B
and A
are independent).
More generally, if events A
1
, . . . , A
n
are mutually independent, and we replace
some of them by their complements, then the resulting events are mutually inde
pendent. We have to be a bit careful though. For example, A and A
are not usually
independent!
Results like the following are also true.
Proposition 1.14 Let events A, B, C be mutually independent. Then A and B∩C
are independent, and A and B∪C are independent.
Example Consider the example of the typing proﬁciency test that we looked at
earlier. You are allowed up to three attempts to pass the test.
Suppose that your chance of passing the test is 0.8. Suppose also that the
events of passing the test on any number of different occasions are mutually inde
pendent. Then, by Proposition 1.11, the probability of any sequence of passes and
fails is the product of the probabilities of the terms in the sequence. That is,
P(p) = 0.8, P( f p) = (0.2) · (0.8), P( f f p) = (0.2)
2
· (0.8), P( f f f ) = (0.2)
3
,
as we claimed in the earlier example.
In other words, mutual independence is the condition we need to justify the
argument we used in that example.
1.12. PROPERTIES OF INDEPENDENCE 19
Example
The electrical apparatus in the diagram
works so long as current can ﬂow from left
to right. The three components are inde
pendent. The probability that component A
works is 0.8; the probability that compo
nent B works is 0.9; and the probability that
component C works is 0.75.
Find the probability that the apparatus works.
`
_¸
A
`
_¸
B
`
_¸
C
At risk of some confusion, we use the letters A, B and C for the events ‘com
ponent A works’, ‘component B works’, and ‘component C works’, respectively.
Now the apparatus will work if either A and B are working, or C is working (or
possibly both). Thus the event we are interested in is (A∩B) ∪C.
Now
P((A∩B) ∪C)) = P(A∩B) +P(C) −P(A∩B∩C)
(by Inclusion–Exclusion)
= P(A) · P(B) +P(C) −P(A) · P(B) · P(C)
(by mutual independence)
= (0.8) · (0.9) +(0.75) −(0.8) · (0.9) · (0.75)
= 0.93.
The problem can also be analysed in a different way. The apparatus will not
work if both paths are blocked, that is, if C is not working and one of A and B is
also not working. Thus, the event that the apparatus does not work is (A
∪B
)∩C
.
By the Distributive Law, this is equal to (A
∩C
) ∪(B
∩C
). We have
P((A
∩C
) ∪(B
∩C
) = P(A
∩C
) +P(B
∩C
) −P(A
∩B
∩C
)
(by Inclusion–Exclusion)
= P(A
) · P(C
) +P(B
) · P(C
) −P(A
) · P(B
) · P(C
)
(by mutual independence of A
, B
,C
)
= (0.2) · (0.25) +(0.1) · (0.25) −(0.2) · (0.1) · (0.25)
= 0.07,
so the apparatus works with probability 1−0.07 = 0.93.
There is a trap here which you should take care to avoid. You might be tempted
to say P(A
∩C
) = (0.2) · (0.25) = 0.05, and P(B
∩C
) = (0.1) · (0.25) = 0.025;
and conclude that
P((A
∩C
) ∪(B
∩C
)) = 0.05+0.025−(0.05) · (0.025) = 0.07375
by the Principle of Inclusion and Exclusion. But this is not correct, since the
events A
∩C
and B
∩C
are not independent!
20 CHAPTER 1. BASIC IDEAS
Example We can always assume that successive tosses of a coin are mutually
independent, even if it is not a fair coin. Suppose that I have a coin which has
probability 0.6 of coming down heads. I toss the coin three times. What are the
probabilities of getting three heads, two heads, one head, or no heads?
For three heads, since successive tosses are mutually independent, the proba
bility is (0.6)
3
= 0.216.
The probability of tails on any toss is 1 −0.6 = 0.4. Now the event ‘two
heads’ can occur in three possible ways, as HHT, HTH, or THH. Each outcome
has probability (0.6) · (0.6) · (0.4) = 0.144. So the probability of two heads is
3· (0.144) = 0.432.
Similarly the probability of one head is 3· (0.6) · (0.4)
2
= 0.288, and the prob
ability of no heads is (0.4)
3
= 0.064.
As a check, we have
0.216+0.432+0.288+0.064 = 1.
1.13 Worked examples
Question
(a) You go to the shop to buy a toothbrush. The toothbrushes there are red, blue,
green, purple and white. The probability that you buy a red toothbrush is
three times the probability that you buy a green one; the probability that you
buy a blue one is twice the probability that you buy a green one; the proba
bilities of buying green, purple, and white are all equal. You are certain to
buy exactly one toothbrush. For each colour, ﬁnd the probability that you
buy a toothbrush of that colour.
(b) James and Simon share a ﬂat, so it would be confusing if their toothbrushes
were the same colour. On the ﬁrst day of term they both go to the shop to
buy a toothbrush. For each of James and Simon, the probability of buying
various colours of toothbrush is as calculated in (a), and their choices are
independent. Find the probability that they buy toothbrushes of the same
colour.
(c) James and Simon live together for three terms. On the ﬁrst day of each term
they buy new toothbrushes, with probabilities as in (b), independently of
what they had bought before. This is the only time that they change their
toothbrushes. Find the probablity that James and Simon have differently
coloured toothbrushes from each other for all three terms. Is it more likely
that they will have differently coloured toothbrushes from each other for
1.13. WORKED EXAMPLES 21
all three terms or that they will sometimes have toothbrushes of the same
colour?
Solution
(a) Let R, B, G, P,W be the events that you buy a red, blue, green, purple and
white toothbrush respectively. Let x = P(G). We are given that
P(R) = 3x, P(B) = 2x, P(P) = P(W) = x.
Since these outcomes comprise the whole sample space, Corollary 2 gives
3x +2x +x +x +x = 1,
so x = 1/8. Thus, the probabilities are 3/8, 1/4, 1/8, 1/8, 1/8 respectively.
(b) Let RB denote the event ‘James buys a red toothbrush and Simon buys a blue
toothbrush’, etc. By independence (given), we have, for example,
P(RR) = (3/8) · (3/8) = 9/64.
The event that the toothbrushes have the same colour consists of the ﬁve
outcomes RR, BB, GG, PP, WW, so its probability is
P(RR) +P(BB) +P(GG) +P(PP) +P(WW)
=
9
64
+
1
16
+
1
64
+
1
64
+
1
64
=
1
4
.
(c) The event ‘different coloured toothbrushes in the ith term’ has probability 3/4
(from part (b)), and these events are independent. So the event ‘different
coloured toothbrushes in all three terms’ has probability
3
4
·
3
4
·
3
4
=
27
64
.
The event ‘same coloured toothbrushes in at least one term’ is the comple
ment of the above, so has probability 1 −(27/64) = (37)/(64). So it is
more likely that they will have the same colour in at least one term.
Question There are 24 elephants in a game reserve. The warden tags six of the
elephants with small radio transmitters and returns them to the reserve. The next
month, he randomly selects ﬁve elephants from the reserve. He counts how many
of these elephants are tagged. Assume that no elephants leave or enter the reserve,
or die or give birth, between the tagging and the selection; and that all outcomes
of the selection are equally likely. Find the probability that exactly two of the
selected elephants are tagged, giving the answer correct to 3 decimal places.
22 CHAPTER 1. BASIC IDEAS
Solution The experiment consists of picking the ﬁve elephants, not the original
choice of six elephants for tagging. Let S be the sample space. Then S =
24
C
5
.
Let A be the event that two of the selected elephants are tagged. This involves
choosing two of the six tagged elephants and three of the eighteen untagged ones,
so A =
6
C
2
·
18
C
3
. Thus
P(A) =
6
C
2
·
18
C
3
24
C
5
= 0.288
to 3 d.p.
Note: Should the sample should be ordered or unordered? Since the answer
doesn’t depend on the order in which the elephants are caught, an unordered sam
ple is preferable. If you want to use an ordered sample, the calculation is
P(A) =
6
P
2
·
18
P
3
·
5
C
2
24
P
5
= 0.288,
since it is necessary to multiply by the
5
C
2
possible patterns of tagged and un
tagged elephants in a sample of ﬁve with two tagged.
Question A couple are planning to have a family. They decide to stop having
children either when they have two boys or when they have four children. Sup
pose that they are successful in their plan.
(a) Write down the sample space.
(b) Assume that, each time that they have a child, the probability that it is a
boy is 1/2, independent of all other times. Find P(E) and P(F) where
E = “there are at least two girls”, F = “there are more girls than boys”.
Solution (a) S ={BB, BGB, GBB, BGGB, GBGB, GGBB, BGGG, GBGG,
GGBG, GGGB, GGGG}.
(b) E ={BGGB, GBGB, GGBB, BGGG, GBGG, GGBG, GGGB, GGGG},
F ={BGGG, GBGG, GGBG, GGGB, GGGG}.
Now we have P(BB) = 1/4, P(BGB) = 1/8, P(BGGB) = 1/16, and similarly
for the other outcomes. So P(E) = 8/16 = 1/2, P(F) = 5/16.
Chapter 2
Conditional probability
In this chapter we develop the technique of conditional probability to deal with
cases where events are not independent.
2.1 What is conditional probability?
Alice and Bob are going out to dinner. They toss a fair coin ‘best of three’ to
decide who pays: if there are more heads than tails in the three tosses then Alice
pays, otherwise Bob pays.
Clearly each has a 50% chance of paying. The sample space is
S ={HHH, HHT, HTH, HTT, THH, THT, TTH, TTT},
and the events ‘Alice pays’ and ‘Bob pays’ are respectively
A ={HHH, HHT, HTH, THH},
B ={HTT, THT, TTH, TTT}.
They toss the coin once and the result is heads; call this event E. How should
we now reassess their chances? We have
E ={HHH, HHT, HTH, HTT},
and if we are given the information that the result of the ﬁrst toss is heads, then E
now becomes the sample space of the experiment, since the outcomes not in E are
no longer possible. In the new experiment, the outcomes ‘Alice pays’ and ‘Bob
pays’ are
A∩E ={HHH, HHT, HTH},
B∩E ={HTT}.
23
24 CHAPTER 2. CONDITIONAL PROBABILITY
Thus the new probabilities that Alice and Bob pay for dinner are 3/4 and 1/4
respectively.
In general, suppose that we are given that an event E has occurred, and we
want to compute the probability that another event A occurs. In general, we can no
longer count, since the outcomes may not be equally likely. The correct deﬁnition
is as follows.
Let E be an event with nonzero probability, and let A be any event. The
conditional probability of A given E is deﬁned as
P(A  E) =
P(A∩E)
P(E)
.
Again I emphasise that this is the deﬁnition. If you are asked for the deﬁnition
of conditional probability, it is not enough to say “the probability of A given that
E has occurred”, although this is the best way to understand it. There is no reason
why event E should occur before event A!
Note the vertical bar in the notation. This is P(A  E), not P(A/E) or P(A\E).
Note also that the deﬁnition only applies in the case where P(E) is not equal
to zero, since we have to divide by it, and this would make no sense if P(E) = 0.
To check the formula in our example:
P(A  E) =
P(A∩E)
P(E)
=
3/8
1/2
=
3
4
,
P(B  E) =
P(B∩E)
P(E)
=
1/8
1/2
=
1
4
.
It may seem like a small matter, but you should be familiar enough with this
formula that you can write it down without stopping to think about the names of
the events. Thus, for example,
P(A  B) =
P(A∩B)
P(B)
if P(B) = 0.
Example A random car is chosen among all those passing through Trafalgar
Square on a certain day. The probability that the car is yellow is 3/100: the
probability that the driver is blonde is 1/5; and the probability that the car is
yellow and the driver is blonde is 1/50.
Find the conditional probability that the driver is blonde given that the car is
yellow.
2.2. GENETICS 25
Solution: If Y is the event ‘the car is yellow’ and B the event ‘the driver is blonde’,
then we are given that P(Y) = 0.03, P(B) = 0.2, and P(Y ∩B) = 0.02. So
P(B  Y) =
P(B∩Y)
P(Y)
=
0.02
0.03
= 0.667
to 3 d.p. Note that we haven’t used all the information given.
There is a connection between conditional probability and independence:
Proposition 2.1 Let A and B be events with P(B) =0. Then A and B are indepen
dent if and only if P(A  B) = P(A).
Proof The words ‘if and only if’ tell us that we have two jobs to do: we have to
show that if A and B are independent, then P(A  B) = P(A); and that if P(A  B) =
P(A), then A and B are independent.
So ﬁrst suppose that A and B are independent. Remember that this means that
P(A∩B) = P(A) · P(B). Then
P(A  B) =
P(A∩B)
P(B)
=
P(A) · P(B)
P(B)
= P(A),
that is, P(A  B) = P(A), as we had to prove.
Now suppose that P(A  B) = P(A). In other words,
P(A∩B)
P(B)
= P(A),
using the deﬁnition of conditional probability. Now clearing fractions gives
P(A∩B) = P(A) · P(B),
which is just what the statement ‘A and B are independent’ means.
This proposition is most likely what people have in mind when they say ‘A
and B are independent means that B has no effect on A’.
2.2 Genetics
Here is a simpliﬁed version of how genes code eye colour, assuming only two
colours of eyes.
Each person has two genes for eye colour. Each gene is either B or b. A child
receives one gene from each of its parents. The gene it receives from its father
is one of its father’s two genes, each with probability 1/2; and similarly for its
mother. The genes received from father and mother are independent.
If your genes are BB or Bb or bB, you have brown eyes; if your genes are bb,
you have blue eyes.
26 CHAPTER 2. CONDITIONAL PROBABILITY
Example Suppose that John has brown eyes. So do both of John’s parents. His
sister has blue eyes. What is the probability that John’s genes are BB?
Solution John’s sister has genes bb, so one b must have come from each parent.
Thus each of John’s parents is Bb or bB; we may assume Bb. So the possibilities
for John are (writing the gene from his father ﬁrst)
BB, Bb, bB, bb
each with probability 1/4. (For example, John gets his father’s B gene with prob
ability 1/2 and his mother’s B gene with probability 1/2, and these are indepen
dent, so the probability that he gets BB is 1/4. Similarly for the other combina
tions.)
Let X be the event ‘John has BB genes’ and Y the event ‘John has brown
eyes’. Then X = {BB} and Y = {BB, Bb, bB}. The question asks us to calculate
P(X  Y). This is given by
P(X  Y) =
P(X ∩Y)
P(Y)
=
1/4
3/4
= 1/3.
2.3 The Theorem of Total Probability
Sometimes we are faced with a situation where we do not know the probability of
an event B, but we know what its probability would be if we were sure that some
other event had occurred.
Example An icecream seller has to decide whether to order more stock for the
Bank Holiday weekend. He estimates that, if the weather is sunny, he has a 90%
chance of selling all his stock; if it is cloudy, his chance is 60%; and if it rains, his
chance is only 20%. According to the weather forecast, the probability of sunshine
is 30%, the probability of cloud is 45%, and the probability of rain is 25%. (We
assume that these are all the possible outcomes, so that their probabilities must
add up to 100%.) What is the overall probability that the salesman will sell all his
stock?
This problem is answered by the Theorem of Total Probability, which we now
state. First we need a deﬁnition. The events A
1
, A
2
, . . . , A
n
form a partition of the
sample space if the following two conditions hold:
(a) the events are pairwise disjoint, that is, A
i
∩A
j
= / 0 for any pair of events A
i
and A
j
;
(b) A
1
∪A
2
∪· · · ∪A
n
= S.
2.3. THE THEOREM OF TOTAL PROBABILITY 27
Another way of saying the same thing is that every outcome in the sample space
lies in exactly one of the events A
1
, A
2
, . . . , A
n
. The picture shows the idea of a
partition.
A
1
A
2
. . . A
n
Now we state and prove the Theorem of Total Probability.
Theorem 2.2 Let A
1
, A
2
, . . . , A
n
form a partition of the sample space with P(A
i
) =
0 for all i, and let B be any event. Then
P(B) =
n
∑
i=1
P(B  A
i
) · P(A
i
).
Proof By deﬁnition, P(B  A
i
) = P(B∩A
i
)/P(A
i
). Multiplying up, we ﬁnd that
P(B∩A
i
) = P(B  A
i
) · P(A
i
).
Now consider the events B ∩A
1
, B ∩A
2
, . . . , B ∩A
n
. These events are pairwise
disjoint; for any outcome lying in both B∩A
i
and B∩A
j
would lie in both A
i
and
A
j
, and by assumption there are no such outcomes. Moreover, the union of all
these events is B, since every outcome lies in one of the A
i
. So, by Axiom 3, we
conclude that
n
∑
i=1
P(B∩A
i
) = P(B).
Substituting our expression for P(B∩A
i
) gives the result.
A
1
A
2
. . . A
n
_
¸
B
Consider the icecream salesman at the start of this section. Let A
1
be the
event ‘it is sunny’, A
2
the event ‘it is cloudy’, and A
3
the event ‘it is rainy’. Then
A
1
, A
2
and A
3
form a partition of the sample space, and we are given that
P(A
1
) = 0.3, P(A
2
) = 0.45, P(A
3
) = 0.25.
28 CHAPTER 2. CONDITIONAL PROBABILITY
Let B be the event ‘the salesman sells all his stock’. The other information we are
given is that
P(B  A
1
) = 0.9, P(B  A
2
) = 0.6, P(B  A
3
) = 0.2.
By the Theorem of Total Probability,
P(B) = (0.9×0.3) +(0.6×0.45) +(0.2×0.25) = 0.59.
You will now realise that the Theorem of Total Probability is really being used
when you calculate probabilities by tree diagrams. It is better to get into the habit
of using it directly, since it avoids any accidental assumptions of independence.
One special case of the Theorem of Total Probability is very commonly used,
and is worth stating in its own right. For any event A, the events A and A
form a
partition of S. To say that both A and A
have nonzero probability is just to say
that P(A) = 0, 1. Thus we have the following corollary:
Corollary 2.3 Let A and B be events, and suppose that P(A) = 0, 1. Then
P(B) = P(B  A) · P(A) +P(B  A
) · P(A
).
2.4 Sampling revisited
We can use the notion of conditional probability to treat sampling problems in
volving ordered samples.
Example I have two red pens, one green pen, and one blue pen. I select two
pens without replacement.
(a) What is the probability that the ﬁrst pen chosen is red?
(b) What is the probability that the second pen chosen is red?
For the ﬁrst pen, there are four pens of which two are red, so the chance of
selecting a red pen is 2/4 = 1/2.
For the second pen, we must separate cases. Let A
1
be the event ‘ﬁrst pen red’,
A
2
the event ‘ﬁrst pen green’ and A
3
the event ‘ﬁrst pen blue’. Then P(A
1
) = 1/2,
P(A
2
) = P(A
3
) = 1/4 (arguing as above). Let B be the event ‘second pen red’.
If the ﬁrst pen is red, then only one of the three remaining pens is red, so that
P(B  A
1
) = 1/3. On the other hand, if the ﬁrst pen is green or blue, then two of
the remaining pens are red, so P(B  A
2
) = P(B  A
3
) = 2/3.
2.5. BAYES’ THEOREM 29
By the Theorem of Total Probability,
P(B) = P(B  A
1
)P(A
1
) +P(B  A
2
)P(A
2
) +P(B  A
3
)P(A
3
)
= (1/3) ×(1/2) +(2/3) ×(1/4) +(2/3) ×(1/4)
= 1/2.
We have reached by a roundabout argument a conclusion which you might
think to be obvious. If we have no information about the ﬁrst pen, then the second
pen is equally likely to be any one of the four, and the probability should be 1/2,
just as for the ﬁrst pen. This argument happens to be correct. But, until your
ability to distinguish between correct arguments and plausiblelooking false ones
is very well developed, you may be safer to stick to the calculation that we did.
Beware of obviouslooking arguments in probability! Many clever people have
been caught out.
2.5 Bayes’ Theorem
There is a very big difference between P(A  B) and P(B  A).
Suppose that a new test is developed to identify people who are liable to suffer
from some genetic disease in later life. Of course, no test is perfect; there will be
some carriers of the defective gene who test negative, and some noncarriers who
test positive. So, for example, let A be the event ‘the patient is a carrier’, and B
the event ‘the test result is positive’.
The scientists who develop the test are concerned with the probabilities that
the test result is wrong, that is, with P(B  A
) and P(B
 A). However, a patient
who has taken the test has different concerns. If I tested positive, what is the
chance that I have the disease? If I tested negative, how sure can I be that I am not
a carrier? In other words, P(A  B) and P(A
 B
).
These conditional probabilities are related by Bayes’ Theorem:
Theorem 2.4 Let A and B be events with nonzero probability. Then
P(A  B) =
P(B  A) · P(A)
P(B)
.
The proof is not hard. We have
P(A  B) · P(B) = P(A∩B) = P(B  A) · P(A),
using the deﬁnition of conditional probability twice. (Note that we need both A
and B to have nonzero probability here.) Now divide this equation by P(B) to get
the result.
30 CHAPTER 2. CONDITIONAL PROBABILITY
If P(A) = 0, 1 and P(B) = 0, then we can use Corollary 17 to write this as
P(A  B) =
P(B  A) · P(A)
P(B  A) · P(A) +P(B  A
) · P(A
)
.
Bayes’ Theorem is often stated in this form.
Example Consider the icecream salesman from Section 2.3. Given that he sold
all his stock of icecream, what is the probability that the weather was sunny?
(This question might be asked by the warehouse manager who doesn’t know what
the weather was actually like.) Using the same notation that we used before, A
1
is the event ‘it is sunny’ and B the event ‘the salesman sells all his stock’. We are
asked for P(A
1
 B). We were given that P(B  A
1
) =0.9 and that P(A
1
) =0.3, and
we calculated that P(B) = 0.59. So by Bayes’ Theorem,
P(A
1
 B) =
P(B  A
1
)P(A
1
)
P(B)
=
0.9×0.3
0.59
= 0.46
to 2 d.p.
Example Consider the clinical test described at the start of this section. Suppose
that 1 in 1000 of the population is a carrier of the disease. Suppose also that the
probability that a carrier tests negative is 1%, while the probability that a non
carrier tests positive is 5%. (A test achieving these values would be regarded as
very successful.) Let A be the event ‘the patient is a carrier’, and B the event ‘the
test result is positive’. We are given that P(A) = 0.001 (so that P(A
) = 0.999),
and that
P(B  A) = 0.99, P(B  A
) = 0.05.
(a) A patient has just had a positive test result. What is the probability that the
patient is a carrier? The answer is
P(A  B) =
P(B  A)P(A)
P(B  A)P(A) +P(B  A
)P(A
)
=
0.99×0.001
(0.99×0.001) +(0.05×0.999)
=
0.00099
0.05094
= 0.0194.
(b) A patient has just had a negative test result. What is the probability that the
patient is a carrier? The answer is
P(A  B
) =
P(B
 A)P(A)
P(B
 A)P(A) +P(B
 A
)P(A
)
2.6. ITERATED CONDITIONAL PROBABILITY 31
=
0.01×0.001
(0.01×0.001) +(0.95×0.999)
=
0.00001
0.94095
= 0.00001.
So a patient with a negative test result can be reassured; but a patient with a posi
tive test result still has less than 2% chance of being a carrier, so is likely to worry
unnecessarily.
Of course, these calculations assume that the patient has been selected at ran
dom from the population. If the patient has a family history of the disease, the
calculations would be quite different.
Example 2% of the population have a certain blood disease in a serious form;
10% have it in a mild form; and 88% don’t have it at all. A new blood test is
developed; the probability of testing positive is 9/10 if the subject has the serious
form, 6/10 if the subject has the mild form, and 1/10 if the subject doesn’t have
the disease.
I have just tested positive. What is the probability that I have the serious form
of the disease?
Let A
1
be ‘has disease in serious form’, A
2
be ‘has disease in mild form’, and
A
3
be ‘doesn’t have disease’. Let B be ‘test positive’. Then we are given that A
1
,
A
2
, A
3
form a partition and
P(A
1
) = 0.02 P(A
2
) = 0.1 P(A
3
) = 0.88
P(B  A
1
) = 0.9 P(B  A
2
) = 0.6 P(B  A
3
) = 0.1
Thus, by the Theorem of Total Probability,
P(B) = 0.9×0.02+0.6×0.1+0.1×0.88 = 0.166,
and then by Bayes’ Theorem,
P(A
1
 B) =
P(B  A
1
)P(A
1
)
P(B)
=
0.9×0.02
0.166
= 0.108
to 3 d.p.
2.6 Iterated conditional probability
The conditional probability of C, given that both A and B have occurred, is just
P(C  A∩B). Sometimes instead we just write P(C  A, B). It is given by
P(C  A, B) =
P(C∩A∩B)
P(A∩B)
,
32 CHAPTER 2. CONDITIONAL PROBABILITY
so
P(A∩B∩C) = P(C  A, B)P(A∩B).
Now we also have
P(A∩B) = P(B  A)P(A),
so ﬁnally (assuming that P(A∩B) = 0), we have
P(A∩B∩C) = P(C  A, B)P(B  A)P(A).
This generalises to any number of events:
Proposition 2.5 Let A
1
, . . . , A
n
be events. Suppose that P(A
1
∩· · · ∩A
n−1
) = 0.
Then
P(A
1
∩A
2
∩· · · ∩A
n
) = P(A
n
 A
1
, . . . , A
n−1
)· · · P(A
2
 A
1
)P(A
1
).
We apply this to the birthday paradox.
The birthday paradox is the following statement:
If there are 23 or more people in a room, then the chances are better
than even that two of them have the same birthday.
To simplify the analysis, we ignore 29 February, and assume that the other 365
days are all equally likely as birthdays of a random person. (This is not quite true
but not inaccurate enough to have much effect on the conclusion.) Suppose that
we have n people p
1
, p
2
, . . . , p
n
. Let A
2
be the event ‘p
2
has a different birthday
from p
1
’. Then P(A
2
) = 1 −
1
365
, since whatever p
1
’s birthday is, there is a 1 in
365 chance that p
2
will have the same birthday.
Let A
3
be the event ‘p
3
has a different birthday from p
1
and p
2
’. It is not
straightforward to evaluate P(A
3
), since we have to consider whether p
1
and p
2
have the same birthday or not. (See below). But we can calculate that P(A
3

A
2
) = 1−
2
365
, since if A
2
occurs then p
1
and p
2
have birthdays on different days,
and A
3
will occur only if p
3
’s birthday is on neither of these days. So
P(A
2
∩A
3
) = P(A
2
)P(A
3
 A
2
) = (1−
1
365
)(1−
2
365
).
What is A
2
∩A
3
? It is simply the event that all three people have birthdays on
different days.
Now this process extends. If A
i
denotes the event ‘p
i
’s birthday is not on the
same day as any of p
1
, . . . , p
i−1
’, then
P(A
i
 A
1
, . . . , A
i−1
) = 1−
i−1
365
,
2.6. ITERATED CONDITIONAL PROBABILITY 33
and so by Proposition 2.5,
P(A
1
∩· · · ∩A
i
) = (1−
1
365
)(1−
2
365
)· · · (1−
i−1
365
).
Call this number q
i
; it is the probability that all of the people p
1
, . . . , p
i
have
their birthdays on different days.
The numbers q
i
decrease, since at each step we multiply by a factor less than 1.
So there will be some value of n such that
q
n−1
> 0.5, q
n
≤0.5,
that is, n is the smallest number of people for which the probability that they all
have different birthdays is less than 1/2, that is, the probability of at least one
coincidence is greater than 1/2.
By calculation, we ﬁnd that q
22
= 0.5243, q
23
= 0.4927 (to 4 d.p.); so 23
people are enough for the probability of coincidence to be greater than 1/2.
Now return to a question we left open before. What is the probability of the
event A
3
? (This is the event that p
3
has a different birthday from both p
1
and p
2
.)
If p
1
and p
2
have different birthdays, the probability is 1 −
2
365
: this is the
calculation we already did. On the other hand, if p
1
and p
2
have the same birthday,
then the probability is 1−
1
365
. These two numbers are P(A
3
 A
2
) and P(A
3
 A
2
)
respectively. So, by the Theorem of Total Probability,
P(A
3
) = P(A
3
 A
2
)P(A
2
) +P(A
3
 A
2
)P(A
2
)
= (1−
2
365
)(1−
1
365
) +(1−
1
365
)
1
365
= 0.9945
to 4 d.p.
Problem How many people would you need to pick at random to ensure that
the chance of two of them being born in the same month are better than even?
Assuming all months equally likely, if B
i
is the event that p
i
is born in a dif
ferent month from any of p
1
, . . . , p
i−1
, then as before we ﬁnd that
P(B
i
 B
1
, · · · , B
i−1
) = 1−
i−1
12
,
so
P(B
1
∩· · · ∩B
i
) = (1−
1
12
)(1−
2
12
)(1−
i−1
12
).
We calculate that this probability is
(11/12) ×(10/12) ×(9/12) = 0.5729
34 CHAPTER 2. CONDITIONAL PROBABILITY
for i = 4 and
(11/12) ×(10/12) ×(9/12) ×(8/12) = 0.3819
for i = 5. So, with ﬁve people, it is more likely that two will have the same birth
month.
A true story. Some years ago, in a probability class with only ten students, the
lecturer started discussing the Birthday Paradox. He said to the class, “I bet that
no two people in the room have the same birthday”. He should have been on safe
ground, since q
11
= 0.859. (Remember that there are eleven people in the room!)
However, a student in the back said “I’ll take the bet”, and after a moment all the
other students realised that the lecturer would certainly lose his wager. Why?
(Answer in the next chapter.)
2.7 Worked examples
Question Each person has two genes for cystic ﬁbrosis. Each gene is either N
or C. Each child receives one gene from each parent. If your genes are NN or NC
or CN then you are normal; if they are CC then you have cystic ﬁbrosis.
(a) Neither of Sally’s parents has cystic ﬁbrosis. Nor does she. However, Sally’s
sister Hannah does have cystic ﬁbrosis. Find the probability that Sally has
at least one C gene (given that she does not have cystic ﬁbrosis).
(b) In the general population the ratio of N genes to C genes is about 49 to 1.
You can assume that the two genes in a person are independent. Harry does
not have cystic ﬁbrosis. Find the probability that he has at least one C gene
(given that he does not have cystic ﬁbrosis).
(c) Harry and Sally plan to have a child. Find the probability that the child will
have cystic ﬁbrosis (given that neither Harry nor Sally has it).
Solution During this solution, we will use a number of times the following prin
ciple. Let A and B be events with A ⊆B. Then A∩B = A, and so
P(A  B) =
P(A∩B)
P(B)
=
P(A)
P(B)
.
(a) This is the same as the eye colour example discussed earlier. We are given
that Sally’s sister has genes CC, and one gene must come from each parent. But
2.7. WORKED EXAMPLES 35
neither parent is CC, so each parent is CN or NC. Now by the basic rules of
genetics, all the four combinations of genes for a child of these parents, namely
CC,CN, NC, NN, will have probability 1/4.
If S
1
is the event ‘Sally has at least one C gene’, then S
1
={CN, NC,CC}; and
if S
2
is the event ‘Sally does not have cystic ﬁbrosis’, then S
2
= {CN, NC, NN}.
Then
P(S
1
 S
2
) =
P(S
1
∩S
2
)
P(S
2
)
=
2/4
3/4
=
2
3
.
(b) We know nothing speciﬁc about Harry, so we assume that his genes are
randomly and independently selected from the population. We are given that the
probability of a random gene being C or N is 1/50 and 49/50 respectively. Then
the probabilities of Harry having genes CC, CN, NC, NN are respectively (1/50)
2
,
(1/50) · (49/50), (49/50) · (1/50), and (49/50)
2
, respectively. So, if H
1
is the
event ‘Harry has at least one C gene’, and H
2
is the event ‘Harry does not have
cystic ﬁbrosis’, then
P(H
1
 H
2
) =
P(H
1
∩H
2
)
P(H
2
)
=
(49/2500) +(49/2500)
(49/2500) +(49/2500) +(2401/2500)
=
2
51
.
(c) Let X be the event that Harry’s and Sally’s child has cystic ﬁbrosis. As in
(a), this can only occur if Harry and Sally both have CN or NC genes. That is,
X ⊆ S
3
∩H
3
, where S
3
= S
1
∩S
2
and H
3
= H
1
∩H
2
. Now if Harry and Sally are
both CN or NC, these genes pass independently to the baby, and so
P(X  S
3
∩H
3
) =
P(X)
P(S
3
∩H
3
)
=
1
4
.
(Remember the principle that we started with!)
We are asked to ﬁnd P(X  S
2
∩H
2
), in other words (since X ⊆ S
3
∩H
3
⊆
S
2
∩H
2
),
P(X)
P(S
2
∩H
2
)
.
Now Harry’s and Sally’s genes are independent, so
P(S
3
∩H
3
) = P(S
3
) · P(H
3
),
P(S
2
∩H
2
) = P(S
2
) · P(H
2
).
Thus,
P(X)
P(S
2
∩H
2
)
=
P(X)
P(S
3
∩H
3
)
·
P(S
3
∩H
3
)
P(S
2
∩H
2
)
36 CHAPTER 2. CONDITIONAL PROBABILITY
=
1
4
·
P(S
1
∩S
2
)
P(S
2
)
·
P(H
1
∩H
2
)
P(H
2
)
=
1
4
· P(S
1
 S
2
) · P(H
1
 H
2
)
=
1
4
·
2
3
·
2
51
=
1
153
.
I thank Eduardo Mendes for pointing out a mistake in my previous solution to
this problem.
Question The Land of Nod lies in the monsoon zone, and has just two seasons,
Wet and Dry. The Wet season lasts for 1/3 of the year, and the Dry season for 2/3
of the year. During the Wet season, the probability that it is raining is 3/4; during
the Dry season, the probability that it is raining is 1/6.
(a) I visit the capital city, Oneirabad, on a random day of the year. What is the
probability that it is raining when I arrive?
(b) I visit Oneirabad on a random day, and it is raining when I arrive. Given this
information, what is the probability that my visit is during the Wet season?
(c) I visit Oneirabad on a random day, and it is raining when I arrive. Given this
information, what is the probability that it will be raining when I return to
Oneirabad in a year’s time?
(You may assume that in a year’s time the season will be the same as today but,
given the season, whether or not it is raining is independent of today’s weather.)
Solution (a) Let W be the event ‘it is the wet season’, D the event ‘it is the dry
season’, and R the event ‘it is raining when I arrive’. We are given that P(W) =
1/3, P(D) = 2/3, P(R  W) = 3/4, P(R  D) = 1/6. By the ToTP,
P(R) = P(R  W)P(W) +P(R  D)P(D)
= (3/4) · (1/3) +(1/6) · (2/3) = 13/36.
(b) By Bayes’ Theorem,
P(W  R) =
P(R  W)P(W)
P(R)
=
(3/4) · (1/3)
13/36
=
9
13
.
2.7. WORKED EXAMPLES 37
(c) Let R
be the event ‘it is raining in a year’s time’. The information we are
given is that P(R∩R
 W) = P(R  W)P(R
 W) and similarly for D. Thus
P(R∩R
) = P(R∩R
 W)P(W) +P(R∩R
 D)P(D)
= (3/4)
2
· (1/3) +(1/6)
2
· (2/3) =
89
432
,
and so
P(R
 R) =
P(R∩R
)
P(R)
=
89/432
13/36
=
89
156
.
38 CHAPTER 2. CONDITIONAL PROBABILITY
Chapter 3
Random variables
In this chapter we deﬁne random variables and some related concepts such as
probability mass function, expected value, variance, and median; and look at some
particularly important types of random variables including the binomial, Poisson,
and normal.
3.1 What are random variables?
The Holy Roman Empire was, in the words of the historian Voltaire, “neither holy,
nor Roman, nor an empire”. Similarly, a random variable is neither random nor a
variable:
A random variable is a function deﬁned on a sample space.
The values of the function can be anything at all, but for us they will always be
numbers. The standard abbreviation for ‘random variable’ is r.v.
Example I select at random a student from the class and measure his or her
height in centimetres.
Here, the sample space is the set of students; the random variable is ‘height’,
which is a function from the set of students to the real numbers: h(S) is the height
of student S in centimetres. (Remember that a function is nothing but a rule for
associating with each element of its domain set an element of its target or range
set. Here the domain set is the sample space S, the set of students in the class, and
the target space is the set of real numbers.)
Example I throw a sixsided die twice; I am interested in the sum of the two
numbers. Here the sample space is
S ={(i, j) : 1 ≤i, j ≤6},
39
40 CHAPTER 3. RANDOM VARIABLES
and the random variable F is given by F(i, j) = i + j. The target set is the set
{2, 3, . . . , 12}.
The two random variables in the above examples are representatives of the two
types of random variables that we will consider. These deﬁnitions are not quite
precise, but more examples should make the idea clearer.
A random variable F is discrete if the values it can take are separated by gaps.
For example, F is discrete if it can take only ﬁnitely many values (as in the second
example above, where the values are the integers from 2 to 12), or if the values of
F are integers (for example, the number of nuclear decays which take place in a
second in a sample of radioactive material – the number is an integer but we can’t
easily put an upper limit on it.)
A random variable is continuous if there are no gaps between its possible
values. In the ﬁrst example, the height of a student could in principle be any real
number between certain extreme limits. A random variable whose values range
over an interval of real numbers, or even over all real numbers, is continuous.
One could concoct random variables which are neither discrete nor continuous
(e.g. the possible, values could be 1, 2, 3, or any real number between 4 and 5),
but we will not consider such random variables.
We begin by considering discrete random variables.
3.2 Probability mass function
Let F be a discrete random variable. The most basic question we can ask is: given
any value a in the target set of F, what is the probability that F takes the value a?
In other words, if we consider the event
A ={x ∈ S : F(x) = a}
what is P(A)? (Remember that an event is a subset of the sample space.) Since
events of this kind are so important, we simplify the notation: we write
P(F = a)
in place of
P({x ∈ S : F(x) = a}).
(There is a fairly common convention in probability and statistics that random
variables are denoted by capital letters and their values by lowercase letters. In
fact, it is quite common to use the same letter in lower case for a value of the
random variable; thus, we would write P(F = f ) in the above example. But
remember that this is only a convention, and you are not bound to it.)
3.3. EXPECTED VALUE AND VARIANCE 41
The probability mass function of a discrete random variable F is the function,
formula or table which gives the value of P(F =a) for each element a in the target
set of F. If F takes only a few values, it is convenient to list it in a table; otherwise
we should give a formula if possible. The standard abbreviation for ‘probability
mass function’ is p.m.f.
Example I toss a fair coin three times. The random variable X gives the number
of heads recorded. The possible values of X are 0, 1, 2, 3, and its p.m.f. is
a 0 1 2 3
P(X = a)
1
8
3
8
3
8
1
8
For the sample space is {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}, and
each outcome is equally likely. The event X = 1, for example, when written as a
set of outcomes, is equal to {HTT, THT, TTH}, and has probability 3/8.
Two random variables X and Y are said to have the same distribution if the
values they take and their probability mass functions are equal. We write X ∼Y
in this case.
In the above example, if Y is the number of tails recorded during the experi
ment, then X and Y have the same distribution, even though their actual values are
different (indeed, Y = 3−X).
3.3 Expected value and variance
Let X be a discrete random variable which takes the values a
1
, . . . , a
n
. The ex
pected value or mean of X is the number E(X) given by the formula
E(X) =
n
∑
i=1
a
i
P(X = a
i
).
That is, we multiply each value of X by the probability that X takes that value,
and sum these terms. The expected value is a kind of ‘generalised average’: if
each of the values is equally likely, so that each has probability 1/n, then E(X) =
(a
1
+· · · +a
n
)/n, which is just the average of the values.
There is an interpretation of the expected value in terms of mechanics. If we
put a mass p
i
on the axis at position a
i
for i = 1, . . . , n, where p
i
= P(X = a
i
), then
the centre of mass of all these masses is at the point E(X).
If the random variable X takes inﬁnitely many values, say a
1
, a
2
, a
3
, . . ., then
we deﬁne the expected value of X to be the inﬁnite sum
E(X) =
∞
∑
i=1
a
i
P(X = a
i
).
42 CHAPTER 3. RANDOM VARIABLES
Of course, now we have to worry about whether this means anything, that is,
whether this inﬁnite series is convergent. This is a question which is discussed
at great length in analysis. We won’t worry about it too much. Usually, discrete
random variables will only have ﬁnitely many values; in the few examples we
consider where there are inﬁnitely many values, the series will usually be a ge
ometric series or something similar, which we know how to sum. In the proofs
below, we assume that the number of values is ﬁnite.
The variance of X is the number Var(X) given by
Var(X) = E(X
2
) −E(X)
2
.
Here, X
2
is just the random variable whose values are the squares of the values of
X. Thus
E(X
2
) =
n
∑
i=1
a
2
i
P(X = a
i
)
(or an inﬁnite sum, if necessary). The next theorem shows that, if E(X) is a kind
of average of the values of X, then Var(X) is a measure of how spreadout the
values are around their average.
Proposition 3.1 Let X be a discrete random variable with E(X) = µ. Then
Var(X) = E((X −µ)
2
) =
n
∑
i=1
(a
i
−µ)
2
P(X = a
i
).
For the second term is equal to the third by deﬁnition, and the third is
n
∑
i=1
(a
i
−µ)
2
P(X = a
i
)
=
n
∑
i=1
(a
2
i
−2µa
i
+µ
2
)P(X = a
i
)
=
_
n
∑
i=1
a
2
i
P(X = a
i
)
_
−2µ
_
n
∑
i=1
a
i
P(X = a
i
)
_
+µ
2
_
n
∑
i=1
P(X = a
i
)
_
.
(What is happening here is that the entire sum consists of n rows with three terms
in each row. We add it up by columns instead of by rows, getting three parts with
n terms in each part.) Continuing, we ﬁnd
E((X −µ)
2
) = E(X
2
) −2µE(X) +µ
2
= E(X
2
) −E(X)
2
,
and we are done. (Remember that E(X) = µ, and that ∑
n
i=1
P(X = a
i
) = 1 since
the events X = a
i
form a partition.)
3.4. JOINT P.M.F. OF TWO RANDOM VARIABLES 43
Some people take the conclusion of this proposition as the deﬁnition of vari
ance.
Example I toss a fair coin three times; X is the number of heads. What are the
expected value and variance of X?
E(X) = 0×(1/8) +1×(3/8) +2×(3/8) +3×(1/8) = 3/2,
Var(X) = 0
2
×(1/8) +1
2
×(3/8) +2
2
×(3/8) +3
2
×(1/8) −(3/2)
2
= 3/4.
If we calculate the variance using Proposition 3.1, we get
Var(X) =
_
−
3
2
_
2
×
1
8
+
_
−
1
2
_
2
×
3
8
+
_
1
2
_
2
×
3
8
+
_
3
2
_
2
×
1
8
=
3
4
.
Two properties of expected value and variance can be used as a check on your
calculations.
• The expected value of X always lies between the smallest and largest values
of X.
• The variance of X is never negative. (For the formula in Proposition 3.1 is
a sum of terms, each of the form (a
i
−µ)
2
(a square, hence nonnegative)
times P(X = a
i
) (a probability, hence nonnegative).
3.4 Joint p.m.f. of two random variables
Let X be a random variable taking the values a
1
, . . . , a
n
, and let Y be a random
variable taking the values b
1
, . . . , b
m
. We say that X and Y are independent if, for
any possible values i and j, we have
P(X = a
i
,Y = b
j
) = P(X = a
i
) · P(Y = b
j
).
Here P(X = a
i
,Y = b
j
) means the probability of the event that X takes the value
a
i
and Y takes the value b
j
. So we could restate the deﬁnition as follows:
The random variables X and Y are independent if, for any value a
i
of
X and any value b
j
of Y, the events X =a
i
and Y =b
j
are independent
(events).
Note the difference between ‘independent events’ and ‘independent random vari
ables’.
44 CHAPTER 3. RANDOM VARIABLES
Example In Chapter 2, we saw the following: I have two red pens, one green
pen, and one blue pen. I select two pens without replacement. Then the events
‘exactly one red pen selected’ and ‘exactly one green pen selected’ turned out to
be independent. Let X be the number of red pens selected, and Y the number of
green pens selected. Then
P(X = 1,Y = 1) = P(X = 1) · P(Y = 1).
Are X and Y independent random variables?
No, because P(X = 2) = 1/6, P(Y = 1) = 1/2, but P(X = 2,Y = 1) = 0 (it is
impossible to have two red and one green in a sample of two).
On the other hand, if I roll a die twice, and X and Y are the numbers that come
up on the ﬁrst and second throws, then X and Y will be independent, even if the
die is not fair (so that the outcomes are not all equally likely).
If we have more than two random variables (for example X,Y, Z), we say that
they are mutually independent if the events that the random variables take speciﬁc
values (for example, X = a, Y = b, Z = c) are mutually independent. (You may
want to revise the material on mutually independent events.)
What about the expected values of random variables? For expected value, it is
easy, but for variance it helps if the variables are independent:
Theorem 3.2 Let X and Y be random variables.
(a) E(X +Y) = E(X) +E(Y).
(b) If X and Y are independent, then Var(X +Y) = Var(X) +Var(Y).
We will see the proof later.
If two random variables X and Y are not independent, then knowing the p.m.f.
of each variable does not tell the whole story. The joint probability mass function
(or joint p.m.f.) of X and Y is the table giving, for each value a
i
of X and each
value b
j
of Y, the probability that X = a
i
and Y = b
j
. We arrange the table so
that the rows correspond to the values of X and the columns to the values of Y.
Note that summing the entries in the row corresponding to the value a
i
gives the
probability that X = a
i
; that is, the row sums form the p.m.f. of X. Similarly the
column sums form the p.m.f. of Y. (The row and column sums are sometimes
called the marginal distributions or marginals.)
In particular, X and Y are independent r.v.s if and only if each entry of the
table is equal to the product of its row sum and its column sum.
3.4. JOINT P.M.F. OF TWO RANDOM VARIABLES 45
Example I have two red pens, one green pen, and one blue pen, and I choose
two pens without replacement. Let X be the number of red pens that I choose and
Y the number of green pens. Then the joint p.m.f. of X and Y is given by the
following table:
Y
0 1
0 0
1
6
X 1
1
3
1
3
2
1
6
0
The row and column sums give us the p.m.f.s for X and Y:
a 0 1 2
P(X = a)
1
6
2
3
1
6
b 0 1
P(Y = b)
1
2
1
2
Now we give the proof of Theorem 3.2.
We consider the joint p.m.f. of X and Y. The random variable X +Y takes the
values a
i
+b
j
for i = 1, . . . , n and j = 1, . . . , m. Now the probability that it takes
a given value c
k
is the sum of the probabilities P(X = a
i
,Y = b
j
) over all i and j
such that a
i
+b
j
= c
k
. Thus,
E(X +Y) =
∑
k
c
k
P(X +Y = c
k
)
=
n
∑
i=1
m
∑
j=1
(a
i
+b
j
)P(X = a
i
,Y = b
j
)
=
_
n
∑
i=1
a
i
m
∑
j=1
P(X = a
i
,Y = b
j
)
_
+
_
m
∑
j=1
b
j
n
∑
i=1
P(X = a
i
,Y = b
j
)
_
.
Now ∑
m
j=1
P(X = a
i
,Y = b
j
) is a row sum of the joint p.m.f. table, so is equal to
P(X = a
i
), and similarly ∑
n
i=1
P(X = a
i
,Y = b
j
) is a column sum and is equal to
P(Y = b
j
). So
E(X +Y) =
n
∑
i=1
a
i
P(X = a
i
) +
m
∑
j=1
b
j
P(Y = b
j
)
= E(X) +E(Y).
The variance is a bit trickier. First we calculate
E((X +Y)
2
) = E(X
2
+2XY +Y
2
) = E(X
2
) +2E(XY) +E(Y
2
),
46 CHAPTER 3. RANDOM VARIABLES
using part (a) of the Theorem. We have to consider the term E(XY). For this, we
have to make the assumption that X and Y are independent, that is,
P(X = a
1
,Y = b
j
) = P(X = a
i
) · P(Y = b
j
).
As before, we have
E(XY) =
n
∑
i=1
m
∑
j=1
a
i
b
j
P(X = a
i
,Y = b
j
)
=
n
∑
i=1
n
∑
j=1
a
i
b
j
P(X = a
i
)P(Y = b
j
)
=
_
n
∑
i=1
a
i
P(X = a
i
)
_
·
_
m
∑
j=1
b
j
P(Y = b
j
)
_
= E(X) · E(Y).
So
Var(X +Y) = E((X +Y)
2
) −(E(X +Y))
2
= (E(X
2
) +2E(XY) +E(Y
2
)) −(E(X)
2
+2E(X)E(Y) +E(Y)
2
)
= (E(X
2
) −E(X)
2
) +2(E(XY) −E(X)E(Y)) +(E(Y
2
) −E(Y)
2
)
= Var(X) +Var(Y).
To ﬁnish this section, we consider constant random variables. (If the thought
of a ‘constant variable’ worries you, remember that a random variable is not a
variable at all but a function, and there is nothing amiss with a constant function.)
Proposition 3.3 Let C be a constant random variable with value c. Let X be any
random variable.
(a) E(C) = c, Var(C) = 0.
(b) E(X +c) = E(X) +c, Var(X +c) = Var(X).
(c) E(cX) = cE(X), Var(cX) = c
2
Var(X).
Proof (a) The random variable C takes the single value c with P(C = c) = 1. So
E(C) = c · 1 = c. Also,
Var(C) = E(C
2
) −E(C)
2
= c
2
−c
2
= 0.
(For C
2
is a constant random variable with value c
2
.)
3.5. SOME DISCRETE RANDOM VARIABLES 47
(b) This follows immediately from Theorem 3.2, once we observe that the
constant random variable C and any random variable X are independent. (This is
true because P(X = a,C = c) = P(X = a) · 1.) Then
E(X +c) = E(X) +E(C) = E(X) +c,
Var(X +c) = Var(X) +Var(C) = Var(X).
(c) If a
1
, . . . , a
n
are the values of X, then ca
1
, . . . , ca
n
are the values of cX, and
P(cX = ca
i
) = P(x = a
i
). So
E(cX) =
n
∑
i=1
ca
i
P(cX = ca
i
)
= c
n
∑
i=1
a
i
P(X = a
i
)
= cE(X).
Then
Var(cX) = E(c
2
X
2
) −E(cX)
2
= c
2
E(X
2
) −(cE(X))
2
= c
2
(E(X
2
) −E(X)
2
)
= c
2
Var(X).
3.5 Some discrete random variables
We now look at ﬁve types of discrete random variables, each depending on one or
more parameters. We describe for each type the situations in which it arises, and
give the p.m.f., the expected value, and the variance. If the variable is tabulated
in the New Cambridge Statistical Tables, we give the table number, and some
examples of using the tables. You should have a copy of the tables to follow the
examples.
A summary of this information is given in Appendix B.
Before we begin, a comment on the New Cambridge Statistical Tables. They
don’t give the probability mass function (or p.m.f.), but a closely related function
called the cumulative distribution function. It is deﬁned for a discrete random
variable as follows.
Let X be a random variable taking values a
1
, a
2
, . . . , a
n
. We assume that these
are arranged in ascending order: a
1
< a
2
<· · · < a
n
. The cumulative distribution
function, or c.d.f., of X is given by
F
X
(a
i
) = P(X ≤a
i
).
48 CHAPTER 3. RANDOM VARIABLES
We see that it can be expressed in terms of the p.m.f. of X as follows:
F
X
(a
i
) = P(X = a
1
) +· · · +P(X = a
i
) =
i
∑
j=1
P(X = a
j
).
In the other direction, we cn recover the p.m.f. from the c.d.f.:
P(X = a
i
) = F
X
(a
i
) −F
X
(a
i−1
).
We won’t use the c.d.f. of a discrete random variable except for looking up
the tables. It is much more important for continuous random variables!
Bernoulli random variable Bernoulli(p)
A Bernoulli random variable is the simplest type of all. It only takes two values,
0 and 1. So its p.m.f. looks as follows:
x 0 1
P(X = x) q p
Here, p is the probability that X = 1; it can be any number between 0 and 1.
Necessarily q (the probability that X = 0) is equal to 1 − p. So p determines
everything.
For a Bernoulli random variable X, we sometimes describe the experiment as
a ‘trial’, the event X = 1 as‘success’, and the event X = 0 as ‘failure’.
For example, if a biased coin has probability p of coming down heads, then
the number of heads that we get when we toss the coin once is a Bernoulli(p)
random variable.
More generally, let A be any event in a probability space S. With A, we asso
ciate a random variable I
A
(remember that a random variable is just a function on
S) by the rule
I
A
(s) =
_
1 if s ∈ A;
0 if s / ∈ A.
The random variable I
A
is called the indicator variable of A, because its value
indicates whether or not A occurred. It is a Bernoulli(p) random variable, where
p = P(A). (The event I
A
= 1 is just the event A.) Some people write 11
A
instead of
I
A
.
Calculation of the expected value and variance of a Bernoulli random variable
is easy. Let X ∼ Bernoulli(p). (Remember that ∼ means “has the same p.m.f.
as”.)
E(X) = 0· q+1· p = p;
Var(X) = 0
2
· q+1
2
· p−p
2
= p−p
2
= pq.
(Remember that q = 1−p.)
3.5. SOME DISCRETE RANDOM VARIABLES 49
Binomial random variable Bin(n, p)
Remember that for a Bernoulli random variable, we describe the event X = 1 as a
‘success’. Now a binomial random variable counts the number of successes in n
independent trials each associated with a Bernoulli(p) random variable.
For example, suppose that we have a biased coin for which the probability of
heads is p. We toss the coin n times and count the number of heads obtained. This
number is a Bin(n, p) random variable.
A Bin(n, p) random variable X takes the values 0, 1, 2, . . . , n, and the p.m.f. of
X is given by
P(X = k) =
n
C
k
q
n−k
p
k
for k = 0, 1, 2, . . . , n, where q = 1−p. This is because there are
n
C
k
different ways
of obtaining k heads in a sequence of n throws (the number of choices of the k
positions in which the heads occur), and the probability of getting k heads and
n−k tails in a particular order is q
n−k
p
k
.
Note that we have given a formula rather than a table here. For small values
we could tabulate the results; for example, for Bin(4, p):
k 0 1 2 3 4
P(X = k) q
4
4q
3
p 6q
2
p
2
4qp
3
p
4
Note: when we add up all the probabilities in the table, we get
n
∑
k=0
n
C
k
q
n−k
p
k
= (q+ p)
n
= 1,
as it should be: here we used the binomial theorem
(x +y)
n
=
n
∑
k=0
n
C
k
x
n−k
y
k
.
(This argument explains the name of the binomial random variable!)
If X ∼Bin(n, p), then
E(X) = np, Var(X) = npq.
There are two ways to prove this, an easy way and a harder way. The easy way
only works for the binomial, but the harder way is useful for many random vari
ables. However, you can skip it if you wish: I have set it in smaller type for this
reason.
Here is the easy method. We have a coin with probability p of coming down
heads, and we toss it n times and count the number X of heads. Then X is our
Bin(n, p) random variable. Let X
k
be the random variable deﬁned by
X
k
=
_
1 if we get heads on the kth toss,
0 if we get tails on the kth toss.
50 CHAPTER 3. RANDOM VARIABLES
In other words, X
i
is the indicator variable of the event ‘heads on the kth toss’.
Now we have
X = X
1
+X
2
+· · · +X
n
(can you see why?), and X
1
, . . . , X
n
are independent Bernoulli(p) randomvariables
(since they are deﬁned by different tosses of a coin). So, as we sawearlier, E(X
i
) =
p, Var(X
i
) = pq. Then, by Theorem 21, since the variables are independent, we
have
E(X) = p+ p+· · · + p = np,
Var(X) = pq+ pq+· · · + pq = npq.
The other method uses a gadget called the probability generating function. We only use it
here for calculating expected values and variances, but if you learn more probability theory you
will see other uses for it. Let X be a random variable whose values are nonnegative integers. (We
don’t insist that it takes all possible values; this method is ﬁne for the binomial Bin(n, p), which
takes values between 0 and n. To save space, we write p
k
for the probability P(X = k). Now the
probability generating function of X is the power series
G
X
(x) =
∑
p
k
x
k
.
(The sum is over all values k taken by X.)
We use the notation [F(x)]
x=1
for the result of substituting x = 1 in the series F(x).
Proposition 3.4 Let G
X
(x) be the probability generating function of a random variable X. Then
(a) [G
X
(x)]
x=1
= 1;
(b) E(X) =
_
d
dx
G
X
(x)
¸
x=1
;
(c) Var(X) =
_
d
2
dx
2
G
X
(x)
_
x=1
+E(X) −E(X)
2
.
Part (a) is just the statement that probabilities add up to 1: when we substitute x = 1 in the
power series for G
X
(x) we just get ∑p
k
.
For part (b), when we differentiate the series termbyterm (you will learn later in Analysis
that this is OK), we get
d
dx
G
X
(x) =
∑
kp
k
x
k−1
.
Now putting x = 1 in this series we get
∑
kp
k
= E(X).
For part (c), differentiating twice gives
d
2
dx
2
G
X
(x) =
∑
k(k −1)p
k
x
k−2
.
Now putting x = 1 in this series we get
∑
k(k −1)p
k
=
∑
k
2
p
k
−
∑
kp
k
= E(X
2
) −E(X).
Adding E(X) and subtracting E(X)
2
gives E(X
2
) −E(X)
2
, which by deﬁnition is Var(X).
3.5. SOME DISCRETE RANDOM VARIABLES 51
Now let us appply this to the binomial random variable X ∼Bin(n, p). We have
p
k
= P(X = k) =
n
C
k
q
n−k
p
k
,
so the probability generating function is
n
∑
k=0
n
C
k
q
n−k
p
k
x
k
= (q+ px)
n
,
by the Binomial Theorem. Putting x = 1 gives (q+ p)
n
= 1, in agreement with Proposition 3.4(a).
Differentiating once, using the Chain Rule, we get np(q+ px)
n−1
. Putting x = 1 we ﬁnd that
E(X) = np.
Differentiating again, we get n(n−1)p
2
(q+ px)
n−2
. Putting x = 1 gives n(n−1)p
2
. Now adding
E(X) −E(X)
2
, we get
Var(X) = n(n−1)p
2
+np−n
2
p
2
= np−np
2
= npq.
The binomial random variable is tabulated in Table 1 of the Cambridge Statis
tical Tables [1]. As explained earlier, the tables give the cumulative distribution
function.
For example, suppose that the probability that a certain coin comes down
heads is 0.45. If the coin is tossed 15 times, what is the probability of ﬁve or
fewer heads? Turning to the page n = 15 in Table 1 and looking at the row 0.45,
you read off the answer 0.2608. What is the probability of exactly ﬁve heads? This
is P(5 or fewer)−P(4 or fewer), and from tables the answer is 0.2608−0.1204 =
0.1404.
The tables only go up to p = 0.5. For larger values of p, use the fact that the
number of failures in Bin(n, p) is equal to the number of successes in Bin(n, 1 −
p). So the probability of ﬁve heads in 15 tosses of a coin with p =0.55 is 0.9745−
0.9231 = 0.0514.
Another interpretation of the binomial random variable concerns sampling.
Suppose that we have N balls in a box, of which M are red. We sample n balls
from the box with replacement; let the random variable X be the number of red
balls in the sample. What is the distribution of X? Since each ball has probability
M/N of being red, and different choices are independent, X ∼ Bin(n, p), where
p = M/N is the proportion of red balls in the sample.
What about sampling without replacement? This leads us to our next random
variable:
Hypergeometric random variable Hg(n, M, N)
Suppose that we have N balls in a box, of which M are red. We sample n balls
from the box without replacement. Let the random variable X be the number of
52 CHAPTER 3. RANDOM VARIABLES
red balls in the sample. Such an X is called a hypergeometric random variable
Hg(n, M, N).
The random variable X can take any of the values 0, 1, 2, . . . , n. Its p.m.f. is
given by the formula
P(X = k) =
M
C
k
·
N−M
C
n−k
N
Cn
.
For the number of samples of n balls from N is
N
C
n
; the number of ways of
choosing k of the M red balls and n−k of the N−M others is
M
C
k
·
N−M
C
n−k
; and
all choices are equally likely.
The expected value and variance of a hypergeometric random variable are as
follows (we won’t go into the proofs):
E(X) = n
_
M
N
_
, Var(X) = n
_
M
N
__
N−M
N
__
N−n
N−1
_
.
You should compare these to the values for a binomial random variable. If we
let p =M/N be the proportion of red balls in the hat, then E(X) =np, and Var(X)
is equal to npq multiplied by a ‘correction factor’ (N−n)/(N−1).
In particular, if the numbers M and N −M of red and nonred balls in the
hat are both very large compared to the size n of the sample, then the difference
between sampling with and without replacement is very small, and indeed the
‘correction factor’ is close to 1. So we can say that Hg(n, M, N) is approximately
Bin(n, M/N) if n is small compared to M and N−M.
Consider our example of choosing two pens from four, where two pens are
red, one green, and one blue. The number X of red pens is a Hg(2, 2, 4) random
variable. We calculated earlier that P(X = 0) = 1/6, P(X = 1) = 2/3 and P(X =
2) =1/6. Fromthis we ﬁnd by direct calculation that E(X) =1 and Var(X) =1/3.
These agree with the formulae above.
Geometric random variable Geom(p)
The geometric random variable is like the binomial but with a different stopping
rule. We have again a coin whose probability of heads is p. Now, instead of
tossing it a ﬁxed number of times and counting the heads, we toss it until it comes
down heads for the ﬁrst time, and count the number of times we have tossed
the coin. Thus, the values of the variable are the positive integers 1, , 2, 3, . . . (In
theory we might never get a head and toss the coin inﬁnitely often, but if p > 0
this possibility is ‘inﬁnitely unlikely’, i.e. has probability zero, as we will see.)
We always assume that 0 < p < 1.
More generally, the number of independent Bernoulli trials required until the
ﬁrst success is obtained is a geometric random variable.
3.5. SOME DISCRETE RANDOM VARIABLES 53
The p.m.f of a Geom(p) random variable is given by
P(X = k) = q
k−1
p,
where q = 1 − p. For the event X = k means that we get tails on the ﬁrst k −1
tosses and heads on the kth, and this event has probability q
k−1
p, since ‘tails’ has
probability q and different tosses are independent.
Let’s add up these probabilities:
∞
∑
k=1
q
k−1
p = p+qp+q
2
p+· · · =
p
1−q
= 1,
since the series is a geometric progression with ﬁrst term p and common ratio
q, where q < 1. (Just as the binomial theorem shows that probabilities sum to 1
for a binomial random variable, and gives its name to the random variable, so the
geometric progression does for the geometric random variable.)
We calculate the expected value and the variance using the probability gener
ating function. If X ∼Geom(p), the result will be that
E(X) = 1/p, Var(X) = q/p
2
.
We have
G
X
(x) =
∞
∑
k=1
q
k−1
px
k
=
px
1−qx
,
again by summing a geometric progression. Differentiating, we get
d
dx
G
X
(x) =
(1−qx)p+ pxq
(1−qx)
2
=
p
(1−qx)
2
.
Putting x = 1, we obtain
E(X) =
p
(1−q)
2
=
1
p
.
Differentiating again gives 2pq/(1−qx)
3
, so
Var(X) =
2pq
p
3
+
1
p
−
1
p
2
=
q
p
2
.
For example, if we toss a fair coin until heads is obtained, the expected number
of tosses until the ﬁrst head is 2 (so the expected number of tails is 1); and the
variance of this number is also 2.
54 CHAPTER 3. RANDOM VARIABLES
Poisson random variable Poisson(λ)
The Poisson random variable, unlike the ones we have seen before, is very closely
connected with continuous things.
Suppose that ‘incidents’ occur at random times, but at a steady rate overall.
The best example is radioactive decay: atomic nuclei decay randomly, but the
average number λ which will decay in a given interval is constant. The Poisson
randomvariabe X counts the number of ‘incidents’ which occur in a given interval.
So if, on average, there are 2.4 nuclear decays per second, then the number of
decays in one second starting now is a Poisson(2.4) random variable.
Another example might be the number of telephone calls a minute to a busy
telephone number.
Although we will not prove it, the p.m.f. for a Poisson(λ) variable X is given
by the formula
P(X = k) =
λ
k
k!
e
−λ
.
Let’s check that these probabilities add up to one. We get
_
∞
∑
k=0
λ
k
k!
_
e
−λ
= e
λ
· e
−λ
= 1,
since the expression in brackets is the sum of the exponential series.
By analogy with what happened for the binomial and geometric random vari
ables, you might have expected that this random variable would be called ‘expo
nential’. Unfortunately, this name has been given to a closelyrelated continuous
random variable which we will meet later. However, if you speak a little French,
you might use as a mnemonic the fact that if I go ﬁshing, and the ﬁsh are biting at
the rate of λ per hour on average, then the number of ﬁsh I will catch in the next
hour is a Poisson(λ) random variable.
The expected value and variance of a Poisson(λ) random variable X are given
by
E(X) = Var(X) = λ.
Again we use the probability generating function. If X ∼Poisson(λ), then
G
X
(x) =
∞
∑
k=0
(λx)
k
k!
e
−λ
= e
λ(x−1)
,
again using the series for the exponential function.
Differentiation gives λe
λ(x−1)
, so E(X) = λ. Differentiating again gives λ
2
e
λ(x−1)
, so
Var(X) = λ
2
+λ−λ
2
= λ.
3.6. CONTINUOUS RANDOM VARIABLES 55
The cumulative distribution function of a Poisson random variable is tabulated
in Table 2 of the New Cambridge Statistical Tables. So, for example, we ﬁnd from
the tables that, if 2.4 ﬁsh bite per hour on average, then the probability that I will
catch no ﬁsh in the next hour is 0.0907, while the probability that I catch at ﬁve or
fewer is 0.9643 (so that the probability that I catch six or more is 0.0357).
There is another situation in which the Poisson distribution arises. Suppose I
am looking for some very rare event which only occurs once in 1000 trials on av
erage. So I conduct 1000 independent trials. How many occurrences of the event
do I see? This number is really a binomial random variable Bin(1000, 1/1000).
But it turns out to be Poisson(1), to a very good approximation. So, for example,
the probability that the event doesn’t occur is about 1/e.
The general rule is:
If n is large, p is small, and np = λ, then Bin(n, p) can be approxi
mated by Poisson(λ).
3.6 Continuous random variables
We haven’t so far really explained what a continuous random variable is. Its target
set is the set of real numbers, or perhaps the nonnegative real numbers or just an
interval. The crucial property is that, for any real number a, we have (X = a) = 0;
that is, the probability that the height of a random student, or the time I have to
wait for a bus, is precisely a, is zero. So we can’t use the probability mass function
for continuous random variables; it would always be zero and give no information.
We use the cumulative distribution function or c.d.f. instead. Remember from
last week that the c.d.f. of the random variable X is the function F
X
deﬁned by
F
X
(x) = P(X ≤x).
Note: The name of the function is F
X
; the lower case x refers to the argument
of the function, the number which is substituted into the function. It is common
but not universal to use as the argument the lowercase version of the name of the
random variable, as here. Note that F
X
(y) is the same function written in terms of
the variable y instead of x, whereas F
Y
(x) is the c.d.f. of the random variable Y
(which might be a completely different function.)
Now let X be a continuous random variable. Then, since the probability that
X takes the precise value x is zero, there is no difference between P(X ≤ x) and
P(X < x).
Proposition 3.5 The c.d.f. is an increasing function (this means that F
X
(x) ≤
F
X
(y) if x < y), and approaches the limits 0 as x →−∞ and 1 as x →∞.
56 CHAPTER 3. RANDOM VARIABLES
The function is increasing because, if x < y, then
F
X
(y) −F
X
(x) = P(X ≤y) −P(X ≤x) = P(x < X ≤y) ≥0.
Also F
X
(∞) =1 because X must certainly take some ﬁnite value; and F
X
(−∞) =0
because no value is smaller than −∞!
Another important function is the probability density function f
X
. It is ob
tained by differentiating the c.d.f.:
f
X
(x) =
d
dx
F
X
(x).
Now f
X
(x) is nonnegative, since it is the derivative of an increasing function. If
we know f
X
(x), then F
X
is obtained by integrating. Because F
X
(−∞) = 0, we
have
F
X
(x) =
Z
x
−∞
f
X
(t)dt.
Note the use of the “dummy variable” t in this integral. Note also that
P(a ≤X ≤b) = F
X
(b) −F
X
(a) =
Z
b
a
f
X
(t)dt.
You can think of the p.d.f. like this: the probability that the value of X lies in a
very small interval from x to x +h is approximately f
X
(x) · h. So, although the
probability of getting exactly the value x is zero, the probability of being close to
x is proportional to f
X
(x).
There is a mechanical analogy which you may ﬁnd helpful. Remember that
we modelled a discrete random variable X by placing at each value a of X a mass
equal to P(X = a). Then the total mass is one, and the expected value of X is
the centre of mass. For a continuous random variable, imagine instead a wire of
variable thickness, so that the density of the wire (mass per unit length) at the
point x is equal to f
X
(x). Then again the total mass is one; the mass to the left of
x is F
X
(x); and again it will hold that the centre of mass is at E(X).
Most facts about continuous random variables are obtained by replacing the
p.m.f. by the p.d.f. and replacing sums by integrals. Thus, the expected value of
X is given by
E(X) =
Z
∞
−∞
x f
X
(x)dx,
and the variance is (as before)
Var(X) = E(X
2
) −E(X)
2
,
where
E(X
2
) =
Z
∞
−∞
x
2
f
X
(x)dx.
It is also true that Var(X) = E((X −µ)
2
), where µ = E(X).
3.7. MEDIAN, QUARTILES, PERCENTILES 57
We will see examples of these calculations shortly. But here is a small example
to show the ideas. The support of a continuous random variable is the smallest
interval containing all values of x where f
X
(x) > 0.
Suppose that the random variable X has p.d.f. given by
f
X
(x) =
_
2x if 0 ≤x ≤1,
0 otherwise.
The support of Xis the interval [0, 1]. We check the integral:
Z
∞
−∞
f
X
(x)dx =
Z
1
0
2x dx =
_
x
2
¸
x=1
x=0
= 1.
The cumulative distribution function of X is
F
X
(x) =
Z
x
−∞
f
X
(t)dt =
_
0 if x < 0,
x
2
if 0 ≤x ≤1,
1 if x > 1.
(Study this carefully to see how it works.) We have
E(X) =
Z
∞
−∞
x f
X
(x)dx =
Z
1
0
2x
2
dx =
2
3
,
E(X
2
) =
Z
∞
−∞
x
2
f
X
(x)dx =
Z
1
0
2x
3
dx =
1
2
,
Var(X) =
1
2
−
_
2
3
_
2
=
1
18
.
3.7 Median, quartiles, percentiles
Another measure commonly used for continuous random variables is the median;
this is the value m such that “half of the distribution lies to the left of m and half to
the right”. More formally, m should satisfy F
X
(m) = 1/2. It is not the same as the
mean or expected value. In the example at the end of the last section, we saw that
E(X) = 2/3. The median of X is the value of m for which F
X
(m) = 1/2. Since
F
X
(x) = x
2
for 0 ≤x ≤1, we see that m = 1/
√
2.
If there is a value m such that the graph of y = f
X
(x) is symmetric about x =m,
then both the expected value and the median of X are equal to m.
The lower quartile l and the upper quartile u are similarly deﬁned by
F
X
(l) = 1/4, F
X
(u) = 3/4.
Thus, the probability that X lies between l and u is 3/4−1/4 = 1/2, so the quar
tiles give an estimate of how spreadout the distribution is. More generally, we
deﬁne the nth percentile of X to be the value of x
n
such that
F
X
(x
n
) = n/100,
58 CHAPTER 3. RANDOM VARIABLES
that is, the probability that X is smaller than x
n
is n%.
Reminder If the c.d.f. of X is F
X
(x) and the p.d.f. is f
X
(x), then
• differentiate F
X
to get f
X
, and integrate f
X
to get F
X
;
• use f
X
to calculate E(X) and Var(X);
• use F
X
to calculate P(a ≤ X ≤ b) (this is F
X
(b) −F
X
(a)), and the median
and percentiles of X.
3.8 Some continuous random variables
In this section we examine three important continuous random variables: the uni
form, exponential, and normal. The details are summarised in Appendix B.
Uniform random variable U(a, b)
Let a and b be real numbers with a <b. A uniform random variable on the interval
[a, b] is, roughly speaking, “equally likely to be anywhere in the interval”. In other
words, its probability density function is constant on the interval [a, b] (and zero
outside the interval). What should the constant value c be? The integral of the
p.d.f. is the area of a rectangle of height c and base b −a; this must be 1, so
c = 1/(b−a). Thus, the p.d.f. of the random variable X ∼U(a, b) is given by
f
X
(x) =
_
1/(b−a) if a ≤x ≤b,
0 otherwise.
By integration, we ﬁnd that the c.d.f. is
F
X
(x) =
_
0 if x < a,
(x −a)/(b−a) if a ≤x ≤b,
1 if x > b.
Further calculation (or the symmetry of the p.d.f.) shows that the expected value
and the median of X are both given by (a +b)/2 (the midpoint of the interval),
while Var(X) = (b−a)
2
/12.
The uniform random variable doesn’t really arise in practical situations. How
ever, it is very useful for simulations. Most computer systems include a random
number generator, which apparently produces independent values of a uniform
random variable on the interval [0, 1]. Of course, they are not really random, since
the computer is a deterministic machine; but there should be no obvious pattern to
3.8. SOME CONTINUOUS RANDOM VARIABLES 59
the numbers produced, and in a large number of trials they should be distributed
uniformly over the interval.
You will learn in the Statistics course how to use a uniform random variable
to construct values of other types of discrete or continuous random variables. Its
great simplicity makes it the best choice for this purpose.
Exponential random variable Exp(λ)
The exponential random variable arises in the same situation as the Poisson: be
careful not to confuse them! We have events which occur randomly but at a con
stant average rate of λ per unit time (e.g. radioactive decays, ﬁsh biting). The
Poisson random variable, which is discrete, counts how many events will occur
in the next unit of time. The exponential random variable, which is continuous,
measures exactly how long from now it is until the next event occurs. Not that it
takes nonnegative real numbers as values.
If X ∼Exp(λ), the p.d.f. of X is
f
X
(x) =
_
0 if x < 0,
λe
−λx
if x ≥0.
By integration, we ﬁnd the c.d.f. to be
F
X
(x) =
_
0 if x < 0,
1−e
−λx
if x ≥0.
Further calculation gives
E(X) = 1/λ, Var(X) = 1/λ
2
.
The median m satisﬁes 1−e
−λm
= 1/2, so that m = log2/λ. (The logarithm is to
base e, so that log2 = 0.69314718056 approximately.
Normal random variable N(µ, σ
2
)
The normal random variable is the commonest of all in applications, and the most
important. There is a theorem called the central limit theorem which says that, for
virtually any random variable X which is not too bizarre, if you take the sum (or
the average) of n independent random variables with the same distribution as X,
the result will be approximately normal, and will become more and more like a
normal variable as n grows. This partly explains why a random variable affected
by many independent factors, like a man’s height, has an approximately normal
distribution.
60 CHAPTER 3. RANDOM VARIABLES
More precisely, if n is large, then a Bin(n, p) random variable is well approx
imated by a normal random variable with the same expected value np and the
same variance npq. (If you are approximating any discrete random variable by a
continuous one, you should make a “continuity correction” – see the next section
for details and an example.)
The p.d.f. of the random variable X ∼N(µ, σ
2
) is given by the formula
f
X
(x) =
1
σ
√
2π
e
−(x−µ)
2
/2σ
2
.
We have E(X) = µ and Var(X) = σ
2
. The picture below shows the graph of this
function for µ = 0, the familiar ‘bellshaped curve’.
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . .
. . . . . . . . . .
. . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . .
. . . . . .
. . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . . . .
. . . . . .
. . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . . .
. . . . . . . . .
. . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
The c.d.f. of X is obtained as usual by integrating the p.d.f. However, it is not
possible to write the integral of this function (which, stripped of its constants, is
e
−x
2
) in terms of ‘standard’ functions. So there is no alternative but to make tables
of its values.
The crucial fact that means that we don’t have to tabulate the function for all
values of µ and σ is the following:
Proposition 3.6 If X ∼N(µ, σ
2
), and Y = (X −µ)/σ, then Y ∼N(0, 1).
So we only need tables of the c.d.f. for N(0, 1) – this is the socalled standard
normal random variable – and we can ﬁnd the c.d.f. of any normal random vari
able. The c.d.f. of the standard normal is given in Table 4 of the New Cambridge
Statistical Tables [1]. The function is called Φ in the tables.
For example, suppose that X ∼ N(6, 25).What is the probability that X ≤ 8?
Putting Y = (X −6)/5, so that Y ∼ N(0, 1), we ﬁnd that X ≤ 8 if and only if
Y ≤(8−6)/5 = 0.4. From the tables, the probability of this is Φ(0.4) = 0.6554.
The p.d.f. of a standard normal r.v. Y is symmetric about zero. This means
that, for any positive number c,
Φ(−c) = P(Y ≤−c) = P(Y ≥c) = 1−P(Y ≤c) = 1−Φ(c).
So it is only necessary to tabulate the function for positive values of its argument.
So, if X ∼N(6, 25) and Y = (X −6)/5 as before, then
P(X ≤3) = P(Y ≤−0.6) = 1−P(Y ≤0.6) = 1−0.7257 = 0.2743.
3.9. ON USING TABLES 61
3.9 On using tables
We end this section with a few comments about using tables, not tied particularly
to the normal distribution (though most of the examples will come from there).
Interpolation
Any table is limited in the number of entries it contains. Tabulating something
with the input given to one extra decimal place would make the table ten times as
bulky! Interpolation can be used to extend the range of values tabulated.
Suppose that some function F is tabulated with the input given to three places
of decimals. It is probably true that F is changing at a roughly constant rate
between, say, 0.28 and 0.29. So F(0.283) will be about threetenths of the way
between F(0.28) and F(0.29).
For example, if Φ is the c.d.f. of the normal distribution, then Φ(0.28) =
0.6103 and Φ(0.29) = 0.6141, so Φ(0.283) = 0.6114. (Threetenths of 0.0038 is
0.0011.)
Using tables in reverse
This means, if you have a table of values of F, use it to ﬁnd x such that F(x) is a
given value c. Usually, c won’t be in the table and we have to interpolate between
values x
1
and x
2
, where F(x
1
) is just less than c and F(x
2
) is just greater.
For example, if Φ is the c.d.f. of the normal distribution, and we want the
upper quartile, then we ﬁnd from tables Φ(0.67) =0.7486 and Φ(0.68) =0.7517,
so the required value is about 0.6745 (since 0.0014/0.0031 = 0.45).
In this case, the percentile points of the standard normal r.v. are given in Table
5 of the New Cambridge Statistical Tables [1], so you don’t need to do this. But
you will ﬁnd it necessary in other cases.
Continuity correction
Suppose we know that a discrete random variable X is well approximated by a
continuous random variable Y. We are given a table of the c.d.f. of Y and want to
ﬁnd information about X. For example, suppose that X takes integer values and
we want to ﬁnd P(a ≤ X ≤ b), where a and b are integers. This probability is
equal to
P(X = a) +P(x = a+1) +· · · +P(X = b).
To say that X can be approximated by Y means that, for example, P(X = a) is
approximately equal to f
Y
(a), where f
Y
is the p.d.f. of Y. This is equal to the area
62 CHAPTER 3. RANDOM VARIABLES
of a rectangle of height f
Y
(a) and base 1 (from a−0.5 to a+0.5). This in turn is,
to a good approximation, the area under the curve y = f
Y
(x) from x = a −0.5 to
x = a+0.5, since the pieces of the curve above and below the rectangle on either
side of x = a will approximately cancel. Similarly for the other values.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
y=f
Y
(x)
g
P(X=a)
a−0.5 a a+0.5
Adding all these pieces. we ﬁnd that P(a ≤X ≤b) is approximately equal to
the area under the curve y = f
Y
(x) from x = a −0.5 to x = b +0.5. This area is
given by F
Y
(b+0.5) −F
Y
(a−0.5), since F
Y
is the integral of f
Y
. Said otherwise,
this is P(a−0.5 ≤Y ≤b+0.5).
We summarise the continuity correction:
Suppose that the discrete random variable X, taking integer values, is
approximated by the continuous random variable Y. Then
P(a ≤X ≤b) ≈P(a−0.5 ≤Y ≤b+0.5) =F
Y
(b+0.5)−F
Y
(a−0.5).
(Here, ≈ means “approximately equal”.) Similarly, for example, P(X ≤b) ≈
P(Y ≤b+0.5), and P(X ≥a) ≈P(Y ≥a−0.5).
Example The probability that a light bulb will fail in a year is 0.75, and light
bulbs fail independently. If 192 bulbs are installed, what is the probability that the
number which fail in a year lies between 140 and 150 inclusive?
Solution Let X be the number of light bulbs which fail in a year. Then X ∼
Bin(192, 3/4), and so E(X) = 144, Var(X) = 36. So X is approximated by Y ∼
N(144, 36), and
P(140 ≤X ≤150) ≈P(139.5 ≤Y ≤150.5)
by the continuity correction.
3.10. WORKED EXAMPLES 63
Let Z = (Y −144)/6. Then Z ∼N(0, 1), and
P(139.5 ≤Y ≤150.5) = P
_
139.5−144
6
≤Z ≤
150.5−144
6
_
= P(−0.75 ≤Z ≤1.083)
= 0.8606−0.2268 (from tables)
= 0.6338.
3.10 Worked examples
Question I roll a fair die twice. Let the random variable X be the maximum of
the two numbers obtained, and let Y be the modulus of their difference (that is,
the value of Y is the larger number minus the smaller number).
(a) Write down the joint p.m.f. of (X,Y).
(b) Write down the p.m.f. of X, and calculate its expected value and its variance.
(c) Write down the p.m.f. of Y, and calculate its expected value and its variance.
(d) Are the random variables X and Y independent?
Solution (a)
Y
0 1 2 3 4 5
1
1
36
0 0 0 0 0
2
1
36
2
36
0 0 0 0
3
1
36
2
36
2
36
0 0 0
X 4
1
36
2
36
2
36
2
36
0 0
5
1
36
2
36
2
36
2
36
2
36
0
6
1
36
2
36
2
36
2
36
2
36
2
36
The best way to produce this is to write out a 6×6 table giving all possible values
for the two throws, work out for each cell what the values of X and Y are, and
then count the number of occurrences of each pair. For example: X = 5, Y = 2
can occur in two ways: the numbers thrown must be (5, 3) or (3, 5).
(b) Take row sums:
x 1 2 3 4 5 6
P(X = x)
1
36
3
36
5
36
7
36
9
36
11
36
64 CHAPTER 3. RANDOM VARIABLES
Hence in the usual way
E(X) =
161
36
, Var(X) =
2555
1296
.
(c) Take column sums:
y 0 1 2 3 4 5
P(Y = y)
6
36
10
36
8
36
6
36
4
36
2
36
and so
E(Y) =
35
18
, Var(Y) =
665
324
.
(d) No: e.g. P(X = 1,Y = 2) = 0 but P(X = 1) · P(Y = 2) =
8
1296
.
Question An archer shoots an arrow at a target. The distance of the arrow from
the centre of the target is a random variable X whose p.d.f. is given by
f
X
(x) =
_
(3+2x −x
2
)/9 if x ≤3,
0 if x > 3.
The archer’s score is determined as follows:
Distance X < 0.5 0.5 ≤X < 1 1 ≤X < 1.5 1.5 ≤X < 2 X ≥2
Score 10 7 4 1 0
Construct the probability mass function for the archer’s score, and ﬁnd the archer’s
expected score.
Solution First we work out the probability of the arrow being in each of the
given bands:
P(X < 0.5) = F
X
(0.5) −F
X
(0) =
Z
0.5
0
3+2x −x
2
9
dx
=
_
9x +3x
2
−x
3
27
_
1/2
0
=
41
216
.
Similarly we ﬁnd that P(0.5 ≤ X < 1) = 47/216, P(1 ≤ X < 1.5) = 47/216,
P(1.5 ≤X < 2) = 41/216, and P(X ≥2) = 40/216. So the p.m.f. fot the archer’s
score S is
s 0 1 4 7 10
P(S = s)
40
216
41
216
47
216
47
216
41
216
3.10. WORKED EXAMPLES 65
Hence
E(S) =
41+47· 4+47· 7+41· 10
216
=
121
27
.
Question Let T be the lifetime in years of new bus engines. Suppose that T is
continuous with probability density function
f
T
(x) =
_
¸
_
¸
_
0 for x < 1
d
x
3
for x > 1
for some constant d.
(a) Find the value of d.
(b) Find the mean and median of T.
(c) Suppose that 240 new bus engines are installed at the same time, and that
their lifetimes are independent. By making an appropriate approximation,
ﬁnd the probability that at most 10 of the engines last for 4 years or more.
Solution (a) The integral of f
T
(x), over the support of T, must be 1. That is,
1 =
Z
∞
1
d
x
3
dx
=
_
−d
2x
2
_
∞
1
= d/2,
so d = 2.
(b) The c.d.f. of T is obtained by integrating the p.d.f.; that is, it is
F
T
(x) =
_
¸
_
¸
_
0 for x < 1
1−
1
x
2
for x > 1
The mean of T is
Z
∞
1
x f
T
(x) dx =
Z
∞
1
2
x
2
dx = 2.
The median is the value m such that F
T
(m) = 1/2. That is, 1 −1/m
2
= 1/2,
or m =
√
2.
66 CHAPTER 3. RANDOM VARIABLES
(c) The probability that an engine lasts for four years or more is
1−F
T
(4) = 1−
_
1−
_
1
4
_
2
_
=
1
16
.
So, if 240 engines are installed, the number which last for four years or more
is a binomial random variable X ∼ Bin(240, 1/16), with expected value 240 ×
(1/16) = 15 and variance 240×(1/16) ×(15/16) = 225/16.
We approximate X by Y ∼ N(15, (15/4)
2
). Using the continuity correction,
P(X ≤10) ≈P(Y ≤10.5).
Now, if Z = (Y −15)/(15/4), then Z ∼N(0, 1), and
P(Y ≤10.5) = P(Z ≤−1.2)
= 1−P(Z ≤1.2)
= 0.1151
using the table of the standard normal distribution.
Note that we start with the continuous random variable T, move to the discrete
random variable X, and then move on to the continuous random variables Y and
Z, where ﬁnally Z is standard normal and so is in the tables.
A true story The answer to the question at the end of the last chapter: As the
students in the class obviously knew, the class included a pair of twins! (The twins
were Leo and Willy Moser, who both had successful careers as mathematicians.)
But what went wrong with our argument for the Birthday Paradox? We as
sumed (without saying so) that the birthdays of the people in the room were inde
pendent; but of course the birthdays of twins are clearly not independent!
Chapter 4
More on joint distribution
We have seen the joint p.m.f. of two discrete random variables X and Y, and we
have learned what it means for X and Y to be independent. Now we examine
this further to see measures of nonindependence and conditional distributions of
random variables.
4.1 Covariance and correlation
In this section we consider a pair of discrete randomvariables X andY. Remember
that X and Y are independent if
P(X = a
i
,Y = b
j
) = P(X = a
i
) · P(Y = b
j
)
holds for any pair (a
i
, b
j
) of values of X and Y. We introduce a number (called
the covariance of X and Y) which gives a measure of how far they are from being
independent.
Look back at the proof of Theorem 21(b), where we showed that if X and Y
are independent then Var(X +Y) = Var(X) +Var(Y). We found that, in any case,
Var(X +Y) = Var(X) +Var(Y) +2(E(XY) −E(X)E(Y)),
and then proved that if X and Y are independent then E(XY) = E(X)E(Y), so that
the last term is zero.
Now we deﬁne the covariance of X and Y to be E(XY) −E(X)E(Y). We
write Cov(X,Y) for this quantity. Then the argument we had earlier shows the
following:
Theorem 4.1 (a) Var(X +Y) = Var(X) +Var(Y) +2Cov(X,Y).
(b) If X and Y are independent, then Cov(X,Y) = 0.
67
68 CHAPTER 4. MORE ON JOINT DISTRIBUTION
In fact, a more general version of (a), proved by the same argument, says that
Var(aX +bY) = a
2
Var(X) +b
2
Var(Y) +2abCov(X,Y). (4.1)
Another quantity closely related to covariance is the correlation coefﬁcient,
corr(X,Y), which is just a “normalised” version of the covariance. It is deﬁned as
follows:
corr(X,Y) =
Cov(X,Y)
_
Var(X)Var(Y)
.
The point of this is the ﬁrst part of the following theorem.
Theorem 4.2 Let X and Y be random variables. Then
(a) −1 ≤corr(X,Y) ≤1;
(b) if X and Y are independent, then corr(X,Y) = 0;
(c) if Y = mX +c for some constants m = 0 and c, then corr(X,Y) = 1 if m > 0,
and corr(X,Y) =−1 if m < 0.
The proof of the ﬁrst part is optional: see the end of this section. But note that
this is another check on your calculations: if you calculate a correlation coefﬁcient
which is bigger than 1 or smaller than −1, then you have made a mistake. Part (b)
follows immediately from part (b) of the preceding theorem.
For part (c), suppose that Y =mX +c. Let E(X) =µ and Var(X) =α, so that E(X
2
) =µ
2
+α.
Now we just calculate everything in sight.
E(Y) = E(mX +c) = mE(X) +c = mµ+c
E(Y
2
) = E(m
2
X
2
+2mcX +c
2
) = m
2
(µ
2
+α) +2mcµ+c
2
Var(Y) = E(Y
2
) −E(Y)
2
= m
2
α
E(XY) = E(mX
2
+cX) = m(µ
2
+α) +cµ;
Cov(X,Y) = E(XY) −E(X)E(Y) = mα
corr(X,Y) = Cov(X,Y)/
_
Var(X)Var(Y) = mα/
√
m
2
α
2
=
_
+1 if m > 0,
−1 if m < 0.
Thus the correlation coefﬁcient is a measure of the extent to which the two
variables are related. It is +1 if Y increases linearly with X; 0 if there is no relation
between them; and −1 if Y decreases linearly as X increases. More generally, a
positive correlation indicates a tendency for larger X values to be associated with
larger Y values; a negative value, for smaller X values to be associated with larger
Y values.
4.1. COVARIANCE AND CORRELATION 69
Example I have two red pens, one green pen, and one blue pen, and I choose
two pens without replacement. Let X be the number of red pens that I choose and
Y the number of green pens. Then the joint p.m.f. of X and Y is given by the
following table:
Y
0 1
0 0
1
6
X 1
1
3
1
3
2
1
6
0
From this we can calculate the marginal p.m.f. of X and of Y and hence ﬁnd
their expected values and variances:
E(X) = 1, Var(X) = 1/3,
E(Y) = 1/2, Var(Y) = 1/4.
Also, E(XY) = 1/3, since the sum
E(XY) =
∑
i, j
a
i
b
j
P(X = a
i
,Y = b
j
)
contains only one term where all three factors are nonzero. Hence
Cov(X,Y) = 1/3−1/2 =−1/6,
and
corr(X,Y) =
−1/6
_
1/12
=−
1
√
3
.
The negative correlation means that small values of X tend to be associated with
larger values of Y. Indeed, if X = 0 then Y must be 1, and if X = 2 then Y must
be 0, but if X = 1 then Y can be either 0 or 1.
Example We have seen that if X and Y are independent then Cov(X,Y) = 0.
However, it doesn’t work the other way around. Consider the following joint
p.m.f.
Y
−1 0 1
−1
1
5
0
1
5
X 0 0
1
5
0
1
1
5
0
1
5
70 CHAPTER 4. MORE ON JOINT DISTRIBUTION
Now calculation shows that E(X) = E(Y) = E(XY) = 0, so Cov(X,Y) = 0. But
X and Y are not independent: for P(X =−1) = 2/5, P(Y = 0) = 1/5, but P(X =
−1,Y = 0) = 0.
We call two random variables X and Y uncorrelated if Cov(X,Y) =0 (in other
words, if corr(X,Y) = 0). So we can say:
Independent random variables are uncorrelated, but uncorrelated ran
dom variables need not be independent.
Here is the proof that the correlation coefﬁcient lies between −1 and 1. Clearly this is exactly
equivalent to proving that its square is at most 1, that is, that
Cov(X,Y)
2
≤Var(X) · Var(Y).
This depends on the following fact:
Let p, q, r be real numbers with p > 0. Suppose that px
2
+2qx +r ≥ 0 for all real
numbers x. Then q
2
≤ pr.
For, when we plot the graph y = px
2
+2qx +r, we get a parabola; the hypothesis means that this
parabola never goes below the Xaxis, so that either it lies entirely above the axis, or it touches it
in one point. This means that the quadratic equation px
2
+2qx +r = 0 either has no real roots, or
has two equal real roots. From highschool algebra, we know that this means that q
2
≤ pr.
Now let p = Var(X), q = Cov(X,Y), and r = Var(Y). Equation (4.1) shows that
px
2
+2qx +r = Var(xX +Y).
(Note that x is an arbitrary real number here and has no connection with the random variable X!)
Since the variance of a random variable is never negative, we see that px
2
+2qx+r ≥0 for all
choices of x. Now our argument above shows that q
2
≤ pr, that is, Cov(X,Y)
2
≤Var(X) · Var(Y),
as required.
4.2 Conditional random variables
Remember that the conditional probability of event B given event A is P(B  A) =
P(A∩B)/P(A).
Suppose that X is a discrete random variable. Then the conditional probability
that X takes a certain value a
i
, given A, is just
P(X = a
i
 A) =
P(A holds and X = a
i
)
P(A)
.
This deﬁnes the probability mass function of the conditional random variable
X  A.
So we can, for example, talk about the conditional expectation
E(X  A) =
∑
i
a
i
P(X = a
i
 A).
4.2. CONDITIONAL RANDOM VARIABLES 71
Now the event A might itself be deﬁned by a random variable; for example, A
might be the event that Y takes the value b
j
. In this case, we have
P(X = a
i
 Y = b
j
) =
P(X = a
i
,Y = b
j
)
P(Y = b
j
)
.
In other words, we have taken the column of the joint p.m.f. table of X and Y
corresponding to the value Y = b
j
. The sum of the entries in this column is just
P(Y = b
j
), the marginal distribution of Y. We divide the entries in the column by
this value to obtain a new distribution of X (whose probabilities add up to 1).
In particular, we have
E(X  Y = b
j
) =
∑
i
a
i
P(X = a
i
 Y = b
j
).
Example I have two red pens, one green pen, and one blue pen, and I choose
two pens without replacement. Let X be the number of red pens that I choose and
Y the number of green pens. Then the joint p.m.f. of X and Y is given by the
following table:
Y
0 1
0 0
1
6
X 1
1
3
1
3
2
1
6
0
In this case, the conditional distributions of X corresponding to the two values
of Y are as follows:
a 0 1 2
P(X = a  Y = 0) 0
2
3
1
3
a 0 1 2
P(X = a  Y = 1)
1
3
2
3
0
We have
E(X  Y = 0) =
4
3
, E(X  Y = 1) =
2
3
.
If we know the conditional expectation of X for all values of Y, we can ﬁnd
the expected value of X:
Proposition 4.3 E(X) =
∑
j
E(X  Y = b
j
)P(Y = b
j
).
Proof: E(X) =
∑
i
a
i
P(X = a
i
)
72 CHAPTER 4. MORE ON JOINT DISTRIBUTION
=
∑
i
a
i ∑
j
P(X = a
i
 Y = b
j
)P(Y = b
j
)
=
∑
j
_
∑
i
a
i
P(X = a
i
 Y = b
j
_
P(Y = b
j
)
=
∑
j
E(X  Y = b
j
)P(Y = b
j
).
In the above example, we have
E(X) = E(X  Y = 0)P(Y = 0) +E(X  Y = 1)P(Y = 1)
= (4/3) ×(1/2) +(2/3) ×(1/2)
= 1.
Example Let us revisit the geometric random variable and calculate its expected
value. Recall the situation: I have a coin with probability p of showing heads; I
toss it repeatedly until heads appears for the ﬁrst time; X is the number of tosses.
Let Y be the Bernoulli random variable whose value is 1 if the result of the
ﬁrst toss is heads, 0 if it is tails. If Y = 1, then we stop the experiment then and
there; so if Y = 1, then necessarily X = 1, and we have E(X  Y = 1) = 1. On
the other hand, if Y = 0, then the sequence of tosses from that point on has the
same distribution as the original experiment; so E(X  Y = 0) = 1 +E(X) (the 1
counting the ﬁrst toss). So
E(X) = E(X  Y = 0)P(Y = 0) +E(X  Y = 1)P(Y = 1)
= (1+E(X)) · q+1· p
= E(X)(1−p) +1;
rearranging this equation, we ﬁnd that E(X) = 1/p, conﬁrming our earlier value.
In Proposition 2.1, we saw that independence of events can be characterised
in terms of conditional probabilities: A and B are independent if and only if they
satisfy P(A  B) = P(A). A similar result holds for independence of random vari
ables:
Proposition 4.4 Let X and Y be discrete random variables. Then X and Y are
independent if and only if, for any values a
i
and b
j
of X and Y respectively, we
have
P(X = a
i
 Y = b
j
) = P(X = a
i
).
This is obtained by applying Proposition 15 to the events X = a
i
and Y = b
j
.
It can be stated in the following way: X and Y are independent if the conditional
p.m.f. of X  (Y = b
j
) is equal to the p.m.f. of X, for any value b
j
of Y.
4.3. JOINT DISTRIBUTION OF CONTINUOUS R.V.S 73
4.3 Joint distribution of continuous r.v.s
For continuous random variables, the covariance and correlation can be deﬁned
by the same formulae as in the discrete case; and Equation (4.1) remains valid.
But we have to examine what is meant by independence for continuous random
variables. The formalism here needs even more concepts from calculus than we
have used before: functions of two variables, partial derivatives, double integrals.
I assume that this is unfamiliar to you, so this section will be brief and can mostly
be skipped.
Let X andY be continuous randomvariables. The joint cumulative distribution
function of X and Y is the function F
X,Y
of two real variables given by
F
X,Y
(x, y) = P(X ≤x,Y ≤y).
We deﬁne X and Y to be independent if P(X ≤ x,Y ≤ y) = P(X ≤ x) · P(Y ≤ y),
for any x and y, that is, F
X,Y
(x, y) = F
X
(x) · F
Y
(y). (Note that, just as in the one
variable case, X is part of the name of the function, while x is the argument of the
function.)
The joint probability density function of X and Y is
f
X,Y
(x, y) =
∂
2
∂x∂y
F
X,Y
(x, y).
In other words, differentiate with respect to x keeping y constant, and then differ
entiate with respect to y keeping x constant (or the other way round: the answer is
the same for all functions we consider.)
The probability that the pair of values of (X,Y) corresponds to a point in some
region of the plane is obtained by taking the double integral of f
X,Y
over that
region. For example,
P(a ≤X ≤b, c ≤Y ≤d) =
Z
d
c
Z
b
a
f
X,Y
(x, y)dx dy
(the right hand side means, integrate with respect to x between a and b keeping y
ﬁxed; the result is a function of y; integrate this function with respect to y from c
to d.)
The marginal p.d.f. of X is given by
f
X
(x) =
Z
∞
−∞
f
X,Y
(x, y)dy,
and the marginal p.d.f. of Y is similarly
f
Y
(y) =
Z
∞
−∞
f
X,Y
(x, y)dx.
74 CHAPTER 4. MORE ON JOINT DISTRIBUTION
Then the conditional p.d.f. of X  (Y = b) is
f
X(Y=b)
(x) =
f
X,Y
(x, b)
f
Y
(b)
.
The expected value of XY is, not surprisingly,
E(XY) =
Z
∞
−∞
Z
∞
−∞
xy f
X,Y
(x, y)dx dy,
and then as in the discrete case
Cov(X,Y) = E(XY) −E(X)E(Y), corr(X,Y) =
Cov(X,Y)
_
Var(X)Var(Y)
.
Finally, and importantly,
The continuous random variables X and Y are independent if and only
if
f
X,Y
(x, y) = f
X
(x) · f
Y
(y).
As usual this holds if and only if the conditional p.d.f. of X  (Y = b) is equal to
the marginal p.d.f. of X, for any value b. Also, if X and Y are independent, then
Cov(X,Y) = corr(X,Y) = 0 (but not conversely!).
4.4 Transformation of random variables
If a continuous random variable Y is a function of another r.v. X, we can ﬁnd the
distribution of Y in terms of that of X.
Example Let X and Y be random variables. Suppose that X ∼U[0, 4] (uniform
on [0, 4]) and Y =
√
X. What is the support of Y? Find the cumulative distribution
function and the probability density function of Y.
Solution (a) The support of X is [0, 4], and Y =
√
X, so the support of Y is [0, 2].
(b) We have f
X
(x) = x/4 for 0 ≤x ≤4. Now
F
Y
(y) = P(Y ≤y)
= P(X ≤y
2
)
= F
X
(y
2
)
= y
2
/4
4.4. TRANSFORMATION OF RANDOM VARIABLES 75
for 0 ≤ y ≤ 2; of course F
Y
(y) = 0 for y < 0 and F
Y
(y) = 1 for y > 2. (Note that
Y ≤y if and only if X ≤y
2
, since Y =
√
X.)
(c) We have
f
Y
(y) =
d
dy
F
Y
(y) =
_
y/2 if 0 ≤y ≤2,
0 otherwise.
The argument in (b) is the key. If we knowY as a function of X, say Y =g(X),
where g is an increasing function, then the event Y ≤ y is the same as the event
X ≤ h(Y), where h is the inverse function of g. This means that y = g(x) if and
only if x = h(y). (In our example, g(x) =
√
x, and so h(y) = y
2
.) Thus
F
Y
(y) = F
X
(h(y)),
and so, by the Chain Rule,
f
Y
(y) = f
X
(h(y))h
(y),
where h
is the derivative of h. (This is because f
X
(x) is the derivative of F
X
(x)
with respect to its argument x, and the Chain Rule says that if x = h(y) we must
multiply by h
(y) to ﬁnd the derivative with respect to y.)
Applying this formula in our example we have
f
Y
(y) =
1
4
· 2y =
y
2
for 0 ≤y ≤2, since the p.d.f. of X is f
X
(x) = 1/4 for 0 ≤x ≤4.
Here is a formal statement of the result.
Theorem 4.5 Let X be a continuous random variable. Let g be a real function
which is either strictly increasing or strictly decreasing on the support of X, and
which is differentiable there. Let Y = g(X). Then
(a) the support of Y is the image of the support of X under g;
(b) the p.d.f. of Y is given by f
Y
(y) = f
X
(h(y))h
(y), where h is the inverse
function of g.
For example, here is the proof of Proposition 3.6: if X ∼ N(µ, σ
2
) and Y =
(X −µ)/σ, then Y ∼N(0, 1). Recall that
f
X
(x) =
1
σ
√
2π
e
−(x−µ)
2
/2σ
2
.
76 CHAPTER 4. MORE ON JOINT DISTRIBUTION
We have Y = g(X), where g(x) = (x −µ)/σ; this function is everywhere strictly
increasing (the graph is a straight line with slope 1/σ), and the inverse function is
x = h(y) = σy +µ. Thus, h
(y) = σ, and
f
Y
(y) = f
X
(σy +µ) · σ =
1
√
2π
e
−y
2
/2
,
the p.d.f. of a standard normal variable.
However, rather than remember this formula, together with the conditions for
its validity, I recommend going back to the argument we used in the example.
If the transforming function g is not monotonic (that is, not either increasing
or decreasing), then life is a bit more complicated. For example, if X is a random
variable taking both positive and negative values, and Y = X
2
, then a given value
y of Y could arise from either of the values
√
y and −
√
y of X, so we must work
out the two contributions and add them up.
Example X ∼N(0, 1) and Y = X
2
. Find the p.d.f. of Y.
The p.d.f. of X is (1/
√
2π)e
−x
2
/2
. Let Φ(x) be its c.d.f., so that P(X ≤ x) =
Φ(x), and
Φ
(x) =
1
√
2π
e
−x
2
/2
.
Now Y = X
2
, so Y ≤y if and only if −
√
y ≤X ≤
√
y. Thus
F
Y
(y) = P(Y ≤y)
= P(−
_
y ≤X ≤
_
y)
= Φ(
_
y) −Φ(−
_
y)
= Φ(
_
y −(1−Φ(
_
y)) (by symmetry of N(0, 1))
= 2Φ(
_
y) −1.
So
f
Y
(y) =
d
dy
F
Y
(y)
= 2Φ
(
√
y) ·
1
2
√
y
(by the Chain Rule)
=
1
√
2πy
e
−y/2
.
Of course, this is valid for y > 0; for y < 0, the p.d.f. is zero.
4.5. WORKED EXAMPLES 77
Note the 2 in the line labelled “by the Chain Rule”. If you blindly applied
the formula of Theorem 4.5, using h(y) =
√
y, you would not get this 2; it arises
from the fact that, since Y = X
2
, each value of Y corresponds to two values of X
(one positive, one negative), and each value gives the same contribution, by the
symmetry of the p.d.f. of X.
4.5 Worked examples
Question Two numbers X and Y are chosen independently from the uniform
distribution on the unit interval [0, 1]. Let Z be the maximum of the two numbers.
Find the p.d.f. of Z, and hence ﬁnd its expected value, variance and median.
Solution The c.d.f.s of X and Y are identical, that is,
F
X
(x) = F
Y
(x) =
_
0 if x < 0,
x if 0 < x < 1,
1 if x > 1.
(The variable can be called x in both cases; its name doesn’t matter.)
The key to the argument is to notice that
Z = max(X,Y) ≤x if and only if X ≤x and Y ≤x.
(For, if both X and Y are smaller than a given value x, then so is their maximum;
but if at least one of them is greater than x, then again so is their maximum.) For
0 ≤x ≤1, we have P(X ≤x) = P(Y ≤x) = x; by independence,
P(X ≤x and Y ≤x) = x · x = x
2
.
Thus P(Z ≤ x) = x
2
. Of course this probability is 0 if x < 0 and is 1 if x > 1. So
the c.d.f. of Z is
F
Z
(x) =
_
0 if x < 0,
x
2
if 0 < x < 1,
1 if x > 1.
The median of Z is the value of m such that F
Z
(m) = 1/2, that is m
2
= 1/2, or
m = 1/
√
2.
We obtain the p.d.f. of Z by differentiating:
f
Z
(x) =
_
2x if 0 < x < 1,
0 otherwise.
Then we can ﬁnd E(Z) and Var(Z) in the usual way:
E(Z) =
Z
1
0
2x
2
dx =
2
3
, Var(Z) =
Z
1
0
2x
3
dx −
_
2
3
_
2
=
1
18
.
78 CHAPTER 4. MORE ON JOINT DISTRIBUTION
Question I roll a fair die bearing the numbers 1 to 6. If N is the number showing
on the die, I then toss a fair coin N times. Let X be the number of heads I obtain.
(a) Write down the p.m.f. for X.
(b) Calculate E(X) without using this information.
Solution (a) If we were given that N = n, say, then X would be a binomial
Bin(n, 1/2) random variable. So P(X = k  N = n) =
n
C
k
(1/2)
n
.
By the ToTP,
P(X = k) =
6
∑
n=1
P(X = k  N = n)P(N = n).
Clearly P(N = n) = 1/6 for n = 1, . . . , 6. So to ﬁnd P(X = k), we add up the
probability that X = k for a Bin(n, 1/2) r.v. for n = k, . . . , 6 and divide by 6. (We
start at k because you can’t get k heads with fewer than k coin tosses!) The answer
comes to
k 0 1 2 3 4 5 6
P(X = k)
63
384
120
384
99
384
64
384
29
384
8
384
1
384
For example,
P(X = 4) =
4
C
4
(1/2)
4
+
5
C
4
(1/2)
5
+
6
C
4
(1/2)
6
6
=
4+10+15
384
.
(b) By Proposition 4.3,
E(X) =
6
∑
n=1
E(X  (N = n))P(N = n).
Nowif we are given that N =n then, as we remarked, X has a binomial Bin(n, 1/2)
distribution, with expected value n/2. So
E(X) =
6
∑
n=1
(n/2) · (1/6) =
1+2+3+4+5+6
2· 6
=
7
4
.
Try working it out from the p.m.f. to check that the answer is the same!
Appendix A
Mathematical notation
The Greek alphabet
Mathematicians use the Greek alpha
bet for an extra supply of symbols.
Some, like π, have standard meanings.
You don’t need to learn this; keep it
for reference. Apologies to Greek stu
dents: you may not recognise this, but
it is the Greek alphabet that mathe
maticians use!
Pairs that are often confused are zeta
and xi, or nu and upsilon, which look
alike; and chi and xi, or epsilon and
upsilon, which sound alike.
Name Capital Lowercase
alpha A α
beta B β
gamma Γ γ
delta ∆ δ
epsilon E ε
zeta Z ζ
eta H η
theta Θ θ
iota I ι
kappa K κ
lambda Λ λ
mu M µ
nu N ν
xi Ξ ξ
omicron O o
pi Π π
rho P ρ
sigma Σ σ
tau T τ
upsilon ϒ υ
phi Φ φ
chi X χ
psi Ψ ψ
omega Ω ω
79
80 APPENDIX A. MATHEMATICAL NOTATION
Numbers
Notation Meaning Example
N Natural numbers 1, 2, 3, . . .
(some people include 0)
Z Integers . . . , −2, −1, 0, 1, 2, . . .
R Real numbers
1
2
,
√
2, π, . . .
x modulus 2 = 2, −3 = 3
a/b or
a
b
a over b 12/3 = 4, 2/4 = 0.5
a  b a divides b 4  12
m
C
n
or
_
m
n
_
m choose n
5
C
2
= 10
n! n factorial 5! = 120
b
∑
i=a
x
i
x
a
+x
a+1
+· · · +x
b
3
∑
i=1
i
2
= 1
2
+2
2
+3
2
= 14
(see section on Summation below)
x ≈y x is approximately equal to y
Sets
Notation Meaning Example
{. . .} a set {1, 2, 3}
NOTE: {1, 2} ={2, 1}
x ∈ A x is an element of the set A 2 ∈ {1, 2, 3}
{x : . . .} the set of all x such that . . . {x : x
2
= 4} ={−2, 2}
or {x  . . .}
A cardinality of A {1, 2, 3} = 3
(number of elements in A)
A∪B A union B {1, 2, 3}∪{2, 4} ={1, 2, 3, 4}
(elements in either A or B)
A∩B A intersection B {1, 2, 3}∩{2, 4} ={2}
(elements in both A and B)
A\B set difference {1, 2, 3}\{2, 4} ={1, 3}
(elements in A but not B)
A ⊆B A is a subset of B (or equal) {1, 3} ⊆{1, 2, 3}
A
complement of A everything not in A
/ 0 empty set (no elements) {1, 2}∩{3, 4} = / 0
(x, y) ordered pair NOTE: (1, 2) = (2, 1)
A×B Cartesian product {1, 2}×{1, 3} =
(set of all ordered pairs) {(1, 1), (2, 1), (1, 3), (2, 3)}
81
Summation
What is it?
Let a
1
, a
2
, a
3
, . . . be numbers. The notation
n
∑
i=1
a
i
(read “sum, from i equals 1 to n, of a
i
”), means: add up the numbers a
1
, a
2
, . . . , a
n
;
that is,
n
∑
i=1
a
i
= a
1
+a
2
+· · · +a
n
.
The notation
n
∑
j=1
a
j
means exactly the same thing. The variable i or j is called
a “dummy variable”.
The notation
m
∑
i=1
a
i
is not the same, since (if m and n are different) it is telling
us to add up a different number of terms.
The sum doesn’t have to start at 1. For example,
20
∑
i=10
a
i
= a
10
+a
11
+· · · +a
20
.
Sometimes I get lazy and don’t bother to write out the values: I just say
∑
i
a
i
to mean: add up all the relevant values. For example, if X is a discrete random
variable, then we say that
E(X) =
∑
i
a
i
P(X = a
i
)
where the sum is over all i such that a
i
is a value of the random variable X.
Manipulation
The following three rules hold.
n
∑
i=1
(a
i
+b
i
) =
n
∑
i=1
a
i
+
n
∑
i+1
b
i
. (A.1)
Imagine the as and bs written out with a
1
+b
1
on the ﬁrst line, a
2
+b
2
on the
second line, and so on. The lefthand side says: add the two terms in each line,
82 APPENDIX A. MATHEMATICAL NOTATION
and then add up all the results. The righthand side says: add the ﬁrst column (all
the as) and the second column (all the bs), and then add the results. The answers
must be the same.
_
n
∑
i=1
a
i
_
·
_
m
∑
j=1
b
j
_
=
n
∑
i=1
m
∑
j=1
a
i
b
j
. (A.2)
The double sum says add up all these products, for all values of i and j. A simple
example shows how it works: (a
1
+a
2
)(b
1
+b
2
) = a
1
b
1
+a
1
b
2
+a
2
b
1
+a
2
b
2
.
If in place of numbers, we have functions of x, then we can “differentiate
termbyterm”:
d
dx
n
∑
i+1
f
i
(x) =
n
∑
i+1
d
dx
f
i
(x). (A.3)
The lefthand side says: add up the functions and differentiate the sum. The right
says: differentiate each function and add up the derivatives.
Another useful result is the Binomial Theorem:
(x +y)
n
=
n
∑
k=0
n
C
k
x
n−k
y
k
.
Inﬁnite sums
Sometimes we meet inﬁnite sums, which we write as
∞
∑
i=1
a
i
for example. This
doesn’t just mean “add up inﬁnitely many values”, since that is not possible. We
need Analysis to give us a deﬁnition in general. But sometimes we know the
answer another way: for example, if a
i
= ar
i−1
, where −1 < r < 1, then
∞
∑
i=1
a
i
= a+ar +ar
2
+· · · =
a
1−r
,
using the formula for the sum of the “geometric series”. You also need to know
the sum of the “exponential series”
∞
∑
i=0
x
i
i!
= 1+x +
x
2
2
+
x
3
6
+
x
4
24
+· · · = e
x
.
Do the three rules of the preceding section hold? Sometimes yes, sometimes
no. In Analysis you will see some answers to this question.
In all the examples you meet in this book, the rules will be valid.
Appendix B
Probability and random variables
Notation
In the table, A and B are events, X and Y are random variables.
Notation Meaning Page
P(A) probability of A 3
P(A  B) conditional probability of A given B 24
X =Y the values of X and Y are equal
X ∼Y X and Y have the same distribution 41
(that is, same p.m.f. or same p.d.f.)
E(X) expected value of X 41
Var(X) variance of X 42
Cov(X,Y) covariance of X and Y 67
corr(X,Y) correlation coefﬁcient of X and Y 68
X  B conditional random variable 70
X  (Y = b) 71
Bernoulli random variable Bernoulli(p) (p. 48)
• Occurs when there is a single trial with a ﬁxed probability p of success.
• Takes only the values 0 and 1.
• p.m.f. P(X = 0) = q, P(X = 1) = p, where q = 1−p.
• E(X) = p, Var(X) = pq.
83
84 APPENDIX B. PROBABILITY AND RANDOM VARIABLES
Binomial random variable Bin(n, p) (p. 49)
• Occurs when we are counting the number of successes in n independent
trials with ﬁxed probability p of success in each trial, e.g. the number of
heads in n coin tosses. Also, sampling with replacement from a population
with a proportion p of distinguished elements.
• The sum of n independent Bernoulli(p) random variables.
• Values 0, 1, 2, . . . , n.
• p.m.f. P(X = k) =
n
C
k
q
n−k
p
k
for 0 ≤k ≤n, where q = 1−p.
• E(X) = np, Var(X) = npq.
Hypergeometric random variable Hg(n, M, N) (p. 51)
• Occurs when we are sampling n elements without replacement from a pop
ulation of N elements of which M are distinguished.
• Values 0, 1, 2, . . . , n.
• p.m.f. P(X = k) = (
M
C
k
·
N−M
C
n−k
)/
N
C
n
.
• E(X) = n
_
M
N
_
, Var(X) = n
_
M
N
__
N−M
N
__
N−n
N−1
_
.
• Approximately Bin(n, M/N) if n is small compared to N, M, N−M.
Geometric random variable Geom(p) (p. 52)
• Describes the number of trials up to and including the ﬁrst success in a
sequence of independent Bernoulli trials, e.g. number of tosses until the
ﬁrst head when tossing a coin.
• Values 1, 2, . . . (any positive integer).
• p.m.f. P(X = k) = q
k−1
p, where q = 1−p.
• E(X) = 1/p, Var(X) = q/p
2
.
85
Poisson random variable Poisson(λ) (p. 54)
• Describes the number of occurrences of a random event in a ﬁxed time
interval, e.g. the number of ﬁsh caught in a day.
• Values 0, 1, 2, . . . (any nonnegative integer)
• p.m.f. P(X = k) = e
−λ
λ
k
/k! .
• E(X) = λ, Var(X) = λ.
• If n is large, p is small, and np = λ, then Bin(n, p) is approximately equal
to Poisson(λ) (in the sense that the p.m.f.s are approximately equal).
Uniform random variable U[a, b] (p. 58)
• Occurs when a number is chosen at random from the interval [a, b], with all
values equally likely.
• p.d.f. f (x) =
_
0 if x < a,
1/(b−a) if a ≤x ≤b,
0 if x > b.
• c.d.f. F(x) =
_
0 if x < a,
(x −a)/(b−a) if a ≤x ≤b,
1 if x > b.
• E(X) = (a+b)/2, Var(X) = (b−a)
2
/12.
Exponential random variable Exp(λ) (p. 59)
• Occurs in the same situations as the Poisson random variable, but measures
the time from now until the ﬁrst occurrence of the event.
• p.d.f. f (x) =
_
0 if x < 0,
λ e
−λx
if x ≥0.
• c.d.f. F(x) =
_
0 if x < 0,
1−e
−λx
if x ≥0.
• E(X) = 1/λ, Var(X) = 1/λ
2
.
• However long you wait, the time until the next occurrence has the same
distribution.
86 APPENDIX B. PROBABILITY AND RANDOM VARIABLES
Normal random variable N(µ, σ
2
) (p. 59)
• The limit of the sum (or average) of many independent Bernoulli random
variables. This also works for many other types of random variables: this
statement is known as the Central Limit Theorem.
• p.d.f. f (x) =
1
σ
√
2π
e
−(x−µ)
2
/2σ
2
.
• No simple formula for c.d.f.; use tables.
• E(X) = µ, Var(X) = σ
2
.
• For large n, Bin(n, p) is approximately N(np, npq).
• Standard normal N(0, 1) is given in the table. If X ∼ N(µ, σ
2
), then (X −
µ)/σ ∼N(0, 1).
The c.d.f.s of the Binomial, Poisson, and Standard Normal random variables
are tabulated in the New Cambridge Statistical Tables, Tables 1, 2 and 4.
ii
Preface
Here are the course lecture notes for the course MAS108, Probability I, at Queen Mary, University of London, taken by most Mathematics students and some others in the ﬁrst semester. The description of the course is as follows: This course introduces the basic notions of probability theory and develops them to the stage where one can begin to use probabilistic ideas in statistical inference and modelling, and the study of stochastic processes. Probability axioms. Conditional probability and independence. Discrete random variables and their distributions. Continuous distributions. Joint distributions. Independence. Expectations. Mean, variance, covariance, correlation. Limiting distributions. The syllabus is as follows: 1. Basic notions of probability. Sample spaces, events, relative frequency, probability axioms. 2. Finite sample spaces. Methods of enumeration. Combinatorial probability. 3. Conditional probability. Theorem of total probability. Bayes theorem. 4. Independence of two events. Mutual independence of n events. Sampling with and without replacement. 5. Random variables. Univariate distributions  discrete, continuous, mixed. Standard distributions  hypergeometric, binomial, geometric, Poisson, uniform, normal, exponential. Probability mass function, density function, distribution function. Probabilities of events in terms of random variables. 6. Transformations of a single random variable. Mean, variance, median, quantiles. 7. Joint distribution of two random variables. Marginal and conditional distributions. Independence. iii
read the corresponding sections straight after the lectures. published by Wadsworth. C.iv 8. published by Oxford University Press. V. Lindley and W. • Probability and Statistics for Engineering and the Sciences by Jay L. especially proofs. Other books which you can use instead are: • Probability and Statistics in Engineering and Management Science by W. Montgomery. All of these books will also be useful to you in the courses Statistics I and Statistical Inference. You need at most one of the three textbooks listed below. both in order and notation. Hines and D. published by Wadsworth. These course notes explain the naterial in the syllabus. which will be provided for you in examinations. Devore (ﬁfth edition). Scott. by John Haigh. Chapters 1–4. You need to become familiar with the tables in this book. They have been “ﬁeldtested” on the class of 2000. You should also buy a copy of • New Cambridge Statistical Tables by D. Limiting distributions in the Binomial case. Many of the examples are taken from the course homework sheets or past exam papers. Set books The notes cover only material in the Probability I course. Chapters 2–8. the lectures go into more detail at several points. but you will need the statistical tables. W. If you ﬁnd the course difﬁcult then you are advised to buy this book. The next book is not compulsory but introduces the ideas in a friendly way: • Taking Chances: Winning with Probability. correlation. • Mathematical Statistics and Data Analysis by John A. F. published by Cambridge University Press. The textbooks listed below will be useful for other courses on probability and statistics. Rice. However. 9. Means and variances of linear functions of random variables. . and do extra exercises from it. Covariance. Chapters 2–5 of this book are very close to the material in the notes. published by Wiley.
where you can run simulations etc.math. with history and many nice pictures.uk/˜pjc/MAS108/ This includes a preliminary version of these notes. by Charles M.combinatorics. http://www.org/Surveys/ds5/VennEJC.newton. a set of webbased resources for students and teachers of probability and statistics.maths. test and past exam papers.qmw. Web pages for other Queen Mary maths courses can be found from the online version of the Maths Undergraduate Handbook.v Web resources the address Course material for the MAS108 course is kept on the Web at http://www. and some solutions. Peter J.uk/wmy2kposters/july/ The Birthday Paradox (poster in the London Underground.edu/˜chance/teaching aids/ books articles/probability book/pdf.uah.dartmouth.html A textbook Introduction to Probability.cam. Cameron December 2000 . July 2000).ac. http://www. available free. Laurie Snell. Other web pages of interest include http://www. Grinstead and J.ac. together with coursework sheets. http://www.edu/stat/ The Virtual Laboratories in Probability and Statistics.html An article on Venn diagrams by Frank Ruskey. with many exercises.
vi .
. . . . 1. . . .10 Independence . . . 3. . . . . . .5 InclusionExclusion Principle . . . .2 Genetics . . . . . . . . . . . . . . . . . . . . . . 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2. . . . . . . . . . . . . . . . . . .2 What is probability? .6 Other results about sets . . . . . . . . . . . 1. . . . . .6 Continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12 Properties of independence . . events . . . .7 Worked examples . 3 Random variables 3. . . . . . . . . .8 Stopping rules . . . . . . . . . . . . . . vii . . . . . . . . . . . . . . . . . . .1 What is conditional probability? . . . 1. . . . 3. . . . . . 2. . . . . . . . . . . . . . . . . . . . . . . . 2. . . . . . . .Contents 1 Basic ideas 1. . .m. . . . . . . . . . . 3. . . 3. . .13 Worked examples . . . . .3 Expected value and variance . . . . . . . . . . . . 1. . . . . .7 Sampling . . . . . . . .5 Some discrete random variables . . . . . . . . 2 Conditional probability 2.9 Questionnaire results . . . . . . . . . . . . . . . . . . . . 1. . . . . . . . 1. . . . . . . . . . . . . . . . . . . . . . . . .4 Proving things from the axioms . . . . .3 The Theorem of Total Probability 2. . . . . . . . . . . . . . . . . . . . of two random variables 3. . . . . . . . . . . . .4 Sampling revisited . 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Sample space. . . . . . . 1. . . . . . . . . . . . . . . . . . . . . .3 Kolmogorov’s Axioms . . . . . . .f. 1 1 3 3 4 6 7 8 12 13 14 16 17 20 23 23 25 26 28 29 31 34 39 39 40 41 43 47 55 . . . . 1. . . . . . . . . . . . . . . . . . 1. . . . . . . . . . . . .11 Mutual independence . . . . . . . . . . . . . .1 What are random variables? . . . . . . . . . .2 Probability mass function . . . 1. . . . . . . . . . . . . . . . . . . . . . . . . . 2. . . . . . . . . . 1. . . . . . . . . . . . . . . . . . . . . . . .4 Joint p. . . . . . . . . . .5 Bayes’ Theorem . .6 Iterated conditional probability . .
. . . . . . . . . . A Mathematical notation B Probability and random variables . . . . . .2 Conditional random variables . . . .viii 3. . . . . . . . .s 4. CONTENTS . . . . . . . . . .4 Transformation of random variables 4. . . . . . . . . . . .8 3. . . . . . 4.5 Worked examples . 4. . . . . . . . . quartiles. . . . . . . . . . 57 58 61 63 67 67 70 73 74 77 79 83 More on joint distribution 4. . . . . . Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Covariance and correlation . . . . . Some continuous random variables On using tables . . . . . . . . .3 Joint distribution of continuous r. . . . . . .7 3. . . . . . . . .9 3. . . . . . . . .v. percentiles . . . . . . . . . . . . . . . . . . . . . . . . .10 4 Median. . . . . . . . . . . . .
writing down some basic axioms which probability must satisfy. (Remember that S  is the number of elements in the set S ). it is most unlikely. 4. T T T }.) If all outcomes are equally likely. or there is some physical reason for assuming it. In the coins example. 1. If I toss a coin three times and record the result. 1 . and examine what it means for events to be independent.1 Sample space. In the beans example. For example. HT T. 6. T HH. and making deductions from these.Chapter 1 Basic ideas In this chapter. the assumption will hold if the coin is ‘fair’: this means that there is no physical reason for it to favour one side over the other. where (for example) HT H means ‘heads on the ﬁrst toss. 2. T HT. 8. We usually call it S . 9. events The general setting is: We perform an experiment which can have a number of different outcomes. HT H. then each has probability 1/S . We also look at different kinds of sampling. we don’t really answer the question ‘What is probability?’ Nobody has a really good answer to this question. if I plant ten bean seeds and count the number that germinate. 7. the sample space is S = {0. Sometimes we can assume that all the outcomes are equally likely. then tails. 1. then heads again’. The sample space is the set of all possible outcomes of the experiment. We take a mathematical approach. 3. T T H. It is important to be able to list the outcomes clearly. HHT. 5. (Don’t assume this unless either you are told to. 10}. the sample space is S = {HHH.
while {a} is an event. Albert Einstein wrote. BASIC IDEAS On this point. B = {HHH. • A (read ‘A complement’) consists of all outcomes not in A (that is. / • 0 (read ‘empty set’) for the event which doesn’t contain any outcomes. So. In other words: Don’t just assume that all outcomes are equally likely. and we usually write P(a). HHT. if all outcomes are equally likely.) We can build new events from old ones: • A ∪ B (read ‘A union B’) consists of all the outcomes in A or in B (or both!) • A ∩ B (read ‘A intersection B’) consists of all the outcomes in both A and B. In particular. HT H. in his 1905 paper On a heuristic point of view concerning the production and transformation of light (for which he was awarded the Nobel Prize). In calculating entropy by moleculartheoretic methods. If A = {a} is a simple event. T HH}. • A \ B (read ‘A minus B’) consists of all the outcomes in A but not in B. we have A . while the event ‘heads on every throw’ is simple (as a set. P(A) = S  In our example.2 CHAPTER 1. An event is simple if it consists of just a single outcome. A and B are compound events. HT H. indeed a simple event. the word “probability” is often used in a sense differing from the way the word is deﬁned in probability theory. In the above example. The probability of an event is calculated by adding up the probabilities of all the outcomes comprising that event. which is simpler to write than P({a}). let A be the event ‘more heads than tails’ and B the event ‘heads on last throw’. and is compound otherwise. Then A = {HHH. then the probability of A is just the probability of the outcome a. . especially when you are given enough information to calculate their probabilities! An event is a subset of S . (Note that a is an outcome. T HH. In the example. it is {HHH}). We can specify an event by listing all the outcomes that make it up. “cases of equal probability” are often hypothetically stipulated when the theoretical methods employed are deﬁnite enough to permit a deduction rather than a stipulation. both A and B have probability 4/8 = 1/2. S \ A). T T H}.
According to Kolmogorov’s axioms. In the example. to say that the probability of getting heads when a coin is tossed means that. this is not equal to P(A) · P(B). These numbers satisfy three axioms: Axiom 1: For any event A.1.2. and it doesn’t explain what probability means. . . N. no two of the events overlap. That is.. We develop ways of doing calculations with probability. But if you toss a coin 1000 times.2 What is probability? There is really no answer to this question. But this argument doesn’t work in all cases.3 Kolmogorov’s Axioms Remember that an event is a subset of the sample space S . Note that P(A ∩ B) = 3/8. and A ∩ B is the event {HHH. We regard probability as a mathematical construction satisfying some axioms (devised by the Russian mathematician A. this is not the same as either a vertical slash  or a forward slash /. Some people think of it as ‘limiting frequency’. which is a number. Kolmogorov). / say A1 . we have P(A) ≥ 0. like the one we used for a fair coin. 1. Some people say it is subjective. Axiom 2: P(S ) = 1. But what about 450. But we can’t build a theory on something subjective. you are not likely to get exactly 500 heads. A number of events. if the coin is tossed many times. that is. HT H}. The answer agrees well with experiment. . T HH. You wouldn’t be surprised to get only 495. are called mutually disjoint or pairwise disjoint if Ai ∩ A j = 0 for any two of the events Ai and A j . you might change your view if you knew that the owner of the coin was a magician or a con man. . or 100? Some people would say that you can work out probability by physical arguments. each event A has a probability P(A). You say that the probability of heads in a coin toss is 1/2 because you have no reason for thinking either heads or tails more likely. WHAT IS PROBABILITY? 3 Note the backwardsloping slash. despite what you read in some books! 1. it is likely to come down heads about half the time. so that (for example) we can calculate how unlikely it is to get 480 or fewer heads in 1000 tosses of a fair coin. A is the event ‘more tails than heads’. A2 .
that is. important statement. Sometimes we separate Axiom 3 into two parts: Axiom 3a if there are only ﬁnitely many events A1 . . just as obvious as the axioms. . . we deﬁne a new event Ai containing only the outcome ai . That means. and a corollary. . Usually. . Proposition 1. and a corollary is something that follows quite easily from a theorem or proposition that came before. There is really no difference between a theorem. i=1 n and Axiom 3b for inﬁnitely many. then P(A1 ∪ A2 ∪ · · ·) = P(A1 ) + P(A2 ) + · · · Note that in Axiom 3. Notice that we write i=1 ∑ P(Ai) n for P(A1 ) + P(A2 ) + · · · + P(An ). . . a proposition a rather smaller statement. . . . A2 . an }. a theorem is a big. n. To prove the proposition. We will only use Axiom 3a. Then A1 . These properties seem obvious. say A = {a1 . . for i = 1. . . An . . then P(A) = P(a1 ) + P(a2 ) + · · · + P(an ). we have the union of events and the sum of numbers. every step must be justiﬁed by appealing to an axiom. but 3b is important later on. for example. Here are some examples of things proved from the axioms. a proposition. never write P(A1 ) ∪ P(A2 ). are pairwise disjoint. Don’t mix these up. . so that we have P(A1 ∪ · · · ∪ An ) = ∑ P(Ai ). A2 . BASIC IDEAS Axiom 3: If the events A1 . . a2 .1 If the event A contains only a ﬁnite number of outcomes. but the point of this game is that we assume only the axioms. and build everything else from that. An are mutually disjoint . they all have to be proved.4 CHAPTER 1. 1. Ai = {ai }.4 Proving things from the axioms You can prove simple properties of probability from the axioms. . . .
3. Corollary 1. Now going back to Proposition 1. so P(0) = 1 − P(S ) by Proposition 1. we see that. say S = {a1 . and their probabilities sum to 1. we can use it on the same basis as an axiom to prove further facts. then P(a1 ) + P(a2 ) + · · · + P(an ) = 1. from which we get P(A) ≤ 1. then each has probability 1/n. and P(A ) ≥ 0 by Axiom 1. Proposition 1. justifying the principle we used earlier. Then A1 ∩ A2 = 0 (that is. you have made a mistake! / Corollary 1. So P(A) = P(A1 ) = 1 − P(A2 ).1. if all the n outcomes are equally likely. then P(A) = A S  for any event A. so 1 − P(A) ≥ 0. PROVING THINGS FROM THE AXIOMS 5 (each contains only one element which is in none of the others). the events A1 and A2 are disjoint). and A1 ∪ A2 = S . and P(S ) = 1 by Axiom 2. / / For 0 = S . For P(a1 ) + P(a2 ) + · · · + P(an ) = P(S ) by Proposition 1.5 P(0) = 0.4 P(A) ≤ 1 for any event A. so by Axiom 3a. .4. . and P(S ) = 1 by Axiom 2. Notice that once we have proved something.3. Corollary 1.3 P(A ) = 1 − P(A) for any event A. / Let A1 = A and A2 = A (the complement of A). an }. . that is.1. and A1 ∪ A2 ∪ · · · ∪ An = A.1. Remember that if you ever calculate a probability to be less than 0 or more than 1. For 1 − P(A) = P(A ) by Proposition 1. if all outcomes are equally likely. . 1/S . Now we see that. / so P(0) = 0. . we have P(A) = P(a1 ) + P(a2 ) + · · · + P(an ).2 If the sample space S is ﬁnite. So P(A1 ) + P(A2 ) = P(A1 ∪ A2 ) (Axiom 3) = P(S ) = 1 (Axiom 2).
The arguments for the other two pairs are similar – you should do them yourself. Proposition 1. take A1 = A. to ﬁnd the size of A ∪ B. namely A1 = A ∩ B. since all elements of A1 belong to B but no elements of A2 do. P(A1 ) + P(A2 ) = P(A1 ∪ A2 ) = P(B). The notation A ⊆ B means that A is contained in B. / Now A1 ∩ A2 = 0. We now prove this from the axioms.6 CHAPTER 1. / This time. In other words.5 InclusionExclusion Principle A B A Venn diagram for two sets A and B suggests that. Similarly we have A1 ∪A2 = A and A1 ∪A3 = B. We see that A ∪ B is made up of three parts. A3 = B \ A. A2 . (We have three pairs of sets to check. then P(A) ≤ P(B). A2 = A \ B.7 P(A ∪ B) = P(A) + P(B) − P(A ∩ B). So by Axiom 3. so P(A) ≤ P(B). Again we have A1 ∩ A2 = 0 (since the elements of B \ A are. since anything in A ∪ B is in both these sets or just the ﬁrst or just the second. A2 = B \ A. using the Venn diagram as a guide. by deﬁnition. Now P(B \ A) ≥ 0 by Axiom 1. every outcome in A also belongs to B. P(A) + P(B \ A) = P(B). but then we have included the size of A ∩ B twice. and A1 ∪ A2 = B. we add the size of A and the size of B. A3 are mutually disjoint. that is. as we had to show. 1. The sets A1 . In terms of probability: Proposition 1.) . Indeed we do have A ∪ B = A1 ∪ A2 ∪ A3 . not in A).6 If A ⊆ B. BASIC IDEAS Here is another result. so we have to take it off.
From this we obtain P(A) + P(B) − P(A ∩ B) = (P(A1 ) + P(A2 )) + (P(A1 ) + P(A3 )) − P(A1 ) = P(A1 ) + P(A2 ) + P(A3 ) = P(A ∪ B) as required. Proposition 1. P(B). But then we ﬁnd that the outcomes lying in all three sets have been taken off completely. Here it is for three events. so must be put back. We will not give formal proofs of these. so we subtract P(A∩B). try to prove it yourself.C be subsets of S . C A B To calculate P(A ∪ B ∪C). B.6 Other results about sets There are other standard results about sets which are often useful in probability theory. but gets more complicated. that is.1. Proposition 1. we ﬁrst add up P(A). P(A∩C) and P(B∩C). The parts in common have been counted twice. OTHER RESULTS ABOUT SETS So. Can you extend this to any number of events? 1. we have P(A) = P(A1 ) + P(A2 ). P(A ∪ B) = P(A1 ) + P(A2 ) + P(A3 ). Here are some examples. and P(C). You should draw Venn diagrams and convince yourself that they work. we add P(A ∩ B ∩C).C.6. De Morgan’s Laws: (A ∪ B) = A ∩ B and (A ∩ B) = A ∪ B . we have P(A∪B∪C) = P(A)+P(B)+P(C)−P(A∩B)−P(A∩C)−P(B∩C)+P(A∩B∩C). by Axiom 3.9 Let A. . B.8 For any three events A. Distributive laws: (A ∩ B) ∪C = (A ∪C) ∩ (B ∪C) and (A ∪ B) ∩C = (A ∩C) ∪ (B ∩C). 7 The InclusionExclusion Principle extends to more than two events. P(B) = P(A1 ) + P(A3 ).
. RB. In this case. RB. then A 2 1 = = . S = {R. Sampling without replacement means that we choose a pen but do not put it back. PR. BP. GB. RP. RG. where R means ‘red pen chosen’ and so on. G. if I have a set of n objects and choose one. the sample space for choosing two pens is { RG. PR. then choose a pen again (which may be the same pen as before or a different one). BR. GP. GP. Sampling with replacement means that we choose a pen. GR. What if we choose more than one pen? We have to be more careful to specify the sample space. BB. and an event consisting of m of the outcomes has probability m/n. with probability 6/12 = 1/2. In this case. with each one equally likely to be chosen. GG.8 CHAPTER 1. RB. green. they are red. BG. RP. PG. P(A) = S  4 2 More generally. PP} The event ‘at least one red pen’ is {RR. each pen has the same chance of being selected. RB. so that our ﬁnal selection cannot include two pens of the same colour. P}. and purple. or • sampling without replacement. In this case. RG. GR. BR. we have to say whether we are • sampling with replacement. PG. RP. blue.7 Sampling I have four pens in my desk drawer. If we choose two pens with replacement. First. then each of the n outcomes has probability 1/n. I draw a pen. and so on until the required number of pens have been chosen. the sample space is {RR. B. note its colour. GR. and has probability 7/16. BP. PR}. if A is the event ‘red or green pen chosen’. RP. PB } and the event ‘at least one red pen’ is {RG. PR}. BR. BASIC IDEAS 1. GB. BR. PB. GR. BG. put it back and shake the drawer.
) The product of k factors each equal to n is nk .1. we multiply the choices for different objects. It doesn’t really matter in this case whether we choose the pens one at a time or simply take two pens out of the drawer. (Each element is written as a set since. with probability 3/6 = 1/2. but this is much harder to prove than the others. {R. {R. for sampling with replacement and ordered sample. k!(n − k)! Theorem 1. (n − k)! so that n Ck = n! . containing six elements. SAMPLING 9 Now there is another issue. There are formulae for the sample space size in these three cases.10 The number of selections of k objects from a set of n objects is given in the following table. and you are very unlikely to need it. {G. {R. G}. P}}.7. we don’t care which element is ﬁrst. only which elements are actually present. So the sample space is a set of sets!) The event ‘at least one red pen’ is {{R. B}. {G. Here are the proofs of the other three cases. and n Pk = n! . P}. and so on. (Think of the choices as being described by a branching tree. {R. G}. These involve the following functions: n! = n(n − 1)(n − 2) · · · 1 n Pk = n(n − 1)(n − 2) · · · (n − k + 1) n Ck = n Pk /k! Note that n! is the product of all the whole numbers from 1 to n. with replacement without replacement nP ordered sample nk k nC unordered sample k In fact the number that goes in the empty box is n+k−1Ck . depending on whether we care about the order in which the pens are chosen. B}. We will only consider this in the case of sampling without replacement. in a set. there are n choices for the ﬁrst object. and n choices for the second. P}}. B}. . {B. P}. First. So in this case the sample space is {{R. and we are not interested in which pen was chosen ﬁrst. We should not be surprised that this is the same as in the previous case.
Example The names of the seven days of the week are placed in a hat. 5) or (6. Example A box contains 20 balls. What is the probability that the sum of the numbers is at least 10? This time we are sampling with replacement. ordered sample’. there are still n choices for the ﬁrst object. the numbers in the three boxes are 42 = 16. Note that. so A = 5 P3 = 5 · 4 · 3 = 60. So the probability of the event is 6/36 = 1/6. the possibilities for the two numbers are (4. and we can take it to be either ordered or unordered. these will be the days of the Probability I lectures. BASIC IDEAS For sampling without replacement and ordered sample. since the two numbers may be the same or different. But each unordered sample could be obtained by drawing it in k! different orders. 4 P = 12. obtaining n Pk /k! = nCk choices. P(A) = 60/210 = 2/7. This product is the formula for n Pk . the size of the sample space is 7 P3 = 7 · 6 · 5 = 210. Thus. If we decided to use unordered samples instead. So the number of elements in the sample space is 62 = 36.10 CHAPTER 1. think ﬁrst of choosing an ordered sample. Do you think that this is more likely to occur if the draws are made with or without replacement? Let S be the sample space. 6). or any other combination. In our example with the pens. 6). Example A sixsided die is rolled twice. then A occurs precisely when all three days drawn are weekdays. which is once again 2/7. we multiply. . To obtain a sum of 10 or more. and so on. If A is the event ‘no lectures at weekends’. For sampling without replacement and unordered sample. and A the event that ﬁve balls are red and ﬁve are blue. For ordered samples. and n − 2 for the third. since k − 1 have previously been removed and n − (k − 1) remain. Three names are drawn out. (5. there are n − k + 1 choices for the kth object. but now only n − 1 choices for the second (since we do not replace the ﬁrst). and 4C = 6. of which 10 are red and 10 are blue. we are assuming that all outcomes are equally likely. 6). 5). the answer would be 5C3 /7C3 . in agreement with what we got when we wrote them all 2 2 out. if we use the phrase ‘sampling without replacement. What is the probability that no lecture is scheduled at the weekend? Here the sampling is without replacement. (6. So we divide by k!. which we can do in n Pk ways. As before. We draw ten balls from the box. 4). (5. (6. and we are interested in the event that exactly 5 of the balls are red and 5 are blue. the answers will be the same.
as it should be. What is A? The number of ways in which we can choose ﬁrst ﬁve red balls and then ﬁve blue ones (that is. P(A) = Example I have 6 gold coins. 20 P 10 and so If we regard the sample as being unordered. and . and P(A) = (10 P5 )2 · 10C5 = 0.252 . Then S  = 2010 . the ﬁve red balls could appear in any ﬁve of the ten draws. then S  = 20 P10 . So 5 P(A) = A = (10 P5 )2 · 10C5 . and the same for the ten blue balls. RRRRRBBBBB). and so on. So we have A = 252 · 1010 . There are 10 P5 ways of choosing ﬁve of the ten red balls. 4 silver coins and 3 bronze coins in my pocket. (10C5 )2 = 0. then S  = 20C10 . So A = (10C5 )2 . So the probability is 72/286 = 0. . So S  = 13C3 = 286. What is the probability that they are all of different material? What is the probability that they are all of the same material? In this case the sampling is without replacement and the sample is unordered. . 20C 10 This is the same answer as in the case before. . SAMPLING 11 Consider sampling with replacement. I take out three coins at random. If we regard the sample as being ordered. The event that the three coins are all of different material can occur in 6 · 4 · 3 = 72 ways. .7. But there are many other ways to get ﬁve red and ﬁve blue balls. the question doesn’t care about order of choices! So the event is more likely if we sample with replacement. . . 252 · 1010 = 0. 2010 Now consider sampling without replacement. is 105 · 105 = 1010 . In fact. since we must have one of the six gold coins. We no longer have to count patterns since we don’t care about the order of the selection. This means that there are 10C5 = 252 different patterns of ﬁve Rs and ﬁve Bs.1.343 . . and as in the previous case there are 10C patterns of red and blue balls.246 .343 . There are 10C5 choices of the ﬁve red balls and the same for the blue balls. .
) Let us further assume that. A stopping rule is a rule of the type described here. . f f p}) = 0. if you pass the test. whichever is convenient. Then we have P(p) P( f p) P( f f p) P( f f f ) = = = = 0. .087 .992. so the probability is 1 − 0.008. you eventually get the certiﬁcate unless you fail three times. BASIC IDEAS The event that the three coins are of the same material can occur in 6 C3 + 4C3 + 3C3 = 20 + 4 + 1 = 25 ways. If all outcomes were equally likely. If the sample is with replacement. continue the experiment until some speciﬁed occurrence happens.g. or if it involves throwing a die or coin several times. 0. (By Proposition 3. you may be very likely to pass on the ﬁrst attempt.008 = 0.16 + 0. and the probability is 25/286 = 0.032. If not.032 = 0.8 + 0. namely. then use the formula for sampling with replacement. f f f }. does the question say anything about the ﬁrst object drawn?).12 CHAPTER 1.16. Let us assume that the probability that you pass the test is 0. your chance of failing is 0. they should give the same answer. f p. decide whether the sample is ordered (e. 0. . 0. f p. then your chance of eventually passing the test and getting the certiﬁcate would be 3/4. In a sampling problem. you don’t need to take it again. where for example f f p denotes the outcome that you fail twice and pass on your third attempt.2 · 0. you should ﬁrst read the question carefully and decide whether the sampling is with or without replacement. So the sample space is S = {p.8. Alternatively. If so. f f p.2. your chance of passing at the next attempt is still 0. For example. But it is unreasonable here to assume that all the outcomes are equally likely.23 = 0. Thus the probability that you eventually get the certiﬁcate is P({p. The experiment may potentially be inﬁnite. then you can use either ordered or unordered samples. then use the formula for ordered samples. 1.8 = 0. Of course.8 = 0.8. no matter how many times you have failed.22 · 0.8 Stopping rules Suppose that you take a typing proﬁciency test. You are allowed to take the test up to three times.992. If it is without replacement.8.
) In the typing test. QUESTIONNAIRE RESULTS 13 For example. In the meantime you might like to consider whether it is a reasonable assumption for tossing a coin. We will not deal with these. . or we might toss a coin until we have obtained heads twice. we will have more to say about the ‘multiplication rule’ we used for calculating the probabilities. or if I don’t replace the balls in the hat after drawing? More likely with replacement 2 2.9 Questionnaire results The students in the Probability I class in Autumn 2000 ﬁlled in the following questionnaire: 1. (We have to allow all possible outcomes. For example. . the results were as follows: Answer to question Eyes Mobile phone No mobile phone “More likely with replacement” Brown Other 35 4 10 3 “More likely without replacement” Brown Other 35 9 7 1 . and so on.} since in principle you may get arbitrarily large numbers of tails before the ﬁrst head. 10 red and 10 blue. T T T H. This ensures that the sample space is ﬁnite. the sample space is S = {H. What colour are your eyes? Blue 2 Brown 2 Green 2 Yes 2 Other 2 More likely without replacement 2 3. .1. Other kinds of stopping rules are possible. In the next chapter. if you toss a coin repeatedly until you obtain heads. the number of coin tosses might be determined by some other random process such as the roll of a die. 1. Do you think that this is more likely if I note the colour of each ball I draw and replace it in the hat. the rule is ‘stop if either you pass or you have taken the test three times’. I have a hat containing 20 balls. I am interested in the event that I draw exactly ﬁve red and ﬁve blue balls. Do you own a mobile phone? No 2 After discarding incomplete questionnaires. T H. or for someone taking a series of tests. I draw 10 balls from the hat. T T H.9.
whereas in fact 45 did. actually it is more likely if we sample without replacement.14 CHAPTER 1. of the 83 people with mobile phones. we would expect that the same fraction 83/104 of the 87 browneyed people would have phones. since 83 out of 104 people have mobile phones. whereas people with brown eyes are slightly less smart! In fact we have shown no such thing. in a sense we will come to soon. Let A be the event that this student thought that the event was more likely if we sample with replacement. BASIC IDEAS What can we conclude? Half the class thought that. So perhaps we have demonstrated that people who own mobile phones are slightly smarter than average. 43 or 44) would answer “with replacement”. in the experiment with the coloured balls. Also. if we think that phone ownership and eye colour are independent. then of the 87 people with brown eyes. do not ever say that P(A ∩ B) = P(A) · P(B) unless you have some good reason for assuming that A and B are independent (either because this is given in the question. 1.4 people. In fact. If true. In fact the number is 70. and under no circum/ stances say that A and B are independent if A ∩ B = 0 (this is the statement that A and B are disjoint. or as in the nextbutone paragraph). On the other hand. i. But they do show that these events are not independent. which is quite a different thing!) Also.e. If you are asked in an exam to deﬁne independence of events. we would expect half (that is. since the students were instructed not to think too hard about it!) You might expect that eye colour and mobile phone ownership would have no inﬂuence on your answer. since our results refer only to the people who ﬁlled out the questionnaire. this is the correct answer. whereas in fact 39 of them did. sampling with replacement make the result more likely.10 Independence P(A ∩ B) = P(A) · P(B). So indeed it seems that eye colour and phone ownership are moreorless independent. 41 or 42) would answer “with replacement”. or as near as we can expect. Let’s test this. B the event that the student has brown eyes. (This doesn’t matter. half of them (i. and C the event that the student has a . (83 · 87)/104 = 69.e. Two events A and B are said to be independent if This is the deﬁnition of independence of events. Let us return to the questionnaire example. Suppose that a student is chosen at random from those who ﬁlled out the questionnaire. as we saw in Chapter 1. Do not say that two events are independent if one has no inﬂuence on the other.
and the separate probabilities of getting 4 on the ﬁrst throw and of getting 5 on the second are both equal to 1/6. If the pens are called R1 . R2 G. INDEPENDENCE mobile phone. R1 B. So the two events are independent. P(A ∩ B) = 45/104 = 0. and B the event that I choose exactly one green pen. but still it could happen that P(A ∩ B) = P(A) · P(B). Example If we toss a coin more than once.4327. but in a sense B and C ‘come closer’ than either of the others. I choose two pens without replacement. R2 B. then A and B are independent. since we assume that all 36 combinations of the two throws are equally likely.5. 15 So none of the three pairs is independent. so that A and B are independent. P(A) · P(C) = 0. or different throws of a die. P(B) ∩ P(C) = 0. Then P(A) = 52/104 = 0.3990. if it is the case that the event A has no effect on the outcome of event B. one green pen.7981. GB}. Example I have two red pens. and one blue pen. But this does not apply in the other direction. A = {R1 G. P(C) = 83/104 = 0. P(B ∩C) = 70/104 = 0.1. There might be a very deﬁnite connection between A and B. Furthermore. G. Let A be the event that I choose exactly one red pen. We will see an example shortly. if we roll a fair sixsided die twice. R2 G. then S = {R1 R2 . then the probability of getting 4 on the ﬁrst throw and 5 on the second is 1/36. then you may assume that different tosses or rolls are independent. This holds even if the examples are not all equally likely. In general.10. P(B) = 87/104 = 0. We will see an example later. But (1/36) = (1/6) · (1/6).6676. or roll a die more than once. GB} . R2 B}. are independent.8365. it is always OK to assume that the outcomes of different tosses of a coin. In practice. This would work just as well for any other combination. P(A) · P(B) = 0. B. as we noted. More precisely. R2 G. P(A ∩C) = 39/104 = 0. R1 G.4183. B = {R1 G. R2 . R1 B.375.6731.
P(A) = 1/2. let C be the event ‘heads on the last toss’. • B = {HHH. In practice. you will ﬁnd that P(A) = 6/10 = 3/5. HT H. You will need to know the conclusions. so A and B are independent. and I do the same experiment. with replacement. P(B) = 1/2.C so that A and B are independent. B and C are independent. though the arguments we use to reach them are not so important. P(A ∩ B) = 1/4. We saw in the cointossing example above that it is possible to have three events A. assume that events are independent only if either you are told to assume it. Let A be the event ‘there are more heads than tails’. . so adding one more pen has made the events nonindependent! We see that it is very difﬁcult to tell whether events are independent or not. But before you say ‘that’s obvious’. from a set of objects.11 Mutual independence This section is a bit technical. T HH}. both A and B clearly involve the results of the ﬁrst two tosses and it is not possible to make a convincing argument that one of these events has no inﬂuence or effect on the other. and B the event ‘the results of the ﬁrst two tosses are the same’. but A and C are not independent. if you write down the sample space and the two events and do the calculations. T HH}. HHT }. P(B) = 4/10 = 2/5. • A ∩ B = {HHH. Each of the eight possible outcomes has probability 1/8. T T T }. so A and C are not independent. Are B and C independent? 1.) Example Consider the experiment where we toss a fair coin three times and note the results. BASIC IDEAS We have P(A) = 4/6 = 2/3. as we saw in Part 1. This time. HT H. HT H. Then • A = {HHH. HHT. HHT. T T H}. • C = {HHH.16 CHAPTER 1. so A and B are independent. or the events are the outcomes of different throws of a coin or die. P(A∩B) = 2/6 = 1/3 = P(A)P(B). suppose that I have also a purple pen. • A ∩C = {HHH. P(A ∩C) = 3/8. Then. P(B) = 3/6 = 1/2. (There is one other case where you can assume independence: this is the result of different draws. T HH. However. For example. T T H. B. P(A ∩ B) = 2/10 = 1/5 = P(A)P(B). P(C) = 1/2.
An be mutually independent. A ∩C. . and the same would apply for any other sequence. i2 . We say that these events are mutually independent if. .12 Properties of independence Proposition 1. B. with probability 1/4. B ∩C. can we then conclude that P(A ∩ B ∩C) = P(A) · P(B) · P(C)? At ﬁrst sight this seems very reasonable. in Axiom 3.11 Let A1 . . any one of the events is independent of the intersection of any number of the other events in the set. and that the events A ∩ B. 1. . ik with k ≥ 1. but P(A ∩ B ∩C) = 1/4. The correct deﬁnition and proposition run as follows. Let A1 . .1. if we toss a fair coin six times. So A. In other words.12 If A and B are independent. P(A) · P(B) · P(C) = 1/8. for example. B the event ‘ﬁrst and third tosses have the same result. and C the event ‘second and third tosses have same result’. . all 64 possible outcomes are equally likely. Proposition 1. PROPERTIES OF INDEPENDENCE 17 If all three pairs of events happen to be independent.C are not mutually independent. So. the events Ai1 ∩ Ai2 ∩ · · · ∩ Aik−1 and Aik are independent. . Thus any pair of the three events are independent. Now all you really need to know is that the same ‘physical’ arguments that justify that two events (such as two tosses of a coin. . then A and B are independent. Then P(A1 ∩ A2 ∩ · · · ∩ An ) = P(A1 ) · P(A2 ) · · · P(An ). . . An be events.12. Unfortunately it is not true. the probability of getting the sequence HHT HHT is (1/2)6 = 1/64. let A be the event ‘ﬁrst and second tosses have same result’. and A ∩ B ∩C are all equal to {HHH. Example In the cointossing example. . . . . we only required all pairs of events to be exclusive in order to justify our conclusion. . In other words. T T T }. or two throws of a die) are independent. given any distinct indices i1 . also justify that any number of such events are mutually independent. You should check that P(A) = P(B) = P(C) = 1/2.
2) · (0. From Corollary 4. if events A1 .8. which is what we were required to prove. In other words. and their union is A (since every event in A is either in B or in B ). A and A are not usually independent! Results like the following are also true.8. and asked to prove that P(A ∩ B ) = P(A) · P(B ).14 Let events A. . and we replace some of them by their complements. You are allowed up to three attempts to pass the test. P(p) = 0. ﬁrst to A and B (to show that A and B are independent). Then. Proposition 1.13 If A and B are independent. by Proposition 1. .8). then the resulting events are mutually independent. Suppose that your chance of passing the test is 0. We have to be a bit careful though. BASIC IDEAS We are given that P(A ∩ B) = P(A) · P(B).2)2 · (0. P( f f f ) = (0. the probability of any sequence of passes and fails is the product of the probabilities of the terms in the sequence. P( f p) = (0. Also. . we have that P(A) = P(A ∩ B) + P(A ∩ B ). mutual independence is the condition we need to justify the argument we used in that example. Suppose also that the events of passing the test on any number of different occasions are mutually independent. An are mutually independent. as we claimed in the earlier example. B. Thus. C be mutually independent. and then to B and A (to show that B and A are independent). . Then A and B ∩C are independent. so are A and B . For example. P( f f p) = (0. P(A ∩ B ) = P(A) − P(A ∩ B) = P(A) − P(A) · P(B) (since A and B are independent) = P(A)(1 − P(B)) = P(A) · P(B ).8). Corollary 1.2)3 . and A and B ∪C are independent. That is. Example Consider the example of the typing proﬁciency test that we looked at earlier. Apply the Proposition twice. . More generally.11. so by Axiom 3.18 CHAPTER 1. we know that P(B ) = 1 − P(B). the events A ∩ B and A ∩ B are disjoint (since no outcome can be both in B and B ).
the event that the apparatus does not work is (A ∪B )∩C . the probability that component B works is 0. We have P((A ∩C ) ∪ (B ∩C ) = P(A ∩C ) + P(B ∩C ) − P(A ∩ B ∩C ) (by Inclusion–Exclusion) = P(A ) · P(C ) + P(B ) · P(C ) − P(A ) · P(B ) · P(C ) (by mutual independence of A .9) + (0.8.C ) = (0. respectively. Thus the event we are interested in is (A ∩ B) ∪C.93.025) = 0. since the events A ∩C and B ∩C are not independent! .25) + (0.8) · (0. and ‘component C works’. 19 A B C At risk of some confusion. so the apparatus works with probability 1 − 0. You might be tempted to say P(A ∩C ) = (0. and conclude that P((A ∩C ) ∪ (B ∩C )) = 0.1) · (0. But this is not correct.1.8) · (0.25) − (0.05 + 0.75) = 0. Thus. this is equal to (A ∩C ) ∪ (B ∩C ).07375 by the Principle of Inclusion and Exclusion.75. that is.025 − (0.25) = 0. PROPERTIES OF INDEPENDENCE Example The electrical apparatus in the diagram works so long as current can ﬂow from left to right. and the probability that component C works is 0.05) · (0.07 = 0. Now the apparatus will work if either A and B are working.05. if C is not working and one of A and B is also not working.25) = 0.2) · (0.93.9) · (0.12.9.025. By the Distributive Law. and P(B ∩C ) = (0.25) = 0. we use the letters A. or C is working (or possibly both). Find the probability that the apparatus works. Now P((A ∩ B) ∪C)) = P(A ∩ B) + P(C) − P(A ∩ B ∩C) (by Inclusion–Exclusion) = P(A) · P(B) + P(C) − P(A) · P(B) · P(C) (by mutual independence) = (0.75) − (0. There is a trap here which you should take care to avoid.1) · (0.2) · (0. The problem can also be analysed in a different way. B and C for the events ‘component A works’.2) · (0.07.1) · (0. ‘component B works’. B . The probability that component A works is 0. The apparatus will not work if both paths are blocked. The three components are independent.
The probability that you buy a red toothbrush is three times the probability that you buy a green one. The probability of tails on any toss is 1 − 0. or no heads? For three heads. Is it more likely that they will have differently coloured toothbrushes from each other for .064. On the ﬁrst day of term they both go to the shop to buy a toothbrush.432 + 0.13 Question Worked examples (a) You go to the shop to buy a toothbrush. This is the only time that they change their toothbrushes. 1. So the probability of two heads is 3 · (0. independently of what they had bought before. Find the probablity that James and Simon have differently coloured toothbrushes from each other for all three terms.6) · (0. The toothbrushes there are red.4. purple. we have 0. with probabilities as in (b). the probability that you buy a blue one is twice the probability that you buy a green one. or T HH. and their choices are independent. and white are all equal. the probability is (0.064 = 1.216. ﬁnd the probability that you buy a toothbrush of that colour. so it would be confusing if their toothbrushes were the same colour. What are the probabilities of getting three heads.288 + 0. the probability of buying various colours of toothbrush is as calculated in (a).6 = 0. purple and white. Similarly the probability of one head is 3 · (0.20 CHAPTER 1.144) = 0. blue. For each colour. (c) James and Simon live together for three terms. Each outcome has probability (0. the probabilities of buying green. since successive tosses are mutually independent. Suppose that I have a coin which has probability 0.6) · (0.4) = 0. BASIC IDEAS Example We can always assume that successive tosses of a coin are mutually independent.432.4)3 = 0.144. as HHT .288.6 of coming down heads. (b) James and Simon share a ﬂat. For each of James and Simon.4)2 = 0. As a check. Now the event ‘two heads’ can occur in three possible ways. On the ﬁrst day of each term they buy new toothbrushes. Find the probability that they buy toothbrushes of the same colour.6) · (0. You are certain to buy exactly one toothbrush. and the probability of no heads is (0. one head. I toss the coin three times.216 + 0. even if it is not a fair coin. HT H.6)3 = 0. green. two heads.
giving the answer correct to 3 decimal places. P. . So it is more likely that they will have the same colour in at least one term.1. and that all outcomes of the selection are equally likely. BB. P(B) = 2x. purple and white toothbrush respectively. 1/4. He counts how many of these elephants are tagged. and these events are independent. P(RR) = (3/8) · (3/8) = 9/64. WW . so its probability is P(RR) + P(BB) + P(GG) + P(PP) + P(WW ) 1 1 1 1 1 9 + + + + = . so x = 1/8. Question There are 24 elephants in a game reserve. Find the probability that exactly two of the selected elephants are tagged. Thus. PP. so has probability 1 − (27/64) = (37)/(64). Let x = P(G). 4 4 4 64 The event ‘same coloured toothbrushes in at least one term’ is the complement of the above. 1/8. G. he randomly selects ﬁve elephants from the reserve. WORKED EXAMPLES 21 all three terms or that they will sometimes have toothbrushes of the same colour? Solution (a) Let R. P(P) = P(W ) = x. Assume that no elephants leave or enter the reserve. = 64 16 64 64 64 4 (c) The event ‘different coloured toothbrushes in the ith term’ has probability 3/4 (from part (b)). for example. The warden tags six of the elephants with small radio transmitters and returns them to the reserve. 1/8. Corollary 2 gives 3x + 2x + x + x + x = 1. By independence (given). etc. blue.13. the probabilities are 3/8. We are given that P(R) = 3x. we have. B. or die or give birth. Since these outcomes comprise the whole sample space. (b) Let RB denote the event ‘James buys a red toothbrush and Simon buys a blue toothbrush’. between the tagging and the selection. The event that the toothbrushes have the same colour consists of the ﬁve outcomes RR. green. GG. So the event ‘different coloured toothbrushes in all three terms’ has probability 3 3 3 27 · · = .W be the events that you buy a red. 1/8 respectively. The next month.
Find P(E) and P(F) where E = “there are at least two girls”. GBGG. (b) E = {BGGB. GBGB.p. and similarly for the other outcomes.288 = 0. (b) Assume that. BGB. BGGG. BASIC IDEAS Solution The experiment consists of picking the ﬁve elephants. GGBG. F = {BGGG. since it is necessary to multiply by the 5C2 possible patterns of tagged and untagged elephants in a sample of ﬁve with two tagged. GBGB. BGGB. P(F) = 5/16. GBGG. They decide to stop having children either when they have two boys or when they have four children. (a) Write down the sample space. Then S  = 24C5 . GGGG}. GGGG}. (a) S = {BB. P(BGB) = 1/8. Thus P(A) = to 3 d. an unordered sample is preferable. GGBB. each time that they have a child. If you want to use an ordered sample. Solution .22 CHAPTER 1. GGBG. GGGG}. GGBB. F = “there are more girls than boys”. This involves choosing two of the six tagged elephants and three of the eighteen untagged ones. Now we have P(BB) = 1/4. Let S be the sample space. so A = 6C2 · 18C3 . Suppose that they are successful in their plan. GGGB. Let A be the event that two of the selected elephants are tagged. GGGB. GBB. Question A couple are planning to have a family. GGBG. P(BGGB) = 1/16. So P(E) = 8/16 = 1/2. BGGG. not the original choice of six elephants for tagging. independent of all other times.288. Note: Should the sample should be ordered or unordered? Since the answer doesn’t depend on the order in which the elephants are caught. the calculation is P(A) = 6 P · 18 P · 5C 2 3 2 24 P 5 6C · 18C 2 3 24C 5 = 0. GGGB. the probability that it is a boy is 1/2. GBGG.
T T T }. 2. They toss the coin once and the result is heads. HHT. T HH. The sample space is S = {HHH. HT T }. T T H. then E now becomes the sample space of the experiment. HHT. HT H}. T T T }. HT H. otherwise Bob pays. HT H. HT T.Chapter 2 Conditional probability In this chapter we develop the technique of conditional probability to deal with cases where events are not independent. B = {HT T. T T H. and if we are given the information that the result of the ﬁrst toss is heads.1 What is conditional probability? Alice and Bob are going out to dinner. 23 . HHT. call this event E. since the outcomes not in E are no longer possible. T HH}. and the events ‘Alice pays’ and ‘Bob pays’ are respectively A = {HHH. HT H. T HT. In the new experiment. They toss a fair coin ‘best of three’ to decide who pays: if there are more heads than tails in the three tosses then Alice pays. B ∩ E = {HT T }. How should we now reassess their chances? We have E = {HHH. HHT. the outcomes ‘Alice pays’ and ‘Bob pays’ are A ∩ E = {HHH. Clearly each has a 50% chance of paying. T HT.
Let E be an event with nonzero probability. although this is the best way to understand it. it is not enough to say “the probability of A given that E has occurred”. P(E) Again I emphasise that this is the deﬁnition. In general. P(B  E) = P(E) 1/2 4 P(A  E) = It may seem like a small matter. and let A be any event. The probability that the car is yellow is 3/100: the probability that the driver is blonde is 1/5. not P(A/E) or P(A \ E). and this would make no sense if P(E) = 0. The conditional probability of A given E is deﬁned as P(A  E) = P(A ∩ E) . If you are asked for the deﬁnition of conditional probability.24 CHAPTER 2. Find the conditional probability that the driver is blonde given that the car is yellow. Thus. To check the formula in our example: P(A ∩ E) 3/8 3 = = . for example. CONDITIONAL PROBABILITY Thus the new probabilities that Alice and Bob pay for dinner are 3/4 and 1/4 respectively. since the outcomes may not be equally likely. The correct deﬁnition is as follows. This is P(A  E). and the probability that the car is yellow and the driver is blonde is 1/50. P(A  B) = if P(B) = 0. we can no longer count. P(E) 1/2 4 P(B ∩ E) 1/8 1 = = . suppose that we are given that an event E has occurred. Note also that the deﬁnition only applies in the case where P(E) is not equal to zero. Example A random car is chosen among all those passing through Trafalgar Square on a certain day. since we have to divide by it. P(A ∩ B) P(B) . There is no reason why event E should occur before event A! Note the vertical bar in the notation. In general. and we want to compute the probability that another event A occurs. but you should be familiar enough with this formula that you can write it down without stopping to think about the names of the events.
P(B) = 0. P(B) P(B) that is.p. P(B) using the deﬁnition of conditional probability. which is just what the statement ‘A and B are independent’ means. GENETICS 25 Solution: If Y is the event ‘the car is yellow’ and B the event ‘the driver is blonde’. There is a connection between conditional probability and independence: Proposition 2. Now suppose that P(A  B) = P(A). Note that we haven’t used all the information given. assuming only two colours of eyes. then we are given that P(Y ) = 0. In other words.02 = = 0.03 to 3 d. A child receives one gene from each of its parents. Now clearing fractions gives P(A ∩ B) = P(A) · P(B). So ﬁrst suppose that A and B are independent. The gene it receives from its father is one of its father’s two genes. Each gene is either B or b.667 P(Y ) 0. Then A and B are independent if and only if P(A  B) = P(A). Remember that this means that P(A ∩ B) = P(A) · P(B).2. and similarly for its mother. then A and B are independent. you have blue eyes. So P(B  Y ) = P(B ∩Y ) 0. as we had to prove. and P(Y ∩ B) = 0. then P(A  B) = P(A). Proof The words ‘if and only if’ tell us that we have two jobs to do: we have to show that if A and B are independent. Each person has two genes for eye colour. Then P(A  B) = P(A ∩ B) P(A) · P(B) = = P(A). you have brown eyes. The genes received from father and mother are independent.03. If your genes are BB or Bb or bB.2. each with probability 1/2.2. P(A  B) = P(A). if your genes are bb. . 2. and that if P(A  B) = P(A). This proposition is most likely what people have in mind when they say ‘A and B are independent means that B has no effect on A’. P(A ∩ B) = P(A).02.1 Let A and B be events with P(B) = 0.2 Genetics Here is a simpliﬁed version of how genes code eye colour.
The events A1 . Bb. He estimates that.26 CHAPTER 2. the probability of cloud is 45%. that is. but we know what its probability would be if we were sure that some other event had occurred. First we need a deﬁnition. . his chance is only 20%.3 The Theorem of Total Probability Sometimes we are faced with a situation where we do not know the probability of an event B. What is the probability that John’s genes are BB? Solution John’s sister has genes bb.) What is the overall probability that the salesman will sell all his stock? This problem is answered by the Theorem of Total Probability. . so one b must have come from each parent. .) Let X be the event ‘John has BB genes’ and Y the event ‘John has brown eyes’. if it is cloudy. (We assume that these are all the possible outcomes. (For example. . The question asks us to calculate P(X  Y ). bb each with probability 1/4. An form a partition of the sample space if the following two conditions hold: / (a) the events are pairwise disjoint. his chance is 60%. which we now state. Similarly for the other combinations. According to the weather forecast. he has a 90% chance of selling all his stock. Thus each of John’s parents is Bb or bB. So the possibilities for John are (writing the gene from his father ﬁrst) BB. and the probability of rain is 25%. P(Y ) 3/4 2. the probability of sunshine is 30%. John gets his father’s B gene with probability 1/2 and his mother’s B gene with probability 1/2. . if the weather is sunny. A2 . His sister has blue eyes. So do both of John’s parents. bB. and if it rains. (b) A1 ∪ A2 ∪ · · · ∪ An = S . Bb. so that their probabilities must add up to 100%. and these are independent. Example An icecream seller has to decide whether to order more stock for the Bank Holiday weekend. bB}. so the probability that he gets BB is 1/4. This is given by P(X  Y ) = P(X ∩Y ) 1/4 = = 1/3. Ai ∩ A j = 0 for any pair of events Ai and A j . we may assume Bb. CONDITIONAL PROBABILITY Example Suppose that John has brown eyes. Then X = {BB} and Y = {BB.
A2 the event ‘it is cloudy’. Moreover.3. B A1 A2 . . . A2 and A3 form a partition of the sample space. A1 A2 . An form a partition of the sample space with P(Ai ) = 0 for all i. Then A1 . A2 . A2 . . P(A2 ) = 0. An . and by assumption there are no such outcomes. P(B  Ai ) = P(B ∩ Ai )/P(Ai ). An Consider the icecream salesman at the start of this section. Now consider the events B ∩ A1 . P(A3 ) = 0. and A3 the event ‘it is rainy’. .45. An Now we state and prove the Theorem of Total Probability. since every outcome lies in one of the Ai . . Theorem 2. we ﬁnd that P(B ∩ Ai ) = P(B  Ai ) · P(Ai ). . i=1 n Proof By deﬁnition. Multiplying up. . So.3. by Axiom 3. we conclude that i=1 ∑ P(B ∩ Ai) = P(B). THE THEOREM OF TOTAL PROBABILITY 27 Another way of saying the same thing is that every outcome in the sample space lies in exactly one of the events A1 . the union of all these events is B. . These events are pairwise disjoint. and we are given that P(A1 ) = 0. . B ∩ An . . Then P(B) = ∑ P(B  Ai ) · P(Ai ). B ∩ A2 . The picture shows the idea of a partition.2 Let A1 . . and let B be any event.2. Let A1 be the event ‘it is sunny’. . n Substituting our expression for P(B ∩ Ai ) gives the result.25. for any outcome lying in both B ∩ Ai and B ∩ A j would lie in both Ai and A j . . . . . .
since it avoids any accidental assumptions of independence. one green pen. (a) What is the probability that the ﬁrst pen chosen is red? (b) What is the probability that the second pen chosen is red? For the ﬁrst pen. then only one of the three remaining pens is red. 1.2. 2. You will now realise that the Theorem of Total Probability is really being used when you calculate probabilities by tree diagrams. if the ﬁrst pen is green or blue. the events A and A form a partition of S .6. One special case of the Theorem of Total Probability is very commonly used.9 × 0. Example I have two red pens. If the ﬁrst pen is red.59. P(B  A3 ) = 0.9. On the other hand. It is better to get into the habit of using it directly. and suppose that P(A) = 0. Let B be the event ‘second pen red’. Then P(B) = P(B  A) · P(A) + P(B  A ) · P(A ). so that P(B  A1 ) = 1/3. The other information we are given is that P(B  A1 ) = 0. To say that both A and A have nonzero probability is just to say that P(A) = 0.3 Let A and B be events. we must separate cases. so the chance of selecting a red pen is 2/4 = 1/2. CONDITIONAL PROBABILITY Let B be the event ‘the salesman sells all his stock’. and one blue pen. Thus we have the following corollary: Corollary 2.3) + (0.4 Sampling revisited We can use the notion of conditional probability to treat sampling problems involving ordered samples. there are four pens of which two are red. For any event A. Let A1 be the event ‘ﬁrst pen red’. For the second pen. A2 the event ‘ﬁrst pen green’ and A3 the event ‘ﬁrst pen blue’. then two of the remaining pens are red.2 × 0. I select two pens without replacement. so P(B  A2 ) = P(B  A3 ) = 2/3. P(B  A2 ) = 0. . Then P(A1 ) = 1/2. By the Theorem of Total Probability. and is worth stating in its own right.28 CHAPTER 2.25) = 0. P(B) = (0. 1.45) + (0. P(A2 ) = P(A3 ) = 1/4 (arguing as above).6 × 0.
then the second pen is equally likely to be any one of the four. P(B) . P(A  B) and P(A  B ). Beware of obviouslooking arguments in probability! Many clever people have been caught out. But. These conditional probabilities are related by Bayes’ Theorem: Theorem 2. We have P(A  B) · P(B) = P(A ∩ B) = P(B  A) · P(A). that is. and the probability should be 1/2.5 Bayes’ Theorem There is a very big difference between P(A  B) and P(B  A). a patient who has taken the test has different concerns. and B the event ‘the test result is positive’. P(B) = P(B  A1 )P(A1 ) + P(B  A2 )P(A2 ) + P(B  A3 )P(A3 ) = (1/3) × (1/2) + (2/3) × (1/4) + (2/3) × (1/4) = 1/2. let A be the event ‘the patient is a carrier’. This argument happens to be correct. how sure can I be that I am not a carrier? In other words. using the deﬁnition of conditional probability twice. If I tested positive. Of course. and some noncarriers who test positive. BAYES’ THEOREM By the Theorem of Total Probability. there will be some carriers of the defective gene who test negative. However. you may be safer to stick to the calculation that we did. no test is perfect. with P(B  A ) and P(B  A).) Now divide this equation by P(B) to get the result. what is the chance that I have the disease? If I tested negative. just as for the ﬁrst pen.4 Let A and B be events with nonzero probability. (Note that we need both A and B to have nonzero probability here. So. until your ability to distinguish between correct arguments and plausiblelooking false ones is very well developed. 2. Suppose that a new test is developed to identify people who are liable to suffer from some genetic disease in later life. If we have no information about the ﬁrst pen. The scientists who develop the test are concerned with the probabilities that the test result is wrong. Then P(A  B) = The proof is not hard. for example.5. 29 We have reached by a roundabout argument a conclusion which you might think to be obvious. P(B  A) · P(A) .2.
p. and we calculated that P(B) = 0.59 (b) A patient has just had a negative test result.46 P(B) 0. Suppose that 1 in 1000 of the population is a carrier of the disease.3. What is the probability that the patient is a carrier? The answer is P(A  B) = P(B  A)P(A) P(B  A)P(A) + P(B  A )P(A ) 0.3. What is the probability that the patient is a carrier? The answer is P(A  B ) = P(B  A)P(A) P(B  A)P(A) + P(B  A )P(A ) . P(A1  B) = to 2 d.59.9 and that P(A1 ) = 0.05094 P(B  A1 )P(A1 ) 0.99 × 0.05.05 × 0. and B the event ‘the test result is positive’. while the probability that a noncarrier tests positive is 5%.001 = (0.99 × 0.001) + (0. Given that he sold all his stock of icecream.9 × 0. and that P(B  A) = 0.001 (so that P(A ) = 0. We are asked for P(A1  B).30 CHAPTER 2. We are given that P(A) = 0. P(B  A ) = 0.00099 = = 0.) Let A be the event ‘the patient is a carrier’. (a) A patient has just had a positive test result.3 = = 0. So by Bayes’ Theorem. (A test achieving these values would be regarded as very successful. P(B  A) · P(A) + P(B  A ) · P(A ) Bayes’ Theorem is often stated in this form. CONDITIONAL PROBABILITY If P(A) = 0.99. then we can use Corollary 17 to write this as P(A  B) = P(B  A) · P(A) .999) 0. A1 is the event ‘it is sunny’ and B the event ‘the salesman sells all his stock’. 1 and P(B) = 0.999). 0. Example Consider the clinical test described at the start of this section.) Using the same notation that we used before.0194. Suppose also that the probability that a carrier tests negative is 1%. Example Consider the icecream salesman from Section 2. We were given that P(B  A1 ) = 0. what is the probability that the weather was sunny? (This question might be asked by the warehouse manager who doesn’t know what the weather was actually like.
the calculations would be quite different. P(B  A1 )P(A1 ) 0. given that both A and B have occurred. and 88% don’t have it at all.95 × 0.02 P(A2 ) = 0.9 × 0. A new blood test is developed.01 × 0. B) = P(C ∩ A ∩ B) .p. Then we are given that A1 . What is the probability that I have the serious form of the disease? Let A1 be ‘has disease in serious form’. by the Theorem of Total Probability. A2 . but a patient with a positive test result still has less than 2% chance of being a carrier.999) 0. Let B be ‘test positive’. Sometimes instead we just write P(C  A. I have just tested positive. and A3 be ‘doesn’t have disease’.2. the probability of testing positive is 9/10 if the subject has the serious form. ITERATED CONDITIONAL PROBABILITY 0.01 × 0.9 P(B  A2 ) = 0.001 (0. so is likely to worry unnecessarily.001) + (0. If the patient has a family history of the disease.94095 = 31 So a patient with a negative test result can be reassured.6.6 Iterated conditional probability The conditional probability of C. P(A1  B) = to 3 d.1 × 0.00001.9 × 0. is just P(C  A ∩ B).1 Thus. P(B) = 0.6 × 0. 0. 6/10 if the subject has the mild form.1 P(A3 ) = 0.88 = 0.88 P(B  A1 ) = 0. It is given by P(C  A.6 P(B  A3 ) = 0. 10% have it in a mild form. A2 be ‘has disease in mild form’.108 P(B) 0.1 + 0. and then by Bayes’ Theorem.02 + 0. Of course.02 = = 0. P(A ∩ B) .166 2.00001 = = 0. these calculations assume that the patient has been selected at random from the population.166. A3 form a partition and P(A1 ) = 0. and 1/10 if the subject doesn’t have the disease. B). Example 2% of the population have a certain blood disease in a serious form.
since we have to consider whether p1 and p2 have the same birthday or not. . p2 . B)P(B  A)P(A). and assume that the other 365 days are all equally likely as birthdays of a random person. Now we also have P(A ∩ B) = P(B  A)P(A). . pn . . . This generalises to any number of events: Proposition 2. The birthday paradox is the following statement: If there are 23 or more people in a room. Let A3 be the event ‘p3 has a different birthday from p1 and p2 ’. . . and A3 will occur only if p3 ’s birthday is on neither of these days. .32 so CHAPTER 2. . 365 . we ignore 29 February. If Ai denotes the event ‘pi ’s birthday is not on the same day as any of p1 . An be events. . It is not straightforward to evaluate P(A3 ). . So 1 2 P(A2 ∩ A3 ) = P(A2 )P(A3  A2 ) = (1 − 365 )(1 − 365 ). . . then the chances are better than even that two of them have the same birthday. we have P(A ∩ B ∩C) = P(C  A. Suppose that P(A1 ∩ · · · ∩ An−1 ) = 0. so ﬁnally (assuming that P(A ∩ B) = 0). Ai−1 ) = 1 − i−1 . Then P(A1 ∩ A2 ∩ · · · ∩ An ) = P(An  A1 .5 Let A1 . since if A2 occurs then p1 and p2 have birthdays on different days. . . B)P(A ∩ B). But we can calculate that P(A3  2 A2 ) = 1 − 365 . What is A2 ∩ A3 ? It is simply the event that all three people have birthdays on different days. . CONDITIONAL PROBABILITY P(A ∩ B ∩C) = P(C  A. . . (This is not quite true but not inaccurate enough to have much effect on the conclusion. Then P(A2 ) = 1 − 365 . We apply this to the birthday paradox. . Now this process extends. (See below). there is a 1 in 365 chance that p2 will have the same birthday. . . pi−1 ’. Let A2 be the event ‘p2 has a different birthday 1 from p1 ’.) Suppose that we have n people p1 . then P(Ai  A1 . To simplify the analysis. An−1 ) · · · P(A2  A1 )P(A1 ). since whatever p1 ’s birthday is.
. by the Theorem of Total Probability. · · · .) 2 If p1 and p2 have different birthdays.5243. that is. that is. 1 then the probability is 1 − 365 .2. it is the probability that all of the people p1 .5. so 23 people are enough for the probability of coincidence to be greater than 1/2. .p. . .).6. qn ≤ 0. ITERATED CONDITIONAL PROBABILITY and so by Proposition 2. On the other hand.p.9945 to 4 d. since at each step we multiply by a factor less than 1. These two numbers are P(A3  A2 ) and P(A3  A2 ) respectively. So. pi−1 .4927 (to 4 d. Now return to a question we left open before. The numbers qi decrease. 12 so 2 1 P(B1 ∩ · · · ∩ Bi ) = (1 − 12 )(1 − 12 )(1 − i−1 ). . then as before we ﬁnd that P(Bi  B1 . the probability of at least one coincidence is greater than 1/2. . q23 = 0. if p1 and p2 have the same birthday.5729 . 365 33 Call this number qi . Problem How many people would you need to pick at random to ensure that the chance of two of them being born in the same month are better than even? Assuming all months equally likely. So there will be some value of n such that qn−1 > 0.5. the probability is 1 − 365 : this is the calculation we already did. By calculation. 12 We calculate that this probability is (11/12) × (10/12) × (9/12) = 0. n is the smallest number of people for which the probability that they all have different birthdays is less than 1/2. pi have their birthdays on different days. we ﬁnd that q22 = 0.5. if Bi is the event that pi is born in a different month from any of p1 . Bi−1 ) = 1 − i−1 . What is the probability of the event A3 ? (This is the event that p3 has a different birthday from both p1 and p2 . . P(A3 ) = P(A3  A2 )P(A2 ) + P(A3  A2 )P(A2 ) 1 1 1 2 = (1 − 365 )(1 − 365 ) + (1 − 365 ) 365 = 0. . 1 2 P(A1 ∩ · · · ∩ Ai ) = (1 − 365 )(1 − 365 ) · · · (1 − i−1 ).
and after a moment all the other students realised that the lecturer would certainly lose his wager. Sally’s sister Hannah does have cystic ﬁbrosis. If your genes are NN or NC or CN then you are normal. You can assume that the two genes in a person are independent. But . Nor does she. Let A and B be events with A ⊆ B. He said to the class. P(B) P(B) (a) This is the same as the eye colour example discussed earlier. (a) Neither of Sally’s parents has cystic ﬁbrosis. Some years ago. since q11 = 0.) 2.859. However. “I bet that no two people in the room have the same birthday”. (c) Harry and Sally plan to have a child. A true story. So. Why? (Answer in the next chapter. the lecturer started discussing the Birthday Paradox. it is more likely that two will have the same birth month.3819 for i = 5. a student in the back said “I’ll take the bet”. Find the probability that the child will have cystic ﬁbrosis (given that neither Harry nor Sally has it). we will use a number of times the following principle. Find the probability that Sally has at least one C gene (given that she does not have cystic ﬁbrosis). in a probability class with only ten students. Find the probability that he has at least one C gene (given that he does not have cystic ﬁbrosis). We are given that Sally’s sister has genes CC. Each child receives one gene from each parent. Each gene is either N or C. if they are CC then you have cystic ﬁbrosis. and so P(A  B) = P(A ∩ B) P(A) = . (Remember that there are eleven people in the room!) However.7 Worked examples Question Each person has two genes for cystic ﬁbrosis. He should have been on safe ground. Then A ∩ B = A. (b) In the general population the ratio of N genes to C genes is about 49 to 1. CONDITIONAL PROBABILITY (11/12) × (10/12) × (9/12) × (8/12) = 0. Harry does not have cystic ﬁbrosis. with ﬁve people. Solution During this solution.34 for i = 4 and CHAPTER 2. and one gene must come from each parent.
P(S3 ∩ H3 ) 4 (Remember the principle that we started with!) We are asked to ﬁnd P(X  S2 ∩ H2 ). CN. will have probability 1/4. Now by the basic rules of genetics. and H2 is the event ‘Harry does not have cystic ﬁbrosis’. WORKED EXAMPLES 35 neither parent is CC.7. all the four combinations of genes for a child of these parents. Then P(S1 ∩ S2 ) 2/4 2 = = . these genes pass independently to the baby. As in (a). P(S2 ∩ H2 ) Now Harry’s and Sally’s genes are independent. X ⊆ S3 ∩ H3 . NC. Thus. P(X) .CC}. Then the probabilities of Harry having genes CC. so P(S3 ∩ H3 ) = P(S3 ) · P(H3 ). namely CC. P(S1  S2 ) = P(S2 ) 3/4 3 (b) We know nothing speciﬁc about Harry. (1/50) · (49/50). So. so we assume that his genes are randomly and independently selected from the population. NC. if H1 is the event ‘Harry has at least one C gene’. then S2 = {CN. NN are respectively (1/50)2 . That is. P(H2 ) (49/2500) + (49/2500) + (2401/2500) 51 (c) Let X be the event that Harry’s and Sally’s child has cystic ﬁbrosis. NC.CN. this can only occur if Harry and Sally both have CN or NC genes. P(X) P(S3 ∩ H3 ) P(X) = · P(S2 ∩ H2 ) P(S3 ∩ H3 ) P(S2 ∩ H2 ) . Now if Harry and Sally are both CN or NC. where S3 = S1 ∩ S2 and H3 = H1 ∩ H2 . NC. then P(H1  H2 ) = (49/2500) + (49/2500) 2 P(H1 ∩ H2 ) = = . so each parent is CN or NC. We are given that the probability of a random gene being C or N is 1/50 and 49/50 respectively. and if S2 is the event ‘Sally does not have cystic ﬁbrosis’. and so P(X  S3 ∩ H3 ) = 1 P(X) = . P(S2 ∩ H2 ) = P(S2 ) · P(H2 ). in other words (since X ⊆ S3 ∩ H3 ⊆ S2 ∩ H2 ). (49/50) · (1/50).2. If S1 is the event ‘Sally has at least one C gene’. NN. then S1 = {CN. NN}. respectively. and (49/50)2 .
P(D) = 2/3. and it is raining when I arrive. given the season. and R the event ‘it is raining when I arrive’. the probability that it is raining is 3/4. what is the probability that it will be raining when I return to Oneirabad in a year’s time? (You may assume that in a year’s time the season will be the same as today but. what is the probability that my visit is during the Wet season? (c) I visit Oneirabad on a random day. Given this information. on a random day of the year. Oneirabad. P(R) 13/36 13 . Question The Land of Nod lies in the monsoon zone. The Wet season lasts for 1/3 of the year. P(W  R) = 9 P(R  W )P(W ) (3/4) · (1/3) = = . Wet and Dry. (b) By Bayes’ Theorem.) Solution (a) Let W be the event ‘it is the wet season’. (a) I visit the capital city. whether or not it is raining is independent of today’s weather. By the ToTP. What is the probability that it is raining when I arrive? (b) I visit Oneirabad on a random day. P(R  D) = 1/6. During the Wet season. the probability that it is raining is 1/6. CONDITIONAL PROBABILITY 1 P(S1 ∩ S2 ) P(H1 ∩ H2 ) · · 4 P(S2 ) P(H2 ) 1 · P(S1  S2 ) · P(H1  H2 ) = 4 1 2 2 = · · 4 3 51 1 = . D the event ‘it is the dry season’. We are given that P(W ) = 1/3. Given this information. and the Dry season for 2/3 of the year. and has just two seasons. P(R) = P(R  W )P(W ) + P(R  D)P(D) = (3/4) · (1/3) + (1/6) · (2/3) = 13/36. and it is raining when I arrive. P(R  W ) = 3/4. 153 = I thank Eduardo Mendes for pointing out a mistake in my previous solution to this problem.36 CHAPTER 2. during the Dry season.
7. Thus P(R ∩ R ) = P(R ∩ R  W )P(W ) + P(R ∩ R  D)P(D) 89 = (3/4)2 · (1/3) + (1/6)2 · (2/3) = . WORKED EXAMPLES 37 (c) Let R be the event ‘it is raining in a year’s time’. 432 and so P(R  R) = P(R ∩ R ) 89/432 89 = = . The information we are given is that P(R ∩ R  W ) = P(R  W )P(R  W ) and similarly for D. P(R) 13/36 156 .2.
38 CHAPTER 2. CONDITIONAL PROBABILITY .
The standard abbreviation for ‘random variable’ is r. expected value. the set of students in the class. I am interested in the sum of the two numbers.) Example I throw a sixsided die twice. (Remember that a function is nothing but a rule for associating with each element of its domain set an element of its target or range set. and median. Example I select at random a student from the class and measure his or her height in centimetres. Here. Similarly. j ≤ 6}. Here the sample space is S = {(i. the random variable is ‘height’. and the target space is the set of real numbers. Here the domain set is the sample space S . “neither holy. and look at some particularly important types of random variables including the binomial. but for us they will always be numbers. The values of the function can be anything at all. nor Roman.Chapter 3 Random variables In this chapter we deﬁne random variables and some related concepts such as probability mass function. and normal. variance.1 What are random variables? The Holy Roman Empire was. 3. which is a function from the set of students to the real numbers: h(S) is the height of student S in centimetres. the sample space is the set of students. 39 . j) : 1 ≤ i. in the words of the historian Voltaire. Poisson. nor an empire”.v. a random variable is neither random nor a variable: A random variable is a function deﬁned on a sample space.
2 Probability mass function Let F be a discrete random variable. or if the values of F are integers (for example. j) = i + j.) . or even over all real numbers. the number of nuclear decays which take place in a second in a sample of radioactive material – the number is an integer but we can’t easily put an upper limit on it.g. or any real number between 4 and 5). and you are not bound to it. the possible. A random variable whose values range over an interval of real numbers. but more examples should make the idea clearer. . The most basic question we can ask is: given any value a in the target set of F. In fact. RANDOM VARIABLES and the random variable F is given by F(i.) A random variable is continuous if there are no gaps between its possible values. We begin by considering discrete random variables. we would write P(F = f ) in the above example. what is the probability that F takes the value a? In other words. 3.40 CHAPTER 3. if we consider the event A = {x ∈ S : F(x) = a} what is P(A)? (Remember that an event is a subset of the sample space. 3. . One could concoct random variables which are neither discrete nor continuous (e.) Since events of this kind are so important. 12}. it is quite common to use the same letter in lower case for a value of the random variable. 2. . These deﬁnitions are not quite precise. The target set is the set {2. For example. 3. we simplify the notation: we write P(F = a) in place of P({x ∈ S : F(x) = a}). (There is a fairly common convention in probability and statistics that random variables are denoted by capital letters and their values by lowercase letters. the height of a student could in principle be any real number between certain extreme limits. F is discrete if it can take only ﬁnitely many values (as in the second example above. but we will not consider such random variables. values could be 1. thus. The two random variables in the above examples are representatives of the two types of random variables that we will consider. In the ﬁrst example. But remember that this is only a convention. is continuous. where the values are the integers from 2 to 12). A random variable F is discrete if the values it can take are separated by gaps. .
The event X = 1. The standard abbreviation for ‘probability mass function’ is p. . and sum these terms. otherwise we should give a formula if possible.. for example. . . The possible values of X are 0. . T HH. say a1 . 3. a3 . . . The expected value is a kind of ‘generalised average’: if each of the values is equally likely. it is convenient to list it in a table. Two random variables X and Y are said to have the same distribution if the values they take and their probability mass functions are equal. T T H.3. i=1 ∞ . then X and Y have the same distribution.f. There is an interpretation of the expected value in terms of mechanics. . T HT.3 Expected value and variance Let X be a discrete random variable which takes the values a1 . and its p. then we deﬁne the expected value of X to be the inﬁnite sum E(X) = ∑ ai P(X = ai ).m. which is just the average of the values. formula or table which gives the value of P(F = a) for each element a in the target set of F. and each outcome is equally likely. If F takes only a few values. 3. .f. an . If we put a mass pi on the axis at position ai for i = 1. so that each has probability 1/n. then E(X) = (a1 + · · · + an )/n. The random variable X gives the number of heads recorded. We write X ∼ Y in this case.3. when written as a set of outcomes. T T T }.m. HT H. T T H}. then the centre of mass of all these masses is at the point E(X). if Y is the number of tails recorded during the experiment. where pi = P(X = ai ). EXPECTED VALUE AND VARIANCE 41 The probability mass function of a discrete random variable F is the function. Example I toss a fair coin three times. T HT. In the above example. HT T. If the random variable X takes inﬁnitely many values. HHT. . 2. a2 . 1. even though their actual values are different (indeed. is a 0 1 2 3 P(X = a) 1 3 3 1 8 8 8 8 For the sample space is {HHH. The expected value or mean of X is the number E(X) given by the formula E(X) = ∑ ai P(X = ai ). n. . we multiply each value of X by the probability that X takes that value. is equal to {HT T. . i=1 n That is. and has probability 3/8. Y = 3 − X).
now we have to worry about whether this means anything. X 2 is just the random variable whose values are the squares of the values of X. the series will usually be a geometric series or something similar. we assume that the number of values is ﬁnite.) . then Var(X) is a measure of how spreadout the values are around their average. RANDOM VARIABLES Of course. whether this inﬁnite series is convergent. We add it up by columns instead of by rows. getting three parts with n terms in each part. discrete random variables will only have ﬁnitely many values. We won’t worry about it too much.) Continuing. and that ∑n P(X = ai ) = 1 since i=1 the events X = ai form a partition. In the proofs below. (What is happening here is that the entire sum consists of n rows with three terms in each row. This is a question which is discussed at great length in analysis. that is. i=1 n For the second term is equal to the third by deﬁnition.42 CHAPTER 3. The variance of X is the number Var(X) given by Var(X) = E(X 2 ) − E(X)2 . Here. The next theorem shows that. Thus E(X ) = ∑ a2 P(X = ai ) i 2 i=1 n (or an inﬁnite sum. Proposition 3. if necessary). and we are done. Then Var(X) = E((X − µ)2 ) = ∑ (ai − µ)2 P(X = ai ). if E(X) is a kind of average of the values of X. which we know how to sum. (Remember that E(X) = µ. Usually.1 Let X be a discrete random variable with E(X) = µ. in the few examples we consider where there are inﬁnitely many values. and the third is i=1 n ∑ (ai − µ)2P(X = ai) ∑ (a2 − 2µai + µ2)P(X = ai) i i=1 n = = i=1 ∑ a2P(X = ai) − 2µ i n i=1 ∑ aiP(X = ai) + µ2 n i=1 ∑ P(X = ai) n . we ﬁnd E((X − µ)2 ) = E(X 2 ) − 2µE(X) + µ2 = E(X 2 ) − E(X)2 .
bm .4. JOINT P.Y = b j ) means the probability of the event that X takes the value ai and Y takes the value b j . each of the form (ai − µ)2 (a square. Note the difference between ‘independent events’ and ‘independent random variables’. hence nonnegative). Var(X) = 02 × (1/8) + 12 × (3/8) + 22 × (3/8) + 32 × (1/8) − (3/2)2 = 3/4. we have P(X = ai . an . .1.4 Joint p. . . X is the number of heads. (For the formula in Proposition 3.m. we get 3 Var(X) = − 2 2 1 1 × + − 8 2 2 3 1 × + 8 2 2 3 3 × + 8 2 2 × 1 3 = . . . 3. the events X = ai and Y = b j are independent (events).3. hence nonnegative) times P(X = ai ) (a probability. We say that X and Y are independent if. • The expected value of X always lies between the smallest and largest values of X. . . • The variance of X is never negative.M. 8 4 Two properties of expected value and variance can be used as a check on your calculations. for any value ai of X and any value b j of Y . .1 is a sum of terms. and let Y be a random variable taking the values b1 . If we calculate the variance using Proposition 3.f.F. for any possible values i and j. Example I toss a fair coin three times.Y = b j ) = P(X = ai ) · P(Y = b j ). Here P(X = ai . of two random variables Let X be a random variable taking the values a1 . What are the expected value and variance of X? E(X) = 0 × (1/8) + 1 × (3/8) + 2 × (3/8) + 3 × (1/8) = 3/2. OF TWO RANDOM VARIABLES 43 Some people take the conclusion of this proposition as the deﬁnition of variance. So we could restate the deﬁnition as follows: The random variables X and Y are independent if. .
Y = b. We will see the proof later.m. If two random variables X and Y are not independent. that is. of each variable does not tell the whole story. and X and Y are the numbers that come up on the ﬁrst and second throws. for each value ai of X and each value b j of Y . Are X and Y independent random variables? No.v.f.m. On the other hand. If we have more than two random variables (for example X. Then the events ‘exactly one red pen selected’ and ‘exactly one green pen selected’ turned out to be independent.f. (The row and column sums are sometimes called the marginal distributions or marginals. of Y . if I roll a die twice. X and Y are independent r. (a) E(X +Y ) = E(X) + E(Y ).f. then X and Y will be independent. one green pen. RANDOM VARIABLES Example In Chapter 2. the row sums form the p.) of X and Y is the table giving.Y. Note that summing the entries in the row corresponding to the value ai gives the probability that X = ai . but for variance it helps if the variables are independent: Theorem 3.m. I select two pens without replacement. . (You may want to revise the material on mutually independent events. and Y the number of green pens selected. (b) If X and Y are independent.) What about the expected values of random variables? For expected value. but P(X = 2. because P(X = 2) = 1/6. The joint probability mass function (or joint p. Let X be the number of red pens selected. Similarly the column sums form the p. Z). Then P(X = 1. of X.f. then Var(X +Y ) = Var(X) + Var(Y ). P(Y = 1) = 1/2. even if the die is not fair (so that the outcomes are not all equally likely).m.) In particular. and one blue pen. We arrange the table so that the rows correspond to the values of X and the columns to the values of Y .Y = 1) = 0 (it is impossible to have two red and one green in a sample of two).s if and only if each entry of the table is equal to the product of its row sum and its column sum. it is easy.2 Let X and Y be random variables. X = a. the probability that X = ai and Y = b j .Y = 1) = P(X = 1) · P(Y = 1). then knowing the p. Z = c) are mutually independent.44 CHAPTER 3. we say that they are mutually independent if the events that the random variables take speciﬁc values (for example. we saw the following: I have two red pens.
f. We consider the joint p. Now the probability that it takes a given value ck is the sum of the probabilities P(X = ai .m.Y = b j ) + j=1 m j=1 ∑ b j ∑ P(X = ai. .Y = b j ) is a column sum and is equal to i=1 P(Y = b j ).m. . i=1 m n Now ∑m P(X = ai . of X and Y is given by the following table: Y 0 1 0 0 1 6 X 1 2 1 3 1 6 1 3 0 The row and column sums give us the p.Y = b j ) i=1 k n m ∑ ai ∑ P(X = ai.Y = b j ) is a row sum of the joint p. so is equal to j=1 P(X = ai ). First we calculate E((X +Y )2 ) = E(X 2 + 2XY +Y 2 ) = E(X 2 ) + 2E(XY ) + E(Y 2 ).M.m. and similarly ∑n P(X = ai .3. The random variable X +Y takes the values ai + b j for i = 1. m.F.s for X and Y : a 0 1 2 P(X = a) 1 6 2 3 1 6 b 0 1 P(Y = b) 1 2 1 2 Now we give the proof of Theorem 3. one green pen. .f.Y = b j ) . . and one blue pen.Y = b j ) over all i and j such that ai + b j = ck .4.f.m. table. So E(X +Y ) = i=1 ∑ aiP(X = ai) + ∑ b j P(Y = b j ) j=1 n m = E(X) + E(Y ). OF TWO RANDOM VARIABLES 45 Example I have two red pens. E(X +Y ) = = = ∑ ck P(X +Y = ck ) i=1 j=1 n ∑ ∑ (ai + b j )P(X = ai. n and j = 1. . Let X be the number of red pens that I choose and Y the number of green pens.f. JOINT P. . . of X and Y . Then the joint p. . The variance is a bit trickier. and I choose two pens without replacement. . Thus.2.
Var(X + c) = Var(X). (c) E(cX) = cE(X).Y = b j ) = P(X = ai ) · P(Y = b j ).46 CHAPTER 3.3 Let C be a constant random variable with value c. remember that a random variable is not a variable at all but a function. that is. We have to consider the term E(XY ). RANDOM VARIABLES using part (a) of the Theorem.) .) Proposition 3. we have to make the assumption that X and Y are independent. For this.Y = b j ) n m ∑ ∑ aib j P(X = ai)P(Y = b j ) i=1 ∑ aiP(X = ai) · j=1 ∑ b j P(Y = b j ) m = E(X) · E(Y ). and there is nothing amiss with a constant function. (b) E(X + c) = E(X) + c. we consider constant random variables. (a) E(C) = c. Var(C) = 0. Also. (For C2 is a constant random variable with value c2 . we have E(XY ) = = = i=1 j=1 n n i=1 j=1 n ∑ ∑ aib j P(X = ai. So E(C) = c · 1 = c. Var(cX) = c2 Var(X). To ﬁnish this section. (If the thought of a ‘constant variable’ worries you. Proof (a) The random variable C takes the single value c with P(C = c) = 1. So Var(X +Y ) = = = = E((X +Y )2 ) − (E(X +Y ))2 (E(X 2 ) + 2E(XY ) + E(Y 2 )) − (E(X)2 + 2E(X)E(Y ) + E(Y )2 ) (E(X 2 ) − E(X)2 ) + 2(E(XY ) − E(X)E(Y )) + (E(Y 2 ) − E(Y )2 ) Var(X) + Var(Y ). Var(C) = E(C2 ) − E(C)2 = c2 − c2 = 0. As before. Let X be any random variable. P(X = a1 .
If the variable is tabulated in the New Cambridge Statistical Tables. Before we begin. the expected value. SOME DISCRETE RANDOM VARIABLES 47 (b) This follows immediately from Theorem 3.5 Some discrete random variables We now look at ﬁve types of discrete random variables. and give the p. You should have a copy of the tables to follow the examples.) Then E(X + c) = E(X) + E(C) = E(X) + c. The cumulative distribution function. (c) If a1 . and some examples of using the tables. . or c. .m. (This is true because P(X = a. we give the table number.f.f. can are the values of cX.). It is deﬁned for a discrete random variable as follows. of X is given by FX (ai ) = P(X ≤ ai ). We assume that these are arranged in ascending order: a1 < a2 < · · · < an . Then Var(cX) = = = = E(c2 X 2 ) − E(cX)2 c2 E(X 2 ) − (cE(X))2 c2 (E(X 2 ) − E(X)2 ) c2 Var(X). Let X be a random variable taking values a1 . So E(cX) = = c ∑ ai P(X = ai ) i=1 i=1 n ∑ caiP(cX = cai) n = cE(X). We describe for each type the situations in which it arises.m. then ca1 . . and the variance.5. each depending on one or more parameters. .f.3. . 3. .2. . A summary of this information is given in Appendix B. . They don’t give the probability mass function (or p. . but a closely related function called the cumulative distribution function. . an . . once we observe that the constant random variable C and any random variable X are independent. and P(cX = cai ) = P(x = ai ). a comment on the New Cambridge Statistical Tables. Var(X + c) = Var(X) + Var(C) = Var(X).d. ...C = c) = P(X = a) · 1. . an are the values of X. a2 .
we associate a random variable IA (remember that a random variable is just a function on S ) by the rule 1 if s ∈ A. then the number of heads that we get when we toss the coin once is a Bernoulli(p) random variable.f. Calculation of the expected value and variance of a Bernoulli random variable is easy. if a biased coin has probability p of coming down heads. let A be any event in a probability space S . RANDOM VARIABLES We see that it can be expressed in terms of the p. So its p. So p determines everything. of a discrete random variable except for looking up the tables. of X as follows: FX (ai ) = P(X = a1 ) + · · · + P(X = ai ) = j=1 ∑ P(X = a j ).m.f. (Remember that ∼ means “has the same p.) E(X) = 0 · q + 1 · p = p. i In the other direction.m. Necessarily q (the probability that X = 0) is equal to 1 − p. For example.) Some people write 1 A instead of IA . p is the probability that X = 1. / The random variable IA is called the indicator variable of A.m. we sometimes describe the experiment as a ‘trial’.f.f.f. More generally. where p = P(A). Var(X) = 02 · q + 12 · p − p2 = p − p2 = pq. It is a Bernoulli(p) random variable. For a Bernoulli random variable X. Let X ∼ Bernoulli(p). (The event IA = 1 is just the event A. We won’t use the c. we cn recover the p.d. because its value indicates whether or not A occurred. and the event X = 0 as ‘failure’. as”. With A. the event X = 1 as‘success’.: P(X = ai ) = FX (ai ) − FX (ai−1 ).f. 0 and 1. It is much more important for continuous random variables! Bernoulli random variable Bernoulli(p) A Bernoulli random variable is the simplest type of all. from the c. It only takes two values. looks as follows: x 0 1 P(X = x) q p Here.) . it can be any number between 0 and 1.d.m.48 CHAPTER 3. IA (s) = 0 if s ∈ A. (Remember that q = 1 − p.
Here is the easy method. for Bin(4. if we get tails on the kth toss. . p) random variable. p).m.f. 2. We have a coin with probability p of coming down heads. but the harder way is useful for many random variables. Var(X) = npq. (x + y)n = as it should be: here we used the binomial theorem k=0 ∑ nCk xn−k yk . . p): k P(X = k) n 0 q4 1 4q3 p 2 6q2 p2 3 4qp3 4 p4 Note: when we add up all the probabilities in the table. where q = 1 − p. Then X is our Bin(n. an easy way and a harder way. p) random variable X takes the values 0. n (This argument explains the name of the binomial random variable!) If X ∼ Bin(n. For example. we get k=0 ∑ nCk qn−k pk = (q + p)n = 1. and we toss it n times and count the number X of heads. 1. 2.3. This number is a Bin(n. The easy way only works for the binomial. . Let Xk be the random variable deﬁned by Xk = 1 0 if we get heads on the kth toss. n. . . of X is given by P(X = k) = nCk qn−k pk for k = 0. A Bin(n. we describe the event X = 1 as a ‘success’.5. and the p. n. However. . Now a binomial random variable counts the number of successes in n independent trials each associated with a Bernoulli(p) random variable. . . suppose that we have a biased coin for which the probability of heads is p. There are two ways to prove this. then E(X) = np. for example. We toss the coin n times and count the number of heads obtained. SOME DISCRETE RANDOM VARIABLES 49 Binomial random variable Bin(n. and the probability of getting k heads and n − k tails in a particular order is qn−k pk . . 1. This is because there are nCk different ways of obtaining k heads in a sequence of n throws (the number of choices of the k positions in which the heads occur). p) random variable. Note that we have given a formula rather than a table here. you can skip it if you wish: I have set it in smaller type for this reason. p) Remember that for a Bernoulli random variable. For small values we could tabulate the results.
Xi is the indicator variable of the event ‘heads on the kth toss’. we have E(X) = p + p + · · · + p = np. Then (a) [GX (x)]x=1 = 1. Now we have X = X1 + X2 + · · · + Xn (can you see why?). Var(X) = pq + pq + · · · + pq = npq. E(Xi ) = p.4 Let GX (x) be the probability generating function of a random variable X. Now the probability generating function of X is the power series GX (x) = ∑ pk xk . p). since the variables are independent. (b) E(X) = (c) Var(X) = d dx GX (x) x=1 . Let X be a random variable whose values are nonnegative integers. we write pk for the probability P(X = k). . which by deﬁnition is Var(X).) We use the notation [F(x)]x=1 for the result of substituting x = 1 in the series F(x). We only use it here for calculating expected values and variances. dx2 Now putting x = 1 in this series we get ∑ k(k − 1)pk = ∑ k2 pk − ∑ kpk = E(X 2 ) − E(X). we get d GX (x) = ∑ kpk xk−1 . dx Now putting x = 1 in this series we get ∑ kpk = E(X). when we differentiate the series termbyterm (you will learn later in Analysis that this is OK). d2 G (x) + E(X) − E(X)2 . . RANDOM VARIABLES In other words. Var(Xi ) = pq. To save space. (We don’t insist that it takes all possible values. So. Proposition 3. which takes values between 0 and n. and X1 . . Adding E(X) and subtracting E(X)2 gives E(X 2 ) − E(X)2 . (The sum is over all values k taken by X. dx2 X x=1 Part (a) is just the statement that probabilities add up to 1: when we substitute x = 1 in the power series for GX (x) we just get ∑ pk . by Theorem 21.50 CHAPTER 3. For part (b). For part (c). Xn are independent Bernoulli(p) random variables (since they are deﬁned by different tosses of a coin). but if you learn more probability theory you will see other uses for it. differentiating twice gives d2 GX (x) = ∑ k(k − 1)pk xk−2 . Then. as we saw earlier. . . this method is ﬁne for the binomial Bin(n. The other method uses a gadget called the probability generating function.
We sample n balls from the box without replacement. SOME DISCRETE RANDOM VARIABLES Now let us appply this to the binomial random variable X ∼ Bin(n. Differentiating once. What is the probability of exactly ﬁve heads? This is P(5 or fewer) − P(4 or fewer). where p = M/N is the proportion of red balls in the sample. of which M are red.5. What about sampling without replacement? This leads us to our next random variable: Hypergeometric random variable Hg(n.9745− 0. we get n(n − 1)p2 (q + px)n−2 . Suppose that we have N balls in a box. Now adding E(X) − E(X)2 . let the random variable X be the number of red balls in the sample.3. suppose that the probability that a certain coin comes down heads is 0.45. Putting x = 1 gives (q + p)n = 1. using the Chain Rule. and from tables the answer is 0. we get np(q + px)n−1 . We have pk = P(X = k) = nCk qn−k pk .2608. in agreement with Proposition 3. M. For larger values of p. If the coin is tossed 15 times. of which M are red. use the fact that the number of failures in Bin(n.1404. what is the probability of ﬁve or fewer heads? Turning to the page n = 15 in Table 1 and looking at the row 0. So the probability of ﬁve heads in 15 tosses of a coin with p = 0. and different choices are independent.2608 − 0.4(a). What is the distribution of X? Since each ball has probability M/N of being red. For example.0514. the tables give the cumulative distribution function. you read off the answer 0. Differentiating again.9231 = 0. The binomial random variable is tabulated in Table 1 of the Cambridge Statistical Tables [1]. so the probability generating function is 51 k=0 ∑ nCk qn−k pk xk = (q + px)n . 1 − p). Another interpretation of the binomial random variable concerns sampling. As explained earlier.55 is 0. X ∼ Bin(n.5. p) is equal to the number of successes in Bin(n. Let the random variable X be the number of .1204 = 0. Putting x = 1 we ﬁnd that E(X) = np. We sample n balls from the box with replacement. we get Var(X) = n(n − 1)p2 + np − n2 p2 = np − np2 = npq. Putting x = 1 gives n(n − 1)p2 . p). n by the Binomial Theorem. N) Suppose that we have N balls in a box.45. The tables only go up to p = 0. p).
2.f. In particular. but if p > 0 this possibility is ‘inﬁnitely unlikely’. RANDOM VARIABLES red balls in the sample. 1. as we will see. . then the difference between sampling with and without replacement is very small. We calculated earlier that P(X = 0) = 1/6. N −1 You should compare these to the values for a binomial random variable. . Its p. We have again a coin whose probability of heads is p. where two pens are red. If we let p = M/N be the proportion of red balls in the hat. is given by the formula MC · N−MC k n−k P(X = k) = . These agree with the formulae above. 2. we toss it until it comes down heads for the ﬁrst time. So we can say that Hg(n. and all choices are equally likely. the number of ways of choosing k of the M red balls and n − k of the N − M others is MCk · N−MCn−k . the number of independent Bernoulli trials required until the ﬁrst success is obtained is a geometric random variable. has probability zero. From this we ﬁnd by direct calculation that E(X) = 1 and Var(X) = 1/3. The random variable X can take any of the values 0. N). 2. Consider our example of choosing two pens from four. (In theory we might never get a head and toss the coin inﬁnitely often. 4) random variable. instead of tossing it a ﬁxed number of times and counting the heads. More generally. n.m. i. 3.e. Thus.52 CHAPTER 3. . . and count the number of times we have tossed the coin. M. N Var(X) = n M N N −M N N −n . . The expected value and variance of a hypergeometric random variable are as follows (we won’t go into the proofs): E(X) = n M . and one blue. if the numbers M and N − M of red and nonred balls in the hat are both very large compared to the size n of the sample. and indeed the ‘correction factor’ is close to 1. M. Now. P(X = 1) = 2/3 and P(X = 2) = 1/6. N) is approximately Bin(n. . The number X of red pens is a Hg(2. . then E(X) = np. Geometric random variable Geom(p) The geometric random variable is like the binomial but with a different stopping rule. and Var(X) is equal to npq multiplied by a ‘correction factor’ (N − n)/(N − 1). N Cn For the number of samples of n balls from N is N Cn .) We always assume that 0 < p < 1. . M/N) if n is small compared to M and N − M. . Such an X is called a hypergeometric random variable Hg(n. one green. the values of the variable are the positive integers 1.
so Var(X) = 2pq 1 1 q + − 2 = 2. so the geometric progression does for the geometric random variable. the expected number of tosses until the ﬁrst head is 2 (so the expected number of tails is 1). if we toss a fair coin until heads is obtained. where q < 1. the result will be that E(X) = 1/p. k=1 ∑ qk−1 pxk = 1 − qx . since ‘tails’ has probability q and different tosses are independent.m. we obtain E(X) = Differentiating again gives 2pq/(1 − qx)3 . We have GX (x) = ∞ Var(X) = q/p2 . ∞ p since the series is a geometric progression with ﬁrst term p and common ratio q. px again by summing a geometric progression. Let’s add up these probabilities: k=1 ∑ qk−1 p = p + qp + q2 p + · · · = 1 − q = 1. If X ∼ Geom(p).5. and gives its name to the random variable. For the event X = k means that we get tails on the ﬁrst k − 1 tosses and heads on the kth. and the variance of this number is also 2. . (Just as the binomial theorem shows that probabilities sum to 1 for a binomial random variable.f of a Geom(p) random variable is given by P(X = k) = qk−1 p. 53 where q = 1 − p. we get (1 − qx)p + pxq p d GX (x) = = . 2 dx (1 − qx) (1 − qx)2 Putting x = 1.3.) We calculate the expected value and the variance using the probability generating function. SOME DISCRETE RANDOM VARIABLES The p. Differentiating. p3 p p p p 1 = . (1 − q)2 p For example. and this event has probability qk−1 p.
is very closely connected with continuous things. By analogy with what happened for the binomial and geometric random variables. then GX (x) = (λx)k −λ e = eλ(x−1) . the p. However. Suppose that ‘incidents’ occur at random times. if you speak a little French. We get λk ∑ k=0 k! ∞ e−λ = eλ · e−λ = 1. for a Poisson(λ) variable X is given by the formula λk P(X = k) = e−λ . Another example might be the number of telephone calls a minute to a busy telephone number. then the number of ﬁsh I will catch in the next hour is a Poisson(λ) random variable. Again we use the probability generating function. Differentiating again gives λ2 eλ(x−1) .4 nuclear decays per second. The Poisson random variabe X counts the number of ‘incidents’ which occur in a given interval. Unfortunately. and the ﬁsh are biting at the rate of λ per hour on average. this name has been given to a closelyrelated continuous random variable which we will meet later. So if. but at a steady rate overall. If X ∼ Poisson(λ). so Var(X) = λ2 + λ − λ2 = λ.54 CHAPTER 3. The best example is radioactive decay: atomic nuclei decay randomly. Although we will not prove it. you might have expected that this random variable would be called ‘exponential’. so E(X) = λ.4) random variable. on average. k! Let’s check that these probabilities add up to one. k=0 k! ∞ ∑ again using the series for the exponential function. there are 2. RANDOM VARIABLES Poisson random variable Poisson(λ) The Poisson random variable.m. then the number of decays in one second starting now is a Poisson(2. .f. Differentiation gives λeλ(x−1) . unlike the ones we have seen before. but the average number λ which will decay in a given interval is constant. since the expression in brackets is the sum of the exponential series. you might use as a mnemonic the fact that if I go ﬁshing. The expected value and variance of a Poisson(λ) random variable X are given by E(X) = Var(X) = λ.
3.6. CONTINUOUS RANDOM VARIABLES
55
The cumulative distribution function of a Poisson random variable is tabulated in Table 2 of the New Cambridge Statistical Tables. So, for example, we ﬁnd from the tables that, if 2.4 ﬁsh bite per hour on average, then the probability that I will catch no ﬁsh in the next hour is 0.0907, while the probability that I catch at ﬁve or fewer is 0.9643 (so that the probability that I catch six or more is 0.0357). There is another situation in which the Poisson distribution arises. Suppose I am looking for some very rare event which only occurs once in 1000 trials on average. So I conduct 1000 independent trials. How many occurrences of the event do I see? This number is really a binomial random variable Bin(1000, 1/1000). But it turns out to be Poisson(1), to a very good approximation. So, for example, the probability that the event doesn’t occur is about 1/e. The general rule is: If n is large, p is small, and np = λ, then Bin(n, p) can be approximated by Poisson(λ).
3.6
Continuous random variables
We haven’t so far really explained what a continuous random variable is. Its target set is the set of real numbers, or perhaps the nonnegative real numbers or just an interval. The crucial property is that, for any real number a, we have (X = a) = 0; that is, the probability that the height of a random student, or the time I have to wait for a bus, is precisely a, is zero. So we can’t use the probability mass function for continuous random variables; it would always be zero and give no information. We use the cumulative distribution function or c.d.f. instead. Remember from last week that the c.d.f. of the random variable X is the function FX deﬁned by FX (x) = P(X ≤ x). Note: The name of the function is FX ; the lower case x refers to the argument of the function, the number which is substituted into the function. It is common but not universal to use as the argument the lowercase version of the name of the random variable, as here. Note that FX (y) is the same function written in terms of the variable y instead of x, whereas FY (x) is the c.d.f. of the random variable Y (which might be a completely different function.) Now let X be a continuous random variable. Then, since the probability that X takes the precise value x is zero, there is no difference between P(X ≤ x) and P(X < x). Proposition 3.5 The c.d.f. is an increasing function (this means that FX (x) ≤ FX (y) if x < y), and approaches the limits 0 as x → −∞ and 1 as x → ∞.
56
CHAPTER 3. RANDOM VARIABLES The function is increasing because, if x < y, then FX (y) − FX (x) = P(X ≤ y) − P(X ≤ x) = P(x < X ≤ y) ≥ 0.
Also FX (∞) = 1 because X must certainly take some ﬁnite value; and FX (−∞) = 0 because no value is smaller than −∞! Another important function is the probability density function fX . It is obtained by differentiating the c.d.f.: d FX (x). dx Now fX (x) is nonnegative, since it is the derivative of an increasing function. If we know fX (x), then FX is obtained by integrating. Because FX (−∞) = 0, we have Z x FX (x) = fX (t)dt. fX (x) =
−∞
Note the use of the “dummy variable” t in this integral. Note also that P(a ≤ X ≤ b) = FX (b) − FX (a) =
Z b
a
fX (t)dt.
You can think of the p.d.f. like this: the probability that the value of X lies in a very small interval from x to x + h is approximately fX (x) · h. So, although the probability of getting exactly the value x is zero, the probability of being close to x is proportional to fX (x). There is a mechanical analogy which you may ﬁnd helpful. Remember that we modelled a discrete random variable X by placing at each value a of X a mass equal to P(X = a). Then the total mass is one, and the expected value of X is the centre of mass. For a continuous random variable, imagine instead a wire of variable thickness, so that the density of the wire (mass per unit length) at the point x is equal to fX (x). Then again the total mass is one; the mass to the left of x is FX (x); and again it will hold that the centre of mass is at E(X). Most facts about continuous random variables are obtained by replacing the p.m.f. by the p.d.f. and replacing sums by integrals. Thus, the expected value of X is given by Z
∞
E(X) =
−∞
x fX (x)dx,
and the variance is (as before) Var(X) = E(X 2 ) − E(X)2 , where E(X ) = It is also true that Var(X) = E((X
2
x2 fX (x)dx. −∞ − µ)2 ), where µ = E(X).
Z ∞
3.7. MEDIAN, QUARTILES, PERCENTILES
57
We will see examples of these calculations shortly. But here is a small example to show the ideas. The support of a continuous random variable is the smallest interval containing all values of x where fX (x) > 0. Suppose that the random variable X has p.d.f. given by 2x if 0 ≤ x ≤ 1, 0 otherwise. The support of Xis the interval [0, 1]. We check the integral: fX (x) =
Z ∞
−∞
Z 1
fX (x)dx =
2x dx = x2
0
x=1 x=0
= 1.
The cumulative distribution function of X is 0 if x < 0, x2 if 0 ≤ x ≤ 1, −∞ 1 if x > 1. (Study this carefully to see how it works.) We have
Z x
FX (x) =
fX (t)dt =
2 2x2 dx = , 3 −∞ 0 Z ∞ Z 1 1 E(X 2 ) = x2 fX (x)dx = 2x3 dx = , 2 −∞ 0 2 2 1 1 − = . Var(X) = 2 3 18 E(X) = x fX (x)dx =
Z ∞
Z 1
3.7
Median, quartiles, percentiles
Another measure commonly used for continuous random variables is the median; this is the value m such that “half of the distribution lies to the left of m and half to the right”. More formally, m should satisfy FX (m) = 1/2. It is not the same as the mean or expected value. In the example at the end of the last section, we saw that E(X) = 2/3. The median of X is the value of m for which FX (m) = 1/2. Since √ FX (x) = x2 for 0 ≤ x ≤ 1, we see that m = 1/ 2. If there is a value m such that the graph of y = fX (x) is symmetric about x = m, then both the expected value and the median of X are equal to m. The lower quartile l and the upper quartile u are similarly deﬁned by FX (l) = 1/4, FX (u) = 3/4.
Thus, the probability that X lies between l and u is 3/4 − 1/4 = 1/2, so the quartiles give an estimate of how spreadout the distribution is. More generally, we deﬁne the nth percentile of X to be the value of xn such that FX (xn ) = n/100,
f. Of course. The details are summarised in Appendix B. b) Let a and b be real numbers with a < b. The uniform random variable doesn’t really arise in practical situations. the probability that X is smaller than xn is n%. then • differentiate FX to get fX . 1 if x > b. “equally likely to be anywhere in the interval”. 3.f. Most computer systems include a random number generator.) shows that the expected value and the median of X are both given by (a + b)/2 (the midpoint of the interval).f. its probability density function is constant on the interval [a. Thus. By integration. and integrate fX to get FX . Reminder If the c.58 CHAPTER 3. of X is FX (x) and the p. is fX (x). What should the constant value c be? The integral of the p. roughly speaking. b] (and zero outside the interval).d. is the area of a rectangle of height c and base b − a. this must be 1. and normal.d. it is very useful for simulations. is FX (x) = 0 if x < a.f.d.8 Some continuous random variables In this section we examine three important continuous random variables: the uniform.d. which apparently produces independent values of a uniform random variable on the interval [0. but there should be no obvious pattern to .f. b) is given by fX (x) = 1/(b − a) if a ≤ x ≤ b. since the computer is a deterministic machine. they are not really random.d. A uniform random variable on the interval [a. 1]. exponential. and the median and percentiles of X. 0 otherwise. RANDOM VARIABLES that is. while Var(X) = (b − a)2 /12. b] is.f. so c = 1/(b − a). the p. Further calculation (or the symmetry of the p. In other words. Uniform random variable U(a. of the random variable X ∼ U(a. we ﬁnd that the c. However.d. • use fX to calculate E(X) and Var(X). • use FX to calculate P(a ≤ X ≤ b) (this is FX (b) − FX (a)). (x − a)/(b − a) if a ≤ x ≤ b.
and in a large number of trials they should be distributed uniformly over the interval. has an approximately normal distribution. Not that it takes nonnegative real numbers as values. Its great simplicity makes it the best choice for this purpose. for virtually any random variable X which is not too bizarre. like a man’s height. measures exactly how long from now it is until the next event occurs. so that m = log 2/λ.3. σ2 ) The normal random variable is the commonest of all in applications. SOME CONTINUOUS RANDOM VARIABLES 59 the numbers produced.8. which is discrete. This partly explains why a random variable affected by many independent factors. ﬁsh biting). if x ≥ 0. (The logarithm is to base e. the p. By integration.69314718056 approximately. to be FX (x) = Further calculation gives E(X) = 1/λ. radioactive decays.d. The Poisson random variable. There is a theorem called the central limit theorem which says that.f. we ﬁnd the c.g. if you take the sum (or the average) of n independent random variables with the same distribution as X. Normal random variable N(µ. 0 1 − e−λx if x < 0. Var(X) = 1/λ2 . counts how many events will occur in the next unit of time. The median m satisﬁes 1 − e−λm = 1/2. Exponential random variable Exp(λ) The exponential random variable arises in the same situation as the Poisson: be careful not to confuse them! We have events which occur randomly but at a constant average rate of λ per unit time (e.f. the result will be approximately normal.d. and the most important. If X ∼ Exp(λ). . You will learn in the Statistics course how to use a uniform random variable to construct values of other types of discrete or continuous random variables. so that log 2 = 0. and will become more and more like a normal variable as n grows. The exponential random variable. of X is fX (x) = 0 λe−λx if x < 0. if x ≥ 0. which is continuous.
.. the probability of this is Φ(0.. of the random variable X ∼ N(µ......... of any normal random variable. So..... ...4........ . is 2 e−x ) in terms of ‘standard’ functions.. . However.............. RANDOM VARIABLES More precisely. it is not possible to write the integral of this function (which..v.. (If you are approximating any discrete random variable by a continuous one.2743.. This means that. σ 2π We have E(X) = µ and Var(X) = σ2 .. .6554..7257 = 0. The crucial fact that means that we don’t have to tabulate the function for all values of µ and σ is the following: Proposition 3. of X is obtained as usual by integrating the p... So we only need tables of the c. if X ∼ N(6..... 25).. .60 CHAPTER 3.... ... .6 If X ∼ N(µ... σ2 ) is given by the formula 2 2 1 fX (x) = √ e−(x−µ) /2σ .. 1) – this is the socalled standard normal random variable – and we can ﬁnd the c. .......f. .d. for N(0.. . . .. p) random variable is well approximated by a normal random variable with the same expected value np and the same variance npq. Φ(−c) = P(Y ≤ −c) = P(Y ≥ c) = 1 − P(Y ≤ c) = 1 − Φ(c).. .... .f.. . The picture below shows the graph of this function for µ = 0... then a Bin(n.... ..f.f..... ..f. and Y = (X − µ)/σ....d......f... . of the standard normal is given in Table 4 of the New Cambridge Statistical Tables [1].. ... ..... you should make a “continuity correction” – see the next section for details and an example. the familiar ‘bellshaped curve’. ........ .4) = 0.. . The c.... 1)... ................ .. The p.. ..... .... .. stripped of its constants... suppose that X ∼ N(6. .. ... So it is only necessary to tabulate the function for positive values of its argument.......... σ2 ).d.. So there is no alternative but to make tables of its values.. 1). . 25) and Y = (X − 6)/5 as before.. ..d.6) = 1 − 0.. . For example.....d.. then P(X ≤ 3) = P(Y ≤ −0. so that Y ∼ N(0... ... ......d. .. Y is symmetric about zero... . From the tables.......... ....d.. The c... . for any positive number c...... . if n is large...... ...) The p... . of a standard normal r... we ﬁnd that X ≤ 8 if and only if Y ≤ (8 − 6)/5 = 0.....f.... . .. ..What is the probability that X ≤ 8? Putting Y = (X − 6)/5... ..... then Y ∼ N(0........ .. The function is called Φ in the tables..6) = 1 − P(Y ≤ 0.......
where F(x1 ) is just less than c and F(x2 ) is just greater.28) = 0.0031 = 0.f. then Φ(0.d.6103 and Φ(0.0038 is 0.6745 (since 0.6114. and we want the upper quartile. So F(0. of Y .v. not tied particularly to the normal distribution (though most of the examples will come from there).29. of the normal distribution. Usually.) Using tables in reverse This means.3. To say that X can be approximated by Y means that. say.d.7486 and Φ(0. so the required value is about 0.283) = 0.9. Suppose that some function F is tabulated with the input given to three places of decimals. It is probably true that F is changing at a roughly constant rate between. This probability is equal to P(X = a) + P(x = a + 1) + · · · + P(X = b).f.d. of the normal distribution. Interpolation Any table is limited in the number of entries it contains. for example.29) = 0.0014/0. This is equal to the area . where a and b are integers. But you will ﬁnd it necessary in other cases. use it to ﬁnd x such that F(x) is a given value c. For example.45).f.28) and F(0. if Φ is the c. In this case.d.7517. where fY is the p.28 and 0. so Φ(0. so you don’t need to do this.6141. if you have a table of values of F. (Threetenths of 0. suppose that X takes integer values and we want to ﬁnd P(a ≤ X ≤ b). We are given a table of the c.29).0011.283) will be about threetenths of the way between F(0. the percentile points of the standard normal r.f. of Y and want to ﬁnd information about X. are given in Table 5 of the New Cambridge Statistical Tables [1]. then we ﬁnd from tables Φ(0.9 On using tables We end this section with a few comments about using tables.67) = 0. For example. P(X = a) is approximately equal to fY (a). For example. if Φ is the c.68) = 0. Tabulating something with the input given to one extra decimal place would make the table ten times as bulky! Interpolation can be used to extend the range of values tabulated. c won’t be in the table and we have to interpolate between values x1 and x2 . Continuity correction Suppose we know that a discrete random variable X is well approximated by a continuous random variable Y . 0. ON USING TABLES 61 3.
5 ≤ Y ≤ b + 0. .. (Here. ... Var(X) = 36. . This in turn is......5)−FY (a−0.. . and P(140 ≤ X ≤ 150) ≈ P(139..5 ≤ Y ≤ 150. we ﬁnd that P(a ≤ X ≤ b) is approximately equal to the area under the curve y = fY (x) from x = a − 0. Then X ∼ Bin(192......) Similarly.. 3/4).... P(X ≤ b) ≈ P(Y ≤ b + 0.. . ..... Similarly for the other values.. .. . Then P(a ≤ X ≤ b) ≈ P(a−0. This area is given by FY (b + 0..5).. 36)...5 a a+0. and P(X ≥ a) ≈ P(Y ≥ a − 0.5 Adding all these pieces..5)...5) − FY (a − 0...... Example The probability that a light bulb will fail in a year is 0. . Said otherwise.5..... .. ....... to a good approximation.5 ≤ Y ≤ b+0. We summarise the continuity correction: Suppose that the discrete random variable X... . So X is approximated by Y ∼ N(144. taking integer values. ......5) = FY (b+0..5 to x = b + 0.... y= f (x) P(X=a) u a−0... since FY is the integral of fY .. Y . .. for example... and so E(X) = 144.5). . is approximated by the continuous random variable Y ......62 CHAPTER 3.......5).5. the area under the curve y = fY (x) from x = a − 0. what is the probability that the number which fail in a year lies between 140 and 150 inclusive? Solution Let X be the number of light bulbs which fail in a year.... RANDOM VARIABLES of a rectangle of height fY (a) and base 1 (from a − 0. If 192 bulbs are installed.5)........75. and light bulbs fail independently....5 to a + 0...5). ..5) by the continuity correction.... since the pieces of the curve above and below the rectangle on either side of x = a will approximately cancel...5 to x = a + 0......... this is P(a − 0. ≈ means “approximately equal”..
63 3. the value of Y is the larger number minus the smaller number). and let Y be the modulus of their difference (that is.6338.5) = P 150.083) = 0.f.f.m.5 ≤ Y ≤ 150. Let the random variable X be the maximum of the two numbers obtained.3. Y = 2 can occur in two ways: the numbers thrown must be (5. work out for each cell what the values of X and Y are. and calculate its expected value and its variance. 5). For example: X = 5. (b) Write down the p. (d) Are the random variables X and Y independent? Solution (a) Y 0 1 2 3 X 4 5 6 1 36 1 36 1 36 1 36 1 36 1 36 1 0 2 36 2 36 2 36 2 36 2 36 2 0 0 2 36 2 36 2 36 2 36 3 0 0 0 2 36 2 36 2 36 4 0 0 0 0 2 36 2 36 5 0 0 0 0 0 2 36 The best way to produce this is to write out a 6 × 6 table giving all possible values for the two throws. 1). and calculate its expected value and its variance.Y ). and P(139.f. of (X. and then count the number of occurrences of each pair. Then Z ∼ N(0.m. 3) or (3.8606 − 0. (c) Write down the p. of Y .10 Worked examples Question I roll a fair die twice.75 ≤ Z ≤ 1. WORKED EXAMPLES Let Z = (Y − 144)/6.5 − 144 ≤Z≤ 6 6 = P(−0.2268 (from tables) = 0. (b) Take row sums: x P(X = x) 1 1 36 2 3 36 3 5 36 4 7 36 5 9 36 6 11 36 .m.10. (a) Write down the joint p. of X.5 − 144 139.
f.f. Var(Y ) = .Y = 2) = 0 but P(X = 1) · P(Y = 2) = 8 1296 .5 1.5 0.5) = 47/216. fot the archer’s score S is s 0 1 4 7 10 40 41 47 47 41 P(S = s) 216 216 216 216 216 .5) − FX (0) = Z 0.d. 216 Similarly we ﬁnd that P(0.64 Hence in the usual way E(X) = (c) Take column sums: y P(Y = y) and so E(Y ) = 0 6 36 CHAPTER 3.m. 36 1 10 36 Var(X) = 2555 .5 ≤ X < 2 X ≥ 2 Score 10 7 4 1 0 Construct the probability mass function for the archer’s score. is given by fX (x) = (3 + 2x − x2 )/9 if x ≤ 3. The distance of the arrow from the centre of the target is a random variable X whose p.5 ≤ X < 1) = 47/216. RANDOM VARIABLES 161 . 1296 4 4 36 2 8 36 3 6 36 5 2 36 35 665 . The archer’s score is determined as follows: Distance X < 0.5 ≤ X < 2) = 41/216.g. and ﬁnd the archer’s expected score. Solution First we work out the probability of the arrow being in each of the given bands: P(X < 0. P(1.5 ≤ X < 1 1 ≤ X < 1.5 3 + 2x − x2 0 9 dx 1/2 0 9x + 3x2 − x3 = 27 41 = . Question An archer shoots an arrow at a target.5) = FX (0. 18 324 (d) No: e. and P(X ≥ 2) = 40/216. P(1 ≤ X < 1. P(X = 1. 0 if x > 3. So the p.
Suppose that T is continuous with probability density function 0 for x < 1 fT (x) = d for x > 1 x3 for some constant d. By making an appropriate approximation. That is. it is 0 for x < 1 FT (x) = 1 − 1 for x > 1 x2 The mean of T is Z ∞ 1 x fT (x) dx = Z ∞ 2 1 x2 dx = 2. (b) Find the mean and median of T . 1 − 1/m2 = 1/2. 1 = dx x3 −d ∞ = 2x2 1 = d/2.d. (b) The c. 1 Z ∞ d so d = 2. Solution (a) The integral of fT (x). or m = 2. must be 1. (c) Suppose that 240 new bus engines are installed at the same time.. and that their lifetimes are independent. WORKED EXAMPLES Hence E(S) = 41 + 47 · 4 + 47 · 7 + 41 · 10 121 = . (a) Find the value of d.f. 216 27 65 Question Let T be the lifetime in years of new bus engines. The√ median is the value m such that FT (m) = 1/2.d. of T is obtained by integrating the p. ﬁnd the probability that at most 10 of the engines last for 4 years or more. That is.f.3. . that is.10. over the support of T .
the number which last for four years or more is a binomial random variable X ∼ Bin(240.1151 using the table of the standard normal distribution. but of course the birthdays of twins are clearly not independent! . 1).2) = 0. P(X ≤ 10) ≈ P(Y ≤ 10. then Z ∼ N(0.5). the class included a pair of twins! (The twins were Leo and Willy Moser. who both had successful careers as mathematicians. Note that we start with the continuous random variable T . move to the discrete random variable X. We approximate X by Y ∼ N(15. Using the continuity correction. Now. if 240 engines are installed. 16 So.5) = P(Z ≤ −1. and then move on to the continuous random variables Y and Z. 1/16). where ﬁnally Z is standard normal and so is in the tables.66 CHAPTER 3. if Z = (Y − 15)/(15/4). with expected value 240 × (1/16) = 15 and variance 240 × (1/16) × (15/16) = 225/16. RANDOM VARIABLES (c) The probability that an engine lasts for four years or more is 1 − FT (4) = 1 − 1 − 1 4 2 = 1 .2) = 1 − P(Z ≤ 1. (15/4)2 ). and P(Y ≤ 10. A true story The answer to the question at the end of the last chapter: As the students in the class obviously knew.) But what went wrong with our argument for the Birthday Paradox? We assumed (without saying so) that the birthdays of the people in the room were independent.
(b) If X and Y are independent.f.Y = b j ) = P(X = ai ) · P(Y = b j ) holds for any pair (ai . We introduce a number (called the covariance of X and Y ) which gives a measure of how far they are from being independent. of two discrete random variables X and Y . Remember that X and Y are independent if P(X = ai .Y ). and then proved that if X and Y are independent then E(XY ) = E(X)E(Y ). in any case.1 Covariance and correlation In this section we consider a pair of discrete random variables X and Y . 4. so that the last term is zero. Var(X +Y ) = Var(X) + Var(Y ) + 2(E(XY ) − E(X)E(Y )). then Cov(X. Now we deﬁne the covariance of X and Y to be E(XY ) − E(X)E(Y ). where we showed that if X and Y are independent then Var(X +Y ) = Var(X) + Var(Y ). Then the argument we had earlier shows the following: Theorem 4.1 (a) Var(X +Y ) = Var(X) + Var(Y ) + 2 Cov(X.Y ) for this quantity. We found that. b j ) of values of X and Y .m. and we have learned what it means for X and Y to be independent.Chapter 4 More on joint distribution We have seen the joint p. Now we examine this further to see measures of nonindependence and conditional distributions of random variables. Look back at the proof of Theorem 21(b). 67 .Y ) = 0. We write Cov(X.
Y ) ≤ 1. The proof of the ﬁrst part is optional: see the end of this section.Y ) = 1 if m > 0.Y ) corr(X. a positive correlation indicates a tendency for larger X values to be associated with larger Y values. Theorem 4. a negative value. a more general version of (a). For part (c). and −1 if Y decreases linearly as X increases.Y ) = E(XY ) − E(X)E(Y ) = mα √ corr(X. It is deﬁned as follows: Cov(X. (4. Now we just calculate everything in sight. for smaller X values to be associated with larger Y values. which is just a “normalised” version of the covariance.Y ). then corr(X. It is +1 if Y increases linearly with X. But note that this is another check on your calculations: if you calculate a correlation coefﬁcient which is bigger than 1 or smaller than −1. proved by the same argument. (c) if Y = mX + c for some constants m = 0 and c. says that Var(aX + bY ) = a2 Var(X) + b2 Var(Y ) + 2ab Cov(X. E(Y ) = E(mX + c) = mE(X) + c = mµ + c E(Y 2 ) = E(m2 X 2 + 2mcX + c2 ) = m2 (µ2 + α) + 2mcµ + c2 Var(Y ) = E(Y 2 ) − E(Y )2 = m2 α E(XY ) = E(mX 2 + cX) = m(µ2 + α) + cµ. = −1 if m < 0.Y ) = Cov(X. then corr(X.Y )/ Var(X) Var(Y ) = mα/ m2 α2 +1 if m > 0. so that E(X 2 ) = µ2 + α. Var(X) Var(Y ) The point of this is the ﬁrst part of the following theorem.68 CHAPTER 4. corr(X.2 Let X and Y be random variables. Cov(X. Part (b) follows immediately from part (b) of the preceding theorem. More generally.Y ). Let E(X) = µ and Var(X) = α. suppose that Y = mX + c.Y ) = −1 if m < 0.1) Another quantity closely related to covariance is the correlation coefﬁcient. and corr(X. MORE ON JOINT DISTRIBUTION In fact. 0 if there is no relation between them. (b) if X and Y are independent.Y ) = . Then (a) −1 ≤ corr(X. then you have made a mistake.Y ) = 0. . Thus the correlation coefﬁcient is a measure of the extent to which the two variables are related.
and one blue pen. E(Y ) = 1/2. Hence Cov(X.4. since the sum E(XY ) = ∑ ai b j P(X = ai .f. j Var(X) = 1/3. one green pen. Consider the following joint p.m. Example We have seen that if X and Y are independent then Cov(X. and I choose two pens without replacement.m. Then the joint p. it doesn’t work the other way around. E(XY ) = 1/3.m. and if X = 2 then Y must be 0. but if X = 1 then Y can be either 0 or 1.Y ) = 0. Y −1 0 1 −1 1 0 1 5 5 X 0 1 0 1 5 1 5 0 1 5 0 .f. if X = 0 then Y must be 1. contains only one term where all three factors are nonzero. 3 1/12 The negative correlation means that small values of X tend to be associated with larger values of Y . COVARIANCE AND CORRELATION 69 Example I have two red pens. Var(Y ) = 1/4.Y ) = 1/3 − 1/2 = −1/6. Indeed. However. and corr(X. of X and of Y and hence ﬁnd their expected values and variances: E(X) = 1.f.Y ) = 1 −1/6 = −√ . of X and Y is given by the following table: Y 0 1 0 0 1 6 X 1 2 1 3 1 6 1 3 0 From this we can calculate the marginal p.Y = b j ) i. Let X be the number of red pens that I choose and Y the number of green pens.1. Also.
Y ) = 0). we know that this means that q2 ≤ pr. We call two random variables X and Y uncorrelated if Cov(X.Y )2 ≤ Var(X) · Var(Y ). talk about the conditional expectation E(X  A) = ∑ ai P(X = ai  A). P(Y = 0) = 1/5. the hypothesis means that this parabola never goes below the Xaxis. but uncorrelated random variables need not be independent. 4. Clearly this is exactly equivalent to proving that its square is at most 1. Now our argument above shows that q2 ≤ pr. given A. This means that the quadratic equation px2 + 2qx + r = 0 either has no real roots. This depends on the following fact: Let p. Then q2 ≤ pr.2 Conditional random variables Remember that the conditional probability of event B given event A is P(B  A) = P(A ∩ B)/P(A).Y = 0) = 0. q = Cov(X. Equation (4. we see that px2 + 2qx + r ≥ 0 for all choices of x. but P(X = −1. we get a parabola. But X and Y are not independent: for P(X = −1) = 2/5. that is. Now let p = Var(X). so Cov(X. So we can say: Independent random variables are uncorrelated. that Cov(X. From highschool algebra. and r = Var(Y ). Then the conditional probability that X takes a certain value ai . Cov(X. MORE ON JOINT DISTRIBUTION Now calculation shows that E(X) = E(Y ) = E(XY ) = 0.Y ) = 0. as required. that is. if corr(X. or has two equal real roots.Y )2 ≤ Var(X) · Var(Y ). Suppose that X is a discrete random variable. so that either it lies entirely above the axis. For. So we can. or it touches it in one point. is just P(X = ai  A) = P(A holds and X = ai ) . Suppose that px2 + 2qx + r ≥ 0 for all real numbers x. when we plot the graph y = px2 + 2qx + r. (Note that x is an arbitrary real number here and has no connection with the random variable X!) Since the variance of a random variable is never negative.Y ). i .70 CHAPTER 4. for example. r be real numbers with p > 0. q. P(A) This deﬁnes the probability mass function of the conditional random variable X  A.1) shows that px2 + 2qx + r = Var(xX +Y ). Here is the proof that the correlation coefﬁcient lies between −1 and 1.Y ) = 0 (in other words.
P(Y = b j ) In other words. and I choose two pens without replacement. E(X  Y = 1) = . Then the joint p. A might be the event that Y takes the value b j . Let X be the number of red pens that I choose and Y the number of green pens.2. table of X and Y corresponding to the value Y = b j .f. CONDITIONAL RANDOM VARIABLES 71 Now the event A might itself be deﬁned by a random variable.Y = b j ) . we have P(X = ai  Y = b j ) = P(X = ai . In particular. i Example I have two red pens. we have taken the column of the joint p. the marginal distribution of Y . one green pen. j Proof: E(X) = ∑ aiP(X = ai) i . the conditional distributions of X corresponding to the two values of Y are as follows: a 0 1 2 P(X = a  Y = 0) 0 We have 2 3 1 3 a 0 1 2 P(X = a  Y = 1) 1 3 2 3 0 4 2 E(X  Y = 0) = . The sum of the entries in this column is just P(Y = b j ). In this case.m. we have E(X  Y = b j ) = ∑ ai P(X = ai  Y = b j ).3 E(X) = ∑ E(X  Y = b j )P(Y = b j ). of X and Y is given by the following table: Y 0 1 0 0 1 6 X 1 2 1 3 1 6 1 3 0 In this case. 3 3 If we know the conditional expectation of X for all values of Y . We divide the entries in the column by this value to obtain a new distribution of X (whose probabilities add up to 1). and one blue pen. for example. we can ﬁnd the expected value of X: Proposition 4.4.m.f.
if Y = 0. Let Y be the Bernoulli random variable whose value is 1 if the result of the ﬁrst toss is heads.72 CHAPTER 4.f. I toss it repeatedly until heads appears for the ﬁrst time. It can be stated in the following way: X and Y are independent if the conditional p. A similar result holds for independence of random variables: Proposition 4. we saw that independence of events can be characterised in terms of conditional probabilities: A and B are independent if and only if they satisfy P(A  B) = P(A). . then we stop the experiment then and there.1. rearranging this equation. and we have E(X  Y = 1) = 1. 0 if it is tails. Recall the situation: I have a coin with probability p of showing heads. we have P(X = ai  Y = b j ) = P(X = ai ). MORE ON JOINT DISTRIBUTION = = = In the above example. ∑ ai ∑ P(X = ai  Y = b j )P(Y = b j ) i j ∑ ∑ aiP(X = ai  Y = b j j i P(Y = b j ) ∑ E(X  Y = b j )P(Y = b j ). so E(X  Y = 0) = 1 + E(X) (the 1 counting the ﬁrst toss). we ﬁnd that E(X) = 1/p. In Proposition 2. so if Y = 1. of X. This is obtained by applying Proposition 15 to the events X = ai and Y = b j . for any value b j of Y .m.4 Let X and Y be discrete random variables. of X  (Y = b j ) is equal to the p. we have E(X) = E(X  Y = 0)P(Y = 0) + E(X  Y = 1)P(Y = 1) = (4/3) × (1/2) + (2/3) × (1/2) = 1. for any values ai and b j of X and Y respectively.m. If Y = 1.f. conﬁrming our earlier value. X is the number of tosses. On the other hand. Then X and Y are independent if and only if. So E(X) = E(X  Y = 0)P(Y = 0) + E(X  Y = 1)P(Y = 1) = (1 + E(X)) · q + 1 · p = E(X)(1 − p) + 1. then necessarily X = 1. then the sequence of tosses from that point on has the same distribution as the original experiment. j Example Let us revisit the geometric random variable and calculate its expected value.
y)dx dy (the right hand side means.v.Y ≤ y) = P(X ≤ x) · P(Y ≤ y). We deﬁne X and Y to be independent if P(X ≤ x. y) = P(X ≤ x.f.3 Joint distribution of continuous r. of Y is similarly Z ∞ fY (y) = −∞ fX. that is. double integrals.1) remains valid. P(a ≤ X ≤ b. the covariance and correlation can be deﬁned by the same formulae as in the discrete case. for any x and y. y) = FX (x) · FY (y). c ≤ Y ≤ d) = Z dZ b c a fX. integrate with respect to x between a and b keeping y ﬁxed. ∂x∂y In other words. The formalism here needs even more concepts from calculus than we have used before: functions of two variables.d. But we have to examine what is meant by independence for continuous random variables.Y (x. partial derivatives.Y (x. X is part of the name of the function.Y (x.Y (x.Y ≤ y).Y (x.V. I assume that this is unfamiliar to you. differentiate with respect to x keeping y constant. integrate this function with respect to y from c to d.Y (x.f. FX. y)dx.4. of X is given by Z ∞ fX (x) = −∞ fX. and Equation (4. just as in the onevariable case. y) = ∂2 FX. while x is the argument of the function. the result is a function of y.3.S 73 4. Let X and Y be continuous random variables. so this section will be brief and can mostly be skipped. y)dy. and then differentiate with respect to y keeping x constant (or the other way round: the answer is the same for all functions we consider.s For continuous random variables. (Note that. JOINT DISTRIBUTION OF CONTINUOUS R.Y of two real variables given by FX.Y ) corresponds to a point in some region of the plane is obtained by taking the double integral of fX. For example.Y (x.) The marginal p. The joint cumulative distribution function of X and Y is the function FX.) The joint probability density function of X and Y is fX.d. and the marginal p. .Y over that region.) The probability that the pair of values of (X. y).
Also.Y ) = Cov(X. b) .Y (x.v.Y ) = corr(X. so the support of Y is [0. we can ﬁnd the distribution of Y in terms of that of X.f.Y ) = E(XY ) − E(X)E(Y ).d.Y (x. of X  (Y = b) is fX(Y =b) (x) = fX. Solution (a) The support of X is [0. of X  (Y = b) is equal to the marginal p.4 Transformation of random variables If a continuous random variable Y is a function of another r.Y ) . As usual this holds if and only if the conditional p. Suppose that X ∼ U[0. What is the support of Y ? Find the cumulative distribution function and the probability density function of Y . and then as in the discrete case Cov(X.d. then Cov(X. not surprisingly. Var(X) Var(Y ) 4. of X. 2]. fY (b) The expected value of XY is.74 CHAPTER 4. P(Y ≤ y) P(X ≤ y2 ) FX (y2 ) y2 /4 .f. corr(X.f. Example Let X √ Y be random variables. Finally. y) = fX (x) · fY (y). Now FY (y) = = = = √ X. and importantly. for any value b. X. and Y = (b) We have fX (x) = x/4 for 0 ≤ x ≤ 4. 4]) and Y = X. The continuous random variables X and Y are independent if and only if fX. if X and Y are independent. MORE ON JOINT DISTRIBUTION Then the conditional p. 4]. y)dx dy.Y ) = 0 (but not conversely!).d. 4] (uniform and on [0.Y (x. Z ∞Z ∞ E(XY ) = −∞ −∞ xy fX.
say Y = g(X). of course FY (y) = 0 for √ < 0 and FY (y) = 1 for y > 2. where h is the derivative of h. σ2 ) and Y = (X − µ)/σ. TRANSFORMATION OF RANDOM VARIABLES 75 for 0 ≤ y ≤ 2.f. where h is the inverse function of g. of X is fX (x) = 1/4 for 0 ≤ x ≤ 4.f. (This is because fX (x) is the derivative of FX (x) with respect to its argument x. Then (a) the support of Y is the image of the support of X under g. Theorem 4. If we know Y as a function of X. where g is an increasing function.) Applying this formula in our example we have fY (y) = y 1 · 2y = 4 2 for 0 ≤ y ≤ 2. Let Y = g(X). and so.) Y ≤ y if and only if X ≤ y (c) We have fY (y) = d FY (y) = y/2 if 0 ≤ y ≤ 2. (Note that y 2 . Recall that 2 2 1 fX (x) = √ e−(x−µ) /2σ . by the Chain Rule.5 Let X be a continuous random variable. since Y = X. The argument in (b) is the key. 1). then Y ∼ N(0. where h is the inverse function of g.4. (b) the p. fY (y) = fX (h(y))h (y). since the p. dy 0 otherwise.d. and which is differentiable there. and so h(y) = y2 . σ 2π .) Thus FY (y) = FX (h(y)). This means that y = g(x) if and √ only if x = h(y).4. Let g be a real function which is either strictly increasing or strictly decreasing on the support of X.6: if X ∼ N(µ. Here is a formal statement of the result. g(x) = x. of Y is given by fY (y) = fX (h(y))h (y). For example. here is the proof of Proposition 3. (In our example. and the Chain Rule says that if x = h(y) we must multiply by h (y) to ﬁnd the derivative with respect to y. then the event Y ≤ y is the same as the event X ≤ h(Y ).d.
f. is zero. where g(x) = (x − µ)/σ. together with the conditions for its validity. . of Y ..d. the p. this function is everywhere strictly increasing (the graph is a straight line with slope 1/σ).d. I recommend going back to the argument we used in the example. 2π √ √ Now Y = X 2 . MORE ON JOINT DISTRIBUTION We have Y = g(X).76 CHAPTER 4. so we must work out the two contributions and add them up. and 2 1 fY (y) = fX (σy + µ) · σ = √ e−y /2 . this is valid for y > 0. X ∼ N(0. for y < 0.d. so Y ≤ y if and only if − y ≤ X ≤ y. so that P(X ≤ x) = Φ(x).f. For example.d. not either increasing or decreasing).f. 2πy (by the Chain Rule) Of course. and Y = X 2 . If the transforming function g is not monotonic (that is. of X is (1/ 2π)e−x /2 . and the inverse function is x = h(y) = σy + µ. h (y) = σ. 2π the p. 1)) = Φ( y) − Φ(− y) = Φ( y − (1 − Φ( y)) = 2Φ( y) − 1. 1) and Y = X 2 . Thus Example FY (y) = P(Y ≤ y) = P(− y ≤ X ≤ y) (by symmetry of N(0. √ 2 The p.d. Thus. then life is a bit more complicated. and 2 1 Φ (x) = √ e−x /2 . if X is a random variable taking both positive and negative values. of a standard normal variable. Let Φ(x) be its c. Find the p. However. So fY (y) = d FY (y) dy 1 √ = 2Φ ( y) · √ 2 y 1 −y/2 = √ e . then a given value √ √ y of Y could arise from either of the values y and − y of X.f.f. rather than remember this formula.
that is m2 = 1/2. of Z. otherwise.Y ) ≤ x if and only if X ≤ x and Y ≤ x. using h(y) = y. Let Z be the maximum of the two numbers. that is. Of course this probability is 0 if x < 0 and is 1 if x > 1.f.5 Worked examples Question Two numbers X and Y are chosen independently from the uniform distribution on the unit interval [0. one negative).d. FX (x) = FY (x) = x if 0 < x < 1. So the c. (The variable can be called x in both cases. you would not get this 2. We obtain the p.) For 0 ≤ x ≤ 1. and each value gives the same contribution. if both X and Y are smaller than a given value x. but if at least one of them is greater than x.f.4. (For.5.d. since Y = X 2 . Solution The c. 1 if x > 1. 4. then again so is their maximum. 3 2 Z 1 Var(Z) = 0 2 2x dx − 3 3 = 1 .f.f. by the symmetry of the p.d.5. Find the p. 18 . of Z by differentiating: fZ (x) = 2x 0 if 0 < x < 1. of X. 1].d.f. it arises from the fact that.d. The√ median of Z is the value of m such that FZ (m) = 1/2. FZ (x) = x2 if 0 < x < 1. each value of Y corresponds to two values of X (one positive. and hence ﬁnd its expected value. we have P(X ≤ x) = P(Y ≤ x) = x. variance and median. If you blindly applied the formula of Theorem 4. of Z is 0 if x < 0. 1 if x > 1. or m = 1/ 2. Thus P(Z ≤ x) = x2 .s of X and Y are identical. 0 if x < 0. P(X ≤ x and Y ≤ x) = x · x = x2 . 2 Then we can ﬁnd E(Z) and Var(Z) in the usual way: Z 1 E(Z) = 0 2 2x dx = .) The key to the argument is to notice that Z = max(X. WORKED EXAMPLES 77 Note the 2 in the line labelled “by the√ Chain Rule”. then so is their maximum. its name doesn’t matter. by independence.
78
CHAPTER 4. MORE ON JOINT DISTRIBUTION
Question I roll a fair die bearing the numbers 1 to 6. If N is the number showing on the die, I then toss a fair coin N times. Let X be the number of heads I obtain. (a) Write down the p.m.f. for X. (b) Calculate E(X) without using this information. Solution (a) If we were given that N = n, say, then X would be a binomial Bin(n, 1/2) random variable. So P(X = k  N = n) = nCk (1/2)n . By the ToTP, P(X = k) =
n=1
∑ P(X = k  N = n)P(N = n).
6
Clearly P(N = n) = 1/6 for n = 1, . . . , 6. So to ﬁnd P(X = k), we add up the probability that X = k for a Bin(n, 1/2) r.v. for n = k, . . . , 6 and divide by 6. (We start at k because you can’t get k heads with fewer than k coin tosses!) The answer comes to 1 2 3 4 5 6 k 0 63 120 99 64 29 8 1 P(X = k) 384 384 384 384 384 384 384 For example, P(X = 4) =
4C (1/2)4 + 5C (1/2)5 + 6C (1/2)6 4 4 4
6
=
4 + 10 + 15 . 384
(b) By Proposition 4.3, E(X) =
n=1
∑ E(X  (N = n))P(N = n).
6
Now if we are given that N = n then, as we remarked, X has a binomial Bin(n, 1/2) distribution, with expected value n/2. So E(X) =
n=1
∑ (n/2) · (1/6) =
6
1+2+3+4+5+6 7 = . 2·6 4
Try working it out from the p.m.f. to check that the answer is the same!
Appendix A Mathematical notation
The Greek alphabet
Name Capital alpha A beta B gamma Γ delta ∆ epsilon E zeta Z eta H theta Θ iota I kappa K lambda Λ mu M nu N xi Ξ omicron O pi Π rho P sigma Σ tau T upsilon ϒ phi Φ chi X psi Ψ omega Ω Lowercase α β γ δ ε ζ η θ ι κ λ µ ν ξ o π ρ σ τ υ φ χ ψ ω
Mathematicians use the Greek alphabet for an extra supply of symbols. Some, like π, have standard meanings. You don’t need to learn this; keep it for reference. Apologies to Greek students: you may not recognise this, but it is the Greek alphabet that mathematicians use! Pairs that are often confused are zeta and xi, or nu and upsilon, which look alike; and chi and xi, or epsilon and upsilon, which sound alike.
79
80
APPENDIX A. MATHEMATICAL NOTATION
Numbers
Notation N Z R x a a/b or b ab
mC or m n n n!
Meaning Natural numbers Integers Real numbers modulus a over b a divides b m choose n n factorial xa + xa+1 + · · · + xb (see section on Summation below) x is approximately equal to y
Example 1, 2, 3, . . . (some people include 0) . . .√ −1, 0, 1, 2, . . . , −2, 1 2 , 2, π, . . . 2 = 2, −3 = 3 12/3 = 4, 2/4 = 0.5 4  12
5C 2
= 10
5! = 120
i=1
i=a
∑ xi
b
∑ i2 = 12 + 22 + 32 = 14
3
x≈y
Sets
Notation {. . .} Meaning a set Example {1, 2, 3} NOTE: {1, 2} = {2, 1} 2 ∈ {1, 2, 3} {x : x2 = 4} = {−2, 2} {1, 2, 3} = 3 {1, 2, 3} ∪ {2, 4} = {1, 2, 3, 4} {1, 2, 3} ∩ {2, 4} = {2} {1, 2, 3} \ {2, 4} = {1, 3} {1, 3} ⊆ {1, 2, 3} everything not in A / {1, 2} ∩ {3, 4} = 0 NOTE: (1, 2) = (2, 1) {1, 2} × {1, 3} = {(1, 1), (2, 1), (1, 3), (2, 3)}
x∈A x is an element of the set A {x : . . .} the set of all x such that . . . or {x  . . .} A cardinality of A (number of elements in A) A∪B A union B (elements in either A or B) A∩B A intersection B (elements in both A and B) A\B set difference (elements in A but not B) A⊆B A is a subset of B (or equal) A complement of A / 0 empty set (no elements) (x, y) ordered pair A×B Cartesian product (set of all ordered pairs)
. The variable i or j is called a “dummy variable”. n n n i=1 ∑ (ai + bi) = ∑ ai + ∑ bi. if X is a discrete random variable.81 Summation What is it? Let a1 . that is. For example. The sum doesn’t have to start at 1. The notation n i=1 ∑ ai (read “sum. . . . a2 . and so on. Manipulation The following three rules hold. then we say that E(X) = ∑ ai P(X = ai ) i where the sum is over all i such that ai is a value of the random variable X. i=1 i+1 (A. a2 . The lefthand side says: add the two terms in each line. of ai ”). For example. a2 + b2 on the second line. an . since (if m and n are different) it is telling us to add up a different number of terms. . a3 . n The notation j=1 m ∑ a j means exactly the same thing. from i equals 1 to n. . . i 20 Sometimes I get lazy and don’t bother to write out the values: I just say ∑ ai to mean: add up all the relevant values. The notation i=1 ∑ ai is not the same. i=10 ∑ ai = a10 + a11 + · · · + a20. . i=1 n ∑ ai = a1 + a2 + · · · + an. be numbers. means: add up the numbers a1 .1) Imagine the as and bs written out with a1 + b1 on the ﬁrst line.
where −1 < r < 1.82 APPENDIX A.3) ∑ dx i+1 i+1 dx The lefthand side says: add up the functions and differentiate the sum.2) The double sum says add up all these products. You also need to know the sum of the “exponential series” xi x2 x3 x4 = 1 + x + + + + · · · = ex . ∞ . then i=1 ∑ ai = a + ar + ar2 + · · · = 1 − r . The answers must be the same. MATHEMATICAL NOTATION and then add up all the results. ∞ a using the formula for the sum of the “geometric series”. and then add the results. This doesn’t just mean “add up inﬁnitely many values”. which we write as i=1 ∑ ai for example. for all values of i and j. sometimes no. We need Analysis to give us a deﬁnition in general. Another useful result is the Binomial Theorem: (x + y)n = n k=0 ∑ nCk xn−k yk . In all the examples you meet in this book. we have functions of x. ∞ Inﬁnite sums Sometimes we meet inﬁnite sums. In Analysis you will see some answers to this question. if ai = ari−1 . then we can “differentiate termbyterm”: n d n d fi (x) = ∑ fi (x). m (A. i=1 ∑ ai · n j=1 ∑ bj m =∑ n i=1 j=1 ∑ ai b j . (A. since that is not possible. ∑ 2 6 24 i=0 i! Do the three rules of the preceding section hold? Sometimes yes. A simple example shows how it works: (a1 + a2 )(b1 + b2 ) = a1 b1 + a1 b2 + a2 b1 + a2 b2 . the rules will be valid. The righthand side says: add the ﬁrst column (all the as) and the second column (all the bs). If in place of numbers. The right says: differentiate each function and add up the derivatives. But sometimes we know the answer another way: for example.
Appendix B Probability and random variables Notation In the table. 48) • Occurs when there is a single trial with a ﬁxed probability p of success. Notation P(A) P(A  B) X =Y X ∼Y Meaning Page probability of A 3 conditional probability of A given B 24 the values of X and Y are equal X and Y have the same distribution 41 (that is. Var(X) = pq.f. • Takes only the values 0 and 1. • E(X) = p. or same p.Y ) covariance of X and Y 67 corr(X. P(X = 1) = p.m.d. P(X = 0) = q. A and B are events. same p.) E(X) expected value of X 41 Var(X) variance of X 42 Cov(X. 83 . • p.m.f.f. X and Y are random variables. where q = 1 − p.Y ) correlation coefﬁcient of X and Y 68 X B conditional random variable 70 X  (Y = b) 71 Bernoulli random variable Bernoulli(p) (p.
f.f. 51) • Occurs when we are sampling n elements without replacement from a population of N elements of which M are distinguished. 2. where q = 1 − p.m. N − M. (any positive integer). the number of heads in n coin tosses. e. e. • E(X) = n M M . • E(X) = np. M/N) if n is small compared to N. . . 49) • Occurs when we are counting the number of successes in n independent trials with ﬁxed probability p of success in each trial. • p.f. • Values 0. number of tosses until the ﬁrst head when tossing a coin. 2. . N −1 • Approximately Bin(n. M. N) (p.g. . sampling with replacement from a population with a proportion p of distinguished elements. . Geometric random variable Geom(p) (p. 52) • Describes the number of trials up to and including the ﬁrst success in a sequence of independent Bernoulli trials. Hypergeometric random variable Hg(n. • Values 1. . p) (p. where q = 1 − p. Var(X) = q/p2 . Var(X) = n N N N −M N N −n . . • Values 0. PROBABILITY AND RANDOM VARIABLES Binomial random variable Bin(n. P(X = k) = (MCk · N−MCn−k )/N Cn .g.m. . . • p. n. 1. • p. P(X = k) = qk−1 p.84 APPENDIX B. . 2. 1. Also. M. • The sum of n independent Bernoulli(p) random variables. Var(X) = npq. . . P(X = k) = nCk qn−k pk for 0 ≤ k ≤ n. • E(X) = 1/p.m. n.
Exponential random variable Exp(λ) (p.g.m. the number of ﬁsh caught in a day. 0 if x > b.m. . 54) • Describes the number of occurrences of a random event in a ﬁxed time interval. if x ≥ 0. f (x) = 0 if x < a.s are approximately equal).f. e. • c. f (x) = • c.f. • p. with all values equally likely. Uniform random variable U[a. 1. P(X = k) = e−λ λk /k! .85 Poisson random variable Poisson(λ) (p.d.d.d.f. • E(X) = λ.f.f. b] (p. (x − a)/(b − a) if a ≤ x ≤ b. Var(X) = (b − a)2 /12. (any nonnegative integer) • p. if x ≥ 0. F(x) = • E(X) = (a + b)/2. • Values 0. and np = λ. . . • E(X) = 1/λ.f. 58) • Occurs when a number is chosen at random from the interval [a. p is small. the time until the next occurrence has the same distribution.d. then Bin(n. 1/(b − a) if a ≤ x ≤ b. Var(X) = 1/λ2 . p) is approximately equal to Poisson(λ) (in the sense that the p. 1 if x > b. 0 if x < a. . 59) • Occurs in the same situations as the Poisson random variable. but measures the time from now until the ﬁrst occurrence of the event. b]. if x < 0. F(x) = 0 λ e−λx 0 1 − e−λx if x < 0. • However long you wait. Var(X) = λ. • If n is large. • p. 2.
86 APPENDIX B. • E(X) = µ.f. 1) is given in the table. PROBABILITY AND RANDOM VARIABLES Normal random variable N(µ. This also works for many other types of random variables: this statement is known as the Central Limit Theorem.s of the Binomial. . • Standard normal N(0. The c. f (x) = √ e−(x−µ) /2σ . 1). Tables 1. Bin(n. use tables. Var(X) = σ2 . σ2 ) (p. and Standard Normal random variables are tabulated in the New Cambridge Statistical Tables. then (X − µ)/σ ∼ N(0.d.d.f. npq). σ2 ). 2 and 4. Poisson. 2 2 1 • p. If X ∼ N(µ. • For large n.d. p) is approximately N(np.. 59) • The limit of the sum (or average) of many independent Bernoulli random variables. σ 2π • No simple formula for c.f.