Chapter 0
The Art of Measuring Sets
A very simple notion is that of a set or collection of objects. Being so simple one can
expect it to crop up in great generality in many places, and indeed it does. It has become a
useful building block in mathematics.
There are many ways to describe a set, but the two most important from our point of view
are: listing the objects in the set, or specifying some property that the objects in
the set satisfy. Naturally in order for this description to be satisfactory, the writer has to
accomplish communication with the reader, just like in the case of sequences in the
previous chapter. For example, suppose I were to write {1,2,3,4,K} . Here I have neither
listed all the objects in the set (an impossible task), nor specified a property that the
elements of the set satisfy. Rather I have appealed to your intuition, and expect you to
know that I am describing the set of natural or counting numbers. Note, that on the
average, you wouldn't have too much trouble deciding if a certain object is an element of
this set or not.
On the other hand if I were to write {2,3,5,17,K} it might be totally unclear to the reader
what set I am talking about. Probably one could speculate I had in mind the set
{x  x = 2 2n
}
+ 1, n ³ 1 , but it would be much better to have clarified. But it could also be
argued that I am just asking for the primes in this collection. Then it would not
necessarily be so easy to decide whether a given object is in the set or not.
To understand the basic notation: whenever one sees { , the set bracket symbol, one reads
it as: the set of all, and the symbol  reads as such that, and of course, as with any other
parentheses, we are required to use } to close the phrase. Thus, {x  x is a prime} reads as
the set of all x’s such that x is a prime, or simply the set of primes. Similarly,
{1,3,5,7,9,...} should probably be read as the set of odd positive integers.
One of the definite ingredients in the course will be to count sets, that is, to decide on the
number of distinct or different elements a set has. Thus {2,3,2,4} has 3 elements since
we are not counting occurrences of 2, we are just counting elements. A very common
error when counting sets is that of double counting, another name for counting an
element of a set more than once (it is like claiming that the set above has four elements).
We will use capital letters for sets, such as A , B, C, X, Y and we will use A , B , etc. to
2
denote the cardinality, or number of distinct elements of the sets A, B, etc. It is not
uncommon to refer to a set with n different elements as an nset. Thus the set {2,3,4} is a
3set.
A related notion is that of a subset or subcollection. If A and B are sets, one says A is a
subset of B if every element of A is an element of B (observe it is a conditional
statement), and one writes A ⊆ B or B ⊇ A to express that fact. Thus A = B if and only
if A Í B and B Í A . Note that naturally since every element of A is an element of A,
A Í A ; every set is a subset of itself. Some authors use B Ì A to indicate that B Í A and
B ¹ A.
Since every element of the empty set is in any set, ∅ ⊆ A for any set A. But it is not true
that necessarily Æ Î A .
Building new sets from old sets is easy. For example, if A and B are sets, let C consist of
all things in the sets A, B. If A={1,2,3,4} and B={3,4,5,6,7}, then C={1,2,3,4,5,6,7}. This
C is called the union of A and B, and is denoted by C = A È B . Formally defined
A È B = {x x Î A or x Î B} ,
The picture in one's mind of a union is very simple—just put the two sets together. A
common way to represent sets graphically is via Venn diagrams (or also affectionately
called bubbles): if
one uses a bubble
to represent a set, A B AÈ B
then the union is
the two bubbles
together:
3
Also easy is the intersection of two sets. This new set consists of only those objects that
A and B have in common. Letting D denote this intersection,
D={3,4} in the example above. In symbols, D = A Ç B and
formally
A Ç B = {x  x Î A and x Î B} . AÇ B
Of course if A and B were disjoint, which means they have no element in common—
or, equivalently, their intersection is empty, A Ç B = Æ , then the size of their union is
the sum of the sizes. As modest a claim as this is, it is still useful and of course it can
be generalized to an arbitrary number of sets as long as they are pairwise disjoint, that is
no two of them have anything in common.
This rule is often called the first counting principle (we will see the second one later)
or the rule of sum. The subsets involved in a partition are often called the classes of the
partition. Thus, we could count the undergraduates in the university by counting the
freshmen, the sophomores, the juniors and the seniors and adding up the results. A little
bit more interesting is the following
Example 1. We are to toss a coin 3 times and record the results (as to heads or tails.) Let
C be the set of possible outputs to this experiment. Let A0 be the subset of outcomes with
0 heads, let A1 be the subset of outcomes with 1 head, A2 the outcomes with 2 heads,
and A3 the one with 3. Then A0 , A1 , A2 , A3 partition C and so A0 + A1 + A2 + A3 = C .
Indeed, C={HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}, while A0 = { TTT } ,
A1 = {HTT , THT, TTH} , A2 = {HHT, THH, HTH} and A3 = {HHH} . Note 8=1+3+3+1.
Similarly if D were the set of outcomes with at least 2 heads, then A2 and A3 partition D,
so D = 4 .
4
What happens if the pieces are not disjoint is more complicated, and in order to
understand it is best to visualize sets a bit. A way of accomplishing this is again by using
Venn diagrams to represent sets. To represent A, B, C
subsets of a universe U one has then the picture
But data is rarely as accommodating as to come in the desired form. Instead, what's
usually available is the size of A, B, C, A Ç B , A Ç C , B Ç C and A Ç B Ç C . What is the
size then of A È B È C ?
has been subtracted thrice! So it has not been counted at all, we need to add it in, so the
right expression becomes:
A È B ÈC = A + B + C  A Ç B  A Ç C  B Ç C + A Ç B Ç C .
But the best way to go about it when a Venn diagram can be drawn is to compute the size
of each of the 8 pieces in the diagram—a specific example should help.
AC AT
Example 2. Of the cars sold during the month of August, 90 had air
20 60
conditioning, 100 had automatic transmission, and 75 had power
steering. Five cars had all of these extras. Twenty cars had none of these 5
5
extras. Twenty cars had only air conditioning and 60 cars had only
automatic transmission. Ten cars had both automatic transmission and PS
power steering. How many cars were sold in August? 20
Filling in the Venn diagram we have some of the 8 disjoint pieces readily
available to us—and in the picture above we have filled in all the available information.
From 10 cars had both AT & PS, we can infer that 5 had only
AC AT AT & PS since 5 had all three, and then we can successively
fill in the rest of the diagram. Note that has to read the
20 30 60 sentences as always inclusive unless otherwise stated—for
5 example, 100 cars had AT means the size of the set of cars
35 5
with automatics transmission was 100.
30 PS
20 Now we can answer the question, and indeed we could answer
any question about any combination of sets of cars: the number
of sold cars is the sum of all the eight pieces, which equals
205.
6
Example 3. In a survey of 75 consumers, 12 indicated they were going to buy a new car,
18 said they were going to buy a new refrigerator, and 24 said they were going to buy a
new oven. Of these 6 were going to buy both a car and a refrigerator, 4 were to buy a car
and an oven, and 10 were going buy a refrigerator and an oven.
R Two were to purchase all three items. Once the information has
C
been processed in the picture, all questions can be answered.
4 4 4 For example we know that 39 people do not intend to buy
2 anything.
2 8
39
12 Example 4. In an eccentric way to organize his research
O
company, the owner boss decides to assign himself as #1, and
then his 99 employees will be assigned a number from 2 through 100. An employee will
be then subservient to all other employees whose numbers are factors of his/her
number—thus, employee #6 would be subservient to the boss (#1) and to both employees
#2 and #3. Of course, everybody is subservient to the boss.
E T
The boss would like to know how many employees are there 23 13 14
going to be responsible jus t to him. Easily, an employee has no 3
boss but the boss exactly when their number is a prime. That 7 3
will occur exactly when it is not a multiple of 2, 3, 5 or 7. First 26
we will count all the numbers that are not multiples of 2, 3 nor 7
5. Let E stand for the set of multiples of two between 1 and F
100, let T stand for the multiples of three, and let F stand for
the multiples of five. What the original question asked was the number of elements in
region 8, E Ç T Ç F . From the picture, or by simple logic, this is equivalent to counting
E È T È F , and then subtracting that total from 100. We use inclusionexclusion to count
E È T È F . Certainly E has 50 elements since 50 = 100 , while T has 33 = 100 and F,
2 3
20 = 100 elements, where we let x denote the largest whole number below x. How
5
does a number get to be in E Ç T ? By being a multiple of 2 and a multiple of 3, in other
words a multiple of their least common multiple: 6, and thus E Ç T has 16 = 100
6
Before we finish counting the primes below 100, we need to include another set: S, the
multiples of seven, and thus we need to understand inclusionexclusion for four sets.
Unfortunately, a Venn diagram can be draw for four or more sets, but a table similar to
the one built above can always be made. However, we have yet another approach to it,
and that is the use of recursion.
Suppose we had now four sets: A, B, C, and D. We use what we know: let C ¢ = C È D .
Then we have
A È B È C È D = A È B È C ¢ = A + B + C ′  A Ç B  A Ç C ¢  B Ç C ¢ + AÇ B Ç C ¢ .
From before we know C ¢ = C + D  C Ç D . What about
A Ç C ¢ ? Consider the Venn diagram picture for just these two
sets:
This is a distributive law and if you say it in words (ands and ors) you will convince
yourself that it is true. Similarly,
B Ç C ¢ = B Ç (C È D) = ( B Ç C)È ( B Ç D)
and
A Ç B Ç C ¢ = A Ç B Ç (C È D) = ( A Ç B Ç C ) È ( A Ç B Ç D) .
But then
A Ç C ¢ = A Ç C + AÇ D  AÇ CÇ D
since A Ç A= A , and similarly,
B Ç C ¢ = B Ç C + B Ç D  BÇ CÇ D ,
A Ç B Ç C ¢ = A Ç B ÇC + AÇ B Ç D  AÇ BÇ CÇ D .
Substituting, we then have
AÈ BÈCÈ D = A + B + C + D
 A Ç B  A ÇC  A Ç D  B Ç C  B Ç D  C Ç D
+ A Ç B Ç C + AÇ BÇ D + A Ç C Ç D + B Ç C Ç D  AÇ B Ç C Ç D
Note that there is a very nice symmetry in the formula, which makes it quite easy to
remember. We will not bother generalizing it to n sets, but everybody should know that,
as we mentioned above, that the general rule is called the inclusionexclusion
principle, and we have just seen a couple of instances of it.
E È T È F ÈS = E + T + F + S  E Ç T  E Ç F  E Ç S  T Ç F  T Ç S  F Ç S
+ E Ç T Ç F + E Ç T Ç S + EÇ F Ç S + TÇ FÇ S  E Ç T Ç F Ç S
And we get: E È T È F ÈS = 50+ 33+ 20+ 14 16 10  7  6  4  2 + 0 = 78 .
The complement then has 22 elements, but we have included 1 and excluded 2,3,5 and 7,
and so the true answer is that there 25 primes below 100. The list of employees with
only the owner as their boss is
{2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97}.
So far in this chapter we have learned how to count some sets based on information from
other sets. The construction was that basically that of unions, or in linguistic terms, they
were ors. Now we will address the issue of how to count ands.
9
We start with the set construction that will aid us: the product of two sets. Let A and B
be sets. Then a new set A cross B, A× B , can be made out of the ordered pairs made out
of A and B with the first coordinate coming out of A and the second coordinate coming
out of B.
For example, if, as before, A={1,2,3,4} and B={3,4,5,6,7}, then A× B consists of the 20
pairs:
(1,3) (1,4 ) (1,5) (1,6 ) (1,7 )
(2,3) (2,4 ) (2,5 ) (2,6 ) (2,7 )
(3,3) (3,4) (3,5) (3,6) (3,7)
(4,3) (4,4 ) (4,5 ) (4,6 ) (4,7 )
Similarly, we could build the cross product of three sets A × B × C which consists of the
ordered triples out of A, B and C respectively which means the first coordinate is from A,
the second from B and the third is from C.
Example 1. Suppose that two players are going to play five games of a specific
diversion or sport. We are interested in recording the winner at each game. How
many ways are there of doing this? Let’s call the players A and B. It is clear here that each
individual element of the set to be counted consists of five games and we will adopt
these as the stages of our development. We could visualize it as ordered fivetuples or as
a tree. In order to be efficient, we will write, for example, ABBAA for ( A, B, B, A, A) .
Starting at the beginning, the first hand can be won by either A or B, so that is our first
branching. At each node of our tree there are going to be two new branches so that we
can easily keep track of how many nodes there are at each level and since at the end each
terminal node is going to be at leve l 5 (all branches will have 5 individual branches), all
we will need to know is how many terminal nodes there are. Well, we know there are two
nodes at level 1, and each will give rise to two new nodes, so there will be four nodes at
level 2 ( 2 + 2 ), which in turn, each of which will give rise to two other ones at level 3. So
at this level there will be 8 nodes (2 + 2 + 2 , or better yet 3 × 2 ). Continuing in this
fashion we get 16 nodes at level 4 and finally we have a total of 32 branches (or nodes at
10
level 5). And the corresponding tree diagram: The point again of why trees are useful is
that
there are as many full branches as there are terminal nodes.
The whole key here, to emphasize again, was that the number of branches at each stage
was independent of where in the tree you were. Although that property is nice,
what is truly essential is that at each level every node has the same
number of branches coming out of it. (It did not matter that all levels
have two branches coming out, but again we repeat, what was
crucial was that at each level all nodes have the same number of
branches coming out.) That way we can keep multiplying.
Note also that we could easily count the number if there were 7 games, or 9 games, or 15
games. Namely if there are n games, the answer is 2 n .
Often, one of the hardest things to do in mathematics is to realize what you have solved
when you have worked out a problem. Suppose I were to ask you: how many subsets
does {1,2,3,4,5} have? Can you work it out? We already have. In order to build a subset,
you have five decisions to make: whether to put 1 in the subset or not, whether to put 2 in
the subset or not, whether to put 3 in or not, and the same with 4, and the same with 5.
Let's say A corresponds to not put in, B to put it in. Then look at any of the 32 ways to
play five games, for each of them we can find a corresponding subset: for example,
AAAAA corresponds to the empty set, ∅ , BBBBB corresponds to the whole set
{1,2,3,4,5} , BAABA corresponds to {1,4} while ABBAB corresponds to {2,3,5}. And the
correspondence goes in reverse too: the subset {1,3,4} corresponds to BABBA. Is it then
clear that there are exactly 32 subsets of {1,2,3,4,5}? And just as we generalized the
games idea, we have that
if X has n elements, then X has 2 n subsets,
of course, including the empty subset and X itself.
11
It turns out that many counting problems can be worked out by suitable trees where the
number of branches coming out of a node depends only on the level of the node not on
which node nor the past history of the tree. Before we get more abstract, let's look at
another example.
Second, that although we had different choices depending on which branch of our tree we
were, the number of choices was independent of whom we had chosen. This is a crucial
point. For suppose we had a constraint such as if Mr. Alberts is elected president, then
Mrs. Chan will not serve as treasurer; then our whole approach is down the drain and we
are into counting branches by just drawing them all. (Actually this is not quite true, only
for the time being).
12
Third, to visualize the set as a collection of triples out of A, B, C, D and E we must require
that the triples have repeated entries, and thus this description is harder than the tree.
Before we lose track of it later on, let's enunciate the very important counting principle
we have been using. It is usually referred to as the second counting principle or the
rule of product:
if when building the elements of a set one has t clearly differentiated
decisions (or stages, or levels), and the number of options at each
stage depends only of what stage of the process we are in and not on
the previous choices for the previous stages, then the total number
of elements in our set is the product n1 × n2 ×L× nt where n1 is the
number of choices at the first stage, n 2 the number of choices at the
second stage, etcetera.
Example 3. In how many ways can we arrange n people in line (to pose for a picture)?
We have n choices for our first position, or first stage. For the next stage we only have
n − 1 choices, and for the third stage we have exactly n − 2 , and so on until at the last
stage, the nth stage we have only one option (whoever is left goes last on the line). Can
you visualize the tree? So the number of ways is n × ( n − 1) × (n − 2)×L×1 , which is
called n factorial, and is denoted by n! . This is one of the functions that will be most
important throughout the course.
n 1 2 3 4 5 6 7 8 9 10
n! 1 2 6 24 120 720 5040 40320 362880 3628800
What is 0! then? If we see how we get the next column from the previous one, we see
that in the table Y and so 0 1 X = 0! = 1 .
X XY X 1
But note that we could not succeed in defining (1)! since nothing 1 0
multiplied by 0 will produce 1. X 1
There are two cautions associated with the second counting principle: first, you must
have the same number of branches coming out of each node at any given level, the
second is that you have to be able to differentiate between your levels. First, we give
an example addressing the first caution.
Example 4. Two players are going to play a game until one of them has
won 3 hands or sets. How many different ways are there of playing the match? It is clear
here that each individual element of the set to be counted consists of several ‘hands’
13
You will get no argument from anybody on the clumsiness of the procedure. How would
you like to do 4 games (i.e., best out of seven)? Even if we become smarter by just
drawing one half of the tree (notice the symmetry), it is still a painful, and not particularly
elucidating experience. Of course, a machine loves to do this kind of work, and the
program to get it to do the work is not difficult to come up with. But if all we are
interested in is the number of possible ways (more motivation on that later), there are
smarter ways to do it. These are not yet available at this stage of the course.
A large part of the clumsiness in the last example came from the fact that at each stage
the number of branches coming out of a node depended on the past history of the tree,
thus some of the elements in our set (the twenty ways listed above) had 3 individual
branches while others had 4 or 5.
Now at the second caution in using the second counting principle, you have to be able to
differentiate between your stages. Let’s change Example 6 slightly:
Example 7. Let’s instead require tha t out of the five candidates, a committee of three is
to be chosen. Before we start discussing it in the abstract let’s write down the answer.
There are 10 ways of choosing such a committee. Here are the 10 of them (using initials
to denote the people): {A,B,C}, {A,B,D}, {A,B,E}, {A,C,D}, {A,C,E}, {A,D,E}, {B,C,D},
{B,C,E}, {B,D,E}, {C,D,E}. One is tempted to use a tree to reason this out, but without
some extra reasoning one may be in trouble. Namely, what are you going to choose for
the first level of your tree? What is your first stage? What decision are you
making? Your answer would probably be: I am choosing the first member of
my committee. But is that clearly spelled out? Suppose your committee was
{A,B,C}. Who did you choose first? The difference between this example and
Example 6 is that before the three persons to be chosen were specifically
differentiated: there was a president, a secretary and a treasurer, now there are
only three people in a committee, nondifferentiated. So how do we get the 10
without listing all the possibilities?
15
Going in reverse, from each of the 10 committees, there are 3! = 6 ways of choosing a
president, a secretary and a treasurer from it since there are three options for president,
and then two for secretary and then only one for treasurer. Of course, that way we obtain
a total of 60 = 5 × 4 × 3 ways of choosing a president, a secretary and a treasurer.
Remember that what is crucial is that the three members of a committee are
indistinguishable from each other while the board members are all differentiable—
pay attention to this distinction, it is terribly important.
16
Ž Choosings
The last section ended with how to count the number of committees one could make out
of 5 candidates if the committee were to consist of 3 of those candidates. In general, let n
and k be nonnegative integers. Then the number of subcollections (or committees) of size
k (or ksubsets) from a collection of n objects or candidates (an nset) is called n choose
k and is denoted by n . The notation is from the 19th century. Other notations abound,
k
including nCk and Ckn . Why this name? What you are doing is choosing out of n friends
that you have, k to come to a party, and you are counting the number of ways of doing
that. Or from n different balls, yo u are choosing k to put in a bucket.
From the previous section we saw 5 = 10 (it reads 5 choose 3) since there were 10
3
subsets of a 5set that had 3 elements. We will revisit this below.
k people out of n eligible candidates. The numbers n are also called binomial
k
coefficients (the reason for this name will be clarified later),
and many of us find them among the most charming of
numbers.
We do the example with n = 5 and k = 2 in the discussion below. We are starting with
five balls of different colors, and
we are going to choose two to go
into a bucket (for whatever
reason).
A similar computation yields 5 = 10 Is this a coincidence? No. Think about it this way.
3
Anytime you choose 2 balls to put in the bucket, you are automatically choosing 3 not to
be placed so.
17
Hence, the number of ways of choosing 3 out of 5 is the same as the number of
ways of choosing 2 out of 5 (compare pictures).
Of course, if k > 5 , 5 = 0 since there is no way to select more than 5 balls from the
k
collection. Finally, since every collection of balls is accounted for, we must have:
5 5 5 5 5 5 = 2 5 = 32,
+ + + + +
0 1 2 3 4 5
since that is the total number of subcollections. In fact, 1 + 5 + 10 + 10 + 5 + 1 = 32 .
The same observations from the example can be generalized to arbitrary numbers:
0 1 2 n
their size.
More importantly, there is a very nice recursion that they satisfy. This recursion is due to
Pascal, the very bright yet never fully developed French mathematician of the 17th
century.
With this recursion, together with the conditions before the theorem, we can now build
the table of binomial coefficients. Although this table has been known to many people
and many cultures from way before the time of Pascal, it is known, in the Western world
at least, as Pascal's triangle. We are going to let n\k 0 1 2 3 4 5 6 7 8 9 10
0 1
n index the rows of our array while k will index 1 1 1
the columns. We will start with n = k = 0 , and 2 1 1
3 1 1
grow from there. By the conditions before the 4 1 1
theorem we know our array looks like with zeros 5 1 1
6 1 1
above the main diagonal. 7 1 1
8 1 1
9 1 1
But now with the recursion we can fill in the rest 10 1 1
of the array. Namely to fill a new row, one adds
the position just above it to the one above and to the left:
n\k 0 1 2 3 4 5 6 7 8 9 10
0 1
1 1 1
2 1 2 1
3 1 3 3 1
4 1 4 6 4 1
5 1 5 10 10 5 1
6 1 6 15 20 15 6 1
7 1 7 21 35 35 21 7 1
8 1 8 28 56 70 56 28 8 1
9 1 9 36 84 126 126 84 36 9 1
10 1 10 45 120 210 252 210 120 45 10 1
Pascal’s Triangle
Observe the example of the recursion in the table. Besides the observations we made
before the theorem, there are many nice features in Pascal’s triangle. One of the
important ones is that the coefficients increase in each row up to the middle, and then,
because of the symmetry, they decrease. For example, the last row in our table went 110
45120210252 (which is the exact middle).
As we saw before, the rows add up to powers of 2. But what about the alternating row
sums (that is, take alternating signs)? If one experiments a bit, it is not too hard to believe
that the alternating sums are always 0, (for example, the last row in our table gives
1 − 10 + 45 − 120 + 210 − 252 + 210 − 120 + 45 − 10 + 1 ).
But as with every recursion, sometimes a closed expression is preferable. This formula
when used wisely is computationally superior to the recursion, but the key word is
wisely. We have seen the idea behind the closed expression in the previous section when
we looked at the number of committees consisting of 3 people out of a pool of 5
candidates (or equivalently, 3subsets of a 5set). We compared that number with the
number of executive boards with President, Secretary and Treasurer out of the same 5
candidates. We are going to try to contrast the committees with the executive boards.
Remember that what is crucial is that the three members of a committee are
19
indistinguishable from each other while the board members are all differentiable—
pay attention to this distinction, it is terribly important, and later on, it will be very
important. We have counted the number of executive boards: 60 = 5 × 4 × 3 where the 5
is the number of choices for our President, 4, the number of choices for our Secretary and
3, the ones for the Treasurer.
Suppose we don’t know the number of committees, yet we think of these committees as
indexing the rows of a matrix, while the boards index the columns of the same matrix, so
the matrix is m × 60 where m is the number we are looking for (we know it is actually
5 = 10 , but suppose for a second that we didn't know this). We are going to fill the
3
matrix with 0’s and 1’s. We put a 1 in a position if the committee corresponding to that
row has the same elements as the board that corresponds to that column, otherwise we put
a 0 in that position. For example, suppose that some row corresponds to the committee
{A,C,D}. Then in the column corresponding to the board DAC (D is President, A is
Secretary and C is Treasurer) we would put a 1, while in the column corresponding to the
board BAC we would not, and instead we would put a 0. One fundamental (and obvious)
fact of life is that if one has a (0,1) matrix, then the number of 1's in it is independent of
whether we counted them by rows or by columns. Let’s first count the ones in our matrix
by columns: clearly every column has only one 1 in it, the one in the row corresponding
to the committee made up of the members of the board corresponding to that column.
Since there are 60 columns in this matrix, there is a total of 60 1's in it. Now let's count
them by rows. Take a row, say the one corresponding to {A, C, D}. How many 1’s are
there in its row? How many boards can be made from the members of this committee?
How many ways can we order the set? 3!=6. Hence every row has 6 1’s and we can
conclude that there are 10 rows since the number of rows times 6 is 60. Isn’t this neat?
Extend that now. Suppose we have a pool of n candidates for a race in which we are
going to keep track of the first k places. Then how many outcomes are to the race? By the
second counting principle, since we have n choices for our first stage or decision, n − 1
choices for the second, and so on, we have n × ( n − 1) × (n − 2 )×L×( n − k + 1) total
choices. Where did that last factor come from? It should be clear that we are going to
have k factors in our result since there are k stages to our tree development. The first
factor is n − 0 , so the last factor should be n − ( k − 1) = n − k + 1 . This number is easy to
remember if we rewrite in the form n ! . Suppose we take an arbitrary ksubset of an
(n − k ) !
nset, that particular subset gives rise to how many race finishings? How many ways can
we order the ksubset? We know it is k! ways. (Think of the matrix.) Thus each ksubset
of our nset, and we know there are, by choice, n of them, gives rise to k! of the race
k
we look at the ratio of two consecutive terms in that row, 100 divided by 100 , we get
k + 1 k
100!
100 − k
(k + 1) !(100 − k − 1)! which equals . This means that as we move on the row from k
100! k +1
k !(100 − k )!
to k + 1 , we are multiplying by 100 − k and dividing by k + 1 , which makes moving on a
row much easier:
Example 1. How many ways can we toss a coin 5 times so that exactly three heads
appear? From the 32 ways of tossing the coin 5 times: 2 ´ 2 ´ 2´ 2´ 2 , three spots have to
æ5ö
be designated for the heads and the remaining 2 for tails, so there are çç ÷÷÷ = 10 ways. For
è3ø
example, the subset {2,3,5} corresponds to THHTH.
By now we have developed our counting tools, we have ors (or unions), and we have
ands (or stages) and then we have choosings. The power comes from knowing which
to use when. We will come back later to this subject.
æ10öæ8ö
answer is simple: çç ÷÷÷çç ÷÷÷ because we must choose the males in the committee and we
è 4 øè3ø
must choose the females in the committee. And so our total answer is 210´ 56 = 11,760 .
What would the answer be if instead all that is required of the committee (of still 7
people) is that at most 3 females serve? That means that either 3, or 2, or 1 or no females
serve in the committee, and so we have the answer to be:
æ10öæ ö æ öæ ö æ öæ ö æ öæ ö
çç ÷÷çç8÷÷ + çç10÷÷çç8÷÷ + çç10÷÷çç8÷÷ + çç10÷÷ çç8÷÷ = 210 × 56 + 252 × 28 + 210 × 8 + 120 ×1 = 20,616 .
è 4 øè3ø÷ è 5 øè
÷ ÷ 2ø÷ è 6 øè ÷ 1ø÷ è 7 ÷øè0ø÷
Think of the six blanks that we are going to fill in with the letters:{R, O, B, E, R, T}:_ _ _
_ _ _ _. Of those 6 blanks, 2 have to go for R's. Choose those 2, for which we have
6 = 15 ways of doing. Then choose the spot for the O's : any of 4 ways. Then we have 3
2
spots left to put the B, 2 for the E and 1 for the T. Thus the number of anagrams is:
15 × 4 × 3 × 2 × 1 = 360 .
22
You can think of obtaining the result in terms of activities: you have six blanks: 2 of them
are going for R, 1 for O, 1 for B, 1 for E and 1 for T.
Of course, one can have variations on the questions. How many anagrams of ROBERT are
there where the vowels are together? Put the vowels together: there are 2 ways of doing
that, and then think of them as one letter: V and we have anagrams of RBRTV, of which
there are 60, so our answer is 120.
Another variation: how many anagrams are there of ROBERT where the vowels come in
alphabetical order (not necessarily together). There are 360 anagrams, and they come in
pair where the vowels occur in the same locations, only one of those two has the vowels
360
in order, so the answer is = 180 .
2
If we proceed without any thought, we can just multiply this out. Not only boring, but
very inefficient. Instead let’s think about the process of the multiplication and the
powerful distrib utive law. What we are doing then is taking one of the two
summands from each one of the factors in order to get one of our terms. So each of
our terms is of the form x i y j where i and j are nonnegative integers and i + j = 6 . How
many terms do we have? In order to build a term, we have six stages or decisions (one for
each of the factors) and two options for each of the decisions, so we will have 2 6 = 64
terms. One term is, clearly, x 6 which comes up by taking an x from each of the factors,
and that is the only way to obtain x 6 . But another term is xyxyxx (we are indicating by
23
the position what each factor contributed) which equals x 4 y 2 . But when we collect terms
to simplify, x 4 y 2 occurred several times; we have just seen one of those occurrences.
Another is yxxxxy . In total, how many times does x 4 y 2 occur? We have to decide which
of the six factors will contribute y's, the others will contribute x's. So we have to choose 2
out of 6, so there are 6 = 15 terms that equal x 4 y 2 , so its coefficient is 15. This is why
2
these numbers are called binomial coefficients. If we just extend our reasoning to all the
terms, we get
( x + y) 6 = x 6 + 6 x 5 y + 15x 4 y 2 + 20x 3 y 3 + 15x 2 y 4 + 6 xy 5 + y 6 .
Of course, since this is an algebraic identity, it is valid for arbitrary x and y. Thus, if what
we wanted was ( 2a − b) 6 , then all we would have to do is substitute x by 2a and y by
− b (watch that minus sign), in order to obtain
( 2a − b ) 6 = 64a 6 − 192a 5b + 240a 4b 2 − 160a 3b 3 + 60a 2b 4 − 12ab5 + b 6 .
i =0 è ø
i
The sigma notation for sums can be a bit intimidating, but by just writing the terms one
byone all fears are conquered:
ænö n1 æ nö n2 2 ænö n3 3
( x + y ) = x + çç ÷÷÷ x y + çç ÷÷÷ x y + çç ÷÷÷ x y + L.
n n
è1ø è 2ø è 3ø
Suppose we now vary our original question, and ask you to compute ( x + y + z ) instead.
6
So we would be looking at
( x + y + z )( x + y + z )( x + y + z )( x + y + z )( x + y + z )( x + y + z ) .
Here the expansion would be real tedious. But the reasoning we did in our previous
considerations is still valid. Namely as we expand this product, in order to build a term
we get one summand out of each of the factors, thus our terms are of the form x i y j z k
where i, j and k are nonnegative integers and i + j + k = 6 . For example, x 2 y 3 z and x 4 z 2
are both terms. How many terms will we have? 6 stages, 3 options for each decision give
us a total of 36 = 729 terms!
factors we have to decide for the y's: 6 − i ways, and finally, for the z's: 6 − i − j =1
j k
since i + j + k = 6 . Hence the coefficient of x i y j z k is
÷÷ = 6! ´ (6  i )! ´ (6  i  j )! = 6!
æ 6ö æ6  iö æ6  i  jö
çç ÷÷´çç ÷÷´ç
çè i ø÷ èç j ø÷ èçç k ø÷ i !(6  i )! j !(6  i  j )! k !0! i ! j !k !
by canceling. This is a very satisfying expression:
These coefficients are then called multinomial coefficients, as we saw before. What
we are doing in the multinomial coefficients is choosing specific numbers of
friends for each of several, different activities.
We finish computing ( x + y + z) :
6
( x + y + z ) 6 = x 6 + y 6 + z 6 + 6 x 5 y + 6 x 5 z + 6 xy 5 + 6 xz 5 + 6 y 5z + 6 yz 5 +
15x 4 y 2 + 15x 4 z 2 + 15x 2 y 4 + 15x 2 z 4 + 15 y 4 z 2 + 15 y 2 z 4 +
30 x 4 yz + 30 xy 4 z + 30 xy 4 + 20 x 3 y 3 + 20 x 3z 3 + 20 y 3 z 3 +
60 x 3 y 2 z + 60 x 3 yz 2 + 60 x 2 y 3 z + 60 x 2 yz 3 + 60 xy 3 z 2 + 60xy 2 z 3 +
90 x 2 y 2 z 2 .
We will do many more applications of our counting ability in the applied area of
probability in the next chapter.
25
As we finished the last section, the idea of counting was to be used in the essential notion
of probability. Indeed, the first stated realization that mathematics, specifically
counting, had a role to play in games of chance concerned the rolling of one
die. There we have six equal way to roll a die, so the probability one rolls a
specific value is 16 . No much depth in this analysis, yet it was not stated until
the 16th century, despite people gambling for at least 10,000 years.
Although the largest regular polyhedron one could make has only 20 sides
(the icosahedron), one can always make a dreidel like object with
arbitrarily many sides to simulate the random choosing of a number from
1 to the number of sides. If that is the case, then the probability of
choosing a specific number will of course be 1n where n is the number of
sides. Thus, we could conceive of a dreidel with 1,000 sides, and so when
1
we roll (or spin) it, the probability of landing on any one side is 1000 . And
then we could ask what is the probability we spin a number which is at
most 100, and the obvious answer is 1000100
= 101 .
π (2r )
2 h
large cone is 1
3
πr 2 h while the volume of the top half is only 1
3 2 .
The previous paragraphs are being used to motivate the need to review some basic
geometric concepts regarding measurement, such as length, angle, distance, area and
volume.
Of course, length is very much associated with counting. Once we have a unit of linear
measure, we count how many times it fits around the room, or whatever we are
attempting to take the length of, and we have arrived at an estimate. Of course, fractions
and eventua lly real numbers occur naturally in this context.
One can only use common sense to speculate what area is the
oldest in mankind’s memory—but the winner, one can
conjecture, must be the rectangle: the base × height formula
for the area could easily be deduced via multiplication from
brick laying or tiling examples. Again, note the intimate connection to counting.
The next area, after the rectangle, to be computed was, probably, that of
a parallelogram, which is also
base × height.
That this was done early follows from the easy rearrangement of any
parallelogram into a rectangle.
And then the triangle could not be far behind since two of them make a
parallelogram:
1
Area = base × height
2
Above we discussed the area of a circle—but that computation is certainly much younger
than the others, and it probably stems from the fact that the area of a circle should be half
the circumference times the radius, as the pictures exemplify.
Much more recent than all the previous and much more relevant to our
27
course is the notions from calculus for the calculation of areas and volumes developed by
the great Leibniz and Newton. Namely the crucial ideal of integration.
Chapter 1
Odds on Favorites
Œ Probability—the Basic Rules
One of the original motivations behind counting was the beginning of the taming of uncertainty
that occurred in the 16th and early part of the 17th century. Why it took so long to even develop
to the extent it did in those early centuries is indeed interesting, but not for us to speculate
(gambling is very old indeed.) What is relevant to us is that by 1600 it was reasonably clear
in many people's minds what some aspects of probability were about. But let's progress by
example, a historical one. Galileo himself was posed this question, and as usual he analyzed it
correctly.
Example 1. Suppose we are going to play the following game (those early years were mostly
concerned with gambling questions (as mentioned above, gambling is olddefinitely
thousands of years old.):
we roll 3 dice, if a 9 shows up, I pay you $1, if a 10 shows up, you
pay me $1, if anything else shows up, we roll again.
Naturally you are mistrustful since I am proposing the game, but how do you know I am not the
idiot by offering it to you, or better, that it may be a fair game and you are just missing the
opportunity to have fun. Of course, if you are just going to play one hand, no calculation is really
necessary and you are just going to make your decision based on your mood, who makes the
offer, etc. But suppose you intend to do this for three hours every Saturday for the next three
years (we live in the age of individual preference.) At first thought it seems like a
reasonable game: one can obtain a 9 by rolling:
At first thought it seems like a fair and reasonable game. Both numbers can be rolled in six
different ways, as the two lists of possibilities indicate. But what is the logic behind this
attempt? It has something to do with the number of ways of doing something and if
one thing has more ways of occurring than another, then it is more likely to occur.
After all nobody would play the previous game if the competition were between rolling a 3 and
rolling a 10 since intuitively one feels that a 3 is much rarer than a 10.
29
Although there is common sense behind this, it is not quite correct. It needs to be improved
upon. The first basic principle that we are going to use for our probability calculations is
suppose an activity or experiment is to be performed, and we have
equally feasible outcomes, then the probability for a given event to
occur is the number of outcomes that give the desired event divided
by the total number of outcomes.1
But extra emphasis needs to be made on the premise of the principle: one must first reduce the
outcomes of the activity to equally feasible outcomes2. Then you can start looking at the
probability of the event that you are interested in. Going back to the game in question. What is
the activity in this example? Rolling 3 dice. What are the outcomes? It seems acceptable to say
that the outcomes are, in addition, to the two lists above:
3
4
5
6
7
8
11
12
13
14
15
16
17
18
and thus there would be a total of 56 outcomes, so the probability of a 9 would be 6 , and a
56
1
This is the first enunciated principle in the theory of probability, and the simplest one. As simple as it is, it
was not stated clearly until the 16th century by the inimitable Cardano, great scholar and scoundrel.
2
What are equally feasible outcomes can be in itself a polemic. How do you know a coin is fair? But we will
be naive about the subtleties of statistical analysis, and only insist that, from what we know, we can
honestly claim that the outcomes we are taking are equally feasible.
30
However, if we apply the same reasoning, then the probability of a 3 is 1 , and a 4 has the
56
same probability. So if we keep rolling the three dice for a long time, the number of 3's
occurring should roughly be the same as the number of 4's. It does not take much
experimentation to perhaps start doubting our premise, and maybe we should question why
did we label those outcomes as equally feasible? So let's rethink a bit. Is a
as equally feasible as a ?
Suppose we had a yellow die, a white die and a blue die. Then, to roll a 3, we would
have to have , we need to show a in each die, but to get a 4, we can do it
by: , or , or ,
, there are three ways since any of the
three dice could show a and the other two need to show a . It seems like we have
some more choices in the latter situation. Three times as many, actually.
A way out of the quagmire is to take for our outcomes the 216 different ways there are to roll
three dice if one of them is yellow, other one white and the third one blue. We get the 216 from:
6 × 6 × 6 . Nobody can argue on the equal feasibility of these 216 outcomes. So we start now
from there. How many ways can we roll a 3? As before, only one way, so the probability is
1 , not 1 . But how many ways can we roll a 4? Three ways as we saw above, so the
216 56
probability of a 4 is 3 . So 4's should occur about three times more often than 3's.
216
Let's go back to the 9 and the 10 of our game. Of the 216 ways, how many ways can we roll
a ? Easily, we have three decisions, which die shows the , which the
and which the : 3× 2 × 1 = 6 ways:
Roll of 9 # of Roll of 10 # of
Ways Ways
6 6
6 3
3 6
3 6
6 3
1 3
Total 25 Total 27
One of the common errors made in the past by mathematicians (including some of the best like
Leibniz, D'Alembert and others) is the one of presuming equally feasible outcomes to an
experiment without further analysis. In this course we will not have the chance to get too subtle
into this subject, but remember to always be careful to set up the outcomes to the experiment
before you start asking about the event that you are interested in, and try to analyze your
outcomes so that they seem, as best as you can tell, equally feasible.
We can enhance the understanding on the previous example by introducing the fundamental
32
When performing an activity that can lead to different numerical outputs one is
in the midst of a random variable X.
In the previous example the activity is rolling three dice, and what we are interested is the sum of
the roll. Therefore the outcomes to our activity consist of the numbers 3 through 18, and thus
are the values that our random variable can take. The (probability) distribution of the variable is
the set of probabilities associated with each of the possible outcomes, and one way to represent
the variable is by a simple table such as:
X 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
P 1
216
3
216
6
216
10
216
15
216
21
216
25
216
27
216
27
216
25
216
21
216
15
216
10
216
6
216
3
216
1
216
# .0046 .0138 .0277 .0463 .0694 .0972 .1157 .125 .125 .1157 .0972 .0694 .0463 .0277 .0138 .0046
0.140
Note that quite a bit of computation went
0.120
0.100
into the table. On the left is a graphical
0.080
representation of the distribution.
0.060
0.040
0.020
0.000 We make two key observations about
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 the distribution of a random variable:
Note that having the distribution of the random variables allows us to answer a variety of
questions. For example, we could ask what is the probability of rolling at least a 15, which is
given by
P ( X ³ 15) = P ( X = 15) + P ( X = 16) + P ( X = 17 ) + P ( X = 18) = » 9.26% .
20
216
As with counting, the reason we are allowed to add the probabilities is that they are disjoint
events, at no times we can have two of them occurring simultaneously.
Observe that if A, B and C are events any two of which are disjoint, then
33
P ( A È B È C ) = P ( A È B) + P (C) = P (A )+ P (B ) + P (C ) .
Example 2. Similarly to the roll of three dice, only easier, we can state that if X is the roll of
two dice, then its distribution is given by
X 2 3 4 5 6 7 8 9 10 11 12
P 1
36
2
36
3
36
4
36
5
36
6
36
5
36
4
36
3
36
2
36
1
36
0.200
In the game of CRAPS, one wins if one 0.150
rolls a 7 or 11 to start with, so the
0.100
probability of wining on the first roll is
0.050
» 22.22% . In the same game one
8
36 0.000
loses to start with if one rolls a 2, 3 or 12, 2 3 4 5 6 7 8 9 10 11 12
» 11.11% of
4
which has probability
36
occurring, and that is the likelihood of losing on the first roll.
Example 3. Suppose that a happily married couple has two children. How likely is it that they
will have one of each sex? D'Alembert incorrectly analyzed this by saying there were three
outcomes to the experiment: Two Boys, Two Girls and OneOfEach. So the
probability of OneOfEach is 1 . Actually, if we assume that a boy being born is as likely as
3
a girl being born in any given birth3, then there are four equally feasible outcomes: BB, BG,
GB and GG. Of those, 2 give us children of both sexes, so the probability is 2 = 1 . Again this
4 2
estimate conforms to reality much better. Here we could have considered Y to be the random
variable which is the number of boys among the children and the distribution Y 0 1 2
of Y is given by P 1 1 1 4 2 4
Now we revisit the first day of class. Suppose you walk into a room with
people. How many people do there need to be in the room before you bet that
there are two people with the same birthday? Of course if there were a
thousand people, you would bet. It is a sure thing. But probably nobody would
bet against you. How about if there are only 100 people? 50?
Before we find the surprising answer, let's look at a simple principle from probability. Since the
size of a set plus the size of its complement is always the size of the universe (first counting
3
As it happens, this is not quite correct. One realization came very early—as soon as statistical tables of
birth were gathered in the 1660's: more boys are born than girls —approximately in an 18to17 ratio (girls are
more likely to survive, so not so many need to be produced.) The other complication is that a given couple,
because of the chemistry, has a certain small factor of repeating the sex of previous children (this is small).
34
principle). Equivalently, in symbols, if A and B are disjoint sets with their union being all
possibilities, then A + B = U , where U is the universe of possibilities (all of them), and so
A B
= 1 , thus
U U
• the probability for a given event to # All Diff. NOT All Diff
occur equals 1 − probability that the 1 1.00000 0.00000
event will not occur. 2 0.99726 0.00274
3 0.99180 0.00820
4 0.98364 0.01636
Example 4. Let A be the set of people in the room. We want to
5 0.97286 0.02714
compute the probability that at least two of them have the same
6 0.95954 0.04046
birthday. What is the activity? We go around the room asking for 7 0.94376 0.05624
people's birthdays. If A = m , then there are 365m (we have m 8 0.92566 0.07434
decisions or stages and 365 choices for each one of them) outcomes 9 0.90538 0.09462
to our experiment. It seems tricky to compute the size of the 10 0.88305 0.11695
11 0.85886 0.14114
outcomes that give at least two with the same birthday, but the
12 0.83298 0.16702
complement of this set is that no two of them have the same birthday.
13 0.80559 0.19441
How many ways can that occur? For the first person we ask we have 14 0.77690 0.22310
365 choices, but for the second one we have only 364, and for the 15 0.74710 0.25290
third one 363, etc. and this kind of reasoning should sound familiar. 16 0.71640 0.28360
17 0.68499 0.31501
So we can build a table of probabilities. Let's say pn denotes the 18 0.65309 0.34691
probability of no two persons having the same birthday when there 19 0.62088 0.37912
20 0.58856 0.41144
are n people in the room. We can write down an answer by simple
21 0.55631 0.44369
counting, and we obtain
22 0.52430 0.47570
365!
pn = . 23 0.49270 0.50730
(365  n)!365 n 24 0.46166 0.53834
Unfortunately, this may not be easy to compute since 365! has more 25 0.43130 0.56870
26 0.40176 0.59824
than 500 digits. A much better way to do the computation is
27 0.37314 0.62686
recursively. Namely, at stage 1 we have 365 , for the next one we
365 28 0.34554 0.65446
364 , 29 0.31903 0.68097
have our previous result times and for the next, previous result
365 30 0.29368 0.70632
times 363 , etc. In short, 31 0.26955 0.73045
365 32 0.24665 0.75335
365  n
pn+1 =
33 0.22503 0.77497
pn .
365 34 0.20468 0.79532
The table gives the values recursively computed. Thus, amazingly, you 35 0.18562 0.81438
40 0.10877 0.89123
should be ready to bet when there are only 23 people in the room!
50 0.02963 0.97037
60 0.00588 0.99412
In a typical classroom of 35 students, the odds that at least two have 70 0.00084 0.99916
the same birthday are better than 4 to 1; and at 40 people they are 80 0.00009 0.99991
better than 8 to 1. For example, if we look at the 43 Presidents of the 90 0.00001 0.99999
35
United States, we should have a high probability of two of them having the same birthday, and
indeed, Polk and Harding were both born on November 2nd. With 50 people the odds are
better than 32 to 1, and if there are as many as 100 people in the room your odds are
astronomical, better than 3,000,000 to 1. In our own Mathematics Department of 45 faculty
members, the odds were better than 10to1 than 2 of us have the same birthday. As it turns
out there are 3 of us with the same birthday! And that is not very likely.
There is a more generic way to view the last rule. An event A is a sub event of event B if
whenever A occurs, B must also occur. Of course, every event is a sub event of the
universe. Another example is rolling a 6 on a die is a sub event of rolling an even number.
Suppose now that A is a subevent of B, then B but not A is an event on its own, and since it is
disjoint from A, we get
• the probability of an event, but not a given subevent is the
probability of the event minus the probability of the subevent.
Thus the probability of an odd roll with two dice but not a seven is 18
36
 366 = 12
36
.
Since the union of two events A and B can be seen as the disjoint union of three events:
A but not A and B,
B but not A and B
A and B
we get that
P ( A or B ) = P (A ) P ( A and B ) + P (B )  P ( A and B) + P ( A and B ) ,
or equivalently we have
‘ For any events A and B, P ( A or B ) = P (A ) + P (B ) P ( A and B ) .
Example 5. Suppose that in a lot of bolts, it is determined that 10% are too long and 15% are
too wide, and 7% are both too long and too wide. In selecting a bolt at random, what is the
probability that it will be acceptable in both length and width? Easy,
P ( toolong or toowide ) = 0.10 + 0.15  0.07 = 0.18 .
So probability of acceptance is 1 0.18 = 82% .
Example 6. It is estimated that in 70% of all fatal automobile accidents between two cars, at
least one of the drivers was driving under the influence, and in 25% of them both of the drivers
were DUI. In how many of those accidents was exactly one of the drivers not intoxicated?
Easily, 70  25 = 45% .
36
We summarize all the above observations about the distribution of a random variable
Œ The probability of an event is always a number between 0
and 1, inclusive.
• The sum of all the possible probabilities is 1 since we are
certain one of them will occur.
Ž If A and B are disjoint events, then P ( A or B ) = P (A ) + P (B ) .
• The probability for a given event to occur equals
1 − probability that the event will not occur.
• The probability of an event, but not a given sub event is the
probability of the event minus the probability of the sub
event.
‘ For any events A and B, P ( A or B ) = P (A ) + P (B ) P ( A and B ) .
Although probability was born the middle of gambling and gaming question, it was soon realized
to allow in social and considerations. The following last example of this section illustrates the
idea of point of view in a numerical sense.
In this scheme, the probability that at least one grandchild will not get money is given by:
15 5
= ≈ 71.43% . So the grandfather will not be surprised if one or more of the grandchildren
21 7
is disappointed.
Let us consider the problem from Alphonse’s point of view. He perhaps sees himself as equally
likely to get $500, $400, $300, $200, $100 or no dollars. So he considers the likelihood that
1
he gets nothing as being ≈ 16.67% . Thus, he would be considerably surprised to end up
6
emptyhanded.
37
Example 8. Grandson’s Dilemma II. Suppose the grandfather is to distribute rather than 5
crisp, new $100 bills, five different bills, a $5 bill, $10, $20, $50 and a $100 dollar bill. As
before, he decides that any given bill is as likely to go any one of the three grandchildren:
Alphonse, Bertrand or Constance. What is the probability that at least one of the grandchildren
will end up being unhappy by not receiving any money?
How many ways can he distribute the money: 35 = 243 since the $5 dollar bill can end up in
three different hands, and the $10 also has three choices, et cetera. Next we need to count the
number of ways to distribute the money so that everybody gets some money. Let A be the set
of distributions where Alphonse does not get anything, and likewise B is the set of
distributions where Bertrand ends up empty handed, and similarly for C .
Then we are interested in A ∪ B ∪ C . Our universe has 243 objects as we saw above, and
clearly A has 32 since if Alphonse is to receive nothing, then
each bill has 5 choices, so we get 2 5 = 32 . On the other hand A B
A ∩ B has only one element since Connie has to then receive all 30 1 30
the money, and finally A ∩ B ∩ C = ∅ since the money will end 0
up in the grandchildren’s hands. Then easily one can finish filling 1 1
in the following diagram. C
30
Thus, the probability that every child will receive some money is 243 − 93 = 150
150
≈ 61.72% , and consequently the probability of bringing
243
unhappiness to at least one grandchild is 38.28%. Of course, Alphonse only has 13.17%
chance of getting nothing, so again he might be surprised this happens.
Example 9. A
typical deck of cards consists of 52 cards in 4
suits (spades, hearts, diamonds and clubs)
and 13 denominations (Ace,
2,3,4,5,6,7,8,9,10, Jack, Queen, King).
Straight Flush
5 cards in sequence in the same suit.
What is the probability of a Straight Flush? There
are at least two ways to view our experiment: one way is I am dealt 5 cards out of a deck of 52
(as in 5card draw), another is I am dealt one card at a time until I have 5 (as in 5card stud).
Should the probabilities be different? Of course not. But what may happen is that it is easier to
do a problem one way than the other. Already with the Straight Flush we see that one is
æ52ö
superior to the other. With the first approach, there are çç ÷÷÷ = 2,598,960 equally feasible
è5ø
outcomes. To compute the numerator, how many decisions do we have to make? The suit of
the flush (4 choices for this decision) and the type of straight (10 choices for this decision—just
count them in your fingers by deciding which denomination is the lowest), so in total we have 40
options, so the probability is 40 » 0.00001539 (you should not hold your breath
2598960
until you get one of these). What about the second approach? The first card can be
anything, but what about the second card. We are in trouble. The number of options for the
second level of the tree depended on which branch of the first level you are in (for example, if
the first one is a king, then the second one can only be a 9,10,J,Q,A: 5 options, while if the first
one is an 8, then the second one can be a 4,5,6,7,9,10,J,Q: 8 options.) We certainly don't want
to start drawing trees. As it turns out this tree has 4,320 terminal nodes! (we might let a machine
do this, but certainly not by hand.) This is an important lesson. If you are in trouble counting
the outcomes for an event, then by moving laterally and changing the set up
maybe the trouble can be avoided. In reality, the second approach only worsens as we go
down the list of hands. So we will stay with the first approach.
4ofaKind
All four cards of the same denomination.
For 4ofaKind: we have to decide the
denomination (13 ways), and then the odd card (48 ways), so we have 13 × 48 = 624 options
all together, giving us a probability of 0.00024.
Full House
3 cards in the same denomination, 2 in another.
For a Full House: in order to build a full house we
need to decide which 3ofakind we are going to have (13 options), which suit those 3 cards
æ4ö æ4ö
are going to have, çç ÷÷÷ = 4 , which pair (12 options) and the suits for the pair, çç ÷÷÷ = 6 options,
è3ø è2ø
so the total is 13 × 4 × 12 × 6 = 3,744 , so the probability is 0.001440576.
39
Flush
5 cards in the same suit, but not in sequence.
For a Flush: we have to decide the suit (4 options)
and which 5 cards out of the 13 in the suit, which
æ13ö
gives çç ÷÷÷ = 1,287 options, so we have 5148 ways, but these include the 40 hands that are
è 5ø
straights, hence the number is 5108, and the probability is 0.0019654. Note that if we had
missed the subtlety of the 40 hands we had counted before, the answer wouldn't be that much
different: 0.0019807, with a difference of 0.000015.
Straight
5 cards in sequence, but not in the same suit.
For a Straight: we have to decide the type of
straight (10 options), and then decide the suit for
each of the cards (4 options for each), as before, we don't worry about the straight flushes, we
will just subtract them. So in total we have 10´ 4´ 4 ´ 4 ´ 4 ´4 =10,240 ways from which we
subtract the 40 straight flushes, to give 10,200, so the probability is 0.0039246.
3ofaKind
3 cards in the same denomination, other 2 different
For 3ofaKind: choose the denomination (13
æ 4ö æ12ö
ways), the suits for these 3 cards çç ÷÷÷ = 4 ways), the two other denominations çç ÷÷÷ = 66 ways),
è 3ø è2ø
and the arbitrary suits for the two new cards: 4 2 = 16 . Total = 13 × 4 × 66 × 16 = 54,912 . The
probability is, thus, 0.021128 (things are getting better).
Two Pairs
2 cards in one denomination, 2 cards in another
denomination, fifth card in yet
another.
This one is subtle: TwoPair. One trap that is commonly fallen into is as follows: choose the
denomination for the first pair—we have 13 ways of doing this, then choose the suits for this
æ4ö
pair: çç ÷÷÷ = 6 ways. Then we have 12 ways of choosing the second pair, and again 6 ways of
è2ø
choosing its suits. Now all we have to do is choose the remaining card out of a possible 48,
giving us a grand total of 13 × 6 × 12 × 6 × 48 = 269,568 . There are two most foul errors in this
discussion. The latter is the easiest one to catch. The 48 is wrong. You are not controlling full
houses! Hence it should be one card out of the remaining 44, giving an adjusted count of
247,104. But one error remains that is most subtle and very common (and tempting—can you
detect it?) Remember the distinction between a committee and an executive board. Can we tell
the difference between the first pair and the second pair? We certainly have counted them as if
we could, and that is definitely wrong. Instead we have counted every hand twice and the real
40
count should be 123,552. Just to make you totally comfortable with this let's count them another
æ13ö
way. Let's start by choosing the two denominations for the two pairs: çç ÷÷÷ = 78 ways. Then we
è 2ø
have 6 ways of choosing the suits for one of the pairs and another 6 of choosing them for the
other one, and then we have as before 44 ways of choosing the extra card:
78 × 6 × 6 × 44 = 123,552 .
Pair
2 cards in one denomination, nothing else.
For a simple Pair, we have first to decide the
denomination (13 ways), then the suits for the pair,
æ4ö÷ æ ö
ç ÷ = 6 . Then we must have three other denominations, ç12÷÷ = 220 , and then the arbitrary suits
çè2ø÷ çè 3 ø÷
for those denominations, 4 × 4 × 4 = 64 , which gives us the total of
13 × 6 × 220 × 64 = 1,098,240 .
Bust
None of the above.
For a Bust, we must have 5 denominations,
æ13ö÷
çç ÷ = 1,287 , and a suit for each, 4 5 = 1,024 . But have included the Straight Flushes, the
è 5 ø÷
Flushes and the Straights, so we have to subtract
1287´1024  4  36  5108 10200 = 1,302,540 .
But wait, you say, I was stupid to have done this calculation since we could have found the
number of bust hands by subtracting all the previous hands from the total number of hands.
However, it is important to have redundancy, especially in a sophisticated calculation like this
one. The fact that, as the table shows, the numbers add up to the correct total assures us that
we have not possibly made only one error. Name of hand Number of ways Probability
• Random Variables
In the last section we encountered the fundamental concept of a random variable. One can
safely say that this is the most fundamental concept in the course. So what is a random variable?
In the last section we looked at a couple of them. The roll of two dice which had for its
distribution
X 2 3 4 5 6 7 8 9 10 11 12
P 1
36
2
36
3
36
4
36
5
36
6
36
5
36
4
36
3
36
2
36
1
36
# .0277 .0556 .0833 .1111 .1388 .1667 .1388 .1111 .0833 .0556 .0277
X 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
P 1
216
3
216
6
216
10
216
15
216
21
216
25
216
27
216
27
216
25
216
21
216
15
216
10
216
6
216
3
216
1
216
# .0046 .0138 .0277 .0463 .0694 .0972 .1157 .125 .125 .1157 .0972 .0694 .0463 .0277 .0138 .0046
Both of these are examples of discrete random variables since the only values the random
variable takes are isolated real numbers. These values that the random variable takes are called
the support (or range) of the random variable. And for a discrete random variable, each of
the numbers in this collection of (isolated, possibly infinitely many) points (the support) gets
assigned a nonnegative number, which stands for the probability of obtaining that number as
an output of our random variable. Critically, the sum of all these probabilities has to be 1,
of course. This collection of probabilities is called the distribution of the (discrete) random
variable. Although this is commonly so designated, it is an unfortunate choice since as we will
see below the word distribution will acquire a slightly different meaning when we discuss other
types of random variables.
The rest of the computations of probabilities is basically dependent on the fact that these
outcomes are disjoint events so one simply adds the probabilities.
Example 1. Suppose you come to take a test totally unprepared. The test consists of 10
TrueFalse questions, each of which you will answer at random (but you will answer them
all, since there is no penalty for guessing.) How likely is it that you will achieve a passing score
of 70% or better? Let X be the random variable that counts the number of correct answers, so
what is desired is P ( X ≥ 7 ) .
First, what is the experiment? Answering the exam. How many ways can you do this? By the
second counting principle, 2 10 = 1024 (10 decisions, 2 choices for each). What is the event we
are pursuing? In how many ways can you answer the exam so that you have exactly 10 correct?
1 way. How about 9 correct? Build your tree, the first stage is to decide which question you are
42
going to answer incorrectly, for this stage you have 10 choices. After you have done that, there
are no options left since the question you are to answer incorrectly has to be answered that way
while all the others have to be answered correctly. So there is a total of 10 = 10 ways of
1
getting 9 correctly. What about 8? There we have the decision of which two questions out of
the 10 we are to answer incorrectly, after that we have no options left, so the answer
is 10 =45. Finally, by similar reasoning, we get 10 =120 ways of getting exactly 7 correct.
2 3
Continuing this way, we get the probability distribution for the random variable that counts the
number of correct answers:
X 0 1 2 3 4 5 6 7 8 9 10
P 10241 10
1024
45
1024
120
1024
210
1024
252
1024
210
1024
120
1024
45
1024
10
1024
1
1024
% .097 .97 4.4 11.7 20.5 24.6 20.5 11.7 4.4 .97 .097
Example 2. This is a slight, but important variation of Example 1. Suppose you come to take
a test (totally unprepared as usual), but that the test is Multiple Choice, with ten questions
and each question having three choices, only one of them correct. What is the probability that
you score at least a 70% on the test? Again let X be the random variable that counts the
number of correct answers.
How many ways can we answer the exam? Easy, 310 = 59049 . How many ways can we get
all correct: 1 way. Nine correct? 10 ×19 × 2 1 = 90 . The surprising ingredient here might be
9
the 2, which is coming from the 2 ways we can answer a question incorrectly. Reviewing the
three factors we get: 10 as the number of ways of choosing the questions that we are going to
9
answer correctly, 1 as the number of ways of answer those questions correctly, 2 1 as the
9
number of ways of answering the remaining questions incorrectly. How many ways can we get
an 80%? 10 ×18 × 2 2 = 480 . Continuing we get
8
X 0 1 2 3 4 5 6 7 8 9 10
P 1024
59049
5120
59049
11520
59049
15360
59049
13440
59049
8064
59049
3360
59049
960
59049
180
59049
20
59049
1
59049
% 1.73 8.67 19.51 26.01 22.76 13.65 5.67 1.63 .30 .03 .001
Example 4. The Bernoulli, the Next Simplest Random Variable. The next simplest
random variable is named after one of the founders of the subject, Jacob Bernoulli who wrote
a very important book named Ars Conjectandi.
The purpose of the Bernoulli random variable is simply success vs. failure. It does have one
parameter, the probability of success, denoted by p, so Bp , which is how
Bp 0 1
we will refer to this random variable from now on, has the simple distribution:
P 1 p p
So usually 1 is success while 0 is failure. Also often we refer to 1 p as q (and we will do so
throughout the course). Of course what success is depends on the user. For example, success
could be rolling a 7 with a pair of dice, or flipping heads with a coin.
But there is another kind altogether of random variables. Let us start with an example we
encountered briefly before.
Example 5. Uniform. Let us consider the random variable X that chooses a number between
0 and 1 at random. This is a randomizer or a random number generator. It will be known as U
for the remainder of the course.
0.2
0.15
0.1 Think of a die, but instead of numbers 1 through
0.05 6 on the face, we have the numbers 16 through
0 6
6
on them. Then the distribution of a roll looks
1/6 1/3 1/2 2/3 5/6 1
like a flat line.
1 20
Similarly, we can think of an icosahedron with 20 sides labeled 20
to 20
0.06
0.05
0.04
0.03
0.02
0.01
0
44
0.02
0.01
0
As we proceed to increase the number of
sides, we see that the probability of getting any
given value is decreasing toward 0, from 6 = .1667 to 20 = .05 to 40 = .025 in our three
1 1 1
examples. So in fact when we talk of a random variable that chooses an arbitrary real number in
the interval [ 0,1] , we have lost the ability to talk of the probability of a single number as being
anything but 0.
First we need to understand the support of the variable—that is, what values can our variable
take. In the example we are discussing, the support is the interval [ 0,1] . If we take an arbitrary
u in this interval, we could ask for the following quantity
P (u ≤ U ≤ u + h )
.
h
And then we could take the limit as h → 0 , this would represent the tendency of probability for
u —a potential. In our specific example, that limit
P (u ≤ U ≤ u + h ) (u + h ) − u = 1
lim = lim
h →0 h h →0 h
as long as 0 ≤ u ≤ 1 , and that agrees with our intuition of what that specific random variable is
doing—namely if one is a number between 0 and 1, one has the same chance as any other
number in that interval of being picked, while numbers outside that interval have no chance of
being selected.
Again we could use the same idea of a table (as we did in the discrete case) to represent this
information, except now that instead of being a discrete table, it is a continuous table:
U L −1 L 0 L u L 1 L 2 L
f (u ) 0 0 0 1 1 1 1 1 0 0 0
And instead of a probability, it is a potential for probability—instead of mass, it is density. And
it is usually denoted by f ( u ) . Note that we use little u to denote an arbitrary number in the
support of the random variable U .
Again, for a continuous random variable, individual members of the support loose all
importance, so the probability of any one of them occurring is simply 0, in other words
P (U = a ) = 0 for any number a,
but that does not mean that single elements have a higher potential of occurring than others. In
45
the previous example, the points in the interval between 0 and 1 had equal potential, while point
outside that interval had no potential (so they were not in the support of the random variable).
In other words, instead of talking of the probability of getting a specific number, we talk about
the probability of being close to that number. But even more dramatically, for any number a ,
P ( U < a ) = P ( U ≤ a ) , since, again, individual points do not matter.
We know for sure in our example that P (0 £U £ 1) = 1 since we have certainty this will
happen. This is closely related to the above mentioned principle:
• The sum of all the possible probabilities is 1 since we are
certain one of them will occur.
But it is no longer the sum since we are acting on a continuum now, so the correct statement is
that
•’ The integral of the density over all real numbers equals 1 since
we are certain this will occur.
∞ 1
And in fact ∫ f (x ) dx = ∫ f ( x ) dx
−∞ 0
since there is no density outside that interval, and in that
∞ 1
interval we have f ( u ) = 1 , so f (x ) dx = ∫ 1dx = x = 1 as it should be.
1
∫
−∞ 0
0
Rather that the table as we did above, the density of a random variable is given as a function. In
our example
ìï1 0 £ u £ 1
f ( u) = ïí
1.2
ïïî0 otherwise
1
0.8
and the graph is given by the picture. 0.6
0.4
0.2
One could also simply say that the support is the interval 0
[ 0,1] and in that interval the density is 1. 0 0.5 1 1.5
What kind of event should we be discussing? Certainly, we can compute P (.2 £ U £ .45) . It
can be viewed two ways:
.45
• ò 1dt = .25 . This is nothing but the sum (integral since it is continuous) of all relevant
.2
possibilities.
‚ Area of a rectangle with base of size .25 and height 1=.25. Similar to the discrete
arguments: Share of .25 out of a total of 1, .25.
46
An event is a subset of the line, and to compute its probability one simply integrates the function
a b
function that it is closely associated with the random variable. It is unfortunately called the
distribution (vs. density) of the random variable, and so one defines
a
F ( a) = P ( X £ a ) = ò f (t ) dt
¥
Often, in order to clarify, the distribution is also referred to as the cumulative distribution.
Its characteristic properties include:
‚ It is increasing, F ( a) £ F (b ) if a £ b
since the event X £ a is a subevent of the event X £ b ;
Of course, if we know the distribution F ( a ) of a continuous random variable X, then its density
dF ( x )
is simply given as its derivative f ( a ) =  .
dx x= a
For example if we use X = U to be the uniform random variable on the interval [ 0,1] , then its
density is given by
47
ïì1 0 £ u £ 1
f ( u) = ïí
ïïî0 otherwise
while its distribution is
ìï0 u £0
ïï
F (u ) = íu 0 £ u £ 1 .
ïï
ïïî1 u ³1
Or simply one could say that F ( x ) is x in the support of U , the rest being obvious. The
graphs of the two are
1.2
1
One of the advantages of having the distribution is that no
0.8
more integration is then required. For example, to compute
P (.2 £ U £ .45) , one simply calculates F ( .45 ) − F (.2 ) ,
0.6
0.4
since 0.2
∞
c
The first computation has to be to find c , and to do that we know that ∫x
10
4
dx has to equal 1.
∞
To find P ( X > 20) , all we need to do is compute then
3000 ∞
∫
20
x 4
dx = −1000 x −3 = 18 , a
20
We end the section with yet another type of random variable, which unfortunately, will not be
encountered often in the course.
If we let M denote the arrival time of the mechanic and C that of the chauffeur, then we have
P ( M £ C ) = 12 since P ( M £ C ) + P ( M ³ C ) = 1 and P ( M £ C ) = P ( M ³ C ) . But this is
equivalent to P (W = 0) = 12 .
1
1
This word: independently, will be further clarified in a future section.
49
In this section we introduce some of the most important features of a random variable—
in all we will look at 5 items, some measure the center of the random variable while
others measure the spread. We have already looked at one of the measurements of the
spread—the SUPPORT or RANGE of a random variable, which consists of the values a
random variable can take.
Next we look at perhaps the most important individual bit of information about a random
variable: its mean, or average or expected value. The notion was introduced by the
great Dutch mathematician Huygens (also of the 17th century).
Example 1. Let's look at the following game (once more, not atypical of the 17th
century):
Since one has a chance in six of rolling a 1 with a die, one has an even
chance to roll at least one 1 when one rolls 3 dice. Hence I propose the
following game to you. You will roll 3 dice. If three 1's show up you win a
wonderful $5, if only two 1's, you win $2, while if only one 1 is rolled, you
will still win $1. If, unfortunately, on the other hand, no 1's show up you pay
me only $1 .
Naturally, you are suspicious of my proposition, but it is much better to pin point the
reasons for your suspicions. What we need is to compute your expected value when you
play this gamethis is equivalent to what your average performance is going to be. Of
course, if you are going to play this game just once, then it does not matter what you opt
to do, but as a longrange strategist, you need to compute.
The computation is just common sense. You are basically asking: suppose I played the
game so many times, what would happen? What is the appropriate random
variable?
In any one roll, you can win either 5, 2 or 1 dollars, or you can lose 1. We know that is
we roll three dice, there are 216 different rolls. Among those rolls, three 1's would show
up once, while two 1's would show up 15 times (15 = 3 × 5 ), the 3 is the number of
options of which two dice are going to show the two 1's while the 5 is what the other dice
is going to show). How many times does one 1 show up? Choose which die shows the 1
(3 options), and then choose what the other two dice show ( 5 × 5 ) for a total of 75 times.
Finally, no 1's will show 125 times (which is X 5 2 1 1
5 × 5 × 5 = 216 − 1 − 15 − 75 ). So the random variable is P 1 15 75 125
216 216 216 216
So if you play the proverbial 216 rolls, you will win $5 once, $2 exactly 15 times, $1, 75
occurrences, but for 125 times, you will lose $1, so your winnings are
5 × 1 + 2 × 15 + 1 × 75 − 1 × 125 = −15
or equivalently, your expectation is
50
æ 1 ö æ 15 ö æ 75 ö æ125 ö 15
5çç ÷÷÷ +2çç ÷÷÷ +1çç ÷÷÷ 1çç ÷÷÷ = » 6.9 ¢.
è 216 ø è 216ø è 216ø è 216 ø 216
So on the average you will lose $15 in 216 rolls, or approximately 7¢ a roll. Naturally,
you would rather not play a game when your expectation is negative, unless you will
have so much fun you are willing to pay the fee.
Note that the computation is simply the dot product of the two rows, X and P. And that is
exactly the definition of the expectation of a discrete random variable,
E ( X ) = X × P = å xi pi .
i
Note that is a sum, so for a continuous random variable we should be seeing an integral,
¥
E ( X ) = ò xf (x ) dx .
¥
1
Thus for U, the uniform random variable on [ 0,1] , we get E (U ) = ò xdx = . In this
1
0
2
particular instance, the average did not carry a lot of information—it is just in the middle
of the density, the safest guess.
Example 2. Carnival Gave. A game is based on the following setup. A bowl contains 4
blue balls marked $100, $50, $20, and $10, Blue $100 $50 $20 $10
respectively, 4 red balls marked $100, $50, $20, and Red $100 $50 $20 $10
$10, respectively, 3 black balls marked $100, $50, Black $100 $50 $20
and $20, 2 yellow balls marked $100, and $50, 1 Yellow $100 $50
green balls marked $100. Then simultaneously 3 Green $100
balls are drawn from the bowl at random. The contestant wins the difference between the
largest ball drawn and the smallest ball drawn. Thus, for example, if we draw 50, 50 and
10, the contestant wins 40.
But we can take the opportunity to introduce other measurements for this random
variable. The simplest notion is that of MODE: the outcome with highest probability. In
this case it is $80.
In the continuous case, the mode is represented by the point(s) where the density achieves
its maximum value. Thus, for the standard uniform, U, every point between 0 and 1
represents the mode. On the other hand, for the sum of two independent standard
uniforms, the mode was 1.
Then the third measure of central tendency is the MEDIAN: This is the outcome that is in
the middle—50% chance above and 50% chance below. One way to compute it in the
example is to think of the 364 possible outcomes. If these are listed in order, then the
middle will be between the 182nd position and the 183rd position. Since both of these are
occupied by 80, that is our median. If the middle positions had been occupied by different
numbers, the median would have been the average of the two numbers.
The median of the standard uniform is also 12 . One fact worth mentioning is that
the mean, median and mode are always in alphabetical either
decreasing or increasing order.
Thus in the example, mean was approximately $63 while the median and the mode were
both $80.
Often the RANGE is simply given as the interval in which the variable has probability
bigger than 0. Thus, in the gaming example, the range is 0 to 90. Another way to
describe the range is by giving the MIN and MAX values of the random variable (of
course only if these values are finite).
Before we discuss the most recent of these ideas, and the most fundamental measure of
spread, we need to expand further on random variables. Given a discrete random variable
X , the random variable X 2 simply takes the squares of the values of X with the same
probability (or the same density). Namely, in table form for the
discrete case, we have X L x L
X 2
L x2 L
P p
Thus,
E(X )=∑x
2 2
px
x
in the discrete case.
In the next section, we will see how to extend the concept of X 2 in the continuous case,
but easily, what we require now is E ( X 2 ) , and that concept is easily extended to
∞
E(X 2
) = ∫ x f (x ) dx .
2
−∞
Now we are ready for a fundamental concept VARIANCE: The variance is the
52
Note that since the standard deviation is the square root of the variance, it is necessary
that the variance is a nonnegative number—we will indeed prove that is the case
below.
So the variance is V ( X ) = 4690.93  ( 63.3) = 684.04 , and the standard deviation is then
2
s = 684.04 = 26.15 . It is measured in the same units as the mean, in this case, dollars.
x f (x ) dx = ò x dx = . Thus
2 2 2 2 1
¥ 0
3
V (U ) = E (U 2 )  E (U ) =  = and so sU =
2 1 1 1 1
.
3 4 12 2 3
Example 4. Being hired by a small firm whose average salary is $100,000 sounds very
exciting. But pursuing it further, we learn that there are 13 employees in the firm, the
boss makes $1,000,000, his two vice presidents (his two daughters) make
$100,000 each, and the remaining 10 employees make $10,000 per S 106 105 104
person. So the distribution of salaries is P 131 2 10
Indeed, E ( S ) = 100,000 , but the median and the mode are both $10,000.
13 13
Example 5. The Roll of the Dice. Consider the roll of two dice:
X 2 3 4 5 6 7 8 9 10 11 12
P 1
36
2
36
3
36
4
36
5
36
6
36
5
36
4
36
3
36
2
36
1
36
# .0277 .0556 .0833 .1111 .1388 .1667 .1388 .1111 .0833 .0556 .0277
then
2´ 1 + 3 ´ 2 + 4 ´ 3 + 5 ´ 4 + 6 ´ 5 + 7 ´ 6 + 8 ´ 5 + 9 ´ 4 + 10 ´ 3 + 11´ 2 + 12 ´1
E( X ) = =7.
36
The mode is 7 also, and so the median has to be 7 (since they are always in alphabetical
order). To compute the variance, we need as usual E ( X 2 ) , which equals
4´ 1 + 9 ´ 2 + 16´ 3+ 25 ´ 4 + 36 ´ 5 + 49 ´ 6 + 64 ´ 5 + 81´ 4 + 100 ´ 3 + 121´ 2 + 144 ´1
E( X 2 ) = =
36
1974
≈ 54.8333 .
36
Hence V ( X ) = 5.8333 and σ = 2.4152 .
Example 6. Lifetime. A certain type of electronic device has a lifetime T with density
ì
ï
ï
3,000
ï t ³ 10
f ( t) = í t 4 .
ï
ï
ï
î 0 otherwise
Since the density is a decreasing function, the mode
is easily seen to be 10. Now 0.35
∞ 0.3
3000 1500 ∞
E ( T ) = ∫ t 4 dt = − 2  = 15 , 0.25
10
t t 10
0.2
So we expect the device to live 15 hour s. 0.15
0.1
The computation of the median is interesting, we 0.05
need to find a so that 0
a 0 10 20 30 40
3000 1
∫10 t 4 dt = 2 .
54
1000
Or equivalently, F ( a ) = 12 , and since we computed F ( a ) = 1 −3
, we get that the
a
median is 3 2000 = 12.60 , so 50% of the devices will not last past 12.60 hours, in
between the mode and the mean (as expected).
The median is clearly 0, and so is the mode. For the expectation, we need to compute
1
E (W ) = 0× 12 + ò t (1 t ) dt = .
1
0
6
Note P (W £ 16 ) = 72
47
= 65.27% . The support is of course, 0 to 1. Now
E (W 2 ) = 0× 12 + ò t 2 (1 t ) dt =
1
1
,
0
12
so V (W ) = 1
12  = .
1
36
1
18
But the notion of expectation can be used in even more interesting ways:
55
The first question has little to do with expectation. How many teams should the salvage
company send in order to have at least a 99% chance of recovering the equipment? If n
teams are sent, the probability that the equipment will not be recovered is (.3) , so if we
n
send three teams, we get 1 (.3) = 1  .027 = .973 chance of recovery, not enough. But if
3
we send 4 teams, we get 1 (.3) = 1 .0081 = .9919 , so 4 teams will suffice. Of course,
4
the more teams we send the more likely it will be that the piece of equipment will be
recovered.
One of the reasons that expectations are more important than modes or medians is that
there are several important theorems that are true about expectation as we will see below
in a future section.
The easiest new random variable is one created from an old random variable via
multiplication by a scalar.
Example 1. The Roll of the Dice. Let X denote the roll of 2 dice. Suppose we will play a
game where we get paid triple our roll. Let Y denote the random variable that denotes our
winnings, then certainly Y = 3 X , and certainly their distributions will be closely
associated:
X 2 3 4 5 6 7 8 9 10 11 12
Y 6 9 12 15 18 21 24 27 30 33 36
P 36 36 363 364
1 2 5
36
6
36
5
36
4
36
3
36
2
36
1
36
As exemplified, the basic relation between X and Y = aX is that for any number n,
P ( X = n ) = P (Y = an) in the discrete case. Clearly,
E (Y ) = E (3X ) = 3X ⋅P = 3( X ⋅ P ) = 3E ( X )
and this claim is generic, E ( aX ) = aE ( X ) . This is at least obvious in the discrete case—
it will become so also in the continuous case soon. Also a little thought will show that the
median of Y is 3 times the median of X , and the same is true for the mode.
Of course the range of Y is the collection of triple values of the range of X , as is easily
observed from the table. Now trivially Y 2 = 9 X 2 , so E ( Y 2 ) = 9 E ( X 2 ) , and finally, we
get
2 2
(
V (Y ) = E ( Y 2 ) − E ( Y ) = 9E ( X 2 ) − ( 3E ( X ) ) = 9 E ( X 2 ) − E ( X )
2
) = 9E ( X ) ,
so σ Y = 3σ X .
Suppose now that X is continuous with density f ( x ) . What do we know about the
random variable Y = 3 X ? Certainly we know that for any number y ,
y
P (Y ≤ y ) = P (3 X ≤ y ) = P X ≤ ,
3
and so FY ( y ) = FX , and so f Y ( y ) =
y y
1
f X , by the chain rule. Let us consider a
3 3
3
specific example.
58
Example 2. Lifetime. A certain type of electronic device has had so far a lifetime T with
density
ì
ï
ï
3,000
ï t ³ 10
f ( t) = í t 4 .
ï
ï
ï
î 0 otherwise
The manufacturers now claim their new product has tripled the life of the old devices, so
the new lifetime is N = 3T . What do we know about N ? We know its support is all
numbers ≥ 30 , and if n ≥ 30 , then
n n 1000 27000
FN ( n ) = P ( N ≤ n ) = P T ≤ = FT = 1 − = 1− .
3 3 ( n3 )
3
n3
81000 n
Hence f N ( n ) = = 1
fT . Of course the mode of N is 30, three times the mode
3
4 3
n
of T , and
∞ ∞
81000 −40500 ∞
E(N) = ∫ nf (n ) dn = ∫ n 4
dn = 30 = 45
−∞ 30
n n2
So E ( N ) = 3E ( T ) . But the astute reader may easily recognize a much better way to do
this computation by substituting in the integral n = 3t :
∞ ∞ ∞
81000 81000 3000
E(N) = ∫ n dn = ∫ 3t 3 dt = 3 ∫ t 4 dt = 3E (T ) .
( 3t )
4 4
30 n 10 10 t
It is clear we use the distribution since it refers to actual probabilities versus the density,
which only refers to potentials.
a
81000 1
Similarly, for the median, a , the median of N satisfies ∫
30
n 4
dn = , and we the same
2
a a
3 3
81000 3000 1 a
substitution we get ∫
10
81t 4
3dt = ∫ 4 dt = , so
10
t 2 3
is the median of T , and so again
and so as before, V ( N ) = 9V ( T ) .
Observe that we also showed that if X has density f X ( x ) and distribution FX ( x ) , then
Y = aX has density and distribution given respectively by
y
f Y ( y ) = 1a f X FY ( y ) = FX .
y
and
a a
Actually, note that it was much easier to find the distribution first (from which we got the
density by differentiating), rather than the other way around.
The next level for building new random variables is that of adding a constant to a random
variable: Y = X + b . This often known as a translation (in geometry), and clearly the
support of the random variable is being so transformed. For example, in the roll of two
dice, suppose somebody claimed they would give you two dollars over the roll. Thus if
X is the roll of the dice, then the winnings would be represented by Y = X + 2 . And the
distribution is given by
X 2 3 4 5 6 7 8 9 10 11 12
Y 4 5 6 7 8 9 10 11 12 13 14
P 361 2
36
3
36
4
36
5
36
6
36
5
36
4
36
3
36
2
36
1
36
Clearly the mode of Y is the same as the mode of X plus 2, and so is the median. For
the expectation, the calculation is interesting:
E ( Y ) = ∑ yi pi = ∑ ( xi + 2 ) p i = ∑ xi pi + ∑ 2 pi = E ( X ) + 2
i i i i
E ( Y 2 ) = ∑ yi2 pi = ∑ ( xi + 2) pi = ∑ xi2 pi +∑ 4 xi pi + ∑ 4 pi = E ( X 2 ) +4 E ( X ) + 4 .
2
i i i i i
So
V (Y ) = E ( Y 2 ) − E ( Y ) =E ( X 2 ) +4 E ( X ) + 4 − ( E ( X ) + 2 ) = E ( X 2 ) − E ( X ) = V ( X )
2 2 2
namely the variance does not change! Below we will see an alternate definition of
variance that will point out clearly the reason for this last claim.
Let us move to the continuous example. Suppose the manufacturer of our devices
promise now that the devices will last 2 hours longer than before. Again, let N be the
lifetime of our new devices, so as before N = T + 2 . So our range (or support) will now
be all numbers greater than or equal to 12. Again, as before, the next easiest information
to gather about N is its cumulative distribution: if n ≥ 12 , then
1000
FN ( n ) = P ( N ≤ n ) = P ( T + 2 ≤ n ) = FT ( n − 2) = 1 −
( n − 2)
3
Thus clearly, the mode of N subtracted by 2 is the mode of T , for the expectation we
get
60
∞ ∞ ∞ ∞
3000 3000 3000 3000
E(N) = ∫ n dn = ∫ (t + 2 ) dt = ∫ t 4 dt + ∫ 2 4 dt = E ( T ) + 2
( n − 2)
4 4
12 10
t 10
t 10
t
∞
3000
by making the substitution n = t + 2 , and also using the fact that ∫
10
t4
dt = 1 .
In a similar fashion to the discrete case, one can show that the variance of N is the same
as the variance of T , and by extending our reasoning to an arbitrary constant, we get a
nice theorem, which receives its name from the fact that transformation of the form
x a ax + b are called affine:
y −b
FY ( y ) = FX .
a
Proof. Only the last remark is worth arguing. Introduce Z = aX , so Y = Z + b . Then from
above we know FZ ( z ) = FX , and since FY ( y ) = FZ ( y − b ) , FY ( y ) = FX
z y −b
, and
a a
so f Y ( y ) = 1a f X
y −b
. z
a
Example 3. Uniforms. From now on, we will let U[ p ,q ] denote the uniform random
ì
ï
ï p £x £q
1
ï
variable on the interval [ p, q ] , so its density is given by f U[ p ,q ] ( x) = í q  p .
ï
ï
ï
î 0 otherwise
As mentioned above, in the special case of the unit interval, [ p, q ] = [ 0,1] , we will use
simply U.
Consider the random variable Y = ( q − p ) U + p . Then the range of Y is given by the set
of numbers of the form ( q − p ) x + p where x is any number between 0 and 1—but then
61
this set is nothing but the interval [ p, q ] . And in that set the density is given by its
y − p y−p y− p
density is given by f Y ( y ) =
1
1
q− p
fU = , since f U = 1 as 0 ≤ ≤ 1,
q− p q − p q− p q− p
(
and so Y = U[ p, q] . Hence we get that E U[ p, q] = ( q  p ) + p =) 1
2
q+ p
2
, the midpoint of
the interval. Also the median is located there. For the variance, we have
(q  p )
( )
2
V U[ p ,q ] = V (( q p)U + p) = (q  p) V (U ) =
2
.
12
The fact that U[ p, q] = ( q − p ) U + p is very useful since most numerical programs come
equipped with a randomizer which is most often U. Thus, if we wanted to model some
other uniform, say U[ 1,2] we could easily build a table of random U U[ 1,2]
values by using U. For example here are 10 random values for U 0.8167 1.4502
and the corresponding 10 values for U[ 1,2] = 3U 1 . 0.0714 0.7859
0.2610 0.2171
0.0167 0.9499
0.3007 0.0980
It has been not stated explicitly, but it is clear that for any random 0.9018 1.7055
variable whose range is finite—in other words, one that is bounded 0.2681 0.1956
above and below, its expectation will lie in the interval 0.7096 1.1289
between the highest and lowest values. Of course, the same is 0.3979 0.1936
0.6657 0.9970
true for the mode and the median. In particular, if a random
variable takes only nonnegative values, its expectation can only be nonnegative.
The discrete case is easiest to describe. For example, let X be the roll of the dice, and
then Y = X 2 and Z = X 3 , then their distributions are trivially given by
X 2 3 4 5 6 7 8 9 10 11 12
Y 4 9 16 25 36 49 64 81 100 121 144
Z 8 27 64 125 216 343 512 729 1000 1331 1728
P 1
36
2
36
3
36
4
36
5
36
6
36
5
36
4
36
3
36
2
36
1
36
And it is clear that the mode of Y is the square of the mode of X , while the mode of Z
is the cube of the mode of X in this case. Similarly for the median, the median of the
square is the square of the median (and also true for the cube), but the same is not true
of the expectation: E ( X ) = 7 , while E ( Y ) = = 54.83 and E ( Z ) =
1974 16758
= 465.5 .
36 36
62
What happens in the continuous case? At the density level, it is similar to the discrete—
namely, we see the density of X under the value of X
while Y = X 2 has the square value: X L x L
Y=X 2
L x 2
L
Except this does not give us the density of Y = X . It does
2
fX L f X ( x) L
however allow us to compute as we have done above the
expectation of Y = X 2 by simply
∞
E(X 2) = ∫ x f (x ) dx .
2
−∞
∞
Similarly, E ( X 3 ) is simply given by E ( X 3 ) = ∫ x f (x ) dx .
3
−∞
For a continuous random variable, we get the same kind of idea about the density as we
did above. Explicitly we see the value of X above the
corresponding value of g ( X ) with the density of X , X L x L
f X below them Y = g ( X ) L g ( x) L
fX L f X ( x) L
The cumulative distribution of the variable Y = g ( X )
is perhaps more easily described. As usual, if FY (b) = P (Y £ b ) and FX ( a) = P ( X £ a ) ,
we simply have, in a similar fashion to the discrete case
FY (b) = ò f X (t ) dt .
g ( t )£b
Let us observe the discrete case is immediate, while the continuous case follows in a
similar manner.
When the variable is a mixed type, we have to use both ideas (as in an example we have
seen previously):
Example 4. A Mixed Problem. Returning to the problem of the mechanic and the
chauffeur, we have the density of the waiting time W being
12 w=0
f W ( w) = 1 − w 0 < w ≤ 1 .
0
otherwise
Suppose the chauffeur requires as payment for the pickup: P = 100eW (this indicates a
high degree of sophistication on the driver) measured in dollars of course. Then what do
we expect to pay the chauffeur? We get
1
E ( P ) = 100 ( 12 ) + ∫ 100ew (1 − w) dw = 50 + 100e − 200 = $121.83 .
0
Example 5. Lifetime. A certain type of electronic device has had so far a lifetime T with
density
ì
ï
ï
3,000
ï t ³ 10
f ( t) = í t 4 .
ï
ï
ï
î 0 otherwise
The cost (in dollars) of running such a device (through its lifetime) is given by
C = 2T 2 + 5T + 8 . What is the expected cost of having a machine until it runs out? We get
immediately:
∞ ∞
E ( C ) = ∫ ( 2t 2 + 5 t + 8 ) f (t ) dt = ∫ ( 2t 2 + 5t + 8 ) 4 dt = $683 .
3000
−∞ 10
t
In a similar fashion we could have computed the expectation of any function g ( T ) .
Note however that although we know the expectation of C , the cost, we do not know its
distribution, not its density, so for example if we wanted to know what is the probability
the costs will exceed $825, at present we could not readily answer. We could proceed to
answer by simple language:
P ( C ≥ 825 ) = P ( 2T 2 + 5T + 8 ≥ 825) = P ( 2T 2 + 5T − 817 ≥ 0 ) =
∞
P ( (T −19 )( 2T + 43) ≥ 0 ) = P ( T ≥ 19 ) =
3000
∫
19
t4
dt = .1457 ,
And therefore through astonishing algebraic fortunes we would have arrived at the
answer. Providentially, there is a much better process called the transformation
method, which is a useful onevariable technique.
get rid of t . First we observe that the support of C is all numbers c ≥ 258 . In that range,
−5 ± 8 c − 39
since c = 2t 2 + 5t + 8 , we get t = , and since t ≥ 10 , we get
4
−5 + 8 c − 39
t=
4
And so for any c ≥ 258 ,
3000 × 256
fC ( c ) = .
( )
4
8c − 39 − 5 8 c − 39
Thus to simply answer the question P ( C ≥ 825 ) = ? , we would simply integrate
∞
∞
being ∫ f C ( c ) dc = .1071 , and so forth. Indeed we now have as much control over C as
1000
we do over T .
Example 7. Powers. Let X have nonnegative range. Then for any positive real number r,
Y = X r is a random variable, and the theorem applies. Let y = x r = g ( x ) , then
g ′ ( x ) = rx , so f Y ( y ) =
r −1 f X ( x) f X y
=
( )
1
r
of y alone.
Thus, for example, if we let X be as in the previous example, and let r = 2 , then the range
1
2 2
y 1
of Y is 0 ≤ y ≤ 9 , and if y = x , then f Y ( y ) =
2 9
1 = , and we obtain a uniform. On the
2y 2 9
z2 4z3
2
other hand, if let Z = X , then we obtain f Z ( z ) = −1 = in the range 0 ≤ z ≤ 3 .
1
2 9
1
z 29
66
Example 8. Uniforms Again. If U is the basic uniform, then what is the density or
distribution of V = U 2 ? We actually can do this computation two different ways. Recall
that on the range 0 ≤ u ≤ 1 , FU ( u ) = u . Observe that V has the same support, so in that
range
FV ( v ) = P (V ≤ v ) = P (U 2 ≤ v ) = P U ≤ v = v . ( )
And so in that range (of course without 0, which only reaffirms the idea that singleton
points are insignificant and irrelevant to continuous random variables):
1
fV ( v ) = .
2 v
We know for example that P (U £ 12 ) = 12 , but on the other hand,
P (V £ 12 ) = 2
2
» 0.7071 . There is a bigger density at the lower numbers because as we
take square roots in the interval [ 0,1] the numbers get smaller
But, of course, we could also have used the trans formation method since the derivative is
only 0 at the edge of the interval1 , so directly we get, a bit more directly,
f ( u) 1
fV ( v ) = U = 1.
2u 2v 2
f ( u) 1
f Z ( z ) = U 1 −1 = 1 −1 = 2 z . 6
2 z
1 2
2u
5
y y
FY ( y ) = ò f X (t ) dt = FY ( y ) = ò dt =
1 2 y
,
3 3
 y  y
1
Actually what is essential for the transformation method is that the function be increasing or decreasing
on the range.
67
so for such a y , f Y ( y ) =
1
. Equivalently, we could have reasoned
3 y
( ) (
P U[21,2] £ y = P  y £ U[1,2] £ y = ) 2 y
3
.
( )
On the other hand, let 1 £ y £ 4 , then P U[21,2] £ y = P 1 £ U[1,2] £ y = ( ) y
3
+ 13 . So
we get
ì
ï
ï
ï
1
0 < y £1
ï
ï 3 y
ï
ï
fY ( y ) = ï 1£ y £4 .
1
í
ï
ï 6 y
ï
ï
ï
ï
0 otherwise
ï
ï
î
Could we have done this example via the transformation method? No, since the
derivative is 0 inside the interval.
We end this important section with a fundamental fact, but first a particular instance of a
very general theorem below:
V ( X ) + E ( X )  2m E ( X ) + m 2 = V ( X ) + m 2  2m 2 + m 2 = V ( X )
2
68
and we are done with the first claim. But since ( X  m ) only takes nonnegative values,
2
that is, X = m . z
In particular, this explains why we can define the standard deviation as the square root of
the variance.
This theorem also illustrates part of the reason of the importance of the variance—it is
measuring the (average) distance (in the Euclidean sense) between the random variable
and the constant random variable represented by the mean.
69
Random variables (and the accompanying probability distributions) are always representations
of the level of information we posses on an activity or phenomenon. In this section we look at
what occurs to a random variable when we gain information. Before we can do that we need to
develop the foundational concept of conditional probability.
Example 1. Let us start with a very simple situation. You are to visit a potential customer who
is known to have two children. You are speculating on the random variable X that counts the
number of boys among his children. Knowing nothing else, the distribution X 0 1 2
of X (assuming once again boys and girls are equally feasible) is given by
P 4 1 1
2
1
4
.25 .5 .25
You arrive at the house and you see a boy playing in the backyard. You
ask the customer who the boy is and the customer replies
He is my oldest child.
Let B be the event that the oldest child is a boy. Then what would we conclude about the
values of the random variable X if we assume B as a given occurrence? Clearly they have
changed to
XB 0 1 2
P 0 12 12
Thus for example if we let A be the event that both children are boys, then if B is given, then
the likelihood that A will occur is the same as the likelihood that the second child is a boy,
which is simply 12 .
But given C , then out of the four possibilities for two children, BB, BG, GB and GG only the
last one is ruled out, so our denominator is 3, and again the distribution for X has changed:
XC 0 1 2
P 0 23 13
This example deals with the fundamental notion of conditional probability. If A and B are
70
Example 2. The accompanying table lists totals of accidental deaths by age and also certain
specific types for the United States in 1976.
Type of Accident
Age All Types Motor Falls Drowning
Vehicle
Under 5 4,692 1,532 201 720
5 to 14 6,308 3,175 121 1,050
15 to 24 24,316 16,650 463 2,090
25 to 34 13,868 7,888 426 1,060
35 to 44 8,531 4,224 534 520
45 to 54 9,434 4,118 931 500
55 to 64 9,566 3,652 1,340 420
65 to 74 8,823 3,082 1,997 270
75 and over 15,223 2,717 8,123 197
Total 100,761 47,038 14,136 6,827
A randomly selected person from the United States was known to have an accidental death in
1976. We can address a variety of questions:
• The probability that he/she was 15 or older. This one is very straightforward:
there were 100,761 deaths of which 11,000 that were not 15 or older, so the
= 89.08% .
89761
probability we desire is
100761
‚ The probability that the cause of death was a motor vehicle accident. Again,
= 46.68% .
47038
readily
100761
These were not conditional statements. The next one is
ƒ The probability that the cause of death was a motor vehicle accident given
that the person was between 15 and 24 years old. Here the pool of people to
71
be considered has become those who are between 15 and 24, so the denominator is
= 68.47% . Observe that this probability is
16650
24, 316 and so the probability is
24316
considerably higher than the one in the previous one, so being between 15 and 24 has a
considerable effect on the probability of dying from a motor vehicle accident.
„ Find the probability that the cause of death was a drowning accident given
that it was not a motor vehicle accident and the person was 24 or under.
Here we have the denominator of 13,959, which is 35,316 (who were 24 or younger)
= 27.65% ,
3860
minus 21,357 (who died in motor accidents), so we have
13959
considerable higher than the probability of drowning: 6.77%.
The previous two examples illustrate the effect one event can have on the probability of some
other event occurring. The relation given above
P ( A and B) P ( A Ç B)
P( A  B) = =
P ( B) P ( B)
has two other equivalent formulations, which are equally as useful:
P ( A and B) = P ( B) P ( A  B )
and
P ( A and B) = P ( A) P ( B  A) .
These latter statements are very useful in situations such as the ones in the following example:
Example 3. We are told that that 60% of the population of Harmony believes the mayor
should quit, and we are also told that among those people, 60% of them believe the mayor
should be Phyllis. So if we let Q stand for a person believing the Mayor should quit, and we let
P stand for a person believing the mayor should be Phyllis, then when a person is chosen at
random, what we know is that P (Q) = 60% . We are also given that P ( P  Q) = 60% , so we
know that P ( P and Q) = 36% . Note we do not know P ( P) since we do not know the
support for Phyllis among the people that do not believe the mayor should quit. If we were told
that 40% of them favored Phyllis, then we would have
that P ( P  ØQ) = 40% , and so P ( P and ØQ) = 16% ,
.4
no effect on another, if that is the case, if for example we are saying event B has no effect on
event A, then we should have that P ( A  B) = P ( A) . But as the following shows much more
than that occurs:
Two events that satisfy the conditions in this fact are called independent events. In many
situations we will assume events are independent. Among these are consecutive flips of a coin,
or tosses of a die, or draw of a card (as long as we shuffle the deck in between draws).
On the other hand, from the example above, we can see that dying from a motor vehicle
accident and being between 15 and 24 years of age are not independent events. However,
being blue eyed and being male are probably independent events, so whenever we get a table
containing data about blue eyed males and females and nonblue eyed males and females, it
should approximately be true that P ( B  M ) = P ( B ) .
Example 4. Consider the random variable that is the toss of two dice:
X 2 3 4 5 6 7 8 9 10 11 12
P 361 362 3
36
4
36
5
36
6
36
5
36
4
36
3
36
2
36
1
36
Suppose some one claims that the red die shows a 4 (call this event F). How is the random
variable affected by this occurrence:
X  F 2 3 4 5 6 7 8 9 10 11 12
P 0 0 0 16 16 16 16 16 16 0 0
Thus P ( X = 7  F ) = 16 = P ( X = 7 ) , so X = 7 , the event that we roll a 7, and F are
independent. But P ( X = 6  F ) = 16 ¹ 365 = P ( X = 6) , so rolling a 6 is not independent of the
roll of the red die.
If we instead are given that one of the dice shows a 4 (event G), then the distribution becomes:
73
XF 2 3 4 5 6 7 8 9 10 11 12
P 0 0 0 2
11
2
11
2
11
1
11
2
11
2
11
0 0
and we see none of these rolls and G are independent.
Example 5. Lifetime Once More. A certain type of electronic device has had so far a lifetime
T with density
ì
ï
ï
3,000
ï t ³ 10
f ( t) = í t 4 .
ï
ï
ï
î 0 otherwise
Before we compute the probability of the device lasting at least 20 hours to be 18 . Suppose now
we are given that the machine has lasted at least 20 hours. What is the probability that it will last
at most 30 hours. In other words, we need to compute P ( X < 30 X > 20) . Since we have
1000
the cumulative distribution from before, F ( t ) = 1 − 3 , so we can readily compute
t
P ( X < 30and X > 20) F ( 30 ) − F ( 20 )
P ( X < 30 X > 20) = =
P ( X > 20 ) 1 − F ( 20)
1000 1000
1− − 1−
303 203
3
2 19
= = 1− = = 70.37%
1000 3 27
1− 1−
203
Note that then P ( X ³ 30 X > 20 ) =1 P ( X < 30 X > 20) = 278 = 29.63% .
In a sense conditional probability has been around the subject from its inception. Two of the first
contributors on the subject were the great French mathematicians of the 17th century: Fermat
and Pascal.
You may think of this problem as that of flipping a coin until you get a total of 6 heads or 6 tails.
Let us say then that 5 tosses had been heads and 3 had been tails. The problem had been
around for a long time, and several proposed answers had been given, including 2:1 and 5:3.
Pascal corresponded with Fermat on it, and they both solved it correctly, but in very different
ways. We will solve it in a modern, aggressive way—probably intellectually closer to Fermat’s
74
way. Let X be the random variables that counts the number of tosses until the game would have
been over. Then its distribution is X 1 2 3
P 12 14 14
The reasons are: X = 1 must mean the next toss was a head, and there is 2 1
chance of that; X = 2 only if we first get a tail and then a head, and we will be done in 3 tosses
if we either get TTH or TTT. Since the only way the tail player wins is when X = 3 , and only
one half of those occurrences, so the probability that T wins is 18 . So the stakes should be
divided 7to1, or 56 coins for H and 8 for T.
A word one encounters often in the outside world is odds. In general, the odds for an event
are the probability of the event occurring divided by the probability of it not
occurring. Equivalently, it is how the stakes should be divided. So in the previous situation, the
odds for H to win are 7to1. The odds for T to win are 1to7. If you want to bet on T, for
every $1 you contribute to the pot, your opponent should contribute $7.
We next review another famous problem, but this one more contemporary. It illustrates the
subtleties inherent in probability—but just because it is subtle you shouldn't mistrust it, instead
you should take the opportunity to fine tune your brain. It is a famous problem that has amused,
confused and bedazzled many people.
Example 7. We are going to play a game where I am going to give you a choice of three
doors. Behind one is an extra point for this class, behind the other two, nothing. You pick a
door, and allknowing I, before showing you what is behind your chosen door, open another
door which has nothing behind it. I then give you a choice of either retaining your door or
switching to the only other unopened door. On the average, suppose you were doing this every
day there is class. What should be your standard operating procedure?
Before we discuss the problem as stated, let us solve a simpler problem. Suppose I had just
given you a door to choose out of 3 doors. Then nobody would argue that you have 13 of a
chance of guessing the correct door. Is that agreed on?
Now, let's convolute the problem by my showing you a door without a price. One easily
arrived, yet wrong, conclusion is that it really does not matter if you have a standard procedure.
This wrongful reasoning goes as follows. Originally each door had a 13 chance of having the one
point behind it. One of them has been eliminated, so now each door has only 12 of having the
point behind it, hence it is really the same if you switch as if you don't switch. Isn't this absolutely
reasonable?
I will try to convince you that it is not. But first let me get a little philosophical and point out that
what probability tries to accomplish is to measure uncertainty, and that the only uncertainty in
this problem is strictly from your point of view. (After all, I know everything.) Hence you have to
75
try to use every bit of information available to you (sort of squeeze blood out of rocks ). What piece
of information you have not weighed in the argument in the previous paragraph? The fact that
although under no circumstances I would show you what is behind your door, I had perhaps a
choice of which door to show you and that I indeed chose the door I chose. More directly put,
it is correct to say that at the beginning every door has 13 of a chance of being rich. But what is
crucial is that the new information could not have affected the probability behind your door, but
instead has affected the probabilities of the other two, one going to 0 and the other to 23 . And
that indeed it behooves you to switch doors every time!
We next review a classical result: Bayes’ Theorem. In 1761, the Reverend Thomas Bayes
passed away unaware that among his unpublished manuscripts lay a paper that would eventually
make his name known to every student of probability. In his own words,
Given the number of times in which an unknown event has
happened and failed:
Required the chance that its probability of happening in a single trial
lies somewhere between any two degrees of probability that can be
named.
Example 8. Treasure Hunting. There are three chests, each with two drawers. In
each drawer there is a coin. In chest Œ there is one gold coin and one silver coin. In
chest • there are two silver coins and in chest Ž there are 2 gold coins. A chest is
selected by rolling a fair die. If the die comes up even, chest Œ is selected. If it
comes 1 or 3, chest • is selected while chest Ž is selected only if a 5 is rolled.
Once the chest is selected, one of the drawers is selected at random and the coin in that drawer
is observed.
The best way to sort out this information is to use a tree diagram as follows:
1
In the picture the probabilities are the probability of traveling
2 the arrow in which the probability is assigned. Thus the arrow
Œ from the start to Œ is 12 , and similarly the arrow from • to
1
1
2
the silver is 1 because we have to draw silver if we are in
2 Chest •.
1
1
3
• Note that the last statement was a conditional statement, the
probability of silver given that we are in chest •, P ( S •) .
1
6
Ž 1
But we know from conditional probability that then
P ( S and •) is the product P (•) P ( S •) , which equals 13 .
76
This exhibits a nice way to compute the probability of going from the starting point to a terminal
node: simply multiply the probabilities of the arrows in the path.
Now we have the ability to answer a variety of questions:
• What is probability that the coin observed is gold?
Since we can observe gold by either going to Œ or Ž first, then we have
P (G) = P (Œ) P (G  Œ) + P (Ž)P (G  Ž) = 12 ´ 12 + 16 ´1 =14 + 16 = 125 .
‚ What is probability that the coin observed is silver?
Similarly to the previous question,
P ( S ) = P (Œ) P ( S  Œ) + P ( •)P (G  •) = 12 ´ 12 + 13 ´1 =14 + 13 = 127 .
Of course, with a bit of reflection we could have deduced this fact with no computation since it
is the nonoccurrence of gold that produces silver.
But so far, we have not used Bayes’ ideas. The next question does require our going back on
the tree, and that is essential to his ideas:
ƒ Given that the coin observed is gold what is the probability that chest Œ was selected?
Here we have
P (Œ and G ) 1
P (Œ  G ) = = 45 = 35 .
P (G ) 12
Again, we could have predicted the outcome, since given G, we must have had Œ, • or Ž,
and these are disjoint events.
Observe that the last three questions asked probabilities of prior events, and that was essential
to Bayes’ ideas. The idea of reversing the process is essential to modern applications to
hypothesis testing.
Do we know enough to evaluate the patient’s chances of having the disease? No, not really. We
77
are missing a key piece of information—the patient needs to ask the doctor the following simple
question: What is probability of testing positive if I do not have the disease? The
question addresses the existence of falsepositives, and these are rarely openly discussed but
they are essential for the computation. Suppose we find that there is +
0.05
only a 5% chance of a false positive, namely we have a 5% percent
chance of testing positive if we are perfectly healthy. Now we can H
0.99

indeed compute. Setting up the tree: 0.95
where we let H stand for the event that the patient is healthy and S +
0.01 0.9999
that the patient has the disease, while + stands for testing positive.
Thus, what we have is that the patient tested positive. Given that S
what is the probability that the patient is sick? 0.0001

Before we can compute P (S  +) , we need to compute P (+) . Since one can test positive by
either being sick or healthy, we have that
P (+) = P (+ and H ) + P (+ and S ) = 0.99´ 0.05 + 0.01´0.9999 = 0.059499 .
And so
P (S and +)
P (S  +) =
.009999
= = 16.80% .
P (+) .059499
Thus the odds that the patient indeed is sick are less than 1 to 4. Surprising!!
Example 10. Quality Control. Suppose as part of admission to graduate school a test is
given. From past occurrences, it is known that if a candidate is qualified to enter graduate
school, he will succeed on the test 95% of the time, while an 0.95
P
ineligible candidate will pass the test only 25% of the time. The
school has given the test to applicant Lewis, and he has passed. Q
What is the probability he was qualified to attend graduate school? 0.8 0.05
F
As it is we do not know enough to answer the question.
P
0.2 0.25
The school clarifies that in the past 80% of the people that applied
were indeed qualified. Now we do know enough. The tree is similar U
to the others: 0.75
F
So now we can compute, P (Q  P) for Lewis. We need as usual P (Q and P ) = 0.76 . But we
also need
P ( P ) = P(P  Q) P(Q) + P( P  U) P( U) = 0.76 + 0.05 = 0.81 .
And so
P (Q  P) = = 93.82% .
0.76
0.81
Thus we are fairly confident of Lewis’ ability to succeed in graduate school.
78
A separate yet interesting question is what P ( Q  F ) equals. This represents the probability of
being left behind (since the test was failed) but that one was still qualified. In fact,
P ( Q and F ) .04
P ( Q  F) = = = 21.05% .
P(F ) .19
Above we saw how the occurrence of an event creates a new random variable from an old one
such as in the examples about the parent with the two children, or the roll of the dice. In a
similar fashion, we can compute the probability of an event by conditioning it on a random
variable. The following example should be interesting.
Example 11. Craps. In the American game of CRAPS, a shooter rolls two dice. The shooter
wins if she rolls a 7 or 11 to start with, and loses to start with if she rolls a 2, 3 or 12. If the first
roll is anything else, namely, 4, 5, 6, 8, 9 or 10, then that roll becomes the shooter’s point, and
she keeps rolling until either she rolls a 7 (she loses) or her point (she wins). We are interested
in the probability the shooter wins.
Let X be the first roll of the two dice, and let B be a Bernoulli random variable, which is 1 if the
player wins.
What is the probability the shooter wins if the first roll was a 4, P ( B = 1 X = 4) ? Then clearly
B = 1 if whenever the shooter rolls either a 4 or a 7, she gets a 4. Therefore let Y be another
roll of two dice, then P ( B = 1 X = 4 ) = P ( Y = 4  Y = 4or Y = 7 ) . But the latter is easily
computed,
P (Y = 4 and Y = 4 or Y = 7 ) P (Y = 4)
P (Y = 4  Y = 4 or Y = 7) = = = 39 .
P (Y = 4 o r Y = 7) P (Y = 4 or Y = 7 )
But what we are really interested in P ( B = 1) . The following should be clear, since these are
disjoint events:
12
P ( B = 1) = ∑ P ( B = 1and X = i ) .
i =2
X 2 3 4 5 6 7 8 9 10 11 12
P 1
36
2
36
3
36
4
36
5
36
6
36
5
36
4
36
3
36
2
36
1
36
P (B =1 X ) 0 0 3
9
4
10
5
11
1 5
11
4
10
3
9
1 0
79
5
11
4
10
3
9
75
‘ Independence
In this last section of this chapter we look at a crucial concept: independent random
variables.
We have defined two events A and B to be independent if equivalently any one of the
following occurs:
P ( A  B ) = P ( A) ,
or P ( B  A) = P ( B ) ,
or most symmetrically
P ( A and B ) = P ( A) P ( B ) .
Let X and Y be random variables. We say they are independent if and only if for all
numbers a and b,
P ( X £ a and Y £ b) = P ( X £ a ) P (Y £b ) .
In other words, X and Y are independent if the events X £ a and Y £ b are always
independent events.
Now that we have the definition of independent random variables, we can discuss
combinations of more than one random variable—at least in this case of independence. 1
1
We will have combinations of arbitrary random variables in Chapter 3.
76
One of the most interesting random variables is the sum of n independent Bernoulli’s
with the same p. As usual, let us start with a specific example.
It is time to bring out the nicest theorem about expectations, and one of the nicest
theorems in the course:
Thus, the expectation of a sum is the sum of the expectations. Note no independence
of the variables is required, this is true at any time. The complete proof will be
postponed. Nevertheless it is worth pointing out that in the independent discrete case, the
proof is straightforward:
E ( X + Y ) = ∑ iP ( X + Y = i ) = ∑ ∑ ( j + k ) P ( X = j and Y = k ) =
i i j +k =i
∑ ∑ ( j + k ) P ( X = j ) P (Y = k ) = ∑∑ jP ( X = j ) P ( Y = k) + kP ( X = j ) P (Y = k )
i j +k =i j k
= ∑ jP ( X = j ) P ( Y = k ) + ∑ kP ( X = j ) P (Y = k )
j, k j ,k
= ∑ jP ( X = j ) ∑P ( Y = k ) + ∑P ( X = j )∑ kP (Y = k )
j k j k
= ∑ jP ( X = j ) + ∑ k P (Y = k ) = E ( X ) + E ( Y ) .z
j k
77
Note that thus the random variable X of the previous example will have expectation
E ( X ) = 4 p since the expectation of Bp is p .
Example 3. A father makes a deal with his two daughters. Cicely is to roll a dice and
Maude is to draw a card, and the father will give their daughters the sum of the roll and
the draw (where king is 13, queen is 12, et cetera), in dollars. It is clear if we let C be
Cicely’s roll and M be Maude’s draw, that these are independent random variables. We
are interested in the random variable X = C + M .
The following elegant and subtle example illustrates the power of the theorem.
Example 4. The Case of the Absent Minded Professor. A professor has written n
letters to his friends and has addressed the corresponding n envelopes. But when the time
came to put the letters in the envelopes, the professor got distracted with some wonderful
mathematical thoughts, and so the professor proceeds to put the letters in the envelopes at
random—keeping of course, one letter to an envelope (the professor is NOT that absent
minded!!). How many letters do we expect to go in the correct envelope?
Let the letter be denoted by 1, 2, …, n be the letters, and when we write a permutation of
these numbers, the first number listed goes in envelope 1, and the second number listed
goes in the second envelope, and so on. Thus for example 4 2 1 5 3 has only one
letter, # 2, going to the correct person.
Let us start with some small cases. Let n = 2 , and so there are two possibilities, both
equally feasible—in one both letters get in the correct envelope while in the other none
does, so the average is 1.
78
Let n = 3 , of the six possibilities, the number of correct letters is given, 1 2 3 (3),
1 3 2 (1), 2 1 3 (1), 2 3 1 (0), 3 1 2 (0) and 3 2 1 (1).
Or equivalently, if we let X be the random variable that counts the X 0 1 2 3
number of correct letters, then P 1
3
1
2
0 16
since there is one permutation that has 4 correct letters, 6 that have two (pick which two
and then swap the other two), 8 that have one correct (pick the correct one (4 choices),
and then there are two permutations of three letters that leave none correct), and the
remaining 9 permutations leave none correct, so we get, once again,
E ( X ) = 241 (1 × 4 + 6 × 2 + 8 ×1) = 1 .
Let us now consider the general case. Certainly the distribution is not necessarily easy to
compute—but the expectation is! For each 1 ≤ i ≤ n , define a Bernoulli random variable
Bi , as follows:
1 letter i is correct
Bi =
0 otherwise
Note that the Bi ’s are not independent, and in fact we know at present little about how
they interact. However, the random variable we are interested is X = B1 + B2 + L+ Bn ,
and the distribution of that random variable is not easy as mentioned above. But
E ( Bi ) =
1
since we could have started with the i  letter, or equivalently, there are
n
( n1)! (out of n !) permutations with the i  letter in the correct spot. But then
E ( X ) = E ( B1 + L + Bn ) = E ( B1 ) + L + E ( Bn ) = + L+ = 1 ,
1 1
n n
and so surprisingly, regardless of the number of letters, we should expect exactly one
letter to be in the correct envelope.
The following theorem is proven via the distributive properties of ands over ors:
∑ P ( X = xi ) P (Y = b ) = P ( g ( X ) = a ) P (Y = b ) .z
i
We saw before that the expectation of a sum is the sum of the expectation regardless of
whether the random variables were independent or not—but in order to state that the
expectation of a product is the product of the expectations, one does indeed need
independence.
∑∑i j × k =i
jkP ( X = j ) P ( Y = k ) = ∑ jP ( X = j )∑ kP ( Y = k ) = E ( X )E (Y ) z
j k
E ( X 2 ) + 2 E ( XY ) + E (Y 2 )  E ( X )  2E ( X )E (Y )  E (Y ) = V ( X ) +V (Y )
2 2
so we are done. z
We will use this corollary extensively throughout the course, but remember independence
was required for the variance of a sum is the sum of the variances.
Example 5. The Unit Circle. Let W be the random variable that selects a point of the unit
disk at random. The unit disk consists of the unit circle and its inside. We do
have an intuitive sense of what the random variable W is doing. For example,
if we let Z = W denote the distance from a point of the unit disk to the origin,
we could ask P ( Z £ 13 ) . The question is equivalent to asking what is the
1
probability the point chosen is within 3
distance from the center. The answer
80
p( 13 )
2
1
=
. In fact, in a very similar fashion, we can compute for every 0 ≤ r ≤ 1 ,
p1 2
9
P ( Z £ r ) = r 2 , so FZ ( r ) = r 2 , and so its density is f Z ( r ) = 2r in the range 0 ≤ r ≤ 1 .
1
Now we can compute E ( Z ) as E ( Z ) = ∫ r 2rdr = .
2
0
3
But suppose we try to obtain W , and all we have at our disposal are standard uniforms.
Can we build W from them? One idea is to use the polar form of a point in the disk,
namely ( x, y ) = (r cos q, r sin q) where r (the distance to the origin) is arbitrarily between
0 and 1 and 0 £ q £ 2p . So perhaps if we let U and V independent standard uniforms,
then let
W = (U cos (2pV ) ,U sin ( 2pV )) .
However, we have an immediate problem with this since we know that Z = U , and that
Z is not uniform.
1.5
0.5
0.5
W= ( U cos (2pV ), U sin (2pV ) . )
When we do that we get a more balanced picture
0
1.5 1 0.5 0 0.5 1 1.5
expectation E ( X ) = E ( )
U cos ( 2 πV ) . But since U and V are
1
1.5
and so E ( X ) = 0 . This is not surprising since there is an equal distribution to the right of
the y − axis as to the left.
Another concept closely related to the expectation of a product is that of the covariance
of two random variables. Let X and Y be random variables. Then their covariance is
simply defined by
81
cov ( X , Y ) = E ( XY )  E (X )E (Y ) ,
the expectation of their product minus the product of their expectations.
Note that
cov ( X , X ) = E ( X 2 ) − E ( X ) = V ( X ) .
2
Two variables that have covariance 0 are said to be uncorrelated. The reason for the
name uncorrelated is because there is a concept called the coefficient of correlation2
defined by
cov ( X , Y )
ρ= .
σ X σY
It is a fact that −1 ≤ ρ ≤ 1 .
If X and Y are independent, then cov ( X , Y ) = 0 .
Thus, two independent variables are uncorrelated. But the following example shows
that two variables may be uncorrelated and yet not be independent.
2
We will not have much use in this course for the coefficient of correlation, but it is an important concept.
82
Chapter 2
A Visit to the Zoo
Œ Sampling
In this chapter we will look at some of the most important types of random variables,
both discrete and continuous. In this section we look at two special types of random
variables that occur often in applications. Let us start with a generic example dealing with
the testing of refrigerators as to whether they are defective or not.
defective. You are to choose 4 of the refrigerators for a store. Let X be 0.500000
0.400000
the number of defective refrigerators in the sample of 4 that you chose. 0.300000
0.200000
X 0 1 2 3 0.000000
0 1 2 3
P 21035 105
210
63
210
7
210
.1667 .5 .3 .0333
æ3öæ ö æ3öæ ö æ 3öæ ö æ3öæ ö
ç ÷÷ç7÷÷ ç ÷÷ç7÷÷ ç ÷÷ç7÷÷ ç ÷÷ç7÷÷
çè0øè
÷ç4ø÷ çè1øè
÷ç3ø÷ çè 2øè
÷ç2ø÷ çè3øè
÷ç1ø÷
since P ( X = 0 ) = , P ( X = 1) = , P ( X = 2) = , P ( X = 3) = .
æ10ö æ10ö æ10ö æ10ö
ç ÷÷ ç ÷÷ ç ÷÷ ç ÷÷
çè 4 ø÷ çè 4 ø÷ çè 4 ø÷ çè 4 ø÷
105 + 126 + 21
Its expectation is E ( X ) = = = 1.2 which is not surprising at all, since
252
210 210
30% of the population is defective, 30% of the sample is expected to be defective. The
mode is 1, and so is the median.
105 + 4 × 63 + 9 × 7
1.2 2 = 1.44 = 0.56 and so the standard deviation
420
The variance is
210 210
is 0.7483.
Example 2. Production Line. But there is another way to have a sampling mechanism.
Suppose now we are pulling refrigerators from a production line, and we are going to pull
4 of them at random. It is known that 30% of refrigerators from that plant are defective (it
is not recommended we purchase refrigerators from this company). Again we are
interested in the distribution of the number of defectives in the sample.
Let Y denote this random variable. Then what is the probability that Y = 0 ? If that is to
be the case, then the first refrigerators has to be good, and so is the second , etcetera, and
83
since we are pulling them from the production line at random, we treat these events as
independents events, and so we get that
P ( Y = 0 ) = .7 × .7 × .7 ×.7 = .2401 .
What is the probability that Y = 1 ? In that case, one of the refrigerators has to be
defective, which one? The first one? The second one? Etcetera. If the first, we have
probability .3 × .7 ×.7 × .7 , while if the second one is defective, we have .7 × .3 ×.7 × .7 .
Similarly for the remaining two possibilities (the third one and fourth), we get the
respective probabilities .7 × .7 × .3 × .7 and .7 × .7 × .7 × .3 . So each of them has the same
probability: .31 × .7 3 , and so if just count the occurrences we will readily have our answer.
Obviously we have four choices, = 4 , so our answer is simply
4
1
P ( Y = 1) = .3.7
4 1 3
= .4116 .
1
Y 0 1 2 3 4
4 4 1 3 4 2 2 4 3 1 4 4 0
P .30.7 4 .3.7 2 .3 .7 3 .3 .7 4 .3 .7
0 1
.2401 .4116 .2646 .0756 .0081
Not surprisingly, the expectation of Y is the same as the expectation of X in the previous
example, E ( Y ) = 1.2 . However, the reasoning is slightly 0.45
0.35
having a 30% chance of doing something (such as hitting a hit 0.3
0.2
successes (hits at bat). The mode is also 1 and so is the median. We 0.15
0
0 1 2 3 4
Both of the previous examples deal with choosing a sample. In Example 1, the sample
was picked from a fixed population with two kinds of objects (defective vs. good).
Thus, in a hypergeometric random variable, there are 3 parameters, N, the number of
individuals in the total population, m, the number of individuals in the selected
84
population, and k, the size of the sample, and such a variable is referred to as H N ,m ,k .
Thus, Example 1 involved, H 10,3,4
But one can also think of a binomial also as a process of sampling. Rather than selecting
from a fixed limited population, one can think of a growing population (or unlimited
population such as a production line), like in Example 2. The binomial random variable
has two parameters, n the number of trials and p the probability of success at a trial,
so it will be referred to as Bn, p . Thus, Example 2 concerned B4,.3 . Although there are
only two parameters, one often associates a third letter with a binomial, and that is
q = 1 − p , which in the example above was q = .7 , the probability of failure.
One could consider a beer factory, where 10% of the bottles are defective (not filled
enough, or overfilled, or broken glass, or not sealed properly). If one selects 20 bottles at
random from the population of bottles, and is interested in the number of defectives, then
one is doing a sampling process, but it is not hypergeometric, since we do not have a
fixed population to choose from, so we are using B20,.10 .
The key difference is that each choice is independent of the next which does not happen
when we are doing a hypergeometric—the choice of a defective refrigerator in the first
example has an effect on the probability of repeating that selection afterwards. Thus,
sometimes one refers to sampling without replacement as a hypergeometric while one
can think of sampling with replacement as a binomial.
In the examples above we contrasted H 10,3,4 versus B4,.3 —however if there had been a
large collection of refrigerators rather than 10, we would see a much better fit since now
choosing the next refrigerator is almost independent from what we have done before. For
example, the table below reflect the values for B4,.3 and H 100,30,4 :
0 1 2 3 4
B4,.3 .2401 .4116 .2646 .0756 .0081
H 100,30,4 .2338 .4188 .2679 .0725 .0070
and all faculty members are equally eligible. Let X count the 0.250000
0.200000
number of males in the committee. We recognize this random 0.150000
0 1 2 3 4 5 6 0.000000
X 01 2 3 4 5 6
E(X 2) =
883728
74613
( )
≈ 11.8441 , so V ( X ) = E X 2 − E ( X ) ≈ 1.1334 and σ ≈ 1.0646 .
2
How shocking would we be if the committee were of one gender? We expect 3plus
males, so to go to either 0 or 6 males, it takes a hefty jump of approximately 3 standard
deviations, so it should not be something that one expects to happen—a bit shocking. In
this particular case, we actual have the unlikely probabilities, 1.24% (for all male) and
0.28% (for all female). Thus, conceivably one could go to court on either, but safer in the
latter case of an all female committee.
Note that if we had wanted the number of females in the committee, we would be dealing
with H 22,10,6 . Easily,
H 22,10,6 = 6 − H 22,12,6 ,
so once we understand one of them, the other one follows.
Example 4. Suppose you come to take a test totally unprepared. The test consists of 10
TrueFalse questions, each of which you will answer at random (but yo u will answer
them all, since there is no penalty for guessing.) How likely is it that you will achieve a
passing score of 70% or better? Although this is not quite a sampling problem, if we
consider the random variable X that counts the number of correct answers you obtain,
that random variable is a binomial random variable.
Think of answering each question as an independent event (after all we are totally
ignorant), so the probability of answering any question correctly is 12 (just like the flip of
a coin)—once we assume the independence of consecutive answers (or flips of the coin),
then to get 3 correct answers, we have to choose which three questions to answer
correctly (and we have 10 to choose them), after we have done that, let us say questions
3
1, 5 and 7, then we must have CWWWCWCWWW as our answers, and since each has
10 ( 12 )
10
1
probability 2
of occurring, because of the independence, we end up with as our
3
probability of exactly 3 correct answers, which is the same as the probability that B10,.5
takes the value 3.
Proceeding in this fashion we can compute the distribution of the random variable
X = B10,.5 that counts the number of correct answers:
X 0 1 2 3 4 5 6 7 8 9 10
P 1024
1 10
1024
45
1024
120
1024
210
1024
252
1024
210
1024
120
1024
45
1024
10
1024
1
1024
% .097 .97 4.4 11.7 20.5 24.6 20.5 11.7 4.4 .97 .097
86
Thus P ( X ≥ 7 ) = 176 , and hence the probability of getting a passing score in the exam is
1024
more than 17%. Not bad for total ignorance!
The next example will show a small yet important variation of this example.
Example 5. Suppose you come to take a test (totally unprepared as usual), but that the
test is Multiple Choice, with ten questions and each question having three choices,
only one of them correct. Again we compute the distribution for the random variable that
counts the number of correct answers. Let us repeat the computation of exactly three
correct answers. Again we have 120 ways of choosing which 3 questions will be
answered correctly, and after that, if for example again we are to have
CWWWCWCWWW , the probability of this is 13 23 23 23 13 23 13 32 32 32 = 3210 , and so we arrive, since
7
310 = 59049
X 0 1 2 3 4 5 6 7 8 9 10
P 1024
59049
5120
59049
11520
59049
15360
59049
13440
59049
8064
59049
3360
59049
960
59049
180
59049
20
59049
1
59049
% 1.73 8.67 19.51 26.01 22.76 13.65 5.67 1.63 .30 .03 .001
A binomial random variable counts the number of successes among n independent trial
at each of which we have probability p of success. Of course, what success is is very
much of our choosing. Equivalently, Bn, p is the sum of n independent Bernoulli random
variables with probability p: Bn, p = B1 + L+ Bn where Bi is a Bp random variable.
Hence, since E ( B p ) = p , and V ( Bp ) = pq , we obtain by the expectation of a sum and
the variance of a sum of independents,
E ( Bn, p ) = np and V ( Bn, p ) = npq .
87
Let us concentrate our sights on Ms. J, one of the Southern neighbors. What is the
probability that she gets chosen to reside in the golden state? Let J denote that event,
N − 1
k −1 ( N − 1) ! × k !( N − k ) ! = k , so P not J = N − k . Let
then P ( J ) = = ( )
N ( k − 1) !( N − k ) ! N! N N
k
Y = ( X  J ) − 1 and let Z = X  ¬J . Then we readily observe that Y = H N −1,m− 1,k −1 while
k ( m −1)( k − 1) N − k ( m − 1) k
+ 1 + =
N N −1 N N −1
k mk − k − m + 1 + N − 1 N − k mk − k
+ =
N N −1 N N −1
88
mk 2 − k 2 − mk + Nk Nmk − Nk − mk 2 + k 2
+ =
N ( N − 1) N ( N − 1)
−mk Nmk mk
+ = .
N ( N − 1 ) N ( N − 1 ) N
Since both random variables are associated with sampling, we should not be surprised if
there is some connection between the parameters of the hypergeometric X = H N ,m ,k and
the binomial Y = Bn , p . Clearly the role of n in the binomial is played by k in the
hypergeometric since that is the size of the sample being taken. Also naturally, the
m N −m
probability of success, p, is nothing but , and so the probability of failure, q = .
N N
Having translated, now we see that
km
E(X ) = = np = E ( Y )
N
while
m N −m N − k N −k N −k
V ( X ) = k = npq = V (Y )
N N N −1 N −1 N −1
N −k
and the odd term tends to 1 if N is very large and k is remains small.
N −1
Example 6. The Roll of the Dice. Suppose we roll five dice simultaneously. We can
view that as a binomial since the rolls are assumed as independent. Let us say the roll of a
is a success, so p = 16 and n = 5 . Then the distribution is given by
X 0 1 2 3 4 5
P % 40.19 40.19 16.08 3.22 .32 .01
standard deviation, so the possible values that are one s away from m (this is a common
reference from the mean) are 0 and 1, and from the distribution we see that in more than
80% of the throws we will end up in that interval. Not shocking at all.
The ranges of both of these variables are easily discerned. We should say a few words
about the mode and the median of both. Certainly the expectation of neither is not
necessarily in the range, so we don’t expect the three measurements, mean, median and
mode to agree—but as the table of examples indicates, they are always close.
Hypergeometric Binomial
N m k Mean Median Mode n p Mean Median Mode
21 11 5 2.6 3 3 12 .85 10.2 10 11
22 11 5 2.50 2 2 or 3 12 .75 9 9 9
23 11 5 2.3 2 2 12 .65 7.8 8 8
23 12 7 3.6 4 4 12 .9 10.8 11 11
33 12 7 2.54 3 2 24 .8 19.2 19 19
43 12 7 1.9 2 2 20 .4 8 8 8
43 22 7 3.58 4 4 25 .25 6.3 6 6
Summarizing, the following table represents the highlight features of the two types of
variables.
In the following example, we will use the binomial distribution to helps us make
decisions.
90
Example 7. Truthful vs. Liar. A university claims that 85% of its students graduate. You
are to test their veracity by setting up a test of their claim. You pick 12 students at
random, and see how many of them graduate. You decide to accept the school’s claim if
at least 8 of the 12 students graduated.
What is the probability that you come to the wrong conclusion if indeed the university’s
claim is true? Let Y denote the number of students among the 12 that graduated, so Y is a
binomial random variable with n = 12 and p = .85 . Thus we will be wrong if we
encounter Y £ 7 , thus we desire to compute P (Y £ 7) = 1 P (Y ³ 8) , and the relevant
values are:
8 9 10 11 12
0.068284 0.171976 0.292358 0.301218 0.142242
with a sum of 0.976078, and so we will be wrong 2.39% of the time, a truly negligible
possibility. This type of error is called a Type I error.
But there is another kind of error we could make—namely accept the university’s claim
when it is not true. This is called a Type II error. Now assume that the university’s rate of
graduation is actually only 60%. What then is the probability that you will accept their
claim as true even though it is not? Then we will be wrong with probability P (Y ³ 8)
where Y is a binomial random variable with n = 12 and p = .6 . The values are then
8 9 10 11 12
0.212841 0.141894 0.063852 0.017414 0.002177
with sum 0.43178, so will be wrong more than 43% of the time.
Thus, in order to reduce the probability of a Type II error, we tighten the test a bit, and we
ill call the university truthful if at least 9 students graduate. So we now want P (Y ³ 9) ,
where Y is a binomial random variable with n = 12 and p = .6 . Then we have the total
sum of 22.53%, and we have reduced our Type II error by more than half. Yet in order to
be thorough we should look again at the Type I error. That one has increased from 2.39%
to over 9.21%. Overall probably the second test of using 9 students rather than 8 to help
us decide decreased the Type II error but increased the Type I.
Finally we give an example of great historical importance since it was one of the first
occasions pure data was used to make a scientific conclusion.
Example 8. Boys vs. Girls. Laplace, the great French mathematician of the 18th century
was one of the first ones to assert that
the probability of giving birth to a
baby boy is bigger than the
probability of giving birth to a baby
girl. His conclusion was based on the
fact that from 17451770, there were 251,527 boys born in Paris while only 241,945 girls.
91
The key method was to assume that boys and girls were equally
feasible, and then to compute the probability of what occurred to
occur. When that probability is too small, we conclude that our
assumption is wrong. This is called hypothesis testing, and it is done
very commonly today.
Laplace’s analysis was as follows. Suppose m male births are given and suppose f female
m
birth occur. Let p = lim be the eventual ratio of male births to total births—he
m+ f ®¥ m + f
decidedly assumed such a limit existed, and that p represented the probability of being
born a boy. He then used calculus to compute some probabilities by computing some
complicated integrals. He then finished showing that the probability that p £ 12 given the
actual distribution of 251,527 boys and 241, 945 girls, was approximately 10 −42 , which
made it morally certain that p > .
1
2
He went even further and by comparing the births of London and Paris, he concluded that
boys are even more feasible in London than in Paris!
Today we would treat the random variable Y (the number of boy births) as a binomial
with n = 493,472, and p = 12 . Then our expectation would be E (Y ) = 246,736 boys.
Now we need to compute P (Y ³ 251527 ) , and this is the kind of computation that
Laplace was involved.
But we have in our possession a much more powerful weapon that Laplace did not have:
the standard deviation.
We saw before that for a binomial, the standard deviation is s = npq , so in our case,
s = 123368 » 351.24 . Then the fact that what occurred is more 13 standard deviations
æ 251527  246736 4791 ö
away ççç = » 13.6÷÷÷ tells us that what occurred is virtually impossible
è s s ø
to occur under our assumptions, so it is these that need to be questioned.
In the next section we will see other variations of the binomial. It is this distribution
that is overwhelmingly important—much more that the hypergeometric.
92
In this section we will look at several random variables closely associated with the
binomial (and necessarily the accompanying Bernoullis). We start with the discrete
waiting variable—namely we have a Bernoulli Bp , and we are waiting for the first trial
in which success will occur. We do assume consecutive trials are independent. Let G p be
the number of times we would have to execute before the first success occurs. Such a
random variable is called a geometric random variable (with parameter p). Thus G p
can take the values 1, 2, 3, …, and its distribution is given by
Gp 1 2 3 4 … n …
P p qp q2 p q3 p … q n −1 p …
Below is a table with the probabilities of the first 12 values of G p for p = 0.1 , 0.2 , 0.3 ,
0.4 and 0.5 . Also below is the graphical representation of the table.
1 2 3 4 5 6 7 8 9 10 11 12
0.1 0.1000 0.0900 0.0810 0.0729 0.0656 0.0590 0.0531 0.0478 0.0430 0.0387 0.0349 0.0314
0.2 0.2000 0.1600 0.1280 0.1024 0.0819 0.0655 0.0524 0.0419 0.0336 0.0268 0.0215 0.0172
0.3 0.3000 0.2100 0.1470 0.1029 0.0720 0.0504 0.0353 0.0247 0.0173 0.0121 0.0085 0.0059
0.4 0.4000 0.2400 0.1440 0.0864 0.0518 0.0311 0.0187 0.0112 0.0067 0.0040 0.0024 0.0015
0.5 0.5000 0.2500 0.1250 0.0625 0.0313 0.0156 0.0078 0.0039 0.0020 0.0010 0.0005 0.0002
Example 1. It is known that 60% of the people of the town like Mildred as a candidate
for Mayor. Her only opponent, Paul is thus preferred by 40% of the population. What is
the probability that the fifth random person interviewed is the first person to like Paul?
Obviously this is P ( G.4 = 5) = .6 4.4 = 0.05184 .
Easily the mode of a geometric variable is 1. The median is only slightly more
interesting. Clearly if p ≥ 0.5 , then the median is 1. Otherwise we must have
1 − qn 1
p (1 + q + q + L + q ) ≥ 2 , and so p 1− q ≥ 2 , hence 12 ≥ qn , so − ln2 ≥ n ln q , and
2 n −1 1
93
ln2
so n ≥ − , and so e.g., if p = 0.1 , the median is 7, but the median is 4 when p = 0.2 ,
ln q
and 2 for p = 0.3,0.4 .
To compute the variance of G p , we first compute E ( Gp2 ) . But similarly to the argument
(
for the expectation, we get E ( G 2p ) = p + qE (Y + 1) , so
2
)
E ( G 2p ) = p + qE ( Y 2 + 2Y + 1) = p + qE ( Gp2 ) + 2q
1
+q.
p
So pE ( G2p ) = 2q
1
+ 1 , and so
p
1 1 1 2q + p −1 q
V ( Gp ) = E ( Gp2 ) − E ( Gp ) = 2 q
2
+ − = = 2 .
p 2 p p2 p2 p
Closely associated with the geometric is the extended geometric (also kno wn as the
negative binomial), which is waiting for the k th success, Gk , p , so in fact G1,p = G p .
Clearly, Gk , p can take the values: k, k + 1 , k + 2 ,…. Let n ≥ k , what is the probability
Gk , p = n ? Easily, the last attempt must have been a success, and among the previous
n − 1 tries, we must have had exactly k − 1 successes (we can see here the relation to the
binomial), so
94
n − 1 n −k k
P ( Gk , p = n ) = q p .
k − 1
In the table below is a list of probabilities for the first 15 values of Gk , p for p = 0.35 and
k = 2,3,4 .
2 3 4 5 6 7 8 9 10 11 12 13 14 15
2 0.1225 0.1593 0.1553 0.1346 0.1093 0.0853 0.0647 0.0480 0.0351 0.0254 0.0181 0.0129 0.0091 0.0063
3 0.0429 0.0836 0.1087 0.1177 0.1148 0.1045 0.0905 0.0757 0.0615 0.0488 0.0381 0.0293 0.0222
4 0.0150 0.0390 0.0634 0.0824 0.0938 0.0975 0.0951 0.0883 0.0789 0.0684 0.0578 0.0478
To compute the expectation and variance of the extended geometric, we simply observe
that the extended geometric is nothing but the sum of k independent geometric random
variables: Gk , p = Gp + L + G p , and so
14 4244 3
k
E ( Gk , p ) =
k
.
p
And since they are independent, V ( Gk , p ) = kV ( Gp ) =
kq
.
p2
Example 2. Jim is a professional diver and has found a huge bed oysters in which on the
average 1 in 4 yield s a pearl. He intends to give his betrothed a necklace with 12 pearls. If
he wants to have a margin of two standard deviations that he will have enough oysters to
make the necklace, how many oysters should he bring up? The relevant random variable
12 ( 34 )
is of course G12,0.25 . Its expectation is 48, and its variance is = 144 , so σ = 12 , and
( 14 )
2
We now enter an interesting variation of the binomial Bn, p —one in which we think of n
as very large and p as very small. Actually, it is used in situations such as the number of
car accidents in a given location in a given amount of time, or the number of customers
entering a given store at a mall in a given hour.
What we have then is an average, λ , for example three customers enter the store in a
given hour, and this average is known to be λ = np = 3 . Then what we have is
n− k
n λk λ
P ( Bn, p = k ) = p k q n− k =
n!
1 − n =
k k !( n − k ) ! n k
λ k n × n − 1×L × n − k + 1!
−k
λ λ
n
1 − 1 −
n n
k
k! n
−k
λk k −1 λ λ
n
λ
n
If we now let n → ∞ , and recalling the fundamental fact that lim 1 − = e −λ , we get
n →∞
n
that
−k
λk 1 2 k −1 λ λ λk −λ
n
P ( Pλ = k ) = lim (1) 1 − 1 − L 1 − 1 − 1 − = e .
n →∞ k !
n n n n n k!
And thus one defines a random variable to be a Poisson random variable with
λ k −λ
parameter λ , Pλ , if it can take the values 0, 1, … with respective probabilities e .
k!
0 1 2 3 4 5 6 7 8 9 10
1 0.367879 0.367879 0.183940 0.061313 0.015328 0.003066 0.000511 0.000073 0.000009 0.000001 0.000000
2 0.135335 0.270671 0.270671 0.180447 0.090224 0.036089 0.012030 0.003437 0.000859 0.000191 0.000038
3 0.049787 0.149361 0.224042 0.224042 0.168031 0.100819 0.050409 0.021604 0.008102 0.002701 0.000810
4 0.018316 0.073263 0.146525 0.195367 0.195367 0.156293 0.104196 0.059540 0.029770 0.013231 0.005292
Indeed,
∞
λ k − λ ∞ λ k −λ ∞ λk ∞
λk
E ( Pλ ) = ∑ k e = ∑k e =∑ e− λ = λe −λ ∑ = λe − λe λ = λ
k =0 k! k =1 k! k =1 ( k − 1) ! k =0 k !
Similarly,
∞
λ k −λ ∞ 2 λ k −λ ∞ k λ k −λ
E ( Pλ2 ) = ∑ k 2 e =∑k e =∑ e =
k =0 k! k =1 k! k = 1 ( k − 1) !
96
∞
( k + 1) λk +1 e −λ = ∞
k λ k +1 −λ ∞ λ k +1 −λ
∑ k!
∑ k ! e +∑ e =
k =0 k =0 k=0 k !
∞
λ k +1 − λ ∞
λk +2 −λ
∑ ( k − 1)! e + λ = ∑ k !
e +λ = λ 2 + λ
k =1 k =0
and so
V ( Pλ ) = λ 2 + λ − λ 2 = λ .
Example 3. Insect Breeding. Suppose the number of eggs the Brown Recluse Spider
lays is given by a Poisson random variable with λ = 10 . Moreover the probability that an
egg will develop is 12 . What is the probability that our spider will give rise to exactly 5
new spiders? In order to help us understand, let us start with P (Y = 0 ) where Y is the
number of baby spiders we will obtain. Clearly
(
P ( Y = 0 ) = e−10 1 + 101! 21 + 102! 14 +
1 2
10 3 1
3! 8 )
+ L = e −10 ( e5 ) = e−5
Since we will have no eggs, or 1 egg and it will not develop, or two eggs and neither of
which will develop, etcetera. Now, for another specific example, let us compute the
probability that Y = 2 . But then we have that P10 = 2 and both eggs developed, or P10 = 3
and exactly two of the eggs developed, or P10 = 4 and exactly two of the eggs developed,
and so on. Thus we have
3
() ()
P ( Y = 2 ) = e −10 102! 212 + 103! 32 213 + 104! 42 214 + L =
2 4
2 3
() ()
e −10 52! + 53! 32 + 54! 42 + L = e −10 52! + 53! 2!1!
4
2 3
3! 4
(
+ 54! 2!2!
4!
+ 55! 2!3!
5
5!
+L = )
( )
2
5
e −10 52
1 + 1!5 + 52! + 53! + L = e −10 e 5 = e− 5
2 3
52
2! 2!
2!
And so we begin to suspect that we get a Poisson distribution also for the number of
spiders. Indeed, by similar reasoning, we get
k
( ) ( ) k+ 2 k+ 3
( )
P ( Y = k ) = e −10 10k ! 21k + (10k +1)! k k+ 1 2k1+1 + (10k+ 2)! k k+ 2 2k1+2 + (10k+ 3)! k k+ 3 2k1+3 + L =
k +1
k
( ) ( ) ( )
e −10 5k ! + ( k5+1)! k k+ 1 + ( k5+2)! k k+ 2 + ( k5+3)! k k+ 3 + L =
k+1 k +2 k +3
e −10 ( 5k
k! + ( 5k +1) ! ( k!1!) + ( 5k +2) ! ( k!2!) + ( 5k+3)! ( k !3!) + L =
k+ 1 k +1 ! k +2 k +2 ! k +3 k +3 !
)
(1 + )
k
5
e −10 5k
+ 2!5 + 3!5 + L = e −10 e 5 = e− 5
2 3
5 5k
k! 1! k!
.
k!
55
Thus, P ( Y = 5 ) = e −5 = 0.17546 .
5!
97
P ( X = 0 ) = ( 279
280 ) ≈ 0.4959 and P ( X = 1) = 196 ( 280 )( 279
280 )
196 195
1
≈ 0.34841 .
But perhaps we should switch to the Poisson, and when we divide the total number of
196
deaths by 280 to compute the average number of deaths, we get λ = = 0.7 . And then
280
we would get P ( Pλ = 0) = e −.7 ≈ 0.49658 , P ( Pλ = 1) = e− .7 ( .7 ) ≈ 0.34760 , reaffirming the
approximation from which we obtained the Poisson.
But also interestingly, we could use the Poisson to predict the number of occurrences of
each of the situations above by multiplying each probability by 280 (and rounding to the
nearest integer):
Number of Deaths 0 1 2 3 4 5
Poisson Probability 0.49658 0.34760 0.12166 0.02838 0.00496 0.00082
Expected # of occurrences 139 97 34 8 1 0
Actuality 144 91 32 11 2 0
1!
+ e −10e −13 132! = e −23
2
1
2! (10
2
+ 2 (10 )(13) + 132 ) = e −23 232! .
2
98
Once more
P ( X = 3) =
P ( P10 = 3 ) P ( P13 = 0 ) +P ( P10 = 2) P ( P13 = 1) + P ( P10 = 1) P ( P13 = 2 ) + P ( P10 = 0) P ( P13 = 3 )
= e −10 103! e −13 + e −10 102! e −13 131! + e −10 101! e −13 132! + e −10e −13 133! =
3 2 3
(
e −23 3!1 10 3 + 3 (10 ) (13) + 3 (10 )(13 ) + 133 = e −23
2 2
) 233
3! .
In general, we get
n n
P ( X = n ) = ∑ P ( P10 = k ) P ( P13 = n − k ) = ∑e −10 10 k −13 13 n− k
k! e ( n − k )! =
k=0 k =0
n
e−23 n 10k e−23 n n! e −23 23n
e −23 ∑ 10k ! ∑ n! ∑ = P ( P23 = n ) .
n− k n− k n −k
= = =
k k
13 13
( n− k )! ( n− k )! 1013
k =0 n! k=0 k! n ! k = 0 k !(n −k )! n!
Thus, P10 + P13 = P23 if in the presence of independence. In a similar fashion to the
example, one can show the surprising result that for any λ, µ (if independent random
variables):
Pλ + Pµ = Pλ+µ .
99
Before we go and discuss the most important of all random variables in the next section,
we take a detour to visit three important theorems. We start with a classical result: The
Law of Averages. We have already encountered James, the eldest of the Bernoulli's (the
most prestigious family in mathematical history), and the first theorem we discuss is due
to him.
An old Arab saying proclaims: Indeed he knows not how to know who knows not
also how to unknow. James Bernoulli made fundamental progress in how to test our
knowledge, and in doing so became one of the first statisticians in history. In 1713, eight
years after his death, his nephew Nicholas (son of brother John) published,
posthumously, Jacob’s masterpiece: Ars Conjectandi, an important book, the heir to
Huygens' (who gave us expectation), and the predecessor to Laplace’s (the next theorem).
The following discussion is an interpretation of some of the ideas Bernoulli discussed.
Suppose you have an urn in which you can hear some balls inside.
Unfortunately the balls are too large to get them out, and certainly the urn is to
be preserved, so you don't want to destroy it. However through the opening you
can see one of the balls (one atatime) inside, and if you rattle the urn you can
perhaps change the ball you are seeing. Indeed, after a little while, through the
top you see balls of two colors: black and white. You spend some time observing the
different colors of the balls you see from the top of the urn, and you
are ready to guess that perhaps there are five balls inside the urn: 3
black and 2 white. But you are an honest person with integrity and you
would like some moral certainty concerning your guess. How would
you go about it? How would we test our knowledge? In Bernoulli's
own words:
...how often a white and how often a black pebble is observed. The question is,
can you do this so often that it becomes ten times, one hundred times, one
thousand times, etc. more probable (that is, it be morally certain) that the
numbers of whites and blacks observed (chosen) are in the same 3:2 ratio as
the pebbles in the urn, rather than in any other ratio?
At the same time, Bernoulli realized that we could only expect an approximation to our
ratio, not an exact ratio. Namely, the more times we did the experiment, the less
likely would it be that we get the exact ratio. Indeed, let's do some computations.
Suppose we observe the urn 500 times, and suppose that our hypothesis of 3 black
and 2 white is correct. Then the probability that we observe exactly 300 black and 200
white is, since we are in the random variable X = B500,.6 :
⎛500⎞⎛ 3 ⎞ ⎛ 2 ⎞⎟
300 200
500!3300 2200
P ( X = 300) = ⎜⎜⎜ ⎟⎟⎟⎜⎜⎜ ⎟⎟⎟ ⎜ ⎟ = ≈ 3.6%
⎝300⎠⎝ 5 ⎠ ⎝⎜⎜ 5 ⎠⎟ 300!200!5500
100
In general, Bernoulli knew that if his hypothesis was correct, and he did the experiment
n times, the probability that he would get exactly k black showings was:
n −k
⎛n⎞⎛ 3⎞ ⎛ 2 ⎞ n !3k 2n−k
k
And if we compute, we see that as the number of experiments increase, the probability
that we get exactly the ratio 3:2 decreases:
Let us say, as he did, that we want to stay within 1 (2%) of our fixed ratio. Thus if we do
50
the experiment 50 times, then we expect to have either 29, 30 or 31 successes (black
balls), if we do it 100 times, then we look at the probability of 58, 59, 60, 61 or 62
successes, etcetera.
The table below gives both the number of occurrences of black balls and the probabilities
for each which have been added up at the bottom of the table. The top row indicates the
number of experiments.
101
Bernoulli proved the theorem in his Ars Conjectandi, and among the tools he used was
this pretty inequality:
a c a+c
Let a, b, c, d be positive numbers. If < , then is in between the
b d b+d
a a+c c
other two, < < .
b b+d d
We can not help observing that this is an average that does occur in baseball. If a player
goes 53 (it is customarily read: 3 for 5) one day and 64 another, then her/his combined
total for the two days is 117 , which is necessarily in between⎯not as good as her/his best
day and not as bad as her/his worse day. Observe that although 64 is the same as 23 in
102
5
most contexts, it is not so from this point of view, since 8 (which is what we get when
we average 53 and 23 ) is not the same as 117 .
Suppose now we let N be the total number of experiments, and because we chose the
fraction 1 , we are going to let N = 50M . Then the ratio 3:2 occurs when we have 30M
50
observations of black balls. Hence to be within 1 of that ratio means we must have
50
between 29M and 31M occurrences of black balls (endpoints included). We are
interested then in
ϑ= probability that we have the number of occurrences of black balls fall
within these two limits,
ϑ
then what Bernoulli proved was that as n grows, so does ϑ and that becomes
1−ϑ
arbitrarily large (moral certainty), that it grows without bound as M → ∞ (or
equivalently, N → ∞ ).
But Bernoulli wanted actual estimates for the number of tries necessary, an estimate
ϑ
for N , and here perhaps he felt he had failed. For example, he wanted ≥ 1000 , and
1−ϑ
then his estimate n was 25,550 observations, which to him was a gigantic number⎯there
were fewer than 3,000 known stars in the skies in his lifetime.
In the early history of probability, the situation most often encountered was that of
performing repeated trials of an experiment, and counting the number of successes
among the trials, however success was defined. In other words, the most common random
variable was the Binomial.
we vary the numbers a little, how sensitive are our estimates to these variations?
It is here that De Moivre introduced what is one of the most important density into
1 − x2
probability, the function y= e 2
, which nowadays is so prevalent. It has many
2π
names: the normal distribution, the Gaussian density function, the astronomer’s
error law, or simply the bellshaped curve (the same shaped as any row of Pascal’s
triangle).
x2
−
Actually De Moivre introduced the shape y = e , and he knew the integral had to be
2
found in order to produce a density. It was left to the great Laplace1 to exactly compute
the area under this curve.
∞ 2
−x
Theorem. ∫ e 2
dx = 2π .
−∞
Proof. It was a clever trick that Laplace used to integrate this function. He worked with
the square instead:
⎛ ∞ −x2 ⎞ ⎛∞ ⎞⎛ ∞ ⎞
2
⎜⎜ e 2 dx⎟⎟ = ⎜⎜ e−x 2 dx⎟⎟⎜⎜ e−x 2 dx⎟⎟ =
2 2
⎜⎜ ∫ ⎟ ⎜∫ ⎟⎟⎜ ∫ ⎟
⎝−∞ ⎠⎟⎟ ⎝⎜−∞ ⎟⎜−∞
⎠⎝ ⎠⎟⎟
⎛ ∞ −x2 ⎞⎛ ∞ ⎞ ∞ ∞ −( x2 + y 2 )
⎜⎜ e 2 dx⎟⎟⎜⎜ e− y 2 dy⎟⎟ =
2
⎜⎜ ∫ ⎟⎟⎜ ∫ ⎟⎟ ∫ ∫ e 2
dx dy .
⎝−∞ ⎟ ⎜
⎠⎝−∞ ⎟
⎠ −∞ −∞
It was here he used polar coordinates, where x = r cos θ , y = r sin θ , and thus,
1
Laplace (17491827) is a major name in both mathematics and mechanics. His two masterpieces:
Mécanique Céleste and Théorie Analytique des Probabilités are both major book in their subjects. The
former is, as its name indicates, the full elucidation of our Solar system following Newton's Laws of
Planetary Motion. The latter is the distillation of all probabilistic knowledge up to the latter half of the
eighteenth century, and remained the major book in the subject for 50 years. Ironically, Laplace's work
represents both the crowning achievement of mechanism, the philosophy that considers the universe to run
as clockwork, and yet by his emphasis on the importance of probability and statistics, he represents the
beginning of the end of such a philosophy.
104
x2 + y 2 = r 2 .
By using infinitesimal areas, an early fact from conversion of areas from one set of
coordinates to the other, we get our integral to be,
⎛ ∞ −x2 ⎞
2
∞ ∞ 2π ∞
⎜⎜ e 2 dx⎟⎟ = −( x + y )
2 2 2
−r
⎜⎜ ∫ ⎟ ∫ ∫ = ∫ ∫ dr d θ
2 2
e dx dy re
⎝−∞ ⎟
⎟
⎠ −∞ −∞ 0 0
⎛ 2π ⎞⎛ ⎟
∞ ⎞⎟ ∞
= ⎜⎜⎜ ∫ d θ⎟⎟⎜⎜⎜ ∫ re 2 dr ⎟⎟ = 2π ∫ re 2 dr .
2 2
−r −r
⎝⎜ 0 ⎠⎝ ⎟⎟⎜ ⎠⎟⎟
0 0
But the last integral is much easier since it has the needed r factor, and so we get, by a
2 2
−r −r
substitution, u = e 2
, then du = −re 2
dr , and when r = 0 , u = 1 , and when r = ∞ ,
u = 0 , thus
⎛ ∞ −x2 ⎞
2
0
⎜⎜ e 2 dx⎟⎟ = 2π −du = 2π .
⎜⎜ ∫ ⎟ ∫ a
⎝−∞ ⎠⎟⎟ 1
A bit later we will see other tremendous contribution of Laplace connected with the
normal.
And without getting too technical, De Moivre correctly claimed that this unique bell
shaped curve can be used to approximate any of the binomial problems (once
appropriately calibrated), and this approximation can be used to give an answer to our
query as an integral instead of a sum. In our case, the estimate is: the probability of
between 4,950 and 5,050 heads is 0.6826 (as we will see in the next section).
Unfortunately for De Moivre, he was not able to see how far reaching was his curve. It
was left to Laplace (and Gauss) to cement the importance of the normal distribution: The
Central Limit Theorem
Ever since the first encounters with probability theory and Pascal's triangle, the bell shape
of the distribution of the numbers was very apparent. As we saw before, DeMoivre had
proven that the normal curve
1 − x2
y= e 2
2π
was the limit situation when a large number of experiments was performed in which two
outcomes (success and failure) were possible (binomial distributions). As we saw it was
1
Laplace that found the constant .
2π
This fact is known as the Central Limit Theorem, and it was used in the 19th century, as
it still used today, to apply statistical methods to the social sciences. We will give some
further application in the next section. But first we clarify what the theorem means.
We have 3 dice, one with the faces marked 1,1,1,2,2,2 (thus the probability of rolling
a 1 is ), another one with the faces marked 2,2,2,2,3,3, and finally the third one
1
2
with 3,3,3,3,3 and 4. You will roll the 3 dice simultaneously and record the sum.
Hence you can roll either a 6, or a 7 or an
0.50
0.45
0.40
8 or a 9. With a little computation we
0.35
0.30
derive the
0.25 probabilities Roll Probability # of Ways
0.20 6 0.277778 60
0.15 of each:
0.10 7 0.472222 102
0.05
0.00 8 0.222222 48
6 7 8 9 Above is the 9 0.027778 6
histogram of
these probabilities.
But suppose that instead of rolling once, you rolled twice, and recorded the average of
your two rolls:
0.40
0.35
Total Average Probability
0.30
12 6.0 0.077160 0.25
13 6.5 0.262346 0.20
And we can begin to see the reason for the name Central Total Average Probability
Limit Theorem. Do it one more time, keeping track of the 18 6.00 0.021433
average of the three rolls. 19 6.33 0.109311
20 6.67 0.237269
0.35 21 7.00 0.286630
0.30 22 7.33 0.211677
0.25 23 7.67 0.098830
0.20 24 8.00 0.029107
0.15 25 8.33 0.005208
0.10 26 8.67 0.000514
0.05 27 9.00 0.000021
0.00
6.00 6.33 6.67 7.00 7.33 7.67 8.00 8.33 8.67 9.00
106
And we perceive more and more the approximation that the theorem claims. We give
three more histograms, the
0.30
Total Average Probability
0.25
24 6.00 0.005954
0.20 25 6.25 0.040485
0.15 26 6.50 0.122290
0.10 27 6.75 0.216549
0.05
28 7.00 0.249915
29 7.25 0.197698
0.00
30 7.50 0.109756
6.00
6.25
6.50
6.75
7.00
7.25
7.50
7.75
8.00
8.25
8.50
8.75
9.00
31 7.75 0.043034
32 8.00 0.011816
ones with 4 rolls, 5 rolls and 6 rolls, and no more words of 33 8.25 0.002215
explanation. 34 8.50 0.000269
T A P 35 8.75 0.000019
30 6.0 0.001654 36 9.00 0.000001
0.25
31 6.2 0.014057
0.20
32 6.4 0.054411
33 6.6 0.127063 0.15
6.4
6.8
7.2
7.6
8.4
8.8
38 7.6 0.053485 T A P
39 7.8 0.018807 36 6.0 0.000459
40 8.0 0.004942 37 6.2 0.004686
41 8.2 0.000953 38 6.3 0.022120
42 8.4 0.000130 39 6.5 0.064159
43 8.6 0.000012 40 6.7 0.128034
44 8.8 0.000001 41 6.8 0.186530
45 9.0 0.000000 42 7.0 0.205459
43 7.2 0.174831
44 7.3 0.116435
45 7.5 0.061111
46 7.7 0.025324
0.25 47 7.8 0.008263
48 8.0 0.002107
0.20 49 8.2 0.000414
50 8.3 0.000061
0.15 51 8.5 0.000007
52 8.7 0.000000
0.10 53 8.8 0.000000
54 9.0 0.000000
0.05
0.00
6.33
6.67
7.33
7.67
8.33
8.67
6
9
107
But the whole impact of the theorem is that what we started with is not relevant
at all, the shape could be very different from the one in the example above, and yet we
would have the same tendency toward the normal curve with its bell shape. To exemplify
we just give six more histogram just like in the example above, but starting with a very
different distribution from the one above.
n o
p q
r s
THE
NORMAL
LAW OF ERROR
STANDS OUT IN THE
EXPERIENCE OF MANKIND
AS ONE OF THE BROADEST
GENERALIZATIONS OF NATURAL
PHILOSOPHY * IT SERVES AS THE
GUIDING INSTRUMENT IN RESERACHES
IN THE PHYSICAL AND SOCIAL SCIENCES AND
IN MEDICINE AGRICULTURAL AND ENGINEERING
IT IS AN INDISPENSABLE TOOL FOR THE ANALYSIS AND THE
INTERPRETATION OF BASIC DATA OBTAINED BY OBSERVATION AND EXPERIMENT
We end the section with a couple of Russian results from the 19th century.
The next fundamental inequality firmly establishes the standard deviation as the yardstick
for shock.
P ( X − µ ≥ kσ) ≤
1
.
k2
Proof. Let Y = ( X − µ ) , so E (Y ) = σ2 , and since Y ≥ 0 , we can apply Markov’s
2
E (Y ) 1
inequality to Y . Letting a = k 2 σ 2 , we get P (Y ≥ σ2 k 2 ) ≤ = . And since
σ2 k 2 k 2
P (Y ≥ σ 2 k 2 ) = P ( X − µ ≥ σk ) , we are done. a
σ2
, P( X −µ ≥ k) ≤ 2 .
k
Note the equivalent statement by replacing k by
σ k
Example 2. If the mean number of accidents in a given highway is 30 in a week, and the
standard deviation is 10, then how likely is it that it will exceed 50 in a given week?
Since we are two standard deviations above the mean, we can readily assert that the
probability is at most 25%.
First observe that if X 1 , X 2 ,…, X n are independent random variables with the same
distribution, and hence the same mean and variance, µ and σ 2 respectively. Then we can
define a new variable, the average, Yn = 1
n ( X1 + X 2 + " + X n ) . We can easily observe
that
E (Yn ) = µ
110
Its range is the whole real line, but most of the area is concentrated between −3 and 3. In
fact, P ( −1 ≤ Z ≤ 1) = 68.25%, P ( −2 ≤ Z ≤ 2 ) = 95.41% and P ( −3 ≤ Z ≤ 3) = 99.69%.
These were done used the table—the cumulative distribution of Z does not have a closed
form in your calculator, although it is known as Φ . The mode and median are both
clearly 0, and indeed so is the expectation,
∞
E (Z ) =
− x2
∫x
−∞
1
2π
e 2 dx = 0
0
1
2π
e dx = −
2
2π
e  +
0 ∫
2π 0
e 2
dx = 0 +
2π 2
= 1.
So σ = 1 also.
As usual, probabilities are given by areas under the curve, and the
areas have been previously computed in a table. The table (see
appendix) gives the area of the tail as indicated in the picture.
The best procedure in order to compute from the table is to always have a picture of the
desired area in mind.
The time has come to consider other normals. Start by considering a random variable of
the form X = aZ + b . Then E ( X ) = aE ( Z ) + b = b , and V ( X ) = a 2V ( Z ) , so if we let
N µ ,σ = σZ + µ , then E ( N µ ,σ ) = µ , and its standard deviation is σ . More clearly, its
distribution is given by
y −µ
σ
⎛ y −µ ⎞
FNµ ,σ ( y ) = P ( N µ ,σ ≤ y ) = P ( σZ + µ ≤ y ) = P ⎜ Z ≤
1 − x2
⎝ σ
⎟
⎠
=
2 π
∫
−∞
e 2
dx .
So by the fundamental theorem of calculus, its density is given by the derivative, which is
−( y −µ )
2
1
f ( y) = e 2σ .
2
σ 2π
In fact all of these graphs are bellshaped. The graphs
on the right represent a collection of normals with the
same means, but with different standard deviations—
the wider the graph the bigger the standard deviation.
The standard normal is the one in the middle.
But more importantly than the density is the fact that in order to compute a probability,
all we need is to measure the difference from the mean in terms of the standard deviation,
y −µ
the expression gives us all the control. In other words, if we are using another
σ
normal beside the standard normal, all we have to do to compute is to translate to the
number of standard deviation away from the mean that we desire. The following example
should be illustrative.
Example 2. Verbal SAT. Scores on the SAT verbal ability follow a normal distribution
with µ = 430 and σ = 100 , so our random variable is N ( 430,100 ) . Suppose 10,000
students take the test. Consider the following.
c How many students scored 530 or higher? How far from the mean are we?
We are 100 points, which is one standard deviation, so the answer is the
same as c in the previous example: 15.87%, or 1,587 students.
d How many students scored 653 or higher? Since 653 − 430 = 223 , we are
2.23 standard deviations above the mean, so again the question is the same
question as in part d of the previous example, so the answer is 130
students.
Proceeding in the same fashion, but phrasing the question in terms of students and SAT
scores, rather than standard deviations away from the mean, we get
e How many students scored 330 or lower: 1,587 of them. Similarly 130
students scored 207 or lower.
f How many students scored 530 or lower? By complementation: we get
8,4123 students.
g How many students scored between 530 and 653: 1587 −130 = 1457 of
them.
h How many students scored between 207 and 530? As before we get the
union of two disjoint events: one with 4,870 students and the other one
with 3,413 for a total of 8,283 students.
i Finally, Lesley’s mother was told that Lesley’s score was in the top 1%.
Thus Lesley’s score was at least 663 points.
The following example reiterates the notion that what is crucial is to use the standard
deviation as the yardstick, that σ is the unit of measurement.
Example 3. Career Choices. Thomas recently took two national exams for admission to
graduate school. His score on the biology test was 89, while in the math test was 78.
Since the tests are given to thousands of students, one can safely assume that the scores
on each of them follow a normal distribution.
Thomas found out that on the biology test the average was 82 and the standard deviation
was 4, while on the math test the average was 76 and the standard deviation was 1. What
subject is Thomas more suited for graduate school? Biology or mathematics?
It is clear that Thomas scored above average in both exams, and the difference in points
113
were 7 and 2, so it would seem that Thomas is more suited for biology. But points on the
test is not the correct measuring stick. In terms of standard deviations, in biology,
Thomas was 1.75 standard deviations above the mean, or equivalently in the top 4.02%
of the test takers. But in math he was 2 standard deviations above, or in the top 2.29%, so
Thomas scored higher in the mathematics test. And his talent in that subject is more
precious than his biological skills.
We return to the original question that motivated De Moivre to discover the normal
curve—the approximation to the binomial.
Example 5. What is the probability of having between 4,950 and 5,050 heads when we
toss a coin 10,000 times? As before, we can readily write an answer, but what numerical
5,050
⎛10, 000⎞⎟ 1
estimate we can attach to it is a different story. The answer is: ∑ ⎜⎜⎜ ⎟ . Now
k = 4,950 ⎝
k ⎠⎟ 210,000
we can use the normal to approximate the binomial. But what is µ , the mean, and what is
σ , the standard deviation? We know that µ = 5000 , and σ = 10000 ⋅ 12 ⋅ 12 = 2500 = 50
and our question simply becomes, what is the probability of being within one standard
deviation (either side) from the mean, so it equals 68.25%.
# of standard
n = Admitted E X
Students
( ) 201− E ( X ) σ = n ⋅ .3⋅ .7 deviations away
from the mean
600 180 21 11.22497 1.87082
601 180.3 20.7 11.23432 1.84256
602 180.6 20.4 11.24366 1.81435
603 180.9 20.1 11.25300 1.78619
604 181.2 19.8 11.26233 1.75807
605 181.5 19.5 11.27165 1.73000
606 181.8 19.2 11.28096 1.70198
607 182.1 18.9 11.29026 1.67400
608 182.4 18.6 11.29956 1.64608
609 182.7 18.3 11.30885 1.61820
Example 8. Sampling. Suppose that 40% of the adult population of Menaville do attend
religious services on any given week. What is the probability that when we interview
1785 adults in the city of Menaville and asked them whether they attended religious
services that week or not, that we will get between 37% and 43% of them to say yes (a
3% margin of error). Again σ = 1785 ⋅ .4 ⋅ .6 = 20.7 . We expect 714 people, and we want
to know the probability P (768 ≥ X ≥ 660) , which readily translates to
P (2.58 ≥ Z ≥−2.58) = 98.97% , and so we have a high degree of confidence.
Example 9. Sampling. Suppose that 40% of the adult population of Menaville do attend
religious services on any given week. What is the probability that when we interview
1000 adults in the city of Menaville and asked them whether they attended religious
services that week or not, that we will get between 37% and 43% of them to say yes (a
3% margin of error). Again σ = 1000 ⋅ .4 ⋅ .6 = 15.49 . Let X = B1000,.4 be the random
variable that counts the number of people attending services. We know E ( X ) = 400 , and
we want to know the probability P (430 ≥ X ≥ 370) . This readily translates to
115
How many people would we have to interview if we wanted a 99% degree of certainty?
Since P (2.59 ≥ Z ≥−2.59) ≥ 99% , we need .03n ≥ 2.59σ = 2.59 .24n . This simplifies
2.59
to n≥ .24 ≈ 42.29 , so any n ≥ 1, 789 would be sufficient.
.03
116
The previous two sections were dedicated to the normal distribution which is symmetric
about its mean. However, many distributions can take values that are only on one side of
the axis, positive for example—certainly Z 2 , or any X 2 for that matter, would be such a
random variable. In this section we look at a popular special case of a major family we
will discuss in a later chapter. The exponential random variable is a special case of the
gamma random variable, but it is an important and useful special case of that variable, so
we will isolate its discussion.
Throughout this section β will denote a positive real number. Then the random variable
⎧ 1 − βy
⎪ e y>0
with density f ( y ) = ⎨ β is known as an exponential random variable
⎪ 0
⎩ otherwise
with parameter β , and will be denoted by X β .
1.2
On the left are the graphs of three such
1 densities with respective β ’s.
0.8
1
0.6 2 Before we discuss examples, we compute
0.4 4 the basic properties of the exponential.
0.2
Its distribution, to start with, has a
0 particularly nice form:
0 2 4 6 8 y
FX β ( y ) = ∫ e β dt = −e
1 −t − βt y −y
 = 1− e β .
0
β 0
Thus P ( X β > y ) = e β , and this is perhaps the simplest of all descriptions of the
−y
Clearly, the mode of X β is 0 (it only decays from there), and the median can be found by
Similarly,
∞ ∞
E(X )=∫β y e 2 −β 2 −β ∞
dy = 0 + 2β E ( X β ) = 2β 2 ,
1 y y
− βy
2
β dy = − y e  + ∫ 2 ye
0
0 0
so V ( X β ) = β , and so σ = β .
2
117
Example 1. A certain machine uses electronic components each of which has a lifetime
in hours given by X 50 . Thus each component has an expected lifetime of 50 hours (with a
standard deviation of 50 hours!). On the other hand, the median lifetime of a component
is only 50ln 2 ≈ 34.65 hours. We also have that the probability a component will last less
than 50 hours is P ( X ≤ 50 ) = 1 − e−1 = .6321 , so P ( X 50 ≥ 50 ) = e−1 ≈ .3678 . If we ask,
what the probability is a component will last 100 hours if it has already lasted 50, we
surprisingly get,
P ( X 50 ≥ 100 ) e −2
P ( X 50 ≥ 100  X 50 ≥ 50 ) = = −1 = e −1 .
P ( X 50 ≥ 50 ) e
We will see below this is a typical property of the exponential
Suppose the machine has 5 of those components acting independently, and that it needs 3
of them to work in order for it to operate. What is the probability the machine will
successfully operate for 100 consecutive hours? What we need now is to switch our
thinking to a Bernoulli where success is a component lasting at least 100 hours. But that
is easy, as we saw before, P ( X 50 ≥ 100 ) = e−2 , so we are considering Be−2 as our
Bernoulli, and we have 5 of them so we are now in a binomial, B5,e−2 , and we want at
least 3 successes. Easily then we get this probability to be close to exactly 2%.
If we let T denote the waiting time until the first customer comes to the store, where time
is measured in days, then T is a continuous random variable. But then P (T > t ) = e12t , so
P (T ≤ t ) = 1 − e12t = FX 1 ( t )
12
The exponential has a unique and distinguishing characteristic, no memory. Once it has
lasted a certain amount of time, the probability it will last a specific time longer is the
same as the probability it would have lasted that specific time to start with!
118
Proof. It is immediate, P ( X β ≥ s + t  X β ≥ t ) = = P ( Xβ ≥ s) . a
e − βs
= =e
P ( Xβ ≥ t ) e
− βt
One can eventually prove that the exponential is the only random variable that satisfies
this memorylessness property.
Example 3. You walk into a UPS shipping company office just before it closes. The
office has 2 windows with employees serving customers. However, there is a line of 3
people waiting to be served in addition to the two already at the windows. Assume that
the time a customer stays at a window is given by X 10 , measured in minutes. What is the
probability you will be the last customer to leave the office? By the time you get to a
window, it is with probability 1 that the other window will also be occupied. Since the
exponential has no memory, you and the person at the other window have equal chances
to finishing first, so your chances are 50%. We will revisit this example in the next
chapter.
We end the section with a brief discussion of failure rates. Let X be a random variable
that represents the lifetime of some machine. Thus we assume X has positive range, and
density f and distribution F. Then the failure rate of X is defined by
f (t )
κ (t ) = .
1 − F (t )
The reason for the name is due to the fact that it measures the proportion of failure in a
small interval after time t once the machine lasted time to that time. The following
explains further
P (t < X < t + h  X > t ) P (t < X < t + h) 1 F (t + h) − F (t ) f (t )
lim = lim = lim =
h→0 h h→0 hP ( X > t ) 1 − F ( t ) h→0 h 1 − F (t )
.
1 − βt
e
f (t ) β 1
In fact, for the exponential, we have κ ( t ) = = − t = , a constant. This is
1 − F (t ) e β β
another indication of the memorylessness of the exponential.
t
so integrating both sides, we get ln (1 − F ( t ) ) = − ∫ κ ( s ) ds + c , and since F ( 0 ) = 0 , we
0
t
∫
− κ( s ) ds
get c = 0 . Then by taking e to both sides, we get 1 − F ( t ) = e 0
, and so
t
∫
− κ( s ) ds
F (t ) = 1 − e 0
,
so
t
∫
− κ( s ) ds
f (t ) = κ (t ) e 0
.
Example 4. A supplier claims its new version of a product has half the failure rate of the
old product. That is the claim is κ n ( t ) = .5κo ( t ) . From experience one knows that 89 of
the old machines failed to reach 100 days once they had lasted 80. How would the new
machines act under the same assumption of having lasted 80 days already?
We know that
100
− ∫ κo ( s ) ds 100
1 − Fo (100 ) e ∫ κo ( s )ds −
= P ( X o ≥ 100  X o ≥ 80 ) =
0
1
= 80 = e 80 .
1 − Fo ( 80 )
9
− ∫ κo ( s ) ds
e 0
100
− ∫ κn ( s )ds
By the same reasoning, P ( X n ≥ 100  X n ≥ 80 ) = e 80
. But κ n ( t ) = .5κo ( t ) , so
100 100 100
− ∫ κn ( s ) ds − ∫ .5 κo ( s )ds − ∫ κo ( s )ds
e 80
=e 80
= e 80
= 13 .
So a full third of the machines will last to a 100 days. Thus the net result of halving the
failure rate had the effect of taking the square root of the probability of survival!
120
Chapter 3
All for One
n Joint Distributions
In this chapter we finally start looking at arbitrary multivariable situations. We now allow
ordered tuples of variables, and we start with some simple examples to clarify the
meaning of joint distributions.
Example 1. Suppose a bowl contains 10 red balls, 8 blue balls and 6 white balls. Four
balls are chosen simultaneously from the bowl. We let R denote the number of red balls
among the four and we let B denote the number of blue balls among the four. We
consider now the ordered pair ( R, B ) of random variables, and ask what values can this
ordered pair take and with what probabilities? The easies way to describe them is via a
⎛ 24 ⎞
table (or matrix), where the common denominator of ⎜ ⎟ = 10, 626 has purposefully
⎝4⎠
been left out:
R\B 0 1 2 3 4 Row Sum
0 15 160 420 336 70 1001
1 200 1200 1680 560 0 3640
2 675 2160 1260 0 0 4095
3 720 960 0 0 0 1680
4 210 0 0 0 0 210
Column Sum 1820 4480 3360 896 70
And this table is then known as the joint distribution of the pair ( R, B ) . To explain the
entries in the table, we choose to explain the entry corresponding to R = 1 and B = 2 . But
then the number of ways of choosing the four balls, since we have to choose 1 red, 2 blue
⎛ 10 ⎞ ⎛ 8 ⎞ ⎛ 6 ⎞
and necessarily 1 white balls respectively, is ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ = 10 × 28 × 6 = 1680 .
⎝ 1 ⎠⎝ 2⎠⎝1⎠
As an immediate benefit of having all this information, we get what is known as the
marginal distributions. Namely these are the individual distributions of the single
variables R and B by respectively, and we obtain these by looking at the row sums and
column sums of our matrix. Thus the distribution for B is given by
B 0 1 2 3 4
P 1820 4480 3360 896 70
with of course the same denominator of 10626. Observe this is simply the random
variable H 24,8,4 .
121
Naturally, once we have the joint distribution we can ask almost any question involving
the random variables such as the following 3 queries:
0 1 2 3 4
P ( R + B ≥ 3) is 10626
R\B
c 1
7956 ≈ 74.87% since the
0 336 70
only positions on the table that are relevant are 1 1680 560
given, and they add to 7956. 2 2160 1260
3 720 960
4 210
Similarly, if our quest is
d P ( R ≥ B ) , then the positions are given by
R\B 0 1 2 3 4
with a total of 7400, and a consequent probability
0 15 1
of 10626 7400 ≈ 69.64% .
1 200 1200
2 675 2160 1260
3 720 960 Finally, to compute
4 210 e P ( R ≥ 2  B ≤ 1) ,
we need to compute P ( R ≥ 2 and B ≤ 1) and P ( B ≤ 1) . Note that for the latter we can
Of course, these two random variables are not independent since the number of red balls
certainly has an effect on the number of blue balls—but even more directly from the table
one can see that there are zeroes on the table, yet there are no zeroes on the marginals,
and hence it is not possible for the entry inside the table to be the product of the
respective entries on the marginals, which is what independence means.
In fact, we expect these two random variables not only to not be independent, but to be
negatively correlated—in other words, the larger one of them is, the more likely the
smaller the other one is. To recall, the covariance of two random variables X and Y is
defined by
cov ( X , Y ) = E ( XY ) − E ( X ) E (Y ) .
Then their correlation (also called index of correlation) is simply
cov ( X , Y )
ρ= .
σ X σY
It is a fact that this number is always between −1 and 1. If positive, then the variables are
said to be positively correlated while if negative, they are negatively correlated. Note that
the covariance alone determines the sign of the correlation:
In a similar fashion to the discrete example above, we can consider a continuous situation
where instead of the table we have a region of the plane consisting of the range of
ordered pairs of the two variables X and Y, for example, and then their joint density
would consist of a function defined above the region in such a way that the total volume
involved would naturally have to be 1.
Example 2. Consider two random variables X and Y whose range is the closed first
quadrant, x ≥ 0 and y ≥ 0 , and whose joint density is given by the function
f ( x, y ) = ce− x e−2 y
(on that range and 0 otherwise). First of all in order to be a joint density, we must have
∞∞
∫ ∫ ce
− x −2 y
total volume under this surface to be 1, so we must have e dxdy = 1 , and since
0 0
∞∞
( ) ( )
2∞ 2 2
dxdy = ∫ 2e −2 y −e − x  dy = e −1 ∫ 2e −2 y dy = e −1 −e −2 y  = e −1 (1 − e −4 ) ≈ 0.3611 .
∞
∫ ∫ 2e
− x −2 y 2
e
1 0
0 1 0 0
y=x
d P(X <Y ).
Again the key is to integrate over the relevant region. In this case,
( −e  ) dy = ∫ 2 (1 − e
∞ y ∞ ∞
∫ ∫ 2e
− x −2 y
e dxdy = ∫ 2e −2 y −x y
0
−y
)e −2 y
dy =
0 0 0 0
∞
∞ ∞
∫ 2e − 2e −3 y dy = −e−2 y  + 23 e−3 y  = 13 .
−2 y
0 0
0
e P ( X + Y ≤ 2) .
x+ y =2
We now have
123
( )
2 2− y 2 ∞
2− y
∫∫
0 0
2e − x e−2 y dxdy = ∫ 2e−2 y −e− x 
0
0
dy = ∫ 2e −2 y − 2e − y − 2 dy = 1 − 2e −2 + e −4 ≈ .7476 .
0
But again having the joint densities, allow us to compute the marginal densities which are
the respective densities of the individual random variables X and Y:
∞
f X ( x ) = ∫ 2e − x e −2 y dy = e − x
0
And we readily observe that the joint distribution is the product of the two marginals,
f ( x, y ) = f X ( x ) f Y ( y ) .
We previously defined that X and Y are independent if for all numbers a and b,
P ( X ≤ a and Y ≤ b) = P ( X ≤ a)P (Y ≤ b) . In other words, X and Y are independent if the
events X ≤ a and Y ≤ b are always independent events. We use this example to show
that in the case of continuous random variables, independence is equivalent to the
joint density being the product of the marginals. For one direction,
b a b a
∫ fY ( y ) dy ∫ f X ( x ) dx = P ( X ≤ a )P (Y ≤ b) .
−∞ −∞
The equality between the two rows is achieved by invoking the continuous distributivity
of multiplication over addition. The other direction follows from similar considerations.
From this we can argue that
Thus we can assert that the random variables of Example 2 are independent.
Having this characterization of independent random variables, we can then easily show
that independent random variables are uncorrelated—in other words, their covariance
is 0. But we should remind ourselves that variables can be uncorrelated without
necessarily being independent. The brief argument,
∞ ∞ ∞ ∞
E ( XY ) = ∫ ∫ xyf ( x, y ) dxdy = ∫ ∫ xyf ( x ) f ( y ) dxdy =
X Y
−∞ −∞ −∞ −∞
∞ ∞
∫ xf ( x ) dx ∫ yf ( y ) dy = E ( X ) E (Y ) .
−∞
X
−∞
Y
124
Thus, cov ( X , Y ) = E ( XY ) − E ( X ) E (Y ) = 0 .
In a future section we will see how to compute the density of functions of random
variables such as X + Y or XY . Bt we already can discuss
Example 3. The Sum of Uniforms. We actually consider the sum of two standard
uniforms. So let U1 and U 2 be two independent standard uniforms, and X = U1 + U 2 .
graph an inequality one graphs the equation, and see which half of
the line satisfies the inequality. So to graph X ≤ 12 , we do the
( 12 , 0) (1, 0) equation x1 + x2 = 12 ,
and then graph the 0 ≤ a ≤1 1≤ a ≤ 2
side containing the origin. We obtain the (1,1) (1,1)
shaded region, which has area 18 , so now we
can answer P ( X ≤ 12 ) = 81 . To do this in
general, we need to compute P ( X ≤ a ) for (1, a −1)
any 0 ≤ a ≤ 2 . The pictures speak for (a, 0) (1, 0) (1, 0)
themselves:
⎧
⎪ 2
⎪
⎪
a
0 ≤ a ≤1
⎪
⎪ 2
and so we have that P ( X ≤ a ) = ⎨ . So for the density, we obtain
⎪
⎪ (2 − a )
2
⎪
⎪1− 1≤ a ≤ 2
⎪
⎩ 2
⎧ a
⎪ 0 ≤ a ≤1
⎪
⎪
f X (a ) = ⎨2 − a 1 ≤ a ≤ 2 1.2
⎪
⎪
⎪
⎩ 0
⎪ otherwise 1
Closely associated with the sum of two random variables, is their averaging. Specifically,
rather than X = U1 + U 2 , we would consider Y = 12 (U1 + U 2 ) . Of course, we have
⎧⎪ y
⎪⎪ 0 ≤ y ≤ 12
⎪⎪ 2
⎪⎪ y
fY ( y ) = ⎪⎨1− 1
≤ y ≤1 .
⎪⎪ 2 2
⎪⎪
⎪⎪ 0 otherwise
⎪⎪⎩
The following table of values should give a sense of the distribution. One can observe,
for example, how much more often Y is around the middle rather than the extremes:
U1 U2 Y U1 U2 Y U1 U2 Y
0.926023 0.687290 0.806657 0.869625 0.801620 0.835623 0.360271 0.433231 0.396751
0.920365 0.112898 0.516631 0.810914 0.199639 0.505277 0.639698 0.688845 0.664272
0.980040 0.188716 0.584378 0.342260 0.545219 0.443740 0.594755 0.296755 0.445755
0.591154 0.930574 0.760864 0.179582 0.882815 0.531199 0.651303 0.661886 0.656595
0.380800 0.046967 0.213883 0.170059 0.358107 0.264083 0.923143 0.402448 0.662796
0.469614 0.288062 0.378838 0.971122 0.598000 0.784561 0.114371 0.444630 0.279500
0.268003 0.265989 0.266996 0.538538 0.189569 0.364053 0.142091 0.689431 0.415761
0.490173 0.196052 0.343112 0.535925 0.530224 0.533074 0.989111 0.270230 0.629671
0.641882 0.138761 0.390322 0.331605 0.908189 0.619897 0.597512 0.371834 0.484673
0.987128 0.796948 0.892038 0.050085 0.091245 0.070665 0.523877 0.203685 0.363781
0.036852 0.138046 0.087449 0.717419 0.981859 0.849639 0.178458 0.851269 0.514864
0.961909 0.121547 0.541728 0.972951 0.560992 0.766972 0.415218 0.064349 0.239783
In a later section, we will see how to compute the sum of three or more independent
standard uniforms.
since we have to choose two fires to be at the homes, and 1 fire at an apartment, and the
⎛ 4 ⎞
remaining fire for a dwelling, thus the term ⎜ ⎟ which is known as a multinomial
⎝ 2,1,1⎠
4!
coefficient and it equals . The rest of the computation is just as in the binomial
2!1!1!
(Bernoulli) situation.
The three marginals in this case are nothing but binomials. For example, to compute
P ( H = 1) , we would need to add all the situations when H = 1 , and there are 4 such
possibilities: (1,3, 0 ) , (1, 2,1) , (1,1, 2 ) and (1, 0,3) given us a grand total of 0.0574,
which also equals P ( B4,.73 = 1) . The reason why these marginals are binomials is simple:
in order to consider only the random variable H, all other fires have become nonH, so
now we are simply H or not H, success vs. failure. Thus, we know
E ( H ) = 4 × .73 = 2.92 and V ( H ) = 4 × .73 × .27 = .7884 . Perhaps the only interesting new
information is the covariance of the individual variances—for example cov ( H , A) . To
compute this all we need is E ( HA ) since we already have the individual expectations,
E ( H ) = 4 × .73 and E ( A) = 4 × .20 , respectively. To compute E ( HA ) we return to the
root of all binomials (and multinomials), the Bernoulli.
Now what random variable is H i Ai ? Simply 0 since it cannot be that the first fire is both
a home and an apartment. What random variable is H i A j if i ≠ j . Since we are assuming
the fires are independent, the distribution of H i A j is H i Aj 0 1
very simple: P 1 − .73 × .20 .73 × .20
And so E ( H i A j ) = .73 × .20 , and thus E ( HA ) = ( 42 − 4 ) .73 × .20 . But E ( H ) = 4 × .73 and
E ( A) = 4 × .20 , so E ( H ) E ( A) = ( 4 × .73)( 4 × .20 ) = 42 × .73 × .20 , and so
cov ( H , A) = −4 × .73 × .20 .
Similarly, cov ( H , D ) = −4 × .73 × .07 and cov ( A, D ) = −4 × .20 × .07 .
Thus, on the main diagonal, we place the variances, and this is clearly a symmetric matrix
since cov ( X , Y ) always equals cov (Y , X ) . In our specific example
⎛ .7884 −.5840 −.2044 ⎞
M = ⎜ −.5840 .6400 −.0560 ⎟ .
⎜ ⎟
⎝ −.2044 −.0560 .2604 ⎠
Naturally, more than the number of fires, one is interested in the costs of the fires.
Suppose that we expect the cost of a house fire to be $25 thousand, while an apartment is
$15 thousand, and other dwelling are $5 thousand. Thus,
C = 25H + 15 A + 5D ,
and of course we could easily build the distribution of C from the table above. For
example, if (1,3, 0 ) occurs (with probability 2.33%), the cost will be $70 thousand.
Easily, the expected cost of the fires is given by
E ( C ) = 25E ( H ) + 15E ( A) + 5E ( D ) = $86.4 thousand.
But what is the variance of C, V ( C ) ? Since the variables are not independent, we cannot
just simply add the variances. Rather we have to do a computation—we will do it a little
more generically. Let X = aH + bA + cD where a, b and c are numbers. Then we know
E ( X ) = aE ( H ) + bE ( A) + cE ( D ) , so
E ( X ) = ( aE ( H ) + bE ( A) + cE ( D ) ) =
2 2
So,
(
E ( X 2 ) = E ( aH + bA + cD )
2
)=
a 2 E ( H 2 ) + b 2 E ( A2 ) + c 2 E ( D 2 ) + 2abE ( HA ) + 2acE ( HD ) + 2bcE ( AD ) .
Thus,
V ( X ) = a 2 E ( H 2 ) + b 2 E ( A2 ) + c 2 E ( D 2 ) + 2abE ( HA ) + 2acE ( HD ) + 2bcE ( AD ) −
a 2 E ( H ) + b 2 E ( A ) + c 2 E ( D ) + 2abE ( H ) E ( A ) + 2acE ( H ) E ( D ) + 2bcE ( A ) E ( D ) =
2 2 2
This expression would be impossible to remember if it were not for the wonderful tool of
matrix multiplications—this is nothing but
⎛a⎞
( a b c ) M ⎜⎜ b ⎟⎟
⎝c⎠
where M is the covariance matrix described above. Thus, in our particular situation,
128
The method actually extends to compute the covariance of two linear combinations. Thus
⎛ 3⎞
for example if B = 3H + 4 A + 5D , then cov ( C , B ) = ( 25 15 5 ) M ⎜ 4 ⎟ .
⎜ ⎟
⎝5⎠
∫ ∫ xydxdy = ∫ y (  ) dy = ∫
1 y 1 1
2 y 1
1
2 x 1
2 y 3 dy = ,
0 8
0 0 0 0
and so c = 8 .
We may think that X and Y are independent since the joint distribution seems like a
product of two marginals—but the triangular shape of the range should quickly dissuade
us from assuming that. Note that this shape indicates a relationship between X and Y—
especially since the former is always less tan or equal to the latter!!
y =1 y=x
In fact, computing the marginals, we get
1
f X ( x ) = ∫ 8 xydy = 4 xy 2 = 4 x − 4 x 3 for 0 ≤ x ≤ 1 ,
1
x
x
and
y
fY ( y ) = ∫ 8 xydx = 4 x 2 y = 4 y 3 for 0 ≤ y ≤ 1 ,
y
0
0
and we readily give up on any claims of independence. In fact, if we consider the joint
distribution of the variables, this will become even clearer. As in the case of one variable,
the cumulative distribution of the pair X and Y is given by
F ( a, b ) = P ( X ≤ a and Y ≤ b ) . If b≤a, then easily y =1 y=x
( ) ( )
F a, b = F b, b , so without loss we can assume 0 ≤ a ≤ b ≤ 1 . ( a , b )
In this case, the picture aids us in computing.
129
( )
a b a a
F ( a, b ) = ∫ ∫ 8 xydydx = ∫ 4 x y  dx = ∫ 4 xb − 4 x dx = a ( 2b − a ) ,
2 2 b 3 2 2 2
x
0 x 0 0
∂2 F
and this is their joint distribution. Naturally, easily confirmed is the fact that f = .
∂x∂y
As in the case of one variable, one can use the distribution to do a variety of
computations:
c P ( X ≤ 12 , Y ≤ ) = F ( 12 , 12 ) = 161 .
1
2
d P ( X ≤ 34 , Y ≤ 12 ) = F ( 12 , 12 ) = 161 . y =1 y=x
e P ( X ≤ 12 , Y ≥ 12 ) = F ( 12 ,1) − F ( 12 , 12 ) = 166 .
f P ( X ≥ 12 , Y ≥ 12 ) = 1 − F ( 12 ,1) = 169 .
g P ( X ≤ 12 , Y ≤ 43 ) = F ( 12 , 43 ) = 327 .
h P ( X ≥ 12 , Y ≤ 34 ) = F ( 34 , 34 ) − F ( 12 , 34 ) = 256
81
− 327 = 256
25
.
∫ ∫ 8 xydydx = ∫ 4 x ( y  ) dx = ∫ 4 x − 8x dx =
1 1 1
2 1− x 2 2
2 1− x 2 1
6
.
x
0 x 0 0
We can compute E ( X ) by two methods since we already have the marginals—we can
1
We could also readily get the variances of X and Y: E ( X 2
) = ∫ 4x 3
− 4 x 5 dx = 13 , so
0
1
V (X) = 11
225 , and σ X ≈ .2211 , and E (Y 2 ) = ∫ 4 y 5 dy = 23 , so V (Y ) = 2
75 , and σY ≈ .1633 .
0
However, we are not ready to compute the covariance matrix. We first need E ( XY ) :
( )
1 y 1 1
E ( XY ) = ∫ ∫ ( xy ) 8xydxdy = ∫ 83 y x  dy = ∫ 83 y dy = 94 .
2 3 y 5
0
0 0 0 0
And so
cov ( X , Y ) = E ( XY ) − E ( X ) E (Y ) = 94 − 158 54 = 225
4
⎛11 4 ⎞
Our covariance matrix is then M = 1
⎜ 4 6⎟ .
⎝ ⎠
225
Having all this information allows us to quickly calculate the main attributes of the
random variable Z = Y − X . Without having to compute the density, we know that
1 ⎛ 11 4 ⎞ ⎛ −1 ⎞
E ( Z ) = 54 − 158 = 154 and V ( Z ) = ( −1 1) 225 ⎜ 4 6 ⎟ ⎜ 1 ⎟ = 225 .
9
⎝ ⎠⎝ ⎠
In this section we have learned several crucial ideas and their uses: joint densities and
joint distributions, both continuous and discrete, marginals, independence, correlation
and covariance, and the covariance matrix.
131
o Transformations
From the onset of the course, we not only considered random variables, but also
combinations of them such as sums and products, as well as new variables obtained from
old ones by taking a function of that variable such as X 2 + 3 . In a previous section, we
look at the transformation method for one variable—in this section we will look at the
extension of this method to the multivariable case.
The multivariable version of the theorem is not that much harder to prove, except it takes
solid understanding of multivariable calculus. First one has to understand what the
derivative of a function of several variables is, and it is (of course) a matrix. This matrix
is made out of all possible partial derivatives. An example should suffice. Consider the
function k : \ × \ → \ × \ given by k ( x, y ) = ( x 2 + y 2 , 2 xy ) , then readily we can see
this function as made up of two other simpler functions, g ( x, y ) = x 2 + y 2 and
h ( x, y ) = 2 xy , so that now k ( x, y ) = ( g ( x, y ) , h ( x, y ) ) . Then what one means by the
⎛ ∂g ( x ) ∂g ( y ) ⎞
⎜ ∂x ∂y
⎟
derivative of k is the matrix of partials, ⎜ ⎟ . In our specific case, it is given
⎜ ∂h ( x ) ∂h ( y ) ⎟
⎜ ∂x ⎟
⎝ ∂y ⎠
⎛ 2x 2 y ⎞
by the matrix ⎜ ⎟.
⎝ 2 y 2x ⎠
1
The J is for Jacobian, named after Carl Gustav Jacobi, nineteenth century mathematician.
132
⎛ ∂g ( x ) ∂g ( y ) ⎞
⎜ ∂x ∂y
⎟
J=⎜ ⎟
⎜ ∂h ( x ) ∂h ( y ) ⎟
⎜ ∂x ⎟
⎝ ∂y ⎠
always (at least in the range of X and Y) has nonzero determinant. Then we
have for ( z , w ) = ( g ( x, y ) , h ( x, y ) ) ,
f X ,Y ( x, y )
f Z ,W ( z , w ) = .
det J
The proof will be omitted (it is a consequence of the multivariable chain rule), but we
should readily observe the similitude between the two theorems.
Naturally, there is a multivariable version of the theorem for any number of variables.
Example 1. Sums and Differences. Let X and Y have joint density f X ,Y . Consider
⎛1 1 ⎞
Z = X + Y and W = X − Y . In that case J = ⎜ ⎟ , is never zero, so if z = x + y and
⎝ 1 −1 ⎠
w = x − y , then since 2x = z + w and 2 y = z − w , we get immediately
f ( x, y ) 1 ⎛ z + w z −w⎞
f Z ,W ( z , w ) = X ,Y = f X ,Y ⎜ , ⎟.
2 2 ⎝ 2 2 ⎠
fW ( w ) = 12 e− w .
( 2 ln w − 2w)z = −2 − 2 ln z + 2 z = 2 ( z − ln z − 1)
1
MomentGenerating Functions
More than a transformation, the last technique we look at in this section is a transform
method, the Laplace transform to be more exact. Manners of encoding sequences into
functions (and vice versa) has been a useful device in mathematics for a considerable
time, and that is exactly what we do now.
mX ( t ) = 1 + E ( X ) t +
( )t
E X2 2
+
( )t
E X3 3
+".
2! 3!
135
One of the wonderful facts about momentgenerating functions is that if two random
variables have the same momentgenerating function, then they have the same
distribution—or equivalently they are the same random variable.
Although a bit intimidating, the last expression above is very useful as the following
examples will illustrate.
Example 8. The Poisson. Consider now Pλ , a Poisson. Then let X = etPλ , then
λ k −λ
P ( X = etk ) = e , and so
k!
∞
λk
mPλ ( t ) = ∑ etk e−λ = e−λ ∑
∞
( λe ) = e−λ eλet = eλ(et −1) .
t k
k =0 k! k =0 k!
136
2π −∞
2π −∞
2π −∞
∞ ∞ ∞
1 ( ) dx =
2
− 12 x 2 − 2 tx + t 2 + t2 1 t2 − 12 (x 2
− 2 tx + t 2 ) dx = 1 t2 − 12 ( x − t ) t2
∫e ∫e ∫e
2
e 2
e2 dx = e 2 .
2π −∞
2π −∞
2π −∞
Certainly one would have to agree that some of the expressions are not that memorable—
but the following theorem will point out their usefulness.
mBn , p ( t ) = ( q + pe )
t n
. Thus, if we so wanted, we could easily compute the first three
moments of the binomial: E ( Bn , p ) = mB′ n , p ( 0 ) = np , E ( Bn2, p ) = mB( 2n ), p ( 0 ) = npq , and
E ( Bn3, p ) = mB(3n), p ( 0 ) = np ( n 2 p 2 + q 2 + 3npq − pq ) .
mW ( t ) = e mY ( t ) = e 2 2 ,
+µ1t +µ 2 t +µ t
2
e 2
=e 2 1 2
In this section we introduce the last family of random variables of the course. Before we
looked at the normal distribution, which is symmetric about its mean. However, many
distributions can take values that are only on one side of the axis, positive for example—
certainly Z 2 would be such a random variable, or the exponential. The gamma is one of
these with nonnegative range, but before we can discuss it we need to do some
integration.
Throughout this section α will denote a positive real number. Consider the following
definition (which is acceptable since as y → ∞ , e − y converges to 0 much faster than y α−1
converges to ∞ ):
∞
Γ ( α ) = ∫ y α−1e − y dy .
0
Γ ( 32 ) = 1
2 π.
∞
−y
∫ y e dy can be computed
α−1
Next we extend this integral a bit. First let β > 0 also. Then β
0
∞
e dy = βα Γ ( α ) . So now we are ready to
α−1 − β
y
⎧ y α−1e β
−y
⎪ y>0
f ( y ) = ⎨ βα Γ ( α )
⎪
⎩ 0 otherwise
is known as a gamma random variable with parameters α and β , and will be denoted
by Gα ,β .
139
0.8
1
These graphs represent the densities of Gamma 0.6 2
random variables with α = 2 and respective 4
0.4
β ’s.
0.2
0.4 0
0.35 0 2 4 6 8
0.3
0.25 1
0.2 2 To compute the two key parameters of
0.15 4 Gα ,β is rather easy:
0.1 ∞
E ( Gα ,β ) =
1 −y
∫ yy e dy =
α−1 β
0.05
0 β Γ (α)
α
0
∞ α+1
Γ ( α + 1)
0 2 4 6 8
1 − βy β
β Γ (α) ∫
α
y e dy = = αβ .
α
0
β Γ (α)
α
And
∞ ∞
β
α+ 2
Γ ( α + 2)
E ( Gα2,β ) = = β2 α ( α + 1) .
1 −y 1 −y
∫ y y e dy = ∫ y e dy =
2 α−1 β α+1 β
β Γ (α)
α
0
β Γ (α)
α
0
β Γ (α)
α
( )
y −βty
mGα ,β ( t ) = E e
1 −y 1 α−1 − β
∫ e y e dy = β Γ (α) ∫
ty α−1 β
= dy =
tGα ,β
y e
β Γ (α)
α
0
α
0
∞ α α
y(1−βt )
⎛ β ⎞
1 ⎛ 1 ⎞
Γ (α) = ⎜
1 − β
∫y
α−1
e dy = α ⎜ ⎟ ⎟ .
βα Γ ( α ) 0
β Γ ( α ) ⎝ 1 − βt ⎠ ⎝ 1 − βt ⎠
α λ α+λ
⎛ 1 ⎞ ⎛ 1 ⎞ ⎛ 1 ⎞
mGα ,β +Gλ ,β ( t ) = mGα ,β ( t ) mGλ ,β ( t ) = ⎜ ⎟ ⎜ ⎟ =⎜ ⎟ = mGα+λ ,β ( t ) .
⎝ 1 − βt ⎠ ⎝ 1 − βt ⎠ ⎝ 1 − βt ⎠
Thus, we have that the sum of n (independent) exponentials with the same mean is a
gamma, X β + X β + + X β = Gn ,β . This extends a previous observation in a former
n
section.
x α−1e β y λ−1e β
f X ,Y ( x, y ) = α
β Γ ( α ) βλ Γ ( λ )
X
Consider U = X + Y and V = . The range of U is all nonnegative numbers, and we
X +Y
already know that U = Gα+λ ,β . The range of V is the unit interval 0 ≤ v ≤ 1 . What is the
joint density of U and V, and what is the marginal of V? Using the transformational
⎛ 1 1 ⎞
⎜ ⎟ −1
method, we get the matrix J = ⎜ x + y − x − x ⎟ , and so J = . If u = x + y and
⎜ ( x + y) x + y
( x + y ) ⎟⎠
2 2
⎝
x
v= , then x = uv and y = u − uv , so
x+ y
−( u −uv )
f X ,Y ( x, y ) ( x + y ) x α−1e β y λ−1e β
−x −y
u ( uv ) e β ( u − uv )
− uv
α−1 λ−1 β
e
fU ,V ( u, v ) = = =
1 βα Γ ( α ) βλ Γ ( λ ) βα Γ ( α ) βλ Γ ( λ )
x+ y
u1+α−1+λ−1v α−1e β (1 − v ) e β u α+λ−1 v (1 − v )
−u λ−1 −u
α−1 λ−1
= =
βα Γ ( α ) βλ Γ ( λ ) βα+λ Γ (α) Γ (λ)
(1 − v ) Γ ( α + λ ) v α−1 (1 − v ) Γ ( α + λ )
−u
α−1 λ−1 λ−1
e β u α+λ−1 v
= α+λ = fU ( u ) .
β Γ (α + λ) Γ (α) Γ (λ ) Γ (α) Γ (λ)
From which we can conclude that U and V are independent, and that
v α−1 (1 − v ) Γ ( α + λ )
λ−1
fV ( v ) = .
Γ (α) Γ (λ)
Such a random variable is known as a beta random variable with parameters α and λ ,
Tα ,λ . Note that as an immediate consequence since this is a density, we get
1
Γ (α) Γ (λ)
∫ v (1 − v )
α−1 λ−1
dv = .
0
Γ (α + λ)
141
The following theorem will capture the main properties of the beta random variables.
f ( x) =
Γ (α) Γ (λ )
over its range [ 0,1] . Then
α αλ
E (Tα ,λ ) = and V (Tα ,λ ) = .
α+λ ( α + λ ) ( α + λ + 1)
2
1 1
Γ ( α + 1) Γ ( λ )
∫ xx (1 − x ) dx = ∫ x α (1 − x )
α−1 λ−1 λ−1
Proof. Now we know that dx = . Thus
0 0
Γ (α +1+ λ )
Γ ( α + 1) Γ ( λ ) Γ ( α + λ ) α
E (Tα ,λ ) = = , by the fundamental recursion of the
Γ (α +1+ λ) Γ (α) Γ (λ ) α + λ
1
Γ ( α + 2) Γ (λ )
∫ x x (1 − x )
2 α−1 λ−1
gamma function. Now for the second moment, dx = as
0
Γ (α + 2 + λ )
Γ ( α + 2) Γ (λ ) Γ (α + λ ) α ( α + 1)
before, so E (Tα2,λ ) = = , and so the variance
Γ ( α + 2 + λ ) Γ ( α ) Γ ( λ ) ( α + λ )( α + λ + 1)
α ( α + 1) ⎛ α ⎞ αλ
2
is given by V (Tα ,λ ) = −⎜ ⎟ = . a
( α + λ )( α + λ + 1) ⎝ α + λ ⎠ ( α + λ ) ( α + λ + 1)
2
given by 3960 ∫ x 7 (1 − x ) dx ≈ 48.23% while the probability he spends 50% of the total
4
We do one more type of random variable that is a derivative of the gamma. Again, let us
use Z to denote N ( 0,1) , the standard normal. Then we consider the random variable Z 2 .
Certainly its range is the set of nonnegative reals, and for any such number y,
y y y
(
FZ 2 ( y ) = P ( Z 2 ≤ y ) = P − y ≤ Z ≤ y = ) ∫ f Z ( t ) dt = 2 ∫ f Z ( t ) dt = 2 ∫ f Z ( t ) dt − 1
− y 0 −∞
= 2 FZ ( y ) −1 ,
−y 1 −1 −y
e2 y2 e 2
so f Z 2 ( y ) = = 1 , and so Z 2 = G 1 ,2 , a gamma with expectation 1 and
2πy 2 Γ ( 2 )
2 1 2
variance 2.
and variance 2n . This type of random variable is so important it acquires a special name,
it is known as a χ 2 − random variable with n degrees of freedom. We will illustrate
some of its uses in the examples below. The probabilities of such a random table are
available in tables similar to the one for the normal.
The χ − square random variable was created by Karl Pearson to develop a goodnessof
fit test. It works as follows:
Example 3. Officers and Horses Again. An example from the past examined the
number of cavalry officers killed by horses. Number of Deaths 0 1 2 3 4 5
The data was as follows: Actuality 144 91 32 11 2 0
The idea, as before, is to model this occurrence with a Poisson random variable. We
obtain λ = .7 , and so we have the following table:
Number of Deaths 0 1 2 3 4 5
Poisson Probability 0.49658 0.34761 0.12166 0.02838 0.00496 0.00078
Expected # of occurrences 139.04 97.33 34.07 7.95 1.39 0.22
Actuality 144 91 32 11 2 0
The idea then is to compare the last two rows of the table—the χ 2 − test then adds the
squares of the difference between what is expected and what occurred divided by what is
expected: so in our case
(144 − 139.02 ) ( 91 − 97.33) ( 32 − 34.07 ) (11 − 7.95 ) ( 2 − 1.39 ) ( −.22 )
2 2 2 2 2 2
+ + + + +
139.02 97.33 34.07 7.95 1.39 .22
which gives a total of 2.3716. This is the key statistic which is then checked in a
χ 2 − table (with 5 degrees of freedom) and found to have a reasonably probability of
occurring but not as high as 90%, so one is reasonably satisfied with the model, but not
totally certain that the fit is perfect. The reason that 5 degrees of freedom was used is due
to the fact that 6 pieces of data are being compared but since their sum is the same, we
only 5 degrees.
143
We end the section with another slightly different application of the χ 2 − test.
Example 4. TwoWay Tables. The effectiveness of a new flu vaccine was being tested
in a small city. The vaccine was provided free of charge in a twoshot sequence over a
period of two weeks to anybody who wanted it. Later a survey of 1000 town people
provided the following information:
Status No Vaccine One Shot Two Shots Total
Flu 24 9 13 46
No Flu 289 100 565 954
Total 313 109 578 1000
We attempt to measure if there is an effect from the vaccine on whether a person got or
did not get the flu. If there had been no effect, then we could say that the rows and the
columns are independent of each other so what we should be getting in each cell of the
table is the share of the totals for that row and that column, thus in the 1,1− position we
46 × 313
would be getting = 14.40 . If we compute each of
1000
these we get the table 14.40 5.01 26.59
298.60 103.99 551.41
Now we are ready to compute the χ 2 − statistic as we did before, and we obtain the
following table of values 6.40 3.17 6.94
0.31 0.15 0.33
which add up to 17.31.
The only mystery remaining is to decide how many degrees of freedom we have. We
have six cells to start with, but we loose one because all the cells add up to the same
number, but we also loose one because the row sum of the first row in both tables is the
same. We do not loose one for the second row, since that had been accounted for with the
total of 1000. But we will loose 2 more for the first two columns—again the third column
is already accounted for with the total, so we have 2 degrees of freedom remaining. Now
the probability of getting as high a value as 17.31 with two degrees of freedom is less
than .005. Thus we can reasonably conclude that there is some effect since what occurred
is highly unlikely to occur if there had been none.
144
q Conditioning Further
In this last section, we further exploring the conditioning of random variables, and end
with a brief discussion of order statistics. Let us review via an example what a continuous
distribution is.
⎜ ⎟⎜ ⎟ ⎜ ⎟ = P⎜B λ = k ⎟
⎝ k ⎠⎝ λ + δ ⎠ ⎝ λ + δ ⎠ ⎝ n , λ+δ ⎠
And so we obtain that X  X + Y = n is nothing but a binomial.
Example 2. Let the amount of time a student takes to finish an exam be given by the
x2 x
random variable X with density f X ( x ) = + in the range 0 to 60 (we are
144000 3600
measuring time in minutes). This random variable has mean 42.5 minutes and a standard
deviation of approximately 13.18 minutes. It is known that no one has finished the exam
in less than 15 minutes, so if we condition using this fact, we will get a different random
variable:
P (15 ≤ X ≤ x ) x 3 + 60 x 2
FX  X ≥15 ( x ) = P ( X ≤ x  X ≥ 15 ) = =
P ( X ≥ 15 ) 415125
x 2 + 40 x
in the range 15 ≤ x ≤ 60 . Thus, f X  X ≥15 ( x ) = , and so its mean is slightly higher
138375
at 43.81 minutes, but on the other hand, its standard deviation has reduced a little to 11.67
minutes. Perhaps, the latter random variable is more realistic—only the data will tell.
But can we condition a continuous with an event similar to the one in the first example,
namely that of a random variable taking a specific value. Suppose we have two
145
∫ ∫ f X ,Y ( t , s ) dtds
P
( x ≤ X ≤ x + h and y ≤ Y ≤ y + k ) y x
hk = hk .
P(y ≤Y ≤ y + k) y+k
∫ fY ( s ) ds
k y
Example 3. Suppose that the range of the random variables X and Y is the (open) unit
square. And in that range their joint density is f X .Y ( x, y ) = 125 x ( 2 − x − y ) . What is the
expectation of X?
To do that we could compute the marginal of X , and then from it derive the
1 1
expectation—or we could simply do the double integral ∫∫x 5
12 x ( 2 − x − y ) dydx . We opt
0 0
1 1 1
for the latter, so E ( X ) = ∫ ∫ x 125 x ( 2 − x − y ) dydx = ∫ 125 x ( 32 − x ) dx =
5
2
≈ 0.1042 .
0 0 0
48
But suppose we are given that Y = y , what then is the expectation of X? We are really
asking for E ( X  Y ) . So first we need to compute the density of this random variable.
Readily, the range of X  Y is the open interval 0 to 1, and there
f ( x, y ) x (2 − x − y) x ( 2 − x − y ) 6x ( 2 − x − y )
f X Y ( x ) = X ,Y =1 = = .
fY ( y ) 3 − 2
2 y
4 − 3y
∫ x ( 2 − x − y ) dx
0
Observe that for any y in the interval 0 to 1, this is a density, namely this has nonnegative
values and
6x (2 − x − y)
1 1
1 1
∫ 4 − 3y ∫
1
dx = 12 x − 6 x 2
− 6 xydx = 6 x 2 − 2 x 3 − 3 xy = 1 .
0
4 − 3y 0
4 − 3y 0
146
E( X Y ) =
1
6x
2
( 2 − x − y ) dx = 1
1
∫ 4 − 3y ∫
− 6 x3 − 6 x 2 ydx =
2
12 x
0
4 − 3y 0
1 1 5 − 4y
4 x3 − 1.5 x 4 − 2 x3 y = .
4 − 3y 0 8 − 6y
Thus if y = 0.5 , we get the expectation of X to be 0.60—certainly then the value of Y has
an effect on the expectation of the other random variable. But something remarkable
5 − 4Y
happens: let us now consider the random variable Z = E ( X  Y ) = , and let us
8 − 6Y
compute E ( Z ) = E ( E ( X  Y ) ) . That is simple,
1 1
5− 4y 5
1
⎛ 5 x 2 ( 5 − 4 y )( 6 − 2 x − 3 y ) ⎞ 1
E (Z ) = ∫ ∫ x ( 2 − x − y ) dxdy = ∫0 ⎜⎝ ⎟0dy =
0 0
8 − 6 y 12 144 ( 4 − 3 y ) ⎠
5 (5 − 4 y )
1
∫0 144 dy =
5
144
( 5 y − 2 y )
2 1
0
=
5
48
.
But we have seen that number before, it was the expectation of X—coincidence? NO
WAY—it is a theorem.
Example 4. ‘Mazing Rats. In order to determine whether rats can distinguish colors or
remember them in any case, a rat is put into a maze with three swinging doors colored red
white and blue. Behind the red door there is a path to a piece of cheese. The path will
take about 3 minutes for the rat to travel. Behind the white door there is a maze that
returns the rat to the starting point after roughly 5 minutes while behind the blue doors
there is a similar but longer maze that returns the rat to the starting point after 7 minutes.
Assuming that the rat is color blind and memoryless so it will take any door at random at
any time, on the average how long will it take it to reach the cheese? Let X be the variable
that measures the time until the rat reaches the cheese, and let Y be the door that the rat
chooses the first time. Then we assume Y = red, white or blue with equal probability, 13 .
Now if the first occurs, then X = 3 so E ( X  Y = red ) = 3 . On the other hand, easily
E ( X  Y = white ) = 5 + E ( X ) and E ( X  Y = blue ) = 7 + E ( X ) , so
E ( X ) = 13 ( 3 + 5 + E ( X ) + 7 + E ( X ) ) = 5 + 32 E ( X )
and hence we conclude the rat will take 15 minutes on the average to reach the cheese.
147
Rather than give a formal proof of this theorem, we will use a long example to illustrate
why it is true.
Example 5. Consider the following two random variables. One rolls two dice, and Y
records the sum of the two dice while X records the highest value of either die. Their joint
distribution is given by the following table.
X/Y 2 3 4 5 6 7 8 9 10 11 12
1 1
36
0 0 0 0 0 0 0 0 0 0
2 0 2
36
1
36
0 0 0 0 0 0 0 0
3 0 0 2
36
2
36
1
36
0 0 0 0 0 0
4 0 0 0 2
36
2
36
2
36
1
36
0 0 0 0
5 0 0 0 0 2
36
2
36
2
36
2
36
1
36
0 0
6 0 0 0 0 0 2
36
2
36
2
36
2
36
2
36
1
36
Let us use the table to compute the marginals of both X and Y. So now we obtain an
extended table:
X/Y 2 3 4 5 6 7 8 9 10 11 12 P
1 1
36
0 0 0 0 0 0 0 0 0 0 1
36
2 0 2
36
1
36
0 0 0 0 0 0 0 0 3
36
3 0 0 2
36
2
36
1
36
0 0 0 0 0 0 5
36
4 0 0 0 2
36
2
36
2
36
1
36
0 0 0 0 7
36
5 0 0 0 0 2
36
2
36
2
36
2
36
1
36
0 0 9
36
6 0 0 0 0 0 2
36
2
36
2
36
2
36
2
36
1
36
11
36
P 1
36
2
36
3
36
4
36
5
36
6
36
5
36
4
36
3
36
2
36
1
36
compute the random variable E ( X  Y ) . We can actually compute this random variable
from the table above:
X/Y 2 3 4 5 6 7 8 9 10 11 12 P
1 1 0 0 0 0 0 0 0 0 0 0 1
36
2 0 1 1
3
0 0 0 0 0 0 0 0 3
36
3 0 0 2
3
1
2
1
5
0 0 0 0 0 0 5
36
4 0 0 0 1
2
2
5
1
3
1
5
0 0 0 0 7
36
5 0 0 0 0 2
5
1
3
2
5
1
2
1
3
0 0 9
36
6 0 0 0 0 0 1
3
2
5
1
2
2
3
1 1 11
36
P 1
36
2
36
3
36
4
36
5
36
6
36
5
36
4
36
3
36
2
36
1
36
E(X Y ) 1 2 8
3
7
2
21
5
5 26
5
11
2
17
3
6 6
148
and we need to understand the nature of the terms in this table. The 1 in the first column
comes from the fact that if Y = 2 , then X has to be 1. Let us proceed to the third column
(the second is similar to the first). In the third column we see 13 and 23 which stem from
the 361 and 362 from the original table since we are apportioning 1 in each column
according to the probabilities in that column. Each column is obtained that way. Now we
need to quickly observe that the entries in the E ( X  Y ) row are obtained by taking the
expectation of each column.
Finally if we compute E ( E ( X  Y ) ) by using the last row of the table, we will obtain
161
36 = E ( X ) . Why did that happen? What happened to a typical entry inside the table?
First in order to become a probability for the conditional, it got divided by the column
sum, then it got multiplied by the row so it would become a summand in the computation
for the value E ( X  Y ) in that column. But then that value got multiplied by the marginal
of Y in that column, which is the column sum! So all that remained then was the sum of
all the entries times the respective values of X, and that is exactly the expectation of X.
Arguing a bit abstractly, let pij denote the probability in the i, j − position of the table, let
c j be the column sum of the jth column. But as observed above, this is the marginal of Y.
Let xi be the value of X in the ith row (in this case xi = i , but this is irrelevant). Then we
p
start with ij , and then multiply it by xi , and add all of the over a given column, so we
cj
pij
end up with ∑c xi , and that becomes the value of the random variable E ( X  Y ) in
i j
column j . But now to compute E ( E ( X  Y ) ) , we take the sum of all of the values over
all columns after being multiplied by the marginal of Y:
p
∑j c j ∑i c ij xi = ∑∑
j i
pij xi = E ( X ) .
j
One should observe that in a similar fashion one proves the expectation of a sum is the
sum of the expectations:
∑∑ x p +∑∑ y
j i
i ij
j i
j pij =E ( X ) + E (Y ) .
We end the section (and the course) with a brief discussion of order statistics.
149
In the last example we looked at the maximum of the roll of two dice. Likewise we can
consider the maximum of the roll of three dice. More formally, let D1 , D2 and D3 be the
rolls of three dice (independent of course), and consider Dmax = max { D1 , D2 , D3 } . Then
the distribution of Dmax is given by Dmax 1 2 3 4 5 6
P 1
216
7
216
19
216
37
216
61
216
91
216
We could have also discussed Dmin whose
distribution is instead
Note that E ( Dmax ) = 1071
216 while E ( Dmin ) =
441
,
Dmin 1 2 3 4 5 6 216
In the continuous case, if X 1 , X 2 ,…, X t are independent and identically distributed, then
if we let X max = max { X 1 ,..., X t } , then X max has the same range as any of the X i ’s.
Easily X max ≤ a if and only if X i ≤ a for all i, and since these are independent events, we
obtain P ( X max ≤ a ) = ∏ P ( X i ≤ a ) . So if we let f ( x ) and F ( x ) denote the common
i
densities and distributions of the X i ’s, then the distribution of X max , Fmax is simply
Fmax ( a ) = ( F ( a ) ) ,
t
0
t+2
so
150
2
t ⎛ t ⎞ t
V (U max ) = −⎜ ⎟ =
t + 2 ⎝ t +1⎠ ( t + 1) ( t + 2 )
2
The distribution of the minimum is just as simple as that of the maximum: X min ≥ a if
and only if X i ≥ a for all i, and again since these are independent events, we obtain
P ( X min ≥ a ) = ∏ P ( X i ≥ a ) . So if f ( x ) and F ( x ) are as before, then we have
i
1 − Fmin ( a ) = (1 − F ( a ) ) ,
t
( ) = ( e ) , and so we
t
parameter β . Then we know 1 − F ( x ) = e β , so 1 − Fmin ( x ) = e β
−x −x − tx
β
readily obtain that Fmin ( x ) = X β , an exponential where the average has been divided by t.
t
109
since the expectation of a sum is the sum of the expectations. Moreover, since the
variance of an independent sum is the sum of the variances, and the variance of a
constant times a variable is the constant squared times the variance of the
variable, we obtain
σ2
V (Yn ) = n12 ( nσ2 ) = ,
n
so by Chebyshev’s Inequality, so for any positive integer k ,
⎛ 1 ⎞ V (Yn ) σ 2
P ⎜ Yn − µ ≥ ⎟ ≤ = 2 ,
⎝ k⎠ k2 k n
and we have proven
Note that this theorem establishes the crucial role that the expectation plays as opposed to
the mode and the median—there is no comparable theorem about the other measurements
of central tendency.
We end the section with a precise statement of the Central Limit Theorem. First we let
Z denote the standard normal, which has density f ( y ) =
− x2
1
2π
e 2
, and as we will see has
expectation 0 and its standard deviation 1.