You are on page 1of 11

Part 2

Probability Concepts
CHAPTER 4

Basic Probability

1. Introduction

In Chapter 3, we computed summary population parameters and sample statistics such as mean and
standard deviation for datasets. Because we were not given information about whether one observation
was more “likely” than another, we gave them all equal weight. (This is sometimes called the “equally
likely” assumption.) For example, in the formula for the population mean µ of a numerical attribute,
taken over all N members of the population,
N
x1 + x2 + · · · + xN 1 X
µ= = xi ,
N N
i=1

each of x1 , . . . , xN is given the same weight, namely 1/N .


But what if we are told some of the observations are more likely than others? Surely we should give the
more likely observations more weight in the formula? The weight we use is the probability of seeing that
observation.

2. Overview of Probability

The mathematical theory of probability deals with patterns that occur in random events. Probability
is a well-established branch of mathematics that originated in 17th century France: gamblers who had
to leave a game early wanted a “fair” way of dividing the pot. This meant working out the chance
of each winning and dividing the pot in proportion to those chances. Pascal and Fermat1 had a long
written correspondence which led to the founding of probability theory. This branch of mathematics
studies games and other situations which involve chance or uncertainty. Today, probability theory finds
applications in every area of academic activity from finance and economics to physics, and in daily
experience from weather prediction to predicting the risks of new medical treatments.
Probability theory tries to describe random occurrences, e. g., measuring the height of a randomly chosen
person.
We attempt to describe how likely is a given outcome or set of outcomes when we observe a random
process or experiment. To say an experiment is random means (a) it is repeatable under identical
conditions; (b) the outcome of any particular trial can vary; (c) if we repeat the experiment many times,
we see some statistical regularity in the outcomes. For example:

• We roll a die2 and the possible outcomes are: 1, 2, 3, 4, 5 and 6, corresponding to the side that
turns up;
• We toss a coin with possible outcomes: H (heads) and T (tails);
• A bank gives out a loan and it is repaid in full or not repaid; Repaid (R), Not repaid (N ).
1Among other things (Pascal’s triangle, the programming language Pascal named after him, his lifelong ill-health, . . . )
the philosopher and mathematician Blaise Pascal is famous for Pascal’s wager on why he should believe in God: Pascal
reasons that if he believes in God but God does not exist, then he has lost a finite amount (of time); but if he does not
believe in God, yet God exists, then he has lost an infinite reward (eternity in heaven); so purely because of the infinite
di↵erence in consequences, he is better o↵ believing in God. Fermat was a well-o↵ official who contributed to many areas
of maths, especially number theory: his famous Fermat’s last theorem, conjectured in 1637 in the margin of his copy of
Diophantus’s book Arithmetica, opened up whole new areas of mathematics but was not proven until 358 years later in
1994 by Andrew Wiles.
2The word die is singular, while the word dice is its plural.

59
60 MIS10090

The observation or experiment is often called a trial. Much of the time, we measure some numerical
attribute derived from the outcome (see random variables below) as we can do numerical calculations
on these such as mean and standard deviation.

2.1. Terms in probability. There are some technical terms which you will need to understand to
progress further.
Definition 4.1 (Outcome). An outcome (or observation) is the result of observing a random process or
experiment, that is, the result of carrying out a trial, e. g., measuring a person’s height as 1.8m.
Definition 4.2 (Sample space). The sample space is the set of all possible outcomes, e. g., the set of all
heights of people measured. It is often denoted by ⌦ or S. It is also known as the probability space.

The sample space depends on what we want to observe in the experiment, e. g., the experiment might
be on all students in the class, but S is di↵erent if we measure weights rather than heights.
In the examples earlier, the sample spaces are, respectively: {1, 2, 3, 4, 5, 6}, {H, T } and {R, N }.
Remark 4.3 (For your information: not examinable). The positive integers N = {1, 2, 3, 4, . . .} are also
called the natural numbers or counting numbers. A set S is called countable if it can be put in one-one
correspondence with a subset of N, possibly a finite subset of N. That is, S is countable if you can write
S as a (possibly infinite) indexed list S = {s1 , s2 , s3 , . . .}.
In mathematics, there are di↵erent sizes of infinity, and N is the smallest. A set is called uncountable if
it is not countable: this means it has “more” elements than N, i. e., is of a larger infinity, such as R, the
real numbers. R and intervals within R, e. g., [a, b] = {x 2 R : a  x  b} where a < b, are the only
uncountable sets we look at.
From this comes the fundamental distinction between discrete and continuous that we will meet later in
the context of distributions and random variables. A sample space S is discrete if S is countable (which
includes the case of S being finite); and S is continuous if S is uncountable. Then a discrete distribution
or random variable (see later) is one defined on a countable S; while a continuous distribution or random
variable is one defined on an uncountable S.
Definition 4.4 (Event). An event is a defined set of outcomes (a subset of the sample space). A
single-element event (just one outcome) is sometimes called an elementary event.
Example 4.5. We might say the event “Giant person” is the set of all measurements of a person’s height
as > 2.0m, that is the subset G = {height : height > 2}.
Or if rolling a die, we might define “Even” as the event {2, 4, 6} where an even number is rolled. }
Definition 4.6 (Occurs). An event occurs if in our trial we observe one of the outcomes corresponding
to that event.
Example 4.7. If we measure a person’s height of 2.1m, the event “Giant person” or G occurs.
If our die roll shows 4, then the event “Even” occurs. If the roll shows 2, then the event “Even” occurs.
But if the roll shows 5, then the event “Even” does not occur. }
Example 4.8. Two distinct six-sided dice are rolled and the numbers on their faces noted. Describe the
sample space. Define and describe the events
A = {the outcomes where the sum of the two faces is 6}
B = {the outcomes where both dice show the same number}.
}
Definition 4.9 (Mutually exclusive). Two or more events are said to be mutually exclusive (or disjoint)
if at most one of them can occur when the experiment is performed, that is, if no two of them have
outcomes in common.

Important: in the next chapter we will meet the concept of independent events. Be aware from the word
Go that mutually exclusive and independent are totally di↵erent concepts!
Data Analysis for Decision Makers 61

Example 4.10. Suppose our experiment is “Select one card at random from a deck of cards”: then S
is the set of all 52 cards. Define Event A as “Queen of Diamonds is selected” and Event B as “Queen
of Clubs is selected”. Then Events A and B are mutually exclusive: at most one of the events A, B can
occur (maybe neither occurs). }

Definition 4.11 (Collectively exhaustive). Two or more events are said to be collectively exhaustive if
at least one of the events must occur. The union of these events covers the whole sample space.

Definition 4.12 (Partition). A collection of events is called a partition of the sample space if the events
are both collectively exhaustive and mutually exclusive, that is, exactly one of the events must occur.

Example 4.13. Define events A = {aces}; B = {black cards}; C = {diamonds}; D = {hearts}.


Events A, B, C and D are collectively exhaustive, but not mutually exclusive: e. g., an ace may also be
a heart.
However, events B, C and D are collectively exhaustive and mutually exclusive, that is, they form a
partition of S. }

2.2. Displaying events as Venn diagrams. Since events are subsets of the sample space, one
way to display them is by using Venn diagrams. Each of the events is viewed as a set and combined
with the other events in the appropriate manner (union, intersection, etc.) to represent the problem.
Figure 4.1 shows mutually exclusive events A and B.

Figure 4.1. Two mutually exclusive sets (events) A and B within a sample space S

Venn diagrams are particularly useful when the events are not mutually exclusive, that is, there is some
overlap. See Figure 4.2.
What do we mean by the event A and B? We mean the event which occurs when both A occurs and B
occurs. It is called a joint event since A and B occur jointly i. e., together. This means that the outcome
observed from the random process is an element of both A and B. In Example 4.13 on cards, the event
A and B occurs if the card drawn is a black ace (ace of spades or ace of clubs).
In set notation, it would be more correct to write A \ B for A and B, since A and B is the set A \ B in
a Venn diagram. We can use either, but you may find A and B or A & B more intuitive. See Figure 4.3.

Similarly, what do we mean by the event A or B? We mean the event which occurs when either or both
of events A or B occur. This means that the outcome observed from the random process is an element
of either A or B (or both). In Example 4.13 on cards, the event A or B occurs if the card drawn is
either a black card or an ace (or both).
In set notation, it would be more correct to write A [ B for A or B, since A or B is the set A [ B in a
Venn diagram. Again, we can use either, but you may find A or B more intuitive. See Figure 4.4.
62 MIS10090

Figure 4.2. Two sets (events) A and B within a sample space S. The joint event
A \ B = A and B is in green. The event A [ B = A or B is all of the red, blue and
green areas together.

Figure 4.3. Two sets (events) A and B within a sample space S, focussing on the joint
event A \ B = A and B = A & B

Figure 4.4. Two sets (events) A and B within a sample space S, focussing on the event
A [ B = A or B.
Data Analysis for Decision Makers 63

2.3. Probability of an event occurring. We seek to describe how “likely” is an event e. g., how
likely that we will find a “Giant person”.

Definition 4.14. We assign a number between 0 and 1 to an event E to describe the likelihood of E
occurring, with a larger number meaning “more likely”. This number is called the probability of the
event E, and is written as P (E).

These probabilities must obey certain rules, given below. But first, we will consider what probability
might mean: how should we interpret the word “probability”.

3. Interpretations of Probability

There are three major interpretations of the term “probability” of a given event:

Classical / Objectivist / A priori interpretation: If we know all possible outcomes in ad-


vance3, we can calculate the a priori classical probability of a particular event by working out
probabilities in advance (e. g., from first principles):
number of ways the event can occur
probability of occurrence =
total number of outcomes
number of favourable outcomes
= .
total number of outcomes
This assumes all outcomes are equally likely: an assumption often made unless we have explicit
reason to believe not all outcomes are equally likely.
It is not always possible to use the a priori method since we may not be able to list all
the possible outcomes. Notice that — even with the “equally likely” assumption — we still
need to be able to count the total number of outcomes and the number of favourable outcomes
(where “favourable” means “satisfying the criteria for our chosen event”); later, we will look at
methods of counting and when to use each.
Physical / Empirical / Frequency interpretation: We often run experiments (an “empiri-
cal” approach) to see how often certain events occur.
The empirical interpretation of probability is the proportion (or relative frequency) of times
I will observe this event over a very large number of trials (of some physical/objectively mea-
surable process).
For example, if we toss a fair coin a million times, we should see heads about half the time:
P (heads) = 0.5.
empirical probability: work out probabilities based on observing experiments
number of favourable outcomes observed
probability of occurrence =
total number of outcomes observed
where, again, “favourable” means “satisfying the criteria for our chosen event”. Again, this
formula assumes all outcomes are equally likely.
Bayesian / Subjectivist / Evidential interpretation: This is the (subjective) degree of be-
lief I have that this event will occur (I may modify this belief in the light of new evidence).
We could consult with an expert in the field of interest who may have insight into the
likelihood of occurrence of events. For example, a barrister may give a professional opinion that
you have a 60% chance of winning a court case. Clearly you are not going to hold the court
case a thousand or a million times to see what proportion you win. Rather, this is the degree
of (informed) belief the barrister has about the outcome “win”.
It is called Bayesian because Bayes’s Theorem (see the next chapter) is used to modify our
degree of belief (probability) in the light of new evidence.

3A priori is Latin meaning “in advance” or “beforehand”.


64 MIS10090

Dept./Gender Male Female Total


IT 50 80 130
Production 240 350 590
Management 120 80 200
Sales 60 20 80
Total 470 530 1000

Table 4.1. Role and gender of 1000 employees in a company

4. Rules of Probability

4.1. The basic rules.

• Probabilities always lie between 0 and 1, inclusive: 0  p  1


• If there are finitely many possible outcomes, all equally likely, then, for an event E,
number of event outcomes in E
P (E) =
total number of possible outcomes in the sample space
This is just the a priori classical probability we saw before.

• Given two events A and B, the probability of either or both of events A or B occurring is
calculated using the formula:
P (A or B) = P (A) + P (B) P (A and B)
To get an intuition as to why this is so, look again at Figure 4.2: event A is the blue and green
areas, while event B is the red and green areas. Then we can get the area covered by A or B
as all of the red, blue and green areas together. This is the area of A together with the area
of B: but we must subtract the area of the joint event A \ B = A and B (green) to avoid
double-counting.
• (consequence of previous point) Probabilities of mutually exclusive events add up: if events A1
and A2 are mutually exclusive, P (A1 or A2 ) = P (A1 ) + P (A2 ) (since P (A1 and A2 ) = 0: See
Figure 4.1: there is no overlap so no danger of double-counting.)

• The sum of the probabilities of all mutually exclusive and collectively exhaustive events is 1.
this is because the union of the mutually exclusive and collectively exhaustive set of events
is the whole sample space S.
in particular, the total probability over S equals 1: P (S) = 1 (sum the probabilities of all
outcomes)
e. g., if the sample space is S = A1 [ A2 [ B with A1 , A2 and B mutually exclusive
(non-overlapping) then P (A1 ) + P (A2 ) + P (B) = 1

Figure 4.5. A sample space of three mutually disjoint sets.

Example 4.15. Table 4.1 shows the numbers of employees with particular roles (rows) and genders
(columns) in a company of 1000 employees.
Data Analysis for Decision Makers 65

If we wished to select random employees from this company (a sample space of size 1000), this table
gives us the probabilities.4
The second column represents the event “Employee is male” while the third column represents the event
“Employee is female”. For example, the probability that an employee chosen at random is female is
530/1000 = 0.53 = 53%. The gender events are mutually exclusive; each employee appears in only one
of the columns.
The rows show which department an employee works in. We can use the table to find the probability
that an employee works in sales. Let S be the event that the employee works in the sales department:
then P (S) = 80/1000 = 0.08 = 8%. Again, these are mutually exclusive events: if an employee is in the
sales dept, he/she is not in IT.
There are no other genders and no other departments for this company: all 1,000 employees are counted
somewhere so this listing of events is also collectively exhaustive.
A joint event such as “Female and works in Sales” is represented in the cell of the table where the
respective row (Sales) meets the respective column (Female). We see that P (Female and works in
Sales) = 20/1000 = 0.02 = 2%.
}

4.2. Complements.
Definition 4.16. The complement of an event A is the event A = S r A comprising all the outcomes
that are not in A.

If A does not occur, A must occur, so we have P (A) = 1 P (A). Thus A and A together form a partition
of the sample space.

Figure 4.6. A set (event) A and its complement.

Other notations you may see for the complement of A include ⇠A, A0 , or Ac .

5. Counting

There are various ways of working out the probability of an event E: as mentioned above, many of these
involve counting
(a) the number of outcomes in event E (“favourable outcomes”); and
(b) the total number of possible outcomes in the sample space;
then (assuming all outcomes are equally likely) taking the ratio of these as
favourable
P (E) = .
total
But how do we count? It can depend on several things, such as
• whether the order in which we list items is important;
• whether an item is replaced after we have counted it, or not replaced.
4This is also called a contingency table, or cross-tabulation table, and allows the sample space for a particular problem
to be viewed in a tabular format.
66 MIS10090

If it is replaced, we could possibly count it again, so it might be repeated;


if it is not replaced, it cannot be drawn or counted again, so cannot be repeated.

We now introduce some standard mathematical5 terms, which we will see used in di↵erent parts of the
course.

5.1. Multiplication Principle. Suppose I want to buy a new car. Imagine (this is not very
realistic — just an example) I can choose from 3 makes of car. Each make has 2 models and each model
can be provided in any one of 5 colours. How many choices do I have?
I can choose from 3 makes, for each of those I can choose from 2 models, for each of those I can choose
5 colours: that gives 3 ⇥ 2 ⇥ 5 = 30 choices overall.
For each independent choice, I multiply together the number of options available for that choice.
This works in general for any number of choices and is called the multiplication principle: if I have c1
choices for variable x1 , c2 choices for variable x2 , . . . , cm choices for variable xm , then in total I have
c1 ⇥ c2 ⇥ · · · ⇥ cm
ways to assign values to all m variables x1 , . . . , xm .

5.2. Factorials. The factorial of a positive integer (whole number) n is written as n! and means
the product of all the whole numbers from n down to 1:
n! = n ⇥ (n 1) ⇥ (n 2) ⇥ · · · ⇥ 3 ⇥ 2 ⇥ 1.
For example, “four factorial” is written as 4! and is 4 ⇥ 3 ⇥ 2 ⇥ 1 = 24.
We take 0! to be 1 by convention (it keeps things consistent).
This exclamation mark is a standard notation, so don’t use an exclamation mark after a number unless
you really mean the factorial.
Why to use it: the factorial n! gives the number of ways in which n distinguishable objects can be ordered
in n distinguishable boxes. By “distinguishable”, we mean that if any two objects were swapped, the
outcome would be di↵erent.

Example 4.17. Suppose we have 4 people to be assigned to 4 di↵erent officer positions in a club. The
positions are: President, Vice-president, Secretary and Treasurer. How many di↵erent assignments of
officers are there?
Solution: We have n = 4 distinguishable objects to fill 4 positions. Thus, there are 4! di↵erent ways of
ordering the people among the positions, that is, 4 ⇥ 3 ⇥ 2 ⇥ 1 = 24 ways or assignments. }

5.3. Combinations. A choice of k objects, without regard to order and without repetition, selected
from n distinct objects is called a combination of n objects taken k at a time.
In other words, a combination is a way of choosing a subset of k objects from a set of n objects, where
order does not matter (recall that the order in which we list elements in a set is unimportant).
The number of such combinations (the number of ways we can choose a subset of k items from the pool
of n items) is given by:
✓ ◆
n n!
=
k k!(n k)!
n
This is often read as “n choose k”. An alternative notation for k is n Ck .

5These terms are most used in the branch of mathematics called Combinatorics, which studies counting, combining
and permuting of objects.
Data Analysis for Decision Makers 67

Example 4.18. You are going to draw 4 cards from a standard deck of 52 cards. How many di↵erent
4-card hands are possible?
Solution: This is a combination problem, because a hand of cards is a subset of cards where the order
does not matter. Therefore, n = 52 and k = 4. The number of possible 4-card hands is
✓ ◆
52 52! 52!
= =
4 4!(52 4)! 4!48!
52 ⇥ 51 ⇥ 50 ⇥ 49
= (cancel 48! above and below)
4⇥3⇥2⇥1
that is, 270,725 di↵erent 4-card hands. }
Exercise 4.19. A poker hand comprises 5 cards. How many di↵erent poker hands can be dealt from a
standard deck of 52 cards?

5.4. Permutations. An ordered arrangement of k objects, without repetition, selected from n


distinct objects, is called a permutation of n objects taken k at a time, and is denoted by:
n n!
Pk = .
(n k)!
When you need to count the number of ways you can arrange items where order is important, use
permutations to count.
Thus forming a permutation asks
• How many ways can we select k things from n things: this is nk = k!(nn! k)!
• Then, for each of these ways of selecting, how many ways can we rearrange the k things among
themselves: this is k!
This is another way of seeing that
✓ ◆
n n! n! n
Pk = = k! = k! = k!n Ck .
(n k)! (n k)!k! k
Rearrangements don’t matter when we are just selecting a subset (combination), but do matter when
we are asking about the number of ordered arrangements: this is why n Pk is k! times the size of nk .
Example 4.20. Find the number of four-letter words, (whether meaningful or not), which can be formed
out of the letters of the word ROSE, where repetition of letters is not allowed.
Solution: This is a permutation problem because, in a word, letters are arranged with emphasis on their
order (item, time and mite have the same letters but are not the same word!). Therefore, n = 4 and
k = 4, so we compute
4 4! 4! 24
P4 = = = = 24 four letter words.
(4 4)! (0)! 1
(Remember that the value of 0! is 1.)
Hence, 24 four-letter words, with or without meaning, can be formed out of the letters of the word
ROSE, where repetition of letters is not allowed.
We can think of this another way: the first place can be filled in 4 di↵erent ways by any one of the
4 letters R, O, S, E. Then, the second place can be filled in by any one of the remaining 3 letters in
3 di↵erent ways, following which the third place can be filled in 2 di↵erent ways; following which, the
fourth place can be filled in 1 way. Thus, the number of ways in which the 4 places can be filled, by the
multiplication principle, is 4 ⇥ 3 ⇥ 2 ⇥ 1 = 24. }
Exercise 4.21. How many four-letter words (whether meaningful or not) can be formed out of the
letters of the word AROSE? Show the details of your work.

You might also like