You are on page 1of 6

CHAPTER 5

Conditional Probability and Bayes’s Theorem

1. Introduction

Previously we mentioned the Bayesian interpretation of probability as the (subjective) degree of belief I
have that this event will occur. (I may modify this belief in the light of new evidence).

It is called Bayesian because Bayes’s Theorem is used to modify our degree of belief (probability) in the
light of new evidence. But first we need to discuss how the probability of one event depends on another:
conditional probability.

2. Joint and Conditional Probability

2.1. Joint probability.

Definition 5.1. The joint probability of events A and B is the probability P (A and B) of both events
occurring, sometimes expressed as P (A \ B) (say ‘A intersection B’).

In Figure 4.2 in Chapter 4, we saw two sets (events) A and B within a sample space S, with the joint
event A \ B = A and B shown in green.

Examples:

• Probability that a random playing card is a red ace (red and an ace)
• Probability that a person is dark-haired and has a limp
• Probability that a town has a railway station and a youth hostel

For the last of these, Figure 5.1 shows the joint event as the intersection of the two sets “Towns with
Stations” (A) and “Towns with Hostels” (B).

Figure 5.1. Joint probability

49
50 MIS2008L

2.2. Independent events.

Definition 5.2. Two events A and B are called independent if the occurrence of one event does not
influence the probability of occurrence of the other event.

If the two events A and B are independent, then:


P (A and B) = P (A)P (B) (multiply probabilities)

In particular, note that A and B being independent is an entirely di↵erent thing from having P (A and B) =
0, which we call mutually exclusive. Suppose A and B are mutually exclusive: if A occurs then B cannot
occur: so the occurrence of A has definitely changed the probability of B occurring, P (B|A) = 0, so
they are certainly not independent!

2.3. Conditional probability.

Definition 5.3. The conditional probability of event B on event A is the probability that event B occurs,
given that event A has already occurred. It is defined as
P (A and B)
P (B|A) = .
P (A)

Notice that if A and B are independent events, then


P (A and B)
P (B|A) =
P (A)
P (A)P (B)
= since A and B are independent
P (A)
= P (B).
That is, the probability of B given that A has occurred is just the normal probability of B: exactly as
we would expect if A and B are independent.
Also note that, rearranging the terms in the definition, P (A and B) = P (B|A)P (A).
By symmetry (since P (A and B) = P (B and A)), we have that P (A and B) = P (A|B)P (B).
It is intuitively easier to see in terms of areas in a Venn diagram by observing the following: if we know
that a town has a railway station (that is, A has occurred), what is the probability that it has a youth
hostel (that is, P (B|A))?

Figure 5.2. Conditional probability

Taking the conditional probability P (B|A) means we are changing (reducing) the sample space by
restricting to the subset A. Intuitively, if the areas on the diagram are roughly proportional to the
probabilities, we are asking: what is the relative proportion within the event A in which the event B
occurs (the right-hand part of Figure 5.2)?
Data Analysis for Decision Makers 51

2.3.1. Total probability. If we have an event A and another event B, we can write the probability of
A in terms of two things:

• the probability of A conditioned on B, and


• the probability of A conditioned on the complement B.

P (A) = P (A and B) + P (A and B)


= P (A|B)P (B) + P (A|B)P (B).

We sometimes call this expression the total probability of A with respect to B.


Example: P (Hostel) = P (Hostel|Station)P (Station) + P (Hostel|No Station)P (No Station).

2.4. Worked examples.

Example 5.4. From the contingency table in Example 4.11 earlier, let’s define
M to be the event that an employee is male and
I to be the event that an employee works in IT. We can work out from the table the probabilities that
when we select a random employee, that employee is:

• Male: P (M ) = 470/1000 = 0.47


• Male and works in IT: P (M and I) = 50/1000 = .05
• Male or works in IT: P (M or I) = (470 + 130 50)/1000 = 0.55
• Male given that he/she works in IT: P (M |I) = 50/130 = 0.38.

Notice that P (M ) 6= P (M |I) so these events are not independant. }

3. Bayes’ Theorem and adjusting Probability in light of new Evidence

3.1. Bayes’ Theorem. Recall, to calculate the joint probability of A and B happening we can say

• P (A and B) = P (A|B)P (B)


or
• P (A and B) = P (B|A)P (A).

Thus P (B|A)P (A) = P (A|B)P (B) since each is equal to P (A and B).
Divide across by P (A) to get P (B|A) = P (A|B)P (B)/P (A).
Now substitute P (A) with our earlier formula for total probability, to get:

Theorem 5.5 (Bayes’ theorem). For any two events A and B,


P (A|B)P (B)
P (B|A) = .
P (A|B)P (B) + P (A|B)P (B)

3.1.1. How do we use Bayes’ Theorem? Recall the Bayesian interpretation of probability = degree
of belief.
Bayes’ theorem is used to factor in new probability information in order to revise existing probabilities
(called “prior” probabilities: prior to or before the new evidence).
We modify our prior belief (probability) in the light of new evidence to get our posterior belief (proba-
bility). (Posterior means after using the new evidence.)
We’ll look at an example in this lecture of the use of Bayes’ theorem: the trial of John Hinckley.
52 MIS2008L

3.2. Example of incorporating new evidence: John Hinckley’s trial. When John Hinckley
was on trial for his 1981 assassination attempt on US President Ronald Reagan, his defence team
wanted to use an insanity defence. The insanity defence was used in fewer than 2% of criminal trials
and was unsuccessful in three out of every four attempts. They used an expert witness in an attempt to
demonstrate that Hinckley was likely to be schizophrenic (the basis of their insanity plea).
Dr. Daniel R. Weinberger, psychiatrist for the National Institute of Mental Health, testified that:
• 30% of CAT scans on schizophrenic people show brain atrophy (shrinkage);
• Only 2% of scans on non-schizophrenics show atrophy.
Defence lawyer Gregory B. Craig wanted to introduce a CAT scan of Hinckley, which showed brain
atrophy.
So is it the case that the odds of Hinckley having schizophrenia is 30:2 — that is, 15:1 — which would
15
mean a probability of 15+1 = 15/16 = 93.75? Let’s see . . .
Bayes’s theorem can be used to test the implications of the psychiatrist’s evidence.
Let A be the event that a CAT scan shows atrophy, and let S be the event that a person is schizophrenic
(so S means the person is not schizophrenic).
• Atrophy is found in 30% of schizophrenics, so P (A|S) = 0.30.
• Atrophy is found in only 2% of non-schizophrenics, so P (A|S) = 0.02
• It is known that 1.5% of the US population su↵er from schizophrenia, so the base probability
is P (S) = 0.015 (and therefore P (S) = 1 0.015 = 0.985).
What is P (S|A)?
P (A|S)P (S)
P (S|A) =
P (A|S)P (S) + P (A|S)P (S)
0.30(0.015)
=
0.30(0.015) + 0.02(0.985)
= 0.186.

Therefore, if the statistical implications of Hinckley’s CAT scan had been explained, the defence would
have been arguing that there was only an 18.6% chance that Hinckley su↵ered from schizophrenia.
However, Bayes’ theorem was not used to make such an interpretation.
Hinckley was found not guilty by reason of insanity.
Some members of the jury indicated that they had been strongly persuaded by the evidence put forward
by the psychiatrist.
Does the Bayes’ finding of 18.6% show that the psychiatric evidence was misapplied?
Some statisticians have suggested that a jury finding Hinckley to be schizophrenic could be consistent
with Bayes’ theorem.1
The 18.6% hinges on assuming that prior to introducing the CAT scan evidence, Hinckley has a standard
1.5% chance of being schizophrenic. But it could be argued that there are already grounds for suspecting
a higher than normal prior probability of schizophrenia: say 10%. What would this lead to?
A 10% prior probability belief that Hinckley is schizophrenic becomes a 63% posterior probability, after
the positive scan result is taken into account.
Exercise 5.6. For practice, as an exercise, test this finding by applying Bayes’ theorem to the CAT
scan probabilities, using
• a 10% prior probability of schizophrenia, and
• a 90% prior probability of non-schizophrenia
1See Barnett, A., Greenberg, I. and Machol, R. E. (1984), “Misapplication Reviews: Hinckley and the Chemical Bath”,
Interfaces, 14(4):48–52
Data Analysis for Decision Makers 53

Show that you get a a 63% posterior probability.

The not guilty finding sparked public outrage. Following the trial, insanity defence laws were rewritten
by Congress and the defence was abolished by some states. Hinckley remains in care in a psychiatric
institution.

3.3. Bayes’ Theorem in Decision Analysis. Bayes’ Theorem can be applied to decision analysis:
it enables us to factor in the impact of evidence and new information. Prior probabilities are revised in
light of new information, to generate posterior probabilities.
It can also be used to “reverse” conditional probabilities.
For example, medical research may give the probability of observing a symptom (e.g., fever) given a
person has a particular disease (e.g., flu), that is, P (fever|flu). But usually a doctor is presented with
a patient showing a symptom and wants the probability that they have the disease, that is, the doctor
wants to know P (flu|fever).
If the doctor knows the actual or base prevalence of flu in the population (that is, P (flu)), he/she can
work out the probability that a particular patient with fever has flu by using Bayes’ Theorem:
P (fever|flu)P (flu)
P (flu|fever) = .
P (fever|flu)P (flu) + P (fever|flu)P (flu)
(Of course, P (flu) = 1 P (flu).)
Exercise 5.7. Try this yourself: research online to find reasonable values for P (fever|flu), P (fever|flu)
and P (flu).

A similar application to this medical one arises in drug testing of athletes.


Exercise 5.8. Suppose that a certain test for drugs can only give either a “positive” or a “negative”
result. From earlier experiments on drug testing, scientists have found that

• if the tested person uses drugs, there is a 0.92 probability that the test will be “positive.”
• if the tested person does not use drugs, there is a 0.96 probability that the test will be “negative.”

Suppose that 10% of the competitors to be tested are in fact drug users.

(a) Work out the probability that a randomly selected competitor will test positive.
(b) Suppose that a randomly selected competitor tests positive.
Work out the probability that he or she actually uses drugs.
(c) Suppose that a randomly selected competitor tests negative.
Work out the probability that he or she actually uses drugs.

The numbers given here are actually realistic. Do the answers surprise you? Consider the potential
advantages and disadvantages of bringing in a mandatory drug testing policy using this particular test.
Example 5.9. A data processing company has 3 locations, A1 , A2 and A3 , which are responsible for
50%, 30% and 20% of the data processing respectively. Past records reveal that the probability of having
an error with data processed by:

• A1 is 0.01
• A2 is 0.05
• A3 is 0.20

Calculate:

(1) the probability of an error,


(2) the probability the error is the responsibility of A3 , given that an error has been found.

Solution Approach:
54 MIS2008L

Figure 5.3. Partitioning the sample space of all pieces of data processing work into
those processed by A1 , A2 and A3 . The errors are the event B (in red)

• First, let B represent the event that an error has occurred. We can partition the sample space
of all pieces of data processing work as in Figure 5.3. Note that these areas are not proportional
to the probabilities given.
• Next, label the information given:
– Conditional probabilities P (B|A1 ) = 0.01, P (B|A2 ) = 0.05 and P (B|A3 ) = 0.20;
– Priors P (A1 ) = 0.5, P (A2 ) = 0.3 and P (A3 ) = 0.2.
Then the solutions are as follows:
(1) Probability of an error is just the simple probability that an error has occurred, P (B). Sum up
the ways the errors can occur: we can get an error from A1 , A2 or A3 :
3
X
P (B) = P (B|Ai )P (Ai )
i=1
= 0.5 ⇥ 0.01 + 0.3 ⇥ 0.05 + 0.2 ⇥ 0.20
= 0.06 = 6%.
(2) To find the probability that the error is the responsibility of A3 , given an error has occurred,
we use Bayes’s Theorem to “flip” the conditional probabilities:
• By Bayes’s Theorem,
P (B|A3 )P (A3 )
P (A3 |B) = .
P (B)
• Recall from (1) above, P (B) = 0.06 = 6%, the simple probability of an error
• What proportion is attributable to A3 , that is, what is P (A3 |A)?
0.2 ⇥ 0.20
P (A3 |B) = = 66.67%.
0.06
}

You might also like