You are on page 1of 1

Lab 4 - Conditional Probability and Bayes' Rule

In this week's laboratory exercise, we will look at how to implement the calculation of conditional probability using an example, and an
example on Bayes' Rule. These two topics are important tools to help answer common statistical questions.

Once again, you can learn more about the language on python.org and Python for Beginners.

This lab exercise is adopted from Blitzstein and Hwang's Introduction To Probability, Second Edition for Conditional Probability and from
Tirthajyoti Sarkar 's Bayes’ rule with a simple and practical example. Because all the code is inside code cells, you can just run each code cell
inline rather than using a separate Python interactive window.

This Jupyter notebook is applicable for Python 3.x versions or above.

Note: This notebook is designed to have you run code cells one by one, and several code cells contain deliberate errors for demonstration
purposes. As a result, if you use the Cell > Run All command, some code cells past the error won't be run. To resume running the code in
each case, use Cell > Run All Below from the cell after the error.

Comments
Many of the examples in this lab exercise include comments. Comments in Python start with the hash character, # , and extend to the end
of the physical line. A comment may appear at the start of a line or following whitespace or code, but not within a string literal. A hash
character within a string literal is just a hash character. Since comments are to clarify code and are not interpreted by Python, they may be
omitted when typing in examples.

1. Conditional Probability
One way, often called the frequentist interpretation, is to understand conditional probability as being based on a large number n of
repetitions of an experiment.

For instance, P (A|B) ≈ nAB /nB , where n AB is the number of times that A ∩ B occurs and n is the number of times that B occurs.
B

Let's try this out by a simulation. In the Python code, we will use the numpy.random.choice to simulate n families, each with two
children.

In [1]:
import numpy as np

np.random.seed(34)

n = 10**5
child1 = np.random.choice([1,2], n, replace=True)
child2 = np.random.choice([1,2], n, replace=True)

print('child1:\n{}\n'.format(child1))

print('child2:\n{}\n'.format(child2))

child1:
[2 1 1 ... 1 2 1]

child2:
[2 2 2 ... 2 2 1]

Here child1 is a NumPy array of length n , where each element is a 1 or a 2. Letting 1 stand for "girl" and 2 stand for "boy", this
array represents the gender of the elder child in each of the n families. Similarly, child2 represents the gender of the younger child
in each family.

Alternatively, we could have used

In [2]:
np.random.choice(["girl", "boy"], n, replace=True)

Out[2]: array(['boy', 'boy', 'boy', ..., 'boy', 'boy', 'boy'], dtype='<U4')

but it is more convenient working with numerical values.

Let A be the event that both children are girls and B the event that the elder is a girl. Following the frequentist interpretation, we count
the number of repetitions where B occurred and name it n_b , and we also count the number of repetitions where A ∩ B occurred and
name it n_ab . Finally, we divide n_ab by n_b to approximate P (A|B).

In [3]:
n_b = np.sum(child1==1)
n_ab = np.sum((child1==1) & (child2==1))

print('P(both girls | elder is girl) = {:0.2F}'.format(n_ab / n_b))

P(both girls | elder is girl) = 0.50


The ampersand & is an elementwise AN D, so n_ab is the number of families where both the first child and the second child are girls.
When we ran this code, we got 0.50, highlighting that if the gender of a child at birth is equally likely among boy and girl, then
P (both girls | elder is a girl) = 1/2 .

You can verify this by running the Python code below.

In [4]:
n_b = np.sum(child1==2)
n_ab = np.sum((child1==2) & (child2==2))

print('P(both boys | elder is boy) = {:0.2F}'.format(n_ab / n_b))

P(both boys | elder is boy) = 0.50


Returning to our original discussion.

Now let A be the event that both children are girls and B the event that at least one of the children is a girl. Then A ∩ B is the same, but
n_b needs to count the number of families where at least one child is a girl. This is accomplished with the elementwise OR operator |
(this is not a conditioning bar; it is an inclusive OR, returning True if at least one element is True ).

In [5]:
n_b = np.sum((child1==1) | (child2==1))
n_ab = np.sum((child1==1) & (child2==1))

print('P(both girls | at least one girl) = {:0.2F}'.format(n_ab / n_b))

P(both girls | at least one girl) = 0.33

Test your understanding


Can you explain why the conditional probability has decreased when the event B changes to at least one of the children is a girl?

Can you think of other definitions of event B? You can change how n_b is calculated to see how the conditional probability might
change?

For the case where both are girls given the elder is girl, the younger can either be a boy or a girl thus giving us a 50% probability.
GG/(GG or GB)
For the second case where we have both boys and given the elder is boy, the younger can be either girl or boy. So, the probability of
both boys given the elder is boy is 50%. BB/(BB or BG)
For the third case, at least one girl is possible in 3 cases and both girls is in 1 case. So, the probability is 1/3 or 33%. GG/(GG or GB or
BG)

In [8]:
n_b = np.sum((child1==1) | (child2==1))
n_ab = np.sum((((child1==2) & (child2==1))) | ((child1==1) & (child2==2)))

print('P(exactly one girl | at least one girl) = {:0.2F}'.format(n_ab / n_b))

P(exactly one girl | at least one girl) = 0.67

Above is the case where we have exactly one girl given there is at least one girl. The Probability here is 67% as we have (GB or BG)/(GG
or GB or BG)

2. Bayes' Rule
Bayes’ theorem (alternatively Bayes’ law or Bayes’ rule) has been called the most powerful rule of probability and statistics. It describes the
probability of an event, based on prior knowledge of conditions that might be related to the event.

P (A|B) = (P (B|A)P (A))/P (B)

where

P (A) = The probability "A" being True. Prior (This is the knowledge) \ P (B|A) = The probability "B" being True, given "A" is True.
Likelihood \ P (B) = The probability "B" being True. Marginalization \ P (A|B) = The probability "A" being True, given "B" is True.
Posterior

Drug Screening Example


Suppose that a test for using a particular drug is 97% sensitive and 95% specific. That is, the test will produce 97% true positive results
for drug users and 95% true negative results for non-drug users. These are the pieces of data that any screening test will have from their
history of tests. Bayes’ rule allows us to use this kind of data-driven knowledge to calculate the final probability.

Suppose, we also know that 0.5% of the general population are users of the drug. What is the probability that a randomly selected
individual with a positive test is a drug user?

Note, this is the crucial piece of ‘Prior’ which is a piece of generalized knowledge about the common prevalence rate. This is our prior
belief about the probability of a random test subject being a drug user. That means if we choose a random person from the general
population, without any testing, we can only say that there is a 0.5% chance of that person being a drug-user.

How to use Bayes’ rule then, in this situation?

Here is the formula for computing the probability that a randomly selected individual with a positive test is a drug user as per the
Bayes’ rule:

P (+|U ser) ∗ P (U ser) P (+|U ser) ∗ P (U ser)


P (U ser|+) = =
P (+) P (+|U ser) ∗ P (U ser) + P (+|N on − user) ∗ P (N on − user)

where

P (U ser) = Prevalence rate \ P (N on − user) = 1 - Prevalence rate \ P (+|U ser) = Sensitivity \ P (−|N on − user) = Specificity \
P (+|N on − user) = 1 - Specificity

In [11]:
# Python code for calculating the the probability that a randomly selected individual
# with a positive test is a drug user

def drug_user(
prob_th=0.5,
sensitivity=0.97,
specificity=0.95,
prevalance=0.005,
verbose=True):
"""
Computes the posterior using Bayes' rule
"""
p_user = prevalance
p_non_user = 1-prevalance
p_pos_user = sensitivity
p_neg_user = specificity
p_pos_non_user = 1-specificity

num = p_pos_user*p_user
den = p_pos_user*p_user+p_pos_non_user*p_non_user

prob = num/den
print(prob)

if verbose:
if prob > prob_th:
print("The test-taker could be a drug user")
else:
print("The test-taker may not be a drug user")

return prob

if __name__ == '__main__':
drug_user()

0.08882783882783876
The test-taker may not be a drug user

What is fascinating here?


Even with a test that is 97% correct for catching positive cases, and 95% correct for rejecting negative cases, the true probability of being a
drug-user with a positive result is only 8.9%!

If you look at the computations, this is because of the extremely low prevalence rate. The number of false positives outweighs the number
of true positives.

For example, if 1000 individuals are tested, there are expected to be 995 non-users and 5 users. From the 995 non-users, 0.05 × 995 ≃ 50
false positives are expected. From the 5 users, 0.95 × 5 ≈ 5 true positives are expected. Out of 55 positive results, only 5 are genuine!

Experimentation
You can adjust the values of the arguments passed to the drug_user() function, and observe the changes in the probability.

In [12]:
# Python code for calculating the the probability that a randomly selected individual
# with a positive test is a drug user

def drug_user(
prob_th=0.5,
sensitivity=0.99,
specificity=0.99,
prevalance=0.005,
verbose=True):
"""
Computes the posterior using Bayes' rule
"""
p_user = prevalance
p_non_user = 1-prevalance
p_pos_user = sensitivity
p_neg_user = specificity
p_pos_non_user = 1-specificity

num = p_pos_user*p_user
den = p_pos_user*p_user+p_pos_non_user*p_non_user

prob = num/den
print(prob)

if verbose:
if prob > prob_th:
print("The test-taker could be a drug user")
else:
print("The test-taker may not be a drug user")

return prob

if __name__ == '__main__':
drug_user()

0.33221476510067094
The test-taker may not be a drug user

You might also like