You are on page 1of 22

Introduction to Data Science

Need for Bayes theorem example:


A habitual drunk is drinking 9 out of 10 days.
When he is drinking, he is going to one of 3 bars, which
he picks at random.
Today, social workers are looking for him.
The social workers checked the first two bars, but he
wasn’t there.
What is the probability that he will be in the 3rd bar?

Is it 90%?
Bayes theorem (or rule)
• Allows to combine, update
and invert conditional
probabilities.
• It has a wide range of
applications, including the
updating of beliefs.
• Philosophy: New evidence
(data/observations) don’t
speak for themselves. They
update prior beliefs.
• It provides an optimal way
to integrate current and
prior information.
Thomas Bayes (1701-1761)
Potential for confusion
• Simple probabilities are easy to invert.
• Particularly if the probability space is binary and the
events are mutually exclusive.
• You don’t need Bayes’ theorem for that.
• But this is not the case for conditional probabilities.
~A=1-A

A B
Revisiting conditional probability

A A B
U
B

As (A ⋂ B) slice grows as a proportion of B,


being B becomes more informative about A
A B
U
p(A ∩ B)
p(A | B) = =
p(B)
B
Most real life situations are asymmetric:
Remember Herpes simplex and Alzheimer?
P(A) =

P(-) = P(+) =
P(A) =
P(+) =
P(-) =
p(+∩ A) p(+∩ A)
p(+ | A) = p(A | +) =
p(A) p(+)

p(−∩ A) p(−∩ A)
p(− | A) = p(A | −) =
p(A) p(−)
Deriving Bayes rule
p(A ∩ B) p(A ∩ B)
p(A | B) = A A B B
U p(B | A) =
p(B) p(A)

p(A ∩ B) = p(B)∗ p(A | B) p(A ∩ B) = p(A)∗ p(B | A)

p(B)∗ p(A | B) = p(A)∗ p(B | A)

p(A)∗ p(B | A)
p(A | B) =
p(B)
Some necessary terminology
Likelihood

p(A)∗ p(B | A)
p(A | B) =
p(B)
Posterior
Priors
What can we do with this?

• Build an optimal spam filter, for instance:


• 1 in 100 messages is spam.
• All spam messages contain certain keywords.
• 1 in 10 messages contain certain keywords.
• Would filtering messages with certain keywords and
flagging them as spam be a good approach or
would we expect too many false positives?
• p(K | S) = 1
• p(S | K) = ?
Casting this in terms of a Bayesian framework:
p(A)∗ p(B | A) p(S) ∗ p(K | S)
p(A | B) = p(S | K ) =
p(B) p(K )

• We know p(K|S) = 1
• We know p(S) = 0.01
• We know p(K) = 0.1
• p(S|K) = 0.01/0.1 = 0.1
• Reasonable to filter by these keywords?
The specific value of the priors matters:

• Say spam is becoming much more common because


it is hard to filter out, p(S) = 0.05.
p(S) ∗ p(K | S)
• Now? p(S | K ) =
p(K )
• P( S | K ) = 0.05 * 1 / 0.1 = 0.5
• Say use of these keywords is going out of style,
outside of spam messages, because of the
association of these words with spam: p(K) = 0.0625
• P(S | K ) = 0.05 * 1 / 0.0625 = 0.8
Why are Bayesian approaches so useful in
(data) science?
• We would ideally want to know the probability of a
hypothesis (which we use to explain the world),
given the data we observed.
• Null hypothesis significance testing gives us the
probability of the data given the (null) hypothesis.
• This is not what we want.
• Maybe we can use Bayes to invert this situation:

p(A)∗ p(B | A) p(H )∗ p(D | H )


p(A | B) = p(H | D) =
p(B) p(D)
There are several forms of Bayes theorem

• The simple form:


p(A)∗ p(B | A)
p(A | B) =
p(B)
• But what if you don’t have p(B)?
• We usually don’t have p(B).
• That’s where the “explicit form” of Bayes
theorem comes in.
• In practice, these 2 forms will be the ones you
will most likely use.
Deriving the explicit form of Bayes theorem

A ~B A B B ~A
U U U

S (Sampling space)

p(A) p(A) = p(A⋂B) ⋃ p(A⋂~B)


p(B) p(B) = p(A⋂B) ⋃ p(B⋂~A)
Finding a way to express p(B) in known terms

• p(B)?
• p(B) = p(A B)Up(B ~A)
U U

• p(B) = p(A B)+ p(B ~A)


U U
p(A ∩ B) = p(A)∗ p(B | A)
• p(B) = p(B A)+ p(B ~A)
U U
p(B ∩ A) = p(B | A)∗ p(A)

p(B) = p(B | A)∗ p(A) + p(B |¬A)∗ p(¬A)


Plugging it back into the simple form
p(A)∗ p(B | A)
• Simple form: p(A | B) =
p(B)

• Plugging what we just derived into p(B):

p(B) = p(B | A)∗ p(A) + p(B |¬A)∗ p(¬A)

p(A)∗ p(B | A)
p(A | B) =
p(B | A)∗ p(A) + p(B |¬A)∗ p(¬A)
What can we do with the explicit form?
This form is often used in diagnostics:
The prevalence of breast cancer in the population is
1%. A mammograph detects present breast cancer
99% of the time. If no breast cancer is present, the
false-positive rate is 10%.
After a test, the mammograph of a person is positive.
What is the probability that this person has breast
cancer?
This kind of inference has been well explored, e.g.
Hoffrage et al. (2000), Science.
Using the explicit form of Bayes theorem:
p(A) ∗ p(B | A)
p(A | B) =
p(A) ∗ p(B | A) + p(¬A) ∗ p(B |¬A)

• p(A) = 0.01 (prior probability)


• p(B|A) = 0.99 (detection rate)
• p(B|~A) = 0.1 (false positive rate)
• p(~A) = 0.99 (via 1 – p(A), per mutual exclusion)
• p(A|B) = ? (posterior probability)
• p(A|B) = (0.01*0.99)/((0.01*0.99)+(0.99*0.1))
• p(A|B) = 0.091
The likelihood differential matters critically
• The root cause of the disappointing diagnostic
power of this test is due to the fact that the ratio
between the probability of a positive test result
given disease or no disease is only ~10:1, in
combination with a low prevalence (~100:1).
• If prevalence is low (detecting low incidence events
like terrorism, extreme ability or rare diseases), this
differential must be much larger in order to raise the
posterior meaningfully from the low prior.
• For instance, if the false positive rate was 100 times
lower: 0.1%, with all other numbers being the same:
If false positive rate was 0.1%:
p(A) ∗ p(B | A)
p(A | B) =
p(A) ∗ p(B | A) + p(¬A) ∗ p(B |¬A)

• p(A) = 0.01 (prior probability)


• p(B|A) = 0.99 (detection rate)
• p(B|~A) = 0.001 (false positive rate)
• p(~A) = 0.99 (via 1 – p(A), per mutual exclusion)
• p(A|B) = ? (posterior probability)
• p(A|B) = (0.01*0.99)/((0.01*0.99)+(0.99*0.001))
• p(A|B) = 0.91
A Bayesian framework makes intuitive sense:
• “Extraordinary claims require extraordinary
evidence” – p < 0.05 won’t do it.
• Extraordinary claims = low prior probability of the
hypothesis being true.
• Disagreements between people: Same evidence,
but arriving at different conclusions because prior
beliefs are different.
• Ideology: If prior beliefs are close to 0 or 1, it is very
hard for evidence to change those beliefs.
• If they are exactly 0 or 1, it is impossible.
• There are many applications of this framework.

You might also like