Frontiers of

Computational Journalism
Columbia Journalism School
Week 5: Quantification and Probability

October 6, 2017
This class

• Quantification and Data Quality
• What is statistics for?
• Models made of Counting
• Interpretation Generally
Quantification and Data Quality
Definition of data?
My definition of data:

A collection of similar pieces of recorded information
Structured data
Unstructured data

! x1 $
# &
# x2 &
# &
# x3 &
#  &
# &
# xN &
" %
Different types of counting
• Numeric
o Continuous or discrete
o Units of measurement?
o Non-linear scales?

• Categorical
o finite, e.g. {true, false}
o infinite e.g. {red, yellow, blue, ... chartreuse…}
o ordered?
Choices about what to count
GDP = C + I + G + (X - M)
1940 U.S. census enumerator
2010 U.S. census race and
ethnicity questions
Some things that are tricky to quantify,
but usefully quantified anyway
• Intelligence
• Academic performance
• Race, ethnicity, nationality, gender
• Number of incidents of some type
• Income
• Political Ideology
Intentional or unintentional problems
It looks like Lucknow and Kanpur have few traffic accidents, but
deaths data suggests that accidents are not being counted.
Evaluating Data Quality
Internal validity: check the data against itself
• row counts (e.g. all 50 states?)
• related data
• histograms
• do the numbers add up?

External validity: compare the data to something else.
• alternate data sources
• expert knowledge
• previous versions
• common sense!
Interview the Data

• Who created this data?
• What is this data supposed to count?
• How was this data actually collected?
• Does it really count what it’s suppose to?
• For what purpose was this data collected?
• How do we know it is complete?
• If the data was collected from people, who was
asked and how?
Interview the Data

• Who is going to look bad or lose money because of
this data?
• Is the data consistent with other sources?
• Is the data consistent from day to day, or when
collected by different people?
• Who has already analyzed it?
• Are there multiple versions?
• Does this data have known problems?
What is statistics for?
Models made of Counting
The Simplest Model:
Counting One Thing
P(Yellow) = 0.6

Blue Yellow
The Second Simplest model:
Counting Two Things

No accident

Pr(Accident) = 0.15

No Accident

Blue Yellow
Deadly Force in Black and White, ProPublica 10/10/2014
Relative risk (risk ratio)
AP Clinton Foundation Story
“At least 85 of 154 people from private interests who met or had phone
conversations scheduled with Clinton while she led the State
Department donated to her family charity or pledged commitments to
its international programs, according to a review of State Department

AP Clinton Foundation Story


Not enough information to compute the odds ratio...
which you can tell immediately because four values are required.

No Accident

P(Accident|Blue) = 0.1

Blue Yellow
Conditional Probability

Pr(B|A) = Pr(AB)/Pr(A)
Relative risk as conditional probability

N = a+b+c+d
N(disease) = a+c
N(no disease) = b+d

Pr(disease) = a+c / a+b+c+d
Pr(disease|smoker) = a / (a+b)
Pr(disease|non-smoker) = c / (c+d)

RR = Pr(disease|smoker)/Pr(disease|non-smoker) = (a/a+b) / (c/c+d)
Predicting Cats
Predicting Recidivism

How We Analyzed the COMPAS Recidivism Algorithm,
ProPublica, 2016
Confusion Matrix
ROC curve (for adjustable thresholds)
Base Rates - Taxi Accidents
Imagine you live in a city where 15% of all rides end in
an accident, and last year there were

- 75 accidents involving yellow cabs
- 25 accidents involving blue cabs

Which taxi company is more dangerous?
Base rate
We know

P(accident) = 0.15
P(accident|blue) = 0.25
P(accident|yellow) = 0.75

We do not know the “base rate”:


or equivalently

Evidence and Conditional Probability

Hypothesis H = Alice has a cold
Evidence E = we just saw her cough
Alice is coughing. Does she have a cold?

Most people with colds cough

P(coughing|cold) = 0.9
P(A|B) ≠ P(B|A)
Most people with colds cough

P(coughing|cold) = 0.9

but we want

P(cold | coughing)
Bayes’ Theorem

Tells us how to go from Pr(A|B) to Pr(B|A)

Pr(B|A) = Pr(A|B)Pr(B) / Pr(A)
Alice is coughing. Does she have a cold?
Prior P(H) = 0.05 (5% of our friends have a cold)
Likelihood P(E|H) = 0.9 (most people with colds cough)
Base rate P(E) = 0.1 (10% of everyone coughs today)

P(H|E) = P(E|H)P(H)/P(E)
= 0.9 * 0.05 / 0.1
= 0.45

If you believe your initial probability estimates, you should now
believe there's a 45% chance she has a cold.
Information that justifies a belief.

Presented with evidence E for X, we should believe X "more."

In terms of probability, P(X|E) > P(X)
Bayes “learns” from evidence
Pr(H|E) = Pr(E|H) Pr(H) / Pr(E)

P(H|E) = Pr(E|H)/Pr(E) * Pr(H)

Posterior Likelihood Prior
How likely is H Base Rate How likely was
Probability of
given evidence E? How commonly H to begin with?
seeing E
do we see E at all?
if H is true
Bayes’ Theorem - Diagnostic tests
Suppose I tell you:

• 14 of 1000 women under 50 have breast cancer
• If a woman has cancer, a mammogram is
positive 75% of the time
• If a woman does not have cancer, a
mammogram is positive 10% of the time

If a woman has a positive mammogram, how likely is
she to have cancer?
The Signal and the Noise, Nate Silver

no cancer

positive negative

no cancer

Pr(positive|cancer) = 0.75

= N(positive & cancer) / N(cancer)

N(cancer) = 4
N(positive & cancer) = 3

positive negative

no cancer

Pr(positive|no cancer) = 0.1

= N(positive & no cancer) / N(positive)

N(no cancer) = 1000
N(positive & no cancer) = 100

positive negative

no cancer

Pr(cancer) = 0.0014
= N(cancer) / N

positive negative
Conditional probabilities

Pr(positive|cancer) = 75%
Pr(positive|no cancer) = 10%

What is Pr(cancer|positive)?

no cancer

= 9.6%

positive negative
Bayesian Mammograms
Pr(cancer|positive) =
Pr(positive|cancer) Pr(cancer) / Pr(positive)

Pr(positive|cancer) = 0.75
Pr(cancer) = 0.014

Pr(positive) = Pr(positive|no cancer)Pr(no cancer) +
= 0.10*0.986 + 0.75*0.014
= 0.1091
Bayesian Mammograms
Pr(cancer|positive) =
Pr(positive|cancer) Pr(cancer) / Pr(positive)

= (0.75 * 0.014) / (0.1091)

= 0.0962

= 9.6% chance she has cancer
if mammogram is positive
Interpretation generally
Same data, different meaning
More than one true story
More than one true story