You are on page 1of 71
Frontiers of Computational Journalism
Frontiers of
Computational Journalism

Columbia Journalism School Week 5: Quantification and Probability

October 6, 2017

Frontiers of Computational Journalism Columbia Journalism School Week 5: Quantification and Probability October 6, 2017
This class
This class

Quantification and Data Quality

What is statistics for?

Models made of Counting

Interpretation Generally

Quantification and Data Quality • What is statistics for? • Models made of Counting • Interpretation
Quantification and Data Quality
Quantification and Data Quality

Definition of data?

Definition of data?

My definition of data:

A collection of similar pieces of recorded information

My definition of data: A collection of similar pieces of recorded information

Structured data

Structured data

Unstructured data

Unstructured data
Quantification
Quantification
Quantification ! # # # # # # # " x x x 1 2 3
Quantification ! # # # # # # # " x x x 1 2 3

!

#

#

#

#

#

#

#

"

x

x

x

1

2

3

x N

$

&

&

&

&

&

&

&

%

Quantification ! # # # # # # # " x x x 1 2 3
Different types of counting
Different types of counting

Numeric

o

Continuous or discrete

o

Units of measurement?

o

Non-linear scales?

Categorical

o

finite, e.g. {true, false}

o

infinite e.g. {red, yellow, blue,

chartreuse…}

o

ordered?

Categorical o finite, e.g. {true, false} o infinite e.g. {red, yellow, blue, chartreuse…} o ordered?

Choices about what to count

Choices about what to count

GDP = C + I + G + (X - M)

GDP = C + I + G + (X - M)
1940 U.S. census enumerator instructions

1940 U.S. census enumerator instructions

1940 U.S. census enumerator instructions
2010 U.S. census race and ethnicity questions

2010 U.S. census race and ethnicity questions

2010 U.S. census race and ethnicity questions

Some things that are tricky to quantify, but usefully quantified anyway

that are tricky to quantify, but usefully quantified anyway • Intelligence • Academic performance • Race,

Intelligence

Academic performance

Race, ethnicity, nationality, gender

Number of incidents of some type

Income

Political Ideology

• Race, ethnicity, nationality, gender • Number of incidents of some type • Income • Political

Intentional or unintentional problems

Intentional or unintentional problems
It looks like Lucknow and Kanpur have few traffic accidents, but deaths data suggests that

It looks like Lucknow and Kanpur have few traffic accidents, but deaths data suggests that accidents are not being counted.

It looks like Lucknow and Kanpur have few traffic accidents, but deaths data suggests that accidents

Evaluating Data Quality

Internal validity: check the data against itself

row counts (e.g. all 50 states?)

related data

histograms

do the numbers add up?

External validity: compare the data to something else.

alternate data sources

expert knowledge

previous versions

common sense!

the data to something else. • alternate data sources • expert knowledge • previous versions •

Interview the Data

Who created this data?

What is this data supposed to count?

How was this data actually collected?

Does it really count what it’s suppose to?

For what purpose was this data collected?

How do we know it is complete?

If the data was collected from people, who was asked and how?

data collected? • How do we know it is complete? • If the data was collected

Interview the Data

Who is going to look bad or lose money because of this data?

Is the data consistent with other sources?

Is the data consistent from day to day, or when collected by different people?

Who has already analyzed it?

Are there multiple versions?

Does this data have known problems?

people? • Who has already analyzed it? • Are there multiple versions? • Does this data
What is statistics for?
What is statistics for?
Description

Description

Description
Explanation

Explanation

Explanation
Prediction

Prediction

Prediction
Models made of Counting
Models made of Counting
The Simplest Model: Counting One Thing
The Simplest Model:
Counting One Thing

P(Yellow) = 0.6

P(Yellow) = 0.6 Blue Yellow
Blue Yellow
Blue
Yellow
The Second Simplest model: Counting Two Things
The Second Simplest model:
Counting Two Things

Accident

A ccident No accident Pr(Accident) = 0.15
A ccident No accident Pr(Accident) = 0.15

No accident

Pr(Accident) = 0.15

A ccident No accident Pr(Accident) = 0.15
A ccident No accident Pr(Accident) = 0.15
Blue Yellow
Blue
Yellow

Accident

No Accident

Blue Yellow Accident No Accident
Deadly Force in Black and White, ProPublica 10/10/2014
Relative risk (risk ratio)
Relative risk (risk ratio)
AP Clinton Foundation Story
AP Clinton Foundation Story

“At least 85 of 154 people from private interests who met or had phone conversations scheduled with Clinton while she led the State Department donated to her family charity or pledged commitments to its international programs, according to a review of State Department calendars.”

odds

charity or pledged commitments to its international programs, according to a review of State Department calendars.”
AP Clinton Foundation Story
AP Clinton Foundation Story
odds
odds

Not enough information to compute the odds ratio which you can tell immediately because four values are required.

odds Not enough information to compute the odds ratio which you can tell immediately because four
Blue Yellow
Blue
Yellow

Accident

No Accident

P(Accident|Blue) = 0.1

Blue Yellow Accident No Accident P(Accident|Blue) = 0.1
Conditional Probability
Conditional Probability

Pr(B|A) = Pr(AB)/Pr(A)

Conditional Probability Pr(B|A) = Pr(AB)/Pr(A)
Relative risk as conditional probability
Relative risk as conditional probability

N

=

a+b+c+d

N(disease)

=

a+c

N(no disease)

=

b+d

Pr(disease)

=

a+c / a+b+c+d

Pr(disease|smoker)

=

a / (a+b)

Pr(disease|non-smoker)

=

c / (c+d)

a / (a+b) Pr(disease|non-smoker ) = c / ( c+d) RR = Pr(disease|smoker )/ Pr(disease|non-smoker) =

RR = Pr(disease|smoker)/Pr(disease|non-smoker) = (a/a+b) / (c/c+d)

Pr(disease|non-smoker ) = c / ( c+d) RR = Pr(disease|smoker )/ Pr(disease|non-smoker) = (a/a+b ) /
Predicting Cats
Predicting Cats
Predicting Cats
Predicting Recidivism
Predicting Recidivism
Predicting Recidivism How We Analyzed the COMPAS Recidivism Algorithm, ProPublica, 2016

How We Analyzed the COMPAS Recidivism Algorithm, ProPublica, 2016

Predicting Recidivism How We Analyzed the COMPAS Recidivism Algorithm, ProPublica, 2016
Confusion Matrix
Confusion Matrix
ROC curve (for adjustable thresholds)
ROC curve (for adjustable thresholds)
Base Rates - Taxi Accidents
Base Rates - Taxi Accidents

Imagine you live in a city where 15% of all rides end in an accident, and last year there were

- 75 accidents involving yellow cabs

- 25 accidents involving blue cabs

Which taxi company is more dangerous?

there were - 75 accidents involving yellow cabs - 25 accidents involving blue cabs Which taxi
Base rate
Base rate

We know

P(accident) = 0.15 P(accident|blue) = 0.25 P(accident|yellow) = 0.75

We do not know the “base rate”:

P(yellow)

or equivalently

N(yellow)

P(accident|blue) = 0.25 P(accident|yellow) = 0.75 We do not know the “base rate”: P(yellow) or equivalently
Evidence and Conditional Probability
Evidence and Conditional Probability

Hypothesis H = Alice has a cold Evidence E = we just saw her cough

Evidence and Conditional Probability Hypothesis H = Alice has a cold Evidence E = we just
Alice is coughing. Does she have a cold?
Alice is coughing. Does she have a cold?

Most people with colds cough

P(coughing|cold) = 0.9

Alice is coughing. Does she have a cold? Most people with colds cough P(coughing|cold) = 0.9
P(A|B) ≠ P(B|A)
P(A|B) ≠ P(B|A)

Most people with colds cough

P(coughing|cold) = 0.9

but we want

P(cold | coughing)

P(A|B) ≠ P(B|A) Most people with colds cough P(coughing|cold) = 0.9 but we want P(cold |
Bayes’ Theorem
Bayes’ Theorem

Tells us how to go from Pr(A|B) to Pr(B|A)

Pr(B|A) = Pr(A|B)Pr(B) / Pr(A)

Bayes’ Theorem Tells us how to go from Pr(A|B) to Pr(B|A) Pr(B|A) = Pr(A|B)Pr(B) / Pr(A)
Alice is coughing. Does she have a cold?
Alice is coughing. Does she have a cold?

Prior P(H) = 0.05 (5% of our friends have a cold) Likelihood P(E|H) = 0.9 (most people with colds cough) Base rate P(E) = 0.1 (10% of everyone coughs today)

P(H|E) = P(E|H)P(H)/P(E)

= 0.9 * 0.05 / 0.1

= 0.45

If you believe your initial probability estimates, you should now believe there's a 45% chance she has a cold.

= 0.45 If you believe your initial probability estimates, you should now believe there's a 45%
Evidence
Evidence

Information that justifies a belief.

Presented with evidence E for X, we should believe X "more."

In terms of probability, P(X|E) > P(X)

a belief. Presented with evidence E for X, we should believe X "more." In terms of
Bayes “learns” from evidence
Bayes “learns” from evidence

Pr(H|E) = Pr(E|H) Pr(H) / Pr(E)

or

P(H|E) = Pr(E|H)/Pr(E) * Pr(H)

= Pr(E|H) Pr(H) / Pr(E) or P(H|E) = Pr(E|H)/Pr(E) * Pr(H) Posterior How likely is H
= Pr(E|H) Pr(H) / Pr(E) or P(H|E) = Pr(E|H)/Pr(E) * Pr(H) Posterior How likely is H

Posterior How likely is H given evidence E?

* Pr(H) Posterior How likely is H given evidence E? Likelihood Probability of seeing E if

Likelihood Probability of seeing E if H is true

Base Rate How commonly do we see E at all? Prior How likely was H
Base Rate
How commonly
do we see E at all?
Prior
How likely was
H to begin with?
Bayes’ Theorem - Diagnostic tests
Bayes’ Theorem - Diagnostic tests

Suppose I tell you:

14 of 1000 women under 50 have breast cancer

If a woman has cancer, a mammogram is positive 75% of the time

If a woman does not have cancer, a mammogram is positive 10% of the time

If a woman has a positive mammogram, how likely is she to have cancer?

cancer, a mammogram is positive 10% of the time If a woman has a positive mammogram,
The Signal and the Noise , Nate Silver

The Signal and the Noise, Nate Silver

The Signal and the Noise , Nate Silver
cancer no cancer positive negative
cancer
no cancer
positive
negative
cancer no cancer positive negative
cancer
no cancer
positive
negative

Pr(positive|cancer) = 0.75

= N(positive & cancer) / N(cancer)

N(cancer) = 4 N(positive & cancer) = 3

negative Pr(positive|cancer ) = 0.75 = N(positive & cancer) / N(cancer) N(cancer) = 4 N(positive &
cancer no cancer positive negative
cancer
no cancer
positive
negative

Pr(positive|no cancer) = 0.1

= N(positive & no cancer) / N(positive)

N(no cancer) = 1000 N(positive & no cancer) = 100

cancer) = 0.1 = N(positive & no cancer) / N(positive) N(no cancer) = 1000 N(positive &

Pr(cancer) = 0.0014 = N(cancer) / N

Pr (cancer) = 0.0014 = N(cancer) / N cancer no cancer positive negative
cancer no cancer positive negative
cancer
no cancer
positive
negative
Conditional probabilities
Conditional probabilities

Pr(positive|cancer) = 75% Pr(positive|no cancer) = 10%

What is Pr(cancer|positive)?

Conditional probabilities Pr(positive|cancer) = 75% Pr(positive|no cancer) = 10% What is Pr(cancer|positive)?

Pr(cancer|positive) = 9.6%

Pr(cancer|positive) = 9.6% cancer no cancer positive negative
cancer no cancer positive negative
cancer
no cancer
positive
negative
Bayesian Mammograms
Bayesian Mammograms

Pr(cancer|positive) = Pr(positive|cancer) Pr(cancer) / Pr(positive)

Pr(positive|cancer) = 0.75 Pr(cancer) = 0.014

Pr(positive) =

Pr(positive|no cancer)Pr(no cancer) + Pr(positive|cancer)Pr(cancer)

= 0.10*0.986 + 0.75*0.014

= 0.1091

(positive) = Pr(positive|no cancer)Pr (no cancer) + Pr(positive|cancer)Pr (cancer ) = 0.10*0.986 + 0.75*0.014 = 0.1091
Bayesian Mammograms
Bayesian Mammograms

Pr(cancer|positive) = Pr(positive|cancer) Pr(cancer) / Pr(positive)

= (0.75 * 0.014) / (0.1091)

= 0.0962

= 9.6% chance she has cancer if mammogram is positive

) Pr(cancer) / Pr(positive) = (0.75 * 0.014) / (0.1091) = 0.0962 = 9.6% chance she
Interpretation generally
Interpretation generally
Same data, different meaning
Same data, different meaning
More than one true story
More than one true story
More than one true story
More than one true story