31 views

Uploaded by Jonathan Stray

Jonathan Stray, Columbia University, Fall 2015
Syllabus at http://www.compjournalism.com/?p=133

save

You are on page 1of 90

Computational Journalism

Columbia Journalism School

Week 10: Drawing Conclusions from Data

December 4, 2015

Data doesn't speak for itself

Interpreting Data

Data + Context => Meaning

Interpretation

There may be more than one defensible interpretation of a data

set.

Our goal in this class is to rule out indefensible interpretations.

**Data interpretation strategy
**

•

•

•

•

•

•

•

**What question are you asking?
**

Understand the quantification process

Quantified randomness and uncertainty

Models, especially causal models

Interpretation: context, cognitive biases, generalizability

Method of competing hypotheses

Communication methods

**What’s your question?
**

The question comes first. It tells you what data you need.

Quantification

The process that creates data.

How was this data created?

**1940 U.S. census enumerator
**

instructions

**2010 U.S. census race and ethnicity
**

questions

Intentional or unintentional problems

**Interview the Data
**

•

•

•

•

•

•

•

•

•

•

•

•

•

**Where do these numbers come from?
**

Who recorded them?

How?

For what purpose was this data collected?

How do we know it is complete?

What are the demographics?

Is this the right way to quantify this issue?

Who is not included in these figures?

Who is going to look bad or lose money as a result of these numbers?

Is the data consistent from day to day, or when collected by different people?

What arbitrary choices had to be made to generate the data?

Is the data consistent with other sources? Who has already analyzed it?

Does it have known flaws? Are there multiple versions?

**Adventures in Field Definitions
**

2004 Election, in Florida, recounted by Matt Waite in Handling

Data about Race and Ethnicity:

There were more than 47,000 Floridians on the felon purge list.

Of them, only 61—that’s one tenth of one percent—were

Hispanic in a state where 17 percent of the population claimed

Hispanic as their race.

...

In the state voter registration database, Hispanic is a race. In the

state’s criminal history database, Hispanic is an ethnicity. When

matched together, and with race as a criteria for matching, the

number of matches involving Hispanic people drops to near

zero.

Randomness

Quantified uncertainty

Margin of Error

**The probabilities of polling
**

If Romney is two points ahead of Obama, 49% to 47%, in a poll with 5.5%

margin of error, how likely is it that Obama is actually leading?

Given:

R = 49%, O=47%

MOE(R) = MOE(O) = ±5.5%

**How likely is it that Obama is actually ahead?
**

Let D = R-O = 2%. This is an observed value, and if we polled the whole

population, we would see a true value D'. We want to know probability that

Obama is actually ahead, i.e. P(D' < 0)

Margin of error on D ≈ MOE(R) + MOE(D) = ±11% because they are almost

completely dependent, R+O ≈ 100.

For better analysis, see

http://abcnews.go.com/images/PollingUnit/MOEFranklin.pdf

Gives MOE(D) = 10.8%

P(Obama ahead)

P(Romney ahead)

**Std. dev of D ≈ MOE(D)/1.96 as MOE is quoted as 95% confidence interval
**

= ±5.5%.

Z-score of -D = -2%/5.5% = -0.36

P(z<-0.35) = 0.36, so 36% chance a Romney is not ahead, or about 1 in 3.

Which one is random?

One star per box – not random

Is this die loaded?

Are these two dice loaded?

Two dice: non-uniform distribution

**Two principles of randomness
**

1. Random data has way patterns in it way more often than you think.

2. This problem gets much more extreme when you have less data.

Is something causing cancer?

Cancer rate per county. Darker = greater incidence of cancer.

Which of these is real data?

Global temperature record

How likely is it that the temperature won't increase over next decade?

From The Signal and the Noise, Nate Silver

The Howland Will Trial

**It is conceivable that the 14 elderly people who are reported to
**

have died soon after receiving the vaccination died of other

causes. Government officials in charge of the program claim

that it is all a coincidence, and point out that old people drop

dead every day. The American people have even become

familiar with a new statistic: Among every 100,000 people 65 to

75 years old, there will be nine or ten deaths in every 24-hour

period under most normal circumstances.

Even using the official statistic, it is disconcerting that three

elderly people in one clinic in Pittsburgh, all vaccinated within

the same hour, should die within a few hours thereafter. This

tragedy could occur by chance, but the fact remains that it is

extremely improbable that such a group of deaths should take

place in such a peculiar cluster by pure coincidence.

- New York Times editorial, 14 October 1976

**Assuming that about 40 percent of elderly Americans were
**

vaccinated within the first 11 days of the program, then about 9

million people aged 65 and older would have received the

vaccine in early October 1976. Assuming that there were 5,000

clinics nationwide, this would have been 164 vaccinations per

clinic per day. A person aged 65 or older has about a 1-in-7,000

chance of dying on any particular day; the odds of at least three

such people dying on the same day from among a group of 164

patients are indeed very long, about 480,000 to one against.

However, under our assumptions, there were 55,000

opportunities for this “extremely improbable” event to occur—

5,000 clinics, multiplied by 11 days. The odds of this coincidence

occurring somewhere in America, therefore, were much

shorter—only about 8 to 1

- Nate Silver, The Signal and the Noise, Ch. 7 footnote 20

Randomization to detect insider trading

**Looking at executives' trading in the week before their companies made news, the
**

Journal found that one of every 33 who dipped in and out posted average returns of

more than 20% (or avoided 20% downturns) in the following week. By contrast, only

one in 117 executives who traded in an annual pattern did that well.

Random Happens

"Unlikely to happen by chance" is only a good argument if you've

estimated the chance.

Also: a particular coincidence may be rare, but some coincidence

somewhere occurs constantly.

**A more complete theory
**

Compare probability of multiple alternatives.

**The Bayesian approach:
**

probability distribution over hypotheses

E.g. Is the NYPD targeting mosques for stop-and-frisk?

1

0

H0

H1

H2

Never Once or twice Routinely

***Tricky: you have to imagine a hypothesis before you can assign it a
**

probability.

Evidence

Information that justifies a belief.

Presented with evidence E for X, we should believe X "more."

In terms of probability, P(X|E) > P(X)

Strength of Evidence

Is coughing strong or weak evidence for a cold?

Expressed in terms of conditional probabilities.

P(cold|coughing)

High values = strong evidence.

**Don't reverse probabilities!
**

In general P(A|B) ≠ P(B|A)

P(coughing|cold) ≈ 0.9

P(cold|coughing) ≈ 0.3

Bayes' theorem gives the relationship

P(A|B) = P(B|A) P(A) / P(B)

**Quantified support for hypotheses
**

How likely is a hypothesis H, given evidence E?

Or, what is Pr(H|E)?

It depends on:

how likely H was before E, Pr(H)

how likely the E would be if H is true, Pr(E|H)

how common is the evidence, Pr(E)

Bayes' theorem:

learning from evidence

Pr(H|E) = Pr(E|H) Pr(H) / Pr(E)

or

P(H|E) = Pr(E|H)/Pr(E) * Pr(H)

Likelihood

How likely is H

given evidence E?

Model of H

Probability of

seeing E

if H is true

Model of E

How commonly

do we see E at all?

Prior

How likely was

H to begin with?

**Alice is coughing. Does she have a cold?
**

Hypothesis H = Alice has a cold

Evidence E = we just saw her cough

**Alice is coughing. Does she have a cold?
**

Hypothesis H = Alice has a cold

Evidence E = we just saw her cough

Prior P(H) = 0.05 (5% of our friends have a cold)

Model P(E|H) = 0.9 (most people with colds cough)

Model P(E) = 0.1 (10% of everyone coughs today)

**Alice is coughing. Does she have a cold?
**

P(H|E) = P(E|H)P(H)/P(E)

= 0.9 * 0.05 / 0.1

= 0.45

If you believe your initial probability estimates, you

should now believe there's a 45% chance she has a

cold.

Did the stoplight reduce accidents?

7

0 2 4 6 8

0 2 4 6 8

0 2 4 6 8

4

0 2 4 6 8

0 2 4 6 8

0 2 4 6 8

0 2 4 6 8

0 2 4 6 8

0 2 4 6 8

1

**Simulated without stoplight
**

2

3

5

6

8

9

7

0 2 4 6 8

0 2 4 6 8

0 2 4 6 8

4

0 2 4 6 8

0 2 4 6 8

0 2 4 6 8

0 2 4 6 8

0 2 4 6 8

0 2 4 6 8

1

**Simulated with a 50% effective stoplight
**

2

5

8

3

6

9

Models

They encode our background knowledge

and assumptions

Does chocolate make you smarter?

Does marriage make women safer?

Occupational Group

Farmers, foresters, and fisherman

Smoking

Mortality

77

84

Miners and quarrymen

137

116

Gas, coke and chemical makers

117

123

94

128

Furnace, forge, foundry, and rolling mill

116

155

Electrical and electronics workers

102

101

Engineering and allied trades

111

118

Woodworkers

93

113

Leather workers

88

104

Textile workers

102

88

91

104

Food, drink, and tobacco workers

104

129

Paper and printing workers

107

86

Makers of other products

112

96

Glass and ceramics makers

Clothing workers

**How correlation happens
**

X

X

Y

Y

Y causes X

X causes Y

Z

X

X

Y

Y

hidden variable causes X and Y

Z causes X and Y

X

Y

random chance!

**How correlation happens
**

X

Y

X causes Y

X

Y

Y causes X

Z

X

Y

Z causes X and Y

X

Y

random chance!

**Guns and firearm homicides?
**

X

Y

if you have a gun, you're going to use it

X

Y

if it's a dangerous neighborhood, you'll buy a gun

X

Y

the correlation is due to chance

**Beauty and responses
**

X

Y

**telling a woman she's beautiful
**

makes her respond less

Z

X

Y

**if a woman is beautiful,
**

1) she'll respond less

2) people will tell her that

**Beauty is a "confounding variable." The correlation is
**

real, but you've misunderstood the causal structure.

**Beauty and responses
**

X

Y

telling a woman she's beautiful doesn't work

Z

X

Y

**if a woman is beautiful,
**

1) she'll respond less

2) people will tell her that

**Beauty is a "confounding variable." The correlation is
**

real, but you've misunderstood the causal structure.

What an experiment is:

intervene in a network of causes

**Does Facebook news feed cause
**

people to share links?

**A good model has a theory of the world.
**

Bad models, bad inferences

Interpretation

Context, cognitive biases, generalizability

Bias

A systematic tendency to produce an incorrect answer.

Systematic means it's not a random error. There's a pattern to the errors.

Implies we could do better if we corrected for the pattern.

*Tricky: evaluating bias requires knowledge of correct answer.

Cognitive biases

Availability heuristic: we use examples that come to mind,

instead of statistics.

Preference for earlier information: what we learn first has a

much greater effect on our judgment.

Memory formation: whatever seems important at the time is

what gets remembered.

Confirmation bias: we seek out and give greater importance to

information that confirms our expectations.

Confirmation bias

Comes in many forms.

...unconsciously filtering information that doesn't fit expectations.

...not looking for contrary information.

...not imagining the alternatives.

**The thing about evidence...
**

As the amount of information increases, it gets more likely that

some information somewhere supports any particular

hypothesis.

In other words, if you go looking for confirmation, you will find it.

This is not a complete truth-finding method.

**Method of competing hypotheses
**

Start with multiple hypotheses H0, H1, ... HN

(Remember, if you can't imagine it, you can't conclude it!)

**Go looking for information that gives you the best ability to discriminate
**

between hypotheses.

Evidence which supports Hi is much less useful than evidence which

supports Hi much more than Hj, if the goal is to choose a hypothesis.

**Method of competing hypotheses,
**

quantitative form

Start with multiple hypotheses H0, H1, ... HN

Each is a model of what you'd expect to see P(E|Hi),

with initial probability P(Hi)

**For each new piece of evidence, use Bayes' rule to
**

update probability on all hypotheses.

Inference result is probabilities of different hypotheses

given all evidence

{ P(H0|E), P(H1|E), ... , P(HN|E) }

A difficult example

NYPD performs ~600,000 street stop and frisks per year.

What sorts of conclusions could we draw from this data? How?

Generalizability

Will your conclusions hold in other contexts?

What

doesn't a

Twitter

map

show?

NYC

population

colored by

income

**Stop and Frisk Causation
**

Suppose you take the address of every mosque in NYC, and

discover that there are 15% more stop-and-frisks within 100m of

mosques than the overall average.

Can we conclude that the police are targeting Muslims?

- Knowledge Representation. Computational Journalism week 8Uploaded byJonathan Stray
- Hybrid Filtering. Computational Journalism week 6Uploaded byJonathan Stray
- CPUC and PG&E Emails: "Wall Street analyst assistance"Uploaded byKQED News
- Algorithmic Filtering. Computational journalism week 4Uploaded byJonathan Stray
- CPUC and PG&E Emails: "Get this info to [Jerry] Brown"Uploaded byKQED News
- Privacy and Security. Computational Journalism week 12Uploaded byJonathan Stray
- Seeing Media Polarization through DataUploaded byJonathan Stray
- Social Network Analysis. Computational Journalism week 10Uploaded byJonathan Stray
- Text Analysis. Computational journalism week 3Uploaded byJonathan Stray
- Introduction. Computational journalism week 1Uploaded byJonathan Stray
- CPUC and PG&E Emails: "Plenty of great wine to drink"Uploaded byKQED News
- Social Filtering. Computational Journalism week 5Uploaded byJonathan Stray
- Clustering. Computational journalism week 2Uploaded byJonathan Stray
- Algorithmic Accountability. Computational Journalism week 9Uploaded byJonathan Stray
- CPUC and PG&E Emails: Smart MetersUploaded byKQED News
- PSYCH 66 - Review for Chapter 1Uploaded byLuigi Ito
- Ethics Class 4 and 5 and 6_Nicaragua_individual LevelUploaded byvaleser02
- List of Cognitive BiasesUploaded byLavinia Răileanu
- <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <HTML><HEAD><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> <TITLE>ERROR: The requested URL could not be retrieved</TITLE> <STYLE type="text/css"><!--BODY{background-color:#ffffff;font-family:verdana,sans-serif}PRE{font-family:sans-serif}--></STYLE> </HEAD><BODY> <H1>ERROR</H1> <H2>The requested URL could not be retrieved</H2> <HR noshade size="1px"> <P> While trying to process the request: <PRE> TEXT http://www.scribd.com/titlecleaner?title=1128.pdf HTTP/1.1 Host: www.scribd.com Proxy-Connection: keep-alive Origin: http://www.scribd.com X-CSRF-Token: c527bebc92b4c56e0140be1fb0382de58306dfa8 X-Requested-With: XMLHttpRequest User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20 Accept: */* Referer: http://www.scribd.com/upload-document?archive_doc=36468238&metadata=%7B%22coUploaded byJayati Bhasin
- The Emotionally Intelligent Decision MakersUploaded byGreg Cutler
- CPUC and PG&E Emails: “Can you guys help me with this?”Uploaded byKQED News
- Confidence IntervalsUploaded byDeni Chan
- Iowa Poll: As independents sour on Trump, disapproval rating tops 50%Uploaded bydmronline
- CPUC and PG&E Emails: "Any thoughts - non-attributed of course"Uploaded byKQED News
- Theoretical Framework and Hypothesis Development.pdfUploaded byGohar Yaseen
- scientific problem solvingUploaded byapi-385919004
- HypothesesUploaded bybaskirathi
- Comparing BaconUploaded byBelle Ling
- CPUC and PG&E Emails: "Focus elsewhere for the moment"Uploaded byKQED News
- CPUC and PG&E Emails: "Are you bringing Charlie's Angels too?"Uploaded byKQED News

- Computational Journalism 2017 Week 1: IntroductionUploaded byJonathan Stray
- Computational Journalism 2016 Week 8: VisualizationUploaded byJonathan Stray
- Computational Journalism Week 9: Knowledge RepresentationUploaded byJonathan Stray
- Computational Journalism 2016 Week 9: Knowledge RepresentationUploaded byJonathan Stray
- Computational Journalism 2017 Week 2: Filtering AlgorithmsUploaded byJonathan Stray
- Computational Journalism 2017 Week 7: Algorithmic Accountability and DiscriminationUploaded byJonathan Stray
- Computational Journalism 2017 Week 5: Quantification and StatisticsUploaded byJonathan Stray
- Computational Journalism Week 11: Privacy and SecurityUploaded byJonathan Stray
- Computational Journalism 2017 Week 4: Computational Journalism PlatformsUploaded byJonathan Stray
- Computational Journalism 2016 Week 11: Privacy and SecurityUploaded byJonathan Stray
- Computational Journalism Week 8: Visualization and NetworksUploaded byJonathan Stray
- Practical Digital Security for JournalistsUploaded byJonathan Stray
- Computational Journalism 2017 Week 6: Drawing Conclusions From DataUploaded byJonathan Stray
- Computational Journalism 2017 Week 3: Filters as EditorsUploaded byJonathan Stray
- Computational Journalism 2016 Week 10: Social Network AnalysisUploaded byJonathan Stray
- Computational Journalism 2016 Week 4: Filters as EditorsUploaded byJonathan Stray
- Computational Journalism 2016 Week 2: Text AnalysisUploaded byJonathan Stray
- Computational Journalism 2016 Week 7: Algorithmic AccountabilityUploaded byJonathan Stray
- Visualization. Computational Journalism week 7Uploaded byJonathan Stray
- Computational Journalism 2016 Week 6: Drawing Conclusions from DataUploaded byJonathan Stray
- Privacy and Security. Computational Journalism week 12Uploaded byJonathan Stray
- Computational Journalism 2016 Week 3: Algorithmic FilteringUploaded byJonathan Stray
- Social Network Analysis. Computational Journalism week 10Uploaded byJonathan Stray
- Computational Journalism 2016 Week 1: IntroductionUploaded byJonathan Stray
- From Algorithms to Stories.Uploaded byJonathan Stray
- What Do Journalists Do With Documents? Field Notes for NLP ResearchersUploaded byJonathan Stray
- Computational Journalism 2016 Week 5: Quantification and StatisticsUploaded byJonathan Stray
- Algorithmic Accountability. Computational Journalism week 9Uploaded byJonathan Stray