11 views

Uploaded by Jonathan Stray

From the course Frontiers of Computational Journalism, Columbia University, Fall 2017 http://www.compjournalism.com/?p=206

save

- 1998 Effects of Search Experience and Subject Knowledge on the Search Tactics of Novice and Experienced Searchers
- An Observatory Note on Tests for Normality Assumptions
- Computational Journalism 2017 Week 5: Quantification and Statistics
- Computational Journalism 2017 Week 7: Algorithmic Accountability and Discrimination
- Clan Capitalism, Graph Distance, and Other Issues
- Computational Journalism 2017 Week 1: Introduction
- Guide to Six Sigma Statistics
- Ch7 Evans BA1e
- Lecture 10
- AmmannVerhofen06
- Is That Back-Test Result Good or Just Lucky
- Random field theory.ppt
- Computational Journalism 2017 Week 3: Filters as Editors
- Hypothesis Testing
- Bayes Factors
- Area Under the Disease Progress Curve (AUDPC)
- 2.4pdf
- wp51
- Debe
- lfstat3e_ppt_07
- Anova for ola uber
- oldworld newworld
- Report 2
- Article162-168
- Dasi Project EchoTRUE
- A Comparative Study of Noise Pollution Levels in Ilorin Metropolis, Nigeria
- PEBC Sample Questions
- 00875666
- Credit Card
- effects of longitudinal 2e case study on inservice teacher mindsets - version a c jones12 4 2014
- Computational Journalism 2017 Week 2: Filtering Algorithms
- Computational Journalism Week 11: Privacy and Security
- Computational Journalism 2016 Week 8: Visualization
- Computational Journalism 2017 Week 1: Introduction
- Computational Journalism 2016 Week 9: Knowledge Representation
- Computational Journalism Week 9: Knowledge Representation
- Practical Digital Security for Journalists
- What Do Journalists Do With Documents? Field Notes for NLP Researchers
- Computational Journalism Week 8: Visualization and Networks
- Computational Journalism 2016 Week 10: Social Network Analysis
- Computational Journalism 2017 Week 3: Filters as Editors
- Computational Journalism 2017 Week 4: Computational Journalism Platforms
- Computational Journalism 2016 Week 11: Privacy and Security
- Privacy and Security. Computational Journalism week 12
- Computational Journalism 2016 Week 6: Drawing Conclusions from Data
- Social Network Analysis. Computational Journalism week 10
- Computational Journalism 2016 Week 3: Algorithmic Filtering
- Computational Journalism 2016 Week 7: Algorithmic Accountability
- Computational Journalism 2016 Week 2: Text Analysis
- Knowledge Representation. Computational Journalism week 8
- Computational Journalism 2016 Week 4: Filters as Editors
- Visualization. Computational Journalism week 7
- Algorithmic Accountability. Computational Journalism week 9
- Drawing Conclusions From Data. Computational Journalism week 11
- Computational Journalism 2016 Week 5: Quantification and Statistics
- From Algorithms to Stories.
- Computational Journalism 2016 Week 1: Introduction

Computational Journalism

Columbia Journalism School

Week 6: Drawing Conclusions from Data

October 20, 2017

This class

• Randomness and Randomization Testing

• $%@*! P-Values

• Bayesian inference

• Causal Models

• Analysis of Competing Hypotheses

Randomness

Margin of Error

Which one is random?

One star per box – “less” random

Two principles of randomness

**1. Random data has “patterns” in it way more often
**

than you think.

**2. This problem gets much more extreme when you
**

have less data.

Is this die loaded?

Are these two dice loaded?

Two dice: non-uniform distribution

Statistics Without Numbers

Is something causing cancer?

**Cancer rate per county. Darker = greater incidence of cancer.
**

From Graphical Inference for Infovis, Wickham et. Al.

Which of these is real data?

Global temperature record

**How likely is it that the temperature won't increase over next decade?
**

From The Signal and the Noise, Nate Silver

It is conceivable that the 14 elderly people who are reported to have

died soon after receiving the vaccination died of other causes.

Government officials in charge of the program claim that it is all a

coincidence, and point out that old people drop dead every day. The

American people have even become familiar with a new statistic:

Among every 100,000 people 65 to 75 years old, there will be nine or

ten deaths in every 24-hour period under most normal circumstances.

**Even using the official statistic, it is disconcerting that three elderly
**

people in one clinic in Pittsburgh, all vaccinated within the same hour,

should die within a few hours thereafter. This tragedy could occur by

chance, but the fact remains that it is extremely improbable that such

a group of deaths should take place in such a peculiar cluster by pure

coincidence.

**- New York Times editorial, 14 October 1976
**

Assuming that about 40 percent of elderly Americans were

vaccinated within the first 11 days of the program, then about 9 million

people aged 65 and older would have received the vaccine in early

October 1976. Assuming that there were 5,000 clinics nationwide, this

would have been 164 vaccinations per clinic per day. A person aged

65 or older has about a 1-in-7,000 chance of dying on any particular

day; the odds of at least three such people dying on the same day

from among a group of 164 patients are indeed very long, about

480,000 to one against. However, under our assumptions, there were

55,000 opportunities for this “extremely improbable” event to occur—

5,000 clinics, multiplied by 11 days. The odds of this coincidence

occurring somewhere in America, therefore, were much shorter—only

about 8 to 1

**- Nate Silver, The Signal and the Noise, Ch. 7 footnote 20
**

The Howland Will Trial

Randomization to detect insider trading

Looking at executives' trading in the week before their companies

made news, the Journal found that one of every 33 who dipped in and

out posted average returns of more than 20% (or avoided 20%

downturns) in the following week. By contrast, only one in 117 executives

who traded in an annual pattern did that well.

$%@*! P-Values

P-value

**p(observed data > your data | null hypothesis)
**

What’s it good for? What’s it bad for?

From A dirty dozen: twelve p-value misconceptions, S.Goodman

Is one classroom better than another?

**T-test for two groups with different variance. Expected to have
**

T-distribution under under null hypothesis of equal scores

Reasons for possible differences

Things that depend on which classroom a student is in

**Things that don’t depend on which classroom they’re in
**

Reasons for possible differences

Things that depend on which classroom a student is in

**Things that don’t depend on which classroom they’re in
**

Break the relationship

observed difference

between classes

observed difference

between classes

**14% of all resamples have a class difference > observed, so p = 0.14
**

New samples from the data

Computing the sampling distribution

**Boostrapping: resample with repetition. This gives an excellent
**

approximation of the sampling distribution, even if non-normal.

A dirty dozen: twelve p-value misconceptions, S. Goodman

A dirty dozen: twelve p-value misconceptions, S. Goodman

Bayesian inference

A more complete theory

**Compare probability of multiple alternatives.
**

Did the stoplight reduce accidents?

0 2 4 6 8 0 2 4 6 8 0 2 4 6 8

7

4

1

0 2 4 6 8 0 2 4 6 8 0 2 4 6 8

8

5

2

Simulated without stoplight

0 2 4 6 8 0 2 4 6 8 0 2 4 6 8

9

6

3

0 2 4 6 8 0 2 4 6 8 0 2 4 6 8

7

4

1

0 2 4 6 8 0 2 4 6 8 0 2 4 6 8

8

5

2

0 2 4 6 8 0 2 4 6 8 0 2 4 6 8

Simulated with a 50% effective stoplight

9

6

3

Bayes “learns” from evidence

Pr(H|E) = Pr(E|H) Pr(H) / Pr(E)

or

P(H|E) = Pr(E|H)/Pr(E) * Pr(H)

**Posterior Likelihood Prior
**

How likely is H Base Rate How likely was

Probability of

given evidence E? How commonly H to begin with?

seeing E

do we see E at all?

if H is true

Probability distribution over hypotheses

Is the NYPD targeting mosques for stop-and-frisk?

1

0

H0 H1 H2

Never Once or twice Routinely

***Tricky: you have to imagine a hypothesis before you can assign it
**

a probability.

Parameter Estimation

Computing probability for a continuum of hypotheses

P(𝛳|E) = Pr(E|𝛳)/Pr(E) * Pr(𝛳)

Strength of Evidence

Can we find a p-value equivalent?

There is “Bayes factor”

Pr(H1|E)/Pr(H2|E)

= [Pr(E|H1)Pr(H1)/Pr(E)] / [Pr(E|H2)Pr(H2)/Pr(E)]

= Pr(E|H1)/Pr(E|H2) * Pr(H1)/Pr(H2)

Bayes Factor

Ok, but what’s a “significant” Bayes Factor?

**From Bayes Factors, Kass and Raftery
**

The Garden of Forking Paths

I Fooled Millions Into Thinking Chocolate Helps Weight Loss. Here's How.

John Bohannon

Science Isn’t Broken, FiveThirtyEight

“Statistical significance” is usually asking the wrong

question.

Does the model reproduce the data?

**Testing for Racial Discrimination in Police Searches of Motor Vehicles, Simoiu et al.
**

Causal Models

Does chocolate make you smarter?

Occupational Group Smoking Mortality

Farmers, foresters, and fisherman 77 84

Miners and quarrymen 137 116

Gas, coke and chemical makers 117 123

Glass and ceramics makers 94 128

Furnace, forge, foundry, and rolling mill 116 155

Electrical and electronics workers 102 101

Engineering and allied trades 111 118

Woodworkers 93 113

Leather workers 88 104

Textile workers 102 88

Clothing workers 91 104

Food, drink, and tobacco workers 104 129

Paper and printing workers 107 86

**Makers of other products 112 96
**

Does marriage make women safer?

How correlation happens

X Y X Y

X causes Y Y causes X

Z

X Y X Y

Z causes X and Y hidden variable causes X and Y

X Y

random chance!

Guns and firearm homicides?

X Y

if you have a gun, you're going to use it

X Y

if it's a dangerous neighborhood, you'll buy a gun

X Y

**the correlation is due to chance
**

Beauty and responses

X Y

**telling a woman she's beautiful
**

makes her respond less

Z

X Y

if a woman is beautiful,

1) she'll respond less

2) people will tell her that

**Beauty is a "confounding variable." The correlation is
**

real, but you've misunderstood the causal structure.

What an experiment is:

intervene in a network of causes

Does Facebook news feed cause

people to share links?

Analysis

of Competing Hypotheses

Cognitive biases

Availability heuristic: we use examples that come to mind,

instead of statistics.

**Preference for earlier information: what we learn first has a much
**

greater effect on our judgment.

**Memory formation: whatever seems important at the time is what
**

gets remembered.

**Confirmation bias: we seek out and give greater importance to
**

information that confirms our expectations.

Confirmation bias

Comes in many forms.

...unconsciously filtering information that doesn't fit expectations.

...not looking for contrary information.

**...not imagining the alternatives.
**

Method of competing hypotheses

Start with multiple hypotheses H0, H1, ... HN

(Remember, if you can't imagine it, you can't conclude it!)

**Go looking for information that gives you the best ability to discriminate
**

between hypotheses.

**Evidence which supports Hi is much less useful than evidence which
**

supports Hi much more than Hj, if the goal is to choose a hypothesis.

In practice: Triangulation

A good conclusion is one which is supported by multiple lines of evidence

from multiple methods.

**“Philosophy ought to imitate the successful sciences in its methods, so far as
**

to proceed only from tangible premises which can be subjected to careful

scrutiny, and to trust rather to the multitude and variety of its arguments

than to the conclusiveness of any one. Its reasoning should not form a

chain which is no stronger than its weakest link, > but a cable whose fibers

may be ever so slender, provided they are sufficiently numerous and

intimately connected.”

**- Charles Sanders Peirce
**

A difficult example

NYPD performs ~600,000 street stop and frisks per year.

**What sorts of conclusions could we draw from this
**

data? How?

Stop and Frisk Causation

**Suppose you take the address of every mosque in NYC,
**

and discover that there are 15% more stop-and-frisks within

100m of mosques than the overall average.

Can we conclude that the police are targeting Muslims?

- 1998 Effects of Search Experience and Subject Knowledge on the Search Tactics of Novice and Experienced SearchersUploaded byjolios85
- An Observatory Note on Tests for Normality AssumptionsUploaded byTamayo Pepe
- Computational Journalism 2017 Week 5: Quantification and StatisticsUploaded byJonathan Stray
- Computational Journalism 2017 Week 7: Algorithmic Accountability and DiscriminationUploaded byJonathan Stray
- Clan Capitalism, Graph Distance, and Other IssuesUploaded byVictor Christianto
- Computational Journalism 2017 Week 1: IntroductionUploaded byJonathan Stray
- Guide to Six Sigma StatisticsUploaded bymehdi810
- Ch7 Evans BA1eUploaded byyarli7777
- Lecture 10Uploaded byShehryar Kayani
- AmmannVerhofen06Uploaded bydreamjongen
- Is That Back-Test Result Good or Just LuckyUploaded byeliforu
- Random field theory.pptUploaded byRashmi Jamadagni
- Computational Journalism 2017 Week 3: Filters as EditorsUploaded byJonathan Stray
- Hypothesis TestingUploaded bySivakasi Velan
- Bayes FactorsUploaded byJulian Gonzalez
- Area Under the Disease Progress Curve (AUDPC)Uploaded byMarlene Rosales
- 2.4pdfUploaded byNita Ferdiana
- wp51Uploaded byTyndall Centre for Climate Change Research
- DebeUploaded byBE KALU
- lfstat3e_ppt_07Uploaded byS.Waqquas
- Anova for ola uberUploaded byrahul
- oldworld newworldUploaded bysiddharthparab
- Report 2Uploaded byCarles David Cristian Alvarez
- Article162-168Uploaded byManu0301
- Dasi Project EchoTRUEUploaded byMathieuMondelé
- A Comparative Study of Noise Pollution Levels in Ilorin Metropolis, NigeriaUploaded byoyedepo sunday olayinka
- PEBC Sample QuestionsUploaded bygura1999
- 00875666Uploaded bySuriyachakArchwichai
- Credit CardUploaded bytejaas
- effects of longitudinal 2e case study on inservice teacher mindsets - version a c jones12 4 2014Uploaded byapi-264019256

- Computational Journalism 2017 Week 2: Filtering AlgorithmsUploaded byJonathan Stray
- Computational Journalism Week 11: Privacy and SecurityUploaded byJonathan Stray
- Computational Journalism 2016 Week 8: VisualizationUploaded byJonathan Stray
- Computational Journalism 2017 Week 1: IntroductionUploaded byJonathan Stray
- Computational Journalism 2016 Week 9: Knowledge RepresentationUploaded byJonathan Stray
- Computational Journalism Week 9: Knowledge RepresentationUploaded byJonathan Stray
- Practical Digital Security for JournalistsUploaded byJonathan Stray
- What Do Journalists Do With Documents? Field Notes for NLP ResearchersUploaded byJonathan Stray
- Computational Journalism Week 8: Visualization and NetworksUploaded byJonathan Stray
- Computational Journalism 2016 Week 10: Social Network AnalysisUploaded byJonathan Stray
- Computational Journalism 2017 Week 3: Filters as EditorsUploaded byJonathan Stray
- Computational Journalism 2017 Week 4: Computational Journalism PlatformsUploaded byJonathan Stray
- Computational Journalism 2016 Week 11: Privacy and SecurityUploaded byJonathan Stray
- Privacy and Security. Computational Journalism week 12Uploaded byJonathan Stray
- Computational Journalism 2016 Week 6: Drawing Conclusions from DataUploaded byJonathan Stray
- Social Network Analysis. Computational Journalism week 10Uploaded byJonathan Stray
- Computational Journalism 2016 Week 3: Algorithmic FilteringUploaded byJonathan Stray
- Computational Journalism 2016 Week 7: Algorithmic AccountabilityUploaded byJonathan Stray
- Computational Journalism 2016 Week 2: Text AnalysisUploaded byJonathan Stray
- Knowledge Representation. Computational Journalism week 8Uploaded byJonathan Stray
- Computational Journalism 2016 Week 4: Filters as EditorsUploaded byJonathan Stray
- Visualization. Computational Journalism week 7Uploaded byJonathan Stray
- Algorithmic Accountability. Computational Journalism week 9Uploaded byJonathan Stray
- Drawing Conclusions From Data. Computational Journalism week 11Uploaded byJonathan Stray
- Computational Journalism 2016 Week 5: Quantification and StatisticsUploaded byJonathan Stray
- From Algorithms to Stories.Uploaded byJonathan Stray
- Computational Journalism 2016 Week 1: IntroductionUploaded byJonathan Stray