This action might not be possible to undo. Are you sure you want to continue?

CompuLauonal !ournallsm

Columbla !ournallsm School

Week 9: urawlng Concluslons from uaLa

november 6, 2013

LecLure 9: urawlng Concluslons

lnLerpreung uaLa

8andomness and Slgnlñcance

8ellef, Lvldence, and 8las

Correlauon and Causauon

uaLa doesn'L speak for lLself

lnLerpreung uaLa

uaLa + ConLexL => Meanlng

lmporLanL noLe

1here may be more Lhan one defenslble

lnLerpreLauon of a daLa seL.

Cur goal ln Lhls class ls Lo rule ouL !"#$%$"&!'($

lnLerpreLauons.

?ou see a sLory ln Lhe daLa

ls lL really Lhere?

Why wouldn'L Lhere be a sLory?

• ?ou mlsundersLand how Lhe daLa ls collecLed

• 1he daLa ls lncompleLe or bad

• 1he pauern ls due Lo chance

• 1he pauern ls real, buL lL lsn'L a causal

relauonshlp

• 1he daLa doesn'L generallze Lhe way you wanL

lL Lo

Pow was Lhls daLa creaLed?

lnLenuonal or unlnLenuonal problems

WhaL

doesn'L a

1wluer

map

show?

n?C

populauon

colored by

lncome

lnLervlew Lhe uaLa

• Where do Lhese numbers come from?

• Who recorded Lhem?

• Pow?

• lor whaL purpose was Lhls daLa collecLed?

• Pow do we know lL ls compleLe?

• WhaL are Lhe demographlcs?

• ls Lhls Lhe rlghL way Lo quanufy Lhls lssue?

• Who ls noL lncluded ln Lhese ñgures?

• Who ls golng Lo look bad or lose money as a resulL of Lhese numbers?

• ls Lhe daLa conslsLenL from day Lo day, or when collecLed by dlñerenL people?

• WhaL arblLrary cholces had Lo be made Lo generaLe Lhe daLa?

• ls Lhe daLa conslsLenL wlLh oLher sources? Who has already analyzed lL?

• uoes lL have known ßaws? Are Lhere muluple verslons?

AdvenLures ln lleld ueñnluons

2004 Llecuon, ln llorlda, recounLed by Mau WalLe ln )*"#(!"+

-*.* *'/0. 1*2$ *"# 3.4"!2!.56

1here were more Lhan 47,000 llorldlans on Lhe felon purge llsL.

Cf Lhem, only 61-LhaL's one LenLh of one percenL-were

Plspanlc ln a sLaLe where 17 percenL of Lhe populauon clalmed

Plspanlc as Lhelr race.

...

ln Lhe sLaLe voLer reglsLrauon daLabase, Plspanlc ls a race. ln Lhe

sLaLe's crlmlnal hlsLory daLabase, Plspanlc ls an eLhnlclLy. When

maLched LogeLher, and wlLh race as a crlLerla for maLchlng, Lhe

number of maLches lnvolvlng Plspanlc people drops Lo near

zero.

LecLure 9: urawlng Concluslons

lnLerpreung uaLa

8andomness and Slgnlñcance

PypoLhesls, Lvldence, and 8las

Correlauon and Causauon

WhaL sLaLs know-how geLs you

• ?ou mlsundersLand how Lhe daLa ls collecLed

• 1he daLa ls lncompleLe or bad

• 1he pauern ls due Lo chance

• 1he pauern ls real, buL lL lsn'L a causal

relauonshlp

• 1he daLa doesn'L generallze Lhe way you wanL

lL Lo

Whlch one ls random?

Cne sLar per box - noL random

WhaL's Lhe probablllLy of rolllng a 6?

Þ(6) = 1/6

Also: Þ(1) = 1/6, Þ(2) = 1/6, Þ(3) = 1/6, Þ(4) = 1/6, Þ(3) = 1/6

ls Lhls dle loaded?

Are Lhese Lwo dlce loaded?

1wo dlce: non-unlform dlsLrlbuuon

1wo prlnclples of randomness

1. 8andom daLa has way pauerns ln lL way more oûen

Lhan you Lhlnk.

2. 1hls problem geLs much more exLreme when you

have less daLa.

ls someLhlng causlng cancer?

Cancer raLe per counLy. uarker = greaLer lncldence of cancer.

Whlch of Lhese ls real daLa?

Clobal LemperaLure record

Pow llkely ls lL LhaL Lhe LemperaLure 7/"8. lncrease over nexL decade?

lrom 94$ :!+"*( *"# .4$ ;/!&$, naLe Sllver

lL ls concelvable LhaL Lhe 14 elderly people who are reporLed Lo

have dled soon aûer recelvlng Lhe vacclnauon dled of oLher

causes. CovernmenL omclals ln charge of Lhe program clalm LhaL

lL ls all a colncldence, and polnL ouL LhaL old people drop dead

every day. 1he Amerlcan people have even become famlllar wlLh

a new sLausuc: Among every 100,000 people 63 Lo 73 years old,

Lhere wlll be nlne or Len deaLhs ln every 24-hour perlod under

mosL normal clrcumsLances.

Lven uslng Lhe omclal sLausuc, lL ls dlsconcerung LhaL Lhree

elderly people ln one cllnlc ln Þlusburgh, all vacclnaLed wlLhln

Lhe same hour, should dle wlLhln a few hours Lhereaûer. 1hls

Lragedy could occur by chance, buL Lhe facL remalns LhaL lL ls

exLremely lmprobable LhaL such a group of deaLhs should Lake

place ln such a pecullar clusLer by pure colncldence.

< ;$7 =/>? 9!@$& edlLorlal, 14 CcLober 1976

Assumlng LhaL abouL 40 percenL of elderly Amerlcans were

vacclnaLed wlLhln Lhe ñrsL 11 days of Lhe program, Lhen abouL 9

mllllon people aged 63 and older would have recelved Lhe

vacclne ln early CcLober 1976. Assumlng LhaL Lhere were 3,000

cllnlcs nauonwlde, Lhls would have been 164 vacclnauons per

cllnlc per day. A person aged 63 or older has abouL a 1-ln-7,000

chance of dylng on any parucular day, Lhe odds of aL leasL Lhree

such people dylng on Lhe same day from among a group of 164

pauenLs are lndeed very long, abouL 480,000 Lo one agalnsL.

Powever, under our assumpuons, Lhere were 33,000

opporLunlues for Lhls ºexLremely lmprobable" evenL Lo occur-

3,000 cllnlcs, muluplled by 11 days. 1he odds of Lhls colncldence

occurrlng somewhere ln Amerlca, Lherefore, were much shorLer

-only abouL 8 Lo 1

< naLe SllverA 94$ :!+"*( *"# .4$ ;/!&$A Ch. 7 fooLnoLe 20

1he probablllues of polllng

lf 8omney ls Lwo polnLs ahead of Cbama, 49° Lo 47°,

ln a poll wlLh 3.3° margln of error, how llkely ls lL LhaL

Cbama ls acLually leadlng?

Clven:

8 = 49°, C=47°

MCL(8) = MCL(C) = ±3.3°

Pow llkely ls lL LhaL Cbama ls acLually ahead?

LeL u = 8-C = 2°. 1hls ls an observed value, and lf we polled Lhe whole

populauon, we would see a Lrue value u'. We wanL Lo know probablllLy LhaL

Cbama ls acLually ahead, l.e. Þ(u' < 0)

Margln of error on u = MCL(8) + MCL(u) = ±11° because Lhey are almosL

compleLely dependenL, 8+C = 100.

lor beuer analysls, see

hup://abcnews.go.com/lmages/ÞolllngunlL/MCLlranklln.pdf

Clves MCL(u) = 10.8°

SLd. dev of u = MCL(u)/1.96 as MCL ls quoLed as 93° conñdence lnLerval

= ±3.3°.

Z-score of -u = -2°/3.3° = -0.36

Þ(z<-0.33) = 0.36, so 36° chance a 8omney ls noL ahead, or abouL 1 ln 3.

Þ(Cbama ahead)

Þ(8omney ahead)

8andom Pappens

"unllkely Lo happen by chance" ls only a good

argumenL lf you've esumaLed Lhe chance.

Also: a B*>C20(*> colncldence may be rare, buL &/@$

colncldence somewhere occurs consLanLly.

LecLure 9: urawlng Concluslons

lnLerpreung uaLa

8andomness and Slgnlñcance

8ellef, Lvldence, and 8las

Correlauon and Causauon

8ellef

"WhaL you belleve" or "whaL you Lhlnk ls Lrue" ls

Lhe enure deñnluon.

Says noLhlng abouL wheLher a bellef ls *2.0*((5 Lrue, or how we

would deLermlne LhaL.

1here are degrees of bellef, whlch can be modeled wlLh a

probablllLy dlsLrlbuuon over alLernauve hypoLhesls.

8ellef:

probablllLy dlsLrlbuuon over hypoLheses

L.g. ls Lhe n?Þu Largeung mosques for sLop-and-frlsk?

1

0

P

0

P

1

P

2

never

8ouunely

Cnce or Lwlce

*1rlcky: you have Lo lmaglne a hypoLhesls before you can asslgn lL a

probablllLy.

Lvldence

lnformauon LhaL [usuñes a bellef.

ÞresenLed wlLh evldence L for x, we should belleve x

"more."

ln Lerms of probablllLy, Þ(x|L) > Þ(x)

SLrengLh of Lvldence

ls coughlng sLrong or weak evldence for a cold?

Lxpressed ln Lerms of condluonal probablllues.

Þ(cold|coughlng)

Plgh values = sLrong evldence.

uon'L reverse probablllues!

ln general Þ(A|8) = Þ(8|A)

Þ(coughlng|cold) = 0.9

Þ(cold|coughlng) = 0.3

8ayes' Lheorem glves Lhe relauonshlp

Þ(A|8) = Þ(8|A) Þ(A) / Þ(8)

•uanuñed supporL for hypoLheses

Pow llkely ls a hypoLhesls P, glven evldence L?

Cr, whaL ls Þr(P|L)?

lL depends on:

how llkely P was before L, Þr(P)

how llkely Lhe L would be lf P ls Lrue, Þr(L|P)

how common ls Lhe evldence, Þr(L)

8ayes' Lheorem:

learnlng from evldence

Þr(P|L) = Þr(L|P) Þr(P) / Þr(L)

or

Þ(P|L) = Þr(L|P)/Þr(L) * Þr(P)

!"#$%"&''(

Pow llkely ls P

glven evldence L?

*+"'+

Pow llkely was

P Lo begln wlLh?

,'($% '- .

ÞrobablllLy of

seelng L

lf P ls Lrue

,'($% '- /

Pow commonly

do we see L aL all?

Allce ls coughlng. uoes she have a cold?

PypoLhesls P = Allce has a cold

Lvldence L = we [usL saw her cough

Allce ls coughlng. uoes she have a cold?

PypoLhesls P = Allce has a cold

Lvldence L = we [usL saw her cough

Þrlor Þ(P) = 0.03 (3° of our frlends have a cold)

Model Þ(L|P) = 0.9 (mosL people wlLh colds cough)

Model Þ(L) = 0.1 (10° of everyone coughs Loday)

Allce ls coughlng. uoes she have a cold?

Þ(P|L) = Þ(L|P)Þ(P)/Þ(L)

= 0.9 * 0.03 / 0.1

= 0.43

lf you belleve your lnlual probablllLy esumaLes, you

should now belleve Lhere's a 43° chance she has a

cold.

8las

A &5&.$@*C2 Lendency Lo produce an lncorrecL

answer.

:5&.$@*C2 means lL's noL a random error. 1here's a

pauern Lo Lhe errors. lmplles we could do beuer lf we

correcLed for Lhe pauern.

*1rlcky: evaluaung blas requlres knowledge of correcL answer.

Cognluve blases

AvallablllLy heurlsuc: we use examples LhaL come Lo mlnd,

lnsLead of sLausucs.

Þreference for earller lnformauon: whaL we learn ñrsL has a

much greaLer eñecL on our [udgmenL.

Memory formauon: whaLever seems lmporLanL *. .4$ C@$ ls

whaL geLs remembered.

Conñrmauon blas: we seek ouL and glve greaLer lmporLance Lo

lnformauon LhaL conñrms our expecLauons.

Conñrmauon blas

Comes ln many forms.

...unconsclously ñlLerlng lnformauon LhaL doesn'L ñL

expecLauons.

...noL looklng for conLrary lnformauon.

...noL lmaglnlng Lhe alLernauves.

1he Lhlng abouL evldence...

As Lhe amounL of lnformauon lncreases, lL geLs

more llkely LhaL some lnformauon somewhere

supporLs any parucular hypoLhesls.

ln oLher words, lf you go looklng for

conñrmauon, you 7!(( ñnd lL. 1hls ls noL a

compleLe LruLh-ñndlng meLhod.

MeLhod of compeung hypoLheses

SLarL wlLh muluple hypoLheses P

0

, P

1

, ... P

n

(8emember, lf you can'L lmaglne lL, you can'L conclude lL!)

Co looklng for lnformauon LhaL glves you Lhe besL

ablllLy Lo ("01+"2"345$ beLween hypoLheses.

Lvldence whlch supporLs P

l

ls much less useful Lhan

evldence whlch supporLs P

l

much more Lhan P

[

, lf Lhe

goal ls Lo choose a hypoLhesls.

MeLhod of compeung hypoLheses,

quanuLauve form

SLarL wlLh muluple hypoLheses P

0

, P

1

, ... P

n

Lach ls a model of whaL you'd expecL Lo see Þ(L|P

l

),

wlLh lnlual probablllLy Þ(P

l

)

lor each new plece of evldence, use 8ayes' rule Lo

updaLe probablllLy on all hypoLheses.

lnference resulL ls probablllues of dlñerenL hypoLheses

glven all evldence

€ Þ(P

0

|L), Þ(P

1

|L), ... , Þ(P

n

|L) •

A good model has a Lheory of Lhe world.

8ad models, bad lnferences

LecLure 9: urawlng Concluslons

lnLerpreung uaLa

8andomness and Slgnlñcance

PypoLheses, Lvldence, and 8lases

Correlauon and Causauon

WhaL ls "causauon"?

?

x

observable Lhlng

Lhlng ln Lhe world

lnLeracuon

Pow correlauon happens

? x

x causes ?

? x

? causes x

? x

random chance!

? x

hldden varlable causes x and ?

? x

Z causes x and ?

Z

Cuns and ñrearm homlcldes?

? x

lf you have a gun, you're golng Lo use lL

? x

lf lL's a dangerous nelghborhood, you'll buy a gun

? x

Lhe correlauon ls due Lo chance

8eauLy and responses

? x

Lelllng a woman she's beauuful doesn'L work

? x

lf a woman ls beauuful,

1) she'll respond less

2) people wlll Lell her LhaL

Z

8eauLy ls a "confoundlng varlable." 1he correlauon ls

real, buL you've mlsundersLood Lhe causal sLrucLure.

WhaL an experlmenL ls:

lnLervene ln a neLwork of causes

uoes lacebook news feed cause

people Lo share llnks?

A dlmculL example

n?Þu performs ‚600,000 sLreeL sLop and frlsks

per year.

WhaL sorLs of concluslons could we draw from

Lhls daLa? Pow?

SLop and lrlsk Causauon

Suppose you Lake Lhe address of every mosque

ln n?C, and dlscover LhaL Lhere are 13° more

sLop-and-frlsks wlLhln 100m of mosques Lhan

Lhe overall average.

Can we conclude LhaL Lhe pollce are Largeung

Musllms?

- Computational Journalism 2016 Week 10
- Computational Journalism 2016 Week 11
- Computational Journalism 2016 Week 9
- Computational Journalism 2016 Week 8
- What Do Journalists Do With Documents? Field Notes for NLP Researchers
- Computational Journalism 2016 Week 7
- Computational Journalism 2016 Week 6
- Computational Journalism 2016 Week 5
- Computational Journalism 2016 Week 4
- Computational Journalism 2016 Week 3
- Computational Journalism 2016 Week 2
- Computational Journalism 2016 Week 1
- From Algorithms to Stories.
- Privacy and Security. Computational Journalism week 12
- Drawing Conclusions From Data. Computational Journalism week 11
- Social Network Analysis. Computational Journalism week 10
- Algorithmic Accountability. Computational Journalism week 9
- Knowledge Representation. Computational Journalism week 8
- Visualization. Computational Journalism week 7
- Introduction. Computational journalism week 1
- Clustering. Computational journalism week 2
- Text Analysis. Computational journalism week 3
- Algorithmic Filtering. Computational journalism week 4
- Social Filtering. Computational Journalism week 5
- Hybrid Filtering. Computational Journalism week 6

course blog at compjournalism.com

course blog at compjournalism.com

- Computational Journalism at Columbia, Fall 2013: Lecture 1, Basicsby Jonathan Stray
- Computational Journalism at Columbia, Fall 2013, Lecture 6: Visualizationby Jonathan Stray
- Computational Journalism at Columbia, Fall 2013, Lecture 4: Social Filteringby Jonathan Stray
- Computational Journalism at Columbia, Fall 2013, Lecture 3: Algorithmic Filteringby Jonathan Stray

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue reading from where you left off, or restart the preview.

scribd