From  Algorithms  

to  Stories
Jonathan Stray
Columbia / ProPublica / Overview

or...
three  hard  lessons
in  building  tools  for  
computational  
journalism

Science  Journalism  
 
Journalism  through  Science

Computational  Journalism

Stories  will  emerge  from  stacks  of  financial  disclosure  
forms,  court  records,  legislative  hearings,  officials'ʹ  
calendars  or  meeting  notes,  and  regulators'ʹ  email  messages  
that  no  one  today  has  time  or  money  to  mine.  With  a  suite  
of  reporting  tools,  a  journalist  will  be  able  to  scan,  
transcribe,  analyze,  and  visualize  the  paFerns  in  these  
documents.

-­‐‑  Cohen,  Hamilton,  Turner,    2011

Links links links!  
bit.ly/OverviewHackers  

Doc  sets  in  journalism

and  then...  
nobody  used  it

three  years  later...

Finalist,  2014  PuliPer  Prize  in  Public  Service

Winner,  2014  PuliPer  Prize  in  Public  Service

Demo:  Obama  Form  LeFers

Algorithm  agnostic  via  visualization  plugin  API
Ships  with  clustering,  word  clouds,  advanced  search...

Lesson  1  
Workflow  >>  Algorithm

User  testing!
Loaded confirmation link, which goes to /docsets. "Hmm. What do I do now?" Eventually
clicked import link. "I need more guidance what to do next." Import pane opened to DC
login. Looked like he was about to type in credentials. Then: "I can't really do any of these
now." Eventually saw "example document sets" and clicked.
Cloned caracas-cables example set. Waited. Understood when document set import
complete. Then hesitated. Didn't know where to click to open. Eventually clicked.
"In general, you could be way more communicative."
Moved mouse to document list immediately. "For some reason, this drew me." Clicked around
doc list. "What am I looking at?"
Moved to tree view. Clicked + without hesitation to open node. Saw document in viewer
change. "It's not clear what I'm looking at in the viewer." Eventually: "Which document is
showing when I click a node? Is it the first?"
A little later, more conversationally: "I don't know how useful the document list is." He said this
twice at different points. "Is this a comma separated list of documents? It just looks like one
block of text." Suggested a horizontal delimiter of some sort.

The  hardest  feature  to  implement
The  most  requested,  the  most  used

Lesson  2
It'ʹs  humans  +  machines

By Maria Kiselyova
(Reuters) - Russian mobile phone operator Vimpelcom has become the
latest company to come under scrutiny over its operations in Uzbekistan, an
authoritarian country where rival MTS had its assets confiscated.
U.S.-listed Vimpelcom, Uzbekistan's biggest mobile operator by subscribers,
said on Wednesday that it was being investigated by the U.S. Securities and
Exchange Commission (SEC) and Dutch authorities.

Demo:  Uzbekistan'ʹs  Telco  Bribes

VIS:  Visual  Investigative  Scenarios

Lesson  3  
Real  data  is  messy

What  researchers  choose
•  News articles
•  Academic literature
•  NLP test data sets

What  journalists  deal  with
• 
• 
• 
• 
• 

PDF dumps
Printed, scanned emails
Scraping thousands of pages from an antique site
CD full of Excel files
...

Standard  Named  Entity  
Recognition  not  working
Test  of  OpenCalais  against  5  random  articles  from  various  sources
versus  hand-­‐‑tagged  entities

Overall  precision  =  77%
Overall  recall  =  30%

...and  this  is  on  the  cleanest  possible  data

Meta-­‐‑lesson  
You  don'ʹt  know  what  
the  user'ʹs  problem  is

Iterative  design  loop

A  number  of  previous  tools  aim  to  help  the  user  “ex-­‐‑  
plore”  a  document  collection  (such  as  [6,  9,  10,  12]),  
though  few  of  these  tools  have  been  evaluated  with  
users  from  a  specific  target  domain  who  bring  their  own  
data,  making  us  suspect  that  this  imprecise  term  often  
masks  a  lack  of  understanding  of  actual  user  tasks.

Six  case  studies,  four  of  which  were  "ʺsearch"ʺ  tasks  
(journalist  needed  to  locate  known  or  suspected  evidence)

There  are  surprisingly  few  papers  that  comment  on  the  
adoption  of  a  visualization  tool  without  the  prompting  
of  designers:  in  a  recent  survey  of  eight  hundred  
visualization  papers  containing  an  evaluation  
component,  only  five  commented  on  adoption  [31]

What  are  the  metrics  
that  count?

Evaluation  Methods  for  Topic  Models
Wallach  et.  al.  2009

BeFer  metrics?
•  How many stories got done?
o  Are you solving a niche problem?
o  Would resources have been better spent on reporting?

•  How long did it take to do the story?
o  Is this faster than using text search?
o  Is it even faster than just reading the documents?
o  How much would it have cost to pay someone to do it?

•  What happened after the story was published?

Journalism  as  a  cycle
Action

Data

Reporting
User

Distribution

Story

Use  it!
overviewproject.org

Code  it!
github.com/overview

Thank  you!
Knight Foundation, Google Ideas, Open Syllabus Project

Master your semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master your semester with Scribd & The New York Times

Cancel anytime.