What do Journalists

do with Documents?
Jonathan Stray
Columbia Journalism School

Computational Journalism
Stories will emerge from stacks of financial disclosure
forms, court records, legislative hearings, officials'
calendars or meeting notes, and regulators' email
messages that no one today has time or money to
mine. With a suite of reporting tools, a journalist will
be able to scan, transcribe, analyze, and visualize the
patterns in these documents.
- Cohen, Hamilton, Turner, 2011

What do Journalists do with Documents?
A Summary

1. Robust Import

The hardest feature to implement
The most requested, the most used

2. Robust Analysis

What researchers choose
• News articles
• Academic literature
• NLP test data sets

What journalists deal with



PDF dumps
Printed, scanned emails
A million pages scraped from an antique site
CD full of random files

LAPD Crime Descriptions
VICTS AND SUSPS BECAME INV IN VERBA ARGUMENT SUSP 
THEN BEGAN HITTING VICTS IN THE FACE

Standard Named Entity
Recognition not working
Test of OpenCalais against 5 random articles from various sources
versus hand-tagged entities

Overall precision = 77%
Overall recall = 30%

3. Search, not exploration

A number of previous tools aim to help the user
“explore” a document collection (such as [6, 9, 10, 12]),
though few of these tools have been evaluated with
users from a specific target domain who bring their
own data, making us suspect that this imprecise term
often masks a lack of understanding of actual user
tasks.
Overview: The Design, Adoption, and Analysis of a Visual
Document Mining Tool For Investigative Journalists, Brehmer et al,
2014

Suffolk County public safety committee transcript,
Reference to a body left on the street due to union dispute
Adam Playford, Newsday, 2014

4. Quantitative Summaries

Count incident types by date. For Level 14, ProPublica, 2015

LAPD Underreported Serious Assaults, Skewing Crime Stats for 8 Years
Los Angeles Times, 2015

The Child Exchange, Reuters, 2014

5. Interactive Methods

Design Study Methodology: Reflections from the Trenches and the Stacks, Sedlmair et al, 2012

Extracting yes/no answers from database of Foreign Corrupt Practices Act cases.
Comparison by Ariana Giorgi

6. Clarity and Accuracy

We used a machine-learning method
known as latent Dirichlet allocation to
identify the topics in all 14,400 petitions
and to then categorize the briefs. This
enabled us to identify which lawyers did
which kind of work for which sorts of
petitioners. For example, in cases where
workers sue their employers, the lawyers
most successful getting cases before the
court were far more likely to represent the
employers rather than the employees.
The Echo Chamber, Reuters, 2014

Evaluation Methods for Topic Models
Wallach et. al. 2009

Interpretation refers to the facility with which an
analyst makes inferences about the data through the
lens of a model abstraction. Trust refers to the actual
and perceived accuracy of an analyst’s inferences
Interpretation and Trust: Designing Model-driven Visualizations for Text
Analysis, Chuang et al. 2012

Things We Need
• Dirty document corpora
• A shared development platform