You are on page 1of 33

What do Journalists

do with Documents?
Jonathan Stray
Columbia Journalism School

Computational Journalism
Stories will emerge from stacks of financial disclosure
forms, court records, legislative hearings, officials'
calendars or meeting notes, and regulators' email
messages that no one today has time or money to
mine. With a suite of reporting tools, a journalist will
be able to scan, transcribe, analyze, and visualize the
patterns in these documents.
- Cohen, Hamilton, Turner, 2011

What do Journalists do with Documents?

A Summary

1. Robust Import

The hardest feature to implement

The most requested, the most used

2. Robust Analysis

What researchers choose

News articles
Academic literature
NLP test data sets

What journalists deal with

PDF dumps
Printed, scanned emails
A million pages scraped from an antique site
CD full of random files

LAPD Crime Descriptions


Standard Named Entity

Recognition not working
Test of OpenCalais against 5 random articles from various sources
versus hand-tagged entities

Overall precision = 77%

Overall recall = 30%

3. Search, not exploration

A number of previous tools aim to help the user

explore a document collection (such as [6, 9, 10, 12]),
though few of these tools have been evaluated with
users from a specific target domain who bring their
own data, making us suspect that this imprecise term
often masks a lack of understanding of actual user
Overview: The Design, Adoption, and Analysis of a Visual
Document Mining Tool For Investigative Journalists, Brehmer et al,

Suffolk County public safety committee transcript,

Reference to a body left on the street due to union dispute
Adam Playford, Newsday, 2014

4. Quantitative Summaries

Count incident types by date. For Level 14, ProPublica, 2015

LAPD Underreported Serious Assaults, Skewing Crime Stats for 8 Years

Los Angeles Times, 2015

The Child Exchange, Reuters, 2014

5. Interactive Methods

Design Study Methodology: Reflections from the Trenches and the Stacks, Sedlmair et al, 2012

Extracting yes/no answers from database of Foreign Corrupt Practices Act cases.
Comparison by Ariana Giorgi

6. Clarity and Accuracy

We used a machine-learning method

known as latent Dirichlet allocation to
identify the topics in all 14,400 petitions
and to then categorize the briefs. This
enabled us to identify which lawyers did
which kind of work for which sorts of
petitioners. For example, in cases where
workers sue their employers, the lawyers
most successful getting cases before the
court were far more likely to represent the
employers rather than the employees.
The Echo Chamber, Reuters, 2014

Evaluation Methods for Topic Models

Wallach et. al. 2009

Interpretation refers to the facility with which an

analyst makes inferences about the data through the
lens of a model abstraction. Trust refers to the actual
and perceived accuracy of an analysts inferences
Interpretation and Trust: Designing Model-driven Visualizations for Text
Analysis, Chuang et al. 2012

Things We Need
Dirty document corpora
A shared development platform