Frontiers of

Computational Journalism
Columbia Journalism School
Week 4: Computational Journalism Platforms

September 29, 2017
This class
• What do journalists do with Documents
• The Computational Journalism Workbench
• Plate Notation
• NYT Recommender
What do journalists do with
Overview prototype running on Iraq security contractor docs, Feb 2012
Technical troubles with a new system meant that almost 70,000 North
Carolina residents received their food stamps late this summer. That’s
8.5 percent of the number of clients the state currently serves every
month. The problem was eventually traced to web browser
compatibility issues. WRAL reporter Tyler Dukes obtained 4,500 pages
of emails — on paper — from various government departments and
used DocumentCloud and Overview to piece together this story.
Used Overview’s “topic tree” (TF-IDF clustering) to find a group
of key emails from a listserv.
What do Journalists do with Documents, Stray 2016
1. Robust Import
The hardest feature to implement
The most requested, the most used
2. Robust Analysis
What researchers choose
• News articles
• Academic literature
• NLP test data sets

What journalists deal with
• PDF dumps
• Printed, scanned emails
• A million pages scraped from an antique site
• CD full of random files
LAPD Crime Descriptions

Entity recognition is not solved!
Entities found
out of 150

Incredibly dirty source data. Current methods have low recall (~70%)
3. Search, not exploration
A number of previous tools aim to help the user “explore”
a document collection (such as [6, 9, 10, 12]), though few
of these tools have been evaluated with users from a
specific target domain who bring their own data, making
us suspect that this imprecise term often masks a lack of
understanding of actual user tasks.

Overview: The Design, Adoption, and Analysis of a Visual Document
Mining Tool For Investigative Journalists, Brehmer et al, 2014
Suffolk County public safety committee transcript,
Reference to a body left on the street due to union dispute
(via Adam Playford, Newsday, 2014)
4. Quantitative Summaries
Count incident types by date. For Level 14, ProPublica, 2015
LAPD Underreported Serious Assaults, Skewing Crime Stats for 8 Years
Los Angeles Times, 2015
The Child Exchange, Reuters, 2014
5. Interactive Methods
Design Study Methodology: Reflections from the Trenches and the Stacks,
Sedlmair et al, 2012
Extracting yes/no answers from database of Foreign Corrupt Practices
Act cases. Comparison by Ariana Giorgi
6. Clarity and Accuracy
We used a machine-learning method
known as latent Dirichlet allocation to
identify the topics in all 14,400 petitions
and to then categorize the briefs. This
enabled us to identify which lawyers
did which kind of work for which sorts
of petitioners. For example, in cases
where workers sue their employers, the
lawyers most successful getting cases
before the court were far more likely to
represent the employers rather than
the employees.

The Echo Chamber, Reuters, 2014
Evaluation Methods for Topic Models
Wallach et. al. 2009
Interpretation refers to the facility with which an
analyst makes inferences about the data through
the lens of a model abstraction. Trust refers to the
actual and perceived accuracy of an analyst’s

Interpretation and Trust: Designing Model-driven Visualizations
for Text Analysis, Chuang et al. 2012
Overview prototype running on Wikileaks cables, early 2012
Overview circa 2014 today
Overview Entity and Multisearch plugins
Overview plugin API
Computational Journalism Workbench
Plate Notation
Probability graphs

Node = variable
Edge = dependence (“sampled from”)
Filled node = observed data
Choose a topic for each word

Both PLSA and LDA model each document as a distribution over
topics. Each word belongs to a single topic.
LDA Plate Notation
topics in doc
topic topic for word words in topics word
word in doc
concentration concentration
parameter parameter

N words D docs K topics
in doc
New York Times recommender
Combining collaborative filtering
and topic modeling
Collaborative Topic Modeling
topic topics in doc topic for word word in doc K topics
concentration (content)

user rating
weight of user topics in doc of doc
selections (collaborative)

variation in
per-user topics topics for user
content only

content +