You are on page 1of 49
Frontiers of Computational Journalism
Frontiers of
Computational Journalism

Columbia Journalism School Week 4: Computational Journalism Platforms

September 29, 2017

of Computational Journalism Columbia Journalism School Week 4: Computational Journalism Platforms September 29, 2017
This class
This class

What do journalists do with Documents

The Computational Journalism Workbench

Plate Notation

NYT Recommender

What do journalists do with Documents • The Computational Journalism Workbench • Plate Notation • NYT
What do journalists do with documents?
What do journalists do with
documents?
Overview prototype running on Iraq security contractor docs, Feb 2012
Overview prototype running on Iraq security contractor docs, Feb 2012

Overview prototype running on Iraq security contractor docs, Feb 2012

Overview prototype running on Iraq security contractor docs, Feb 2012
Technical troubles with a new system meant that almost 70,000 North Carolina residents received their

Technical troubles with a new system meant that almost 70,000 North Carolina residents received their food stamps late this summer. That’s 8.5 percent of the number of clients the state currently serves every month. The problem was eventually traced to web browser compatibility issues. WRAL reporter Tyler Dukes obtained 4,500 pages of emails — on paper — from various government departments and used DocumentCloud and Overview to piece together this story.

https://blog.overviewdocs.com/completed-stories/

and used DocumentCloud and Overview to piece together this story. https://blog.overviewdocs.com/completed-stories/
Used Overview’s “topic tree” (TF-IDF clustering) to find a group of key emails from a
Used Overview’s “topic tree” (TF-IDF clustering) to find a group of key emails from a

Used Overview’s “topic tree” (TF-IDF clustering) to find a group of key emails from a listserv.

Used Overview’s “topic tree” (TF-IDF clustering) to find a group of key emails from a listserv.
What do Journalists do with Documents, Stray 2016
What do Journalists do with Documents, Stray 2016

What do Journalists do with Documents, Stray 2016

What do Journalists do with Documents, Stray 2016
1. Robust Import
1. Robust Import
The hardest feature to implement The most requested, the most used

The hardest feature to implement The most requested, the most used

The hardest feature to implement The most requested, the most used
2. Robust Analysis
2. Robust Analysis
What researchers choose
What researchers choose

News articles

Academic literature

NLP test data sets

What journalists deal with
What journalists deal with

PDF dumps

Printed, scanned emails

A million pages scraped from an antique site

CD full of random files

• PDF dumps • Printed, scanned emails • A million pages scraped from an antique site

LAPD Crime Descriptions

VICTS AND SUSPS BECAME INV IN VERBA ARGUMENT SUSP THEN BEGAN HITTING VICTS IN THE FACE

LAPD Crime Descriptions VICTS AND SUSPS BECAME INV IN VERBA ARGUMENT SUSP THEN BEGAN HITTING VICTS

Entity recognition is not solved!

Entity recognition is not solved ! Entities found out of 150 Incredibly dirty source data. Current

Entities found out of 150

Entity recognition is not solved ! Entities found out of 150 Incredibly dirty source data. Current

Incredibly dirty source data. Current methods have low recall (~70%)

3. Search, not exploration
3. Search, not exploration

A number of previous tools aim to help the user “explore” a document collection (such as [6, 9, 10, 12]), though few of these tools have been evaluated with users from a specific target domain who bring their own data, making us suspect that this imprecise term often masks a lack of understanding of actual user tasks.

Overview: The Design, Adoption, and Analysis of a Visual Document Mining Tool For Investigative Journalists, Brehmer et al, 2014

The Design, Adoption, and Analysis of a Visual Document Mining Tool For Investigative Journalists, Brehmer et
Suffolk County public safety committee transcript, Reference to a body left on the street due
Suffolk County public safety committee transcript, Reference to a body left on the street due

Suffolk County public safety committee transcript, Reference to a body left on the street due to union dispute (via Adam Playford, Newsday, 2014)

committee transcript, Reference to a body left on the street due to union dispute (via Adam
4. Quantitative Summaries
4. Quantitative Summaries
Count incident types by date. For Level 14, ProPublica , 2015
Count incident types by date. For Level 14, ProPublica , 2015

Count incident types by date. For Level 14, ProPublica, 2015

Count incident types by date. For Level 14, ProPublica , 2015
LAPD Underreported Serious Assaults, Skewing Crime Stats for 8 Years Los Angeles Times, 2015
LAPD Underreported Serious Assaults, Skewing Crime Stats for 8 Years Los Angeles Times, 2015

LAPD Underreported Serious Assaults, Skewing Crime Stats for 8 Years Los Angeles Times, 2015

LAPD Underreported Serious Assaults, Skewing Crime Stats for 8 Years Los Angeles Times, 2015
The Child Exchange, Reuters, 2014
The Child Exchange, Reuters, 2014

The Child Exchange, Reuters, 2014

The Child Exchange, Reuters, 2014
5. Interactive Methods
5. Interactive Methods
Design Study Methodology: Reflections from the Trenches and the Stacks, Sedlmair et al, 2012
Design Study Methodology: Reflections from the Trenches and the Stacks, Sedlmair et al, 2012

Design Study Methodology: Reflections from the Trenches and the Stacks, Sedlmair et al, 2012

Design Study Methodology: Reflections from the Trenches and the Stacks, Sedlmair et al, 2012
Extracting yes/no answers from database of Foreign Corrupt Practices Act cases. Comparison by Ariana Giorgi

Extracting yes/no answers from database of Foreign Corrupt Practices Act cases. Comparison by Ariana Giorgi

Extracting yes/no answers from database of Foreign Corrupt Practices Act cases. Comparison by Ariana Giorgi
6. Clarity and Accuracy
6. Clarity and Accuracy
We used a machine-learning method known as latent Dirichlet allocation to identify the topics in
We used a machine-learning method known as latent Dirichlet allocation to identify the topics in

We used a machine-learning method known as latent Dirichlet allocation to identify the topics in all 14,400 petitions and to then categorize the briefs. This enabled us to identify which lawyers did which kind of work for which sorts of petitioners. For example, in cases where workers sue their employers, the lawyers most successful getting cases before the court were far more likely to represent the employers rather than the employees.

The Echo Chamber, Reuters, 2014

the court were far more likely to represent the employers rather than the employees. The Echo
Evaluation Methods for Topic Models Wallach et. al. 2009

Evaluation Methods for Topic Models Wallach et. al. 2009

Evaluation Methods for Topic Models Wallach et. al. 2009

Interpretation refers to the facility with which an analyst makes inferences about the data through the lens of a model abstraction. Trust refers to the actual and perceived accuracy of an analyst’s inferences

Interpretation and Trust: Designing Model-driven Visualizations for Text Analysis, Chuang et al. 2012

inferences Interpretation and Trust: Designing Model-driven Visualizations for Text Analysis , Chuang et al. 2012
Overview prototype running on Wikileaks cables, early 2012
Overview prototype running on Wikileaks cables, early 2012

Overview prototype running on Wikileaks cables, early 2012

Overview prototype running on Wikileaks cables, early 2012
Overview circa 2014
Overview circa 2014

Overview circa 2014

Overview circa 2014
Overviewdocs.com today
Overviewdocs.com today

Overviewdocs.com today

Overviewdocs.com today
Overview Entity and Multisearch plugins
Overview Entity and Multisearch plugins

Overview Entity and Multisearch plugins

Overview Entity and Multisearch plugins
Overview plugin API
Overview plugin API

Overview plugin API

Overview plugin API
Computational Journalism Workbench
Computational Journalism Workbench
cjworkbench.org
cjworkbench.org

cjworkbench.org

cjworkbench.org
Plate Notation
Plate Notation

Probability graphs

Probability graphs Node = variable Edge = dependence (“sampled from”) Filled node = observed data
Probability graphs Node = variable Edge = dependence (“sampled from”) Filled node = observed data

Node = variable Edge = dependence (“sampled from”) Filled node = observed data

Probability graphs Node = variable Edge = dependence (“sampled from”) Filled node = observed data

Choose a topic for each word

Choose a topic for each word Both PLSA and LDA model each document as a distribution

Both PLSA and LDA model each document as a distribution over topics. Each word belongs to a single topic.

for each word Both PLSA and LDA model each document as a distribution over topics. Each
LDA Plate Notation
LDA Plate Notation
topics in doc topic for word words in topics word word in doc parameter N
topics in doc
topic for word
words in topics
word
word in doc
parameter
N words
D docs
K topics

topic

concentration

parameter

in doc

concentration

in topics word word in doc parameter N words D docs K topics topic concentration parameter
New York Times recommender
New York Times recommender

Combining collaborative filtering and topic modeling

Combining collaborative filtering and topic modeling
Combining collaborative filtering and topic modeling

Collaborative Topic Modeling

topic word in doc concentration topics in doc (content) topic for word K topics weight
topic
word in doc
concentration
topics in doc
(content)
topic for word
K topics
weight of user
selections
topics in doc
(collaborative)
user rating
of doc
variation in
topics for user
per-user
topics

content onlycontent + social

content +

socialcontent only content +

content only content + social