Frontiers of

Computational Journalism
Columbia Journalism School
Week 1: Introduction, Clustering

September 8, 2017
Computational Journalism: Definitions

“Broadly defined, it can involve changing how stories are
discovered, presented, aggregated, monetized, and
archived. Computation can advance journalism by
drawing on innovations in topic detection, video analysis,
personalization, aggregation, visualization, and

- Cohen, Hamilton, Turner, Computational Journalism, 2011
Computational Journalism: Definitions

“Stories will emerge from stacks of financial disclosure
forms, court records, legislative hearings, officials' calendars
or meeting notes, and regulators' email messages that no
one today has time or money to mine. With a suite of
reporting tools, a journalist will be able to scan, transcribe,
analyze, and visualize the patterns in these documents.”

- Cohen, Hamilton, Turner, Computational Journalism, 2011
Computational Journalism: Definitions

“The application of computer science to the problems of
public information, knowledge, and belief, by practitioners
who see their mission as outside of both commerce and

- Jonathan Stray, A Computational Journalism Reading List, 2011
Cohen et al. model

Data Reporting


CS for presentation / interaction


Data Reporting

Filter stories for user




Reporting Filtering

CS CS User
Examples of filters

• Facebook news feed
• What an editor puts on the front page
• Google News
• Reddit’s comment system
• Twitter
• New York Times recommendation system
Kony 2012 early network, by Gilad Lotan
CS in Journalism



Reporting Filtering Effects

Journalism as a cycle


Effects Data



User CS

Journalism with algorithms
Journalism about algorithms
Websites Vary Prices, Deals Based on Users' Information
Valentino-Devries, Singer-Vine and Soltani, WSJ, 2012
Message Machine
Jeff Larson, Al Shaw, ProPublica, 2012
Computer Science and Journalism

Reporting on society, using computation
Reporting on computation in society
Information Retrieval Visualization
Natural Language Text Analysis
Processing Sociology
Filter Design
Social Network Analysis
Graph Theory
Artificial Knowledge Representation
Intelligence Drawing Conclusions
Cognitive Science

Statistics Epistemology
Course Structure
Unit 1: Filters
Information retrieval, TF-IDF, topic modeling, search engines, social filtering, filtering
system design.

Unit 2: Interpreting Data
Quantification, data quality, statistical basics, causation, prediction, narratives.

Unit 3: Methods
Visualization, knowledge representation, social network analysis, privacy and
security, tracking flow and effects

Some assignments require programming, but
your writing counts for more than your code!

Final project
Code, story, or research

Course blog

40% assignements
40% final project
20% class participation
This class
• Introduction
• Classification and clustering
• Text analysis in journalism
• The Document Vector Space model
Classification and Clustering
Classification and Clustering
Classification is arguably one of the most central and
generic of all our conceptual exercises. It is the
foundation not only for conceptualization, language,
and speech, but also for mathematics, statistics, and
data analysis in general.

Kenneth D. Bailey, Typologies and Taxonomies: An Introduction to
Classification Techniques
Interpreting High Dimensional Data

UK House of Lords voting record, 2000-2012.
N = 1043 lords by M = 1630 votes
2 = aye, 4 = nay, -9 = didn't vote
Distance metric
d(x, y) ≥ 0
- distance is never negative

d(x, x) = 0
- “reflexivity”: zero distance to self

d(x, y) = d(y, x)
- “symmetry”: x to y same as y to x

d(x, z) ≤ d(x, y) + d(y, z)
- “triangle inequality”: going direct is shorter
Distance matrix
Data matrix for M objects of N dimensions
! x1 $ ! x1,1 x1,2  x1,N $
# & # &
# x2 & # x2,1 x2,2 &
X= =
# & #  
# & # &
" xM % #" x1,M xM ,N &%

Distance matrix
! d d1,2  dM ,M $
# 1,1 &
# d2,1 d2,2 &
Dij = D ji = d(xi , x j ) = # &
#   &
# d1,M dM ,M &%
Different clustering algorithms
• Partitioning
o keep adjusting clusters until convergence
o e.g. K-means
o Also LDA and many Bayesian models, from a certain perspective

• Agglomerative hierarchical
o start with leaves, repeatedly merge clusters
o e.g. MIN and MAX approaches

• Divisive hierarchical
o start with root, repeatedly split clusters
o e.g. binary split
K-means demo
UK House of Lords voting clusters
UK House of Lords voting clusters
Algorithm instructed to separate MPs into five clusters. Output:

1 2 2 1 3 2 2 2 1 4
1 1 1 1 1 1 5 2 1 1
2 2 1 2 3 2 2 4 2 1
2 3 2 1 3 1 1 2 1 2
1 5 2 1 4 2 2 1 2 1
1 4 1 1 4 1 2 2 1 5
1 1 1 2 3 3 2 2 2 5
2 3 1 2 1 4 1 1 4 4
1 1 2 1 1 2 2 2 2 1
2 1 2 1 2 2 1 3 2 1
1 2 2 1 2 3 4 2 2 2
Voting clusters with parties
LDem XB Lab LDem XB Lab XB Lab Con XB
1 2 2 1 3 2 2 2 1 4
Con Con LDem Con Con Con LDem Lab Con LDem
1 1 1 1 1 1 5 2 1 1
Lab Lab Con Lab XB XB Lab XB Lab Con
2 2 1 2 3 2 2 4 2 1
Lab XB Lab Con XB XB LDem Lab XB Lab
2 3 2 1 3 1 1 2 1 2
Con Con Lab Con XB Lab Lab Con XB XB
1 5 2 1 4 2 2 1 2 1
Con XB Con Con XB Con Lab XB LDem Con
1 4 1 1 4 1 2 2 1 5
Con Con Con Lab Bp XB Lab Lab Lab LDem
1 1 1 2 3 3 2 2 2 5
Lab XB Con Lab Con XB Con Con XB XB
2 3 1 2 1 4 1 1 4 4
Con Con Lab Con Con XB Lab Lab Lab Con
1 1 2 1 1 2 2 2 2 1
Lab LDem Lab Con Lab Lab Con XB Lab Con
2 1 2 1 2 2 1 3 2 1
Con Lab XB Con XB XB XB Lab Lab Lab
1 2 2 1 2 3 4 2 2 2
Clustering Algorithm

Input: data points (feature vectors).
Output: a set of clusters, each of which is a set of points.


Input: data points (feature vectors).
Output: a picture of the points.
Dimensionality reduction
Problem: vector space is high-dimensional. Up to
thousands of dimensions. The screen is two-dimensional.

We have to go from
x RN
to much lower dimensional points
y RK<<N

Probably K=2 or K=3.

Projection from 3 to 2 dimensions
Which direction should we look from?
Principal components analysis: find a linear projection
that preserves greatest variance

Take first K eigenvectors of covariance matrix corresponding to
largest eigenvalues. This gives a K-dimensional sub-space for
Linear projections

Projects in a straight line to
closest point on "screen.”

y = Px

where P is a K by N matrix.

Projection from 2 to 1 dimensions
Nonlinear projections
Still going from high-
dimensional x to low-
dimensional y, but now

y = f(x)

for some function f(), not
linear. So, may not
preserve relative
distances, angles, etc.
Fish-eye projection from 3 to 2 dimensions
Multidimensional scaling
Idea: try to preserve distances between points "as much as

If we have the distances between all points in a distance matrix,

D = |xi – xj| for all i,j

We can recover the original {xi} coordinates exactly (up to rigid
transformations.) Like working out a country map if you know how
far away each city is from every other.
Multidimensional scaling
Torgerson's "classical MDS" algorithm (1952)
Reducing dimension with MDS
Notice: dimension N is not encoded in the distance matrix D (it’s
M by M where M is number of points)

MDS formula (theoretically) allows us to recover point
coordinates {x} in any number of dimensions k.
MDS Stress minimization
The formula actually minimizes “stress”
stress(x) = ∑ xi − x j − dij )
i, j
Think of “springs” between every pair of points. Spring between xi,
xj has rest length dij

Stress is zero if all high-dimensional distances matched exactly in
low dimension.
Multi-dimensional Scaling

Like "flattening" a
stretchy structure into
2D, so that distances
between points are
preserved (as much as
House of Lords MDS plot
Robustness of results
Regarding these analyses of legislative voting, we could still
• Are we modeling the right thing? (What about other
legislative work, e.g. in committee?)
• Are our underlying assumptions correct? (do representatives
really have “ideal points” in a preference space?)
• What are we trying to argue? What will be the effect of
pointing out this result?
Text Analysis in Journalism
USA Today/Twitter Political Issues Index
Politico analysis of GOP primary, 2012
CNN State of the Union Twitter analysis, 2010
The Post obtained draft versions of 12 audits by the inspector general’s office,
covering projects from the Caribbean to Pakistan to the Republic of Georgia
between 2011 and 2013. The drafts are confidential and rarely become public.
The Post compared the drafts with the final reports published by the
inspector general’s office and interviewed former and current employees. E-
mails and other internal records also were reviewed.

The Post tracked changes in the language that auditors used to describe
USAID and its mission offices. The analysis found that more than 400
negative references were removed from the audits between the draft and
final versions.

Sentiment analysis used by Washington Post, 2014
LAPD Underreported Serious Assaults, Skewing Crime Stats for 8 Years
Los Angeles Times
The Times analyzed Los Angeles Police Department violent crime
data from 2005 to 2012. Our analysis found that the Los Angeles
Police Department misclassified an estimated 14,000 serious assaults
as minor offenses, artificially lowering the city’s crime levels. To
conduct the analysis, The Times used an algorithm that combined
two machine learning classifiers. Each classifier read in a brief
description of the crime, which it used to determine if it was a minor
or serious assault.

An example of a minor assault reads: "VICTS AND SUSPS BECAME
We used a machine-learning method
known as latent Dirichlet allocation to
identify the topics in all 14,400 petitions
and to then categorize the briefs. This
enabled us to identify which lawyers did
which kind of work for which sorts of
petitioners. For example, in cases where
workers sue their employers, the lawyers
most successful getting cases before the
court were far more likely to represent the
employers rather than the employees.

The Echo Chamber, Reuters
Document Vector Space Model
Documents, not words
We can use clustering and classification techniques if we
can convert documents into vectors.

As before, we want to find numerical “features” that
describe the document.

How do we capture the meaning of a document in
What is this document "about"?
Most commonly occurring words a pretty good indicator.

30 the
23 to
19 and
19 a
18 animal
17 cruelty
15 of
15 crimes
14 in
14 for
11 that
8 crime
7 we
Features = words works fine

Encode each document as the list of words it contains.

Dimensions = vocabulary of document set.

Value on each dimension = # of times word appears in
D1 = “I like databases”
D2 = “I hate hate databases”

Each row = document vector
All rows = term-document matrix
Individual entry = tf(t,d) = “term frequency”
Aka “Bag of words” model

Throws out word order.

e.g. “soldiers shot civilians” and “civilians shot soldiers”
encoded identically.
The documents come to us as long strings, not individual
words. Tokenization is the process of converting the string
into individual words, or "tokens."

For this course, we will assume a very simple strategy:
o convert all letters to lowercase
o remove all punctuation characters
o separate words based on spaces

Note that this won't work at all for Chinese. It will fail in
,many ways even for English. How?
Distance metric for text

Useful for:
• clustering documents
• finding docs similar to example
• matching a search query

Basic idea: look for overlapping terms
Cosine similarity
Given document vectors a,b define

similarity(a, b) ≡ a • b
If each word occurs exactly once in each document,
equivalent to counting overlapping words.

Note: not a distance function, as similarity increases when
documents are… similar. (What part of the definition of a
distance function is violated here?)
Problem: long documents always win

Let a = “This car runs fast.”
Let b = “My car is old. I want a new car, a shiny car”
Let query = “fast car”

this car runs fast my is old I want a new shiny

a 1 1 1 1 0 0 0 0 0 0 0 0

b 0 3 0 0 1 1 1 1 1 1 1 1

q 0 1 0 1 0 0 0 0 0 0 0 0
Problem: long documents always win

similarity(a,q) = 1*1 [car] + 1*1 [fast] = 2
similarity(b,q) = 3*1 [car] + 0*1 [fast] = 3

Longer document more “similar”, by virtue of repeating
Normalize document vectors

similarity(a, b) ≡
a b

= cos(Θ)

returns result in [0,1]
Normalized query example
this car runs fast my is old I want a new shiny

a 1 1 1 1 0 0 0 0 0 0 0 0

b 0 3 0 0 1 1 1 1 1 1 1 1

q 0 1 0 1 0 0 0 0 0 0 0 0

2 1
similarity(a, q) = = ≈ 0.707
4 2 2
similarity(b, q) = ≈ 0.514
17 2
Cosine similarity

cosθ = similarity(a, b) ≡
a b
Cosine distance (finally)

dist(a, b) ≡ 1−
a b
Problem: common words
We want to look at words that “discriminate” among

Stopwords: if all documents contain “the,” are all
documents similar?

Common words: if most documents contain “car” then car
doesn’t tell us much about (contextual) similarity.
Context matters
General News Car Reviews

= contains “car”
= does not contain “car”
Document Frequency

Idea: de-weight common words
Common = appears in many documents

df (t, D) = d ∈ D : t ∈ d D

“document frequency” = fraction of docs
containing term
Inverse Document Frequency

Invert (so more common = smaller weight) and take

idf (t, D) = log ( D d ∈ D : t ∈ d )
Multiply term frequency by inverse document frequency

tfidf (t, d, D) = tf (t, d)⋅ idf (d, D)
= n(t, d)⋅ log ( D n(t, D))
n(t,d) = number of times term t in doc d
n(t,D) = number docs in D containing t
TF-IDF depends on entire corpus
The TF-IDF vector for a document changes if we
add another document to the corpus.

tfidf (t, d, D) = tf (t, d)⋅ idf (d, D)

if we add a document, D changes!

TF-IDF is sensitive to context. The context is all other
What is this document "about"?
Each document is now a vector of TF-IDF scores for every word in the
document. We can look at which words have the top scores.
crimes 0.0675591652263963
cruelty 0.0585772393867342
crime 0.0257614113616027
reporting 0.0208838148975406
animals 0.0179258756717422
michael 0.0156575858658684
category 0.0154564813388897
commit 0.0137447439653709
criminal 0.0134312894429112
societal 0.0124164973052386
trends 0.0119505837811614
conviction 0.0115699047136248
patterns 0.011248045148093
Salton’s description of tf-idf

- from Salton et al, A Vector Space Model for Automatic Indexing, 1975

nj-sentator-menendez corpus, Overview sample files

color = human tags generated from TF-IDF clusters
Cluster Hypothesis
“documents in the same cluster behave similarly with respect to
relevance to information needs”

- Manning, Raghavan, Schütze, Introduction to Information Retrieval

Not really a precise statement – but the crucial link between
human semantics and mathematical properties.

Articulated as early as 1971, has been shown to hold at web
scale, widely assumed.
Bag of words + TF-IDF hard to beat
Practical win: good precision-recall metrics in tests with human-
tagged document sets.

Still the dominant text indexing scheme used today. (Lucene,
FAST, Google…) Many variants and extensions.

Some, but not much, theory to explain why this works. (E.g. why
that particular IDF formula? why doesn’t indexing bigrams
improve performance?)

Collectively: the vector space document model
No unique “right” clustering
Different distance metrics and clustering algorithms give
different results.

Should we sort incident reports by location, time, actor,
event type, author, cost, casualties…?

There is only context-specific categorization.
And the computer doesn’t understand your context.
Different libraries,
different categories