You are on page 1of 75

Frontiers of

Computational Journalism
Columbia Journalism School
Week 2: Filtering Algorithms

September 15, 2017

Journalism as a cycle


Effects Data


stories not covered



Each day, the Associated Press publishes:

~10,000 text stories

~3,000 photographs
~500 videos
+ radio, interactive…
More video on YouTube than produced by TV networks
during entire 20th century.
Google now indexes more web pages
than there are people in the world

400,000,000 tweets per day

estimated 130,000,000 books ever published

10,000 legally-required reports filed by U.S. public
companies every day
All New York Times
articles ever =
0.06 terabytes
(13 million stories,
5k per story)
It’s not information overload, it’s filter failure
- Clay Shirky
This class
Filtering by content
Topic models

Filtering by user interaction

Reddit comment ranking
User-item recommendation

Both together:
Collaborative topic models
System Description

Cluster Cluster
Scrape Summarize
events topics

Handcrafted list of source URLs (news front pages) and links

followed to depth 4

Then extract the text of each article

Text extraction from HTML
Ideal world: HTML5 “article” tags

“The article element represents a component of a page that

consists of a self-contained composition in a document, page,
application, or site and that is intended to be independently
distributable or reusable, e.g. in syndication.”
- W3C Specification
Text extraction from HTML
Newsblaster paper:

“For each page examined, if the amount of text in the

largest cell of the page (after stripping tags and links) is
greater than some particular constant (currently 512
characters), it is assumed to be a news article, and this text
is extracted.”

(At least it’s simple. This was 2002. How often does this work
Text extraction from HTML
Now multiple services/apis to do this, e.g.
Cluster Events
Cluster Events


• encode articles into feature vectors

• cosine distance function
• hierarchical clustering algorithm
Different clustering algorithms

• Partitioning
o keep adjusting clusters until convergence
o e.g. K-means

• Agglomerative hierarchical
o start with leaves, repeatedly merge clusters
o e.g. MIN and MAX approaches

• Divisive hierarchical
o start with root, repeatedly split clusters
o e.g. binary split
But news is an on-line problem...
Articles arrive one at a time, and must be clustered

Can’t look forward in time, can’t go back and reassign.

Greedy algorithm.
Single pass clustering

put first story in its own cluster

get next story S
look for cluster C with distance < T
if found
put S in C
put S in new cluster
Now sort events into categories

U.S., World, Finance, Science and Technology, Entertainment, Sports.

Primitive operation: what topic is this story in?

TF-IDF, again
Each category has pre-assigned TF-IDF coordinate.
Story category = closest point.
latest story

“finance” category
Cluster summarization
Problem: given a set of documents, write a sentence
summarizing them.

Active research area. See references in Newsblaster paper,

and more recent techniques.
Is Newsblaster really a filter?
After all, it shows “all” the news...

Differences with Google News?

Topic Modeling – Matrix Techniques
Problem Statement

Can the computer tell us the “topics” in a document set?

Can the computer organize the documents by “topic”?

Note: TF-IDF tells us the topics of a single document, but

here we want topics of an entire document set.
Simplest possible technique
Sum TF-IDF scores for each word across entire document
set, choose top ranking words.

This is how Overview generates cluster descriptions.

Topic Modeling Algorithms
Basic idea: reduce dimensionality of document vector
space, so each dimension is a topic.

Each document is then a vector of topic weights. We want

to figure out what dimensions and weights give a good
approximation of the full set of words in each document.

Many variants: LSI, PLSI, LDA, NMF

Matrix Factorization

Approximate term-document matrix V as product of

two lower rank matrixes

V = W

m docs by n terms m docs by r "topics" r "topics" by n terms

Matrix Factorization
A "topic" is a group of words that occur together.

words in this topic

topics in this document

Non-negative Matrix Factorization
All elements of document coordinate matrix W and topic
matrix H must be >= 0

Simple iterative algorithm to compute.

Still have to choose number of topics r

Probabilistic Topic Modeling
Latent Dirichlet Allocation
Imagine that each document is written by someone going
through the following process:

1. For each doc d, choose mixture of topics p(z|d)

2. For each word w in d, choose a topic z from p(z|d)
3. Then choose word from p(w|z)

A document has a distribution of topics.

Each topic is a distribution of words.
LDA tries to find these two sets of distributions.

LDA models each document as a distribution over topics. Each

word belongs to a single topic.

LDA models a topic as a distribution over all the words in the corpus.
In each topic, some words are more likely, some are less likely.
LDA Plate Notation
topics in doc
topic topic for word words in topics word
word in doc
concentration concentration
parameter parameter

N words D docs K topics

in doc
Computing LDA
word[d][i] document words
k # topics
a doc topic concentration
b topic word concentration
n # docs
len[d] # words in document
v vocabulary size
Computing LDA

topics[n][i] doc/word topic assignments

topic_words[k][v] topic words dist
doc_topics[n][k] document topics dist
topics -> topic_words
topic_words[*][*] = b

for d=1..n
for i=1..len[d]
topic_words[topics[d][i]][word[d][i]] += 1

for j=1..k
normalize topic_words[j]
topics -> doc_topics
doc_topics[*][*] = a

for d=1..n
for i=1..len[d]
doc_topics[d][topics[d][i]] +=1

for d=1..n
normalize doc_topics[d]
Update topics
// for each word in document, sample a new topic
for d=1..n
for i=1..len[d]
w = word[d][i]
for t=1..k
p[t] = doc_topics[d][j] * topic_words[j][w]

topics[d][i] = sample from p[t]

Dimensionality reduction
Output of NMF and LDA is a vector of much lower
dimension for each document. ("Document
coordinates in topic space.")

Dimensions are “concepts” or “topics” instead of


Can measure cosine distance, cluster, etc. in this new

Comment Ranking
Filtering Comments

Thousands of comments, what are the “good” ones?

Comment voting

Problem: putting comments with most votes at top doesn’t

work. Why?
Reddit Comment Ranking (old)

Up – down votes
plus time decay
Reddit Comment Ranking (new)

v = 11
p = 11/16 = 0.6875

Hypothetically, suppose all users voted on the comment, and v

out of N up-voted. Then we could sort by proportion p = v/N of
Reddit Comment Ranking

v’ = 1
p’ = 1/3 = 0.333

Actually, only n users out of N vote, giving an observed

approximate proportion p’ = v’/n
Reddit Comment Ranking

p’ = 0.75
p = 0.1875

p’ = 0.333
p = 0.6875

Limited sampling can rank votes wrong when we don’t have

enough data.
Random error in sampling
If we observe p’ upvotes from n random users, what is the
distribution of the true proportion p?

Distribution of p’ when p=0.5

Confidence interval
1-𝛼 probability that the true value p will lie within the
central region (when sampled assuming p=p’)
Rank comments by lower bound
of confidence interval
Analytic solution for confidence interval, known as “Wilson score”

p’ = observed proportion of upvotes

n = how many people voted
zα= how certain do we want to be before we assume that p’ is
“close” to true p
User-item Recommendation
User-item matrix

Stores “rating” of each user for each item. Could also

be binary variable that says whether user clicked,
liked, starred, shared, purchased...
User-item matrix
• No content analysis. We know nothing about what is “in” each
• Typically very sparse – a user hasn’t watched even 1% of all
• Filtering problem is guessing “unknown” entry in matrix. High
guessed values are things user would want to see.
Filtering process
How to guess unknown rating?

Basic idea: suggest “similar” items.

Similar items are rated in a similar way by many different


Remember, “rating” could be a click, a like, a purchase.

o “Users who bought A also bought B...”
o “Users who clicked A also clicked B...”
o “Users who shared A also shared B...”
Similar items
Item similarity

Cosine similarity!
Other distance measures
“adjusted cosine similarity”

Subtracts average rating for each user, to compensate for general

enthusiasm (“most movies suck” vs. “most movies are great”)
Generating a recommendation

Weighted average of item ratings by their similarity.

Matrix factorization recommender
Matrix factorization recommender

Note: only sum over observed ratings rij.

Matrix factorization plate model
variation in
topics for item
item topics
j items
λv v

user rating
of item

λu u
i users
variation in
user topics topics for user
New York Times recommender
Combining collaborative filtering
and topic modeling
Content modeling - LDA
topics in doc
topic topic for word words in topics word
word in doc
concentration concentration
parameter parameter

N words D docs K topics

in doc
Collaborative Topic Modeling
topic topics in doc topic for word word in doc K topics
concentration (content)

user rating
weight of user topics in doc of doc
selections (collaborative)

variation in
per-user topics topics for user
content only

content +