Frontiers of

Computational Journalism
Columbia Journalism School
Week 3: Filtering Algorithms
September 26, 2016

This class




The need for filters
Comment Ranking
Newsblaster
User-item recommendation
New York Times recommendation system

The Need for Filters

Journalism as a cycle
CS

Effects

Data

CS

Reporting

User

CS

Filtering

CS

User

User

stories not covered
x

x
x
x

x
x

x
filtering

User

Each day, the Associated Press publishes:
~10,000 text stories
~3,000 photographs
~500 videos
+ radio, interactive…

more video on YouTube than produced
by TV networks during entire 20th century

Google now indexes more web pages
than there are people in the world
400,000,000 tweets per day
estimated 130,000,000 books ever published

10,000 legally-required reports filed
by U.S. public companies every day

All New York Times
articles ever =
0.06 terabytes
(13 million stories,
5k per story)

It’s not information overload, it’s filter
failure
- Clay Shirky

Comment Ranking

Filtering Comments

Thousands of comments, what are the “good” ones?

Comment voting

Problem: putting comments with most votes at
top doesn’t work. Why?

Reddit Comment Ranking (old)

Up – down votes
plus time decay

Reddit Comment Ranking (new)
N=16
v = 11
p = 11/16 = 0.6875

Hypothetically, suppose all users voted on the comment, and v
out of N up-voted. Then we could sort by proportion p = v/N of
upvotes.

Reddit Comment Ranking
n=3
v’ = 1
p’ = 1/3 = 0.333

Actually, only n users out of N vote, giving an observed
approximate proportion p’ = v’/n

Reddit Comment Ranking
p’ = 0.75
p = 0.1875

p’ = 0.333
p = 0.6875

Limited sampling can rank votes wrong when we don’t have
enough data.

Random error in sampling
If we observe p’ upvotes from n random users, what is
the distribution of the true proportion p?

Distribution of p’ when p=0.5

Confidence interval
Given observed p’, interval that true p has a
probability α of lying inside.

Rank comments by lower bound
of confidence interval
Analytic solution for confidence interval, known as “Wilson score”

p’ = observed proportion of upvotes
n = how many people voted
zα= how certain do we want to be before we assume that p’ is
“close” to true p

Newsblaster

System Description
Scrape

Cluster
events

Cluster
topics

Summarize

Scrape
Handcrafted list of source URLs (news front pages)
and links followed to depth 4
Then extract the text of each article

Text extraction from HTML

Ideal world: HTML5 “article” tags

“The article element represents a component of a page that
consists of a self-contained composition in a document, page,
application, or site and that is intended to be independently
distributable or reusable, e.g. in syndication.”
- W3C Specification

The dismal reality of text extraction

Every site is a beautiful flower.

Text extraction from HTML
Newsblaster paper:
“For each page examined, if the amount of text in the
largest cell of the page (after stripping tags and links) is
greater than some particular constant (currently 512
characters), it is assumed to be a news article, and this text
is extracted.”
(At least it’s simple. This was 2002. How often does this work
now?)

Text extraction from HTML
Now multiple services/apis to do this, e.g.
readability.com

Cluster Events

Cluster Events
Surprise!
• encode articles into feature vectors
• cosine distance function
• hierarchical clustering algorithm

Different clustering algorithms
• Partitioning
o keep adjusting clusters until convergence
o e.g. K-means

• Agglomerative hierarchical
o start with leaves, repeatedly merge clusters
o e.g. MIN and MAX approaches

• Divisive hierarchical
o start with root, repeatedly split clusters
o e.g. binary split

But news is an on-line problem...
Articles arrive one at a time, and must be clustered
immediately.
Can’t look forward in time, can’t go back and
reassign.
Greedy algorithm.

Single pass clustering
put first story in its own cluster
repeat
get next story S
look for cluster C with distance < T
if found
put S in C
else
put S in new cluster

Evaluating clusterings
When is one clustering “better” than another?
Ideally, we’d like a quantitative metric.

Evaluating clusterings
When is one clustering “better” than another?
Ideally, we’d like a quantitative metric.
This is possible if we have training data = human generated
clusters.
Available from the TDT2 corpus (“topic detection and
tracking”)

Error with respect to hand-generated clusters from training data

Now sort events into categories
Categories:
U.S., World, Finance, Science and Technology, Entertainment, Sports.

Primitive operation: what topic is this story in?

TF-IDF, again
Each category has pre-assigned TF-IDF coordinate.
Story category = closest point.
“world”category
latest story

“finance” category

Cluster summarization
Problem: given a set of documents, write a sentence
summarizing them.
Difficult problem. See references in Newsblaster
paper, and more recent techniques.

Is Newsblaster really a filter?
After all, it shows “all” the news...
Differences with Google News?

Personalization!
Not every person needs to see the same news.

User-item Recommendation

User-item matrix

Stores “rating” of each user for each item. Could also
be binary variable that says whether user clicked,
liked, starred, shared, purchased...

User-item matrix


No content analysis. We know nothing about what is “in” each
item.
Typically very sparse – a user hasn’t watched even 1% of all
movies.
Filtering problem is guessing “unknown” entry in matrix. High
guessed values are things user would want to see.

Filtering process

How to guess unknown rating?
Basic idea: suggest “similar” items.
Similar items are rated in a similar way by many
different users.
Remember, “rating” could be a click, a like, a
purchase.
o “Users who bought A also bought B...”
o “Users who clicked A also clicked B...”
o “Users who shared A also shared B...”

Similar items

Item similarity
Cosine similarity!

Other distance measures
“adjusted cosine similarity”

Subtracts average rating for each user, to compensate for general
enthusiasm (“most movies suck” vs. “most movies are great”)

Generating a recommendation

Weighted average of item ratings by their similarity.

Matrix factorization recommender

Matrix factorization recommender

Note: only sum over observed ratings rij.

Matrix factorization plate model
variation in
item topics
λv

topics for item
v

j items

r

λu
variation in
user topics

u

i users

topics for user

user rating
of item

New York Times recommender

Combining collaborative filtering
and topic modeling

Content modeling - LDA
topics in doc
topic
topic for word
concentration
parameter

word in doc

N words
in doc

words in topics

D docs

word
concentration
parameter

K topics

Collaborative Topic Modeling
topic
concentration

topics in doc topic for word word in doc
(content)

weight of user topics in doc
(collaborative)
selections

variation in
per-user topics

topics for user

user rating
of doc

K topics

content only
content +
social

Master your semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master your semester with Scribd & The New York Times

Cancel anytime.