00 upvotes00 downvotes

40 views75 pagesFrom the course Frontiers of Computational Journalism, Columbia University, Fall 2017 http://www.compjournalism.com/?p=206

Sep 21, 2017

© © All Rights Reserved

PDF, TXT or read online from Scribd

From the course Frontiers of Computational Journalism, Columbia University, Fall 2017 http://www.compjournalism.com/?p=206

© All Rights Reserved

40 views

00 upvotes00 downvotes

From the course Frontiers of Computational Journalism, Columbia University, Fall 2017 http://www.compjournalism.com/?p=206

© All Rights Reserved

You are on page 1of 75

Computational Journalism

Columbia Journalism School

Week 2: Filtering Algorithms

Journalism as a cycle

CS

Effects Data

CS

Reporting

User

CS

CS

Filtering

User

stories not covered

x

x

x

x

x

x

x

filtering

User

Each day, the Associated Press publishes:

~3,000 photographs

~500 videos

+ radio, interactive…

More video on YouTube than produced by TV networks

during entire 20th century.

Google now indexes more web pages

than there are people in the world

10,000 legally-required reports filed by U.S. public

companies every day

All New York Times

articles ever =

0.06 terabytes

(13 million stories,

5k per story)

It’s not information overload, it’s filter failure

- Clay Shirky

This class

Filtering by content

Newsblaster

Topic models

Reddit comment ranking

User-item recommendation

Both together:

Collaborative topic models

Newsblaster

System Description

Cluster Cluster

Scrape Summarize

events topics

Scrape

followed to depth 4

Text extraction from HTML

Ideal world: HTML5 “article” tags

consists of a self-contained composition in a document, page,

application, or site and that is intended to be independently

distributable or reusable, e.g. in syndication.”

- W3C Specification

Text extraction from HTML

Newsblaster paper:

largest cell of the page (after stripping tags and links) is

greater than some particular constant (currently 512

characters), it is assumed to be a news article, and this text

is extracted.”

(At least it’s simple. This was 2002. How often does this work

now?)

Text extraction from HTML

Now multiple services/apis to do this, e.g. readability.com

Cluster Events

Cluster Events

Surprise!

• cosine distance function

• hierarchical clustering algorithm

Different clustering algorithms

• Partitioning

o keep adjusting clusters until convergence

o e.g. K-means

• Agglomerative hierarchical

o start with leaves, repeatedly merge clusters

o e.g. MIN and MAX approaches

• Divisive hierarchical

o start with root, repeatedly split clusters

o e.g. binary split

But news is an on-line problem...

Articles arrive one at a time, and must be clustered

immediately.

Greedy algorithm.

Single pass clustering

repeat

get next story S

look for cluster C with distance < T

if found

put S in C

else

put S in new cluster

Now sort events into categories

Categories:

U.S., World, Finance, Science and Technology, Entertainment, Sports.

TF-IDF, again

Each category has pre-assigned TF-IDF coordinate.

Story category = closest point.

“world”category

latest story

“finance” category

Cluster summarization

Problem: given a set of documents, write a sentence

summarizing them.

and more recent techniques.

Is Newsblaster really a filter?

After all, it shows “all” the news...

Topic Modeling – Matrix Techniques

Problem Statement

Can the computer organize the documents by “topic”?

here we want topics of an entire document set.

Simplest possible technique

Sum TF-IDF scores for each word across entire document

set, choose top ranking words.

Topic Modeling Algorithms

Basic idea: reduce dimensionality of document vector

space, so each dimension is a topic.

to figure out what dimensions and weights give a good

approximation of the full set of words in each document.

Matrix Factorization

two lower rank matrixes

H

V = W

Matrix Factorization

A "topic" is a group of words that occur together.

Non-negative Matrix Factorization

All elements of document coordinate matrix W and topic

matrix H must be >= 0

Probabilistic Topic Modeling

Latent Dirichlet Allocation

Imagine that each document is written by someone going

through the following process:

2. For each word w in d, choose a topic z from p(z|d)

3. Then choose word from p(w|z)

Each topic is a distribution of words.

LDA tries to find these two sets of distributions.

"Documents"

word belongs to a single topic.

"Topics"

LDA models a topic as a distribution over all the words in the corpus.

In each topic, some words are more likely, some are less likely.

LDA Plate Notation

topics in doc

topic topic for word words in topics word

word in doc

concentration concentration

parameter parameter

in doc

Computing LDA

Inputs:

word[d][i] document words

k # topics

a doc topic concentration

b topic word concentration

Also:

n # docs

len[d] # words in document

v vocabulary size

Computing LDA

Outputs:

topic_words[k][v] topic words dist

doc_topics[n][k] document topics dist

topics -> topic_words

topic_words[*][*] = b

for d=1..n

for i=1..len[d]

topic_words[topics[d][i]][word[d][i]] += 1

for j=1..k

normalize topic_words[j]

topics -> doc_topics

doc_topics[*][*] = a

for d=1..n

for i=1..len[d]

doc_topics[d][topics[d][i]] +=1

for d=1..n

normalize doc_topics[d]

Update topics

// for each word in document, sample a new topic

for d=1..n

for i=1..len[d]

w = word[d][i]

for t=1..k

p[t] = doc_topics[d][j] * topic_words[j][w]

Dimensionality reduction

Output of NMF and LDA is a vector of much lower

dimension for each document. ("Document

coordinates in topic space.")

words.

space.

Comment Ranking

Filtering Comments

Comment voting

work. Why?

Reddit Comment Ranking (old)

Up – down votes

plus time decay

Reddit Comment Ranking (new)

N=16

v = 11

p = 11/16 = 0.6875

out of N up-voted. Then we could sort by proportion p = v/N of

upvotes.

Reddit Comment Ranking

n=3

v’ = 1

p’ = 1/3 = 0.333

approximate proportion p’ = v’/n

Reddit Comment Ranking

p’ = 0.75

p = 0.1875

p’ = 0.333

p = 0.6875

enough data.

Random error in sampling

If we observe p’ upvotes from n random users, what is the

distribution of the true proportion p?

Confidence interval

1-𝛼 probability that the true value p will lie within the

central region (when sampled assuming p=p’)

Rank comments by lower bound

of confidence interval

Analytic solution for confidence interval, known as “Wilson score”

n = how many people voted

zα= how certain do we want to be before we assume that p’ is

“close” to true p

User-item Recommendation

User-item matrix

be binary variable that says whether user clicked,

liked, starred, shared, purchased...

User-item matrix

• No content analysis. We know nothing about what is “in” each

item.

• Typically very sparse – a user hasn’t watched even 1% of all

movies.

• Filtering problem is guessing “unknown” entry in matrix. High

guessed values are things user would want to see.

Filtering process

How to guess unknown rating?

users.

o “Users who bought A also bought B...”

o “Users who clicked A also clicked B...”

o “Users who shared A also shared B...”

Similar items

Item similarity

Cosine similarity!

Other distance measures

“adjusted cosine similarity”

enthusiasm (“most movies suck” vs. “most movies are great”)

Generating a recommendation

Matrix factorization recommender

Matrix factorization recommender

Matrix factorization plate model

variation in

topics for item

item topics

j items

λv v

user rating

r

of item

λu u

i users

variation in

user topics topics for user

New York Times recommender

Combining collaborative filtering

and topic modeling

Content modeling - LDA

topics in doc

topic topic for word words in topics word

word in doc

concentration concentration

parameter parameter

in doc

Collaborative Topic Modeling

topic topics in doc topic for word word in doc K topics

concentration (content)

user rating

weight of user topics in doc of doc

selections (collaborative)

variation in

per-user topics topics for user

content only

content +

social

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.