## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

Computational Journalism

Columbia Journalism School

Week 2: Filtering Algorithms

September 15, 2017

Journalism as a cycle

CS

Effects Data

CS

Reporting

User

CS

CS

Filtering

User

stories not covered

x

x

x

x

x

x

x

filtering

User

Each day, the Associated Press publishes:

**~10,000 text stories
**

~3,000 photographs

~500 videos

+ radio, interactive…

More video on YouTube than produced by TV networks

during entire 20th century.

Google now indexes more web pages

than there are people in the world

400,000,000 tweets per day

**estimated 130,000,000 books ever published
**

10,000 legally-required reports filed by U.S. public

companies every day

All New York Times

articles ever =

0.06 terabytes

(13 million stories,

5k per story)

It’s not information overload, it’s filter failure

- Clay Shirky

This class

Filtering by content

Newsblaster

Topic models

**Filtering by user interaction
**

Reddit comment ranking

User-item recommendation

Both together:

Collaborative topic models

Newsblaster

System Description

Cluster Cluster

Scrape Summarize

events topics

Scrape

**Handcrafted list of source URLs (news front pages) and links
**

followed to depth 4

**Then extract the text of each article
**

Text extraction from HTML

Ideal world: HTML5 “article” tags

**“The article element represents a component of a page that
**

consists of a self-contained composition in a document, page,

application, or site and that is intended to be independently

distributable or reusable, e.g. in syndication.”

- W3C Specification

Text extraction from HTML

Newsblaster paper:

**“For each page examined, if the amount of text in the
**

largest cell of the page (after stripping tags and links) is

greater than some particular constant (currently 512

characters), it is assumed to be a news article, and this text

is extracted.”

**(At least it’s simple. This was 2002. How often does this work
**

now?)

Text extraction from HTML

Now multiple services/apis to do this, e.g. readability.com

Cluster Events

Cluster Events

Surprise!

**• encode articles into feature vectors
**

• cosine distance function

• hierarchical clustering algorithm

Different clustering algorithms

• Partitioning

o keep adjusting clusters until convergence

o e.g. K-means

• Agglomerative hierarchical

o start with leaves, repeatedly merge clusters

o e.g. MIN and MAX approaches

• Divisive hierarchical

o start with root, repeatedly split clusters

o e.g. binary split

But news is an on-line problem...

Articles arrive one at a time, and must be clustered

immediately.

Can’t look forward in time, can’t go back and reassign.

Greedy algorithm.

Single pass clustering

**put first story in its own cluster
**

repeat

get next story S

look for cluster C with distance < T

if found

put S in C

else

put S in new cluster

Now sort events into categories

Categories:

U.S., World, Finance, Science and Technology, Entertainment, Sports.

**Primitive operation: what topic is this story in?
**

TF-IDF, again

Each category has pre-assigned TF-IDF coordinate.

Story category = closest point.

“world”category

latest story

“finance” category

Cluster summarization

Problem: given a set of documents, write a sentence

summarizing them.

**Active research area. See references in Newsblaster paper,
**

and more recent techniques.

Is Newsblaster really a filter?

After all, it shows “all” the news...

**Differences with Google News?
**

Topic Modeling – Matrix Techniques

Problem Statement

**Can the computer tell us the “topics” in a document set?
**

Can the computer organize the documents by “topic”?

**Note: TF-IDF tells us the topics of a single document, but
**

here we want topics of an entire document set.

Simplest possible technique

Sum TF-IDF scores for each word across entire document

set, choose top ranking words.

**This is how Overview generates cluster descriptions.
**

Topic Modeling Algorithms

Basic idea: reduce dimensionality of document vector

space, so each dimension is a topic.

**Each document is then a vector of topic weights. We want
**

to figure out what dimensions and weights give a good

approximation of the full set of words in each document.

**Many variants: LSI, PLSI, LDA, NMF
**

Matrix Factorization

**Approximate term-document matrix V as product of
**

two lower rank matrixes

H

V = W

**m docs by n terms m docs by r "topics" r "topics" by n terms
**

Matrix Factorization

A "topic" is a group of words that occur together.

words in this topic

**topics in this document
**

Non-negative Matrix Factorization

All elements of document coordinate matrix W and topic

matrix H must be >= 0

Simple iterative algorithm to compute.

**Still have to choose number of topics r
**

Probabilistic Topic Modeling

Latent Dirichlet Allocation

Imagine that each document is written by someone going

through the following process:

**1. For each doc d, choose mixture of topics p(z|d)
**

2. For each word w in d, choose a topic z from p(z|d)

3. Then choose word from p(w|z)

**A document has a distribution of topics.
**

Each topic is a distribution of words.

LDA tries to find these two sets of distributions.

"Documents"

**LDA models each document as a distribution over topics. Each
**

word belongs to a single topic.

"Topics"

**LDA models a topic as a distribution over all the words in the corpus.
**

In each topic, some words are more likely, some are less likely.

LDA Plate Notation

topics in doc

topic topic for word words in topics word

word in doc

concentration concentration

parameter parameter

**N words D docs K topics
**

in doc

Computing LDA

Inputs:

word[d][i] document words

k # topics

a doc topic concentration

b topic word concentration

Also:

n # docs

len[d] # words in document

v vocabulary size

Computing LDA

Outputs:

**topics[n][i] doc/word topic assignments
**

topic_words[k][v] topic words dist

doc_topics[n][k] document topics dist

topics -> topic_words

topic_words[*][*] = b

for d=1..n

for i=1..len[d]

topic_words[topics[d][i]][word[d][i]] += 1

for j=1..k

normalize topic_words[j]

topics -> doc_topics

doc_topics[*][*] = a

for d=1..n

for i=1..len[d]

doc_topics[d][topics[d][i]] +=1

for d=1..n

normalize doc_topics[d]

Update topics

// for each word in document, sample a new topic

for d=1..n

for i=1..len[d]

w = word[d][i]

for t=1..k

p[t] = doc_topics[d][j] * topic_words[j][w]

**topics[d][i] = sample from p[t]
**

Dimensionality reduction

Output of NMF and LDA is a vector of much lower

dimension for each document. ("Document

coordinates in topic space.")

**Dimensions are “concepts” or “topics” instead of
**

words.

**Can measure cosine distance, cluster, etc. in this new
**

space.

Comment Ranking

Filtering Comments

**Thousands of comments, what are the “good” ones?
**

Comment voting

**Problem: putting comments with most votes at top doesn’t
**

work. Why?

Reddit Comment Ranking (old)

**Up – down votes
**

plus time decay

Reddit Comment Ranking (new)

N=16

v = 11

p = 11/16 = 0.6875

**Hypothetically, suppose all users voted on the comment, and v
**

out of N up-voted. Then we could sort by proportion p = v/N of

upvotes.

Reddit Comment Ranking

n=3

v’ = 1

p’ = 1/3 = 0.333

**Actually, only n users out of N vote, giving an observed
**

approximate proportion p’ = v’/n

Reddit Comment Ranking

p’ = 0.75

p = 0.1875

p’ = 0.333

p = 0.6875

**Limited sampling can rank votes wrong when we don’t have
**

enough data.

Random error in sampling

If we observe p’ upvotes from n random users, what is the

distribution of the true proportion p?

Distribution of p’ when p=0.5

Confidence interval

1-𝛼 probability that the true value p will lie within the

central region (when sampled assuming p=p’)

Rank comments by lower bound

of confidence interval

Analytic solution for confidence interval, known as “Wilson score”

**p’ = observed proportion of upvotes
**

n = how many people voted

zα= how certain do we want to be before we assume that p’ is

“close” to true p

User-item Recommendation

User-item matrix

**Stores “rating” of each user for each item. Could also
**

be binary variable that says whether user clicked,

liked, starred, shared, purchased...

User-item matrix

• No content analysis. We know nothing about what is “in” each

item.

• Typically very sparse – a user hasn’t watched even 1% of all

movies.

• Filtering problem is guessing “unknown” entry in matrix. High

guessed values are things user would want to see.

Filtering process

How to guess unknown rating?

Basic idea: suggest “similar” items.

**Similar items are rated in a similar way by many different
**

users.

**Remember, “rating” could be a click, a like, a purchase.
**

o “Users who bought A also bought B...”

o “Users who clicked A also clicked B...”

o “Users who shared A also shared B...”

Similar items

Item similarity

Cosine similarity!

Other distance measures

“adjusted cosine similarity”

**Subtracts average rating for each user, to compensate for general
**

enthusiasm (“most movies suck” vs. “most movies are great”)

Generating a recommendation

**Weighted average of item ratings by their similarity.
**

Matrix factorization recommender

Matrix factorization recommender

**Note: only sum over observed ratings rij.
**

Matrix factorization plate model

variation in

topics for item

item topics

j items

λv v

user rating

r

of item

λu u

i users

variation in

user topics topics for user

New York Times recommender

Combining collaborative filtering

and topic modeling

Content modeling - LDA

topics in doc

topic topic for word words in topics word

word in doc

concentration concentration

parameter parameter

**N words D docs K topics
**

in doc

Collaborative Topic Modeling

topic topics in doc topic for word word in doc K topics

concentration (content)

user rating

weight of user topics in doc of doc

selections (collaborative)

variation in

per-user topics topics for user

content only

content +

social

- Printuploaded byanilnaik287
- 8._Comp Sci - IJCSEITR -Clustering Concepts Using for - Pooja Shrivastava - OPaiduploaded byTJPRC Publications
- Paper4-10.1.1.96.9084uploaded byAnonymous RrGVQj
- Web Usage Mining through Efficient Genetic Fuzzy C-Meansuploaded byijcsis
- 115uploaded byNadjia Kha
- Bent Ez 2016uploaded bykalokos
- Cluster 1uploaded byVimala Priya
- Exposure of Documentuploaded byiaetsdiaetsd
- Everitt Cluster Analysis PDFuploaded byJeremey
- Ch Cluster Analysisuploaded byevalinalisbeth
- Mango Grading by Using Fuzzy Image Analysisuploaded byIndra
- K-MEANS CLUSTERING FOR ANALYZING PRODUCTIVITY IN LIGHT OF R & D SPILLOVERuploaded byAlejandro Carver
- 24. Electronics - IJECIERD -Automated Leading Vehicle - Nilesh L Gadekar - (1)Plagiarism (1)uploaded byTJPRC Publications
- Advance Quantitiative Thecniqueuploaded byfahimafridi
- COMPARATIVE ANALYSIS OF DIFFERENT ALGORITHM FOR BRAIN TUMOR DETECTION.uploaded byIJAR Journal
- Recognisation of Outlier using Distance based method for Large Scale Databaseuploaded byEditor IJRITCC
- BI Survey.pdfuploaded byTuti Aryanti Hastomo
- Legible Cities: Focus-Dependent Multi-Resolution Visualization of Urban Relationshipsuploaded bySakai
- hannahs article summeryuploaded byapi-240670506
- Early Diabetes Detection using Machine Learning: A Surveyuploaded byIJIRST
- 188 1496475265_03-06-2017.pdfuploaded byEditor IJRITCC
- MKSslideuploaded byJego Wykurwistość Kraszczyk I
- Clustering Dichotomous Data for Health Careuploaded byMandy Diaz
- Multivariate Analysisuploaded byShipra Singh
- 20100201 BIRCH Cluster for Time Seriesuploaded byyurimazur
- AIslidesuploaded bymouth589
- 41-125-1-PBuploaded byIjarcsee Journal
- An Improved Swarm Based Approach for Efficient Document Clusteringuploaded byseventhsensegroup
- Cluster Analysis - Ward’s Methoduploaded byraghu_sb
- Px c 3874611uploaded byAdi Dian

- Computational Journalism Week 11: Privacy and Securityuploaded byJonathan Stray
- Computational Journalism 2017 Week 5: Quantification and Statisticsuploaded byJonathan Stray
- Computational Journalism 2017 Week 7: Algorithmic Accountability and Discriminationuploaded byJonathan Stray
- Computational Journalism 2016 Week 9: Knowledge Representationuploaded byJonathan Stray
- Computational Journalism Week 9: Knowledge Representationuploaded byJonathan Stray
- Computational Journalism 2016 Week 8: Visualizationuploaded byJonathan Stray
- Computational Journalism 2017 Week 1: Introductionuploaded byJonathan Stray
- Computational Journalism 2016 Week 10: Social Network Analysisuploaded byJonathan Stray
- Computational Journalism 2017 Week 3: Filters as Editorsuploaded byJonathan Stray
- Computational Journalism 2017 Week 6: Drawing Conclusions From Datauploaded byJonathan Stray
- Practical Digital Security for Journalistsuploaded byJonathan Stray
- What Do Journalists Do With Documents? Field Notes for NLP Researchersuploaded byJonathan Stray
- Computational Journalism Week 8: Visualization and Networksuploaded byJonathan Stray
- Computational Journalism 2016 Week 11: Privacy and Securityuploaded byJonathan Stray
- Computational Journalism 2017 Week 4: Computational Journalism Platformsuploaded byJonathan Stray
- Social Network Analysis. Computational Journalism week 10uploaded byJonathan Stray
- Computational Journalism 2016 Week 3: Algorithmic Filteringuploaded byJonathan Stray
- Privacy and Security. Computational Journalism week 12uploaded byJonathan Stray
- Computational Journalism 2016 Week 6: Drawing Conclusions from Datauploaded byJonathan Stray
- Visualization. Computational Journalism week 7uploaded byJonathan Stray
- Computational Journalism 2016 Week 7: Algorithmic Accountabilityuploaded byJonathan Stray
- Computational Journalism 2016 Week 2: Text Analysisuploaded byJonathan Stray
- Computational Journalism 2016 Week 4: Filters as Editorsuploaded byJonathan Stray
- Knowledge Representation. Computational Journalism week 8uploaded byJonathan Stray
- Algorithmic Accountability. Computational Journalism week 9uploaded byJonathan Stray
- Drawing Conclusions From Data. Computational Journalism week 11uploaded byJonathan Stray
- Computational Journalism 2016 Week 5: Quantification and Statisticsuploaded byJonathan Stray
- From Algorithms to Stories.uploaded byJonathan Stray
- Computational Journalism 2016 Week 1: Introductionuploaded byJonathan Stray

Close Dialog## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

Loading