You are on page 1of 24

Fron%ers

of Computa%onal Journalism
Columbia Journalism School Week 5: Hybrid Filters October 2, 2013

Week 5: Hybrid Filtering


Filtering Comments by Vo%ng User-item recommenda%on systems General Hybrid Filters

Filtering Comments

Thousands of comments, what are the good ones?

Comment vo%ng

Problem: puPng comments with most votes at top doesnt work. Why?

Reddit Comment Ranking


N=16 v = 11 p = 11/16 = 0.6875

Hypothe%cally, suppose all users voted on the comment, and v out of N up-voted. Then we could sort by propor%on p = v/N of upvotes.

Reddit Comment Ranking


n=3 v = 1 p = 1/3 = 0.333

Actually, only n users out of N vote, giving an observed approximate propor%on p = v/n

Reddit Comment Ranking


p = 0.75 p = 0.1875

p = 0.333 p = 0.6875

Limited sampling can rank votes wrong when we dont have enough data.

Random error in sampling


If we observe p upvotes from n random users, what is the distribu%on of the true propor%on p?

Distribu%on of p when p=0.5

Condence interval
Given observed p, interval that true p has a probability of lying inside.

Rank comments by lower bound of condence interval


Analy%c solu%on for condence interval, known as Wilson score

p = observed propor%on of upvotes n = how many people voted z= how certain do we want to be before we assume that p is close to true p

Week 5: Hybrid Filtering


Filtering Comments by Vo%ng User-item recommenda%on systems General Hybrid Filters

User-item matrix

Stores ra%ng of each user for each item. Could also be binary variable that says whether user clicked, liked, starred, shared, purchased...

User-item matrix
No content analysis. We know nothing about what is in each item. Typically very sparse a user hasnt watched even 1% of all movies. Filtering problem is guessing unknown entry in matrix. High guessed values are things user would want to see.

Filtering process

How to guess unknown ra%ng?


Basic idea: suggest similar items. Similar items are rated in a similar way by many dierent users. Remember, ra%ng could be a click, a like, a purchase.
Users who bought A also bought B... Users who clicked A also clicked B... Users who shared A also shared B...

Similar items

Item similarity
Cosine similarity!

Other distance measures


adjusted cosine similarity

Subtracts average ra%ng for each user, to compensate for general enthusiasm (most movies suck vs. most movies are great)

Genera%ng a recommenda%on

Weighted average of item ra%ngs by their similarity.

Week 5: Hybrid Filtering


Filtering Comments by Vo%ng User-item recommenda%on systems General Hybrid Filters

Dierent Filtering Systems


Pure algorithmic: Newsblaster analyze the topics in the documents. No concept of users. Pure social: What I see on Twijer determined by who I follow. No content analysis. Hybrid: Reddit comments ltered by an algorithm that takes votes as input. Hybrid: Items recommended based co-consump%on by all users. What else is possible?

Item Content

My Data

Other Users Data

what Ive read/liked

Text analysis, topic modeling, clustering...

who I follow

social network structure, other users likes

How to evaluate/op%mize the lter?

How to evaluate/op%mize the lter?


Nellix: try to predict the ra%ng that the user gives a movie amer watching it. Amazon: sell more stu. Google web search: human raters A/B test every change

You might also like