Professional Documents
Culture Documents
of
Computa%onal
Journalism
Columbia
Journalism
School
Week
5:
Hybrid
Filters
October
2,
2013
Filtering Comments
Comment vo%ng
Problem: puPng comments with most votes at top doesnt work. Why?
Hypothe%cally, suppose all users voted on the comment, and v out of N up-voted. Then we could sort by propor%on p = v/N of upvotes.
Actually, only n users out of N vote, giving an observed approximate propor%on p = v/n
p = 0.333 p = 0.6875
Limited sampling can rank votes wrong when we dont have enough data.
Condence
interval
Given
observed
p,
interval
that
true
p
has
a
probability
of
lying
inside.
p = observed propor%on of upvotes n = how many people voted z= how certain do we want to be before we assume that p is close to true p
User-item matrix
Stores ra%ng of each user for each item. Could also be binary variable that says whether user clicked, liked, starred, shared, purchased...
User-item
matrix
No
content
analysis.
We
know
nothing
about
what
is
in
each
item.
Typically
very
sparse
a
user
hasnt
watched
even
1%
of
all
movies.
Filtering
problem
is
guessing
unknown
entry
in
matrix.
High
guessed
values
are
things
user
would
want
to
see.
Filtering process
Similar items
Item
similarity
Cosine
similarity!
Subtracts average ra%ng for each user, to compensate for general enthusiasm (most movies suck vs. most movies are great)
Genera%ng a recommenda%on
Item Content
My Data
who I follow