By design, an algorithm that removes noise by focusing on URLs that occur more than once (favoringmore popular URLs over less popular URLs) is biasing towards the common. When is this property desired? What are its potentially negative side effects?
We would want to favor the most popular URL when searching for recent sites visited / used byeveryone. By including the simple notion of giving higher weight-age to newer sites amongst thesepopular URLs we can get the interesting articles like recent hot news.The problem of using this approach is that certain sites are very common and always get a high score.These include common portal sites like Yahoo, Amazon, etc. These are popular URLs i.e. are bookmarkedby many users but ideally are of less interest to everyone. Common solutions to solve this problemwould be in the form of tf-idf scoring on the bookmarks.
To create an algorithm for more bold recommendations, biased towards uniqueness, how would you find the most "interesting" users having tagged weigend.com? And how would you derive recommendations from their most "interesting" bookmarks?
Following are some ways of finding “interesting” users.Firstly, we could find users on the basis of the URLs tagged. This means we give higher score to userswho have marked the same URLs. This may seem feasible when the users are restricted to a particulardomain. But when the userbase increases the overlap of URLs decreases. Hence we can extend theabove idea to cluster the URLs on the basis of URL’s topic. We can then give weight to each cluster ourURLs fall into and then find overlap with other users.Another way to find “interesting” users would be on the basis of “tags.” A similar approach can beadopted as in the case of URLs wherein we take users having high overlapping tags and give them higherscores compared to others with less overlap.One final approach that we can take to find “interesting” users is on the basis of comments. Thesecomments can give you insight into what the user was thinking of when tagging the certain page. Oneway to use comments is to extract “keywords.” We can then use these keywords very similar to the tagscase. New advances in NLP can also help us infer what the user likes / dislikes and maybe help matchusers with similar likes / dislikes.By deriving “interesting” users we are giving their bookmarked items a higher score. Hence we can setweights using URL-User combination and by doing simple weighted average get a unique score for everyURL. (Similar to approach followed in Chapter 2.4 of “Programming Collective Intelligence”)
Compare these two different approaches, explain the differences. When would you use the first algorithm, biased towards more frequently tagged URLs, when would you use the second algorithm,biased towards individuality?
Using the individuality algorithm would be more appropriate when dealing with datasets which donot change very often. We can then only periodically update the user similarity scores (i.e. scores of “interesting” users) and get our recommendations. Also this favors scenarios wherein the dataset beingdealt with is small comparatively.Using the frequently tagged URL algorithm would be more appropriate in scenarios wherein we havelarger dataset and is simple to keep count of frequent URLs as and when they arrive. This wouldparticularly be helpful in the current stream world.
So far, you have only considered the fact that others have tagged the
URL,
and ignored the
specifictags
as well as the
comments
. How would you use the number of tags (maybe relative to how many the person usually uses, maybe relative to an overall distribution), how the uniqueness of tags? And how about the comments they leave?
Leave a Comment