• Embed Doc
  • Readcast
  • Collections
  • CommentGo Back
Download
 
STATS 252 - ASSIGNMENT 4
Roshan Sumbaly
The first part of the Homework asks us to analyze the distribution of how many common URLs aretagged between varying numbers of users (who have tagged ‘weigend.com.’). On doing some simpleURL manipulation (i.e. replacing ‘http://www.’ to ‘http://’ only) and using the results of 100 users (themaximum allowed to be drawn at once from del.icio.us), we get the distribution shown below. On the X-axis is the number of users while the Y-axis is the count of URLs (in log scale). For example, a value of 2 : 72 (X:Y) means there were 72 links which were marked by 2 users.
48 - http://weigend.com/5 - http://stanford2008.wikispaces.com/5 - http://go2web20.net/3 - http://hotjobs.yahoo.com/3 - http://proquest.safaribooksonline.com/9780596529321?tocview=true3 - http://datawrangling.com/3 - http://facebook.com/3 - http://slashdot.org/2 - http://danga.com/memcached/2 - http://bus.wisc.edu/nielsencenter/blog/student-blog/2 - http://geekaction.com/node/162 - http://ning.com/2 - http://google.com/2 - http://stanford2009.wikispaces.com/2 - http://economist.com/2 - http://de.amiando.com/2 - http://pipes.yahoo.com/pipes/2 - http://googlewatchblog.de/2 - http://youtube.com/watch?v=NQwSRM57K5s2 - http://captology.stanford.edu/2 - http://money.cnn.com/magazines/fortune/bestcompanies/2008/2 - http://zazzle.com/2 - http://technologyreview.com/index.aspx2 - http://42topics.com/dumps/appengine/doc.html2 - http://research.microsoft.com/mlas/106272521110100100010000123548
 
By design, an algorithm that removes noise by focusing on URLs that occur more than once (favoringmore popular URLs over less popular URLs) is biasing towards the common. When is this property desired? What are its potentially negative side effects? 
We would want to favor the most popular URL when searching for recent sites visited / used byeveryone. By including the simple notion of giving higher weight-age to newer sites amongst thesepopular URLs we can get the interesting articles like recent hot news.The problem of using this approach is that certain sites are very common and always get a high score.These include common portal sites like Yahoo, Amazon, etc. These are popular URLs i.e. are bookmarkedby many users but ideally are of less interest to everyone. Common solutions to solve this problemwould be in the form of tf-idf scoring on the bookmarks.
To create an algorithm for more bold recommendations, biased towards uniqueness, how would you find the most "interesting" users having tagged weigend.com? And how would you derive recommendations from their most "interesting" bookmarks? 
Following are some ways of finding “interesting” users.Firstly, we could find users on the basis of the URLs tagged. This means we give higher score to userswho have marked the same URLs. This may seem feasible when the users are restricted to a particulardomain. But when the userbase increases the overlap of URLs decreases. Hence we can extend theabove idea to cluster the URLs on the basis of URL’s topic. We can then give weight to each cluster ourURLs fall into and then find overlap with other users.Another way to find “interesting” users would be on the basis of “tags.” A similar approach can beadopted as in the case of URLs wherein we take users having high overlapping tags and give them higherscores compared to others with less overlap.One final approach that we can take to find “interesting” users is on the basis of comments. Thesecomments can give you insight into what the user was thinking of when tagging the certain page. Oneway to use comments is to extract “keywords.” We can then use these keywords very similar to the tagscase. New advances in NLP can also help us infer what the user likes / dislikes and maybe help matchusers with similar likes / dislikes.By deriving “interesting” users we are giving their bookmarked items a higher score. Hence we can setweights using URL-User combination and by doing simple weighted average get a unique score for everyURL. (Similar to approach followed in Chapter 2.4 of “Programming Collective Intelligence”)
Compare these two different approaches, explain the differences. When would you use the first algorithm, biased towards more frequently tagged URLs, when would you use the second algorithm,biased towards individuality? 
Using the individuality algorithm would be more appropriate when dealing with datasets which donot change very often. We can then only periodically update the user similarity scores (i.e. scores of “interesting” users) and get our recommendations. Also this favors scenarios wherein the dataset beingdealt with is small comparatively.Using the frequently tagged URL algorithm would be more appropriate in scenarios wherein we havelarger dataset and is simple to keep count of frequent URLs as and when they arrive. This wouldparticularly be helpful in the current stream world.
So far, you have only considered the fact that others have tagged the
URL,
and ignored the
specifictags
as well as the
comments
. How would you use the number of tags (maybe relative to how many the person usually uses, maybe relative to an overall distribution), how the uniqueness of tags? And how about the comments they leave? 
 
As already mentioned above, we can use the tags to find similar users. We can also use “tags”directly to find URLs in the form of the URL-tag matrix. Hence a higher overlap of my tags with overalltags can help me determine URLs of interest independent of the users. Similarly for comments, it can beused to find similar users and continue on to give weighted recommended bookmarks. Or we canextract keywords (or semantic groups – Tuple of [Subject, Object, Predicate]) out of the comments toget the important URLs directly.
Implementation
The first module implemented was for a simple recommendation system based on URL frequency. Thisone would simply give results on the basis of popularity. The code is implemented in“FrequencyCount.py.” The top recommended sites are given on Page 1 of the report. As was expectedthe target site i.e. weigend.com gets the highest score. Also as explained before this algorithm givespreference to commonly used sites like Facebook, Google, etc.The next approach tried was to find “interesting users.” The code for this part was implemented inUserBased.py. “Similar users” were derived on the basis of URLs i.e. higher overlap of same URLcorresponds to high similarity score. The score we used for Jaccard’s similarity i.e.
 =
|∩||∪|
. This scorecaptures the fact that when there is a perfect overlap the score will be 1, while no overlap will result in ascore of 0. The top few sites recommended for me (i.e. target user) are as follows:
http://woodgears.ca/eyeball/http://wiki.python.org/moin/http://socialdatarevolution.com/http://indianfoodforever.com/bachelor-cooking/quick-vegetable-pulao.htmlhttp://a.parsons.edu/~joseph/k2/gameoflife/https://www.gsb.stanford.edu/library/articles/databases/db_alpha.htmlhttp://facebook.com/home.phphttp://canon.com.cn/specialsite/ds_abcbook/http://stanford2009.wikispaces.com/http://youtube.com/watch?v=dtFroEJN1nIhttp://positivityblog.com/index.php/2008/05/09/gandhis-top-10-fundamentals-for-changing-the-world/http://economist.com/displayStory.cfm?STORY_ID=11326549http://wiki.python.org/moin/http://statmodel.com/http://statmethods.net/http://somafm.com/http://sfgate.com/http://feedparser.org/docs/http://facebook.com/socialdatarevolutionhttps://coursework.stanford.edu/portalhttp://stat.stanford.edu/~tibs/ElemStatLearn/http://stat.stanford.edu/~cgates/PERSI/http://news.bbc.co.uk/http://douban.com/http://americanselfstorage.us/
As is evident from the above links, the first one is an outlier which gains a lot of weight due to my highsimilarity with the user who tagged him. Also on observing the dataset it was noticed that the user whowas causing this problem had marked only 2 tags: the first one shown above, dating a year back, and 2
nd
 
of 00

Leave a Comment

You must be to leave a comment.
Submit
Characters: ...
You must be to leave a comment.
Submit
Characters: ...