• Embed Doc
  • Readcast
  • Collections
  • CommentGo Back
Download
 
Personalising Alerts with Bayesian Feed Filtering
Author: Phil Barker, Heriot Watt University; http://www.icbl.hw.ac.uk/~philb/Contributors: Lisa Rogers, Santiago ChumbeDate: 22 April 2009
Summary
The proposal for a project funded under the JISC Rapid Innovation programme of June-Dec2009, see http://www.icbl.hw.ac.uk/bayesff/ . We propose to develop and investigate theperformance of a tool that will aggregate and filter a range of RSS and ATOM feeds selectedby a user. The algorithm used for the filtering is similar to that used to identify spam in manyemail filters only in this case it will be "trained" to identify items that are interesting andshould be highlighted, not those that should be junked. An important element of the project isinvestigating whether the filtering is effective enough to be helpful to users (specifically, inthis case, researchers looking at journal tables of content for interesting newly-publishedpapers) and disseminating information about the potential of this approach within the JISCcommunity.
Licence
This work is licenced under the Creative Commons Attribution-Non-Commercial-NoDerivative Works 2.5 UK: Scotland License. To view a copy of this licence, visithttp://creativecommons.org/licenses/by-nc-nd/2.5/scotland/ or send a letter to CreativeCommons, 171 Second Street, Suite 300, San Francisco, California 94105, USA.1
 
Personalising Alerts with Bayesian Feed Filtering
1Fit to Programme Objectives and Value to Community
1.We propose a pilot implementation and evaluation of a personalisable feed aggregator with the aim of demonstrating how the technique of Bayesian filtering can be applied to RSS/ATOM feeds.2.This proposal addresses the potential information overload created by the explosion of alerting servicesbased on RSS/ATOM feeds
1
.The problem is quite general, it applies to any area where a user mightwant to monitor alerts from a wide range of sources in the knowledge that many of the items in the feedswill be irrelevant. Anyone who has subscribed to dozens of seemingly relevant feeds only to find that theyare presented with more items than they can scan is familiar with this problem. Examples where thiswould happen would include:
library staff who want to monitor new books from a wide range of publishers, but who are onlyinterested in those related to specific disciplines;
students who want to monitor feeds from a range of resource providers, but who are onlyinterested in resources that relate to the course they are taking;
research staff who want to monitor research findings and opportunities from a wide range of sources but who are only interested in a specific research field.3.We propose an approach which we believe is as general as the problem, it should be applicable to all theexamples above and any others that occur, however in order to test the solution we focus on a specificinstance of this problem, that of researchers who monitor table of contents feeds from a range of academic journals but who are only interested in items that relate to their specific research topic. Theadvantages of choosing this particular manifestation of the problem are that 1) the TicTocs serviceprovides a ready source of feeds; 2) the affected community is readily at hand; and 3) the underlyingproblem is well known to those in research oriented universities: thus source data, a community toengage with and the attention and understanding of that community should all be available.4.The approach to solving this problem that the project will take is to first aggregate all the source feedsand then to use naïve Bayesian filtering to select those items that match the user's interests. NaïveBayesian filtering is the technique that underlies most modern spam filters
2
. It employs a combination of term extraction, machine-learning and term matching to achieve automatic classification, typically intotwo categories
3
,of which one is presented to the user the other is marked as not interesting. To startusing Bayesian filtering it is first necessary to train the system, that is to manually classify arepresentative selection of text as interesting / not interesting. The frequencies of occurrence of words inthe text of the two categories are then calculated. The assumption is that some words will occur more inone category that the other: for spam filtering words like "Viagra" and "replica" will occur more frequentlyin spam messages; for our purposes words that are characteristic of the research interest of the user willoccur more in the items in which they are interested than in those in which they are not interested. It isthen possible when given a new text to calculate the probability that it is interesting (or not) based on theobserved frequency of occurrence of these words in the new text (the mathematical basis for thiscalculation is Bayes' Theorem, hence "Bayesian Filtering").5.The use of Bayesian filtering has the advantage that the basic algorithm is available in manyimplementations, albeit mostly tied to email spam filtering. Should the pilot investigation proposed hereprove successful we envisage that it will be possible to adapt implementations of Bayesian filters toenhance many JISC services and tools.
1
See for example Hardesty & Sugarman (2007) "Academic Librarians, Professional Literature and New technologies: A survey"Journal of Academic Librarianship 22 (2) p196-205. See also Clay Shirky "It's not information overload, it's filter failure".http://web2expo.blip.tv/file/1277460/ for the need for information filters as a result.
2
Widespread adoption dates from Paul Graham's "A plan for spam" (2002), Bayesian filtering is the basis for spam filtering byFirefox Junk mail controls, Spam Assassin, DSPAM and many others. See http://en.wikipedia.org/wiki/Bayesian_spam_filtering
3
It is possible to have more than one category, for example "interesting" could be broken down into different areas of interest,however more training is required to achieve satisfactory results as the number of categories increases.
2
 
6.The idea of applying Bayesian algorithms to feed filtering is not new: a Perl implementation wasdescribed in 2004
4
, there exist commercial web start-ups implementing the idea, e.g. FilteredRSS,Feedscrub and Feedzero
5
[though the first of these appears to be defunct, and the second is in invite-only beta testing and the website for the third occasionally doesn't work]. There are also a handful of open source projects implementing Bayesian filtering for RSS feeds, e.g. AmphetaRate, Feedisto, andsux0r 
6
.The commercial offerings don't seem to fit well with the JISC information environment: we wouldlike to see a service that from the outset can be used remotely via an API, whereas they will want to drivetraffic to their site; we would want to allow users to access the data used to filter their feeds (i.e. theinformation on which terms characterised the items they were interested in) so that the same informationcould potentially be used by other feed filters, whereas they are likely to guard this information as part of their commercial interest. For these reasons we shall use one of the open source implementations,probably sux0r since an initial evaluation indicates that it provides the functionality we would need, isunder active development, and is written in a language our developer understands.
2Workplan
2.1Aims
7.1, To test the potential of Bayesian filtering of RSS and ATOM feeds for providing a personalised alertingservice; and2, should the filtering be shown to work, to raise awareness of the potential of this approach among theJISC community (developers, service managers, policy makers).
2.2 Work package 1: Technical development
8.
Objective,
to develop a demonstrator service that can be used by an individual to aggregate selectedRSS and ATOM feeds and which, when provided with sufficient information concerning the user'sinterests, will use a naïve Bayesian filtering algorithm to indicate which new items from the feeds beingaggregated are likely to be of interest to the user.9.
Deliverable.
open source software and a demonstrator service for aggregating and filtering feeds, withan open API, and the ability for users to import and export information about the feeds being aggregated(i.e. OPML files) and the information infered about their interests (i.e. the information used for the filtering,perhaps as an APML file). This software and service will be available to any user who wishes to try it.10.
Details.
The demonstrator service will be built, as far as is practicable, out of existing open sourcesoftware modules, for example the Bayesian filtering routine used by sux0r, and the RSS aggregator andthe user interfaces from sux0r and ticTOCs. All software will be developed as open source software, i.e.using open source applications such as Apache, mySQL, PHP, with code hosted on SourceForge or Google Code, and available through an open source licence. The API is intended to allow users tointeract remotely with the filtering mechanism, i.e. by indicating which items are and are not relevant totheir interests. A typical use for the API would be a widget to display those items that the systemsuggested as of interest on a site such as iGoogle or Netvibes, and through this widget to be able toindicate any items which actually weren't of interest. Santiago Chumbe will be responsible for executingthis workpackage.
2.3Work package 2: Trialling
11.
Objective.
To test the ability of the recommender service to identify new journal papers of interest toresearchers based on a knowledge of the papers which they have recently read.12.
Deliverable.
Documented trials of the recommender service with a group of researchers.13.
Details.
We will guide a group of approximately 20 researchers through the use the system, training theBayesian filter with information about their interests. RSS feeds for the tables of contents of journals
4
See Simon Cozens (2004) "Bayesian Analysis for RSS Reading",
Doctor Dobb's
URL: http://www.ddj.com/web-development/184416095
5
http://www.filteredrss.com/ , http://www.feedscrub.com/ and http://www.feedzero.com
6
http://sourceforge.net/projects/amphetarate/ http://feedisto.berlios.de/ and http://sourceforge.net/projects/sux0r/
3
of 00

Leave a Comment

You must be to leave a comment.
Submit
Characters: ...
You must be to leave a comment.
Submit
Characters: ...