Active Learning For Multi-Task Adaptive Filtering

Active Learning for Multi-Task Adaptive Filtering
Abhay Harpale Yiming Yang Carnegie Mellon Univeresity, 5000 Forbes Avenue, Pittsburgh, PA-15213, USA
aharpale@cs.cmu.edu yiming@cs.cmu.edu
Abstract
In this paper, we propose an Active Learning (AL) framework for the Multi-Task Adaptive Filtering (MTAF) problem. Specically, we explore AL approaches to rapidly improve an MTAF system, based on Dirichlet Process priors, with minimal user/task-level feedback. The proposed AL approaches select instances for delivery with a two-fold objective: 1) Improve future task-specic system performance based on feedback received on delivered instances for that task, 2) Improve the future overall system performance, thereby beneting other tasks in the system, based on feedback received on delivered instances for a particular task. Current AL approaches focus only on the rst objective. For satisfying both goals, we dene a new scoring function called Utility Gain to estimate the perceived improvements in task-specic and global models. In our experiments on standard benchmark datasets, we observed that global AL approaches that additionally take into account the potential benet of feedback to other tasks in the system performed better than the task-specic approach that focused only the benet of the current task.
tive. AF systems model user interests based on the initial information request (e.g. a query) presented by the user, and the subsequent relevance feedback provided by the user for the results presented to the user. For example, one popular AF approach involves the use of Logistic Regression classier (Jaakkola & Jordan, 1996) to classify documents into relevant and non-relevant categories for that particular user. The relevance feedback received from the user is used to supplement the training data to retrain the classier for future predictions. Most research in AF systems has focused on learning each task independently of other tasks. In this paper, we will refer to such approaches as Single-Task AF (STAF) approaches. In the initial stages of learning, when the feedback from each user is limited, these approaches suer from data sparsity, leading to weaker models, and consequently poorer performance. Multitask learning methods have shown signicant success in mitigating this per-task data sparsity problem by sharing information across multiple tasks. For example, to learn the interest-model of a particular stock analyst, it could be useful to identify common important features for portfolio tracking based on the feedback received from other stock analysts. Irrespective of their portfolios, all stock analysts are usually interested in common news regarding corporate announcements, balance sheets, government regulations, and day-to-day stock market indicators pertinent to their stocks. In this paper, we will refer to such approaches, which leverage information from multiple tasks, as Multi-Task Adaptive Filtering (MTAF) approaches. The future performance of an AF system relies on the feedback received on delivered items. If the system myopically focuses on delivering only (perceived) relevant documents (better immediate performance), the feedback received on these documents may not lead to the best learnt task models (in the future), thereby limiting the usefulness of such feedback for future predictions. In this paper, we present an Active Learn-
1. Introduction
Adaptive Filtering (AF) (Robertson et al., 2001) systems monitor a stream of documents to lter out documents that are relevant to the particular task or user. For example, a stock analyst would like to lter out all the news about stocks in his portfolio, while a technology entrepreneur might be interested in news about latest technology startups. Each of these users can be considered a task from the AF systems perspecAppearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).
ing (AL) framework for MTAF that additionally takes into account the perceived benet of feedback on items before making a delivery. In the MTAF setting, the AL system has a three-fold objective: 1) Provide relevant documents, 2) Feedback provided on a delivered document should maximally improve the task-specic performance in the future, 3) Feedback provided on a delivered document, should maximally improve the overall system (multiple tasks) in the future. The current AL approaches focuses only on the second objective, thereby narrowly focusing on task-specic performance improvements. In this paper, for satisfying the goal 1, we chose the Multi-Task Logistic Regression with Dirichlet Process priors (Blei & Jordan, 2006; Xue et al., 2007) for facilitating information sharing across tasks. For goals 2 and 3, we propose a novel AL framework based on a new scoring function called Utility Gain. This scoring function is inspired by the popular empirical risk minimization approaches in Active Learning (Melville et al., 2005). However, the current empirical risk minimization approaches only focus on minimizing the risk of one task, while we develop variants that selectively focus on one task, global model, or all tasks, as required during the various phases of learning. Consequently, our framework chooses instances that might lead to maximal (expected) gain in performance of the system (task-specic or global) for future predictions.
Table 1. Important notation used in this paper.
Symbol M m {1, . . . , M } Nm d xm,n Rd ym,n {0, 1} wm R d k {1, . . . , } wk R d m,k [0, 1]
Description Total number of tasks A task Total number of instances for task m Dimensionality of feature space The nth data instance for task m The label of instance xm,n LR parameters for task m A group or cluster LR parameters of the kth group Group k mixing proportion for task m
estimate of P (y = 1|x, D) P (y = 1|x, D) =

w
P (y = 1|x, w)P (w|D)dw
(2)
2.2. Utility of Adaptive Filtering In the Adaptive Filtering setting, the performance of a system is usually measured in terms of the utility of documents delivered by that system. For example, in the TREC Filtering track, one popular utility metric is T 9U = 1 R + 0 N , where R and N are the number of delivered results the user considered relevant and non-relevant respectively. 1 and 0 are the benet achieved and loss incurred, by the user due to reading the relevant and non-relevant documents respectively. For TREC-9, 1 = 2 and 0 = 1. From the system perspective, the expected utility of delivering a document can be computed as: U(x|D) =
y{0,1}
2. Adaptive Filtering Preliminaries

2.1. Single-Task Adaptive Filtering We briey describe a popular representative STAF approach based on the Logistic Regression (LR) classier. The LR classier estimates relevance of a doc1 ument using the sigmoid function (x) = 1+ex as follows: P (y = 1|x, w) = (wT x) = 1 1 + ewT x (1)
y P (y|x, D)
(3)
In the above equation, x is a feature vector representing a document, and w is the weight vector of regression coecients. w is usually t by maximum likelihood estimation on the available relevant and nonrelevant training documents D for the task. In this paper, for consistency among approaches, we will be focusing on the Bayesian Logistic Regression variant. In the Bayesian setting, the parameters w are usually drawn from a diuse Gaussian prior distribution w N (, ), to ensure regularization. For a fully Bayesian treatment, instead of using a point estimate of the parameters w, one may integrate over the posterior distribution of w to obtain a Bayesian
Instances with U(x) > t are delivered, where t is the dissemination threshold learnt via cross-validation or set to 0 (Zhang et al., 2003). In this paper, we focus on the T 9U utility metric to guage the system performance, but the ideas are general and maybe extended to other evaluation metrics as well.
3. Our approach
3.1. Multi-Task Adaptive Filtering Owing to the success of Logistic Regression (LR) classiers in Adaptive Filtering, we choose a multi-task approach based on LR classiers. Our chosen approach is depicted graphically in Figure 1. Table 1 lists the
Algorithm 1 MTAF Generative Process 1: Fixed diuse hyperpriors: 10 , 20 , 0 , 10 , 20 2: Draw j Gamma(10 , 20 ), j = 1, . . . , d 3: Let be a diagonal matrix with elements j , j 4: Draw N (0, (0 )1 ) 5: Let = 1 6: Draw Gamma(10 , 20 ) 7: Draw vk Beta(1, ), k = 1, . . . , 8: Draw wk N (, ), k = 1, . . . , k1 9: Let k = vk i=1 (1 vi ), k 10: for m = 1 to M do 11: Draw m M ultinomial(1; 1 , . . . , ) 12: Let wm = k=1 (wk )m,k T 13: Draw ym,n Binomial(1, (wm xm,n )), n = 1, . . . , Nm 14: end for
Figure 1. Graphical model representation of multi-task adaptive ltering based on Dirichlet Processes.
notation used in this paper. This approach is based on the Mutli-Task classication approach developed by (Xue et al., 2007). A similar approach, consisting of Support Vector Machines for individual tasks, has been used for content-enhanced collaborative ltering by (Yu et al., 2004). We summarize the generative process of the model in Algorithm 1. The approach clusters/groups related tasks by drawing the parameters for the related tasks from a mixture of Gaussians (as evident from the line 12 of the Algorithm 1). Intuitively, related tasks will share information by getting grouped into the same cluster(s). For example, features that are indicators of interests of a football fanatic (e.g. teams, scoreboards, win/loss decisions) may be inferred from the relevance judgments available from other sports enthusiasts. The grouping of related tasks ensures that unrelated tasks (e.g. stockportfolio tracking vs sports-news ltering) do not contaminate, as such contamination might lead to poorer understanding of the individual tasks. As the optimal number of groups is unknown apriori, a Dirichlet Process model is inferred over the tasks to discover the optimal number of groups pertinent to the tasks. The Dirichlet Process is essentially a mixture model with potentially inninte components. The actual number of components participating in the formation of the tasks is based on the parameter , also known as the innovation parameter. Larger values of lead to more participating components and vice-versa. The optimal value of (and other parameters) can be inferred from the data using variational inference (Xue et al., 2007).
With the knowledge of the inferred parameters for a task m, the decision function for a new instance xm, follows the distribution:
K
P (ym, = 1|xm, , wm ) =
k=1
m,k (wk T xm, )
(4)
Again, like in Equation 2, for a fully Bayesian treatment, we integrate over the posterior distribution of wk , instead of using a point estimate: P (ym, = 1|xm, , wm ) =
K
m,k
k=1
(wk T xm, )P (wk |, , D)dwk
(5)
The above integral does not have an analytical solution. (Xue et al., 2007) suggest the use of an approximate form of the integral based on (MacKay, 1992). Combining Equation 5 and Equation 3, one can compute the expected utility U(xm, ) of delivering the document xm, . A passive MTAF approach will deliver the document if U(xm, ) > t, where t is a dissemination threshold, usually set to zero, or maybe learned via cross-validation. We call this approach passive, because in choosing to deliver xm, , the system did not foresee the benet of getting the feedback on that document. Consequently, the user eort in providing feedback on this item may go wasted as it may not result in better results (for the user) in the future. We remedy this situation in the next section by proposing our Active Learning framework for the MTAF approach.
3.2. Active Learning for Adaptive Filtering In a classication setting, an AL approach typically selects instances that (if labeled) are expected to improve the classication performance the most. Along similar lines, in the MTAF setting, an AL approach could estimate the perceived benet of delivering an item in three dierent ways. Firstly, does the delivered item xm, lead to improved performance on that task m in the future (based on feedback received on xm, ). Secondly, will the feedback on delivered item xm, lead to improvements in the global model (e.g. , , )? Finally, does feedback on the delivered item xm, lead to improvements in other tasks in the system? We propose an Active Learning solution for each of these objectives in the following sections. 3.2.1. Local Active Learning In this section, we discuss an AL approach that scores instances based on the perceived future benet of delivering these instances to the current task m. Consequently, we rst dene our notion of future benet LUG m (xm, ) of delivering an instance to the task m. LUG m (xm, |D) = LSP m (D (xm, , ym, )) LSP m (D) (6)
specic population. It should be noted that each task in the system may have its own sample population from which documents are being monitored/ltered. For example, a stock analyst may subscribe to news feeds from business magazines, while a sports enthusiast may subscribe to news feeds from sports magazines, leading to dierent p(xm, ) for these two tasks. (Ofcourse, the sports enthusiast is not interested in all sports news being served, but only those that match her interests. It is the goal of the AF system to identify those interests and lter the relevant news from the subscribed feed. Hence the system performance is evaluated over the sample population of the subscribed feed.). In this Equation, the domain of the integral is dened as LU m (xm, ) = {xm, : LU m (xm, |D ) > 0}. This means that the future system performance will only be calculated based on the instances that will be delivered; undelivered instances will be ignored from the system performance calculation due to the nature of the T 9U utility function. The second subtle point in Equation 8, is the localized nature of the utility function LU m (xm, |D ). For computing the expected future system performance Sm (D (xm, , ym, )), we compute the utitlity LU m (xm, |D (xm, , ym, )) based only on the updated local model wm of the task m. This means, we do not consider the eect of updating the global parameters , , after observing the new training instance (xm, , ym, ). If wm is the current (trained on D) local model, we can use the Bayes rule to obtain the posterior distribution P (wm |xm, , ym, ). We use the variational approximation method suggested by (Jaakkola & Jordan, 1996) to obtain the posterior. With the knowledge of the posterior distribution P (wm |xm, , ym, ), the localized expected utility can then be dened as: LU m (xm, |D (xm, , ym, )) = y P (y|xm, , wm )P (wm |xm, , ym, )dwm
wm y{0,1}
In the above equation, LSP m (D ) denotes the expected system performance on task m when trained on any data D . Intuitively, the benet of delivering an instance xm, is estimated as the improvement in the task-specic performance (for task m), if the feedback of that instance (xm, , ym, ) was added to the training set D. In Equation 6, the system doesnt know the true label ym, for the instance xm, . As a result, we rephrase the equation as an expectation over the possible labels. LUG m (xm, ) =
y{0,1}
P (y|xm, , D)LSP m (D (xm, , y)) (7)
LSP m (D)
(9) For the case of Adaptive Filtering, we dene the system performance in terms of the expected utility of all potential instances the system will monitor. Consequently, LUG m is the localized utility gain of task m. LSP m (D ) = p(xm, )LU m (xm, |D )dxm,
LU m (xm, )
In the Adaptive Filtering setting, the system will then choose to deliver items that provide immediate utility Um (xm, |D), or are potentially benecial in improving the system in future iterations, as estimated by expected future utility LUG m (xm, |D). We express the joint objective as a linear combination: LALm (xm, |D) = Um (xm, |D) + L LUG m (xm, |D) (10) Here, L is the weightage given to the AL component
(8) In Equation, 8, there are two important points to note. Firstly, the integral sums over p(xm, ), i.e. the task-
of the system. We discuss the nature of this weight in Section 3.2.5. Owing to the localized focus of this approach on the perceived improvement of the current task, we call this approach Local Active Learning (LAL). We expect LAL to perform well if the global model, consisting of , , , is already strong, and thus does not need to be updated. Thus, LAL is expected to be eective on new tasks that are added to an already well-performing MTAF system. We deal with the case of the improving the global model in the next section. 3.2.2. Global Active Learning As mentioned in the previous section, LAL does not foresee the updates/improvements in the global model consisting of = {, , }. Global AL (GAL) xes this by updating the global model for computing the expected utility gain in Equation 7. It should be noted that, just like LAL, the goal of GAL is still to improve the future performance of task m (the task to which the system plans to deliver instance xm, ), albeit based on improvements in the global model. Consequently, we dene the estimation of the global utility function GU m (xm, |D (xm, , ym, )). This is a modication of localized utility LU m (xm, |D (xm, , ym, )) in equation 9. GU m (xm, |D (xm, , ym, )) =
y{0,1}
ized utility gain GU G m (xm, ) =

y{0,1}
P (y|xm, , D)GSP m (D (xm, , y)) (13)
GSP m (D)
Finally, substituting the globalized utility gain GU G m into Equation 14, we get the scoring function for the Global Active Learning GAL approach. GALm (xm, |D) = Um (xm, |D) + G GU G m (xm, |D) (14) 3.2.3. Benevolent Active Learning So far, LAL and GAL have focused on improved performance of the task m to which the system plans to deliver the instance xm, . But it is not clear if the feedback on that instance will lead to improvements in other tasks. To achieve this goal, we devise another AL score, called Benevolent AL, as the system tries to deliver instances to task m that might lead to improvements in other tasks. In this context, we revise the denition of GSP m to BSP m , which sums over the performance of each task in the system, based on the feedback received on the instance xm, delivered to task m. BSP m (D ) =
M
y P (y|xm, , )
m =1 GU m (xm
, )
p(xm , )GU m (xm , |D )dxm , (15) It is important to note that inside the summation for each task m , the instances xm , are being sampled from the distribution p(xm , ) of incoming documents for that particular task m . This stems from the fact that the distribution of incoming instances P (xm ) may be dierent for dierent tasks, as discussed earlier. Based on BSP m , for clarity and completeness, we provide the Equations of the corresponding utility gain BU G m and AL scores BALm . BU G m (xm, ) = P (y|xm, , D)BSP m (D (xm, , y)) BSP m (D)
y{0,1}
P (|xm, , ym, )d =
y{0,1} wm
y P (y|xm, , wm ) P (wm |)(P (|xm, , ym, )dwm d (11)
It is important to note that the posterior model global P (|(xm, , ym, ) has been trained on additional available feedback from the task m only. This is because, we are computing the expected global future utility if the instance is delivered to task m, that is if the feedback is received from task m. Replacing local utility LU in Equation 8 with expected future global utility GU, we get the globalized system performance GSP m of task m. GSP m (D ) =
GU m (xm, )
(16) BALm (xm, |D) = Um (xm, |D) + B BU G m (xm, |D) (17)
p(xm, )GU m (xm, |D )dxm,
(12) Substituting the globalized system performance GSP m from Equation 12 into Equation 7, we get the global-
3.2.4. Analysis of LAL, GAL, and BAL In Table 2, we summarize the major ideas from proposed AL approaches. Each approach focuses on improvements in dierent parameters of the model, thereby leading to dierent consequences. The LAL approach focuses on improving its own task. Such approach can be expected to perform well when the global model is already strong, and less susceptible to drastic change based on feedback on a new instance. The GAL approach restimates the global model, but in the process computes the future utility only based on the current task. This selsh strategy of modifying the global model without foreseeing the eect on other tasks in the system can be detrimental to the overall performance of the model, making this approach undesirable. The BAL approach updates the global model, but at the same time studies the eect of the update on the utility gain of other tasks in the system, thereby ensuring that the global model doesnt get biased towards a particular task. Consequently, we expect BAL to perform superior to other methods in the initial stages of AF when global model is weak.
Table 2. A comparative summary of AL approaches. The rst column lists the AL score to decide delivery of instance xm, to task m. The second column lists the parameters that will be (potentially) improved (for better future utility) if the system retrains on the feedback received on the instance delivered based on the corresponding AL score Delivery criteria Potentially improved parameters based on feedback LALm GALm BALm wm wm , , , , , , wm , m {1, . . . , M }
stronger global model means lower G . Similarly, B should decrease so that the tasks can then focus on improving locally, thereby maintaining a higher value of L . We test these hypotheses empirically in Section 4.3
4. Experiments
4.1. Datasets We chose to use 2 datasets that are popular in the TREC ltering community, namely, RCV1 (84 categories, 810,000 documents) (Lewis et al., 2004) and 20 Newsgroups (20 categories, 18,846 documents) (Joachims, 1997). For both datasets, we modeled each category as a task. For experiments on both datasets, we start AF with only one known relevant document per task. It should be noted that RCV1 is a multi-labeled dataset, meaning each document belongs to multiple classes/tasks, thereby indicating some level of overlap/relatedness among tasks (e.g. tasks sharing the same documents). 20 Newsgroups is not multi-labeled and each document belongs to only one class. However, it is known that the categories can be grouped into 6 groups based on subject matter, namely, comp.*, rec.*, sci.*, talk.*, misc.* and others. Thus, the relatedness of tasks is in the feature-space, as documents are not shared among tasks. For example, the feature computer is a strong indicator of the tasks in comp.* group. We chose this dataset to see if the MTAF approaches are able to discover these hidden groups of related categories, even if the documents are not shared among tasks. 4.2. Methods Primarily, we are interested in comparing the four Active Learning approaches LAL, GAL, BAL and MAL in the MTAF setting to see which approach rapidly improves utility, and how they perform in the various phases of ltering. We also wish to compare these 4 approaches to the passive version that decides to deliver an instance without taking any future utility gain into account. (i.e. delivery criteria is Um (xm, |D) > 0). 4.2.1. STAF baselines (Zhang et al., 2003) devised a strategy called exploration and exploitation to perform Active Adaptive Filtering, and our AL framework is inspired by their work. The exploitation component is a passive AF strategy that delivers an item if Um (xm, |D) > 0. The exploration component is an Active Learning component based on a metric called Utility Divergence, similar to the Utility Gain score developed in this paper. Utility Divergence, however, doesnt com-
3.2.5. Combined AL for MTAF Based on the discussion in the previous section, each of the above methods can be combined to come up with a meta-AL system for AF that tries to satisfy multiple objectives at dierent phases of learning. We use weighted linear combination of the individual utility gain scores to come up with the Meta AL score MALm (xm, |D) = Um (xm, |D) + L LUG m (xm, |D) + G GU G m (xm, |D) + B BU G m (xm, |D) (18)
The weight parameters L , G and B should vary with the quality of the model, local as well as global. A
pute the actual gain in utility of the system based on feedback on the instance. It instead computes the dierence between the expected utility of a hypothetical true model and the model based on expected feedback |(xm, , ym, ) and instances are scored based on their potential to reduce the gap between utility of the true model and utility of the current model. In our results, we will refer to these approaches as STAF-passive (only exploitation), STAFactive-UD (exploitation and exploration using Utility Divergence score). 4.2.2. A note on implementation The various methods were implemented in MATLAB. For integrating over the posterior distribution of a vari able (e.g. wk ), we used the Metropolis Hastings algorithm. Specically, we draw several (S) samples from the posterior of the corresponding distribution of the variable, and averaged over the various outputs to obtain a probabilistic integral, similar to the approach described in (Zhang et al., 2003). For example, to implement Equation 3, we sample S samples from the posterior distribution P (w|D) and the integration was implemented as: U(x|D) = 1 S
S
an stronger global model. LAL improves at a faster pace once the learnt global model is stronger (due to learning from more feedback). The GAL approach, that tries to improve the global model, without foreseeing improvements (or degradation) in other task performs inferior to the LAL approach. We believe that the GAL approach sometimes makes wrong decisions about instance selection, by selecting those instances that will lead to improvement in one task, but mostly degrade other tasks (because the utility gain in tasks is not taken into account). We found that setting G = 0 in Equation 18, i.e. ignoring the GU G m component corresponding to GAL, in the MAL approach led to better performance of MAL. Consequently, the MAL approach, which combines the strengths of the LAL and the BAL approaches, outperforms all the other approaches.
y P (y|x, w(s) )
s=1 y{0,1}
(19)
In this specic case, the posterior distribution was derived by using the Laplace approximation method (Xue et al., 2007). Other posterior distributions for the MTAF case were based on the derivations available in (Xue et al., 2007). For integrating over P (x), the samples were drawn from the data stream observed so far, consisting of delivered and undelivered items, to get a non-parametric estimate of P (x). 4.3. Experimental Results and Discussion First we compare the MTAF approaches (MTAFPassive, LAL, GAL, BAL, and MAL) to test the hypotheses we made in Section 3.2.4. Figure 2 shows the trends (in terms of T9U utility) on the RCV1 dataset. AL approaches are typically most eective in the initial phases of ltering and so the Figure 2 shows performance upto the ltering of the rst 5000 instances on the RCV1 dataset. The results validate our hypothesis that the LAL approach performs quite poorly in the initial stages of ltering, when the global model itself is quite weak. It can also be observed that BAL approach, that tries to improve the overall utility of all the tasks in the system performs the best initially. This is because improved utility of all tasks reects in
Figure 2. Comparison of the MTAF AL approaches on the RCV1 dataset (similar trends for 20 Newsgroups)
Next, we compare representative MTAF approaches (MTAF-passive and MTAF-MAL) to the STAF approaches to study the benet of using a multi-tasking setup. We also compare the AL scoring functions: our Utility Gain (UG) criteria and the Utility Divergence (UD) criteria proposed in (Zhang et al., 2003). We call these variants of the STAF-active approach STAFactive-UG and STAF-active-UD respectively. In Figure 3, we observe that the MTAF approaches, active as well as passive, outperform their STAF counterparts. It can be observed that our best performing approach, MAL (T9U = 423), has a performance improvement of more than 20 percent over that of STAF-active-UD (T9U=348). A paired t-test shows strong statistically signicant evidence (p-value = 0.0002) for the superiority of our approach over the current state-of-art STAF-active-UD. We also observe that the starting utility of the MTAF approaches is higher, due to information sharing between tasks, to overcome the pertask data sparsity problem in the STAF approaches.
Regarding the AL score, we observe that the performance of UG and UD is quite similar for STAF-active. Empirically, it seems that the goal of UD (reduce the gap between the utility of a hypothetical true model and the learnt model) is quite similar to the goal of UG (increase the future utility of the learnt model the most).
It is also necessary to derive analytic solutions to the sampling strategies described here to come up with closed-form/quicker expected future utility evaluation schemes. Another challenge is the problem of spam. In a practical MTAF system, how does the MTAF system protect its genuine users/tasks from other malicious users. In a benevolent AL approach the feedback from the malicious users is potentially harmful to the system, and consequently to the genuine users. So how does a Benevolent AL approach safeguard againt malicious use?
References
Blei, David M. and Jordan, Michael I. Variational inference for dirichlet process mixtures, 2006. Jaakkola, Tommi S. and Jordan, Michael I. A variational approach to bayesian logistic regression models and their extensions, 1996. Joachims, Thorsten. A probabilistic analysis of the rocchio algorithm with tdf for text categorization. pp. 143151, 1997. Zhang, Yi, Xu, Wei, and Callan, Jamie. Exploration and exploitation in adaptive ltering based on. In Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003, pp. 896 903, 2003. MacKay, David J.C. The evidence framework applied to classication networks. Neural Computation, 4: 720736, 1992. Stephen Robertson, Robertson, Stephen, and Soboro, Ian. The trec 2002 ltering track report. In In TREC-11, 2001. Xue, Ya, Liao, Xuejun, Carin, Lawrence, and Krishnapuram, Balaji. Multi-task learning for classication with dirichlet process priors. Journal of Machine Learning Research, 8:2007, 2007. Yu, Kai, Tresp, Volker, and Yu, Shipeng. A nonparametric hierarchical bayesian framework for information ltering, 2004. David Lewis, Yiming Yang, T. Rose and Fan Li. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5 (2004) 361-397. Melville, P., Saar-Tsechansky, M., Provost, F., and Mooney, R. 2005. An Expected Utility Approach to Active Feature-Value Acquisition. In Proceedings of the Fifth IEEE international Conference on Data Mining (November 27 - 30, 2005).
Figure 3. Comparison of MTAF and STAF approaches on the 20 Newsgroups dataset (similar trends for RCV1)
5. Conclusion and Future Work

In this paper, we have explored various Active Learning approaches to the Multi-Task Adaptive Filtering. To score the benet of delivering an instance, we developed a new metric called Utility Gain that estimates the improvement in system performance (in terms of utility) if the system is re-trained on the feedback received for the delivered instance. In the MTAF setting, we compare the eect of selsh (local) AL approaches (that focus on improvements in one task) to a benevolent AL approach (that evaluates the benet of labeling an instance across many tasks). Our empirical analysis demonstrates the superior performance of the benevolent approach in the initial phase of ltering, when the global model of the MTAF system is weak, while a more rapid improvement in the performance of the selsh approaches in later stages, when the global model is strong (i.e. global model may not benet much from AL). We also demonstrate that a combined approach, called meta-AL, that combines the strengths of local and benevolent AL approaches, is superior to the individual approaches. There are several areas for exploration in the future. The sampling approaches presented in this work can be infeasible in a large-scale adaptive ltering system with millions of tasks. In such an environment, it is necessary to rst segregate the tasks so that each of the smaller AF systems can handle their respective tasks.

Active Learning For Multi-Task Adaptive Filtering

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Active Learning For Multi-Task Adaptive Filtering

Uploaded by

Copyright:

Available Formats

Active Learning for Multi-Task Adaptive Filtering