Professional Documents
Culture Documents
net/publication/4275794
CITATIONS READS
8 59
3 authors:
Shengli Wu Yaxin Bi
Ulster University Ulster University
115 PUBLICATIONS 977 CITATIONS 91 PUBLICATIONS 1,237 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Shengli Wu on 04 June 2014.
functions
Group a2 b2 Significance .8
that the cubic model is better than the logistic model -.2
-1 0 1 2 3 4 5 6 7
Logistic
.2
component results of that year and those results are
0.0
Obs erved quite different from each other.
Cubic For each run in each year group, a given number of
-.2 Logistic component results (3, 4, 5, 6, 7, 8, 9, 10) were selected
-1 0 1 2 3 4 5 6 7
randomly and then all 50 or 249 queries (depending on
Logarithm of rank which year group) were run using all five data fusion
Figure 1. Original and estimated curves for a group of methods. 200 runs were executed for any given number
TREC 9 results of systems. Two measures, which were mean average
precision (MAP) and recall-level precision (RP), were
used to evaluate the performance of these data fusion
methods. These two measures have been widely used
for retrieval evaluation [2]. 0.3
Recall-level precision
Cubic
Figures 3-8 present the experimental results. 0.29
Logistic
Figures 3-5 present mean average precision of all 0.28 Borda
methods over a total of 200 runs for every given 0.27 CombSum
CombMNZ
number of results, while Figures 6-8 present recall- 0.26
level precision of these data fusion methods. In Figures 3 4 5 6 7 8 9 10
3-4 and 6-7, each data point is the average of 50 Number of results
0.28
Cubic
Recall-level precision
0.27 0.38
Logistic
0.37
0.26 Borda Cubic
0.36
CombSum Logistic
0.25 0.35
CombMNZ Borda
0.24 0.34
CombSum
3 4 5 6 7 8 9 10 0.33
3 4 5 6 7 8 9 10 CombMNZ
Number of results
Number of results
0.28
Cubic
0.27
Logistic
From these figures we can see that Borda fusion is
0.26 Borda
CombSum
almost always the worst among all the methods. A
0.25
CombMNZ two-tailed t test is carried out to compare their means
0.24
and it shows that the differences between Borda fusion
3 4 5 6 7 8 9 10
Number of results
and all other methods are significant at a level of .000,
or the probability is over 99.95% that Borda fusion is
worse than all other methods. On average, the cubic
Figure 4. Mean average precision of several data fusion
method, the logistic method, CombSum, and
methods in TREC 2001
CombMNZ outperform Borda by 4.67%, 4.33%,
4.20%, and 3.64% respectively on MAP; they
Mean average precision
0.36
0.35
Cubic
outperform Borda by 3.94%, 3.46%, 3.68%, and 3.29%
0.34
Logistic respectively on RP. This demonstrates that the linear
0.33
0.32
Borda relevance model used by Borda fusion can be
0.31
CombSum
improved by using non-linear functions such as cubic
CombMNZ
3 4 5 6 7 8 9 10
or logistic functions. On the other hand, it
Number of results
demonstrates that the fused results using rank
information (the cubic method and the logistic method)
Figure 5. Mean average precision of several data fusion can be as good as those using scoring information
methods in TREC 2004 (the curve of Cubic is covered (CombSum and CombMNZ).
by the curve of Logistic) Comparing the cubic method with CombSum, the
former is better than the latter at a significant level of
0.3
.000 on MAP and a significant level of .006 on RP. For
Recall-level precision
Cubic
0.29
Logistic the cubic method, MAP is 0.2937 and RP is 0.3116;
0.28 Borda
for CombSum, MAP is 0.2924 and RP is 0.3109.
CombSum
0.27
CombMNZ
Comparing the cubic method with CombMNZ, the
0.26 former is better than the latter at a significant level of
3 4 5 6 7 8 9 10
.000 on both MAP and RP. For CombSum, MAP is
Number of results
0.2908 and RP is 0.3097. The difference between the
logistic method and CombSum is not significant (0.155
Figure 6. Recall-level precision of several data fusion on MAP and 0.290 on RP). The difference between the
methods in TREC 9
logistic method and CombMNZ is significant on MAP italicized and centered below their respective names.
(0.008) but not significant on RP (0.645). Include e-mail addresses if possible. Author
Finally let us compare the cubic method and the information should be followed by two 12-point blank
logistic method. Over three year groups, their average lines.
performances are very close. For the cubic method,
MAP is 0.2937 and RP is 0.3116; for the logistic 5. Conclusion
method, MAP is 0.2927 and RP is 0.3102.
Interestingly, these small differences (0.33% on MAP In this paper we have investigated applying
and 0.46% on RP) are still statistically significant at a regression relevance models for data fusion, in which
significance level of .002 (t test), which suggests that only ranking information is used. We find that the
the cubic method is slightly better than the logistic cubic models and the logistic models are two good
method. According to the observations in Section 3, models for this. Compared with Borda fusion, the
which demonstrate that the cubic function can fit much cubic method and the logistic method are more
better than the logistic function with TREC data, we effective by 4%. Compared with CombSum and
expect that the difference between the cubic method CombMNZ, the cubic method is slightly better than
and the logistic method should be bigger than this. them and the logistic method is as good as them in
Therefore, we take a more careful look at the logistic performance but no score information is needed for the
curves. Actually, they do not fit well with TREC data cubic method and the logistic method.
on the top 30-50 ranked documents, which are the most Comparing the logistic models with the cubic
important documents for data fusion. However, we find models, the latter can fit TREC data more precisely
that the logistic curves always overestimate the than the former. However, when used for data fusion,
relevance probability of the document in those ranks, the cubic method is slightly better than the logistic
and in a consistent way! That is to say, for those top- method.
ranked documents, the rates of overestimation decrease
with ranks. Such an overestimation does not affect the 6. References
effectiveness of data fusion very much because of two
reasons. First, those documents which are top-ranked [1] Aslam, J. and Montague, M. Models for Metasearch. In
in any component result are very likely top-ranked in Proceedings of of 24th Annual ACM SIGIR Conference, 2001,
the fused results whether their relevance probabilities pages 275-284, New Orleans, USA.
are overestimated or not. Second, the relative rankings [2] Buckley, C. and Voorhees, E. M. Evaluating Evaluation
of those top-ranked documents are not very much Measure Stability. In Proceedings of the 23rd Annual ACM
affected because of the pattern of overestimation. A SIGIR Conference, 2000, pages 33-40, Athens, Greece.
linear relationship exists between the estimated logistic [3] Calve, A. L. and Savoy, J. Database Merging Strategy
curve and the original data curve for those top-ranked Based on Logistic Regression. Information Processing &
Management, 2000, 36(3): 341-359.
documents. In such a situation, if we only consider the [4] Fox, E. A. and Shaw, J. A. Combining of Multiple
top 30-50 ranked documents, the estimated logistic Searches. In Proceedings of the 2nd Annual Text Retrieval
curve will bring the same fused result as the original Conference (TREC-2), 1994, NIST, Gaithersburg, USA,
curve by using Equation 1 for data fusion. This can November.
explain why the logistic curve is almost as good as the [5] Wu, S., & McClean, S. Performance prediction of data
cubic curve for data fusion even though it is not as fusion for information retrieval. Information Processing &
good as the cubic curve for the estimation of the Management, 2006, 42(4): 899-915.
relationship of rank-probability of relevance. [6] Wu, S., & McClean, S. Improving High Accuracy
Author names and affiliations are to be centered Retrieval by Eliminating the Uneven Correlation Effect in
beneath the title and printed in Times 12-point, non- Data Fusion. Journal of the American Society for
boldface type. Multiple authors may be shown in a Information Science and Technology, 57(14): 1962-1973,
2006.
two- or three-column format, with their affiliations