You are on page 1of 4

Session 8: Users and Evaluation II ICTIR '19, October 2-5, 2019, Santa Clara, CA, USA

Generalising Kendall’s Tau


for Noisy and Incomplete Preference Judgements
Riku Togashi Tetsuya Sakai
Mercari Inc., Waseda University Waseda University
Tokyo, Japan Tokyo, Japan
riktor@mercari.com tetsuyasakai@acm.org

ABSTRACT 1 INTRODUCTION
We propose a new ranking evaluation measure for situations where
Collecting relevance assessments is a crucial factor in information
multiple preference judgements are given for each item pair but
retrieval evaluation. However, as it is widely known, manual rel-
they may be noisy (i.e., some judgements are unreliable) and/or
evance assessments are costly and do not scale well; the burden
incomplete (i.e., some judgements are missing). While it is generally
on the labellers is especially high if graded relevance [6] is utilised.
easier for assessors to conduct preference judgements than absolute
There are many more innovations that may help researchers over-
judgements, it is often not practical to obtain preference judgements
come this problem: the first is the use of crowdsourcing [4], and
for all combinations of documents. However, this problem can
the second is the use of preference judgements [2].
be overcome if we can effectively utilise noisy and incomplete
Crowdsourced relevance assessments are cheap and fast, and
preference judgements such as those that can be obtained from
therefore it is not difficult to obtain labels from multiple crowd
crowdsourcing. Our measure, η, is based on a simple probabilistic
workers for the same target item to judge. However, the downside
user model of the labellers which assumes that each document is
is that crowdsourced assessments are generally noisy, and even
associated with a graded relevance score for a given query. We
after applying a few tricks for filtering out low-quality labels, the
also consider situations where multiple preference probabilities,
labels from multiple labellers may disagree with one another. One
rather than preference labels, are given for each document pair.
simple approach to handle the disagreements would be to employ
For example, in the absence of manual preference judgements, one
majority voting, but this means losing information about the origi-
might want to employ an ensemble of machine learning techniques
nal distribution of the labels. Approaches to preserving the original
to obtain such estimated probabilities. For this scenario, we propose
distribution for the evaluation (See [5][6]) would be useful.
another ranking evaluation measure called ηp . Through simulated
On the other hand, collecting preference judgements, rather than
experiments, we demonstrate that our proposed measures η and
absolute relevance judgements is advantageous especially from the
ηp can evaluate rankings more reliably than τ -b, a popular rank
reliability point of view [2]. Generally speaking, it is easier for a
correlation measure.
labeller to decide which of the two documents presented side-by-
side is more relevant than to decide whether a single document is
CCS CONCEPTS relevant or not relevant. However, due to combinatorial explosion,
• Information systems → Retrieval effectiveness; Relevance obtaining preference judgements for all possible pairs of documents
assessment; is not practical. Hence, if we are to leverage preference judgements
to evaluate a ranked list of documents, the measure should be robust
KEYWORDS to incompleteness. That is, it should be able to evaluate rankings
crowdsourcing, evaluation measures, graded relevance, preference even if some preference judgements are missing.
judgements Based on the above considerations, the present study proposes
a new ranking evaluation measure akin to Kendall’s τ , which can
ACM Reference format: handle noise and incompleteness when multiple preference judge-
Riku Togashi and Tetsuya Sakai. 2019. Generalising Kendall’s Tau for Noisy ments are available per document pair. Our proposed measure, η,
and Incomplete Preference Judgements. In Proceedings of The 2019 ACM is based on a simple probabilistic user model of the labellers, and
SIGIR International Conference on the Theory of Information Retrieval, Santa assumes that each document is associated with a graded relevance
Clara, CA, USA, October 2–5, 2019 (ICTIR ’19), 4 pages. score for a given query. We also consider situations where multiple
https://doi.org/10.1145/3341981.3344246
preference probabilities, rather than preference labels, are given
for each document pair. For example, in the absence of manual
Permission to make digital or hard copies of all or part of this work for personal or preference judgements, one might want to employ an ensemble of
classroom use is granted without fee provided that copies are not made or distributed machine learning techniques to obtain such estimated probabilities.
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
For this scenario, we propose another ranking evaluation measure
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, called ηp . Through simulated experiments, we demonstrate that our
to post on servers or to redistribute to lists, requires prior specific permission and/or a proposed measures η and ηp can evaluate rankings more reliably
fee. Request permissions from permissions@acm.org.
ICTIR ’19, October 2–5, 2019, Santa Clara, CA, USA
than τ -b, a popular rank correlation measure.
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6881-0/19/10. . . $15.00
https://doi.org/10.1145/3341981.3344246

193
Session 8: Users and Evaluation II ICTIR '19, October 2-5, 2019, Santa Clara, CA, USA

2 RELATED WORK By letting δi j = 1i j − 1ji , δi j can be written as follows.


The original Kendall’s τ is computed simply by subtracting the num- (
1 i is ranked higher than j
ber of discordant pairs from the number of concordant pairs and δi j = (8)
then applying normalisation so that its range becomes [−1, 1]. In −1 i is ranked lower than j
the present study, we use a popular variant, τ -b, which can handle We obtain the following measure η̂ which can be regarded as a
ties in the ranking, as the baseline measure. There are several vari- generalisation of Kendall’s τ with probabilistic preference labels.
ants of Kendall’s τ . Yilmaz et al. [8] developed τap measure based N −1 ÕN
on Kendall’s τ , which has a probabilistic interpretation and a top- 2 Õ
η̂ = (2pi j − 1)δi j (9)
weighted property. Webber et al. [7] proposed RBO (Rank-Biased N (N − 1) i=1 j=i+1
Overlap), a top-weighted measure that can handle nonconjoint
and incomplete rankings. However, these measures cannot handle When pi j for each documents pair (i, j) is fixed, η̂ has the max-
distributions of multiple judgements for a documents pair. Mad- imum value which can be obtained when the ranking satisfies
dalena et al. [5] proposed AANDCG (Agreement Aware NDCG) δi j = siдn(2pi j − 1) for all combinations (i, j). However, when the
measure, which considers the agreement between labellers, espe- preferences given by comparisons are not transitive, there is no
cially in low-quality data collection environments such as those ranking that actually satisfies this.
based on crowdsourcing. Although AANDCG takes into considera- Here, by substituting δi j to ∆i j = siдn(2pi j − 1) in Eq.(9), we
tion the distribution of graded relevance given by multiple labellers, obtain the ideal value of η:
N N
−1 Õ
to the best of our knowledge, there is no measure that leverages 2 Õ
the distribution of preference judgements. η∗ = (2pi j − 1)∆i j (10)
N (N − 1) i=1 j=i+1
3 PROPOSED MEASURES Our proposed measure η is obtained by normalising η̂ by η ∗ .
3.1 η measure η̂
η= ∗ (11)
For a given query, we hire n labellers and let them independently η
make a preference judgement on each document pair (i, j): 3.2 Interpretation of η
η is based on a simple user model of labellers to handle graded
(
(k ) 1 labeller k prefers i over j
vi j = (1) relevance. Let µ i∗ denote the true relevance grade for document
0 labeller k prefers j over i i (i = 1, . . . , N ). In our labeller model, a labeller compares two
for k = 1, . . . , n. As a result, the probability pi j of “labellers prefer documents (i, j) in the process of creating a preference judgement
i over j” can be computed as the average of the observed votes as and chooses document i that they prefers over document j with the
follows. following probability.
n µ∗
pi j =
1 Õ (k ) p(i ≻ j) = ∗ i ∗ (12)
n
vi j (2) µi + µ j
k =1
Thus, the winning rate p(i ≻ j) is generated by the Bradley-Terry
We regard this probability as a preference label that means “i should model [1] under the assumption that preferences are transitive, as
be ranked higher than j in a ranking”. The number of concordant in the case with graded relevance. In this model, the winning rate
pairs nc and the number of disconcordant pairs nd in the ranking varies depending on the magnitude of the difference of relevance
can be expressed using these probability labels. between a document pair. This generalises the preference label
N
Õ −1 N
Õ which traditional Kendall’s τ and τ -b rely on. In a situation where
nc = (pi j 1i j + (1 − pi j )1ji ) (3) multiple preference judgements per item are available, τ may be
i=1 j=i+1 used by applying a majority vote to produce the final verdict, and
N
Õ −1 N
Õ then use this as a single preference label. However, this approach
nd = (pi j 1ji + (1 − pi j )1i j ) (4) ignores the original distribution of votes, even if they are nearly
i=1 j=i+1 tied. In such cases, majority voting will have strong adverse effects
( on the reliability of τ , as we shall demonstrate later.
1 i is ranked higher than j
1i j = (5) 3.3 ηp measure
0 i is ranked lower than j
In the absence of manual preference judgements for a query, semi-
where N is the number of documents to be ranked. automatic evaluation by pseudo-labeling is a promising approach
Then, the difference of nc and nd is given by: to low-cost evaluation in large datasets. In this situation, multiple
N
Õ −1 N
Õ preference probabilities, rather than preference labels, are given
= for each document pair as predicted values of machine learning

nc − nd (2pi j − 1)1i j − (2pi j − 1)1ji (6)
i=1 j=i+1 techniques. In general, although the evaluation utilising predicted
N
Õ −1 N
Õ preference labels are less accurate than human-in-the-loop evalua-
= (2pi j − 1)(1i j − 1ji ) (7) tion, it is possible to make a safe decision by computing uncertainty
i=1 j=i+1 with techniques such as Bayesian inference or ensemble of multiple

194
Session 8: Users and Evaluation II ICTIR '19, October 2-5, 2019, Santa Clara, CA, USA

models. For this scenario, we also propose a measure for reliably The above trial is repeated 100 times to obtain 100 Pearson cor-
evaluating a given ranking. relation values for τ -b and for η.
(t )
Suppose that the prediction probability of the t-th model pi j (t =
4.2 Number of Labellers
1, . . . , n) is given for each documents pair (i, j). In Eq.(9), we now
treat the expectation E[pi j ] as the final probabilistic label for each First, we examine the effect of the number of labellers n on the
document pair, instead of pi j . Hence, the aforementioned η̂ measure robustness of η, as well as that of τ -b. Figure 1 compares the Pearson
can be rewritten as follows. correlation between η and the true τ -b and that between τ -b and the
N −1 ÕN
true τ -b, while varying n. It can be observed that τ -b is unreliable
2 Õ (2E[pi j ] − 1)δi j when the number of votes n is a small even number, because ties
η̂p = (13)
N (N − 1) i=1 j=i+1 1 + V [pi j ] can occur often and therefore majority voting introduces a lot of
noise in such cases. While we should avoid hiring an even number
(t )
where V [pi j ] is the variance of the predicted probabilities pi j , of assessors if majority votes will be taken, this cannot always be
which can be used as the uncertainty of prediction. Note that ensured due to (say) exclusions of data from low quality labellers.
2E[pi j ] − 1, the gain for (i, j), also handles the uncertainty of pref- Hence, this is a problem for τ -b. In contrast, it can be observed that
erence, because the closer E[pi j ] is to 0.5 (tie), the closer this term η consistently outperforms τ -b.
is to 0. Thus, the gain for a documents pair (i, j) is discounted by
the risk V [pi j ] while avoiding division by zero.
η̂p also has the maximum value when a ranking satisfies δi j =
siдn(2E[pi j ] − 1). Hence, we can define ηp , by normalising η̂p by
ηp as follows.
η̂p
ηp = ∗ (14)
ηp

4 EVALUATING THE PROPOSED MEASURES


This section evaluates the reliability of our proposed measures
through simulation.
4.1 Artificial Data
Figure 1: Effect of the Number of labellers on the Pearson
In each trial of our simulation, we consider д = 4 relevance grades Correlation with true τ -b. The error bars represent 95% CIs.
and N = 50 documents, and for each document we assign a true
relevance grade µ k∗ (k = 1, . . . , N ) as follows. 4.3 Abused Preferences
In crowdsourcing, there might be abusers in labellers. This may be
r ∼ U ni f orm(0, д) , µ k∗ = ⌊r ⌋ (15)
a challenge for η as it does not ignore minority votes. Hence, we
Since τ -b for any document ranking can be computed based on investigate how η is affected by such abused votes. More specifically,
the number of concordant pairs and the number of discordant pairs we define abused preferences to be those that oppose the true
relative to the gold pairwise preferences, a true τ -b score for any preferences, and examine the effect of the amount of abuser votes
ranking can be obtained based on the gold preferences derived from on the reliability of η.
the above true relevance grades. Figure 2 shows the effect of abused preferences. The x axis is the
(t )
On the other hand, we obtain artificial human votes {vi j |t = number of abused preferences we intentionally injected among the
1, . . . , n} as: ! 50 ∗ 49/2 = 1, 225 preferences. The y axis is the correlation with
(t ) µ i∗ the true τ -b as before. It can be observed that, while η is at least as
vi j ∼ Bern ∗ (16) accurate as τ -b when the number of abused preferences is relatively
µ i + µ ∗j
small, it degrades more rapidly than τ -b as this number increases, as
Given this noisy data, a τ -b score for any ranking can be computed was expected. However, it can be observed that η performs at least
by first taking a majority vote of the above votes to obtain a prefer- as well as τb when the number of abused preferences is less than
ence label for each document pair. Compared to the true τ -b, this 700 (out of 1,225). Since abusers should not dominate in a practical
τ -b reflects the noise in the labels. In contrast, from the above votes, crowdsourcing experiment, the above property of η probably does
we can compute an η score for any ranking directly. not pose any practical threat.
In each trial, we generated 500 different document rankings by
degrading an ideal ranked list (as defined according to the true rele- 4.4 Inferring Preferences
vance grades) to different degrees: we randomly applied S pairwise This section discusses the robustness of η when the number of
document swaps (S = 1, . . . , 20) to the original ideal list, and 25 available preference judgements is limited. To do this, we set up
different rankings were obtained for each S. We then computed a simulation environment similar to the one described earlier, but
the above three measures for each of the 20 ∗ 25 = 500 rankings, gradually reduce the number of observed preferences from 1,225,
and then the Pearson correlation between the true τ -b and the τ -b while holding the number of labellers to a constant (n = 3). Given
(based on noisy labels), and that between the true τ -b and η. Hence such a situation, it is possible to estimate the relevance grades of
the Pearson correlations reflect the robustness of τ -b and that of η unobserved documents from observed preferences; we apply this
when noise is introduced to labels. relevance estimation approach to both η and τ -b. Given the set of

195
Session 8: Users and Evaluation II ICTIR '19, October 2-5, 2019, Santa Clara, CA, USA

4.5 Using Predicted Preference Probabilities


We finally compare ηp with η when predicted preference proba-
bilities from multiple models are available: the basic simulation
conditions are similar to before, but here we assume that each
model introduces uncertainty to the preference probabilities, as
follows. (t ) (t ) (t )
νi j ∼ U ni f orm(0, 1.0) , ϵi j ∼ N (νi j , 0.1) (20)
µ∗
pi j = ∗ i ∗ + ϵi j
(t ) (t )
(21)
µi + µ j
Then, for each document ranking, ηp is computed as defined in
Figure 2: Effect of the number of abused preferences on the Section 3.3. As for η, since it requires an exact preference probability,
Pearson correlation with true τ -b, with 95% CIs. we let pi j = E[pi j ]. As before, we measure the Pearson correlation
with the true τ -b for 100 trials, each with 500 document rankings.
sum of observed votes W = {w i j |i = 1, . . . , N , j = 1, . . . , N , j ,
Figure 4 compares ηp with η while varying the number of models
i} where w i j = nt=1 vi j , we maximise the following likelihood
Í (t )
n; the y axis is the correlation for ηp minus that for η. Hence, the
function that is also based on the Bradley-Terry model: figure shows that ηp consistently outperforms η even though the
N −1 Ö N   wi j   w ji differences are very small. The results demonstrate that ηp can
Ö µi µj
P(W |µ 1 , . . . , µ N ) = (17) handle the uncertainty introduced by each model.
i=1 j=i+1
µi + µ j µi + µ j

To this end, we estimate each graded relevance score µ i using the


MM algorithm [3] through iteration as follows.
ÍN
(k +1) j=1 w i j
µi = ÍN n
(18)
j=1 µ (k ) +µ (k )
i j

The τ -b for a given ranking can then be computed based on prefer-


ences derived from the estimated graded relevance µˆi ; we denote
this as τ -b-MM. As for η, each probability label can be obtained
from µˆi as: µˆi Figure 4: Effect of the number of models on the Pearson cor-
pˆi j = (19) relation with true τ -b, with 95% CIs.
µˆi + µˆj
4.6 Conclusion
Then η can be computed for any given ranking based on the above;
We proposed a ranking evaluation measure η for situations where
we refer to this as η-MM. As before, the straight η, η-MM, and
multiple preference judgements are given for each item pair, but
τ -b-MM for 500 document rankings can be compared with the true
they may be noisy and/or incomplete. We also proposed ηp for sit-
τ -b in terms of Pearson correlation, for 100 trials, each with 500
uations where multiple preference probabilities are given. Through
document rankings.
simulation, we have demonstrated the reliability of our measures
Figure 3 shows the effect of the number of judged document
in comparison to the popular τ -b.
pairs on the correlation with the the true τ -b for η, η-MM, and
The code for computing our proposed measures can be obtained
τ -b-MM. It can be observed that all correlation values are quite
from a github repository (github.com/mercari/eta).
high even when the number of judged pairs is small, and more
importantly, that η-MM outperforms η, which in turn outperforms REFERENCES
τ -b-MM. That is, even without relevance estimation, η can provide [1] Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block
rankings that are more reliable than τ -b (with relevance estimation). designs: I. The method of paired comparisons. Biometrika 39, 3/4 (1952), 324–345.
[2] Ben Carterette, Paul N Bennett, David Maxwell Chickering, and Susan T Dumais.
2008. Here or there. In European Conference on Information Retrieval. Springer,
16–27.
[3] David R Hunter et al. 2004. MM algorithms for generalized Bradley-Terry models.
The annals of statistics 32, 1 (2004), 384–406.
[4] Matthew Lease and Emine Yilmaz. 2013. Crowdsourcing for information retrieval:
introduction to the special issue. Information retrieval 16, 2 (2013), 91–100.
[5] Eddy Maddalena, Kevin Roitero, Gianluca Demartini, and Stefano Mizzaro. 2017.
Considering assessor agreement in ir evaluation. In Proceedings of the ACM SIGIR
International Conference on Theory of Information Retrieval. ACM, 75–82.
[6] Tetsuya Sakai. 2019. Graded Relevance Assessments and Graded Relevance Mea-
sures of NTCIR: A Survey of the First Twenty Years. CoRR abs/1903.11272 (2019).
arXiv:1903.11272 http://arxiv.org/abs/1903.11272
[7] William Webber, Alistair Moffat, and Justin Zobel. 2010. A similarity measure for
indefinite rankings. ACM TOIS 28, 4 (2010), 20.
[8] Emine Yilmaz, Javed A Aslam, and Stephen Robertson. 2008. A new rank cor-
relation coefficient for information retrieval. In Proceedings of the 31st annual
Figure 3: Effect of the number of judged documents pairs on international ACM SIGIR conference on Research and development in information
the correlation with true τ -b, with 95% CIs. retrieval. ACM, 587–594.

196

You might also like