Professional Documents
Culture Documents
ABSTRACT 1 INTRODUCTION
We propose a new ranking evaluation measure for situations where
Collecting relevance assessments is a crucial factor in information
multiple preference judgements are given for each item pair but
retrieval evaluation. However, as it is widely known, manual rel-
they may be noisy (i.e., some judgements are unreliable) and/or
evance assessments are costly and do not scale well; the burden
incomplete (i.e., some judgements are missing). While it is generally
on the labellers is especially high if graded relevance [6] is utilised.
easier for assessors to conduct preference judgements than absolute
There are many more innovations that may help researchers over-
judgements, it is often not practical to obtain preference judgements
come this problem: the first is the use of crowdsourcing [4], and
for all combinations of documents. However, this problem can
the second is the use of preference judgements [2].
be overcome if we can effectively utilise noisy and incomplete
Crowdsourced relevance assessments are cheap and fast, and
preference judgements such as those that can be obtained from
therefore it is not difficult to obtain labels from multiple crowd
crowdsourcing. Our measure, η, is based on a simple probabilistic
workers for the same target item to judge. However, the downside
user model of the labellers which assumes that each document is
is that crowdsourced assessments are generally noisy, and even
associated with a graded relevance score for a given query. We
after applying a few tricks for filtering out low-quality labels, the
also consider situations where multiple preference probabilities,
labels from multiple labellers may disagree with one another. One
rather than preference labels, are given for each document pair.
simple approach to handle the disagreements would be to employ
For example, in the absence of manual preference judgements, one
majority voting, but this means losing information about the origi-
might want to employ an ensemble of machine learning techniques
nal distribution of the labels. Approaches to preserving the original
to obtain such estimated probabilities. For this scenario, we propose
distribution for the evaluation (See [5][6]) would be useful.
another ranking evaluation measure called ηp . Through simulated
On the other hand, collecting preference judgements, rather than
experiments, we demonstrate that our proposed measures η and
absolute relevance judgements is advantageous especially from the
ηp can evaluate rankings more reliably than τ -b, a popular rank
reliability point of view [2]. Generally speaking, it is easier for a
correlation measure.
labeller to decide which of the two documents presented side-by-
side is more relevant than to decide whether a single document is
CCS CONCEPTS relevant or not relevant. However, due to combinatorial explosion,
• Information systems → Retrieval effectiveness; Relevance obtaining preference judgements for all possible pairs of documents
assessment; is not practical. Hence, if we are to leverage preference judgements
to evaluate a ranked list of documents, the measure should be robust
KEYWORDS to incompleteness. That is, it should be able to evaluate rankings
crowdsourcing, evaluation measures, graded relevance, preference even if some preference judgements are missing.
judgements Based on the above considerations, the present study proposes
a new ranking evaluation measure akin to Kendall’s τ , which can
ACM Reference format: handle noise and incompleteness when multiple preference judge-
Riku Togashi and Tetsuya Sakai. 2019. Generalising Kendall’s Tau for Noisy ments are available per document pair. Our proposed measure, η,
and Incomplete Preference Judgements. In Proceedings of The 2019 ACM is based on a simple probabilistic user model of the labellers, and
SIGIR International Conference on the Theory of Information Retrieval, Santa assumes that each document is associated with a graded relevance
Clara, CA, USA, October 2–5, 2019 (ICTIR ’19), 4 pages. score for a given query. We also consider situations where multiple
https://doi.org/10.1145/3341981.3344246
preference probabilities, rather than preference labels, are given
for each document pair. For example, in the absence of manual
Permission to make digital or hard copies of all or part of this work for personal or preference judgements, one might want to employ an ensemble of
classroom use is granted without fee provided that copies are not made or distributed machine learning techniques to obtain such estimated probabilities.
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
For this scenario, we propose another ranking evaluation measure
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, called ηp . Through simulated experiments, we demonstrate that our
to post on servers or to redistribute to lists, requires prior specific permission and/or a proposed measures η and ηp can evaluate rankings more reliably
fee. Request permissions from permissions@acm.org.
ICTIR ’19, October 2–5, 2019, Santa Clara, CA, USA
than τ -b, a popular rank correlation measure.
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6881-0/19/10. . . $15.00
https://doi.org/10.1145/3341981.3344246
193
Session 8: Users and Evaluation II ICTIR '19, October 2-5, 2019, Santa Clara, CA, USA
194
Session 8: Users and Evaluation II ICTIR '19, October 2-5, 2019, Santa Clara, CA, USA
models. For this scenario, we also propose a measure for reliably The above trial is repeated 100 times to obtain 100 Pearson cor-
evaluating a given ranking. relation values for τ -b and for η.
(t )
Suppose that the prediction probability of the t-th model pi j (t =
4.2 Number of Labellers
1, . . . , n) is given for each documents pair (i, j). In Eq.(9), we now
treat the expectation E[pi j ] as the final probabilistic label for each First, we examine the effect of the number of labellers n on the
document pair, instead of pi j . Hence, the aforementioned η̂ measure robustness of η, as well as that of τ -b. Figure 1 compares the Pearson
can be rewritten as follows. correlation between η and the true τ -b and that between τ -b and the
N −1 ÕN
true τ -b, while varying n. It can be observed that τ -b is unreliable
2 Õ (2E[pi j ] − 1)δi j when the number of votes n is a small even number, because ties
η̂p = (13)
N (N − 1) i=1 j=i+1 1 + V [pi j ] can occur often and therefore majority voting introduces a lot of
noise in such cases. While we should avoid hiring an even number
(t )
where V [pi j ] is the variance of the predicted probabilities pi j , of assessors if majority votes will be taken, this cannot always be
which can be used as the uncertainty of prediction. Note that ensured due to (say) exclusions of data from low quality labellers.
2E[pi j ] − 1, the gain for (i, j), also handles the uncertainty of pref- Hence, this is a problem for τ -b. In contrast, it can be observed that
erence, because the closer E[pi j ] is to 0.5 (tie), the closer this term η consistently outperforms τ -b.
is to 0. Thus, the gain for a documents pair (i, j) is discounted by
the risk V [pi j ] while avoiding division by zero.
η̂p also has the maximum value when a ranking satisfies δi j =
siдn(2E[pi j ] − 1). Hence, we can define ηp , by normalising η̂p by
ηp as follows.
η̂p
ηp = ∗ (14)
ηp
195
Session 8: Users and Evaluation II ICTIR '19, October 2-5, 2019, Santa Clara, CA, USA
196