You are on page 1of 10

Session 5A: Evaluation in IR 2 CIKM’18, October 22-26, 2018, Torino, Italy

Meta-Analysis for Retrieval Experiments Involving


Multiple Test Collections
Ian Soboroff
National Institute of Standards and Technology
Gaithersburg, Maryland
ian.soboroff@nist.gov

ABSTRACT Because test collections have a fixed set of documents and search
Traditional practice recommends that information retrieval experi- needs, they necessarily represent some kind of sample of the uni-
ments be run over multiple test collections, to support, if not prove, verse of document collections and search needs. Hence, experimen-
that gains in performance are likely to generalize to other col- talists in information retrieval have long recommended that tests of
lections or tasks. However, because of the pooling assumptions, retrieval effectiveness be repeated and reported over multiple test
evaluation scores are not directly comparable across different test collections, lest a result be an artifact of one particular collection’s
collections. We present a widely-used statistical tool, meta-analysis, quirks.
as a framework for reporting results from IR experiments using Voorhees reinforces this recommendation with respect to how
multiple test collections. We demonstrate the meta-analytical ap- modern test collections violate the assumptions of the original
proach through two standard experiments on stemming and pseudo- Cranfield-II experiments [23]. Because modern test collections are
relevance feedback, and compare the results to those obtained from known to have incomplete relevance judgments, evaluation met-
score standardization. Meta-analysis incorporates several recent ric values are only meaningful for relative comparisons between
recommendations in the literature, including score standardization, systems as measured on the same test collection.1 This imparts a de-
reporting effect sizes rather than score differences, and avoiding a gree of noise into any retrieval experiment, and as we reduce noise
reliance on null-hypothesis statistical testing, in a unified approach. from variance across topics by using many topics in a single test
It therefore represents an important methodological improvement collection, we can control for the noise of the collection by using
over using these techniques in isolation. many test collections. However, now we have a problem: because
of the noise and sampling error inherent in a single test collection,
we should use more than one test collection. But because of incom-
CCS CONCEPTS
plete judgments and (again) the sampling error inherent in test
· Information systems → Evaluation of retrieval results; collections, we can’t compare system scores between collections
directly.
KEYWORDS This paper proposes using a statistical method, meta-analysis, to
meta-analysis, score standardization, effect sizes perform comparisons across multiple test collections in an infor-
mation retrieval experiment. We compare meta-analysis with score
ACM Reference Format: standardization [25] through several demonstration experiments
Ian Soboroff. 2018. Meta-Analysis for Retrieval Experiments Involving Multi- that highlight the advantage of meta-analysis over score standard-
ple Test Collections. In The 27th ACM International Conference on Information ization. The goal of this paper is not to claim that meta-analysis
and Knowledge Management (CIKM ’18), October 22–26, 2018, Torino, Italy. is new, or to present new empirical evidence for well-known IR
ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3269206.3271719 phenomena, but to make a methodological contribution that solves
a real problem in IR experiment.

1 INTRODUCTION 2 RELATED WORK


Laboratory experiments in information retrieval commonly use Information retrieval experiments through the 1970s and 1980s
test collections. A test collection comprises a set of documents, a were necessarily performed with the data at hand. There were a
set of search needs, and a set of relevance judgments that indicate few test collections held by multiple research groups, including
the łcorrectž documents that should be retrieved for each search Cranfield (CRAN), CISI, and CACM, and in the mid-to-late 1980s
need. Test collections are intended to model an abstract search task, experiments conducted across more than one test collection be-
such as the well-known ad hoc search task from the Text Retrieval gan to appear. Two particularly interesting examples appeared in
Conference (TREC) test collections [24]. SIGIR 1987: one study of phrase indexing [8], and another on stem-
ming [10]. Everyone at the time expected retrieval performance
to vary across different types of document collections, and these
This paper is authored by an employee(s) of the United States Government and is in
the public domain. Non-exclusive copying or redistribution is allowed, provided that papers clearly demonstrated it experimentally. With the emergence
the article citation is given and the authors and agency are clearly identified as its of TREC and other annual evaluation forums in the 1990s, along
source.
CIKM ’18, October 22–26, 2018, Torino, Italy 1 One might wonder, given subsequent observations of variances in relevance assess-
2018. ACM ISBN 978-1-4503-6014-2/18/10. ment, even by the same user, whether any set of relevance judgments can be considered
https://doi.org/10.1145/3269206.3271719 to be complete.

713
Session 5A: Evaluation in IR 2 CIKM’18, October 22-26, 2018, Torino, Italy

with cheaper computer and storage technology, it became feasible All of these differences could affect the outcome of the study. Some
to expect researchers to investigate results in multiple test collec- of the studies could be statistically underpowered or neglect to
tions. Nevertheless, there was (and remains) no accepted method describe an essential part of the procedure clearly. Meta-analysis
for rigorously drawing a general conclusion from results gleaned was developed to allow a researcher to select an appropriate, com-
from multiple test collections. parable subset of studies and integrate them to derive an overall
Webber et al. [25] proposed score standardization as a solution łtreatment effectž. We refer the reader to Borenstein et al. [3] for
to this problem: each test collection should come with a standard- an excellent and readable gateway of the basics of meta-analysis
ization factor that scales scores into a comparable interval. The beyond those presented here.
standardization approach replaces raw scores with z-scores, scores The problem of comparing IR experiments across multiple test
normalized to zero-mean, unit-variance based on the sample of collections is similar to the problem of integrating separate scien-
scores in a designated set of system runs such as those used to pool tific studies. The documents in the test collection are samples of an
the collection. This scales the differences in score between experi- undefined wider population: ClueWeb12 is a sample of the entire
mental conditions as some number of standard deviations from the web as it existed between February and May 2012, during which
mean, which is comparable across test collections with identically time the web certainly changed; at the same time, it is a sample of
standardized scores. Sakai proposed adding a linear transformation the written output of the human species. TREC CDs 4 and 5, used
to the z-score which improves sensitivity to the topic set while in many ad hoc TREC tasks over the years, are a set of news articles
maintaining discriminative power [18]. Webber et al. give an excel- from the Wall Street Journal and the Los Angeles Times alongside
lent summary of information retrieval literature on the subject of government publications from the US Federal Register and For-
normalization of evaluation scores to which we refer the interested eign Broadcast Information Service. Each of these is a sample of
reader. its source, and is a sample of the larger population of newspaper
Score standardization is a reasonable approach. We see three articles. Similarly, the topics in a test collection are an opportu-
issues which motivate us to propose meta-analysis as an improve- nity sample: in TREC they are developed by assessors or mined
ment. First, z-scores are in units of standard deviations, which could from query logs. There is often a noisy selection process, since
be hard to interpret. Second, the z-score parameters are fixed at topics with large numbers of relevant documents present problems
what might be an early point in the improvement of IR technology for evaluating ad hoc search. These sampling and selection pro-
for that test collection, and so may need recalibration over time. cesses are difficult if not impossible to characterize completely, so
Third, all scores must be standardized using the identical approach, we certainly can’t say how similar the samples might be to each
for example either that of Webber et al. [25] or Sakai [18], with other. Numerous papers on experimental methods in IR have ex-
identical parameters, for comparability to be maintained. amined subsets of test collection topic sets (for example [9]) and
One can argue that standardization assumes that metric scores observed that some topics have a larger effect on the experimental
are normally distributed, or alternatively that the mean and stan- outcome, or that sometimes different topics are equally predictive
dard deviation are meaningful characterizations of the distribution and thus having multiple such topics in a test collection increases
of scores, which is plainly not the case when scores are bounded bias towards topics of that type, whatever that type might be.
to the range [0,1]. In practice z-score normalization is a common Even without being able to understand fully the relationship
practice in many statistical methods, and is even inherent in many of the components of test collections to the populations they are
common techniques such as the t-test which are known to be robust drawn from, we want to be able to integrate results across multiple
to violations of distributional assumptions. test collections, and take away from that some sort of statistical
What is missing in the score standardization approach is a proce- prediction about future datasets. Meta-analysis scales the results of
dure for inference about the scaled scores. Ideally, score standardiza- the so-called łprimaryž experiments so that they are directly com-
tion would generate a summary score over multiple test collections, parable, and summarizes the experiments with a mean treatment
and predict whether outcomes in similar test collections will behave effect in a confidence interval. If the interval crosses the point of
similarly. no difference, then this is equivalent to result that is not statisti-
cally significant. Otherwise the interval contains the true treatment
3 META-ANALYSIS effect in similar studies 95% of the time.
Meta-analysis is a statistical technique designed to integrate in- We propose here to use meta-analysis as a framework for pre-
formation from multiple separate experiments in order to draw senting results of a series of experiments conducted by a single team
a single overall conclusion [3]. This is a long-standing problem using multiple test collections. To accomplish this we need to give
in many fields, including medicine, public health, and agriculture, a brief tutorial on meta-analysis, with specific recommendations
where researchers are rarely fortunate to experiment with the same on choices that are sensible for IR experiments.
data [13]. Consider the question of whether iocaine powder2 is
effective for treating persistent headaches. Different studies would 3.1 Effect size
conduct slightly different experiments around this question, differ-
The first step in meta-analysis is to quantify the effect size we
ing in dosage, how the powder is delivered to the subject, whether
observe in each experiment. In an experiment we keep a control
the subject is a human (the target population) or an animal (a model
or a baseline, and apply a treatment, for example stemming query
of the population), the number of subjects surviving the study, etc.
terms. We observe some difference, and the true value of that change
2A fictional substance featured in the film łThe Princess Bridež (1987) is due to the treatment but also possibly other spurious causes. The

714
Session 5A: Evaluation in IR 2 CIKM’18, October 22-26, 2018, Torino, Italy

goal of meta-analysis is to set up a comparison within which we 3.2 Model


can estimate the true treatment effect with explicit assumptions There are two primary meta-analysis models: the fixed-effect model
about other effects. and the random-effects model. The fixed-effect model reports the
There is a wide array of types of effect sizes in the biomedical summary estimate of the true effect as the weighted average of the
and social sciences, for results reported as tables of outcomes, or study effects, with the weight as w i = 1/si2 one over the within-
relative risks, or categorical measures. For measures generically, experiment variance. By using a fixed-effect model, we assume that
the chief questions are whether the measure is continuous or not, the true effect size is constant across all the collections. A random-
whether it is ratio-valued with a meaningful zero point, and how effects model allows us to say the true effect size varies within an
that value is distributed. In common IR experiments we are lucky interval.
in that most of our system effectiveness measures are continuous Of course, now that we assume that the effect size varies across
ratios with a natural zero. Even for measures like average precision the collections, that amount of variance is something we need to
which are not normally distributed, they are well characterized by be concerned about. In meta-analysis this is called łheterogeneity.ž
their mean and standard deviation. Within each test collection, the effect varies, and that amount of
In a meta-analysis, we are not concerned with the score of a variance varies across collections. Heterogeneity is the excess vari-
single retrieval run, rather, we want to quantify the effect of the ation we see from collection to collection beyond what we would
łtreatmentž. A treatment in information retrieval is whether we ap- expect to see if the true effect were the same in all collections.
ply stemming or not, or the difference between a ranking function In a random-effects model we add an estimate of the between-
and a baseline, or the effect we see from adding relevance feed- studies variance to the study variance when we compute the weight.
back. We will consider each experiment as computing the effect There are a number of methods for estimating the between-studies
size between the treatment and the control, between the experi- variance T 2 , of which the basic one is the DerSimonian and Laird
mental condition and the baseline. This already offers a significant (łDLž) estimate [7]:
methodological advantage: the baseline is always explicit. It is also
possible to look at covariates, as in multiple regression or analysis Q − df
T2 =
of variance, to compare multiple treatment variables, but in this C
Í 2
paper we restrict ourselves to the common case: I, the experimenter, k k w y
Õ i=1 i i
propose a novel retrieval algorithm and wish to demonstrate its Q= w i yi2 − Ík
improvement over a reasonable baseline. I =1 i=1 w i
For each test collection, we have a measurement of the baseline Õ Í 2
w
and the treatment for each topic. The simplest effect size is the C= wi − Í i
wi
difference in the raw mean scores y = T̄ − B̄. We also need the
df = k − 1
variance of the effect size, which can be pooled (the sample variance
of the per-topic differences) or not. Since we don’t expect the per- k = number of studies
topic scores between the baseline and the treatment to be perfectly
correlated, we use the unpooled variance: Then the new study weight w i′ = 1/(si2 + T 2 ).
There are of course more complicated estimates, but the DL
s 2 = sT2 + s B2 − 2 × sT × s B estimate is quick to compute and is accurate. For the meta-analyses
where s 2 is the sample variance over the topics in either the treat- in this paper, we use the R package metafor, which supports many
ment or the baseline. models from the literature.
An alternative effect size we might choose for IR experiments is
the łresponse ratiož, or the logarithm of the ratio of the mean of 3.3 Summary and inference
the treatment to the mean of the baseline: The meta-analysis process scales the results of the primary studies
according to their variance, and computes a summary effect as
 

y = ln a weighted average of the primary studies. A common form for

The variance of the log response ratio is: reporting these results is a forest plot, as in Figure 2 for the stem-
  ming experiment below. Each primary study effect is shown with a
1 1 normal model confidence interval. The summary effect is shown as
s 2 = s pooled
2
+
nT T̄ 2 n B B̄ 2 a diamond extending for its confidence interval.
The response ratio is equivalent to reporting a percent improvement A n% confidence interval indicates that the true value in similar
in the score, rather than a raw difference. experiments will lie within that interval n% of the time. In contrast
It is also possible to use the standardized mean difference as to a p-value from a hypothesis test, which reports the probability
the effect size. In meta-analysis this is typically done when the of correctly rejecting a null hypothesis despite sampling error, a
measures in each study are different or are on different scales. The retrospective statement, confidence intervals are predictive about
meta-analysis process will weight the observed effects by dividing future outcomes in similar experiments. The distinction is subtle,
by the variance, so we effectively standardize anyway. Since in and in fact confidence intervals can be interpreted as illustrating
a typical IR experiment we have full control of the experimental hypothesis test outcomes, but this flexibility is welcome here, be-
setup, we will use a single metric across all test collections, and cause it allows us to distinguish the magnitude of the effect with
standardization is not necessary. the probability that we will see it again.

715
Session 5A: Evaluation in IR 2 CIKM’18, October 22-26, 2018, Torino, Italy

Document set Genre Topic set Year titles of the topics and collapsed all subtopics such that a document
301 - 350 TREC-6 relevant to any subtopic was relevant to the topic. For all other
Disks 4 and 5 news, gov’t 351 - 400 TREC-7 topics we use the title or short query field as the query.
401 - 450 TREC-8 One question that arises is how to best deal with the fact that the
701 - 750 TREC 2004 document collection is a factor linking nine of the test collections.
Gov2 web 751 - 800 TREC 2005 Meta-analysis calls these factors łmoderatorsž and we can include
801 - 850 TREC 2006 them in our model, such that the between-studies variance takes
851 - 900 TREC 2006 this into account; the resulting model is called a łmixed-effects
Blogs06 blog 901 - 950 TREC 2007 modelž. However, this makes it impossible to compute a pooled,
1001 - 1050 TREC 2008 summary effect. We take a different approach, and show summary
LDC New York Times news 1 - 60 TREC 2017 effects for the subset of experiments using each test collection. As
an alternative method we collapsed the three ad hoc collections, the
Table 1: Test collections used in the stemming experiment. three terabyte collections, and the three blog collections each into
a single collection using all topics created for the same task and
document collection. Combining the collections gives 150 topics
each, giving higher statistical power [17] and stability [5]. We show
4 STEMMING each approach below.
Here we present an demonstrator experiment to determine if stem- We use the ATIRE system for all of our experiments.7 Except as
ming is effective at improving search. Stemming is the process of specified in the experiments below, we used the default settings, in-
conflating words to a łstemž term, for example combining łcom- cluding using BM25 as the ranking model. Retrieval was limited to
putež, łcomputerž, łcomputersž and łcomputingž into a single term the top 1000 (-k1000) and evaluation measures were computed with
łcomputž, analogous to a morphological root except that the al- trec_eval.8 The specific stemmer employed is the Krovetz stem-
gorithms are usually based on heuristic rules or corpus statistics mer as provided by ATIRE. The document collections were indexed
rather than linguistics. Stemming can be as simple as dropping the following instructions on the ATIRE website and examples from the
letter łsž from word endings. Porter [14] and Krovetz [12] proposed website of the SIGIR 2015 RIGOR workshop on reproducibility [1].
more complicated algorithms. Stemming can be done at index time, Scripts for reproducing the results here, tested under MacOS and
storing stemmed terms only, or at search time by expanding the Linux, can be found at https://github.com/isoboroff/meta-analysis/.
query to include all terms with the common stem. Stemming ideally As we mentioned above, we use the metafor package in R. The
should improve recall at the expense of precision, and was widely effect size is calculated by the escalc function, and takes the num-
used in information retrieval for many years, but nowadays has a ber of topics in each test collection, and the means and standard
mixed reputation and many systems do not stem in their default deviations in the two experimental conditions, no stemming or
configuration. Harman [10] identified excessive term łexpansionž stemmed with the Krovetz stemmer. The ‘MD’ parameter indicates
by stemming algorithms through a failure analysis. Singh and Gupta that the effect we are calculating is the raw mean difference. The
[20] present a lengthy survey of stemming, not only for information ‘DL’ parameter for the model indicates the DerSimonian and Laird
retrieval but also in text classification and morphological analysis. estimate for between-studies variance:
Silvello et al. [19] examined modern statistical stemming methods library(tidyverse)
and compared them across a variety of test collections, with a focus library(metafor)
on reproducing published results in CLEF; stemming can have a dat <- read.table('all', col.names=c('treat', 'coll', 'meas'
stronger effect in languages that are more morphologically rich , 'topic', 'value'))
than English. maps <- subset(dat, meas=='map')
The purpose of this experiment is to illustrate the meta-analysis stems <- maps
method, and contrast it to score standardization, in an experimental %>% group_by(coll, treat)
context familiar to all IR researchers. We use ten test collections %>% spread(treat, value)
from TREC: The ad hoc test collections from TREC-6 through TREC- %>% summarise(n = n(), m_nostem = mean(no_stem), sd_
8, with Disks 4 and 5 (leaving aside the Congressional Record sub- nostem = sd(no_stem), m_kro=mean(krovetz), sd_kro=
collection) as the document collection;3 the terabyte track ad hoc sd(krovetz))
collections from TREC 2004 through 2006, with the Gov2 crawl of stems.md <- escalc('MD', m2i=m_nostem, m1i=m_kro, sd2i=sd_
websites in the .gov domain;4 the blog track opinion finding ad hoc nostem, sd1i=sd_kro, n2i=n, n1i=n, data=stems)
collections from TREC 2006 through 2008, with the Blogs06 blog model.md <- rma(stems.md, method='DL')
crawl for data;5 and the dynamic domain collection from TREC 2017, forest(model.md, slab=stems.md$coll)
built on the LDC New York Times collection.6 For the blog track
topics, we treated all relevant posts of any opinion categorization
Figure 1 shows the complete experiment using a forest plot. The
as relevant. For the dynamic domain collection, we only used the
leftmost plot uses the raw difference in mean average precision as
3 https://www.nist.gov/srd/nist-special-database-22 and -23 the effect. Each row shows a test collection. The numbers in the
4 http://ir.dcs.gla.ac.uk/test_collections/access_to_data.html
5 ibid. 7 http://www.atire.org/, Mercurial revision 9a105a28370b
6 https://catalog.ldc.upenn.edu/ldc2008t19 8 https://github.com/usnistgov/trec_eval

716
Session 5A: Evaluation in IR 2 CIKM’18, October 22-26, 2018, Torino, Italy

blogs06 32.05% −0.01 [−0.06, 0.03] blogs06 37.81% 0.96 [0.84, 1.09]

gov2 33.27% 0.02 [−0.02, 0.06] nyt 16.42% 1.07 [0.87, 1.32]

nyt 6.62% 0.03 [−0.06, 0.12] gov2 30.08% 1.08 [0.93, 1.25]

cd45 28.06% 0.04 [−0.01, 0.08] cd45 15.69% 1.19 [0.96, 1.47]

Summary effect, RE model 100.00% 0.01 [−0.01, 0.04] Summary effect, RE model 100.00% 1.05 [0.96, 1.14]

−0.1 0 0.1 0.8 1 1.2 1.4 1.6


Mean Difference Ratio of Means

Figure 1: Forest plots for the stemming experiment. The left figure shows the mean difference in AP as the effect, while the
right shows the log ratio of means.

right column are the effect, shown as the raw difference and as a
95% confidence interval. The percentages give the weights of the
respective test collection sub-experiments; recall that the weight is
one over the total variance for that test collection (including our
estimate of between-collection variance). The lines and polygons
blogs06.2006 −0.01 [−0.08, 0.06]
give a graphical illustration of the effect in each test collection
blogs06.2007 −0.01 [−0.09, 0.06]
and its confidence interval. The vertical line at zero means zero blogs06.2008 −0.01 [−0.09, 0.06]
difference from using stemming ś no effect. Finally, the bottom line cd45.TREC6 0.06 [−0.02, 0.14]
shows the summary effect with interval. cd45.TREC7 0.01 [−0.05, 0.08]
The results here show that we cannot conclude that stemming cd45.TREC8 0.04 [−0.05, 0.12]
gov2.2004 0.01 [−0.06, 0.08]
improves effectiveness. The summary effect line indicates that we gov2.2005 0.05 [−0.02, 0.11]
would expect stemming to change the mean average precision be- gov2.2006 0.00 [−0.07, 0.08]
tween -0.01 and 0.04 95% of the time. Because that interval includes nyt 0.03 [−0.06, 0.12]
zero, we are essentially saying we can’t detect this effect size (0.01
RE model, all document sets 0.01 [−0.01, 0.04]
MAP) at this confidence level. From the hypothesis testing perspec-
tive, this is equivalent to saying that the results are not statistically RE model, blogs06 only −0.01 [−0.06, 0.03]
significant ś we would not reject the null hypothesis of no effect RE model, cd45 only 0.03 [−0.01, 0.08]
due to stemming. We will return to the question of significance RE model, gov2 only 0.02 [−0.02, 0.06]
below in the section on power analysis.
The right-hand plot in Figure 1 shows the same experiment, but
−0.1 0 0.05 0.15
using the log ratio of means as the effect (‘ROM’ instead of ‘MD’
Mean Difference
in the escalc function). While the effect and model work with log
ratios, in the plot the results are transformed to the plain ratio of
average precision: now, the 1 point on the x-axis indicates no differ-
ence, ratios less than 1 are where stemming hurts performance, and Figure 2: Stemming experiment broken out by individual
ratios greater than 1 are where it helps. We can see that the results test collection. Comparative models within each test collec-
are the same, we’ve just expressed the effect of using stemming tion are summarized below.
with different units. We feel that the raw difference in the score is
easier to understand, and so going forward we will only report raw
differences. 4.1 Comparison with score standardization
Figure 2 shows the analysis when we don’t combine the test How does the meta-analysis method compare to score standard-
collections with common document sets. The overall summary ization? Here we analyze the data from the stemming experiment,
of the experiment is the same, but the plot now lets us see some following [25]. To standardize the scores in each test collection,
interesting differences among the test collections. We add a plot we need to have standardization factors: the mean µ and standard
of the summary effect for the studies within each test collection; deviation σ of the performance of a set of systems on each topic
these intervals are the same as those on the left side of Figure 1. in the collection. Webber et al. [25] advises taking the set of runs

717
Session 5A: Evaluation in IR 2 CIKM’18, October 22-26, 2018, Torino, Italy

that contributed to the pools for that collection, and investigates None Krovetz δ p [95% CI δ ]
the impact of improved performance over time, showing for the blogs06.2006 0.062 0.060 -0.001 0.05 [-0.004, 0.000]
TREC Robust track data, the standardization factors computed from blogs06.2007 0.039 0.038 -0.000 0.17 [-0.002, 0.000]
the TREC 6-8 pools were suitable. For those collections and for blogs06.2008 0.037 0.036 -0.001 0.08 [-0.003, 0.000]
the Gov2 collections, we re-use their standardization factors as cd45.TREC6 0.186 0.199 0.013 0.02 [0.002, 0.024]
published online.9 cd45.TREC7 0.120 0.122 0.001 0.57 [-0.004, 0.008]
There are no published standardization factors for the Blogs06 cd45.TREC8 0.220 0.231 0.011 0.00 [0.004, 0.019]
collections. We compute them from all the runs that were submitted gov2.2004 0.153 0.156 0.003 0.36 [-0.004, 0.009]
to the TREC Blog track in each year. In 2008, Blog track participants gov2.2005 0.058 0.062 0.003 0.00 [0.002, 0.006]
ran their systems over the 2006 and 2007 topics as well, and those gov2.2006 0.068 0.067 -0.001 0.56 [-0.005, 0.003]
systems were probably improved over those that were originally
pooled for those topics in 2006 and 2007, so these factors might Table 2: Stemming experiment results, reported using score
reflect some level of improvement and bug fixing that factors from standardization. łNonež indicates the run without stem-
pooled runs would not. In any event, it is not a problem to use a ming used.
slightly different method to compute the standardization factors in
different collections, as long as all experiments using a particular
collection are normalized using the same factors.
The TREC 2017 Dynamic Domain track data was not pooled; on future collections. While we might decide on the basis of stan-
rather, an active learning system based on relevance models was dardized scores that stemming might or might not help a little,
used to attempt to find and judge as many relevant documents as the meta-analysis computes a confidence interval on the overall
possible during topic development. Moreover, the measure for that effect size. Whereas the confidence intervals in the score standard-
track, the Cube Test, is quite different than average precision, and ization approach are predictive with respect to those individual
only a few teams participated in the track. As a result, we were collections, the meta-analysis summary interval is predictive for
wary of computing standardization factors for that dataset and so all collections similar to those analyzed. Thus, while the figures
we drop it from the analysis here. are similar, meta-analysis provides a more useful result than score
To standardize an AP score for a topic, the score is normalized standardization.
to zero-mean, unit-variance: Analysis of variance is an alternative approach that can be taken
with standardized scores. For ANOVA we would model the collec-
s AP − µ
s std = tion and the presence/absence of stemming as factors. This would
σ
allow us to compare the effect of the collection to the effect of stem-
As stated above, this score is in units of standard deviations from ming, but would still not provide inference about future collections.
the mean used to compute the standardization factor, so a z-score
of 0 equals the mean average precision µ, and a z-score of -2 would 5 PSEUDO-RELEVANCE FEEDBACK
indicate performance two standard deviations below that mean. For a second experiment, we looked at the improvement given
Since the z-score can range from [−∞, ∞], we follow [25] and con- by pseudo-relevance feedback over a baseline. Pseudo-relevance
vert it into a probability using the cumulative density function of feedback (PRF) is a method of query expansion. The user’s initial
the standard normal distribution. After this scaling a z-score of 0 query is run and a small number of top-n documents are retrieved.
converts to a probability of 0.5 (performing greater than half the Pretending that these documents are actually relevant, they are fed
systems used to compute the standardization factors), and a z-score to a query-expansion model. The results of the expanded query
of 1.96 gives a probability of 0.975. are shown to the user. There are a number of variations and each
Table 2 shows the results using standardized scores. The col- can be applied to different query expansion models. Historically
umn δ shows the difference in mean standardized AP obtained by (see, for example, Singhal [21]), PRF is felt to provide 5-10% relative
applying stemming. The p and 95% confidence interval columns improvement in MAP, but as in stemming, results are mixed in dif-
show the outcome of a paired, two-tailed t-test for the difference ferent collections. The Reliable Information Access (RIA) workshop
in mean standardized AP. Stemming provides a statistically signifi- in 2003 performed an in-depth study of PRF effectiveness using
cant advantage for three of the nine collections; the most hopeful multiple systems and per-topic failure analysis, finding that often
end of the confidence intervals indicate a best-case advantage of PRF can improve average precision but it can frequently be fooled
around 2%. Those three collections are the three collections with the into abysmal performance by quirks in query processing or the
highest mean effect size in the meta-analysis, so in that sense, the expansion model [4].
two analyses agree. One should keep in mind that the standardized The baseline here is the same as the unstemmed baseline in the
scores and deltas are not plain average precision figures, but rather stemming experiment. The query expansion model is Rocchio’s
represent the fraction of standardizing systems that this system algorithm [16], as implemented in the ATIRE system. Rocchio’s
improves upon. algorithm is a vector-space model approach: relevant document
Unfortunately, the score-standardization approach cannot pro- vectors are added to the query vector, and nonrelevant ones sub-
vide any kind of summary effect or prediction about performance tracted, and in the limit this should produce an optimal query. In
practice we don’t have complete feedback and don’t use all terms
9 https://people.eng.unimelb.edu.au/ammoffat/ir_eval/
in the documents for expansion, so we operate quite a step back

718
Session 5A: Evaluation in IR 2 CIKM’18, October 22-26, 2018, Torino, Italy

None PRF δ p [95% CI δ ]


blogs06.2006 0.062 0.062 0.000 0.77 -0.001 0.001
blogs06.2007 0.039 0.038 -0.001 0.44 -0.001 0.002
blogs06.2008 0.037 0.037 -0.000 0.95 -0.001 0.001
cd45.TREC6 0.186 0.188 0.002 0.07 -0.004 0.000
blogs06.2006 0.00 [−0.07, 0.07]
cd45.TREC7 0.120 0.122 0.002 0.00 -0.003 -0.001
blogs06.2007 0.00 [−0.08, 0.08]
blogs06.2008 0.01 [−0.07, 0.08]
cd45.TREC8 0.220 0.224 0.004 0.00 -0.006 -0.001
cd45.TREC6 0.01 [−0.07, 0.08] gov2.2004 0.153 0.157 0.004 0.01 -0.008 -0.001
cd45.TREC7 0.02 [−0.05, 0.08] gov2.2005 0.058 0.061 0.003 0.00 -0.004 -0.001
cd45.TREC8 0.01 [−0.07, 0.09] gov2.2006 0.068 0.071 0.003 0.02 -0.006 -0.000
gov2.2004 0.02 [−0.05, 0.09]
gov2.2005 0.03 [−0.04, 0.10]
gov2.2006 0.02 [−0.05, 0.10] Table 3: Pseudo-relevance feedback results presented using
nyt 0.02 [−0.07, 0.12] score standardization

RE model, all studies 0.01 [−0.01, 0.04]

RE model, blogs06 only 0.00 [−0.04, 0.05]


RE model, cd45 only 0.01 [−0.03, 0.05]
show a statistically significant improvement, and again these are the
RE model, gov2 only 0.02 [−0.02, 0.07] collections with the largest positive effect size in the meta-analysis.
The confidence intervals for these collections are negative at both
ends, despite a positive delta in the experiment, which would seem
−0.1 0 0.05 0.1 0.15 to reflect numerical error due to the tiny difference in performance.
Mean Difference Because the differences in the systems are small in the standard-
ized probability interval, the analysis from standardization indicates
little or no effect from pseudo-relevance feedback in these collec-
Figure 3: Pseudo-relevance feedback using Rocchio’s algo- tions. Both analyses support the hypothesis that pseudo-relevance
rithm. feedback is slightly beneficial on average, but not all topics improve.
The meta-analysis further supports an inferential conclusion, that
we would see little or no effect in future collections. There are of
from that limit. Here, we start with the title or short query fields of course many parameters to the PRF setup which could be tweaked
the topic (as in the stemming experiment), and expand the query here: the number of feedback documents, the number of feedback
using the top 50 terms from the top 20 retrieved documents. We terms, the feedback model, the starting query, and more. We might
tuned effectiveness using the TREC 6-8 topics, using a variety of see different results by exploring those parameters. Meta-analysis
ranking models and parameter settings, before settling on these set- with multiple covariates is called meta-regression, and we plan
tings as representative of the upper range of performance. ATIRE’s to explore the use of meta-regression in IR experiments in future
implementation of the Rocchio model selects terms based on the work.
KL-divergence of the term in the collection. We then re-rank the
entire collection using the expanded query. 6 POWER ANALYSIS
Figure 3 shows the experimental results in a forest plot. There is Both of these experiments failed to achieve statistical significance.
a small improvement in MAP for almost all collections, except in Why is that? This is a question of statistical power. Statistical power
two of the Blogs06 collections. However, that is just the average, is the probability of rejecting a null hypothesis when the alternative
and the variance across topics indicates that some topics average hypothesis is true. Higher values of power indicate a smaller prob-
precision falls below the baseline in every collection, consistent with ability of Type II error, incorrectly accepting the null hypothesis.
the result found in the RIA experiments [4]. Performance on the Power in meta-analysis is a function of the number of test collection
dynamic domain (łNYTž) collection has the largest variance. The studies we use, the variance we see in each collection, the variance
summary effect interval again crosses zero, indicating no significant between collections, and the effect size we expect to see. Just as
change by using PRF should be expected, on average, from test individual IR experiments can be underpowered by having too few
collections like these. topics to detect a small effect, we can fall into the same trap in
One might find these results surprising, given that historically meta-analysis [22].
PRF was seen as a critical component for ad hoc search. One ex- Prospective power analysis is complicated by the need to es-
planation could be that early work on PRF may not have had timate these parameters, but since we’ve already done the meta-
other parameters of the models well-tuned. For example, the stan- analysis, we can compute the power retrospectively, and use those
dard ATIRE BM25 parameters are (k 1 : 0.9, b : 0.4) in contrast to results to help guide us on what we might have done differently.
(k 1 : 1.2, b : 0.75) reported by the Okapi team in TREC-8 [15]. Note that we strongly recommend against retrospective power anal-
We also present the parallel analysis using score standardization, ysis as a tool for understanding the p-value or confidence interval
following the same procedures as in the stemming experiment and of the experiment at hand. Rather, we can explore the components
using the same nine collections that we could standardize. Table 3 that make up statistical power, starting from the data we have, to
shows the results. The TREC-7, TREC-8, and all the Gov2 collections imagine the experiment we need to design next.

719
Session 5A: Evaluation in IR 2 CIKM’18, October 22-26, 2018, Torino, Italy

1.0
0.8

0.8
Power (10 test collections)
0.6

0.6
Power

0.4

0.4
0.2
0.2

20 40 60 80 100 0.05 0.10 0.15 0.20 0.25

Number of test collections Effect size

Figure 4: Statistical power as we increase the number of test collections, or alternatively the effect size with only 10 test
collections, given the effect variances we see in the stemming experiment.

The power for the random-effects model that the mean effect is 7 DISCUSSION
different than zero is Meta-analysis is essentially performing the same function as score
standardization: scores are transformed by dividing them by the
p = 1 − Φ(c α − λ)
variance in the test collection. However, there are no fixed standard-
where Φ is the standard normal cumulative distribution function ization factors, rather the standardization is done for the data at
and c α is the critical value. For α = 0.05, c α = 1.96 for the two- hand, specifically in order to show the effect size of the experimen-
tailed test we are doing here. λ is the extent to which the null tal condition on the same scale in each test collection. Additionally,
hypothesis is false, which we compute from the observed effects the variance in each collection and between the collections is used
and variances: to compute a weighted average effect summarizing the results of
p all the individual test collections.
λ = Y / v ∗ /k Because meta-analysis and score standardization are equivalent,
v ∗ = v̄ + var yi as far as computing the effect size within each experiment, what is
there to be gained here? Meta-analysis provides a statistical frame-
Y is the mean effect, and k is the number of test collections we’ve work for asking about the overall effect given many test collections.
used. v∗ is the variance component and because we are using a The mean effect is understandable either from the perspective of
random-effects model, it is the sum of the average variance within null-hypothesis testing, or interpreting confidence intervals.
each test collection and the variance in the effect size across test We were forced to drop the TREC 2017 Dynamic Domain collec-
collections. tion from the score standardization analysis because of concerns
The observed power of the stemming experiment with four test regarding the existing runs we would use to compute the standard-
collections is 16.5%, and 18.9% with ten test collections. If we work ization factors: the collection was built with active learning instead
with the effects and variances we see in the ten-collection experi- of pooling, and only a few systems participated in the track, so the
ment, we can ask about the design of a follow-on experiment with set of runs to standardize from is limited. This is not a problem
more studies or a different target effect size. for meta-analysis, because as long as the collection is sufficiently
Figure 4 illustrates how we might design the experiments differ- complete to measure effectiveness on the baseline and experimental
ently, expecting that the individual test collections we would use condition runs, the effect size is still meaningful. If the collection
would have similar properties. The left-hand plot shows how power were missing judgments for many highly-ranked retrieved docu-
would increase given more test collections, and the right-hand plot ments in either the baseline or experimental condition, the effect
shows how power would increase if we tried to detect a larger size would be in doubt and we would not advise using the collection
effect size. We can see that we would have to use 60 or more test in any analysis.
collections to get 80% power for a 0.016 mean effect size, but if we In meta-analysis generally there are a great number of choices to
could observe effect sizes of 0.05 or greater, we could achieve 80% make regarding effect sizes, models, etc. [3] In this paper we concen-
power with these ten test collections only. trate on the raw mean difference as the effect size, do not assume

720
Session 5A: Evaluation in IR 2 CIKM’18, October 22-26, 2018, Torino, Italy

the variance of the effect size is constant among test collections, but goes further in providing a summary overall effect that we might
employ a random-effects model to allow our estimate of the true expect from these algorithms on similar sets of test collections.
effect size to lie within an interval, look at a single experimental We have also demonstrated using meta-analysis that experi-
variable, and employ traditional estimators where we need them. menters do very much need to use multiple test collections, and
We recommend that IR experiments generally follow this approach that results from two or three very similar collections should be
except that multiple variables could be of interest. Industry studies suspect. Existing methodologies did not support drawing a conclu-
which might consider a binary response like search success would sion from the suite of test collection experiments, but meta-analysis
compute the effect size differently. does.
Traditional meta-analysis is done as part of a systematic review A feature of the meta-analysis method is that baselines are an
of past literature. This is complicated in information retrieval, where explicit component of the analysis; the effect size is always with
many studies make use of the same test collections, but may have respect to a specified baseline. The summary effect is only with
other differences in systems used or parameters set. A future step respect to that baseline, and there is no implication that a result
for meta-analysis in IR would be to show how it can be used to generalizes beyond the application analyzed. If a researcher wants
understand variations in performance from multiple systems that to claim that an improvement is orthogonal to other system com-
claim to implement the łsamež algorithm. ponents and would yield comparable effects when added to any
Another question which confronts IR is that of additivity: should system, the onus is on the researcher to demonstrate that. The
we expect to see such-and-such effect size when applied to a dif- meta-analysis makes it clear whether additivity is the subject of
ferent baseline? Research using poor baselines has historically led the experiment.
to exaggerated claims of improved effectiveness. Armstrong et al. This paper has only presented the simplest meta-analysis models.
[2] surveyed more than 100 articles in SIGIR 1998-2008 and CIKM Meta-analysis as a field of statistics has been around since at least
2003-2008 to compare choices in baselines, and found that often the 1980s and very active starting in the 2000s, as computational
authors employ weak baselines, often not even competitive with methods have made systematic reviews easier to manage. There are
official TREC results. (Unfortunately, their survey did not standard- new methods being invented all the time. We feel that the control
ize scores, although some of their other experiments do.) This work we as experimenters bring to the IR experiment mean that older
was extended by Kharazmi et al. [11] to explore more recent IR models and estimation methods are robust, but exploring the space
methods like result diversification. Their analyses of additivity are of meta-analytic methods is part of our planned future work.
complicated by the lack of strong comparisons across multiple test Like any subfield of statistics, meta-analysis is not without its
collections. Meta-analysis can’t solve the problem of choosing a detractors. When meta-analysis is used to summarize existing re-
weak baseline, but by trying to derive a general effect size of an IR search studies, it can be difficult to anticipate issues of bias in results,
feature over multiple test collections, meta-analysis could bring us and there is much work in the meta-analysis literature trying to
a step closer to understanding additivity. solve that problem. We feel that in information retrieval exper-
Cormack and Lynam [6] also suggested meta-analysis as an alter- iments, using meta-analysis to integrate multiple test collection
native to averaging per-topic scores within a single test collection, experiments for which we have total control, and can completely
but we have not investigated that here. We could consider the mea- observe effects, variances, and design decisions, meta-analysis is ro-
sure of a single topic to be drawn from the set of judged documents bust and provides distinct advantages over alternative approaches.
retrieved between the baseline and the treatment run. Within-topic
variance could come from observations of differing relevance judg- 9 ACKNOWLEDGEMENTS
ments or from models of assessor disagreement. Topics would not
We deeply thank Sauparna Palchowdhury for the initial work on
need to be scored by a common metric, but instead each topic could
this study, and Andrew Trotman, Matt Crane, and the RIGOR work-
use the measure best suited to that information need. Meta-analysis
shop for helpful advice on using the ATIRE system. We also thank
would scale the per-topic scores and then output a weighted average
the reviewers of several venues for their comments on earlier ver-
as the summary treatment effect.
sions of this paper.

8 CONCLUSION REFERENCES
Information retrieval experiments are best done using multiple test [1] Jaime Arguello, Fernando Diaz, Jimmy Lin, and Andrew Trotman. 2015. SIGIR
collections, but the community has lacked a framework for making 2015 Workshop on Reproducibility, Inexplicability, and Generalizability of Re-
sults (RIGOR). In Proceedings of the 38th International ACM SIGIR Conference on
comparisons across multiple test collections in a statistically sound Research and Development in Information Retrieval (SIGIR ’15). ACM, New York,
way. Score standardization as proposed by Webber et al. [25] was NY, USA, 1147ś1148. https://doi.org/10.1145/2766462.2767858
an important step, but didn’t say anything about how inferences [2] Timothy G. Armstrong, Alistair Moffat, William Webber, and Justin Zobel. 2009.
Improvements That Don’t Add Up: Ad-hoc Retrieval Results Since 1998. In Pro-
should be subsequently drawn from the data. If we accept score ceedings of the 18th ACM Conference on Information and Knowledge Management
standardization as an approach, then meta-analysis completes that (CIKM ’09). ACM, New York, NY, USA, 601ś610. https://doi.org/10.1145/1645953.
step. Meta-analysis provides a method to obtain a summary effect 1646031
[3] Michael Borenstein, Larry V. Hedges, Julian P. T. Higgins, and Hannah R. Roth-
and to bound that in a confidence interval. stein. 2009. Introduction to Meta-Analysis. John Wiley & Sons, Ltd.
We have shown this with two demonstrator experiments, one [4] Chris Buckley. 2009. Why Current IR Engines Fail. Inf. Retr. 12, 6 (Dec. 2009),
652ś665. https://doi.org/10.1007/s10791-009-9103-2
on stemming and one on pseudo-relevance feedback. The meta- [5] Chris Buckley and Ellen M. Voorhees. 2000. Evaluating Evaluation Measure
analysis approach agrees with the analysis by score standardization, Stability. In Proceedings of the 23rd Annual International ACM SIGIR Conference on

721
Session 5A: Evaluation in IR 2 CIKM’18, October 22-26, 2018, Torino, Italy

Research and Development in Information Retrieval (SIGIR ’00). ACM, New York, [17] Tetsuya Sakai. 2006. Evaluating Evaluation Metrics Based on the Bootstrap. In
NY, USA, 33ś40. https://doi.org/10.1145/345508.345543 Proceedings of the 29th Annual International ACM SIGIR Conference on Research
[6] Gordon V. Cormack and Thomas R. Lynam. 2006. Statistical Precision of Infor- and Development in Information Retrieval (SIGIR ’06). ACM, New York, NY, USA,
mation Retrieval Evaluation. In Proceedings of the 29th Annual International ACM 525ś532. https://doi.org/10.1145/1148170.1148261
SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’06). [18] Tetsuya Sakai. 2016. A Simple and Effective Approach to Score Standardisation. In
ACM, New York, NY, USA, 533ś540. https://doi.org/10.1145/1148170.1148262 Proceedings of the 2016 ACM International Conference on the Theory of Information
[7] Rebecca DerSimonian and Nan Laird. 2015. Meta-analysis in clinical trials re- Retrieval (ICTIR ’16). ACM, New York, NY, USA, 95ś104. https://doi.org/10.1145/
visited. Contemporary Clinical Trials 45, 0 0 (2015), 139ś145. https://doi.org/10. 2970398.2970399
1016/j.cct.2015.09.002 [19] Gianmaria Silvello, Riccardo Bucco, Giulio Busato, Giacomo Fornari, Andrea
[8] Joel L. Fagan. 1987. Automatic Phrase Indexing for Document Retrievals: An Langeli, Alberto Purpura, Giacomo Rocco, Alessandro Tezza, and Maristella
Examination of Syntactic and Non-Syntactic Methods. In Proceedings of the 10th Agosti. 2018. Statistical Stemmers: A Reproducibility Study. In Advances in
Annual Conference on Research and Development in Information Retrieval (SIGIR Information Retrieval - 40th European Conference on IR Research, ECIR 2018, Greno-
1987). New Orleans, LA, 91ś101. ble, France, March 26-29, 2018, Proceedings. 385ś397. https://doi.org/10.1007/
[9] John Guiver, Stefano Mizzaro, and Stephen Robertson. 2009. A Few Good Topics: 978-3-319-76941-7_29
Experiments in Topic Set Reduction for Retrieval Evaluation. ACM Trans. Inf. Syst. [20] Jasmeet Singh and Vishal Gupta. 2016. Text Stemming: Approaches, Applications,
27, 4, Article 21 (Nov. 2009), 26 pages. https://doi.org/10.1145/1629096.1629099 and Challenges. ACM Comput. Surv. 49, 3, Article 45 (Sept. 2016), 46 pages.
[10] Donna Harman. 1987. A Failure Analysis on the Limitations of Suffixing in an https://doi.org/10.1145/2975608
Online Environment. In Proceedings of the 10th Annual Conference on Research and [21] Amit Singhal. 1997. AT&T at TREC-6. In Proceedings of the Sixth Text Retrieval
Development in Information Retrieval (SIGIR 1987). New Orleans, LA, 102ś108. Conference (TREC-6). 215ś226. http://trec.nist.gov/pubs/trec6/papers/att.ps.gz
[11] Sadegh Kharazmi, Falk Scholer, David Vallet, and Mark Sanderson. 2016. Exam- [22] Jeffrey C. Valentine, Therese D. Pigott, and Hannah R. Rothstein. 2010. How
ining Additivity and Weak Baselines. ACM Trans. Inf. Syst. 34, 4, Article 23 (June Many Studies Do You Need? A Primer on Statistical Power for Meta-Analaysis.
2016), 18 pages. https://doi.org/10.1145/2882782 Journal of Educational and Behavioural Sciences 35, 2 (2010), 215ś247. https:
[12] Robert Krovetz. 1993. Viewing morphology as an inference process.. In Pro- //doi.org/10.3102/1076998609346961
ceedings of the 16th Annual International ACM SIGIR Conference on Research and [23] Ellen M. Voorhees. 2002. The Philosophy of Information Retrieval Evaluation. In
Development in Information Retrieval. ACM, New York, NY, 191ś202. Evaluation of Cross-Language Information Retrieval Systems, Carol Peters, Martin
[13] Keith O’Rourke. 2007. An historical perspective on meta-analysis: dealing quan- Braschler, Julio Gonzalo, and Michael Kluck (Eds.). Springer Berlin Heidelberg,
titatively with varying study results. Journal of the Royal Society of Medicine 100, Berlin, Heidelberg, 355ś370.
12 (Dec 2007), 579ś582. https://dx.doi.org/10.1258/jrsm.100.12.579 [24] Ellen M. Voorhees and Donna K. Harman (Eds.). 2005. TREC: Experiment and
[14] Martin F. Porter. 1980. An algorithm for suffix stripping. Program 14, 3 (1980), Evaluation in Information Retrieval. MIT Press.
130ś137. [25] William Webber, Alistair Moffat, and Justin Zobel. 2008. Score Standardization
[15] Stephen E. Robertson and Stephen Walker. 1999. Okapi/Keenbow at TREC-8. for Inter-collection Comparison of Retrieval Systems. In Proceedings of the 31st
In Proceedings of the Eighth Text Retrieval Conference (TREC-8). 151ś162. https: Annual International ACM SIGIR Conference on Research and Development in
//trec.nist.gov/pubs/trec8/papers/okapi.pdf Information Retrieval (SIGIR ’08). ACM, New York, NY, USA, 51ś58. https:
[16] Joseph J. Rocchio. 1965. Relevance Feedback in Information Retrieval. In //doi.org/10.1145/1390334.1390346
Information Storage and Retrieval (ISR-9), Gerard Salton (Ed.). NSF. http:
//sigir.org/files/museum/pub-08/XXIII-1.pdf

722

You might also like