Professional Documents
Culture Documents
document in this column. Columns a1 and a2 show the This experiment is intended as a sanity check against the
results and ideal orderings up to position 5 along with judg- results obtained in the first one, especially with respect to
ment values. Note that the results ordering does not change, bias introduced due to the sharing of intent distributions be-
but the judgment value for some of the documents changes tween IA-Select and the metrics. More specifically, while
depending on the category of the query and the document. IA-Select still uses the intent distribution from the classi-
In the Results column under a1 , d8 , d9 , and d10 receive a 0 fier aforementioned, the metrics are now evaluated with re-
judgment value since they belong to c2 . Similarly, in the Re- spect to the distributions of intents and the relevance scores
sults column under a2 , d1 and d2 receive 0 judgment scores. obtained from the human subjects.
At the bottom of the table, we show the resulting DCG In the third set of experiments, we use a hybrid of data
scores for each intent, followed by a row for the NDCG val- sources for evaluation—the distribution of intents for queries
ues. Finally, we show the NDCG-IA score below the com- are from the Mechanical Turk platform, whereas document
bined a1 and a2 column set. Recall from Section 3.7 that relevance is from the repository used in the first experi-
P (c1 |q) = 0.7 and P (c2 |q) = 0.3. So the final calculation is: ment. This allows us to focus on how the performance of
IA-Select depends on having the exact knowledge of the
NDCG-IA(Q, 5) = (0.7 × 0.7396) + (0.3 × 0.6612) = 0.7161 query intent distribution.
4.2.2 MRR-IA and MAP-IA In all of the experiments, intents are classified according to
the ODP taxonomy (www.dmoz.org). Documents are scored
Using the same idea of averaging over user intent, we de- for relevance on a 4-point scale as described in Section 4.
fine intent aware MRR to be The results used in the evaluation of the commercial search
X
MRR-IA(Q, k) = P (c|q)MRR(Q, k|c). engines are based on the actual documents returned for the
c queries issued. The results generated by IA-Select is based
on the following input: intent distribution obtained from a
Similarly, intent aware MAP is defined to be query classifier; set of documents given by the top 50 re-
sults from one of the search engines; intent-specific quality
X
MAP-IA(Q, k) = P (c|q)MAP(Q, k|c).
c
value given by the relevance score obtained from the same
search engine, multiplied by the probability that the docu-
5. EMPIRICAL EVALUATION ment belongs to the corresponding category determined by a
document classifier. The ordering generated by IA-Select
We evaluate our approach for result diversification against
can be interpreted as an attempt to re-order the results from
three commercial search engines. To preserve anonymity, we
that search engine to take into account diversification under
refer to these as Engines 1, 2, and 3.
our problem formulation. One can also view the results as a
We conduct three sets of experiments, differing in how the
comparison between a version of IA-Select that is ignorant
distributions of intents and how the relevance of the docu-
of intents and one that takes into account intents.
ments are obtained. In the first set of experiments, we obtain
For evaluation, we employ the three intent-aware metrics
the distributions of intents for both queries and documents
explained in Section 4. An alternative is to conduct side-
via standard classifiers, and the relevance of documents from
by-side studies that try to capture user satisfaction more
a proprietary repository of human judgements that we have
directly. Our choice is dictated by the fact that side-by-side
been granted access to. This allows us to conduct a large-
comparisons are usually highly noisy and hence require large
scale study of how IA-Select performs compared to the
sample sizes, and as a result limits the scope of our study.
other engines, but the results must be taken with a grain
On the other hand, the metrics we use can leverage human
of salt as the input to IA-Select and the evaluation of the
judgments already present in the repository, and gives us
metrics uses the same intent distributions.
the flexibility to use different distribution of intents in our
In the second set of experiments, we obtain the distribu-
evaluation.
tions of intents for queries and the document relevance using
In these studies, we have chosen to use proprietary datasets
the Amazon Mechanical Turk platform (www.mturk.com).
600
0.3
Diverse Engine 1 Engine 2 Engine 3
500
0.28
400
Query Count
NDCG-IA value
300 0.26
NDCG
200 0.24
100
0.22
0
2 3 4 5 6 7 8 9 10 0.2
Category Count NDCG-IA@3 NDCG-IA@5 NDCG-IA@10
50% 0.58
MAP-IA value
% judgements
40%
0.56
30%
MAP
0.54
20%
10% 0.52
0% 0.5
Excellent Good Fair Bad MAP-IA@3 MAP-IA@5 MAP-IA@10
judgement label
0.58
@2 .8501 .8720
@3 .8577 .8676
MRR
0.56
@4 .8614 .8663
0.54 @5 .8603 .8690
0.52
MRR-IA@3 MRR-IA@5 MRR-IA@10
0.21 tion of how useful the results are. We plan to explore these
alternatives in the future.
NDCG
7. ACKNOWLEDGMENTS
We thus now have a P (c|q) for computing the performance We thank the anonymous reviewers for their detailed re-
of IA-Select which is different from the one it uses for view and thoughtful suggestions.
diversifying results.
Figure 7 shows the results. We include the results only
for the NDCG-IA metric; the trends are similar for the
8. REFERENCES
other two metrics. We use only those queries for which we [1] Kannan Achan, Ariel Fuxman, Panayiotis Tsaparas,
could obtain P (c|q) values from the user study. We observe and Rakesh Agrawal. Using the wisdom of the crowds
that IA-Select again outperforms the three search engines, for keyword generation. In WWW, pages 1–8, 2008.
showing the importance of diversifying search results from [2] Aris Anagnostopoulos, Andrei Z. Broder, and David
the point of view of users. Carmel. Sampling search-engine results. In WWW,
It is tempting to compare Figure 7 with Figure 3. How- pages 245–256, 2005.
ever, several cautions are in order. First, Figure 7 includes [3] A. Bookstein. Information retrieval: A sequential
computation for queries that are a subset of those used for learning process. Journal of the American Society for
computing Figure 3. They are the ones that had at least Information Sciences (ASIS), 34(5):331–342, 1983.
three intents. Furthermore, we retained only top three in- [4] B. Boyce. Beyond topicality: A two stage view of
tents for each query when we obtained user judgments. relevance and the retrieval process. Info. Processing
Subject to above caveats, we can analyze why the NDCG- and Management, 18(3):105–109, 1982.
IA values are relatively lower in Figure 7. We are penalizing [5] Jaime G. Carbonell and Jade Goldstein. The use of
the scores of those queries that might have had more than MMR, diversity-based reranking for reordering
three intents. Additionally, the NDCG-IA values for IA- documents and producing summaries. In SIGIR, pages
Select are expected to be lower due to mismatch between 335–336, 1998.
the estimates for P (c|q) obtained from two different sources. [6] Harr Chen and David R. Karger. Less is more:
IA-Select still emerges as the winner. probabilistic models for retrieving fewer relevant
documents. In SIGIR, pages 429–436, 2006.
6. CONCLUSIONS [7] Charles L. A. Clarke, Maheedhar Kolla, Gordon V.
Cormack, Olga Vechtomova, Azin Ashkan, Stefan
We studied the question of how best to diversify results in Büttcher, and Ian MacKinnon. Novelty and diversity
the presence of ambiguous queries. This is a problem that in information retrieval evaluation. In SIGIR, pages
most search engines face as users often underspecify their
659–666, 2008.
true information needs. We took into consideration both
[8] W. Goffman. A searching procedure for information
the relevance of the documents and the diversity of search
retrieval. Info. Storage and Retrieval, 2:73–78, 1964.
results and presented an objective that directly optimizes
for the two. We provided a greedy algorithm for the objec- [9] Dorit Hochbaum, editor. Approximation Algorithms
tive with good approximation guarantees. To evaluate the for NP-Hard problems. Springer, 1999.
effectiveness of our approach, we proposed generalizations of [10] K. Jarvelin and J. Kekalainen. IR evaluation methods
well-studied metrics to take into account of the intentions of for retrieving highly relevant documents. In SIGIR,
the users. Over a large set of experiments, we demonstrated pages 41–48, 2000.
that our approach consistently outperforms results produced [11] Christopher D. Manning, Prabhakar Raghavan, and
by commercial search engines over all of the metrics. Hinrich Schütze. Introduction to Information
The objective in Diversify can be viewed as a conser- Retrieval. Cambridge Univ Press, 2008.
vative metric that aims to maximize the probability that [12] G. Nemhauser, L. Wolsey, and M. Fisher. An analysis
the average user will find some useful information among of the approximations for maximizing submodular set
the search results. There are other reasonable objectives as functions. Math. Programming, 14:265–294, 1978.
well. For example, when users are looking for opinions on a [13] F. Radlinski, R. Kleinberg, and T. Joachims. Learning
product, a set of results is useful only when there are mul- diverse rankings with multi-armed bandits. In ICML,
tiple results regarding the product. The objective function 2008. First presented at NIPS07 Workshop on
will need to be changed accordingly to take this into ac- Machine Learning for Web Search.
X Y
[14] Filip Radlinski and Susan T. Dumais. Improving = P (c|q) (1 − V (d|q, c)) V (e|q, c)
personalized web search using result diversification. In c d∈S
SIGIR, pages 691–692, 2006.
Similarly, we can establish that
[15] Erik Vee, Utkarsh Srivastava, Jayavel X Y
Shanmugasundaram, Prashant Bhat, and Sihem P (T 0 |q) − P (T |q) = P (c|q) (1 − V (d|q, c)) V (e|q, c).
Amer-Yahia. Efficient computation of diverse query c d∈T
results. In ICDE, pages 228–236, 2008.
However, note that for all c,
[16] E. M. Voorhees. Overview of the trec 2004 robust Y Y
retrieval track. In TREC, 2004. (1 − V (d|q, c)) ≤ (1 − V (d|q, c)).
[17] Yunjie Xu and Hainan Yin. Novelty and topicality in d∈T d∈C
interactive information retrieval. J. Am. Soc. Inf. Sci.
Therefore, we conclude that
Technol., 59(2):201–215, 2008.
[18] ChengXiang Zhai. Risk Minimization and Language P (S 0 |q) − P (S|q) ≥ P (T 0 |q) − P (T |q)
Modeling in Information Retrieval. PhD thesis,
Carnegie Mellon University, 2002. as desired, i.e., the function P (S|q) is submodular.
[19] ChengXiang Zhai, William W. Cohen, and John D. Theorem 1. IA-Select is optimal when |C(d)| = 1 for
Lafferty. Beyond independent relevance: methods and all d ∈ R(q).
evaluation metrics for subtopic retrieval. In SIGIR,
pages 10–17, 2003. Proof. First, note that if each document can only be of
[20] ChengXiang Zhai and John D. Lafferty. A risk only one category, i.e., |C(d)| = 1, g(d|q, c, S) is solely deter-
minimzation framework for information retrieval. Info. mined by U (C(d)|q, S)V (d|q, C(d)). Therefore, for a given
Processing and Management, 42(1):31–55, 2006. category c, the ordering of the documents R(q) with respect
[21] Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. to g(d|q, c, S) is the same for all S. For a set of documents S,
Konstan, and Georg Lausen. Improving by replacing the documents in each category with the ones
recommendation lists through topic diversification. In with the highest V (d|q, c), P (S|q) can only improve. Let us
WWW, pages 22–32, 2005. call this local improvement. Note that the results returned
by IA-Select cannot be locally improved, since it selects
only the documents with the highest V (d|q, c) in each cate-
APPENDIX gory. Suppose contrary to our claim, there exists a solution
A. PROOFS FOR SECTION 3 T that is strictly better the result S returned by that IA-
Lemma 1. Diversify(k) is NP-hard. Select. We can first carry out locally improvements of T as
specified. If S and T has the same number of documents in
Proof. This follows from a reduction from Max Cov- a category, then these documents must be identical. There-
erage, a N P -hard problem related to Set Cover [9]. In fore, for T to be better, it must include different number of
the Max Coverage problem, one is given a universe of el- documents for some categories.
ements U, a collection C of subsets of U, and an integer k. Given that |S| = |T |, if T has more results than S in
The objective is to find a set of subsets S ⊆ C, |S| ≤ k, to category c1 , then there must be some other category c2 for
maximize the number of covered elements. which S has more results. Consider the worst document, d1 ,
Given an instance of the Max Coverage problem, we that T returned for category c1 , and the best document, d2 ,
map U to the set of categories C. The collection of subsets that S returned for category c2 that T does not include. Let
C is mapped to the set of documents D, where if C ∈ C S 0 = S \ {d2 } and T 0 = T \ {d1 }.
contains elements {u1 , u2 , . . . , uk }, the corresponding doc- Since the algorithm is greedy, the marginal benefit of
ument d ∈ D can satisfy categories {c1 , c2 , . . . , ck }, i.e., adding d2 to S 0 must be as high as that of adding d1 , or
V (d|q, ci ) = 1 for c1 through ck . One can easily verify that d1 would have been chosen instead of d2 . By submodular-
the optimal solution to Diversify(k) is optimal for Max ity, compared to S 0 , the marginal benefit of adding d2 can
Coverage. In fact, Diversify(k) is a soft generalization only be higher for T 0 , since T 0 has only a subset of docu-
of Max Coverage, in the sense that the documents can ments in category c2 , and the marginal benefit for adding
partially satisfy a category. d1 can only be lower, since T 0 has a superset of documents
Lemma 2. P (S|q) is a submodular function. in category c1 . Therefore,
Proof. Intuitively, the function is submodular because P (T 0 ∪ {d2 }|q) − P (T 0 |q) ≥ P (S 0 ∪ {d2 }|q) − P (S 0 |q)
a larger set of documents would have already satisfied more
≥ P (S 0 ∪ {d1 }|q) − P (S 0 |q)
users, and therefore the incremental gain for an additional
document is smaller. Let us now capture the intuition math- ≥ P (T 0 ∪ {d1 }|q) − P (T 0 |q)
ematically. Let S, T be two arbitrary sets of documents re- Hence, P (T 0 ∪ {d2 }|q) ≥ P (T 0 ∪ {d1 }|q) = P (T |q). Note
lated by S ⊆ T . Let e be a document not in T . Denote that (T 0 ∪ {d2 }) is more similar to set S than T was, i.e.,
S ∪ {e} by S 0 and similarly T ∪ {e} by T 0 . |S ∩ T | < |S ∩ (T 0 ∪ {d2 })|. By repeating this argument,
0
P (S |q) − P (S|q) we eventually obtain a chain of inequalities that lead to S,
X Y Y which yields P (S|q) ≥ P (T |q), contradicting the supposition
= P (c|q) (1 − (1 − V (d|q, c))) − (1 − (1 − V (d|q, c))) that T is better.
c d∈S 0 d∈S
X Y Y
= P (c|q) (1 − V (d|q, c)) − (1 − V (d|q, c))
c d∈S d∈S 0