Professional Documents
Culture Documents
Results
ABSTRACT 1. INTRODUCTION
We propose a novel conception language for exploring the re- One main deficiency with current services for searching
sults retrieved by several internet search services (like search over the Internet is the inadequacy of the list of ranked doc-
engines) that cluster retrieved documents. The goal is to uments which they offer to access documents retrieved by a
offer users a tool to discover relevant hidden relationships query. It is well known that users generally do not analyze
between clustered documents. documents ranked in the low positions of the ordered list,
The proposal is motivated by the observation that visualiza- even when the search engine has retrieved only hundreds of
tion paradigms, based on either the ranked list or clustered documents. Thus, if users do not find what they are looking
results, do not allow users to fully exploit the combined use for in the first one or two result pages, they are more keen to
of several search services to answer a request. reformulate a new query than to analyze successive pages,
When the same query is submitted to distinct search ser- or they prefer to try with another search service.
vices, they may produce partially overlapped clustered re- The motivation of using meta search engines, such as mamma,
sults, where clusters identified by distinct labels collect some dogpile, Metacrawler etc., is the claimed or assumed greater
common documents. Moreover, clusters with similar labels, effectiveness of these systems, with respect to single search
but containing distinct documents, may be produced as well. engines, in retrieving relevant web documents in the first
In such a situation, it may be useful to compare, combine results pages.
and rank the cluster contents, to filter out relevant docu- Nevertheless, the side effect of the merging lists is to
ments. In the proposed language, we define several oper- augment the number of retrieved documents, which will be
ators (inspired by relational algebra) that work on groups hardly analyzed by users; thus, this makes useless much of
of clusters. New clusters (and groups) can be generated by the meta search engine’s effort.
combining (i.e., overlapping, refining and intersecting) clus- To overcome this problem, some search services such as
ters (and groups), in a set oriented fashion. Furthermore, vivisimo, clusty, Snaket, Ask.com 1 , MS AdCenter Labs Search
several ranking functions are also proposed, to model dis- Result Clustering etc., have shifted from the usual ranked
tinct semantics of the combination. list to the clustered results paradigm. This consists in or-
ganizing the documents retrieved by a query into containers
(i.e., clusters), possibly semantically homogeneous with re-
Categories and Subject Descriptors spect to their contents, and in presenting them labeled, so
H.2.3 [DATABASE MANAGEMENT]: Languages; H.2.1 as to synthesize their main content focus.
[DATABASE MANAGEMENT]: Logical Design - Data Clustering is often proposed as a viable way of narrowing
Models a search into a more specific query, like in Ask.com. Besides
search engines, also several news services provide overviews
General Terms of incoming news streams clustered into news stories, as an
alternative to the ranking of news depending on their date.
Query Languages, Semantic Searches
W.r.t. the ranked list, the clustered results paradigm has
the advantage of offering users an overview of the main top-
Keywords ics dealt with in the whole set of retrieved documents, that
Search service, web documents, Exploratory search would be missed in the first few pages of the traditional
ranked list [2, 15, 6].
On the other side, one problem users encounter with such
clustered results, is the inability of fully understanding and
Permission to make digital or hard copies of all or part of this work for appreciating the contents of the clusters. This is mainly
personal or classroom use is granted without fee provided that copies are due to the short and sometimes bad quality of the labels
not made or distributed for profit or commercial advantage and that copies of the clusters, which generally consist of a few terms, or
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
individual short phrases, which are automatically extracted
permission and/or a fee. from the documents of the cluster based on statistic and co-
CIKM’08, October 26 – 30, 2008, Napa Valley, California, USA. 1
Copyright 2008 ACM 978-1-59593-991-3/08/10 ...$5.00. At www.ask.com/reference/dictionary/49514/cluster
occurrence analysis. Often, several clusters have similar la- of possible use of this language is discussed in the context of
bels which differ just for a single term. To effectively explore tourism. In Section 4, the basic operations between clusters
the cluster contents, users have no other means than clicking are introduced, while in Section 5 and 6 the group operators
on the cluster labels and browsing the clusters themselves. and the functions on groups are formalized, respectively. In
This problem is much more apparent when submitting the Section 7 the related works are reviewed and, the conclusions
same request to distinct search engines, each one producing summarize the main contents.
a group of clustered results reflecting distinct criteria. For
example, the Gigabits search engine clusters retrieved docu-
ments by their freshness dating (Last Day, Last Week, Last 2. THE DATA MODEL
Month, Last Year, etc.), the Dmoz search service presents In this section we define the data model of our language.
classified documents. In such a situation, one may want Consider a query q submitted to a search engine; the query
to explore if a given category contains documents that are result is a ranked list of documents, that we call ranked
fresh or not; this necessity may occur quite frequently in items.
analyzing news streams (RSS) to find out the frequency of Definition 1: Ranked Item A ranked item r represents
a given news story reported by media as a function of time. a document retrieved by a web search. It is defined as a
When the groups of clusters are generated by distinct search tuple
services, users may be faced with distinct clusters possibly r : (uri, title, snippet, irank )
with same labels. In this situation, it becomes necessary
to explore the relationships between the contents of these where: uri is the Uniform Resource Identifier of the ranked
clusters to identify common and distinct documents. web document; title and snippet are, respectively, the docu-
But the task so far described is, in effect, an exploratory ment title and snippet2 ; irank is a score (in the range [0, 1])
task, that may last for a long time, and may require to that expresses the estimated relevance of the retrieved doc-
reuse the intermediate results several times. For this rea- ument w.r.t. the query. 2
son, we can conceive the storing of the intermediate results The same document is represented by distinct items in
into a database as an essential phase for successive manipu- distinct results lists. In facts, we assume that a document is
lation. Furthermore, the local manipulation of results avoids uniquely identified by its uri [4] (near duplicates are not de-
useless network and search services overloading; in fact, in tected), while it may have distinct snippets and irank when
current practices, several modified queries are submitted to retrieved by different search services. We assume that irank
the search engines, trying to capture relevant documents in is a function of the position of the item in the query result
the first positions of the ranked list, documents that were list.
already retrieved by the previous queries, although hidden
Definition 2: Cluster A cluster representative c is a set
to the user since they did not occurred in the first position.
of ranked items, having itself a rank. It is defined as a tuple
In this contribution we propose a novel conception lan- c : (label , crank , items)
guage for aiding users in flexibly exploring groups of clus- where: label is a set of terms that semantically synthesizes
tered results retrieved by one or several search services (ba- the main content of the cluster; crank is a score (in the range
sically search engines) over the internet, so as to make them [0, 1]) depending on the ranks of the items belonging to the
able to reveal implicit relationships.In our view, the usual cluster; items is the set of ranked items that constitute the
ranked list, produced by search engines is regarded as a cluster.
group consisting of a single cluster that has the query it- With |c| ≡ |c.items| we denote the cardinality of the cluster.
self as label. Thus, our language can be used to compare 2
the results of any search service producing ranked lists too.
The language is novel for several reasons. First of all, for A cluster label c.label is generally automatically generated
the scope for which it is defined; second, for the objects on by a function SynthLabel (c.items) (see the end of Section 4).
which it operates, i.e., groups of clusters of web documents; A cluster has initially a natural rank. When a cluster be-
third, for the types of operators defined in it. longs to a group generated by one operators of the language,
In the literature, no similar approach has been proposed its rank can have a different semantics corrisponding to the
till now. We think that users, by manipulating groups of ranking method that the operator applied.
clustered results through the operators of this language, can
fully benefit of the results of search engines. The language Finally, we define the main element of the data model.
can be particularly useful when one needs to explore and
Definition 3: Group A group g is a non empty and
analyze the results produced by several search services to
ordered set of clusters representatives. It is defined as a
identify and then select relevant documents. Furthermore,
pair:
our language can be used to realize a database support
g : (l, hc1 , . . . , cn i)
for a multi-service document retrieval system: its operators
in which l is the label of the group and hc1 , . . . , cn i is the
are closed w.r.t. the data model, and allow to write com-
list of clusters rapresentatives.
plex analysis and transformation expressions, whose results
A special kind of group is the empty group g0 := (l, hi).
are stored and retrieved for further needs. In this paper
This group can be explicitly generated by the user through
we introduce and formalize a first set of operators whose
the function CreateQuery (see Section 6).
completeness, actual utility and effectiveness for exploratory
analysis, will be the subject of future users’ evaluations. 2
Figure 1: Clusters examples. The first rows report the groups identifiers, the search service name and the
query text; the other rows report a cluster identifier within the group, and the label of the cluster.
not exists r2 ∈ R2 with c0 .uri = c2 .uri, then r0 = r1 (and where G is the set of groups, S is the set of names of available
r0 = r2 in the dual case). services, s is the service that evaluates the query q = g.l,
2) If there exist r1 ∈ R1 and r2 ∈ R2 such that: while g 0 is the resulting group of clusters whose label g 0 .l =
r0 .uri = r1 .uri = r2 .uri; g.l. When the user wants to submit a query to a service for
r0 .title = Comb(r1 .title, r2 .title, r1 .irank, r2 .irank, ∪); the first time (when no groups are available), the input group
r0 .snippet = is an empty group generated by the function CreateQuery.
= Comb(r1 .snippet, r2 .snippet, r1 .irank, r2 .irank, ∪); We assume that a search service retrieves a maximum of N
r0 .irank = max (r1 .irank , r2 .irank ). 2 documents. On this basis, for each retrieved document, the
Differently from the case of ∩R , the irank of a ranked operator builds a ranked item r, whose irank value depends
item r0 in the result of ∪R is the maximum of irank values on the position of the document in the result list: r.irank =
of items r1 ∈ R1 and items r2 ∈ R2 , because it represents the (N − P os(d) + 1)/N (where P os(d) is the position of the
best level of relevance obtained by the retrieved document document in the query result list). In this way, a document
in both the web searches. This is also consistent with the in the first positions has a rank r.irank very close to 1.
definition of union of fuzzy sets. In order to allow the user to submit a query to a search ser-
The operations so far described can be implemented in vice without clustering the ranked list of document, he/she
a a very efficient way. In fact, if we maintain the list of can specifies the value 0 in input. In this case the resulting
documents in a cluster ordered by document uri, ranked group g contains one single cluster, i.e., the trivial cluster
intersection and ranked union can be implemented on the that contains an item for each document retrieved by the
basis of a merge operation, whose complexity is O(|c1 |+|c2 |). search service.
Definition 6: Natural Rank Each set of items R (and
Properties. The associative property and the commutative consequently, each cluster c) has a Natural Rank (denoted
property hold for ranked intersection and union, as the reader as NRank (R)) that is the average of ranks of items in the
can easily see. set. Formally P
Labels. In the remainder of the paper, we will use opera- NRank (R) = ( r∈R r.irank)/|R|
tions ∩R and ∪R to generate a new cluster by moving from If we refer to a cluster c, the natural rank of c, that we
two input clusters. When a new cluster is generated, it is denote simply by NRank (c), is the natural rank of its items
necessary to define its label. Function SynthLabel (R) gen- (i.e., NRank (c.items)) 2
erates a representative label from the set of ranked items This ranking method is the default one that reflects the
R, by extracting the most meaningful non-stopwords terms ordering of the items retrived by the search services.
from within titles and snippets associated with items in R.
Several simple manipulation operators on groups are ba-
The significance of a term is determined based on the oc-
sically useful. The Cluster Selection operator σ allows to
currences of the terms in titles and snippets (an interesting
select the clusters in a group.
algorithm can be found in [10]).
It is defined as σ : g, P → g 0 where g is the group
whose clusters must be selected, and P is a predicate on
5. OPERATORS ON GROUPS positions of clusters in the group, or on cluster labels; the
In this section we define an algebra for groups of clusters, selected clusters maintain the original order.
by defining a set of operators. Similarly, the Cluster Deletion operator δ : g, P → g 0
deletes clusters that satisfies predicate P . (thus, g 0 contains
5.1 Basic Operators all clusters in g that do not satisfy P ).
A first operator we consider is the CQuery operator, that Example 2: By observing the labels of the clusters in Fig-
allows to submit a query to a given search service and cluster ure 1, we can easily identify which of them are most closely
the results. It is responsible for the start up of the process related to our needs for planning a tour. To this aim, we
supported by the proposed language. It is defined as may want to select a subset of clusters. Assuming that we
are planning a wine-tour, the clusters about gastronomy,
CQuery : G × S × 0, 1 → G CQuery(g, s) → g 0 wine and touristic topics are the most interesting ones; the
reduced set of clusters on which we focus is depicted in Fig- Definition 9: Weighted Minimum Rank Given a set of
ure 2. In Figure 3 we show the selected clusters with their items R0 , obtained combining sets R1 (R2 ) of ranked items
contents. 2 belonging to the same cluster c1 (c2 ) with a crank C1 (C2 ),
Since a group is an ordered list of clusters, group sorting its Weighted Minimum Rank (denoted as WMinRank ) is the
operators must be provided. average of ranks of items in R0 , weighted w.r.t. the crank of
Shortly, operator S : g, L → g 0 sorts clusters in g based R1 and R2 . Formally:
on the ordered list of positions L; operator S : g → g 0 sorts WMinRank (R1 , C1 , R2 , C2 , R0 ) =
clusters in g w.r.t. their crank in decreasing order6 . min((C1 · r1 .irank), (C2 · r2 .irank))
)/|R0 |
P
= ( r∈R0
max (C1 , C2 )
5.2 Group Intersection where r1 ∈ R1 and r2 ∈ R2 are the original items describ-
The first complex operator we introduce is the Group In- ing the same document represented by r0 (r0 .url = r1 .url =
tersection. Intuitively, it is a quite straightforward wish of r2 .url ). We assume r1 .irank = 0 (r2 .irank = 0) if there is not
users to intersect clusters in two groups, to find more specific an item r1 ∈ D1 (r2 ∈ D2 ) with r0 .url = r1 .url (r0 .url =
clusters. The assumption is that the more search services re- r2 .url). 2
trieve a document, the more the document content is worth By choosing a given ranking method, it is possible to rep-
analyzing. resent different properties of the intersection. Natural ap-
Definition 7: The Group Intersection operator ∩ is defined plies the natural rank, so that the ranking of a resulting
as: cluster is uniquely determined on the ranks of its items (see
Definition 6). On the contrary, Weighted determines the
∩ : G × G × T∩ → G ∩(g1 , g2 , t) → g 0 rank not only based on the ranks of items, but also on the
basis of the ranks of the incoming clusters.
where g1 and g2 are the groups of clusters to intersect, g 0 The reader can notice that the irank value of the two
is the resulting group, t is the ranking method adopted to input items is weighted with the crank value of the cluster.
evaluate the crank of clusters in g 0 . Consequently, the irank of items in the lowest ranked cluster
t ∈ T∩ = {Natural, Cardinality, Weighted}. (cluster in the last position of the group) are more likely to
For each pair of clusters c1 ∈ g1 , c2 ∈ g2 ,such that their contribute to the weighted rank of the intersection. This
intersection is not empty (i.e., |c1 .items∩R c2 . items )| 6= 0), definition is in accordance with the goal of being cautious
there is a cluster c0 ∈ g 0 . c0 is defined as follows: in determining the rank of an item common to the original
c0 .items = c1 .items ∩R c2 .items, clusters.
c0 .label =SynthLabel (c0 .items). As far as the cardinality rank is concerned, it determines
If t is Natural, the crank value is obtained as: the rank of the cluster in the group based on its cardinality,
c0 crank =NRank (c0 .items). i.e., the number of items it contains. This can be useful when
If t is Cardinality, the crank value of each resulting cluster one is interested in analyzing first big sets of documents
is defined as: c0 crank =CardRank (c0 , g 0 ). about a relevant topics, giving higher importance to clusters
If t is Weighted, the crank value is obtained as: that are larger than the others in the same group.
c0 crank = Properties. The associative property holds for the Group In-
WMinRank (c1 .items, c1 .crank , c2 .items, c2 .crank , c0 .items). tersection Operator, provided that the same ranking method
2 is chosen for all the occurrences of the group join operator
The operator provides three distinct methods to compute in the expression.
the ranking of resulting clusters. If t is Natural, the crank The commutative property holds as well, since, it holds for
value of a lcuster is obtained by computing its Natural Rank. ranked intersection.
In this way, the relevance of a cluster uniquely depends on Example 3: In order to filter the most reliable documents,
items common to both intersecdte4d clusters. we could identify the common documents retrieved by all the
Instead, if t is Cardinality, the crank value of each result- three search engines. In the headings of groups depicted in
ing cluster obtained by means of function CardRank, defined Figure 4, the expressions applied to obtain the groups are
by the following definition. reported. Consider groups g4 and g5 : first of all, we intersect
Definition 8: Cardinality Rank Given the group g 0 g1 and g2 , obtaining group g4 ; then, we further intersect g4
obtained by intersecting two groups, the Cardinality Rank with g3 , obtaining group g5 . The obtained clusters in g5 are
of each cluster c0 is the ratio between the cardinality of c0 the intersection of c1 =Wine Wineries and c4 =Travel Guide
and the maximum cardinality of the other cluster in g 0 : from g1 , of c1 =Wine Tasting and Wineries and c3 =Plan
CardRank (c0 , g 0 ) = |c0 .items|/max c∈g0 |c|. 2 your Travel Vacation from g2 , of c1 =Wine Wineries and
c4 =Napa Valley Hotel from g3 .
The cardinality rank determines the relevance of clusters Since the intersection is an associative operation, we can
locally within the group: the largest cluster has crank equals write the expression to obtain g5 in a different way. This
to 1, while the others have a smaller value. is done to obtain groups gt6 and g7 , depicted in Figure 4.
Finally, if t is Weighted, the crank value is obtained by Looking at groups g5 and g7 , the reader can see that they
means of function WMinRank, defined in the following def- are identical, apart fro the expressions that generate them.
inition. For this reason, cluster cl.1 in G − 5 and cluster cl.1 in g7
have the same label label5,1 . 2
6
The list of simple operators might be longer; however, they
are not essential in this paper, and for the sake of space we
do not further discuss this topic. 5.3 Group Join
g1 : Google -"Napa Valley" g2 : Yahoo! -"Napa Valley" g3 : MSN Live Search -"Napa Valley"
cl.1: Wine Wineries cl.1: Wine Tasting and Wineries cl.1: Wine Wineries
cl.4: Travel Guide cl.3: Plan your Travel Vacation cl.4: Napa Valley Hotel
g1 : Google -"Napa Valley" g2 : Yahoo! -"Napa Valley" g3 : MSN Live Search -"Napa Valley"
cl.1: Wine Wineries (0.9575) cl.1: Wine Tasting and Wineries (0.975) cl.1: Wine Wineries (0.94)
u.1 (r: 1 - posQ: 1) u.1 (r: 1 - posQ: 1) u.3 (r: 0.99 - posQ: 2)
u.3 (r: 0.99 - posQ: 2) u.3 (r: 0.99 - posQ: 2) u.1 (r: 0.98 - posQ: 3)
u.2 (r: 0.94 - posQ: 7) u.10 (r: 0.96 - posQ: 5) u.2 (r: 0.94 - posQ: 7)
u.4 (r: 0.9 - posQ: 11) u.11 (r: 0.95 - posQ: 6) u.6 (r: 0.91 - posQ: 10)
cl.4: Travel Guide (0.945) cl.3: Plan your Travel Vacation (0.925) u.7 (r: 0.88 - posQ: 13)
u.5 (r: 0.98 - posQ: 3) u.4 (r: 0.97 - posQ: 4) cl.4: Napa Valley Hotel (0.92)
u.6 (r: 0.97 - posQ: 4) u.15 (r: 0.94 - posQ: 7) u.5 (r: 0.97 - posQ: 4)
u.2 (r: 0.94 - posQ: 7) u.13 (r: 0.91 - posQ: 10) u.8 (r: 0.93 - posQ: 8)
u.7 (r: 0.89 - posQ: 12) u.14 (r: 0.88 - posQ: 13) u.10 (r: 0.86 - posQ: 15)
Figure 3: Expanded clusters. Each cluster is expanded with the items in it; for each item, we report its uri,
its rank r and its position in the original ranked list retrieved by the query posQ : n through the search service
in the corresponding column.
The second complex operator we introduce is the Group content of the new cluster which subsumes as more specific
Join operator. topics those of the original clusters.
Definition 10: The Group Join operator ./ is defined as: As for group intersection, the natural rank is the basic
rank value of a cluster. An alternative is to compute the rank
./: G × G × T./ → G ./ (g1 , g2 , t) → g 0 in a weighted way; in this case we define the WMaxRank,
since we want to give more chance in determining the final
rank to the items belonging to the highest weighted cluster.
where g1 and g2 are the groups of clusters to join, g 0 is the
resulting group. t is the ranking method adopted to evaluate Definition 11: Weighted Maximum Rank Given a
the crank of clusters in g 0 . cluster c0 , obtained combining clusters c1 and c2 , its Weighted
t ∈ T./ = {Natural, Cardinality, Weighted, Correlation, Maximum Rank is defined as
Expansion, Weighted-Correlation, Weighted-Expansion}. WMaxRank (R1 , C1 , R2 , C2 , R0 ) =
For each pair of clusters c1 ∈ g1 , c2 ∈ g2 , such that their max (C1 r1 .irank, C2 r2 .irank)
)/|R0 |
P
intersection is not empty (i.e., |c1 .items∩R c2 . items )| 6= 0), = ( r∈R0
max (C1 , C2 )
there is a cluster c0 ∈ g 0 defined as follows:
c0 .items = c1 .items ∪R c2 .items, (see Definition 9 for the meaning of symbols). 2
c0 .label =SynthLabel (c0 .items). A third alternative to compute the ranking of cluster after
If t is Natural, or Weighted, or Correlation, or Expansion, a join is the Correlation Rank, that estimates the degree of
or Weighted-Correlation, or Weighted-Expansion, it is resp. correlation of the two incoming clusters. Indeed, this degree
obtained as: of correlation is computed by a measure of overlapping of
c0 crank =NRank (c0 .items), the clusters. The greater they are overlapped, the more
c0 crank = their topics are related one another.
WMaxRank (c1 .items, c1 .crank , c2 .items, c2 .crank , c0 .items) Definition 12: Correlation Rank Given two sets of doc-
0 uments R1 and R2 , their Correlation Rank (shortly, CRank )
c crank = CRank (c1 .items, c2 .items)
c0 crank = ERank (c1 .items, c2 .items) is the overlapping value between R1 and R2 . Formally:
c0 crank = |R1 ∩R R2 |
CRank (R1 , R2 ) =
WCRank (c1 .items, c1 .crank , c2 .items, c2 .crank ) |R1 ∪R R2 |
c0 crank = 2
WERank (c1 .items, c1 .crank , c2 .items, c2 .crank ). 2 A fourth alternative to compute the ranking of a cluster
The Group Join operator can be used to explicit indirect after a join, is the Expansion Rank, that evaluates the degree
correlations between the topics represented by the clusters of novelty w.r.t. the overlapped parts of the two incoming
in the two groups. The basic idea underlying its definition is clusters. The greater the novelty, the less the two topics
that if two clusters overlap, i.e., have some common items, represented by the original clusters are related. This hints to
it means that the contents of these items are related with the fact that the cluster with high novelty rank can represent
both topics represented by the clusters. This may hint the a very general topic.
existence of an implicit relationship between the two topics. Definition 13: Expansion Rank Given two sets of items
By assuming that topics can be organized into a hierarchy, R1 and R2 , their Expansion Rank (shortly, ERank ) is the
by grouping the two overlapping clusters into a single one complementary value of overlapping between R1 and R2 .
we may reveal the more general topic representing the whole Formally:
g4 : ∩(g1 , g2 , t) g5 : ∩(g4 , g3 , t) = (∩(∩(g1 , g2 , t), g3 , t)
cl.1: label4,1 cl.1: label5,1
(NRank = 0.995; WMinRank = 0.977; CardRank = 1) (NRank = 0.985; WMinRank = 0.930; CardRank = 1)
u.1 (r: 1); u.3 (r: 0.99) u.1 (r: 0.98); u.3 (r: 0.99)
cl.2: label4,2
(NRank = 0.9; WMinRank = 0.872; CardRank = 0.5)
u.4 (r: 0.9)