You are on page 1of 11

A Language for Manipulating Clustered Web Documents

Results

Gloria Bordogna Alessandro Campi Giuseppe Psaila Stefania Ronchi .


CNR-IDPA Politecnico di Milano Università di Bergamo .
Via Pasubio 3 DEI Facoltà di Ingegneria .
24044 Dalmine (BG) Piazza L. da Vinci 32 Viale Marconi 5 .
Italy 20133 Milano 24044 Dalmine (BG) .
gloria.bordogna@idpa.cnr.it Italy Italy .
. campi@elet.polimi.it psaila@unibg.it stefania.ronchi@gmail.com

ABSTRACT 1. INTRODUCTION
We propose a novel conception language for exploring the re- One main deficiency with current services for searching
sults retrieved by several internet search services (like search over the Internet is the inadequacy of the list of ranked doc-
engines) that cluster retrieved documents. The goal is to uments which they offer to access documents retrieved by a
offer users a tool to discover relevant hidden relationships query. It is well known that users generally do not analyze
between clustered documents. documents ranked in the low positions of the ordered list,
The proposal is motivated by the observation that visualiza- even when the search engine has retrieved only hundreds of
tion paradigms, based on either the ranked list or clustered documents. Thus, if users do not find what they are looking
results, do not allow users to fully exploit the combined use for in the first one or two result pages, they are more keen to
of several search services to answer a request. reformulate a new query than to analyze successive pages,
When the same query is submitted to distinct search ser- or they prefer to try with another search service.
vices, they may produce partially overlapped clustered re- The motivation of using meta search engines, such as mamma,
sults, where clusters identified by distinct labels collect some dogpile, Metacrawler etc., is the claimed or assumed greater
common documents. Moreover, clusters with similar labels, effectiveness of these systems, with respect to single search
but containing distinct documents, may be produced as well. engines, in retrieving relevant web documents in the first
In such a situation, it may be useful to compare, combine results pages.
and rank the cluster contents, to filter out relevant docu- Nevertheless, the side effect of the merging lists is to
ments. In the proposed language, we define several oper- augment the number of retrieved documents, which will be
ators (inspired by relational algebra) that work on groups hardly analyzed by users; thus, this makes useless much of
of clusters. New clusters (and groups) can be generated by the meta search engine’s effort.
combining (i.e., overlapping, refining and intersecting) clus- To overcome this problem, some search services such as
ters (and groups), in a set oriented fashion. Furthermore, vivisimo, clusty, Snaket, Ask.com 1 , MS AdCenter Labs Search
several ranking functions are also proposed, to model dis- Result Clustering etc., have shifted from the usual ranked
tinct semantics of the combination. list to the clustered results paradigm. This consists in or-
ganizing the documents retrieved by a query into containers
(i.e., clusters), possibly semantically homogeneous with re-
Categories and Subject Descriptors spect to their contents, and in presenting them labeled, so
H.2.3 [DATABASE MANAGEMENT]: Languages; H.2.1 as to synthesize their main content focus.
[DATABASE MANAGEMENT]: Logical Design - Data Clustering is often proposed as a viable way of narrowing
Models a search into a more specific query, like in Ask.com. Besides
search engines, also several news services provide overviews
General Terms of incoming news streams clustered into news stories, as an
alternative to the ranking of news depending on their date.
Query Languages, Semantic Searches
W.r.t. the ranked list, the clustered results paradigm has
the advantage of offering users an overview of the main top-
Keywords ics dealt with in the whole set of retrieved documents, that
Search service, web documents, Exploratory search would be missed in the first few pages of the traditional
ranked list [2, 15, 6].
On the other side, one problem users encounter with such
clustered results, is the inability of fully understanding and
Permission to make digital or hard copies of all or part of this work for appreciating the contents of the clusters. This is mainly
personal or classroom use is granted without fee provided that copies are due to the short and sometimes bad quality of the labels
not made or distributed for profit or commercial advantage and that copies of the clusters, which generally consist of a few terms, or
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
individual short phrases, which are automatically extracted
permission and/or a fee. from the documents of the cluster based on statistic and co-
CIKM’08, October 26 – 30, 2008, Napa Valley, California, USA. 1
Copyright 2008 ACM 978-1-59593-991-3/08/10 ...$5.00. At www.ask.com/reference/dictionary/49514/cluster
occurrence analysis. Often, several clusters have similar la- of possible use of this language is discussed in the context of
bels which differ just for a single term. To effectively explore tourism. In Section 4, the basic operations between clusters
the cluster contents, users have no other means than clicking are introduced, while in Section 5 and 6 the group operators
on the cluster labels and browsing the clusters themselves. and the functions on groups are formalized, respectively. In
This problem is much more apparent when submitting the Section 7 the related works are reviewed and, the conclusions
same request to distinct search engines, each one producing summarize the main contents.
a group of clustered results reflecting distinct criteria. For
example, the Gigabits search engine clusters retrieved docu-
ments by their freshness dating (Last Day, Last Week, Last 2. THE DATA MODEL
Month, Last Year, etc.), the Dmoz search service presents In this section we define the data model of our language.
classified documents. In such a situation, one may want Consider a query q submitted to a search engine; the query
to explore if a given category contains documents that are result is a ranked list of documents, that we call ranked
fresh or not; this necessity may occur quite frequently in items.
analyzing news streams (RSS) to find out the frequency of Definition 1: Ranked Item A ranked item r represents
a given news story reported by media as a function of time. a document retrieved by a web search. It is defined as a
When the groups of clusters are generated by distinct search tuple
services, users may be faced with distinct clusters possibly r : (uri, title, snippet, irank )
with same labels. In this situation, it becomes necessary
to explore the relationships between the contents of these where: uri is the Uniform Resource Identifier of the ranked
clusters to identify common and distinct documents. web document; title and snippet are, respectively, the docu-
But the task so far described is, in effect, an exploratory ment title and snippet2 ; irank is a score (in the range [0, 1])
task, that may last for a long time, and may require to that expresses the estimated relevance of the retrieved doc-
reuse the intermediate results several times. For this rea- ument w.r.t. the query. 2
son, we can conceive the storing of the intermediate results The same document is represented by distinct items in
into a database as an essential phase for successive manipu- distinct results lists. In facts, we assume that a document is
lation. Furthermore, the local manipulation of results avoids uniquely identified by its uri [4] (near duplicates are not de-
useless network and search services overloading; in fact, in tected), while it may have distinct snippets and irank when
current practices, several modified queries are submitted to retrieved by different search services. We assume that irank
the search engines, trying to capture relevant documents in is a function of the position of the item in the query result
the first positions of the ranked list, documents that were list.
already retrieved by the previous queries, although hidden
Definition 2: Cluster A cluster representative c is a set
to the user since they did not occurred in the first position.
of ranked items, having itself a rank. It is defined as a tuple
In this contribution we propose a novel conception lan- c : (label , crank , items)
guage for aiding users in flexibly exploring groups of clus- where: label is a set of terms that semantically synthesizes
tered results retrieved by one or several search services (ba- the main content of the cluster; crank is a score (in the range
sically search engines) over the internet, so as to make them [0, 1]) depending on the ranks of the items belonging to the
able to reveal implicit relationships.In our view, the usual cluster; items is the set of ranked items that constitute the
ranked list, produced by search engines is regarded as a cluster.
group consisting of a single cluster that has the query it- With |c| ≡ |c.items| we denote the cardinality of the cluster.
self as label. Thus, our language can be used to compare 2
the results of any search service producing ranked lists too.
The language is novel for several reasons. First of all, for A cluster label c.label is generally automatically generated
the scope for which it is defined; second, for the objects on by a function SynthLabel (c.items) (see the end of Section 4).
which it operates, i.e., groups of clusters of web documents; A cluster has initially a natural rank. When a cluster be-
third, for the types of operators defined in it. longs to a group generated by one operators of the language,
In the literature, no similar approach has been proposed its rank can have a different semantics corrisponding to the
till now. We think that users, by manipulating groups of ranking method that the operator applied.
clustered results through the operators of this language, can
fully benefit of the results of search engines. The language Finally, we define the main element of the data model.
can be particularly useful when one needs to explore and
Definition 3: Group A group g is a non empty and
analyze the results produced by several search services to
ordered set of clusters representatives. It is defined as a
identify and then select relevant documents. Furthermore,
pair:
our language can be used to realize a database support
g : (l, hc1 , . . . , cn i)
for a multi-service document retrieval system: its operators
in which l is the label of the group and hc1 , . . . , cn i is the
are closed w.r.t. the data model, and allow to write com-
list of clusters rapresentatives.
plex analysis and transformation expressions, whose results
A special kind of group is the empty group g0 := (l, h i).
are stored and retrieved for further needs. In this paper
This group can be explicitly generated by the user through
we introduce and formalize a first set of operators whose
the function CreateQuery (see Section 6).
completeness, actual utility and effectiveness for exploratory
analysis, will be the subject of future users’ evaluations. 2

The paper is organized as follows. In Section 2 the con- 2


The snippet is an excerpt of the document, made by a set
cepts of the data model are defined. In Section 3, an example of sentences that may contain the keywords of the query
A group g can contain a single cluster. Observe that a thus we want to obtain highly specific clusters containing
particular kind of cluster representative is the one that rep- documents recommended by all the search services. In the
resents the ranked list obtained as the result of a query q; second case, we want to expand the resulting clusters by
in this case, c.label = q. grouping more documents concerning the desired and re-
For the sake of simplicity, in the remainder of the paper, lated topics. Finally, in the last case, we want to obtain
we use the term cluster to intend cluster representative. filtered sets of documents based on a more specific request.
Notice that these manipulations should be carried out re-
3. RUNNING EXAMPLE gardless of the positions of the items in the retrieved list. It
may happen that a cluster is generated with items retrieved
Let us suppose we want to go on a tour in Napa valley;
in high and low rank positions, just because they are about
to plan the trip, we need to collect information concerning
the same topic.
wineries, sites to visit, close cities to reach, e.g., by car,
We may also be interested to have a re-ordering of the new
as well as hotels and restaurants. The search services pro-
clusters in the resulting group, based on some characteristics
vide a large set of documents concerning Napa Valley, so it
that may differ from the default one (initial ranking), such
becomes a hard task to find, among them, the most rele-
as the cardinality of the resulting clusters, or their degree of
vant ones for our goal. Consequently, it can be convenient
expansion w.r.t. the clusters from which they originated.
to semantically characterize them, by organizing them in
groups of semantically homogeneous documents (clusters), In the following sections, we will discuss the application
and then to perform a kind of exploratory task, in which we of operators in the context of the running example.
try to combine the results of queries submitted to search
services, in order to filter out the useful documents. This 4. BASIC OPERATIONS
novel practice can be carried out locally, thus minimizing
the need of new remote searches, as it generally happens In order to define the operators and functions that con-
with current search services. The results obtained by ana- stitute the proposed language (introduced in Sections 5 and
lyzing and combining previously submitted queries can also 6), it is necessary to define some basic operations on sets of
inspire new more focused queries. ranked items and on cluster labels.
Hereafter, we start a running example that we will use The basic operations that we are going to define work on
throughout the paper to clarify our approach and to explain two input sets of ranked items (R1 and R2 ) and generate a
the proposed language. new set of ranked items (R0 ).
Example 1: To start our search of information to organize Definition 4: Ranked Intersection: The operation
the tour to Napa valley, we submit the query "Napa Valley" RIntersect, denoted by ∩R , performs the intersection of
to the search services Google, Yahoo! and MSN Live Search. two sets of ranked items.
To have a rapid glance at the main topics retrieved, the first R0 contains all ranked items r0 such that there are two
N 3 top documents returned by each service are clustered. ranked items r1 ∈ R1 and r2 ∈ R2 such that r1 and r2
This is done on the basis of the documents’ snippets (brief refer to the same uri. The irank of r0 is defined as the min-
piece of content) shown in the results pages. The labels of imum irank value of r1 and r2 .
the obtained clusters are represented in Figure 1.4 . 2 Formally: r0 ∈ R0 , if and only if there exists r1 ∈ R1 and
r2 ∈ R2 such that:
On the basis of these results, it could be interesting to
r0 .uri = r1 .uri = r2 .uri;
apply some manipulations on the different groups, in order
r0 .title = Comb(r1 .title, r2 .title, r1 .irank, r2 .irank, ∩)5 ;
to filter out results more and more satisfying our needs.
r0 .snippet =
For example, it may be interesting to keep in the groups
= Comb(r1 .snippet, r2 .snippet, r1 .irank, r2 .irank, ∩);
only the most interesting clusters concerning some particular
r0 .irank = min(r1 .irank , r2 .irank ). 2
contents (identified through the clusters’ label): in this way
we reduce the whole set of documents to only those that Given two ranked items r1 ∈ R1 and r2 ∈ R2 , and the
really cover the desired aspects, thus saving time for their item r0 ∈ R that represents both, we defined r0 .irank =
inspection. min(r1 .irank , r2 .irank ) for the following reason: it is the
Alternatively, it may be interesting to obtain new groups common level of relevance obtained by the retrieved docu-
in which clusters are composed of: ment in both the web searches from which R1 and R2 are
obtained. This definition is consistent with the definition of
• only the most reliable documents concerning the de- the intersection operation between fuzzy sets [14].
sired topics, such as Hotel or Hotel and Travel Guide;
Definition 5: Ranked Union: The operation RUnion,
• all the retrieved documents which concern more gen- denoted by ∪R , performs the union of two sets of ranked
eral topics than the retrieved clusters; items.
R0 contains all ranked items r0 such that there is a ranked
• only the documents regarding specific topics, even not item r1 ∈ c1 (resp. r2 ∈ c2 ) such that r1 (resp. r2 ) refers to
explicitly expressed by the labels of the clusters. the same uri. If there are both r1 and r2 , the irank of r0 is
the maximum irank value of r1 and r2 .
In the first case we can assume that documents recom-
Formally: r0 ∈ R0 , if and only if one of the following situa-
mended by all the search services are the most reliable ones.;
tions occurs.
3
N is a parameter specific for each search service 1) If there exists r1 ∈ R1 with c0 .uri = c1 .uri, and there
4
For the sake of readability, we show only a subset of the
5
results useful to better explain our work. To cluster the Function Comb in case of ∩ select among the two input
results of each search service and to label the clusters, we sting, the one with lower irank value; in case of ∪ the one
applied the Lingo algorithm [10] with the higher irank value.
g1 : Google -"Napa Valley" g2 : Yahoo! -"Napa Valley" g3 : MSN Live Search -"Napa Valley"
cl.1: Wine Wineries cl.1: Wine Tasting and Wineries cl.1: Wine Wineries
cl.2: Hotels Amp Spa cl.2: Tours in Napa cl.2: Napa Valley Email
cl.3: Opera House cl.3: Plan your Travel Vacation cl.3: Napa Valley Clickable Map
cl.4: Travel Guide cl.4: Napa Valley Spas Amp cl.4: Napa Valley Hotel
cl.5: Napa Valley Reservations cl.5: Wine Country cl.5: Beautiful
cl.6: Wine Tasting cl.6: Map cl.6: Napa Valley Regions
cl.7: Country Club cl.7: Recreational Sports in Napa cl.7: Napa Valley Symphony
cl.8: Destinations cl.8: Napa Valley California cl.8: Visitors
cl.9: Napa Valley Redwood Inn cl.9: Bed and Breakfasts
cl.10: Napa Valley Marathon cl.10: Wikipedia the Free Encyclopedia
cl.11: Symphony cl.11: Wine Train

Figure 1: Clusters examples. The first rows report the groups identifiers, the search service name and the
query text; the other rows report a cluster identifier within the group, and the label of the cluster.

not exists r2 ∈ R2 with c0 .uri = c2 .uri, then r0 = r1 (and where G is the set of groups, S is the set of names of available
r0 = r2 in the dual case). services, s is the service that evaluates the query q = g.l,
2) If there exist r1 ∈ R1 and r2 ∈ R2 such that: while g 0 is the resulting group of clusters whose label g 0 .l =
r0 .uri = r1 .uri = r2 .uri; g.l. When the user wants to submit a query to a service for
r0 .title = Comb(r1 .title, r2 .title, r1 .irank, r2 .irank, ∪); the first time (when no groups are available), the input group
r0 .snippet = is an empty group generated by the function CreateQuery.
= Comb(r1 .snippet, r2 .snippet, r1 .irank, r2 .irank, ∪); We assume that a search service retrieves a maximum of N
r0 .irank = max (r1 .irank , r2 .irank ). 2 documents. On this basis, for each retrieved document, the
Differently from the case of ∩R , the irank of a ranked operator builds a ranked item r, whose irank value depends
item r0 in the result of ∪R is the maximum of irank values on the position of the document in the result list: r.irank =
of items r1 ∈ R1 and items r2 ∈ R2 , because it represents the (N − P os(d) + 1)/N (where P os(d) is the position of the
best level of relevance obtained by the retrieved document document in the query result list). In this way, a document
in both the web searches. This is also consistent with the in the first positions has a rank r.irank very close to 1.
definition of union of fuzzy sets. In order to allow the user to submit a query to a search ser-
The operations so far described can be implemented in vice without clustering the ranked list of document, he/she
a a very efficient way. In fact, if we maintain the list of can specifies the value 0 in input. In this case the resulting
documents in a cluster ordered by document uri, ranked group g contains one single cluster, i.e., the trivial cluster
intersection and ranked union can be implemented on the that contains an item for each document retrieved by the
basis of a merge operation, whose complexity is O(|c1 |+|c2 |). search service.
Definition 6: Natural Rank Each set of items R (and
Properties. The associative property and the commutative consequently, each cluster c) has a Natural Rank (denoted
property hold for ranked intersection and union, as the reader as NRank (R)) that is the average of ranks of items in the
can easily see. set. Formally P
Labels. In the remainder of the paper, we will use opera- NRank (R) = ( r∈R r.irank)/|R|
tions ∩R and ∪R to generate a new cluster by moving from If we refer to a cluster c, the natural rank of c, that we
two input clusters. When a new cluster is generated, it is denote simply by NRank (c), is the natural rank of its items
necessary to define its label. Function SynthLabel (R) gen- (i.e., NRank (c.items)) 2
erates a representative label from the set of ranked items This ranking method is the default one that reflects the
R, by extracting the most meaningful non-stopwords terms ordering of the items retrived by the search services.
from within titles and snippets associated with items in R.
Several simple manipulation operators on groups are ba-
The significance of a term is determined based on the oc-
sically useful. The Cluster Selection operator σ allows to
currences of the terms in titles and snippets (an interesting
select the clusters in a group.
algorithm can be found in [10]).
It is defined as σ : g, P → g 0 where g is the group
whose clusters must be selected, and P is a predicate on
5. OPERATORS ON GROUPS positions of clusters in the group, or on cluster labels; the
In this section we define an algebra for groups of clusters, selected clusters maintain the original order.
by defining a set of operators. Similarly, the Cluster Deletion operator δ : g, P → g 0
deletes clusters that satisfies predicate P . (thus, g 0 contains
5.1 Basic Operators all clusters in g that do not satisfy P ).
A first operator we consider is the CQuery operator, that Example 2: By observing the labels of the clusters in Fig-
allows to submit a query to a given search service and cluster ure 1, we can easily identify which of them are most closely
the results. It is responsible for the start up of the process related to our needs for planning a tour. To this aim, we
supported by the proposed language. It is defined as may want to select a subset of clusters. Assuming that we
are planning a wine-tour, the clusters about gastronomy,
CQuery : G × S × 0, 1 → G CQuery(g, s) → g 0 wine and touristic topics are the most interesting ones; the
reduced set of clusters on which we focus is depicted in Fig- Definition 9: Weighted Minimum Rank Given a set of
ure 2. In Figure 3 we show the selected clusters with their items R0 , obtained combining sets R1 (R2 ) of ranked items
contents. 2 belonging to the same cluster c1 (c2 ) with a crank C1 (C2 ),
Since a group is an ordered list of clusters, group sorting its Weighted Minimum Rank (denoted as WMinRank ) is the
operators must be provided. average of ranks of items in R0 , weighted w.r.t. the crank of
Shortly, operator S : g, L → g 0 sorts clusters in g based R1 and R2 . Formally:
on the ordered list of positions L; operator S : g → g 0 sorts WMinRank (R1 , C1 , R2 , C2 , R0 ) =
clusters in g w.r.t. their crank in decreasing order6 . min((C1 · r1 .irank), (C2 · r2 .irank))
)/|R0 |
P
= ( r∈R0
max (C1 , C2 )
5.2 Group Intersection where r1 ∈ R1 and r2 ∈ R2 are the original items describ-
The first complex operator we introduce is the Group In- ing the same document represented by r0 (r0 .url = r1 .url =
tersection. Intuitively, it is a quite straightforward wish of r2 .url ). We assume r1 .irank = 0 (r2 .irank = 0) if there is not
users to intersect clusters in two groups, to find more specific an item r1 ∈ D1 (r2 ∈ D2 ) with r0 .url = r1 .url (r0 .url =
clusters. The assumption is that the more search services re- r2 .url). 2
trieve a document, the more the document content is worth By choosing a given ranking method, it is possible to rep-
analyzing. resent different properties of the intersection. Natural ap-
Definition 7: The Group Intersection operator ∩ is defined plies the natural rank, so that the ranking of a resulting
as: cluster is uniquely determined on the ranks of its items (see
Definition 6). On the contrary, Weighted determines the
∩ : G × G × T∩ → G ∩(g1 , g2 , t) → g 0 rank not only based on the ranks of items, but also on the
basis of the ranks of the incoming clusters.
where g1 and g2 are the groups of clusters to intersect, g 0 The reader can notice that the irank value of the two
is the resulting group, t is the ranking method adopted to input items is weighted with the crank value of the cluster.
evaluate the crank of clusters in g 0 . Consequently, the irank of items in the lowest ranked cluster
t ∈ T∩ = {Natural, Cardinality, Weighted}. (cluster in the last position of the group) are more likely to
For each pair of clusters c1 ∈ g1 , c2 ∈ g2 ,such that their contribute to the weighted rank of the intersection. This
intersection is not empty (i.e., |c1 .items∩R c2 . items )| 6= 0), definition is in accordance with the goal of being cautious
there is a cluster c0 ∈ g 0 . c0 is defined as follows: in determining the rank of an item common to the original
c0 .items = c1 .items ∩R c2 .items, clusters.
c0 .label =SynthLabel (c0 .items). As far as the cardinality rank is concerned, it determines
If t is Natural, the crank value is obtained as: the rank of the cluster in the group based on its cardinality,
c0 crank =NRank (c0 .items). i.e., the number of items it contains. This can be useful when
If t is Cardinality, the crank value of each resulting cluster one is interested in analyzing first big sets of documents
is defined as: c0 crank =CardRank (c0 , g 0 ). about a relevant topics, giving higher importance to clusters
If t is Weighted, the crank value is obtained as: that are larger than the others in the same group.
c0 crank = Properties. The associative property holds for the Group In-
WMinRank (c1 .items, c1 .crank , c2 .items, c2 .crank , c0 .items). tersection Operator, provided that the same ranking method
2 is chosen for all the occurrences of the group join operator
The operator provides three distinct methods to compute in the expression.
the ranking of resulting clusters. If t is Natural, the crank The commutative property holds as well, since, it holds for
value of a lcuster is obtained by computing its Natural Rank. ranked intersection.
In this way, the relevance of a cluster uniquely depends on Example 3: In order to filter the most reliable documents,
items common to both intersecdte4d clusters. we could identify the common documents retrieved by all the
Instead, if t is Cardinality, the crank value of each result- three search engines. In the headings of groups depicted in
ing cluster obtained by means of function CardRank, defined Figure 4, the expressions applied to obtain the groups are
by the following definition. reported. Consider groups g4 and g5 : first of all, we intersect
Definition 8: Cardinality Rank Given the group g 0 g1 and g2 , obtaining group g4 ; then, we further intersect g4
obtained by intersecting two groups, the Cardinality Rank with g3 , obtaining group g5 . The obtained clusters in g5 are
of each cluster c0 is the ratio between the cardinality of c0 the intersection of c1 =Wine Wineries and c4 =Travel Guide
and the maximum cardinality of the other cluster in g 0 : from g1 , of c1 =Wine Tasting and Wineries and c3 =Plan
CardRank (c0 , g 0 ) = |c0 .items|/max c∈g0 |c|. 2 your Travel Vacation from g2 , of c1 =Wine Wineries and
c4 =Napa Valley Hotel from g3 .
The cardinality rank determines the relevance of clusters Since the intersection is an associative operation, we can
locally within the group: the largest cluster has crank equals write the expression to obtain g5 in a different way. This
to 1, while the others have a smaller value. is done to obtain groups gt6 and g7 , depicted in Figure 4.
Finally, if t is Weighted, the crank value is obtained by Looking at groups g5 and g7 , the reader can see that they
means of function WMinRank, defined in the following def- are identical, apart fro the expressions that generate them.
inition. For this reason, cluster cl.1 in G − 5 and cluster cl.1 in g7
have the same label label5,1 . 2
6
The list of simple operators might be longer; however, they
are not essential in this paper, and for the sake of space we
do not further discuss this topic. 5.3 Group Join
g1 : Google -"Napa Valley" g2 : Yahoo! -"Napa Valley" g3 : MSN Live Search -"Napa Valley"
cl.1: Wine Wineries cl.1: Wine Tasting and Wineries cl.1: Wine Wineries
cl.4: Travel Guide cl.3: Plan your Travel Vacation cl.4: Napa Valley Hotel

Figure 2: Clusters selection

g1 : Google -"Napa Valley" g2 : Yahoo! -"Napa Valley" g3 : MSN Live Search -"Napa Valley"
cl.1: Wine Wineries (0.9575) cl.1: Wine Tasting and Wineries (0.975) cl.1: Wine Wineries (0.94)
u.1 (r: 1 - posQ: 1) u.1 (r: 1 - posQ: 1) u.3 (r: 0.99 - posQ: 2)
u.3 (r: 0.99 - posQ: 2) u.3 (r: 0.99 - posQ: 2) u.1 (r: 0.98 - posQ: 3)
u.2 (r: 0.94 - posQ: 7) u.10 (r: 0.96 - posQ: 5) u.2 (r: 0.94 - posQ: 7)
u.4 (r: 0.9 - posQ: 11) u.11 (r: 0.95 - posQ: 6) u.6 (r: 0.91 - posQ: 10)
cl.4: Travel Guide (0.945) cl.3: Plan your Travel Vacation (0.925) u.7 (r: 0.88 - posQ: 13)
u.5 (r: 0.98 - posQ: 3) u.4 (r: 0.97 - posQ: 4) cl.4: Napa Valley Hotel (0.92)
u.6 (r: 0.97 - posQ: 4) u.15 (r: 0.94 - posQ: 7) u.5 (r: 0.97 - posQ: 4)
u.2 (r: 0.94 - posQ: 7) u.13 (r: 0.91 - posQ: 10) u.8 (r: 0.93 - posQ: 8)
u.7 (r: 0.89 - posQ: 12) u.14 (r: 0.88 - posQ: 13) u.10 (r: 0.86 - posQ: 15)

Figure 3: Expanded clusters. Each cluster is expanded with the items in it; for each item, we report its uri,
its rank r and its position in the original ranked list retrieved by the query posQ : n through the search service
in the corresponding column.

The second complex operator we introduce is the Group content of the new cluster which subsumes as more specific
Join operator. topics those of the original clusters.
Definition 10: The Group Join operator ./ is defined as: As for group intersection, the natural rank is the basic
rank value of a cluster. An alternative is to compute the rank
./: G × G × T./ → G ./ (g1 , g2 , t) → g 0 in a weighted way; in this case we define the WMaxRank,
since we want to give more chance in determining the final
rank to the items belonging to the highest weighted cluster.
where g1 and g2 are the groups of clusters to join, g 0 is the
resulting group. t is the ranking method adopted to evaluate Definition 11: Weighted Maximum Rank Given a
the crank of clusters in g 0 . cluster c0 , obtained combining clusters c1 and c2 , its Weighted
t ∈ T./ = {Natural, Cardinality, Weighted, Correlation, Maximum Rank is defined as
Expansion, Weighted-Correlation, Weighted-Expansion}. WMaxRank (R1 , C1 , R2 , C2 , R0 ) =
For each pair of clusters c1 ∈ g1 , c2 ∈ g2 , such that their max (C1 r1 .irank, C2 r2 .irank)
)/|R0 |
P
intersection is not empty (i.e., |c1 .items∩R c2 . items )| 6= 0), = ( r∈R0
max (C1 , C2 )
there is a cluster c0 ∈ g 0 defined as follows:
c0 .items = c1 .items ∪R c2 .items, (see Definition 9 for the meaning of symbols). 2
c0 .label =SynthLabel (c0 .items). A third alternative to compute the ranking of cluster after
If t is Natural, or Weighted, or Correlation, or Expansion, a join is the Correlation Rank, that estimates the degree of
or Weighted-Correlation, or Weighted-Expansion, it is resp. correlation of the two incoming clusters. Indeed, this degree
obtained as: of correlation is computed by a measure of overlapping of
c0 crank =NRank (c0 .items), the clusters. The greater they are overlapped, the more
c0 crank = their topics are related one another.
WMaxRank (c1 .items, c1 .crank , c2 .items, c2 .crank , c0 .items) Definition 12: Correlation Rank Given two sets of doc-
0 uments R1 and R2 , their Correlation Rank (shortly, CRank )
c crank = CRank (c1 .items, c2 .items)
c0 crank = ERank (c1 .items, c2 .items) is the overlapping value between R1 and R2 . Formally:
c0 crank = |R1 ∩R R2 |
CRank (R1 , R2 ) =
WCRank (c1 .items, c1 .crank , c2 .items, c2 .crank ) |R1 ∪R R2 |
c0 crank = 2
WERank (c1 .items, c1 .crank , c2 .items, c2 .crank ). 2 A fourth alternative to compute the ranking of a cluster
The Group Join operator can be used to explicit indirect after a join, is the Expansion Rank, that evaluates the degree
correlations between the topics represented by the clusters of novelty w.r.t. the overlapped parts of the two incoming
in the two groups. The basic idea underlying its definition is clusters. The greater the novelty, the less the two topics
that if two clusters overlap, i.e., have some common items, represented by the original clusters are related. This hints to
it means that the contents of these items are related with the fact that the cluster with high novelty rank can represent
both topics represented by the clusters. This may hint the a very general topic.
existence of an implicit relationship between the two topics. Definition 13: Expansion Rank Given two sets of items
By assuming that topics can be organized into a hierarchy, R1 and R2 , their Expansion Rank (shortly, ERank ) is the
by grouping the two overlapping clusters into a single one complementary value of overlapping between R1 and R2 .
we may reveal the more general topic representing the whole Formally:
g4 : ∩(g1 , g2 , t) g5 : ∩(g4 , g3 , t) = (∩(∩(g1 , g2 , t), g3 , t)
cl.1: label4,1 cl.1: label5,1
(NRank = 0.995; WMinRank = 0.977; CardRank = 1) (NRank = 0.985; WMinRank = 0.930; CardRank = 1)
u.1 (r: 1); u.3 (r: 0.99) u.1 (r: 0.98); u.3 (r: 0.99)
cl.2: label4,2
(NRank = 0.9; WMinRank = 0.872; CardRank = 0.5)
u.4 (r: 0.9)

g6 : ∩(g2 , g3 , t) g7 : ∩(g1 , g6 , t) = ∩(g1 , ∩(g2 , g3 , t), t)


cl.1: label6,1 cl.1: label5,1
(NRank = 0.985;WMinRank = 0.950; CardRank = 1) (NRank = 0.985 ;WMinRank = 0.930; CardRank = 1)
u.1 (r: 0.98); u.3 (r: 0.99) u.1 (r: 0.98); u.3 (r: 0.99)
cl.2: label6,2
(NRank = 0.86;WMinRank = 0.811; CardRank = 0.5)
u.10 (r: 0.86)

Figure 4: Group Intersection. In the expressions, t denotes a generic ranking method.

|R1 ∩R R2 | The commutative property holds as well, since, it holds for


ERank (R1 , R2 ) = 1−
|R1 ∪R R2 | ranked union.
2
Example 4: The application of this operator on our run-
Another alternative to compute the ranking of clusters ning example is shown in Figure 5. The unified clusters
after a join is to weight the correlation rank. With this that group documents common to the original clusters are
ranking criterion we want to enhance the contribution of about both topics (such as Wine Wineries and Wine Tasting
the items belonging to the lowest ranked cluster so as to and Wineries), and at the same time, include also non com-
model a cautious attitude. mon documents, which are apparently unrelated. This is the
Definition 14: Weighted Correlation Rank Given two case of clusters Plan your Travel vacation and Wine Winer-
sets of items R1 and R2 , their Weighted-Correlation Rank ies which both contain some documents, such as Featured
(shortly, WCRank ) is their weighted overlapping value. For- Wineries in Napa Valley - Plan your Wine Tasting Room
mally: Tour. By joining these two clusters together, we generate a
more populous cluster in which information about wineries
WCRank (R1 , C1 , R2P
, C2 ) =
and travel vacations are included. At this point, we could
min(C1 r1 .irank , C2 r2 .irank )
= CRank (R1 , R2 ) P u∈I also order the resulting clusters w.r.t. the degree of corre-
u∈U max (C1 r1 .irank , C2 r2 .irank ) lation (i.e., overlapping) between the two original clusters,
where I = R1 ∩R R2 and U = R1 ∪R R2 ; r1 ∈ R1 and r2 ∈ R2 to identify the most correlated topics. In this example, in
are the original items describing the document having uri the same result cluster there are documents concerning only
u (u = r1 .uri= r2 .uri), while C1 and C2 we denote the Wineries or Travel Vacation, but also both. We can so ex-
crank of R1 and R2 , respectively. We assume r1 .irank = 0 pand the intersection between the two original clusters with
(r2 .irank = 0) if there is not an item r1 ∈ R1 (r2 ∈ R2 ) with documents correlated with it. 2
u = r1 .uri (u = r2 .uri). 2 Discussion. Observing the various ranks reported in Fig-
Note that the weight of the intersection R1 ∩R R2 is the ure 5 for clusters, it is possible to see that different rank-
minimum rank associated with the document and the clus- ing methods give a different relevance to clusters. For ex-
ter of belonging, while that of the union R1 ∪R R2 is the ample, in group g8 , the weighted rank is similar for both
maximum. the clusters, but the correlation rank is quite different: in
Finally, it is possible to choose the Weighted Expansion fact, cluster cl.1 has CRank = 0.333, while cluster cl.2 has
Rank. CRank = 0.143; this means that clusters joined together to
obtain cl.2 were less correlated than the ones joined to form
Definition 15: Weighted Expansion Rank Given two cl.1. This result is coherent with the expansion rank: ER-
sets of items R1 and R2 , their Weighted Expansion Rank ank = 0.857 for cluster cl.2, that is much higher than the
(shortly WERank ) is the complement of the weighted cor- expansion rank for cluster cl.1.
relation rank. If we observe the basic weighted rank (WMaxRank ), it is
WERank (R1 , C1 , R2 , C2 ) = evident that its values are coherent w.r.t. the correlation
= 1− WCRank (R1 , C1 , R2 , C2 ) = rank; however, the distance between the two values is much
= 1− P smaller than the distance between the values of the corre-
u∈I min(C1 r1 .irank , C2 r1 .irank ) lation rank; this is due to the fact that the crank values of
CRank (R1 , R2 ) P
u∈U max (C1 r1 .irank , C2 r1 .irank ) original clusters influence the final rank.
(see Definition 14 for the meaning of symbols). 2 Finally, we can notice that sorting clusters in g8 based
on cardinality ranking and (weighted) expansion ranking,
Properties. The associative property holds for the Group clusters are sorted in a different order than the one depicted
Join Operator, provided that the same ranking method is in the figure.
chosen for all the occurrences of the group join operator in
the expression. 5.4 Group Refinement
as generating clusters specializing the topics of the clusters
g8 :./ (g1 , g2 , t)
in the first group on the basis of the topics of any cluster
cl.1: label8,1
in the second group. The idea underlying this operator is
(NRank = 0.968; WMaxRank = 0.965; CRank = 0.333;
that we want to collect, in a unique cluster, the items which
ERank = 0.667; WCRank = 0.112; WERank = 0.887;
belong to both a cluster c1 of the first group g1 and any of
CardRank = 0.857)
the clusters in the second group g2 . This way, by eliminat-
u.1 (r: 1); u.2 (r: 0.94); u.3 (r: 0.99); u.4 (r: 0.97);
ing some items from c1 we generate a cluster representing
u.10 (r: 0.96); u.11 (r: 0.95)
a more specific topic with respect to c1 but not necessarily
cl.2: label8,2
more specific w.r.t. the clusters of the second group.
(NRank = 0.947; WMaxRank = 0.951; CRank = 0.143;
ERank = 0.857; WCRank = 0.019; WERank = 0.981; Example 5: Suppose that by analyzing the results in
CardRank = 1) Figure 5, we discover that no cluster has been retrieved con-
u.1 (r: 1); u.2 (r: 0.94); u.3 (r: 0.99); u.4 (r: 0.97); cerning restaurants (i.e., with the word restaurant in the
u.13 (r: 0.91); u.14 (r: 0.88); u.15 (r: 0.94) label). We could take a remedy by submitting the new
query "Napa Valley Restaurants" to Yahoo! ; the resulting
g9 :./ (g8 , g3 , t) clusters shown in Figure 6 (strongly focused on restaurants)
cl.1: label9,1 are used to filter out subclusters of documents concerning
(NRank = 0,96; WMaxRank = 0.941; CRank = 0.37; restaurants from within clusters in the previous lists (we
ERank = 0.62; WCRank = 0.141; WERank = 0.859; refine clusters in the first list). 2
CardRank = 1)
u.1 (r: 1); u.2 (r: 0.94); u.3 (r: 0.99); u.4 (r: 0.97);
5.5 Other Operators
u.6 (r: 0.97); u.7 (r: 0.94); u.10 (r: 0.96); u.11 (r: 0.95) The operators so far introduced constitute the core of our
cl.2: label9,2 proposal, but other less critical operators can be figured out,
NRank = 0,96; WMaxRank = 0.958; CRank = 0.125; to complete the proposed language. Here, we briefly sketch
ERank = 0.875; WCRank = 0.013; WERank = 0.987; the ones we think necessary.
CardRank = 1) Group Union. The user should be provided with an opera-
u.1 (r: 1); u.2 (r: 0.94); u.3 (r: 0.99); u.4 (r: 0.97); tor to unite together two groups. The group union operator
u.5 (r: 0.98); u.8 (r: 0.93); u.10 (r: 0.96); u.11 (r: 0.95) c1 ∪ c2 = c0 generates c0 in such a way it contains all clusters
in c1 and all clusters in c2 .
Figure 5: Group Join. In the expressions, t denotes Group Coalescing. Complex processing of retrieved doc-
a generic ranking method. uments may need to be performed by fusing all clusters in a
group into one global cluster. The group coalescing operator
⊕(c) = c0 generates c0 in such a way that c0 contains only one
cluster, obtained by applying the ranked union operation to
After group intersection and group join, it is possible to all clusters in c.
introduce the group refinement operator: it is aimed at re-
fining clusters in a group, based on clusters in another group. Reclustering. After complex transformations, it might be
necessary to reapply the clustering method to a group. In
Definition 16: The Group Refinement operator  is de- fact, reclustering documents in a group may let new and un-
fined as: expected semantic information emerge.
The Reclustering Operator Cluster (c) = c0 performs the
 : G × G × T → G (g1 , g2 , t) → g 0 tanked union of all clusters in c, and generates c0 in such
a way that it contains all the clusters obtained by clustering
where g1 is the group to refine on the basis of g2 , g 0 is the all ranked documents.
resulting group. t is the ranking method adopted to evaluate Closure Property of Group Operators The data model and
the crank of clusters in g 0 . the group operators were designed in such a way the Clo-
t ∈ T = {Natural, Cardinality, Refinement}. sure Property holds: operators are defined on groups and
Consider a cluster c1 ∈ g1 ; for each cluster ci ∈ g2 , Ii = generate groups.
c1 .items∩R ci .items. We defined a rich set of ranking methods, each one high-
If at least one Ii 6= ∅, there is a cluster c0 ∈ g 0 defined as lighting a characteristics of either the operator itseft (e.g.
follows: Natural and Refinement), the input groups (e.g. Correlation
c0 .items = I1 ∪R · · · ∪R In , and Weighted ranking methods) or the output group (e.g.
c0 .label =SynthLabel (c0 .items). Cardinality). User evaluation will provide the evidence of
If t is Natural, the crank value is obtained as: the most useful ranking method for each operator.
c0 crank =NRank (c0 .items).
If t is Refinement, the crank value (called (Refinement 6. FUNCTIONS ON GROUPS
Ranking) is obtained as : The group operators so far described, allow to conduct
c0 crank = |c0 .items|/|c1 .items|. a powerful exploratory activity: by combining groups, the
If t is Cardinality, the crank value of each resulting cluster user can discover useful information and may be inspired
is defined as c0 crank = |c0 .items|/max c∈g0 |c| (called Cardi- for new searches; the results of these new searches might be
nality Rank ). 2 combined with previously computed groups, and so on.
While the group join operator generates a cluster repre- The first function that we need to define is the Create-
senting a more general topic than the topics in both the Query function that makes it possible to generate an empty
original clusters, the refinement operator can be regarded group with a desired label l. It is defined as:
where g1 and g2 are the groups of clusters to intersect (resp.
g10 : Yahoo! -"Napa Valley Restaurants"
join or refine). t is the ranking method adopted to evaluate
cl.1: Restaurants Visitors Info (0.955)
the crank of clusters that would be produced: for ∩E , it is
u.16 (r: 1 - posQ: 1)
chosen among the methods Natural, Weighted and Cardinal-
u.17 (r: 0.97 - posQ: 4)
ity (see Definition 7); for ./E , it is chosen among the meth-
u.18 (r: 0.95 - posQ: 6)
ods Natural, Weighted, Correlation, Expansion, Weighted-
u.4 (r: 0.9 - posQ: 11)
Correlation, Weighted-Expansion and Cardinality (see Defi-
cl.2: California Wine Country (0.95)
nition 10); for E , it is chosen among the methods Natural,
u.19 (r: 0.98 - posQ: 3)
Refinement and Cardinality (see Definition 16).
u.1 (r: 0.96 - posQ: 5)
As for selection evaluation, the functions produces a 5-
u.7 (r: 0.91 - posQ: 10)
tuple with the previously defined fields.
cl.3: Seafood Restaurants in Napa Valley (0.913)
u.21 (r: 0.99 - posQ: 2) Example 6: An example of application of these func-
u.22 (r: 0.93 - posQ: 8) tions could be proposed on each of the operators previously
u.23 (r: 0.89 - posQ: 12) described. For the group intersection for example, whose
u.24 (r: 0.84 - posQ: 17) results are represented in Figure 4, we may want to know
if it is convenient (in terms of obtained results) to execute
g11 : g9  g10
the operation g4 : g1 ∩ g2 . To this aim we can apply the
cl.1: label11,1 function ∩E (g1 , g2 , W eighted) that will return, as a result,
(NRank = 0.97; RefRank = 0,375; CaRank = 1) (nc = 2, mincard = 1, maxcard = 2, mincrank = 0.8723,
u.1 (r: 1); u.4 (r: 0.97); u.7 (r: 0.94) maxcrank = 0.9771) = (2, 1, 2, 0.8723, 0.9771)
cl.3: label11,2 On the basis of this information, the user can see that
(NRank = 0.985; RefRank = 0,25; CaRank = 0.667) only two clusters are retrieved containing a total of three
u.1 (r: 1); u.4 (r: 0.97) documents with high minimum and maximum rank (in the
range [0,1]). Thus it can be worth executing the intersection
Figure 6: Group Refinement. In the expression la- operator. 2
beling the group on the right, t denotes a generic
ranking method.
7. RELATED WORK
In this section we review works that have some relation
CreateQuery(l) → g0 where g0 = (l, h i) to our proposal, although they have been conceived either
This function is necessary to generate the input group for with different scopes than the analysis of web documents
the CQuery operator when the user wants to submit a query retrieved by search services, or with distinct functionalities.
for the first time. This allows to achive the closure of the all In fact, we did not find any language similar to our proposal.
the set of group operators. A motivation of the utility of our proposal can be found in
[8]. In this work, the authors advocate the need of tools for
However, being an exploratory activity, it might be useful giving the user more immediate control over the clusters of
to evaluate the results of group operations without actually retrieved web documents; such tools should serve as means
building and storing a new group. If the user were pro- for exploring the similarity among documents and clusters.
vided with functions that returns a quantitative summary They also consider giving the user some means to correct,
of what would be obtained by applying an operator on al- or even completely change, the classification structure. To
ready computed groups, the user could decide whether to support the manipulation of clusters, they suggest the de-
actually apply a group operator to obtain a new group. velopment of graphic user interfaces. Indeed, the literature
For this reason, the proposed language provides some use- on visual paradigms for the presentation of textual search
ful evaluation functions, that we introduce in this section. results is too extensive to review; for a survey, the reader
Selection. The first function we introduce is σE , that eval- can see [1] and [12]. One goal of these approaches is to per-
uates the effect of a cluster selection. It is defined as: form some kind of text mining based on conceptual maps
visualization [3, 7]. Nevertheless, our proposal is different,
σE (g, P ) → (nc,mincard ,maxcard , mincrank ,maxcrank ) since we do not exploit a graphical representation of rela-
where g is the group which apply the selection to, and P is tionships between documents at this level, but we provide
the selection predicate (see Section 5.1). The function pro- a language for flexibly exploring the hidden relationships.
duces a 5-tuple with the following fields: nc is the number In the NIRVE prototype, Cugini and Laskowski [11] evalu-
of clusters that would be selected, min and max are, respec- ate and compare several graphical interfaces for showing the
tively, the minimum and maximum cardinality of clusters retrieved results of NIST ’s PRISE search engine. In their
that would be selected, while mincrank and maxcrank are, conclusions, they state that ”a good visualization of search
respectively, their minimum and maximum values. results depends on a good structure and what often happens
is that developers perform a deeper analysis of results in or-
Intersection, Join and Refinement. Three functions der to generate that structure”. In this respect, we envisage
are defined, corresponding to the main group operators: ∩E that our proposed language could be employed to the pur-
evaluates intersection, ./E evaluates join, E evaluates re- pose of exploring and finding a good structure of results that
finement. can then be presented by taking advantages of the proposed
∩E (g1 , g2 , t) → (nc,mincard ,maxcard , mincrank ,maxcrank ) graphic visualization techniques.
./E (g1 , g2 , t) → (nc,mincard ,maxcard , mincrank ,maxcrank ) An approach that shares some similarity of intent to our pro-
E (g1 , g2 , t) → (nc,mincard ,maxcard , mincrank ,maxcrank ) posal, in that it allows to do dynamic clustering and refine-
ment of search results, is the Scatter/Gather algorithm [6]. interface, that will be exploited to develop multi-channel ap-
Its distinctive feature is the way it allows clusters to be se- plications. We envisage that this manipulation language can
lected, recombined (gathered) and re-clustered (scattered). be employed for exploratory analysis in other contexts than
However, the user has to decide which clusters have a rele- the one presented in this paper, ex. in the databases.
vant theme based solely on keywords and titles. No function-
ality is available to detect the degree of share between clus- 9. REFERENCES
ters. Furthermore, since new clusters are generated based [1] S. K. Card, J. D. Mackinlay, and B. Shneiderman.
on re-clustering, the generation criteria remain implicit and Readings in information visualization: Using vision to
unknown to the user. On the contrary, in our approach,
think. Morgan Kaufmann Publishers Inc. San
the user is perfectly aware of the criteria that generated the
Francisco, CA, 1999.
new clusters, since they depend on the applied group opera-
[2] H. Chen and S. Dumais. Bringing order to the web:
tor.Moreover, the intersection and union operations between
Automatically categorizing search results. Proceedings
cluster representatives generate the label of the resulting
of the SIGCHI conference on Human factors in
cluster through the processing of its items (titles and snip-
computing systems, IR-76:145–152, 2000.
pets), so as to reveal its hidden semantics. In facts, the label
gives new information previously unknown on the common [3] W. Chung, H. Chen, and J. J. Nunamaker. Business
documents’ contents in the cluster. We can also find some intelligence explorer: a knowledge map framework for
discovering business intelligence on the web. System
similarity of our approach w.r.t clustering ensemble tech- Sciences, Proceedings of the 36th Annual Hawaii
niques, that have been defined to compare either the results International Conference on System Sciences:10, 2003.
obtained by the application of distinct clustering algorithms [4] T. Coates, D. Connolly, D. Dack, L. Daigle,
on the same set of items, or to compare distinct partitions of R. Denenberg, M. Durst, P. Grosso, S. Hawke,
the same set of items obtained based on distinct views (rep- R. Iannella, G. Klyne, L. Masinter, M. Mealling,
resentations) of the items [5, 13, 9]. The main goal of these M. Needleman, and N. Walsh. URIs, URLs, and
techniques is to achieve a robust and consensual clustering URNs: Clarifications and recommendations 1.0.
of the items. Robust clustering is achieved by combining Technical report, World Wide Web Consortium, URI
data partitions (forming a clustering ensemble) produced Planning Interest Group W3C/IETF,
by multiple clusterings. The approaches are defined within http://www.w3.org/TR/2001/NOTE-uri-clarification-
an information-theoretical framework; in fact, mutual infor- 20010921/,
mation is the underlying concept used in the definition of 2001.
quantitative measures of agreement or consistency between
[5] A. L. N. Fred and A. K. Jain. Robust data clustering.
data partitions.The group intersection operator of our lan-
Proceedings of the IEEE Computer Society Conference
guage takes inspiration from these ideas, since its goal is,
on Computer Vision and Pattern Recognition (CVPR
given two distinct partitions, (i.e., groups of clusters), to
’03), 2:128, 2003.
identify the common partitions, i.e., those sharing the same
[6] M. A. Hearst and J. O. Pederson. Reexamining the
sets of documents. If we iteratively apply the intersection
cluster hypothesis: Scatter/gather on retrieval results.
operator to a set of groups, we thus find the consensual par-
Proceedings of the Conference on Research and
titions among these groups.
Development in Information Retrieval, 1996.
As far as the join operator is concerned, it can be regarded [7] N. Kampanya, R. Shen, S. Kim, C. North, and E. A.
as the generation of a new partition (group) containing only Fox. Citiviz: A visual user interface to the citidel
the unions of the original clusters which have a non empty system. LNCS, Springer Verlag, 3232:122–133, 2004.
intersection. Its meaning is that of expanding the result of [8] A. V. Leouski and W. B. Croft. An evaluation of
the intersection operator between groups, so as to consider techniques for clustering search results. Technical
indirect correlations among the items of the original clus- Report of the Department of Computer Science f
ters. University of Massachusetts at Amherst,
IR-76:122–133, 1996.
8. CONCLUSIONS [9] M. V. M. Pagani, G. Bordogna. Mining
In this paper, we addressed the problem of defining a multidimensional data using clustering techniques.
language for manipulating the results provided by internet Proceedings of DEXA Workshop FLEXDBIST-07,
search services. The work is motivated by the need to better 2007.
exploit, in an integrated way, the results obtained by differ- [10] S. Osinski. An algorithm for clustering of web search
ent search services like, e.g., web search engines. The large results. Master’s thesis, Department of Computing
number of documents retrieved by such services constitute a Science, Poznan’ University of Technology,
serious obstacle for users, that are not able to extract a se- http://project.carrot2.org/publications/osinski-2003-
mantic summarization of the results. The proposed language lingo.pdf,
provides operators to manipulate, in a complex and con- 2003.
trolled way, groups of clustered retrieved documents. The [11] M. M. Sebrechts, J. Vasilakis, M. S. Miller, and
richness of the proposed language allows users to integrate S. J. L. J. V. Cugini. Visualization of search results: A
the results of different search services in several ways, then comparative evaluation of text, 2d, and 3d interfaces.
revealing more general or more specific topics than those car- Proceedings of SIGIR ’99, 1999.
ried by the single documents. We are developing a software [12] E. Staley and M. Twidale. Graphical interfaces to
prototype that supports the proposed language. Based on a support information search. Technical report,
Service Oriented Architecture, it will provide a web service University of Illinois,
http://people.lis.uiuc.edu/ twidale/irinterfaces/bib-
main.html,
2000.
[13] A. Strehl and J. Ghosh. Cluster ensembles Ű a
knowledge reuse framework for combining
partitionings. Proceedings of AAAI, 2002, 2002.
[14] L. Zadeh. Fuzzy sets. Information and control,
8:338–353, 1965.
[15] O. Zamir and O. Etzioni. Grouper: a dynamic
clustering interface to web search results. Proceedings
of the 8th International World Wide Web Conference,
1999.

You might also like