You are on page 1of 14

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/255578383

Efficient Ranked Queries on Sources with Boolean Query Interfaces

Article · January 2009

CITATION READS
1 68

3 authors, including:

Yuheng Hu Panos Ipeirotis


University of Illinois at Chicago New York University
27 PUBLICATIONS   406 CITATIONS    130 PUBLICATIONS   14,966 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Crowdsourcing View project

All content following this page was uploaded by Panos Ipeirotis on 03 August 2015.

The user has requested enhancement of the downloaded file.


1

Efficient Ranked Queries on Sources with


Boolean Query Interfaces
Vagelis Hristidis, Yuheng Hu, and Panagiotis G. Ipeirotis,

Abstract—Many online or local data sources provide powerful querying mechanisms but limited ranking capabilities. For instance,
PubMed allows users to submit highly expressive Boolean keyword queries, but ranks the query results by date only. However, a
user would typically prefer a ranking by relevance, measured by an Information Retrieval (IR) ranking function. The naive approach
would be to submit a disjunctive query with all query keywords, retrieve the returned documents, and then re-rank them. Unfortunately,
such an operation would be very expensive due to the large number of results returned by disjunctive queries. In this paper we present
algorithms that return the top results for a query, ranked according to an IR-style ranking function, while operating on top of a source with
a Boolean query interface with no ranking capabilities (or a ranking capability of no interest to the end user). The algorithms generate
a series of conjunctive queries that return only documents that are candidates for being highly ranked according to a relevance metric.
Our approach can also be applied to other settings where the ranking is monotonic on a set of factors (query keywords in IR) and the
source query interface is a Boolean expression of these factors. Our comprehensive experimental evaluation on the PubMed database
and a TREC dataset show that we achieve order of magnitude improvement compared to the current baseline approaches.

Index Terms—Hidden database, Keyword Search, Top-k ranking

1 I NTRODUCTION Information Retrieval (IR) ranking function. Given that


traditional IR ranking functions [19] like Okapi [20] and
Many online or local data sources provide power-
BM25 [18] implicitly assume disjunctive (OR) semantics,
ful querying mechanisms but limited ranking capabil-
the naive approach would be to submit a disjunctive
ities. For instance, PubMed1 allows users to submit
query with all query keywords, retrieve all the returned
Boolean keyword queries on the biomedical publications
documents, and then rank them according to the rel-
database, but ranks the query results by publication
evance metric of choice. However, this would be very
date only. Similarly, the US Patent and Trademark Office
expensive due to the large number of results returned
(USPTO)2 allows Boolean keyword queries or searching
by disjunctive queries. For example, consider the query
patents but only ranks by patent date. Furthermore, job
“immunodeficiency virus structure”, an example query
search databases, like the job search of LinkedIn,3 allow
used to teach information specialists how to search the
users to sort job listings by date or title (alphabetically),
PubMed database [8]. Executing the corresponding dis-
but not by IR relevance of the job posting to the submit-
junctive query “immunodeficiency OR virus OR struc-
ted query. As a more recent example, the micro-blogging
ture” on PubMed returns 1,451,446 publication results.
service Twitter4 offers a highly expressive Boolean search
Downloading and ranking them is infeasible for an
interface but ranks the results by date only. In most cases,
interactive query system, even if the source is on the
these sources do not allow downloading and indexing
local network. The problem becomes even more critical
of data or the size of the underlying database makes
if we use the public web services provided by PubMed
any comprehensive download [16], [17] an expensive
for programmatic (API) access over the web. Given the
operation.
large overhead incurred when retrieving publications,
Often, the user prefers a ranking other than the default
PubMed imposes quotas on the amount of data an
(e.g., by date) provided by the source. For instance, a
application can retrieve per minute, rendering infeasible
user of the PubMed or USPTO Web sites would typ-
any attempt to download large number of documents.
ically prefer a ranking by relevance, measured by an
To overcome such problems, in this paper we present
algorithms to compute the top results for an IR ranked
• Vagelis Hristidis and Yuheng Hu are with the School of Computing and query, over a source with a Boolean query interface
Information Sciences, Florida International University Miami, FL.
E-mail: {vagelis,yhu002}@cis.fiu.edu but without any ranking capabilities (or with a ranking
• Panagiotis G. Ipeirotis is with Department of Information, Operations, function that is generally uncorrelated to the user’s
and Management Sciences, New York University New York, NY. E-mail: ranking e.g., by date). A key idea behind our tech-
panos@stern.nyu.edu
nique is to use a probabilistic modeling approach, and
estimate the distribution of document scores that are
1. http://www.ncbi.nlm.nih.gov/pubmed/
2. http://patft.uspto.gov/
expected to be returned by the database. Hence, we
3. http://www.linkedin.com/ can estimate what are the minimum cutoff scores for
4. http://www.twitter.com/ including a document in the list of highly ranked docu-
2

ments. To achieve this result over a database that allows query the database, and (ii) the local setting where
only query-based access of documents, we generate a we query a locally installed subset of PubMed. Our
querying strategy that submits a minimal sequence of results show an order of magnitude improvement
conjunctive queries to the source. (Note that conjunctive compared to the naive query evaluation approach.
queries are cheaper since they return significantly fewer The rest of the paper is organized as follows. In
results than disjunctive ones.) After every submitted Section 2 we describe related work and place our work
conjunctive query we update the estimated probability in the context of the existing literature. In Section 3 we
distributions of the query keywords in the database and give the framework, problem definition, and notation,
decide whether the algorithm should terminate given while in Section 4 we outline the basic ideas of our
the user’s results confidence requirement or whether approach. Then, in Section 5 we describe in detail our
further querying is necessary; in the latter case, our algorithms, and in Section 6 we present the results of
algorithm also decides which is the best query to submit our experiments. Finally, Section 7 concludes the paper.
next. For instance, for the above query “immunodefi-
ciency virus structure”, the algorithm may first execute
“immunodeficiency AND virus AND structure”, then 2 R ELATED W ORK
“immunodeficiency AND structure” and then terminate, A preliminary version of this work has been published
after estimating that the returned documents contain all as a short paper [9].
the documents that would be highly ranked under an
IR-style ranking mechanism. As we will see, our work Top-k queries A significant amount of work has been
fits into the “exploration vs. exploitation” paradigm [2], devoted to the evaluation of top-k queries in databases.
[14], [15], since we iteratively explore the source by This line of work typically handles the aggregation of
submitting conjunctive queries to learn the probability attribute values of objects in the case where the attribute
distributions of the keywords, and at the same time values lie in different sources [7], [3] or in a single
we exploit the returned “document samples” to retrieve source [10]. For example, Bruno et al. [3] consider the
results for the user query. problem of ordering a set of restaurants by distance and
Our approach can also be extended and applied to price. They present an optimal sequence of random or
other settings where the ranking is monotonic on a set sequential accesses on the sources (e.g., Zagat for price
of factors (query keywords in IR) and the source query and Mapquest for distance) in order to compute the top-
interface is a Boolean expression of these factors. For k restaurants. Since the concept of top-k is typically a
instance, consider a database of products with Boolean heuristic itself for locating the most interesting items in
attributes, like cars for sale that have attributes such the database, Theobald et al. [21] describe a framework
as “used,” “new,” “two-doors,” “four-doors,” “convert- for generating an approximate top-k answer, with some
ible,” and so on. Suppose that the query interface only probabilistic guarantees. In our work, we use the same
allows specifying attribute values (e.g., “used AND con- idea; the main difference is that we only have “random
vertible AND two-doors”). Suppose the source ranks access” to the underlying database (i.e., through query-
cars always by price. If the user wants to rank by a ing), and no “sorted access.” Ilyas et al. [11] provide
weighted sum of the attribute values (e.g., 0.2 · used + a survey of the research on top-k queries on relational
0.4 · convertible + 0.4 · two − doors), then we can apply databases.
an adaptation of our approach. Exploration vs. exploitation The idea of the exploita-
Our work has the following contributions:
tion/exploration tradeoff [2], [14], [15] (also called the
1) We define the novel problem of applying ranking “multi-armed bandit problem”) is to determine a strat-
on top of sources with no ranking capabilities by egy of sequential execution of actions, each of which
exploiting their query interface. has a stochastic payoff. While executing an action we
2) We describe sampling strategies for estimating the get back some (uncertain) payoff, and at the same time
relevance of the documents retrieved by different we get some information that allows us to decrease the
keyword queries. We present a static sampling uncertainty of the payoff of future actions. The problem
approach and a dynamic sampling approach that has been first posed in the 1930’s and has been used
simultaneously executes the query and estimates to model problems in a wide variety of areas, ranging
the parameters required for efficient query execu- from medicine and economics to ad placement in web
tion. pages. In our work, we are trying to maximize the pay-
3) We present algorithms that, given a user confi- off/exploitation of each query (which is the number of
dence input, retrieve a minimal number of re- new, relevant top-k documents that the query retrieves)
sults from the source through submitting high- while minimizing the expense/exploration (number of
selectivity (conjunctive) queries, so that the user’s queries sent, and documents retrieved).
confidence requirement is satisfied.
4) We experimentally evaluate our algorithms using Deep Web Our work bears some similarities to the
the PubMed database and examine two settings: problem of extracting data from the Deep Web [1]
(i) the remote setting, where we use web services to databases. For example, Ntoulas et al. [17] attempt to
3

download the contents of a Deep Web database by AND diabetes AND NOT sclerosis], q3 = [diabetes OR
issuing queries through a web form interface. The goal sclerosis], and so on. The returned results are guaranteed
of Ntoulas et al. is to download and index the contents to match the Boolean conditions but the documents are
of databases with limited query capabilities, whereas in not expected to be ranked in any useful manner.
our case the focus is on achieving on-the-fly ranking
of query results, on top of sources with no (or non- Objective We want to devise a scheme for retriev-
useful) ranking capabilities. An alternative approach is ing from D the top-k documents, ranked according to
to characterize databases by extracting a small sample of F (tf, df, dl). The trivial solution is to send an extremely
documents that is then used to describe the contents of broad disjunctive query, returning all documents that
the database. For example, it is possible to use query- have a non-zero F (tf, df, dl) score. Then, we can re-
based sampling [4], [13] to extract such a document trieve the documents, examine their contents, and rerank
sample, generate estimates for the distribution of each them locally before presenting the results to the user.
term, and then use the estimates to guide the choice Unfortunately, this is a very time-consuming solution.
of queries that should be submitted to the database. In Therefore, our objective is to construct a query sequence
the experimental section, we compare against this “static q1 , q2 , · · · , qv of Boolean queries, that can be submitted
sampling” alternative and demonstrate the superiority to the database, retrieve as few documents as possible,
of the dynamic sampling technique, which dynamically and still contain all the documents that would be in the
generates estimates tailored to the query at hand. top-k results.

4 OVERVIEW OF A PPROACH
3 P ROBLEM D EFINITION As mentioned above, our approach is based on choosing
the best sequence q1 , q2 , · · · , qv of Boolean queries to
Query Model Consider a text database D with doc-
submit to the data source, such that we retrieve the top-
uments d, . . . , dm . The user submits a keyword query
k ranked documents for Q. Of course, to select the best
Q = {t1 ...tn } containing the terms t1 ...tn . The answer to
sequence of queries, we need to know some statistics
the query is a list of the top k documents; the documents
about the type of documents retrieved by each query qi .
are ranked according to a relevance score score(Q, d),
To get these statistics we need to sample the database
which estimates the relevance of a document d to the
through query-based sampling. So, through querying we
query Q.
are both retrieving documents to generate the necessary
The score of a document can be computed using
statistics and at the same time aim to retrieve documents
any of the the well studied tf.idf scoring functions like
that are in the top-k relevant documents. So, we can
BM25 and Okapi [18], [19], [20]. The key arguments of
consider our approach as a case of “exploration vs.
a tf.idf function are the term frequency (tf), the docu-
exploitation.”
ment frequency (df) and the document length (dl). The
Even though we can use any Boolean query in our
term frequency tf (t, d) is the number of times that the
strategy, we only consider conjunctive Boolean queries
word t appears in document d. The document frequency
as candidates, given that a disjunctive query can be
df (t, D) is the number of documents in D that contain t.
split to a set of conjunctive queries. Conjunctive queries
Hence, score(Q, d) = F (tf, df, dl). At its basic form, the
provide a good query granularity and simplify the
tf.idf ranking function is:
analysis below. Note that in practice we add negation
X |D| + 1 conditions to the issued conjunctive queries in order
score(Q, d) = tf (t, d) · ln (1) to avoid retrieving the same results multiple times. For
df (t, D)
t∈Q,d instance, if Q = {a, b}, after submitting q1 = a AN D b,
where |D| = m is the size of the database D. In our ex- we submit q2 = a AN D N OT b instead of q2 = a.
periments, we use the Okapi scoring function, although So, what are the goals of our querying strategy? Fol-
any other tf.idf function could be used. For simplicity lowing Equation 1, we need to know the tf and df values
though we use the basic tf.idf scoring function as the for the terms in the database, to estimate the similarity
running example. score of a query to a document. Using these values, we
can then estimate the overall similarity score distribution
Data Source Model We assume that database D is for all the documents in the database. Given the score
only accessible through a Boolean query interface and we distribution, we can compute how many documents in
do not have direct access to the underlying documents. the database have score higher than the documents that
The query interface evaluates the Boolean query Q and we have seen so far.
returns the documents ranked using a non-desirable The relatively easy part is the estimation of the df
ranking function, e.g., by date (as is the case for PubMed values. We can estimate these values in two ways: (a) We
and USPTO). can send n queries to the database, one for each query
For instance, if the user query is Q=[anemia, diabetes, term ti , and compute the df value for each term. Note
sclerosis], then we can submit to the data source queries that the PubMed eUtils, which we use in our experi-
q1 = [anemia AND diabetes AND sclerosis], q2 = [anemia ments, have a method to directly return the number of
4

results (df ) for a query. (b) We can use estimates of the idf TABLE 1: Key Notation
(inverse df ) values by using some other database with Notation Description
similar content (for example, using the Google Web 1T
λt tf parameter of Poisson for word t
5-gram collection5 ). λsq score (tf.idf) parameter of Poisson for query q
The more challenging part is the estimation of the tf d document
t term in query Q
values. We need to estimate the value of tf for each q conjunctive query, subset of Q
query term and for each document, that is, a total of P user-specified benefit threshold
n × |D| values. This is rather unrealistic without having PQ priority queue
Zq set of fetched results by query q
direct access to the underlying database. So, we adopt Sq size of Zq
a query-based probabilistic approach and we use the S Total number of documents retrieved so far
fact that term frequencies (tf ) tend to follow a Poisson pr(q) benefit of query q
distribution within the documents of a database [21].
The more accurately we know the parameters of the
distribution, the better we can estimate the document 5 E XPLORATION AND E XPLOITATION OF THE
score distribution, and the better we can estimate how DATABASE C ONTENTS THROUGH Q UERYING
many documents should be in the top-k results but are
still not retrieved. In this section, we describe the core of our approach. We
One strategy for estimating the distribution parameter show how we can use selective querying to: (a) Explore
values is to generate a static document sample from the the database: get the necessary statistics to estimate the
database and use this sample for our estimations. As we parameters that our algorithms need; and (b) Exploit the
will see, this strategy suffers from some shortcomings, database: retrieve documents that are promising to be in
so we present an alternative strategy as well, which the top-k results of user query Q. We will see how our
relies on the exploitation-exploration framework, and scheme achieves both (a) and (b) in parallel.
combines sampling and query execution. We provide
further details on our sampling strategy in Section 5
and compare the performance of the two approaches in 5.1 Initial Probabilistic Modeling of the Source
Section 6.
Our overall goal is to figure out during our querying
Now, assuming that we know the score distribution
process, how many of the top-k relevant documents we
for Q of the documents, we can estimate the benefit
have retrieved and how many are still unretrieved in the
that each issued query will generate: we can estimate
database. Unfortunately, we cannot be absolutely certain
the distribution of document scores (with respect to
about these numbers unless we retrieve and score all
Q) for the documents retrieved by a conjunctive query
documents: an expensive operation. Alternatively, we
q. Therefore, we can estimate the benefit of a query
can build a probabilistic model of score distributions and
q, defined as the probability that a randomly selected
examine, probabilistically, how many good documents
document from the answer of q will have score higher
are still not retrieved. We describe our approach here.
than the k-th ranked score for Q among the documents
retrieved so far. It is generally accepted that the term frequencies of the
terms in a database tend to follow a Poisson distribution.
To achieve that, we create a priority queue with all In other words, for a word t, the probability that a
candidate queries q, ordered by expected benefit. We randomly chosen document d from database D has term
select the query at the top of the priority queue, retrieve frequency r is:
documents, and based on the results we update the
expected benefits of the other queries. Then, we pick the
exp(λt )
query with the next-highest expected benefit and so on. P r{tf (t, d) = r} = · (λt )r (2)
r!
The algorithm terminates when the benefit (i.e., probabil-
ity of retrieving a top-k document) drops below a user- where λt is a word-specific parameter. Now, instead of
specified probability constant P . That is, the algorithm knowing the n × |D| tf values in the database, we only
terminates when every unseen result has probability need to estimate n values: the λt values for each of the
less than P to be in the top-k answer. Note that P is n words in the query Q.
provided by a domain expert to balance response time
Following this, the estimation of the score distribution
and accuracy, and hence users do not have to worry
is reduced to the problem of estimating the distribution
about it in practice. In the next sections we describe in
of a sum (see Equation 1) of Poisson distributed vari-
detail our approach.
ables. We know that if X and Y are two independent
random variables following a Poisson distribution with
parameters λx and λy respectively, then the sum X + Y
5. http://www.ldc.upenn.edu/Catalog/docs/LDC2006T13/ follows a Poisson distribution with parameter λx + λy .
readme.txt Therefore, in our case, the score distribution will also be,
5

approximately,6 a Poisson distribution with parameter: mean queries with any word, not only queries with
X words from the issued user query Q.)
λsQ = (λt · idf (t)) (3) Given such a document sample, we can measure the
t∈Q tf (t, d) values for each term t and document d, and
use the maximum likelihood estimate (MLE) to compute
where we note that idf (t) = ln df|D|+1 (t,D) according to estimates for the λt values. To avoid zero estimates for
Equation 1. We use the s superscript to denote the tf.idf terms that do not appear in the sample, we use Laplace
parameter as opposed to just the tf parameter. smoothing:7
This model implicitly assumes independence across
terms. Of course, the assumption of independence across PS
terms is admittedly not realistic, but used by many algo- bMLE = 1 +
λ i=1 tf (t, di )
(7)
t
rithms that deal with text, including many IR relevance S+1
models e.g., tf/idf and LM models, and many text classi- This strategy tends to have a few shortcomings.
fication algorithms. The independence assumption also First, the estimates assume that query-based sampling
tends to work in practice (see the work of Domingos and is equivalent to random sampling, an assumption that
Pazzani [5] for a theoretical justification) and contributes does not necessarily hold [12]: there is a bias to retrieve
to the tractability of our algorithms. more often longer documents, or documents with higher
Now, given that we have a functional form for the priority in the underlying ranking function (e.g., more
score distribution, we can estimate the number of doc- recent documents in date-based ranking). Second, the
uments in the database that are expected to have scores estimates for the terms that were sent to the database
higher than the currently retrieved documents. Suppose as query probes are significant overestimates of the real
that our currently retrieved top-k documents have a values as by definition the retrieved documents contain
cutoff score τ . (We can estimate the exact similarity only documents with the submitted terms. Third, and
value for these documents since we have retrieved them more importantly, there is a very significant data sparse-
locally.) We are trying to estimate how many documents ness issue: many terms do not appear in the retrieved
in the database have score higher than τ : document sample and their estimates are simply the
Laplacean-corrected values.
|Docs(score > τ )| = |D| · P r{score > τ }) (4) Next, we describe an alternative approach that com-
s
pensates for the data sparseness by retrieving document
= |D| · P r{Poi (λQ ) > τ } (5) samples through a sampling process customized to the
µ ¶
Γ(bτ + 1c, λsQ ) issued query Q (Section 5.3). Then, we show how to
= |D| · 1 − (6)
bτ c! compensate for the overestimates introduced by the very
Pa−1 r nature of the query-based sampling (Section 5.4). As
where Γ(a, x) = (a − 1)!e−x r=0 xr! is the incomplete we will see, this exploitation-biased strategy tends to
Gamma function. be slightly more expensive than the summary-based
We now have an estimate for the number of docu- strategy (as it generates customized document samples
ments that have similarity above a given threshold τ . Of on the fly, instead of having a static summary shared by
course, before proceeding further, we need to estimate all queries) but generates results of superior quality.
the λt parameters required to compute the λsQ param-
eter for the score distribution. Next, in Section 5.2, we
describe an approach that relies on “static” query-based 5.3 Exploitation-biased Query-based Estimation of
sampling [4]. Then, in Sections 5.3 and 5.4, we describe Poisson Parameters
our approach that simultaneously explores the database
In the previous section, we have described a query
and retrieve as many good documents as possible at the
strategy in which we were sending random queries for
same time.
sampling. Now, we describe an approach in which the
sampling queries involve only the actual query words,
5.2 Summary-based Estimation of Poisson Parame- biasing the retrieval of documents towards beneficial doc-
ters uments. In parallel, this query strategy avoids the issue
One strategy for generating estimates for the λ values is of data sparseness by generating estimates specifically
to generate a static document sample from the database, for the query at hand. (We describe later the exact query
and then use the retrieved documents to generate the formulation strategies.) However, such a query strategy
estimates. For example, Callan et al. [4] generate a sum- generates biases in the sampling, which affect the basic
mary from the database by sending random keyword MLE estimation. So, we show now how to compensate
queries and retrieving 300 documents. (By random, we for these biases.

6. Strictly speaking, the weighted sum of two Poisson random vari- 7. We can also use the Bayesian estimator instead of the MLE one;
ables is a quasi-discrete distribution, which technically cannot be called due to space restrictions we do not present this variant here, since
Poisson. However, in practice, it behaves like a continuous version of the differences are small and restricted at the very first stages of the
the discrete Poisson distribution. estimation.
6

Suppose that we submit as query the word t that 5.4 Query-based Document Score Distribution
appears in the user query Q. We need to estimate prop-
In the previous section, we described how to adjust
erly the parameter λt . In this case, the results that we
the λt estimates to compensate for the bias introduced
received back are not a random sample, so we cannot
through query-based sampling. Given these estimates,
use directly the method described above. Instead, now
we can generate the distribution of similarity scores for
all the returned documents are guaranteed to contain
the user-submitted query Q in the database. However,
the word t. That is, the returned document results are
we are not going to retrieve documents randomly from
a “conditional” random sample, with the condition that
the database. Instead, we submit a sequence of conjunc-
tf (t, d) > 0. So, our retrieved sample misses all the doc-
tive queries trying to retrieve the most highly similar
uments that do not contain t. Therefore, our calculations
documents. So, to identify which queries will retrieve
need to account for this fact. Hence, we estimate, given
the most similar documents, we need to estimate the
the retrieved documents for query t, how many empty
score distribution for the query results for a given query
documents we would have seen if we were performing
q, which is not necessarily the same as the original
random sampling.
user query Q. We now describe how to compute score
Suppose that after submitting the query t, we retrieve
distribution for any conjunctive query q.
and process a set of S documents. For the word t we
In Section 5.1, we gave a functional form for the
also know its document frequency df (t) in D. So there
score distribution, assuming that we get a randomized
are df (t) documents in the database that contain t and
sample from the database. However, when the docu-
|D| − df (t) that do not contain t. Therefore, for every
ments that we examine are retrieved by using a query
document with t that we retrieve from D, we expect to
q that contains some of the terms in the original user-
have |D|−df (t)
documents without t (i.e., with tf (t, d) =
df (t) issued query Q, then the retrieved document sample is
0). So, we modify the estimator from the previous section
biased: a conjunctive query guarantees that the returned
to account for the unseen documents that do not contain
documents have tf (t, d) > 0 when t ∈ q. In this case we
t and the sample size from S becomes S 0 = S + |D|−df (t)
·
D
df (t) have:
S = df (t) · S. So, by changing the normalizing factor 1/S
to 1/S 0 in Equation 7, we have:
P r{tf (t, d) = r, r > 0}
S P r{tf (t, d) = r|tf (t, d) > 0} =
bMLE 1 X df (t) P r{tf (t, d) > 0}
λ t = · tf (t, d) (8) exp(λt )
S i=1 |D| r! · (λt )r
=
The key observation in Equation 8 is that when we up- 1 − P r{tf (t, d) = 0}
exp(λt )
date the MLE estimates for the term t using results from r! · (λt )r
a query that contains t, we should scale the estimates = (9)
1 − exp(λt )
using the df|D|
(t)
factor.
Below, in Procedure updateLambdaByMLE, we present In other words, the new tf distribution is a Poisson
1
the algorithm to update the current λt estimations after distribution with a normalizing factor 1−exp(λ t)
. There-
a conjunctive query q is submitted, which produces a fore, when we send a query q to the database, the
set of Sq results. S is the number of results retrieved so document score distribution (the score is always defined
far by all submitted conjunctive queries. Note that for with respect to the user query Q) follows the Poisson
simplicity we use the notation λt instead of λMLE
t in the distribution with a configuring parameter λsq :
rest of the analysis.
Xµ λt
¶ X
Procedure updateLambdaByMLE(q) λsq = · idf (t) + (λt · idf (t)) (10)
t∈q
1 − exp(λt )
t∈Q\q
1 Let Zq be the results set for q. It is Sq = |Zq |.
2 foreach t ∈ Q do
3 if t ∈ q then which is different that the functional form depicted in
4 λprev ← λprev + df|D|
(t)
· tf (t, Zq ) Equation 3. Following the analysis from Section 5.1, the
t t
5 // λtprev
is an extra variable we need to keep number of documents in the results of a conjunctive
for each t, given that λt stores the query q, with score above a threshold τ are:
averaged value. P
6 //tf (t, Zq ) = d∈Zq tf (t, d)
7 end |Docs(score > τ )| = Sq · P r{score > τ })
8 else if t ∈ Q \ q then
9 λprev ← λprev + tf (t, Zq ) = Sq · P r{Poi (λsq ) > τ }
t t µ ¶
10 end Γ(bτ + 1c, λsq )
11 λt ← λprevt /S //S is the total number of results = Sq · 1 − (11)
retrieved so far by all queries bτ c!
12 end
This analysis gives us the basis for formulating our
querying strategy, which we describe next.
7

5.5 Top-k Querying Algorithm Procedure QueryExecution(k, Q, P)


Above we presented the estimation for a general query- 1 Initialize:
based approach, without specifying how to select queries 2 - Add to priority queue P Q all combinations q of terms
of Q, that is, P Q has all candidate conjunctive
to send. However, we know now the expected score queries; P Q is ordered by the benefit pr(q) of q
distribution for each conjunctive query, and how these 3 - Default λt parameters are assigned to each t ∈ Q,
estimates are updated every time that we retrieve a new and accordingly initial benefits pr(q) for each q ∈ P Q
document (Procedure updateLambdaByMLE above). are computed;
Our querying strategy is as follows. We start by 4 - Create results array R with size k, where results are
ordered by score;
sending all the terms of the query as a conjunctive 5 while P Q 6= ∅ do
query to the database. This query is expected to retrieve 6 q ← P Q.pop()
the documents with the highest scores. Obviously, if 7 if pr(q) < P and R contains k results then
the query matches less than k documents, we need to 8 break
submit relaxed versions of the query (e.g., remove one 9 end
10 else
keyword – we describe below how to select which query 11 Zq ← Fetch(q)
to submit). If we have more than k results, we still need 12 S ← S + Sq // S is the number of
to compute the confidence that the retrieved documents documents retrieved so far.
contain the correct top-k results. (Note that a document Sq = |Zq |
with fewer query keywords may achieve higher ranking 13 Insert Zq into R // Zq and R are
merged into R
according to an IR function.) 14 UpdateLambdaByMLE(q)
Since we do not have access to the complete database, 15 UpdatePQ (q)
we cannot be absolutely certain that we retrieved all the 16 end
“real” top-k documents. Instead, we adopt a probabilistic 17 end
approach, and we use an input parameter P which is the 18 return R
probability that any unseen document belongs on the
top-k results is less than P. That is for all the (unseen)
documents d with relevance score(d , Q), we have:

P r{score(d , Q) > τ } < P (12)


where as τ we set the relevance score of the k-th
highest scoring document retrieved so far. (See Equa-
tions 11 and 10 to see how to compute the value
P r{score(d , Q) > τ }. Notice that we are trying to es-
timate the distribution of scores for the user-submitted
query Q, and we retrieve the documents by sending a
set of conjunctive queries qi that contain only a subset Fig. 1: Lattice of candidate conjunctive queries.
of the terms from Q.)
Hence, at every step we compute the benefit of every
candidate query, which is P r{score > τ }, and is com- identical to previously retrieved ones. In our querying
puted as shown in Section 5.4. If for all candidate queries technique we use the negation trick: The process is as
the benefit is less than P, the algorithm terminates. Else, follows. We save the set of documents Zq for each past
the query q with the maximum benefit is submitted. query q, and then for a new query qnew , we do the
Then the λt parameters estimations are updated and the following:
process repeats. • Let L = q1 , ..., ql be the set of past queries for which
We maintain a priority queue with the expected benefit qnew ⊂ qi .
of each query so we can select which query to issue next. 0
• Submit qnew = qnew − q1 , ..., ql .
The main algorithm is shown in Procedure QueryExecu- • The result of qnew for our probabilistic analysis
tion. purposes is Zqnew = Zqnew 0 ∪ Zq1 ∪ ... ∪ Zql
As mentioned in Section 4, a practical issue that we
face is that we may retrieve the same documents many Candidate Queries Lattice The sequence q1 , q2 , · · · , qv
times as we issue the queries. From an estimation point of queries that the algorithm submits can be viewed as
of view, we should always include such documents in a prefix of the queries lattice, shown in Figure 1. Except
the updating of the estimates. Practically though we for the most selective, vanilla conjunctive query that
do not want to retrieve the same documents multiple returns the AND of all terms (the top of the lattice),
times so that we can save the retrieval and processing it is not clear which query is going to be the “next
cost. We can achieve this by either adding negations of most beneficial” (See Figure 1). Identifying which are the
the previously issued queries in the submitted query or “most beneficial” (or “most selective”) queries is part of
by simply not retrieving documents with document ids our algorithm, as described above. Our algorithm always
8

Procedure UpdatePQ(q) 6.1 Experimental Setup


1 foreach (pr, q) in ³P Q do ´
P Configuration: All experiments were run on a PC with
2 λsq = t∈q F 1−exp(λ λt
)
, df (t), avdl +
P t
a 2.5G Intel quad-core processor with 4G RAM running
t∈Q\q F (λt , df (t), avdl)
// according to Equation 10 Windows XP SP2. The algorithms were implemented in
3 pr(q) ← 1 − IGF(τ, λsq ) // IGF stands for Java.
incomplete gamma function
4 end Datasets: We ran our algorithm on three real datasets
shown in Table 2, two “Local” and one “Remote.”
The first Local dataset is the “PMC Open Access Sub-
set” (LocalPubMed)8 dataset, which is a subset of PubMed
executes a prefix of the lattice (the upper queries of a cut which comprises of 117,860 open-access articles, with the
are executed) because the parent always has higher bene- full text available for download. All of the documents
fit pr(q) than a child: the benefit increases monotonically are XML files. The second Local dataset is the TREC
with the number of keywords. Two possible executions Disk 1-5 dataset (LocalTREC)9 , which comprises of over
of a user query are shown in Figure 1. The execution three million articles from newspapers and government
of the algorithm can be viewed as a cut going down the agencies. We used Lucene10 to index every article in the
lattice. Hence, in the priority queue we only need to keep two Local datasets. Note that Lucene allows IR ranking
the candidate queries that are just below the current cut. of the documents, but we assumed this feature is not
Total vs. Block Variants of the Algorithm In the available in this experiment. Instead, we set Lucene to
above description, the algorithm submits a query q at a return the documents ordered by date.
time and retrieves all its results. That is, the granularity The Remote dataset, which is more appropriate for
is a whole query. We refer to this variant and Total. this paper’s motivation, is the whole PubMed, which can
Alternatively, we could refine the granularity by only only be remotely accessed through PubMed Web access
retrieving B results at a time from the results of a query q utility services (RemotePubMed).11 We only retrieve the
if B < |Zq | (like line 11 in Procedure QueryExecution). abstracts of the articles since the body of many articles
Then, the probabilities are updated as usual and the rest is missing from PubMed. Note that PubMed does not
of the query is placed back to the priority queue. We offer any form of relevance-based ranking. All results
refer to this variant as Block. These variants are compared are ranked by date.
experimentally in Section 6 for various block sizes B. For LocalPubMed, we picked 60 queries that have
been used as exercises to train bioinformatics informa-
tion specialists [8]. Then, we separated the queries into
two sets of 30 queries each: “frequent” and “infrequent”
6 E XPERIMENTS based on the number of results that they generated
when evaluated on the web interface of PubMed. Due to
We experimentally evaluate the performance and quality
restrictions imposed by the web interface of PubMed, we
of the retrieval algorithms. We compare the Query-based
could only use the “infrequent” queries with the Remote
probability estimation strategy described in Section 5.4
dataset. This is the result of the restrictions imposed by
to the Summary-based estimation strategy of Section 5.2,
PubMed, which does not allow massive downloads of
and also consider the Total vs. Block variants of the top-k
documents over the web service interface. Therefore, we
querying algorithm of Section 5.5. For that, we compare
could not fully evaluate and retrieve all the returned
the following algorithm variants:
documents for the “frequent” queries and, hence, we
• Baseline: This algorithm submits the disjunction of could not generate the baseline against which to evaluate
all query keywords to the database, gets all results, the quality of the results. (We could use our algorithms
computes their IR score, and finally return the top-k that retrieve significantly less documents, but we would
to the user. not be able to evaluate the results.) For our LocalPubmed
• Summary-based: Our “Total” algorithm with dataset we use both frequent and infrequent queries. For
summary-based probability estimation of the LocalTREC, we used the English Test Questions from
λ’s. the TREC website12 as our queries. For these queries our
• Query-based: Our “Total” algorithm with query- baseline is the relevance ranking provided by the TREC
based probability estimation of the λ’s. relevance judgments. Tables 3 and 4 show a sample of
• Block-based: Our “Block” algorithm with query- the queries submitted to the Local and Remote datasets,
based probability estimation of the λ’s. respectively.
Note that we do not show results for a Block variant
with summary-based estimation, because our experi- 8. http://www.pubmedcentral.nih.gov/about/openftlist.html
9. http://trec.nist.gov/data/test coll.html
ments show that the Block variant is worse than the 10. http://lucene.apache.org/
Total variant, and also that Summary-based is worse 11. http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils help.html
than Query-based estimation. 12. http://trec.nist.gov/data/testq eng.html
9

TABLE 2: Dataset detail TABLE 4: Sample queries from the Remote dataset
Dataset Type Total # of doc. #Query df
PMC Open Access Local 117,860 Herniotomies orchidopexy 645
TREC text collections Local 3,000,000 NTBC Fah 363
PubMed Portal Remote 17,000,000 NTBC fumarylacetoacetate Fah 435
tetrahydropyridines thiazolidinones carboxaldehydes 242
FAH NTBC tyrosinemia FAA 2032
MAAI DCA dichloroacetate NTBC 2364
TABLE 3: Sample queries from the LocalPubMed dataset MPI MAAI polyaromatic Ralstonia gentisate 5447
#Query df pEGFP SSBs CCCs fuma Sakashita 1953
cancer promoter 6280
cystic fibrosis 5285
immunodeficiency virus structure 8341 TABLE 5: System parameters
flanking SNPs gene 13276
alveolar carcinoma cell TDs 10051 Parameter Range
PCR DMD BMD exon 2546 Probability threshold (P) 0.01, 0.1, 0.2, · · · , 0.5
climate Control Prevention change impacts 7001 Result cardinality (k) 1, 10, 50, 100
DNA mRNA overlapping isolated cDNA 19787 Keyword cardinality (#keywords) 2, 3, 4, 5
Block-based size (infinity for Query-based) 100, 500, 1000, 2000

Quality Measure: We measure the quality of the algo- in Section 5.2, the summary-based algorithm retrieves
rithms as follows: we first execute the Baseline algorithm 300 documents for the initial document summary to
to compute the optimal top-k results. Then, we measure generate the estimates but we do not include this one-
the quality of Query-based and Block-based algorithms time cost in the reported results.) Moveover, in Fig 2(b)
by comparing their top-k search results to this optimal we see that for P ≥ 0.2, Query-based and Block-
list generated by the Baseline algorithm. We compare based coincide, because the number of the documents
two top-k lists using the top-k Spearman’s Footrule Block-based fetches is less than Block size B. The same
metric [6]: phenomenon also happens in Fig 3(b) for P ≥ 0.25.
k
X As shown in Figures 2(c) and 3(c), the time of all
F (σ, τ ) = |σ(i) − τ (i)| (13) three algorithms is reduced with P since we fetch fewer
i=1 documents when P increases.
where σ and τ are two top-k list, and σ(i) and τ (i) are the Overall, in terms of efficiency, all algorithms perform
ranks of the ith search results. If a result is missing from a better than Baseline because Baseline fetches all the
list, we assume it appears at position k+1. We normalize documents in the results before reranking them. For
the Footrule by dividing by its maximum score: P = 0.01, Query-based is slightly faster than Block-
based because they both retrieve the same number of
2 · F (σ, τ ) documents (see Figure 2(a)) but Block-based needs more
F r(σ, τ ) = (14) fetches which incur an overhead. Summary-based is the
k(k + 1)
fastest because it performs the fewest fetches (queries)
Table 5 summarizes the parameters varied in our
and also the lambda estimation is performed off-line.
experiments, along with their ranges.
Although the Summary-based algorithm is the most
efficient, we observed that the speed comes at the ex-
6.2 Experiments on Local Datasets pense of the quality of the results. In terms of quality,
Varying P: First, we examine the effect of P in the perfor- Figures 4(a) and 4(b) show that both Query-based and
mance of our algorithms. P is the parameter that defines Block-based achieve excellent Footrule values for P up
the confidence that the returned results are close to the to 0.3 (for LocalPubMed) or 0.2 (for LocalTREC) while
optimal. Smaller values of P mean that the algorithms Summary-based is the worst in all cases as expected: this
tries harder to approximate the optimal list, while large is the result of the rough probability estimates.
values of P mean that the algorithm can stop earlier, In the rest of this section, due to space constraints, we
returning more rough approximations of the optimal list. only report the results for LocalPubMed, given that the
In Figures 2 and 3 we set the number of keywords results of LocalTREC follow similar trends.
to 3 and fix k = 50. For Block-based algorithm, we Varying k: Next, we set the number of keywords to
set the Block size to 2000. We vary P from 0.01 to 3, P = 0.1, and vary k from 1 to 100, as shown in
0.5. Figures 2(a) and 3(a) show that Summary-based, Figure 5. As displayed in Figure 5(a), the number of
Query-based and Block-based fetch fewer documents fetched documents increases with k for Summary-based,
as P grows. We observe that Block-based retrieves Block-based and Query-based, as expected: with small k
slightly fewer documents but submits more conjunctive we can easily retrieve “a few good documents” but when
queries compared with Query-based (called fetches in k increases the task of locating all similar documents
Figures 2(b) and 3(b)). As expected, Summary-based re- becomes increasingly harder. Furthermore, observe that
trieves the least documents in most cases. (As discussed the number of documents grows slowly from k = 10
10

(a) number of doc. vs. P (b) number of fetches vs P (c) time vs. P

Fig. 2: LocalPubMed: Varying P


4
x 10 11 2000
6
10 1800

5 9 1600

8 1400
4
7 1200
#document

Query−based

time(ms)
#fetch
Query−based
Summary−based
Summary−based
3 6 1000
Block−based Block−based
Baseline Baseline
5 800
2 4 600

3 400
1 Query−based
Summary−based
2 200 Block−based
Baseline
0 1 0
0.01 0.1 0.2 0.25 0.3 0.4 0.5 0.01 0.1 0.2 0.25 0.6 0.4 0.5 0.01 0.1 0.2 0.25 0.3 0.4 0.5
P(Probability threshold) P(Probability threshold) P(Probability threshold)

(a) number of doc. vs. P (b) number of fetches vs P (c) time vs. P

Fig. 3: LocalTREC: Varying P

0.35

0.3

0.25
footrule

0.2

0.15

0.1

0.05 Query−based
Summary−based
Block−based
0
0.01 0.1 0.2 0.25 0.3 0.4 0.5
P(Probability threshold)

(a) LocalPubMed (b) LocalTREC (c) Remote

Fig. 4: Footrule vs. P


800

700

600

500
time(ms)

400

300

200
Query−based
Summary−based
100 Block−based
Baseline
0
1 10 50 100
k
(a) number of doc. vs. k (b) number of fetch vs. k (c) time vs. k
Fig. 5: LocalPubMed: Varying k

to k = 100 but fast from k = 1 to k = 10. The reason whereas Block-based’s accuracy decreases slightly as k
is that very few documents have very high relevance increases. Summary-based is the most efficient but again
score, as expected from the Poisson distribution of the has the worst accuracy as measured by the footrule
similarity scores, but after that the similarity threshold distance.
does not change as drastically with k. The observations Varying the number of keywords: Figure 7 depicts
for the number of documents naturally carry for number the results for different number of keywords for two
of fetches in Figure 5(b). As shown in Figure 5(c), the local datasets. In this experiment we fix k = 50 and
execution time of the four algorithms increases with k. P = 0.1. As shown in Figure 7(a), Query-based fetches
For Baseline, this is because it has to compute the top- slightly more documents in most cases. An exception
k results from all retrieved results. Query-based, it is is #keywords = 5 where both Query-based and Block-
slightly faster than Block-based when k = 10, because based retrieve about the same number of documents be-
both algorithms fetch the same number of documents cause each executed conjunctive query has fewer results
when k = 10 (Figure 5(a)) but Query-based needs fewer than the Block-based size. (Note that conjunctive queries
fetches. Summary-based is faster than other three algo- with more keywords return fewer results.) As shown
rithms, because it performs fewer fetches and retrieves in Figure 7(b), the number of fetches for all methods
fewer documents, as we explained above in the “varying increases fast because the number of keyword combina-
P” paragraph. The quality results are also similar: As tions grows exponentially with the number of keywords.
shown in Figure 6(a), Query-based has perfect accuracy, In Figure 7(c), Summary-based, Query-based and Block-
11

(a) LocalPubMed (fr vs. k) (b) Remote (fr vs. k) (c) LocalPubMed (fr vs. #kwd) (d) Remote (fr vs. #kwd)

Fig. 6: Footrule vs. k and #kwd

(a) #document vs. #keyword (b) #fetch vs. #keyword (c) time vs. #keyword

Fig. 7: LocaLPubMed: Varying number of keywords.

(a) #fetch vs. tf : disease (b) #fetch vs. tf : genetic (c) #fetch vs. tf : treatment

Fig. 8: Estimation of tf vs. #fetch (LocalPubMed)

(a) number of doc. vs. P (b) number of fetch vs. P (c) time vs. P

Fig. 9: Remote Dataset: Varying P

based perform consistently better than Baseline. Block- is that Query-based retrieves slightly more documents,
based is slightly faster than Query-based, since Query- especially for few keywords, which are the correct top-k,
based fetches more documents than Block-based in most given the focused nature of the conjunctive queries with
cases. large number of keywords.
In terms of quality, interestingly, as we see in Fig- Varying Block Size: In Table 6, we set the number
ure 6(c), the performance of Summary-based degrades of keywords to 3, k = 50, P = 0.1, and we measure
as the number of keywords increases. The reason is that the performance of Block-based algorithm by varying
for more keywords, the number of candidate conjunctive Block size B = 100 to B = 2000. Note that Query-based
queries explodes and hence the inaccurate parameter algorithm is equivalent to Block-based when Block size
estimation of Summary-based leads often to bad query is infinity. The number of fetched documents increases
choices. In contrast, we see that the performance of with Block size, since Block-based algorithm can stop
the Block-based algorithm increases with number of earlier if Block-based size is smaller. As expected, the
keywords, because the submitted conjunctive queries number of fetches decreases as Block size increases. The
become more selective and the correct top results often time of Block-based algorithm decreases with increasing
appear in these very focused queries. Finally, we see that Block size, even though the number of retrieved doc-
Query-based has perfect quality in this figure. The reason uments slightly increases. This is because of the over-
12

(a) number of documents vs. k (b) number of fetch vs. k (c) time vs. k

Fig. 10: Remote Dataset: Varying k

(a) #doc vs. #keywords (b) #fetch vs.#keywords (c) time vs. #keywords

Fig. 11: Remote Dataset: Varying #keywords

head incurred by each fetch, which includes the query based because for a slightly higher execution time, it
overhead and the additional tasks for each fetch, like leads to considerable quality improvement. Hence, we
updating the estimated frequencies. In terms of quality, only consider Query-based and Summary-based in this
as Block-based size increases the Footrule of Block-based section.
drops, because more results are retrieved, as expected. In Figures 9, 10 and 11 we repeat the above exper-
iments on the Remote database for varying P, k and
TABLE 6: Varying Block size number of keywords, respectively. Due to the character-
Block size istics of the remote dataset, which are the much larger
100 500 1000 2000 ∞ size and the slow query response times, we observe the
#doc 8340 8440 8500 8870 9540 following key differences from the results on the Local
#fetch 89 20 18 15 12 dataset.
time(ms) 4352 1512 1351 899 878 As shown in Figure 4(c), the Footrule has much less
Footrule 0.034 0.031 0.029 0.026 0.00
variation than in Figure 4(a) because the number of
retrieved documents has much smaller variation with P.
Compare tf estimations of Summary-based vs.
The reason for the latter is that the queries we used in
Query-based: In the above experiments we showed that
the Remote dataset have more infrequent keywords as
the quality of the Summary-based variant is consistently
we explain in Section 6.1.
worse than the Query-based variants. The main reason
Also, in Figure 6(b) we see that the Footrule decreases
is that Summary-based does not estimates as accurately
with k, in contrast to Figure 6(a) where it was 0 for
the λ parameters, which intuitively means that it does
Query-based and increasing for Block-based. The reason
not estimates as accurately the expected tf s of the query
is that, as we see in Figure 10(a), the number of retrieved
words. Given the fact that LocalPubMed dataset shares
documents increases dramatically with k, which was not
many aspects with LocalTREC dataset, here we just use
the case for the Local dataset (Figure 5(a)). The reason
LocalPubMed as our testbed to verify this fact.
for the latter fact is that the Remote dataset has much
Figure 8 shows a qualitative depiction for k = 50 and
more documents. When increasing the #keywords, in
P = 0.1, for the 3-keyword queries: genetic, disease
Figures 6(d) and 6(c), we observe that all algorithms
and treatment on the local dataset. We compare the
are stable or degrade, since the search space increases.
tf estimations of the two methods to the correct values,
The only exception is the Block-based for LocalPubMed,
which are calculated off-line by scanning the complete
which improves because it reads a too small number of
dataset and using the Maximum Likehood Estimation
documents for small #keywords (Figure 7(a)).
(MLE) formula (Eq. 7). We see that Query-based is
consistently better than Summary-based for the reasons
explained in Section 5.2. 6.4 Discussion
Generally, as we have seen in previously experiments,
6.3 Remote Dataset the Summary-based variant is slightly faster than the
Given the graphs of Section 6.2, we conclude that the Query-based variant. On the other hand, Query-based
Query-based algorithm is generally better than Block- is more accurate since its estimation strategy is better.
13

View publication stats [7] R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms
for middleware. In PODS ’01: Proceedings of the twentieth ACM
SIGMOD-SIGACT-SIGART symposium on Principles of database sys-
tems, pages 102–113, New York, NY, USA, 2001. ACM.
[8] R. C. Geer, D. J. Messersmith, K. Alpi, M. Bhagwat, A. Chattopad-
hyay, N. Gaedeke, J. Lyon, M. E. Minie, R. C. Morris, J. A. Ohles,
D. L. Osterbur, and M. R. Tennant. Ncbi advanced workshop
for bioinformatics information specialists: Sample user questions
and answers. Accessible at http://www.ncbi.nlm.nih.gov/Class/
NAWBIS/index.html, 2002. Last revised on August 6th 2007.
[9] V. Hristidis, Y. Hu, and P. G. Ipeirotis. Ranked queries over
sources with boolean query interfaces without ranking support.
In 26th IEEE International Conference on Data Engineering (ICDE
2010), 2010.
[10] V. Hristidis and Y. Papakonstantinou. Algorithms and applica-
tions for answering ranked queries using ranked views. The VLDB
Fig. 12: Query cost vs. Quality Journal, 13(1):49–70, 2004.
[11] I. F. Ilyas, G. Beskales, and M. A. Soliman. A survey of top-k
query processing techniques in relational database systems. ACM
In this section, we combine previous results to give Comput. Surv., 40(4):1–58, 2008.
a general picture of two methods and show the time [12] P. G. Ipeirotis, E. Agichtein, P. Jain, and L. Gravano. To search
vs. quality tradeoffs. We use the results of 3-keyword or to crawl? towards a query optimizer for text-centric tasks. In
SIGMOD Conference, pages 265–276, 2006.
queries for this analysis. [13] P. G. Ipeirotis and L. Gravano. Distributed search over the hidden
Figure 12 illustrates the time and quality improvement web: Hierarchical database sampling and selection. In Proceedings
of the Query-based and Summary-based algorithms for of the 28th International Conference on Very Large Databases (VLDB
2002), pages 394–405, 2002.
the LocalPubMed and Remote datasets. We see that [14] J. Lee, J. Lee, and H. Lee. Exploration and exploitation in the
Summary-based has a very slight advantage in terms of presence of network externalities. Management Science, 49(4):553–
execution time at the expense of a considerable disad- 570, Apr. 2003.
[15] W. G. Macready and D. H. Wolpert. Bandit problems and the ex-
vantage in terms of quality. ploration/exploitation tradeoff. IEEE Transactions on Evolutionary
Computation, 2(1):2–22, Apr. 1998.
7 C ONCLUSIONS [16] J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, and
A. Y. Halevy. Google’s deep web crawl. PVLDB, 1(2):1241–1252,
We presented a framework and efficient algorithms to 2008.
build a ranking wrapper on top of a documents data [17] A. Ntoulas, P. Zerfos, and J. Cho. Downloading textual hidden
web content by keyword queries. In Proceedings of the Fifth
source that only serves Boolean keyword queries. This ACM+IEEE Joint Conference on Digital Libraries (JCDL 2005), 2005.
setting is common in various major databases today, [18] S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and
including PubMed and USPTO. Our algorithm sub- M. Gatford. Okapi at trec-3. In TREC, 1994.
[19] G. Salton and M. J. McGill. Introduction to Modern Information
mits a minimal sequence of conjunctive queries instead Retrieval. McGraw-Hill, Inc., New York, NY, USA, 1986.
of a very expensive disjunctive one. The query score [20] A. Singhal. Modern information retrieval: A brief overview.
distributions of the candidate conjunctive queries are Bulletin of the IEEE Computer Society Technical Committee on Data
Engineering, 24(4):35–42, 2001.
learned as documents are retrieved from the source. Our [21] M. Theobald, G. Weikum, and R. Schenkel. Top-k query evalua-
comprehensive experimental evaluation on the PubMed tion with probabilistic guarantees. In VLDB, 2004.
database shows that we achieve order of magnitude
improvement compared to the baseline approach.

8 ACKNOWLEDGEMENT
Vagelis Hristidis was partly supported by NSF grant IIS-
0811922 and DHS grant 2009-ST-062-000016.
R EFERENCES
[1] M. K. Bergman. The Deep Web: Surfacing hidden value. Journal
of Electronic Publishing, 7(1), Aug. 2001.
[2] D. A. Berry and B. Fristedt. Bandit Problems: Sequential Allocation
of Experiments. Springer, 1985.
[3] N. Bruno, L. Gravano, and A. Marian. Evaluating top-k queries
over web-accessible databases. In ICDE ’02: Proceedings of the 18th
International Conference on Data Engineering, page 369, Washington,
DC, USA, 2002. IEEE Computer Society.
[4] J. P. Callan and M. Connell. Query-based sampling of text
databases. ACM Transactions on Information Systems, 19(2):97–130,
2001.
[5] P. Domingos and M. J. Pazzani. On the optimality of the simple
bayesian classifier under zero-one loss. Machine Learning, 29(2-
3):103–130, 1997.
[6] R. Fagin, R. Kumar, and D. Sivakumar. Comparing top k lists.
In SODA ’03: Proceedings of the fourteenth annual ACM-SIAM
symposium on Discrete algorithms, pages 28–36, Philadelphia, PA,
USA, 2003. Society for Industrial and Applied Mathematics.

You might also like