You are on page 1of 3

581257 Information Retrieval Methods

Autumn 2010

Exercise 1, solutions

1. What kind of information retrieval tasks you have completed? What kind of retrieval methods you
have used to complete them?

Solution:
The answer is individual. Methods mentioned may include searching information on a certain topic
using some search engine like Google, searching scientific articles with some library search system
or from digital libraries, searching for available flights and accommodation with some search engine
of a travel agency, browsing in a web site, etc.

2. (Course book's exercise 1.2) Consider these documents:


Doc 1 breakthrough drug for schizophrenia
Doc 2 new schizophrenia drug
Doc 3 new approach for treatment of schizophrenia
Doc 4 new hopes for schizophrenia patients

a) Draw the term-document incidence matrix for this document collection.

Solution:
Term-document matrix:

D1 D2 D3 D4
approach 0 0 1 0
breakthrough 1 0 0 0
drug 1 1 0 0
for 1 0 1 1
hopes 0 0 0 1
new 0 1 1 1
of 0 0 1 0
patients 0 0 0 1
schizophrenia 1 1 1 1
treatment 0 0 1 0

b) Draw the inverted index representation for this collection, as in Figure 1.3 (page 6).

Solution:
Inverted index:

approach 3
breakthrough 1
drug 1 2
for 1 3 4
hopes 4
new 2 3 4
of 3
patients 4
schizophrenia 1 2 3 4
treatment 3

1
3. (Course book's exercise 1.3) For the document collection shown in Exercise 2 (Course book's
exercise 1.2), what are the returned results for these queries:

a) schizophrenia AND drug

Solution:
doc1, doc2

b) for AND NOT (drug OR approach)

Solution:
doc4

4. (Course book's exercise 1.4) For the queries below, can we still run through the intersection in time
O(x+y), where x and y are the lengths of the postings lists for Brutus and Caesar? If not, what can
we achieve?

a) Brutus AND NOT Caesar

Solution:
Time is O(x+y). Instead of collecting documents that occur in both postings lists, collect
those that occur in the first one and not in the second.

b) Brutus OR NOT Caesar

Solution:
Time is O(N) (where N is the total number of documents in the collection) assuming we need
to return a complete list of all documents satisfying the query. This is because the length of
the results list is only bounded by N, not by the length of the postings lists.

5. (Course book's exercise 1.7) Recommend a query processing order for


(tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)
given the following postings list sizes:
Term Postings size
eyes 213312
kaleidoscope 87009
marmalade 107913
skies 271658
tangerine 46653
trees 316812

Solution:
Using the conservative estimate of the length of the union of postings lists, the recommended order
is:

(kaleidoscope OR eyes) (300,321) AND (tangerine OR trees) (363,465) AND (marmalade OR skies)
(379,571)

However, depending on the actual distribution of postings, (tangerine OR trees) may well be longer
than (marmalade OR skies), because the two components of the former are more asymmetric. For
example, the union of 11 and 9990 is expected to be longer than the union of 5000 and 5000 even
though the conservative estimate predicts otherwise.

2
6. Try the search feature at http://www.rhymezone.com/shakespeare/. Write down five search features you
think it could do better.

Solution:
Such features may include excluding search terms (NOT), other Boolean operators, ranking based
on the relevance, searching for a combination of single word and phrases, wild-card queries,
proximity queries, ..

7. (Course book's exercise 1.13) Try using the Boolean search features on a couple of major web
search engines. For instance, choose a word, such as burglar, and submit the queries (i) burglar, (ii)
burglar AND burglar, and (iii) burglar OR burglar. Look at the estimated number of results and top
hits. Do they make sense in terms of Boolean logic? Often they haven't for major search engines.
Can you make sense of what is going on?

What about if you try different words? For example, query for (i) knight, (ii) conquer, and then (iii)
knight OR conquer. What bound should the number of results from the first two queries place on the
third query? Is this bound observed?

Example solution:
Observations on queries with some popular search engines:

Google Yahoo Altavista Bing

burglar 8,230,000 2,660,000 - 2,990,000 36,800,000 2,410,000


burglar AND burglar 5,350,000 689,000 - 880,000 36,800,000 4,990,000
burglar OR burglar 4,980,000 3,960,000 36,900,000 2,460,000

The number of hits may slightly change if the same query is run several times. The numeric results
depend on which search engine version you are using (e.g., google.com or google.fi).

Inferences:

When the operator AND is used in a query, Google prompts not to use that because it includes all
terms of a query by default. Yahoo and Bing seem to treat the term AND as a normal term and not as
a Boolean operator as shown by the top results and the total number of hits. Altavista seems to follow
the Boolean logic most tightly. In case of Google, the top documents returned with AND had the term
burglar more than once which tells that query processing does not include redundant terms removal.
Surprisingly, Bing returns twice the number of documents with the term AND it would be interesting
to hear, what's the reason for that. The term OR is treated as a Boolean operator by Altavista, Google
and Yahoo, but the number of hits is still a bit confusing in the case of Google.

Observations:

Google Yahoo Altavista Bing


knight 83,700,00 6,930,000 495,000,000 4,800,000
conquer 19,800,000 8,300,000 144,000,000 4,620,000
knight and conquer 1,830,000 2,950,000 27,800,000 2,940,000
knight AND conquer 1,830,000 3,170,000 27,800,000 3,030,000
knight or conquer 1,830,000 2,700,000 30,100,000 2,930,000
knight OR conquer 212,000,000 12,600,000 617,000,000 8,860,000

Inferences:

The number of hits for knight AND conquer should be less than for knight or conquer separately
which is observed in all 4 searches. The number of hits for knight OR conquer should be more than
for knight or conquer separately, which is observed in all 4 searches.

You might also like