You are on page 1of 1

Semantic Web & Web Mining Assignment

Q1 Why is tf·idf a good weighting scheme? Why are inverse document frequencies (idf weights)
expected to improve IR performance when added to term frequencies (tf)?

Q2 Explain the differences amongst URI, URL, and URN using Venn diagrams and provide
examples for each intersection amongst these sets

Q3 Consider the vector-space representation of documents and compare the cosine distance to the
ordinary Euclidean distance. Show that for vectors of unit length the ranking induced by the two
distances is the same

Q4 What is the bag-of-words representation of the sentence “to be or not to be”? Suppose we search
for the above sentence via the keyword “be”. What is the bag-of-words representation for this query,
and what is the Euclidean distance from the sentence? Describe a simple text search that could not
be carried out effectively using a bag-of-words representation (no matter what distance measure is
used).

Q5 What is the Euclidean distance between each of the vectors (1, 0, 0), (1, 4, 5), and (10, 0, 0)?
Divide each vector by its sum. How do the relative distances change? Divide each vector by its
Euclidean length. How do the relative distances change? Suppose we’re using the bag-of-words
representation for similarity searching with a Euclidean metric. Describe how the previous parts of
the question illustrate a potential problem if we do not normalize for document length.

Q6 Compute precision, recall and F1 for this result set:

Relevant Not Relevant

Retrieved 20 40

Not Retrieved 60 1000000

Is accuracy a useful measure in IR? Demonstrate with a example

You might also like