Professional Documents
Culture Documents
Semantic
Semantic
Q1 Why is tf·idf a good weighting scheme? Why are inverse document frequencies (idf weights)
expected to improve IR performance when added to term frequencies (tf)?
Q2 Explain the differences amongst URI, URL, and URN using Venn diagrams and provide
examples for each intersection amongst these sets
Q3 Consider the vector-space representation of documents and compare the cosine distance to the
ordinary Euclidean distance. Show that for vectors of unit length the ranking induced by the two
distances is the same
Q4 What is the bag-of-words representation of the sentence “to be or not to be”? Suppose we search
for the above sentence via the keyword “be”. What is the bag-of-words representation for this query,
and what is the Euclidean distance from the sentence? Describe a simple text search that could not
be carried out effectively using a bag-of-words representation (no matter what distance measure is
used).
Q5 What is the Euclidean distance between each of the vectors (1, 0, 0), (1, 4, 5), and (10, 0, 0)?
Divide each vector by its sum. How do the relative distances change? Divide each vector by its
Euclidean length. How do the relative distances change? Suppose we’re using the bag-of-words
representation for similarity searching with a Euclidean metric. Describe how the previous parts of
the question illustrate a potential problem if we do not normalize for document length.
Retrieved 20 40