UNIVERSITY OF OTTAWA.
FACULTY OF ENGINEERING
SCHOOL OF IT AND ENGINEERING
CS14107
Midterm
‘March 2, 2006, 4-5:30 pm
Examiner: Diana Inkpen
Name
Student Number
Total marks: 47
Duration: 80 minutes
Total Number of pages: 9
Important Regulations:
1. Students are allowed to bring in a page of notes (written on one side).
2. Calculators are allowed.
3. A student identification cards (or another photo ID and signature) is required.
4. An attendance sheet shall be circulated and should be signed by each student.
5. Please answer all questions on this paper, in the ted spaces.
Marks A 1B
BOOT 4
Cc 110
dD /10
BE /10
Total 147Part A [18 marks]
Short answers and explanations.
1, @ marks) Explain the difference between an information retrieval system and a search
engine.
- asearch engine contains a crawler to collect webpages
- the scale is much larger ( lection, efficiency issues)
~ the collection is dynamic: new pages appear, some pages disappear
- HTML format can be used in weighting (headings, large font, etc).
2. (2 marks) Why is ¢f'idf'a good weighting scheme? Why are inverse document
frequencies (idf weights) expected to improve IR performance when added to term
frequencies (1)? (Remember that the idf value for a term is given by the number of
documents where it appears).
- idf gives higher weight to terms that appear in few documents and therefore are
likely to be important in those documents.
3. (2 marks) Explain what is the difference between relevance feedback and the pseudo-
relevance feedback. Which one do you think would achieve better retrieval performance.
Why?
= relevance feedback asks a user to judge the first N answers to a query in
order to revise the query for a betier search. Pseudo-relevance will blindly
assume that the first N documents are relevant.
+ relevance feedback is likely to achieve higher performance because the
judgements for the N document won’t be incorrect.4. (2 marks) In IR systems, a possible pre-processing step is stemming the words.
Do you think the performance of the system (the average precision) would be higher with
or without stemming? Why?
‘Usually the performance is higher with stemming.
- allows for higher recall by retrieving inflected forms (plurals, verb forms, etc.)
without much loss of precision.
5. marks) Compute the edit distance between the following strings. Remember that the
edit distance is the minimum number of deletions, insertions and substitutions needed to
transform the first string into the second,
How would you normalize the score? Why is the normalization needed?
String 1: abracadabra
String 2: nabucodor
Edit distance = 7
Deleten Substru Deletea Subst ao Substao Delete b Delete a
Normalize by dividing by length of longest string.
Why: to make it fair when there are the same number of deletions, insertions,
substitutions, but the strings are long or short. Ifthe strings are short the distance
should be higher.
6. (2 marks) Below is a sample robot META tag in the HEAD section of an HTML
document. Explain what this tag means.
= spiders are allowed to index the webpage but not to follow the Links in it