You are on page 1of 9
UNIVERSITY OF OTTAWA. FACULTY OF ENGINEERING SCHOOL OF IT AND ENGINEERING CS14107 Midterm ‘March 2, 2006, 4-5:30 pm Examiner: Diana Inkpen Name Student Number Total marks: 47 Duration: 80 minutes Total Number of pages: 9 Important Regulations: 1. Students are allowed to bring in a page of notes (written on one side). 2. Calculators are allowed. 3. A student identification cards (or another photo ID and signature) is required. 4. An attendance sheet shall be circulated and should be signed by each student. 5. Please answer all questions on this paper, in the ted spaces. Marks A 1B BOOT 4 Cc 110 dD /10 BE /10 Total 147 Part A [18 marks] Short answers and explanations. 1, @ marks) Explain the difference between an information retrieval system and a search engine. - asearch engine contains a crawler to collect webpages - the scale is much larger ( lection, efficiency issues) ~ the collection is dynamic: new pages appear, some pages disappear - HTML format can be used in weighting (headings, large font, etc). 2. (2 marks) Why is ¢f'idf'a good weighting scheme? Why are inverse document frequencies (idf weights) expected to improve IR performance when added to term frequencies (1)? (Remember that the idf value for a term is given by the number of documents where it appears). - idf gives higher weight to terms that appear in few documents and therefore are likely to be important in those documents. 3. (2 marks) Explain what is the difference between relevance feedback and the pseudo- relevance feedback. Which one do you think would achieve better retrieval performance. Why? = relevance feedback asks a user to judge the first N answers to a query in order to revise the query for a betier search. Pseudo-relevance will blindly assume that the first N documents are relevant. + relevance feedback is likely to achieve higher performance because the judgements for the N document won’t be incorrect. 4. (2 marks) In IR systems, a possible pre-processing step is stemming the words. Do you think the performance of the system (the average precision) would be higher with or without stemming? Why? ‘Usually the performance is higher with stemming. - allows for higher recall by retrieving inflected forms (plurals, verb forms, etc.) without much loss of precision. 5. marks) Compute the edit distance between the following strings. Remember that the edit distance is the minimum number of deletions, insertions and substitutions needed to transform the first string into the second, How would you normalize the score? Why is the normalization needed? String 1: abracadabra String 2: nabucodor Edit distance = 7 Deleten Substru Deletea Subst ao Substao Delete b Delete a Normalize by dividing by length of longest string. Why: to make it fair when there are the same number of deletions, insertions, substitutions, but the strings are long or short. Ifthe strings are short the distance should be higher. 6. (2 marks) Below is a sample robot META tag in the HEAD section of an HTML document. Explain what this tag means. = spiders are allowed to index the webpage but not to follow the Links in it

You might also like