# Introduction to Information Retrieval

Introduction to

Information Retrieval

Introduction to Information Retrieval

Information Retrieval
 Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).

2

Introduction to Information Retrieval

Sec. 1.1

Boolean Retrieval
 Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?  One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia?  Why is that not the answer?
 Slow (for large corpora)  Ranked retrieval (best documents to return)  Later lectures

• The way to avoid linearly scanning the text for each query is to index the document in advance.
3

1 Binary Term-document incidence Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony Brutus Caesar Calpurnia Cleopatra mercy worser 1 1 1 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 1 0 0 1 0 0 1 1 1 0 1 0 0 1 0 Brutus AND Caesar BUT NOT Calpurnia 1 if play contains word.Introduction to Information Retrieval Sec. 0 otherwise . 1.

complement the last. 5 . 1.1 Incidence vectors  So we have a 0/1 vector for each term.  To answer query: take the vectors for Brutus. and then do a bitwise AND.Introduction to Information Retrieval Sec. Caesar and Calpurnia.  110100 AND 110111 AND 101111 = 100100.

When Antony found Julius Caesar dead. Act III. Act III.1 Answers to query  Antony and Cleopatra. He cried almost to roaring.  Hamlet. Enobarbus. Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why.Introduction to Information Retrieval Sec. 1. Brutus killed me. Scene ii Lord Polonius: I did enact Julius Caesar I was killed i' the Capitol. 6 . and he wept When at Philippi he found Brutus slain.

1. OR. in which terms are combined with the operators AND. 7 .  The model views each document as just a set of words.Introduction to Information Retrieval Sec.1 Boolean retrieval model  The Boolean retrieval model is a model for information retrieval in which we can pose any query which is in the form of a Boolean expression of terms. and NOT. that is.

Introduction to Information Retrieval Sec.1 Basic assumptions of Information Retrieval  Collection: Fixed set of documents  Goal: Retrieve documents with information that is relevant to the user’s information need and helps the user complete a task 8 . 1.

Introduction to Information Retrieval The classic search model TASK Get rid of mice in a politically correct way Info about removing mice without killing them How do I trap mice alive? Info Need Verbal form Query mouse trap SEARCH ENGINE Query Refinement Results Corpus .

1 How good are the retrieved docs?  Precision : Fraction of retrieved docs that are relevant to user’s information need Precision = relevant documents retrieved/ total documents  Recall : Fraction of relevant docs in collection that are retrieved recall = relevant documents retrieved / total relevant documents 10 .Introduction to Information Retrieval Sec. 1.

11 .Introduction to Information Retrieval Sec. each with about 1000 words.  Say there are M = 500K distinct terms among these. 1.1 Bigger collections  Consider N = 1 million documents.  Avg 6 bytes/word including spaces/punctuation  6GB of data in the documents.

Introduction to Information Retrieval Can’t build the matrix  500K x 1M matrix has half-a-trillion 0’s and 1’s.8% of the cells are zero. the inverted index.  This idea is central to the first major concept in information retrieval.(1M x 1000) so a minimum of 99.  matrix is extremely sparse. 12 . Why?  What’s a better representation?  We only record the 1 positions.  But it has no more than one billion 1’s.

a document serial number  Can we used fixed-size arrays for this? Brutus Caesar Calpurnia 1 1 2 2 4 4 11 5 31 45 173 174 6 16 57 132 2 31 54 101 What happens if the word Caesar is added to document 14? 13 .  Identify each by a docID. we must store a list of all documents that contain t.Introduction to Information Retrieval Inverted index  For each term t.

can use linked lists or variable length arrays  Some tradeoffs in size/ease of insertion Posting Brutus Caesar Calpurnia 1 1 2 2 4 4 11 5 31 45 173 174 6 16 57 132 2 31 54 101 Dictionary Postings Sorted by docID 14 .Introduction to Information Retrieval Inverted index  We need variable-size postings lists  On disk. a continuous run of postings is normal and best  In memory.

Friends Romans Linguistic modules Countrymen Modified tokens. Romans.Introduction to Information Retrieval Sec. countrymen…. friend Indexer friend roman countryman 2 1 4 2 16 Inverted index. Friends..2 Inverted index construction Documents to be indexed. Tokenizer Token stream. 1. roman countryman 13 .

Brutus killed me.2 Indexer steps: Token sequence  Sequence of (Modified token. Doc 2 So let it be with Caesar. Document ID) pairs. Doc 1 I did enact Julius Caesar I was killed i' the Capitol.Introduction to Information Retrieval Sec. The noble Brutus hath told you Caesar was ambitious . 1.

2 Indexer steps: Sort  Sort by terms  And then docID Core indexing step . 1.Introduction to Information Retrieval Sec.

2 Indexer steps: Dictionary & Postings  Multiple term entries in a single document are merged. frequency information is added. .Introduction to Information Retrieval Sec.  Split into Dictionary and Postings  Doc. 1.

1.2 Where do we pay in storage? Lists of docIDs Terms and counts Pointers 19 .Introduction to Information Retrieval Sec.

1.3 Processing the Boolean Queries  How do we process a query using an inverted index and basic Boolean retrieval model? 20 .Introduction to Information Retrieval Sec.

 Retrieve its postings.3 Query processing: AND  Consider processing the query: Brutus AND Caesar  Locate Brutus in the Dictionary.  Retrieve its postings.Introduction to Information Retrieval Sec.  Locate Caesar in the Dictionary.  “Merge” the two postings: 2 1 4 2 8 3 16 5 32 8 13 64 21 128 Brutus 34 Caesar 21 . 1.

22 .3 The merge  Walk through the two postings simultaneously. Crucial: postings sorted by docID. the merge takes O(x+y) operations. in time linear in the total number of postings entries 2 8 2 1 4 2 8 3 16 5 32 8 13 64 21 128 Brutus 34 Caesar If the list lengths are x and y.Introduction to Information Retrieval Sec. 1.

Introduction to Information Retrieval Intersecting two postings lists (a “merge” algorithm) 23 .

 Perhaps the simplest model to build an IR system on  Primary commercial retrieval tool for 3 decades. 1. OR and NOT to join query terms  Views each document as a set of words  Is precise: document matches condition or not.3 Boolean queries: Exact match  The Boolean retrieval model is being able to ask a query that is a Boolean expression:  Boolean Queries are queries using AND.  Many search systems you still use are Boolean:  Email. Mac OS X Spotlight 24 . library catalog.Introduction to Information Retrieval Sec.

1.000 users  Majority of users still use boolean queries  Example query:  What is the statute of limitations in cases involving the federal tort claims act?  LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM  /3 = within 3 words.4 Example: WestLaw http://www. 700.com/  Largest commercial (paying subscribers) legal search service (started 1975. ranking added 1992)  Tens of terabytes of data.Introduction to Information Retrieval Sec. /S = in same sentence 25 .westlaw.

not conjunction!  Long.westlaw.4 Example: WestLaw  Another example query: http://www. not like web search  Many professional searchers still like Boolean search  You know exactly what you are getting  But that doesn’t mean it actually works better….com/  Requirements for disabled people to be able to access a workplace  disabl! /p access! /s work-site work-place (employment /3 place  Note that SPACE is disjunction. precise queries. 1. proximity operators. incrementally developed. .Introduction to Information Retrieval Sec.

 For each of the n terms.Introduction to Information Retrieval Sec.3 Query optimization  What is the best order for query processing?  Consider a query that is an AND of n terms. Brutus 2 4 8 16 32 64 128 Caesar Calpurnia 1 2 3 5 8 16 21 34 13 16 27 Query: Brutus AND Calpurnia AND Caesar . then AND them together. 1. get its postings.

then keep cutting further. in dictionary Brutus 2 4 8 16 32 64 128 Caesar Calpurnia 1 2 3 5 8 16 21 34 13 16 Execute the query as (Calpurnia AND Brutus) AND Caesar. This is why we kept document freq.3 Query optimization example  Process in order of increasing freq:  start with smallest set. 1. 28 .Introduction to Information Retrieval Sec.

freq.  Process in increasing order of OR sizes. 29 .3 More general optimization  e.. freq.  Estimate the size of each OR by the sum of its doc.’s for all terms. (madding OR crowd) AND (ignoble OR strife)  Get doc.g. 1.Introduction to Information Retrieval Sec.’s (conservative).

Introduction to Information Retrieval Exercise  Recommend a query processing order for Term Freq 213312 87009 107913 271658 46653 316812 (tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes) eyes kaleidoscope marmalade skies tangerine trees 30 .

each query term appears only once in the query.Introduction to Information Retrieval Query processing exercises  Exercise: If the query is friends AND romans AND (NOT countrymen). how could we use the freq of countrymen?  Exercise: Extend the merge to an arbitrary Boolean query. Can we always guarantee execution in time linear in the total postings size?  Hint: Begin with the case of a Boolean formula query: in this. 31 .

 Often we want to rank/group results  Need to measure proximity from query to each doc. 32 . or a group of docs covering various aspects of the query.  Need to decide whether docs presented to user are singletons.Introduction to Information Retrieval Ranking search results  Boolean queries give inclusion or exclusion of docs.

exploit ideas from social networks  link analysis. clickstreams . queries. information needs  Beyond terms..  How do search engines work? And how can we make them better? 33 ..Introduction to Information Retrieval The web and its challenges  Unusual and diverse documents  Unusual and diverse users.

Introduction to Information Retrieval More sophisticated information retrieval      Cross-language information retrieval Question answering Summarization Text mining … 34 .

9394548938898 .° n°°¯f°  f 390780. 2070 .4792  .

fnO½  .° n°°¯f°  f \$0.  ¾f   ¾¾°-  f° -@  ©°  ¯¾ I ¾ fn n¯ °f¾f¾   ¾ ¾½ n¾  n¯ °¯fn ¾n° °° 9 f½¾ ¾¯½ ¾¯   f°¾¾ ¯° 9¯fn¯¯ nf  f nf ¾ .f°¾ fn¾¾ ¯¾¾¾ f   f° ¯f  fnff .    f°  ¾ fn¯fn @   f°   f¯  ¾ °f  f¾f  f¾f  f° ½ ¾¾°  f°.

  f¯½ J ¾f995.° n°°¯f°  f \$0.

.

42. 089. .

9:904129.4.36:0708 .790/7..7.94383.0 89.9 %.989089.42207.7089.9479.28.2506:07 . 80.7-078 0..-90841/. 5...47941:807889:80-440.33.//0/  %03841907. :8078 .807.38:-8.8083. ..9.3 9010/07.

\$%%&%% .

\$#.

% #% .

 .

9347/8 .

\$38.20803903.0  .

° n°°¯f°  f \$0.   f¯½ J ¾f995.

.

42. 089. .

°  f¯½     ¯ °¾ ¾f  ½ ½  f  fnn ¾¾f ½fn ¾f "\$½fnn ¾¾"\$¾ ¾  ½fn % ¯½¯ °\$ ½fn - f9.

f°½ ¾¾°f¾ fn ¾¾   f°¾ fn ° fnff  ° f  ¾°#¯ f°fnf¾   .¾ ¾©°n° °n°©°n°" ° ½ n¾   ¾ ½¯½ f¾  °n ¯ °f  ½ °  ¾ fn .

° n°°¯f°  f \$0. ½¯f° Jf¾  ¾  ½n ¾¾°" .   .

°¾ f f¾f°- °  ¯¾  fn °  ¯¾  ¾½¾°¾  ° -  ¯   ¾ .

f ¾f .

f½°f                   .  ¾ - .

f½°f - .

f ¾f .

° n°°¯f°  f \$0.7 ¾ .   .:20391706 3/. ½¯f° f¯½ 9n ¾¾° °n f¾°  ¾f¾¯f ¾¾   ° ½ n°  %880059 /4.943.

f ¾f .

f½°f                   n   f¾%.

f½°f - ¾% -.

f ¾f  .

  ° f½¯f°  %¯f ° n %-%°   ¾ %   n   #¾f ¯¾ ¾¯f  ¾  fn  ¾¯¾ n   #¾%n°¾ f % 9n ¾¾°°n f¾°  ¾ ¾  .° n°°¯f°  f \$0.   .

72.450 # 008 008 .0/48../0 808 9.30730 97008  .450 2.0/48.30730 # 97008  2.72../0 # 808  .° n°°¯f°  f  n¾  n¯¯ ° f  ½n ¾¾°  %072 706         9.

° n°°¯f°  f . ½n ¾¾°  n¾ ¾  n¾   ¾ ° ¾ -¯f°¾ - %-@n°¯ °% n  ¾    n°¯ °"  n¾  °  ¯  f°f f  f°   .

f° ff¾ff°   n°°¯  ° f° f½¾°¾¾ " °  ° nf¾ f  f°¯f   °¾  fn  ¯f½½ f¾°°n °     .

° n°°¯f°  f f°°¾ fn ¾¾  f°  ¾ °n¾° n¾° n¾  ° f°f°\$½ ¾¾ - ¯ f¾ ½¯¯  fn n -  n    n¾½ ¾ ° ¾ f  ¾° °¾ f½ n¾n °f¾f¾½ n¾     .

° n°°¯f°  f @  f° ¾nf ° ¾ D°¾ff°   ¾  n¯ °¾ D°¾ff°   ¾ ¾ ¾   ¾ °¯f° ° ¾ °  ¯¾  ½ f¾¯¾nf° ¾ °f°f¾¾ nn¾ f¯¾  ¾ fn °° ¾"° nf°  ¯f  ¯  "  .

° n°°¯f°  f . ¾½¾nf °¯f°   f .

¾¾ f°f °¯f°  f . ¾°f°¾ ° ¯¯ff° @ ¯°°  .