Advanced Topics in Information Systems: Information Retrieval

Advanced Topics in Information Systems:

Information Retrieval
Jun.Prof. Alexander Markowetz Slides modified from Christopher Manning and Prabhakar Raghavan

Advanced Topics in Information Systems: Information Retrieval

DICTIONARY DATA STRUCTURES
2

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.1

Hashes
 Each vocabulary term is hashed to an integer
 (We assume you’ve seen hashtables before)

 Pros:
 Lookup is faster than for a tree: O(1)

 Cons:
 No easy way to find minor variants:
 judgment/judgement

 No prefix search [tolerant retrieval]  If vocabulary keeps growing, need to occasionally do the expensive operation of rehashing everything

3

Advanced Topics in Information Systems: Information Retrieval Sec.1 Tree: binary tree a-m Root n-z a-hu hy-m n-sh si-z huy g ard sic k 4 zyg ot var k ens le . 3.

4].g.. [2. e.Advanced Topics in Information Systems: Information Retrieval Sec.1 Tree: B-tree a-hu n-z hy-m  Definition: Every internal nodel has a number of children in the interval [a. 3. b are appropriate natural numbers.b] where a. 5 .

Advanced Topics in Information Systems: Information Retrieval Sec.1 Trees  Simplest: binary tree  More usual: B-trees  Trees require a standard ordering of characters and hence strings … but we standardly have one  Pros:  Solves the prefix problem (terms starting with hyp)  Cons:  Slower: O(log M) [and this requires balanced tree]  Rebalancing binary trees is expensive  But B-trees mitigate the rebalancing problem 6 . 3.

Advanced Topics in Information Systems: Information Retrieval WILD-CARD QUERIES 7 .

 Easy with binary tree (or B-tree) lexicon: retrieve all words in range: mon ≤ w < moo  *mon: find words ending in “mon”: harder  Maintain an additional B-tree for terms backwards. Exercise: from this. how can we enumerate all terms meeting the wild-card query pro*cent ? 8 . Can retrieve all words in range: nom ≤ w < non.Advanced Topics in Information Systems: Information Retrieval Sec.2 Wild-card queries: *  mon*: find all docs containing any word beginning “mon”. 3.

. we have an enumeration of all terms in the dictionary that match the wildcard query.  E. consider the query: se*ate AND fil*er This may result in the execution of many Boolean AND queries.g.2 Query processing  At this point.Advanced Topics in Information Systems: Information Retrieval Sec. 9 . 3.  We still have to look up the postings for each enumerated term.

Advanced Topics in Information Systems: Information Retrieval Sec. 10 . 3.2 B-trees handle *’s at the end of a query term  How can we handle *’s in the middle of query term?  co*tion  We could look up co* AND *tion in a B-tree and intersect the two term sets  Expensive  The solution: transform wild-card queries so that the *’s occur at the end  This gives rise to the Permuterm Index.

o$hell where $ is a special symbol. ello$h. Y=o Lookup o$hel* 11 . lo$hel.  Queries:  X lookup on X$ X* lookup on $X*  *X lookup on X$* *X* lookup on X*  X*Y lookup on Y$X* X*Y*Z ??? Exercise! Query = hel*o X=hel.2.1 Permuterm index  For term hello. index under:  hello$. llo$he. 3.Advanced Topics in Information Systems: Information Retrieval Sec.

3.1 Permuterm query processing  Rotate query wild-card to the right  Now use B-tree lookup as before. 12 .Advanced Topics in Information Systems: Information Retrieval Sec.  Permuterm problem: ≈ quadruples lexicon size Empirical observation for English.2.

e$.es. 3.el.s$.on. from text “April is the cruelest month” we get the 2-grams (bigrams) $a.$i.g.2.cr. ue. 13 .ru.is.le.$c.2 Bigram (k-gram) indexes  Enumerate all k-grams (sequence of k chars) occurring in any term  e.t$.h$  $ is a special word boundary symbol  Maintain a second inverted index from bigrams to dictionary terms that match each bigram.nt.mo.l$..ri. $m.Advanced Topics in Information Systems: Information Retrieval Sec.il.st.pr.ap.he.$t.th.

$m mo on mace among among madden amortize around 14 .Advanced Topics in Information Systems: Information Retrieval Sec.2. 3.2 Bigram index example  The k-gram index finds terms based on a query consisting of k-grams (here k=2).

 Must post-filter these terms against query.2 Processing wild-cards  Query mon* can now be run as  $m AND mo AND on  Gets terms that match AND version of our wildcard query.  Fast.  But we’d enumerate moon. 15 . 3.  Surviving enumerated terms are then looked up in the term-document inverted index.Advanced Topics in Information Systems: Information Retrieval Sec.2. space efficient (compared to permuterm).

 Which web search engines allow wildcard queries? 16 .2 Processing wild-card queries  As before. E. we must execute a Boolean query for each enumerated.2.g. Alex* will match Alexander. 3.  Wild-cards can result in expensive query execution (very large disjunctions…)  pyth* AND prog*  If you encourage “laziness” people will respond! Search Type your search terms. use ‘*’ if you need to.. filtered term.Advanced Topics in Information Systems: Information Retrieval Sec.

Advanced Topics in Information Systems: Information Retrieval SPELLING CORRECTION 17 .

3.g. from → form  Context-sensitive  Look at surrounding words.  e.3 Spell correction  Two principal uses  Correcting document(s) being indexed  Correcting user queries to retrieve “right” answers  Two main flavors:  Isolated word  Check each word on its own for misspelling  Will not catch typos resulting in correctly spelled words  e..g.Advanced Topics in Information Systems: Information Retrieval Sec.. I flew form Heathrow to Narita. 18 .

 But also: web pages and even printed material has typos  Goal: the dictionary contains fewer misspellings  But often we don’t change the documents but aim to fix the query-document mapping 19 .. so more likely interchanged in typing).Advanced Topics in Information Systems: Information Retrieval Sec.3 Document correction  Especially needed for OCR’ed documents  Correction algorithms are tuned for this: rn/m  Can use domain-specific knowledge  E.g. 3. OCR can confuse O and D more often than it would confuse O and I (adjacent on the QWERTY keyboard.

Advanced Topics in Information Systems: Information Retrieval Sec.. OR  Return several suggested alternative queries with the correct spelling  Did you mean … ? 20 . the query Alanis Morisett  We can either  Retrieve documents indexed by the correct spelling.g. 3.3 Query mis-spellings  Our principal focus here  E.

3.  (Including the mis-spellings) 21 .g. 3..Advanced Topics in Information Systems: Information Retrieval Sec. acronyms etc.2 Isolated word correction  Fundamental premise – there is a lexicon from which the correct spellings come  Two basic choices for this  A standard lexicon such as  Webster’s English Dictionary  An “industry-specific” lexicon – hand-maintained  The lexicon of the indexed corpus  E. all words on the web  All names.

3. 3. return the words in the lexicon closest to Q  What’s “closest”?  We’ll study several alternatives  Edit distance (Levenshtein distance)  Weighted edit distance  n-gram overlap 22 .2 Isolated word correction  Given a lexicon and a character sequence Q.Advanced Topics in Information Systems: Information Retrieval Sec.

23 .  See http://www.com/ld.Advanced Topics in Information Systems: Information Retrieval Sec. the minimum number of operations to convert one to the other  Operations are typically character-level  Insert.3.htm for a nice example plus an applet.)  Generally found by dynamic programming. (Just 1 with transpose.g. the edit distance from dof to dog is 1  From cat to act is 2  from cat to dog is 3. Delete. 3. Replace..merriampark. (Transposition)  E.3 Edit distance  Given two strings S1 and S2.

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.3.3

Weighted edit distance
 As above, but the weight of an operation depends on the character(s) involved
 Meant to capture OCR or keyboard errors, e.g. m more likely to be mis-typed as n than as q  Therefore, replacing m by n is a smaller edit distance than by q  This may be formulated as a probability model

 Requires weight matrix as input  Modify dynamic programming to handle weights

24

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.3.4

Using edit distances
 Given query, first enumerate all character sequences within a preset (weighted) edit distance (e.g., 2)  Intersect this set with list of “correct” words  Show terms you found to user as suggestions  Alternatively,
 We can look up all possible corrections in our inverted index and return all docs … slow  We can run with a single most likely correction

 The alternatives disempower the user, but save a round of interaction with the user
25

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.3.4

Edit distance to all dictionary terms?
 Given a (mis-spelled) query – do we compute its edit distance to every dictionary term?
 Expensive and slow  Alternative?

 How do we cut the set of candidate dictionary terms?  One possibility is to use n-gram overlap for this  This can also be used by itself for spelling correction.

26

3. etc. 27 .4 n-gram overlap  Enumerate all the n-grams in the query string as well as in the lexicon  Use the n-gram index (recall wild-card search) to retrieve all lexicon terms matching any of the query n-grams  Threshold by number of matching n-grams  Variants – weight by keyboard layout.3.Advanced Topics in Information Systems: Information Retrieval Sec.

ove.  The query is december  Trigrams are dec.4 Example with trigrams  Suppose the text is november  Trigrams are nov. ece. emb. mbe. vem.3. cem.  So 3 trigrams overlap (of 6 in each term)  How can we turn this into a normalized measure of overlap? 28 . ber. 3.Advanced Topics in Information Systems: Information Retrieval Sec. emb. mbe. ber.

C. is X ∩Y / X ∪Y  Equals 1 when X and Y have the same elements and zero when they are disjoint  X and Y don’t have to be of the same size  Always assigns a number between 0 and 1  Now threshold to decide if you have a match  E. then the J.8. > 0. if J..g.Advanced Topics in Information Systems: Information Retrieval Sec.4 One option – Jaccard coefficient  A commonly-used measure of overlap  Let X and Y be two sets. 3.3. declare a match 29 .C.

Advanced Topics in Information Systems: Information Retrieval Sec. or. 3.4 Matching trigrams  Consider the query lord – we wish to identify words matching 2 of its 3 bigrams (lo.3. 30 . rd) lo or rd alone border ardent lord lord border sloth morbid card Standard postings “merge” will enumerate … Adapt this to using Jaccard (or another) measure.

31 . 3.3.Advanced Topics in Information Systems: Information Retrieval Sec.  Consider the phrase query “flew form Heathrow”  We’d like to respond Did you mean “flew from Heathrow”? because no docs matched the query phrase.5 Context-sensitive spell correction  Text: I flew from Heathrow to Narita.

5 Context-sensitive correction  Need surrounding context to catch this.3.Advanced Topics in Information Systems: Information Retrieval Sec. 32 . 3.  First idea: retrieve dictionary terms close (in weighted edit distance) to each query term  Now try all possible resulting phrases with one word “fixed” at a time  flew from heathrow  fled form heathrow  flea form heathrow  Hit-based spelling correction: Suggest the alternative that has lots of hits.

3. How many “corrected” phrases will we enumerate in this scheme? 33 .Advanced Topics in Information Systems: Information Retrieval Sec.5 Exercise  Suppose that for “flew form Heathrow” we have 7 alternatives for flew.3. 19 for form and 3 for heathrow.

3.3.Advanced Topics in Information Systems: Information Retrieval Sec.  Enumerate phrase matches and … rank them! 34 .  Look for biwords that need only one term corrected.5 Another approach  Break phrase query into a conjunction of biwords (Lecture 2).

Advanced Topics in Information Systems: Information Retrieval Sec.3. topical queries  Spell-correction is computationally expensive  Avoid running routinely on every query?  Run only on queries that matched few docs 35 . 3.5 General issues in spell correction  We enumerate multiple alternatives for “Did you mean?”  Need to figure out which to present to the user  Use heuristics  The alternative hitting most docs  Query log analysis + tweaking  For especially popular.

Advanced Topics in Information Systems: Information Retrieval SOUNDEX 36 .

chebyshev → tchebycheff  Invented for the U..S. census … in 1918 37 .Advanced Topics in Information Systems: Information Retrieval Sec. 3.4 Soundex  Class of heuristics to expand a query into phonetic equivalents  Language specific – mainly for names  E.g.

htm#Top 38 . 3.com/Doc/Articles/SoundEx1/SoundEx1.creativyst.4 Soundex – typical algorithm  Turn every token to be indexed into a 4character reduced form  Do the same with query terms  Build and search an index on the reduced forms  (when the query calls for a soundex match)  http://www.Advanced Topics in Information Systems: Information Retrieval Sec.

Retain the first letter of the word. 'O'. 'Y'. 2. Change letters to digits as follows:  B. J. 'I'. Q. N → 5  R→6 39 . E'. G. 3. X.T → 3  L→4  M. Z → 2  D. 3. Change all occurrences of the following letters to '0' (zero):   'A'.Advanced Topics in Information Systems: Information Retrieval Sec. P. V → 1  C. K. S. 'H'.4 Soundex – typical algorithm 1. F. 'W'. 'U'.

which will be of the form <uppercase letter> <digit> <digit> <digit>. 6. Remove all zeros from the resulting string.g. Will hermann generate the same code? 40 . 5. 3. E. Herman becomes H655.4 Soundex continued 4.Advanced Topics in Information Systems: Information Retrieval Sec. Remove all pairs of consecutive digits.. Pad the resulting string with trailing zeros and return the first four positions.

…)  How useful is soundex?  Not very – for information retrieval  Okay for “high recall” tasks (e. 3. provided by most databases (Oracle.Advanced Topics in Information Systems: Information Retrieval Sec. though biased to names of certain nationalities  Zobel and Dart (1996) show that other algorithms for phonetic matching perform much better in the context of IR 41 .. Interpol). Microsoft.4 Soundex  Soundex is the classic algorithm.g.

Advanced Topics in Information Systems: Information Retrieval What queries can we process?  We have     Positional inverted index with skip pointers Wild-card index Spell-correction Soundex  Queries such as (SPELL(moriset) /3 toron*to) OR SOUNDEX(chaikofski) 42 .

Advanced Topics in Information Systems: Information Retrieval INDEX GENERATION 43 .

Advanced Topics in Information Systems: Information Retrieval Ch. 4 Index construction  How do we construct an index?  What strategies can we use with limited main memory? 44 .

1 Hardware basics  Many design decisions in information retrieval are based on the characteristics of hardware  We begin by reviewing hardware basics 45 .Advanced Topics in Information Systems: Information Retrieval Sec. 4.

46 .  Block sizes: 8KB to 256 KB. 4.  Disk I/O is block-based: Reading and writing of entire blocks (as opposed to smaller chunks).1 Hardware basics  Access to data in memory is much faster than access to data on disk.Advanced Topics in Information Systems: Information Retrieval Sec.  Disk seeks: No data is transferred from disk while the disk head is being positioned.  Therefore: Transferring one large chunk of data from disk to memory is faster than transferring many small chunks.

1 Hardware basics  Servers used in IR systems now typically have several GB of main memory. sometimes tens of GB. 4.  Google is particularly famous for combining standard hardware … in shipping containers.Advanced Topics in Information Systems: Information Retrieval Sec.  Available disk space is several (2–3) orders of magnitude larger.  Fault tolerance is very expensive: It’s much cheaper to use many regular machines rather than one fault tolerant machine. 47 .

1 Hardware assumptions  symbol statistic value  s average seek time 5 ms = 5 x 10−3 s  b transfer time per byte 0.g. compare & swap a word)   size of main memory size of disk space several GB 1 TB or more 48 .02 μs = 2 x 10−8 s processor’s clock rate 109 s−1  p low-level operation 0. 4.Advanced Topics in Information Systems: Information Retrieval Sec..01 μs = 10−8 s (e.

 As an example for applying scalable index construction algorithms.2 RCV1: Our collection for this lecture  Shakespeare’s collected works definitely aren’t large enough for demonstrating many of the points in this course.Advanced Topics in Information Systems: Information Retrieval Sec.  This is one year of Reuters newswire (part of 1995 and 1996) 49 . but it’s publicly available and is at least a more plausible example. we will use the Reuters RCV1 collection.  The collection we’ll use isn’t really large enough either. 4.

4.2 A Reuters RCV1 document 50 .Introduction to Information Retrieval Sec.

5 non-positional postings 100. # bytes per token (without spaces/punct.) avg. # tokens per doc M terms (= word types) avg.) value 800. # bytes per token (incl.5 avg.000 200 400. # bytes per term 7.000. 4.2 Reuters RCV1 statistics         symbol statistic N documents L avg.000 6 4. spaces/punct.Advanced Topics in Information Systems: Information Retrieval Sec.000 51 .

Advanced Topics in Information Systems: Information Retrieval Sec.2 Recall IIR 1 index construction  Documents are parsed to extract words and these are saved with the Document ID. Doc 1 I did enact Julius Caesar I was killed i' the Capitol. 4. Brutus killed me. The noble Brutus hath told you Caesar was ambitious Term I did enact julius caesar I was killed i' the capitol brutus killed m e so let it be with caesar the noble brutus hath told you caesar was am bitious Doc # 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 52 2 . Doc 2 So let it be with Caesar.

4. We have 100M items to sort. We focus on this sort step.2 Key step  After all documents have been parsed. the inverted file is sorted by terms.Advanced Topics in Information Systems: Information Retrieval Sec. Term I did enact julius caesar I was killed i' the capitol brutus killed m e so let it be with caesar the noble brutus hath told you caesar was am bitious Doc # 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Term ambitious be brutus brutus capitol caesar caesar caesar did enact hath I I i' it julius killed killed let me noble so the the told you was was with Doc # 2 2 1 2 1 1 2 2 1 1 1 1 1 1 2 1 1 1 2 1 2 2 1 2 2 2 1 2 2 53 .

 How can we construct an index for very large collections?  Taking into account the hardware constraints we just learned about . . disk.  Memory. etc. 4. 54 .Advanced Topics in Information Systems: Information Retrieval Sec. . speed.2 Scaling index construction  In-memory index construction does not scale.

E. but typical collections are much larger.  While building the index.  At 12 bytes per non-positional postings entry (term. the New York Times provides an index of >150 years of newswire  Thus: We need to store intermediate results on disk.000 in the case of RCV1  So … we can do this in memory in 2009. we cannot easily exploit compression tricks (you can. 55 . doc.2 Sort-based index construction  As we build the index.g.000. but much more complex)  The final postings for any term are incomplete until the end.Advanced Topics in Information Systems: Information Retrieval Sec. demands a lot of space for large collections. freq). we parse docs one at a time. 4.  T = 100.

but by using disk instead of memory?  No: Sorting T = 100.  We need an external sorting algorithm. 56 .000.000 records on disk is too slow – too many disk seeks.2 Use the same algorithm for disk?  Can we use the same index construction algorithm for larger collections. 4.Advanced Topics in Information Systems: Information Retrieval Sec.

how long would this take? 57 . 4.Advanced Topics in Information Systems: Information Retrieval Sec. and N items could be sorted with N log2N comparisons.2 Bottleneck  Parse and build postings entries one doc at a time  Now sort postings entries by term (then by doc within each term)  Doing this with random disk seeks would be too slow – must sort T=100M records If every comparison took 2 disk seeks.

 Define a Block ~ 10M such records  Can easily fit a couple into memory. sort.  These are generated as we parse docs.2 BSBI: Blocked sort-based Indexing (Sorting with fewer disk seeks)  12-byte (4+4+4) records (term.Advanced Topics in Information Systems: Information Retrieval Sec. 4.  Then merge the blocks into one long sorted order. 58 . doc.  Basic idea of algorithm:  Accumulate postings for each block. freq).  Will have 10 such blocks to start with. write to disk.  Must now sort 100M such 12-byte records by term.

4.2 59 .Advanced Topics in Information Systems: Information Retrieval Sec.

4. read each block and sort within:  Quicksort takes 2N ln N expected steps  In our case 2 x (10M ln 10M) steps  Exercise: estimate total time to read each block from disk and and quicksort it.  10 times this estimate – gives us 10 sorted runs of 10M records each.  Done straightforwardly. need 2 copies of data on disk  But can optimize this 60 .Advanced Topics in Information Systems: Information Retrieval Sec.2 Sorting 10 blocks of 10M records  First.

4.2 61 .Sec.

 During each layer. write back.2 How to merge the sorted runs?  Can do binary merges.Advanced Topics in Information Systems: Information Retrieval Sec. 1 1 3 2 2 Merged run. merge. Disk 62 . 4. read into memory runs in blocks of 10M. with a merge tree of log210 = 4 layers. 4 3 4 Runs being merged.

where you are reading from all blocks simultaneously  Providing you read decent-sized chunks of each block into memory and then write out a decent-sized output chunk.Advanced Topics in Information Systems: Information Retrieval Sec.2 How to merge the sorted runs?  But it is more efficient to do a n-way merge. then you’re not killed by disk seeks 63 . 4.

 Actually. but very slow index construction method.  We need the dictionary (which grows dynamically) in order to implement a term to termID mapping.) 64 .docID postings . we could work with term. but then intermediate files become very large. . (We would end up with a scalable.docID postings instead of termID.Advanced Topics in Information Systems: Information Retrieval Sec. .3 Remaining problem with sortbased algorithm  Our assumption was: we can keep the dictionary in memory. . 4. .  .

 With these two ideas we can generate a complete inverted index for each block.Advanced Topics in Information Systems: Information Retrieval Sec. Accumulate postings in postings lists as they occur. 65 .3 SPIMI: Single-pass in-memory indexing  Key idea 1: Generate separate dictionaries for each block – no need to maintain term-termID mapping across blocks.  Key idea 2: Don’t sort.  These separate indexes can then be merged into one big index. 4.

66 . 4.3 SPIMI-Invert  Merging of blocks is analogous to BSBI.Advanced Topics in Information Systems: Information Retrieval Sec.

4.3 SPIMI: Compression  Compression makes SPIMI even more efficient.  Compression of terms  Compression of postings 67 .Advanced Topics in Information Systems: Information Retrieval Sec.

4.Advanced Topics in Information Systems: Information Retrieval Sec.4 Distributed indexing  For web-scale indexing (don’t try this at home!): must use a distributed computing cluster  Individual machines are fault-prone  Can unpredictably slow down or fail  How do we exploit such a pool of machines? 68 .

 Based on expenditures of 200–250 million dollars per year  This would be 10% of the computing capacity of the world!?! 69 .000 servers each quarter.Advanced Topics in Information Systems: Information Retrieval Sec. 3 million processors/cores (Gartner 2007)  Estimate: Google installs 100. 4.  Estimate: a total of 1 million servers.4 Google data centers  Google data centers mainly contain commodity machines.  Data centers are distributed around the world.

9% uptime. what is the uptime of the system?  Answer: 63%  Calculate the number of servers failing per minute for an installation of 1 million servers.4 Google data centers  If in a non-fault-tolerant system with 1000 nodes.Advanced Topics in Information Systems: Information Retrieval Sec. each node has 99. 70 . 4.

Advanced Topics in Information Systems: Information Retrieval Sec.  Master machine assigns each task to an idle machine from a pool.4 Distributed indexing  Maintain a master machine directing the indexing job – considered “safe”.  Break up indexing into sets of (parallel) tasks. 71 . 4.

4.Advanced Topics in Information Systems: Information Retrieval Sec.4 Parallel tasks  We will use two sets of parallel tasks  Parsers  Inverters  Break the input document collection into splits  Each split is a subset of documents (corresponding to blocks in BSBI/SPIMI) 72 .

a-f.g. 4. g-p.  Now to complete the index inversion 73 . q-z) – here j = 3.4 Parsers  Master assigns a split to an idle parser machine  Parser reads a document at a time and emits (term.Advanced Topics in Information Systems: Information Retrieval Sec. doc) pairs  Parser writes pairs into j partitions  Each partition is for a range of terms’ first letters  (e..

4.Advanced Topics in Information Systems: Information Retrieval Sec.doc) pairs (= postings) for one term-partition.  Sorts and writes to postings lists 74 .4 Inverters  An inverter collects all (term.

4 Data flow assign Parser Parser splits Master assign Inverter Inverter Inverter Postings a-f g-p q-z a-f g-p q-z a-f g-p q-z Parser Map phase a-f g-p q-z Segment files Reduce phase 75 .Introduction to Information Retrieval Sec. 4.

76 . each implemented in MapReduce. 2002) as consisting of a number of phases.Advanced Topics in Information Systems: Information Retrieval Sec.  MapReduce (Dean and Ghemawat 2004) is a robust and conceptually simple framework for distributed computing …  … without having to write code for the distribution part.  They describe the Google indexing system (ca. 4.4 MapReduce  The index construction algorithm we just described is an instance of MapReduce.

77 .  Term-partitioned: one machine handles a subrange of terms  Document-partitioned: one machine handles a subrange of documents  As we discuss later in the course. 4. most search engines use a document-partitioned index … better load balancing.Advanced Topics in Information Systems: Information Retrieval Sec.  Another phase: transforming a term-partitioned index into a document-partitioned index.4 MapReduce  Index construction was just one phase. etc.

(d1:2. <came.Advanced Topics in Information Systems: Information Retrieval Sec.(d1)>) → (<C.list(v)) → output  Instantiation of the schema for index construction  map: web collection → list(termID.(d2. list(docID)>. <c’ed.d1)>. d1>  reduce: (<C. <died. list(docID)>. 4.(d2:1)>. …) → (postings list1.4 Schema for index construction in MapReduce  Schema of map and reduce functions  map: input → list(k.d1>. d1 : C came. <died.(d2)>. → (<C. <c’ed. <C.(d1)>. <C.(d1:1)>) 78 .d2:1)>. <died. docID)  reduce: (<termID1. d2>. <came.d1>. …)  Example for index construction  map: d2 : C died. <came. postings list2. C c’ed. <termID2.d1>. v) reduce: (k.d2>.(d1:1)>. <c’ed.d1.

Advanced Topics in Information Systems: Information Retrieval Sec.5 Dynamic indexing  Up to now.  They rarely are:  Documents come in over time and need to be inserted. 4. we have assumed that collections are static.  This means that the dictionary and postings lists have to be modified:  Postings updates for terms already in dictionary  New terms added to dictionary 79 .  Documents are deleted and modified.

4. merge results Deletions  Invalidation bit-vector for deleted docs  Filter docs output on a search result by this invalidation bit-vector  Periodically.5 Simplest approach     Maintain “big” main index New docs go into “small” auxiliary index Search across both.Advanced Topics in Information Systems: Information Retrieval Sec. re-index into one main index 80 .

) 81 . 4. collect postings lists of length 1 in one file etc.  But then we would need a lot of files – inefficient for O/S.5 Issues with main and auxiliary indexes  Problem of frequent merges – you touch stuff a lot  Poor performance during merge  Actually:  Merging of the auxiliary index into the main index is efficient if we keep a separate file for each postings list.Advanced Topics in Information Systems: Information Retrieval Sec. split very large postings lists.  Assumption for the rest of the lecture: The index is one big file.g.  Merge is the same as a simple append..  In reality: Use a scheme somewhere in between (e.

…) on disk  If Z0 gets too big (> n).5 Logarithmic merge  Maintain a series of indexes.Advanced Topics in Information Systems: Information Retrieval Sec. 4. write to disk as I0  or merge with I0 (if I0 already exists) as Z1  Either write merge Z1 to disk as I1 (if no I1)  Or merge with I1 to form Z2  etc. I1.  Keep smallest (Z0) in memory  Larger ones (I0. each twice as large as the previous one. 82 .

Sec.5 83 . 4.

so complexity is O(T log T)  So logarithmic merge is much more efficient for index construction  But query processing now requires the merging of O(log T) indexes  Whereas it is O(1) if you just have a main and auxiliary index 84 . 4.Advanced Topics in Information Systems: Information Retrieval Sec.  Logarithmic merge: Each posting is merged O(log T) times.5 Logarithmic merge  Auxiliary and main index: index construction time is O(T2) as each posting is touched in each merge.

. pick the one with the most hits  How do we maintain the top ones with multiple indexes and invalidation bit vectors?  One possibility: ignore everything but the main index for such ordering  Will see more such statistics used in results ranking 85 .g.Advanced Topics in Information Systems: Information Retrieval Sec. 4. when we spoke of spell-correction: which of several corrected alternatives do we present to the user?  We said.5 Further issues with multiple indexes  Collection-wide statistics are hard to maintain  E.

…  But (sometimes/typically) they also periodically reconstruct the index from scratch  Query processing is then switched to the new index. blogs.Advanced Topics in Information Systems: Information Retrieval Sec. new topical web pages  Sarah Palin. and the old index is then deleted 86 .5 Dynamic indexing at search engines  All the large search engines now do dynamic indexing  Their indices have frequent incremental changes  News items. 4.

4.5 87 .Advanced Topics in Information Systems: Information Retrieval Sec.

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.5

Other sorts of indexes
 Positional indexes
 Same sort of sorting problem … just larger

 Building character n-gram indexes:
 As text is parsed, enumerate n-grams.  For each n-gram, need pointers to all dictionary terms containing it – the “postings”.  Note that the same “postings entry” will arise repeatedly in parsing the docs – need efficient hashing to keep track of this.

Why ?

 E.g., that the trigram uou occurs in the term deciduous will be discovered on each text occurrence of deciduous  Only need to process each term once
88

Advanced Topics in Information Systems: Information Retrieval

INDEX COMPRESSION

89

Advanced Topics in Information Systems: Information Retrieval

Ch. 5

Compressing Indexes

 Collection statistics in more detail (with RCV1)
 How big will the dictionary and postings be?

 Dictionary compression  Postings compression
90

Advanced Topics in Information Systems: Information Retrieval Ch. 5 Why compression (in general)?  Use less disk space  Saves a little money  Keep more stuff in memory  Increases speed  Increase speed of data transfer from disk to memory  [read compressed data | decompress] is faster than [read uncompressed data]  Premise: Decompression algorithms are fast  True of the decompression algorithms we use 91 .

 Compression lets you keep more in memory  We will devise various IR-specific compression schemes 92 . 5 Why compression for inverted indexes?  Dictionary  Make it small enough to keep in main memory  Make it so small that you can keep some postings lists in main memory too  Postings file(s)  Reduce disk space needed  Decrease time needed to read postings lists from disk  Large search engines keep a significant part of the postings in memory.Advanced Topics in Information Systems: Information Retrieval Ch.

1 Recall Reuters RCV1         symbol statistic value N documents 800.) avg.000 L avg.5 non-positional postings 100. # bytes per token 4. # tokens per doc 200 M terms (= word types) ~400.5 (without spaces/punct. 5.000 93 .000 avg. # bytes per term 7. # bytes per token 6 (incl. spaces/punct.000.Advanced Topics in Information Systems: Information Retrieval Sec.) avg.

812 -4 entries correspond to big deltas in other columns? 94 .680 96. Why do some zero-52 all 63.858 94.158 121.002 -8 -3 -14 -30 -8 -12 -24 -39 ∆ % cumul % position al postings positional index Size (K) 197.1 Index parameters vs. p.971 -2 -17 -0 -0 100. what we index (details IIR Table 5.1.80) word types (terms) dictiona ry Size (K) ∆% cumul % -2 -19 -19 -19 nonpositiona l postings nonpositional index Size (K) 109.969 83.879 179.390 67.517 0 Exercise: give intuitions for-33 the ‘0’ entries.Advanced Topics in Information Systems: Information Retrieval Sec.158 179. 5.517 -9 0 -31 -47 -9 -9 -38 -52 ∆ % cumul % size of Unfiltered No numbers Case folding 30 stopwords 150 stopwords 484 474 392 391 391 stemming 322 -17 -42 94.

 Lossy compression: Discard some information  Several of the preprocessing steps can be viewed as lossy compression: case folding. stop words.  Chap/Lecture 7: Prune postings entries that are unlikely to turn up in the top k list for any query.  What we mostly do in IR. number elimination.1 Lossless vs. 95 . lossy compression  Lossless compression: All information is preserved.Advanced Topics in Information Systems: Information Retrieval Sec. stemming. 5.  Almost no loss quality for top k list.

how many distinct words are there?  Can we assume an upper bound?  Not really: At least 7020 = 1037 different words of length 20  In practice. the vocabulary will keep growing with the collection size  Especially with Unicode  96 . collection size  How big is the term vocabulary?  That is. 5.1 Vocabulary vs.Advanced Topics in Information Systems: Information Retrieval Sec.

5. T is the number of tokens in the collection  Typical values: 30 ≤ k ≤ 100 and b ≈ 0.1 Vocabulary vs. collection size  Heaps’ law: M = kTb  M is the size of the vocabulary. T.Advanced Topics in Information Systems: Information Retrieval Sec.5  In a log-log plot of vocabulary size M vs. Heaps’ law predicts a line with slope about ½  It is the simplest possible relationship between the two in log-log space  An empirical finding (“empirical law”) 97 .

000.49 so k = 101. 5.49 log10 T + 1. M = 101.64 ≈ 44 and b = 0.1 Heaps’ Law For RCV1.64 T0.1 p81 98 . 38.323 terms. the dashed line log10 M = 0.64 is the best least squares fit.365 terms Fig 5. actually.49. Thus. law predicts 38. Good empirical fit for Reuters RCV1 ! For first 1.020 tokens.Sec.

automatically correcting spelling errors on Heaps’ law?  Compute the vocabulary size M for this scenario:  Looking at a collection of web pages.  Assume a search engine indexes a total of 20. you find that there are 3000 different terms in the first 10. vs. containing 200 tokens on average  What is the size of the vocabulary of the indexed collection as predicted by Heaps’ law? 99 .000 different terms in the first 1.Advanced Topics in Information Systems: Information Retrieval Sec.000 tokens and 30.000.000.000 (2 × 1010 ) pages.000.000 tokens.1 Exercises  What is the effect of including spelling errors. 5.

 cfi ∝ 1/i = K/i where K is a normalizing constant  cfi is collection frequency: the number of occurrences of the term ti in the collection. there are a few very frequent terms and very many very rare terms.  Zipf’s law: The ith most frequent term has frequency proportional to 1/i .  In natural language.  We also study the relative frequencies of terms.Advanced Topics in Information Systems: Information Retrieval Sec. 100 . 5.1 Zipf’s law  Heaps’ law gives the vocabulary size in collections.

so  log cfi = log K . 5.log i  Linear relationship between log cfi and log i  Another power law relationship 101 .1 Zipf consequences  If the most frequent term (the) occurs cf1 times  then the second most frequent term (of) occurs cf1/2 times  the third most frequent term (and) occurs cf1/3 times …  Equivalent: cfi = K/i where K is a normalizing factor.Advanced Topics in Information Systems: Information Retrieval Sec.

1 Zipf’s law for Reuters RCV1 102 . 5.Introduction to Information Retrieval Sec.

Advanced Topics in Information Systems: Information Retrieval Ch. etc.  We will consider compression schemes 103 . we will consider compressing the space for the dictionary and postings  Basic Boolean index only  No study of positional indexes. 5 Compression  Now.

Sec.2 DICTIONARY COMPRESSION 104 . 5.

Advanced Topics in Information Systems: Information Retrieval Sec.2 Why compress the dictionary?  Search begins with the dictionary  We want to keep it in memory  Memory footprint competition with other applications  Embedded/mobile devices may have very little memory  Even if the dictionary isn’t in memory. 5. compressing the dictionary is important 105 . we want it to be small for a fast search startup time  So.

Terms a aachen ….first cut  Array of fixed-width entries  ~400. Dictionary search structure 20 bytes 4 bytes each 106 .Advanced Topics in Information Systems: Information Retrieval Sec. 5.2 MB. zulu Freq. 221 Postings ptr. 656.265 65 ….000 terms. 28 bytes/term = 11.2 Dictionary storage .

 And we still can’t handle supercalifragilisticexpialidocious or hydrochlorofluorocarbons.2 Fixed-width terms are wasteful  Most of the bytes in the Term column are wasted – we allot 20 bytes for 1 letter terms. 107 . 5.  Written English averages ~4.Advanced Topics in Information Systems: Information Retrieval Sec. dictionary word in English: ~8 characters  How do we use ~8 characters per dictionary term?  Short words dominate token counts but not type average.5 characters/word.  Exercise: Why is/isn’t this the number to use for estimating the dictionary size?  Ave.

Introduction to Information Retrieval Sec. 33 29 44 126 Postings ptr.2M positions: log23. 5.2M = 22bits = 3bytes 108 .2MB Pointers resolve 3. Freq.2 Compressing the term list: Dictionary-as-a-String  Store dictionary as a (long) string of characters: Pointer to next word shows end of current word Hope to save up to 60% of dictionary space.systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo…. Term ptr. Total string length = 400K x 8B = 3.  ….

  not 20.Advanced Topics in Information Systems: Information Retrieval Sec.2MB for fixed width) 109 .  Now avg. 8 bytes per term in term string 400K terms x 19 ⇒ 7. 3 bytes per term pointer Avg. 5.6 MB (against 11.2 Space for dictionary as a string      4 bytes per term for Freq.bytes/term. 11 4 bytes per term for pointer to Postings.

2 Blocking  Store pointers to every kth term string. Freq.7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo…. Lose 4 bytes on term lengths. 5. 33 29 44 126 7 Postings ptr.  Need to store term lengths (1 extra byte) ….  Example below: k=4.Advanced Topics in Information Systems: Information Retrieval Sec.  Save 9 bytes  on 3  pointers. 110 . Term ptr.

5. We can save more with larger k.6 MB to 7. now we use 3 + 4 = 7 bytes. This reduces the size of the dictionary from 7.2 Net  Example for block size k = 4  Where we used 3 bytes/pointer without blocking  3 x 4 = 12 bytes. Shaved another ~0. Why not go with larger k? 111 .1 MB.5MB.Advanced Topics in Information Systems: Information Retrieval Sec.

5.6 MB) with blocking. 8 and 16. 112 .2 Exercise  Estimate the space usage (and savings compared to 7. for block sizes of k = 4.Advanced Topics in Information Systems: Information Retrieval Sec.

6 Exercise: what if the frequencies of query terms were non-uniform but known. how would you structure the dictionary search tree? 113 . 5.Advanced Topics in Information Systems: Information Retrieval Sec.2 Dictionary search without blocking  Assuming each dictionary term equally likely in query (not really so in practice!). average number of comparisons = (1+2∙2+4∙3+4)/8 ~2.

avg. 5. = (1+2∙2+2∙3+2∙4+5)/8 = 3 compares 114 .Introduction to Information Retrieval Sec.  Then linear search through terms in block.  Blocks of 4 (binary tree).2 Dictionary search with blocking  Binary search down to 4-term block.

2 Exercise  Estimate the impact on search performance (and slowdown compared to k=1) with blocking.Advanced Topics in Information Systems: Information Retrieval Sec. 5. for block sizes of k = 4. 115 . 8 and 16.

Advanced Topics in Information Systems: Information Retrieval Sec. 5. Begins to resemble general string compression.2 Front coding  Front-coding:  Sorted words commonly have long common prefix – store differences only  (for last k-1 in a block of k) 8automata8automate9automatic10autom ation →8automat*a1◊e2◊ic3◊ion Encodes automat Extra length beyond automat.116 .

2 RCV1 dictionary compression summary Technique Fixed width Dictionary-as-String with pointers to every term Also.1 5. 5.6 7.Advanced Topics in Information Systems: Information Retrieval Sec. blocking k = 4 Also. Blocking + front coding Size in MB 11.9 117 .2 7.

5.3 POSTINGS COMPRESSION 118 .Sec.

we would use 32 bits per docID when using 4-byte integers. we can use log2 800. factor of at least 10.Advanced Topics in Information Systems: Information Retrieval Sec.000 ≈ 20 bits per docID.  Our goal: use a lot less than 20 bits per docID.  Alternatively.  Key desideratum: store each posting compactly.3 Postings compression  The postings file is much larger than the dictionary.  For Reuters (800.000 documents).  A posting for our purposes is a docID. 119 . 5.

so 20 bits/posting is too expensive.3 Postings: two conflicting forces  A term like arachnocentric occurs in maybe one doc out of a million – we would like to store this posting using log2 1M ~ 20 bits.  Prefer 0/1 bitmap vector in this case 120 .  A term like the occurs in virtually every doc. 5.Advanced Topics in Information Systems: Information Retrieval Sec.

202 …  Consequence: it suffices to store gaps.154. 121 .  33.3 Postings file entry  We store the list of docs containing a term in increasing order of docID.159. 5.107.14.  computer: 33.47.Advanced Topics in Information Systems: Information Retrieval Sec.43 …  Hope: most gaps can be encoded/stored with far fewer than 20 bits.5.

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.3

Three postings entries

122

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.3

Variable length encoding
 Aim:
 For arachnocentric, we will use ~20 bits/gap entry.  For the, we will use ~1 bit/gap entry.

 If the average gap for a term is G, we want to use ~log2G bits/gap entry.  Key challenge: encode every integer (gap) with about as few bits as needed for that integer.  This requires a variable length encoding  Variable length codes achieve this by using short codes for small numbers
123

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.3

Variable Byte (VB) codes
 For a gap value G, we want to use close to the fewest bytes needed to hold log2 G bits  Begin with one byte to store G and dedicate 1 bit in it to be a continuation bit c  If G ≤127, binary-encode it in the 7 available bits and set c =1  Else encode G’s lower-order 7 bits and then use additional bytes to encode the higher order bits using the same algorithm  At the end set the continuation bit of the last byte to 1 (c =1) – and for the other bytes c = 0.
124

Advanced Topics in Information Systems: Information Retrieval Sec. 125 . VB uses a whole byte. 5.3 Example docIDs gaps VB code 00000110 10111000 824 829 5 10000101 215406 214577 00001101 00001100 10110001 Postings stored as the byte concatenation 000001101011100010000101000011010000110010110001 Key property: VB-encoded postings are uniquely prefix-decodable. For a small gap (5).

 There is also recent work on word-aligned codes that pack a variable number of gaps into one word 126 .Advanced Topics in Information Systems: Information Retrieval Sec.  Variable byte codes:  Used by many commercial/research systems  Good low-tech blend of variable-length coding and sensitivity to computer memory alignment matches (vs. 5.3 Other variable unit codes  Instead of bytes. which we look at next).  Variable byte alignment wastes space if you have many small gaps – nibbles do better in such cases. 4 bits (nibbles). 16 bits. bit-level codes. we can also use a different “unit of alignment”: 32 bits (words).

 Unary code for 40 is 11111111111111111111111111111111111111110 .  Unary code for 3 is 1110.Advanced Topics in Information Systems: Information Retrieval Unary code  Represent n as n 1s with a final 0. but….  Unary code for 80 is: 111111111111111111111111111111111111111111 111111111111111111111111111111111111110  This doesn’t look promising. 127 .

3 Gamma codes  We can compress better with bit-level codes  The Gamma code is the best known of these.Advanced Topics in Information Systems: Information Retrieval Sec.  Represent a gap G as a pair length and offset  offset is G in binary.  Gamma code of 13 is the concatenation of length and offset: 1110101 128 . with the leading bit cut off  For example 13 → 1101 → 101  length is the length of offset  For 13 (offset 101). this is 3.  We encode length with unary code: 1110. 5.

00 1110.Advanced Topics in Information Systems: Information Retrieval Sec.001 1110.101 11110.0 10.11111111 11111111110.1000 111111110.0000000001 129 .1 110.3 Gamma code examples number 0 1 2 3 4 9 13 24 511 1025 0 10 10 110 1110 1110 11110 111111110 11111111110 0 1 00 001 101 1000 11111111 0000000001 length offset γ-code none 0 10. 5.

3 Gamma code properties  G is encoded using 2 log G + 1 bits  Length of offset is log G bits  Length of length is log G + 1 bits  All gamma codes have an odd number of bits  Almost within a factor of 2 of best possible.Advanced Topics in Information Systems: Information Retrieval Sec. log2 G  Gamma code is uniquely prefix-decodable. 5. like VB  Gamma code can be used for any distribution  Gamma code is parameter-free 130 .

Advanced Topics in Information Systems: Information Retrieval Sec. 64 bits  Operations that cross word boundaries are slower  Compressing and manipulating at the granularity of bits can be slow  Variable byte encoding is aligned and thus potentially more efficient  Regardless of efficiency. 5. 16.3 Gamma seldom used in practice  Machines have word boundaries – 8. 32. variable byte is conceptually simpler at little additional space cost 131 .

uncompressed (20 bits) postings. γ−encoded Size in MB 11.0 116.2 7. k = 4 with blocking & front coding collection (text. term pointers into string with blocking. uncompressed (32-bit words) postings. fixed-width dictionary.0 101.0 400.9 3.600.3 RCV1 compression Data structure dictionary.6 7. variable byte encoded postings. xml markup etc) collection (text) Term-doc incidence matrix postings. 5.0 960.000.0 250.0 132 .0 40.Advanced Topics in Information Systems: Information Retrieval Sec.1 5.

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.3

Index compression summary
 We can now create an index for highly efficient Boolean retrieval that is very space efficient  Only 4% of the total size of the collection  Only 10-15% of the total size of the text in the collection  However, we’ve ignored positional information  Hence, space savings are less for indexes used in practice
 But techniques substantially the same.

133

Sec. 5.3

SCORING, TERM WEIGHTING AND THE VECTOR SPACE MODEL
134

Advanced Topics in Information Systems: Information Retrieval

Ch. 6

Ranked retrieval
 Thus far, our queries have all been Boolean.
 Documents either match or don’t.

 Good for expert users with precise understanding of their needs and the collection.
 Also good for applications: Applications can easily consume 1000s of results.

 Not good for the majority of users.
 Most users incapable of writing Boolean queries (or they are, but they think it’s too much work).  Most users don’t want to wade through 1000s of results.
 This is particularly true of web search.

135

 Query 1: “standard user dlink 650” → 200.  AND gives too few. 6 Problem with Boolean search: feast or famine  Boolean queries often result in either too few (=0) or too many (1000s) results. OR gives too many 136 .000 hits  Query 2: “standard user dlink 650 no card found”: 0 hits  It takes a lot of skill to come up with a query that produces a manageable number of hits.Advanced Topics in Information Systems: Information Retrieval Ch.

the user’s query is just one or more words in a human language  In principle. ranked retrieval models have normally been associated with free text queries and vice versa 137 . there are two separate choices here. in ranked retrieval models. the system returns an ordering over the (top) documents in the collection with respect to a query  Free text queries: Rather than a query language of operators and expressions.Advanced Topics in Information Systems: Information Retrieval Ranked retrieval models  Rather than a set of documents satisfying a query expression. but in practice.

6 Feast or famine: not a problem in ranked retrieval  When a system produces a ranked result set. the size of the result set is not an issue  We just show the top k ( ≈ 10) results  We don’t overwhelm the user  Premise: the ranking algorithm works 138 .Advanced Topics in Information Systems: Information Retrieval Ch. large result sets are not an issue  Indeed.

6 Scoring as the basis of ranked retrieval  We wish to return in order the documents most likely to be useful to the searcher  How can we rank-order the documents in the collection with respect to a query?  Assign a score – say in [0.Advanced Topics in Information Systems: Information Retrieval Ch. 1] – to each document  This score measures how well document and query “match”. 139 .

the higher the score (should be)  We will look at a number of alternatives for this.Advanced Topics in Information Systems: Information Retrieval Ch. 6 Query-document matching scores  We need a way of assigning a score to a query/document pair  Let’s start with a one-term query  If the query term does not occur in the document: score should be 0  The more frequent the query term in the document. 140 .

B) = |A ∩ B| / |A ∪ B|  jaccard(A.Advanced Topics in Information Systems: Information Retrieval Ch. 6 Take 1: Jaccard coefficient  Recall from Lecture 3: A commonly used measure of overlap of two sets A and B  jaccard(A.A) = 1  jaccard(A.  Always assigns a number between 0 and 1.B) = 0 if A ∩ B = 0  A and B don’t have to be the same size. 141 .

Advanced Topics in Information Systems: Information Retrieval Ch. 6 Jaccard coefficient: Scoring example  What is the query-document match score that the Jaccard coefficient computes for each of the two documents below?  Query: ides of march  Document 1: caesar died in march  Document 2: the long march 142 .

instead of |A ∩ B|/|A ∪ B|| (Jaccard) A  B | for length normalization. we’ll use A B | / |  . Jaccard doesn’t consider this information  We need a more sophisticated way of normalizing for length  Later in this lecture. 6 Issues with Jaccard for scoring  It doesn’t consider term frequency (how many times a term occurs in a document)  Rare terms in a collection are more informative than frequent terms. .Advanced Topics in Information Systems: Information Retrieval Ch. 143 . .

6.Advanced Topics in Information Systems: Information Retrieval Sec.2 Recall (Lecture 1): Binary termdocument incidence matrix Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony Brutus Caesar Calpurnia Cleopatra mercy worser 1 1 1 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 1 0 0 1 0 0 1 1 1 0 1 0 0 1 0 Each document is represented by a binary vector ∈ {0.1}|V| 144 .

6.2 Term-document count matrices  Consider the number of occurrences of a term in a document:  Each document is a count vector in ℕv: a column below Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony Brutus Caesar Calpurnia Cleopatra mercy worser 157 4 232 0 57 2 2 73 157 227 10 0 0 0 0 0 0 0 0 3 1 0 1 2 0 0 5 1 0 0 1 0 0 5 1 0 0 1 0 0 1 0 145 .Advanced Topics in Information Systems: Information Retrieval Sec.

this is a step back: The positional index was able to distinguish these two documents.  We will look at “recovering” positional information later in this course.Advanced Topics in Information Systems: Information Retrieval Bag of words model  Vector representation doesn’t consider the ordering of words in a document  John is quicker than Mary and Mary is quicker than John have the same vectors  This is called the bag of words model.  In a sense.  For now: bag of words model 146 .

But how?  Raw term frequency is not what we want:  A document with 10 occurrences of the term is more relevant than a document with 1 occurrence of the term. IR 147 .d of term t in document d is defined as the number of times that t occurs in d.  Relevance does not increase proportionally NB: frequency = count in with term frequency.Advanced Topics in Information Systems: Information Retrieval Term frequency tf  The term frequency tft.  We want to use tf when computing querydocument match scores.  But not 10 times more relevant.

6.d >0 otherw ise  0 → 0.  if tf t. 10 → 2.d 1  +log 10 tf t. 1000 → 4.3. 1 → 1. 148 . = 0.d )  The score is 0 if none of the query terms is present in the document. 2 → 1.Advanced Topics in Information Systems: Information Retrieval Sec.  Score for a document-query pair: sum over terms t in both q and d:  score = ∑t∈q∩d (1 + log tf t .2 Log-frequency weighting  The log frequency weight of term t in d is wt. etc.d .

1 Document frequency  Rare terms are more informative than frequent terms  Recall stop words  Consider a term in the query that is rare in the collection (e. 6..2.g. 149 .Advanced Topics in Information Systems: Information Retrieval Sec. arachnocentric)  A document containing this term is very likely to be relevant to the query arachnocentric  → We want a high weight for rare terms like arachnocentric.

6. increase.g. 150 . line)  A document containing such a term is more likely to be relevant than a document that doesn’t  But it’s not a sure indicator of relevance.. increase.  → For frequent terms.Advanced Topics in Information Systems: Information Retrieval Sec. we want high positive weights for words like high. and line  But lower weights than for rare terms.2.1 Document frequency. continued  Frequent terms are less informative than rare terms  Consider a query term that is frequent in the collection (e.  We will use document frequency (df) to capture this. high.

Will turn out the base of the log is immaterial. 6. 151 .2.Advanced Topics in Information Systems: Information Retrieval Sec.1 idf weight  dft is the document frequency of t: the number of documents that contain t  dft is an inverse measure of the informativeness of t  dft ≤ N  We define the idf (inverse document frequency) of t by idf t = log10 ( N/df t )  We use log (N/dft) instead of N/dft to “dampen” the effect of idf.

000 10.000 1. 6.Advanced Topics in Information Systems: Information Retrieval Sec.000 idft term calpurnia animal sunday fly under the idf t = log10 ( N/df t ) There is one idf value for each term t in a collection. suppose N = 1 million dft 1 100 1.000.2.000 100. 152 .1 idf example.

idf weighting makes occurrences of capricious count for much more in the final document ranking than occurrences of person. 153 .Advanced Topics in Information Systems: Information Retrieval Effect of idf on ranking  Does idf have an effect on ranking for one-term queries. like  iPhone  idf has no effect on ranking one term queries  idf affects the ranking of documents for queries with at least two terms  For the query capricious person.

2.Advanced Topics in Information Systems: Information Retrieval Sec. 6.1 Collection vs. counting multiple occurrences. Document frequency  The collection frequency of t is the number of occurrences of t in the collection.  Example: Word Collection frequency Document frequency insurance try 10440 10422 3997 8760  Which word is a better search term (and should get a higher weight)? 154 .

2. not a minus sign!  Alternative names: tf. 6.d ) × log10 ( N / df t )  Best known weighting scheme in information retrieval  Note: the “-” in tf-idf is a hyphen.2 tf-idf weighting  The tf-idf weight of a term is the product of its tf weight and its idf weight.d = (1 + log tf t . tf x idf  Increases with the number of occurrences within a document  Increases with the rarity of the term in the collection 155 .idf.Advanced Topics in Information Systems: Information Retrieval Sec. w t .

2 Final ranking of documents for a query Score(q.d 156 .idft.2.d) =  t ᅫ qᅫ d tf.Advanced Topics in Information Systems: Information Retrieval Sec. 6.

37 3.95 Each document is now represented by a real-valued vector of tf-idf weights ∈ R|V| 157 .9 0.25 0.25 1.59 0 2.85 1.25 0.18 6.15 0 0 0.12 4.21 8.11 0 1 1.88 1.25 0 0 5. 6.54 0 0 0 0 0 0 0 0 1.51 0 0 0.3 Binary → count → weight matrix Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony Brutus Caesar Calpurnia Cleopatra mercy worser 5.51 1.Advanced Topics in Information Systems: Information Retrieval Sec.1 2.35 0 0 0 0 0.54 1.

most entries are zero.3 Documents as vectors So we have a |V|-dimensional vector space Terms are axes of the space Documents are points or vectors in this space Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine  These are very sparse vectors .     158 . 6.Advanced Topics in Information Systems: Information Retrieval Sec.

3 Queries as vectors  Key idea 1: Do the same for queries: represent them as vectors in the space  Key idea 2: Rank documents according to their proximity to the query in this space  proximity = similarity of vectors  proximity ≈ inverse of distance  Recall: We do this because we want to get away from the you’re-either-in-or-out Boolean model.  Instead: rank more relevant documents higher than less relevant documents 159 .Advanced Topics in Information Systems: Information Retrieval Sec. 6.

. . .3 Formalizing vector space proximity  First cut: distance between two points  ( = distance between the end points of the two vectors)  Euclidean distance?  Euclidean distance is a bad idea .  . 160 . because Euclidean distance is large for vectors of different lengths. 6.Advanced Topics in Information Systems: Information Retrieval Sec. .

Advanced Topics in Information Systems: Information Retrieval Sec. 161 . 6.3 Why distance is a bad idea The Euclidean distance between q and d2 is large even though the distribution of terms in the query q and the distribution of terms in the document d2 are very similar.

corresponding to maximal similarity.3 Use angle instead of distance  Thought experiment: take a document d and append it to itself.Advanced Topics in Information Systems: Information Retrieval Sec. 6. 162 . Call this document d′.  Key idea: Rank documents according to angle with query.  “Semantically” d and d′ have the same content  The Euclidean distance between the two documents can be quite large  The angle between the two documents is 0.

3 From angles to cosines  The following two notions are equivalent. 180o] 163 .  Rank documents in decreasing order of the angle between query and document  Rank documents in increasing order of cosine(query. 6.document)  Cosine is a monotonically decreasing function for the interval [0o.Advanced Topics in Information Systems: Information Retrieval Sec.

6.Advanced Topics in Information Systems: Information Retrieval Sec.3 From angles to cosines  But how – and why – should we be computing cosines? 164 .

Advanced Topics in Information Systems: Information Retrieval Sec.3 Length normalization  A vector can be (length-) normalized by dividing each of its components by its length – for this we use the L2 norm:  x2= xi2 ∑i  Dividing a vector by its L2 norm makes it a unit (length) vector (on surface of unit hypersphere)  Effect on the two documents d and d′ (d appended to itself) from earlier slide: they have identical vectors after length-normalization. 6.  Long and short documents now have comparable weights 165 .

the cosine of the angle between q and d.3 cosine(query. equivalently.d) is the cosine similarity of q and d … or.document) Dot product Unit vectors     q•d cos( q . 166 .Advanced Topics in Information Systems: Information Retrieval Sec. 6. d ) =   = qd  q • q  d = d ∑ ∑ V qi d i i =1 d i2 ∑i =1 V V 2 i =1 i q qi is the tf-idf weight of term i in the query di is the tf-idf weight of term i in the document cos(q.

d length-normalized. cosine similarity is simply the dot product (or scalar product): r r r r cos(q. 167 . d ) = q ᅫ d =  V i=1 qi di for q.Advanced Topics in Information Systems: Information Retrieval Cosine for length-normalized vectors  For length-normalized vectors.

Advanced Topics in Information Systems: Information Retrieval Cosine similarity illustrated 168 .

Advanced Topics in Information Systems: Information Retrieval Sec. we don’t do idf weighting. 169 .3 Cosine similarity amongst 3 documents How similar are the novels SaS: Sense and Sensibility PaP: Pride and Prejudice. 6. and WH: Wuthering Heights? term affection jealous gossip wuthering SaS 115 10 2 0 PaP 58 7 0 0 WH 20 11 6 38 Term frequencies (counts) Note: To simplify this example.

Advanced Topics in Information Systems: Information Retrieval Sec.69 Why do we have cos(SaS.555 0 0 WH 0.0 + 0.78 2. Log frequency weighting term affection jealous gossip wuthering SaS 3.335 0 PaP 0.WH)? 170 .3 3 documents example contd.405 0.76 1.58 After length normalization term affection jealous gossip wuthering SaS 0.832 0.515 × 0.WH) ≈ 0.WH) ≈ 0.465 0.789 0.04 1.PaP) > cos(SAS.789 × 0.00 1.0 × 0. 6.588 cos(SaS.85 0 0 WH 2.0 ≈ 0.79 cos(PaP.335 × 0.06 2.524 0.30 0 PaP 2.PaP) ≈ 0.94 cos(SaS.515 0.30 2.555 + 0.832 + 0.

Advanced Topics in Information Systems: Information Retrieval Sec.3 Computing cosine scores 171 . 6.

Advanced Topics in Information Systems: Information Retrieval Sec.4 tf-idf weighting has many variants Columns headed ‘n’ are acronyms for weight schemes. 6. Why is the base of the log in idf immaterial? 172 .

4 Weighting may differ in queries vs documents  Many search engines allow for different weightings for queries vs. no idf and cosine normalization  Query: logarithmic tf (l in leftmost column). documents  SMART Notation: denotes the combination in use in an engine. with the notation ddd.Advanced Topics in Information Systems: Information Retrieval Sec.qqq. no normalization … 173 . using the acronyms from the previous table  A very standard weighting scheme is: lnc.ltc  Document: logarithmic tf (l as first character). idf A bad idea? (t in second column). 6.

53 Prod Exercise: what is N.3 wt 1 0 1 1.92 174 Score = 0+0+0. 6.68 0 0 0.0 3.Advanced Topics in Information Systems: Information Retrieval Sec.3 n’lize 0.8 .0 wt 0 1.0 n’lize 0 0.27 0.52 0.27+0.52 0.53 = 0. the number of docs? Doc length = 12 + 0 2 + 12 + 1.52 0 0.ltc Document: car insurance auto insurance Query: best car insurance Term Que ry tfraw auto best car insurance 0 1 1 1 tf-wt 0 1 1 1 df 5000 50000 10000 1000 idf 2.34 0.0 3.3 2.78 Docu ment tf-raw 1 0 1 2 tf-wt 1 0 1 1.32 ᅫ 1.3 1.4 tf-idf example: lnc.3 2.

Advanced Topics in Information Systems: Information Retrieval Summary – vector space ranking  Represent the query as a weighted tf-idf vector  Represent each document as a weighted tf-idf vector  Compute the cosine similarity score for the query vector and each document vector  Rank documents with respect to the query by score  Return the top K (e.g. K = 10) to the user 175 ..

Sign up to vote on this title
UsefulNot useful