Advanced Topics in Information Systems: Information Retrieval

Advanced Topics in Information Systems:

Information Retrieval
Jun.Prof. Alexander Markowetz Slides modified from Christopher Manning and Prabhakar Raghavan

Advanced Topics in Information Systems: Information Retrieval

DICTIONARY DATA STRUCTURES
2

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.1

Hashes
 Each vocabulary term is hashed to an integer
 (We assume you’ve seen hashtables before)

 Pros:
 Lookup is faster than for a tree: O(1)

 Cons:
 No easy way to find minor variants:
 judgment/judgement

 No prefix search [tolerant retrieval]  If vocabulary keeps growing, need to occasionally do the expensive operation of rehashing everything

3

1 Tree: binary tree a-m Root n-z a-hu hy-m n-sh si-z huy g ard sic k 4 zyg ot var k ens le .Advanced Topics in Information Systems: Information Retrieval Sec. 3.

[2.Advanced Topics in Information Systems: Information Retrieval Sec.. 5 .4].1 Tree: B-tree a-hu n-z hy-m  Definition: Every internal nodel has a number of children in the interval [a.g. b are appropriate natural numbers. 3.b] where a. e.

1 Trees  Simplest: binary tree  More usual: B-trees  Trees require a standard ordering of characters and hence strings … but we standardly have one  Pros:  Solves the prefix problem (terms starting with hyp)  Cons:  Slower: O(log M) [and this requires balanced tree]  Rebalancing binary trees is expensive  But B-trees mitigate the rebalancing problem 6 .Advanced Topics in Information Systems: Information Retrieval Sec. 3.

Advanced Topics in Information Systems: Information Retrieval WILD-CARD QUERIES 7 .

Exercise: from this. 3. Can retrieve all words in range: nom ≤ w < non.  Easy with binary tree (or B-tree) lexicon: retrieve all words in range: mon ≤ w < moo  *mon: find words ending in “mon”: harder  Maintain an additional B-tree for terms backwards. how can we enumerate all terms meeting the wild-card query pro*cent ? 8 .2 Wild-card queries: *  mon*: find all docs containing any word beginning “mon”.Advanced Topics in Information Systems: Information Retrieval Sec.

 E.. consider the query: se*ate AND fil*er This may result in the execution of many Boolean AND queries.2 Query processing  At this point.g. we have an enumeration of all terms in the dictionary that match the wildcard query. 9 .Advanced Topics in Information Systems: Information Retrieval Sec. 3.  We still have to look up the postings for each enumerated term.

10 .2 B-trees handle *’s at the end of a query term  How can we handle *’s in the middle of query term?  co*tion  We could look up co* AND *tion in a B-tree and intersect the two term sets  Expensive  The solution: transform wild-card queries so that the *’s occur at the end  This gives rise to the Permuterm Index.Advanced Topics in Information Systems: Information Retrieval Sec. 3.

index under:  hello$. ello$h.1 Permuterm index  For term hello. llo$he.  Queries:  X lookup on X$ X* lookup on $X*  *X lookup on X$* *X* lookup on X*  X*Y lookup on Y$X* X*Y*Z ??? Exercise! Query = hel*o X=hel. Y=o Lookup o$hel* 11 . 3.2. o$hell where $ is a special symbol.Advanced Topics in Information Systems: Information Retrieval Sec. lo$hel.

12 .  Permuterm problem: ≈ quadruples lexicon size Empirical observation for English.2. 3.Advanced Topics in Information Systems: Information Retrieval Sec.1 Permuterm query processing  Rotate query wild-card to the right  Now use B-tree lookup as before.

t$.nt.2.ri.ap.l$.el.is.mo.e$.$t.il.Advanced Topics in Information Systems: Information Retrieval Sec.s$. ue.le.g.ru. 3.cr. from text “April is the cruelest month” we get the 2-grams (bigrams) $a.th.2 Bigram (k-gram) indexes  Enumerate all k-grams (sequence of k chars) occurring in any term  e. 13 .he..$i. $m.st.on.h$  $ is a special word boundary symbol  Maintain a second inverted index from bigrams to dictionary terms that match each bigram.es.pr.$c.

$m mo on mace among among madden amortize around 14 .2 Bigram index example  The k-gram index finds terms based on a query consisting of k-grams (here k=2).2. 3.Advanced Topics in Information Systems: Information Retrieval Sec.

15 .  But we’d enumerate moon. 3.  Surviving enumerated terms are then looked up in the term-document inverted index. space efficient (compared to permuterm).  Must post-filter these terms against query.  Fast.Advanced Topics in Information Systems: Information Retrieval Sec.2 Processing wild-cards  Query mon* can now be run as  $m AND mo AND on  Gets terms that match AND version of our wildcard query.2.

2 Processing wild-card queries  As before.g..Advanced Topics in Information Systems: Information Retrieval Sec.  Wild-cards can result in expensive query execution (very large disjunctions…)  pyth* AND prog*  If you encourage “laziness” people will respond! Search Type your search terms.  Which web search engines allow wildcard queries? 16 . we must execute a Boolean query for each enumerated. 3. filtered term.2. Alex* will match Alexander. E. use ‘*’ if you need to.

Advanced Topics in Information Systems: Information Retrieval SPELLING CORRECTION 17 .

Advanced Topics in Information Systems: Information Retrieval Sec..g. I flew form Heathrow to Narita. from → form  Context-sensitive  Look at surrounding words.g.3 Spell correction  Two principal uses  Correcting document(s) being indexed  Correcting user queries to retrieve “right” answers  Two main flavors:  Isolated word  Check each word on its own for misspelling  Will not catch typos resulting in correctly spelled words  e..  e. 18 . 3.

 But also: web pages and even printed material has typos  Goal: the dictionary contains fewer misspellings  But often we don’t change the documents but aim to fix the query-document mapping 19 .Advanced Topics in Information Systems: Information Retrieval Sec. 3.. OCR can confuse O and D more often than it would confuse O and I (adjacent on the QWERTY keyboard. so more likely interchanged in typing).3 Document correction  Especially needed for OCR’ed documents  Correction algorithms are tuned for this: rn/m  Can use domain-specific knowledge  E.g.

Advanced Topics in Information Systems: Information Retrieval Sec..3 Query mis-spellings  Our principal focus here  E. 3. the query Alanis Morisett  We can either  Retrieve documents indexed by the correct spelling.g. OR  Return several suggested alternative queries with the correct spelling  Did you mean … ? 20 .

g. 3. acronyms etc. all words on the web  All names..3.2 Isolated word correction  Fundamental premise – there is a lexicon from which the correct spellings come  Two basic choices for this  A standard lexicon such as  Webster’s English Dictionary  An “industry-specific” lexicon – hand-maintained  The lexicon of the indexed corpus  E.  (Including the mis-spellings) 21 .Advanced Topics in Information Systems: Information Retrieval Sec.

Advanced Topics in Information Systems: Information Retrieval Sec. return the words in the lexicon closest to Q  What’s “closest”?  We’ll study several alternatives  Edit distance (Levenshtein distance)  Weighted edit distance  n-gram overlap 22 .2 Isolated word correction  Given a lexicon and a character sequence Q.3. 3.

.3 Edit distance  Given two strings S1 and S2. 23 .)  Generally found by dynamic programming. 3.htm for a nice example plus an applet. the minimum number of operations to convert one to the other  Operations are typically character-level  Insert. (Transposition)  E. (Just 1 with transpose.3. Replace. Delete.com/ld. the edit distance from dof to dog is 1  From cat to act is 2  from cat to dog is 3.Advanced Topics in Information Systems: Information Retrieval Sec.  See http://www.merriampark.g.

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.3.3

Weighted edit distance
 As above, but the weight of an operation depends on the character(s) involved
 Meant to capture OCR or keyboard errors, e.g. m more likely to be mis-typed as n than as q  Therefore, replacing m by n is a smaller edit distance than by q  This may be formulated as a probability model

 Requires weight matrix as input  Modify dynamic programming to handle weights

24

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.3.4

Using edit distances
 Given query, first enumerate all character sequences within a preset (weighted) edit distance (e.g., 2)  Intersect this set with list of “correct” words  Show terms you found to user as suggestions  Alternatively,
 We can look up all possible corrections in our inverted index and return all docs … slow  We can run with a single most likely correction

 The alternatives disempower the user, but save a round of interaction with the user
25

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.3.4

Edit distance to all dictionary terms?
 Given a (mis-spelled) query – do we compute its edit distance to every dictionary term?
 Expensive and slow  Alternative?

 How do we cut the set of candidate dictionary terms?  One possibility is to use n-gram overlap for this  This can also be used by itself for spelling correction.

26

27 .4 n-gram overlap  Enumerate all the n-grams in the query string as well as in the lexicon  Use the n-gram index (recall wild-card search) to retrieve all lexicon terms matching any of the query n-grams  Threshold by number of matching n-grams  Variants – weight by keyboard layout. etc.3.Advanced Topics in Information Systems: Information Retrieval Sec. 3.

 So 3 trigrams overlap (of 6 in each term)  How can we turn this into a normalized measure of overlap? 28 . emb.Advanced Topics in Information Systems: Information Retrieval Sec. emb. ber. ove. cem. ber. mbe. 3. ece.  The query is december  Trigrams are dec.4 Example with trigrams  Suppose the text is november  Trigrams are nov. mbe.3. vem.

if J.3.8.C.C.4 One option – Jaccard coefficient  A commonly-used measure of overlap  Let X and Y be two sets. 3.Advanced Topics in Information Systems: Information Retrieval Sec.. > 0. is X ∩Y / X ∪Y  Equals 1 when X and Y have the same elements and zero when they are disjoint  X and Y don’t have to be of the same size  Always assigns a number between 0 and 1  Now threshold to decide if you have a match  E. declare a match 29 . then the J.g.

3. or. 3.4 Matching trigrams  Consider the query lord – we wish to identify words matching 2 of its 3 bigrams (lo. rd) lo or rd alone border ardent lord lord border sloth morbid card Standard postings “merge” will enumerate … Adapt this to using Jaccard (or another) measure. 30 .Advanced Topics in Information Systems: Information Retrieval Sec.

Advanced Topics in Information Systems: Information Retrieval Sec. 31 . 3.  Consider the phrase query “flew form Heathrow”  We’d like to respond Did you mean “flew from Heathrow”? because no docs matched the query phrase.5 Context-sensitive spell correction  Text: I flew from Heathrow to Narita.3.

32 .Advanced Topics in Information Systems: Information Retrieval Sec. 3.  First idea: retrieve dictionary terms close (in weighted edit distance) to each query term  Now try all possible resulting phrases with one word “fixed” at a time  flew from heathrow  fled form heathrow  flea form heathrow  Hit-based spelling correction: Suggest the alternative that has lots of hits.5 Context-sensitive correction  Need surrounding context to catch this.3.

3.Advanced Topics in Information Systems: Information Retrieval Sec. 3. 19 for form and 3 for heathrow.5 Exercise  Suppose that for “flew form Heathrow” we have 7 alternatives for flew. How many “corrected” phrases will we enumerate in this scheme? 33 .

5 Another approach  Break phrase query into a conjunction of biwords (Lecture 2).3.Advanced Topics in Information Systems: Information Retrieval Sec.  Look for biwords that need only one term corrected.  Enumerate phrase matches and … rank them! 34 . 3.

3.3. topical queries  Spell-correction is computationally expensive  Avoid running routinely on every query?  Run only on queries that matched few docs 35 .5 General issues in spell correction  We enumerate multiple alternatives for “Did you mean?”  Need to figure out which to present to the user  Use heuristics  The alternative hitting most docs  Query log analysis + tweaking  For especially popular.Advanced Topics in Information Systems: Information Retrieval Sec.

Advanced Topics in Information Systems: Information Retrieval SOUNDEX 36 .

chebyshev → tchebycheff  Invented for the U. census … in 1918 37 .g.4 Soundex  Class of heuristics to expand a query into phonetic equivalents  Language specific – mainly for names  E..Advanced Topics in Information Systems: Information Retrieval Sec. 3.S.

3.Advanced Topics in Information Systems: Information Retrieval Sec.com/Doc/Articles/SoundEx1/SoundEx1.htm#Top 38 .creativyst.4 Soundex – typical algorithm  Turn every token to be indexed into a 4character reduced form  Do the same with query terms  Build and search an index on the reduced forms  (when the query calls for a soundex match)  http://www.

Retain the first letter of the word.T → 3  L→4  M. 'U'. E'. 3. 3. Change all occurrences of the following letters to '0' (zero):   'A'. 2.4 Soundex – typical algorithm 1.Advanced Topics in Information Systems: Information Retrieval Sec. G. 'I'. Z → 2  D. Change letters to digits as follows:  B. P. 'W'. K. J. S. 'O'. F. 'H'. 'Y'. Q. V → 1  C. N → 5  R→6 39 . X.

Remove all zeros from the resulting string. 6. 3. Herman becomes H655. Remove all pairs of consecutive digits. 5.Advanced Topics in Information Systems: Information Retrieval Sec.. which will be of the form <uppercase letter> <digit> <digit> <digit>.4 Soundex continued 4. Will hermann generate the same code? 40 . E.g. Pad the resulting string with trailing zeros and return the first four positions.

3..Advanced Topics in Information Systems: Information Retrieval Sec. Microsoft. …)  How useful is soundex?  Not very – for information retrieval  Okay for “high recall” tasks (e.4 Soundex  Soundex is the classic algorithm. though biased to names of certain nationalities  Zobel and Dart (1996) show that other algorithms for phonetic matching perform much better in the context of IR 41 . Interpol).g. provided by most databases (Oracle.

Advanced Topics in Information Systems: Information Retrieval What queries can we process?  We have     Positional inverted index with skip pointers Wild-card index Spell-correction Soundex  Queries such as (SPELL(moriset) /3 toron*to) OR SOUNDEX(chaikofski) 42 .

Advanced Topics in Information Systems: Information Retrieval INDEX GENERATION 43 .

4 Index construction  How do we construct an index?  What strategies can we use with limited main memory? 44 .Advanced Topics in Information Systems: Information Retrieval Ch.

1 Hardware basics  Many design decisions in information retrieval are based on the characteristics of hardware  We begin by reviewing hardware basics 45 .Advanced Topics in Information Systems: Information Retrieval Sec. 4.

 Disk I/O is block-based: Reading and writing of entire blocks (as opposed to smaller chunks).  Block sizes: 8KB to 256 KB.1 Hardware basics  Access to data in memory is much faster than access to data on disk.  Disk seeks: No data is transferred from disk while the disk head is being positioned. 4.  Therefore: Transferring one large chunk of data from disk to memory is faster than transferring many small chunks. 46 .Advanced Topics in Information Systems: Information Retrieval Sec.

 Available disk space is several (2–3) orders of magnitude larger.1 Hardware basics  Servers used in IR systems now typically have several GB of main memory. 47 .Advanced Topics in Information Systems: Information Retrieval Sec. sometimes tens of GB. 4.  Google is particularly famous for combining standard hardware … in shipping containers.  Fault tolerance is very expensive: It’s much cheaper to use many regular machines rather than one fault tolerant machine.

g.01 μs = 10−8 s (e.02 μs = 2 x 10−8 s processor’s clock rate 109 s−1  p low-level operation 0. compare & swap a word)   size of main memory size of disk space several GB 1 TB or more 48 .Advanced Topics in Information Systems: Information Retrieval Sec.. 4.1 Hardware assumptions  symbol statistic value  s average seek time 5 ms = 5 x 10−3 s  b transfer time per byte 0.

 This is one year of Reuters newswire (part of 1995 and 1996) 49 . but it’s publicly available and is at least a more plausible example.Advanced Topics in Information Systems: Information Retrieval Sec.  As an example for applying scalable index construction algorithms. 4. we will use the Reuters RCV1 collection.  The collection we’ll use isn’t really large enough either.2 RCV1: Our collection for this lecture  Shakespeare’s collected works definitely aren’t large enough for demonstrating many of the points in this course.

2 A Reuters RCV1 document 50 . 4.Introduction to Information Retrieval Sec.

000 6 4. # bytes per term 7.000 51 .5 avg.) avg.) value 800.2 Reuters RCV1 statistics         symbol statistic N documents L avg.Advanced Topics in Information Systems: Information Retrieval Sec. spaces/punct. # bytes per token (without spaces/punct. 4. # bytes per token (incl.000. # tokens per doc M terms (= word types) avg.5 non-positional postings 100.000 200 400.

Doc 2 So let it be with Caesar.Advanced Topics in Information Systems: Information Retrieval Sec. Doc 1 I did enact Julius Caesar I was killed i' the Capitol. The noble Brutus hath told you Caesar was ambitious Term I did enact julius caesar I was killed i' the capitol brutus killed m e so let it be with caesar the noble brutus hath told you caesar was am bitious Doc # 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 52 2 .2 Recall IIR 1 index construction  Documents are parsed to extract words and these are saved with the Document ID. 4. Brutus killed me.

4.Advanced Topics in Information Systems: Information Retrieval Sec. We have 100M items to sort. We focus on this sort step.2 Key step  After all documents have been parsed. the inverted file is sorted by terms. Term I did enact julius caesar I was killed i' the capitol brutus killed m e so let it be with caesar the noble brutus hath told you caesar was am bitious Doc # 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Term ambitious be brutus brutus capitol caesar caesar caesar did enact hath I I i' it julius killed killed let me noble so the the told you was was with Doc # 2 2 1 2 1 1 2 2 1 1 1 1 1 1 2 1 1 1 2 1 2 2 1 2 2 2 1 2 2 53 .

54 .  Memory.Advanced Topics in Information Systems: Information Retrieval Sec. .  How can we construct an index for very large collections?  Taking into account the hardware constraints we just learned about .2 Scaling index construction  In-memory index construction does not scale. disk. 4. speed. . etc.

the New York Times provides an index of >150 years of newswire  Thus: We need to store intermediate results on disk. but typical collections are much larger. doc. demands a lot of space for large collections. but much more complex)  The final postings for any term are incomplete until the end. we cannot easily exploit compression tricks (you can.  While building the index.  T = 100.g.000 in the case of RCV1  So … we can do this in memory in 2009. 4. 55 . we parse docs one at a time. E.2 Sort-based index construction  As we build the index.Advanced Topics in Information Systems: Information Retrieval Sec.000. freq).  At 12 bytes per non-positional postings entry (term.

000 records on disk is too slow – too many disk seeks. 56 .Advanced Topics in Information Systems: Information Retrieval Sec. but by using disk instead of memory?  No: Sorting T = 100. 4.2 Use the same algorithm for disk?  Can we use the same index construction algorithm for larger collections.  We need an external sorting algorithm.000.

4.2 Bottleneck  Parse and build postings entries one doc at a time  Now sort postings entries by term (then by doc within each term)  Doing this with random disk seeks would be too slow – must sort T=100M records If every comparison took 2 disk seeks.Advanced Topics in Information Systems: Information Retrieval Sec. and N items could be sorted with N log2N comparisons. how long would this take? 57 .

write to disk.  Must now sort 100M such 12-byte records by term.  Basic idea of algorithm:  Accumulate postings for each block. 58 . 4.  These are generated as we parse docs.  Will have 10 such blocks to start with. sort. freq).  Define a Block ~ 10M such records  Can easily fit a couple into memory. doc.2 BSBI: Blocked sort-based Indexing (Sorting with fewer disk seeks)  12-byte (4+4+4) records (term.Advanced Topics in Information Systems: Information Retrieval Sec.  Then merge the blocks into one long sorted order.

Advanced Topics in Information Systems: Information Retrieval Sec.2 59 . 4.

need 2 copies of data on disk  But can optimize this 60 . read each block and sort within:  Quicksort takes 2N ln N expected steps  In our case 2 x (10M ln 10M) steps  Exercise: estimate total time to read each block from disk and and quicksort it.2 Sorting 10 blocks of 10M records  First.  10 times this estimate – gives us 10 sorted runs of 10M records each.  Done straightforwardly.Advanced Topics in Information Systems: Information Retrieval Sec. 4.

Sec. 4.2 61 .

Advanced Topics in Information Systems: Information Retrieval Sec. merge. read into memory runs in blocks of 10M. Disk 62 . 4 3 4 Runs being merged. 4.  During each layer. with a merge tree of log210 = 4 layers. 1 1 3 2 2 Merged run. write back.2 How to merge the sorted runs?  Can do binary merges.

4.Advanced Topics in Information Systems: Information Retrieval Sec.2 How to merge the sorted runs?  But it is more efficient to do a n-way merge. where you are reading from all blocks simultaneously  Providing you read decent-sized chunks of each block into memory and then write out a decent-sized output chunk. then you’re not killed by disk seeks 63 .

but then intermediate files become very large. . 4.docID postings . .Advanced Topics in Information Systems: Information Retrieval Sec. (We would end up with a scalable. we could work with term.) 64 . . but very slow index construction method.docID postings instead of termID.3 Remaining problem with sortbased algorithm  Our assumption was: we can keep the dictionary in memory. .  We need the dictionary (which grows dynamically) in order to implement a term to termID mapping.  .  Actually.

 These separate indexes can then be merged into one big index. 65 .  Key idea 2: Don’t sort.Advanced Topics in Information Systems: Information Retrieval Sec. Accumulate postings in postings lists as they occur.3 SPIMI: Single-pass in-memory indexing  Key idea 1: Generate separate dictionaries for each block – no need to maintain term-termID mapping across blocks.  With these two ideas we can generate a complete inverted index for each block. 4.

66 .Advanced Topics in Information Systems: Information Retrieval Sec. 4.3 SPIMI-Invert  Merging of blocks is analogous to BSBI.

3 SPIMI: Compression  Compression makes SPIMI even more efficient.  Compression of terms  Compression of postings 67 .Advanced Topics in Information Systems: Information Retrieval Sec. 4.

4 Distributed indexing  For web-scale indexing (don’t try this at home!): must use a distributed computing cluster  Individual machines are fault-prone  Can unpredictably slow down or fail  How do we exploit such a pool of machines? 68 . 4.Advanced Topics in Information Systems: Information Retrieval Sec.

 Based on expenditures of 200–250 million dollars per year  This would be 10% of the computing capacity of the world!?! 69 . 4.Advanced Topics in Information Systems: Information Retrieval Sec. 3 million processors/cores (Gartner 2007)  Estimate: Google installs 100.000 servers each quarter.4 Google data centers  Google data centers mainly contain commodity machines.  Data centers are distributed around the world.  Estimate: a total of 1 million servers.

70 .9% uptime.4 Google data centers  If in a non-fault-tolerant system with 1000 nodes. each node has 99. what is the uptime of the system?  Answer: 63%  Calculate the number of servers failing per minute for an installation of 1 million servers.Advanced Topics in Information Systems: Information Retrieval Sec. 4.

Advanced Topics in Information Systems: Information Retrieval Sec.  Break up indexing into sets of (parallel) tasks. 71 . 4.  Master machine assigns each task to an idle machine from a pool.4 Distributed indexing  Maintain a master machine directing the indexing job – considered “safe”.

4.4 Parallel tasks  We will use two sets of parallel tasks  Parsers  Inverters  Break the input document collection into splits  Each split is a subset of documents (corresponding to blocks in BSBI/SPIMI) 72 .Advanced Topics in Information Systems: Information Retrieval Sec.

 Now to complete the index inversion 73 .. q-z) – here j = 3. doc) pairs  Parser writes pairs into j partitions  Each partition is for a range of terms’ first letters  (e.4 Parsers  Master assigns a split to an idle parser machine  Parser reads a document at a time and emits (term. g-p.Advanced Topics in Information Systems: Information Retrieval Sec.g. 4. a-f.

doc) pairs (= postings) for one term-partition.4 Inverters  An inverter collects all (term.  Sorts and writes to postings lists 74 . 4.Advanced Topics in Information Systems: Information Retrieval Sec.

4 Data flow assign Parser Parser splits Master assign Inverter Inverter Inverter Postings a-f g-p q-z a-f g-p q-z a-f g-p q-z Parser Map phase a-f g-p q-z Segment files Reduce phase 75 .Introduction to Information Retrieval Sec. 4.

each implemented in MapReduce. 76 .Advanced Topics in Information Systems: Information Retrieval Sec.  They describe the Google indexing system (ca.  MapReduce (Dean and Ghemawat 2004) is a robust and conceptually simple framework for distributed computing …  … without having to write code for the distribution part.4 MapReduce  The index construction algorithm we just described is an instance of MapReduce. 4. 2002) as consisting of a number of phases.

77 .Advanced Topics in Information Systems: Information Retrieval Sec. etc. most search engines use a document-partitioned index … better load balancing.  Another phase: transforming a term-partitioned index into a document-partitioned index.  Term-partitioned: one machine handles a subrange of terms  Document-partitioned: one machine handles a subrange of documents  As we discuss later in the course.4 MapReduce  Index construction was just one phase. 4.

d2>.(d1)>.d1>.d2:1)>. <came. <C. <C.(d1)>) → (<C.Advanced Topics in Information Systems: Information Retrieval Sec. C c’ed.d1>. d1>  reduce: (<C.d1)>. list(docID)>. …) → (postings list1. list(docID)>. <came. <c’ed. <died. <died. …)  Example for index construction  map: d2 : C died. <died.(d2:1)>. v) reduce: (k. <c’ed.d1. postings list2. <c’ed.(d1:1)>) 78 .list(v)) → output  Instantiation of the schema for index construction  map: web collection → list(termID. docID)  reduce: (<termID1. d1 : C came. <termID2.(d2)>.4 Schema for index construction in MapReduce  Schema of map and reduce functions  map: input → list(k.(d1:1)>. <came. → (<C.d1>. 4.(d2. d2>.(d1:2.

5 Dynamic indexing  Up to now. we have assumed that collections are static.  This means that the dictionary and postings lists have to be modified:  Postings updates for terms already in dictionary  New terms added to dictionary 79 . 4.  Documents are deleted and modified.  They rarely are:  Documents come in over time and need to be inserted.Advanced Topics in Information Systems: Information Retrieval Sec.

5 Simplest approach     Maintain “big” main index New docs go into “small” auxiliary index Search across both. re-index into one main index 80 .Advanced Topics in Information Systems: Information Retrieval Sec. merge results Deletions  Invalidation bit-vector for deleted docs  Filter docs output on a search result by this invalidation bit-vector  Periodically. 4.

 Merge is the same as a simple append.  But then we would need a lot of files – inefficient for O/S.  Assumption for the rest of the lecture: The index is one big file.g.5 Issues with main and auxiliary indexes  Problem of frequent merges – you touch stuff a lot  Poor performance during merge  Actually:  Merging of the auxiliary index into the main index is efficient if we keep a separate file for each postings list.Advanced Topics in Information Systems: Information Retrieval Sec..  In reality: Use a scheme somewhere in between (e. split very large postings lists. collect postings lists of length 1 in one file etc. 4.) 81 .

I1. 4.Advanced Topics in Information Systems: Information Retrieval Sec.5 Logarithmic merge  Maintain a series of indexes. each twice as large as the previous one.  Keep smallest (Z0) in memory  Larger ones (I0. write to disk as I0  or merge with I0 (if I0 already exists) as Z1  Either write merge Z1 to disk as I1 (if no I1)  Or merge with I1 to form Z2  etc. 82 . …) on disk  If Z0 gets too big (> n).

5 83 . 4.Sec.

so complexity is O(T log T)  So logarithmic merge is much more efficient for index construction  But query processing now requires the merging of O(log T) indexes  Whereas it is O(1) if you just have a main and auxiliary index 84 .Advanced Topics in Information Systems: Information Retrieval Sec. 4.5 Logarithmic merge  Auxiliary and main index: index construction time is O(T2) as each posting is touched in each merge.  Logarithmic merge: Each posting is merged O(log T) times.

g. 4. when we spoke of spell-correction: which of several corrected alternatives do we present to the user?  We said. pick the one with the most hits  How do we maintain the top ones with multiple indexes and invalidation bit vectors?  One possibility: ignore everything but the main index for such ordering  Will see more such statistics used in results ranking 85 .Advanced Topics in Information Systems: Information Retrieval Sec.5 Further issues with multiple indexes  Collection-wide statistics are hard to maintain  E..

5 Dynamic indexing at search engines  All the large search engines now do dynamic indexing  Their indices have frequent incremental changes  News items.Advanced Topics in Information Systems: Information Retrieval Sec. blogs. 4. and the old index is then deleted 86 . new topical web pages  Sarah Palin. …  But (sometimes/typically) they also periodically reconstruct the index from scratch  Query processing is then switched to the new index.

5 87 . 4.Advanced Topics in Information Systems: Information Retrieval Sec.

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.5

Other sorts of indexes
 Positional indexes
 Same sort of sorting problem … just larger

 Building character n-gram indexes:
 As text is parsed, enumerate n-grams.  For each n-gram, need pointers to all dictionary terms containing it – the “postings”.  Note that the same “postings entry” will arise repeatedly in parsing the docs – need efficient hashing to keep track of this.

Why ?

 E.g., that the trigram uou occurs in the term deciduous will be discovered on each text occurrence of deciduous  Only need to process each term once
88

Advanced Topics in Information Systems: Information Retrieval

INDEX COMPRESSION

89

Advanced Topics in Information Systems: Information Retrieval

Ch. 5

Compressing Indexes

 Collection statistics in more detail (with RCV1)
 How big will the dictionary and postings be?

 Dictionary compression  Postings compression
90

5 Why compression (in general)?  Use less disk space  Saves a little money  Keep more stuff in memory  Increases speed  Increase speed of data transfer from disk to memory  [read compressed data | decompress] is faster than [read uncompressed data]  Premise: Decompression algorithms are fast  True of the decompression algorithms we use 91 .Advanced Topics in Information Systems: Information Retrieval Ch.

5 Why compression for inverted indexes?  Dictionary  Make it small enough to keep in main memory  Make it so small that you can keep some postings lists in main memory too  Postings file(s)  Reduce disk space needed  Decrease time needed to read postings lists from disk  Large search engines keep a significant part of the postings in memory.  Compression lets you keep more in memory  We will devise various IR-specific compression schemes 92 .Advanced Topics in Information Systems: Information Retrieval Ch.

5 (without spaces/punct. # bytes per term 7.000 avg.000 93 .Advanced Topics in Information Systems: Information Retrieval Sec. 5. spaces/punct. # bytes per token 6 (incl.000 L avg.) avg.5 non-positional postings 100. # tokens per doc 200 M terms (= word types) ~400. # bytes per token 4.1 Recall Reuters RCV1         symbol statistic value N documents 800.000.) avg.

002 -8 -3 -14 -30 -8 -12 -24 -39 ∆ % cumul % position al postings positional index Size (K) 197.812 -4 entries correspond to big deltas in other columns? 94 .158 121.1.517 -9 0 -31 -47 -9 -9 -38 -52 ∆ % cumul % size of Unfiltered No numbers Case folding 30 stopwords 150 stopwords 484 474 392 391 391 stemming 322 -17 -42 94.390 67.969 83.879 179.158 179.971 -2 -17 -0 -0 100.Advanced Topics in Information Systems: Information Retrieval Sec.517 0 Exercise: give intuitions for-33 the ‘0’ entries. p.680 96.1 Index parameters vs.80) word types (terms) dictiona ry Size (K) ∆% cumul % -2 -19 -19 -19 nonpositiona l postings nonpositional index Size (K) 109. Why do some zero-52 all 63.858 94. what we index (details IIR Table 5. 5.

5.Advanced Topics in Information Systems: Information Retrieval Sec.  Lossy compression: Discard some information  Several of the preprocessing steps can be viewed as lossy compression: case folding. 95 . lossy compression  Lossless compression: All information is preserved.  Chap/Lecture 7: Prune postings entries that are unlikely to turn up in the top k list for any query. stop words. number elimination.  Almost no loss quality for top k list.  What we mostly do in IR.1 Lossless vs. stemming.

Advanced Topics in Information Systems: Information Retrieval Sec. 5. how many distinct words are there?  Can we assume an upper bound?  Not really: At least 7020 = 1037 different words of length 20  In practice. collection size  How big is the term vocabulary?  That is. the vocabulary will keep growing with the collection size  Especially with Unicode  96 .1 Vocabulary vs.

T.1 Vocabulary vs.Advanced Topics in Information Systems: Information Retrieval Sec. Heaps’ law predicts a line with slope about ½  It is the simplest possible relationship between the two in log-log space  An empirical finding (“empirical law”) 97 . T is the number of tokens in the collection  Typical values: 30 ≤ k ≤ 100 and b ≈ 0. 5. collection size  Heaps’ law: M = kTb  M is the size of the vocabulary.5  In a log-log plot of vocabulary size M vs.

M = 101. 38.64 ≈ 44 and b = 0. 5.Sec.323 terms.49 so k = 101.64 is the best least squares fit. the dashed line log10 M = 0. actually. Good empirical fit for Reuters RCV1 ! For first 1.64 T0. law predicts 38.49.365 terms Fig 5.1 Heaps’ Law For RCV1.020 tokens.000.1 p81 98 . Thus.49 log10 T + 1.

000.000 tokens and 30.000.000 (2 × 1010 ) pages.Advanced Topics in Information Systems: Information Retrieval Sec.1 Exercises  What is the effect of including spelling errors. vs.000 tokens. you find that there are 3000 different terms in the first 10. 5.  Assume a search engine indexes a total of 20.000 different terms in the first 1. containing 200 tokens on average  What is the size of the vocabulary of the indexed collection as predicted by Heaps’ law? 99 . automatically correcting spelling errors on Heaps’ law?  Compute the vocabulary size M for this scenario:  Looking at a collection of web pages.000.

 cfi ∝ 1/i = K/i where K is a normalizing constant  cfi is collection frequency: the number of occurrences of the term ti in the collection. 5.  We also study the relative frequencies of terms.Advanced Topics in Information Systems: Information Retrieval Sec.  In natural language.  Zipf’s law: The ith most frequent term has frequency proportional to 1/i . there are a few very frequent terms and very many very rare terms. 100 .1 Zipf’s law  Heaps’ law gives the vocabulary size in collections.

log i  Linear relationship between log cfi and log i  Another power law relationship 101 . 5. so  log cfi = log K .1 Zipf consequences  If the most frequent term (the) occurs cf1 times  then the second most frequent term (of) occurs cf1/2 times  the third most frequent term (and) occurs cf1/3 times …  Equivalent: cfi = K/i where K is a normalizing factor.Advanced Topics in Information Systems: Information Retrieval Sec.

Introduction to Information Retrieval Sec.1 Zipf’s law for Reuters RCV1 102 . 5.

 We will consider compression schemes 103 . etc. we will consider compressing the space for the dictionary and postings  Basic Boolean index only  No study of positional indexes.Advanced Topics in Information Systems: Information Retrieval Ch. 5 Compression  Now.

Sec.2 DICTIONARY COMPRESSION 104 . 5.

compressing the dictionary is important 105 .2 Why compress the dictionary?  Search begins with the dictionary  We want to keep it in memory  Memory footprint competition with other applications  Embedded/mobile devices may have very little memory  Even if the dictionary isn’t in memory. 5. we want it to be small for a fast search startup time  So.Advanced Topics in Information Systems: Information Retrieval Sec.

000 terms. 5. Terms a aachen …. 28 bytes/term = 11.265 65 ….2 MB.Advanced Topics in Information Systems: Information Retrieval Sec. 656.first cut  Array of fixed-width entries  ~400.2 Dictionary storage . Dictionary search structure 20 bytes 4 bytes each 106 . 221 Postings ptr. zulu Freq.

5.  And we still can’t handle supercalifragilisticexpialidocious or hydrochlorofluorocarbons.Advanced Topics in Information Systems: Information Retrieval Sec.  Written English averages ~4.5 characters/word.  Exercise: Why is/isn’t this the number to use for estimating the dictionary size?  Ave. dictionary word in English: ~8 characters  How do we use ~8 characters per dictionary term?  Short words dominate token counts but not type average.2 Fixed-width terms are wasteful  Most of the bytes in the Term column are wasted – we allot 20 bytes for 1 letter terms. 107 .

Freq. 33 29 44 126 Postings ptr.2MB Pointers resolve 3.2 Compressing the term list: Dictionary-as-a-String  Store dictionary as a (long) string of characters: Pointer to next word shows end of current word Hope to save up to 60% of dictionary space.2M = 22bits = 3bytes 108 .systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo…. Total string length = 400K x 8B = 3. 5.Introduction to Information Retrieval Sec.2M positions: log23.  …. Term ptr.

  not 20.Advanced Topics in Information Systems: Information Retrieval Sec. 5.6 MB (against 11.bytes/term.  Now avg. 8 bytes per term in term string 400K terms x 19 ⇒ 7. 3 bytes per term pointer Avg.2MB for fixed width) 109 .2 Space for dictionary as a string      4 bytes per term for Freq. 11 4 bytes per term for pointer to Postings.

Advanced Topics in Information Systems: Information Retrieval Sec.7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo…. 33 29 44 126 7 Postings ptr. Lose 4 bytes on term lengths.  Example below: k=4. 110 .2 Blocking  Store pointers to every kth term string. Freq.  Need to store term lengths (1 extra byte) …. 5. Term ptr.  Save 9 bytes  on 3  pointers.

This reduces the size of the dictionary from 7. Shaved another ~0. Why not go with larger k? 111 .2 Net  Example for block size k = 4  Where we used 3 bytes/pointer without blocking  3 x 4 = 12 bytes.6 MB to 7.1 MB. now we use 3 + 4 = 7 bytes.Advanced Topics in Information Systems: Information Retrieval Sec. We can save more with larger k.5MB. 5.

Advanced Topics in Information Systems: Information Retrieval Sec. 8 and 16. 5.6 MB) with blocking. 112 .2 Exercise  Estimate the space usage (and savings compared to 7. for block sizes of k = 4.

average number of comparisons = (1+2∙2+4∙3+4)/8 ~2. 5. how would you structure the dictionary search tree? 113 .Advanced Topics in Information Systems: Information Retrieval Sec.2 Dictionary search without blocking  Assuming each dictionary term equally likely in query (not really so in practice!).6 Exercise: what if the frequencies of query terms were non-uniform but known.

avg.  Blocks of 4 (binary tree). = (1+2∙2+2∙3+2∙4+5)/8 = 3 compares 114 .2 Dictionary search with blocking  Binary search down to 4-term block. 5.  Then linear search through terms in block.Introduction to Information Retrieval Sec.

115 . 8 and 16.Advanced Topics in Information Systems: Information Retrieval Sec. for block sizes of k = 4. 5.2 Exercise  Estimate the impact on search performance (and slowdown compared to k=1) with blocking.

2 Front coding  Front-coding:  Sorted words commonly have long common prefix – store differences only  (for last k-1 in a block of k) 8automata8automate9automatic10autom ation →8automat*a1◊e2◊ic3◊ion Encodes automat Extra length beyond automat. 5.Advanced Topics in Information Systems: Information Retrieval Sec. Begins to resemble general string compression.116 .

Blocking + front coding Size in MB 11.2 RCV1 dictionary compression summary Technique Fixed width Dictionary-as-String with pointers to every term Also.Advanced Topics in Information Systems: Information Retrieval Sec.1 5.6 7.2 7. 5.9 117 . blocking k = 4 Also.

3 POSTINGS COMPRESSION 118 . 5.Sec.

000 ≈ 20 bits per docID. 5.Advanced Topics in Information Systems: Information Retrieval Sec. we can use log2 800. 119 . factor of at least 10.  Our goal: use a lot less than 20 bits per docID.  A posting for our purposes is a docID.3 Postings compression  The postings file is much larger than the dictionary.  For Reuters (800. we would use 32 bits per docID when using 4-byte integers.  Key desideratum: store each posting compactly.  Alternatively.000 documents).

5.  A term like the occurs in virtually every doc.Advanced Topics in Information Systems: Information Retrieval Sec.  Prefer 0/1 bitmap vector in this case 120 . so 20 bits/posting is too expensive.3 Postings: two conflicting forces  A term like arachnocentric occurs in maybe one doc out of a million – we would like to store this posting using log2 1M ~ 20 bits.

 computer: 33.159.43 …  Hope: most gaps can be encoded/stored with far fewer than 20 bits.Advanced Topics in Information Systems: Information Retrieval Sec.107. 5.  33.14.202 …  Consequence: it suffices to store gaps.5.3 Postings file entry  We store the list of docs containing a term in increasing order of docID. 121 .154.47.

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.3

Three postings entries

122

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.3

Variable length encoding
 Aim:
 For arachnocentric, we will use ~20 bits/gap entry.  For the, we will use ~1 bit/gap entry.

 If the average gap for a term is G, we want to use ~log2G bits/gap entry.  Key challenge: encode every integer (gap) with about as few bits as needed for that integer.  This requires a variable length encoding  Variable length codes achieve this by using short codes for small numbers
123

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.3

Variable Byte (VB) codes
 For a gap value G, we want to use close to the fewest bytes needed to hold log2 G bits  Begin with one byte to store G and dedicate 1 bit in it to be a continuation bit c  If G ≤127, binary-encode it in the 7 available bits and set c =1  Else encode G’s lower-order 7 bits and then use additional bytes to encode the higher order bits using the same algorithm  At the end set the continuation bit of the last byte to 1 (c =1) – and for the other bytes c = 0.
124

VB uses a whole byte.Advanced Topics in Information Systems: Information Retrieval Sec. 125 . 5. For a small gap (5).3 Example docIDs gaps VB code 00000110 10111000 824 829 5 10000101 215406 214577 00001101 00001100 10110001 Postings stored as the byte concatenation 000001101011100010000101000011010000110010110001 Key property: VB-encoded postings are uniquely prefix-decodable.

bit-level codes.Advanced Topics in Information Systems: Information Retrieval Sec. 16 bits.  Variable byte alignment wastes space if you have many small gaps – nibbles do better in such cases. which we look at next).  Variable byte codes:  Used by many commercial/research systems  Good low-tech blend of variable-length coding and sensitivity to computer memory alignment matches (vs.  There is also recent work on word-aligned codes that pack a variable number of gaps into one word 126 . we can also use a different “unit of alignment”: 32 bits (words).3 Other variable unit codes  Instead of bytes. 4 bits (nibbles). 5.

but….  Unary code for 80 is: 111111111111111111111111111111111111111111 111111111111111111111111111111111111110  This doesn’t look promising.  Unary code for 40 is 11111111111111111111111111111111111111110 .  Unary code for 3 is 1110. 127 .Advanced Topics in Information Systems: Information Retrieval Unary code  Represent n as n 1s with a final 0.

with the leading bit cut off  For example 13 → 1101 → 101  length is the length of offset  For 13 (offset 101). 5.  We encode length with unary code: 1110.  Gamma code of 13 is the concatenation of length and offset: 1110101 128 .3 Gamma codes  We can compress better with bit-level codes  The Gamma code is the best known of these.  Represent a gap G as a pair length and offset  offset is G in binary. this is 3.Advanced Topics in Information Systems: Information Retrieval Sec.

11111111 11111111110.101 11110. 5.0 10.0000000001 129 .1 110.1000 111111110.Advanced Topics in Information Systems: Information Retrieval Sec.001 1110.00 1110.3 Gamma code examples number 0 1 2 3 4 9 13 24 511 1025 0 10 10 110 1110 1110 11110 111111110 11111111110 0 1 00 001 101 1000 11111111 0000000001 length offset γ-code none 0 10.

5.3 Gamma code properties  G is encoded using 2 log G + 1 bits  Length of offset is log G bits  Length of length is log G + 1 bits  All gamma codes have an odd number of bits  Almost within a factor of 2 of best possible. like VB  Gamma code can be used for any distribution  Gamma code is parameter-free 130 .Advanced Topics in Information Systems: Information Retrieval Sec. log2 G  Gamma code is uniquely prefix-decodable.

16. variable byte is conceptually simpler at little additional space cost 131 . 32. 64 bits  Operations that cross word boundaries are slower  Compressing and manipulating at the granularity of bits can be slow  Variable byte encoding is aligned and thus potentially more efficient  Regardless of efficiency.3 Gamma seldom used in practice  Machines have word boundaries – 8. 5.Advanced Topics in Information Systems: Information Retrieval Sec.

1 5.0 250.0 40. fixed-width dictionary.3 RCV1 compression Data structure dictionary.6 7.0 101. 5.0 400.0 116.600.Advanced Topics in Information Systems: Information Retrieval Sec.000. term pointers into string with blocking.0 960. k = 4 with blocking & front coding collection (text. variable byte encoded postings. xml markup etc) collection (text) Term-doc incidence matrix postings. uncompressed (20 bits) postings.2 7.0 132 . uncompressed (32-bit words) postings. γ−encoded Size in MB 11.9 3.

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.3

Index compression summary
 We can now create an index for highly efficient Boolean retrieval that is very space efficient  Only 4% of the total size of the collection  Only 10-15% of the total size of the text in the collection  However, we’ve ignored positional information  Hence, space savings are less for indexes used in practice
 But techniques substantially the same.

133

Sec. 5.3

SCORING, TERM WEIGHTING AND THE VECTOR SPACE MODEL
134

Advanced Topics in Information Systems: Information Retrieval

Ch. 6

Ranked retrieval
 Thus far, our queries have all been Boolean.
 Documents either match or don’t.

 Good for expert users with precise understanding of their needs and the collection.
 Also good for applications: Applications can easily consume 1000s of results.

 Not good for the majority of users.
 Most users incapable of writing Boolean queries (or they are, but they think it’s too much work).  Most users don’t want to wade through 1000s of results.
 This is particularly true of web search.

135

Advanced Topics in Information Systems: Information Retrieval Ch. 6 Problem with Boolean search: feast or famine  Boolean queries often result in either too few (=0) or too many (1000s) results.  Query 1: “standard user dlink 650” → 200. OR gives too many 136 .  AND gives too few.000 hits  Query 2: “standard user dlink 650 no card found”: 0 hits  It takes a lot of skill to come up with a query that produces a manageable number of hits.

Advanced Topics in Information Systems: Information Retrieval Ranked retrieval models  Rather than a set of documents satisfying a query expression. there are two separate choices here. the user’s query is just one or more words in a human language  In principle. but in practice. in ranked retrieval models. ranked retrieval models have normally been associated with free text queries and vice versa 137 . the system returns an ordering over the (top) documents in the collection with respect to a query  Free text queries: Rather than a query language of operators and expressions.

6 Feast or famine: not a problem in ranked retrieval  When a system produces a ranked result set. the size of the result set is not an issue  We just show the top k ( ≈ 10) results  We don’t overwhelm the user  Premise: the ranking algorithm works 138 . large result sets are not an issue  Indeed.Advanced Topics in Information Systems: Information Retrieval Ch.

139 . 1] – to each document  This score measures how well document and query “match”.Advanced Topics in Information Systems: Information Retrieval Ch. 6 Scoring as the basis of ranked retrieval  We wish to return in order the documents most likely to be useful to the searcher  How can we rank-order the documents in the collection with respect to a query?  Assign a score – say in [0.

Advanced Topics in Information Systems: Information Retrieval Ch. 140 . the higher the score (should be)  We will look at a number of alternatives for this. 6 Query-document matching scores  We need a way of assigning a score to a query/document pair  Let’s start with a one-term query  If the query term does not occur in the document: score should be 0  The more frequent the query term in the document.

A) = 1  jaccard(A. 141 .Advanced Topics in Information Systems: Information Retrieval Ch. 6 Take 1: Jaccard coefficient  Recall from Lecture 3: A commonly used measure of overlap of two sets A and B  jaccard(A.  Always assigns a number between 0 and 1.B) = |A ∩ B| / |A ∪ B|  jaccard(A.B) = 0 if A ∩ B = 0  A and B don’t have to be the same size.

6 Jaccard coefficient: Scoring example  What is the query-document match score that the Jaccard coefficient computes for each of the two documents below?  Query: ides of march  Document 1: caesar died in march  Document 2: the long march 142 .Advanced Topics in Information Systems: Information Retrieval Ch.

Advanced Topics in Information Systems: Information Retrieval Ch. . instead of |A ∩ B|/|A ∪ B|| (Jaccard) A  B | for length normalization. . we’ll use A B | / |  . Jaccard doesn’t consider this information  We need a more sophisticated way of normalizing for length  Later in this lecture. 6 Issues with Jaccard for scoring  It doesn’t consider term frequency (how many times a term occurs in a document)  Rare terms in a collection are more informative than frequent terms. 143 .

Advanced Topics in Information Systems: Information Retrieval Sec.2 Recall (Lecture 1): Binary termdocument incidence matrix Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony Brutus Caesar Calpurnia Cleopatra mercy worser 1 1 1 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 1 0 0 1 0 0 1 1 1 0 1 0 0 1 0 Each document is represented by a binary vector ∈ {0. 6.1}|V| 144 .

Advanced Topics in Information Systems: Information Retrieval Sec.2 Term-document count matrices  Consider the number of occurrences of a term in a document:  Each document is a count vector in ℕv: a column below Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony Brutus Caesar Calpurnia Cleopatra mercy worser 157 4 232 0 57 2 2 73 157 227 10 0 0 0 0 0 0 0 0 3 1 0 1 2 0 0 5 1 0 0 1 0 0 5 1 0 0 1 0 0 1 0 145 . 6.

 For now: bag of words model 146 .Advanced Topics in Information Systems: Information Retrieval Bag of words model  Vector representation doesn’t consider the ordering of words in a document  John is quicker than Mary and Mary is quicker than John have the same vectors  This is called the bag of words model. this is a step back: The positional index was able to distinguish these two documents.  We will look at “recovering” positional information later in this course.  In a sense.

d of term t in document d is defined as the number of times that t occurs in d. IR 147 .Advanced Topics in Information Systems: Information Retrieval Term frequency tf  The term frequency tft.  But not 10 times more relevant.  Relevance does not increase proportionally NB: frequency = count in with term frequency. But how?  Raw term frequency is not what we want:  A document with 10 occurrences of the term is more relevant than a document with 1 occurrence of the term.  We want to use tf when computing querydocument match scores.

= 0.Advanced Topics in Information Systems: Information Retrieval Sec.d .d >0 otherw ise  0 → 0. 148 .2 Log-frequency weighting  The log frequency weight of term t in d is wt. 10 → 2.  if tf t. 1 → 1. 6. 1000 → 4.d )  The score is 0 if none of the query terms is present in the document. etc.3. 2 → 1.  Score for a document-query pair: sum over terms t in both q and d:  score = ∑t∈q∩d (1 + log tf t .d 1  +log 10 tf t.

149 .g. arachnocentric)  A document containing this term is very likely to be relevant to the query arachnocentric  → We want a high weight for rare terms like arachnocentric.1 Document frequency  Rare terms are more informative than frequent terms  Recall stop words  Consider a term in the query that is rare in the collection (e.Advanced Topics in Information Systems: Information Retrieval Sec. 6..2.

 We will use document frequency (df) to capture this. 150 . and line  But lower weights than for rare terms. increase. 6.Advanced Topics in Information Systems: Information Retrieval Sec.. line)  A document containing such a term is more likely to be relevant than a document that doesn’t  But it’s not a sure indicator of relevance.1 Document frequency.g. high. continued  Frequent terms are less informative than rare terms  Consider a query term that is frequent in the collection (e. we want high positive weights for words like high.  → For frequent terms. increase.2.

151 .Advanced Topics in Information Systems: Information Retrieval Sec. 6. Will turn out the base of the log is immaterial.2.1 idf weight  dft is the document frequency of t: the number of documents that contain t  dft is an inverse measure of the informativeness of t  dft ≤ N  We define the idf (inverse document frequency) of t by idf t = log10 ( N/df t )  We use log (N/dft) instead of N/dft to “dampen” the effect of idf.

6. suppose N = 1 million dft 1 100 1.000 100.Advanced Topics in Information Systems: Information Retrieval Sec.000.000 idft term calpurnia animal sunday fly under the idf t = log10 ( N/df t ) There is one idf value for each term t in a collection.1 idf example.000 10.000 1. 152 .2.

idf weighting makes occurrences of capricious count for much more in the final document ranking than occurrences of person. 153 . like  iPhone  idf has no effect on ranking one term queries  idf affects the ranking of documents for queries with at least two terms  For the query capricious person.Advanced Topics in Information Systems: Information Retrieval Effect of idf on ranking  Does idf have an effect on ranking for one-term queries.

2.1 Collection vs. Document frequency  The collection frequency of t is the number of occurrences of t in the collection.  Example: Word Collection frequency Document frequency insurance try 10440 10422 3997 8760  Which word is a better search term (and should get a higher weight)? 154 . counting multiple occurrences.Advanced Topics in Information Systems: Information Retrieval Sec. 6.

tf x idf  Increases with the number of occurrences within a document  Increases with the rarity of the term in the collection 155 .d = (1 + log tf t .idf. not a minus sign!  Alternative names: tf.Advanced Topics in Information Systems: Information Retrieval Sec.2.d ) × log10 ( N / df t )  Best known weighting scheme in information retrieval  Note: the “-” in tf-idf is a hyphen.2 tf-idf weighting  The tf-idf weight of a term is the product of its tf weight and its idf weight. 6. w t .

d) =  t ᅫ qᅫ d tf.d 156 . 6.idft.2.Advanced Topics in Information Systems: Information Retrieval Sec.2 Final ranking of documents for a query Score(q.

12 4.9 0.85 1.25 1.3 Binary → count → weight matrix Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony Brutus Caesar Calpurnia Cleopatra mercy worser 5.59 0 2.95 Each document is now represented by a real-valued vector of tf-idf weights ∈ R|V| 157 .18 6.35 0 0 0 0 0.15 0 0 0.1 2.88 1.25 0.51 0 0 0.25 0.37 3.25 0 0 5.51 1.54 0 0 0 0 0 0 0 0 1.54 1.Advanced Topics in Information Systems: Information Retrieval Sec. 6.21 8.11 0 1 1.

most entries are zero.3 Documents as vectors So we have a |V|-dimensional vector space Terms are axes of the space Documents are points or vectors in this space Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine  These are very sparse vectors .     158 .Advanced Topics in Information Systems: Information Retrieval Sec. 6.

6.Advanced Topics in Information Systems: Information Retrieval Sec.  Instead: rank more relevant documents higher than less relevant documents 159 .3 Queries as vectors  Key idea 1: Do the same for queries: represent them as vectors in the space  Key idea 2: Rank documents according to their proximity to the query in this space  proximity = similarity of vectors  proximity ≈ inverse of distance  Recall: We do this because we want to get away from the you’re-either-in-or-out Boolean model.

because Euclidean distance is large for vectors of different lengths.  . . 6. .3 Formalizing vector space proximity  First cut: distance between two points  ( = distance between the end points of the two vectors)  Euclidean distance?  Euclidean distance is a bad idea . . 160 . .Advanced Topics in Information Systems: Information Retrieval Sec.

Advanced Topics in Information Systems: Information Retrieval Sec. 6.3 Why distance is a bad idea The Euclidean distance between q and d2 is large even though the distribution of terms in the query q and the distribution of terms in the document d2 are very similar. 161 .

 “Semantically” d and d′ have the same content  The Euclidean distance between the two documents can be quite large  The angle between the two documents is 0. corresponding to maximal similarity.3 Use angle instead of distance  Thought experiment: take a document d and append it to itself. 162 .  Key idea: Rank documents according to angle with query. Call this document d′. 6.Advanced Topics in Information Systems: Information Retrieval Sec.

180o] 163 .3 From angles to cosines  The following two notions are equivalent.document)  Cosine is a monotonically decreasing function for the interval [0o. 6.  Rank documents in decreasing order of the angle between query and document  Rank documents in increasing order of cosine(query.Advanced Topics in Information Systems: Information Retrieval Sec.

6.3 From angles to cosines  But how – and why – should we be computing cosines? 164 .Advanced Topics in Information Systems: Information Retrieval Sec.

3 Length normalization  A vector can be (length-) normalized by dividing each of its components by its length – for this we use the L2 norm:  x2= xi2 ∑i  Dividing a vector by its L2 norm makes it a unit (length) vector (on surface of unit hypersphere)  Effect on the two documents d and d′ (d appended to itself) from earlier slide: they have identical vectors after length-normalization. 6.  Long and short documents now have comparable weights 165 .Advanced Topics in Information Systems: Information Retrieval Sec.

Advanced Topics in Information Systems: Information Retrieval Sec.document) Dot product Unit vectors     q•d cos( q . d ) =   = qd  q • q  d = d ∑ ∑ V qi d i i =1 d i2 ∑i =1 V V 2 i =1 i q qi is the tf-idf weight of term i in the query di is the tf-idf weight of term i in the document cos(q. 166 .3 cosine(query. the cosine of the angle between q and d. 6. equivalently.d) is the cosine similarity of q and d … or.

d ) = q ᅫ d =  V i=1 qi di for q. cosine similarity is simply the dot product (or scalar product): r r r r cos(q. 167 . d length-normalized.Advanced Topics in Information Systems: Information Retrieval Cosine for length-normalized vectors  For length-normalized vectors.

Advanced Topics in Information Systems: Information Retrieval Cosine similarity illustrated 168 .

we don’t do idf weighting. 6. 169 .3 Cosine similarity amongst 3 documents How similar are the novels SaS: Sense and Sensibility PaP: Pride and Prejudice. and WH: Wuthering Heights? term affection jealous gossip wuthering SaS 115 10 2 0 PaP 58 7 0 0 WH 20 11 6 38 Term frequencies (counts) Note: To simplify this example.Advanced Topics in Information Systems: Information Retrieval Sec.

WH) ≈ 0.335 × 0.00 1.0 + 0.588 cos(SaS.405 0.335 0 PaP 0.555 + 0.PaP) ≈ 0.06 2.30 2.WH)? 170 .69 Why do we have cos(SaS.0 × 0.524 0.30 0 PaP 2.76 1.3 3 documents example contd.04 1.PaP) > cos(SAS.85 0 0 WH 2.58 After length normalization term affection jealous gossip wuthering SaS 0. Log frequency weighting term affection jealous gossip wuthering SaS 3.789 0.465 0.Advanced Topics in Information Systems: Information Retrieval Sec.79 cos(PaP.WH) ≈ 0.0 ≈ 0.832 + 0.789 × 0.832 0. 6.94 cos(SaS.515 × 0.515 0.78 2.555 0 0 WH 0.

6.3 Computing cosine scores 171 .Advanced Topics in Information Systems: Information Retrieval Sec.

Advanced Topics in Information Systems: Information Retrieval Sec.4 tf-idf weighting has many variants Columns headed ‘n’ are acronyms for weight schemes. 6. Why is the base of the log in idf immaterial? 172 .

with the notation ddd. no normalization … 173 . documents  SMART Notation: denotes the combination in use in an engine. idf A bad idea? (t in second column).4 Weighting may differ in queries vs documents  Many search engines allow for different weightings for queries vs. using the acronyms from the previous table  A very standard weighting scheme is: lnc. no idf and cosine normalization  Query: logarithmic tf (l in leftmost column).qqq. 6.ltc  Document: logarithmic tf (l as first character).Advanced Topics in Information Systems: Information Retrieval Sec.

3 1.27 0.Advanced Topics in Information Systems: Information Retrieval Sec.53 Prod Exercise: what is N.52 0.4 tf-idf example: lnc.52 0 0.52 0.0 3.0 wt 0 1.53 = 0.92 174 Score = 0+0+0.0 n’lize 0 0.32 ᅫ 1.8 .34 0.0 3.3 n’lize 0.3 2.68 0 0 0.27+0. the number of docs? Doc length = 12 + 0 2 + 12 + 1.3 wt 1 0 1 1. 6.3 2.78 Docu ment tf-raw 1 0 1 2 tf-wt 1 0 1 1.ltc Document: car insurance auto insurance Query: best car insurance Term Que ry tfraw auto best car insurance 0 1 1 1 tf-wt 0 1 1 1 df 5000 50000 10000 1000 idf 2.

g.Advanced Topics in Information Systems: Information Retrieval Summary – vector space ranking  Represent the query as a weighted tf-idf vector  Represent each document as a weighted tf-idf vector  Compute the cosine similarity score for the query vector and each document vector  Rank documents with respect to the query by score  Return the top K (e. K = 10) to the user 175 ..

Sign up to vote on this title
UsefulNot useful