You are on page 1of 175

Advanced Topics in Information Systems: Information Retrieval

Advanced Topics in Information Systems:

Information Retrieval
Jun.Prof. Alexander Markowetz Slides modified from Christopher Manning and Prabhakar Raghavan

Advanced Topics in Information Systems: Information Retrieval

DICTIONARY DATA STRUCTURES


2

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.1

Hashes
Each vocabulary term is hashed to an integer
(We assume youve seen hashtables before)

Pros:
Lookup is faster than for a tree: O(1)

Cons:
No easy way to find minor variants:
judgment/judgement

No prefix search [tolerant retrieval] If vocabulary keeps growing, need to occasionally do the expensive operation of rehashing everything

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.1

Tree: binary tree


a-m Root n-z

a-hu

hy-m

n-sh

si-z

huy g

ard

sic k

zyg ot

var k

ens

le

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.1

Tree: B-tree
a-hu n-z hy-m

Definition: Every internal nodel has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4].
5

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.1

Trees
Simplest: binary tree More usual: B-trees Trees require a standard ordering of characters and hence strings but we standardly have one Pros: Solves the prefix problem (terms starting with hyp) Cons: Slower: O(log M) [and this requires balanced tree] Rebalancing binary trees is expensive
But B-trees mitigate the rebalancing problem

Advanced Topics in Information Systems: Information Retrieval

WILD-CARD QUERIES

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.2

Wild-card queries: *
mon*: find all docs containing any word beginning mon. Easy with binary tree (or B-tree) lexicon: retrieve all words in range: mon w < moo *mon: find words ending in mon: harder
Maintain an additional B-tree for terms backwards.

Can retrieve all words in range: nom w < non.


Exercise: from this, how can we enumerate all terms meeting the wild-card query pro*cent ?
8

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.2

Query processing
At this point, we have an enumeration of all terms in the dictionary that match the wildcard query. We still have to look up the postings for each enumerated term. E.g., consider the query: se*ate AND fil*er This may result in the execution of many Boolean AND queries.

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.2

B-trees handle *s at the end of a query term


How can we handle *s in the middle of query term?
co*tion

We could look up co* AND *tion in a B-tree and intersect the two term sets
Expensive

The solution: transform wild-card queries so that the *s occur at the end This gives rise to the Permuterm Index.

10

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.2.1

Permuterm index
For term hello, index under:
hello$, ello$h, llo$he, lo$hel, o$hell

where $ is a special symbol. Queries:


X lookup on X$ X* lookup on $X* *X lookup on X$* *X* lookup on X* X*Y lookup on Y$X* X*Y*Z ??? Exercise!

Query = hel*o X=hel, Y=o Lookup o$hel*


11

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.2.1

Permuterm query processing


Rotate query wild-card to the right Now use B-tree lookup as before. Permuterm problem: quadruples lexicon size
Empirical observation for English.

12

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.2.2

Bigram (k-gram) indexes


Enumerate all k-grams (sequence of k chars) occurring in any term e.g., from text April is the cruelest month we get the 2-grams (bigrams)
$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru, ue,el,le,es,st,t$, $m,mo,on,nt,h$

$ is a special word boundary symbol

Maintain a second inverted index from bigrams to dictionary terms that match each bigram.

13

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.2.2

Bigram index example


The k-gram index finds terms based on a query consisting of k-grams (here k=2).
$m mo on mace among among madden amortize around

14

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.2.2

Processing wild-cards
Query mon* can now be run as
$m AND mo AND on

Gets terms that match AND version of our wildcard query. But wed enumerate moon. Must post-filter these terms against query. Surviving enumerated terms are then looked up in the term-document inverted index. Fast, space efficient (compared to permuterm).

15

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.2.2

Processing wild-card queries


As before, we must execute a Boolean query for each enumerated, filtered term. Wild-cards can result in expensive query execution (very large disjunctions)
pyth* AND prog*

If you encourage laziness people will respond!

Search
Type your search terms, use * if you need to. E.g., Alex* will match Alexander.

Which web search engines allow wildcard queries?

16

Advanced Topics in Information Systems: Information Retrieval

SPELLING CORRECTION

17

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.3

Spell correction
Two principal uses
Correcting document(s) being indexed Correcting user queries to retrieve right answers

Two main flavors:


Isolated word
Check each word on its own for misspelling Will not catch typos resulting in correctly spelled words e.g., from form

Context-sensitive
Look at surrounding words, e.g., I flew form Heathrow to Narita.

18

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.3

Document correction
Especially needed for OCRed documents
Correction algorithms are tuned for this: rn/m Can use domain-specific knowledge
E.g., OCR can confuse O and D more often than it would confuse O and I (adjacent on the QWERTY keyboard, so more likely interchanged in typing).

But also: web pages and even printed material has typos Goal: the dictionary contains fewer misspellings But often we dont change the documents but aim to fix the query-document mapping
19

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.3

Query mis-spellings
Our principal focus here
E.g., the query Alanis Morisett

We can either
Retrieve documents indexed by the correct spelling, OR Return several suggested alternative queries with the correct spelling
Did you mean ?

20

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.3.2

Isolated word correction


Fundamental premise there is a lexicon from which the correct spellings come Two basic choices for this
A standard lexicon such as
Websters English Dictionary An industry-specific lexicon hand-maintained

The lexicon of the indexed corpus


E.g., all words on the web All names, acronyms etc. (Including the mis-spellings)

21

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.3.2

Isolated word correction


Given a lexicon and a character sequence Q, return the words in the lexicon closest to Q Whats closest? Well study several alternatives
Edit distance (Levenshtein distance) Weighted edit distance n-gram overlap

22

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.3.3

Edit distance
Given two strings S1 and S2, the minimum number of operations to convert one to the other Operations are typically character-level
Insert, Delete, Replace, (Transposition)

E.g., the edit distance from dof to dog is 1


From cat to act is 2 from cat to dog is 3. (Just 1 with transpose.)

Generally found by dynamic programming. See http://www.merriampark.com/ld.htm for a nice example plus an applet.
23

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.3.3

Weighted edit distance


As above, but the weight of an operation depends on the character(s) involved
Meant to capture OCR or keyboard errors, e.g. m more likely to be mis-typed as n than as q Therefore, replacing m by n is a smaller edit distance than by q This may be formulated as a probability model

Requires weight matrix as input Modify dynamic programming to handle weights

24

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.3.4

Using edit distances


Given query, first enumerate all character sequences within a preset (weighted) edit distance (e.g., 2) Intersect this set with list of correct words Show terms you found to user as suggestions Alternatively,
We can look up all possible corrections in our inverted index and return all docs slow We can run with a single most likely correction

The alternatives disempower the user, but save a round of interaction with the user
25

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.3.4

Edit distance to all dictionary terms?


Given a (mis-spelled) query do we compute its edit distance to every dictionary term?
Expensive and slow Alternative?

How do we cut the set of candidate dictionary terms? One possibility is to use n-gram overlap for this This can also be used by itself for spelling correction.

26

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.3.4

n-gram overlap
Enumerate all the n-grams in the query string as well as in the lexicon Use the n-gram index (recall wild-card search) to retrieve all lexicon terms matching any of the query n-grams Threshold by number of matching n-grams
Variants weight by keyboard layout, etc.

27

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.3.4

Example with trigrams


Suppose the text is november
Trigrams are nov, ove, vem, emb, mbe, ber.

The query is december


Trigrams are dec, ece, cem, emb, mbe, ber.

So 3 trigrams overlap (of 6 in each term) How can we turn this into a normalized measure of overlap?

28

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.3.4

One option Jaccard coefficient


A commonly-used measure of overlap Let X and Y be two sets; then the J.C. is

X Y / X Y
Equals 1 when X and Y have the same elements and zero when they are disjoint X and Y dont have to be of the same size Always assigns a number between 0 and 1
Now threshold to decide if you have a match E.g., if J.C. > 0.8, declare a match
29

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.3.4

Matching trigrams
Consider the query lord we wish to identify words matching 2 of its 3 bigrams (lo, or, rd)

lo or rd

alone border ardent

lord lord border

sloth morbid card

Standard postings merge will enumerate Adapt this to using Jaccard (or another) measure.
30

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.3.5

Context-sensitive spell correction


Text: I flew from Heathrow to Narita. Consider the phrase query flew form Heathrow Wed like to respond Did you mean flew from Heathrow? because no docs matched the query phrase.

31

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.3.5

Context-sensitive correction
Need surrounding context to catch this. First idea: retrieve dictionary terms close (in weighted edit distance) to each query term Now try all possible resulting phrases with one word fixed at a time
flew from heathrow fled form heathrow flea form heathrow

Hit-based spelling correction: Suggest the alternative that has lots of hits.

32

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.3.5

Exercise
Suppose that for flew form Heathrow we have 7 alternatives for flew, 19 for form and 3 for heathrow. How many corrected phrases will we enumerate in this scheme?

33

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.3.5

Another approach
Break phrase query into a conjunction of biwords (Lecture 2). Look for biwords that need only one term corrected. Enumerate phrase matches and rank them!

34

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.3.5

General issues in spell correction


We enumerate multiple alternatives for Did you mean? Need to figure out which to present to the user Use heuristics
The alternative hitting most docs Query log analysis + tweaking
For especially popular, topical queries

Spell-correction is computationally expensive


Avoid running routinely on every query? Run only on queries that matched few docs

35

Advanced Topics in Information Systems: Information Retrieval

SOUNDEX

36

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.4

Soundex
Class of heuristics to expand a query into phonetic equivalents
Language specific mainly for names E.g., chebyshev tchebycheff

Invented for the U.S. census in 1918

37

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.4

Soundex typical algorithm


Turn every token to be indexed into a 4character reduced form Do the same with query terms Build and search an index on the reduced forms
(when the query calls for a soundex match)
http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm#Top

38

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.4

Soundex typical algorithm


1. Retain the first letter of the word. 2. Change all occurrences of the following letters to '0' (zero): 'A', E', 'I', 'O', 'U', 'H', 'W', 'Y'. 3. Change letters to digits as follows: B, F, P, V 1 C, G, J, K, Q, S, X, Z 2 D,T 3 L4 M, N 5 R6
39

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.4

Soundex continued
4. Remove all pairs of consecutive digits. 5. Remove all zeros from the resulting string. 6. Pad the resulting string with trailing zeros and return the first four positions, which will be of the form <uppercase letter> <digit> <digit> <digit>. E.g., Herman becomes H655.
Will hermann generate the same code?
40

Advanced Topics in Information Systems: Information Retrieval

Sec. 3.4

Soundex
Soundex is the classic algorithm, provided by most databases (Oracle, Microsoft, ) How useful is soundex? Not very for information retrieval Okay for high recall tasks (e.g., Interpol), though biased to names of certain nationalities Zobel and Dart (1996) show that other algorithms for phonetic matching perform much better in the context of IR

41

Advanced Topics in Information Systems: Information Retrieval

What queries can we process?


We have
Positional inverted index with skip pointers Wild-card index Spell-correction Soundex

Queries such as (SPELL(moriset) /3 toron*to) OR SOUNDEX(chaikofski)

42

Advanced Topics in Information Systems: Information Retrieval

INDEX GENERATION

43

Advanced Topics in Information Systems: Information Retrieval

Ch. 4

Index construction
How do we construct an index? What strategies can we use with limited main memory?

44

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.1

Hardware basics
Many design decisions in information retrieval are based on the characteristics of hardware We begin by reviewing hardware basics

45

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.1

Hardware basics
Access to data in memory is much faster than access to data on disk. Disk seeks: No data is transferred from disk while the disk head is being positioned. Therefore: Transferring one large chunk of data from disk to memory is faster than transferring many small chunks. Disk I/O is block-based: Reading and writing of entire blocks (as opposed to smaller chunks). Block sizes: 8KB to 256 KB.
46

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.1

Hardware basics
Servers used in IR systems now typically have several GB of main memory, sometimes tens of GB. Available disk space is several (23) orders of magnitude larger. Fault tolerance is very expensive: Its much cheaper to use many regular machines rather than one fault tolerant machine. Google is particularly famous for combining standard hardware in shipping containers.
47

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.1

Hardware assumptions
symbol statistic value s average seek time 5 ms = 5 x 103 s b transfer time per byte 0.02 s = 2 x 108 s processors clock rate 109 s1 p low-level operation 0.01 s = 108 s
(e.g., compare & swap a word)

size of main memory size of disk space

several GB 1 TB or more

48

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.2

RCV1: Our collection for this lecture


Shakespeares collected works definitely arent large enough for demonstrating many of the points in this course. The collection well use isnt really large enough either, but its publicly available and is at least a more plausible example. As an example for applying scalable index construction algorithms, we will use the Reuters RCV1 collection. This is one year of Reuters newswire (part of 1995 and 1996)
49

Introduction to Information Retrieval

Sec. 4.2

A Reuters RCV1 document

50

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.2

Reuters RCV1 statistics


symbol statistic N documents L avg. # tokens per doc M terms (= word types) avg. # bytes per token
(incl. spaces/punct.)

value 800,000 200 400,000 6 4.5

avg. # bytes per token


(without spaces/punct.)

avg. # bytes per term 7.5 non-positional postings 100,000,000


51

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.2

Recall IIR 1 index construction


Documents are parsed to extract words and these are saved with the Document ID.

Doc 1 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me.

Doc 2 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious

Term I did enact julius caesar I was killed i' the capitol brutus killed m e so let it be with caesar the noble brutus hath told you caesar was am bitious

Doc # 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 52 2

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.2

Key step
After all documents have been parsed, the inverted file is sorted by terms.

We focus on this sort step. We have 100M items to sort.

Term I did enact julius caesar I was killed i' the capitol brutus killed m e so let it be with caesar the noble brutus hath told you caesar was am bitious

Doc # 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

Term ambitious be brutus brutus capitol caesar caesar caesar did enact hath I I i' it julius killed killed let me noble so the the told you was was with

Doc # 2 2 1 2 1 1 2 2 1 1 1 1 1 1 2 1 1 1 2 1 2 2 1 2 2 2 1 2 2

53

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.2

Scaling index construction


In-memory index construction does not scale. How can we construct an index for very large collections? Taking into account the hardware constraints we just learned about . . . Memory, disk, speed, etc.

54

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.2

Sort-based index construction


As we build the index, we parse docs one at a time. While building the index, we cannot easily exploit compression tricks (you can, but much more complex) The final postings for any term are incomplete until the end. At 12 bytes per non-positional postings entry (term, doc, freq), demands a lot of space for large collections. T = 100,000,000 in the case of RCV1 So we can do this in memory in 2009, but typical collections are much larger. E.g. the New York Times provides an index of >150 years of newswire Thus: We need to store intermediate results on disk.
55

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.2

Use the same algorithm for disk?


Can we use the same index construction algorithm for larger collections, but by using disk instead of memory? No: Sorting T = 100,000,000 records on disk is too slow too many disk seeks. We need an external sorting algorithm.

56

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.2

Bottleneck
Parse and build postings entries one doc at a time Now sort postings entries by term (then by doc within each term) Doing this with random disk seeks would be too slow must sort T=100M records

If every comparison took 2 disk seeks, and N items could be sorted with N log2N comparisons, how long would this take?
57

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.2

BSBI: Blocked sort-based Indexing (Sorting with fewer disk seeks)


12-byte (4+4+4) records (term, doc, freq). These are generated as we parse docs. Must now sort 100M such 12-byte records by term. Define a Block ~ 10M such records
Can easily fit a couple into memory. Will have 10 such blocks to start with.

Basic idea of algorithm:


Accumulate postings for each block, sort, write to disk. Then merge the blocks into one long sorted order.
58

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.2

59

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.2

Sorting 10 blocks of 10M records


First, read each block and sort within:
Quicksort takes 2N ln N expected steps In our case 2 x (10M ln 10M) steps

Exercise: estimate total time to read each block from disk and and quicksort it. 10 times this estimate gives us 10 sorted runs of 10M records each. Done straightforwardly, need 2 copies of data on disk
But can optimize this

60

Sec. 4.2

61

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.2

How to merge the sorted runs?


Can do binary merges, with a merge tree of log210 = 4 layers. During each layer, read into memory runs in blocks of 10M, merge, write back.
1

1 3

2
2

Merged run.

4
3 4

Runs being merged.

Disk
62

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.2

How to merge the sorted runs?


But it is more efficient to do a n-way merge, where you are reading from all blocks simultaneously Providing you read decent-sized chunks of each block into memory and then write out a decent-sized output chunk, then youre not killed by disk seeks

63

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.3

Remaining problem with sortbased algorithm


Our assumption was: we can keep the dictionary in memory. We need the dictionary (which grows dynamically) in order to implement a term to termID mapping. Actually, we could work with term,docID postings instead of termID,docID postings . . . . . . but then intermediate files become very large. (We would end up with a scalable, but very slow index construction method.)
64

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.3

SPIMI: Single-pass in-memory indexing


Key idea 1: Generate separate dictionaries for each block no need to maintain term-termID mapping across blocks. Key idea 2: Dont sort. Accumulate postings in postings lists as they occur. With these two ideas we can generate a complete inverted index for each block. These separate indexes can then be merged into one big index.

65

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.3

SPIMI-Invert

Merging of blocks is analogous to BSBI.

66

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.3

SPIMI: Compression
Compression makes SPIMI even more efficient.
Compression of terms Compression of postings

67

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.4

Distributed indexing
For web-scale indexing (dont try this at home!): must use a distributed computing cluster Individual machines are fault-prone
Can unpredictably slow down or fail

How do we exploit such a pool of machines?

68

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.4

Google data centers


Google data centers mainly contain commodity machines. Data centers are distributed around the world. Estimate: a total of 1 million servers, 3 million processors/cores (Gartner 2007) Estimate: Google installs 100,000 servers each quarter.
Based on expenditures of 200250 million dollars per year

This would be 10% of the computing capacity of the world!?!


69

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.4

Google data centers


If in a non-fault-tolerant system with 1000 nodes, each node has 99.9% uptime, what is the uptime of the system? Answer: 63% Calculate the number of servers failing per minute for an installation of 1 million servers.

70

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.4

Distributed indexing
Maintain a master machine directing the indexing job considered safe. Break up indexing into sets of (parallel) tasks. Master machine assigns each task to an idle machine from a pool.

71

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.4

Parallel tasks
We will use two sets of parallel tasks
Parsers Inverters

Break the input document collection into splits Each split is a subset of documents (corresponding to blocks in BSBI/SPIMI)

72

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.4

Parsers
Master assigns a split to an idle parser machine Parser reads a document at a time and emits (term, doc) pairs Parser writes pairs into j partitions Each partition is for a range of terms first letters
(e.g., a-f, g-p, q-z) here j = 3.

Now to complete the index inversion

73

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.4

Inverters
An inverter collects all (term,doc) pairs (= postings) for one term-partition. Sorts and writes to postings lists

74

Introduction to Information Retrieval

Sec. 4.4

Data flow
assign Parser Parser splits Master assign Inverter Inverter Inverter Postings a-f g-p q-z

a-f g-p q-z a-f g-p q-z

Parser Map phase

a-f g-p q-z Segment files

Reduce phase

75

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.4

MapReduce
The index construction algorithm we just described is an instance of MapReduce. MapReduce (Dean and Ghemawat 2004) is a robust and conceptually simple framework for distributed computing without having to write code for the distribution part. They describe the Google indexing system (ca. 2002) as consisting of a number of phases, each implemented in MapReduce.
76

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.4

MapReduce
Index construction was just one phase. Another phase: transforming a term-partitioned index into a document-partitioned index.
Term-partitioned: one machine handles a subrange of terms Document-partitioned: one machine handles a subrange of documents

As we discuss later in the course, most search engines use a document-partitioned index better load balancing, etc.

77

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.4

Schema for index construction in MapReduce


Schema of map and reduce functions map: input list(k, v) reduce: (k,list(v)) output Instantiation of the schema for index construction map: web collection list(termID, docID) reduce: (<termID1, list(docID)>, <termID2, list(docID)>, ) (postings list1, postings list2, ) Example for index construction map: d2 : C died. d1 : C came, C ced. (<C, d2>, <died,d2>, <C,d1>, <came,d1>, <C,d1>, <ced, d1> reduce: (<C,(d2,d1,d1)>, <died,(d2)>, <came,(d1)>, <ced,(d1)>) (<C,(d1:2,d2:1)>, <died,(d2:1)>, <came,(d1:1)>, <ced,(d1:1)>)
78

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.5

Dynamic indexing
Up to now, we have assumed that collections are static. They rarely are:
Documents come in over time and need to be inserted. Documents are deleted and modified.

This means that the dictionary and postings lists have to be modified:
Postings updates for terms already in dictionary New terms added to dictionary

79

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.5

Simplest approach
Maintain big main index New docs go into small auxiliary index Search across both, merge results Deletions
Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation bit-vector

Periodically, re-index into one main index

80

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.5

Issues with main and auxiliary indexes


Problem of frequent merges you touch stuff a lot Poor performance during merge Actually:
Merging of the auxiliary index into the main index is efficient if we keep a separate file for each postings list. Merge is the same as a simple append. But then we would need a lot of files inefficient for O/S.

Assumption for the rest of the lecture: The index is one big file. In reality: Use a scheme somewhere in between (e.g., split very large postings lists, collect postings lists of length 1 in one file etc.)
81

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.5

Logarithmic merge
Maintain a series of indexes, each twice as large as the previous one. Keep smallest (Z0) in memory Larger ones (I0, I1, ) on disk If Z0 gets too big (> n), write to disk as I0 or merge with I0 (if I0 already exists) as Z1 Either write merge Z1 to disk as I1 (if no I1) Or merge with I1 to form Z2 etc.
82

Sec. 4.5

83

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.5

Logarithmic merge
Auxiliary and main index: index construction time is O(T2) as each posting is touched in each merge. Logarithmic merge: Each posting is merged O(log T) times, so complexity is O(T log T) So logarithmic merge is much more efficient for index construction But query processing now requires the merging of O(log T) indexes
Whereas it is O(1) if you just have a main and auxiliary index
84

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.5

Further issues with multiple indexes


Collection-wide statistics are hard to maintain E.g., when we spoke of spell-correction: which of several corrected alternatives do we present to the user?
We said, pick the one with the most hits

How do we maintain the top ones with multiple indexes and invalidation bit vectors?
One possibility: ignore everything but the main index for such ordering

Will see more such statistics used in results ranking


85

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.5

Dynamic indexing at search engines


All the large search engines now do dynamic indexing Their indices have frequent incremental changes
News items, blogs, new topical web pages
Sarah Palin,

But (sometimes/typically) they also periodically reconstruct the index from scratch
Query processing is then switched to the new index, and the old index is then deleted

86

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.5

87

Advanced Topics in Information Systems: Information Retrieval

Sec. 4.5

Other sorts of indexes


Positional indexes
Same sort of sorting problem just larger

Building character n-gram indexes:


As text is parsed, enumerate n-grams. For each n-gram, need pointers to all dictionary terms containing it the postings. Note that the same postings entry will arise repeatedly in parsing the docs need efficient hashing to keep track of this.

Why ?

E.g., that the trigram uou occurs in the term deciduous will be discovered on each text occurrence of deciduous Only need to process each term once
88

Advanced Topics in Information Systems: Information Retrieval

INDEX COMPRESSION

89

Advanced Topics in Information Systems: Information Retrieval

Ch. 5

Compressing Indexes

Collection statistics in more detail (with RCV1)


How big will the dictionary and postings be?

Dictionary compression Postings compression


90

Advanced Topics in Information Systems: Information Retrieval

Ch. 5

Why compression (in general)?


Use less disk space
Saves a little money

Keep more stuff in memory


Increases speed

Increase speed of data transfer from disk to memory


[read compressed data | decompress] is faster than [read uncompressed data] Premise: Decompression algorithms are fast
True of the decompression algorithms we use

91

Advanced Topics in Information Systems: Information Retrieval

Ch. 5

Why compression for inverted indexes?


Dictionary
Make it small enough to keep in main memory Make it so small that you can keep some postings lists in main memory too

Postings file(s)
Reduce disk space needed Decrease time needed to read postings lists from disk Large search engines keep a significant part of the postings in memory.
Compression lets you keep more in memory

We will devise various IR-specific compression schemes


92

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.1

Recall Reuters RCV1


symbol statistic value N documents 800,000 L avg. # tokens per doc 200 M terms (= word types) ~400,000 avg. # bytes per token 6
(incl. spaces/punct.)

avg. # bytes per token 4.5 (without spaces/punct.) avg. # bytes per term 7.5 non-positional postings 100,000,000
93

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.1

Index parameters vs. what we index (details IIR Table 5.1, p.80)
word types (terms) dictiona ry Size (K) % cumul % -2 -19 -19 -19 nonpositiona l postings nonpositional index Size (K) 109,971 -2 -17 -0 -0 100,680 96,969 83,390 67,002 -8 -3 -14 -30 -8 -12 -24 -39 % cumul % position al postings positional index Size (K) 197,879 179,158 179,158 121,858 94,517 -9 0 -31 -47 -9 -9 -38 -52 % cumul %

size of

Unfiltered No numbers Case folding 30 stopwords 150 stopwords

484 474 392 391 391

stemming 322 -17 -42 94,517 0 Exercise: give intuitions for-33 the 0 entries. Why do some zero-52 all 63,812 -4 entries correspond to big deltas in other columns? 94

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.1

Lossless vs. lossy compression


Lossless compression: All information is preserved.
What we mostly do in IR.

Lossy compression: Discard some information Several of the preprocessing steps can be viewed as lossy compression: case folding, stop words, stemming, number elimination. Chap/Lecture 7: Prune postings entries that are unlikely to turn up in the top k list for any query.
Almost no loss quality for top k list.
95

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.1

Vocabulary vs. collection size


How big is the term vocabulary?
That is, how many distinct words are there?

Can we assume an upper bound?


Not really: At least 7020 = 1037 different words of length 20

In practice, the vocabulary will keep growing with the collection size
Especially with Unicode

96

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.1

Vocabulary vs. collection size


Heaps law: M = kTb M is the size of the vocabulary, T is the number of tokens in the collection Typical values: 30 k 100 and b 0.5 In a log-log plot of vocabulary size M vs. T, Heaps law predicts a line with slope about
It is the simplest possible relationship between the two in log-log space An empirical finding (empirical law)

97

Sec. 5.1

Heaps Law
For RCV1, the dashed line log10 M = 0.49 log10 T + 1.64 is the best least squares fit. Thus, M = 101.64 T0.49 so k = 101.64 44 and b = 0.49. Good empirical fit for Reuters RCV1 ! For first 1,000,020 tokens, law predicts 38,323 terms; actually, 38,365 terms

Fig 5.1 p81

98

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.1

Exercises
What is the effect of including spelling errors, vs. automatically correcting spelling errors on Heaps law? Compute the vocabulary size M for this scenario:
Looking at a collection of web pages, you find that there are 3000 different terms in the first 10,000 tokens and 30,000 different terms in the first 1,000,000 tokens. Assume a search engine indexes a total of 20,000,000,000 (2 1010 ) pages, containing 200 tokens on average What is the size of the vocabulary of the indexed collection as predicted by Heaps law?

99

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.1

Zipfs law
Heaps law gives the vocabulary size in collections. We also study the relative frequencies of terms. In natural language, there are a few very frequent terms and very many very rare terms. Zipfs law: The ith most frequent term has frequency proportional to 1/i . cfi 1/i = K/i where K is a normalizing constant cfi is collection frequency: the number of occurrences of the term ti in the collection.
100

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.1

Zipf consequences
If the most frequent term (the) occurs cf1 times
then the second most frequent term (of) occurs cf1/2 times the third most frequent term (and) occurs cf1/3 times

Equivalent: cfi = K/i where K is a normalizing factor, so log cfi = log K - log i Linear relationship between log cfi and log i

Another power law relationship


101

Introduction to Information Retrieval

Sec. 5.1

Zipfs law for Reuters RCV1

102

Advanced Topics in Information Systems: Information Retrieval

Ch. 5

Compression
Now, we will consider compressing the space for the dictionary and postings Basic Boolean index only No study of positional indexes, etc. We will consider compression schemes

103

Sec. 5.2

DICTIONARY COMPRESSION

104

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.2

Why compress the dictionary?


Search begins with the dictionary We want to keep it in memory Memory footprint competition with other applications Embedded/mobile devices may have very little memory Even if the dictionary isnt in memory, we want it to be small for a fast search startup time So, compressing the dictionary is important

105

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.2

Dictionary storage - first cut


Array of fixed-width entries
~400,000 terms; 28 bytes/term = 11.2 MB.

Terms a aachen . zulu

Freq. 656,265 65 . 221

Postings ptr.

Dictionary search structure

20 bytes

4 bytes each
106

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.2

Fixed-width terms are wasteful


Most of the bytes in the Term column are wasted we allot 20 bytes for 1 letter terms.
And we still cant handle supercalifragilisticexpialidocious or hydrochlorofluorocarbons.

Written English averages ~4.5 characters/word.


Exercise: Why is/isnt this the number to use for estimating the dictionary size?

Ave. dictionary word in English: ~8 characters


How do we use ~8 characters per dictionary term?

Short words dominate token counts but not type average.


107

Introduction to Information Retrieval

Sec. 5.2

Compressing the term list: Dictionary-as-a-String

Store dictionary as a (long) string of characters:


Pointer to next word shows end of current word Hope to save up to 60% of dictionary space.

.systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo.
Freq. 33 29 44 126 Postings ptr. Term ptr.

Total string length = 400K x 8B = 3.2MB Pointers resolve 3.2M positions: log23.2M = 22bits = 3bytes
108

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.2

Space for dictionary as a string


4 bytes per term for Freq. Now avg. 11 4 bytes per term for pointer to Postings.bytes/term, not 20. 3 bytes per term pointer Avg. 8 bytes per term in term string 400K terms x 19 7.6 MB (against 11.2MB for fixed width)

109

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.2

Blocking
Store pointers to every kth term string.
Example below: k=4.

Need to store term lengths (1 extra byte)


.7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo.

Freq. 33 29 44 126 7

Postings ptr. Term ptr.

Save 9 bytes on 3 pointers.

Lose 4 bytes on term lengths.


110

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.2

Net
Example for block size k = 4 Where we used 3 bytes/pointer without blocking
3 x 4 = 12 bytes,

now we use 3 + 4 = 7 bytes. Shaved another ~0.5MB. This reduces the size of the dictionary from 7.6 MB to 7.1 MB. We can save more with larger k.
Why not go with larger k?
111

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.2

Exercise
Estimate the space usage (and savings compared to 7.6 MB) with blocking, for block sizes of k = 4, 8 and 16.

112

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.2

Dictionary search without blocking


Assuming each dictionary term equally likely in query (not really so in practice!), average number of comparisons = (1+22+43+4)/8 ~2.6

Exercise: what if the frequencies of query terms were non-uniform but known, how would you structure the dictionary search tree?
113

Introduction to Information Retrieval

Sec. 5.2

Dictionary search with blocking

Binary search down to 4-term block;


Then linear search through terms in block.

Blocks of 4 (binary tree), avg. = (1+22+23+24+5)/8 = 3 compares


114

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.2

Exercise
Estimate the impact on search performance (and slowdown compared to k=1) with blocking, for block sizes of k = 4, 8 and 16.

115

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.2

Front coding
Front-coding:
Sorted words commonly have long common prefix store differences only (for last k-1 in a block of k)

8automata8automate9automatic10autom ation
8automat*a1e2ic3ion

Encodes automat

Extra length beyond automat.

Begins to resemble general string compression.116

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.2

RCV1 dictionary compression summary


Technique Fixed width Dictionary-as-String with pointers to every term Also, blocking k = 4 Also, Blocking + front coding Size in MB 11.2 7.6 7.1 5.9

117

Sec. 5.3

POSTINGS COMPRESSION

118

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.3

Postings compression
The postings file is much larger than the dictionary, factor of at least 10. Key desideratum: store each posting compactly. A posting for our purposes is a docID. For Reuters (800,000 documents), we would use 32 bits per docID when using 4-byte integers. Alternatively, we can use log2 800,000 20 bits per docID. Our goal: use a lot less than 20 bits per docID.
119

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.3

Postings: two conflicting forces


A term like arachnocentric occurs in maybe one doc out of a million we would like to store this posting using log2 1M ~ 20 bits. A term like the occurs in virtually every doc, so 20 bits/posting is too expensive.
Prefer 0/1 bitmap vector in this case

120

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.3

Postings file entry


We store the list of docs containing a term in increasing order of docID.
computer: 33,47,154,159,202

Consequence: it suffices to store gaps.


33,14,107,5,43

Hope: most gaps can be encoded/stored with far fewer than 20 bits.

121

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.3

Three postings entries

122

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.3

Variable length encoding


Aim:
For arachnocentric, we will use ~20 bits/gap entry. For the, we will use ~1 bit/gap entry.

If the average gap for a term is G, we want to use ~log2G bits/gap entry. Key challenge: encode every integer (gap) with about as few bits as needed for that integer. This requires a variable length encoding Variable length codes achieve this by using short codes for small numbers
123

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.3

Variable Byte (VB) codes


For a gap value G, we want to use close to the fewest bytes needed to hold log2 G bits Begin with one byte to store G and dedicate 1 bit in it to be a continuation bit c If G 127, binary-encode it in the 7 available bits and set c =1 Else encode Gs lower-order 7 bits and then use additional bytes to encode the higher order bits using the same algorithm At the end set the continuation bit of the last byte to 1 (c =1) and for the other bytes c = 0.
124

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.3

Example
docIDs gaps VB code 00000110 10111000 824 829 5 10000101 215406 214577 00001101 00001100 10110001

Postings stored as the byte concatenation


000001101011100010000101000011010000110010110001

Key property: VB-encoded postings are uniquely prefix-decodable.


For a small gap (5), VB uses a whole byte.

125

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.3

Other variable unit codes


Instead of bytes, we can also use a different unit of alignment: 32 bits (words), 16 bits, 4 bits (nibbles). Variable byte alignment wastes space if you have many small gaps nibbles do better in such cases. Variable byte codes:
Used by many commercial/research systems Good low-tech blend of variable-length coding and sensitivity to computer memory alignment matches (vs. bit-level codes, which we look at next).

There is also recent work on word-aligned codes that pack a variable number of gaps into one word

126

Advanced Topics in Information Systems: Information Retrieval

Unary code
Represent n as n 1s with a final 0. Unary code for 3 is 1110. Unary code for 40 is 11111111111111111111111111111111111111110 . Unary code for 80 is: 111111111111111111111111111111111111111111 111111111111111111111111111111111111110 This doesnt look promising, but.

127

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.3

Gamma codes
We can compress better with bit-level codes
The Gamma code is the best known of these.

Represent a gap G as a pair length and offset offset is G in binary, with the leading bit cut off
For example 13 1101 101

length is the length of offset


For 13 (offset 101), this is 3.

We encode length with unary code: 1110. Gamma code of 13 is the concatenation of length and offset: 1110101
128

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.3

Gamma code examples


number 0 1 2 3 4 9 13 24 511 1025 0 10 10 110 1110 1110 11110 111111110 11111111110 0 1 00 001 101 1000 11111111 0000000001 length offset -code none 0 10,0 10,1 110,00 1110,001 1110,101 11110,1000 111111110,11111111 11111111110,0000000001

129

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.3

Gamma code properties


G is encoded using 2 log G + 1 bits
Length of offset is log G bits Length of length is log G + 1 bits

All gamma codes have an odd number of bits Almost within a factor of 2 of best possible, log2 G Gamma code is uniquely prefix-decodable, like VB Gamma code can be used for any distribution Gamma code is parameter-free
130

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.3

Gamma seldom used in practice


Machines have word boundaries 8, 16, 32, 64 bits
Operations that cross word boundaries are slower

Compressing and manipulating at the granularity of bits can be slow Variable byte encoding is aligned and thus potentially more efficient Regardless of efficiency, variable byte is conceptually simpler at little additional space cost
131

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.3

RCV1 compression
Data structure dictionary, fixed-width dictionary, term pointers into string with blocking, k = 4 with blocking & front coding collection (text, xml markup etc) collection (text) Term-doc incidence matrix postings, uncompressed (32-bit words) postings, uncompressed (20 bits) postings, variable byte encoded postings, encoded Size in MB 11.2 7.6 7.1 5.9 3,600.0 960.0 40,000.0 400.0 250.0 116.0 101.0
132

Advanced Topics in Information Systems: Information Retrieval

Sec. 5.3

Index compression summary


We can now create an index for highly efficient Boolean retrieval that is very space efficient Only 4% of the total size of the collection Only 10-15% of the total size of the text in the collection However, weve ignored positional information Hence, space savings are less for indexes used in practice
But techniques substantially the same.

133

Sec. 5.3

SCORING, TERM WEIGHTING AND THE VECTOR SPACE MODEL


134

Advanced Topics in Information Systems: Information Retrieval

Ch. 6

Ranked retrieval
Thus far, our queries have all been Boolean.
Documents either match or dont.

Good for expert users with precise understanding of their needs and the collection.
Also good for applications: Applications can easily consume 1000s of results.

Not good for the majority of users.


Most users incapable of writing Boolean queries (or they are, but they think its too much work). Most users dont want to wade through 1000s of results.
This is particularly true of web search.

135

Advanced Topics in Information Systems: Information Retrieval

Ch. 6

Problem with Boolean search: feast or famine


Boolean queries often result in either too few (=0) or too many (1000s) results. Query 1: standard user dlink 650 200,000 hits Query 2: standard user dlink 650 no card found: 0 hits It takes a lot of skill to come up with a query that produces a manageable number of hits.
AND gives too few; OR gives too many

136

Advanced Topics in Information Systems: Information Retrieval

Ranked retrieval models


Rather than a set of documents satisfying a query expression, in ranked retrieval models, the system returns an ordering over the (top) documents in the collection with respect to a query Free text queries: Rather than a query language of operators and expressions, the users query is just one or more words in a human language In principle, there are two separate choices here, but in practice, ranked retrieval models have normally been associated with free text queries and vice versa

137

Advanced Topics in Information Systems: Information Retrieval

Ch. 6

Feast or famine: not a problem in ranked retrieval


When a system produces a ranked result set, large result sets are not an issue
Indeed, the size of the result set is not an issue We just show the top k ( 10) results We dont overwhelm the user Premise: the ranking algorithm works

138

Advanced Topics in Information Systems: Information Retrieval

Ch. 6

Scoring as the basis of ranked retrieval


We wish to return in order the documents most likely to be useful to the searcher How can we rank-order the documents in the collection with respect to a query? Assign a score say in [0, 1] to each document This score measures how well document and query match.

139

Advanced Topics in Information Systems: Information Retrieval

Ch. 6

Query-document matching scores


We need a way of assigning a score to a query/document pair Lets start with a one-term query If the query term does not occur in the document: score should be 0 The more frequent the query term in the document, the higher the score (should be) We will look at a number of alternatives for this.

140

Advanced Topics in Information Systems: Information Retrieval

Ch. 6

Take 1: Jaccard coefficient


Recall from Lecture 3: A commonly used measure of overlap of two sets A and B jaccard(A,B) = |A B| / |A B| jaccard(A,A) = 1 jaccard(A,B) = 0 if A B = 0 A and B dont have to be the same size. Always assigns a number between 0 and 1.

141

Advanced Topics in Information Systems: Information Retrieval

Ch. 6

Jaccard coefficient: Scoring example


What is the query-document match score that the Jaccard coefficient computes for each of the two documents below? Query: ides of march Document 1: caesar died in march Document 2: the long march

142

Advanced Topics in Information Systems: Information Retrieval

Ch. 6

Issues with Jaccard for scoring


It doesnt consider term frequency (how many times a term occurs in a document) Rare terms in a collection are more informative than frequent terms. Jaccard doesnt consider this information We need a more sophisticated way of normalizing for length Later in this lecture, well use A B | / | . . . instead of |A B|/|A B|| (Jaccard) A B | for length normalization.
143

Advanced Topics in Information Systems: Information Retrieval

Sec. 6.2

Recall (Lecture 1): Binary termdocument incidence matrix


Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony Brutus Caesar Calpurnia Cleopatra mercy worser

1 1 1 0 1 1 1

1 1 1 1 0 0 0

0 0 0 0 0 1 1

0 1 1 0 0 1 1

0 0 1 0 0 1 1

1 0 1 0 0 1 0

Each document is represented by a binary vector {0,1}|V|


144

Advanced Topics in Information Systems: Information Retrieval

Sec. 6.2

Term-document count matrices


Consider the number of occurrences of a term in a document:
Each document is a count vector in v: a column below
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony Brutus Caesar Calpurnia Cleopatra mercy worser

157 4 232 0 57 2 2

73 157 227 10 0 0 0

0 0 0 0 0 3 1

0 1 2 0 0 5 1

0 0 1 0 0 5 1

0 0 1 0 0 1 0
145

Advanced Topics in Information Systems: Information Retrieval

Bag of words model


Vector representation doesnt consider the ordering of words in a document John is quicker than Mary and Mary is quicker than John have the same vectors This is called the bag of words model. In a sense, this is a step back: The positional index was able to distinguish these two documents. We will look at recovering positional information later in this course. For now: bag of words model
146

Advanced Topics in Information Systems: Information Retrieval

Term frequency tf
The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d. We want to use tf when computing querydocument match scores. But how? Raw term frequency is not what we want:
A document with 10 occurrences of the term is more relevant than a document with 1 occurrence of the term. But not 10 times more relevant.

Relevance does not increase proportionally NB: frequency = count in with term frequency. IR
147

Advanced Topics in Information Systems: Information Retrieval

Sec. 6.2

Log-frequency weighting
The log frequency weight of term t in d is
wt,d 1 +log 10 tf t,d , = 0, if tf t,d >0 otherw ise

0 0, 1 1, 2 1.3, 10 2, 1000 4, etc. Score for a document-query pair: sum over terms t in both q and d: score

= tqd (1 + log tf t ,d )

The score is 0 if none of the query terms is present in the document.


148

Advanced Topics in Information Systems: Information Retrieval

Sec. 6.2.1

Document frequency
Rare terms are more informative than frequent terms
Recall stop words

Consider a term in the query that is rare in the collection (e.g., arachnocentric) A document containing this term is very likely to be relevant to the query arachnocentric We want a high weight for rare terms like arachnocentric.

149

Advanced Topics in Information Systems: Information Retrieval

Sec. 6.2.1

Document frequency, continued


Frequent terms are less informative than rare terms Consider a query term that is frequent in the collection (e.g., high, increase, line) A document containing such a term is more likely to be relevant than a document that doesnt But its not a sure indicator of relevance. For frequent terms, we want high positive weights for words like high, increase, and line But lower weights than for rare terms. We will use document frequency (df) to capture this.

150

Advanced Topics in Information Systems: Information Retrieval

Sec. 6.2.1

idf weight
dft is the document frequency of t: the number of documents that contain t
dft is an inverse measure of the informativeness of t dft N

We define the idf (inverse document frequency) of t by

idf t = log10 ( N/df t )

We use log (N/dft) instead of N/dft to dampen the effect of idf.

Will turn out the base of the log is immaterial.

151

Advanced Topics in Information Systems: Information Retrieval

Sec. 6.2.1

idf example, suppose N = 1 million


dft 1 100 1,000 10,000 100,000 1,000,000 idft

term calpurnia animal sunday fly under the

idf t = log10 ( N/df t )


There is one idf value for each term t in a collection.
152

Advanced Topics in Information Systems: Information Retrieval

Effect of idf on ranking


Does idf have an effect on ranking for one-term queries, like
iPhone

idf has no effect on ranking one term queries


idf affects the ranking of documents for queries with at least two terms For the query capricious person, idf weighting makes occurrences of capricious count for much more in the final document ranking than occurrences of person.

153

Advanced Topics in Information Systems: Information Retrieval

Sec. 6.2.1

Collection vs. Document frequency


The collection frequency of t is the number of occurrences of t in the collection, counting multiple occurrences. Example:
Word Collection frequency Document frequency

insurance try

10440 10422

3997 8760

Which word is a better search term (and should get a higher weight)?

154

Advanced Topics in Information Systems: Information Retrieval

Sec. 6.2.2

tf-idf weighting
The tf-idf weight of a term is the product of its tf weight and its idf weight.

w t ,d = (1 + log tf t ,d ) log10 ( N / df t )
Best known weighting scheme in information retrieval
Note: the - in tf-idf is a hyphen, not a minus sign! Alternative names: tf.idf, tf x idf

Increases with the number of occurrences within a document Increases with the rarity of the term in the collection

155

Advanced Topics in Information Systems: Information Retrieval

Sec. 6.2.2

Final ranking of documents for a query

Score(q,d) =

t q d

tf.idft,d

156

Advanced Topics in Information Systems: Information Retrieval

Sec. 6.3

Binary count weight matrix


Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony Brutus Caesar Calpurnia Cleopatra mercy worser

5.25 1.21 8.59 0 2.85 1.51 1.37

3.18 6.1 2.54 1.54 0 0 0

0 0 0 0 0 1.9 0.11

0 1 1.51 0 0 0.12 4.15

0 0 0.25 0 0 5.25 0.25

0.35 0 0 0 0 0.88 1.95

Each document is now represented by a real-valued vector of tf-idf weights R|V|


157

Advanced Topics in Information Systems: Information Retrieval

Sec. 6.3

Documents as vectors
So we have a |V|-dimensional vector space Terms are axes of the space Documents are points or vectors in this space Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine These are very sparse vectors - most entries are zero.

158

Advanced Topics in Information Systems: Information Retrieval

Sec. 6.3

Queries as vectors
Key idea 1: Do the same for queries: represent them as vectors in the space Key idea 2: Rank documents according to their proximity to the query in this space proximity = similarity of vectors proximity inverse of distance Recall: We do this because we want to get away from the youre-either-in-or-out Boolean model. Instead: rank more relevant documents higher than less relevant documents
159

Advanced Topics in Information Systems: Information Retrieval

Sec. 6.3

Formalizing vector space proximity


First cut: distance between two points
( = distance between the end points of the two vectors)

Euclidean distance? Euclidean distance is a bad idea . . . . . . because Euclidean distance is large for vectors of different lengths.

160

Advanced Topics in Information Systems: Information Retrieval

Sec. 6.3

Why distance is a bad idea


The Euclidean distance between q and d2 is large even though the distribution of terms in the query q and the distribution of terms in the document d2 are very similar.

161

Advanced Topics in Information Systems: Information Retrieval

Sec. 6.3

Use angle instead of distance


Thought experiment: take a document d and append it to itself. Call this document d. Semantically d and d have the same content The Euclidean distance between the two documents can be quite large The angle between the two documents is 0, corresponding to maximal similarity. Key idea: Rank documents according to angle with query.
162

Advanced Topics in Information Systems: Information Retrieval

Sec. 6.3

From angles to cosines


The following two notions are equivalent.
Rank documents in decreasing order of the angle between query and document Rank documents in increasing order of cosine(query,document)

Cosine is a monotonically decreasing function for the interval [0o, 180o]

163

Advanced Topics in Information Systems: Information Retrieval

Sec. 6.3

From angles to cosines

But how and why should we be computing cosines?


164

Advanced Topics in Information Systems: Information Retrieval

Sec. 6.3

Length normalization
A vector can be (length-) normalized by dividing each of its components by its length for this we use the L2 norm:

x2=

xi2 i

Dividing a vector by its L2 norm makes it a unit (length) vector (on surface of unit hypersphere) Effect on the two documents d and d (d appended to itself) from earlier slide: they have identical vectors after length-normalization.
Long and short documents now have comparable weights
165

Advanced Topics in Information Systems: Information Retrieval

Sec. 6.3

cosine(query,document)
Dot product
Unit vectors

qd cos( q , d ) = = qd

q q

d = d

qi d i i =1 d i2 i =1
V

2 i =1 i

qi is the tf-idf weight of term i in the query di is the tf-idf weight of term i in the document cos(q,d) is the cosine similarity of q and d or, equivalently, the cosine of the angle between q and d.
166

Advanced Topics in Information Systems: Information Retrieval

Cosine for length-normalized vectors


For length-normalized vectors, cosine similarity is simply the dot product (or scalar product):

r r r r cos(q, d ) = q d =

V i=1

qi di

for q, d length-normalized.

167

Advanced Topics in Information Systems: Information Retrieval

Cosine similarity illustrated

168

Advanced Topics in Information Systems: Information Retrieval

Sec. 6.3

Cosine similarity amongst 3 documents


How similar are the novels SaS: Sense and Sensibility PaP: Pride and Prejudice, and WH: Wuthering Heights?
term affection jealous gossip wuthering SaS 115 10 2 0 PaP 58 7 0 0 WH 20 11 6 38

Term frequencies (counts)

Note: To simplify this example, we dont do idf weighting.

169

Advanced Topics in Information Systems: Information Retrieval

Sec. 6.3

3 documents example contd.


Log frequency weighting
term affection jealous gossip wuthering SaS 3.06 2.00 1.30 0 PaP 2.76 1.85 0 0 WH 2.30 2.04 1.78 2.58

After length normalization


term affection jealous gossip wuthering SaS 0.789 0.515 0.335 0 PaP 0.832 0.555 0 0 WH 0.524 0.465 0.405 0.588

cos(SaS,PaP) 0.789 0.832 + 0.515 0.555 + 0.335 0.0 + 0.0 0.0 0.94 cos(SaS,WH) 0.79 cos(PaP,WH) 0.69 Why do we have cos(SaS,PaP) > cos(SAS,WH)?
170

Advanced Topics in Information Systems: Information Retrieval

Sec. 6.3

Computing cosine scores

171

Advanced Topics in Information Systems: Information Retrieval

Sec. 6.4

tf-idf weighting has many variants

Columns headed n are acronyms for weight schemes. Why is the base of the log in idf immaterial?
172

Advanced Topics in Information Systems: Information Retrieval

Sec. 6.4

Weighting may differ in queries vs documents


Many search engines allow for different weightings for queries vs. documents SMART Notation: denotes the combination in use in an engine, with the notation ddd.qqq, using the acronyms from the previous table A very standard weighting scheme is: lnc.ltc Document: logarithmic tf (l as first character), no idf and cosine normalization Query: logarithmic tf (l in leftmost column), idf A bad idea? (t in second column), no normalization
173

Advanced Topics in Information Systems: Information Retrieval

Sec. 6.4

tf-idf example: lnc.ltc


Document: car insurance auto insurance Query: best car insurance
Term Que ry tfraw auto best car insurance 0 1 1 1 tf-wt 0 1 1 1 df 5000 50000 10000 1000 idf 2.3 1.3 2.0 3.0 wt 0 1.3 2.0 3.0 nlize 0 0.34 0.52 0.78 Docu ment tf-raw 1 0 1 2 tf-wt 1 0 1 1.3 wt 1 0 1 1.3 nlize 0.52 0 0.52 0.68 0 0 0.27 0.53 Prod

Exercise: what is N, the number of docs?


Doc length =

12 + 0 2 + 12 + 1.32 1.92
174

Score = 0+0+0.27+0.53 = 0.8

Advanced Topics in Information Systems: Information Retrieval

Summary vector space ranking


Represent the query as a weighted tf-idf vector Represent each document as a weighted tf-idf vector Compute the cosine similarity score for the query vector and each document vector Rank documents with respect to the query by score Return the top K (e.g., K = 10) to the user

175

You might also like