Professional Documents
Culture Documents
Inverted Indexing For Text Retrieval: Chapter 4 Lin and Dyer
Inverted Indexing For Text Retrieval: Chapter 4 Lin and Dyer
RETRIEVAL
CHAPTER 4 LIN AND DYER
INTRODUCTION
4 WEB CRAWLING
• Inverted index consists of postings lists, one associated with each term that appears in the corpus.
• <t, posting>n
• <t, <docid, tf> >n
• <t, <docid, tf, other info>>n
• Key, value pair where the key is the term (word) and the value is the docid, followed by “payload”
• Payload can be empty for simple index
• Payload can be complex: provides such details as co-occurrences, additional linguistic processing, page rank
of the doc, etc.
• <t2, <d1, d4, d67, d89>>
• <t3, <d4, d6, d7, d9, d22>>
• Document numbering typically do not have semantic content but docs from the same corpus are numbered
together or the numbers could be assigned based on “page ranks” (Hmm, what is that?).
RETRIEVAL
term10
term1 term2 term3 term 4 term5 term6 term7 term8 term9
IMPLEMENTATION
• From Baseline to an improved version
• Observe the sort done by the Reducer. Is there any way to push this into the MR runtime?
• Instead of
• (term t, posting<docid, f>)
• Emit
• (tuple<t, docid>, tf f)
• This is our previously studied value-key conversion design pattern
• This switching ensures the keys arrive in order at the reducer
• Small memory foot print; less buffer space needed at the reducer
• See fig.4.4
MODIFIED MAPPER
method reduce (tuple <t,n>, tf [f1, ..]) This means a new term
if (t # tprev) ^ (tprev # 0)
{ emit (term t, posting P);
What is Initialize and Close?
reset P; }
P.add(<n,f>)
tprev t
Close
Observe that reducer
emit(term t, postings P) gets you the terms in the format you
need for answering queries.
IMPROVED MR FOR II 10/27/2021
class Reducer
class Mapper
method Initialize
method Map(docid n; doc d) tprev = 0; P = new PostingsList
H = new AssociativeArray method Reduce(tuple <t, n>; tf [f])
method Close
emit(term t, postings P)
LET’S WALK THROUGH A SAMPLE TEXT
<d2, The roof is a tin roof. There is a tin can on the roof.>
<d3, Cat kicks the can. It rolls on the roof and falls on the next roof.>