You are on page 1of 13

CONTINUING WITH

INVERTED INDEX
Chapter 4
LET’S WALK THROUGH A SAMPLE TEXT

<d1, This is a cat. Cat sits on a roof.>

<d2, The roof is a tin roof. There is a tin can on the roof.>

<d3, Cat kicks the can. It rolls on the roof and falls on the next roof.>

<d4, The cat rolls too. It sits on the can.>


LET’S WALK THROUGH A SAMPLE TEXT: MAPPER OUTPUT

<d1, This is a cat. Cat sits on a roof.>


<this,<d1,1> <is, <d1,1>> <cat, <d1,1>>, <cat, <d1,1>>, <sits, <d1,1>> <on <d1,1>> <roof,
<d1,1>>
<d2, This roof is a tin roof. There is a tin can on the roof.>
<this <d2,1>> <roof, <d2,1>> <is, <d2,1>> <tin <d2,1>> <roof, <d2,1>> <there,<d2,1>>
<is<<d1,1>> <can, <d2,1>> <on <d2,1>>, <roof<d2,1>>
<d3, Cat kicks the can. It rolls on the roof and falls on the next roof.>
<cat, <d3,1>> <kicks, <d3,1>> <can, <d3,1>> <it, <d3,1>> <rolls, <d3,1>>, <on, <d3,1>>
<roof, <d3,1>> <and <d3,1>> <falls, <d3,1>> <on, <d3,1>> <next,<d3,1>>
<roof<<d3,1>>
<d4, This cat rolls too. It sits on the can.>
<this,<d4,1>> <cat,<d4,1>> <rolls,<d4,1>> <too,<d4,1>> <it,<d4,1>> <sits,<d4,1>>
<on,<d4,1>> <cam,<d4,1>>
Let me give you one more document: d5

<d5, Cat and the roof fall down.>

Send it through the mapper and tell me what comes out.

What is your observation when you carry out this operation?


Assuming that all the same keys are delivered to the same reducer:
This is now the responsibility of the “custom” partitioner

Still the input the reducer is sorted by the value but by the key
<this,<d1,1> <is, <d1,1>> <cat, <d1,1>>, <cat, <d1,1>>, <sits, <d1,1>> <on <d1,1>> <roof, <d1,1>>

<this <d2,1>> <roof, <d2,1>> <is, <d2,1>> <tin <d2,1>> <can, <d2,1>> <on <d2,1>>, <roof<d2,1>>

<cat, <d3,1>> <kicks, <d3,1>> <can, <d3,1>> <it, <d3,1>> <rolls, <d3,1>>, <on, <d3,1>> <roof, <d3,1>> <and <d3,1>>
<falls, <d3,1>> <on, <d3,1>> <next,<d3,1>> <roof<<d3,1>>
<this,<d4,1>> <cat,<d4,1>> <rolls,<d4,1>> <too,<d4,1>> <it,<d4,1>> <sits,<d4,1>> <on,<d4,1>> <cam,<d4,1>>

REDUCER OUTPUT FOR THE INVERTED INDEX WALK THROUGH

<this, [<d1,1><d2,1><d4,1>>

This is with the “sort” inside the reducer!


<roof, <d1,1><d2,3>,<d3,2>>
If the sort is taken out document#s
are not in that order.
Also this form of MR is overload on
in-memory shuffle!
INVERTED INDEX: REVISED 10/27/2021

IMPLEMENTATION
• From Baseline to an improved version
• Observe the sort done by the Reducer. Is there any way to push this into the MR runtime?
• Instead of
• (term t, posting<docid, f>)
• Emit
• (tuple<t, docid>, tf f)
• This is our previously studied value-key conversion design pattern
• This switching ensures the keys arrive in order at the reducer
• Small memory foot print; less buffer space needed at the reducer

• See fig.4.4
MODIFIED MAPPER

Map (docid n, doc d)


H  new AssociativeArray
For all terms t in doc
H{t}  H{t} + 1
For all terms in H There is only one roof from
a document!
emit (tuple<t,n>, H{t})
For example from mapper of d3,
<<roof,d3>,3>
Will be output to reducer!
MODIFIED REDUCER
Initialize
tprev  0
P  new PostingList

method reduce (tuple <t,n>, tf [f1, ..]) This means a new term
if (t # tprev) ^ (tprev # 0)
{ emit (term t, posting P);
What is Initialize and Close?
reset P; }
P.add(<n,f>)
tprev  t

Close
Observe that reducer
emit(term t, postings P) gets you the terms in the format you
need for answering queries.
IMPROVED MR FOR II 10/27/2021

class Reducer
class Mapper
method Initialize
method Map(docid n; doc d) tprev = 0; P = new PostingsList
H = new AssociativeArray method Reduce(tuple <t, n>; tf [f])

for all term t in doc d do if t <> tprev ^ tprev <> 0; then


Emit(term t; postings P)
H[t] = H[t] + 1
P:Reset()
for all term t in H do P:Add(<n, f>)
Emit(tuple <t; n>, tf H[t]) tprev = t

method Close
emit(term t, postings P)
ADVANTAGE OF MODIFIED
INVERTED INDEX
• Pseudo-code of a scalable inverted indexing algorithm in
MapReduce.
• By applying the value-to-key conversion design pattern, the
execution framework is exploited to sort postings so that
they arrive sorted by document id in the reducer.
• Also removal of Sort in reducer results in efficient reducer.
OTHER MODIFICATIONS

• Partitioner and shuffle have to deliver all related <key,


value> to same reducer
• Custom partitioner so that all terms t go to the same reducer.
• That is, you cannot split the terms across reducers!
• Lets go through an example
10/27/2021

13 WHAT ABOUT RETRIEVAL?

• While MR is great for indexing, it is not great for retrieval.

You might also like