You are on page 1of 23

INVERTED INDEXING FOR TEXT

RETRIEVAL
CHAPTER 4 LIN AND DYER
INTRODUCTION

• Web search is a quintessential large-data problem.


• So are any number of problems in genomics.
• Google, amazon (aws) and many others all are involved in research and discovery
in this area

• Web search or full text search depends on a data structure called


inverted index.
• Web search problem breaks down into three major components:
• Gathering the web content (crawling) (Pre-project 1.. We did not do it)
• Construction of inverted index (indexing)
• Ranking the documents given a query (retrieval)
ISSUES WITH THESE
COMPONENTS
• Crawling and indexing have similar characteristics: resource
consumption is high
• Typically offline batch processing except of course on twitter model

• There are many requirements for a web crawler or in general a


data aggregator..
• Etiquette, bandwidth resources, multilingual, duplicate contents,
frequency of changes…
• How often to collect: too few may miss important updates, too often may
have too much info
10/27/2021

4 WEB CRAWLING

• Start with a “seed” URL , say wikipedia page, and start


collecting the content by following the links in the seed
page; the depth of traversal is also specified by the input
• What are the issues?
• See page 67
RETRIEVAL

• Retrieval is a online problem that demands stringent


timings: sub-second response times.
• Concurrent queries
• Query latency
• Load on the servers
• Other circumstances: day of the day
• Resource consumption can be spikey or highly variable

• Resource requirement for indexing is more predictable


INDEXES

• Regular index: Document  terms


• Inverted index termdocuments
• Example:
term1  {d1,p}, {d2, p}, {d23, p}
term2  {d2, p}. {d34, p}
term3  {d6, p}, {d56, p}, {d345, p}
Where d is the doc id, p is the payload (example for payload: term
frequency… this can be blank too)
INVERTED INDEX 10/27/2021

• Inverted index consists of postings lists, one associated with each term that appears in the corpus.
• <t, posting>n
• <t, <docid, tf> >n
• <t, <docid, tf, other info>>n

• Key, value pair where the key is the term (word) and the value is the docid, followed by “payload”
• Payload can be empty for simple index
• Payload can be complex: provides such details as co-occurrences, additional linguistic processing, page rank
of the doc, etc.
• <t2, <d1, d4, d67, d89>>
• <t3, <d4, d6, d7, d9, d22>>
• Document numbering typically do not have semantic content but docs from the same corpus are numbered
together or the numbers could be assigned based on “page ranks” (Hmm, what is that?).
RETRIEVAL

• Once the inverted index is developed, when a query comes in,


retrieval involves fetching the appropriate docs.
• The docs are ranked and top k docs are listed.
• It is good to have the inverted index in memory.
• If not , some queries may involve random disk access for
decoding of postings.
• Solution: organize the disk accesses so that random seeks are
minimized.
PSEUDO CODE

Pseudo code  Baseline implementation  value-key


conversion pattern implementation…
10/27/2021

INVERTED INDEX: BASELINE IMPLEMENTATION USING MR


10

• Input to the mapper consists of docid and actual content.


• Each document is analyzed and broken down into terms.
• Processing pipeline assuming HTML docs:
• Strip HTML tags
• Strip Javascript code
• Tokenize using a set of delimiters
• Case fold
• Remove stop words (a, an the…)
• Remove domain-specific stop works
• Stem different forms (..ing, ..ed…, dogs – dog)
BASELINE IMPLEMENTATION

procedure map (docid n, doc d)


H  new Associative array
for all terms in doc d
H{t}  H{t} + 1
for all term in H
emit(term t, posting <n, H{t}>)
VISUALIZE THE OUTPUT FROM
MAPPERS
… … … … … ..

term10
term1 term2 term3 term 4 term5 term6 term7 term8 term9

<docid n, #occurrences, other info >


REDUCER FOR BASELINE IMPLEMENTATION

procedure reducer( term t, postings[<n1, f1> <n2, f2>, …])


P  new List
for all posting <a,f> in postings A term t may appear
in many documents;
Append (P, <a,f>) create and sort that list;
Output it.
Sort (P) // sorted by docid
Emit (term t, postings P)
SHUFFLE AND SORT PHASE

• Is a very large “group by term” of the postings


• Lets look at a toy example
• Fig. 4.3, let’s explore
REVISED IMPLEMENTATION

• Issue: MR does not guarantee sorting order of the values.. Only by


keys
• Lot of data to be held in memory for the baseline version, naïve
implementation
• So the sort in the reducer is an expensive operation esp. if the docs
cannot be held in memory.
• Lets check a revised solution
• Instead of (term t, posting<docid, f>) emit
(tuple<t,docid>, tf f)
INVERTED INDEX: REVISED 10/27/2021

IMPLEMENTATION
• From Baseline to an improved version
• Observe the sort done by the Reducer. Is there any way to push this into the MR runtime?
• Instead of
• (term t, posting<docid, f>)
• Emit
• (tuple<t, docid>, tf f)
• This is our previously studied value-key conversion design pattern
• This switching ensures the keys arrive in order at the reducer
• Small memory foot print; less buffer space needed at the reducer

• See fig.4.4
MODIFIED MAPPER

Map (docid n, doc d)


H  new AssociativeArray
For all terms t in doc
H{t}  H{t} + 1
For all terms in H
emit (tuple<t,n>, H{t})
MODIFIED REDUCER
Initialize
tprev  0
P  new PostingList

method reduce (tuple <t,n>, tf [f1, ..]) This means a new term
if (t # tprev) ^ (tprev # 0)
{ emit (term t, posting P);
What is Initialize and Close?
reset P; }
P.add(<n,f>)
tprev  t

Close
Observe that reducer
emit(term t, postings P) gets you the terms in the format you
need for answering queries.
IMPROVED MR FOR II 10/27/2021

class Reducer
class Mapper
method Initialize
method Map(docid n; doc d) tprev = 0; P = new PostingsList
H = new AssociativeArray method Reduce(tuple <t, n>; tf [f])

for all term t in doc d do if t <> tprev ^ tprev <> 0; then


Emit(term t; postings P)
H[t] = H[t] + 1
P:Reset()
for all term t in H do P:Add(<n, f>)
Emit(tuple <t; n>, tf H[t]) tprev = t

method Close
emit(term t, postings P)
LET’S WALK THROUGH A SAMPLE TEXT

<d1, This is a cat. Cat sits on a roof.>

<d2, The roof is a tin roof. There is a tin can on the roof.>

<d3, Cat kicks the can. It rolls on the roof and falls on the next roof.>

<d4, The cat rolls too. It sits on the can.>


OTHER MODIFICATIONS

• Partitioner and shuffle have to deliver all related <key,


value> to same reducer
• Custom partitioner so that all terms t go to the same reducer.
• That is, you cannot split the terms across reducers!
• Lets go through an example
10/27/2021

23 WHAT ABOUT RETRIEVAL?

• While MR is great for indexing, it is not great for retrieval.

You might also like