You are on page 1of 44

Indexing structure

Abdo Ababor
2020/21

July, 2021 1
July, 2021 2
Indexing: Basic Concepts
 Indexing is an arrangement of index terms to permit fast
searching and reducing memory space requirement
 It used to speed up access to desired information from document

collection as per users query such that


 It enhances efficiency in terms of time for retrieval.

 Relevant documents are searched and retrieved quick

 Index file usually has index terms in a sorted order.

 Which list is easier to search?

fox pig zebra hen ant cat dog lion ox


ant cat dog fox hen lion ox pig zebra

July, 2021 3
Indexing: Basic Concepts
 An index file consists of records, called index entries.
 Index files are much smaller than the original file.
 Remember Heaps Law: in 1 GB of text collection the

vocabulary has a size of only 5 MB. This size may be


further reduced by Linguistic pre-processing (or text
operations).
 The usual unit for indexing is the word

 Index terms - are used to look up records in a file.

July, 2021 4
Major Steps in Index Construction
 Source file: Collection of text document
A document can be described by a set of representative keywords

called index terms.


 Index Terms Selection: apply text operations or preprocessing

Tokenize

Stop words removal

Word stem

Term relevance weight: assignment of numerical weights to each

index term of a document. TF, IDF, TF*IDF


 Indexing structure: a set of index terms (vocabulary) are organized

in Index File to easily identify documents in which each term occurs


in.
July, 2021 5
Basic Indexing Process
Documents to
be indexed. Friends, Romans, countrymen.

Token Tokenizer
stream. Friends Romans countrymen

Modified Linguistic friend roman countryman


tokens. preprocessor

Index File Indexer


friend 2 4

roman 1 2
Inverted file countryman 13 16
July, 2021
Index file Evaluation Metrics
 Running time of the main operations
 Access/search time
 How much is the running time to find the required search key
from the list?
 Update time (Insertion time, Deletion time)
 How much time does it take to update existing records in an
attempt to add new terms or delete existing unnecessary terms?
 Does the indexing structure allows incremental update or re-
indexing?
 Space overhead
 Computer storage space consumed for keeping the list.

July, 2021 7
Building Index file
 An index file of a document is a file consisting of a list of
index terms and a link to one or more documents that has
the index term
An index file is a list of search terms that are organized for
associative look-up, i.e., to answer user’s query:
In which documents does a specified search term appear?

Where within each document does each term appear? (There

may be several occurrences.)


For organizing index file for a collection of documents, there

are various options available:


Decide what data structure and/or file structure to use. Is it

sequential file, inverted file, suffix tree, etc. ?


July, 2021 8
1. Sequential File

 Sequential file is the most primitive file structures.


 It has no vocabulary as well as linking pointers.

 The records are generally arranged serially, one after


another, but in lexicographic order on the value of some key
field. i.e
 a particular attribute is chosen as primary key whose value

will determine the order of the records.


 when the first key fails to discriminate among records, a

second key is chosen to give an order.

July, 2021 9
Example:
 Given a collection of documents, they are parsed to extract
words and these are saved with the Document ID.

Negative affect
Doc 1 can make it harder
to do even easy tasks.
so make it easy

positive affect can


Doc 2
make it easier
to do difficult tasks
July, 2021 10
Sorting the
Vocabulary Sequential file
 After all documents
have been tokenized,
stop words are
removed, and
normalization and
stemming are
applied, to generate
index terms
 These index terms in
sequential file are
sorted in alphabetical
order

11
Sequential File

 To access records search serially;


 starting at the first record read and investigate
all the succeeding records until the required
record is found or end of the file is reached.
 Update options: Is the index needs to be rebuilt
or incremental update is supported?

July, 2021 12
Sequential File …

Its main advantages:


 easy to implement;

 provides fast access to the next record using lexicographic

order.
 Can be searched quickly, using binary search, O(log n)

 Its disadvantages:

 No weights attached to terms.


 Random access is slow: since similar terms are indexed
individually, we need to find all terms that match with
the query
July, 2021 13
2. Inverted file
 A word oriented indexing mechanism based on sorted list of
keywords, with each keyword having links to the documents
containing it
 Building and maintaining an inverted index is a relatively low cost
risk. On a text of n words an inverted index can be built in O(n)
time
 This list is inverted from a list of terms in location order to a list of
terms in alphabetical order.
Word IDs
Word Extraction

Original
Documents •W1:d1,d2,d3
•W2:d2,d4,d7,d9
•Wn :di,…dn
Document IDs
July, 2021 Inverted Files
Use of Inverted Files for Calculating
Similarities
In the term vector space, if q is query and dj a document,
then q and dj have no terms in common iff q.dj = 0.
1. To calculate all the non-zero similarities find R, the set of all
the documents, dj, that contain at least one term in the query:
2. Merge the inverted lists for each term ti in the query, with a
logical or, to establish the set, R.
3. For each dj  R, calculate Similarity(q, dj), using appropriate
weights.
4. Return the elements of R in ranked order.

July, 2021 15
Inverted file
Data to be held in the inverted file includes
 The vocabulary (List of terms):

 is the set of all distinct words (index terms) in the text

collection.
having information about vocabulary (list of terms) speeds

searching for relevant documents


 For each term: the inverted file contains information related to
Location: all the text locations/positions where the word
occurs
frequency of occurrence of terms in a document collection

July, 2021 16
Enhancements to Inverted Files --
Concept
Location: Each posting holds information about the location of
each term within the document.
Uses
user interface design -- highlight location of search term
adjacency and near operators (in Boolean searching)
Frequency: Each inverted list includes the number of postings
for each term.
Uses
term weighting
query processing optimization

July, 2021 17
Inverted file
 Having information about the location of each term within
the document helps for:

 user interface design: highlight location of search term

 proximity based ranking: adjacency and near operators (in


Boolean searching)

 Having information about frequency is used for:

 calculating term weighting (like TF, TF*IDF, …)

 optimizing query processing

18
July, 2021
Inverted File
Documents are organized by the terms/words they contain
Term CF Doc ID TF Location This is called an
term 1 3 2 1 66 index file.
19 1 213 Text operations
29 1 45 are performed
before building
term 2 4 3 1 94
the index.
19 2 7, 212
22 1 56
term 3 1 5 1 43
term 4 3 11 2 3, 70 CF, total
34 1 40 frequency of tj in
the corpus n
Is it possible to keep all these information during searching?
July, 2021 19
Construction of Inverted file
An inverted index consists of 2 files: vocabulary & posting files
 A vocabulary file (Word list):

 stores all of the distinct terms (keywords) that appear in any


of the documents (in lexicographical order, i.e like that of a
dictionary) and
 For each word a pointer to a posting file

 Records kept for each term j in the vocabulary (word list)


contains the following:
 term j

 number of documents in which term j occurs (DFj)

 Collection frequency of term j (Cf)

 pointer to inverted (postings) list for term j

July, 2021 20
Postings File (Inverted List)
 For each distinct term in the vocabulary, the posting file
stores a list of pointers to the documents that contain that
term.
 Each element in an inverted list is called a posting, i.e.,
the occurrence of a term in a document
 Each list consists of one or many individual postings

July, 2021 21
Advantage of dividing inverted file
into vocabulary and posting
 Keeping a pointer in the vocabulary to the list in the posting
file allows:
 the vocabulary to be kept in memory at search time even
for large text collection, while the Posting file is kept on
disk for accessing the pointers to documents

July, 2021 22
General structure of Inverted File
 The following figure shows the general structure of inverted
index file.

July, 2021 23
Organization of Index File
Vocabulary
(word list)
Postings
Documents
(inverted list)
Pointer
Term DF CF To
posting

term 1 3 3 Inverted
term 2 3 4 lists

term 3 1 1

term 4 2 3

July, 2021 24
Example:
 Given a collection of documents, they are parsed to extract
words and these are saved with the Document ID.

Negative affect
Doc 1 can make it harder
to do even easy tasks.
so make it easy

positive affect can


Doc 2 make it easier
to do difficult tasks
July, 2021 25
Sorting the
Vocabulary
 After all documents
have been tokenized
the inverted file is
sorted by terms
 Steps

 Extract the terms in


each doc
 Sort the terms

 Compile the terms


i.e Collect the
frequencies for each
term
July, 2021
Remove stop words and compute
frequency
 Multiple term
entries in a
single
document are
merged and
frequency
information
added

July, 2021 27
Stemming & compute frequency

 Multiple term
entries in a
single
document are
merged and
frequency
information
added

28
July, 2021
Vocabulary and postings file
The file is commonly split into a Dictionary and a Posting file
Pointers
vocabulary posting

Doc # TF
Term DF CF 1 1
affect 2 2 2 1
difficult 1 1 1 1
do 2 2 1 1
2 1
easy 2 3
1 2
hard 1 1 2 1
make 2 3 1 2
negative 1 1 2
1
2 1
positive 2 1 1 1
task 2 2 2 1
1 1
29
Searching on Inverted File
 Since the whole index file is divided into two, searching can
be done faster by loading vocabulary list which takes less
memory even for large document collection
 Using binary Search the searching takes logarithmic time
 The search is done in the vocabulary lists

 Updating inverted file is complex.


 We need to update both vocabulary and posting files

July, 2021 30
Example: Create Inverted file
 Create an inverted file (both the vocabulary list and
the posting file) for the following document collection

D1 The Department of Computer Science was established in


1984.
D2 The Department launched its first BSc in Computer
Studies in 1987.
D3 Followed by the MSc in Computer Science which was
started in 1991.
D4 The Department also produced its first PhD graduate in
1994.
D5 Our staff have contributed intellectually and
professionally to the advancements in these fields.
July, 2021 31
Example: Create Inverted file
 After text operation red color terms remain as index
term
D1 The Department of Computer Science was established in
1984.
D2 The Department launched its first BSc in Computer
Studies in 1987.
D3 Followed by the MSc in Computer Science which was
started in 1991.
D4 The Department also produced its first PhD graduate in
1994.
D5 Our staff have contributed intellectually and
professionally to the advancements in these fields.
July, 2021 32
…. Example: Create Inverted file
After text operation performed
 D1= department comput science establish
 D2= department launch bsc comput study
 D3= follow msc comput science start
 D4= department produce phd graduat
 D5= staff contribut intellect profession advance field

July, 2021 33
vocabulary posting All term specific
word DF CF WID
Doc# TF mTF loc info. (max tf, tf, tf-
advance 1 1 w1 5 1 1 5 idf, location…etc.)
bsc 1 1 W2 2 1 1 3 Stored on posting
comput 3 3 W3 1 1 1 2
•W1:d5
contribut 1 1 W4 2 1 1 4 •W2:d2
department 3 3 W5 3 1 1 2 •W3:d1,d2,d3
5 1 1 4 •Wn :di,…dn
establish 1 1 W6
field 1 1 W7 1 1 1 1

follow 1 1 2 1 1 1
document file
graduat 1 1 4 1 1 1

intellect 1 1 Pointers 1 1
c
launch 1 1 1 1
o
1 1 1 1
msc n
phd 1 1t 1 1

produce 1 1i 1 1

profession 1 1n 1 1

science 2 2u 2 2

staff 1 1e 1 1

start 1 1 1 1

study 1 1 1 1 34
3. Forward Index
 It is a data structure that stores mapping from documents to
words i.e. directs you from document to word.
 Steps to build Forward index are:
 Fetch the document and gather all the keywords.

 Append all the keywords in the index entry for this

document.
 Repeat above steps for all documents

 Indexing is quite fast as it only append keywords as it move


forwards.
 Searching is quite difficult as it has to look at every contents
of index just to retrieve all pages related to word.
July, 2021 35
Example of forward index

Document Keywords
doc1 hello, sky, morning
doc2 tea, coffee, hi
doc3 greetings, sky

 It stores duplicate keywords in index. Eg: word “sky” is


stored multiple times.

 Real life examples of Forward index:


 Table of contents in book.

 DNS lookup

July, 2021 36
Inverted Index
 It is a data structure that stores mapping from words to
documents or set of documents i.e. directs you from word
to document.
 Steps to build Inverted index are:
 Fetch the document and gather all the words.

 Check for each word, if it is present then add reference of

document to index else create new entry in index for that


word.
 Repeat above steps for all documents and sort the words.

 Indexing is slow as it first checks that word is present or not.


 Searching is very fast.
37
Example of Inverted index

Word Documents
hello doc1
sky doc1, doc3
coffee doc2
hi doc2
greetings doc3

 It does not store duplicate keywords in index.

38
4. Partitioning
 Document-partitioned inverted index
 each compute node indexes a subset of the document

collection
 each query is processed by every compute node

 perfect load balance, embarrassingly scalable, easy

maintenance

July, 2021 39
4. Partitioning
 Term-partitioned inverted index
 each compute node holds posting lists for a subset of

terms
 queries are routed to compute nodes with relevant

terms
 lower resource consumption, susceptible to imbalance

(because of skew in the data or query workload), index


maintenance non-trivial

July, 2021 40
5. Catching
 What is cached?
 Query results

 Posting lists

 Posting-list intersections

 Documents

 Snippets

 Where is it cached?
 in RAM of responsible compute node

 in dedicated front-end accelerators or proxy nodes

 in RAM of all (many) compute nodes

July, 2021 41
Caching Strategies
 Least recently used (LRU)
 when space is needed, evict the item that was least

recently used.
 Least frequently used (LFU)
 when space is needed, evict the item that was least

frequently used.
 Cost-aware (Landlord algorithm)
 estimate for each item: temperature = access-rate / cost

 when space is needed, evict item with lowest temperature

 Re-fetch item if its predicted temperature is higher than

the temperature of the corresponding replacement


July, 2021 42
Caching Effectiveness

 Query frequencies follow Zipf distribution (s ≈ 1) [Baeza-


Yates et al. 07] analyzed one-year query log of Yahoo!

 88% of queries are issued only once

 account for 44% of overall query volume

July, 2021 43
Thank you

July, 2021 44

You might also like