Indexing 2021

Indexing structure
Abdo Ababor
2020/21
July, 2021 1
July, 2021 2
Indexing: Basic Concepts
 Indexing is an arrangement of index terms to permit fast
searching and reducing memory space requirement
 It used to speed up access to desired information from document
collection as per users query such that

 It enhances efficiency in terms of time for retrieval.
 Relevant documents are searched and retrieved quick
 Index file usually has index terms in a sorted order.
 Which list is easier to search?
fox pig zebra hen ant cat dog lion ox

ant cat dog fox hen lion ox pig zebra
July, 2021 3
Indexing: Basic Concepts
 An index file consists of records, called index entries.
 Index files are much smaller than the original file.
 Remember Heaps Law: in 1 GB of text collection the
vocabulary has a size of only 5 MB. This size may be

further reduced by Linguistic pre-processing (or text
operations).
 The usual unit for indexing is the word
 Index terms - are used to look up records in a file.
July, 2021 4
Major Steps in Index Construction
 Source file: Collection of text document
A document can be described by a set of representative keywords
called index terms.

 Index Terms Selection: apply text operations or preprocessing
Tokenize
Stop words removal
Word stem
Term relevance weight: assignment of numerical weights to each
index term of a document. TF, IDF, TF*IDF

 Indexing structure: a set of index terms (vocabulary) are organized
in Index File to easily identify documents in which each term occurs

in.
July, 2021 5
Basic Indexing Process
Documents to
be indexed. Friends, Romans, countrymen.
Token Tokenizer
stream. Friends Romans countrymen
Modified Linguistic friend roman countryman

tokens. preprocessor
Index File Indexer

friend 2 4
roman 1 2
Inverted file countryman 13 16
July, 2021
Index file Evaluation Metrics
 Running time of the main operations
 Access/search time
 How much is the running time to find the required search key
from the list?
 Update time (Insertion time, Deletion time)
 How much time does it take to update existing records in an
attempt to add new terms or delete existing unnecessary terms?
 Does the indexing structure allows incremental update or re-
indexing?
 Space overhead
 Computer storage space consumed for keeping the list.
July, 2021 7
Building Index file
 An index file of a document is a file consisting of a list of
index terms and a link to one or more documents that has
the index term
An index file is a list of search terms that are organized for
associative look-up, i.e., to answer user’s query:
In which documents does a specified search term appear?
Where within each document does each term appear? (There
may be several occurrences.)

For organizing index file for a collection of documents, there
are various options available:

Decide what data structure and/or file structure to use. Is it
sequential file, inverted file, suffix tree, etc. ?

July, 2021 8
1. Sequential File
 Sequential file is the most primitive file structures.

 It has no vocabulary as well as linking pointers.
 The records are generally arranged serially, one after

another, but in lexicographic order on the value of some key
field. i.e
 a particular attribute is chosen as primary key whose value
will determine the order of the records.

 when the first key fails to discriminate among records, a
second key is chosen to give an order.
July, 2021 9
Example:
 Given a collection of documents, they are parsed to extract
words and these are saved with the Document ID.
Negative affect
Doc 1 can make it harder
to do even easy tasks.
so make it easy
positive affect can

Doc 2
make it easier
to do difficult tasks
July, 2021 10
Sorting the
Vocabulary Sequential file
 After all documents
have been tokenized,
stop words are
removed, and
normalization and
stemming are
applied, to generate
index terms
 These index terms in
sequential file are
sorted in alphabetical
order
11
Sequential File
 To access records search serially;

 starting at the first record read and investigate
all the succeeding records until the required
record is found or end of the file is reached.
 Update options: Is the index needs to be rebuilt
or incremental update is supported?
July, 2021 12
Sequential File …
Its main advantages:

 easy to implement;
 provides fast access to the next record using lexicographic
order.
 Can be searched quickly, using binary search, O(log n)
 Its disadvantages:
 No weights attached to terms.

 Random access is slow: since similar terms are indexed
individually, we need to find all terms that match with
the query
July, 2021 13
2. Inverted file
 A word oriented indexing mechanism based on sorted list of
keywords, with each keyword having links to the documents
containing it
 Building and maintaining an inverted index is a relatively low cost
risk. On a text of n words an inverted index can be built in O(n)
time
 This list is inverted from a list of terms in location order to a list of
terms in alphabetical order.
Word IDs
Word Extraction
Original
Documents •W1:d1,d2,d3
•W2:d2,d4,d7,d9
•Wn :di,…dn
Document IDs
July, 2021 Inverted Files
Use of Inverted Files for Calculating
Similarities
In the term vector space, if q is query and dj a document,
then q and dj have no terms in common iff q.dj = 0.
1. To calculate all the non-zero similarities find R, the set of all
the documents, dj, that contain at least one term in the query:
2. Merge the inverted lists for each term ti in the query, with a
logical or, to establish the set, R.
3. For each dj  R, calculate Similarity(q, dj), using appropriate
weights.
4. Return the elements of R in ranked order.
July, 2021 15
Inverted file
Data to be held in the inverted file includes
 The vocabulary (List of terms):
 is the set of all distinct words (index terms) in the text
collection.
having information about vocabulary (list of terms) speeds
searching for relevant documents

 For each term: the inverted file contains information related to
Location: all the text locations/positions where the word
occurs
frequency of occurrence of terms in a document collection
July, 2021 16
Enhancements to Inverted Files --
Concept
Location: Each posting holds information about the location of
each term within the document.
Uses
user interface design -- highlight location of search term
adjacency and near operators (in Boolean searching)
Frequency: Each inverted list includes the number of postings
for each term.
Uses
term weighting
query processing optimization
July, 2021 17
Inverted file
 Having information about the location of each term within
the document helps for:
 user interface design: highlight location of search term
 proximity based ranking: adjacency and near operators (in

Boolean searching)
 Having information about frequency is used for:
 calculating term weighting (like TF, TF*IDF, …)
 optimizing query processing
18
July, 2021
Inverted File
Documents are organized by the terms/words they contain
Term CF Doc ID TF Location This is called an
term 1 3 2 1 66 index file.
19 1 213 Text operations
29 1 45 are performed
before building
term 2 4 3 1 94
the index.
19 2 7, 212
22 1 56
term 3 1 5 1 43
term 4 3 11 2 3, 70 CF, total
34 1 40 frequency of tj in
the corpus n
Is it possible to keep all these information during searching?
July, 2021 19
Construction of Inverted file
An inverted index consists of 2 files: vocabulary & posting files
 A vocabulary file (Word list):
 stores all of the distinct terms (keywords) that appear in any

of the documents (in lexicographical order, i.e like that of a
dictionary) and
 For each word a pointer to a posting file
 Records kept for each term j in the vocabulary (word list)

contains the following:
 term j
 number of documents in which term j occurs (DFj)
 Collection frequency of term j (Cf)
 pointer to inverted (postings) list for term j
July, 2021 20
Postings File (Inverted List)
 For each distinct term in the vocabulary, the posting file
stores a list of pointers to the documents that contain that
term.
 Each element in an inverted list is called a posting, i.e.,
the occurrence of a term in a document
 Each list consists of one or many individual postings
July, 2021 21
Advantage of dividing inverted file
into vocabulary and posting
 Keeping a pointer in the vocabulary to the list in the posting
file allows:
 the vocabulary to be kept in memory at search time even
for large text collection, while the Posting file is kept on
disk for accessing the pointers to documents
July, 2021 22
General structure of Inverted File
 The following figure shows the general structure of inverted
index file.
July, 2021 23
Organization of Index File
Vocabulary
(word list)
Postings
Documents
(inverted list)
Pointer
Term DF CF To
posting
term 1 3 3 Inverted
term 2 3 4 lists
term 3 1 1
term 4 2 3
July, 2021 24
Example:
 Given a collection of documents, they are parsed to extract
words and these are saved with the Document ID.
Negative affect
Doc 1 can make it harder
to do even easy tasks.
so make it easy
positive affect can

Doc 2 make it easier
to do difficult tasks
July, 2021 25
Sorting the
Vocabulary
 After all documents
have been tokenized
the inverted file is
sorted by terms
 Steps
 Extract the terms in

each doc
 Sort the terms
 Compile the terms

i.e Collect the
frequencies for each
term
July, 2021
Remove stop words and compute
frequency
 Multiple term
entries in a
single
document are
merged and
frequency
information
added
July, 2021 27
Stemming & compute frequency
 Multiple term
entries in a
single
document are
merged and
frequency
information
added
28
July, 2021
Vocabulary and postings file
The file is commonly split into a Dictionary and a Posting file
Pointers
vocabulary posting
Doc # TF
Term DF CF 1 1
affect 2 2 2 1
difficult 1 1 1 1
do 2 2 1 1
2 1
easy 2 3
1 2
hard 1 1 2 1
make 2 3 1 2
negative 1 1 2
1
2 1
positive 2 1 1 1
task 2 2 2 1
1 1
29
Searching on Inverted File
 Since the whole index file is divided into two, searching can
be done faster by loading vocabulary list which takes less
memory even for large document collection
 Using binary Search the searching takes logarithmic time
 The search is done in the vocabulary lists
 Updating inverted file is complex.

 We need to update both vocabulary and posting files
July, 2021 30
Example: Create Inverted file
 Create an inverted file (both the vocabulary list and
the posting file) for the following document collection
D1 The Department of Computer Science was established in

1984.
D2 The Department launched its first BSc in Computer
Studies in 1987.
D3 Followed by the MSc in Computer Science which was
started in 1991.
D4 The Department also produced its first PhD graduate in
1994.
D5 Our staff have contributed intellectually and
professionally to the advancements in these fields.
July, 2021 31
Example: Create Inverted file
 After text operation red color terms remain as index
term
D1 The Department of Computer Science was established in
1984.
D2 The Department launched its first BSc in Computer
Studies in 1987.
D3 Followed by the MSc in Computer Science which was
started in 1991.
D4 The Department also produced its first PhD graduate in
1994.
D5 Our staff have contributed intellectually and
professionally to the advancements in these fields.
July, 2021 32
…. Example: Create Inverted file
After text operation performed
 D1= department comput science establish
 D2= department launch bsc comput study
 D3= follow msc comput science start
 D4= department produce phd graduat
 D5= staff contribut intellect profession advance field
July, 2021 33
vocabulary posting All term specific
word DF CF WID
Doc# TF mTF loc info. (max tf, tf, tf-
advance 1 1 w1 5 1 1 5 idf, location…etc.)
bsc 1 1 W2 2 1 1 3 Stored on posting
comput 3 3 W3 1 1 1 2
•W1:d5
contribut 1 1 W4 2 1 1 4 •W2:d2
department 3 3 W5 3 1 1 2 •W3:d1,d2,d3
5 1 1 4 •Wn :di,…dn
establish 1 1 W6
field 1 1 W7 1 1 1 1
follow 1 1 2 1 1 1
document file
graduat 1 1 4 1 1 1
intellect 1 1 Pointers 1 1
c
launch 1 1 1 1
o
1 1 1 1
msc n
phd 1 1t 1 1
produce 1 1i 1 1
profession 1 1n 1 1
science 2 2u 2 2
staff 1 1e 1 1
start 1 1 1 1
study 1 1 1 1 34
3. Forward Index
 It is a data structure that stores mapping from documents to
words i.e. directs you from document to word.
 Steps to build Forward index are:
 Fetch the document and gather all the keywords.
 Append all the keywords in the index entry for this
document.
 Repeat above steps for all documents
 Indexing is quite fast as it only append keywords as it move

forwards.
 Searching is quite difficult as it has to look at every contents
of index just to retrieve all pages related to word.
July, 2021 35
Example of forward index
Document Keywords
doc1 hello, sky, morning
doc2 tea, coffee, hi
doc3 greetings, sky
 It stores duplicate keywords in index. Eg: word “sky” is

stored multiple times.
 Real life examples of Forward index:

 Table of contents in book.
 DNS lookup
July, 2021 36
Inverted Index
 It is a data structure that stores mapping from words to
documents or set of documents i.e. directs you from word
to document.
 Steps to build Inverted index are:
 Fetch the document and gather all the words.
 Check for each word, if it is present then add reference of
document to index else create new entry in index for that

word.
 Repeat above steps for all documents and sort the words.
 Indexing is slow as it first checks that word is present or not.

 Searching is very fast.
37
Example of Inverted index
Word Documents
hello doc1
sky doc1, doc3
coffee doc2
hi doc2
greetings doc3
 It does not store duplicate keywords in index.
38
4. Partitioning
 Document-partitioned inverted index
 each compute node indexes a subset of the document
collection
 each query is processed by every compute node
 perfect load balance, embarrassingly scalable, easy
maintenance
July, 2021 39
4. Partitioning
 Term-partitioned inverted index
 each compute node holds posting lists for a subset of
terms
 queries are routed to compute nodes with relevant
terms
 lower resource consumption, susceptible to imbalance
(because of skew in the data or query workload), index

maintenance non-trivial
July, 2021 40
5. Catching
 What is cached?
 Query results
 Posting lists
 Posting-list intersections
 Documents
 Snippets
 Where is it cached?
 in RAM of responsible compute node
 in dedicated front-end accelerators or proxy nodes
 in RAM of all (many) compute nodes
July, 2021 41
Caching Strategies
 Least recently used (LRU)
 when space is needed, evict the item that was least
recently used.
 Least frequently used (LFU)
 when space is needed, evict the item that was least
frequently used.
 Cost-aware (Landlord algorithm)
 estimate for each item: temperature = access-rate / cost
 when space is needed, evict item with lowest temperature
 Re-fetch item if its predicted temperature is higher than
the temperature of the corresponding replacement

July, 2021 42
Caching Effectiveness
 Query frequencies follow Zipf distribution (s ≈ 1) [Baeza-

Yates et al. 07] analyzed one-year query log of Yahoo!
 88% of queries are issued only once
 account for 44% of overall query volume
July, 2021 43
Thank you
July, 2021 44

Indexing 2021

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Indexing 2021

Uploaded by

Copyright:

Available Formats

Indexing structure

collection as per users query such that

 Relevant documents are searched and retrieved quick

 Index file usually has index terms in a sorted order.

 Which list is easier to search?

fox pig zebra hen ant cat dog lion ox

vocabulary has a size of only 5 MB. This size may be

 Index terms - are used to look up records in a file.

called index terms.

Stop words removal

Term relevance weight: assignment of numerical weights to each

index term of a document. TF, IDF, TF*IDF

in Index File to easily identify documents in which each term occurs

Modified Linguistic friend roman countryman

Index File Indexer

Where within each document does each term appear? (There

may be several occurrences.)

are various options available:

sequential file, inverted file, suffix tree, etc. ?

 Sequential file is the most primitive file structures.

 The records are generally arranged serially, one after

will determine the order of the records.

second key is chosen to give an order.

positive affect can

 To access records search serially;

Its main advantages:

 provides fast access to the next record using lexicographic

 No weights attached to terms.

 is the set of all distinct words (index terms) in the text

searching for relevant documents

 user interface design: highlight location of search term

 proximity based ranking: adjacency and near operators (in

 Having information about frequency is used for:

 calculating term weighting (like TF, TF*IDF, …)

 optimizing query processing

 stores all of the distinct terms (keywords) that appear in any

 Records kept for each term j in the vocabulary (word list)

 number of documents in which term j occurs (DFj)

 Collection frequency of term j (Cf)

 pointer to inverted (postings) list for term j

positive affect can

 Extract the terms in

 Compile the terms

 Updating inverted file is complex.

D1 The Department of Computer Science was established in

 Append all the keywords in the index entry for this

 Indexing is quite fast as it only append keywords as it move

 It stores duplicate keywords in index. Eg: word “sky” is

 Real life examples of Forward index:

 Check for each word, if it is present then add reference of

document to index else create new entry in index for that

 Indexing is slow as it first checks that word is present or not.

 It does not store duplicate keywords in index.

 perfect load balance, embarrassingly scalable, easy

(because of skew in the data or query workload), index

 in dedicated front-end accelerators or proxy nodes

 in RAM of all (many) compute nodes

 when space is needed, evict item with lowest temperature

 Re-fetch item if its predicted temperature is higher than

the temperature of the corresponding replacement

 Query frequencies follow Zipf distribution (s ≈ 1) [Baeza-

 88% of queries are issued only once

 account for 44% of overall query volume

You might also like