Professional Documents
Culture Documents
Abdo Ababor
2020/21
July, 2021 1
July, 2021 2
Indexing: Basic Concepts
Indexing is an arrangement of index terms to permit fast
searching and reducing memory space requirement
It used to speed up access to desired information from document
July, 2021 3
Indexing: Basic Concepts
An index file consists of records, called index entries.
Index files are much smaller than the original file.
Remember Heaps Law: in 1 GB of text collection the
July, 2021 4
Major Steps in Index Construction
Source file: Collection of text document
A document can be described by a set of representative keywords
Tokenize
Word stem
Token Tokenizer
stream. Friends Romans countrymen
roman 1 2
Inverted file countryman 13 16
July, 2021
Index file Evaluation Metrics
Running time of the main operations
Access/search time
How much is the running time to find the required search key
from the list?
Update time (Insertion time, Deletion time)
How much time does it take to update existing records in an
attempt to add new terms or delete existing unnecessary terms?
Does the indexing structure allows incremental update or re-
indexing?
Space overhead
Computer storage space consumed for keeping the list.
July, 2021 7
Building Index file
An index file of a document is a file consisting of a list of
index terms and a link to one or more documents that has
the index term
An index file is a list of search terms that are organized for
associative look-up, i.e., to answer user’s query:
In which documents does a specified search term appear?
July, 2021 9
Example:
Given a collection of documents, they are parsed to extract
words and these are saved with the Document ID.
Negative affect
Doc 1 can make it harder
to do even easy tasks.
so make it easy
11
Sequential File
July, 2021 12
Sequential File …
order.
Can be searched quickly, using binary search, O(log n)
Its disadvantages:
Original
Documents •W1:d1,d2,d3
•W2:d2,d4,d7,d9
•Wn :di,…dn
Document IDs
July, 2021 Inverted Files
Use of Inverted Files for Calculating
Similarities
In the term vector space, if q is query and dj a document,
then q and dj have no terms in common iff q.dj = 0.
1. To calculate all the non-zero similarities find R, the set of all
the documents, dj, that contain at least one term in the query:
2. Merge the inverted lists for each term ti in the query, with a
logical or, to establish the set, R.
3. For each dj R, calculate Similarity(q, dj), using appropriate
weights.
4. Return the elements of R in ranked order.
July, 2021 15
Inverted file
Data to be held in the inverted file includes
The vocabulary (List of terms):
collection.
having information about vocabulary (list of terms) speeds
July, 2021 16
Enhancements to Inverted Files --
Concept
Location: Each posting holds information about the location of
each term within the document.
Uses
user interface design -- highlight location of search term
adjacency and near operators (in Boolean searching)
Frequency: Each inverted list includes the number of postings
for each term.
Uses
term weighting
query processing optimization
July, 2021 17
Inverted file
Having information about the location of each term within
the document helps for:
18
July, 2021
Inverted File
Documents are organized by the terms/words they contain
Term CF Doc ID TF Location This is called an
term 1 3 2 1 66 index file.
19 1 213 Text operations
29 1 45 are performed
before building
term 2 4 3 1 94
the index.
19 2 7, 212
22 1 56
term 3 1 5 1 43
term 4 3 11 2 3, 70 CF, total
34 1 40 frequency of tj in
the corpus n
Is it possible to keep all these information during searching?
July, 2021 19
Construction of Inverted file
An inverted index consists of 2 files: vocabulary & posting files
A vocabulary file (Word list):
July, 2021 20
Postings File (Inverted List)
For each distinct term in the vocabulary, the posting file
stores a list of pointers to the documents that contain that
term.
Each element in an inverted list is called a posting, i.e.,
the occurrence of a term in a document
Each list consists of one or many individual postings
July, 2021 21
Advantage of dividing inverted file
into vocabulary and posting
Keeping a pointer in the vocabulary to the list in the posting
file allows:
the vocabulary to be kept in memory at search time even
for large text collection, while the Posting file is kept on
disk for accessing the pointers to documents
July, 2021 22
General structure of Inverted File
The following figure shows the general structure of inverted
index file.
July, 2021 23
Organization of Index File
Vocabulary
(word list)
Postings
Documents
(inverted list)
Pointer
Term DF CF To
posting
term 1 3 3 Inverted
term 2 3 4 lists
term 3 1 1
term 4 2 3
July, 2021 24
Example:
Given a collection of documents, they are parsed to extract
words and these are saved with the Document ID.
Negative affect
Doc 1 can make it harder
to do even easy tasks.
so make it easy
July, 2021 27
Stemming & compute frequency
Multiple term
entries in a
single
document are
merged and
frequency
information
added
28
July, 2021
Vocabulary and postings file
The file is commonly split into a Dictionary and a Posting file
Pointers
vocabulary posting
Doc # TF
Term DF CF 1 1
affect 2 2 2 1
difficult 1 1 1 1
do 2 2 1 1
2 1
easy 2 3
1 2
hard 1 1 2 1
make 2 3 1 2
negative 1 1 2
1
2 1
positive 2 1 1 1
task 2 2 2 1
1 1
29
Searching on Inverted File
Since the whole index file is divided into two, searching can
be done faster by loading vocabulary list which takes less
memory even for large document collection
Using binary Search the searching takes logarithmic time
The search is done in the vocabulary lists
July, 2021 30
Example: Create Inverted file
Create an inverted file (both the vocabulary list and
the posting file) for the following document collection
July, 2021 33
vocabulary posting All term specific
word DF CF WID
Doc# TF mTF loc info. (max tf, tf, tf-
advance 1 1 w1 5 1 1 5 idf, location…etc.)
bsc 1 1 W2 2 1 1 3 Stored on posting
comput 3 3 W3 1 1 1 2
•W1:d5
contribut 1 1 W4 2 1 1 4 •W2:d2
department 3 3 W5 3 1 1 2 •W3:d1,d2,d3
5 1 1 4 •Wn :di,…dn
establish 1 1 W6
field 1 1 W7 1 1 1 1
follow 1 1 2 1 1 1
document file
graduat 1 1 4 1 1 1
intellect 1 1 Pointers 1 1
c
launch 1 1 1 1
o
1 1 1 1
msc n
phd 1 1t 1 1
produce 1 1i 1 1
profession 1 1n 1 1
science 2 2u 2 2
staff 1 1e 1 1
start 1 1 1 1
study 1 1 1 1 34
3. Forward Index
It is a data structure that stores mapping from documents to
words i.e. directs you from document to word.
Steps to build Forward index are:
Fetch the document and gather all the keywords.
document.
Repeat above steps for all documents
Document Keywords
doc1 hello, sky, morning
doc2 tea, coffee, hi
doc3 greetings, sky
DNS lookup
July, 2021 36
Inverted Index
It is a data structure that stores mapping from words to
documents or set of documents i.e. directs you from word
to document.
Steps to build Inverted index are:
Fetch the document and gather all the words.
Word Documents
hello doc1
sky doc1, doc3
coffee doc2
hi doc2
greetings doc3
38
4. Partitioning
Document-partitioned inverted index
each compute node indexes a subset of the document
collection
each query is processed by every compute node
maintenance
July, 2021 39
4. Partitioning
Term-partitioned inverted index
each compute node holds posting lists for a subset of
terms
queries are routed to compute nodes with relevant
terms
lower resource consumption, susceptible to imbalance
July, 2021 40
5. Catching
What is cached?
Query results
Posting lists
Posting-list intersections
Documents
Snippets
Where is it cached?
in RAM of responsible compute node
July, 2021 41
Caching Strategies
Least recently used (LRU)
when space is needed, evict the item that was least
recently used.
Least frequently used (LFU)
when space is needed, evict the item that was least
frequently used.
Cost-aware (Landlord algorithm)
estimate for each item: temperature = access-rate / cost
July, 2021 43
Thank you
July, 2021 44