Professional Documents
Culture Documents
Retrieval
The meaning
• What is retrieval?
– It is the process of searching and accessing
information as per the information need of
users
• What is information?
– Making sense of data via contextualization
– It is the step where single meaning is
attached to the data towards decision making,
problem solving.
The Need for Information Retrieval
• The information explosion is the rapid increase in
the amount of published information and the effects
of this abundance.
– As the amount of available information grows, the
problem of managing the information becomes more
difficult, which can lead to information overload.
Black box
User Documents
Typical IR System Architecture
Document
corpus
Query IR
String System
1. Doc1
2. Doc2
Ranked 3. Doc3
Relevant Documents .
.
IR System vs. Web Search System
Web Spider
Document
corpus
Query IR
String System
1. Page1
2. Page2
3. Page3 Ranked
. Relevant Documents
.
The Retrieval Process
User
Interface
User need
Text Text
Text Operations Database
L o g i c a l v i e w
User Query DocID
feedback Indexing
Formulation
Inverted file
Query
Searching
Index
Retrieved docs file
Ranked docs
Ranking
Course outline
Topic(s) Details
Overview of IR Define IR; The retrieval process; Basic structure of an IR system
Text Document Tokenization; Stopword detection; Stemming; Normalization;
Operations Term weighting; similarity measures
Indexing The need for indexing; Inverted files; Suffix trees and Suffix
Structures arrays; Signature files
A Formal Characterization of IR Models; Boolean model,
IR Models
Vector space mode & Probabilistic model
Retrieval Evaluation of IR systems; Relevance judgement; Retrieval
Evaluation effectiveness measures (Recall, Precision, F-measure, etc.)
Types of Query formulation; Keyword-based queries; Natural
Query Languages
language queries; Query reformulation; Relevance feedback;
and Operations
Query expansion; Reweighting query terms
Current Research IR in Local Languages; Information Extraction; Information
Issues in IR Filtering; Text Summarization, Cross-language retrieval...
Presentation Assignment
Review literature on the given topic and submit your report via Google
classroom. There will also be presentation for the class the summary of
your report within 5-7 minutes. Your report should provide; (i) an
overview (including definition) of the concept, (ii) pros & cons of the
concept, with application areas, (iii) architecture (including procedures)
followed during implementation, (iv) concluding remarks; (v) references
Free
Text Content-
bearing
Index terms
Tokenization of Text
• Tokenization (also called Lexical Analysis) is the process
of converting text documents into a sequence of words,
w1, w2, … wn.
– It is the process of demarcating and possibly classifying sections
of a string of input characters into words.
– How we identify a set of words that exist in a text documents?
consider, The quick brown fox jumps over the lazy dog
• Objective - identify words in the text
–Tokenization greatly depends on how the concept of word
defined
• Is that a sequence of characters, numbers and alpha-numeric once?
A word is a sequence of letters terminated by a separator (period,
comma, space, etc).
• Definition of letter and separator is flexible; e.g., hyphen could be
defined as a letter or as a separator.
• Usually, common words (such as “a”, “the”, “of”, …) are ignored.
• Tokenization Issues
–numbers, hyphens, punctuations marks, apostrophes …
Issues in Tokenization
• One word or multiple: How to handle special cases involving hyphens,
apostrophes, punctuation marks etc? C++, C#, URL’s, e-mail, …
– Sometimes punctuations (e-mail), numbers (1999), & case (Republican
vs. republican) can be a meaningful part of a token.
– However, frequently they are not.
• Two words may be connected by hyphens.
– Can two words connected by hyphens taken as one word or two
words? Break up hyphenated sequence as two tokens?
• In most cases hyphen – break up the words (e.g. state-of-the-art
state of the art), but some words, e.g. MS-DOS, B-49 - unique words
which require hyphens
• Two words may be connected by punctuation marks .
– Punctuation marks: remove totally unless significant, e.g. program
code: x.exe and xexe. What about Kebede’s, www.command.com?
• Two words (phrase) may be separated by space.
– E.g. Addis Ababa, San Francisco, Los Angeles
• Two words may be written in different ways
– lowercase, lower-case, lower case? data base, database, data-base?
Issues in Tokenization
• Numbers: are numbers/digits words and used as index
terms?
– For instance: dates (3/12/91 vs. Mar. 12, 1991); phone numbers
(+251923415005); IP addresses (100.2.86.144)
– Numbers are not good index terms (like 1910, 1999); but 510 B.C.
is unique. Generally, don’t index numbers as text, though very
useful.
• What about case of letters (e.g. Data or data or DATA):
– cases are not important and there is a need to convert all to upper or
lower. Which one is mostly followed by human beings?
• Simplest approach is to ignore all numbers and
punctuation marks (period, colon, comma, brackets,
semi-colon, apostrophe, …) & use only case-insensitive
unbroken strings of alphabetic characters as words.
– Will often index “meta-data”, including creation date, format, etc.
separately
• Issues of tokenization are language specific
– Requires the language to be known
Tokenization
• Analyze text into a sequence of discrete tokens
(words).
• Input: “Friends, Romans and Countrymen”
• Output: Tokens (an instance of a sequence of characters
that are grouped together as a useful semantic unit for
processing)
– Friends
– Romans
– and
– Countrymen
• Each such token is now a candidate for an index
entry, after further processing
– But what are valid tokens to emit?
Exercise: Tokenization
• The cat slept peacefully in the living room.
It’s a very old cat.
• Disadvantages
– Elimination of stopwords might reduce recall
• e.g. “To be or not to be” – all eliminated except “be” –
no or irrelevant retrieval
How to detect a stopword?
• Method One: Sort terms (in decreasing order) by
document frequency (DF) and take the most
frequent ones based on the cutoff point
–In a collection about insurance practices,
“insurance” would be a stop word
–Searching
• Is an online process that scans document corpus to find
relevant documents that matches users query
Indexing Subsystem
documents
Documents Assign document identifier
document document
Tokenization
IDs
tokens
Stopword removal
non-stoplist
tokens Stemming &
Normalization
stemmed
terms Term weighting
Weighted index
terms Index File
Searching Subsystem
query parse query
query tokens
ranked non-stoplist
document Stop word
tokens
set
Ranking
Stemming & Normalize
relevant stemmed terms
document set
Similarity Query Term weighting
Measure terms
Index terms
Index
Basic assertion
Indexing and searching:
inexorably connected
– you cannot search that that was not first indexed
in some manner or other
– indexing of documents or objects is done in
order to be searchable
• there are many ways to do indexing
– to index one needs an indexing language
• there are many indexing languages
• even taking every word in a document is an indexing
language
Token Tokenizer
stream. Friends Romans countrymen
countryman 13 16
Building Index file
•An index file of a document is a file consisting of a list of index
terms & a link to one or more documents that has the index term
–A good index file maps each keyword Ki to a set of documents Di that
contain the keyword
• Space overhead
–Computer storage space consumed.
• Access types supported efficiently.
–Is the indexing structure allows to access:
• records with a specified term, or
• records with terms falling in a specified range of values.
Sequential File
• Sequential file is the most primitive file structures.
It has no vocabulary as well as linking pointers.
• The records are generally arranged serially, one after
another, but in lexicographic order on the value of some
key field.
a particular attribute is chosen as primary key whose value
will determine the order of the records.
when the first key fails to discriminate among records, a
second key is chosen to give an order.
Example:
• Given a collection of documents, they are parsed
to extract words and these are saved with the
Document ID.
So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary
Term Doc #
I 1
• After all did
enact
1
1
Sequential file
documents have julius 1
Doc
caesar 1
been tokenized, I 1 Term No.
stopwords are was
killed
1
1 1 ambition 2
removed, and I 1
2 brutus 1
the 1
normalization capitol 1 3 brutus 2
and stemming brutus
killed
1
1 4 capitol 1
are applied, to me 1
5 caesar 1
so 2
generate index let 2
6 caesar 2
terms it 2
be 2 7 caesar 2
• These index with
caesar
2
2 8 enact 1
terms in the 2
9 julius 1
noble 2
sequential file brutus 2 10 kill 1
are sorted in hath
told
2
2 11 kill 1
alphabetical you 2
12 noble 2
order caesar
was
2
2
ambitious 2
Complexity Analysis
• Creating sequential file requires O(n log n)
time, n is the total number of content-bearing
words identifies from the corpus.
• Since terms in sequential file are sorted, the
search time is logarithmic using binary tree.
• Updating the index file needs re-indexing;
that means incremental indexing is not
possible
Sequential File
• Its main advantages are:
– easy to implement;
– provides fast access to the next record using lexicographic
order.
– Instead of Linear time search, one can search in logarithmic
time using binary search
• Its disadvantages:
– difficult to update. Index must be rebuilt if a new term is
added. Inserting a new record may require moving a large
proportion of the file;
– random access is extremely slow.
• The problem of update can be solved :
– by ordering records by date of acquisition, than the key
value; hence, the newest entries are added at the end of the
file & therefore pose no difficulty to updating. But searching
becomes very tough; it requires linear time
Inverted file
• A technique that index based on sorted list of terms, with each
term having links to the documents containing it
– Building and maintaining an inverted index is a relatively low cost
risk. On a text of n words an inverted index can be built in O(n) time, n
is number of terms
• Content of the inverted file: Data to be held in the inverted file
includes :
• The vocabulary (List of terms)
• The occurrence (Location and frequency of terms in a
document collection)
• The occurrence: contains one record per term, listing
– Frequency of each term in a document
• TFij, number of occurrences of term tj in document di
• DFj, number of documents containing tj
• CFj,, collection frequency of tj in nj
– Locations/Positions of words in the text
Inverted file
•Why vocabulary?
–Having information about vocabulary (list of terms) speeds
searching for relevant documents
•Why location?
–Having information about the location of each term
within the document helps for:
• user interface design: highlight location of search term
• proximity based ranking: adjacency and near operators (in
Boolean searching)
•Why frequencies?
• Having information about frequency is used for:
–calculating term weighting (like IDF, TF*IDF, …)
–optimizing query processing
Inverted File
Documents are organized by the terms/words they contain
Term CF Document TF Location
ID
This is called an
auto 3 2 1 66
index file.
19 1 213
29 1 45
bus 4 3 1 94 Text operations
19 2 7, 212 are performed
before building
22 1 56
the index.
taxi 1 5 1 43
train 3 11 2 3, 70
34 1 40
Organization of Index File
• An inverted index consists of two files :
• vocabulary file
• Posting file
Act 3 3 Inverted
Bus 3 4 lists
pen 1 1
total 2 3
Inverted File
• Vocabulary file
–A vocabulary file (Word list):
• stores all of the distinct terms (keywords) that appear in any of
the documents (in lexicographical order) and
• For each word a pointer to posting file
–Records kept for each term j in the word list contains the
following: term j, DFj, CFj and pointer to posting file
• Postings File (Inverted List)
– For each distinct term in the vocabulary, stores a list of pointers to
the documents that contain that term.
– Each element in an inverted list is called a posting, i.e., the
occurrence of a term in a document
– It is stored as a separate inverted list for each column, i.e., a list
corresponding to each term in the index file.
• Each list consists of one or many individual postings related to
Document ID, TF and location information about a given term i
Construction of Inverted file
Advantage of dividing inverted file:
• Keeping a pointer in the vocabulary to the list in the
posting file allows:
– the vocabulary to be kept in memory at search time even
for large text collection, and
– Posting file to be kept on disk for accessing to
documents
• Exercise:
– In the Terabyte of text collection, if 1 page is 100KBs and
each page contains 250 words, on the average, calculate
the memory space requirement of vocabulary words?
Assume 1 word contains 10 characters.
Inverted index storage
•Separation of inverted file into vocabulary and posting
file is a good idea.
–Vocabulary: For searching purpose we need only word list.
This allows the vocabulary to be kept in memory at search
time since the space required for the vocabulary is small.
• The vocabulary grows by O(nβ), where β is a constant between 0 – 1.
• Example: from 1,000,000,000 documents, there may be 1,000,000
distinct words. Hence, the size of index is 100 MBs, which can easily
be held in memory of a dedicated computer.
–Posting file requires much more space.
• For each word appearing in the text we are keeping statistical
information related to word occurrence in documents.
• Each of the postings pointer to the document requires an extra space
of O(n).
•How to speed up access to inverted file?
Example:
• Given a collection of documents, they are parsed
to extract words and these are saved with the
Document ID.
So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary Term
ambitious
Doc #
2
be 2
Term Doc #
brutus 1
I 1
brutus 2
• After all did 1
capitol 1
documents have enact
julius
1
1 caesar 1
been tokenized caesar 1 caesar 2
the inverted file I 1 caesar 2
was 1 did 1
is sorted by killed 1 enact 1
terms I 1 has 1
the 1 I 1
capitol 1 I 1
brutus 1
I 1
killed 1
it 2
me 1
so 2 julius 1
let 2 killed 1
it 2 killed 1
be 2 let 2
with 2 me 1
caesar 2 noble 2
the 2 so 2
noble 2 the 1
brutus 2
the 2
hath 2
told 2
told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2
Remove stopwords, apply stemming & compute term frequency
Pointers
Complexity Analysis
• The inverted index can be built in O(n) + O(n
log n) time.
– n is number of vocabulary terms
• Since terms in vocabulary file are sorted
searching takes logarithmic time.
• To update the inverted index it is possible to
apply Incremental indexing which requires
O(k) time, k is number of new index terms
Exercise
• Construct the inverted index for the following
document collections.
Doc 1 : New home to home sales forecasts
Doc 2 : Rise in home sales in July
Doc 3 : Home sales rise in July for new homes
Doc 4 : July new home sales rise
Suffix Trie and Tree
Suffix trie
• What is Suffix? A suffix is a substring that exists at the end of
the given string.
– Each position in the text is considered as a text suffix
– If txt=t1t2...ti...tn is a string, then Ti=ti, ti+1...tn is the suffix of txt that starts at
position i,
• Example: txt = mississippi txt = GOOGOL
T1 = mississippi; T1 = GOOGOL
T2 = ississippi; T2 = OOGOL
T3 = ssissippi; T3 = OGOL
T4 = sissippi; T4 = GOL
T5 = issippi; T5 = OL
T6 = ssippi; T6 = L
T7 = sippi;
T8 = ippi;
T9 = ppi;
T10 = pi;
T11 = i;
Suffix trie
• A suffix trie is an ordinary trie in which the input
strings are all possible suffixes.
–Principles: The idea behind suffix TRIE is to assign to each
symbol in a text an index corresponding to its position in the
text. (i.e: First symbol has index 1, last symbol has index n
(number of symbols in text).
To build the suffix TRIE we use these indices instead of the
actual object.
• The structure has several advantages:
–We do not have to store the same object twice (no
duplicate).
–Whatever the size of index terms, the search time is also
linear in the length of string S.
Suffix Trie
Construct SUFFIX TRIE for the following string: GOOGOL
We begin by giving a position to every suffix in the text starting
from left to right as per characters occurrence in the string.
TEXT : GOOGOL$
POSITION : 1 2 3 4 5 6 7
Build a SUFFIX TRIE for all n suffixes of the text.
Note: The resulting tree has n leaves and height n.
This structure is
particularly
useful for any
application
requiring prefix
based ("starts
with") pattern
matching.
Suffix tree
• A suffix tree is a member
of the trie family. It is a Trie
of all the proper suffixes of S O
–The suffix tree is created by
compacting unary nodes of
the suffix TRIE.
• We store pointers rather
than words in the leaves.
–It is also possible to replace
strings in every edge by a
pair (a,b), where a & b are
the beginning and end index
of the string. i.e.
(3,7) for OGOL$
(1,2) for GO
(7,7) for $
Example: Suffix tree
Let s=abab, a suffix tree of s is a compressed
trie of all suffixes of s=abab$
{
1 abab$
2 bab$
3 ab$ $
4 b$ b 5
5 $ ab
} $
{ a $ #
1. abab$ 1. aab# b
2. bab$ 2. ab# # 5 4
3. ab$3. b# ab#
b ab$ $
4. b$ 4. #
1 3
5. $ ab$ #
$ 2 4
}
1 3 2
Search in suffix tree
• Searching for all instances of a substring S in a suffix
tree is easy since any substring of S is the prefix of
some suffix.
• Pseudo-code for searching in suffix tree:
–Start at root
–Go down the tree by taking each time the corresponding path
–If S correspond to a node then return all leaves in sub-tree
• The places where S can be found are given by the pointers
in all the leaves in the sub-tree rooted at x.
–If S encountered a NIL pointer before reaching the end, then
S is not in the tree
Example:
• If S = "GO" we take the GO path and return:
GOOGOL$, GOL$.
• If S = "OR" we take the O path and then we hit a NIL
pointer so "OR" is not in the tree.
Drawbacks
• Suffix trees consume a lot of space
– Even if word beginnings are indexed, space
overhead of 120% - 240% over the text size is
produced. Because depending on the
implementation each nodes of the suffix tree
takes a space (in bytes) equivalent to the
number of symbols used.
– How much space is required at each node for
English word indexing based on alphabets a to z.
• How many bytes required to store
MISSISSIPI ?
Signature file
• Word-oriented index structures based on hashing
• How to build signature file
– Hash each term to allocate fixed sized F-bits vector
(word signature)
– Divide the text in blocks of N words each
– Assign F-bits masks for each text block of size N
(document signature)
• This is obtained by bitwise ORing the signatures of all the
words in the text block.
• Efficient to search for phrases
• Hence the signature file is no more than the sequence
of bit masks of all blocks (plus a pointer to each block).
Structure of Signature File
Document Signature file
signature pointer
F-bits Text file
0 1 … 0 1
1
1
…
N
blocks 1
1
0
1
Example
• Given a text: “A text has many words. Words are made from letters”
Text
Signature:
1110101 0111100 1011111
Probabilistic
relevance
General Procedures Followed
To find relevant documens for a given query,
• First, map documents and queries into term-document
vector space.
Note that queries are considered as short document
• Second, queries and documents are represented as
weighted vectors, wij
There are binary weights & non-binary weighting
technique
• Third, rank documents by the closeness of their vectors to
the query.
Documents are ranked by closeness to the query.
Closeness is determined by a similarity score calculation
Mapping Documents & Queries
• Represent both documents and queries as N-
dimensional vectors in a term-document matrix, which shows
occurrence ofterms in the document collection or query
d j (t1, j , t 2, j ,..., t N , j ); qk (t1,k , t 2,k ,..., t N ,k )
• An entry in the matrix corresponds to the “weight” of a
term in the document; zero means the term doesn’t exist in
the document.
T1 T2 …. TN – Document collection is mapped to
D1 w11 w12 … w1N term-by-document matrix
D2 w21 w22 … w2N – View as vector in multidimensional
space
: : : :
• Nearby vectors are related
: : : :
– Normalize for vector length to avoid
DM wM1 wM2 … wMN
the effect of document length
Qi wi1 wi2 … wiN
How to evaluate Models?
Criteria for selecting IR model considers procedures
the IR Model followed and techniques used:
• What is the weighting technique used by the IR Models
for measuring importance of terms in documents?
– Are they using binary or non-binary weight?
• What is the matching technique used by the IR models?
– Are they measuring similarity or dissimilarity?
• Are they applying exact matching or partial matching in
the course of finding relevant documents for a given
query?
• Are they applying best matching principle to measure
the degree of relevance of documents to display in
ranked-order?
– Is there any Ranking mechanism applied before
displaying relevant documents for the users?
The Boolean Model
• Boolean model is a simple model based on set theory
The Boolean model imposes a binary criterion
for deciding relevance
• Terms are either present or absent. Thus,
wij {0,1}
• sim(q,dj) = 1, if document satisfies the boolean query
0 otherwise T1 T 2 …. TN
D1 w11 w12 … w1N
- Note that, no weights D2 w21 w22 … w2N
assigned in-between 0 and 1, : : : :
just only values 0 or 1
: : : :
DM wM1 wM2 … wMN
The Boolean Model: Example
Given the following three documents, Construct Term – document
matrix and find the relevant documents retrieved by the
Boolean model for the query “gold silver truck”
• D1: “Shipment of gold damaged in a fire”
• D2: “Delivery of silver arrived in a silver truck”
• D3: “Shipment of gold arrived in a truck”
Table below shows document –term (ti) matrix
arrive damage deliver fire gold silver ship truck
D1
D2
D3
query
• Answer: {D1, D2, D4, D6} ({D1, D2, D3, D6} {D3, D5})
= {D1, D2, D6}
Exercise
Given the following four documents with the
following contents:
– D1 = “computer information retrieval”
– D2 = “computer retrieval”
– D3 = “information”
– D4 = “computer information”
T1 T2 …. TN
• How to compute weights
D1 w11 w21 … w1N for term i in document j and
D2 w21 w22 … w2N in query q; wij and wiq ?
: : : :
: : : :
DM wM1 wM2 … wMN
Computing weights
• The vector space model with TF*IDF weights is a good
ranking strategy with general collections
• For index terms a normalized TF*IDF weight is given
by: freq (i, j )
wij * log(N/n i )
max( freq (k , j ))
• Users query is typically treated as a short document and
also TF-IDF weighted.
For the query term weights, a suggestion is
freq (i, q)
wiq 0.5 [0.5 * ] * log(N/ni )
max( freq (k , q))
• The vector space model is usually as good as the known
ranking alternatives. It is also simple and fast to compute.
Example: Computing weights
• A collection includes 10,000 documents
The term A appears 20 times in a particular document j
The maximum appearance of any term in document j is
50
The term A appears in 2,000 of the collection
documents.
i 1 w i1 i ,q
n n
dj q 2
i, j w 2
Counts TF Wi = TF*IDF
Terms Q D1 D2 D3 DF IDF Q D1 D2 D3
Terms Q D1 D2 D3
arrive 0 0 0.176 0.176
damage 0 0.477 0 0
deliver 0 0 0.477 0
fire 0 0.477 0 0
gold 0.176 0.176 0 0.176
silver 0.477 0 0.954 0
ship 0 0.176 0 0.176
truck 0.176 0 0.176 0.176
Vector-Space Model: Example
• Compute similarity using cosine Sim(q,d1)
• First, for each document and query, compute all vector
lengths (zero terms ignored)
|d1|= 0. 477 2
0. 477 2
0. 176 2
0.176 2
= 0.517 = 0.719
|d2|= 0 .176 2
0.477 2
0.176 2
0.176 2
= 1.2001 = 1.095
|d3|= 0.176 2 0.1762 0.176 2 0.176 2 = 0.124 = 0.352
• Disadvantages:
• Assumes independence of index terms. It doesn’t relate
one term with another term
• Computationally expensive since it measures the similarity
between each document and the query
Assignment
Given short descriptions of four popular books on database systems:
• Document A: “Fundamentals of Database Systems”
This book discusses the most popular database topics, including SQL, security,
and data mining along with an introduction to UML modeling and an entirely
new chapter on XML and Internet databases.
• Document B: “An Introduction to Database Systems”
This book provides a comprehensive introduction to the now very large field of
database systems by providing a solid ground in the foundations of database
technology.
• Document C: “Databases & Transaction Processing: An Application-Oriented”
The book presents the engineering principles underlying the implementation of
database and transaction processing. This text presents the theory underlying
relational databases and relational query languages applications.
• Document D: “Database Management Systems”
This document provides comprehensive and up-to-date coverage of the
fundamentals of database systems. Coherent explanations and practical
examples have made this one of the leading texts in the field.
• Problem:
– Construct an inverted index file
– Calculate the relevance of each document for the query “Data base system”
using the vector model & the TFIDF method.
– Identify relevant documents in ranked order
Exercise
• Consider these documents:
Doc 1 breakthrough drug for schizophrenia
Doc 2 new schizophrenia drug
Doc 3 new approach for treatment of schizophrenia
Doc 4 new hopes for schizophrenia patients
–Draw the term-document incidence matrix for this document
collection.
–Draw the inverted index representation for this collection.
p(R | D) p(R)p(R | D)
log O(R | D) log log
p(R | D) p(R)p(R | D)
p(R | D) 1 - p(R | D)
Probabilistic Models
• Most probabilistic models based on combining probabilities of
relevance and non-relevance of individual terms
– Probability that a term will appear in a relevant document
– Probability that the term will not appear in a non‐relevant
document
• These probabilities are estimated based on counting term
appearances in document descriptions
• Retrieval Status Value (rsv)
D is a vector of binary term
occurrences
We assume that terms occur
independently of each other
Principles surrounding weights
• Independence Assumptions
– I1: The distribution of terms in relevant documents is
independent and their distribution in all documents is
independent.
– I2: The distribution of terms in relevant documents is
independent and their distribution in non-relevant documents is
independent.
• Ordering Principles
– O1: Probable relevance is based only on the presence of
search terms in the documents.
– O2: Probable relevance is based on both the presence of
search terms in documents and their absence from documents.
Computing term probabilities
• Initially, there are no retrieved documents
– R is completely unknown
– Assume P(ti|R) is constant (usually 0.5)
– Assume P(ti|NR) approximated by distribution
of ti across collection – IDF
n r 0. 5
• Predictive formulation
–To guarantee that the N n R r 0.5
denominator is never zero,
adding a minor 0.5 to all
numerators and (r 0.5)( N n R r 0.5)
denominators:
w (1)
log
(n r 0.5)( R r 0.5)
Relevance weighted Example
d Document vectors <tfd,t>
col day eat hot lot nin old pea por pot Rele
vanc
e
1 1 1 1 1 NR
2 1 1 1 R
3 1 1 1 NR
4 1 1 1 NR
5 1 1 NR
6 1 1 NR
- porridge
•wqt 3 = hot 0.00 0.00 - 0.00 0.00 0.00 0.62 0.62 0.95
0.33 0.33
• Document 2 is relevant
Probabilistic Retrieval Example
• D1: “Cost of paper is up.” (relevant)
• D2: “Cost of jellybeans is up.” (not relevant)
• D3: “Salaries of CEO’s are up.” (not relevant)
• D4: “Paper: CEO’s labor cost up.” (????)
Probabilistic Retrieval Example
cost paper Jellybean salary CEO labor up
D1 1 1 0 0 0 0 1
D2 1 0 1 0 0 0 1
D3 0 0 0 1 1 0 1
D4 1 1 0 0 1 1 1
Wij 0.477 1.176 -0.477 -0.477 -0.477 0.222 -0.222
• D1=0.477 +1.176+ -0.222
• D2=0.477 + -0.477+ -0.222
• D3= -0.477 + -0.477+ -0.222
• D4=0.477 +1.176 + -0.477 + 0.222 + -0.222
Exercise
• Consider the collection below. The collection has 5 documents
and each document is described by two terms. The initial
guess of relevance to a particular query Q is as given in the
table below. Assuming the query Q has a total of 2 relevant
documents in this collection solve the following questions
Document T1 T2 Relevance
D1 1 1 R
D2 0 1 NR
D3 1 0 NR
D4 1 0 R
D5 0 1 NR
• Using the probabilistic term weighting formula, calculate the
new weight for each of the query in Q
• Rank the documents according to their probability of relevance
with the new query
Probabilistic model
• Probabilistic model uses probability theory to model the
uncertainty in the retrieval process
– Assumptions are made explicit
– Term weight without relevance information is IDF
• Relevance feedback can improve the ranking by giving
better term probability estimates
• Advantages of probabilistic model over vector ‐space
– Strong theoretical basis
– Since the base is probability theory, it is very well understood
– Easy to extend
• Disadvantages
– Models are often complicated
– No term frequency weighting
• Which is better: vector‐space or probabilistic?
– Both are approximately as good as each other
– Depends on collection, query, and other factors
Retrieval Effectiveness
• Evaluation of IR systems
• Relevance judgement
• Performance measures
– Recall,
– Precision,
– Single-valued measures
Why System Evaluation?
• Any systems needs validation and verification
–Check whether the system is right or not
–Check whether it is the right system or not
• It provides the ability to measure the difference between IR
systems
–How well do our search engines work?
–Is system A better than B?
–Under what conditions?
• Evaluation drives what to study
–Identify techniques that work well and do not work
–There are many retrieval models/algorithms
• which one is the best?
–What is the best component for:
• Similarity measures (dot-product, cosine, …)
• Index term selection (tokenization, stop-word removal,
stemming…)
• Term weighting (TF, TF-IDF,…)
148
Types of Evaluation Strategies
• User-centered evaluation
– Given several users, and at least two retrieval
systems
• Have each user try the same task on both systems
• Measure which system works the “best” for users
information need
• How to measure users satisfaction?
• System-centered evaluation
– Given documents, queries, and relevance
judgments
• Try several variations of the system
• Measure which system returns the “best” hit list
The Notion of Relevance Judgment
• Relevance is the measure of a correspondence existing
between a document and query.
– Construct document - query Q1 Q2 …. QN
matrix as determined by: D1 R N .… R
(i) the user who posed the D2 R N .… R
retrieval problem;
(ii) an external judge; : : : :
(iii) information specialist : : : :
DM R N .… R
– Is the relevance judgment made by users and external
person the same?
•Relevance judgment is usually:
– Subjective: Depends upon a specific user’s judgment.
– Situational: Relates to user’s current needs.
– Cognitive: Depends on human perception and behavior.
– Dynamic: Changes over time.
Measuring Retrieval Effectiveness
Relevant Irrelevant
Metrics often Retrieved True False
used to Positive Positive
evaluate
effectiveness
of the system Not retrieved False True
Negative Positive
| {Relevant} {Retrieved } |
Re call
| {Relevant} |
Relevant +
Relevant
Retrieved Retrieved
| {Relevant} {Retrieved } |
Pr ecision
| {Retrieved } |
Irrelevant + Not Retrieved
Example
Assume that there are a total of 10 relevant document
Ranking Relevance Recall Precision
1. Doc. 50 R 0.10 1.00
2. Doc. 34 NR 0.10 0.50
3. Doc. 45 R 0.20 0.67
4. Doc. 8 NR 0.20 0.50
5. Doc. 23 NR 0.20 0.40
6. Doc. 16 NR 0.20 0.33
7. Doc. 63 R 0.30 0.43
8. Doc 119 R 0.40 0.50
9. Doc 21 NR 0.40 0.44
10. Doc 80 R 0.50 0.50
Graphing Precision and Recall
• Plot each (recall, precision) point on a graph
– Recall is a non-decreasing function of the number of documents
retrieved,
–Precision usually decreases (in a good system)
• Precision/Recall tradeoff
–Can increase recall by retrieving many documents (down to
a low level of relevance ranking), but many irrelevant
documents would be fetched, reducing precision
–Can get high recall (but low precision) by retrieving all
documents for all queries
1 The ideal
Returns relevant
Precision
documents but
misses many Returns most relevant
useful ones too documents but includes
0 lots of junk
Recall 1
Need for Interpolation
• Two issues:
–How do you compare performance across queries?
–Is the sawtooth shape intuitive to understand and
interpret the performance result?
1
0.8
P re c i s i o n
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Re ca ll
Solution: Interpolation!
Interpolate a precision value for each standard recall level
Interpolation
• It is a general form of precision/recall calculation
• Precision change w.r.t. Recall (not a fixed point)
– It is an empirical fact that on average as recall increases,
precision decreases
• Interpolate precision at 11 standard recall levels:
– rj {0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0},
where j = 0 …. 10
• The interpolated precision at the j-th standard recall
level is the maximum known precision at any recall
level between the jth and (j + 1)th level:
P ( r j ) max P ( r )
r j r r j 1
Example: Interpolation
Recall Precision
Assume that there 1.00
0.00
are a total of 10 1.00
0.10
relevant document
0.20 0.67
0.30 0.50
0.40 0.50
0.50 0.50
0.60 0.50
0.70 0.50
0.80 0.50
0.90 0.50
1.00 0.50
Result of Interpolation
0.8
P re c is io n
0.6
0.4
0.2
0
0 0.1 0.2 0.3 0.4 0.5
Recall
Interpolating across queries
• For each query, calculate precision at 11 standard
recall levels
• Compute average precision at each standard recall
level across all queries.
• Plot average precision/recall curves to evaluate
overall system performance on a document/query
corpus.
2 PR 2
F 1 1
P R RP
• Compared to arithmetic mean, both need to be high
for harmonic mean to be high.
• What if no relevant documents exist?
168
Example
Recall Precision F-Measure
0.10 1.00 0.18
0.10 0.50 0.17
0.20 0.67 0.31
0.20 0.50 0.00
0.20 0.40 0.00
0.20 0.33 0.00
0.30 0.43 0.35
0.40 0.50 0.44
0.40 0.44 0.00
0.50 0.50 0.50
E-Measure
• Associated with Van Rijsbergen
• Allows user to specify importance of recall and
precision
• It is parameterized F Measure. A variant of F
measure that allows weighting emphasis on precision
over recall:
(1 ) PR (1 )
2 2
E 2 1
PR
2
R P
Other measures
• Noise = retrieved irrelevant docs / retrieved docs
• Silence/Miss = non-retrieved relevant docs / relevant docs
– Noise = 1 – Precision; Silence = 1 – Recall
| {Relevant} {NotRetrieved } |
Miss
| {Relevant} |
| {Retrieved } {NotRelevant} |
Fallout
| {NotRelevant} | 172
Exercise
• Define what is Precision and Recall
• What are the disadvantages of using precision/
recall as performance measures in IR
• Give example cases of getting
– High precision but low recall
– High recall, but low precision
• Describe the 11-point interpolated average
precision method
• What are the single measures of performance
that can be used?
• What are the subjective measures used in IR?
• In commercial WWW search engines what are
the most important factors in terms of
performance by which they are evaluated by the
users?
Query Languages
174
Keyword-based querying
• A query is an expression of users information
needs.
– IR queries are distinctive in that they are unstructured
and often ambiguous; they differ from standard query
languages which are governed by strict syntax rules.
• Queries are combinations of words.
– The document collection is searched for documents that
contain these words.
• Word queries are intuitive, easy to express and
provide fast ranking.
175
Single-word queries
• A query is a single word
– Usually used for searching in document images
180
Natural language
• Using natural language for querying is very
attractive.
• Example: Find all the documents that discuss
“campaign finance reforms, including documents that
discuss violations of campaign financing regulations. Do
not include documents that discuss campaign
contributions by the gun and the tobacco industries”.
• Natural language queries are converted to a formal
language for processing against a set of
documents.
• Such translation requires intelligence and is still a
challenge
181
Natural language
• Pseudo NL processing: System scans the text and
extracts recognized terms and Boolean connectors.
The grammaticality of the text is not important.
– Often used by search engines.
• Problem: Recognizing the negation in the search
statement (“Do not include...”).
• Compromise: Users enter natural language clauses
connected with Boolean operators.
• In the above example: “campaign finance reforms”
or “violations of campaign financing regulations" and
not “campaign contributions by the gun and the
tobacco industries”. 182
Exercise
• What are the main types of queries?
• List the main “query languages” used in IR
• Describe what is the Levenstein distance and its
use in IR
• Describe what is the longest common subsequence
(LCS) and its use in IR
• What is regular expression and how can it be used
in IR?
• What are structural queries?
Query Operations
• Relevance Feedback
• Query Reformulation
• Query Expansion
• Query term reweighting
184
Problems with Keywords
May not retrieve relevant documents that
include synonymy terms.
◦ “restaurant” vs. “café”
◦ “Abyssinia” vs. “Ethiopia” vs. “Walia”
Query Document
String corpus
ReRanked
Revised IR Relevant
Query System Documents
1. Doc2
2. Doc4
Query 3. Doc5
Reformulation Ranked .
.
Relevant 1. Doc1
1. Doc1 2. Doc2
2. Doc2
Documents 3. Doc3
3. Doc3 .
Feedback . . .. .
193
Pseudo Relevance Feedback
• Use relevance feedback methods without explicit user
input.
• Steps
–Obtain relevance feedback automatically.
• Just assume the top K retrieved documents are relevant, &
use them to reformulate the query.
–Allow for query expansion that includes terms from
documents that are correlated with the query terms.
• Identify terms related to query terms (such as synonyms
terms and/or similar terms that are close to query terms in
text)
• Two strategies
–Local strategies
–Global strategies
194
Pseudo Feedback Architecture
Query Document
String corpus
Revised Rankings
IR ReRanked
Query System Documents
1. Doc2
2. Doc4
Query 3. Doc5
Ranked .
Reformulation 1. Doc1 .
Documents 2. Doc2
3. Doc3
.
Pseudo 1. Doc1
.
2. Doc2
Feedbac 3. Doc3
k .
195
.
Local Analysis
• Examine only documents retrieved automatically for
query to determine query expansion
– Base correlation analysis on only the “local” set of
retrieved documents for a specific query.
• At query time, dynamically determine similar terms
based on analysis of K top-ranked retrieved
documents.
• Avoids ambiguity by determining similar (related)
terms only within relevant documents.
– “Apple computer” “Apple computer Powerbook
laptop”
196
Global analysis
• Expand query using information from whole set of
documents in collection
– Determine term similarity through a pre-computed statistical
analysis of the complete corpus.
• Thesaurus-like structure using all documents
– Approach to automatically built thesaurus
For example: similarity thesaurus based on co-occurrence
frequency
• A thesaurus provides information on synonyms and
semantically related words and phrases.
• Example:
physician
similar/synonymous: doctor, medical, MD
related: general practitioner, surgeon
197
Query Expansion: Thesaurus-based
• For each term, t, in a query, expand the query with
synonyms (similar) and related words of t from the
thesaurus.
• May weight added to terms less than original query
terms.
– Generally increases recall.
198
Query Expansion: Term Co-occurrence-based
• Compute association matrices which quantify term correlations
in terms of how frequently they co-occur.
– Expand queries with statistically most similar terms.
– Synonymy association: terms that frequently co-occur
inside local set of documents
w1 w2 w3 ………………..wn
w1 c11 c12 c13…………………c1n Association
Matrix
w2 c21 c22 c23…………………c2n
. .
. .
wn cn1 cn2 cn3…………………cnn
cij: Correlation factor between
term i and term j
ci , j tf (t , d ) tf (t
d Dl
i j ,d)
tf : Frequency of term i and term j
199
in document d
Normalized Association Matrix
• Frequency based correlation factor favors more
frequent terms. cij
• Normalize association scores: sij
cii c jj c ij
• Normalized score is 0 ≤ Sij ≤ 1
– 1 if two terms have the same frequency in all documents.
– 0 if two terms have no correlation
• Given a Query q
– Find clusters of terms tj for the |q| query terms based on
term association
• N terms that register the largest value Si,j ≥ β
– Keep clusters small
– Expand original query
200
Example
• For the query “information extraction”; let say the
following documents are retrieved (which contains
the words listed below):
– Doc 1: data, data, information, budget, retrieval,
information, budget, retrieval
– Doc 2: extraction, retrieval, extraction, information,
information, data
– Doc 3: data, retrieving, budget, budget, data,
information, budget, retrieval, information
– Doc 4: information, Internet
• Using term association expand the given query with
frequently co-occuring terms with Si,j ≥ 80%
201
Global vs. Local Analysis
• Global analysis requires intensive term correlation
computation only once at system development
time.
– Local analysis requires intensive term correlation
computation for every query at run time (although
number of terms and documents is less than in global
analysis).
202
Exercise
• List some of the features of an intelligent IR system.
• What is the purpose of relevance feedback in IR?
• Describe the Rocchio’s relevance feedback method
• Why user’s feedback is not widely used in IR?
• What is relevance pseudo feedback in IR?
• Describe the Thesaurus based query expansion
method and what are its advantages and
disadvantages
• Describe the statistical thesaurus method for query
expansion
• Describe the process of query expansion using
– global analysis
– local analysis?
• What are the similarity and differences between global
and local analysis methods?
IR Project
• Form a group with 3 members. Implement ---------for at least 10
documents written in one of the selected local language.
1. Tokenization: identify word separators and use them to split
sentences into tokens (with no numeric and special characters)
2. Stop word detection: use stop list and DF to identify and show list of
non-stop words
3. Stemming: identify suffix and prefix used in a given language and
identify stem or root words
4. Normalization: correct any artificial difference between words,
including any difference in writing and speaking, like numbers and
special characters. Make your system write what we speak (‘1’ should
be written as ‘one’)
5. Weighting terms using TF, DF, TFIDF and rank them
6. Document clustering: Identify similar documents
7. Word clustering: Identify similar and related words
Report: Write a publishable report with sections:
• Abstract -- ½ page
• Introduce problem, objective, scope & methodology -- 2 pages
• Review related works -- 4 pages
• Description of architecture of IR system -- 3 pages
• Discussion of evaluation result, with findings --- 3 pages
• Concluding remarks, with major recommendation --- 1 page
• Reference (use IEEE referencing style).
• Contribution of each member for the success of the project
Important Dates:
• Assignment:
• Project:
• Final exam:
207
THANK YOU
indexing tokens = open('token_list.txt', 'w', encoding ='utf-8')
tokens.write(str(token_list))
tokens.close()
import sys, glob, numpy, string, l3 #stemming using hornmorph stemmer
from matplotlib import pyplot l3.anal_file('om','token_list.txt', 'stemmed.txt' , citation =True, nbest =1)
from nltk.corpus import stopwords stemmed_tokens = open('stemmed.txt', 'r', encoding = 'utf-8')
import pandas as pd stem_list = stemmed_tokens.read()
from string import punctuation number_of_documents = len(document_list)
import matplotlib.pyplot as plt number_of_tokens = len(stem_list)
# Building the TF-IDF matrix
# Structures holding documents and terms sys.stderr.write ('Building the TF matrix and counting term occurrencies\n')
document_list = [] token_count = [0] * number_of_tokens # number of occurrences of word in
document_ids = {} document
token_list = [] TF = numpy.empty((number_of_tokens,number_of_documents), dtype=float)
token_ids = {} # Scan the document list
# Repeat for every document in processed_pages for i,doc in enumerate(document_list):
for filename in glob.glob('corpus/*'): # Initialize with zeros
n_dt = [0] * number_of_tokens
f = open (filename, 'r', encoding = 'utf-8') # For all token IDs in document
tokens = f.read() for tid in doc['tokens']:
tokens = tokens.lower() # if first occurrence, increase global count for IDF
tokens = tokens.split() if n_dt[tid] == 0:
table = str.maketrans(' ', ' ', punctuation) token_count[tid] += 1
tokens = [w.translate(table) for w in tokens] # increase local count
tokens =[word for word in tokens if word.isalpha()] n_dt[tid] += 1
tokens =[word for word in tokens if len(word)>2 ] # Normalize local count by document length obtaining TF vector;
stop_words = set(stopwords.words('oromo')) # store it as the i-th column of the TFIDF matrix.
tokens = [w for w in tokens if not w in stop_words] TF[:,i] = numpy.array(n_dt, dtype=float) / len(doc['tokens'])
f.close() TF[] = numpy.array(n_dt, dtype=float) / len(doc['tokens'])
IDF = numpy.log10(number_of_documents / numpy.array(token_count,
# Get the document name as last part of path dtype=float))
article_name = filename[filename.rfind('/')+1:] TFIDF = numpy.diag(IDF).dot(TF)
doc_id = len(document_list) TFIDF = pd.DataFrame(TFIDF)
# Insert ID in inverse list #SVD for rank reduction