You are on page 1of 188

Modern Information

Retrieval
The meaning
• What is retrieval?
– It is the process of searching and accessing
information as per the information need of
users
• What is information?
– Making sense of data via contextualization
– It is the step where single meaning is
attached to the data towards decision making,
problem solving.
The Need for Information Retrieval
• The information explosion is the rapid increase in
the amount of published information and the effects
of this abundance. 
– As the amount of available information grows, the
problem of managing the information becomes more
difficult, which can lead to information overload. 

• Large collections of documents from various


sources: books, journal articles, conference papers,
newspapers, magazines, Web pages, etc.

• Think of the effect of global information system and


digital libraries
Text Collections and IR
• Sample Statistics of Text Collections
– Google, www.google.com, Search Engines offers
access to over 130 trillion Web documents.
• Google handles over 3.5 billion searches per day, it has a 92%
share of the global search
– Yahoo, www.yahoo.com, index more than 20 billion
web documents and images
– AltaVista, www.altavista.com, covers over 250
million Web pages.
• It performs more than 40 million search queries each
day in more than 25 languages.
• Can you explore the size of web pages with
Excite, Ask Jeeves… search engines?
Classification of IR systems
• Data types (text, image, video & audio)
– Text-based (Lexis-Nexis, Google, FAST): Search by
keywords. Limited search using queries in natural language.
– Multimedia (audio, video & image retrieval): search engines
like (QBIC, WebSeek, SaFe) search by visual appearance
(shapes, colors, texture… ).
– Solution: multimodal IR (cross-modal, bimodal IR)
• Languages
– Support languages with their own writing system
– Solution: multilingual IR (cross-language, bilingual IR)
• Fields
– General-purpose vs specialized search engines (e.g. Medical
search engines)
• Performance
– efficiency and effectiveness of IR systems
– Solution: Question answering (ask.com, Answerbus)
What is Information Retrieval ?
• Information retrieval is the
process of searching for
relevant documents from
unstructured large corpus
that satisfy users information
need.
– It is a tool that finds and
selects from a collection of
items a subset that serves
the user’s purpose

• Much IR research focuses more specifically on text


retrieval.
Information Retrieval serve as Bridge
• An Information Retrieval System serves as a
bridge between the world of authors/producers
and the world of readers/users,
– That is, writers present a set of ideas in a document
using a set of concepts. Then Users seek the IR system
for relevant documents that satisfy their information
need.

Black box
User Documents
Typical IR System Architecture

Document
corpus

Query IR
String System

1. Doc1
2. Doc2
Ranked 3. Doc3
Relevant Documents .
.
IR System vs. Web Search System

Web Spider
Document
corpus

Query IR
String System

1. Page1
2. Page2
3. Page3 Ranked
. Relevant Documents
.
The Retrieval Process
User
Interface
User need
Text Text
Text Operations Database
L o g i c a l v i e w
User Query DocID
feedback Indexing
Formulation
Inverted file
Query

Searching
Index
Retrieved docs file

Ranked docs
Ranking
Course outline
Topic(s) Details
Overview of IR Define IR; The retrieval process; Basic structure of an IR system
Text Document Tokenization; Stopword detection; Stemming; Normalization;
Operations Term weighting; similarity measures
Indexing The need for indexing; Inverted files; Suffix trees and Suffix
Structures arrays; Signature files
A Formal Characterization of IR Models; Boolean model,
IR Models
Vector space mode & Probabilistic model
Retrieval Evaluation of IR systems; Relevance judgement; Retrieval
Evaluation effectiveness measures (Recall, Precision, F-measure, etc.)
Types of Query formulation; Keyword-based queries; Natural
Query Languages
language queries; Query reformulation; Relevance feedback;
and Operations
Query expansion; Reweighting query terms
Current Research IR in Local Languages; Information Extraction; Information
Issues in IR Filtering; Text Summarization, Cross-language retrieval...
Presentation Assignment
Review literature on the given topic and submit your report via Google
classroom. There will also be presentation for the class the summary of
your report within 5-7 minutes. Your report should provide; (i) an
overview (including definition) of the concept, (ii) pros & cons of the
concept, with application areas, (iii) architecture (including procedures)
followed during implementation, (iv) concluding remarks; (v) references

• Web IR (or Web search engine) • Cross language IR


• Query Expansion • Document Classification
• Multilingual IR • Text Mining
• Multimodal IR • Intelligent IR
• Multimedia IR • Document provenance
• Question Answering system • Document image retrieval
• Information Extraction • Semantic compression
• • Compound term processing
Information Filtering • Document clustering
• Recommender system • Document provenance
• Document summarization • Compound term processing
• Probabilistic IR
• Document categorization
Assignment
Review literature on the given topic & submit your report via Google classroom. Your
report should provide; (i) an overview (including definition) of the concept, (ii) pros &
cons of the concept, with application areas, (iii) architecture (including procedures)
followed during implementation, (iv) concluding remarks; (v) references
Title Name
1 Multimodal IR Fanuel Zegeye Tesfaye
2 Information Extraction Bniam Worku Yilma
3 Web search engine Abdi Shanbel Kiso
4 Query Expansion Kirubel Kebede Demissie
5 Question Answering system Samuel Nigussie Haileselassie
6 Information Filtering Mulugeta Nirea Kahissay
7 Content based Video retrieval Omer Tegegn Mersha
8 Probabilistic IR Filagot Meshesha Ayano
9 Multilingual IR Dereje Wegayehu Bayesa
10 Document summarization Selahadin Nurga Babeta
11 Semantic based IR Anteneh Berhanu Leykun
12 Recommender system Mesay Girma Megistu
13 Content based Audio retrieval Lemi Rattu Jeldu
14 Document Classification Etsegenet Gebregiorgies Asrat
15 Cross language IR Mihretu Genetie Mekonnen
16 Text Mining Bikila Boja Dessalegn
17 Content based Image retrieval Betelihem Demissie
18 Document clustering Samson Debela Ayana
19 Intelligent IR Mikiyas Teressa Merga
20 Document categorization Barnabas Dereje Tilahun
Text Operations

• Index term selection


Are all words in a document important?
• Not all words in a document are equally significant to
represent the contents/meanings of a document
– Some word carry more meaning than others
– Noun words are the most representative of a document
content
• Therefore, need to preprocess the text of a document
in a collection to be used as index terms
• Using the set of all words in a collection to index
documents creates too much noise for the retrieval
task
– Reduce noise means reduce words which can be used to
refer to the document
• Text operations is the process of transforming text
documents in to their logical representations that can
be used as index terms
Main Text Operations
Main operations for selecting index/query terms, i.e.
to choose words/stems (or groups of words) to be
used indexing/searching:
•Tokenization of the text: generate a set of words from
text collection
•Elimination of stop words - filter out words which are not
important in the retrieval process
•Normalization – resolving artificial difference among
words
•Stemming words - remove affixes (prefixes and suffixes)
and group together word variants with similar meaning
Generating Document Representatives
• Text Processing System
– Input text – full text, abstract or title
– Output – a document representative adequate for use in
an automatic retrieval system
• The document representative consists of a list of class
names, each name representing a class of words
occurring in the total input text. A document will be
indexed by a name if one of its significant words occurs as a
member of that class.
Document Tokenization stop words Normalization stemming
Corps

Free
Text Content-
bearing
Index terms
Tokenization of Text
• Tokenization (also called Lexical Analysis) is the process
of converting text documents into a sequence of words,
w1, w2, … wn.
– It is the process of demarcating and possibly classifying sections
of a string of input characters into words.
– How we identify a set of words that exist in a text documents?
consider, The quick brown fox jumps over the lazy dog
• Objective - identify words in the text
–Tokenization greatly depends on how the concept of word
defined
• Is that a sequence of characters, numbers and alpha-numeric once?
A word is a sequence of letters terminated by a separator (period,
comma, space, etc).
• Definition of letter and separator is flexible; e.g., hyphen could be
defined as a letter or as a separator.
• Usually, common words (such as “a”, “the”, “of”, …) are ignored.
• Tokenization Issues
–numbers, hyphens, punctuations marks, apostrophes …
Issues in Tokenization
• One word or multiple: How to handle special cases involving hyphens,
apostrophes, punctuation marks etc? C++, C#, URL’s, e-mail, …
– Sometimes punctuations (e-mail), numbers (1999), & case (Republican
vs. republican) can be a meaningful part of a token.
– However, frequently they are not.
• Two words may be connected by hyphens.
– Can two words connected by hyphens taken as one word or two
words? Break up hyphenated sequence as two tokens?
• In most cases hyphen – break up the words (e.g. state-of-the-art 
state of the art), but some words, e.g. MS-DOS, B-49 - unique words
which require hyphens
• Two words may be connected by punctuation marks .
– Punctuation marks: remove totally unless significant, e.g. program
code: x.exe and xexe. What about Kebede’s, www.command.com?
• Two words (phrase) may be separated by space.
– E.g. Addis Ababa, San Francisco, Los Angeles
• Two words may be written in different ways
– lowercase, lower-case, lower case? data base, database, data-base?
Issues in Tokenization
• Numbers: are numbers/digits words and used as index
terms?
– For instance: dates (3/12/91 vs. Mar. 12, 1991); phone numbers
(+251923415005); IP addresses (100.2.86.144)
– Numbers are not good index terms (like 1910, 1999); but 510 B.C.
is unique. Generally, don’t index numbers as text, though very
useful.
• What about case of letters (e.g. Data or data or DATA):
– cases are not important and there is a need to convert all to upper or
lower. Which one is mostly followed by human beings?
• Simplest approach is to ignore all numbers and
punctuation marks (period, colon, comma, brackets,
semi-colon, apostrophe, …) & use only case-insensitive
unbroken strings of alphabetic characters as words.
– Will often index “meta-data”, including creation date, format, etc.
separately
• Issues of tokenization are language specific
– Requires the language to be known
Tokenization
• Analyze text into a sequence of discrete tokens
(words).
• Input: “Friends, Romans and Countrymen”
• Output: Tokens (an instance of a sequence of characters
that are grouped together as a useful semantic unit for
processing)
– Friends
– Romans
– and
– Countrymen
• Each such token is now a candidate for an index
entry, after further processing
– But what are valid tokens to emit?
Exercise: Tokenization
• The cat slept peacefully in the living room.
It’s a very old cat.

• The instructor (Dr. O’Neill) thinks that the


boys’ stories about Chile’s capital aren’t
amusing.
Elimination of Stopwords
• Stopwords are extremely common words across
document collections that have no discriminatory power
– They may occur in 80% of the documents in a collection.
– They would appear to be of little value in helping select
documents matching a user need and needs to be filtered
out as potential index terms
• Examples of stopwords are articles, prepositions,
conjunctions, etc.:
– articles (a, an, the); pronouns: (I, he, she, it, their, his)
– Some prepositions (on, of, in, about, besides, against),
conjunctions/connectors (and, but, for, nor, or, so, yet),
verbs (is, are, was, were), adverbs (here, there, out,
because, soon, after) and adjectives (all, any, each, every,
few, many, some) can also be treated as stopwords
Why Stopword Removal?
• Intuition:
–Stopwords have little semantic content
• It is typical to remove such high-frequency words
–Stopwords take up 50% of the text. Hence,
document size reduces by about 30-50%
• Stopwords are language dependent.
Why Stopword Removal?
• Advantages
–Smaller indices for information retrieval
• Good compression techniques for indices: The 30 most
common words account for 30% of the tokens in written
text
–With the removal of stopwords, we can measure better
approximation of importance for text classification,
text categorization, text summarization, etc.

• Disadvantages
– Elimination of stopwords might reduce recall
• e.g. “To be or not to be” – all eliminated except “be” –
no or irrelevant retrieval
How to detect a stopword?
• Method One: Sort terms (in decreasing order) by
document frequency (DF) and take the most
frequent ones based on the cutoff point
–In a collection about insurance practices,
“insurance” would be a stop word

• Method Two: Build a stop word list that contains


a set of articles, pronouns, etc.
–Why do we need stop lists: With a stop list, we
can compare and exclude from index terms entirely
the commonest words.
–Can you identify common words in Amharic and
build stop list?
Stop words
• Stop word elimination used to be standard in older
IR systems. But the trend is away from doing this
nowadays.
• Most web search engines index stop words:
– Good query optimization techniques mean you pay little at
query time for including stop words.
– You need stopwords for:
• Phrase queries: “King of Denmark”
• Various song titles, etc.: “Let it be”, “To be or not to be”
• “Relational” queries: “flights to London”
Normalization
• It is Canonicalizing tokens so that matches occur
despite superficial differences in the character
sequences of the tokens
– Need to “normalize” terms in indexed text as well as
query terms into the same form
– Example: We want to match words Ethiopia and
ethiopia; IT and Information Technology with
information technology
• Case Folding: Often best to lower case everything,
since users will use lowercase regardless of ‘correct’
capitalization…
– Fasil vs. fasil vs. FASIL
– Anti-discriminatory vs. antidiscriminatory
– U.S.A vs. USA vs. United States of America
– 1239 vs. twelve thirty nine vs. one thousand two hundred
thirty nine
– Car vs. Automobile?
Normalization issues
• Good for:
– Allow instances of Automobile at the beginning of
a sentence to match with a query of automobile
– Helps a search engine when most users type
ferrari while they are interested in a Ferrari car
• Not advisable for:
– Proper names vs. common nouns
• E.g. General Motors, Associated Press, Kebede…
• Solution:
– lowercase only words at the beginning of the sentence
• In IR, lowercasing is most practical because of the
way users issue their queries
• Normalization is language dependent
Python code for tokenization, normalization
and stop word removal
Stemming/Morphological analysis
• Stemming reduces tokens to their root form of words to
recognize morphological variation.
– The process involves removal of affixes (i.e. prefixes & suffixes)
with the aim of reducing variants to the same stem
– Often removes inflectional & derivational morphology of a word
• Inflectional morphology: vary the form of words in order to
express grammatical features, such as singular/plural or
past/present tense. E.g. Boy → boys, cut → cutting.
• Derivational morphology: makes new words from old ones. E.g.
creation is formed from create , but they are two separate words.
And also, destruction → destroy
• Stemming is language dependent
– Correct stemming is language specific and can be complex.
compressed and compression compress and compress
are both accepted. are both accept
Stemming
• The final output from a conflation algorithm is a set of
classes, one for each stem detected.
–A Stem: the portion of a word which is left after the
removal of its affixes (i.e., prefixes and/or suffixes).
–Example: ‘connect’ is the stem for {connected,
connecting connection, connections}
–Thus, [automate, automatic, automation] all reduce to
 automat
• A stem is used as index terms/keywords for document
representations
• Queries : Queries are handled in the same way.
Ways to implement stemming
There are basically two ways to implement stemming.
–The first approach is to create a big dictionary that maps
words to their stems.
• The advantage of this approach is that it works perfectly (if
the stem of a word is defined perfectly); the disadvantages
are the space required by the dictionary and cost of
maintaining the dictionary as new words appear.
–The second approach is to use a set of rules that extract
stems from words.
• Techniques widely used include: rule-based, statistical,
machine learning or hybrid
• The advantages of this approach are that the code is
typically small, & it can gracefully handle new words; the
disadvantage is that it occasionally makes mistakes.
– But, since stemming is imperfectly defined, anyway, occasional
mistakes are tolerable, & the rule-based approach is the one
that is generally chosen.
Porter Stemmer
• Stemming is the operation of stripping the suffices
from a word, leaving its stem.
– Google, for instance, uses stemming to search for web
pages containing the words connected, connecting,
connection and connections when users ask for a web
page that contains the word connect.

• In 1979, Martin Porter developed a stemming


algorithm that uses a set of rules to extract stems
from words, and though it makes some mistakes,
most common words seem to work out right.
– Porter describes his algorithm & provides a reference
implementation at :
http://tartarus.org/~martin/PorterStemmer/index.html
Porter stemmer
• Most common algorithm for stemming English
words to their common grammatical root

• It is a simple procedure for removing known


affixes in English without using a dictionary.

• To gets rid of plurals the following rules are


used:
– SSES  SS caresses  caress
– IES  i ponies  poni
– SS  SS caress → caress
–S   cats  cat
Porter stemmer
• Porter stemmer works in steps.
– While the first step gets rid of plurals –s and -es, the
second step removes -ed or -ing.
– e.g.
;; agreed -> agree ;; disabled -> disable
;; matting -> mat ;; mating -> mate
;; meeting -> meet ;; milling -> mill
;; messing -> mess ;; meetings -> meet
;; feed -> feed

– EMENT   (Delete final ement if what remains is


longer than 1 character )
replacement  replac
cement  cement
Stemming: challenges
• May produce unusual stems that are not
English words:
– Removing ‘UAL’ from FACTUAL and EQUAL

• May conflate/reduce to the same token/stem


words that are actually distinct.
• “computer”, “computational”, “computation” all
reduced to same token “comput”

• Not recognize all morphological derivations.


Stemming using Hornmorph
Index Term Selection
• Index language is the language used to describe
documents and requests
• Elements of the index language are index terms
which may be derived from the text of the document
to be described, or may be arrived at independently.
– If a full text representation of the text is adopted, then
all words in the text are used as index terms = full text
indexing
– Otherwise, need to select the words to be used as
index terms for reducing the size of the index file
which is basic to design an efficient searching IR
system
Exercise
• List the main operations that are performed on text on an IR
system
• Describe what is stemming and lemmatization? Discuss their
similarity and differences?
• What is case folding? Which text operation method is most
suitable for undertaking case folding?
• What are stop words?
• List some of the possible errors generated by Porter’s
stemmer?
• What is the effect of stop words removal, normalization and
stemming in the recall and precession of the IR system?
• What is a thesauri and how can be used in IR?
• What are lexical and phrasal concepts?
• List two methods of compression and mention what is its
application in IR
• What is metadata?
• What is the semantic web and how it is related to the problem
that IR tries to solve?
Designing an IR System
Designing an IR System
Our focus during IR system design:
• In improving Effectiveness of the system
–The concern here is retrieving more relevant documents
for users query
–Effectiveness of the system is measured in terms of
precision, recall, …
–Main emphasis: Stemming, stopwords removal, weighting
schemes, matching algorithms

• In improving Efficiency of the system


–The concern here is reducing storage space requirement,
enhancing searching time, indexing time, access time…
–Main emphasis: Compression, indexing structures, space
– time tradeoffs
Subsystems of IR system
The two subsystems of an IR system: Indexing and
Searching
–Indexing:
• is an offline process of organizing documents using
keywords extracted from the collection
• Indexing is used to speed up access to desired information
from document collection as per users query

–Searching
• Is an online process that scans document corpus to find
relevant documents that matches users query
Indexing Subsystem
documents
Documents Assign document identifier

document document
Tokenization
IDs
tokens
Stopword removal
non-stoplist
tokens Stemming &
Normalization
stemmed
terms Term weighting
Weighted index
terms Index File
Searching Subsystem
query parse query
query tokens
ranked non-stoplist
document Stop word
tokens
set
Ranking
Stemming & Normalize
relevant stemmed terms
document set
Similarity Query Term weighting
Measure terms
Index terms
Index
Basic assertion
Indexing and searching:
inexorably connected
– you cannot search that that was not first indexed
in some manner or other
– indexing of documents or objects is done in
order to be searchable
• there are many ways to do indexing
– to index one needs an indexing language
• there are many indexing languages
• even taking every word in a document is an indexing
language

Knowing searching is knowing indexing


Indexing structure

• How to select indexing structure


• Sequential file
• Inverted file
• Signature file
Indexing: Basic Concepts
• Indexing is used to speed up access to desired
information from document collection as per users query
such that
– It enhances efficiency in terms of time for retrieval. Relevant
documents are searched and retrieved quick
• An index file consists of records, called index entries.
– The usual unit for indexing is the word
• Index terms - are used to look up records in a file.
• Index files are much smaller than the original file. Do
you agree?
– Remember Heaps Law: In 1 GB text collection the size of a
vocabulary is only 5 MB (Baeza-Yates and Ribeiro-Neto,
2005)
– This size can be further reduced by Linguistic pre-
processing (like stemming & other normalization methods).
Major Steps in Index Construction
• Source file: Collection of text document
–A document can be described by a set of representative keywords called
index terms.
• Index Terms Selection:
–Tokenize: identify words in a document, so that each
document is represented by a list of keywords or attributes
–Stop words: removal of high frequency words
• Stop list of words is used for comparing the input text
–Stemming and Normalization: reduce words with similar
meaning into their stem/root word
• Suffix stripping is the common method
–Weighting terms: Different index terms have varying
importance when used to describe document contents.
• This effect is captured through the assignment of numerical weights
to each index term of a document.
• There are different index terms weighting methods (TF, DF, CF) based
on which TF*IDF weight can be calculated during searching
• Output: a set of index terms (vocabulary) to be used for
Indexing the documents that each term occurs in.
Basic Indexing Process
Documents to
be indexed. Friends, Romans, countrymen.

Token Tokenizer
stream. Friends Romans countrymen

Modified Linguistic friend roman countryman


tokens. preprocessing

Index File Indexer


friend 2 4
(Inverted file).
roman 1 2

countryman 13 16
Building Index file
•An index file of a document is a file consisting of a list of index
terms & a link to one or more documents that has the index term
–A good index file maps each keyword Ki to a set of documents Di that
contain the keyword

•Index file usually has index terms in a sorted order.


–The sort order of the terms in the index file provides an order on a
physical file
•An index file is list of search terms that are organized for
associative look-up, i.e., to answer user’s query:
–In which documents does a specified search term appear?
–Where within each document does each term appear? (There may be
several occurrences.)
•For organizing index file for a collection of documents, there are
various options available:
–Decide what data structure and/or file structure to use. Is it sequential file,
inverted file, suffix array, signature file, etc. ?
Index file Evaluation Metrics
• Running time
–Indexing time
–Access/search time: is that allows sequential or random
searching/access?
–Update time (Insertion time, Deletion time, modification
time….): can the indexing structure support re-indexing or
incremental indexing?

• Space overhead
–Computer storage space consumed.
• Access types supported efficiently.
–Is the indexing structure allows to access:
• records with a specified term, or
• records with terms falling in a specified range of values.
Sequential File
• Sequential file is the most primitive file structures.
It has no vocabulary as well as linking pointers.
• The records are generally arranged serially, one after
another, but in lexicographic order on the value of some
key field.
a particular attribute is chosen as primary key whose value
will determine the order of the records.
when the first key fails to discriminate among records, a
second key is chosen to give an order.
Example:
• Given a collection of documents, they are parsed
to extract words and these are saved with the
Document ID.

I did enact Julius


Doc 1 Caesar I was killed
I the Capitol;
Brutus killed me.

So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary
Term Doc #
I 1
• After all did
enact
1
1
Sequential file
documents have julius 1
Doc
caesar 1
been tokenized, I 1 Term No.
stopwords are was
killed
1
1 1 ambition 2
removed, and I 1
2 brutus 1
the 1
normalization capitol 1 3 brutus 2
and stemming brutus
killed
1
1 4 capitol 1
are applied, to me 1
5 caesar 1
so 2
generate index let 2
6 caesar 2
terms it 2
be 2 7 caesar 2
• These index with
caesar
2
2 8 enact 1
terms in the 2
9 julius 1
noble 2
sequential file brutus 2 10 kill 1
are sorted in hath
told
2
2 11 kill 1
alphabetical you 2
12 noble 2
order caesar
was
2
2
ambitious 2
Complexity Analysis
• Creating sequential file requires O(n log n)
time, n is the total number of content-bearing
words identifies from the corpus.
• Since terms in sequential file are sorted, the
search time is logarithmic using binary tree.
• Updating the index file needs re-indexing;
that means incremental indexing is not
possible
Sequential File
• Its main advantages are:
– easy to implement;
– provides fast access to the next record using lexicographic
order.
– Instead of Linear time search, one can search in logarithmic
time using binary search
• Its disadvantages:
– difficult to update. Index must be rebuilt if a new term is
added. Inserting a new record may require moving a large
proportion of the file;
– random access is extremely slow.
• The problem of update can be solved :
– by ordering records by date of acquisition, than the key
value; hence, the newest entries are added at the end of the
file & therefore pose no difficulty to updating. But searching
becomes very tough; it requires linear time
Inverted file
• A technique that index based on sorted list of terms, with each
term having links to the documents containing it
– Building and maintaining an inverted index is a relatively low cost
risk. On a text of n words an inverted index can be built in O(n) time, n
is number of terms
• Content of the inverted file: Data to be held in the inverted file
includes :
• The vocabulary (List of terms)
• The occurrence (Location and frequency of terms in a
document collection)
• The occurrence: contains one record per term, listing
– Frequency of each term in a document
• TFij, number of occurrences of term tj in document di
• DFj, number of documents containing tj
• CFj,, collection frequency of tj in nj
– Locations/Positions of words in the text
Inverted file
•Why vocabulary?
–Having information about vocabulary (list of terms) speeds
searching for relevant documents
•Why location?
–Having information about the location of each term
within the document helps for:
• user interface design: highlight location of search term
• proximity based ranking: adjacency and near operators (in
Boolean searching)
•Why frequencies?
• Having information about frequency is used for:
–calculating term weighting (like IDF, TF*IDF, …)
–optimizing query processing
Inverted File
Documents are organized by the terms/words they contain
Term CF Document TF Location
ID
This is called an
auto 3 2 1 66
index file.
19 1 213
29 1 45
bus 4 3 1 94 Text operations
19 2 7, 212 are performed
before building
22 1 56
the index.
taxi 1 5 1 43
train 3 11 2 3, 70
34 1 40
Organization of Index File
• An inverted index consists of two files :
• vocabulary file
• Posting file

Vocabulary (word list) Postings Actual


Term No Tot Pointer (inverted list) Documents
of freq To
Doc posting

Act 3 3 Inverted
Bus 3 4 lists

pen 1 1
total 2 3
Inverted File
• Vocabulary file
–A vocabulary file (Word list):
• stores all of the distinct terms (keywords) that appear in any of
the documents (in lexicographical order) and
• For each word a pointer to posting file
–Records kept for each term j in the word list contains the
following: term j, DFj, CFj and pointer to posting file
• Postings File (Inverted List)
– For each distinct term in the vocabulary, stores a list of pointers to
the documents that contain that term.
– Each element in an inverted list is called a posting, i.e., the
occurrence of a term in a document
– It is stored as a separate inverted list for each column, i.e., a list
corresponding to each term in the index file.
• Each list consists of one or many individual postings related to
Document ID, TF and location information about a given term i
Construction of Inverted file
Advantage of dividing inverted file:
• Keeping a pointer in the vocabulary to the list in the
posting file allows:
– the vocabulary to be kept in memory at search time even
for large text collection, and
– Posting file to be kept on disk for accessing to
documents

• Exercise:
– In the Terabyte of text collection, if 1 page is 100KBs and
each page contains 250 words, on the average, calculate
the memory space requirement of vocabulary words?
Assume 1 word contains 10 characters.
Inverted index storage
•Separation of inverted file into vocabulary and posting
file is a good idea.
–Vocabulary: For searching purpose we need only word list.
This allows the vocabulary to be kept in memory at search
time since the space required for the vocabulary is small.
• The vocabulary grows by O(nβ), where β is a constant between 0 – 1.
• Example: from 1,000,000,000 documents, there may be 1,000,000
distinct words. Hence, the size of index is 100 MBs, which can easily
be held in memory of a dedicated computer.
–Posting file requires much more space.
• For each word appearing in the text we are keeping statistical
information related to word occurrence in documents.
• Each of the postings pointer to the document requires an extra space
of O(n).
•How to speed up access to inverted file?
Example:
• Given a collection of documents, they are parsed
to extract words and these are saved with the
Document ID.

I did enact Julius


Doc 1 Caesar I was killed
I the Capitol;
Brutus killed me.

So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary Term
ambitious
Doc #
2
be 2
Term Doc #
brutus 1
I 1
brutus 2
• After all did 1
capitol 1
documents have enact
julius
1
1 caesar 1
been tokenized caesar 1 caesar 2
the inverted file I 1 caesar 2
was 1 did 1
is sorted by killed 1 enact 1
terms I 1 has 1
the 1 I 1
capitol 1 I 1
brutus 1
I 1
killed 1
it 2
me 1
so 2 julius 1
let 2 killed 1
it 2 killed 1
be 2 let 2
with 2 me 1
caesar 2 noble 2
the 2 so 2
noble 2 the 1
brutus 2
the 2
hath 2
told 2
told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2
Remove stopwords, apply stemming & compute term frequency

•Multiple term Term Doc #


entries in a ambition 2 Term Doc # TF
single brutus 1 ambition 2 1
document are brutus 2 brutus 1 1
merged and capitol 1 brutus 2 1
frequency capitol 1 1
caesar 1
information caesar 1 1
caesar 2
added caesar 2 2
caesar 2
•Counting enact 1 enact 1 1
number of julius 1 julius 1 1
occurrence of kill 1 kill 1 2
terms in the kill 1 noble 2 1
collections noble 2
helps to
compute TF
Vocabulary and postings file
The file is commonly split into a Dictionary and a Posting file
vocabulary posting
Term Doc # TF Term DF CF Doc # TF
ambition 2 1
ambitious 1 1 2 1
brutus 1 1 1 1
brutus 2 1 brutus 2 2
2 1
capitol 1 1 capitol 1 1
1 1
caesar 1 1 caesar 2 3 1 1
caesar 2 2 enact 1 1 2 2
enact 1 1 julius 1 1 1 1
julius 1 1 kill 1 2 1 1
kill 1 2 1 2
noble 1 1
noble 2 1 2 1

Pointers
Complexity Analysis
• The inverted index can be built in O(n) + O(n
log n) time.
– n is number of vocabulary terms
• Since terms in vocabulary file are sorted
searching takes logarithmic time.
• To update the inverted index it is possible to
apply Incremental indexing which requires
O(k) time, k is number of new index terms
Exercise
• Construct the inverted index for the following
document collections.
Doc 1  :  New home to home sales forecasts
Doc 2  :  Rise in home sales in July
Doc 3  :  Home sales rise in July for new homes
Doc 4  :  July new home sales rise
Suffix Trie and Tree
Suffix trie
• What is Suffix? A suffix is a substring that exists at the end of
the given string.
– Each position in the text is considered as a text suffix
– If txt=t1t2...ti...tn is a string, then Ti=ti, ti+1...tn is the suffix of txt that starts at
position i,
• Example: txt = mississippi txt = GOOGOL
T1 = mississippi; T1 = GOOGOL
T2 = ississippi; T2 = OOGOL
T3 = ssissippi; T3 = OGOL
T4 = sissippi; T4 = GOL
T5 = issippi; T5 = OL
T6 = ssippi; T6 = L
T7 = sippi;
T8 = ippi;
T9 = ppi;
T10 = pi;
T11 = i;
Suffix trie
• A suffix trie is an ordinary trie in which the input
strings are all possible suffixes.
–Principles: The idea behind suffix TRIE is to assign to each
symbol in a text an index corresponding to its position in the
text. (i.e: First symbol has index 1, last symbol has index n
(number of symbols in text).
To build the suffix TRIE we use these indices instead of the
actual object.
• The structure has several advantages:
–We do not have to store the same object twice (no
duplicate).
–Whatever the size of index terms, the search time is also
linear in the length of string S.
Suffix Trie
Construct SUFFIX TRIE for the following string: GOOGOL
We begin by giving a position to every suffix in the text starting
from left to right as per characters occurrence in the string.
TEXT : GOOGOL$
POSITION : 1 2 3 4 5 6 7
Build a SUFFIX TRIE for all n suffixes of the text.
Note: The resulting tree has n leaves and height n.

This structure is
particularly
useful for any
application
requiring prefix
based ("starts
with") pattern
matching.
Suffix tree
• A suffix tree is a member
of the trie family. It is a Trie
of all the proper suffixes of S O
–The suffix tree is created by
compacting unary nodes of
the suffix TRIE.
• We store pointers rather
than words in the leaves.
–It is also possible to replace
strings in every edge by a
pair (a,b), where a & b are
the beginning and end index
of the string. i.e.
(3,7) for OGOL$
(1,2) for GO
(7,7) for $
Example: Suffix tree
Let s=abab, a suffix tree of s is a compressed
trie of all suffixes of s=abab$
{
1 abab$
2 bab$
3 ab$ $
4 b$ b 5
5 $ ab
} $

$ ab$ 4 We label each


ab$ leaf with the
3 starting point
2
1 of the
corresponding
suffix.
Complexity Analysis
• The suffix tree for a string has been built in
O(n2) time.
• The search time is proportional to the length
of string S; i.e. O(|S|).
• Searching for a substring[1..m], in string[1..n],
can be solved in O(m) time
– It requires to search for the length of the string O(|
S|).
• Updating the index file can be done
incrementally without affecting the existing
index
Generalized suffix tree
Given a set of strings S, a generalized suffix tree of S is a
compressed trie of all suffixes of s  S
To make suffixes prefix-free we add a special char, $, at the end of
s. To associate each suffix with a unique string in S add a different
special symbol to each s
Build a suffix tree for the string s1$s2#, where `$' and `#'
are a special terminator for s1,s2.
Ex.: Let s1=abab & s2=aab, a generalized suffix tree for s1 & s2 is:

{ a $ #
1. abab$ 1. aab# b
2. bab$ 2. ab# # 5 4
3. ab$3. b# ab#
b ab$ $
4. b$ 4. #
1 3
5. $ ab$ #
$ 2 4
}
1 3 2
Search in suffix tree
• Searching for all instances of a substring S in a suffix
tree is easy since any substring of S is the prefix of
some suffix.
• Pseudo-code for searching in suffix tree:
–Start at root
–Go down the tree by taking each time the corresponding path
–If S correspond to a node then return all leaves in sub-tree
• The places where S can be found are given by the pointers
in all the leaves in the sub-tree rooted at x.
–If S encountered a NIL pointer before reaching the end, then
S is not in the tree
Example:
• If S = "GO" we take the GO path and return:
GOOGOL$, GOL$.
• If S = "OR" we take the O path and then we hit a NIL
pointer so "OR" is not in the tree.
Drawbacks
• Suffix trees consume a lot of space
– Even if word beginnings are indexed, space
overhead of 120% - 240% over the text size is
produced. Because depending on the
implementation each nodes of the suffix tree
takes a space (in bytes) equivalent to the
number of symbols used.
– How much space is required at each node for
English word indexing based on alphabets a to z.
• How many bytes required to store
MISSISSIPI ?
Signature file
• Word-oriented index structures based on hashing
• How to build signature file
– Hash each term to allocate fixed sized F-bits vector
(word signature)
– Divide the text in blocks of N words each
– Assign F-bits masks for each text block of size N
(document signature)
• This is obtained by bitwise ORing the signatures of all the
words in the text block.
• Efficient to search for phrases
• Hence the signature file is no more than the sequence
of bit masks of all blocks (plus a pointer to each block).
Structure of Signature File
Document Signature file
signature pointer
F-bits Text file
0 1 … 0 1
1
1

N
blocks 1
1
0
1
Example
• Given a text: “A text has many words. Words are made from letters”

A text has many words. Words are made from letters

Text
Signature:
1110101 0111100 1011111

Signature (hash) function:


h(text) = 1000101
h(many) = 0110101 Block 4: 001100
h(word) = 0111100 OR 100001
h(made) = 0010111 101101
h(letter) = 1001011
Searching
• During query processing:
–Hash the query to a F-bit mask Q
–Compare query signature with document signature of each
block, that is
• Bit-wise ANDing all the bits set in the query with bit masks Bi of
all the text block
–If all corresponding 1-bits are “on” in document signature,
document probably contains that term, that is
• If Q & Bi = Q, all the bits set in Q are also set in BI and therefore
the text block may contain the word

• The main idea of signature file is that if a word is


present in a text block, then all the bits set in its
signature are also set in the bit mask of the text block
–Hence if a bit is set in the mask of the query word and not in
the mask of the text block, then the word is not present in the
text block
Signature file trivia
• Signature files leads to possible mismatches.
–It is possible that all the corresponding bits are set
even though the word is not there. This is called
false drop.

• False drop or false positive


–Document that is retrieved by a search but is not
relevant to the searcher’s needs
–False drops occur because of words that are
written the same but have different meanings.
–Example: ‘squash’ refer to a game, a vegetable or
an action
Exercise
• What is an index? Why we need it?
• What is full text indexing? Discuss the advantage and
disadvantage.
• List the main components of an indexing data
structure
• Which data structures can be used to implement an
index file? On what basis we can select index file
structure?
• What are the advantages and disadvantages of using
inverted index file?
• List the main steps to create an index in IR systems
• List the main steps to construct an inverted index
IR models

• Procedure followed in IR model


• Criteria in IR model selection
• Boolean IR model
• Vector space model
IR Models - Basic Concepts
• Word evidence:
 IR systems usually adopt index terms to index and retrieve
documents
 Each document is represented by a set of representative
keywords or index terms (called Bag of Words)
• An index term is a word useful for remembering the
document main themes
• Not all terms are equally useful for representing the
document contents:
 less frequent terms allow identifying a narrower set of
documents
• But no ordering information is attached to the Bag of
Words identified from the document collection.
IR Models - Basic Concepts
• One central problem regarding IR systems is the
issue of predicting the degree of relevance of
documents for a given query
 Such a decision is usually dependent on a ranking
algorithm which attempts to establish a simple
ordering of the documents retrieved
 Documents appearning at the top of this ordering
are considered to be more likely to be relevant
• Thus ranking algorithms are at the core of IR systems
 The IR models determine the predictions of what
is relevant and what is not, based on the notion of
relevance implemented by the system
IR models

Probabilistic
relevance
General Procedures Followed
To find relevant documens for a given query,
• First, map documents and queries into term-document
vector space.
 Note that queries are considered as short document
• Second, queries and documents are represented as
weighted vectors, wij
 There are binary weights & non-binary weighting
technique
• Third, rank documents by the closeness of their vectors to
the query.
 Documents are ranked by closeness to the query.
Closeness is determined by a similarity score calculation
Mapping Documents & Queries
• Represent both documents and queries as N-
dimensional vectors in a term-document matrix, which shows
occurrence ofterms in the document collection or query

d j  (t1, j , t 2, j ,..., t N , j ); qk  (t1,k , t 2,k ,..., t N ,k )
• An entry in the matrix corresponds to the “weight” of a
term in the document; zero means the term doesn’t exist in
the document.
T1 T2 …. TN – Document collection is mapped to
D1 w11 w12 … w1N term-by-document matrix
D2 w21 w22 … w2N – View as vector in multidimensional
space
: : : :
• Nearby vectors are related
: : : :
– Normalize for vector length to avoid
DM wM1 wM2 … wMN
the effect of document length
Qi wi1 wi2 … wiN
How to evaluate Models?
Criteria for selecting IR model considers procedures
the IR Model followed and techniques used:
• What is the weighting technique used by the IR Models
for measuring importance of terms in documents?
– Are they using binary or non-binary weight?
• What is the matching technique used by the IR models?
– Are they measuring similarity or dissimilarity?
• Are they applying exact matching or partial matching in
the course of finding relevant documents for a given
query?
• Are they applying best matching principle to measure
the degree of relevance of documents to display in
ranked-order?
– Is there any Ranking mechanism applied before
displaying relevant documents for the users?
The Boolean Model
• Boolean model is a simple model based on set theory
 The Boolean model imposes a binary criterion
for deciding relevance
• Terms are either present or absent. Thus,
wij  {0,1}
• sim(q,dj) = 1, if document satisfies the boolean query
0 otherwise T1 T 2 …. TN
D1 w11 w12 … w1N
- Note that, no weights D2 w21 w22 … w2N
assigned in-between 0 and 1, : : : :
just only values 0 or 1
: : : :
DM wM1 wM2 … wMN
The Boolean Model: Example
Given the following three documents, Construct Term – document
matrix and find the relevant documents retrieved by the
Boolean model for the query “gold silver truck”
• D1: “Shipment of gold damaged in a fire”
• D2: “Delivery of silver arrived in a silver truck”
• D3: “Shipment of gold arrived in a truck”
Table below shows document –term (ti) matrix
arrive damage deliver fire gold silver ship truck
D1
D2
D3
query

Also find the documents relevant for the queries:


(a)gold delivery; (b) ship gold; (c) silver truck
The Boolean Model: Further Example
• Given the following determine documents retrieved
by the Boolean model based IR system
• Index Terms: K1, …,K8.
• Documents:
1. D1 = {K1, K2, K3, K4, K5}
2. D2 = {K1, K2, K3, K4}
3. D3 = {K2, K4, K6, K8}
4. D4 = {K1, K3, K5, K7}
5. D5 = {K4, K5, K6, K7, K8}
6. D6 = {K1, K2, K3, K4}
• Query: K1 (K2  K3)

• Answer: {D1, D2, D4, D6} ({D1, D2, D3, D6} {D3, D5})
= {D1, D2, D6}
Exercise
Given the following four documents with the
following contents:
– D1 = “computer information retrieval”
– D2 = “computer retrieval”
– D3 = “information”
– D4 = “computer information”

• What are the relevant documents retrieved for the


queries:
– Q1 = “information  retrieval”
– Q2 = “information  ¬computer”
Drawbacks of the Boolean Model
• Retrieval based on binary decision criteria with no
notion of partial matching
• No ranking of the documents is provided (absence of
a grading scale)
• Information need has to be translated into a Boolean
expression which most users find awkward
• The Boolean queries formulated by the users are
most often too simplistic
 As a consequence, the Boolean model frequently
returns either too few or too many documents in
response to a user query
Vector-Space Model
• This is the most commonly used strategy for measuring
relevance of documents for a given query. This is
because,
 Use of binary weights is too limiting
 Non-binary weights provide consideration for partial
matches
• These term weights are used to compute a degree of
similarity between a query and each document
 Ranked set of documents provides for better
matching
• The idea behind VSM is that
 the meaning of a document is conveyed by the words
used in that document
Vector-Space Model
To find relevant documens for a given query,
• First, map documents and queries into term-document
vector space.
Note that queries are considered as short document
• Second, in the vector space, queries and documents are
represented as weighted vectors, wij
There are different weighting technique; the most widely used
one is computing TF*IDF weight for each term

• Third, similarity measurement is used to rank documents


by the closeness of their vectors to the query.
To measure closeness of documents to the query cosine
similarity score is used by most search engines
Term-document matrix.
• A collection of n documents and query can be represented
in the vector space model by a term-document matrix.
– An entry in the matrix corresponds to the “weight” of a term in
the document;
– zero means the term has no significance in the document or
it simply doesn’t exist in the document. Otherwise, wij > 0
whenever ki  dj

T1 T2 …. TN
• How to compute weights
D1 w11 w21 … w1N for term i in document j and
D2 w21 w22 … w2N in query q; wij and wiq ?
: : : :
: : : :
DM wM1 wM2 … wMN
Computing weights
• The vector space model with TF*IDF weights is a good
ranking strategy with general collections
• For index terms a normalized TF*IDF weight is given
by: freq (i, j )
wij  * log(N/n i )
max( freq (k , j ))
• Users query is typically treated as a short document and
also TF-IDF weighted.
 For the query term weights, a suggestion is
freq (i, q)
wiq   0.5  [0.5 * ]  * log(N/ni )
max( freq (k , q))
• The vector space model is usually as good as the known
ranking alternatives. It is also simple and fast to compute.
Example: Computing weights
• A collection includes 10,000 documents
 The term A appears 20 times in a particular document j
 The maximum appearance of any term in document j is
50
 The term A appears in 2,000 of the collection
documents.

• Compute TF*IDF weight of term A?


 tf(A,j) = freq(A,j) / max(freq(k,j)) = 20/50 = 0.4
 idf(A) = log(N/DFA) = log (10,000/2,000) = log(5) = 2.32
 wAj = tf(A,j) * log(N/DFA) = 0.4 * 2.32 = 0.928
Similarity Measure
• A similarity measure is a function that computes the
degree of similarity between document j and users
query.  

n
d j q wi , j wi ,q
sim(d j , q)     i 1

i 1 w i1 i ,q
n n
dj q 2
i, j w 2

• Using a similarity score between the query and each


document:
– It is possible to rank the retrieved documents in the
order of presumed relevance.
– It is possible to enforce a certain threshold so that we
can control the size of the retrieved set of documents.
Vector Space with Term Weights
and Cosine Matching
Di=(d1i,w1di;d2i, w2di;…;dti, wtdi)
Term B
Q =(q1i,w1qi;q2i, w2qi;…;qti, wtqi)
1.0 Q = (0.4,0.8)

t
D2 Q D1=(0.8,0.3)
j 1
w jq w jdi
0.8 D2=(0.2,0.7) sim(Q, Di ) 
 j 1 (w jq )  j 1 jdi
t 2 t 2
( w )
0.6
2 (0.4  0.2)  (0.8  0.7)
sim (Q, D 2) 
0.4
[(0.4) 2  (0.8) 2 ]  [(0.2) 2  (0.7) 2 ]
D1
0.2 1 0.64
  0.98
0.42
0 0.2 0.4 0.6 0.8 1.0
.56
Term A sim(Q, D1 )   0.74
0.58
Vector-Space Model: Example
• Suppose user query for: Q = “gold silver truck”. The
database collection consists of three documents with the
following content.
D1: “Shipment of gold damaged in a fire”
D2: “Delivery of silver arrived in a silver truck”
D3: “Shipment of gold arrived in a truck”
• Show retrieval results in ranked order?
1.Assume that full text terms are used during indexing,
without removing common terms, stop words, & also no
terms are stemmed.
2.Assume that content-bearing terms are selected during
indexing
3.Also compare your result with or without normalizing
term frequency
Vector-Space Model: Example

Counts TF Wi = TF*IDF
Terms Q D1 D2 D3 DF IDF Q D1 D2 D3

arrive 0 0 1 1 2 0.176 0 0 0.176 0.176


damage 0 1 0 0 1 0.477 0 0.477 0 0
deliver 0 0 1 0 1 0.477 0 0 0.477 0
fire 0 1 0 0 1 0.477 0 0.477 0 0
gold 1 1 0 1 2 0.176 0.176 0.176 0 0.176
silver 1 0 2 0 1 0.477 0.477 0 0.954 0
ship 0 1 0 1 2 0.176 0 0.176 0 0.176
truck 1 0 1 1 2 0.176 0.176 0 0.176 0.176
Vector-Space Model

Terms Q D1 D2 D3
arrive 0 0 0.176 0.176
damage 0 0.477 0 0
deliver 0 0 0.477 0
fire 0 0.477 0 0
gold 0.176 0.176 0 0.176
silver 0.477 0 0.954 0
ship 0 0.176 0 0.176
truck 0.176 0 0.176 0.176
Vector-Space Model: Example
• Compute similarity using cosine Sim(q,d1)
• First, for each document and query, compute all vector
lengths (zero terms ignored)
|d1|= 0. 477 2
 0. 477 2
 0. 176 2
 0.176 2
= 0.517 = 0.719
|d2|= 0 .176 2
 0.477 2
 0.176 2
 0.176 2
= 1.2001 = 1.095
|d3|= 0.176 2  0.1762  0.176 2  0.176 2 = 0.124 = 0.352

|q|= 0.1762  0.4712  0.1762 = 0.2896 = 0.538


• Next, compute dot products (zero products ignored)
Q*d1= 0.176*0.167 = 0.0310
Q*d2 = 0.954*0.477 + 0.176 *0.176 = 0.4862
Q*d3 = 0.176*0.167 + 0.176*0.167 = 0.0620
Vector-Space Model: Example
Now, compute similarity score
Sim(q,d1) = (0.0310) / (0.538*0.719) = 0.0801
Sim(q,d2) = (0.4862 ) / (0.538*1.095)= 0.8246
Sim(q,d3) = (0.0620) / (0.538*0.352)= 0.3271
Finally, we sort and rank documents in descending
order according to the similarity scores
Rank 1: Doc 2 = 0.8246
Rank 2: Doc 3 = 0.3271
Rank 3: Doc 1 = 0.0801
Vector-Space Model
• Advantages:
• Term-weighting improves quality of the answer set since it
helps to display relevant documents in ranked order
• Partial matching allows retrieval of documents that
approximate the query conditions
• Cosine ranking formula sorts documents according to
degree of similarity to the query

• Disadvantages:
• Assumes independence of index terms. It doesn’t relate
one term with another term
• Computationally expensive since it measures the similarity
between each document and the query
Assignment
Given short descriptions of four popular books on database systems:
• Document A: “Fundamentals of Database Systems”
This book discusses the most popular database topics, including SQL, security,
and data mining along with an introduction to UML modeling and an entirely
new chapter on XML and Internet databases.
• Document B: “An Introduction to Database Systems”
This book provides a comprehensive introduction to the now very large field of
database systems by providing a solid ground in the foundations of database
technology.
• Document C: “Databases & Transaction Processing: An Application-Oriented”
The book presents the engineering principles underlying the implementation of
database and transaction processing. This text presents the theory underlying
relational databases and relational query languages applications.
• Document D: “Database Management Systems”
This document provides comprehensive and up-to-date coverage of the
fundamentals of database systems. Coherent explanations and practical
examples have made this one of the leading texts in the field.
• Problem:
– Construct an inverted index file
– Calculate the relevance of each document for the query “Data base system”
using the vector model & the TFIDF method.
– Identify relevant documents in ranked order
Exercise
• Consider these documents:
Doc 1    breakthrough drug for schizophrenia
Doc 2    new schizophrenia drug
Doc 3    new approach for treatment of schizophrenia
Doc 4    new hopes for schizophrenia patients
–Draw the term-document incidence matrix for this document
collection.
–Draw the inverted index representation for this collection.

• For the document collection shown above, what are


the returned results for the queries:
–schizophrenia AND drug
–for AND NOT(drug OR approach)
Exercise
1.What is a model for IR and list the main models used in ad hoc IR?
2.Give some of the features of the Boolean model. What are the advantages
& disadvantages of the Boolean model?
3.Translate the following information need in a query for the Boolean model?
– Give all documents containing term ‘information’ but excluding those containing
either ‘retrieval’ or ‘Boolean’
4.Describe the vector model representation of documents and queries
5.What is usually considered a term in vector space model and give some
other possible terms that can be used
6.What is a similarity measure? Assume the query vector Q={0.2,0.3,0.5} and
documents D1={0.1,0.2,0.4} and D2={0.3,0.8,0.1}. Calculate the cosine
similarity measure of each document with respect to the query. Give the
relevance ranking of the documents.
7.Calculate the similarity of two documents using Jaccard & dice similarity; (i)
D1={1,0,1,1}, D2={0,1,1,0}; (ii) D1={0.3,0.5,0.1},D2={0.1,0.8,0.4}
8.What is the term weight TFIDF & what is the intuition behind this measure?
9.How can we apply TFIDF measure to calculate the similarity between two
documents?
10.What are the advantages & disadvantages of the vector space model?
11.What is document ranking?
12.What is the view of the probabilistic model in IR? Give three advantages/
disadvantages of the probabilistic model in IR
13.Describe the Probability Ranking Principle? What are the main
assumptions of the PRP ?
14.Describe the Binary Independence Retrieval model? What is the Retrieval
Status Value in the BIR model?
Probabilistic Model
• IR is an uncertain process
–Mapping Information need to Query is not perfect
–Mapping Documents to index terms is a logical representation
–Query terms and index terms mostly mismatch

• This situation leads to several statistical approaches:


probability theory, fuzzy logic, theory of evidence, etc.
• Probabilistic retrieval model is rigorous formal model
that attempts to predict the probability that a given
document will be relevant to a given query (P(R|q,di)
–Use probability to estimate the “odds” of relevance of a query
to a document.
–It relies on accurate estimates of probabilities
Probabilistic model
• Asks the question: what is the probability that user
will see relevant information if they read this
document.
– P(rel | di ): probability of relevance after reading di
– How likely is the user to get relevance information from
reading this document
– high probability means more likely to get relevant info.

• A Probabilistic retrieval model


– Rank documents in decreasing order of probability of
relevance to users information need
– Calculate P(rel|di) for each document and rank
Probability Ranking Principle
• You have a collection of Documents
– User issues a query
– A Set of documents needs to be returned
– Intuitively, want the “best” document to be first, second best -
second, etc…
– We need a formal way to judge the “goodness” of documents
with respect to queries.
• Probability ranking principle: if a reference retrieval
system's response to each request is a ranking of the
documents in the collection in order of decreasing
probability of relevance… the overall effectiveness of
the system to its user will be the best that is obtainable.
Difficulties
• Evidence is based on a lossy representation
– Evaluate probability of relevance based on occurrence
of terms in query and documents
– Start with an initial estimate , refine through relevance
feedback

• Computing the probabilities exactly according to


the model is intractable
– Make some simplifying assumptions
Probabilistic Model definitions
• Let D be a document in the collection.
– dj = (t1,j, t2,j, … tt,j), ti,j Î {0,1}
• terms occurrences are boolean (not counts)
• query q is represented similarly
• Let R represent the set of relevant documents with respect to a given
query and let NR represent the set of irrelevant documents.
– P(R | dj) is probability that dj is relevant,
– P(NR | dj) is probability that dj is irrelevant
• Need to find p(R| D) - probability that a retrieved document D is
relevant.
• Similarity function: p( R | D) 
p( D | R) p( R)
– Ratio of prob of relevance to prob of p( D)
non-relevance: p ( D | NR ) p ( NR )
p ( NR | D) 
If p(R|D) > p(NR|D) then D is relevant, p( D)
otherwise D is not relevant
Bayes’ Theorem: Application in IR
• Goal: want to estimate the probability that a document D
is relevant to a given query.
p(R)p(D | R) p(R)p(D | R)
p(R | D)  
p(D) p(R)p(D | R)  p(R)p(D | R)

• It is easier to estimate log odds of probability of


relevance

p(R | D) p(R)p(R | D)
log O(R | D)  log  log
p(R | D) p(R)p(R | D)

p(R | D)  1 - p(R | D)
Probabilistic Models
• Most probabilistic models based on combining probabilities of
relevance and non-relevance of individual terms
– Probability that a term will appear in a relevant document
– Probability that the term will not appear in a non‐relevant
document
• These probabilities are estimated based on counting term
appearances in document descriptions
• Retrieval Status Value (rsv)
D is a vector of binary term
occurrences
We assume that terms occur
independently of each other
Principles surrounding weights
• Independence Assumptions
– I1: The distribution of terms in relevant documents is
independent and their distribution in all documents is
independent.
– I2: The distribution of terms in relevant documents is
independent and their distribution in non-relevant documents is
independent.

• Ordering Principles
– O1: Probable relevance is based only on the presence of
search terms in the documents.
– O2: Probable relevance is based on both the presence of
search terms in documents and their absence from documents.
Computing term probabilities
• Initially, there are no retrieved documents
– R is completely unknown
– Assume P(ti|R) is constant (usually 0.5)
– Assume P(ti|NR) approximated by distribution
of ti across collection – IDF

• This can be used to compute an initial rank


using IDF as the basic term weight
Probabilistic Model Example
d Document vectors <tfd,t>
col day eat hot lot nin old pea por pot
1 1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
5 1 1
6 1 1
wt 0.26 0.56 0.56 0.26 0.56 0.56 0.56 0.0 0.0 0.26
• q1 = eat
• q2 = porridge
• q3 = hot porridge
• q4 = eat nine day old porridge
Improving the Ranking
• Now, suppose
– we have shown the initial ranking to the user
– the user has labeled some of the documents
as relevant ("relevance feedback")
• We now have
– N documents in coll, R are known relevant
– ni documents containing ti, ri are relevant
Improving Term Weight Estimates
N=the total number of documents in the collection
n= the total number of documents that contain term ti
R=the total number of relevant documents retrieved
r=the total number of relevant documents retrieved that contain term ti
Document Relevance
For term ti No of No of non-relevant Total
relevant docs
docs
No of docs including r n-r n
term ti
No of docs excluding R-r N-R-(n-r) N-n
term ti
Total R N-R N
Compute Term Weight: Robertson-Spark
Jones Weights
• Retrospective formulation  r 
–Ratio of the odds of a relevant  
document having the term (i.e.,  Rr
ratio of relevant documents log
having the term to not having  nr 
the term) to the odds of all  
non-relevant documents  N nRr
having the term (i.e., ratio of
all non-relevant documents  r  0.5 
having the term to not having  
the term)  R  r  0. 5 
w  log
(1)

 n  r  0. 5 
• Predictive formulation  
–To guarantee that the  N  n  R  r  0.5 
denominator is never zero,
adding a minor 0.5 to all
numerators and (r  0.5)( N  n  R  r  0.5)
denominators:
w (1)
 log
(n  r  0.5)( R  r  0.5)
Relevance weighted Example
d Document vectors <tfd,t>
col day eat hot lot nin old pea por pot Rele
vanc
e
1 1 1 1 1 NR
2 1 1 1 R
3 1 1 1 NR
4 1 1 1 NR
5 1 1 NR
6 1 1 NR
- porridge
•wqt 3 = hot 0.00 0.00 - 0.00 0.00 0.00 0.62 0.62 0.95
0.33 0.33
• Document 2 is relevant
Probabilistic Retrieval Example
• D1: “Cost of paper is up.” (relevant)
• D2: “Cost of jellybeans is up.” (not relevant)
• D3: “Salaries of CEO’s are up.” (not relevant)
• D4: “Paper: CEO’s labor cost up.” (????)
Probabilistic Retrieval Example
cost paper Jellybean salary CEO labor up

D1 1 1 0 0 0 0 1
D2 1 0 1 0 0 0 1
D3 0 0 0 1 1 0 1
D4 1 1 0 0 1 1 1
Wij 0.477 1.176 -0.477 -0.477 -0.477 0.222 -0.222
• D1=0.477 +1.176+ -0.222
• D2=0.477 + -0.477+ -0.222
• D3= -0.477 + -0.477+ -0.222
• D4=0.477 +1.176 + -0.477 + 0.222 + -0.222
Exercise
• Consider the collection below. The collection has 5 documents
and each document is described by two terms. The initial
guess of relevance to a particular query Q is as given in the
table below. Assuming the query Q has a total of 2 relevant
documents in this collection solve the following questions
Document T1 T2 Relevance
D1 1 1 R
D2 0 1 NR
D3 1 0 NR
D4 1 0 R
D5 0 1 NR
• Using the probabilistic term weighting formula, calculate the
new weight for each of the query in Q
• Rank the documents according to their probability of relevance
with the new query
Probabilistic model
• Probabilistic model uses probability theory to model the
uncertainty in the retrieval process
– Assumptions are made explicit
– Term weight without relevance information is IDF
• Relevance feedback can improve the ranking by giving
better term probability estimates
• Advantages of probabilistic model over vector ‐space
– Strong theoretical basis
– Since the base is probability theory, it is very well understood
– Easy to extend
• Disadvantages
– Models are often complicated
– No term frequency weighting
• Which is better: vector‐space or probabilistic?
– Both are approximately as good as each other
– Depends on collection, query, and other factors
Retrieval Effectiveness

• Evaluation of IR systems
• Relevance judgement
• Performance measures
– Recall,
– Precision,
– Single-valued measures
Why System Evaluation?
• Any systems needs validation and verification
–Check whether the system is right or not
–Check whether it is the right system or not
• It provides the ability to measure the difference between IR
systems
–How well do our search engines work?
–Is system A better than B?
–Under what conditions?
• Evaluation drives what to study
–Identify techniques that work well and do not work
–There are many retrieval models/algorithms
• which one is the best?
–What is the best component for:
• Similarity measures (dot-product, cosine, …)
• Index term selection (tokenization, stop-word removal,
stemming…)
• Term weighting (TF, TF-IDF,…)
148
Types of Evaluation Strategies
• User-centered evaluation
– Given several users, and at least two retrieval
systems
• Have each user try the same task on both systems
• Measure which system works the “best” for users
information need
• How to measure users satisfaction?
• System-centered evaluation
– Given documents, queries, and relevance
judgments
• Try several variations of the system
• Measure which system returns the “best” hit list
The Notion of Relevance Judgment
• Relevance is the measure of a correspondence existing
between a document and query.
– Construct document - query Q1 Q2 …. QN
matrix as determined by: D1 R N .… R
(i) the user who posed the D2 R N .… R
retrieval problem;
(ii) an external judge; : : : :
(iii) information specialist : : : :
DM R N .… R
– Is the relevance judgment made by users and external
person the same?
•Relevance judgment is usually:
– Subjective: Depends upon a specific user’s judgment.
– Situational: Relates to user’s current needs.
– Cognitive: Depends on human perception and behavior.
– Dynamic: Changes over time.
Measuring Retrieval Effectiveness
Relevant Irrelevant
Metrics often Retrieved True False
used to Positive Positive
evaluate
effectiveness
of the system Not retrieved False True
Negative Positive

• Retrieval of documents may result in:


–False positive (Errors of omission): some irrelevant
documents may be retrieved by the system as relevant.
–False negative (False drop or Errors of commission):
some relevant documents may not be retrieved by the
system as irrelevant.
–For many applications a good index should not permit any
false drops, but may permit a few false positives.
Measuring Retrieval Effectiveness
Relevant Not
relevant
Collection size = A+B+C+D
Relevant = A+C
Retrieved A B Retrieved = A+B
Not retrieved C D

| {Relevant}  {Retrieved } |
Re call 
| {Relevant} |
Relevant +
Relevant
Retrieved Retrieved
| {Relevant}  {Retrieved } |
Pr ecision 
| {Retrieved } |
Irrelevant + Not Retrieved
Example
Assume that there are a total of 10 relevant document
Ranking Relevance Recall Precision
1. Doc. 50 R 0.10 1.00
2. Doc. 34 NR 0.10 0.50
3. Doc. 45 R 0.20 0.67
4. Doc. 8 NR 0.20 0.50
5. Doc. 23 NR 0.20 0.40
6. Doc. 16 NR 0.20 0.33
7. Doc. 63 R 0.30 0.43
8. Doc 119 R 0.40 0.50
9. Doc 21 NR 0.40 0.44
10. Doc 80 R 0.50 0.50
Graphing Precision and Recall
• Plot each (recall, precision) point on a graph
– Recall is a non-decreasing function of the number of documents
retrieved,
–Precision usually decreases (in a good system)
• Precision/Recall tradeoff
–Can increase recall by retrieving many documents (down to
a low level of relevance ranking), but many irrelevant
documents would be fetched, reducing precision
–Can get high recall (but low precision) by retrieving all
documents for all queries
1 The ideal
Returns relevant
Precision

documents but
misses many Returns most relevant
useful ones too documents but includes
0 lots of junk
Recall 1
Need for Interpolation
• Two issues:
–How do you compare performance across queries?
–Is the sawtooth shape intuitive to understand and
interpret the performance result?
1

0.8
P re c i s i o n

0.6

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1
Re ca ll

Solution: Interpolation!
Interpolate a precision value for each standard recall level
Interpolation
• It is a general form of precision/recall calculation
• Precision change w.r.t. Recall (not a fixed point)
– It is an empirical fact that on average as recall increases,
precision decreases
• Interpolate precision at 11 standard recall levels:
– rj {0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0},
where j = 0 …. 10
• The interpolated precision at the j-th standard recall
level is the maximum known precision at any recall
level between the jth and (j + 1)th level:
P ( r j )  max P ( r )
r j  r  r j 1
Example: Interpolation
Recall Precision
Assume that there 1.00
0.00
are a total of 10 1.00
0.10
relevant document
0.20 0.67
0.30 0.50
0.40 0.50
0.50 0.50
0.60 0.50
0.70 0.50
0.80 0.50
0.90 0.50
1.00 0.50
Result of Interpolation

0.8
P re c is io n

0.6

0.4

0.2

0
0 0.1 0.2 0.3 0.4 0.5
Recall
Interpolating across queries
• For each query, calculate precision at 11 standard
recall levels
• Compute average precision at each standard recall
level across all queries.
• Plot average precision/recall curves to evaluate
overall system performance on a document/query
corpus.

• Average precision favors systems which produce


relevant documents high in rankings
Single-valued measures
• Single value measures: calculate a single
value for each query to evaluate
performance
– Mean Average Precision at seen relevant
documents
• Typically average performance over a large
set of queries.
– R-Precision
• Precision at Rth relevant documents
MAP (Mean Average Precision)
• Computing mean average for more than one query
1 1 j
MAP    ( ) 
n Qi | Ri | D j Ri rij
– rij = rank of the jth relevant document for Qi
– |Ri| = number of relevant documents for Qi
– n = number of test queries
• E.g. Assume there are 3 relevant documents for Query 1 and 2
for query 2. Calculate MAP?
Relevant Docs. retrieved Query 1 Query 2
1st rel. doc. 1 4
2nd rel. doc. 5 8
3rd rel. doc. 10 -
1 1 1 2 3 1 1 2
MAP  [ (   )  (  )]
2 3 1 5 10 2 4 8
F-Measure
• One measure of performance that takes into
account both recall and precision.
• Harmonic mean of recall and precision:

2 PR 2
F  1 1
P  R RP
• Compared to arithmetic mean, both need to be high
for harmonic mean to be high.
• What if no relevant documents exist?

168
Example
Recall Precision F-Measure
0.10 1.00 0.18
0.10 0.50 0.17
0.20 0.67 0.31
0.20 0.50 0.00
0.20 0.40 0.00
0.20 0.33 0.00
0.30 0.43 0.35
0.40 0.50 0.44
0.40 0.44 0.00
0.50 0.50 0.50
E-Measure
• Associated with Van Rijsbergen
• Allows user to specify importance of recall and
precision
• It is parameterized F Measure. A variant of F
measure that allows weighting emphasis on precision
over recall:
(1   ) PR (1   )
2 2
E  2 1
 PR
2

R P

• Value of  controls trade-off:


 = 1: Equal weight for precision and recall (E=F).
 > 1: Weight recall more. It emphasizes recall.
 < 1: Weight precision more. It emphasizes precision. 170
Example
Recall Precision F-Measure E-Measure E-Measure
(β=2) (β=1/2)
0.10 1.00 0.18 0.12 0.74
0.10 0.50 0.17 0.10 0.43
0.20 0.67 0.31 0.23 0.61
0.20 0.50 0.00 0.00 0.00
0.20 0.40 0.00 0.00 0.00
0.20 0.33 0.00 0.00 0.00
0.30 0.43 0.35 0.32 0.32
0.40 0.50 0.44 0.42 0.50
0.40 0.44 0.00 0.00 0.00
0.50 0.50 0.50 0.50 0.50
Problems with both precision and recall
• Number of irrelevant documents in the collection is not taken
into account.
• Recall is undefined when there is no relevant document in
the collection.
• Precision is undefined when no document is retrieved.

Other measures
• Noise = retrieved irrelevant docs / retrieved docs
• Silence/Miss = non-retrieved relevant docs / relevant docs
– Noise = 1 – Precision; Silence = 1 – Recall

| {Relevant}  {NotRetrieved } |
Miss 
| {Relevant} |
| {Retrieved }  {NotRelevant} |
Fallout 
| {NotRelevant} | 172
Exercise
• Define what is Precision and Recall
• What are the disadvantages of using precision/
recall as performance measures in IR
• Give example cases of getting
– High precision but low recall
– High recall, but low precision
• Describe the 11-point interpolated average
precision method
• What are the single measures of performance
that can be used?
• What are the subjective measures used in IR?
• In commercial WWW search engines what are
the most important factors in terms of
performance by which they are evaluated by the
users?
Query Languages

174
Keyword-based querying
• A query is an expression of users information
needs.
– IR queries are distinctive in that they are unstructured
and often ambiguous; they differ from standard query
languages which are governed by strict syntax rules.
• Queries are combinations of words.
– The document collection is searched for documents that
contain these words.
• Word queries are intuitive, easy to express and
provide fast ranking.

175
Single-word queries
• A query is a single word
– Usually used for searching in document images

• Simplest form of query.


• What are the possible documents retrieved as
relevant?
– All documents that include the query word are
retrieved.

• On what base documents are ranked?


– Documents may be ranked by the frequency count of
the query word in the document.
– Documents containing more of the query word are
given the highest priority 176
Phrase queries
• A query is a sequence of words treated as a single unit.
Also called “literal string” or “exact phrase” query.
–Phrase is usually surrounded by quotation marks.
–All documents that include this phrase are retrieved.
• Usually, separators (commas, colons, ...) & common words
(“a”, “the”, “of”, “for”…) in the phrase are ignored
• In effect, this query is for a set of words that must appear as
per the given sequence.
–Allows users to specify a context and thus gain precision.
–Ex.: “Information Processing for Document Retrieval”.
• What are the possible documents retrieved as relevant?
–All documents that include phrase query are retrieved.
• On what base documents are ranked?
–Documents may be ranked by the frequency of appearance of
the phrase query in the document.
177
Multiple-word queries
• A query is a set of words (or a set of phrases).
– Ex.: what is the result for the query “Data Mining and Intelligent
Database Design”?

• What are the possible documents retrieved as relevant?


– A document is retrieved if it includes one or more of the
query words.
• On what bases documents be ranked to list according to
best matching principle?
– Documents are ranked by the number of query words they
contain. A document containing n query terms is ranked
higher than a document containing m < n query words.
– Documents are ranked in decreasing order:
• those containing all the query words are ranked at the top, only one
query word at bottom.
– Frequency counts may be used to break ties among
documents that contain the same query words. 178
Boolean queries
• Queries are formulated based on concepts from logic:
AND, OR, NOT
–It describes the information needed by relating multiple words
with Boolean operators.
• Semantics: For each query word w a corresponding set Dw
is constructed that includes the documents that contain w.
• The Boolean expression is then interpreted as an
expression on the corresponding document sets with
corresponding set operators:
–AND: Finds only documents containing all of the specified
words or phrases.
–OR: Finds documents containing at least one of the specified
words or phrases.
–NOT: Excludes documents containing the specified word or
phrase.
179
Examples: Boolean queries
1.computer OR server
–Finds documents containing either computer, server or both
2. (computer OR server) NOT mainframe
–Select all documents that discuss computers or servers, do
not select any documents that discuss mainframes.
3. computer OR server NOT mainframe
–Select all documents that discuss computers, or documents
that discuss servers but do not discuss mainframes.
4. computer NOT (server OR mainframe)
–Select all documents that discuss computers, and do not
discuss either servers or mainframes.

180
Natural language
• Using natural language for querying is very
attractive.
• Example: Find all the documents that discuss
“campaign finance reforms, including documents that
discuss violations of campaign financing regulations. Do
not include documents that discuss campaign
contributions by the gun and the tobacco industries”.
• Natural language queries are converted to a formal
language for processing against a set of
documents.
• Such translation requires intelligence and is still a
challenge
181
Natural language
• Pseudo NL processing: System scans the text and
extracts recognized terms and Boolean connectors.
The grammaticality of the text is not important.
– Often used by search engines.
• Problem: Recognizing the negation in the search
statement (“Do not include...”).
• Compromise: Users enter natural language clauses
connected with Boolean operators.
• In the above example: “campaign finance reforms”
or “violations of campaign financing regulations" and
not “campaign contributions by the gun and the
tobacco industries”. 182
Exercise
• What are the main types of queries?
• List the main “query languages” used in IR
• Describe what is the Levenstein distance and its
use in IR
• Describe what is the longest common subsequence
(LCS) and its use in IR
• What is regular expression and how can it be used
in IR?
• What are structural queries?
Query Operations

• Relevance Feedback
• Query Reformulation
• Query Expansion
• Query term reweighting

184
Problems with Keywords
May not retrieve relevant documents that
include synonymy terms.
◦ “restaurant” vs. “café”
◦ “Abyssinia” vs. “Ethiopia” vs. “Walia”

May retrieve irrelevant documents that


include polysomy terms.
◦ “Apple” (company vs. fruit vs. sport club)
◦ “bit” (unit of data vs. act of eating)
◦ “bat” (baseball vs. mammal)
Research area: Intelligent IR
• Take into account the meaning of the words used
– Solution: ontology
• Take into account the order of words in the query
– Solution: n-gram
• Adapt to the user need based on automatic or
semi-automatic feedback
– Solution: relevance feedback & query reformulation
• Extend search with related terms
– Solution: thesaurus & statistical co-occurance
• Perform automatic spell checking
– Solution: NLP – spell checker
Introduction
 No detailed knowledge of collection and searching
environment
é difficult to formulate queries well designed for searching
é need many formulations of queries for effective searching
 First formulation: often naïve attempt to retrieve
relevant information
 Documents initially retrieved:
é can be examined for relevance information (by the user or
automatically by the system) to provide relevance
feedback
é improve query formulations for retrieving additional
relevant documents (using query reformulation
techniques) 187
Query Reformulation
• Identify terms related to query terms
• Revise query to account for feedback using two
basic techniques:
– Query Expansion: Add new terms related to query terms
from relevant documents.
– Term Reweighting: modify term weights based on
documents relevance for the users query
• Increase weight of terms in relevant documents and
decrease weight of terms in irrelevant documents
such that:
– Reformulated query: Closer to term weight vectors
of relevant documents

• Several algorithms for query reformulation.


188
Term Reweighting for Query
Reformulation: Rochio formula
For query q
–Dr: set of relevant documents among retrieved
documents
–Dn: set of non-relevant documents among retrieved
documents
–Cr: set of relevant documents among all documents in
collection  
qi 1  qi 
Dr
d
d j Dr
j 
Dn
d
d j Dn
j

,,: tuning constants.


• Initial formulation =1
• Usually information in relevant documents is more important
than in non-relevant documents (<<)
Term Reweighting for Query
Reformulation: Ide formula
• Initial formulation = = =1
qi 1  qi   d
d j Dr
j  d
d j Dn
j

• E.g. Given query vector qi = (2,3,1,2,5); documents


with the following vector are retrieved: d1(3,3,2,0,9);
d2(2,2,1,0,12); d3(3,2,1,0,9); d4(1,0,0,7,2); d5(0,1,0,8,5);
Compute the modified query using Rochio approach with =
= =1
– Assume that the first three documents (d1-d3) are identified
as relevant by a user & 2 documents (d4-d5) as irrelevant.
Approaches for Relevance Feedback
• Users relevance feedback
 Most popular query reformulation strategy.
 Approaches based on feedback from users about
relevance of documents retrieved
• Pseudo-relevance feedback (Explicit vs. Implicit
Relevance Feedback)
 Approaches based on information derived from set
of initially retrieved documents (local set of
documents), which is called Local Analysis
 Approaches based on global information derived
from document collection, which is called Global
Analysis
191
Users Relevance Feedback
• After initial searching results are presented, allow users to
provide feedback on the relevance of one or more of the
retrieved documents.
• Cycle:
–User presented with list of retrieved documents
–User marks those which are relevant
• In practice: top 10-20 ranked documents are examined
–Use this feedback information to reformulate the query, i.e.
• To select important terms from documents assessed
relevant by the users & add these terms to the reformulated
new query
–Produce new results based on reformulated query.
–Allows more interactive, multi-pass process.
• Expected: New query moves towards relevant documents
and away from non-relevant documents
192
Relevance Feedback Architecture

Query Document
String corpus

ReRanked
Revised IR Relevant
Query System Documents
1. Doc2
2. Doc4
Query 3. Doc5
Reformulation Ranked .
.
Relevant 1. Doc1
1. Doc1  2. Doc2
2. Doc2 
Documents 3. Doc3
3. Doc3  .
Feedback . . .. .
193
Pseudo Relevance Feedback
• Use relevance feedback methods without explicit user
input.
• Steps
–Obtain relevance feedback automatically.
• Just assume the top K retrieved documents are relevant, &
use them to reformulate the query.
–Allow for query expansion that includes terms from
documents that are correlated with the query terms.
• Identify terms related to query terms (such as synonyms
terms and/or similar terms that are close to query terms in
text)
• Two strategies
–Local strategies
–Global strategies
194
Pseudo Feedback Architecture
Query Document
String corpus

Revised Rankings
IR ReRanked
Query System Documents
1. Doc2
2. Doc4
Query 3. Doc5
Ranked .
Reformulation 1. Doc1 .
Documents 2. Doc2
3. Doc3
.
Pseudo 1. Doc1 
.
2. Doc2 
Feedbac 3. Doc3 
k .
195
.
Local Analysis
• Examine only documents retrieved automatically for
query to determine query expansion
– Base correlation analysis on only the “local” set of
retrieved documents for a specific query.
• At query time, dynamically determine similar terms
based on analysis of K top-ranked retrieved
documents.
• Avoids ambiguity by determining similar (related)
terms only within relevant documents.
– “Apple computer”  “Apple computer Powerbook
laptop”

196
Global analysis
• Expand query using information from whole set of
documents in collection
– Determine term similarity through a pre-computed statistical
analysis of the complete corpus.
• Thesaurus-like structure using all documents
– Approach to automatically built thesaurus
For example: similarity thesaurus based on co-occurrence
frequency
• A thesaurus provides information on synonyms and
semantically related words and phrases.
• Example:
physician
similar/synonymous: doctor, medical, MD
related: general practitioner, surgeon
197
Query Expansion: Thesaurus-based
• For each term, t, in a query, expand the query with
synonyms (similar) and related words of t from the
thesaurus.
• May weight added to terms less than original query
terms.
– Generally increases recall.

• May significantly decrease precision, particularly


with ambiguous terms.
– “interest rate”  “interest rate fascinate evaluate”

198
Query Expansion: Term Co-occurrence-based
• Compute association matrices which quantify term correlations
in terms of how frequently they co-occur.
– Expand queries with statistically most similar terms.
– Synonymy association: terms that frequently co-occur
inside local set of documents
w1 w2 w3 ………………..wn
w1 c11 c12 c13…………………c1n Association
Matrix
w2 c21 c22 c23…………………c2n
. .
. .
wn cn1 cn2 cn3…………………cnn
cij: Correlation factor between
term i and term j
ci , j   tf (t , d )  tf (t
d Dl
i j ,d)
tf : Frequency of term i and term j
199
in document d
Normalized Association Matrix
• Frequency based correlation factor favors more
frequent terms. cij
• Normalize association scores: sij 
cii  c jj c ij
• Normalized score is 0 ≤ Sij ≤ 1
– 1 if two terms have the same frequency in all documents.
– 0 if two terms have no correlation

• Given a Query q
– Find clusters of terms tj for the |q| query terms based on
term association
• N terms that register the largest value Si,j ≥ β
– Keep clusters small
– Expand original query
200
Example
• For the query “information extraction”; let say the
following documents are retrieved (which contains
the words listed below):
– Doc 1: data, data, information, budget, retrieval,
information, budget, retrieval
– Doc 2: extraction, retrieval, extraction, information,
information, data
– Doc 3: data, retrieving, budget, budget, data,
information, budget, retrieval, information
– Doc 4: information, Internet
• Using term association expand the given query with
frequently co-occuring terms with Si,j ≥ 80%
201
Global vs. Local Analysis
• Global analysis requires intensive term correlation
computation only once at system development
time.
– Local analysis requires intensive term correlation
computation for every query at run time (although
number of terms and documents is less than in global
analysis).

• But local analysis gives better results.


– Term ambiguity may introduce irrelevant statistically
correlated terms during global analysis.
– “Apple computer”  “Apple red fruit computer”

202
Exercise
• List some of the features of an intelligent IR system.
• What is the purpose of relevance feedback in IR?
• Describe the Rocchio’s relevance feedback method
• Why user’s feedback is not widely used in IR?
• What is relevance pseudo feedback in IR?
• Describe the Thesaurus based query expansion
method and what are its advantages and
disadvantages
• Describe the statistical thesaurus method for query
expansion
• Describe the process of query expansion using
– global analysis
– local analysis?
• What are the similarity and differences between global
and local analysis methods?
IR Project
• Form a group with 3 members. Implement ---------for at least 10
documents written in one of the selected local language.
1. Tokenization: identify word separators and use them to split
sentences into tokens (with no numeric and special characters)
2. Stop word detection: use stop list and DF to identify and show list of
non-stop words
3. Stemming: identify suffix and prefix used in a given language and
identify stem or root words
4. Normalization: correct any artificial difference between words,
including any difference in writing and speaking, like numbers and
special characters. Make your system write what we speak (‘1’ should
be written as ‘one’)
5. Weighting terms using TF, DF, TFIDF and rank them
6. Document clustering: Identify similar documents
7. Word clustering: Identify similar and related words
Report: Write a publishable report with sections:
• Abstract -- ½ page
• Introduce problem, objective, scope & methodology -- 2 pages
• Review related works -- 4 pages
• Description of architecture of IR system -- 3 pages
• Discussion of evaluation result, with findings --- 3 pages
• Concluding remarks, with major recommendation --- 1 page
• Reference (use IEEE referencing style).
• Contribution of each member for the success of the project
Important Dates:
• Assignment:

• Project:

• Final exam:

207
THANK YOU
indexing tokens = open('token_list.txt', 'w', encoding ='utf-8')
tokens.write(str(token_list))
tokens.close()
import sys, glob, numpy, string, l3 #stemming using hornmorph stemmer
from matplotlib import pyplot l3.anal_file('om','token_list.txt', 'stemmed.txt' , citation =True, nbest =1)
from nltk.corpus import stopwords stemmed_tokens = open('stemmed.txt', 'r', encoding = 'utf-8')
import pandas as pd stem_list = stemmed_tokens.read()
from string import punctuation number_of_documents = len(document_list)
import matplotlib.pyplot as plt number_of_tokens = len(stem_list)
# Building the TF-IDF matrix
# Structures holding documents and terms sys.stderr.write ('Building the TF matrix and counting term occurrencies\n')
document_list = [] token_count = [0] * number_of_tokens # number of occurrences of word in
document_ids = {} document
token_list = [] TF = numpy.empty((number_of_tokens,number_of_documents), dtype=float)
token_ids = {} # Scan the document list
# Repeat for every document in processed_pages for i,doc in enumerate(document_list):
for filename in glob.glob('corpus/*'): # Initialize with zeros
n_dt = [0] * number_of_tokens
f = open (filename, 'r', encoding = 'utf-8') # For all token IDs in document
tokens = f.read() for tid in doc['tokens']:
tokens = tokens.lower() # if first occurrence, increase global count for IDF
tokens = tokens.split() if n_dt[tid] == 0:
table = str.maketrans(' ', ' ', punctuation) token_count[tid] += 1
tokens = [w.translate(table) for w in tokens] # increase local count
tokens =[word for word in tokens if word.isalpha()] n_dt[tid] += 1
tokens =[word for word in tokens if len(word)>2 ] # Normalize local count by document length obtaining TF vector;
stop_words = set(stopwords.words('oromo')) # store it as the i-th column of the TFIDF matrix.
tokens = [w for w in tokens if not w in stop_words] TF[:,i] = numpy.array(n_dt, dtype=float) / len(doc['tokens'])
f.close() TF[] = numpy.array(n_dt, dtype=float) / len(doc['tokens'])
  IDF = numpy.log10(number_of_documents / numpy.array(token_count,
# Get the document name as last part of path dtype=float))
article_name = filename[filename.rfind('/')+1:] TFIDF = numpy.diag(IDF).dot(TF)
doc_id = len(document_list) TFIDF = pd.DataFrame(TFIDF)
 
# Insert ID in inverse list #SVD for rank reduction
 

document_ids[article_name] = doc_id To, So, DoT = numpy.linalg.svd (TFIDF, full_matrices =False)


# Populate token structure for all tokens in document #r = number of low rank
for t in tokens: #Y = TSD
if t not in token_ids: r = 10
token_ids[t] = len(token_list) T = To[:, :r]
token_list.append(t) S = numpy.diag(So)[:r, :r]
# Transform documents token list into the DT = DoT[:r, :]
corresponding ID list Y = numpy.dot(T, numpy.dot(S, DT))
tids = [token_ids[t] for t in tokens]  
# Store the document as both its token ID list and the #Kmeans clustering using reduced SVD
from sklearn.cluster import KMeans
corresponding set n_clusters = 8
document_list.append({ kmeans = KMeans(n_clusters, max_iter=100, n_init=1, verbose=1)
'name': article_name, cluster = kmeans.fit_predict(Y) #Y=TSDT or reduced SVD using r number
'tokens': tids, label = kmeans.labels_
'set': set(tids) centroid = kmeans.cluster_centers_ #centroid of the clusters
}) sorted_centroid = centroid.argsort()[:, ::-1] #sorting centroid
centroid = pd.DataFrame(centroid)
centroid = kmeans.cluster_centers_ #centroid of the clusters
sorted_centroid = centroid.argsort()[:, ::-1] #sorting centroid
centroid = pd.DataFrame(centroid)
 
#retrieve query result using VSM
searching code N= len(document_list)
def print_top(sim,N,smallest=False):
sorted_sim = sorted(enumerate(sim),key=lambda
query = input('please enter query/Jechaa keessan t:t[1], reverse= not smallest)
galchaa!! ') for i,s in sorted_sim[:N]:
query = query.lower() if s>0:
q_split = query.split() print( document_list[i]['name'])
q_split = [w.translate(table) for w in q_split] #retrieve query result using SVD
q_split =[word for word in q_split if word.isalpha()] # Compute the cosine similarity array given the SVD
q_split =[word for word in q_split if not decomposition of the TFIDF matrix (computed
word.isdigit()] above), the normalized TFIDF query vector q and
q_split =[word for word in q_split if len(word)>2 ] the desired rank r
stop_words = set(stopwords.words('oromo'))  
q_split = [w for w in q_split if not w in stop_words] def reduced_similarity (r, To, So, DoT, q):
q_tokens = open('q_split.txt', 'w', encoding ='utf-8') T = To[:, :r]
S = numpy.diag(So)[:r, :r]
q_tokens.write(str(q_split)) DT = DoT[:r, :]
q_tokens.close()
l3.anal_file('om','q_split.txt', 'stemm_q.txt' , q_r = S.dot(T.T).dot(q_TFIDF)
citation =True, nbest =1) return DT.T.dot(q_r)
fq = open('stemm_q.txt', 'r', encoding = 'utf-8') sim_r = reduced_similarity (r, To, So, DoT, q_TFIDF)
qr_list = fq.read() print_top (sim_r,N)
q_tokens = set()  
q_count = [0] * number_of_tokens #retrieve query result using SVD with K-means
q_length = 0 #calculating similarity b/n cluster centroid and query
for token in q_list: vector
try: c_qsim =centroid.dot(q_TFIDF)
my_dict =c_qsim.to_dict()
t_id = token_ids[token] def get_key(val):
q_count[t_id] += 1 for key, value in my_dict.items():
q_length += 1 if val == value:
q_tokens.add(t_id) return key
except:  
Pass # w is cluster with max similarity b/n query and centroid
q_TFIDF = (numpy.array(q_count,dtype=float) / w = get_key(c_qsim.max())
q_length) * IDF doc_dict = {i: numpy.where(label ==i)[0] for i in
sim = TFIDFM.T.dot(q_TFIDF) range(kmeans.n_clusters)}
doc_dict_key = list(doc_dict.keys())
doc_dict_val = list(doc_dict.values())
doc_list = []
for key in doc_dict.keys():
if w == key:
print ( 'doc:', list(doc_dict[w]))

You might also like