You are on page 1of 42

PANIMALAR ENGINEERING COLLEGE

DEPARTMENT OF CSE
CS8080 INFORMATION RETRIEVAL TECHNIQUES

UNIT II MODELING AND RETRIEVAL EVALUATION


(Chapter 3 , 4, 5 - Ricardo Baeza)
Basic IR Models - Boolean Model - TF-IDF (Term Frequency/Inverse
Document Frequency) Weighting - Vector Model – Probabilistic Model –
Latent Semantic Indexing Model – Neural Network Model – Retrieval
Evaluation – Retrieval Metrics – Precision and Recall – Reference Collection –
User-based Evaluation – Relevance Feedback and Query Expansion –
Explicit Relevance Feedback.

2.1 Basic IR Models


Introduction Modeling
Modeling in IR is a complex process aimed at producing a
ranking function
Ranking function: a function that assigns scores to
documents with regard to a given query.
This process consists of two main tasks:
 The conception of a logical framework for representing
documents and queries
 The definition of a ranking function that allows
quantifying the similarities among documents and
queries
IR systems usually adopt index terms to index and retrieve
documents

IR Model Definition:

An IR model is a quadruple [D, Q, F, R(qi, dj)] where


1. D is a set of logical views for the documents in the collection
2. Q is a set of logical views for the user queries
3. F is a framework for modeling documents and queries
4. R(qi, dj) is a ranking function

A Taxonomy of IR Models
Retrieval models most frequently associated with distinct
combinations of a document logical view and a user task.
The users task includes retrieval and browsing. In retrieval

1
i) Ad Hoc Retrieval:
The documents in the collection remain relatively static while
new queries are submitted to the system.

ii) Filtering
The queries remain relatively static while new documents come
into the system

Classic IR model:
Each document is described by a set of representative
keywords called index terms. Assign a numerical
weights to distinct relevance between index terms.
Three classic models: Boolean, vector, probabilistic
Boolean Model :
The Boolean retrieval model is a model for information
retrieval in which we can pose any query which is in
the form of a Boolean expression of terms, that is, in
which terms are combined with the operators AND,
OR, and NOT. The model views each document as just
a set of words. Based on a binary decision criterion
without any notion of a grading scale. Boolean
expressions have precise semantics.
Vector Model
Assign non-binary weights to index terms in queries
and in documents. Compute the similarity between
documents and query. More precise than Boolean
model.
Probabilistic Model
The probabilistic model tries to estimate the probability
that the user will find the document dj relevant with
ratio
P(dj relevant to q)/P(dj nonrelevant to q)
2
Given a user query q, and the ideal answer set R of the
relevant documents, the problem is to specify the
properties for this set. Assumption (probabilistic
principle): the probability of relevance depends on the
query and document representations only; ideal answer
set R should maximize the overall probability of
relevance

Basic Concepts
 Each document is represented by a set of representative
keywords or index terms
 Index term:
In a restricted sense: it is a keyword that has
some meaning on its own; usually plays the role of
a noun
In a more general form: it is any word that appears in
a document
 Let, t be the number of index terms
in the document collection ki be a
generic index term Then,
 The vocabulary V = {k1, . . . , kt} is
the set of all distinct index terms in
the collection

The Term-Document Matrix


 The occurrence of a term t i in a document dj
establishes a relation between t i and dj
 A term-document relation between ti and dj can be
quantified by the frequency of the term in the
document

3
 In matrix form, this can written as

 where each fi,j element stands for the frequency of term ti


in document dj
 Logical view of a document: from full text to a set of index
terms
Logical view of a document
It represents the full text to a set of index terms

• Documents in a collection are frequently represented through a


set of index termsor keywords. Keywords are extracted from
document. Keywords are derived automatically or generated by a
specialist, they provide a logical view of the document
• This can be accomplished through the elimination of stop words
(articles and connectives) and stemming(reduces distinct words
to their common grammaticalroot), identification of noun groups
( which eliminates adjectives ,adverbs and verb).Further
compression might be employed. These operations are called
Textoperations.
• Stop-words
– To reduce the set of representative keywords from large
collection
– Some examples of stop words are: "a", "and", "but", "how",
"or", and "what."
4
– For example, if you were to search for "What is a
motherboard?" onComputer Hope, the search
engine would only look for the term "motherboard"
. The removal of stop words usually improves IR
effectiveness.
• Noun groups
– To identify the noun groups
– Which eliminates the adjectives, adverbs and verbs
• Stemming
– Which reduces distinct words to their common grammatical
root
– Removing some endings of word
A stemmer for English, for example, should identify the string "cats"
(and possibly"catlike", "catty" etc.) as based on the root "cat", and
"stems", "stemmer", "stemming", "stemmed" as based on "stem". A
stemming algorithm reduces the words "fishing", "fished", and
"fisher" to the root word, "fish". On the other hand, "argue", "argued",
"argues", "arguing", and "argus" reduce to the stem "argu" but
"argument" and "arguments" reduce to the stem "argument".

• Reason for stemming


– Different word forms may bear similar meaning (e.g.
search, searching):create a ―standard‖ representation
for them

• Finally, compression employed, ―Text operations‖ used to


extract the index terms. Text operations reduce the complexity of
the document representation andallow moving the logical view
from that of full text to that of a set of index terms

2.2 Boolean Retrieval Models


Definition : The Boolean retrieval model is a model for
information retrieval in which the query is in the form
of a Boolean expression of terms, combined with the
operators AND, OR, and NOT. The model views each
document as just a set of words.
 Simple model based on set theory and Boolean algebra
 The Boolean model predicts that each document is
either relevant or non- relevant

5
Example :
A fat book which many people own is Shakespeare‟s Collected
Works.

Problem : To determine which plays of Shakespeare contain


the words Brutus AND Caesar AND NOT Calpurnia.

Method1 : Using Grep


The simplest form of document retrieval is for a computer to
do the linear scan through documents. This process is
commonly referred to as grepping through text, after the
Unix command grep. Grepping through text can be a very
effective process, especially given the speed of modern
computers, and often allows useful possibilities for wildcard
pattern matching through the use of regular expressions.

To Perform simple querying of modest collections , we need :


1. To process large document collections quickly.
2. To allow more flexible matching operations.
For example, it is impractical to perform the query
“Romans NEAR countrymen” with grep , where NEAR
might be defined as “within 5 words” or “within the
same sentence”.
3. To allow ranked retrieval: in many cases we want the
best answer to an information need among many documents
that contain certain words. The way to avoid linearly scanning
the texts for each query is to index documents in advance.

Method2: Using Boolean Retrieval Model


 The Boolean retrieval model is a model for information
retrieval in which we can pose any query which is in the
form of a Boolean expression of terms, that is, in which
terms are combined with the operators AND, OR, and
NOT. The model views each document as just a set of
words.
Terms are the indexed units . we have a vector for each term,
which shows the documents it appears in, or a vector for each
document, showing the terms that occur in it. The result is a
binary term-document incidence matrix, as in Figure

Anthony Julius The Haml Othell Macbeth


and Caesar Tempes et o.
Cleopatra t
Anthony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0

6
Mercy 1 0 1 1 1 1
Worser 1 0 1 1 1 0

A term-document incidence matrix. Matrix element (t,


d) is 1 if the play in column d contains the word in row
t, and is 0 otherwise.

 To answer the query Brutus AND Caesar AND NOT


Calpurnia, we take the vectors for Brutus, Caesar and
Calpurnia, complement the last, and then do a
bitwise AND:
110100 AND 110111 AND 101111 = 100100

Solution: Antony and Cleopatra and Hamlet

Results from Shakespeare for the query Brutus AND Caesar


AND NOT Calpurnia.

Drawbacks of the Boolean Model


 Retrieval based on binary decision criteria with no notion of
partial matching
 No ranking of the documents is provided (absence of a
grading scale)
 Information need has to be translated into a Boolean
expression, which most users find awkward
 The Boolean queries formulated by the users are most often
too simplistic
 The model frequently returns either too few or too many
documents in

2.3 Term weighting


Search Engine should return in order the documents
most likely to be useful to the searcher . To achieve
this , ordering documents with respect to a query -
called Ranking

Term-Document Incidence Matrix


A Boolean model only records term presence or absence,
Assign a score – say in [0, 1] – to each document , it measures
how well document and query “match”. For One-term query
7
“BRUTUS” , score is 1 if it is present in the document , 0
otherwise , More appearances of term in document have
higher score
Anthony Julius The Hamlet Othell Macbeth
and Caesar Tempes o.
Cleopatra t
Anthony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
Mercy 1 0 1 1 1 1
Worser 1 0 1 1 1 0
Document represented by binary vector Є {0,1}|V

Term Frequency tf
One of the weighting scheme is Term Frequency and is
denoted tft,d, with the subscripts denoting the term and the
document in order.
Term frequency TF(t, d) of term t in document d = number of
times that t occurs in d
Ex: Term-Document Count Matrix
but we would like to give more weight to documents that have
a term several times as opposed to ones that contain it only
once. To do this we need term frequency information the
number of times a term occurs in a document .
Assign a score to represent the number of occurrences

How to use tf for query-document match scores?

Raw term frequency is not what we want. A document with 10


occurrences of the term is more relevant than a document with 1
occurrence of the term . But not 10 times more relevant. We use
Log frequency weighting

8
Log-Frequency Weighting
Log-frequency weight of term t in document d is calculated
as

tft,d → wt,d : 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 →


4, etc.
Document Frequency & Collection Frequency
Document frequency DF(t): the number of documents
in the collection that contain a term t
Collection frequency CF(t): the total number of
occurrences of a term t in the collection
Example :

Rare terms are more informative than frequent terms, to capture


this we will use document frequency (df)
Example: rare word ARACHNOCENTRIC

Document containing this term is very likely to be relevant to query


ARACHNOCENTRIC We want high weight for rare terms like
ARACHNOCENTRIC
Example: common word THE

Document containing this term can be about anything


We want very low weight for common terms like THE
Example

Document frequency is more meaningful , as we see in the


example above , there are few documents that contain
“insurance” to get a higher boost for a query on “insurance”
than the many documents containing “try” to get from a
9
query on “try”
Inverse Document Frequency ( idf Weight)
It estimates the rarity of a term in the whole document
collection. idft is an inverse measure of the informativeness of
t and idft <= N
dft is the document frequency of t: the number of documents
that contain t Informativeness idf (inverse document
frequency) of t:

log (N/dft) is used instead of N/dft to diminish the effect of


idf.

N: the total number of documents in the collection (for


example :806,791 documents)
• IDF(t) is high if t is a rare term
• IDF(t) is likely low if t is a frequent term

Effect of idf on Ranking


Does idf have an effect on ranking for one- term queries, like
IPHONE? Ans: No , idf will be effect for >1 term queries.
Query CAPRICIOUS PERSON: idf puts more weight on
CAPRICIOUS than PERSON, since it is a rare term.

TF-IDF Weighting
tf-idf weight of a term: product of tf weight and idf weight
, Best known weighting scheme in information retrieval.
TF(t, d) measures the importance of a term t in document
d , IDF(t) measures the importance of a term t in the
whole collection of documents
TF/IDF weighting: putting TF and IDF together

TFIDF(t, d) = TF(t, d) x IDF(t)

(if log tf is used)

 High if t occurs many times in a small number of


10
documents, i.e., highly discriminative in those
documents
 Not high if t appears infrequent in a document, or is
frequent in many documents, i.e., not discriminative
 Low if t occurs in almost all documents, i.e., no
discrimination at all
Simple Query-Document Score
 scoring finds whether or not a query term is present in a
zone (Zones: document features whose content can be
arbitrary free text – Examples: title, abstracts ) within
a document.
 Score for a document-query pair: sum over
terms t in both q and d:

If the Query contains more than one terms , Score for


a document-query pair is sum over terms t in both q
and d:

The score is 0 if none of the query terms is present in the document

Document Length Normalization


 Document sizes might vary widely
 This is a problem because longer documents are more likely to
be retrieved by a given query
 To compensate for this undesired effect, we can divide the
rank of each document by its length
 This procedure consistently leads to better ranking, and it is
called document length normalization

11
2.4 The Vector Model
 Boolean matching and binary weights is too limiting
 The vector model proposes a framework in which partial
matching is possible
 This is accomplished by assigning non-binary weights to index
terms in queries and in documents
 Term weights are used to compute a degree of similarity
between a query and each document
 The documents are ranked in decreasing order of their degree of
Similarity

 The weight wi,j associated with a pair (k i, dj) is positive and


non-binary
 The index terms are assumed to be all mutually independent
 They are represented as unit vectors of a t-dimensional space (t
is the total number of index terms)
 The representations of document dj and query q are
t-dimensional vectors given by

Similarity between a document dj and a query q

12
 Weights in the Vector model are basically tf-idf weights
 These equations should only be applied for values of term
frequency greater than zero
 If the term frequency is zero, the respective weight is also zero.

Cosine Similarity Measure the similarity between document and


the query using the cosine of the vector representations

Example1: To find similarity between 3 novels


Three novels are taken for discussion , namely
o SaS: Sense and Sensibility
o PaP: Pride and Prejudice
o WH: Wuthering Heights?
To find how similar are the novels using Term frequency
tft , values are

Log frequency weights :

13
After length normalization

cos(SaS,PaP) ≈ 0.789*0.832 + 0.515*0.555 + 0.335*0 + 0*0 ≈


0.94
cos(SaS,WH) ≈0.79 cos(PaP,WH) ≈ 0.69
cos(SAS,PaP) > cos(*,WH)

Example 2:

We often use different weightings for queries(q ) and


documents(d).
Notation : ddd.qqq
Example : lnc.ltn
Document : l logarithmic tf,
nno idf weighting,
c cosine normalization
Query : l logarithmic tf,
t idf,
nno cosine normalization

Example query: “best car insurance”


Example document: “car insurance auto insurance”
N=10,000,000 & df column given in the question itself.

Final similarity score between query and document:


wqi · wdi = 0 + 0 + 1.04 + 2.04 = 3.08

Example 3: standard weighting scheme (lnc.ltc)


14
Document : l logarithmic tf,
nno idf weighting,
c cosine normalization

Query : l logarithmic tf,


t idf,
c cosine normalization
Document: car insurance auto insurance

Query: best car insurance

N=1000000

Term Query Document Prod


tf- tf- df idf wt n’lize tf- tf- wt n’lize
raw wt raw wt
(log)
auto 0 0 5000 2.3 0 0 1 1 1 0.52 0
best 1 1 50000 1.3 1.3 0.34 0 0 0 0 0
car 1 1 10000 2.0 2.0 0.52 1 1 1 0.52 0.27
insurance 1 1 1000 3.0 3.0 0.78 2 1.3 1.3 0.68 0.53

Score = 0+0+0.27+0.53 = 0.8

Since in the above example only one document is given, only one score is
calculated . if suppose n documents given , n score will be calculated and
ranking done in decreasing order of scores.

Advantages:
 term-weighting improves quality of the answer set
 partial matching allows retrieval of docs that approximate the
query conditions
 cosine ranking formula sorts documents according to a degree of
similarity to the query
 document length normalization is naturally built-in into the ranking
Disadvantages:
It assumes independence of index terms

2.5 Probabilistic Model

The probabilistic model captures the IR problem using a


probabilistic framework
Given a user query, there is an ideal answer set for this
query , Given a description of this ideal answer set, we
could retrieve the relevant documents. Querying is seen
as a specification of the properties of this ideal answer
set.

15
An initial set of documents is retrieved somehow,The
user inspects these docs looking for the relevant ones (in
truth, only top 10-20 need to be inspected). The IR
system uses this information to refine the description of
the ideal answer set. By repeating this process, it is
expected that the description of the ideal answer set will
improve.

Probabilistic Ranking Principle

The probabilistic model tries to estimate the probability


that a document will be relevant to a user query ,
Assumes that this probability depends on the query and
document representations only ,The ideal answer set,
referred to as R, should maximize the probability of
relevance

Idea: Given a user query q, and the ideal answer set R


of the relevant documents, the problem is to specify
the properties for this set
– Assumption (probabilistic principle): the
probability of relevance depends on the query and
document representations only; ideal answer set R
should maximize the overall probability of
relevance
– The probabilistic model tries to estimate the
probability that the user will find the document
dj relevant with ratio
P(dj relevant to q)/P(dj nonrelevant to q)

Given a query q, there exists a subset of the documents R


which are relevant to q
But membership of R is uncertain (not sure) , A
Probabilistic retrieval model ranks documents in

16
decreasing order of probability of relevance to the
information need: P(R | q,di)

Why probabilities in IR?


Users gives with information needs, which they translate
into query representations. Similarly, there are
documents, which are converted into document
representations . Given only a query, an IR system has
an uncertain understanding of the information need. So
IR is an uncertain process , Because
 Information need to query representation
 Documents to index terms construction
 Query terms and index terms mismatch
Probability theory provides a principled foundation for
such reasoning under uncertainty. This model provides
how likely a document is relevant to an information
need.

The Ranking

Probabilistic IR - Need to Estimate

17
1. Find measurable statistics (tf, df ,document length) that
affect judgments about document relevance
2. Combine these statistics to estimate the probability of
document relevance
3. Order documents by decreasing estimated probability of
relevance P(R|d, q)
4. Assume that the relevance of each document is
independent of the relevance of other documents

Term Incidence Contingency Table

Let pt = P(xt = 1|R = 1,q) be the probability of a term


appearing in a document relevant to the query,
Let ut = P(x t = 1|R = 0,q) be the probability of a term
appearing in a non relevant document.
Rank documents using the log odds ratios for the terms in
the query ct

(pt/(1 − pt)),  odds of the term appearing if the document


is relevant
(ut/(1 − u t ))  odds of the term appearing if the document
is irrelevant
ct = 0  term has equal odds of appearing in relevant and
irrelevant docs
ct positive  higher odds to appear in relevant documents
ct negative higher odds to appear in irrelevant
documents
ct functions as a term weight
Retrieval status value for document d:

Example for the BIR model


• Assume a query q containing two terms, x1 , x2. Table
18
1 gives the relevance judgements from 20 documents
together with the distribution of the terms (0 /1)
within these documents

• Rank documents using the log odds ratios for the terms
in the query ct

contingency table :

For the parameters of the BIR model


compute
– p1 = 8/12 = 2/3 (term1 present in relevant doc)
– u1 = 3/8, ( term1 present in non-relevant doc)

– p2 = 7/12 (term2 present in relevant doc)


– u2 = 4/8 (term2 present in non-relevant doc)

p1=2/3 , u1= 3/8


P2=7/12 u2= 4/12

pplying the formula


19
c1 = log 10/3 = 1.20 (x1 term weight)
c2 = log 7/5 = 0:33 (x2 term weight)
Based on these weights, documents are ranked
according to their corresponding binary vector x in the
order (1, 1) (1, 0) (0, 1) (0, 0).

Ranking order of documents:

Advantages:
Docs ranked in decreasing order of probability of
relevance
Disadvantages:
need to guess initial estimates for piR
method does not take into account tf factors
the lack of document length normalization

2.6 Latent Semantic Indexing Model

Classic IR might lead to poor retrieval due to:


Unrelated documents might be included in the answer
set
Relevant documents that do not contain at least one
index term are not retrieved

Reasoning: retrieval based on index terms is vague and


noisy. The user information need is more related to concepts
and ideas than to index terms.

Key Idea :
 The process of matching documents to a given query
could be concept matching instead of index term
matching
 A document that shares concepts with another
document known to be relevant might be of interest
20
 The key idea is to map documents and queries into a
lower dimensional space (i.e., composed of higher level
concepts which are in fewer number than the index
terms), Retrieval in this reduced concept space might
be superior to retrieval in the space of index terms
 LSI increases recall and hurts precision.

Definition : Latent semantic indexing (LSI) is an indexing


and retrieval method that uses a mathematical technique
called singular value decomposition (SVD) to identify
patterns in the relationships between the terms and
concepts contained in an unstructured collection of text.
LSI is based on the principle that words that are used in the
same contexts tend to have similar meanings.

• Latent Semantic Indexing (LSI) means analyzing documents to


find the underlying meaning or concepts of those documents
• Does concept matching instead of index term matching
• map documents and queries into a lower dimensional space
• Solves the problem of Synonymy

Problems with Lexical Semantics

 Ambiguity and association in natural language


 Polysemy: Words often have a multitude of
meanings and different types of usage (more severe
in very heterogeneous collections).
 The vector space model is unable to
discriminate
between different meanings of the same word.

 Synonymy: Different terms may have an


identical or a similar meaning (weaker: words
indicating the same topic).
 No associations between words are made in the vector
space representation.

Polysemy and Context.

21
Example : LSI

Similarity of d2 and d3 in the original space: 0. ( Without SVD)

Similarity of d2 und d3 in the reduced space:


0.52 * 0.28 + 0.36 * 0.16 + 0.72 * 0.36 + 0.12 * 0.20 + - 0.39 * - 0.08 ≈ 0.52
(After Applying SVD)

Document d2, d3 have similar meaning words like ship ,boat , but similarity
score is 0 without SVD, after applying SVD in Latent semantic indexing ,
score is 0.52.

How LSI addresses synonymy and semantic relatedness


• The dimensionality reduction forces us to omit a lot of “detail”.
• We have to map different words (= different dimensions of the full
space) to the same dimension in the reduced space.
• The “cost” of mapping synonyms to the same dimension is much less
than the cost of collapsing unrelated words.
22
• SVD selects the “least costly” mapping ,Thus, it will map synonyms to
the same dimension. But it will avoid doing that for unrelated words.
SVD(decomposition)

23
2.7 Neural Network Model
Classic IR:
Terms are used to index documents and queries
Retrieval is based on index term matching
Motivation: Neural networks are known to be good pattern matchers

The human brain is composed of billions of neurons , Each neuron


can be viewed as a small processing unit
A neuron is stimulated by input signals and emits output signals in
reaction
A chain reaction of propagating signals is called a spread activation
process
As a result of spread activation, the brain might command the body
to take physical reactions

24
A neural network is an oversimplified representation of the neuron
interconnections in the human brain:
 nodes are processing units
 edges are synaptic connections
 the strength of a propagating signal is modeled by a weight
assigned to each edge
 the state of a node is defined by its activation level
 depending on its activation level, a node might issue an output
signal
Neural Network Model
n input of X is given with weight , W0 is bias , Neuron has a threshold to fire
,Todetermine the output Neuron uses from unlinear function ft(a)

Example

x0 x1 x2 S y
1 0 0 -1.5 0
1 0 1 -0.5 0
1 1 0 -0.5 0

25
1 1 1 +0.5 1
Neural Network Model For Information Retrieval

 Three layers network: one for the query terms, one for the
document terms, and a third one for the documents
 Signals propagate across the network
 First level of propagation:
o Query terms issue the first signals
o These signals propagate across the network to reach the
document nodes
 Second level of propagation:
o Document nodes might themselves generate new signals
which affect thedocument term nodes
o Document term nodes might respond with new signals of
their own
2.8 Retrieval Evaluation

To evaluate an IR system is to measure how well the system meets


the information needs of the users. This is troublesome, given that a
same result set might be
interpreted differently by distinct users. To deal with this problem,
some metrics have been defined that, on average, have a correlation
with the preferences of a group of users.
Without proper retrieval evaluation, one cannot
determine how well the IR system is performing
compare the performance of the IR system with that of other
systems, objectively
Retrieval evaluation is a critical and integral component of any
modern IR system.
Retrieval performance evaluation consists of associating a quantitative
metric to the results produced by an IR system
 This metric should be directly associated with the relevance of

26
the results to the user
 Usually, its computation requires comparing the results
produced by the system with results suggested by humans for a
same set of queries

The Cranfield Paradigm


Evaluation of IR systems is the result of early experimentation
initiated in the 50’s by Cyril Cleverdon. The insights derived from
these experiments provide a foundation for the evaluation of IR
systems.

Cleverdon obtained a grant from the National Science Foundation to


compare distinct indexing systems. These experiments provided
interesting insights, that
culminated in the modern metrics of precision and recall.

Recall ratio: the fraction of relevant documents retrieved


Precision ratio: the fraction of documents retrieved that are relevant

The next step was to devise a set of experiments that would allow
evaluating each indexing system in isolation more thoroughly. The
result was a test reference collection composed of documents,
queries, and relevance judgements . It became known as the
Cranfield-2 collection. The reference collection allows using the same
set of documents and queries to evaluate different ranking systems.
The uniformity of this setup allows quick evaluation of new ranking
functions

Reference Collections

Reference collections, which are based on the foundations


established by the Cranfield experiments, constitute the most used
evaluation method in IR
A reference collection is composed of:

2.9 Retrieval Metrics


2.9.1 Precision and Recall
Consider,
I: an information request
R: the set of relevant documents for I
27
A: the answer set for I, generated by an IR system
R ∩ A: the intersection of the sets R and A

2.10 Reference Collections


With small collections one can apply the Cranfield evaluation
paradigm to provide relevance assessments. With large collections,
however, not all documents can be evaluated relatively to a given
information need. The alternative is consider only the top k
documents produced by various ranking algorithms for a given
information need. This is called the pooling method. The method
works for reference collections of a few million documents, such as
the TREC collections .

The TREC Collections


TREC is an yearly promoted conference dedicated to experimentation
with large test collections. For each TREC conference, a set of
experiments is designed . The research groups that participate in the
conference use these experiments to compare their retrieval systems.
As with most test collections, a TREC collection is composed of three
parts:
28
 the documents
 the example information requests (called topics)
 a set of relevant documents for each example information
request

The main TREC collection has been growing steadily over the years.
The TREC-3 collection has roughly 2 gigabytes.
The TREC-6 collection has roughly 5.8 gigabytes
 It is distributed in 5 CD-ROM disks of roughly 1 gigabyte of
compressed text each
 Its 5 disks were also used at the TREC-7 and TREC-8
Conferences.
The Terabyte test collection introduced at TREC-15, also referred to as
GOV2, includes 25 million Web documents crawled from sites in the
“.gov” domain
TREC documents come from the following sources:

Information Requests Topics


Each TREC collection includes a set of example information
requests
Each request is a description of an information need in natural
language
In the TREC nomenclature, each test information request is referred
to as a topic

An example of an information request is the topic


numbered 168 used in TREC-3:

<top>
<num> Number: 168
<title> Topic: Financing AMTRAK
<desc> Description:
A document will address the role of the Federal Government in
financing the operation of the National Railroad Transportation
Corporation (AMTRAK)
<narr> Narrative: A relevant document must provide information on
the government’s responsibility to make AMTRAK an economically
viable
29
entity. It could also discuss the privatization of AMTRAK as an
alternative to continuing government subsidies. Documents
comparing
government subsidies given to air and bus transportation with those
provided to AMTRAK would also be relevant
</top>

The Relevant Documents

The set of relevant documents for each topic is obtained from a pool
of possible relevant documents.This pool is created by taking the top
K documents (usually,K = 100) in the rankings generated by various
retrieval systems. The documents in the pool are then shown to
human assessors who ultimately decide on the relevance of each
document.
This technique of assessing relevance is called the pooling method
and is based on two assumptions:
 First, that the vast majority of the relevant documents is
collected in the assembled pool
 Second, that the documents which are not in the pool can be
considered to be not relevant
The Benchmark Tasks

The TREC conferences include two main information retrieval tasks


Ad hoc task: a set of new requests are run against a fixed document
database
routing task: a set of fixed requests are run against a database
whose documents are continually changing
For the ad hoc task, the participant systems execute the topics on a
pre-specified document collection
For the routing task, they receive the test information requests and
two distinct document collections
 The first collection is used for training and allows the tuning
of the retrieval algorithm
 The second is used for testing the tuned retrieval algorithm

At TREC-6, secondary tasks were added in as follows:


 Chinese — ad hoc task in which both the documents and the
topics are in Chinese
 Filtering — routing task in which the retrieval algorithms has only
to decide whether a document is relevant or not
 Interactive — task in which a human searcher interacts with the
retrieval system to determine the relevant documents
 NLP — task aimed at verifying whether retrieval algorithms based
on natural language processing offer advantages when compared
to the more traditional retrieval algorithms based on index terms

30
Other tasks added in TREC-6:
 Cross languages — ad hoc task in which the documents are in
one language but the topics are in a different language
 High precision — task in which the user of a retrieval system is
asked to retrieve ten documents that answer a given
information request within five minutes
 Spoken document retrieval — intended to stimulate research
on retrieval techniques for spoken documents
 Very large corpus — ad hoc task in which the retrieval systems
have to deal with collections of size 20 gigabytes

Evaluation Measures at TREC

At the TREC conferences, four basic types of evaluation measures are


used:
Summary table statistics — this is a table that summarizes
statistics relative to a given task
Recall-precision averages — these are a table or graph with average
precision (over all topics) at 11 standard recall levels
Document level averages — these are average precision figures
computed at specified document cutoff values
Average precision histogram — this is a graph that includes a
single measure for each separate topic
Other Reference Collections
Inex
Reuters
OHSUMED
NewsGroups
NTCIR
CLEF
Small collections
ADI, CACM, ISI, CRAN, LISA, MED, NLM, NPL, TIME
CF (Cystic Fibrosis)

2.11 User-based Evaluation

User preferences are affected by the characteristics of the user


interface (UI) .For instance, the users of search engines look first at
the upper left corner of the results page.Thus, changing the layout is
likely to affect the assessment made by the users and their
behaviour.Proper evaluation of the user interface requires going
beyond the framework of the Cranfield experiments.

31
Human Experimentation in the Lab

 Evaluating the impact of UIs is better accomplished in


laboratories, with human subjects carefully selected
 The downside is that the experiments are costly to setup and
costly to be repeated
 Further, they are limited to a small set of information needs
executed by a small number of human subjects
 However, human experimentation is of value because it
complements the information produced by evaluationbased on
reference collections
Side-by-Side Panels

A form of evaluating two different systems is to evaluate their results


side by side

Typically, the top 10 results produced by the systems for a given


query are displayed in side-by-side panels

Presenting the results side by side allows controlling:

 differences of opinion among subjects


 influences on the user opinion produced by the ordering of the
top results

Side by side panels for Yahoo! and Google ,Top 5 answers produced
by each search engine, with regard to the query “information retrieval
evaluation”

Side by side panels can be used for quick comparison of distinct


search engines. In a side-by-side experiment, the users are aware

32
that they are participating in an experiment.Further, a side-by-side
experiment cannot be repeated in the same conditions of a previous
execution.Finally, side-by-side panels do not allow measuring how
much better is system A when compared to system B. Despite these
disadvantages, side-by-side panels constitute a dynamic evaluation
method that provides insights that complement other evaluation
methods

A/B Testing & Crowd sourcing

A/B testing consists of displaying to selected users a modification in


the layout of a page. The group of selected users constitute a fraction
of all users such as, for instance, 1%. The method works well for sites
with large audiences.By analysing how the users react to the change,
it is possible to analyse if the modification proposed is positive or not.
A/B testing provides a form of human experimentation, even if the
setting is not that of a lab

Crowdsourcing

There are a number of limitations with current approaches for


relevance evaluation.For instance, the Cranfield paradigm is
expensive and has obvious scalability issues.Recently, crowdsourcing
has emerged as a feasible alternative for relevance evaluation.
Crowdsourcing is a term used to describe tasks that are outsourced
to a large group of people, called “workers”. It is an open call to solve
a problem or carry out a task, one which usually involves a monetary
value in exchange for such service. To illustrate, crowdsourcing has
been used to validate research on the quality of search snippets.One
of the most important aspects of crowdsourcing is to design the
experiment carefully.It is important to ask the right questions and to
use well-known usability techniques. Workers are not information
retrieval experts, so the task designer should provide clear
instructions.

Amazon Mechanical Turk

Amazon Mechanical Turk (AMT) is an example of a crowdsourcing


platform. The participants execute human intelligence tasks, called
HITs, in exchange for small sums of money. The tasks are filed by
requesters who have an evaluation need. While the identity of
participants is not known to requesters, the service produces
evaluation results of high quality.

33
Evaluation using Clickthrough Data

 Reference collections provide an effective means of evaluating


the relevance of the results set
 However, they can only be applied to a relatively small number
of queries
 On the other side, the query log of a Web search engine is
typically composed of billions of queries. Thus, evaluation of a
Web search engine using reference collections has its
limitations
 One very promising alternative is evaluation based on the
analysis of clickthrough data
 It can be obtained by observing how frequently the users click
on a given document, when it is shown in the answer set for a
given query
 This is particularly attractive because the data can be collected
at a low cost without overhead for the user.
Biased Clickthrough Data
 To compare two search engines A and B , we can measure the
clickthrough rates in rankings RA and RB
 To illustrate, consider that a same query is specified by various
users in distinct moments in time
 We select one of the two search engines randomly and show the
results for this query to the user
 By comparing clickthrough data over millions of queries, we can
infer which search engine is preferable
 However, clickthrough data is difficult to interpret
 To illustrate, consider a query q and assume that the users
have clicked
o on the answers 2, 3, and 4 in the ranking R A, and
o on the answers 1 and 5 in the ranking R B
 In the first case, the average clickthrough rank position is
(2+3+4)/3, which is equal to 3
 In the second case, it is (1+5)/2, which is also equal to 3
 The example shows that clickthrough data is difficult to Analyze

Further, clickthrough data is not an absolute indicator of relevance,


That is, a document that is highly clicked is not necessarily relevant.
Instead, it is preferable with regard to the other documents in the
answer. Further, since the results produced by one search engine are
not relative to the other, it is difficult to use them to compare two
distinct ranking algorithms directly. The alternative is to mix the two
rankings to collect unbiased clickthrough data, as follows.

Unbiased Clickthrough Data


To collect unbiased clickthrough data from the users, we mix the
result sets of the two ranking algorithms. This way we can compare

34
clickthrough data for the two rankings. To mix the results of the two
rankings, we look at the top
results from each ranking and mix them.

Notice that, among any set of top r ranked answers, the number of
answers origin are from each ranking differs by no more than 1
By collecting clickthrough data for the combined ranking, we further
ensure that the data is unbiased and reflects the user preferences

Under mild conditions, it can be shown that


Ranking RA contains more relevant documents than ranking
RB only if the clickthrough rate for RA is higher than the
clickthrough rate for RB.
Most important, under mild assumptions, the comparison of
two ranking algorithms with basis on the combined ranking
clickthrough data is consistent with a comparison of them
based on relevance judgements collected from human
assessors.
This is a striking result that shows the correlation
between clicks and the relevance of results

2.12 Relevance Feedback and Query Expansion

 The process of query modification is commonly referred as


o relevance feedback, when the user provides information
on relevant documents to a query, or
o query expansion, when information related to the query is
used to expand it
 We refer to both of them as feedback methods
 Two basic approaches of feedback methods:
 explicit feedback, in which the information for query
reformulation is provided directly by the users, and
35
 implicit feedback, in which the information for query
reformulation is implicitly derived by the system
A Framework for Feedback Methods

 Consider a set of documents Dr that are known to be relevant to


the current query q
 In relevance feedback, the documents in Dr are used to
transform q into a modified query qm
 However, obtaining information on documents relevant to a
query requires the direct interference of the user
o Most users are unwilling to provide this information,
particularly in the Web
 Because of this high cost, the idea of relevance feedback has
been relaxed over the years
 Instead of asking the users for the relevant documents, we
could:
o Look at documents they have clicked on; or
o Look at terms belonging to the top documents in the
result set
 In both cases, it is expect that the feedback cycle will produce
results of higher quality
 A feedback cycle is composed of two basic steps:
o Determine feedback information that is either related or
expected to be related to the original query q and
o Determine how to transform query q to take this
information effectively into account
 The first step can be accomplished in two distinct ways:
o Obtain the feedback information explicitly from the users
o Obtain the feedback information implicitly from the query
results or from external sources such as a thesaurus

2.13 Explicit Relevance Feedback.

 In an explicit relevance feedback cycle, the feedback


information is provided directly by the users
 However, collecting feedback information is expensive and time
consuming
 In the Web, user clicks on search results constitute a new
source of feedback information
 A click indicate a document that is of interest to the user in the
context of the current query
 Notice that a click does not necessarily indicate a document
that is relevant to the query

36
 In an implicit relevance feedback cycle, the feedback
information is derived implicitly by the system
 There are two basic approaches for compiling implicit feedback
information:
o local analysis, which derives the feedback information
from the top ranked documents in the result set
o global analysis, which derives the feedback information
from external sources such as a thesaurus

37
Explicit Relevance Feedback
 In a classic relevance feedback cycle, the user is presented with
a list of the retrieved documents
 Then, the user examines them and marks those that are
relevant
 In practice, only the top 10 (or 20) ranked documents need to
be examined
 The main idea consists of
o selecting important terms from the documents that have
been identified as relevant, and
o enhancing the importance of these terms in a new query
formulation
 Expected effect: the new query will be moved towards the
relevant docs and away from the non-relevant ones
 Early experiments have shown good improvements in precision
for small test collections
 Relevance feedback presents the following characteristics:
o it shields the user from the details of the query
reformulation process (all the user has to provide is a
relevance judgement)
o it breaks down the whole searching task into a sequence
of small steps which are easier to grasp

The Rocchio Method

Key concept for relevance feedback: Centroid

The centroid is the center of mass of a set of points.


Recall that we represent documents as points in a
high-dimensional space. Thus: we can compute
centroids of documents.
Definition:

where D is a set of documents and is the vector we use


Example:

 The Rocchio‟ algorithm implements relevance feedback in the


38
vector space model.
 Rocchio‟ chooses the query that maximizes

 Dr : set of relevant docs; Dnr : set of nonrelevant docs


 Intent: ~qopt is the vector that separates relevant and
nonrelevant docs maximally.
 Making some additional assumptions, we can rewrite as:

 The optimal query vector is:

 The Problem is we don’t know the truly relevant


docs. We move the centroid of the relevant documents
by the difference between the two centroids.

The Rocchio optimal query for separating


relevant and nonrelevant documents.

39
Rocchio 1971 algorithm

Relevance feedback on initial query

 We can modify the query based on relevance


feedback and apply standard vector space model.
Use only the docs that were marked. Relevance
feedback can improve recall and
precision.Relevance feedback is most useful for
increasing recall in situations where recall is
important. Users can be expected to review
results and to take time to iterate

qm: modified query vector; q0: original query vector; Dr and


Dnr : sets of known relevant and nonrelevant documents
respectively; α, β, and γ: weights
 New query moves towards relevant documents and away
from nonrelevant documents.
 Tradeoff α vs. β/γ: If we have a lot of judged documents,
we want a higher β/γ.
 Set negative term weights to 0.
 “Negative weight” for a term doesn‟t make sense in the
vector space model.
 Positive feedback is more valuable than negative feedback.
For example, set β = 0.75, γ = 0.25 to give higher

40
weight to positive feedback. Many systems only allow

positive feedback.

To Compute Rocchio’ vector

41
Disadvantages of Relevance feedback

 Relevance feedback is expensive.


 Relevance feedback creates long modified queries.
 Long queries are expensive to process.
 Users are reluctant to provide explicit feedback.
 It‟s often hard to understand why a particular
document was retrieved after
applying relevance feedback.
 The search engine Excite had full relevance
feedback at one point, but abandoned it later.

42

You might also like