L15 IRSW Evaluation

INFORMATION RETRIEVAL
AND SEMANTIC WEB

16B1NCI648
Information System Evaluation
Sec. 8.6
MEASURES FOR A SEARCH

ENGINE
• How fast does it index

• Number of documents/hour
• (Average document size)
• How fast does it search

• Latency as a function of index size
• Expressiveness of query language

• Ability to express complex information needs
• Speed on complex queries
Sec. 8.6
MEASURES FOR A SEARCH

ENGINE
• All of the preceding criteria are measurable: we can
quantify speed/size
• The key measure: user happiness
• Speed of response/size of index are factors
• But blindingly fast, useless answers won’t make a user happy
• Need a way of quantifying user happiness

• Relevance of results is the most important factor
EVALUATING AN IR SYSTEM
• The information need is translated into a query

• Relevance is assessed relative to the information need not the query
• Relevance measurement requires 3 elements:
1. A benchmark document collection
2. A benchmark suite of queries
3. A usually binary assessment of either Relevant or Nonrelevant for each query and
each document
• Some work on more-than-binary, but not the standard
EVALUATING AN IR SYSTEM
• E.g., Information need: I'm looking for information on

whether drinking red wine is more effective at reducing your
risk of heart attacks than white wine.
• Query: wine red white heart attack effective
• Evaluate whether the doc addresses the information need,
not whether it has these words
Standard Relevance Benchmarks
• Cranfield Collection [1950]: Contains 1398 abstracts of journal articles, 225
queries, exhaustive judgments for all query-document pairs.
• Text Retrieval Conference (TREC) [1992]: 1.89 billion documents,
relevance judgments for 450 information needs. Judgments for top-k
documents.
• GOV2: 25 Million .gov web pages!
• NTCIR and CLEF: Cross language information retrieval collection has queries
in one language over a collection with multiple languages.
• Reuters-RCV1, 20 Newsgroups, …
“Human experts mark, for each query and for each doc, Relevant or Non-relevant
or at least for subset of docs that some system returned for that query
THE SIGIR MUSEUM
EVALUATION
How to compare Search Engines?

How good is an IR system?
• Various evaluation methods

• Precision/Recall
• Mean Average Precision
• Mean Reciprocal Rank
• If first relevant doc is at kth position, RR = 1/k.
• NDCG
• Non-Boolean/Graded relevance scores
• DCG = r1 + r2/log22 + r3/log23 + … rn/log2n
Sec. 8.3
PRECISION AND RECALL
Precision: fraction of retrieved docs that are relevant =

P(relevant|retrieved)
Recall: fraction of relevant docs that are retrieved
= P(retrieved|relevant)
Relevant Nonrelevant
Retrieved tp fp
Not Retrieved fn tn
• Precision P = tp/(tp + fp)

• Recall R = tp/(tp + fn)
• R refers to Relevant Document
• N refers to Nonrelevant Document.
• Collection has 10,000 documents.
• Assume that there are 8 relevant documents in
total in the collection. Calculate Precision and
Recall.
• Retrieved Documents:
RRNNN NNNRN RNNNR NNNNR
• Precision = 6/20
• Recall = 6/8
EXAMPLE : PRECISION &
RECALL
• Suppose wife asked husband about the dates of 4 important events -
wedding anniversary, her birthday, mother-in law and father-in law
birthday dates.
• If husband was able to recall all these 4 dates but with 8 attempts in
total.
• Your recall score is 100% but your precision score is 50% which is
4 divided by 8.
SHOULD WE INSTEAD USE THE ACCURACY
MEASURE FOR EVALUATION?
• Given a query, an engine classifies each doc as “Relevant”

or “Non-relevant”
• The accuracy of an engine: the fraction of these
classifications that are correct
(tp + tn) / ( tp + fp + fn + tn)
• Accuracy is a commonly used evaluation measure in
machine learning classification work
• Why is this not a very useful evaluation measure in IR?
• As in almost all circumstances , the data are extremely skewed
normally, over 99.9% of the doc are non relevant category.
• A system tuned to maximize accuracy can appear to perform well
by simply deeming all documents non-relevant to all queries.
• The measure precision and recall concentrate the evaluation on the
return of true positives asking % of relevant documents have been
found and how many false positives have also been returned
Sec. 8.3
PRECISION/RECALL
• Advantage of having both precision and recall is that one

is important than the other in many situations
• Web surfers need high precision while Intelligence
analyst need high recall.
• You can get high recall (but low precision) by retrieving
all docs for all queries!
• In a good system, precision decreases as either the
number of docs retrieved or recall increases
DIFFICULTIES IN USING
PRECISION/RECALL
• Should average over large document collection/query

ensembles
• Need human relevance assessments
• People aren’t reliable assessors
• Assessments have to be binary

• Nuanced assessments?
• Heavily skewed by collection/authorship

• Results may not translate from one domain to another
A COMBINED MEASURE: F
• Combined measure that assesses precision/recall tradeoff is F measure

(weighted harmonic mean):
1 (  2  1) PR
F 
1
  (1   )
1  2
PR
P R
• People usually use balanced F1 measure

• i.e., with  = 1 or  = ½
• Harmonic mean is a conservative average

• See CJ van Rijsbergen, Information Retrieval
F1-Score
A Mean for Precision and Recall
𝟐 𝑷𝑹 People usually use balanced F1
𝑭𝟏 = measure i.e., with  = 1 or  = ½
𝑷+ 𝑹
A more generalized formula:
See “The truth of the F-measure” for a detailed discussion.

https://www.toyota-ti.ac.jp/Lab/Denshi/COIN/people/yutaka.sasaki/F-measure-YS-26Oct07.pdf
P, R AND F ARE SET BASED MEASURES. THEY ARE
COMPUTED ON UNORDERED SETS OF
DOCUMENTS.
Can we do better for ranked documents?

RANKED RETRIEVAL
EVALUATION
• In ranked retrieval context, appropriate sets of retrieved
documents are given by top k retrieved documents.
• For each set , precision and recall values can be plotted to give
Precision-Recall curve .
• It is sawtooth shape.
• It the (k+1)th document retrieved is non relevant then recall is
the same as for top k documents but precision will be dropped.
• If it is relevant then both the precision and recall increase and
the curve jags up and to the right.
PRECISION RECALL CURVE BENEFIT
Example : Model has been developed to find a medicine reduces the BP on

diabetic patients or not. YES NO
YES 30(TP) 50(FP)
• Lets consider a confusion matrix NO 10(FN) 1000(TN)
• Here 40(P) patients are diabetic in which 30(TP) patients the medicine
worked well.
• Accuracy= 30+1000/(30+50+10+1000) = 95%
• Issues are - Number of TN is very high (Non Diabetic) due to which
accuracy is going high. Data is skewed.
• In such scenario Precision recall Curve is helpful as they don’t consider TN
ROC VS PR CURVE
• ROC curves should be used when there are roughly

equal numbers of observations for each class. ( Plot
TPR Vs FPR)
• Precision-Recall curves should be used when there is a
moderate to large class imbalance. (Plot P Vs R)
EVALUATION OF RANKED BASED RETRIEVAL :
• Precision@k: Precision of top k results.

R R R R
R R R R
• P@1 = 0
• P@2 = ½ R
• P@3 = 2/3 R R
• P@4 = 2/4 R R
Disadvantage: If there are only 4 relevant documents in entire

collection, and if we retrieve 10 documents, max precision achievable
is only 0.4.
RECALL@K
• Assume that there are 100 relevant documents.
R R R R
R R R R
• R@1 = 0
• R@2 = 1/100 R
• R@3 = 2/100 R R
• R@4 = 2/100 R R
Sec. 8.4
A PRECISION-RECALL CURVE
What is happening here when
precision dips without increase in
1.0 recall
0.8
Precision
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Recall
INTERPOLATED PRECISION
• It is often useful to remove such jiggles and

standard way to do this is with an
interpolated precision(IP).
• IP means that for each recall value from 0.0,0.1,
0.2 …. to 1. find the maximum precision at
recall precision table where the recall value is
greater than or equal to recall level,
INTERPOLATED PRECISION
• We cut-off results at kth relevance level.
R R R R
R R R R
•
(Interpolated) P@1 = 0.5 R
•
(Interpolated) P@2 = 2/3 R R
Interpolated Average Precision = (0.5 + 0.66) / 2 = 0.58 (if

we are only interested in 2 levels of relevance)
*Interpolated precision at 0 is 1!
WHAT IS THE AVERAGE PRECISION?
• Case 1:
R R R R R
• Average of Precision at each relevance level.

• Average Precision = ½ + ½ + ½ 5+ ½ + ½
• Case 2:
R R R
• Average Precision = ?
For convenience, we refer to Interpolated Average Precision when we say AP
WHAT IS THE AVERAGE PRECISION?
• Case 1:
R R R R R
• Average Precision = ½ + ½ + ½ 5+ ½ + ½
• Case 2:
R R R
• Average Precision = 1/3

PRECISION-RECALL CURVE PLOT
• Let total number of relevant documents is 15.
• Recall – Precision Table :
• Interpolated Recall – Precision Table :
• Interpolated Recall – Precision Graph :

ROC
MEAN AVERAGE PRECISION
(MAP)
• Average of the precision value obtained for the top k

documents, each time a relevant doc is retrieved
• Avoids interpolation, use of fixed recall levels
• MAP for query collection is arithmetic average.
• Macro-averaging: each query counts equally
Mean Average
Precision
Mean Average
Precision
MAP computes Average

Precision for all relevance levels
for a set of queries.
COMPUTE MAP
• Query1:
R R R R R Only 5 relevant
docs in
• Query2: corpus.
R R R
Only 3 relevant
• Query3: docs in
R R R corpus.
COMPUTE MAP
• Query1:
R R R R R Only 5 relevant
docs in corpus.
• Query2:
R R R
• Query3: Only 3 relevant

docs in
R R R corpus.
• Compute MAP.
MAP = (1/2 + 1/3 + 1/3)/3

HOW TO COMPARE TWO SYSTEMS, IF RESULTS
ARE RANKED AND GRADED?
and we do not know the total number of relevant documents

DISCOUNTED CUMULATIVE GAIN
DCGk = DCG at position k

r = rank
relr = graded relevance of the result at
rank r
DCG EXAMPLE
• Presented with a list of documents in response to a
search query, an experiment participant is asked to
judge the relevance of each document to the query.
Each document is to be judged on a scale of 0-3 with:
• 0  not relevant,
• 3  highly relevant, and
• 1 and 2  "somewhere in between".
DCG EXAMPLE
• Compute DCG
WHICH SYSTEM IS BETTER?
• 3,3,3,2,2,2 or 3,2,3,0,1,2
?
Results from System 1 Results from System 2
𝒓𝒆𝒍𝒊 𝒓𝒆𝒍𝒊
reli log2(i+1) 𝒍𝒐𝒈𝟐(𝒊+ 𝟏) reli log2(i+1) 𝒍𝒐𝒈𝟐(𝒊+ 𝟏)
3.00 1.00 3.00 3.00 1.00 3.00
3.00 1.58 1.89 2.00 1.58 1.26
3.00 2.00 1.50 3.00 2.00 1.50
2.00 2.32 0.86 0.00 2.32 0.00
2.00 2.58 0.77 1.00 2.58 0.39
2.00 2.81 0.71 2.00 2.81 0.71
8.74 6.86
WHICH SYSTEM IS BETTER?
• 3,2,3,0,1,2 What if there are unequal
number of documents?
or
• 3,3,3,2,2,2,1,0
• Ideal DCG at 6 is (the best value) DCG for
3,3,3,2,2,2
• Normalize DCG with Ideal DCG value.
• NDCG for System 1 = DCG/IDCG = 0.785.
• NDCG for System 2 = 1.
For a set of queries Q, we average the NDCG.

KAPPA STATISTICS
KAPPA MEASURE FOR INTER-JUDGE
(DIS)AGREEMENT
• Kappa measure
• Agreement measure among judges
• Designed for categorical judgments
• Corrects for chance agreement
• Kappa = [ P(A) – P(E) ] / [ 1 – P(E) ]

• P(A) – proportion of time judges agree
• P(E) – what agreement would be by chance
• Kappa = 0 for chance agreement, 1 for total agreement.
KAPPA STATISTIC
P(A)? P(E)?
KAPPA MEASURE: EXAMPLE
Number of docs Judge 1 Judge 2
300 Relevant Relevant
70 Nonrelevant Nonrelevant
20 Relevant Nonrelevant
10 Nonrelevant Relevant
KAPPA EXAMPLE
• P(A) = 370/400 = 0.925

• P(nonrelevant) = (10+20+70+70)/800 = 0.2125
• P(relevant) = (10+20+300+300)/800 = 0.7878
• P(E) = 0.2125^2 + 0.7878^2 = 0.665
• Kappa = (0.925 – 0.665)/(1-0.665) = 0.776
• Kappa > 0.8 = good agreement

• 0.67 < Kappa < 0.8 -> “tentative conclusions” (Carletta ’96)
• Depends on purpose of study
• For >2 judges: average pairwise kappas
REFERENCES
• Christopher D. Manning, Prabhakar Raghavan and Hinrich

Schütze, “An introduction to Information Retrieval”, 2013
Cambridge University Press UP.

L15 IRSW Evaluation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

L15 IRSW Evaluation

Uploaded by

Copyright:

Available Formats

INFORMATION RETRIEVAL

AND SEMANTIC WEB

MEASURES FOR A SEARCH

• How fast does it index

• How fast does it search

• Expressiveness of query language

MEASURES FOR A SEARCH

• Need a way of quantifying user happiness

• The information need is translated into a query

• E.g., Information need: I'm looking for information on

How to compare Search Engines?

• Various evaluation methods

PRECISION AND RECALL

Precision: fraction of retrieved docs that are relevant =

• Precision P = tp/(tp + fp)

• Given a query, an engine classifies each doc as “Relevant”

• Advantage of having both precision and recall is that one

• Should average over large document collection/query

• Assessments have to be binary

• Heavily skewed by collection/authorship

• Combined measure that assesses precision/recall tradeoff is F measure

• People usually use balanced F1 measure

• Harmonic mean is a conservative average

A more generalized formula:

See “The truth of the F-measure” for a detailed discussion.

Can we do better for ranked documents?

Example : Model has been developed to find a medicine reduces the BP on

• ROC curves should be used when there are roughly

• Precision@k: Precision of top k results.

Disadvantage: If there are only 4 relevant documents in entire

• It is often useful to remove such jiggles and

Interpolated Average Precision = (0.5 + 0.66) / 2 = 0.58 (if

• Average of Precision at each relevance level.

• Average Precision = 1/3

• Interpolated Recall – Precision Table :

• Interpolated Recall – Precision Graph :

• Average of the precision value obtained for the top k

MAP computes Average

• Query3: Only 3 relevant

MAP = (1/2 + 1/3 + 1/3)/3

and we do not know the total number of relevant documents

DCGk = DCG at position k

For a set of queries Q, we average the NDCG.

• Kappa = [ P(A) – P(E) ] / [ 1 – P(E) ]

Number of docs Judge 1 Judge 2

300 Relevant Relevant

• P(A) = 370/400 = 0.925

• Kappa > 0.8 = good agreement

• Christopher D. Manning, Prabhakar Raghavan and Hinrich

You might also like