SIT772 Lecture 10

SIT772
Database and Information Retrieval

Lecture 10
A/Prof. Jianxin Li
School of Information Technology
Deakin University
jianxin.li@Deakin.edu.au
Review of Last Week’s Class
• The Motivation of Vector Model

• TF-IDF Vector Model and Examples
– Basic concepts
– Similarity computation
• Summary of Vector Model
2
Outline of Today’s Class
• Retrieval Evaluation
• Evaluation Criteria
– Effectiveness
– Efficiency
– Usability
• Metrics
– Precision, recall, miss, false alarm
• Harmonic Mean and the E Measure
3
Retrieval Evaluation
• Retrieval Performance
• Evaluations
• Precision
• Recall
• Single Value Summaries
4
The Importance of Evaluation
• The ability to measure differences underlies

experimental science
– How well do our systems work?
– Is A better than B?
– Is it really?
– Under what conditions?
• Evaluation drives what to research
– Identify techniques that work and don’t work
– Formative vs. summative evaluations
5
Types of Evaluation
Strategies
• System-centered studies
– Given documents, queries, and relevance judgments
– Try several variations of the system
– Measure which system returns the “best” hit list
• User-centered studies
– Given several users, and at least two retrieval systems
– Have each user try the same task on both systems
– Measure which system works the “best”
6
– Effectiveness
– Efficiency
– Usability
• Metrics
• Self-test Question Discussion
7
Evaluation Criteria
 Effectiveness
– How “good” are the documents that are returned?
– System only, human + system
• Efficiency
– Retrieval time, indexing time, index size
• Usability
– Learnability, frustration
– Novice vs. expert users
8
Good Effectiveness Measures
• Should capture some aspect of what the user wants

– That is, the measure should be meaningful
• Should have predictive value for other situations

– What happens with different queries on a different document
collection?
• Should be easily replicated by other researchers

• Should be easily comparable
– Optimally, expressed as a single number
9
The Notion of Relevance
• IR systems essentially facilitate communication

between a user and document collections
• Relevance is a measure of the effectiveness of
communication
– Logic and philosophy present other approaches
What is
relevance?
10
What is relevance?
Relevance is the measure of acorrespondence

degree utility
dimension connection
estimate satisfaction
appraisal fit
relation bearing
matching
existing between a document and a query
article request
textual form information used
reference point of view
information provided information need statement
fact
as determined by person
judge
user Does this help?
requester
Information specialist
Tefko Saracevic. (1975) Relevance: A Review of and a Framework for Thinking on the Notion in Information Science.
Journal of the American Society for Information Science, 26(6), 321-343;
11
Automatic Evaluation Model
Query Documents
IR Black Box
Ranked List
Evaluation
Relevance Judgments
Module
Measure of Effectiveness
These are the four things we need! 12

Which is the Best Rank Order?
A.
B.
C.
D.
E.
F.
= irrelevant document = relevant document

13
– Effectiveness
– Efficiency
– Usability
• Metrics
14
Set-Based Measures
Relevant Not relevant

Collection size = A+B+C+D
Retrieved A B
Relevant = A+C
Not retrieved C D Retrieved = A+B
• Precision = A ÷ (A+B)
• Recall = A ÷ (A+C)
• Miss = C ÷ (A+C)
• False alarm (fallout) = B ÷ (B+D)
When is precision important?
When is recall important?
15
Recall and Precision
• Recall
– the fraction of the relevant Relevant Docs
documents which has been in Answer Set
retrieved A collection
• Precision
– the fraction of the retrieved Answer Set
documents which is relevant
A+B
Relevant Docs
A+C
16
Measuring Precision and
Recall
• Assume there are a total of 14 relevant documents
• The user is not usually presented with all the documents in the
answer set A at once
Hits 01-10
Precision 1/1 1/2 1/3 1/4 2/5 3/6 3/7 4/8 4/9 4/10
Recall 1/14 1/14 1/14 1/14 2/14 3/14 3/14 4/14 4/14 4/14
Hits 11-20
Precision 5/11 5/12 5/13 5/14 5/15 6/16 6/17 6/18 6/19 6/20
Recall 5/14 5/14 5/14 5/14 5/14 6/14 6/14 6/14 6/14 6/14
= irrelevant document = relevant document
17
Graphing Precision and
Recall
• Plot each (recall, precision) point on a graph

• Visually represent the precision/recall tradeoff
1
0.8
0.6
Precision
0.4
0.2
0
0 0.1 0.2 0.3 0.4 0.5
Recall
18
Need for Interpolation
• Two issues:
– How do you compare performance across queries?
– Is the sawtooth shape intuitive of what’s going on?
0.8
0.6
Precision
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Recall
Solution: Interpolation!
19
Interpolation
• Why?
– We have no observed data between the data points
– Strange sawtooth shape doesn’t make sense
• It is an empirical fact that on average as recall
increases, precision decreases
• Interpolate at 11 standard recall levels
– 100%, 90%, 80%, … 30%, 20%, 10%, 0% (!)
– How?
P ( R ) = max{P′ : R′ ≥ R ∧ ( R′, P′) ∈ S }
where S is the set of all observed (P,R) points
20
Interpolation
• Interpolate at 11 standard recall levels

1
0.8
0.6
Precision
0.4
0.2
0
0 0.1 0.2 0.3 0.4 0.5
Recall
P ( R ) = max{P′ : R′ ≥ R ∧ ( R′, P′) ∈ S }

21
Result of Interpolation
0.8
0.6
Precision
0.4
0.2
0
0 0.1 0.2 0.3 0.4 0.5
Recall
We can also average precision across the 11 standard recall levels
P ( R ) = max{P′ : R′ ≥ R ∧ ( R′, P′) ∈ S }

22
Standard Recall Levels
• Standard recall levels for several queries average the

precision figures at each recall level
Nq
Pi (r )
P(r ) = ∑
i =1 N q
• P (r ) : the average precision at the recall level r

• Nq: the number of queries used
• Pi(r): the precision at recall level r for the i-th query
23
Precision versus Recall
Figures
• Compare the retrieval performance of distinct retrieve

algorithms over a set of example queries
120
100
80
Precision
60
40
20
0
0 10 20 30 40 50 60 70 80 90 100
Recall
24
Single Value Summaries
• Average precision versus recall figures are useful for

comparing the retrieval performance of distinct retrieval
algorithms over a set of example queries.
• we may wish to compare the retrieval performance of the
different algorithms for individual queries, because
– averaging precision over many queries might disguise important
anomalies in the retrieval algorithms under study;
– when comparing two algorithms we might be interested in investigating
whether one of them outperforms the other for each query in a given set
of example queries
– In these situations, a single precision value can be used, and should be
interpreted as a summary of precision versus recall curves.
– Compare the retrieval performance of a retrieval algorithm for
individual queries
25
(MAP)
• Mean average precision (MAP)

– Average of precision at each retrieved relevant document
– Relevant documents not retrieved contribute zero to score
Hits 1-10
Precision 1/1 1/2 1/3 1/4 2/5 3/6 3/7 4/8 4/9 4/10
Hits 11-20
Precision 5/11 5/12 5/13 5/14 5/15 6/16 6/17 6/18 6/19 4/20
Assume total of 14 relevant documents: 8 relevant

documents not retrieved contribute eight zeros MAP = .2307
26
(R-precision)
• R-precision (Recall-precision)
– the precision at the R-th position in the ranking, where R is
the total number of relevant documents for the current query
Hits 1-10
Precision 1/1 1/2 1/3 1/4 2/5 3/6 3/7 4/8 4/9 4/10
Hits 11-20
Precision 5/11 5/12 5/13 5/14 5/15 6/16 6/17 6/18 6/19 4/20
Assume total of 14 relevant documents: 5 relevant

documents retrieved in the first 14 retrieved R-precision = 5/14
27
Problems with Precision and
Recall rate
• Estimation of maximal recall requires knowledge of

all the documents in the collection
• Recall and precision capture different aspects of the
set of retrieved documents
• Recall and precision measure the effectiveness over
queries in batch mode
• Recall and precision are defined under the
enforcement of linear ordering of the retrieved
documents
28
– Effectiveness
– Efficiency
– Usability
• Metrics
29
The Harmonic Mean
• harmonic mean F(j) of recall and precision
2
F ( j) =
1 1
+
R( j ) P( j )
• R(j): the recall for the j-th document in the ranking
• P(j): the precision for the j-th document in the ranking
30
The Harmonic Mean
2× P × R
F=
P+R
• F is always in the interval [0,1]
• F is 0 when no relevant documents have been retrieved
– in which case P and R are both 0
• F is 1 when all ranked documents are relevant
– in which case P and R are both 1
• Further more, F is only high when both R and P are high,
meaning that the determination of the maximum value of F can
be interpreted as an attempt to find the best possible
compromise between recall and precision.
31
E Measure (parameterized F
Measure)
• A variant of F measure that allows weighting
emphasis on precision over recall:
(1 + β ) PR (1 + β )
2 2
E= = β2 1
β P+R
2
+
R P
• Value of β controls trade-off:
– β = 1: Equally weight precision and recall (E=F).
– β > 1: Weight recall more.
– β < 1: Weight precision more.
32
– Effectiveness
– Efficiency
– Usability
• Metrics
34
Conclusion of Today’s Class
– Effectiveness
– Efficiency
– Usability
• Metrics
37

SIT772 Lecture 10

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SIT772 Lecture 10

Uploaded by

Copyright:

Available Formats

SIT772

Database and Information Retrieval

• The Motivation of Vector Model

• The ability to measure differences underlies

• Should capture some aspect of what the user wants

• Should have predictive value for other situations

• Should be easily replicated by other researchers

• IR systems essentially facilitate communication

Relevance is the measure of acorrespondence

These are the four things we need! 12

= irrelevant document = relevant document

Relevant Not relevant

= irrelevant document = relevant document

• Plot each (recall, precision) point on a graph

where S is the set of all observed (P,R) points

• Interpolate at 11 standard recall levels

P ( R ) = max{P′ : R′ ≥ R ∧ ( R′, P′) ∈ S }

We can also average precision across the 11 standard recall levels

P ( R ) = max{P′ : R′ ≥ R ∧ ( R′, P′) ∈ S }

• Standard recall levels for several queries average the

• P (r ) : the average precision at the recall level r

• Compare the retrieval performance of distinct retrieve

• Average precision versus recall figures are useful for

• Mean average precision (MAP)

Assume total of 14 relevant documents: 8 relevant

Assume total of 14 relevant documents: 5 relevant

• Estimation of maximal recall requires knowledge of

• harmonic mean F(j) of recall and precision

You might also like