You are on page 1of 34

SIT772

Database and Information Retrieval


Lecture 10

A/Prof. Jianxin Li
School of Information Technology
Deakin University
jianxin.li@Deakin.edu.au
Review of Last Week’s Class

• The Motivation of Vector Model


• TF-IDF Vector Model and Examples
– Basic concepts
– Similarity computation
• Summary of Vector Model

2
Outline of Today’s Class

• Retrieval Evaluation
• Evaluation Criteria
– Effectiveness
– Efficiency
– Usability
• Metrics
– Precision, recall, miss, false alarm
• Harmonic Mean and the E Measure

3
Retrieval Evaluation

• Retrieval Performance
• Evaluations
• Precision
• Recall
• Single Value Summaries

4
The Importance of Evaluation

• The ability to measure differences underlies


experimental science
– How well do our systems work?
– Is A better than B?
– Is it really?
– Under what conditions?
• Evaluation drives what to research
– Identify techniques that work and don’t work
– Formative vs. summative evaluations

5
Types of Evaluation
Strategies

• System-centered studies
– Given documents, queries, and relevance judgments
– Try several variations of the system
– Measure which system returns the “best” hit list

• User-centered studies
– Given several users, and at least two retrieval systems
– Have each user try the same task on both systems
– Measure which system works the “best”

6
Outline of Today’s Class

• Retrieval Evaluation
• Evaluation Criteria
– Effectiveness
– Efficiency
– Usability
• Metrics
– Precision, recall, miss, false alarm
• Harmonic Mean and the E Measure
• Self-test Question Discussion

7
Evaluation Criteria

 Effectiveness
– How “good” are the documents that are returned?
– System only, human + system

• Efficiency
– Retrieval time, indexing time, index size

• Usability
– Learnability, frustration
– Novice vs. expert users

8
Good Effectiveness Measures

• Should capture some aspect of what the user wants


– That is, the measure should be meaningful

• Should have predictive value for other situations


– What happens with different queries on a different document
collection?

• Should be easily replicated by other researchers


• Should be easily comparable
– Optimally, expressed as a single number

9
The Notion of Relevance

• IR systems essentially facilitate communication


between a user and document collections
• Relevance is a measure of the effectiveness of
communication
– Logic and philosophy present other approaches
What is
relevance?

10
What is relevance?

Relevance is the measure of acorrespondence


degree utility
dimension connection
estimate satisfaction
appraisal fit
relation bearing
matching
existing between a document and a query
article request
textual form information used
reference point of view
information provided information need statement
fact
as determined by person
judge
user Does this help?
requester
Information specialist
Tefko Saracevic. (1975) Relevance: A Review of and a Framework for Thinking on the Notion in Information Science.
Journal of the American Society for Information Science, 26(6), 321-343;
11
Automatic Evaluation Model

Query Documents

IR Black Box

Ranked List

Evaluation
Relevance Judgments
Module

Measure of Effectiveness

These are the four things we need! 12


Which is the Best Rank Order?

A.

B.

C.

D.

E.

F.

= irrelevant document = relevant document


13
Outline of Today’s Class

• Retrieval Evaluation
• Evaluation Criteria
– Effectiveness
– Efficiency
– Usability
• Metrics
– Precision, recall, miss, false alarm
• Harmonic Mean and the E Measure
• Self-test Question Discussion

14
Set-Based Measures

Relevant Not relevant


Collection size = A+B+C+D
Retrieved A B
Relevant = A+C
Not retrieved C D Retrieved = A+B

• Precision = A ÷ (A+B)
• Recall = A ÷ (A+C)
• Miss = C ÷ (A+C)
• False alarm (fallout) = B ÷ (B+D)
When is precision important?
When is recall important?

15
Recall and Precision

• Recall
– the fraction of the relevant Relevant Docs
documents which has been in Answer Set
retrieved A collection

• Precision
– the fraction of the retrieved Answer Set
documents which is relevant
A+B
Relevant Docs
A+C

16
Measuring Precision and
Recall
• Assume there are a total of 14 relevant documents
• The user is not usually presented with all the documents in the
answer set A at once

Hits 01-10

Precision 1/1 1/2 1/3 1/4 2/5 3/6 3/7 4/8 4/9 4/10
Recall 1/14 1/14 1/14 1/14 2/14 3/14 3/14 4/14 4/14 4/14

Hits 11-20

Precision 5/11 5/12 5/13 5/14 5/15 6/16 6/17 6/18 6/19 6/20
Recall 5/14 5/14 5/14 5/14 5/14 6/14 6/14 6/14 6/14 6/14

= irrelevant document = relevant document

17
Graphing Precision and
Recall

• Plot each (recall, precision) point on a graph


• Visually represent the precision/recall tradeoff
1

0.8

0.6
Precision

0.4

0.2

0
0 0.1 0.2 0.3 0.4 0.5
Recall

18
Need for Interpolation

• Two issues:
– How do you compare performance across queries?
– Is the sawtooth shape intuitive of what’s going on?

0.8

0.6
Precision

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1
Recall

Solution: Interpolation!
19
Interpolation

• Why?
– We have no observed data between the data points
– Strange sawtooth shape doesn’t make sense
• It is an empirical fact that on average as recall
increases, precision decreases
• Interpolate at 11 standard recall levels
– 100%, 90%, 80%, … 30%, 20%, 10%, 0% (!)
– How?
P ( R ) = max{P′ : R′ ≥ R ∧ ( R′, P′) ∈ S }

where S is the set of all observed (P,R) points

20
Interpolation

• Interpolate at 11 standard recall levels


1

0.8

0.6
Precision

0.4

0.2

0
0 0.1 0.2 0.3 0.4 0.5
Recall

P ( R ) = max{P′ : R′ ≥ R ∧ ( R′, P′) ∈ S }


21
Result of Interpolation

0.8

0.6
Precision

0.4

0.2

0
0 0.1 0.2 0.3 0.4 0.5
Recall

We can also average precision across the 11 standard recall levels

P ( R ) = max{P′ : R′ ≥ R ∧ ( R′, P′) ∈ S }


22
Standard Recall Levels

• Standard recall levels for several queries average the


precision figures at each recall level
Nq
Pi (r )
P(r ) = ∑
i =1 N q

• P (r ) : the average precision at the recall level r


• Nq: the number of queries used
• Pi(r): the precision at recall level r for the i-th query

23
Precision versus Recall
Figures

• Compare the retrieval performance of distinct retrieve


algorithms over a set of example queries
120

100

80
Precision

60

40

20

0
0 10 20 30 40 50 60 70 80 90 100
Recall

24
Single Value Summaries

• Average precision versus recall figures are useful for


comparing the retrieval performance of distinct retrieval
algorithms over a set of example queries.
• we may wish to compare the retrieval performance of the
different algorithms for individual queries, because
– averaging precision over many queries might disguise important
anomalies in the retrieval algorithms under study;
– when comparing two algorithms we might be interested in investigating
whether one of them outperforms the other for each query in a given set
of example queries
– In these situations, a single precision value can be used, and should be
interpreted as a summary of precision versus recall curves.
– Compare the retrieval performance of a retrieval algorithm for
individual queries

25
Single Value Summaries
(MAP)

• Mean average precision (MAP)


– Average of precision at each retrieved relevant document
– Relevant documents not retrieved contribute zero to score

Hits 1-10

Precision 1/1 1/2 1/3 1/4 2/5 3/6 3/7 4/8 4/9 4/10

Hits 11-20

Precision 5/11 5/12 5/13 5/14 5/15 6/16 6/17 6/18 6/19 4/20

Assume total of 14 relevant documents: 8 relevant


documents not retrieved contribute eight zeros MAP = .2307

26
Single Value Summaries
(R-precision)

• R-precision (Recall-precision)
– the precision at the R-th position in the ranking, where R is
the total number of relevant documents for the current query

Hits 1-10

Precision 1/1 1/2 1/3 1/4 2/5 3/6 3/7 4/8 4/9 4/10

Hits 11-20

Precision 5/11 5/12 5/13 5/14 5/15 6/16 6/17 6/18 6/19 4/20

Assume total of 14 relevant documents: 5 relevant


documents retrieved in the first 14 retrieved R-precision = 5/14

27
Problems with Precision and
Recall rate

• Estimation of maximal recall requires knowledge of


all the documents in the collection
• Recall and precision capture different aspects of the
set of retrieved documents
• Recall and precision measure the effectiveness over
queries in batch mode
• Recall and precision are defined under the
enforcement of linear ordering of the retrieved
documents

28
Outline of Today’s Class

• Retrieval Evaluation
• Evaluation Criteria
– Effectiveness
– Efficiency
– Usability
• Metrics
– Precision, recall, miss, false alarm
• Harmonic Mean and the E Measure
• Self-test Question Discussion

29
The Harmonic Mean

• harmonic mean F(j) of recall and precision

2
F ( j) =
1 1
+
R( j ) P( j )
• R(j): the recall for the j-th document in the ranking
• P(j): the precision for the j-th document in the ranking

30
The Harmonic Mean

2× P × R
F=
P+R
• F is always in the interval [0,1]
• F is 0 when no relevant documents have been retrieved
– in which case P and R are both 0
• F is 1 when all ranked documents are relevant
– in which case P and R are both 1
• Further more, F is only high when both R and P are high,
meaning that the determination of the maximum value of F can
be interpreted as an attempt to find the best possible
compromise between recall and precision.

31
E Measure (parameterized F
Measure)
• A variant of F measure that allows weighting
emphasis on precision over recall:

(1 + β ) PR (1 + β )
2 2
E= = β2 1
β P+R
2
+
R P
• Value of β controls trade-off:
– β = 1: Equally weight precision and recall (E=F).
– β > 1: Weight recall more.
– β < 1: Weight precision more.

32
Outline of Today’s Class

• Retrieval Evaluation
• Evaluation Criteria
– Effectiveness
– Efficiency
– Usability
• Metrics
– Precision, recall, miss, false alarm
• Harmonic Mean and the E Measure
• Self-test Question Discussion

34
Conclusion of Today’s Class

• Retrieval Evaluation
• Evaluation Criteria
– Effectiveness
– Efficiency
– Usability
• Metrics
– Precision, recall, miss, false alarm
• Harmonic Mean and the E Measure
• Self-test Question Discussion

37

You might also like