Professional Documents
Culture Documents
2
Introduction to Information Retrieval
3
Introduction to Information Retrieval
4
Introduction to Information Retrieval
Topic classification
5
Introduction to Information Retrieval
6
Introduction to Information Retrieval
7
Introduction to Information Retrieval
8
Introduction to Information Retrieval
9
Introduction to Information Retrieval
10
NAIVE BAYES TEXT
CLASSIFICATION
Introduction to Information Retrieval
11
Introduction to Information Retrieval
12
Introduction to Information Retrieval
13
Introduction to Information Retrieval
Simple interpretation:
Each conditional parameter log is a weight that
indicates how good an indicator tk is for c.
The prior log is a weight that indicates the relative
frequency of c.
The sum of log prior and term weights is then a measure of
how much evidence there is for the document being in the class
.
We select the class with the most evidence.
14
Introduction to Information Retrieval
15
The following figure 13.
4 represents
Multinomial NB Model
Introduction to Information Retrieval
P(China d) | | |
∝ P(China) ・ P(BEIJING China) ・ P(AND China)
| |
・ P(TAIPEI China) ・ P(JOIN China) ・ P(WTO|
China)
If WTO never occurs in class China in the train set:
16
To reduce the number of parameters, we make the Naive Bayes conditional
We have introduced two random variables here to make the two different
generative models explicit. Xk RANDOM VARIABLE X is the random variable for
position k in the
document and takes as values terms from the vocabulary. P(Xk = t|c) is the
RANDOM VARIABLE U probability that in a document of class c the term t will
occur in position k. Ui
is the random variable for vocabulary term i and takes as values 0 (absence)
and 1 (presence). Pˆ(Ui = 1|c) is the probability that in a document of class c
the term ti will occur – in any position and possibly multiple times.
We illustrate the conditional independence assumption in Figures 13.4 and 13.
5.
The class China generates values for each of the five term attributes (multi-
nomial) or six binary attributes (Bernoulli) with a certain probability, inde-
pendent of the values of the other attributes. The fact that a document in the
class China contains the term Taipei does not make it more likely or less likely
that it also contains Beijing.
In reality, the conditional independence assumption does not hold for text
data. Terms are conditionally dependent on each other. But as we will dis-
cuss shortly, NB models perform well despite the conditional independence
MULTINOMIAL MODEL
Vs
BERNOULLI MODEL
Introduction to Information Retrieval
17
ADD-ONE SMOOTHING:
Before:
B is the number of different words (in this case the size of the
vocabulary: |V | = M)
18
Introduction to Information Retrieval
19
Introduction to Information Retrieval
20
Introduction to Information Retrieval
21
Introduction to Information Retrieval
Exercise
22
Introduction to Information Retrieval
23
Introduction to Information Retrieval
Example: Classification
24
Introduction to Information Retrieval
Generative model
25
Introduction to Information Retrieval
Evaluating classification
26
Introduction to Information Retrieval
27
Introduction to Information Retrieval
P = TP / ( TP + FP)
R = TP / ( TP + FN)
28
Introduction to Information Retrieval
A combined measure: F
29
Introduction to Information Retrieval
30
Introduction to Information Retrieval
31
Introduction to Information Retrieval
Introduction to
Information Retrieval
CS276: Information Retrieval and Web Search
Pandu Nayak and Prabhakar Raghavan
3
Introduction to Information Retrieval Sec.14.1
4
Introduction to Information Retrieval Sec.14.1
Government
Science
Arts
6
Introduction to Information Retrieval Sec.14.1
Government
Science
Arts
7
Introduction to Information Retrieval Sec.14.1
Is this
similarity
hypothesis
true in
general?
Government
Science
Arts
9
Introduction to Information Retrieval Sec.14.2
11
Introduction to Information Retrieval Sec.14.2
Definition of centroid
1
(c) v (d)
| Dc | d Dc
12
Introduction to Information Retrieval Sec.14.2
Rocchio Properties
Forms a simple generalization of the examples in
each class (a prototype).
Prototype vector does not need to be averaged or
otherwise normalized for length since cosine
similarity is insensitive to vector length.
Classification is based on similarity to class
prototypes.
Does not guarantee classifications are consistent
with the given training data.
Why not?
13
Introduction to Information Retrieval Sec.14.2
Rocchio Anomaly
Prototype models have problems with polymorphic
(disjunctive) categories.
14
Introduction to Information Retrieval Sec.14.2
Rocchio classification
Rocchio forms a simple representation for each class:
the centroid/prototype
Classification is based on similarity to / distance from
the prototype/centroid
It does not guarantee that classifications are
consistent with the given training data
It is little used outside text classification
It has been used quite effectively for text classification
But in general worse than Naïve Bayes
Again, cheap to train and test documents
15
Introduction to Information Retrieval Sec.14.3
16
Introduction to Information Retrieval Sec.14.3
P(science| )?
Government
Science
Arts
17
Introduction to Information Retrieval Sec.14.3
18
Introduction to Information Retrieval Sec.14.3
19
Introduction to Information Retrieval Sec.14.3
k Nearest Neighbor
Using only the closest example (1NN) to determine
the class is subject to errors due to:
A single atypical example.
Noise (i.e., an error) in the category label of a single
training example.
More robust alternative is to find the k most-similar
examples and return the majority category of these k
examples.
Value of k is typically odd to avoid ties; 3 and 5 are
most common.
20
Introduction to Information Retrieval Sec.14.3
Government
Science
Arts
Similarity Metrics
Nearest neighbor method depends on a similarity (or
distance) metric.
Simplest for continuous m-dimensional instance
space is Euclidean distance.
Simplest for m-dimensional binary instance space is
Hamming distance (number of feature values that
differ).
For text, cosine similarity of tf.idf weighted vectors is
typically most effective.
22
Introduction to Information Retrieval Sec.14.3
23
Introduction to Information Retrieval
24
Introduction to Information Retrieval Sec.14.3
kNN: Discussion
No feature selection necessary
Scales well with large number of classes
Don’t need to train n classifiers for n classes
Classes can influence each other
Small changes to one class can have ripple effect
Scores can be hard to convert to probabilities
No training necessary
Actually: perhaps not true. (Data editing, etc.)
May be expensive at test time
In most cases it’s more accurate than NB or Rocchio
26
Introduction to Information Retrieval Sec.14.6
28
Introduction to Information Retrieval Sec.14.4
29
Introduction to Information Retrieval Sec.14.4
Separation by Hyperplanes
A strong high-bias assumption is linear separability:
in 2 dimensions, can separate classes by a line
in higher dimensions, need hyperplanes
Can find separating hyperplane by linear programming
(or can iteratively fit solution via perceptron):
separator can be expressed as ax + by = c
30
Introduction to Information Retrieval Sec.14.4
Which Hyperplane?
Which Hyperplane?
Lots of possible solutions for a,b,c.
Some methods find a separating hyperplane,
but not the optimal one [according to some
criterion of expected goodness]
E.g., perceptron
Most methods find an optimal separating
hyperplane
Which points should influence optimality?
All points
Linear/logistic regression
Naïve Bayes
Only “difficult points” close to decision
boundary
Support vector machines
33
Introduction to Information Retrieval Sec.14.4
34
Introduction to Information Retrieval Sec.14.4
Linear Classifiers
Many common text classifiers are linear classifiers
Naïve Bayes
Perceptron
Rocchio
Logistic regression
Support vector machines (with linear kernel)
Linear regression with threshold
Despite this similarity, noticeable performance differences
For separable problems, there is an infinite number of separating
hyperplanes. Which one do you choose?
What to do for non-separable problems?
Different training methods pick different hyperplanes
Classifiers more powerful than linear often don’t perform better on
text problems. Why?
35
Introduction to Information Retrieval Sec.14.2
36
Introduction to Information Retrieval Sec.14.2
w d i i
i1
[Aside for ML/stats people: Rocchio classification is a simplification of the classic Fisher
Linear Discriminant where you don’t model the variance (or assume it is
spherical).]
37
Introduction to Information Retrieval Sec.14.4
A nonlinear problem
A linear classifier
like Naïve Bayes
does badly on
this task
39
Introduction to Information Retrieval Sec.14.4
40
Introduction to Information Retrieval Sec.14.5
41
Introduction to Information Retrieval Sec.14.5
42
Introduction to Information Retrieval Sec.14.5
43
Introduction to Information Retrieval
Summary: Representation of
Text Categorization Attributes
Representations of text are usually very high
dimensional (one feature for each word)
High-bias algorithms that prevent overfitting in high-
dimensional space should generally work best*
For most text categorization tasks, there are many
relevant features and many irrelevant ones
Methods that combine evidence from many or all
features (e.g. naive Bayes, kNN) often tend to work
better than ones that try to isolate just a few
relevant features*
*Although the results are a bit more mixed than often thought
44
Introduction to Information Retrieval
1 / 79
Overview
1 Recap
2 Clustering: Introduction
3 Clustering in IR
4 K -means
5 Evaluation
2 / 79
Outline
1 Recap
2 Clustering: Introduction
3 Clustering in IR
4 K -means
5 Evaluation
3 / 79
Take-away today
What is clustering?
Applications of clustering in information retrieval
K -means algorithm
Evaluation of clustering
How many clusters?
4 / 79
Outline
1 Recap
2 Clustering: Introduction
3 Clustering in IR
4 K -means
5 Evaluation
5 / 79
Clustering: Definition
6 / 79
Data set with clear cluster structure
Propose
algorithm
for finding
2.5
the cluster
structure in
2.0
this example
1.5
1.0
0.5
0.0
7 / 79
Classification vs. Clustering
8 / 79
Outline
1 Recap
2 Clustering: Introduction
3 Clustering in IR
4 K -means
5 Evaluation
9 / 79
The cluster hypothesis
10 / 79
Applications of clustering in IR
11 / 79
Search result clustering for better navigation
12 / 79
Scatter-Gather
13 / 79
Global navigation: Yahoo
14 / 79
Global navigation: MESH (upper level)
15 / 79
Global navigation: MESH (lower level)
16 / 79
Navigational hierarchies: Manual vs. automatic creation
17 / 79
Global navigation combined with visualization (1)
18 / 79
Global navigation combined with visualization (2)
19 / 79
Global clustering for navigation: Google News
http://news.google.com
20 / 79
Clustering for improving recall
21 / 79
Data set with clear cluster structure
Propose
algorithm
for finding
2.5
the cluster
structure in
2.0
this example
1.5
1.0
0.5
0.0
22 / 79
Desiderata for clustering
23 / 79
Flat vs. Hierarchical clustering
Flat algorithms
Usually start with a random (partial) partitioning of docs into
groups
Refine iteratively
Main algorithm: K -means
Hierarchical algorithms
Create a hierarchy
Bottom-up, agglomerative
Top-down, divisive
24 / 79
Hard vs. Soft clustering
25 / 79
Flat algorithms
26 / 79
Outline
1 Recap
2 Clustering: Introduction
3 Clustering in IR
4 K -means
5 Evaluation
27 / 79
K -means
28 / 79
Document representations in clustering
29 / 79
K -means: Basic idea
30 / 79
K -means pseudocode (µk is centroid of ωk )
K -means({⃗x1 , . . . , ⃗xN }, K )
1 (⃗s1 , ⃗s2 , . . . , ⃗sK ) ← SelectRandomSeeds({⃗x1 , . . . , ⃗xN }, K )
2 for k ← 1 to K
3 do µ ⃗ k ← ⃗sk
4 while stopping criterion has not been met
5 do for k ← 1 to K
6 do ωk ← {}
7 for n ← 1 to N
8 do j ← arg minj ′ |⃗ µj ′ − ⃗xn |
9 ωj ← ωj ∪ {⃗xn } (reassignment of vectors)
10 for k ← 1 to ∑ K
11 do µ ⃗ k ← |ω1k | ⃗x ∈ωk ⃗x (recomputation of centroids)
12 return {⃗ ⃗K }
µ1 , . . . , µ
31 / 79
Worked Example: Set of points to be clustered
b
b b
b b
b
b
b
b
b
b
b b b
b b
32 / 79
Worked Example: Random selection of initial centroids
× b
b
b b
b b
b
b
b
b
×
b
b
b b b
b b
33 / 79
Worked Example: Assign points to closest center
× b
b
b b
b b
b
b
b
b
×
b
b
b b b
b b
34 / 79
Worked Example: Assignment
×2
2
222
1 1 1 1
1 1
×1
1 1
1 1
1
1 1
35 / 79
Worked Example: Recompute cluster centroids
×
2
2 × 222
1 1 1 1
1
×1
1 1
1
×
1 1
1
1 1
36 / 79
Worked Example: Assign points to closest centroid
× b
b b
b b
b
b
b
×
b
b
b
b b b
b b
37 / 79
Worked Example: Assignment
2
2 × 222
2 2 1 1
1
×1
1 1
1
1 1
1
1 1
38 / 79
Worked Example: Recompute cluster centroids
××
2
222
2 2 1 1
1 1
×
1
1 × 1
1 1
1
1 1
39 / 79
Worked Example: Assign points to closest centroid
×
b
b
b b
b b
b
b
b
b
×
b
b
b b b
b b
40 / 79
Worked Example: Assignment
×
2
222
2 2 1 1
2 1
×1
1 1
1 1
1
1 1
41 / 79
Worked Example: Recompute cluster centroids
2
2
2
××2 222
1 1
2 1
1
×
1
1 ×1 1
1
1 1
42 / 79
Worked Example: Assign points to closest centroid
b
× b
b
b
b b
b
b
b
b
×
b
b b b
b b
43 / 79
Worked Example: Assignment
2
2
2
×2 222
1 1
2 1
1
×
1
2 1 1
1
1 1
44 / 79
Worked Example: Recompute cluster centroids
2
2
×× 2
222
2
1 1
2 1
1
× 11
1
2 1×
1 1
45 / 79
Worked Example: Assign points to closest centroid
×
b
b b
b b
b
b
b
b
b
×
b
b b b
b b
46 / 79
Worked Example: Assignment
2
2
×
211
2 2 1 1
2 1
2
× 11
1
2 1
1 1
47 / 79
Worked Example: Recompute cluster centroids
2
2
211
×
2
×
2 1 1
2
×
2 1
1
2 1 × 1
1
1 1
48 / 79
Worked Example: Assign points to closest centroid
×
b b
b b
b
b
b
×
b
b
b
b b b
b b
49 / 79
Worked Example: Assignment
2
2
111
×
2 2 1 1
2
×1
2 1
1
2 1
1
1 1
50 / 79
Worked Example: Recompute cluster centroids
2
2
111
×
2× 2 1 1
×
2 1
2 1×
2 1 1
1
1 1
51 / 79
Worked Example: Assign points to closest centroid
b
b b
× b b
b
b
×
b
b
b
b
b b b
b b
52 / 79
Worked Example: Assignment
2
2
111
×
2 1 1 1
×1
2 1
2 1
2 1
1
1 1
53 / 79
Worked Example: Recompute cluster centroids
2
2
111
× ×2 1 1 1
2
2
2
1
×
1×
1
1
1
1 1
54 / 79
Worked Ex.: Centroids and assignments after convergence
2
2
111
× 2 1 1 1
2
2
2
1
×1
1
1
1
1 1
55 / 79
K -means is guaranteed to converge: Proof
56 / 79
Re-computation decreases average distance
∑
RSS = K k=1 RSSk – the residual sum of squares (the “goodness”
measure)
∑ ∑ ∑
M
RSSk (⃗v ) = ∥⃗v − ⃗x ∥2 = (vm − xm )2
x ∈ωk
⃗ x ∈ωk m=1
⃗
∂RSSk (⃗v ) ∑
= 2(vm − xm ) = 0
∂vm
x ∈ωk
⃗
1 ∑
vm = xm
|ωk |
x ∈ωk
⃗
57 / 79
K -means is guaranteed to converge
58 / 79
Optimality of K -means
Convergence ̸= optimality
Convergence does not mean that we converge to the optimal
clustering!
This is the great weakness of K -means.
If we start with a bad set of seeds, the resulting clustering can
be horrible.
59 / 79
Exercise: Suboptimal clustering
3
d1 d2 d3
2 × × ×
1 × × ×
d4 d5 d6
0
0 1 2 3 4
What is the optimal clustering for K = 2?
Do we converge on this clustering for arbitrary seeds di , dj ?
60 / 79
Initialization of K -means
61 / 79
Time complexity of K -means
62 / 79
Outline
1 Recap
2 Clustering: Introduction
3 Clustering in IR
4 K -means
5 Evaluation
63 / 79
What is a good clustering?
Internal criteria
RSS
Modularity in graph
But an internal criterion often does not evaluate the actual
utility of a clustering in the application.
Alternative: External criteria
Evaluate with respect to a human-defined classification
64 / 79
External criteria for clustering quality
65 / 79
External criterion: Purity
1 ∑
purity(Ω, C ) = max |ωk ∩ cj |
N j
k
66 / 79
Example for computing purity
To compute purity:
5 = max |ω1 ∩ cj | (class x, cluster 1)
j
5+4+3
Purity = ≈ 0.71. (1)
17
cluster 1 cluster 2 cluster 3
x x x o x ⋄
o x o o ⋄ ⋄ ⋄
x x o x
67 / 79
Another external criterion: Rand index
x x x o x ⋄
o x o o ⋄ ⋄ ⋄
x x o x
69 / 79
Rand measure for the o/⋄/x example
20 + 72 92
RI = = ≈ 0.68.
20 + 20 + 24 + 72 136
x x x o x ⋄
o x o o ⋄ ⋄ ⋄
x x o x
70 / 79
Normalized mutual information (NMI)
I (Ω; C )
NMI (Ω, C ) =
(H(Ω) + H(C )) /2
∑∑ P(ωk ∩ cj )
I (Ω; C ) = P(ωk ∩ cj ) log (2)
P(ωk )P(cj )
k j
∑ ∑ |ωk ∩ cj | N|ωk ∩ cj |
= log (3)
N |ωk ||cj |
k j
∑
H(Ω) = P(ωk ) log P(ωk ) (4)
k
H: entropy
I: Mutual Information
the denominator: normalize the value to be within -1 to 1.
71 / 79
Evaluation results for the o/⋄/x example
purity NMI RI F5
lower bound 0.0 0.0 0.0 0.0
maximum 1.0 1.0 1.0 1.0
value for example 0.71 0.36 0.68 0.46
72 / 79
Outline
1 Recap
2 Clustering: Introduction
3 Clustering in IR
4 K -means
5 Evaluation
73 / 79
How many clusters?
74 / 79
Exercise
75 / 79
Simple objective function for K : Basic idea
76 / 79
Simple objective function for K : Formalization
77 / 79
Finding the “knee” in the curve
1950
1900
residual sum of squares
1850
1800
1750
2 4 6 8 10
number of clusters
Introduction to
Information Retrieval
CS276: Information Retrieval and Web Search
Pandu Nayak and Prabhakar Raghavan
What is clustering?
Clustering: the process of grouping a set of objects
into classes of similar objects
Documents within a cluster should be similar.
Documents from different clusters should be
dissimilar.
The commonest form of unsupervised learning
Unsupervised learning = learning from raw data, as
opposed to supervised data where a classification of
examples is given
A common and important task that finds many
applications in IR and other places
Introduction to Information Retrieval Ch. 16
How would
you design
an algorithm
for finding
the three
clusters in
this case?
Introduction to Information Retrieval Sec. 16.1
Applications of clustering in IR
Whole corpus analysis/navigation
Better user interface: search without typing
For improving recall in search applications
Better search results (like pseudo RF)
For better navigation of search results
Effective “user recall” will be higher
For speeding up vector space retrieval
Cluster-based retrieval gives faster search
Introduction to Information Retrieval
… (30)
Notion of similarity/distance
Ideal: semantic similarity.
Practical: term-statistical similarity
We will use cosine similarity.
Docs as vectors.
For many algorithms, easier to think in
terms of a distance (rather than similarity)
between docs.
We will mostly speak of Euclidean distance
But real implementations use cosine similarity
Introduction to Information Retrieval
Clustering Algorithms
Flat algorithms
Usually start with a random (partial) partitioning
Refine it iteratively
K means clustering
(Model based clustering)
Hierarchical algorithms
Bottom-up, agglomerative
(Top-down, divisive)
Introduction to Information Retrieval
Partitioning Algorithms
Partitioning method: Construct a partition of n
documents into a set of K clusters
Given: a set of documents and the number K
Find: a partition of K clusters that optimizes the
chosen partitioning criterion
Globally optimal
Intractable for many objective functions
Ergo, exhaustively enumerate all partitions
Effective heuristic methods: K-means and K-
medoids algorithms
See also Kleinberg NIPS 2002 – impossibility for natural clustering
Introduction to Information Retrieval Sec. 16.4
K-Means
Assumes documents are real-valued vectors.
Clusters based on centroids (aka the center of gravity
or mean) of points in a cluster, c:
1
μ(c)
| c | xc
x
K-Means Algorithm
Select K random docs {s1, s2,… sK} as seeds.
Until clustering converges (or other stopping criterion):
For each doc di:
Assign di to the cluster cj such that dist(xi, sj) is minimal.
(Next, update the seeds to the centroid of each cluster)
For each cluster cj
sj = (cj)
Introduction to Information Retrieval Sec. 16.4
K Means Example
(K=2)
Pick seeds
Reassign clusters
Compute centroids
Reassign clusters
x x Compute centroids
x
x
Reassign clusters
Converged!
Introduction to Information Retrieval Sec. 16.4
Termination conditions
Several possibilities, e.g.,
A fixed number of iterations.
Doc partition unchanged.
Centroid positions don’t change.
Convergence
Why should the K-means algorithm ever reach a
fixed point?
A state in which clusters don’t change.
K-means is a special case of a general procedure
known as the Expectation Maximization (EM)
algorithm.
EM is known to converge.
Number of iterations could be large.
But in practice usually isn’t
Introduction to Information Retrieval Sec. 16.4
Lower case!
Convergence of K-Means
Define goodness measure of cluster k as sum of
squared distances from cluster centroid:
Gk = Σi (di – ck)2 (sum over all di in cluster k)
G = Σk Gk
Reassignment monotonically decreases G since
each vector is assigned to the closest centroid.
Introduction to Information Retrieval Sec. 16.4
Convergence of K-Means
Recomputation monotonically decreases each Gk
since (mk is number of members in cluster k):
Σ (di – a)2 reaches minimum for:
Σ –2(di – a) = 0
Σ di = Σ a
mK a = Σ di
a = (1/ mk) Σ di = ck
K-means typically converges quickly
Introduction to Information Retrieval Sec. 16.4
Time Complexity
Computing distance between two docs is O(M)
where M is the dimensionality of the vectors.
Reassigning clusters: O(KN) distance computations,
or O(KNM).
Computing centroids: Each doc gets added once to
some centroid: O(NM).
Assume these two steps are each done once for I
iterations: O(IKNM).
Introduction to Information Retrieval Sec. 16.4
Seed Choice
Results can vary based on Example showing
random seed selection. sensitivity to seeds
Dhillon et al. ICDM 2002 – variation to fix some issues with small
document clusters
Introduction to Information Retrieval
Hierarchical Clustering
Build a tree-based hierarchical taxonomy
(dendrogram) from a set of documents.
animal
vertebrate invertebrate
32
Introduction to Information Retrieval Sec. 17.1
Hierarchical Agglomerative Clustering
(HAC)
Starts with each doc in a separate cluster
then repeatedly joins the closest pair of
clusters, until there is only one cluster.
The history of merging forms a binary tree
or hierarchy.
Note: the resulting clusters are still “hard” and induce a partition
Introduction to Information Retrieval Sec. 17.2
Complete Link
Use minimum similarity of pairs:
Ci Cj Ck
Introduction to Information Retrieval Sec. 17.2
Computational Complexity
In the first iteration, all HAC methods need to
compute similarity of all pairs of N initial instances,
which is O(N2).
In each of the subsequent N2 merging iterations,
compute the distance between the most recently
created cluster and all other existing clusters.
In order to maintain an overall O(N2) performance,
computing similarity to each other cluster must be
done in constant time.
Often O(N3) if done naively or O(N2 log N) if done more
cleverly
Introduction to Information Retrieval Sec. 17.3
Group Average
Similarity of two clusters = average similarity of all pairs
within merged cluster.
1
sim (ci , c j ) sim ( x , y )
ci c j ( ci c j 1) x( ci c j ) y( ci c j ): y x
Compromise between single and complete link.
Two options:
Averaged across all ordered pairs in the merged cluster
Averaged over all pairs between the two original clusters
No clear difference in efficacy
Introduction to Information Retrieval Sec. 17.3
( s (ci ) s (c j )) ( s (ci ) s (c j )) (| ci | | c j |)
sim (ci , c j )
(| ci | | c j |)(| ci | | c j | 1)
Introduction to Information Retrieval Sec. 16.3
Purity example
Same class in
ground truth 20 24
Different
classes in 20 72
ground truth
Introduction to Information Retrieval Sec. 16.3
A D
RI
A B C D
Compare with standard Precision and Recall:
A A
P R
A B AC
People also define and use a cluster F-
measure, which is probably a better measure.
Introduction to Information Retrieval
Resources
IIR 16 except 16.5
IIR 17.1–17.3