Unit 4

Introduction to Information Retrieval
A text classification task: Email spam filtering

From: ‘‘’’ <takworlld@hotmail.com>
Subject: real estate is the only way... gem oalvgkay
Anyone can buy real estate with no money down
Stop paying rent TODAY !
There is no need to spend hundreds or even thousands for
similar courses
I am 22 years old and I have already purchased 6 properties
using the
methods outlined in this truly INCREDIBLE ebook.
Change your life NOW !
=================================================
Click Below to order:
http://www.wholesaledaily.com/sales/nmd.htm
=================================================
How would you write a program that would automatically detect
and delete this type of message?
2
Formal definition of TC: Training

Given:
 A document set X
 Documents are represented typically in some type of high-
dimensional space.
 A fixed set of classes C = {c1, c2, . . . , cJ}
 The classes are human-defined for the needs of an application (
e.g., relevant vs. nonrelevant).
 A training set D of labeled documents with each labeled
document <d, c> ∈ X × C
Using a learning method or learning algorithm, we then wish to
learn a classifier ϒ that maps documents to classes:
ϒ:X→C
3
Formal definition of TC: Application/Testing
Given: a description d ∈ X of a document Determine: ϒ (d) ∈

C,
that is, the class that is most appropriate for d
4
Topic classification
5
Examples of how search engines use classification
 Language identification (classes: English vs. French etc.)

 The automatic detection of spam pages (spam vs. nonspam)
 Topic-specific or vertical search – restrict search to a
“vertical” like “related to health” (relevant to vertical vs. not)
6
Classification methods: Statistical/Probabilistic
 This was our definition of the classification problem – text

classification as a learning problem
 (i) Supervised learning of a the classification function ϒ and
(ii) its application to classifying new documents
 We will look at doing this using Naive Bayes
 requires hand-classified training data
 But this manual classification can be done by non-experts.
7
Derivation of Naive Bayes rule

We want to find the class that is most likely given the document:
Apply Bayes rule
Drop denominator since P(d) is the same for all classes:
8
Too many parameters / sparseness
 There are too many parameters , one

for each unique combination of a class and a sequence of
words.
 We would need a very, very large number of training
examples to estimate that many parameters.
 This is the problem of data sparseness.
9
Naive Bayes conditional independence assumption
To reduce the number of parameters to a manageable size, we

make the Naive Bayes conditional independence assumption:
We assume that the probability of observing the conjunction of

attributes is equal to the product of the individual probabilities
|
P(Xk = tk c).
10
NAIVE BAYES TEXT
CLASSIFICATION
The Naive Bayes classifier

 The Naive Bayes classifier is a probabilistic classifier.
 We compute the probability of a document d being in a class c
as follows:
 nd is the length of the document. (number of tokens)

 P(tk |c) is the conditional probability of term tk occurring in a
document of class c
 P(tk |c) as a measure of how much evidence tk contributes
that c is the correct class.

 P(c) is the prior probability of c.
 If a document’s terms do not provide clear evidence for one
class vs. another, we choose the c with highest P(c).
11
Maximum a posteriori class
 Our goal in Naive Bayes classification is to find the “best”

class.
 The best class is the most likely or maximum a posteriori (
MAP) class cmap:
12
Taking the log

 Multiplying lots of small probabilities can result in floating
point underflow.
 Since log(xy) = log(x) + log(y), we can sum log probabilities
instead of multiplying probabilities.
 Since log is a monotonic function, the class with the highest
score does not change.
 So what we usually compute in practice is:
13
Naive Bayes classifier

 Classification rule:
 Simple interpretation:
 Each conditional parameter log is a weight that
indicates how good an indicator tk is for c.
 The prior log is a weight that indicates the relative
frequency of c.
 The sum of log prior and term weights is then a measure of
how much evidence there is for the document being in the class
.
 We select the class with the most evidence.
14
Parameter estimation take 1: Maximum likelihood

 Estimate parameters and from train data: How?
 Prior:
 Nc : number of docs in class c; N: total number of docs

 Conditional probabilities:
 Tct is the number of tokens of t in training documents from class

c (includes multiple occurrences)
 We’ve made a Naive Bayes independence assumption here:
15
The following figure 13.
4 represents
Multinomial NB Model
The problem with maximum likelihood estimates: Zeros
P(China d) | | |
∝ P(China) ・ P(BEIJING China) ・ P(AND China)
| |
・ P(TAIPEI China) ・ P(JOIN China) ・ P(WTO|
China)
 If WTO never occurs in class China in the train set:
16
To reduce the number of parameters, we make the Naive Bayes conditional
independence assumption. We assume that attribute values are independent of

each other given the class:
see formula below
We have introduced two random variables here to make the two different
generative models explicit. Xk RANDOM VARIABLE X is the random variable for
position k in the
document and takes as values terms from the vocabulary. P(Xk = t|c) is the
RANDOM VARIABLE U probability that in a document of class c the term t will
occur in position k. Ui
is the random variable for vocabulary term i and takes as values 0 (absence)
and 1 (presence). Pˆ(Ui = 1|c) is the probability that in a document of class c
the term ti will occur – in any position and possibly multiple times.
We illustrate the conditional independence assumption in Figures 13.4 and 13.
5.
The class China generates values for each of the five term attributes (multi-
nomial) or six binary attributes (Bernoulli) with a certain probability, inde-
pendent of the values of the other attributes. The fact that a document in the
class China contains the term Taipei does not make it more likely or less likely
that it also contains Beijing.
In reality, the conditional independence assumption does not hold for text
data. Terms are conditionally dependent on each other. But as we will dis-
cuss shortly, NB models perform well despite the conditional independence
MULTINOMIAL MODEL
Vs
BERNOULLI MODEL
The problem with maximum likelihood estimates: Zeros

(cont)
 If there were no occurrences of WTO in documents in class China

, we’d get a zero estimate:
 → We will get P(China|d) = 0 for any document that contains

WTO!
 Zero probabilities cannot be conditioned away.
17
ADD-ONE SMOOTHING:
To eliminate zeros, we use

add-one or Laplace
smoothing, which simply
adds
one to each count
To avoid zeros: Add-one smoothing
 Before:
 Now: Add one to each count to avoid zeros:
 B is the number of different words (in this case the size of the
vocabulary: |V | = M)
18
To avoid zeros: Add-one smoothing
 Estimate parameters from the training corpus using add-one

smoothing
 For a new document, for each class, compute sum of (i) log of
prior and (ii) logs of conditional probabilities of the terms
 Assign the document to the class with the largest score
19
Naive Bayes: Training
20
Naive Bayes: Testing
21
Exercise
 Estimate parameters of Naive Bayes classifier

 Classify test document
22
Example: Parameter estimates
The denominators are (8 + 6) and (3 + 6) because the lengths of

textc and are 8 and 3, respectively, and because the constant
B is 6 as the vocabulary consists of six terms.
23
Example: Classification
Thus, the classifier assigns the test document to c = China. The

reason for this classification decision is that the three occurrences
of the positive indicator CHINESE in d5 outweigh the occurrences
of the two negative indicators JAPAN and TOKYO.
24
Generative model
 Generate a class with probability P(c)

 Generate each of the words (in their respective positions),
conditional on the class, but independent of each other, with
probability P(tk |c)
 To classify docs, we “reengineer” this process and find the class
that is most likely to have generated the doc.
25
Evaluating classification
 Evaluation must be done on test data that are independent of

the training data (usually a disjoint set of instances).
 It’s easy to get good performance on a test set that was
available to the learner during training (e.g., just memorize
the test set).
 Measures: Precision, recall, F1, classification accuracy
26
Constructing Confusion Matrix c
27
Precision P and recall R
P = TP / ( TP + FP)
R = TP / ( TP + FN)
28
A combined measure: F
 F1 allows us to trade off precision against recall.
 This is the harmonic mean of P and R:
29
Averaging: Micro vs. Macro
 We now have an evaluation measure (F1) for one class.

 But we also want a single number that measures the
aggregate performance over all classes in the collection.
 Macroaveraging
 Compute F1 for each of the C classes
 Average these C numbers
 Microaveraging
 Compute TP, FP, FN for each of the C classes
 Sum these C numbers (e.g., all TP to get aggregate TP)
 Compute F1 for aggregate TP, FP, FN
30
Micro- vs. Macro-average: Example
31
Introduction to
Information Retrieval
CS276: Information Retrieval and Web Search
Pandu Nayak and Prabhakar Raghavan
Lecture 11: Text Classification;

Vector space classification
[Borrows slides from Ray Mooney]
Recap: Naïve Bayes classifiers

 Classify based on prior weight of class and
conditional parameter for what each word says:
 
c NB  argmax log P(c j )   log P(x i | c j )
c j C 
 i positions 

 Training is done by counting and dividing:
Nc j Tc j x k  
P(c j )  P(x k | c j ) 
N  x i V
[Tc j x i   ]
 Don’t forget to smooth


2
The rest of text classification

 Today:
 Vector space methods for Text Classification
 Vector space classification using centroids (Rocchio)
 K Nearest Neighbors
 Decision boundaries, linear and nonlinear classifiers
 Dealing with more than 2 classes
 Later in the course
 More text classification
 Support Vector Machines
 Text-specific issues in classification
3
Introduction to Information Retrieval Sec.14.1
Recall: Vector Space Representation

 Each document is a vector, one component for each
term (= word).
 Normally normalize vectors to unit length.
 High-dimensional vector space:
 Terms are axes
 10,000+ dimensions, or even 100,000+
 Docs are vectors in this space
 How can we do classification in this space?
4
Classification Using Vector Spaces

 As before, the training set is a set of documents,
each labeled with its class (e.g., topic)
 In vector space classification, this set corresponds to
a labeled set of points (or, equivalently, vectors) in
the vector space
 Premise 1: Documents in the same class form a
contiguous region of space
 Premise 2: Documents from different classes don’t
overlap (much)
 We define surfaces to delineate classes in the space
5
Documents in a Vector Space
Government
Science
Arts
6
Test Document of what class?
Government
Science
Arts
7
Test Document = Government
Is this
similarity
hypothesis
true in
general?
Government
Science
Arts
Our main topic today is how to find good separators 8

Aside: 2D/3D graphs can be misleading
9
Using Rocchio for text classification

 Relevance feedback methods can be adapted for text
categorization
 As noted before, relevance feedback can be viewed as 2-class
classification
 Relevant vs. nonrelevant documents
 Use standard tf-idf weighted vectors to represent text
documents
 For training documents in each category, compute a
prototype vector by summing the vectors of the training
documents in the category.
 Prototype = centroid of members of class
 Assign test documents to the category with the closest
prototype vector based on cosine similarity.
10
Illustration of Rocchio Text Categorization
11
Definition of centroid

1
(c)  v (d)
| Dc | d Dc
 Where Dc is the set of all documents that belong to

class c and v(d) is the vector space representation of
d.
 Note that centroid will in general not be a unit vector

even when the inputs are unit vectors.
12
Rocchio Properties
 Forms a simple generalization of the examples in
each class (a prototype).
 Prototype vector does not need to be averaged or
otherwise normalized for length since cosine
similarity is insensitive to vector length.
 Classification is based on similarity to class
prototypes.
 Does not guarantee classifications are consistent
with the given training data.
Why not?
13
Rocchio Anomaly
 Prototype models have problems with polymorphic
(disjunctive) categories.
14
Rocchio classification
 Rocchio forms a simple representation for each class:
the centroid/prototype
 Classification is based on similarity to / distance from
the prototype/centroid
 It does not guarantee that classifications are
consistent with the given training data
 It is little used outside text classification
 It has been used quite effectively for text classification
 But in general worse than Naïve Bayes
 Again, cheap to train and test documents
15
k Nearest Neighbor Classification

 kNN = k Nearest Neighbor
 To classify a document d into class c:

 Define k-neighborhood N as k nearest neighbors of d
 Count number of documents i in N that belong to c
 Estimate P(c|d) as i/k
 Choose as class argmaxc P(c|d) [ = majority class]
16
Example: k=6 (6NN)
P(science| )?
Government
Science
Arts
17
Nearest-Neighbor Learning Algorithm

 Learning is just storing the representations of the training examples
in D.
 Testing instance x (under 1NN):
 Compute similarity between x and all examples in D.
 Assign x the category of the most similar example in D.
 Does not explicitly compute a generalization or category
prototypes.
 Also called:
 Case-based learning
 Memory-based learning
 Lazy learning
 Rationale of kNN: contiguity hypothesis
18
kNN Is Close to Optimal

 Cover and Hart (1967)
 Asymptotically, the error rate of 1-nearest-neighbor
classification is less than twice the Bayes rate [error rate of
classifier knowing model that generated data]
 In particular, asymptotic error rate is 0 if Bayes rate is

0.
 Assume: query point coincides with a training point.
 Both query point and training point contribute error
→ 2 times Bayes rate
19
k Nearest Neighbor
 Using only the closest example (1NN) to determine
the class is subject to errors due to:
 A single atypical example.
 Noise (i.e., an error) in the category label of a single
training example.
 More robust alternative is to find the k most-similar
examples and return the majority category of these k
examples.
 Value of k is typically odd to avoid ties; 3 and 5 are
most common.
20
kNN decision boundaries

Boundaries
are in
principle
arbitrary
surfaces –
but usually
polyhedra
Government
Science
Arts
kNN gives locally defined decision boundaries between

classes – far away points do not influence each classification
decision (unlike in Naïve Bayes, Rocchio, etc.) 21
Similarity Metrics
 Nearest neighbor method depends on a similarity (or
distance) metric.
 Simplest for continuous m-dimensional instance
space is Euclidean distance.
 Simplest for m-dimensional binary instance space is
Hamming distance (number of feature values that
differ).
 For text, cosine similarity of tf.idf weighted vectors is
typically most effective.
22
Illustration of 3 Nearest Neighbor for Text

Vector Space
23
3 Nearest Neighbor vs. Rocchio

 Nearest Neighbor tends to handle polymorphic
categories better than Rocchio/NB.
24
Nearest Neighbor with Inverted Index

 Naively, finding nearest neighbors requires a linear
search through |D| documents in collection
 But determining k nearest neighbors is the same as
determining the k best retrievals using the test
document as a query to a database of training
documents.
 Use standard vector space inverted index methods to
find the k nearest neighbors.
 Testing Time: O(B|Vt|) where B is the average
number of training documents in which a test-document word
appears.
 Typically B << |D|
25
kNN: Discussion
 No feature selection necessary
 Scales well with large number of classes
 Don’t need to train n classifiers for n classes
 Classes can influence each other
 Small changes to one class can have ripple effect
 Scores can be hard to convert to probabilities
 No training necessary
 Actually: perhaps not true. (Data editing, etc.)
 May be expensive at test time
 In most cases it’s more accurate than NB or Rocchio
26
kNN vs. Naive Bayes

 Bias/Variance tradeoff
 Variance ≈ Capacity
 kNN has high variance and low bias.
 Infinite memory
 NB has low variance and high bias.
 Decision surface has to be linear (hyperplane – see later)
 Consider asking a botanist: Is an object a tree?
 Too much capacity/variance, low bias
 Botanist who memorizes
 Will always say “no” to new object (e.g., different # of leaves)
 Not enough capacity/variance, high bias
 Lazy botanist
 Says “yes” if the object is green
 You want the middle ground
(Example due to C. Burges)
27
Bias vs. variance:

Choosing the correct model capacity
28
Linear classifiers and binary and multiclass

classification
 Consider 2 class problems
 Deciding between two classes, perhaps, government and
non-government
 One-versus-rest classification
 How do we define (and find) the separating surface?
 How do we decide which region a test doc is in?
29
Separation by Hyperplanes
 A strong high-bias assumption is linear separability:
 in 2 dimensions, can separate classes by a line
 in higher dimensions, need hyperplanes
 Can find separating hyperplane by linear programming
(or can iteratively fit solution via perceptron):
 separator can be expressed as ax + by = c
30
Linear programming / Perceptron
Find a,b,c, such that

ax + by > c for red points
ax + by < c for blue points.
31
Which Hyperplane?
In general, lots of possible

solutions for a,b,c.
32
Which Hyperplane?
 Lots of possible solutions for a,b,c.
 Some methods find a separating hyperplane,
but not the optimal one [according to some
criterion of expected goodness]
 E.g., perceptron
 Most methods find an optimal separating
hyperplane
 Which points should influence optimality?
 All points
 Linear/logistic regression
 Naïve Bayes
 Only “difficult points” close to decision
boundary
 Support vector machines
33
Linear classifier: Example

 Class: “interest” (as in interest rate)
 Example features of a linear classifier
 wi t i wi ti
• 0.70 prime • −0.71 dlrs
• 0.67 rate • −0.35 world
• 0.63 interest • −0.33 sees
• 0.60 rates • −0.25 year
• 0.46 discount • −0.24 group
• 0.43 bundesbank • −0.24 dlr
 To classify, find dot product of feature vector and weights
34
Linear Classifiers
 Many common text classifiers are linear classifiers
 Naïve Bayes
 Perceptron
 Rocchio
 Logistic regression
 Support vector machines (with linear kernel)
 Linear regression with threshold
 Despite this similarity, noticeable performance differences
 For separable problems, there is an infinite number of separating
hyperplanes. Which one do you choose?
 What to do for non-separable problems?
 Different training methods pick different hyperplanes
 Classifiers more powerful than linear often don’t perform better on
text problems. Why?
35
Rocchio is a linear classifier
36
Two-class Rocchio as a linear classifier

 Line or hyperplane defined by:
M
w d i i 
i1
 For Rocchio, set:

w  (c1)  (c 2 )

  0.5  (| (c1 ) |2  | (c 2 ) |2 )
[Aside for ML/stats people: Rocchio classification is a simplification of the classic Fisher
Linear Discriminant where you don’t model the variance (or assume it is
 spherical).]
37
Naive Bayes is a linear classifier

 Two-class Naive Bayes. We compute:
P(C | d ) P(C ) P( w | C )
log  log   log
P(C | d ) P(C ) wd P( w | C )
 Decide class C if the odds is greater than 1, i.e., if the
log odds is greater than 0.
 So decision boundary is hyperplane:
P(C )
  wV  w  nw  0 where   log ;
P(C )
P( w | C )
 w  log ; nw  # of occurrence s of w in d
P( w | C ) 38
A nonlinear problem
 A linear classifier
like Naïve Bayes
does badly on
this task
 kNN will do very

well (assuming
enough training
data)
39
High Dimensional Data

 Pictures like the one at right are absolutely
misleading!
 Documents are zero along almost all axes
 Most document pairs are very far apart (i.e.,
not strictly orthogonal, but only share very
common words and a few scattered others)
 In classification terms: often document sets
are separable, for most any classification
 This is part of why linear classifiers are quite
successful in this domain
40
More Than Two Classes

 Any-of or multivalue classification
 Classes are independent of each other.
 A document can belong to 0, 1, or >1 classes.
 Decompose into n binary problems
 Quite common for documents
 One-of or multinomial or polytomous classification
 Classes are mutually exclusive.
 Each document belongs to exactly one class
 E.g., digit recognition is polytomous classification
 Digits are mutually exclusive
41
Set of Binary Classifiers: Any of

 Build a separator between each class and its
complementary set (docs from all other classes).
 Given test doc, evaluate it for membership in each
class.
 Apply decision criterion of classifiers independently
 Done
 Though maybe you could do better by considering

dependencies between categories
42
Set of Binary Classifiers: One of

 Build a separator between each class and its
complementary set (docs from all other classes).
 Given test doc, evaluate it for membership in each
class.
 Assign document to class with:
 maximum score
 maximum confidence ?
 maximum probability ?
?
43
Summary: Representation of
Text Categorization Attributes
 Representations of text are usually very high
dimensional (one feature for each word)
 High-bias algorithms that prevent overfitting in high-
dimensional space should generally work best*
 For most text categorization tasks, there are many
relevant features and many irrelevant ones
 Methods that combine evidence from many or all
features (e.g. naive Bayes, kNN) often tend to work
better than ones that try to isolate just a few
relevant features*
*Although the results are a bit more mixed than often thought
44
Which classifier do I use for a given

text classification problem?
 Is there a learning method that is optimal for all text
classification problems?
 No, because there is a tradeoff between bias and
variance.
 Factors to take into account:
 How much training data is available?
 How simple/complex is the problem? (linear vs. nonlinear
decision boundary)
 How noisy is the data?
 How stable is the problem over time?
 For an unstable problem, it’s better to use a simple and robust
classifier. 45
Flat Clustering
Slides are mostly from Hinrich Schütze
March 27, 2017
1 / 79
Overview
1 Recap
2 Clustering: Introduction
3 Clustering in IR
4 K -means
5 Evaluation
6 How many clusters?
2 / 79
Outline
1 Recap
3 Clustering in IR
4 K -means
5 Evaluation
3 / 79
Take-away today
What is clustering?
Applications of clustering in information retrieval
K -means algorithm
Evaluation of clustering
How many clusters?
4 / 79
Outline
1 Recap
3 Clustering in IR
4 K -means
5 Evaluation
5 / 79
Clustering: Definition
(Document) clustering is the process of grouping a set of

documents into clusters of similar documents.
Documents within a cluster should be similar.
Documents from different clusters should be dissimilar.
Clustering is the most common form of unsupervised learning.
Unsupervised = there are no labeled or annotated data.
6 / 79
Data set with clear cluster structure
Propose
algorithm
for finding
2.5
the cluster
structure in
2.0
this example
1.5
1.0
0.5
0.0
0.0 0.5 1.0 1.5 2.0
7 / 79
Classification vs. Clustering
Classification: supervised learning

Clustering: unsupervised learning
Classification: Classes are human-defined and part of the
input to the learning algorithm.
Clustering: Clusters are inferred from the data without human
input.
Many ways of influencing the outcome of clustering:
number of clusters,
similarity measure,
representation of documents,
...
8 / 79
Outline
1 Recap
3 Clustering in IR
4 K -means
5 Evaluation
9 / 79
The cluster hypothesis
Cluster hypothesis. Documents in the same cluster behave

similarly with respect to relevance to information needs.
All applications of clustering in IR are based (directly or indirectly)

on the cluster hypothesis.
Van Rijsbergen’s original wording (1979): “closely associated

documents tend to be relevant to the same requests”.
10 / 79
Applications of clustering in IR
application what is benefit

clustered?
search result clustering search more effective infor-
results mation presentation
to user
Scatter-Gather (subsets of) alternative user inter-
collection face: “search without
typing”
collection clustering collection effective information
presentation for ex-
ploratory browsing
cluster-based retrieval collection higher efficiency:
faster search
11 / 79
Search result clustering for better navigation
12 / 79
Scatter-Gather
13 / 79
Global navigation: Yahoo
14 / 79
Global navigation: MESH (upper level)
15 / 79
Global navigation: MESH (lower level)
16 / 79
Navigational hierarchies: Manual vs. automatic creation
Note: Yahoo/MESH are not examples of clustering.

But they are well known examples for using a global hierarchy
for navigation.
Some examples for global navigation/exploration based on
clustering:
Cartia
Themescapes
Google News
17 / 79
Global navigation combined with visualization (1)
18 / 79
Global navigation combined with visualization (2)
19 / 79
Global clustering for navigation: Google News
http://news.google.com
20 / 79
Clustering for improving recall
To improve search recall:

Cluster docs in collection a priori
When a query matches a doc d, also return other docs in the
cluster containing d
Hope: if we do this: the query “car” will also return docs
containing “automobile”
Because the clustering algorithm groups together docs
containing “car” with those containing “automobile”.
Both types of documents contain words like “parts”, “dealer”,
“mercedes”, “road trip”.
21 / 79
Data set with clear cluster structure
Propose
algorithm
for finding
2.5
the cluster
structure in
2.0
this example
1.5
1.0
0.5
0.0
0.0 0.5 1.0 1.5 2.0
22 / 79
Desiderata for clustering
General goal: put related docs in the same cluster, put

unrelated docs in different clusters.
We’ll see different ways of formalizing this.
The number of clusters should be appropriate for the data set
we are clustering.
Initially, we will assume the number of clusters K is given.
Later: Semiautomatic methods for determining K
Secondary goals in clustering
Avoid very small and very large clusters
Define clusters that are easy to explain to the user
Many others . . .
23 / 79
Flat vs. Hierarchical clustering
Flat algorithms
Usually start with a random (partial) partitioning of docs into
groups
Refine iteratively
Main algorithm: K -means
Hierarchical algorithms
Create a hierarchy
Bottom-up, agglomerative
Top-down, divisive
24 / 79
Hard vs. Soft clustering
Hard clustering: Each document belongs to exactly one

cluster.
More common and easier to do
Soft clustering: A document can belong to more than one
cluster.
Makes more sense for applications like creating browsable
hierarchies
You may want to put sneakers in two clusters:
sports apparel
shoes
You can only do that with a soft clustering approach.
25 / 79
Flat algorithms
Flat algorithms compute a partition of N documents into a

set of K clusters.
Given: a set of documents and the number K
Find: a partition into K clusters that optimizes the chosen
partitioning criterion
Global optimization: exhaustively enumerate partitions, pick
optimal one
Not tractable
Effective heuristic method: K -means algorithm
26 / 79
Outline
1 Recap
3 Clustering in IR
4 K -means
5 Evaluation
27 / 79
K -means
Perhaps the best known clustering algorithm

Simple, works well in many cases
Use as default / baseline for clustering documents
28 / 79
Document representations in clustering
Vector space model

relatedness between vectors can be measured by Euclidean
distance, cosine similarity, etc.
29 / 79
K -means: Basic idea
Each cluster in K -means is defined by a centroid.

Objective/partitioning criterion: minimize the average squared
difference from the centroid
Recall definition of centroid:
1 ∑
µ
⃗ (ω) = ⃗x
|ω|
x ∈ω
⃗
where we use ω to denote a cluster.

We try to find the minimum average squared difference by
iterating two steps:
reassignment: assign each vector to its closest centroid
recomputation: recompute each centroid as the average of the
vectors that were assigned to it in reassignment
30 / 79
K -means pseudocode (µk is centroid of ωk )
K -means({⃗x1 , . . . , ⃗xN }, K )
1 (⃗s1 , ⃗s2 , . . . , ⃗sK ) ← SelectRandomSeeds({⃗x1 , . . . , ⃗xN }, K )
2 for k ← 1 to K
3 do µ ⃗ k ← ⃗sk
4 while stopping criterion has not been met
5 do for k ← 1 to K
6 do ωk ← {}
7 for n ← 1 to N
8 do j ← arg minj ′ |⃗ µj ′ − ⃗xn |
9 ωj ← ωj ∪ {⃗xn } (reassignment of vectors)
10 for k ← 1 to ∑ K
11 do µ ⃗ k ← |ω1k | ⃗x ∈ωk ⃗x (recomputation of centroids)
12 return {⃗ ⃗K }
µ1 , . . . , µ
31 / 79
Worked Example: Set of points to be clustered
b
b b
b b
b
b
b
b
b
b
b b b
b b
what are the two clusters?

compute the centroids of the clusters
32 / 79
Worked Example: Random selection of initial centroids
× b
b
b b
b b
b
b
b
b
×
b
b
b b b
b b
33 / 79
Worked Example: Assign points to closest center
× b
b
b b
b b
b
b
b
b
×
b
b
b b b
b b
34 / 79
Worked Example: Assignment
×2
2
222
1 1 1 1
1 1
×1
1 1
1 1
1
1 1
35 / 79
Worked Example: Recompute cluster centroids
×
2
2 × 222
1 1 1 1
1
×1
1 1
1
×
1 1
1
1 1
36 / 79
Worked Example: Assign points to closest centroid
× b
b b
b b
b
b
b
×
b
b
b
b b b
b b
37 / 79
2
2 × 222
2 2 1 1
1
×1
1 1
1
1 1
1
1 1
38 / 79
××
2
222
2 2 1 1
1 1
×
1
1 × 1
1 1
1
1 1
39 / 79
×
b
b
b b
b b
b
b
b
b
×
b
b
b b b
b b
40 / 79
×
2
222
2 2 1 1
2 1
×1
1 1
1 1
1
1 1
41 / 79
2
2
2
××2 222
1 1
2 1
1
×
1
1 ×1 1
1
1 1
42 / 79
b
× b
b
b
b b
b
b
b
b
×
b
b b b
b b
43 / 79
2
2
2
×2 222
1 1
2 1
1
×
1
2 1 1
1
1 1
44 / 79
2
2
×× 2
222
2
1 1
2 1
1
× 11
1
2 1×
1 1
45 / 79
×
b
b b
b b
b
b
b
b
b
×
b
b b b
b b
46 / 79
2
2
×
211
2 2 1 1
2 1
2
× 11
1
2 1
1 1
47 / 79
2
2
211
×
2
×
2 1 1
2
×
2 1
1
2 1 × 1
1
1 1
48 / 79
×
b b
b b
b
b
b
×
b
b
b
b b b
b b
49 / 79
2
2
111
×
2 2 1 1
2
×1
2 1
1
2 1
1
1 1
50 / 79
2
2
111
×
2× 2 1 1
×
2 1
2 1×
2 1 1
1
1 1
51 / 79
b
b b
× b b
b
b
×
b
b
b
b
b b b
b b
52 / 79
2
2
111
×
2 1 1 1
×1
2 1
2 1
2 1
1
1 1
53 / 79
2
2
111
× ×2 1 1 1
2
2
2
1
×
1×
1
1
1
1 1
54 / 79
Worked Ex.: Centroids and assignments after convergence
2
2
111
× 2 1 1 1
2
2
2
1
×1
1
1
1
1 1
55 / 79
K -means is guaranteed to converge: Proof
RSS = sum of all squared distances between document vector

and closest centroid
RSS decreases during each reassignment step.
because each vector is moved to a closer centroid
RSS decreases during each recomputation step.
see next slide
There is only a finite number of clusterings.
Thus: We must reach a fixed point.
Assumption: Ties are broken consistently.
Finite set & monotonically decreasing → convergence
56 / 79
Re-computation decreases average distance
∑
RSS = K k=1 RSSk – the residual sum of squares (the “goodness”
measure)
∑ ∑ ∑
M
RSSk (⃗v ) = ∥⃗v − ⃗x ∥2 = (vm − xm )2
x ∈ωk
⃗ x ∈ωk m=1
⃗
∂RSSk (⃗v ) ∑
= 2(vm − xm ) = 0
∂vm
x ∈ωk
⃗
1 ∑
vm = xm
|ωk |
x ∈ωk
⃗
The last line is the componentwise definition of the centroid! We minimize

RSSk when the old centroid is replaced with the new centroid. RSS, the
sum of the RSSk , must then also decrease during recomputation.
57 / 79
K -means is guaranteed to converge
But we don’t know how long convergence will take!

If we don’t care about a few docs switching back and forth,
then convergence is usually fast (< 10-20 iterations).
However, complete convergence can take many more
iterations.
58 / 79
Optimality of K -means
Convergence ̸= optimality
Convergence does not mean that we converge to the optimal
clustering!
This is the great weakness of K -means.
If we start with a bad set of seeds, the resulting clustering can
be horrible.
59 / 79
Exercise: Suboptimal clustering
3
d1 d2 d3
2 × × ×
1 × × ×
d4 d5 d6
0
0 1 2 3 4
What is the optimal clustering for K = 2?
Do we converge on this clustering for arbitrary seeds di , dj ?
60 / 79
Initialization of K -means
Random seed selection is just one of many ways K -means can

be initialized.
Random seed selection is not very robust: It’s easy to get a
suboptimal clustering.
Better ways of computing initial centroids:
Select seeds not randomly, but using some heuristic (e.g., filter
out outliers or find a set of seeds that has “good coverage” of
the document space)
Use hierarchical clustering to find good seeds
Select i (e.g., i = 10) different random sets of seeds, do a
K -means clustering for each, select the clustering with lowest
RSS
61 / 79
Time complexity of K -means
Computing one distance of two vectors is O(M).

Reassignment step: O(KNM) (we need to compute KN
document-centroid distances)
Recomputation step: O(NM) (we need to add each of the
document’s < M values to one of the centroids)
Assume number of iterations bounded by I
Overall complexity: O(IKNM) – linear in all important
dimensions
M: Vector length; N: Number of documents.
62 / 79
Outline
1 Recap
3 Clustering in IR
4 K -means
5 Evaluation
63 / 79
What is a good clustering?
Internal criteria
RSS
Modularity in graph
But an internal criterion often does not evaluate the actual
utility of a clustering in the application.
Alternative: External criteria
Evaluate with respect to a human-defined classification
64 / 79
External criteria for clustering quality
Based on a gold standard data set, e.g., the Reuters collection

we also used for the evaluation of classification
Goal: Clustering should reproduce the classes in the gold
standard
(But we only want to reproduce how documents are divided
into groups, not the class labels.)
First measure for how well we were able to reproduce the
classes: purity
65 / 79
External criterion: Purity
1 ∑
purity(Ω, C ) = max |ωk ∩ cj |
N j
k
Ω = {ω1 , ω2 , . . . , ωK } is the set of clusters and

C = {c1 , c2 , . . . , cJ } is the set of classes.
For each cluster ωk : find class cj with most members nkj in ωk
Sum all nkj and divide by total number of points
66 / 79
Example for computing purity
To compute purity:
5 = max |ω1 ∩ cj | (class x, cluster 1)
j
4 = max |ω2 ∩ cj | (class o, cluster 2)

j
3 = max |ω3 ∩ cj | (class ⋄, cluster 3)

j
5+4+3
Purity = ≈ 0.71. (1)
17
cluster 1 cluster 2 cluster 3
x x x o x ⋄
o x o o ⋄ ⋄ ⋄
x x o x
67 / 79
Another external criterion: Rand index
Purity can be increased easily by increasing K – a measure

that does not have this problem: Rand index.
Definition: RI = TP+TN
TP+FP+FN+TN
Based on 2x2 contingency table of all pairs of documents:
same cluster different clusters
same class true positives (TP) false negatives (FN)
different classes false positives (FP) true negatives (TN)
TP+FN+FP+TN is the total number of pairs.
( )
TP+FN+FP+TN = N2 for N documents.
( )
Example: 17 2 = 136 in o/⋄/x example
Each pair is either positive or negative (the clustering puts the
two documents in the same or in different clusters) . . .
. . . and either “true” (correct) or “false” (incorrect): the
clustering decision is correct or incorrect.
68 / 79
Rand Index: Example
The three clusters contain 6, 6, and 5 points.
( ) ( ) ( )
6 6 5
TP + FP = + + = 40
2 2 2
Of these, the x pairs in cluster 1, the o pairs in cluster 2, the ⋄ pairs in
cluster 3, and the x pair in cluster 3 are true positives:
( ) ( ) ( ) ( )
5 4 3 2
TP = + + + = 20
2 2 2 2
Thus, FP = 40 − 20 = 20.
x x x o x ⋄
o x o o ⋄ ⋄ ⋄
x x o x
69 / 79
Rand measure for the o/⋄/x example
same cluster different clusters

same class TP = 20 FN = 24
different classes FP = 20 TN = 72
20 + 72 92
RI = = ≈ 0.68.
20 + 20 + 24 + 72 136
x x x o x ⋄
o x o o ⋄ ⋄ ⋄
x x o x
70 / 79
Normalized mutual information (NMI)
I (Ω; C )
NMI (Ω, C ) =
(H(Ω) + H(C )) /2
∑∑ P(ωk ∩ cj )
I (Ω; C ) = P(ωk ∩ cj ) log (2)
P(ωk )P(cj )
k j
∑ ∑ |ωk ∩ cj | N|ωk ∩ cj |
= log (3)
N |ωk ||cj |
k j
∑
H(Ω) = P(ωk ) log P(ωk ) (4)
k
H: entropy
I: Mutual Information
the denominator: normalize the value to be within -1 to 1.
71 / 79
Evaluation results for the o/⋄/x example
purity NMI RI F5
lower bound 0.0 0.0 0.0 0.0
maximum 1.0 1.0 1.0 1.0
value for example 0.71 0.36 0.68 0.46
All four measures range from 0 (really bad clustering) to 1 (perfect

clustering).
72 / 79
Outline
1 Recap
3 Clustering in IR
4 K -means
5 Evaluation
73 / 79
How many clusters?
Number of clusters K is given in many applications.

E.g., there may be an external constraint on K . Example: In
the case of Scatter-Gather, it was hard to show more than
10–20 clusters on a monitor in the 90s.
What if there is no external constraint? Is there a “right”
number of clusters?
One way to go: define an optimization criterion
Given docs, find K for which the optimum is reached.
What optimization criterion can we use?
We can’t use RSS or average squared distance from centroid
as criterion: always chooses K = N clusters.
74 / 79
Exercise
Your job is to develop the clustering algorithms for a

competitor to news.google.com
You want to use K -means clustering.
How would you determine K ?
75 / 79
Simple objective function for K : Basic idea
Start with 1 cluster (K = 1)

Keep adding clusters (= keep increasing K )
Add a penalty for each new cluster
Then trade off cluster penalties against average squared
distance from centroid
Choose the value of K with the best tradeoff
76 / 79
Simple objective function for K : Formalization
Given a clustering, define the cost for a document as

(squared) distance to centroid
Define total distortion RSS(K) as sum of all individual
document costs (corresponds to average distance)
Then: penalize each cluster with a cost λ
Thus for a clustering with K clusters, total cluster penalty is
Kλ
Define the total cost of a clustering as distortion plus total
cluster penalty: RSS(K) + K λ
Select K that minimizes (RSS(K) + K λ)
Still need to determine good value for λ . . .
77 / 79
Finding the “knee” in the curve
1950
1900
residual sum of squares
1850
1800
1750
2 4 6 8 10
number of clusters
Pick the number of clusters where curve “flattens”. Here: 4 or 9.

78 / 79
Introduction to
Information Retrieval
CS276: Information Retrieval and Web Search
Pandu Nayak and Prabhakar Raghavan
Lecture 12: Clustering

Introduction to Information Retrieval Ch. 16
What is clustering?
 Clustering: the process of grouping a set of objects
into classes of similar objects
 Documents within a cluster should be similar.
 Documents from different clusters should be
dissimilar.
 The commonest form of unsupervised learning
 Unsupervised learning = learning from raw data, as
opposed to supervised data where a classification of
examples is given
 A common and important task that finds many
applications in IR and other places
A data set with clear cluster structure
 How would
you design
an algorithm
for finding
the three
clusters in
this case?
Introduction to Information Retrieval Sec. 16.1
Applications of clustering in IR
 Whole corpus analysis/navigation
 Better user interface: search without typing
 For improving recall in search applications
 Better search results (like pseudo RF)
 For better navigation of search results
 Effective “user recall” will be higher
 For speeding up vector space retrieval
 Cluster-based retrieval gives faster search
Yahoo! Hierarchy isn’t clustering but is the kind

of output you want from clustering
www.yahoo.com/Science
… (30)
agriculture biology physics CS space
... ... ... ...

...
dairy
crops botany cell AI courses craft
magnetism
forestry agronomy evolution HCI missions
relativity
Google News: automatic clustering gives an

effective news presentation metaphor
Scatter/Gather: Cutting, Karger, and Pedersen

For visualizing a document collection and its

themes
 Wise et al, “Visualizing the non-visual” PNNL
 ThemeScapes, Cartia
 [Mountain height = cluster size]
For improving search recall

 Cluster hypothesis - Documents in the same cluster behave similarly
with respect to relevance to information needs
 Therefore, to improve search recall:
 Cluster docs in corpus a priori
 When a query matches a doc D, also return other docs in the
cluster containing D
 Hope if we do this: The query “car” will also return docs containing
automobile
 Because clustering grouped together docs containing car with
those containing automobile.
Why might this happen?

yippy.com – grouping search results 11

Issues for clustering

 Representation for clustering
 Document representation
 Vector space? Normalization?
 Centroids aren’t length normalized
 Need a notion of similarity/distance
 How many clusters?
 Fixed a priori?
 Completely data driven?
 Avoid “trivial” clusters - too large or small
 If a cluster's too large, then for navigation purposes you've
wasted an extra user click without whittling down the set of
documents much.
Notion of similarity/distance
 Ideal: semantic similarity.
 Practical: term-statistical similarity
 We will use cosine similarity.
 Docs as vectors.
 For many algorithms, easier to think in
terms of a distance (rather than similarity)
between docs.
 We will mostly speak of Euclidean distance
 But real implementations use cosine similarity
Clustering Algorithms
 Flat algorithms
 Usually start with a random (partial) partitioning
 Refine it iteratively
 K means clustering
 (Model based clustering)
 Hierarchical algorithms
 Bottom-up, agglomerative
 (Top-down, divisive)
Hard vs. soft clustering

 Hard clustering: Each document belongs to exactly one cluster
 More common and easier to do
 Soft clustering: A document can belong to more than one
cluster.
 Makes more sense for applications like creating browsable
hierarchies
 You may want to put a pair of sneakers in two clusters: (i) sports
apparel and (ii) shoes
 You can only do that with a soft clustering approach.
 We won’t do soft clustering today. See IIR 16.5, 18
Partitioning Algorithms
 Partitioning method: Construct a partition of n
documents into a set of K clusters
 Given: a set of documents and the number K
 Find: a partition of K clusters that optimizes the
chosen partitioning criterion
 Globally optimal
 Intractable for many objective functions
 Ergo, exhaustively enumerate all partitions
 Effective heuristic methods: K-means and K-
medoids algorithms
See also Kleinberg NIPS 2002 – impossibility for natural clustering
K-Means
 Assumes documents are real-valued vectors.
 Clusters based on centroids (aka the center of gravity
or mean) of points in a cluster, c:
 1 
μ(c)  
| c | xc
x
 Reassignment of instances to clusters is based on

distance to the current cluster centroids.
 (Or one can equivalently phrase it in terms of similarities)
K-Means Algorithm
Select K random docs {s1, s2,… sK} as seeds.
Until clustering converges (or other stopping criterion):
For each doc di:
Assign di to the cluster cj such that dist(xi, sj) is minimal.
(Next, update the seeds to the centroid of each cluster)
For each cluster cj
sj = (cj)
K Means Example
(K=2)
Pick seeds
Reassign clusters
Compute centroids
Reassign clusters
x x Compute centroids
x
x
Reassign clusters
Converged!
Termination conditions
 Several possibilities, e.g.,
 A fixed number of iterations.
 Doc partition unchanged.
 Centroid positions don’t change.
Does this mean that the docs in a

cluster are unchanged?
Convergence
 Why should the K-means algorithm ever reach a
fixed point?
 A state in which clusters don’t change.
 K-means is a special case of a general procedure
known as the Expectation Maximization (EM)
algorithm.
 EM is known to converge.
 Number of iterations could be large.
 But in practice usually isn’t
Lower case!
Convergence of K-Means
 Define goodness measure of cluster k as sum of
squared distances from cluster centroid:
 Gk = Σi (di – ck)2 (sum over all di in cluster k)
 G = Σk Gk
 Reassignment monotonically decreases G since
each vector is assigned to the closest centroid.
Convergence of K-Means
 Recomputation monotonically decreases each Gk
since (mk is number of members in cluster k):
 Σ (di – a)2 reaches minimum for:
 Σ –2(di – a) = 0
 Σ di = Σ a
 mK a = Σ di
 a = (1/ mk) Σ di = ck
 K-means typically converges quickly
Time Complexity
 Computing distance between two docs is O(M)
where M is the dimensionality of the vectors.
 Reassigning clusters: O(KN) distance computations,
or O(KNM).
 Computing centroids: Each doc gets added once to
some centroid: O(NM).
 Assume these two steps are each done once for I
iterations: O(IKNM).
Seed Choice
 Results can vary based on Example showing
random seed selection. sensitivity to seeds
 Some seeds can result in poor

convergence rate, or
convergence to sub-optimal
In the above, if you start
clusterings. with B and E as centroids
 Select good seeds using a heuristic you converge to {A,B,C}
(e.g., doc least similar to any and {D,E,F}
If you start with D and F
existing mean) you converge to
 Try out multiple starting points {A,B,D,E} {C,F}
 Initialize with the results of another
method.
K-means issues, variations, etc.

 Recomputing the centroid after every assignment
(rather than after all points are re-assigned) can
improve speed of convergence of K-means
 Assumes clusters are spherical in vector space
 Sensitive to coordinate changes, weighting etc.
 Disjoint and exhaustive
 Doesn’t have a notion of “outliers” by default
 But can add outlier filtering
Dhillon et al. ICDM 2002 – variation to fix some issues with small
document clusters
How Many Clusters?

 Number of clusters K is given
 Partition n docs into predetermined number of clusters
 Finding the “right” number of clusters is part of the
problem
 Given docs, partition into an “appropriate” number of
subsets.
 E.g., for query results - ideal value of K not known up front
- though UI may impose limits.
 Can usually take an algorithm for one flavor and
convert to the other.
K not specified in advance

 Say, the results of a query.
 Solve an optimization problem: penalize having
lots of clusters
 application dependent, e.g., compressed summary
of search results list.
 Tradeoff between having more clusters (better
focus within each cluster) and having too many
clusters
K not specified in advance

 Given a clustering, define the Benefit for a
doc to be the cosine similarity to its
centroid
 Define the Total Benefit to be the sum of
the individual doc Benefits.
Why is there always a clustering of Total Benefit n?

Penalize lots of clusters

 For each cluster, we have a Cost C.
 Thus for a clustering with K clusters, the Total Cost is
KC.
 Define the Value of a clustering to be =
Total Benefit - Total Cost.
 Find the clustering of highest value, over all choices
of K.
 Total benefit increases with increasing K. But can stop
when it doesn’t increase by “much”. The Cost term
enforces this.
Hierarchical Clustering
 Build a tree-based hierarchical taxonomy
(dendrogram) from a set of documents.
animal
vertebrate invertebrate
fish reptile amphib. mammal worm insect crustacean
 One approach: recursive application of a

partitional clustering algorithm.
Dendrogram: Hierarchical Clustering

 Clustering obtained
by cutting the
dendrogram at a
desired level: each
connected
component forms a
cluster.
32
Hierarchical Agglomerative Clustering
(HAC)
 Starts with each doc in a separate cluster
 then repeatedly joins the closest pair of
clusters, until there is only one cluster.
 The history of merging forms a binary tree
or hierarchy.
Note: the resulting clusters are still “hard” and induce a partition
Closest pair of clusters

 Many variants to defining closest pair of clusters
 Single-link
 Similarity of the most cosine-similar (single-link)
 Complete-link
 Similarity of the “furthest” points, the least cosine-similar
 Centroid
 Clusters whose centroids (centers of gravity) are the most
cosine-similar
 Average-link
 Average cosine between pairs of elements
Single Link Agglomerative Clustering

 Use maximum similarity of pairs:
sim (ci ,c j )  max sim ( x, y )

xci , yc j
 Can result in “straggly” (long and thin) clusters
due to chaining effect.
 After merging ci and cj, the similarity of the
resulting cluster to another cluster, ck, is:
sim ((ci  c j ), ck )  max( sim (ci , ck ), sim (c j , ck ))

Single Link Example

Complete Link
 Use minimum similarity of pairs:
sim (ci ,c j )  min sim ( x, y)

xci , yc j
 Makes “tighter,” spherical clusters that are typically
preferable.
 After merging ci and cj, the similarity of the resulting
cluster to another cluster, ck, is:
sim ((ci  c j ), ck )  min( sim (ci , ck ), sim (c j , ck ))
Ci Cj Ck
Complete Link Example

Introduction to Information Retrieval Sec. 17.2.1
Computational Complexity
 In the first iteration, all HAC methods need to
compute similarity of all pairs of N initial instances,
which is O(N2).
 In each of the subsequent N2 merging iterations,
compute the distance between the most recently
created cluster and all other existing clusters.
 In order to maintain an overall O(N2) performance,
computing similarity to each other cluster must be
done in constant time.
 Often O(N3) if done naively or O(N2 log N) if done more
cleverly
Group Average
 Similarity of two clusters = average similarity of all pairs
within merged cluster.
1  
sim (ci , c j )    sim ( x , y )
ci  c j ( ci  c j  1) x( ci c j ) y( ci c j ): y  x
 Compromise between single and complete link.
 Two options:
 Averaged across all ordered pairs in the merged cluster
 Averaged over all pairs between the two original clusters
 No clear difference in efficacy
Computing Group Average Similarity

 Always maintain sum of vectors in each cluster.
 
s (c j )  x

xc j
 Compute similarity of clusters in constant time:
   
( s (ci )  s (c j ))  ( s (ci )  s (c j ))  (| ci |  | c j |)
sim (ci , c j ) 
(| ci |  | c j |)(| ci |  | c j | 1)
What Is A Good Clustering?

 Internal criterion: A good clustering will produce
high quality clusters in which:
 the intra-class (that is, intra-cluster) similarity is
high
 the inter-class similarity is low
 The measured quality of a clustering depends on
both the document representation and the
similarity measure used
External criteria for clustering quality

 Quality measured by its ability to discover some
or all of the hidden patterns or latent classes in
gold standard data
 Assesses a clustering with respect to ground truth
… requires labeled data
 Assume documents with C gold standard classes,
while our clustering algorithms produce K clusters,
ω1, ω2, …, ωK with ni members.
External Evaluation of Cluster Quality

 Simple measure: purity, the ratio between the
dominant class in the cluster πi and the size of
cluster ωi
1
Purity (i )  max j (nij ) j  C
ni
 Biased because having n clusters maximizes
purity
 Others are entropy of classes in clusters (or
mutual information between classes and
clusters)
Purity example
     
     
    
Cluster I Cluster II Cluster III
Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6
Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6
Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5

Rand Index measures between pair

decisions. Here RI = 0.68
Different
Number of Same Cluster
Clusters in
points in clustering
clustering
Same class in
ground truth 20 24
Different
classes in 20 72
ground truth
Rand index and Cluster F-measure
A D
RI 
A B C  D
Compare with standard Precision and Recall:
A A
P R
A B AC
People also define and use a cluster F-
measure, which is probably a better measure.
Final word and resources

 In clustering, clusters are inferred from the data without
human input (unsupervised learning)
 However, in practice, it’s a bit less clear: there are many
ways of influencing the outcome of clustering: number of
clusters, similarity measure, representation of documents, .
..
 Resources
 IIR 16 except 16.5
 IIR 17.1–17.3

Unit 4

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 4

Uploaded by

Copyright:

Available Formats

Introduction to Information Retrieval

A text classification task: Email spam filtering

Formal definition of TC: Training

Formal definition of TC: Application/Testing

Given: a description d ∈ X of a document Determine: ϒ (d) ∈

Examples of how search engines use classification

 Language identification (classes: English vs. French etc.)

Classification methods: Statistical/Probabilistic

 This was our definition of the classification problem – text

Derivation of Naive Bayes rule

Apply Bayes rule

Drop denominator since P(d) is the same for all classes:

Too many parameters / sparseness

 There are too many parameters , one

Naive Bayes conditional independence assumption

To reduce the number of parameters to a manageable size, we

We assume that the probability of observing the conjunction of

The Naive Bayes classifier

 nd is the length of the document. (number of tokens)

that c is the correct class.

class vs. another, we choose the c with highest P(c).

Maximum a posteriori class

 Our goal in Naive Bayes classification is to find the “best”

Taking the log

Naive Bayes classifier

Parameter estimation take 1: Maximum likelihood

 Nc : number of docs in class c; N: total number of docs

 Tct is the number of tokens of t in training documents from class

The problem with maximum likelihood estimates: Zeros

independence assumption. We assume that attribute values are independent of

see formula below

The problem with maximum likelihood estimates: Zeros

 If there were no occurrences of WTO in documents in class China

 → We will get P(China|d) = 0 for any document that contains

To eliminate zeros, we use

To avoid zeros: Add-one smoothing

 Now: Add one to each count to avoid zeros:

To avoid zeros: Add-one smoothing

 Estimate parameters from the training corpus using add-one

Naive Bayes: Training

Naive Bayes: Testing

 Estimate parameters of Naive Bayes classifier

Example: Parameter estimates

The denominators are (8 + 6) and (3 + 6) because the lengths of

Thus, the classifier assigns the test document to c = China. The

 Generate a class with probability P(c)

 Evaluation must be done on test data that are independent of

Constructing Confusion Matrix c

Precision P and recall R

 F1 allows us to trade off precision against recall.

 This is the harmonic mean of P and R:

Averaging: Micro vs. Macro

 We now have an evaluation measure (F1) for one class.

Micro- vs. Macro-average: Example

Lecture 11: Text Classification;

Recap: Naïve Bayes classifiers

 Don’t forget to smooth

The rest of text classification

Recall: Vector Space Representation

 How can we do classification in this space?

Classification Using Vector Spaces

Documents in a Vector Space