You are on page 1of 207

Introduction to Information Retrieval

A text classification task: Email spam filtering


From: ‘‘’’ <takworlld@hotmail.com>
Subject: real estate is the only way... gem oalvgkay
Anyone can buy real estate with no money down
Stop paying rent TODAY !
There is no need to spend hundreds or even thousands for
similar courses
I am 22 years old and I have already purchased 6 properties
using the
methods outlined in this truly INCREDIBLE ebook.
Change your life NOW !
=================================================
Click Below to order:
http://www.wholesaledaily.com/sales/nmd.htm
=================================================
How would you write a program that would automatically detect
and delete this type of message?

2
Introduction to Information Retrieval

Formal definition of TC: Training


Given:
 A document set X
 Documents are represented typically in some type of high-
dimensional space.
 A fixed set of classes C = {c1, c2, . . . , cJ}
 The classes are human-defined for the needs of an application (
e.g., relevant vs. nonrelevant).
 A training set D of labeled documents with each labeled
document <d, c> ∈ X × C
Using a learning method or learning algorithm, we then wish to
learn a classifier ϒ that maps documents to classes:
ϒ:X→C

3
Introduction to Information Retrieval

Formal definition of TC: Application/Testing

Given: a description d ∈ X of a document Determine: ϒ (d) ∈


C,
that is, the class that is most appropriate for d

4
Introduction to Information Retrieval

Topic classification

5
Introduction to Information Retrieval

Examples of how search engines use classification

 Language identification (classes: English vs. French etc.)


 The automatic detection of spam pages (spam vs. nonspam)
 Topic-specific or vertical search – restrict search to a
“vertical” like “related to health” (relevant to vertical vs. not)

6
Introduction to Information Retrieval

Classification methods: Statistical/Probabilistic

 This was our definition of the classification problem – text


classification as a learning problem
 (i) Supervised learning of a the classification function ϒ and
(ii) its application to classifying new documents
 We will look at doing this using Naive Bayes
 requires hand-classified training data
 But this manual classification can be done by non-experts.

7
Introduction to Information Retrieval

Derivation of Naive Bayes rule


We want to find the class that is most likely given the document:

Apply Bayes rule

Drop denominator since P(d) is the same for all classes:

8
Introduction to Information Retrieval

Too many parameters / sparseness

 There are too many parameters , one


for each unique combination of a class and a sequence of
words.
 We would need a very, very large number of training
examples to estimate that many parameters.
 This is the problem of data sparseness.

9
Introduction to Information Retrieval

Naive Bayes conditional independence assumption

To reduce the number of parameters to a manageable size, we


make the Naive Bayes conditional independence assumption:

We assume that the probability of observing the conjunction of


attributes is equal to the product of the individual probabilities
|
P(Xk = tk c).

10
NAIVE BAYES TEXT
CLASSIFICATION
Introduction to Information Retrieval

The Naive Bayes classifier


 The Naive Bayes classifier is a probabilistic classifier.
 We compute the probability of a document d being in a class c
as follows:

 nd is the length of the document. (number of tokens)


 P(tk |c) is the conditional probability of term tk occurring in a
document of class c
 P(tk |c) as a measure of how much evidence tk contributes

that c is the correct class.


 P(c) is the prior probability of c.
 If a document’s terms do not provide clear evidence for one

class vs. another, we choose the c with highest P(c).

11
Introduction to Information Retrieval

Maximum a posteriori class

 Our goal in Naive Bayes classification is to find the “best”


class.
 The best class is the most likely or maximum a posteriori (
MAP) class cmap:

12
Introduction to Information Retrieval

Taking the log


 Multiplying lots of small probabilities can result in floating
point underflow.
 Since log(xy) = log(x) + log(y), we can sum log probabilities
instead of multiplying probabilities.
 Since log is a monotonic function, the class with the highest
score does not change.
 So what we usually compute in practice is:

13
Introduction to Information Retrieval

Naive Bayes classifier


 Classification rule:

 Simple interpretation:
 Each conditional parameter log is a weight that
indicates how good an indicator tk is for c.
 The prior log is a weight that indicates the relative
frequency of c.
 The sum of log prior and term weights is then a measure of
how much evidence there is for the document being in the class
.
 We select the class with the most evidence.

14
Introduction to Information Retrieval

Parameter estimation take 1: Maximum likelihood


 Estimate parameters and from train data: How?
 Prior:

 Nc : number of docs in class c; N: total number of docs


 Conditional probabilities:

 Tct is the number of tokens of t in training documents from class


c (includes multiple occurrences)
 We’ve made a Naive Bayes independence assumption here:

15
The following figure 13.
4 represents
Multinomial NB Model
Introduction to Information Retrieval

The problem with maximum likelihood estimates: Zeros

P(China d) | | |
∝ P(China) ・ P(BEIJING China) ・ P(AND China)
| |
・ P(TAIPEI China) ・ P(JOIN China) ・ P(WTO|
China)
 If WTO never occurs in class China in the train set:

16
To reduce the number of parameters, we make the Naive Bayes conditional

independence assumption. We assume that attribute values are independent of


each other given the class:

see formula below

We have introduced two random variables here to make the two different
generative models explicit. Xk RANDOM VARIABLE X is the random variable for
position k in the
document and takes as values terms from the vocabulary. P(Xk = t|c) is the
RANDOM VARIABLE U probability that in a document of class c the term t will
occur in position k. Ui
is the random variable for vocabulary term i and takes as values 0 (absence)
and 1 (presence). Pˆ(Ui = 1|c) is the probability that in a document of class c
the term ti will occur – in any position and possibly multiple times.
We illustrate the conditional independence assumption in Figures 13.4 and 13.
5.
The class China generates values for each of the five term attributes (multi-
nomial) or six binary attributes (Bernoulli) with a certain probability, inde-
pendent of the values of the other attributes. The fact that a document in the
class China contains the term Taipei does not make it more likely or less likely
that it also contains Beijing.
In reality, the conditional independence assumption does not hold for text
data. Terms are conditionally dependent on each other. But as we will dis-
cuss shortly, NB models perform well despite the conditional independence
MULTINOMIAL MODEL
Vs
BERNOULLI MODEL
Introduction to Information Retrieval

The problem with maximum likelihood estimates: Zeros


(cont)

 If there were no occurrences of WTO in documents in class China


, we’d get a zero estimate:

 → We will get P(China|d) = 0 for any document that contains


WTO!
 Zero probabilities cannot be conditioned away.

17
ADD-ONE SMOOTHING:

To eliminate zeros, we use


add-one or Laplace
smoothing, which simply
adds
one to each count
Introduction to Information Retrieval

To avoid zeros: Add-one smoothing

 Before:

 Now: Add one to each count to avoid zeros:

 B is the number of different words (in this case the size of the
vocabulary: |V | = M)

18
Introduction to Information Retrieval

To avoid zeros: Add-one smoothing

 Estimate parameters from the training corpus using add-one


smoothing
 For a new document, for each class, compute sum of (i) log of
prior and (ii) logs of conditional probabilities of the terms
 Assign the document to the class with the largest score

19
Introduction to Information Retrieval

Naive Bayes: Training

20
Introduction to Information Retrieval

Naive Bayes: Testing

21
Introduction to Information Retrieval

Exercise

 Estimate parameters of Naive Bayes classifier


 Classify test document

22
Introduction to Information Retrieval

Example: Parameter estimates

The denominators are (8 + 6) and (3 + 6) because the lengths of


textc and are 8 and 3, respectively, and because the constant
B is 6 as the vocabulary consists of six terms.

23
Introduction to Information Retrieval

Example: Classification

Thus, the classifier assigns the test document to c = China. The


reason for this classification decision is that the three occurrences
of the positive indicator CHINESE in d5 outweigh the occurrences
of the two negative indicators JAPAN and TOKYO.

24
Introduction to Information Retrieval

Generative model

 Generate a class with probability P(c)


 Generate each of the words (in their respective positions),
conditional on the class, but independent of each other, with
probability P(tk |c)
 To classify docs, we “reengineer” this process and find the class
that is most likely to have generated the doc.

25
Introduction to Information Retrieval

Evaluating classification

 Evaluation must be done on test data that are independent of


the training data (usually a disjoint set of instances).
 It’s easy to get good performance on a test set that was
available to the learner during training (e.g., just memorize
the test set).
 Measures: Precision, recall, F1, classification accuracy

26
Introduction to Information Retrieval

Constructing Confusion Matrix c

27
Introduction to Information Retrieval

Precision P and recall R

P = TP / ( TP + FP)
R = TP / ( TP + FN)

28
Introduction to Information Retrieval

A combined measure: F

 F1 allows us to trade off precision against recall.

 This is the harmonic mean of P and R:

29
Introduction to Information Retrieval

Averaging: Micro vs. Macro

 We now have an evaluation measure (F1) for one class.


 But we also want a single number that measures the
aggregate performance over all classes in the collection.
 Macroaveraging
 Compute F1 for each of the C classes
 Average these C numbers
 Microaveraging
 Compute TP, FP, FN for each of the C classes
 Sum these C numbers (e.g., all TP to get aggregate TP)
 Compute F1 for aggregate TP, FP, FN

30
Introduction to Information Retrieval

Micro- vs. Macro-average: Example

31
Introduction to Information Retrieval

Introduction to
Information Retrieval
CS276: Information Retrieval and Web Search
Pandu Nayak and Prabhakar Raghavan

Lecture 11: Text Classification;


Vector space classification
[Borrows slides from Ray Mooney]
Introduction to Information Retrieval

Recap: Naïve Bayes classifiers


 Classify based on prior weight of class and
conditional parameter for what each word says:
 
c NB  argmax log P(c j )   log P(x i | c j )
c j C 
 i positions 

 Training is done by counting and dividing:
Nc j Tc j x k  
P(c j )  P(x k | c j ) 
N  x i V
[Tc j x i   ]

 Don’t forget to smooth



2
Introduction to Information Retrieval

The rest of text classification


 Today:
 Vector space methods for Text Classification
 Vector space classification using centroids (Rocchio)
 K Nearest Neighbors
 Decision boundaries, linear and nonlinear classifiers
 Dealing with more than 2 classes
 Later in the course
 More text classification
 Support Vector Machines
 Text-specific issues in classification

3
Introduction to Information Retrieval Sec.14.1

Recall: Vector Space Representation


 Each document is a vector, one component for each
term (= word).
 Normally normalize vectors to unit length.
 High-dimensional vector space:
 Terms are axes
 10,000+ dimensions, or even 100,000+
 Docs are vectors in this space

 How can we do classification in this space?

4
Introduction to Information Retrieval Sec.14.1

Classification Using Vector Spaces


 As before, the training set is a set of documents,
each labeled with its class (e.g., topic)
 In vector space classification, this set corresponds to
a labeled set of points (or, equivalently, vectors) in
the vector space
 Premise 1: Documents in the same class form a
contiguous region of space
 Premise 2: Documents from different classes don’t
overlap (much)
 We define surfaces to delineate classes in the space
5
Introduction to Information Retrieval Sec.14.1

Documents in a Vector Space

Government

Science

Arts

6
Introduction to Information Retrieval Sec.14.1

Test Document of what class?

Government

Science

Arts

7
Introduction to Information Retrieval Sec.14.1

Test Document = Government

Is this
similarity
hypothesis
true in
general?

Government

Science

Arts

Our main topic today is how to find good separators 8


Introduction to Information Retrieval Sec.14.1

Aside: 2D/3D graphs can be misleading

9
Introduction to Information Retrieval Sec.14.2

Using Rocchio for text classification


 Relevance feedback methods can be adapted for text
categorization
 As noted before, relevance feedback can be viewed as 2-class
classification
 Relevant vs. nonrelevant documents
 Use standard tf-idf weighted vectors to represent text
documents
 For training documents in each category, compute a
prototype vector by summing the vectors of the training
documents in the category.
 Prototype = centroid of members of class
 Assign test documents to the category with the closest
prototype vector based on cosine similarity.
10
Introduction to Information Retrieval Sec.14.2

Illustration of Rocchio Text Categorization

11
Introduction to Information Retrieval Sec.14.2

Definition of centroid


1
(c)  v (d)
| Dc | d Dc

 Where Dc is the set of all documents that belong to


class c and v(d) is the vector space representation of
d.

 Note that centroid will in general not be a unit vector


even when the inputs are unit vectors.

12
Introduction to Information Retrieval Sec.14.2

Rocchio Properties
 Forms a simple generalization of the examples in
each class (a prototype).
 Prototype vector does not need to be averaged or
otherwise normalized for length since cosine
similarity is insensitive to vector length.
 Classification is based on similarity to class
prototypes.
 Does not guarantee classifications are consistent
with the given training data.
Why not?

13
Introduction to Information Retrieval Sec.14.2

Rocchio Anomaly
 Prototype models have problems with polymorphic
(disjunctive) categories.

14
Introduction to Information Retrieval Sec.14.2

Rocchio classification
 Rocchio forms a simple representation for each class:
the centroid/prototype
 Classification is based on similarity to / distance from
the prototype/centroid
 It does not guarantee that classifications are
consistent with the given training data
 It is little used outside text classification
 It has been used quite effectively for text classification
 But in general worse than Naïve Bayes
 Again, cheap to train and test documents
15
Introduction to Information Retrieval Sec.14.3

k Nearest Neighbor Classification


 kNN = k Nearest Neighbor

 To classify a document d into class c:


 Define k-neighborhood N as k nearest neighbors of d
 Count number of documents i in N that belong to c
 Estimate P(c|d) as i/k
 Choose as class argmaxc P(c|d) [ = majority class]

16
Introduction to Information Retrieval Sec.14.3

Example: k=6 (6NN)

P(science| )?

Government

Science

Arts

17
Introduction to Information Retrieval Sec.14.3

Nearest-Neighbor Learning Algorithm


 Learning is just storing the representations of the training examples
in D.
 Testing instance x (under 1NN):
 Compute similarity between x and all examples in D.
 Assign x the category of the most similar example in D.
 Does not explicitly compute a generalization or category
prototypes.
 Also called:
 Case-based learning
 Memory-based learning
 Lazy learning
 Rationale of kNN: contiguity hypothesis

18
Introduction to Information Retrieval Sec.14.3

kNN Is Close to Optimal


 Cover and Hart (1967)
 Asymptotically, the error rate of 1-nearest-neighbor
classification is less than twice the Bayes rate [error rate of
classifier knowing model that generated data]

 In particular, asymptotic error rate is 0 if Bayes rate is


0.
 Assume: query point coincides with a training point.
 Both query point and training point contribute error
→ 2 times Bayes rate

19
Introduction to Information Retrieval Sec.14.3

k Nearest Neighbor
 Using only the closest example (1NN) to determine
the class is subject to errors due to:
 A single atypical example.
 Noise (i.e., an error) in the category label of a single
training example.
 More robust alternative is to find the k most-similar
examples and return the majority category of these k
examples.
 Value of k is typically odd to avoid ties; 3 and 5 are
most common.

20
Introduction to Information Retrieval Sec.14.3

kNN decision boundaries


Boundaries
are in
principle
arbitrary
surfaces –
but usually
polyhedra

Government

Science

Arts

kNN gives locally defined decision boundaries between


classes – far away points do not influence each classification
decision (unlike in Naïve Bayes, Rocchio, etc.) 21
Introduction to Information Retrieval Sec.14.3

Similarity Metrics
 Nearest neighbor method depends on a similarity (or
distance) metric.
 Simplest for continuous m-dimensional instance
space is Euclidean distance.
 Simplest for m-dimensional binary instance space is
Hamming distance (number of feature values that
differ).
 For text, cosine similarity of tf.idf weighted vectors is
typically most effective.

22
Introduction to Information Retrieval Sec.14.3

Illustration of 3 Nearest Neighbor for Text


Vector Space

23
Introduction to Information Retrieval

3 Nearest Neighbor vs. Rocchio


 Nearest Neighbor tends to handle polymorphic
categories better than Rocchio/NB.

24
Introduction to Information Retrieval Sec.14.3

Nearest Neighbor with Inverted Index


 Naively, finding nearest neighbors requires a linear
search through |D| documents in collection
 But determining k nearest neighbors is the same as
determining the k best retrievals using the test
document as a query to a database of training
documents.
 Use standard vector space inverted index methods to
find the k nearest neighbors.
 Testing Time: O(B|Vt|) where B is the average
number of training documents in which a test-document word
appears.
 Typically B << |D|
25
Introduction to Information Retrieval Sec.14.3

kNN: Discussion
 No feature selection necessary
 Scales well with large number of classes
 Don’t need to train n classifiers for n classes
 Classes can influence each other
 Small changes to one class can have ripple effect
 Scores can be hard to convert to probabilities
 No training necessary
 Actually: perhaps not true. (Data editing, etc.)
 May be expensive at test time
 In most cases it’s more accurate than NB or Rocchio
26
Introduction to Information Retrieval Sec.14.6

kNN vs. Naive Bayes


 Bias/Variance tradeoff
 Variance ≈ Capacity
 kNN has high variance and low bias.
 Infinite memory
 NB has low variance and high bias.
 Decision surface has to be linear (hyperplane – see later)
 Consider asking a botanist: Is an object a tree?
 Too much capacity/variance, low bias
 Botanist who memorizes
 Will always say “no” to new object (e.g., different # of leaves)
 Not enough capacity/variance, high bias
 Lazy botanist
 Says “yes” if the object is green
 You want the middle ground
(Example due to C. Burges)
27
Introduction to Information Retrieval Sec.14.6

Bias vs. variance:


Choosing the correct model capacity

28
Introduction to Information Retrieval Sec.14.4

Linear classifiers and binary and multiclass


classification
 Consider 2 class problems
 Deciding between two classes, perhaps, government and
non-government
 One-versus-rest classification
 How do we define (and find) the separating surface?
 How do we decide which region a test doc is in?

29
Introduction to Information Retrieval Sec.14.4

Separation by Hyperplanes
 A strong high-bias assumption is linear separability:
 in 2 dimensions, can separate classes by a line
 in higher dimensions, need hyperplanes
 Can find separating hyperplane by linear programming
(or can iteratively fit solution via perceptron):
 separator can be expressed as ax + by = c

30
Introduction to Information Retrieval Sec.14.4

Linear programming / Perceptron

Find a,b,c, such that


ax + by > c for red points
ax + by < c for blue points.
31
Introduction to Information Retrieval Sec.14.4

Which Hyperplane?

In general, lots of possible


solutions for a,b,c.
32
Introduction to Information Retrieval Sec.14.4

Which Hyperplane?
 Lots of possible solutions for a,b,c.
 Some methods find a separating hyperplane,
but not the optimal one [according to some
criterion of expected goodness]
 E.g., perceptron
 Most methods find an optimal separating
hyperplane
 Which points should influence optimality?
 All points
 Linear/logistic regression
 Naïve Bayes
 Only “difficult points” close to decision
boundary
 Support vector machines

33
Introduction to Information Retrieval Sec.14.4

Linear classifier: Example


 Class: “interest” (as in interest rate)
 Example features of a linear classifier
 wi t i wi ti
• 0.70 prime • −0.71 dlrs
• 0.67 rate • −0.35 world
• 0.63 interest • −0.33 sees
• 0.60 rates • −0.25 year
• 0.46 discount • −0.24 group
• 0.43 bundesbank • −0.24 dlr
 To classify, find dot product of feature vector and weights

34
Introduction to Information Retrieval Sec.14.4

Linear Classifiers
 Many common text classifiers are linear classifiers
 Naïve Bayes
 Perceptron
 Rocchio
 Logistic regression
 Support vector machines (with linear kernel)
 Linear regression with threshold
 Despite this similarity, noticeable performance differences
 For separable problems, there is an infinite number of separating
hyperplanes. Which one do you choose?
 What to do for non-separable problems?
 Different training methods pick different hyperplanes
 Classifiers more powerful than linear often don’t perform better on
text problems. Why?
35
Introduction to Information Retrieval Sec.14.2

Rocchio is a linear classifier

36
Introduction to Information Retrieval Sec.14.2

Two-class Rocchio as a linear classifier


 Line or hyperplane defined by:
M

w d i i 
i1

 For Rocchio, set:


w  (c1)  (c 2 )

  0.5  (| (c1 ) |2  | (c 2 ) |2 )

[Aside for ML/stats people: Rocchio classification is a simplification of the classic Fisher
Linear Discriminant where you don’t model the variance (or assume it is
 spherical).]
37
Introduction to Information Retrieval Sec.14.4

Naive Bayes is a linear classifier


 Two-class Naive Bayes. We compute:
P(C | d ) P(C ) P( w | C )
log  log   log
P(C | d ) P(C ) wd P( w | C )
 Decide class C if the odds is greater than 1, i.e., if the
log odds is greater than 0.
 So decision boundary is hyperplane:
P(C )
  wV  w  nw  0 where   log ;
P(C )
P( w | C )
 w  log ; nw  # of occurrence s of w in d
P( w | C ) 38
Introduction to Information Retrieval Sec.14.4

A nonlinear problem
 A linear classifier
like Naïve Bayes
does badly on
this task

 kNN will do very


well (assuming
enough training
data)

39
Introduction to Information Retrieval Sec.14.4

High Dimensional Data


 Pictures like the one at right are absolutely
misleading!
 Documents are zero along almost all axes
 Most document pairs are very far apart (i.e.,
not strictly orthogonal, but only share very
common words and a few scattered others)
 In classification terms: often document sets
are separable, for most any classification
 This is part of why linear classifiers are quite
successful in this domain

40
Introduction to Information Retrieval Sec.14.5

More Than Two Classes


 Any-of or multivalue classification
 Classes are independent of each other.
 A document can belong to 0, 1, or >1 classes.
 Decompose into n binary problems
 Quite common for documents
 One-of or multinomial or polytomous classification
 Classes are mutually exclusive.
 Each document belongs to exactly one class
 E.g., digit recognition is polytomous classification
 Digits are mutually exclusive

41
Introduction to Information Retrieval Sec.14.5

Set of Binary Classifiers: Any of


 Build a separator between each class and its
complementary set (docs from all other classes).
 Given test doc, evaluate it for membership in each
class.
 Apply decision criterion of classifiers independently
 Done

 Though maybe you could do better by considering


dependencies between categories

42
Introduction to Information Retrieval Sec.14.5

Set of Binary Classifiers: One of


 Build a separator between each class and its
complementary set (docs from all other classes).
 Given test doc, evaluate it for membership in each
class.
 Assign document to class with:
 maximum score
 maximum confidence ?
 maximum probability ?
?

43
Introduction to Information Retrieval

Summary: Representation of
Text Categorization Attributes
 Representations of text are usually very high
dimensional (one feature for each word)
 High-bias algorithms that prevent overfitting in high-
dimensional space should generally work best*
 For most text categorization tasks, there are many
relevant features and many irrelevant ones
 Methods that combine evidence from many or all
features (e.g. naive Bayes, kNN) often tend to work
better than ones that try to isolate just a few
relevant features*
*Although the results are a bit more mixed than often thought

44
Introduction to Information Retrieval

Which classifier do I use for a given


text classification problem?
 Is there a learning method that is optimal for all text
classification problems?
 No, because there is a tradeoff between bias and
variance.
 Factors to take into account:
 How much training data is available?
 How simple/complex is the problem? (linear vs. nonlinear
decision boundary)
 How noisy is the data?
 How stable is the problem over time?
 For an unstable problem, it’s better to use a simple and robust
classifier. 45
Flat Clustering

Slides are mostly from Hinrich Schütze

March 27, 2017

1 / 79
Overview

1 Recap

2 Clustering: Introduction

3 Clustering in IR

4 K -means

5 Evaluation

6 How many clusters?

2 / 79
Outline

1 Recap

2 Clustering: Introduction

3 Clustering in IR

4 K -means

5 Evaluation

6 How many clusters?

3 / 79
Take-away today

What is clustering?
Applications of clustering in information retrieval
K -means algorithm
Evaluation of clustering
How many clusters?

4 / 79
Outline

1 Recap

2 Clustering: Introduction

3 Clustering in IR

4 K -means

5 Evaluation

6 How many clusters?

5 / 79
Clustering: Definition

(Document) clustering is the process of grouping a set of


documents into clusters of similar documents.
Documents within a cluster should be similar.
Documents from different clusters should be dissimilar.
Clustering is the most common form of unsupervised learning.
Unsupervised = there are no labeled or annotated data.

6 / 79
Data set with clear cluster structure
Propose
algorithm
for finding
2.5

the cluster
structure in
2.0

this example
1.5
1.0
0.5
0.0

0.0 0.5 1.0 1.5 2.0

7 / 79
Classification vs. Clustering

Classification: supervised learning


Clustering: unsupervised learning
Classification: Classes are human-defined and part of the
input to the learning algorithm.
Clustering: Clusters are inferred from the data without human
input.
Many ways of influencing the outcome of clustering:
number of clusters,
similarity measure,
representation of documents,
...

8 / 79
Outline

1 Recap

2 Clustering: Introduction

3 Clustering in IR

4 K -means

5 Evaluation

6 How many clusters?

9 / 79
The cluster hypothesis

Cluster hypothesis. Documents in the same cluster behave


similarly with respect to relevance to information needs.

All applications of clustering in IR are based (directly or indirectly)


on the cluster hypothesis.

Van Rijsbergen’s original wording (1979): “closely associated


documents tend to be relevant to the same requests”.

10 / 79
Applications of clustering in IR

application what is benefit


clustered?
search result clustering search more effective infor-
results mation presentation
to user
Scatter-Gather (subsets of) alternative user inter-
collection face: “search without
typing”
collection clustering collection effective information
presentation for ex-
ploratory browsing
cluster-based retrieval collection higher efficiency:
faster search

11 / 79
Search result clustering for better navigation

12 / 79
Scatter-Gather

13 / 79
Global navigation: Yahoo

14 / 79
Global navigation: MESH (upper level)

15 / 79
Global navigation: MESH (lower level)

16 / 79
Navigational hierarchies: Manual vs. automatic creation

Note: Yahoo/MESH are not examples of clustering.


But they are well known examples for using a global hierarchy
for navigation.
Some examples for global navigation/exploration based on
clustering:
Cartia
Themescapes
Google News

17 / 79
Global navigation combined with visualization (1)

18 / 79
Global navigation combined with visualization (2)

19 / 79
Global clustering for navigation: Google News

http://news.google.com

20 / 79
Clustering for improving recall

To improve search recall:


Cluster docs in collection a priori
When a query matches a doc d, also return other docs in the
cluster containing d
Hope: if we do this: the query “car” will also return docs
containing “automobile”
Because the clustering algorithm groups together docs
containing “car” with those containing “automobile”.
Both types of documents contain words like “parts”, “dealer”,
“mercedes”, “road trip”.

21 / 79
Data set with clear cluster structure
Propose
algorithm
for finding
2.5

the cluster
structure in
2.0

this example
1.5
1.0
0.5
0.0

0.0 0.5 1.0 1.5 2.0

22 / 79
Desiderata for clustering

General goal: put related docs in the same cluster, put


unrelated docs in different clusters.
We’ll see different ways of formalizing this.
The number of clusters should be appropriate for the data set
we are clustering.
Initially, we will assume the number of clusters K is given.
Later: Semiautomatic methods for determining K
Secondary goals in clustering
Avoid very small and very large clusters
Define clusters that are easy to explain to the user
Many others . . .

23 / 79
Flat vs. Hierarchical clustering

Flat algorithms
Usually start with a random (partial) partitioning of docs into
groups
Refine iteratively
Main algorithm: K -means
Hierarchical algorithms
Create a hierarchy
Bottom-up, agglomerative
Top-down, divisive

24 / 79
Hard vs. Soft clustering

Hard clustering: Each document belongs to exactly one


cluster.
More common and easier to do
Soft clustering: A document can belong to more than one
cluster.
Makes more sense for applications like creating browsable
hierarchies
You may want to put sneakers in two clusters:
sports apparel
shoes
You can only do that with a soft clustering approach.

25 / 79
Flat algorithms

Flat algorithms compute a partition of N documents into a


set of K clusters.
Given: a set of documents and the number K
Find: a partition into K clusters that optimizes the chosen
partitioning criterion
Global optimization: exhaustively enumerate partitions, pick
optimal one
Not tractable
Effective heuristic method: K -means algorithm

26 / 79
Outline

1 Recap

2 Clustering: Introduction

3 Clustering in IR

4 K -means

5 Evaluation

6 How many clusters?

27 / 79
K -means

Perhaps the best known clustering algorithm


Simple, works well in many cases
Use as default / baseline for clustering documents

28 / 79
Document representations in clustering

Vector space model


relatedness between vectors can be measured by Euclidean
distance, cosine similarity, etc.

29 / 79
K -means: Basic idea

Each cluster in K -means is defined by a centroid.


Objective/partitioning criterion: minimize the average squared
difference from the centroid
Recall definition of centroid:
1 ∑
µ
⃗ (ω) = ⃗x
|ω|
x ∈ω

where we use ω to denote a cluster.


We try to find the minimum average squared difference by
iterating two steps:
reassignment: assign each vector to its closest centroid
recomputation: recompute each centroid as the average of the
vectors that were assigned to it in reassignment

30 / 79
K -means pseudocode (µk is centroid of ωk )

K -means({⃗x1 , . . . , ⃗xN }, K )
1 (⃗s1 , ⃗s2 , . . . , ⃗sK ) ← SelectRandomSeeds({⃗x1 , . . . , ⃗xN }, K )
2 for k ← 1 to K
3 do µ ⃗ k ← ⃗sk
4 while stopping criterion has not been met
5 do for k ← 1 to K
6 do ωk ← {}
7 for n ← 1 to N
8 do j ← arg minj ′ |⃗ µj ′ − ⃗xn |
9 ωj ← ωj ∪ {⃗xn } (reassignment of vectors)
10 for k ← 1 to ∑ K
11 do µ ⃗ k ← |ω1k | ⃗x ∈ωk ⃗x (recomputation of centroids)
12 return {⃗ ⃗K }
µ1 , . . . , µ

31 / 79
Worked Example: Set of points to be clustered

b
b b

b b
b
b
b
b
b
b
b b b

b b

what are the two clusters?


compute the centroids of the clusters

32 / 79
Worked Example: Random selection of initial centroids

× b

b
b b

b b
b
b
b
b

×
b
b
b b b

b b

33 / 79
Worked Example: Assign points to closest center

× b

b
b b

b b
b
b
b
b

×
b
b
b b b

b b

34 / 79
Worked Example: Assignment

×2
2
222
1 1 1 1
1 1
×1
1 1
1 1
1
1 1

35 / 79
Worked Example: Recompute cluster centroids

×
2
2 × 222
1 1 1 1
1
×1
1 1
1
×
1 1
1
1 1

36 / 79
Worked Example: Assign points to closest centroid

× b
b b

b b
b
b
b

×
b
b
b
b b b

b b

37 / 79
Worked Example: Assignment

2
2 × 222
2 2 1 1
1
×1
1 1
1
1 1
1
1 1

38 / 79
Worked Example: Recompute cluster centroids

××
2
222
2 2 1 1
1 1
×
1
1 × 1
1 1
1
1 1

39 / 79
Worked Example: Assign points to closest centroid

×
b

b
b b

b b
b
b
b
b

×
b
b
b b b

b b

40 / 79
Worked Example: Assignment

×
2
222
2 2 1 1
2 1
×1
1 1
1 1
1
1 1

41 / 79
Worked Example: Recompute cluster centroids

2
2
2
××2 222
1 1
2 1
1
×
1
1 ×1 1
1
1 1

42 / 79
Worked Example: Assign points to closest centroid

b
× b
b

b
b b

b
b
b
b

×
b
b b b

b b

43 / 79
Worked Example: Assignment

2
2
2
×2 222
1 1
2 1
1
×
1
2 1 1
1
1 1

44 / 79
Worked Example: Recompute cluster centroids

2
2
×× 2
222
2
1 1
2 1
1
× 11
1
2 1×
1 1

45 / 79
Worked Example: Assign points to closest centroid

×
b
b b

b b
b
b
b
b
b

×
b
b b b

b b

46 / 79
Worked Example: Assignment

2
2
×
211
2 2 1 1
2 1
2
× 11
1
2 1
1 1

47 / 79
Worked Example: Recompute cluster centroids

2
2
211
×
2
×
2 1 1
2
×
2 1
1
2 1 × 1
1
1 1

48 / 79
Worked Example: Assign points to closest centroid

×
b b

b b
b
b
b

×
b
b
b
b b b

b b

49 / 79
Worked Example: Assignment

2
2
111
×
2 2 1 1
2
×1
2 1
1
2 1
1
1 1

50 / 79
Worked Example: Recompute cluster centroids

2
2
111
×
2× 2 1 1
×
2 1
2 1×
2 1 1
1
1 1

51 / 79
Worked Example: Assign points to closest centroid

b
b b

× b b
b
b

×
b
b
b
b
b b b

b b

52 / 79
Worked Example: Assignment

2
2
111
×
2 1 1 1
×1
2 1
2 1
2 1
1
1 1

53 / 79
Worked Example: Recompute cluster centroids

2
2
111
× ×2 1 1 1

2
2
2
1
×

1
1

1
1 1

54 / 79
Worked Ex.: Centroids and assignments after convergence

2
2
111
× 2 1 1 1

2
2
2
1
×1
1
1

1
1 1

55 / 79
K -means is guaranteed to converge: Proof

RSS = sum of all squared distances between document vector


and closest centroid
RSS decreases during each reassignment step.
because each vector is moved to a closer centroid
RSS decreases during each recomputation step.
see next slide
There is only a finite number of clusterings.
Thus: We must reach a fixed point.
Assumption: Ties are broken consistently.
Finite set & monotonically decreasing → convergence

56 / 79
Re-computation decreases average distance

RSS = K k=1 RSSk – the residual sum of squares (the “goodness”
measure)

∑ ∑ ∑
M
RSSk (⃗v ) = ∥⃗v − ⃗x ∥2 = (vm − xm )2
x ∈ωk
⃗ x ∈ωk m=1

∂RSSk (⃗v ) ∑
= 2(vm − xm ) = 0
∂vm
x ∈ωk

1 ∑
vm = xm
|ωk |
x ∈ωk

The last line is the componentwise definition of the centroid! We minimize


RSSk when the old centroid is replaced with the new centroid. RSS, the
sum of the RSSk , must then also decrease during recomputation.

57 / 79
K -means is guaranteed to converge

But we don’t know how long convergence will take!


If we don’t care about a few docs switching back and forth,
then convergence is usually fast (< 10-20 iterations).
However, complete convergence can take many more
iterations.

58 / 79
Optimality of K -means

Convergence ̸= optimality
Convergence does not mean that we converge to the optimal
clustering!
This is the great weakness of K -means.
If we start with a bad set of seeds, the resulting clustering can
be horrible.

59 / 79
Exercise: Suboptimal clustering

3
d1 d2 d3
2 × × ×

1 × × ×
d4 d5 d6
0
0 1 2 3 4
What is the optimal clustering for K = 2?
Do we converge on this clustering for arbitrary seeds di , dj ?

60 / 79
Initialization of K -means

Random seed selection is just one of many ways K -means can


be initialized.
Random seed selection is not very robust: It’s easy to get a
suboptimal clustering.
Better ways of computing initial centroids:
Select seeds not randomly, but using some heuristic (e.g., filter
out outliers or find a set of seeds that has “good coverage” of
the document space)
Use hierarchical clustering to find good seeds
Select i (e.g., i = 10) different random sets of seeds, do a
K -means clustering for each, select the clustering with lowest
RSS

61 / 79
Time complexity of K -means

Computing one distance of two vectors is O(M).


Reassignment step: O(KNM) (we need to compute KN
document-centroid distances)
Recomputation step: O(NM) (we need to add each of the
document’s < M values to one of the centroids)
Assume number of iterations bounded by I
Overall complexity: O(IKNM) – linear in all important
dimensions
M: Vector length; N: Number of documents.

62 / 79
Outline

1 Recap

2 Clustering: Introduction

3 Clustering in IR

4 K -means

5 Evaluation

6 How many clusters?

63 / 79
What is a good clustering?

Internal criteria
RSS
Modularity in graph
But an internal criterion often does not evaluate the actual
utility of a clustering in the application.
Alternative: External criteria
Evaluate with respect to a human-defined classification

64 / 79
External criteria for clustering quality

Based on a gold standard data set, e.g., the Reuters collection


we also used for the evaluation of classification
Goal: Clustering should reproduce the classes in the gold
standard
(But we only want to reproduce how documents are divided
into groups, not the class labels.)
First measure for how well we were able to reproduce the
classes: purity

65 / 79
External criterion: Purity

1 ∑
purity(Ω, C ) = max |ωk ∩ cj |
N j
k

Ω = {ω1 , ω2 , . . . , ωK } is the set of clusters and


C = {c1 , c2 , . . . , cJ } is the set of classes.
For each cluster ωk : find class cj with most members nkj in ωk
Sum all nkj and divide by total number of points

66 / 79
Example for computing purity
To compute purity:
5 = max |ω1 ∩ cj | (class x, cluster 1)
j

4 = max |ω2 ∩ cj | (class o, cluster 2)


j

3 = max |ω3 ∩ cj | (class ⋄, cluster 3)


j

5+4+3
Purity = ≈ 0.71. (1)
17
cluster 1 cluster 2 cluster 3

x x x o x ⋄
o x o o ⋄ ⋄ ⋄
x x o x
67 / 79
Another external criterion: Rand index

Purity can be increased easily by increasing K – a measure


that does not have this problem: Rand index.
Definition: RI = TP+TN
TP+FP+FN+TN
Based on 2x2 contingency table of all pairs of documents:
same cluster different clusters
same class true positives (TP) false negatives (FN)
different classes false positives (FP) true negatives (TN)
TP+FN+FP+TN is the total number of pairs.
( )
TP+FN+FP+TN = N2 for N documents.
( )
Example: 17 2 = 136 in o/⋄/x example
Each pair is either positive or negative (the clustering puts the
two documents in the same or in different clusters) . . .
. . . and either “true” (correct) or “false” (incorrect): the
clustering decision is correct or incorrect.
68 / 79
Rand Index: Example
The three clusters contain 6, 6, and 5 points.
( ) ( ) ( )
6 6 5
TP + FP = + + = 40
2 2 2
Of these, the x pairs in cluster 1, the o pairs in cluster 2, the ⋄ pairs in
cluster 3, and the x pair in cluster 3 are true positives:
( ) ( ) ( ) ( )
5 4 3 2
TP = + + + = 20
2 2 2 2
Thus, FP = 40 − 20 = 20.
cluster 1 cluster 2 cluster 3

x x x o x ⋄
o x o o ⋄ ⋄ ⋄
x x o x

69 / 79
Rand measure for the o/⋄/x example

same cluster different clusters


same class TP = 20 FN = 24
different classes FP = 20 TN = 72

20 + 72 92
RI = = ≈ 0.68.
20 + 20 + 24 + 72 136

cluster 1 cluster 2 cluster 3

x x x o x ⋄
o x o o ⋄ ⋄ ⋄
x x o x

70 / 79
Normalized mutual information (NMI)

I (Ω; C )
NMI (Ω, C ) =
(H(Ω) + H(C )) /2

∑∑ P(ωk ∩ cj )
I (Ω; C ) = P(ωk ∩ cj ) log (2)
P(ωk )P(cj )
k j
∑ ∑ |ωk ∩ cj | N|ωk ∩ cj |
= log (3)
N |ωk ||cj |
k j


H(Ω) = P(ωk ) log P(ωk ) (4)
k

H: entropy
I: Mutual Information
the denominator: normalize the value to be within -1 to 1.
71 / 79
Evaluation results for the o/⋄/x example

purity NMI RI F5
lower bound 0.0 0.0 0.0 0.0
maximum 1.0 1.0 1.0 1.0
value for example 0.71 0.36 0.68 0.46

All four measures range from 0 (really bad clustering) to 1 (perfect


clustering).

72 / 79
Outline

1 Recap

2 Clustering: Introduction

3 Clustering in IR

4 K -means

5 Evaluation

6 How many clusters?

73 / 79
How many clusters?

Number of clusters K is given in many applications.


E.g., there may be an external constraint on K . Example: In
the case of Scatter-Gather, it was hard to show more than
10–20 clusters on a monitor in the 90s.
What if there is no external constraint? Is there a “right”
number of clusters?
One way to go: define an optimization criterion
Given docs, find K for which the optimum is reached.
What optimization criterion can we use?
We can’t use RSS or average squared distance from centroid
as criterion: always chooses K = N clusters.

74 / 79
Exercise

Your job is to develop the clustering algorithms for a


competitor to news.google.com
You want to use K -means clustering.
How would you determine K ?

75 / 79
Simple objective function for K : Basic idea

Start with 1 cluster (K = 1)


Keep adding clusters (= keep increasing K )
Add a penalty for each new cluster
Then trade off cluster penalties against average squared
distance from centroid
Choose the value of K with the best tradeoff

76 / 79
Simple objective function for K : Formalization

Given a clustering, define the cost for a document as


(squared) distance to centroid
Define total distortion RSS(K) as sum of all individual
document costs (corresponds to average distance)
Then: penalize each cluster with a cost λ
Thus for a clustering with K clusters, total cluster penalty is

Define the total cost of a clustering as distortion plus total
cluster penalty: RSS(K) + K λ
Select K that minimizes (RSS(K) + K λ)
Still need to determine good value for λ . . .

77 / 79
Finding the “knee” in the curve

1950
1900
residual sum of squares

1850
1800
1750

2 4 6 8 10

number of clusters

Pick the number of clusters where curve “flattens”. Here: 4 or 9.


78 / 79
Introduction to Information Retrieval

Introduction to
Information Retrieval
CS276: Information Retrieval and Web Search
Pandu Nayak and Prabhakar Raghavan

Lecture 12: Clustering


Introduction to Information Retrieval Ch. 16

What is clustering?
 Clustering: the process of grouping a set of objects
into classes of similar objects
 Documents within a cluster should be similar.
 Documents from different clusters should be
dissimilar.
 The commonest form of unsupervised learning
 Unsupervised learning = learning from raw data, as
opposed to supervised data where a classification of
examples is given
 A common and important task that finds many
applications in IR and other places
Introduction to Information Retrieval Ch. 16

A data set with clear cluster structure

 How would
you design
an algorithm
for finding
the three
clusters in
this case?
Introduction to Information Retrieval Sec. 16.1

Applications of clustering in IR
 Whole corpus analysis/navigation
 Better user interface: search without typing
 For improving recall in search applications
 Better search results (like pseudo RF)
 For better navigation of search results
 Effective “user recall” will be higher
 For speeding up vector space retrieval
 Cluster-based retrieval gives faster search
Introduction to Information Retrieval

Yahoo! Hierarchy isn’t clustering but is the kind


of output you want from clustering
www.yahoo.com/Science

… (30)

agriculture biology physics CS space

... ... ... ...


...
dairy
crops botany cell AI courses craft
magnetism
forestry agronomy evolution HCI missions
relativity
Introduction to Information Retrieval

Google News: automatic clustering gives an


effective news presentation metaphor
Introduction to Information Retrieval Sec. 16.1

Scatter/Gather: Cutting, Karger, and Pedersen


Introduction to Information Retrieval

For visualizing a document collection and its


themes
 Wise et al, “Visualizing the non-visual” PNNL
 ThemeScapes, Cartia
 [Mountain height = cluster size]
Introduction to Information Retrieval Sec. 16.1

For improving search recall


 Cluster hypothesis - Documents in the same cluster behave similarly
with respect to relevance to information needs
 Therefore, to improve search recall:
 Cluster docs in corpus a priori
 When a query matches a doc D, also return other docs in the
cluster containing D
 Hope if we do this: The query “car” will also return docs containing
automobile
 Because clustering grouped together docs containing car with
those containing automobile.

Why might this happen?


Introduction to Information Retrieval

yippy.com – grouping search results 11


Introduction to Information Retrieval Sec. 16.2

Issues for clustering


 Representation for clustering
 Document representation
 Vector space? Normalization?
 Centroids aren’t length normalized
 Need a notion of similarity/distance
 How many clusters?
 Fixed a priori?
 Completely data driven?
 Avoid “trivial” clusters - too large or small
 If a cluster's too large, then for navigation purposes you've
wasted an extra user click without whittling down the set of
documents much.
Introduction to Information Retrieval

Notion of similarity/distance
 Ideal: semantic similarity.
 Practical: term-statistical similarity
 We will use cosine similarity.
 Docs as vectors.
 For many algorithms, easier to think in
terms of a distance (rather than similarity)
between docs.
 We will mostly speak of Euclidean distance
 But real implementations use cosine similarity
Introduction to Information Retrieval

Clustering Algorithms
 Flat algorithms
 Usually start with a random (partial) partitioning
 Refine it iteratively
 K means clustering
 (Model based clustering)
 Hierarchical algorithms
 Bottom-up, agglomerative
 (Top-down, divisive)
Introduction to Information Retrieval

Hard vs. soft clustering


 Hard clustering: Each document belongs to exactly one cluster
 More common and easier to do
 Soft clustering: A document can belong to more than one
cluster.
 Makes more sense for applications like creating browsable
hierarchies
 You may want to put a pair of sneakers in two clusters: (i) sports
apparel and (ii) shoes
 You can only do that with a soft clustering approach.
 We won’t do soft clustering today. See IIR 16.5, 18
Introduction to Information Retrieval

Partitioning Algorithms
 Partitioning method: Construct a partition of n
documents into a set of K clusters
 Given: a set of documents and the number K
 Find: a partition of K clusters that optimizes the
chosen partitioning criterion
 Globally optimal
 Intractable for many objective functions
 Ergo, exhaustively enumerate all partitions
 Effective heuristic methods: K-means and K-
medoids algorithms
See also Kleinberg NIPS 2002 – impossibility for natural clustering
Introduction to Information Retrieval Sec. 16.4

K-Means
 Assumes documents are real-valued vectors.
 Clusters based on centroids (aka the center of gravity
or mean) of points in a cluster, c:
 1 
μ(c)  
| c | xc
x

 Reassignment of instances to clusters is based on


distance to the current cluster centroids.
 (Or one can equivalently phrase it in terms of similarities)
Introduction to Information Retrieval Sec. 16.4

K-Means Algorithm
Select K random docs {s1, s2,… sK} as seeds.
Until clustering converges (or other stopping criterion):
For each doc di:
Assign di to the cluster cj such that dist(xi, sj) is minimal.
(Next, update the seeds to the centroid of each cluster)
For each cluster cj
sj = (cj)
Introduction to Information Retrieval Sec. 16.4

K Means Example
(K=2)
Pick seeds
Reassign clusters
Compute centroids
Reassign clusters
x x Compute centroids
x
x
Reassign clusters
Converged!
Introduction to Information Retrieval Sec. 16.4

Termination conditions
 Several possibilities, e.g.,
 A fixed number of iterations.
 Doc partition unchanged.
 Centroid positions don’t change.

Does this mean that the docs in a


cluster are unchanged?
Introduction to Information Retrieval Sec. 16.4

Convergence
 Why should the K-means algorithm ever reach a
fixed point?
 A state in which clusters don’t change.
 K-means is a special case of a general procedure
known as the Expectation Maximization (EM)
algorithm.
 EM is known to converge.
 Number of iterations could be large.
 But in practice usually isn’t
Introduction to Information Retrieval Sec. 16.4
Lower case!
Convergence of K-Means
 Define goodness measure of cluster k as sum of
squared distances from cluster centroid:
 Gk = Σi (di – ck)2 (sum over all di in cluster k)
 G = Σk Gk
 Reassignment monotonically decreases G since
each vector is assigned to the closest centroid.
Introduction to Information Retrieval Sec. 16.4

Convergence of K-Means
 Recomputation monotonically decreases each Gk
since (mk is number of members in cluster k):
 Σ (di – a)2 reaches minimum for:
 Σ –2(di – a) = 0
 Σ di = Σ a
 mK a = Σ di
 a = (1/ mk) Σ di = ck
 K-means typically converges quickly
Introduction to Information Retrieval Sec. 16.4

Time Complexity
 Computing distance between two docs is O(M)
where M is the dimensionality of the vectors.
 Reassigning clusters: O(KN) distance computations,
or O(KNM).
 Computing centroids: Each doc gets added once to
some centroid: O(NM).
 Assume these two steps are each done once for I
iterations: O(IKNM).
Introduction to Information Retrieval Sec. 16.4

Seed Choice
 Results can vary based on Example showing
random seed selection. sensitivity to seeds

 Some seeds can result in poor


convergence rate, or
convergence to sub-optimal
In the above, if you start
clusterings. with B and E as centroids
 Select good seeds using a heuristic you converge to {A,B,C}
(e.g., doc least similar to any and {D,E,F}
If you start with D and F
existing mean) you converge to
 Try out multiple starting points {A,B,D,E} {C,F}
 Initialize with the results of another
method.
Introduction to Information Retrieval Sec. 16.4

K-means issues, variations, etc.


 Recomputing the centroid after every assignment
(rather than after all points are re-assigned) can
improve speed of convergence of K-means
 Assumes clusters are spherical in vector space
 Sensitive to coordinate changes, weighting etc.
 Disjoint and exhaustive
 Doesn’t have a notion of “outliers” by default
 But can add outlier filtering

Dhillon et al. ICDM 2002 – variation to fix some issues with small
document clusters
Introduction to Information Retrieval

How Many Clusters?


 Number of clusters K is given
 Partition n docs into predetermined number of clusters
 Finding the “right” number of clusters is part of the
problem
 Given docs, partition into an “appropriate” number of
subsets.
 E.g., for query results - ideal value of K not known up front
- though UI may impose limits.
 Can usually take an algorithm for one flavor and
convert to the other.
Introduction to Information Retrieval

K not specified in advance


 Say, the results of a query.
 Solve an optimization problem: penalize having
lots of clusters
 application dependent, e.g., compressed summary
of search results list.
 Tradeoff between having more clusters (better
focus within each cluster) and having too many
clusters
Introduction to Information Retrieval

K not specified in advance


 Given a clustering, define the Benefit for a
doc to be the cosine similarity to its
centroid
 Define the Total Benefit to be the sum of
the individual doc Benefits.

Why is there always a clustering of Total Benefit n?


Introduction to Information Retrieval

Penalize lots of clusters


 For each cluster, we have a Cost C.
 Thus for a clustering with K clusters, the Total Cost is
KC.
 Define the Value of a clustering to be =
Total Benefit - Total Cost.
 Find the clustering of highest value, over all choices
of K.
 Total benefit increases with increasing K. But can stop
when it doesn’t increase by “much”. The Cost term
enforces this.
Introduction to Information Retrieval Ch. 17

Hierarchical Clustering
 Build a tree-based hierarchical taxonomy
(dendrogram) from a set of documents.
animal

vertebrate invertebrate

fish reptile amphib. mammal worm insect crustacean

 One approach: recursive application of a


partitional clustering algorithm.
Introduction to Information Retrieval

Dendrogram: Hierarchical Clustering


 Clustering obtained
by cutting the
dendrogram at a
desired level: each
connected
component forms a
cluster.

32
Introduction to Information Retrieval Sec. 17.1
Hierarchical Agglomerative Clustering
(HAC)
 Starts with each doc in a separate cluster
 then repeatedly joins the closest pair of
clusters, until there is only one cluster.
 The history of merging forms a binary tree
or hierarchy.

Note: the resulting clusters are still “hard” and induce a partition
Introduction to Information Retrieval Sec. 17.2

Closest pair of clusters


 Many variants to defining closest pair of clusters
 Single-link
 Similarity of the most cosine-similar (single-link)
 Complete-link
 Similarity of the “furthest” points, the least cosine-similar
 Centroid
 Clusters whose centroids (centers of gravity) are the most
cosine-similar
 Average-link
 Average cosine between pairs of elements
Introduction to Information Retrieval Sec. 17.2

Single Link Agglomerative Clustering


 Use maximum similarity of pairs:

sim (ci ,c j )  max sim ( x, y )


xci , yc j
 Can result in “straggly” (long and thin) clusters
due to chaining effect.
 After merging ci and cj, the similarity of the
resulting cluster to another cluster, ck, is:

sim ((ci  c j ), ck )  max( sim (ci , ck ), sim (c j , ck ))


Introduction to Information Retrieval Sec. 17.2

Single Link Example


Introduction to Information Retrieval Sec. 17.2

Complete Link
 Use minimum similarity of pairs:

sim (ci ,c j )  min sim ( x, y)


xci , yc j
 Makes “tighter,” spherical clusters that are typically
preferable.
 After merging ci and cj, the similarity of the resulting
cluster to another cluster, ck, is:
sim ((ci  c j ), ck )  min( sim (ci , ck ), sim (c j , ck ))

Ci Cj Ck
Introduction to Information Retrieval Sec. 17.2

Complete Link Example


Introduction to Information Retrieval Sec. 17.2.1

Computational Complexity
 In the first iteration, all HAC methods need to
compute similarity of all pairs of N initial instances,
which is O(N2).
 In each of the subsequent N2 merging iterations,
compute the distance between the most recently
created cluster and all other existing clusters.
 In order to maintain an overall O(N2) performance,
computing similarity to each other cluster must be
done in constant time.
 Often O(N3) if done naively or O(N2 log N) if done more
cleverly
Introduction to Information Retrieval Sec. 17.3

Group Average
 Similarity of two clusters = average similarity of all pairs
within merged cluster.
1  
sim (ci , c j )    sim ( x , y )
ci  c j ( ci  c j  1) x( ci c j ) y( ci c j ): y  x
 Compromise between single and complete link.
 Two options:
 Averaged across all ordered pairs in the merged cluster
 Averaged over all pairs between the two original clusters
 No clear difference in efficacy
Introduction to Information Retrieval Sec. 17.3

Computing Group Average Similarity


 Always maintain sum of vectors in each cluster.
 
s (c j )  x

xc j
 Compute similarity of clusters in constant time:

   
( s (ci )  s (c j ))  ( s (ci )  s (c j ))  (| ci |  | c j |)
sim (ci , c j ) 
(| ci |  | c j |)(| ci |  | c j | 1)
Introduction to Information Retrieval Sec. 16.3

What Is A Good Clustering?


 Internal criterion: A good clustering will produce
high quality clusters in which:
 the intra-class (that is, intra-cluster) similarity is
high
 the inter-class similarity is low
 The measured quality of a clustering depends on
both the document representation and the
similarity measure used
Introduction to Information Retrieval Sec. 16.3

External criteria for clustering quality


 Quality measured by its ability to discover some
or all of the hidden patterns or latent classes in
gold standard data
 Assesses a clustering with respect to ground truth
… requires labeled data
 Assume documents with C gold standard classes,
while our clustering algorithms produce K clusters,
ω1, ω2, …, ωK with ni members.
Introduction to Information Retrieval Sec. 16.3

External Evaluation of Cluster Quality


 Simple measure: purity, the ratio between the
dominant class in the cluster πi and the size of
cluster ωi
1
Purity (i )  max j (nij ) j  C
ni
 Biased because having n clusters maximizes
purity
 Others are entropy of classes in clusters (or
mutual information between classes and
clusters)
Introduction to Information Retrieval Sec. 16.3

Purity example

     
     
    

Cluster I Cluster II Cluster III

Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6

Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6

Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5


Introduction to Information Retrieval Sec. 16.3

Rand Index measures between pair


decisions. Here RI = 0.68
Different
Number of Same Cluster
Clusters in
points in clustering
clustering

Same class in
ground truth 20 24

Different
classes in 20 72
ground truth
Introduction to Information Retrieval Sec. 16.3

Rand index and Cluster F-measure

A D
RI 
A B C  D
Compare with standard Precision and Recall:
A A
P R
A B AC
People also define and use a cluster F-
measure, which is probably a better measure.
Introduction to Information Retrieval

Final word and resources


 In clustering, clusters are inferred from the data without
human input (unsupervised learning)
 However, in practice, it’s a bit less clear: there are many
ways of influencing the outcome of clustering: number of
clusters, similarity measure, representation of documents, .
..

 Resources
 IIR 16 except 16.5
 IIR 17.1–17.3

You might also like