You are on page 1of 28

Document Clustering

1
What is Cluster Analysis?

 Clustering: the process of grouping a set of objects


into classes of similar objects
 Documents within a cluster should be similar.
 Documents from different clusters should be
dissimilar.
 The commonest form of unsupervised learning
 Unsupervised learning = learning from raw data, as
opposed to supervised data where a classification of
examples is given
 A common and important task that finds many
applications
Three Cluster Diagram Showing Between-Cluster and
Within-Cluster Variation

Between-Cluster Variation = Maximize


Within-Cluster Variation = Minimize
A data set with clear cluster structure

How would you


design an
algorithm for
finding the
three clusters
in this case?
Hard vs. soft clustering

Hard clustering: Each document belongs to exactly one cluster


• More common and easier to do
Soft clustering: A document can belong to more than one cluster.
• Makes more sense for applications like creating browsable
hierarchies
• You may want to put a pair of sneakers in two clusters: (i) sports
apparel and (ii) shoes
• You can only do that with a soft clustering approach.
Four Basic Questions In A Cluster Analysis

What are the variables we should use for clustering?

How do we measure similarity?


• We require a method of simultaneously comparing observations on the clustering
variables. Several methods are possible, including the correlation between
objects or perhaps a measure of their proximity in two-dimensional space such
that the distance between observations indicates similarity.
How do we form clusters?
• No matter how similarity is measured, the procedure must group those
observations that are most similar into a cluster, thereby determining the cluster
group membership of each observation for each set of clusters formed.
How many groups do we form?
• The final task is to select one cluster solution (i.e., set of clusters) as the final
solution. In doing so, the researcher faces a trade-off: fewer clusters and less
homogeneity within clusters versus a larger number of clusters and more within-
oup homogeneity.
Conducting Cluster Analysis

Formulate the Problem and identify Variables

Select a Distance Measure

Select a Clustering Procedure

Decide on the Number of Clusters

Interpret and Profile Clusters


A Classification of Clustering Procedures

Clustering Procedures

Hierarchical Nonhierarchical/ Other


K- Mean

Agglomerative Divisive Two-Step

Linkage Variance Centroid Sequential Parallel Optimizing


Methods Methods Methods Threshold Threshold Partitioning

Ward’s
Method

Single Complete Average


Linkage Linkage Linkage
Two Most Popular Approaches to Partitioning
Observations

Hierarchical
• Most common approach is where all objects start as separate clusters and then
are joined sequentially such that each step forms a new cluster joining by two
clusters at a time until only a single cluster remains.

Non-hierarchical
• the number of clusters is specified by the analyst and then the set of objects are
formed into that set of groupings.
Two Types of Hierarchical Clustering Procedures

Agglomerative Methods
• Buildup: all
observations start as
individual clusters, join
together sequentially.

Divisive Methods
• Breakdown: initially all
observations in a single
cluster, then divided
into smaller clusters.
Hierarchical Clustering

Build a tree-based hierarchical taxonomy


(dendrogram) from a set of documents.

animal

vertebrate invertebrate

fish reptile amphib. mammal worm insect crustacean

Hierarchical algorithms
• Bottom-up, agglomerative
• (Top-down, divisive)
Dendrogram: Hierarchical Clustering

• Clustering obtained
by cutting the
dendrogram at a
desired level: each
connected
component forms a
cluster.

12
How Agglomerative Hierarchical Approaches Work?

A multi-step process
• Start with all documents as their own cluster.
• Using the selected similarity measure and agglomerative algorithm,
combine the two most similar observations into a new cluster, now
containing two observations.
• Repeat the clustering procedure using the similarity
measure/agglomerative algorithm to combine the two most similar
observations or clusters (i.e., combinations of observations) into
another new cluster.
• Continue the process until all observations are in a single cluster.
Closest pair of clusters

Many variants to defining closest pair of clusters


Single-link
• Similarity of the most cosine-similar (single-link)
Complete-link
• Similarity of the “furthest” points, the least cosine-similar
Centroid
• Clusters whose centroids (centers of gravity) are the most cosine-similar
Average-link
• Average cosine between all pairs of elements
Ward's method:
• Minimum variance method as the document or cluster whose merge
minimizes the increase in within-group variance (the least reduction in
cosine similarity between cluster centers) .
Closest pair of clusters

Many variants to defining closest pair of clusters


Single-link
• Similarity of the most cosine-similar (single-link)
Complete-link
• Similarity of the “furthest” points, the least cosine-similar
Centroid
• Clusters whose centroids (centers of gravity) are the most cosine-similar
Average-link
• Average cosine between all pairs of elements
Ward's method:
• Minimum variance method as the document or cluster whose merge
minimizes the increase in within-group variance (the least reduction in
cosine similarity between cluster centers) .
How Agglomerative Hierarchical Approaches Work?

A multi-step process
• Start with all observations as their own cluster.
• Using the selected similarity measure and agglomerative algorithm,
combine the two most similar observations into a new cluster, now
containing two observations.
• Repeat the clustering procedure using the similarity
measure/agglomerative algorithm to combine the two most similar
observations or clusters (i.e., combinations of observations) into
another new cluster.
• Continue the process until all observations are in a single cluster.
Comparing the Agglomerative Algorithms

Single-linkage
• probably the most versatile algorithm, but poorly delineated cluster structures
within the data produce unacceptable snakelike “chains” for clusters.
Complete linkage
• eliminates the chaining problem, but only considers the outermost observations
in a cluster, thus impacted by outliers.
Average linkage
• generates clusters with small within-cluster variation and less affected by
outliers.
Centroid linkage
• like average linkage, is less affected by outliers.
Ward’s method
• most appropriate when the researcher expects somewhat equally sized clusters,
but easily distorted by outliers.
Clustering Algorithms -- Agglomerative

Provides a method for combining clusters which have more than one
observation

Most widely used algorithms


• Single Linkage (nearest neighbor) – shortest distance from any object in one
cluster to any object in the other.
• Complete Linkage (farthest neighbor) – based on maximum distance between
observations in each cluster.
• Average Linkage – based on the average similarity of all individuals in a cluster.
• Centroid Method – measures distance between cluster centroids.
• Ward’s Method – based on the total sum of squares within clusters.
Hierarchical Clustering–Beer Data Case

Amalgamation or Linkage Rules: Once several objects have been linked together, how do we
determine the distances between those new clusters?

Single linkage (nearest neighbor): Complete linkage (furthest neighbor):

• The distance between two clusters is • The distances between clusters are
determined by the distance of the two closest determined by the greatest distance between
objects (nearest neighbors) in the different any two objects in the different clusters (i.e.,
clusters. by the "furthest neighbors").
• This rule will, in a sense, string objects
together to form clusters, and the resulting
clusters tend to represent long "chains.“

In single linkage, individuals at the


opposite end of the chain may be
dissimilar but part of the same
cluster.

© Copyright 2015 All rights reserved.


Single-Linkage Versus Complete Linkage
Linkage Methods of Clustering

Single Linkage
Minimum Distance

Single linkage: Similarity


based only on two Cluster 1 Cluster 2
closest observations Complete Linkage
Maximum
Complete linkage:
Distance
Similarity based only on
two farthest
observations.
Cluster 1 Cluster 2
Average Linkage

Average Distance
Cluster 1 Cluster 2
Copyright © 2010 Pearson Education, Inc. 20-21
Other Agglomerative Clustering Methods
Fig. 20.6
Ward’s Procedure

Centroid Method

Copyright © 2010 Pearson Education, Inc. 20-22


Cosine Similarity

The cosine similarity between X1 and X2 is given by


n
 X1i  X 2i
X1  X 2 i 1

Similarity (X1, X2) = cos() = X1  X 2 n
2
n
2
 X1i   X 2i
i 1 i 1

In cosine similarity, X1 and X2 are two n-dimensional vectors


and it measures the angle between two vectors (thus called
vector space model).
Cosine Similarities
 Dendrogram. A dendrogram, or tree graph, is a graphical
device for displaying clustering results. Vertical lines
represent clusters that are joined together. The position of the
line on the scale indicates the distances at which clusters were
joined. The dendrogram is read from left to right.

25
Basic Pattern: Word Cloud
 Wordcloud was generated
 using R

 The samsung users


mentioned other models like
iphone, HTC etc

 They also mentioned


features like battery, screen,
camera etc. and abstract
features like look, design,etc.

 They also mentioned attitudes towards


feature/phone as
 Ugly, happy, bad
……….

 Return
Cluster Analysis of Documents

X-means clustering by
Cluster_0 Cluster_1
 Interesting android
automatic identification Excellent gb

module of Rapid Miner


Looks processor
Love battery
produced interesting Innovative core

results Thank
Amazing
devices
os
Beat ram
Goodbye memory
 Produced two clusters Beautiful card
Hope apple
Boring Iphone, ipad
Worried browser
 Words with high TFIDF Listen movie
listed in two clusters Wifi ios
Social applications
performance game
Awesome display
Uglier drawback
Findings From Two Clusters
 Two Types of Reviewers
 Emotionally expressive (nearly 62%)
 Tech lovers (38%)

 Word cloud Interpretation


 Comparative assessments (among brands)
 Technical assessments (feature wise, product wise)
 Attitude oriented expressions (love, excellent, boring )

Return

You might also like