4.clustering 19102020

Document Clustering
1
What is Cluster Analysis?
 Clustering: the process of grouping a set of objects

into classes of similar objects
 Documents within a cluster should be similar.
 Documents from different clusters should be
dissimilar.
 The commonest form of unsupervised learning
 Unsupervised learning = learning from raw data, as
opposed to supervised data where a classification of
examples is given
 A common and important task that finds many
applications
Three Cluster Diagram Showing Between-Cluster and
Within-Cluster Variation
Between-Cluster Variation = Maximize

Within-Cluster Variation = Minimize
A data set with clear cluster structure
How would you

design an
algorithm for
finding the
three clusters
in this case?
Hard vs. soft clustering
Hard clustering: Each document belongs to exactly one cluster

• More common and easier to do
Soft clustering: A document can belong to more than one cluster.
• Makes more sense for applications like creating browsable
hierarchies
• You may want to put a pair of sneakers in two clusters: (i) sports
apparel and (ii) shoes
• You can only do that with a soft clustering approach.
Four Basic Questions In A Cluster Analysis
What are the variables we should use for clustering?
How do we measure similarity?

• We require a method of simultaneously comparing observations on the clustering
variables. Several methods are possible, including the correlation between
objects or perhaps a measure of their proximity in two-dimensional space such
that the distance between observations indicates similarity.
How do we form clusters?
• No matter how similarity is measured, the procedure must group those
observations that are most similar into a cluster, thereby determining the cluster
group membership of each observation for each set of clusters formed.
How many groups do we form?
• The final task is to select one cluster solution (i.e., set of clusters) as the final
solution. In doing so, the researcher faces a trade-off: fewer clusters and less
homogeneity within clusters versus a larger number of clusters and more within-
oup homogeneity.
Conducting Cluster Analysis
Formulate the Problem and identify Variables
Select a Distance Measure
Select a Clustering Procedure
Decide on the Number of Clusters
Interpret and Profile Clusters

A Classification of Clustering Procedures
Clustering Procedures
Hierarchical Nonhierarchical/ Other

K- Mean
Agglomerative Divisive Two-Step
Linkage Variance Centroid Sequential Parallel Optimizing

Methods Methods Methods Threshold Threshold Partitioning
Ward’s
Method
Single Complete Average

Linkage Linkage Linkage
Two Most Popular Approaches to Partitioning
Observations
Hierarchical
• Most common approach is where all objects start as separate clusters and then
are joined sequentially such that each step forms a new cluster joining by two
clusters at a time until only a single cluster remains.
Non-hierarchical
• the number of clusters is specified by the analyst and then the set of objects are
formed into that set of groupings.
Two Types of Hierarchical Clustering Procedures
Agglomerative Methods
• Buildup: all
observations start as
individual clusters, join
together sequentially.
Divisive Methods
• Breakdown: initially all
observations in a single
cluster, then divided
into smaller clusters.
Hierarchical Clustering
Build a tree-based hierarchical taxonomy

(dendrogram) from a set of documents.
animal
vertebrate invertebrate
fish reptile amphib. mammal worm insect crustacean
Hierarchical algorithms
• Bottom-up, agglomerative
• (Top-down, divisive)
Dendrogram: Hierarchical Clustering
• Clustering obtained
by cutting the
dendrogram at a
desired level: each
connected
component forms a
cluster.
12
How Agglomerative Hierarchical Approaches Work?
A multi-step process
• Start with all documents as their own cluster.
• Using the selected similarity measure and agglomerative algorithm,
combine the two most similar observations into a new cluster, now
containing two observations.
• Repeat the clustering procedure using the similarity
measure/agglomerative algorithm to combine the two most similar
observations or clusters (i.e., combinations of observations) into
another new cluster.
• Continue the process until all observations are in a single cluster.
Closest pair of clusters
Many variants to defining closest pair of clusters

Single-link
• Similarity of the most cosine-similar (single-link)
Complete-link
• Similarity of the “furthest” points, the least cosine-similar
Centroid
• Clusters whose centroids (centers of gravity) are the most cosine-similar
Average-link
• Average cosine between all pairs of elements
Ward's method:
• Minimum variance method as the document or cluster whose merge
minimizes the increase in within-group variance (the least reduction in
cosine similarity between cluster centers) .
Closest pair of clusters
Many variants to defining closest pair of clusters

Single-link
• Similarity of the most cosine-similar (single-link)
Complete-link
• Similarity of the “furthest” points, the least cosine-similar
Centroid
• Clusters whose centroids (centers of gravity) are the most cosine-similar
Average-link
• Average cosine between all pairs of elements
Ward's method:
• Minimum variance method as the document or cluster whose merge
minimizes the increase in within-group variance (the least reduction in
cosine similarity between cluster centers) .
How Agglomerative Hierarchical Approaches Work?
A multi-step process
• Start with all observations as their own cluster.
• Using the selected similarity measure and agglomerative algorithm,
combine the two most similar observations into a new cluster, now
containing two observations.
• Repeat the clustering procedure using the similarity
measure/agglomerative algorithm to combine the two most similar
observations or clusters (i.e., combinations of observations) into
another new cluster.
• Continue the process until all observations are in a single cluster.
Comparing the Agglomerative Algorithms
Single-linkage
• probably the most versatile algorithm, but poorly delineated cluster structures
within the data produce unacceptable snakelike “chains” for clusters.
Complete linkage
• eliminates the chaining problem, but only considers the outermost observations
in a cluster, thus impacted by outliers.
Average linkage
• generates clusters with small within-cluster variation and less affected by
outliers.
Centroid linkage
• like average linkage, is less affected by outliers.
Ward’s method
• most appropriate when the researcher expects somewhat equally sized clusters,
but easily distorted by outliers.
Clustering Algorithms -- Agglomerative
Provides a method for combining clusters which have more than one
observation
Most widely used algorithms

• Single Linkage (nearest neighbor) – shortest distance from any object in one
cluster to any object in the other.
• Complete Linkage (farthest neighbor) – based on maximum distance between
observations in each cluster.
• Average Linkage – based on the average similarity of all individuals in a cluster.
• Centroid Method – measures distance between cluster centroids.
• Ward’s Method – based on the total sum of squares within clusters.
Hierarchical Clustering–Beer Data Case
Amalgamation or Linkage Rules: Once several objects have been linked together, how do we
determine the distances between those new clusters?
Single linkage (nearest neighbor): Complete linkage (furthest neighbor):
• The distance between two clusters is • The distances between clusters are
determined by the distance of the two closest determined by the greatest distance between
objects (nearest neighbors) in the different any two objects in the different clusters (i.e.,
clusters. by the "furthest neighbors").
• This rule will, in a sense, string objects
together to form clusters, and the resulting
clusters tend to represent long "chains.“
In single linkage, individuals at the

opposite end of the chain may be
dissimilar but part of the same
cluster.
© Copyright 2015 All rights reserved.

Single-Linkage Versus Complete Linkage
Linkage Methods of Clustering
Single Linkage
Minimum Distance
Single linkage: Similarity

based only on two Cluster 1 Cluster 2
closest observations Complete Linkage
Maximum
Complete linkage:
Distance
Similarity based only on
two farthest
observations.
Cluster 1 Cluster 2
Average Linkage
Average Distance
Cluster 1 Cluster 2
Copyright © 2010 Pearson Education, Inc. 20-21
Other Agglomerative Clustering Methods
Fig. 20.6
Ward’s Procedure
Centroid Method
Copyright © 2010 Pearson Education, Inc. 20-22

Cosine Similarity
The cosine similarity between X1 and X2 is given by

n
 X1i  X 2i
X1  X 2 i 1

Similarity (X1, X2) = cos() = X1  X 2 n
2
n
2
 X1i   X 2i
i 1 i 1
In cosine similarity, X1 and X2 are two n-dimensional vectors

and it measures the angle between two vectors (thus called
vector space model).
Cosine Similarities
 Dendrogram. A dendrogram, or tree graph, is a graphical
device for displaying clustering results. Vertical lines
represent clusters that are joined together. The position of the
line on the scale indicates the distances at which clusters were
joined. The dendrogram is read from left to right.
25
Basic Pattern: Word Cloud
 Wordcloud was generated
 using R
 The samsung users

mentioned other models like
iphone, HTC etc
 They also mentioned

features like battery, screen,
camera etc. and abstract
features like look, design,etc.
 They also mentioned attitudes towards

feature/phone as
 Ugly, happy, bad
……….
 Return
Cluster Analysis of Documents
X-means clustering by
Cluster_0 Cluster_1
 Interesting android
automatic identification Excellent gb
module of Rapid Miner

Looks processor
Love battery
produced interesting Innovative core
results Thank
Amazing
devices
os
Beat ram
Goodbye memory
 Produced two clusters Beautiful card
Hope apple
Boring Iphone, ipad
Worried browser
 Words with high TFIDF Listen movie
listed in two clusters Wifi ios
Social applications
performance game
Awesome display
Uglier drawback
Findings From Two Clusters
 Two Types of Reviewers
 Emotionally expressive (nearly 62%)
 Tech lovers (38%)
 Word cloud Interpretation

 Comparative assessments (among brands)
 Technical assessments (feature wise, product wise)
 Attitude oriented expressions (love, excellent, boring )
Return

4.clustering 19102020

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

4.clustering 19102020

Uploaded by

Copyright:

Available Formats

Document Clustering

 Clustering: the process of grouping a set of objects

Between-Cluster Variation = Maximize

How would you

Hard clustering: Each document belongs to exactly one cluster

What are the variables we should use for clustering?

How do we measure similarity?

Formulate the Problem and identify Variables

Select a Distance Measure

Select a Clustering Procedure

Decide on the Number of Clusters

Interpret and Profile Clusters

Hierarchical Nonhierarchical/ Other

Agglomerative Divisive Two-Step

Linkage Variance Centroid Sequential Parallel Optimizing

Single Complete Average

Build a tree-based hierarchical taxonomy

fish reptile amphib. mammal worm insect crustacean

Many variants to defining closest pair of clusters

Many variants to defining closest pair of clusters

Most widely used algorithms

Single linkage (nearest neighbor): Complete linkage (furthest neighbor):

In single linkage, individuals at the

© Copyright 2015 All rights reserved.

Single linkage: Similarity

Copyright © 2010 Pearson Education, Inc. 20-22

The cosine similarity between X1 and X2 is given by

In cosine similarity, X1 and X2 are two n-dimensional vectors

 The samsung users

 They also mentioned

 They also mentioned attitudes towards

module of Rapid Miner

 Word cloud Interpretation

You might also like