You are on page 1of 13

FLAT CLUSTERING &

HIERARCHICAL
CLUSTERING in I.R.

By:
Suraj Jogani (20117101)
Pankaj Agarwal (20117068)
Aditya Kumar Dubey(20117901)

Electrical Engineering
6th Semester
What is clustering?
 Grouping set of documents into subsets or clusters.
 The Goal of clustering algorithm is: To create clusters that are coherent internally,
but clearly different from each other
 Documents within a cluster should be as similar as possible; and
 Documents in one cluster should be as dissimilar as possible from documents in
other clusters
Clustering algorithms

Flat algorithm
 Usually start with a random (partial) partitioning
 Refine it iteratively by changing the centroid
 K-Means clustering
 Model based clustering

Hierarchical algorithm
Hierarchical algorithms are algorithms where you also have the explicit notion of a
hierarchy

In hierarchical we can cluster our documents into certain no of clusters and then we can
group together those clusters in turn into larger clusters and so on to finally have a
hierarchy.
Bottom up ,agglomerative
Top down ,Divisive
Hard vs soft clustering

Another way to classify clustering is:

 Hard clustering: Each documents belong to exactly one


cluster.
 Soft clustering: A document can belongs to more than one
cluster.
Ex—You may want to put a pair of sneakers in two clusters (i)
sports apparels (ii) shoes

Soft clustering is not used only hard clustering is used today.


K-Means
 K-Means is the important flat clustering algorithm.
 Its objective is to minimize the average squared Euclidean distance of
documents from their cluster centers .
 where a cluster center is defined as the mean or centroid.
 The first step of K-Means and our goal is to select seeds- is a initial cluster
centers K randomly selected documents.
 The algorithm then moves the cluster centers around in space to minimize
RSS(residual sum of squares)
 The square distance of each vector from its centroid summed over all the
vectors.
A k-means example of k=2
Hierarchy clustering

 Hierarchical clustering outputs a hierarchy ,


a structure that is more informative than the unstructured set
of clusters returned by flat clustering.
 Hierarchical clustering does not require us to prespecify the
number of clusters and most hierarchical algorithm are
deterministic.
 Hierarchical clustering produce better result than flat
clustering
Hierarchical Agglomerative
clustering
 Hierarchical clustering algorithm are either top-down or bottom-up.

 Bottom-up algorithms treat each document as a singleton clusters at the outset and then successively merge pairs of

clusters until all clusters have been merged into a single cluster that contains all documents.

 Bottom up hierarchical clustering is therefore called hierarchical agglomerative clustering(HAC)

 Agglomerative hierarchical clustering presents four different agglomerative algorithms.

 1.Single link

 2.Complete link

 3.Group-average

 4.Centroid similarity.
Single-link and complete-link clustering

Single link clustering: In single link or single linkage clustering ,the similarity of two
clusters is the similarity of their most similar members

Complete link clustering: In complete link clustering or complete linkage clustering, the
similarity of two clusters is the similarity of their most dissimilar members.
Divisive clustering

 This variant of hierarchical clustering is called top-down


clustering or divisive clustering.
 We start at the top with all documents in one cluster. the
cluster is split using a flat clustering algorithm.
 This procedure is applied recursively until each document is
in its own singleton cluster.
Why cluster documents?

• Better user interface


• Better search results
• Effective “user recall”(relevant
documents retrieved) will be
higher
• Faster search
Some application of clustering
in Information Retrieval

Application What is clustered? Benefit

Search result clustering search more effective information


results presentation to user
Scatter-Gather (subsets of) alternative user interface:
collection “search without typing”
Collection clustering collection effective information
presentation
for exploratory
browsing
Language modeling collection increased precision and/or
recall
Cluster-based retrieval collection higher efficiency: faster
search
Thank you

You might also like