You are on page 1of 3

Experiment 6 : Clustering K-means and HAC

I. INTRODUCTION • Dendrogram Cutting: To obtain a specific number of


clusters, one can cut the dendrogram at a certain height.
In this experiment we are using a Reuters-21578 dataset
The horizontal line at the chosen height represents the
which is a collection of news articles. It contains 10,369
desired number of clusters.
documents and a vocabulary of 29,930 words.
• Single-link: The distance between two clusters is
This experiment is divided into a two parts, In part A we
the minimum distance between any two points in the
discarded documents that do not occur in one the ten classes:
different clusters.
acquisitions, corn, crude, earn, grain, interest, money-fx, ship,
trade, and wheat then we removed documents which occur in
• Complete-link: The distance between two clusters is
two or more of these ten classes. Then we applied K-means
the maximum distance between any two points in the
on this refined dataset and evaluated clustering using purity,
different clusters.
normalized mutual information,F1 score, and a confusion
matrix. In part B we used same filtering method for classes:
• GAAC: Group average agglomerative clustering
crude, interest, and grain. we employed Single-link, Complete-
evaluates cluster quality based on all similarities between
link, GAAC, and centroid methods. Cutting the dendrogram at
documents.
the second branch results in three clusters (K=3). We computed
the Rand index for each clustering to determine the most
effective method. III. P ROCEDURE
II. UNDERLYING CONCEPTS A. Part-A
• K-means Clustering: Its objective is to minimize the 1) Download Reuters-21578 dataset. Filter this dataset by
average squared Euclidean distance of documents from discarding the documents that do not occur in one of
their cluster centers where a cluster center is defined the ten given classes and the documents that contains in
as the centroid mean or centroid of the documents in a more than one given classes.
cluster.
2) Perform TF-IDF vectorization and K-means clustering
• Evaluation Metrics: on this dataset using pipeline. Fit the pipeline model to
our filtered data and then obtain cluster labels from from
Purity: Measures the extent to which clusters contain a the fitted model.
single class. It assesses the homogeneity of clusters.
3) predict labels by applying the predict method on the
Normalized Mutual Information (NMI): Measures the pipeline using the TF-IDF transformed data and encode
mutual information between the true and predicted labels. the true labels using LabelEncoder.

F1 Score: Computes the weighted average of the 4) Evaluate clustering performance of our clustering using
precision and recall for each cluster. purity, normalized mutual information(NMI), F1 score,
and adjusted Rand Index(RI).
Rand Index (RI): Measures the similarity between true
and predicted cluster assignments. 5) Constuct the confusion matrix using true and predicted
label and Identify classes contributing to false positives
• Confusion Matrix: This table is used to evaluate the and false negatives.
performance of a classification or clustering algorithm.
It shows the true and predicted class memberships. B. Part-B
1) Download Reuters-21578 dataset. Filter this dataset by
• Hierarchical Agglomerative Clustering (HAC): It is discarding the documents that do not occur in one of the
popular clustering algorithm used for grouping similar three given classes and the documents that contains in
data points into hierarchical structures. more than one given classes.
2) Extracts the document content and assigns the primary The confusion matrix:
target class as the true label for each filtered documents.

3) Initializes a TF-IDF vectorizer with English stop words


and transforms the filtered documents into a TF-IDF
matrix, representing the importance of words in the
selected documents.

4) Utilizes the linkage function from SciPy to clustering


on the TF-IDF matrix. Using fcluster method to cut the
dendrogram to achieve a desired number of clusters(k
= 3), assigning each document to a specific cluster.
Computes the Rand Index, using the rand score function
from sklearn.

5) Using step (4) for each clustering methods and computing


respective Rand index.

IV. O BSERVATIONS

A. Part-A

We have in total of 10788 documents in Reuters-21578


dataset and after applying filtering we have 5399 documents FP and FN for each classes:
remaining.

Plot of the documents after applying K-means Clustering:

Fig. 1.

B. Part-B
We have in total of 10788 documents in Reuters-21578
dataset and after applying filtering we have 1632 documents
remaining.

Here is the Tabular reprensentation of Rand Index of differ-


ent clustering methods. In which, Complete link hierarchical
clustering method works best for our filtered documents.
Clustering Performance:

Clustering Method Rand Index


Single 0.336176
Complete 0.530872
GAAC 0.341107
Centroid 0.337973
V. L INK TO C ODE
Part - A
Part - B

VI. C ONCLUSIONS
The Experiment employed K-means and hierarchical
agglomerative clustering techniques on the Reuters-21578
dataset. The K-means results revealed moderate purity,
limited normalized mutual information, and a low Rand
Index. The confusion matrix showed specific classes with
significant occurrences of false positives and false negatives.
In the hierarchical agglomerative clustering, the complete-link
method demonstrated superior performance, yielding the
highest Rand Index among the methods tested. The findings
highlight the challenges of clustering in diverse datasets,
emphasizing the need for careful method selection.

R EFERENCES
[1] Chapter 16 and 17, Introduction to Information Retrieval, Manning,
Raghvan and Schutze

You might also like