Professional Documents
Culture Documents
6 TH
6 TH
F1 Score: Computes the weighted average of the 4) Evaluate clustering performance of our clustering using
precision and recall for each cluster. purity, normalized mutual information(NMI), F1 score,
and adjusted Rand Index(RI).
Rand Index (RI): Measures the similarity between true
and predicted cluster assignments. 5) Constuct the confusion matrix using true and predicted
label and Identify classes contributing to false positives
• Confusion Matrix: This table is used to evaluate the and false negatives.
performance of a classification or clustering algorithm.
It shows the true and predicted class memberships. B. Part-B
1) Download Reuters-21578 dataset. Filter this dataset by
• Hierarchical Agglomerative Clustering (HAC): It is discarding the documents that do not occur in one of the
popular clustering algorithm used for grouping similar three given classes and the documents that contains in
data points into hierarchical structures. more than one given classes.
2) Extracts the document content and assigns the primary The confusion matrix:
target class as the true label for each filtered documents.
IV. O BSERVATIONS
A. Part-A
Fig. 1.
B. Part-B
We have in total of 10788 documents in Reuters-21578
dataset and after applying filtering we have 1632 documents
remaining.
VI. C ONCLUSIONS
The Experiment employed K-means and hierarchical
agglomerative clustering techniques on the Reuters-21578
dataset. The K-means results revealed moderate purity,
limited normalized mutual information, and a low Rand
Index. The confusion matrix showed specific classes with
significant occurrences of false positives and false negatives.
In the hierarchical agglomerative clustering, the complete-link
method demonstrated superior performance, yielding the
highest Rand Index among the methods tested. The findings
highlight the challenges of clustering in diverse datasets,
emphasizing the need for careful method selection.
R EFERENCES
[1] Chapter 16 and 17, Introduction to Information Retrieval, Manning,
Raghvan and Schutze