You are on page 1of 9

Distance Measures for Clustering

of Documents in a Topic Space

Tomasz Walkowiak1(B) and Mateusz Gniewkowski2


1
Faculty of Electronics, Wroclaw University of Science and Technology,
Wroclaw, Poland
tomasz.walkowiak@pwr.edu.pl
2
Faculty of Computer Science and Management,
Wroclaw University of Science and Technology, Wroclaw, Poland
mateusz.gniewkowski@pwr.edu.pl

Abstract. Topic modeling is a method for discovery of topics (groups of


words). In this paper we focus on clustering documents in a topic space
obtained using the MALLET tool. We tested several different distance
measures with two clustering algorithm (spectral clustering, agglomer-
ative hierarchical clustering) and described those that served better
(cosine distance, correlation distance, bhattacharyya distance) than the
Euclidean metric for k-means algorithm. For evaluation purpose we used
Adjusted Mutual Information (AMI ) score. The need for such experi-
ments comes from the difficulty of choosing appropriate grouping meth-
ods for the given data, which is specific in our case.

Keywords: Topic modeling · Distance · Cosine distance ·


Correlation distance · Clustering · Grouping · Clustering evaluation ·
Text analysis

1 Introduction
Topic modeling [5,22] is a method that allows us to find sets of words that
best represents a given collection of documents. Those sets are called “topics”.
Thanks to that, thousands of files can be organized in groups and therefore
easier to understand or classify. There are numerous methods and tools for topic
modeling [3], but they will not be discussed here.
The main problem we focus on in this paper is how to accurately group
documents with already assigned topics (those might be considered as fuzzy
sets), and since measuring the distance between two samples is important part
of clustering we also analyze different distance functions between two samples
from our data. The need for document clusterization follows from a frequent
occurrence of many interpretable topics (sets of words that describe a document)
in a dataset. By increasing the given number of topics, we can find more and more
subtopics (that do not necessarily remind the previous ones) and still give each of
them a meaning. Grouping of documents seen in the topic space (or topics in the
c Springer Nature Switzerland AG 2020
W. Zamojski et al. (Eds.): DepCoS-RELCOMEX 2019, AISC 987, pp. 544–552, 2020.
https://doi.org/10.1007/978-3-030-19501-4_54
Distance Measures for Clustering of Documents in a Topic Space 545

word space) gives a researcher, who deals with language or literature, a better
view for the results of topic modelling. It also might automatically find some
supersets for given topics (like a “sport” for texts about football and volleyball).
The purpose of clustering is to group together N samples described by M
features (number of topics in our case), without any knowledge about the prob-
lem (labels). There are several issues with clustering like an unknown number of
groups, defining a distance function or input data itself [1].
The input to the clustering algorithms is the list of N 2 pairwise dissimilarities
between N points representing samples (in our case they are texts). As a result,
we get an association of documents to X groups. The number of groups is often
an input parameter, but some clustering algorithms are able to predict it by
itself [10].
There are plenty of clustering algorithms [7,8,12] with different properties
and input parameters. In our work we decided to focus on the agglomerative
hierarchical clustering [6] because it is a common method used in document
grouping that allows building hierarchy and therefore draw a dendrogram, which
helps interpreting the results. For comparative reasons we also used the spectral
clustering algorithm [17] and related to it k-means [12].
In this paper, we first discuss our input data format for clustering algorithms
(samples and features). In Sect. 3 we focus on different distance measures. After
that, in Sect. 4, we describe data sets we used for our tests. Two last sections
cover the results and conclusions.

2 Topic Modelling Data


As we want to measure the similarity between documents in topics dimensions,
we consider a standard output of most topic modeling methods as our input -
two matrices. First of them describing the probability distribution of documents
over topics, the second one topics over words (Fig. 1).

Fig. 1. Input matrices: documents (di ) over topics (ti ) (on the left) and topics (ti ) over
words (wi )

Those matrices have some interesting hallmarks. First of all, every row repre-
sents Dirichlet distribution, which also means that they sum up to one (therefore
they are basically normalized with L1 , see the picture below). Also, both of them
are rather sparse, especially if ignoring all values under some threshold (which
is recommended). Those characteristics are important when choosing the right
distance (or similarity) measure (Fig. 2).
546 T. Walkowiak and M. Gniewkowski

Fig. 2. 2D representation of document in topics dimensions

3 Distance Measures

One of the most crucial tasks in dimension reduction or clustering is properly


selecting a “good” distance function. The goodness of the metric depends on a
data set (dimension and type of numbers) and intentions of a researcher. At this
point, we will discuss a few possible solutions that, according to our tests, works
best (Table 1).

Table 1. Symbols description

Xn , Yn Two points in given dimension


xt t-th element of X
X, Y Mean value of the given point
DX,Y Distance between x and y
Dmax , Dmin Distance to the farthest/closest point from the origin
n Dimension
||X||p Lp length of a vector
||X|| Euclidean length of a vector
E[||X||] Expected value of the length of any vector
var Variance
Lp p-norm
θ Angle between two vectors
X ·Y Dot product between X and Y
Distance Measures for Clustering of Documents in a Topic Space 547

3.1 Euclidean Distance and Other Lp Metrics

1
Lp = ||X||p = (|x1 |p + |x2 |p + ...|xn |p ) p
DX,Y = ||X − Y ||
Plenty of works based on [4] have shown that all Lp norms fail in high dimen-
sionality. This is because when

||Xn ||p Dmax − Dmin


lim var( )=0 then →0
n→inf E[||Xn ||p ] Dmin

The above equation shows that the difference between the farthest and closest
point does not increase as fast as the minimum distance itself. Therefore all
methods based on the nearest neighbour problem become incorrectly defined,
especially if k is greater than three [11]. It is a useful theorem that can be used
to check whether the metric is reasonable.
Moreover, in [13] authors point out, that L2 norm is very noisy, especially
with lots of irrelevant attributes, which is a case in our example due to the
sparsity of input matrices. Note, that if most of the vector’s elements are zeroes,
then the difference may act as their sum.
What is more and also important, all Lp norms strongly rely on lengths of
given vectors, which is not necessarily what we want to do. Imagine a situation
where two documents (represented by vectors) are quite similar, but the given
set of words (topic) is more frequent in one of them. Using Lp measurement
we distinguish one sample from the other, but in linguistic meaning, they are
similar. Those metrics are vulnerable to scaling.

3.2 Cosine Distance

X ·Y
DX,Y = 1 − cos(θ) = 1 −
||X||||Y ||
Cosine measure is widely and successfully used with document clustering
problems because, by its definition, it is not vulnerable for any scaling of the
given vectors and it acts better in high dimensional space. Despite its simplicity,
it gives similar results to more distinguish methods (and more computationally
complex) [9,18]. It is also worth to notice that the cosine similarity is much
faster for sparse data because, if any position of one of the given vectors is zero,
calculation are not needed.
On the other hand, there are plenty of disadvantages with the cosine distance.
This is not a proper distance metric as it does not have the triangle inequality
property. It is not high dimension resistant due to the fact that two randomly
picked vectors on the surface of unit n-sphere have a high probability to be
orthogonal but it still acts better than any Lp norm.
548 T. Walkowiak and M. Gniewkowski

3.3 Correlation Distance

(X − X) · (Y − Y )
DX,Y = 1 −
||X||||Y ||
Correlation Distance or Pearson’s distance is another metric based on the
cosine similarity. Its value lies between 0 and 2. Thanks to measuring the angle
between mean vectors, the problem is reduced by one dimension. There might
however be a problem with values lying close to the middle point (close to
the point A = (1/n, ..., 1/n)) because, after normalization, the cosine similarity
between them most likely will be equal to −1, but according to our input data
(result of the topic modeling algorithm) it should not be a common issue.

3.4 Bhattacharyya Distance


DX,Y = −ln[ (xt yt )]
t∈T

Since our data is a Dirichlet distribution it seems reasonable to use some kind
of a similarity measure focused on probability objects. Bhattacharyya distance
[2] is a metric clearly based on the cosine similarity. Basically it is a cosine of
two vectors built from roots of original elements. The loge is a standard way to
make it a distance measurement (as 1 − Dcos ).
√ √
X  = ( x1 , ..., xn )
√ √
Y  = ( y1 , ..., yn )
i=n
Bcoef f icent = i=1 yi xi = X  · Y  = cos(θ), θ = (X  , Y  )

4 Experiments

4.1 Clustering Algorithms

In order to group data, we decided to use the following algorithms:

1. K-means [12] algorithm is a classic method that assigns labels to the data,
basing on a distance to the nearest centroid. Centroids are moved iteratively
until all clusters stabilize.
2. Agglomerative hierarchical clustering [6] is a method that iteratively joins
subgroups basing on a linkage criterion. In this paper, we present result for
the average linkage clustering.
3. Spectral clustering [17] is based on the Laplacian matrix of the similarity
graph and its eigenvectors. The least significant eigenvectors create new, lower
dimensional space that is used with a k-means algorithm.
Distance Measures for Clustering of Documents in a Topic Space 549

4.2 Clustering Evaluation

For evaluation purpose, we decided to use one of the information theoretic based
measures - Adjusted Mutual Information (AMI). For a set of “true” labels
(classes) and predicted ones (clusters) it yields an upper-bounded by 1 (the
best score) value. Thanks to “correction for a chance” property the result does
not increase with a number of clusters for randomly chosen labels, it should stay
close to zero [19].
It is possible to compare sets with different numbers of clusters and classes.
Thanks to that it should be possible to discover hidden groups in already labelled
data [20].

4.3 Data Sets

We evaluated mentioned above algorithms using two collections of text docu-


ments: W iki and P ress.
The first corpus (W iki) consists of articles extracted from the Polish lan-
guage Wikipedia. This corpus has also good quality, however some of the class
assignments may be doubtful. It is characterized by a significant number of 34
subject categories. This data set was created by merging two publicly available
collections [15,16]. It includes 9, 837 articles in total, which translates into about
300 articles per class (however one class is slightly underrepresented).
The second corpus (P ress) consists of Polish press news [21]. It is a good
example of a complete, high quality and well defined data set. The texts were
assigned by press agency to 5 subject categories. All the subject groups are very
well separable from each other and each group contains reasonably large number
of members. There are 6564 documents in total in this corpus, which gives an
average of ca. 1300 documents per class.

5 Results

We performed our experiments as follows: given the documents di ∈ D over


topics ti ∈ T (the number of topics was set to 100) matrices as the result of the
MALLET [14] for two data sets described in the previous section, we performed
several tests that were evaluating the quality of clustering documents using dif-
ferent distance/similarity measures (like cosine distance, correlation distance,
hellinger distance, total variation distance, jensen–shannon divergence) with dif-
ferent clustering algorithms (agglomerative clustering, spectral clustering). We
have made the evaluation using AMI score and known classes ci ∈ C for every
document di . The number of clusters to find was the same as the number of
original classes. Below, we present the results for the six best combinations that
were better than the k-means algorithm.
The results for various combinations are given in Fig. 3. We compared classic
k-means algorithm with agglomerative clustering and spectral clustering algo-
rithms. For W iki dataset it can be observed that agglomerative clustering acts
550 T. Walkowiak and M. Gniewkowski

Fig. 3. AMI score for different clustering methods and distance measures (ac - agglom-
erative clustering, sc - spectral clustering)

better than the k-means. Notice, that the results for the cosine and correlation
distances with agglomerative clustering are quite similar. It is something that we
expected to see. The situation is different for the P ress dataset. One is signifi-
cantly better than the other and also any of the given solutions. As surprising it
can be, it seems that using cosine measures with mean vectors is more resistant
to some unique hallmarks of data distribution. In general, correlation measure
with agglomerative clustering method seems like the best solution for grouping
text in a topic space.
By increasing the number of clusters in the clustering algorithm and compar-
ing the results with original labels we expected to see a global peak somewhere
after the presumed number of groups. That would mean that we can suggest the
researcher another input parameter (a proper amount of groups). It might help

Fig. 4. AMI score for different number of clusters


Distance Measures for Clustering of Documents in a Topic Space 551

discover new dependencies in a given, labelled corpus. Figure 4 shows that our
expectations were correct - for the best combination of measure and clustering
method (which is the agglomerative clustering with the correlation distance),
the peak occurred at 8 clusters for the P ress data set and about 40 for W iki
data set. The very first point of each line represents the AMI score for “true”
number of labels.

6 Conclusion
In this paper, we discussed several options for choosing a distance/similarity
measures in the context of the given clustering method. We have shown how
significantly it affects the obtained results in a complex and unintuitive task
which is clustering. Much of it depends on the input, and that is why it is
important to classify different kinds of it and then try to find a suitable solution.
After all, there is no algorithm that is appropriate for every type of data. In our
work, we focused on the output of topic modelling tools (MALLET) which is a set
of Dirichlet distributions for every processed document. It is worth to notice that
in our dataset there should not be any vectors near to the middle point ( n1 , ..., n1 )
- this feature is probably the main reason why correlation distance works so well
(every two vectors lying near to the middle point might be considered completely
different although they are near to each other). This is not a problem in our case
because it would mean that the algorithm did not generate topics properly.
What is more, the correlation distance, is bounded from the above by two and
from below by zero. It means that the interval in this measure is bigger than in
cosine. In addition, we have shown that by evaluating the results of clustering
for a different number of clusters it is possible to find a better fitting solution
that might (or might not) be useful for the researcher.

Acknowledgements. The work was funded by the Polish Ministry of Science and
Higher Education within CLARIN-PL Research Infrastructure.

References
1. Agarwal, P., Alam, M.A., Biswas, R.: Issues, challenges and tools of clustering
algorithms. arXiv preprint arXiv:1110.2610 (2011)
2. Aherne, F.J., Thacker, N.A., Rockett, P.I.: The Bhattacharyya metric as an abso-
lute similarity measure for frequency coded data. Kybernetika 34(4), 363–368
(1998)
3. Barde, B.V., Bainwad, A.M.: An overview of topic modeling methods and tools. In:
International Conference on Intelligent Computing and Control Systems (ICICCS),
pp. 745–750 (2017)
4. Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is “nearest neighbor”
meaningful? In: ICDT 1999 Proceedings of the 7th International Conference on
Database Theory, pp. 217–235 (1999)
5. Blei, D.M.: Probabilistic topic models. Commun. ACM 55(4), 77–84 (2012)
552 T. Walkowiak and M. Gniewkowski

6. Day, W.H.E., Edelsbrunner, H.: Efficient algorithms for agglomerative hierarchical


clustering methods. J. Classif. 1(1), 7–24 (1984)
7. Dueck, D., Frey, B.J.: Non-metric affinity propagation for unsupervised image cat-
egorization. In: 2007 IEEE 11th International Conference on Computer Vision,
ICCV 2007, pp. 1–8. IEEE (2007)
8. Fung, G.: A comprehensive overview of basic clustering algorithms (2001)
9. Huang, A.: Similarity measures for text document clustering. In: Proceedings of
the 6th New Zealand Computer Science Research Student Conference (2008)
10. K. Khan, S. U. Rehman, K.A.S.F., Sarasvady, S.: DBSCAN: past, present and
future. In: The Fifth International Conference on the Applications of Digital Infor-
mation and Web Technologies (ICADIWT 2014), pp. 232–238 (2014)
11. Keim D. A., Hinneburg A., A.C.C.: On the surprising behavior of distance metrics
in high dimensional space. In: ICDT 2001 Proceedings of the 8th International
Conference on Database Theory, pp. 420–434 (2001)
12. Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recognit. Lett.
31(8), 651–666 (2009)
13. Kriegel, H.-P., Schubert, E., Zimek, A.: A survey on unsupervised outlier detection.
Stat. Anal. Data Min. 5, 363–387 (2012)
14. Milligan Ian, Weingart Scott, G.S.: Getting started with topic modeling and mallet.
Technical report (2012). http://hdl.handle.net/10012/11751
15. Mlynarczyk, K., Piasecki, M.: Wiki test - 34 categories (2015). http://hdl.handle.
net/11321/217. CLARIN-PL digital repository
16. Mlynarczyk, K., Piasecki, M.: Wiki train - 34 categories (2015). http://hdl.handle.
net/11321/222. CLARIN-PL digital repository
17. Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: analysis and an algo-
rithm. In: Advances in neural information processing systems, pp. 849–856 (2002)
18. Subhashini, R., Kumar, V.J.: Evaluating the performance of similarity measures
used in document clustering and information retrieval. In: Integrated Intelligent
Computing, pp. 27–31 (2010)
19. Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings com-
parison: is a correction for chance necessary? In: Proceedings of the 26th Annual
International Conference on Machine Learning, ICML 2009, pp. 1073–1080. ACM,
New York, NY, USA (2009). http://doi.acm.org/10.1145/1553374.1553511
20. Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings
comparison: variants, properties, normalization and correction for chance. J. Mach.
Learn. Res. 11(Oct), 2837–2854 (2010)
21. Walkowiak, T., Malak, P.: Polish texts topic classification evaluation. In: Proceed-
ings of the 10th International Conference on Agents and Artificial Intelligence -
Volume 2: ICAART, pp. 515–522. INSTICC, SciTePress (2018)
22. Wallach, H.M.: Topic modeling: Beyond bag-of-words. In: Proceedings of the 23rd
International Conference on Machine Learning, ICML 2006, pp. 977–984. ACM,
New York, NY, USA (2006). http://doi.acm.org/10.1145/1143844.1143967

You might also like