Professional Documents
Culture Documents
t S P φ c
Reference
Corpus
Figure 1: Overview over the unifying coherence framework—its four parts and their intermediate results.
The latter two are variations of the first one that require
an ordered word set. They compare a word only to the
preceding and succeeding words respectively, as done by the
UMass coherence. one one one any
Douven and Meijs [7] proposed several other segmenta- Figure 2: Sone , Sall , Sany and Sany segmentations of
tions that have been adapted to topic evaluation by [16]. the word set {game, ball, sport, team} and their
These definitions allow one or both subsets to contain more hierarchy.
than one single word.
n
3.2 Probability Estimation
one
Sall = (W 0 , W ∗ )|W 0 = {wi }; The method of probability estimation defines the way
o (20) how the probabilities are derived from the underlying data
wi ∈ W ; W ∗ = W \ {wi } source. Boolean document (Pbd ) estimates the probability
n of a single word as the number of documents in which the
one
Sany = (W 0 , W ∗ )|W 0 = {wi }; word occurs divided by the total number of documents. In
o (21) the same way, the joint probability of two words is estimated
wi ∈ W ; W ∗ ⊆ W \ {wi } by the number of documents containing both words divided
n o by the total number of documents. This estimation method
any
Sany = (W 0 , W ∗ )|W 0 , W ∗ ⊂ W ; W 0 ∩ W ∗ = ∅ (22) is called boolean as the number of occurrences of words in
a single document as well as distances between the occur-
one
Sall compares every single word to all other words of the rences are not considered. UMass coherence is based on
one one
word set. Sany extends Sall by using every subset as condi- an equivalent kind of estimation [12]. Text documents with
any
tion. Sany is another extension that compares every subset some formatting allow simple variations, namely the boolean
with every other disjoint subset. Figure 2 shows the differ- paragraph (Pbp ) and boolean sentence (Pbs ). These estima-
ent sets of subset pairs produced by applying the different tion methods are similar to boolean document except instead
segmentations to the running example. of documents paragraphs or sentences are used respectively.
The approach in [1] compares words to the total word Boolean sliding window (Psw ) determines word counts us-
set W using words context vectors. Therefore, we define ing a sliding window.4 The window moves over the docu-
another segmentation ments one word token per step. Each step defines a new
n o virtual document by copying the window content. Boolean
one
Sset = (W 0 , W ∗ )|W 0 = {wi }; wi ∈ W ; W ∗ = W (23) document is applied to these virtual documents to compute
The coherences defined in [7] and transformed in [16] can be rankings are correlated using Pearson’s r. Thus, good qual-
written as: ity of a coherence measure is indicated by a large correlation
one
to human ratings.
Cone-all = (Pbd , Sall , md , σa ) (42) Word counts and probabilities are derived from Wikipedia.
one
In case the corpus, which was used as training data for topic
Cone-any = Pbd , Sany , md , σa (43)
any learning, was available, we computed coherence measures a
Cany-any = Pbd , Sany , md , σa (44) second time using counts derived from that corpus.
Shogenji’s [17], Olsson’s [14] and Fitelson’s [8] coherences During evaluation, we tested a wide range of different pa-
do not define how the probabilities are computed. There- rameter settings. Window sizes varied between [10, 300] for
fore, these measure definitions can be combined with every the sliding and [5, 150] for the context window. The param-
method of probability estimation6 : eter γ varied in {1, 2, 3}. Overall, our evaluation comprises a
total of 237 912 different coherences and parameterizations.
In literature, other evaluation methods has been used as
all
CS = ·, Sone , mls , σa (45)
well, e.g., humans were asked to classify word sets using
set
CO = ·, Sset , mo , σi (46) different given error types [12]. However, since the data is
one not freely available, we can not use such methods for our
CF = ·, Sany , mf , σa (47)
evaluation.
Using the context-window-based probability estimation A dataset comprises a corpus, topics and human ratings.
Pcw described in section 2, we are able to formulate the Topics had been computed using the corpus and are given by
context-vector-based coherences defined in [1] within our word sets consisting of the topics top words. Human ratings
framework7 : for topics had been created by presenting these word sets
to human raters. Topics are rated regarding interpretability
one
Ccos = Pcw(5) , Sone , m̃cos(nlr,γ) , σa (48) and understandability using three categories—good, neutral
one or bad [13].
Cdice = Pcw(5) , Sone , m̃dice(nlr,γ) , σa (49)
The generation of such a dataset is expensive due to the
one
Cjac = Pcw(5) , Sone , m̃jac(nlr,γ) , σa (50) necessary manual work to create human topic ratings. Re-
one
Ccen = Pcw(5) , Sset , m̃cos(nlr,γ) , σa
(51) cently, several datasets have been published [1, 6, 10, 16].
Additionally, the creation of such a dataset is separated
This shows that the framework can cover all existing mea- from the topic model used to compute the topics, since the
sures. However, it allows to construct new measures that humans rate just plain word sets without any information
combine the ideas of existing measures. about the topic model [1, 6, 10, 15, 16]. This opens the
possibility to reuse them for evaluation.
The 20NG dataset contains the well known 20 News-
4. EVALUATION AND DATA SETS groups corpus that consists of Usenet messages of 20 dif-
The evaluation follows a common scheme that has already ferent groups. The Genomics corpus comprises scientific
been used in [1, 10, 13, 16]. Coherence measures are com- articles of 49 MEDLINE journals and is part of the TREC-
puted for topics given as word sets that have been rated Genomics Track8 . Aletras and Stevenson [1] published 100
by humans with respect to understandability. Each mea- rated topics for both dataset, each represented by a set of
sure produces a ranking of the topics that is compared to 10 top words. Further, they published 100 rated topics that
the ranking induced by human ratings. Following [10], both have been computed using 47 229 New York Times articles
(NYT ). Unfortunately, this corpus is not available to us.
6 all
Sone is used to formulate Shogenji’s coherence and is de- Chang et al. [6] used two corpora, one comprising New
fined like Sallone
but with W 0 and W ∗ swapped. Sset set
= York Times articles (RTL-NYT ) and the other is a Wikipedia
{(W 0 , W ∗ )|W 0 , W ∗ = W } is used for Olsson’s coherence. subset (RTL-Wiki). A number of 900 topics were created
set
Note that Sset contains only one single pair. Therefore, for each of these corpora. [10] published human ratings for
the aggregation function is the identity function σi . These these topics. Human raters evaluated word subsets of size
special segmentation schemas are used only for these two five randomly selected from the top words of each topic. We
coherences.
7 aggregated the ratings for each word set. Word sets with
We added window size to the model name, e.g., Pcw(5) for
the window size s = ±5. The context window is only used
for indirect coherence measures. For represent all measures
mentioned in [1] instead of mnlr the measures mlr or mP =
P (W 0 , W ∗ ) could be used respectively. 8
http://ir.ohsu.edu/genomics
Name CV CP CUMass Cone-any CUCI CNPMI CA
less than three ratings or words with encoding errors are 0.74
removed.9 0.72
The RTL-Wiki corpus is published in bag-of-words for- 0.7
correlation
scientific philosophy that have been not used in the area of [5] L. Bovens and S. Hartmann. Bayesian Epistemology.
topic evaluation and—in the case of Fitelson’s coherence— Oxford University Press, 2003.
showed a good performance. Using our framework, we iden- [6] J. Chang, S. Gerrish, C. Wang, J. L. Boyd-graber, and
tified measures that clearly outperform all coherences pro- D. M. Blei. Reading tea leaves: How humans interpret
posed so far. topic models. In Advances in Neural Information
In the future, a next step will be a more detailed investiga- Processing Systems 22, pages 288–296. 2009.
tion on the sliding window and could bring up new probabil- [7] I. Douven and W. Meijs. Measuring coherence.
ity estimation methods. Future work will include transfer of Synthese, 156(3):405–425, 2007.
coherence to other applications beyond topic models, which [8] B. Fitelson. A probabilistic theory of coherence.
can be done in a systematic way using the new unifying Analysis, 63(279):194–199, 2003.
framework. Hence, our research agenda may lead to com- [9] A. Hinneburg, R. Preiss, and R. Schröder.
pletely new insights of the behavior of topic models. Finally, Topicexplorer: Exploring document collections with
our open framework enables researchers to measure the im- topic models. In ECML/PKDD (2), pages 838–841,
pact on a comparable basis for fields of application like text 2012.
mining, information retrieval and the world wide web. Ad-
[10] J. H. Lau, D. Newman, and T. Baldwin. Machine
ditionally, it will be possible to finally evaluate the reasons
reading tea leaves: Automatically evaluating topic
for specific behavior of statistics-driven models and uncover
coherence and topic model quality. In Proc. of the
distinguishing properties of benchmarks.
Europ. Chap. of the Assoc. for Comp. Ling., 2014.
[11] D. Mimno and D. Blei. Bayesian checking for topic
9. ACKNOWLEDGMENTS models. In Proc. of the Conf. on Empirical Methods in
We want to thank Jey Han Lau and Jordan Boyd-Graber Natural Language Processing, pages 227–237. 2011.
for making their data available. A special thanks goes to [12] D. Mimno, H. M. Wallach, E. Talley, M. Leenders, and
Nikos Aletras for publishing his data and several discussions A. McCallum. Optimizing semantic coherence in topic
about his approach. The work of the first author has been models. In Proc. of the Conf. on Empirical Methods in
supported by the ESF and the Free State of Saxony. The Natural Language Processing, pages 262–272. 2011.
work of the last author has been partially [13] D. Newman, J. H. Lau, K. Grieser, and T. Baldwin.
supported by the Klaus Automatic evaluation of topic coherence. In Human
Tschira Foundation. Language Technologies: The 2010 Annual Conference
of the North American Chapter of the Association for
Computational Linguistics, pages 100–108. 2010.
10. REFERENCES [14] E. Olsson. What is the problem of coherence and
[1] N. Aletras and M. Stevenson. Evaluating topic truth? The Jour. of Philosophy, 99(5):246–272, 2002.
coherence using distributional semantics. In Proc. of [15] M. Röder, M. Speicher, and R. Usbeck. Investigating
the 10th Int. Conf. on Computational Semantics quality raters’ performance using interface evaluation
(IWCS’13), pages 13–22, 2013. methods. In Informatik 2013, pages 137–139. GI, 2013.
[2] L. Alsumait, D. Barbará, J. Gentle, and [16] F. Rosner, A. Hinneburg, M. Röder, M. Nettling, and
C. Domeniconi. Topic significance ranking of lda A. Both. Evaluating topic coherence measures. CoRR,
generative models. In Proc. of the Europ. Conf. on abs/1403.6397, 2014.
Machine Learning and Knowledge Disc. in Databases: [17] T. Shogenji. Is coherence truth conducive? Analysis,
Part I, ECML PKDD ’09, pages 67–82, 2009. 59(264):338–345, 1999.
[3] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent [18] K. Stevens, P. Kegelmeyer, D. Andrzejewski, and
dirichlet allocation. Journal of Machine Learning D. Buttler. Exploring topic coherence over many
Research, 3:993–1022, 2003. models and many topics. In Proc. of the 2012 Joint
[4] G. Bouma. Normalized (pointwise) mutual information Conf. on Empirical Methods in Natural Language
in collocation extraction. In From Form to Meaning: Processing and Computational Natural Language
Processing Texts Automatically, Proc. of the Biennial Learning, EMNLP-CoNLL ’12, pages 952–961, 2012.
GSCL Conference 2009, pages 31–40, Tübingen, 2009.