Professional Documents
Culture Documents
length of 5.9 words. For the ”20 Newsgroups” dataset, we good topic quality. UMass score [20] is one representation of
got 3285 documents and 13518 different words. The average intrinsic topic coherence measure as in Equation 14, which
documents length is 49.6 words. The statistics of the dataset considers the word order as well.
is in Table I.
N i−1
TABLE I
2 XX D(wi , wj ) + 1
scoreumass = log (14)
DATASET N (N − 1) i=2 j=1 D(wi )
Dataset ”Health Twitter” ”20 Newsgroups”
where N is the number of words describing the topic,
Documents Number 63326 3285
Different Words 24536 13518 D(wi , wj ) is the number of documents containing both word
Average Length 5.9 49.6 wi and wj , and D(wi ) is the number of documents containing
wi .
For model implementation, we employ Gibbs sampling [7]. We use UMass score as the topic description quality mea-
The number of iteration is 50 in our following experiments. sure in our following experiments. The final UMass score is
Since the model inference involves random initialization, we an averaged one of all the topics in the model. The models
repeat each experiment which includes LDA inference ten are LDA with the symmetric priors (noted as ”symm.”), the
times to eliminate the effect caused by random initialization. previous optimization (noted as ”prev.”), and our proposed
The conclusion of our experiment relies on multiple runs and prior setting (noted as ”prop.”). Since the number of topics
statistical tests. We also change the number of topics in our is one possible factor affecting the quality of topics, we use
experiments to make our results more reliable. For comparison, different numbers of topics in each dataset. The numbers of
we choose three document prior settings: the symmetric priors topics are different in the different dataset according to their
(noted as ”symm.”) [7], the previous optimization (noted as respective scales. We reserve the most frequent 20 words to
”prev.”) [2], and our proposed prior setting (noted as ”prop.”). describe a topic for all models. For ”Health Twitter”, we use
The priors for topics are all symmetric priors for the three 80, 90, and 100 topics in this experiments. The topic numbers
settings. The symmetric priors are priors with the same values for ”20 Newsgroups” are 5, 10 and 15.
for all entries in the vector. The previous optimization is using The UMass scores achieved are in Table II and Table III.
inference method as in Equation 6, which integrates the priors We record the data in the format of all the trials’ mean ±
out and learns the asymmetric prior during inference. For our stand deviation. The bold text indicates significantly better
proposed prior setting, we learn the priors before inference. results than its comparative models. Table II is the UMass
scores of topics extracted from ”Health Twitter” dataset. The
B. Topic Evaluation table shows our proposed prior setting achieves topics of
We describe topics in the topic model as the ranked list of the highest Umass scores given different numbers of topics.
words according to their importance (probability) in the topic. Meanwhile, we can observe no significant difference of Umass
The quality of a topic is how semantically meaningful the topic scores between LDA with symmetric priors and LDA with
is. There are two categories of methods to judge the quality the previous optimal setting. Table III is the UMass scores
of a topic: the extrinsic method and the intrinsic method. The of topics extracted from ”20 Newsgroup” dataset. We can see
extrinsic method involves human judgment, which is different that LDA with our proposed prior setting extracts topics of the
for different people so it is not well defined (e.g., different highest Umass scores when the number of topics is 5 and 10.
people may have different interpretations of a topic). The When there are 15 topics, there is no significant difference in
intrinsic method measures the topic quality according to the Umass scores between the three parameter settings. Moreover,
word co-occurrence in a corpus. In intrinsic measures, we will the Umass scores from ”20Newsgroups” are consistently better
get high scores if the words describing the topic co-occurs than that from ”Health Twitter”, which indicates that the
with a high probability in the corpus, and high scores indicate
TABLE II
UM ASS S CORES OF ”H EALTH T WITTER ”
TABLE III
UM ASS S CORES OF ”20 N EWSGROUPS ”
average length of documents may affect the Umass scores as We use ”Health Twitter” in clustering experiments. Since
well. cluster number is another factor that affects clustering quality,
These experiments demonstrate that LDA with our proposed we change the number of clusters in our experiments besides
prior setting extracts topics of higher topic coherence than the the number of topics. The cluster numbers are 5, 10, 15, and
other two settings. As in [21], [22], higher topic coherence 20. The results are in Table IV.
scores indicate more semantically meaningful topic. We can The clustering results are in Table IV. The scores in
conclude from our experiments that LDA with our proposed the table are also in the format of all the trials’ mean ±
prior settings extracts better topics than current settings. stand deviation. The bold text indicates significantly better
results. We can observe that LDA with our proposed prior
C. Documents Clustering
setting achieves significantly higher CH scores than LDA with
LDA provides fix-length vector documents representation in the other two settings. The table reveals that LDA with our
the form of the distribution of topics given specific document proposed prior setting performs significantly better than the
p(topic|document), which is hard to evaluate. We cluster other two settings in documents clustering, which verifies
documents using the document’s representation and evaluate that our proposed prior setting improves LDA in documents
the clustering performance. The clustering quality is a measure representation.
of documents representation quality.
In this documents clustering experiment, the document D. Documents Classification
features are their respective topics distributions. The clustering For documents with ground truth (i.e., class labels), we eval-
algorithm is k-means. The inputs of k-means are the number of uate the documents representation quality by conducting docu-
clusters and initialization of each cluster. We use ”k-means++” ments classification. Better documents representation performs
to initialize in our case. better in the documents classification task. We use the distri-
There is no ground truth for clustering. Therefore, we bution of topics given specific document p(topic|document)
choose the metric of Calinski-Harabaz Index [23] as in 15 as well in this experiments.
to evaluate the clustering quality, which is the ratio of the The classifier is Multi-Layer Perceptron with 10 hidden
between-clusters dispersion mean and the within-cluster dis- layers, ”relu” as the activation function and ”adam” as the
persion without knowing the ground truth. solver [24]–[26]. We use precision, recall, and f1-score as
the classification metrics. Since our task is a multi-labeled
Tr(Bk ) N −k classification one, we used the weighted average score, i.e., the
s(k) = ×
Tr(Wk ) k−1 final score is the weighted average of scores from all categories
k X
X [27]. Higher scores indicate better classification performance.
Wk = (x − cq )(x − cq )T (15) The dataset in this experiment is ”20 Newsgroups”. There
q=1 x∈Cq are four categories in the dataset. We split the dataset into
X
Bk = nq (cq − c)(cq − c)T 80% and 20% as training data and testing data respective. To
q avoid the problem of unbalanced training data, we split each
where N is the number of samples, Cq is the samples in cluster category of data separately and merge the documents from
q, cq is the center of cluster q, nq is the number of samples the four categories. The numbers of topics are 5 and 10 in
in cluster q, and c is the center of all samples. documents classification.
Higher Calinski-Harabaz Index indicates better clustering The documents classification results are in Table V and
quality. Note that the score depends on the number of samples Table VI. The records are in the format of all the trials’
as in Equation 15. We can only compare the scores from the mean ± standard deviation. The bold text indicates sig-
setting dataset, rather than the tendency of scores. nificantly better results. Table V is the classification result
TABLE IV
T HE C ALINSKI -H ARABAZ I NDEX IN D OCUMENTS C LUSTERING
TABLE V
”20 N EWGROUPS ” D OCUMENTS C LASSIFICATION , 5 T OPICS
TABLE VI
”20 N EWGROUPS ” D OCUMENTS C LASSIFICATION , 10 T OPICS
when using five topics in LDA. LDA with previous prior The results of ”Health Twitter” and ”20 Newsgroups” are
optimization performs better regarding precision, while our in Table VII and Table VIII respectively. We observe no
proposed prior setting performs better regarding recall and significant different between LDA with different prior settings.
f1-score. Table VI is the classification result when using For the same settings and dataset, we can observe significant
10 topics in LDA. LDA with our proposed prior setting differences in previous experiments evaluating topic quality
performs better in precision, recall, and f1-score. We can and documents representation quality. The results are consis-
conclude that LDA with our proposed prior setting performs tent with previous research [3] that better modeling ability
better than LDA with current prior settings. Experiments in does not necessarily lead to better models in real applications.
documents classification validate that LDA with our proposed
V. C ONCLUSION
prior represents documents better in another aspect.
LDA is a useful tool in document organizing. Its effec-
E. Modeling Ability tiveness depends on the prior setting, especially document
We compare the modeling ability of LDA with different prior. Previous research proposed asymmetric prior setting that
prior settings here. Modeling ability of a model is its ability improves modeling ability. However, modeling ability does not
to predict previously unseen documents in the training data. necessarily consent with the performance in real tasks. In this
Better modeling ability will achieve higher probability. We use paper, we proposed a novel prior setting method for LDA. We
the log-perplexity per word as the metric, which is the most conduct experiments to compare the performance of LDA in
popular measure in language modeling [28]. different prior settings. Experiments validate that our proposed
In this experiment, we split the dataset into the training parameters setting improves the performance of LDA signifi-
set and the testing set by 90% and 10% respectively. As in cantly in both topic description and document representation.
previous experiments, we train LDA with three prior settings Moreover, the experiments also show that better performance
respectively. We then compute the log-perplexity per word on is not consistent with better modeling ability.
the testing dataset. The topic numbers are also 5, 10, and 15 for In the future, we will compare different features and differ-
”20 Newsgroups”, while the topic number for ”Health Twitter” ent clustering algorithms in learning the priors to improve the
are 80, 90, and 100. We also repeat each process ten times and prior setting in multiple aspects.
get the mean and standard deviation in the records.
TABLE VII
L OG - PERPLEXITY P ER W ORD , ”H EALTH T WITTER ”
TABLE VIII
L OG - PERPLEXITY P ER W ORD , ”20 N EWSGROUPS ”
ACKNOWLEDGMENT [16] E. Loper and S. Bird, “Nltk: The natural language toolkit,” in Proceed-
ings of the ACL-02 Workshop on Effective Tools and Methodologies for
This research is supported by the National Science Foun- Teaching Natural Language Processing and Computational Linguistics
dation award IIS-1739095. The authors would like to thank - Volume 1, ser. ETMTNLP ’02. Stroudsburg, PA, USA: Association
the anonymous reviewers for their helpful and constructive for Computational Linguistics, 2002, pp. 63–70. [Online]. Available:
https://doi.org/10.3115/1118108.1118117
comments. [17] R. Řehůřek and P. Sojka, “Software Framework for Topic Modelling
with Large Corpora,” in Proceedings of the LREC 2010 Workshop on
R EFERENCES New Challenges for NLP Frameworks. Valletta, Malta: ELRA, May
[1] D. M. Blei, “Probabilistic topic models,” Communications of the ACM, 2010, pp. 45–50, http://is.muni.cz/publication/884893/en.
vol. 55, no. 4, pp. 77–84, 2012. [18] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
[2] H. M. Wallach, D. M. Mimno, and A. McCallum, “Rethinking lda: Why O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
priors matter,” in Advances in neural information processing systems, plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
2009, pp. 1973–1981. esnay, “Scikit-learn: Machine learning in Python,” Journal of Machine
[3] J. Chang, S. Gerrish, C. Wang, J. L. Boyd-Graber, and D. M. Blei, Learning Research, vol. 12, pp. 2825–2830, 2011.
“Reading tea leaves: How humans interpret topic models,” in Advances [19] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
in neural information processing systems, 2009, pp. 288–296. “Distributed representations of words and phrases and their composi-
[4] C. H. Papadimitriou, P. Raghavan, H. Tamaki, and S. Vempala, “Latent tionality,” in Advances in neural information processing systems, 2013,
semantic indexing: A probabilistic analysis,” Journal of Computer and pp. 3111–3119.
System Sciences, vol. 61, no. 2, pp. 217–235, 2000. [20] D. Mimno, H. M. Wallach, E. Talley, M. Leenders, and A. McCallum,
[5] T. Hofmann, “Probabilistic latent semantic analysis,” in Proceedings “Optimizing semantic coherence in topic models,” in Proceedings of
of the Fifteenth conference on Uncertainty in artificial intelligence. the conference on empirical methods in natural language processing.
Morgan Kaufmann Publishers Inc., 1999, pp. 289–296. Association for Computational Linguistics, 2011, pp. 262–272.
[6] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” [21] K. Stevens, P. Kegelmeyer, D. Andrzejewski, and D. Buttler, “Exploring
Journal of machine Learning research, vol. 3, no. Jan, pp. 993–1022, topic coherence over many models and many topics,” in Proceedings of
2003. the 2012 Joint Conference on Empirical Methods in Natural Language
[7] T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proceedings Processing and Computational Natural Language Learning. Associa-
of the National academy of Sciences, vol. 101, no. suppl 1, pp. 5228– tion for Computational Linguistics, 2012, pp. 952–961.
5235, 2004. [22] M. Röder, A. Both, and A. Hinneburg, “Exploring the space of topic
[8] S. Z. Selim and M. A. Ismail, “Soft clustering of multidimensional data: coherence measures,” in Proceedings of the eighth ACM international
a semi-fuzzy approach,” Pattern Recognition, vol. 17, no. 5, pp. 559– conference on Web search and data mining. ACM, 2015, pp. 399–408.
568, 1984. [23] T. Caliński and J. Harabasz, “A dendrite method for cluster analysis,”
[9] A. P. Gasch and M. B. Eisen, “Exploring the conditional coregulation Communications in Statistics-theory and Methods, vol. 3, no. 1, pp. 1–
of yeast gene expression through fuzzy k-means clustering,” Genome 27, 1974.
biology, vol. 3, no. 11, pp. research0059–1, 2002. [24] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
[10] R. Winkler, F. Klawonn, and R. Kruse, “Fuzzy c-means in high di- arXiv preprint arXiv:1412.6980, 2014.
mensional spaces,” International Journal of Fuzzy System Applications [25] X. Glorot and Y. Bengio, “Understanding the difficulty of training
(IJFSA), vol. 1, no. 1, pp. 1–16, 2011. deep feedforward neural networks,” in Proceedings of the thirteenth
[11] A. Mackiewicz and W. Ratajczak, “Principal components analysis (pca),” international conference on artificial intelligence and statistics, 2010,
Computers and Geosciences, vol. 19, pp. 303–342, 1993. pp. 249–256.
[12] R. Baeza-Yates, B. d. A. N. Ribeiro et al., Modern information retrieval. [26] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:
New York: ACM Press; Harlow, England: Addison-Wesley,, 2011. Surpassing human-level performance on imagenet classification,” in
[13] N. M. Nasrabadi, “Pattern recognition and machine learning,” Journal Proceedings of the IEEE international conference on computer vision,
of electronic imaging, vol. 16, no. 4, p. 049901, 2007. 2015, pp. 1026–1034.
[14] A. Karami, A. Gangopadhyay, B. Zhou, and H. Kharrazi, “Fuzzy [27] S. Godbole and S. Sarawagi, “Discriminative methods for multi-labeled
approach topic discovery in health and medical corpora,” International classification,” in Pacific-Asia conference on knowledge discovery and
Journal of Fuzzy Systems, vol. 20, no. 4, pp. 1334–1345, 2018. data mining. Springer, 2004, pp. 22–30.
[15] T. Joachims, “A probabilistic analysis of the rocchio algorithm with [28] C. D. Manning, C. D. Manning, and H. Schütze, Foundations of
tfidf for text categorization.” Carnegie-mellon univ pittsburgh pa dept of statistical natural language processing. MIT press, 1999.
computer science, Tech. Rep., 1996.