Topic Modeling Bibliography Quentin Pleple, qpleple@ucsd.
edu
SECTION 1
Topic modeling pre-LDA
Landauer97
Thomas K. Landauer and Susan T. Dumais. Solutions to plato’s problem: The latent
semantic analysis theory of acquisition, induction, and representation of knowledge.
Psychological Review, (104), 1997
Deerwester90
Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and
Richard Harshman. Indexing by latent semantic analysis. Journal of the American
Society for Information Science, 41(6):391–407, 1990
LSA (or LSI) is a linear topic model based the factorization of the document-word matrix X,
where xdw is the count of occurrences of word w in document d. The goal is to find a low-rank
approximation X̃ of X factorizing it into two matrices, one representing the documents, and
the other the topics.
Use SVD on X = U ΣV T . By selecting only the K largest singular values from Σ and
the corresponding vectors in U and V T , we get the best rank K approximation of matrix X.
Rows of U represent documents, and rows of V T represents the topics. Each document can be
expressed as a linear combination of topics.
Hofmann99
Thomas Hofmann. Probilistic latent semantic analysis. In UAI, 1999
Introduced pLSI, a probabilistic topic model based on the following generative process. No
priors on θ d and ϕk .
for document d = 1, ..., D do
for position i = 1, ..., N in document d do
Draw a topic zdi ∼ Discrete(θ d )
Draw a word wdi ∼ Discrete(ϕzdi )
Ding08
Chris Ding, Tao Li, and Wei Peng. On the equivalence between non-negative matrix
factorization and probabilistic latent semantic indexing. Computational Statistics and
Data Analysis, 52:3913–3927, 2008
Proved the equivalence between pLSI and NMF, by showing that they both optimize the same
objective function. As they are different algorithms, this allow to design an hybrid method
alternating between NMF and pLSI, every time jumping out of the local optimum of the other
method.
1
Topic Modeling Bibliography Quentin Pleple, qpleple@ucsd.edu
SECTION 2
LDA
Chronologically, Blei, Ng and Jordan first published [Blei02] presenting LDA in NIPS treating
topics ϕk as free parameters. Shortly after, Griffiths and Steyvers published [Griffiths02a]
and [Griffiths02b] extending this model by adding a symmetric Dirichlet prior on ϕk . Finally,
Blei, Ng and Jordan published an extended version [Blei03] of their first paper in Journal of
Machine Learning Research (by far the most cited LDA paper) with a section on having this
Dirichlet smoothing on multinomial parameters ϕk .
Blei02
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. In
NIPS, 2002
First first paper for LDA, quite short, not used. See [Blei03].
Griffiths02a
T. Griffiths and M. Steyvers. A probabilistic approach to semantic representation. In
Proceedings of the 24th Annual Conference of the Cognitive Science Society, 2002
Derive a Gibbs sampler for LDA and introduce Dirichlet prior on topics ϕk .
Griffiths02b
Thomas L. Griffiths and Mark Steyvers. Prediction and semantic association. In NIPS,
pages 11–18, 2002
Almost the same paper as [Griffiths02a].
Blei03
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. J.
Mach. Learn. Res., 3:993–1022, March 2003
Most cited paper for LDA, extended version of [Blei02].
Griffiths04
Thomas L. Griffiths and Mark Steyvers. Finding scientific topics. PNAS,
101(suppl. 1):5228–5235, 2004
Less technical paper showing application of LDA to several datasets.
2
Topic Modeling Bibliography Quentin Pleple, qpleple@ucsd.edu
Heinrich04
Gregor Heinrich. Parameter estimation for text analysis. Technical report, 2004
Heavily detailed tutorial about LDA, and inference using Gibbs sampling.
Steyvers06
Mark Steyvers and Tom Griffiths. Probabilistic topic models. In T. Landauer, D. Mc-
namara, S. Dennis, and W. Kintsch, editors, Latent Semantic Analysis: A Road to
Meaning. Laurence Erlbaum, 2006
LDA has been around for 3 years, they give an in-depth review and analysis of probabilistic
models, full of deep insights. hey propose measure capturing similarity between topics (KL,
KLsym, JS, cos, L1, L2), between a set of words and documents, and between words.
Blei12
David M. Blei. Probabilistic topic models. Commun. ACM, 55(4):77–84, April 2012
A short, high-level review on topic models. Not technical.
Hoffman10
Matthew Hoffman, David M. Blei, and Francis Bach. Online learning for latent dirichlet
allocation. In NIPS, 2010
Present an online version of the Variational EM algorithm introduced in [Blei03].
SECTION 3
Evaluation of topic models
Wei06
Xing Wei and Bruce Croft. Lda-based document models for ad-hoc retrieval. In SIGIR,
2006
As an extrinsic evaluation method of topics, used discovered topics for information retrieval.
Chang09
Jonathan Chang, Jordan Boyd-Graber, Chong Wang, Sean Gerrish, and David M. Blei.
Reading tea leaves: How humans interpret topic models. In NIPS, 2009
Shown that surprisingly predictive likelihood (or equivalently, perplexity) and human judgment
are often not correlated, and even sometimes slightly anti-correlated.
They ran a large scale experiment on the Amazon Turk platform. For each topic, they took
the five top words of that topics and added a random sixth word. Then, they presented these
lists of six words to people asking them which is the intruder word.
3
Topic Modeling Bibliography Quentin Pleple, qpleple@ucsd.edu
If all the people asked could tell which is the intruder, then we can conclude safely that the
topic is good at describing an idea. If on the other hand, many people identified other words
as the intruder, it means that they could not see the logic into the association of words, and
we can conclude the topic was not good enough.
Wallach09a
Hanna M. Wallach, Iain Murray, Ruslan Salakhutdinov, and David Mimno. Evaluation
methods for topic models. In Proceedings of the 26th Annual International Conference
on Machine Learning, ICML ’09, pages 1105–1112, New York, NY, USA, 2009. ACM
Gives tons of methods to compute approximations of the likelihood p(wd |Φ, α) for one unseen
document, which is intractable but needed to evaluate topic models.
Buntine09
Wray L. Buntine. Estimating likelihoods for topic models. In Asian Conference on
Machine Learning, 2009
Like [Wallach09a], gives method to estimate the likelihood of unseen documents.
AlSumait09
Loulwah AlSumait, Daniel Barbará, James Gentle, and Carlotta Domeniconi. Topic
significance ranking of lda generative models. In ECML, 2009
Define measures based on three prototypes of junk and insignificant topics to rank discovered
topics according to their significance score. The three junk prototypes are the uniform word-
distribution, the empirical corpus word-distribution, and the uniform document-distribution:
p(w|topic) ∝ 1 p(w|topic) ∝ count(w in corpus) p(d|topic) ∝ 1
Then the topic significance score is based on combinations of dissimilarities (KL divergence,
cosine, and correlation) from those three junk prototypes. However
Newman10c
David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. Automatic evalu-
ation of topic coherence. In NAACL, 2010
Tries different coherence measure on different dataset to compare them.
Mimno11b
David Mimno and David Blei. Bayesian checking for topic models. In EMNLP, 2011
Presents a Baysian methods measuring how well a topics model fits a corpus.
4
Topic Modeling Bibliography Quentin Pleple, qpleple@ucsd.edu
SECTION 4
Topic coherence
Newman10a
David Newman, Youn Noh, Edmund Talley, Sarvnaz Karimi, and Timothy Baldwin.
Evaluating topic models for digital libraries. In Proceedings of the 10th annual joint
conference on Digital libraries, pages 215–224, New York, NY, USA, 2010. ACM
P p(wi ,wj )
Introduced the UCI coherence measure i<j log p(wi )p(w j)
for w1 , ..., w10 top words (based on
PMI). The measure is extrinsic as it uses empirical probabilities from an external corpus such
as Wikipedia.
Mimno11a
David Mimno, Hanna Wallach, Edmund Talley, Miriam Leenders, and Andrew McCal-
lum. Optimizing semantic coherence in topic models. In EMNLP, 2011
P 1+D(w ,w )
Introduced the UMass coherence measure i<j log D(wii ) j for w1 , ..., w10 top words (intrinsic
measure).
Stevens12
Keith Stevens, Philip Kegelmeyer, David Andrzejewski, and David Buttler. Exploring
topic coherence over many models and many topics. In Proceedings of the 2012 Joint
Conference on Empirical Methods in Natural Language Processing and Computational
Natural Language Learning, EMNLP-CoNLL ’12, pages 952–961, Stroudsburg, PA,
USA, 2012. Association for Computational Linguistics
Explore computing the two coherence metrics UCI from [Newman10a] and UMass from [Mimno11a]
on multiple datasets, for different number of topics. The aggregate results (computing average
and entropy). They assume these two are good metrics and use them to compare different topic
models: LDA, LSA+SVD, and LSA+NMF.
SECTION 5
Interactive LDA
Andr09
David Andrzejewski, Xiaojin Zhu, and Mark Craven. Incorporating domain knowledge
into topic modeling via dirichlet forest priors. In ICML, pages 25–32, 2009
Make the discovery of topics semi-supervised where a user repeatedly gives orders on top words
of discovered topics: “those words should be in the same topic”, “those words should not be in
the same topic”, and “those words should be by themselves”. Orders are encoded into pair-wise
5
Topic Modeling Bibliography Quentin Pleple, qpleple@ucsd.edu
constraints on words: two words have to or can not be in the same topic. Then the model is
trained again with a complex new prior encoding the constraints based on Dirichlet Forests.
Hu11
Yuening Hu, Jordan Boyd-Graber, and Brianna Satinoff. Interactive topic modeling.
In Association for Computational Linguistics, 2011
Extended approach from [Andr09] proposing interactive topic modeling (ITM) where we don’t
have to start over the Gibbs sampler after each human action. Instead, the prior is updated in-
place to incorporate the new constraints and the underlying model is changed and seen a starting
position for a new Markov chain. Updating the model is done by state ablation; invalidate some
topic-word assignments by setting z = −1. The counts are decremented accordingly
They explore several strategies of invalidation: invalidates all assignments, only of docu-
ments that have any of the terms constraints, only of the terms concerned, or none. After each
human actions, the Gibbs sample runs for 30 more iterations before asking for human feedback
again. Experiments have been done using Amazon Mechanical Turk.
SECTION 6
Misc topic modeling
Pauca04
V. Paul Pauca, Farial Shahnaz, Michael W. Berry, and Robert J. Plemmons. Text
mining using non-negative matrix factorizations. In SDM, 2004
Reference for successful use of NMF for topic modeling.
Lee99
Daniel D. Lee and H. Sebastian Seung. Learning the parts of objects by non-negative
matrix factorization. Nature, 401(6755), 1999
Reference for successful use of NMF for topic modeling.
Doyle09
Gabriel Doyle and Charles Elkan. Accounting for burstiness in topic models. In ICML,
2009
Elkan’s paper about burstiness.
Wallach09b
Hanna Wallach, David Mimno, and Andrew McCallum. Rethinking lda: Why priors
matter. In NIPS, 2009
Study the effect of different priors on LDA output.
6
Topic Modeling Bibliography Quentin Pleple, qpleple@ucsd.edu
Andr11
David Andrzejewski, Xiaojin Zhu, Mark Craven, and Ben Recht. A framework for
incorporating general domain knowledge into latent dirichlet allocation using first-
order logic. In IJCAI, 2011
Use discovered topics in a search engine, use query expansion (like we do in Squid).
Chang10
Jonathan Chang. Not-so-latent dirichlet allocation: collapsed gibbs sampling using
human judgments. In Proceedings of the NAACL HLT 2010 Workshop on Creating
Speech and Language Data with Amazon’s Mechanical Turk, CSLDAMT ’10, pages
131–138, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics
Presents a “manual” topic modeling where humans are asked to repeatedly tag documents and
cluster these annotations.
Chuang12
Jason Chuang, Christopher D. Manning, and Jeffrey Heer. Termite: Visualization
techniques for assessing textual topic models. In Advanced Visual Interfaces, 2012
Define the distinctiveness D(w) of a word as the KL divergence between the topic distribution
p(k|w) given the word w and the empirical topic distribution p(k) of the corpus. Also, the
distinctiveness of a word can be weighted by its frequency, this how we define the saliency S(w)
of a word.
X p(k|w)
D(w) = p (k|w) log = KL p(k|w) k p(k) S(w) = p(w)D(w)
topic k
p(k)
Also present a new visualization of topics distributions based on a matrix of circles, and a word
ordering such that topics span contiguous words.
SECTION 7
Misc
Campbell66
L. L. Campbell. Exponential entropy as a measure of extent of a distribution. In
Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, volume 5, 1966
Reference for argument: given a distribution, the exponential entropy is a measure of the extent,
or spread, of the distribution. For eg, measure how much a word is shared across several topics.
Geman84
7
Topic Modeling Bibliography Quentin Pleple, qpleple@ucsd.edu
Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distributions, and the
bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell., 6(6):721–741,
November 1984
Introduced Gibbs sampling.
Bottou98
Léon Bottou. Online algorithms and stochastic approximations. In David Saad, editor,
Online Learning and Neural Networks. Cambridge University Press, Cambridge, UK,
1998. revised, oct 2012
Convergence of online algorihtms, gives the condition ∞ 2
P
t=0 ρt < ∞ needed to prove the con-
vergence of Online Variational LDA [Hoffman10].
Lee00
Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix factoriza-
tion. In In NIPS, pages 556–562. MIT Press, 2000
Reference for NMF.
Bishop06
Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Sci-
ence and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006
Heavy book used a reference on Bayesian methods.
Tzikas08
Dimitris Tzikas, Aristidis Likas, and Nikolaos Galatsanos. The variational approxima-
tion for Bayesian inference. IEEE Signal Processing Magazine, 25(6):131–146, Novem-
ber 2008
A step-by-step tutorial on EM algorithm, following closely the Bishop book [Bishop06]. They
described the MAP as poor man’s Bayesian inference as this is a way of including prior knowl-
edge without having to pay the expensive price of computing the normalizer.
Crump13
Matthew J. C. Crump, John V. McDonnell, and Todd M. Gureckis. Evaluating Ama-
zon’s Mechanical Turk as a Tool for Experimental Behavioral Research. PLoS ONE,
8(3):e57410+, March 2013
On using Amazon Turk for doing experiments. One experiment is about measuring the per-
formance of users when varying the amount of the financial incentive, either $2 and a bonus
up to $2.5 based on task performance, or $0.75 with no bonus. Results show that the amount
8
Topic Modeling Bibliography Quentin Pleple, qpleple@ucsd.edu
on the incentive does not effect the task performance but does effect the rate at which workers
sign up for the task.