You are on page 1of 9

Topic Modeling Bibliography Quentin Pleple, qpleple@ucsd.

edu

SECTION 1
Topic modeling pre-LDA

Landauer97

Thomas K. Landauer and Susan T. Dumais. Solutions to plato’s problem: The latent
semantic analysis theory of acquisition, induction, and representation of knowledge.
Psychological Review, (104), 1997

Deerwester90

Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and


Richard Harshman. Indexing by latent semantic analysis. Journal of the American
Society for Information Science, 41(6):391–407, 1990
LSA (or LSI) is a linear topic model based the factorization of the document-word matrix X,
where xdw is the count of occurrences of word w in document d. The goal is to find a low-rank
approximation X̃ of X factorizing it into two matrices, one representing the documents, and
the other the topics.
Use SVD on X = U ΣV T . By selecting only the K largest singular values from Σ and
the corresponding vectors in U and V T , we get the best rank K approximation of matrix X.
Rows of U represent documents, and rows of V T represents the topics. Each document can be
expressed as a linear combination of topics.

Hofmann99

Thomas Hofmann. Probilistic latent semantic analysis. In UAI, 1999


Introduced pLSI, a probabilistic topic model based on the following generative process. No
priors on θ d and ϕk .
for document d = 1, ..., D do
for position i = 1, ..., N in document d do
Draw a topic zdi ∼ Discrete(θ d )
Draw a word wdi ∼ Discrete(ϕzdi )

Ding08

Chris Ding, Tao Li, and Wei Peng. On the equivalence between non-negative matrix
factorization and probabilistic latent semantic indexing. Computational Statistics and
Data Analysis, 52:3913–3927, 2008
Proved the equivalence between pLSI and NMF, by showing that they both optimize the same
objective function. As they are different algorithms, this allow to design an hybrid method
alternating between NMF and pLSI, every time jumping out of the local optimum of the other
method.

1
Topic Modeling Bibliography Quentin Pleple, qpleple@ucsd.edu

SECTION 2
LDA

Chronologically, Blei, Ng and Jordan first published [Blei02] presenting LDA in NIPS treating
topics ϕk as free parameters. Shortly after, Griffiths and Steyvers published [Griffiths02a]
and [Griffiths02b] extending this model by adding a symmetric Dirichlet prior on ϕk . Finally,
Blei, Ng and Jordan published an extended version [Blei03] of their first paper in Journal of
Machine Learning Research (by far the most cited LDA paper) with a section on having this
Dirichlet smoothing on multinomial parameters ϕk .

Blei02

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. In
NIPS, 2002
First first paper for LDA, quite short, not used. See [Blei03].

Griffiths02a

T. Griffiths and M. Steyvers. A probabilistic approach to semantic representation. In


Proceedings of the 24th Annual Conference of the Cognitive Science Society, 2002
Derive a Gibbs sampler for LDA and introduce Dirichlet prior on topics ϕk .

Griffiths02b

Thomas L. Griffiths and Mark Steyvers. Prediction and semantic association. In NIPS,
pages 11–18, 2002
Almost the same paper as [Griffiths02a].

Blei03

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. J.
Mach. Learn. Res., 3:993–1022, March 2003
Most cited paper for LDA, extended version of [Blei02].

Griffiths04

Thomas L. Griffiths and Mark Steyvers. Finding scientific topics. PNAS,


101(suppl. 1):5228–5235, 2004
Less technical paper showing application of LDA to several datasets.

2
Topic Modeling Bibliography Quentin Pleple, qpleple@ucsd.edu

Heinrich04

Gregor Heinrich. Parameter estimation for text analysis. Technical report, 2004
Heavily detailed tutorial about LDA, and inference using Gibbs sampling.

Steyvers06

Mark Steyvers and Tom Griffiths. Probabilistic topic models. In T. Landauer, D. Mc-
namara, S. Dennis, and W. Kintsch, editors, Latent Semantic Analysis: A Road to
Meaning. Laurence Erlbaum, 2006
LDA has been around for 3 years, they give an in-depth review and analysis of probabilistic
models, full of deep insights. hey propose measure capturing similarity between topics (KL,
KLsym, JS, cos, L1, L2), between a set of words and documents, and between words.

Blei12

David M. Blei. Probabilistic topic models. Commun. ACM, 55(4):77–84, April 2012
A short, high-level review on topic models. Not technical.

Hoffman10

Matthew Hoffman, David M. Blei, and Francis Bach. Online learning for latent dirichlet
allocation. In NIPS, 2010
Present an online version of the Variational EM algorithm introduced in [Blei03].

SECTION 3
Evaluation of topic models

Wei06

Xing Wei and Bruce Croft. Lda-based document models for ad-hoc retrieval. In SIGIR,
2006
As an extrinsic evaluation method of topics, used discovered topics for information retrieval.

Chang09

Jonathan Chang, Jordan Boyd-Graber, Chong Wang, Sean Gerrish, and David M. Blei.
Reading tea leaves: How humans interpret topic models. In NIPS, 2009
Shown that surprisingly predictive likelihood (or equivalently, perplexity) and human judgment
are often not correlated, and even sometimes slightly anti-correlated.
They ran a large scale experiment on the Amazon Turk platform. For each topic, they took
the five top words of that topics and added a random sixth word. Then, they presented these
lists of six words to people asking them which is the intruder word.

3
Topic Modeling Bibliography Quentin Pleple, qpleple@ucsd.edu

If all the people asked could tell which is the intruder, then we can conclude safely that the
topic is good at describing an idea. If on the other hand, many people identified other words
as the intruder, it means that they could not see the logic into the association of words, and
we can conclude the topic was not good enough.

Wallach09a

Hanna M. Wallach, Iain Murray, Ruslan Salakhutdinov, and David Mimno. Evaluation
methods for topic models. In Proceedings of the 26th Annual International Conference
on Machine Learning, ICML ’09, pages 1105–1112, New York, NY, USA, 2009. ACM
Gives tons of methods to compute approximations of the likelihood p(wd |Φ, α) for one unseen
document, which is intractable but needed to evaluate topic models.

Buntine09

Wray L. Buntine. Estimating likelihoods for topic models. In Asian Conference on


Machine Learning, 2009
Like [Wallach09a], gives method to estimate the likelihood of unseen documents.

AlSumait09

Loulwah AlSumait, Daniel Barbará, James Gentle, and Carlotta Domeniconi. Topic
significance ranking of lda generative models. In ECML, 2009
Define measures based on three prototypes of junk and insignificant topics to rank discovered
topics according to their significance score. The three junk prototypes are the uniform word-
distribution, the empirical corpus word-distribution, and the uniform document-distribution:

p(w|topic) ∝ 1 p(w|topic) ∝ count(w in corpus) p(d|topic) ∝ 1

Then the topic significance score is based on combinations of dissimilarities (KL divergence,
cosine, and correlation) from those three junk prototypes. However

Newman10c

David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. Automatic evalu-
ation of topic coherence. In NAACL, 2010
Tries different coherence measure on different dataset to compare them.

Mimno11b

David Mimno and David Blei. Bayesian checking for topic models. In EMNLP, 2011
Presents a Baysian methods measuring how well a topics model fits a corpus.

4
Topic Modeling Bibliography Quentin Pleple, qpleple@ucsd.edu

SECTION 4
Topic coherence

Newman10a

David Newman, Youn Noh, Edmund Talley, Sarvnaz Karimi, and Timothy Baldwin.
Evaluating topic models for digital libraries. In Proceedings of the 10th annual joint
conference on Digital libraries, pages 215–224, New York, NY, USA, 2010. ACM
P p(wi ,wj )
Introduced the UCI coherence measure i<j log p(wi )p(w j)
for w1 , ..., w10 top words (based on
PMI). The measure is extrinsic as it uses empirical probabilities from an external corpus such
as Wikipedia.

Mimno11a

David Mimno, Hanna Wallach, Edmund Talley, Miriam Leenders, and Andrew McCal-
lum. Optimizing semantic coherence in topic models. In EMNLP, 2011
P 1+D(w ,w )
Introduced the UMass coherence measure i<j log D(wii ) j for w1 , ..., w10 top words (intrinsic
measure).

Stevens12

Keith Stevens, Philip Kegelmeyer, David Andrzejewski, and David Buttler. Exploring
topic coherence over many models and many topics. In Proceedings of the 2012 Joint
Conference on Empirical Methods in Natural Language Processing and Computational
Natural Language Learning, EMNLP-CoNLL ’12, pages 952–961, Stroudsburg, PA,
USA, 2012. Association for Computational Linguistics
Explore computing the two coherence metrics UCI from [Newman10a] and UMass from [Mimno11a]
on multiple datasets, for different number of topics. The aggregate results (computing average
and entropy). They assume these two are good metrics and use them to compare different topic
models: LDA, LSA+SVD, and LSA+NMF.

SECTION 5
Interactive LDA

Andr09

David Andrzejewski, Xiaojin Zhu, and Mark Craven. Incorporating domain knowledge
into topic modeling via dirichlet forest priors. In ICML, pages 25–32, 2009
Make the discovery of topics semi-supervised where a user repeatedly gives orders on top words
of discovered topics: “those words should be in the same topic”, “those words should not be in
the same topic”, and “those words should be by themselves”. Orders are encoded into pair-wise

5
Topic Modeling Bibliography Quentin Pleple, qpleple@ucsd.edu

constraints on words: two words have to or can not be in the same topic. Then the model is
trained again with a complex new prior encoding the constraints based on Dirichlet Forests.

Hu11

Yuening Hu, Jordan Boyd-Graber, and Brianna Satinoff. Interactive topic modeling.
In Association for Computational Linguistics, 2011
Extended approach from [Andr09] proposing interactive topic modeling (ITM) where we don’t
have to start over the Gibbs sampler after each human action. Instead, the prior is updated in-
place to incorporate the new constraints and the underlying model is changed and seen a starting
position for a new Markov chain. Updating the model is done by state ablation; invalidate some
topic-word assignments by setting z = −1. The counts are decremented accordingly
They explore several strategies of invalidation: invalidates all assignments, only of docu-
ments that have any of the terms constraints, only of the terms concerned, or none. After each
human actions, the Gibbs sample runs for 30 more iterations before asking for human feedback
again. Experiments have been done using Amazon Mechanical Turk.

SECTION 6
Misc topic modeling

Pauca04

V. Paul Pauca, Farial Shahnaz, Michael W. Berry, and Robert J. Plemmons. Text
mining using non-negative matrix factorizations. In SDM, 2004
Reference for successful use of NMF for topic modeling.

Lee99

Daniel D. Lee and H. Sebastian Seung. Learning the parts of objects by non-negative
matrix factorization. Nature, 401(6755), 1999
Reference for successful use of NMF for topic modeling.

Doyle09

Gabriel Doyle and Charles Elkan. Accounting for burstiness in topic models. In ICML,
2009
Elkan’s paper about burstiness.

Wallach09b

Hanna Wallach, David Mimno, and Andrew McCallum. Rethinking lda: Why priors
matter. In NIPS, 2009
Study the effect of different priors on LDA output.

6
Topic Modeling Bibliography Quentin Pleple, qpleple@ucsd.edu

Andr11

David Andrzejewski, Xiaojin Zhu, Mark Craven, and Ben Recht. A framework for
incorporating general domain knowledge into latent dirichlet allocation using first-
order logic. In IJCAI, 2011
Use discovered topics in a search engine, use query expansion (like we do in Squid).

Chang10

Jonathan Chang. Not-so-latent dirichlet allocation: collapsed gibbs sampling using


human judgments. In Proceedings of the NAACL HLT 2010 Workshop on Creating
Speech and Language Data with Amazon’s Mechanical Turk, CSLDAMT ’10, pages
131–138, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics
Presents a “manual” topic modeling where humans are asked to repeatedly tag documents and
cluster these annotations.

Chuang12

Jason Chuang, Christopher D. Manning, and Jeffrey Heer. Termite: Visualization


techniques for assessing textual topic models. In Advanced Visual Interfaces, 2012
Define the distinctiveness D(w) of a word as the KL divergence between the topic distribution
p(k|w) given the word w and the empirical topic distribution p(k) of the corpus. Also, the
distinctiveness of a word can be weighted by its frequency, this how we define the saliency S(w)
of a word.
X p(k|w) 
D(w) = p (k|w) log = KL p(k|w) k p(k) S(w) = p(w)D(w)
topic k
p(k)

Also present a new visualization of topics distributions based on a matrix of circles, and a word
ordering such that topics span contiguous words.

SECTION 7
Misc

Campbell66

L. L. Campbell. Exponential entropy as a measure of extent of a distribution. In


Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, volume 5, 1966
Reference for argument: given a distribution, the exponential entropy is a measure of the extent,
or spread, of the distribution. For eg, measure how much a word is shared across several topics.

Geman84

7
Topic Modeling Bibliography Quentin Pleple, qpleple@ucsd.edu

Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distributions, and the
bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell., 6(6):721–741,
November 1984
Introduced Gibbs sampling.

Bottou98

Léon Bottou. Online algorithms and stochastic approximations. In David Saad, editor,
Online Learning and Neural Networks. Cambridge University Press, Cambridge, UK,
1998. revised, oct 2012
Convergence of online algorihtms, gives the condition ∞ 2
P
t=0 ρt < ∞ needed to prove the con-
vergence of Online Variational LDA [Hoffman10].

Lee00

Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix factoriza-
tion. In In NIPS, pages 556–562. MIT Press, 2000
Reference for NMF.

Bishop06

Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Sci-


ence and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006
Heavy book used a reference on Bayesian methods.

Tzikas08

Dimitris Tzikas, Aristidis Likas, and Nikolaos Galatsanos. The variational approxima-
tion for Bayesian inference. IEEE Signal Processing Magazine, 25(6):131–146, Novem-
ber 2008
A step-by-step tutorial on EM algorithm, following closely the Bishop book [Bishop06]. They
described the MAP as poor man’s Bayesian inference as this is a way of including prior knowl-
edge without having to pay the expensive price of computing the normalizer.

Crump13

Matthew J. C. Crump, John V. McDonnell, and Todd M. Gureckis. Evaluating Ama-


zon’s Mechanical Turk as a Tool for Experimental Behavioral Research. PLoS ONE,
8(3):e57410+, March 2013
On using Amazon Turk for doing experiments. One experiment is about measuring the per-
formance of users when varying the amount of the financial incentive, either $2 and a bonus
up to $2.5 based on task performance, or $0.75 with no bonus. Results show that the amount

8
Topic Modeling Bibliography Quentin Pleple, qpleple@ucsd.edu

on the incentive does not effect the task performance but does effect the rate at which workers
sign up for the task.

You might also like