You are on page 1of 13

Topic Modeling using Latent Dirichlet Allocation

Mohit Kothari
Computer Science and Engineering
University of California, San Diego
mkothari@ucsd.edu
Sonali Rahagude
Computer Science and Engineering
University of California, San Diego
srahagud@ucsd.edu
Abstract
Latent Dirichlet Allocation (LDA) is a probabilistic, generative model designed to
discover latent topics in text corpora. The idea behind LDA is to model documents
as arising from multiple topics, where a topic is dened to be a distribution over
a xed vocabulary of terms. In this report, we train LDA model on two datasets,
namely, Classic400 and BBC News. We use the method of collapsed Gibbs sam-
pling to train the model. We discuss issues related to Gibbs sampling, dening
goodness-of-t criteria, parameter tuning, convergence etc. and analyze the ex-
perimental results. We test the effectiveness of LDA in modeling and discovering
latent topics in the corpus using VI distance measure.
1 Introduction
In recent years, the amount of data such as text, media available to us has increased exponentially
and people have been continuously trying to extract useful information from it. For example, given
a set of raw text documents, a good way to extract information is to nd some keywords that suc-
cinctly describe what the document is about. We can then discover different themes that may span
a given corpora of documents. Hence, the goal is to nd short descriptions for documents that en-
able efcient processing of large collections while preserving the essential statistical structure of the
documents.
Latent Dirichlet Allocation (LDA)[1] is the simplest topic model that specically aims to nd these
short descriptions for members in a data corpus. LDA is an unsupervised, generative model that
proposes a stochastic procedure for modeling the words in the given collection of documents. LDA
was originally proposed in the context of text mining but its applications have spanned to a variety
of elds including domains such as collaborative ltering, content-based image retrieval and bioin-
formatics. Because words carry very strong semantic information, documents that contain similar
content will most likely use a similar set of words. As such, mining an entire corpus of text docu-
ments can expose sets of words that frequently co-occur within documents. These sets of words can
be interpreted as topics and they act as building blocks of the short descriptions.
The report is structured as follows, we cover some theoretical background in section 2, section 3
describes the design choices we make and details about goodness-of-t of a topic model. Later
on, section 4 discusses the results we obtain and tries to comment on the behavior observed with
changing hyper-parameters K, and . Section 5 concludes with a summary.
2 Background
LDA is a probabilistic model designed to discover latent topics in the text corpora. It is a three-
level hierarchical Bayesian model[1] in which each document in a collection is modeled as a nite
mixture over an underlying set of topics. Each topic is, in turn, modeled as an innite mixture over
1
an underlying set of topic probabilities. These topic probabilities provide an explicit representation
of a document.
2.1 Notation and terminology
Formally, we dene the following terms[1],
1. A word is the basic unit of discrete data, dened to be an item from a vocabulary indexed
by {1, ..., V }. For implementation purposes, set of all words is dened as a V dimensional
vector.
2. A document m is a sequence of N
m
words denoted by w = (w
1
, w
2
, ..., w
Nm
), where w
n
is the n
th
word in the sequence.
3. A corpus is a collection of M documents denoted by D = {w
1
, w
2
, ..., w
M
}.
Variables in bold represent vectors and the same notation is followed for rest of the paper.
We wish to train an LDA model on the training corpus that not only assigns high probability to the
documents in the corpus, but also assigns high probability to other similar documents which are
unseen during the training phase.
2.2 Simplication
The foremost goal of mining a text corpus is to nd an apt representation of the corpora for each
document. Intuitively, it makes sense to choose a model that not only takes the content but also
retains the ordering of the words i.e. captures the structure of the document. However, LDA is based
on the bag-of-words assumption - that the order of words in a document can be neglected[1].
In the language of probability theory, this is an assumption of exchangeability for the words in a
document[2]. LDA also assumes that documents are exchangeable; the specic ordering of the
documents in a corpus can also be neglected. This assumption helps in reducing the complexity
of the algorithm without compromising much on the quality. Based on the total distinct words that
appear in the corpus, a global vocabulary list is built. Most of the times, to reduce the dimensionality
of the vocabulary list, lot of data pre-processing is done; some common methods include stop-word
removal, stemming, synonyms etc.
2.3 Latent Dirichlet allocation (LDA)
Latent Dirichlet allocation is a generative probabilistic model of a corpus. The basic idea is that
documents are represented as random mixtures over latent topics, where each topic is characterized
by a distribution over words. Here, words are modeled as observed random variables, while topics
are modeled as latent random variables. Once the generative procedure is established, we dene its
joint distribution and then use statistical inference to compute the probability distribution over the
latent variables, conditioned on the observed variables.
2.4 Multinomial distribution
LDA uses multinomial distribution to model the training set of documents[1, 3]. Once we have
nalized the parameters of this model, we can evaluate the probability of a test document predicted
by this model. The distribution is represented as follows,
p(x; ) = (
n!

V
j=1
x
j
!
)(
V

j=1

xj
j
) (1)
where the data x is a vector of non-negative integers and the parameters is a real-valued vector.
Both the vectors have the same length V. In equation (1), the rst factor in parentheses is called a
multinomial coefcient. It is the size of the equivalence class of x, that is the number of different
word sequences that yield the same counts. The second factor in parentheses is the probability of
any individual member of the equivalence class of x.
2
2.5 LDA generative process
The generative process for a document collection D under the LDA model is as follows [1, 4],
1. For k = 1...K :
(a)
(k)
Dirichlet()
2. For each document d D:
(a)
d
Dirichlet()
(b) For each word w
i
d:
i. z
i
Discrete(
d
)
ii. w
i
Discrete(
(zi)
)
where K is the number of latent topics in the collection,
(k)
is a discrete probability distribution
over a xed vocabulary that represents the k
th
topic distribution,
d
is a document-specic distribu-
tion over the available topics, z
i
is the topic index for word w
i
, and and are hyper-parameters
for the symmetric Dirichlet distributions from which the discrete distributions are drawn.
The generative process described above results in the following joint distribution,
p(w, z, , |, ) =[
K

k=1
p(|
k
)][
M

d=1
p(
d
|)]
[
N
d

n=1
(p(z
d,n
|
d
)p(w
d,n
|z
d,n
,
z
d,n
))]
(2)
which can be directly inferred from the plate notation of the LDA as shown in the gure 1
Figure 1: Graphical model representation of the LDA model used for this project. The boxes are
plates representing replicates. The outer plate represents documents, while the inner plate repre-
sents the repeated choice of topics and words within a document.
The unobserved (latent) variables z, , and are of interest to us. Each
d
is a low-dimensional
representation of a document in topic space. Each z
i
represents the topic that generates the word
instance w
i
and represents a K V matrix where
j,i
= p(w
i
|z
j
)[5]. One of the most inter-
esting aspects of LDA is that it can learn words that we would associate with certain topics in an
unsupervised manner. This is expressed through the topic distributions .
As mentioned in equation 2, we use Dirichlet distribution for our priors. Dirichlet distribution is a
probability density function over the set of all multinomial parameter vectors, given by
p(|) =
1
D()
V

s=1

s1
s
(3)
3
where, is any parameter of length V over which the Dirichlet distribution is acted such that
s
>
0 s and

V
s=1

s
= 1. is the parameter of the Dirichlet distribution itself, where
D() =

s=1

s1
s
=

V
s=1
(
s
)
(

V
s=1

s
)
(4)
where is dened by (k) = (k 1)! for integer k. The reason for using Dirichlet distribution is
because it is a conjugate of multinomial distribution. This simplies the probability expression as
mentioned in equation(5), thus making the training algorithm more efcient and making it easy to
infer new unseen documents.
2.6 Collapsed Gibbs sampling
The training algorithm that we use in this report is known as collapsed Gibbs sampling. Rather than
inferring and
(k)
distributions directly, it infers the latent variable z for each word occurrence in
each document i.e. the topic from which the word is coming. Because a word can appear at different
places in a document, each appearance of the word has its own z value. Thus the same word can
come from different topics.
Suppose we have a vector z for a document such that z = {z
1
, ..., z
n
} with the distribution
p(z
i
|z
1
, ..., z
i1
, z
i+1
, ..., z
n
; w). Gibbs sampling uses the following algorithm to reach the true
distribution of p(z
i
| z
i
; w) where z
i
= {z
1
, ..., z
i1
, z
i+1
, ..., z
n
}. The steps involved in Gibbs
sampling are as follows,
1. Select an arbitrary initial guess for z = {z
1
, ..., z
n
}.
2. Draw z
1
according to p(z
1
| z
1
; w) and so on for z
2
,z
3
etc., till z
n
.
3. Update z with new drawn values and repeat step (2).
Further, if step (2) is repeated very large number of times, the process converges to the actual dis-
tribution of vector z for each w. Skipping the derivation of the probability distribution, the nal
probability is given by
p(z
i
= j| z
i
, w)
(n
(i)
d,k
+
k
) (n
(i)
k,w
+
w
)

k
(n
(i)
d,k

+
k
)

w
(n
(i)
k,w

+
w
)
(5)
where n
d,k
is the number of times words in document d are assigned to topic k and where n
k,w
is
the number of times word w is assigned to topic k. And superscript
(i)
signies leaving the i
th
token out of the calculation. The pseudocode for the algorithm is presented in Appendix A.
3 Design
We now describe the design choices we make for training the LDA model using collapsed Gibbs
sampling.
3.1 Datasets
We use two datasets to experiment with our implementation of LDA. The Classic400 dataset [6]
contains 400 documents over a total vocabulary of 6205 words. It is already pre-processed where
the documents are stored in the form of a MATLAB sparse matrix. The dataset also contains class
labels for the documents in the corpus and each document belongs to one of the 3 distinct classes
C {1, 2, 3}.
The other dataset is derived from British Broadcasting Corporation (BBC) news articles
1
. This
dataset consists of 2225 documents over a total vocabulary of 9635 words. The documents corre-
spond to stories in ve areas of topics dated between (2004-2005), namely, business, entertainment,
1
http://mlg.ucd.ie/datasets/bbc.html
4
politics, sports, technology. As with the Classic400 dataset, it also provides the class labels of the
documents i.e. C {1, 2, 3, 4, 5}. The dataset is available in matrix market format
2
. Since this
format is deprecated, some effort is spent to convert it into MATLAB sparse matrix format similar
to the Classic400 dataset, so that our framework can handle it without any changes.
3.2 Pre-processing for BBC dataset
MATLAB does not provide functions to load les which are in Matrix market format and so we need
to take help of a third party solution to load BBC dataset. We modify rdcood function written by R.
Pozo
3
which converts the data in matrix market format into sparse matrix format. We use the same
technique to load the word list and class labels of the BBC dataset.
3.3 Choice of hyper-parameters ,
As shown previously in equation(5), the hyper-parameters and inuence the prior belief of the
document-topic and topic-term distributions respectively. More generally, the scalar is the number
of pseudowords that belong to each topic j in each document d. Intuitively, when is bigger, it is
easier for different positions in the same document to be assigned to different topics. On the other
hand, is the pseudocount of prior occurrences of each word in each topic. When is bigger, it is
easy for two appearances of a word to be assigned to different topics.
As explained by David Blei [1], these hyper-parameters also have a smoothing effect on the dis-
tributions and hence the plate notations represented in gure 1 is also known as smoothed LDA.
Lowering their values reduces this smoothing effect and results in more decisive topic associations,
thus both and become more sparse.
Since they both have a joint effect and there is no algorithmic way of identifying which pair or what
kind of pair gives the best possible model, we run our experiments for the following combinations
of and , {1/K, 2/K, 5/K, 50/K} and {1, 0.01, 0.0001} for the Classic400 dataset.
However, for the BBC dataset, we run it for a smaller subset, {2/K, 50/K} and {1, 0.01}
because of the computational complexity of running these experiments.. Another simplication
that we apply is the assumption of symmetric Dirichlet distributions, i.e. we assume and are
symmetric and thus we can assume same values for all
k
and same values for all
d
.
3.4 Choice of number of topics
There is always a dilemma in choosing the number of topics while training a topic model. As
this is an unsupervised learning, we dont know how many topics the underlying corpus contains.
We think its more of a black art to detect how many topics would be a good t. For this project,
both our datasets contain the class labels for the documents in the corpus. So, to start with, we
assume the number of topics K equal to the number of classes |C|. But we dont stop there, we
experiment with values K > |C| as well as K < |C|. Its always exciting to see the results of these
congurations, i.e. if two or more topics can be generalized into a single topic or if any underlying
topic contains subtopics. In doing so, there is also a risk of either over-generalization or very ne
grain categorization of the topics i.e. adding noise to your topics.
For the Classic400 dataset, documents are already classied into 3 classes; so we run our experi-
ments for K {3, 4, 5}. Running it for 2 topics really doesnt make much sense, intuitively it
would lead to very high generalization. For BBC News dataset, documents are already classied
into 5 classes and hence we run our experiments for K {4, 5, 6}. We also report some intuition
in section 4.5, when we train the model for K = 4 and K = 6 topics on BBC dataset.
3.5 Principle component analysis
Its always exciting to visualize how the model is getting trained. One way of visualization could be
to plot the document distribution, i.e. for different number of epochs. In MATLAB, 3D graphs can
be plotted using scatter3 or plot3, so we can plot the per-document topic distribution for K = 3.
2
http://math.nist.gov/MatrixMarket/formats.html
3
http://math.nist.gov/pozo
5
Since for a given document d,
d
sum to 1, we can also plot the distribution for K = 4 as the degree
of freedom is 3.
But while using higher number of topics i.e K > 4, we cannot directly plot them. Hence, we use
Principal Component Analysis (PCA)[7] to perform dimensionality reduction on the per-document
topic distribution, reducing it to principal components. The idea behind using PCA is that the dataset
tends to be distributed along a low dimensional subspace. Using PCA, you can nd patterns in the
data and reduce the number of dimensions without much loss of information. We use an online
tutorial to implement our own version of PCA
4
. The steps involved in PCA are pretty straightfor-
ward; you translate the dataset so that the center is at origin, calculate the covariance matrix, nd
the principal components by rotations and plot the reduced dimensional dataset.
3.6 Analysing topic models for goodness-of-t
Topic models like LDA broadly perform soft estimations of the documents over latent topics and
observed entities, i.e. words, documents etc. While learning a model, it is often required to evaluate
the quality of the model and its correctness in discovering the topics in the given corpus. This quality
measure for the LDA model can be dened as goodness-of-t for the model. We now describe 3
techniques that can be used to compute goodness-of-t.
3.6.1 Clustering accuracy
The LDA model already provides a soft clustering of the documents as well as the terms in a corpus
by associating them with topics. It is useful to measure the quality of such clustering. One way
to evaluate goodness-of-t for an LDA model could be to perform subjective inspection of topic
assignments to different documents in the corpus. A more concrete method would be to use a
classier that predicts the class of a document based on the vectors. If we know a priori what class
that document belongs to, we can compare it with the class predicted by the classier and see if they
confer or not. Essentially, we are treating the topic j as a class comprising of documents with the
highest
j
and comparing it with the true class labels for the documents.
3.6.2 Variation of information distance
An alternative measure for goodness-of-t could be based on soft clustering of the LDA. Here, the
goal is to treat the given set of class labels as a deterministic topic distribution for each document, and
compute a distance measure between this deterministic distribution and the LDA topic distributions.
One such measure described by Gregor Heinrich et.al.[8, 9] is variation of information distance also
known as VI distance. Assume we have documents d
1
, ..., d
n
. A soft clustering C assigns to each
document d
i
, a distribution p(c = r|d
i
) for r = 1, ..., k. If we have a second clustering C

, we
have a new distribution p(c

= r|d
i
) for r = 1, ..., k

. Note that both clusterings may have different


number of clusters (k and k

can be different).
If clusterings are very similar, there will be pairs of clusters that will often occur together. On the
other hand, if both clusterings are independent, the pairs of clusters c = r and c

= s will appear
with probability p(c = r)p(c

= s). Therefore, we can determine the Kullback-Leibler divergence


between these independent distribution and the actual distribution p(c = r, c

= s). This is just


the mutual information between the random variables induced by the clustering[10, 8]. The mutual
information is given by,
I(C, C

) =
k

r=1
k

s=1
p(c = r, c

= s)log(
p(c = r, c

= s)
p(c = r)p(c

= s)
) (6)
4
http://nghiaho.com/?page_id=1030
6
Where the required probabilities are computed by averaging over the distributions of cluster of
documents.
p(c = r) =
1
M
M

i=1
p(c = r|d
i
)
p(c = r, c

= s) =
1
M
M

i=1
p(c = r, c

= s|d
i
)
(7)
The mutual information between two random variables becomes 0 for independent variables. Fur-
ther, I(C, C

) min{H(C), H(C

)} where H(C) =

k
r=1
p(c = r)log
2
p(c = r) is the
entropy of C. This inequality becomes an equality I(C, C

) = H(C) = H(C) if and only if the


two clusterings are equal. Meila [9] uses these properties to dene the variation of information
distance measure,
DV I(C, C

) = H(C) + H(C

) 2I(C, C

) (8)
and shows that DV I(C, C

) is a true metric, i.e. it is always non-negative. It observes the triangle


inequality, DV I(C, C

) + DV I(C

, E) DV I(C, E) and becomes zero if and only if C = C

.
Further, the VI distance metric only depends on the proportions of cluster associations with data
items, it is invariant to the absolute numbers of data items.
3.6.3 Perplexity
The above described methods can be adopted when the class labels for the documents are known
a priori. In the absence of class labels, a common criterion for measuring the goodness-of-t of
LDA model is the likelihood of held-out data under the trained model. Perplexity for a model tries
to measure this likelihood. Perplexity is dened as the reciprocal of geometric mean of the word
likelihoods in the test corpus given the model(M).
p(

W|M) = exp

M
d=1
log(p( w
d
|M))

M
d=1
N
d
(9)
We are not going into the details of how to compute this likelihood. We choose variational informa-
tion (VI distance) to measure the goodness-of-t as we already have the true labels for the dataset.
VI distance provides richer information about the topic model than clustering accuracy. We do not
choose perplexity because our initial dataset consists of 400 documents only and creating a hold-out
validation set would leave very small number of documents to train for LDA.
3.7 Convergence for Gibbs sampling and overtting measure
Gibbs sampling being a MCMC method, faces the difculty to determine when the Markov chain
has reached its stationary distribution. In practice, the convergence of some measure of model
quality is used instead. Heinrich[10] proposes the use of perplexity and test-set likelihood towards
convergence monitoring for an LDA model.
In many practical cases, in addition to using perplexity and likelihood of held-out data for this
purpose, it is possible to perform intermediate convergence monitoring steps using the likelihood
or perplexity of the training data. Because no additional sampling of held-out data topics has to
be performed, this measurement is more efcient compared to using held-out data. As long as no
overtting occurs, the difference between both types of likelihood remains low, a fact that can be
used to monitor overtting as well.
Thus, we choose VI distance to evaluate goodness-of-t and use a xed number of epochs (1000)
instead of stopping condition for Gibbs sampling while training the LDA model.
4 Results
We now present all the major results of our experiments. All the experiments are performed in
MATLAB.
7
4.1 Estimation of hyper-parameters and
Figures 2 and 3 show plots of the variation of information (VI) distance versus the number of epochs
for the Classic400 and BBC News datasets, respectively. It is evident from the gures that the values
= 2/K and = 1 give the least VI distance for both the datasets. Here, K is the number of topics;
hence for K = { 3, 4, 5, 6 }, we have = { 0.67, 0.5, 0.4, 0.33 }.
We choose the values of = 2/K and = 1 to report our results for the remainder of the exper-
iments, since we do not want to overload the reader with a plethora of graphs of per-topic distribu-
tions.
0 10 20 30 40 50 60 70 80 90 100
1
1.5
2
2.5
3
3.5
4
Iterations
V
a
r
i
a
t
i
o
n

o
f

I
n
f
o
r
m
a
t
i
o
n

D
i
s
t
a
n
c
e


[16.67, 0.01, 3]
[0.4, 0.01, 5]
[12.5, 1, 4] [12.5, 0.01, 4]
[0.67, 1, 3]
[0.4, 1, 5]
[0.5, 1, 4]
[0.67, 0.01, 3]
[0.5, 0.01, 4]
Figure 2: VI Distance for Classic400 dataset [, , No. of topics]
0 5 10 15 20 25 30 35 40 45 50
1.5
2
2.5
3
3.5
4
4.5
5
Iterations
V
a
r
i
a
t
i
o
n

o
f

I
n
f
o
r
m
a
t
i
o
n

D
i
s
t
a
n
c
e


[12.5, 0.01, 4]
[0.4, 0.01, 5]
[0.5, 0.01, 4]
[0.33, 1, 6]
[0.5, 1, 4]
[0.5, 0.01, 4]
[0.33, 0.01, 6]
[10, 0.01, 5]
[8.33, 0.01, 6 ]
[0.4, 1, 5]
Figure 3: VI Distance for BBC dataset [, , No. of topics]
We use less number of epochs in the plot since the VI distance for different (, ) pairs diverge with
the number of epochs. Hence, it sufces to infer the relevant values for and from 100 epochs
for the Classic400 dataset and from 50 epochs for the BBC dataset.
8
4.2 Top 10 most probable words for topics
4.2.1 Classic400 dataset
Table 1 lists the 10 most probable words for the Classic400 dataset for K = 3. We have labeled the
topics discovered with our own interpretation of the natural topics contained in the corpus, based
on the words associated with them.
Topic 1 Topic 2 Topic 3
boundary patients system
layer ventricular research
wing left fatty
mach cases scientic
supersonic nickel retrieval
ratio aortic acids
wings septal science
velocity visual language
shock defect methods
effects pulmonary glucose
Aerospace-Physics Medical-Medicine Research-Science
Table 1: 10 most probable words per topic for Classic400 dataset
4.2.2 BBC dataset
Table 2 lists the 10 most probable words for the topics discovered with K = 5 on the BBC dataset.
We notice that some of the words have been chopped off at their ends. This is because the BBC
dataset has been pre-processed with stemming.
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
game govern peopl year lm
plai peopl game compani best
win labour technolog market award
player parti mobil rm year
england elect phone bank music
against minist servic sale star
rst blair on price show
year plan user share on
world tori comput growth includ
Sports Politics Technology Business Entertainment
Table 2: 10 most probable words per topic for BBC dataset
4.3 Sparsity in per-document topic distribution
4.3.1 Classic400 dataset
For the Classic400 dataset, we observe that there is an increase in the sparsity of per-document topic
distribution with the number of epochs. This is depicted in gure 4. As seen, the per-document topic
distribution is sparser for 1000 epochs of Gibbs sampling as compared to 10 and 100 epochs. From
this result, we conclude that the LDA model converges to the true topic assignment as we increase
the number of epochs of Gibbs sampling.
9
0
0.2
0.4
0.6
0.8
1
0
0.5
1
0
0.2
0.4
0.6
0.8
1
PCA(2) PCA(1)
P
C
A
(
3
)
(a) 10 epochs
0
0.5
1
0
0.5
1
0
0.2
0.4
0.6
0.8
1
PCA(1) PCA(2)
P
C
A
(
3
)
(b) 100 epochs
0
0.5
1
0
0.5
1
0
0.2
0.4
0.6
0.8
1
PCA(1) PCA(2)
P
C
A
(
3
)
(c) 1000 epochs
Figure 4: Per-document topic distributions for different epochs - Classic400 dataset
4.3.2 BBC dataset
In case of BBC dataset, we observe a slightly different trend for the per-document topic distribution
over the number of epochs. Figure 5 shows plots of the per-document topic distribution for different
number of epochs run for training the LDA model. We notice that though the sparsity of the distri-
bution increases with the number of epochs, the distribution is not as sparse as the one obtained from
the Classic400 dataset, at the end of 1000 epochs. We thus infer that sparsity of the per-document
topic distribution is a characteristic of the dataset for which the LDA model is trained. In case of
the Classic400 dataset, documents tend to contain words related to a single topic while in the BBC
dataset, documents tend to contain words that span a number of topics.
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
PCA(1) PCA(2)
P
C
A
(
3
)
(a) 10 epochs
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
PCA(1) PCA(2)
P
C
A
(
3
)
(b) 100 epochs
0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
PCA(1) PCA(2)
P
C
A
(
3
)
(c) 1000 epochs
Figure 5: Per-document topic distributions for different epochs - BBC dataset
10
4.4 Comparison of different number of topics
As described in section 3.4, we choose the number of topics for the Classic400 dataset as K
{3, 4, 5} and for the BBC dataset as K {4, 5, 6}. We run our LDA models for each of these
K values. We observe that for the Classic400 dataset, K = 3 gives the most natural per-word
topic distributions where distinct topic denitions can be derived from the 10 most probable words
belonging to a topic. However, for values K = 4 and K = 5, we could see some overlap in the
10 most probable words belonging to different topics. Thus, for values K = 4 or K = 5, the
model generates some noise in the per-word topic distribution. Hence, we conclude that the given
corpus consists of 3 topics only. One reason for the model not giving distinct interpretable topics for
K = 4, 5 could be the small size of the corpus.
However, in case of the BBC dataset, we observe some interesting results with variation in the
number of topics. These are described in the next section.
4.5 Number of topics in BBC dataset
By subjective evaluation, we can observe that the BBC dataset consists of ve topics. This is also
hinted by number of the class labels provided along with the dataset. Thus, xing the number of
topics, i.e. K = 5, gives a per-word topic distribution as shown in table 4. We also train the LDA
model for number of topics, K = 4, 6. The 10 most probable words for each are listed in tables 3
and 5.
Looking at tables 3, 4 and 5, it is interesting to note that for the model trained with K = 4, the topics
Entertainment and Sports have been combined together in a single topic. Similarly, for the model
trained with K = 6, the topic of Entertainment from the LDA model of K = 5 has been further
split into the topics of Movies and Music. We are purposefully re-stating the top 10 most probable
words for K = 5 below to improve readability.
Topic 1 Topic 2 Topic 3 Topic 4
year peopl govern lm
compani game labour year
market technolog peopl plai
rm mobil parti best
bank music elect game
sale phone minist win
share on blair rst
price servic plan on
growth get tori award
Business Technology Politics Entertainment-Sports
Table 3: 10 most probable words for BBC dataset with K=4
5 Conclusions
Given a corpus of documents, trying to identify different themes in the corpus is a very interesting
problem. For this report we look at a very simple model of identifying the topics inherent in a
given corpus. We use latent Dirichlet allocation which is a exible generative probabilistic model
for collections of discrete data. LDA is based on a simple exchangeability assumption of the words
and topics. The results of training an LDA model on the given document corpus depend on the
hyper-parameters and . We observe that with greater values of , words in a given document
tend to be assigned to different topics. For greater values of , same appearances of a word can be
assigned different topics. We also notice that sparsity of per-document distribution is a characteristic
of the dataset on which the LDA model is being trained. Determining the number of topics for a
given corpus can be a tricky issue. We observe that setting K to a number more than the actual
number of topics leads to ne-grained topics while setting K less than the actual number leads to
generalization of some of the topics.
11
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
game govern peopl year lm
plai peopl game compani best
win labour technolog market award
player parti mobil rm year
england elect phone bank music
against minist servic sale star
rst blair on price show
year plan user share on
world tori comput growth includ
Sports Politics Technology Business Entertainment
Table 4: 10 most probable words for BBC dataset with K=5
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6
govern lm game year peopl game
labour award plai compani technolog music
peopl best win market mobil year
parti star player rm phone plai
elect year england bank servic on
minist show against sale user song
blair actor rst price comput band
plan director year share on record
tori includ world growth rm album
sai nomin time economi digit top
Politics Movies Sports Business Technology Music
Table 5: 10 most probable words for BBC dataset with K=6
A Appendix
Algorithm 1 LDA Generative process with collapsed Gibbs Sampling
Input: words w documents d [1, D]
1: randomly initialize z and increment counters
2: for iteration i [1, epoch] do
3: for document d [1, D] do
4: for word [1, N
d
] do
5: topic z[word]
6: decrement counters according to document d, topic and word
7: for k [1, K] do
8: calculate p(z = k|.) using Gibbs equation
9: end for
10: newTopic sample from p(z|.)
11: z[word] newTopic
12: decrement counters according to document d, newTopic and word
13: end for
14: end for
15: end for
References
[1] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. Journal of
Machine Learning Research, 3:9931022, 2003.
12
[2] David Aldous. Exchangeability and related topics.

Ecole d

Et e de Probabilit es de Saint-Flour
XIII a1983, pages 1198, 1985.
[3] Charles Elkan. Text mining and topic models. University of California, San Diego, February
2014.
[4] Ian Porteous, David Newman, Alexander Ihler, Arthur Asuncion, Padhraic Smyth, and Max
Welling. Fast collapsed gibbs sampling for latent dirichlet allocation. In Proceedings of the
14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages
569577. ACM, 2008.
[5] William M Darling. A theoretical and practical implementation tutorial on topic modeling and
gibbs sampling. 2011.
[6] Charles Elkan. Clustering documents with an exponential-family approximation of the dirich-
let compound multinomial distribution. In Proceedings of the 23rd international conference
on Machine learning, pages 289296. ACM, 2006.
[7] Lindsay I Smith. A tutorial on principal components analysis. Cornell University, USA, 51:52,
2002.
[8] Gregor Heinrich, J org Kindermann, Codrina Lauth, Gerhard Paa, and Javier Sanchez-
Monzon. Investigating word correlation at different scopesa latent concept approach. In
Workshop Lexical Ontology Learning at Int. Conf. Mach. Learning, 2005.
[9] Marina Meil a. Comparing clusteringsan information based distance. Journal of Multivariate
Analysis, 98(5):873895, 2007.
[10] Gregor Heinrich. Parameter estimation for text analysis. Technical report, 2004.
13