LU - 35 Latent Dirichlet Algorithm

Unit IV - TEXT CLASSIFICATION
AND TOPIC MODELING

LU 35 - Topic Modelling Algorithms:
Latent Dirichlet Allocation
• The word ‘Latent’ indicates that the model discovers the ‘yet-to-
be-found’ or hidden topics from the documents.
• ‘Dirichlet’ indicates LDA’s assumption that the distribution of

topics in a document and the distribution of words in topics are
both Dirichlet distributions.
• ‘Allocation’ indicates the distribution of topics in the document.
• LDA assumes that documents are composed of words that help

determine the topics and maps documents to a list of topics by
assigning each word in the document to different topics.
• LSA is prone to overfitting on training data. After we have
trained our model and added new documents the model will get
progressively worse until we retain it. That situation we have to
use it LDA rather than LSI.
• LDA is a generative statistical model that allows a set of items to
be sorted into unobserved groups by similarity.
• LDA is applied to items that are each made up of various parts,
and to which similarity sorting can be done using the statistical
patterns of those parts.
• LDA can be performed on any collection of things that are made
up of parts, such as employees and their skills, sports teams and
their individual members, documents and their words.
Latent Dirichlet Allocation – cntd.,
• In topic modeling groups are unobserved. However, what we can
observe are the words in documents and how we can reverse
engineer statistical patterns about the words to figure out the
groups.
• Each document is assumed to have a small number of topics
associated with it.
• For ex: cat related document, there is a high probability for the
words meow, purr, kitty, while dog related documents the words
are bone, bark and wag. This probability is called a prior
probability, which is assumption we can safely make about
something before actually verifying it.
• LDA is a way of measuring the statistical patterns of words in
documents such that we can deduce the topics.
• The value in each cell indicates the probability of a word wj belonging
to topic tk. ‘j’ and ‘k’ are the word and topic indices respectively.
• It is important to note that LDA ignores the order of occurrence of
words and the syntactic information. It treats documents just as a
collection of words or a bag of words.
LDA Algorithm
• LDA assumes that each document is generated by a statistical
generative process. That is, each document is a mix of topics, and each
topic is a mix of words. For Example:
LDA Algorithm – cntd.,
• Last figure shows a document with ten different words.
• This document could be assumed to be a mix of three topics; tourism,
facilities and feedback.
• Each of these topics, in turn, is a mix of different collections of words.
• In the process of generating this document, first, a topic is selected
from the document-topic distribution and later, from the selected topic,
a word is selected from the multinomial topic-word distributions.
• While identifying the topics in the documents, LDA does the opposite
of the generation process.
• LDA begins with random assignment of topics to each word and
iteratively improves the assignment of topics to words through Gibbs
sampling.
Steps in LDA
Example in LDA
Example in LDA – cntd.,
Hyper parameters in LDA
LDA has three hyper parameters;
• document-topic density factor ‘α’, shown in step-7,
• topic-word density factor ‘β’, shown in step-8 and
• the number of topics ‘K’ to be considered.
• The ‘α’ hyperparameter controls the number of topics expected in the

document. Low value of ‘α’ is used to imply that fewer number of
topics in the mix is expected and a higher value implies that one would
expect the documents to have higher number topics in the mix.
• The ‘β’ hyper parameter controls the distribution of words per topic.
At lower values of ‘β’, the topics will likely have fewer words and at
higher values topics will likely have more words.
• Ideally, we would like to see a few topics in each document and few
words in each of the topics. So, α and β are typically set below one.
Hyper parameters in LDA
• The ‘K’ hyperparameter specifies the number of topics expected in
the corpus of documents.
• Choosing a value for K is generally based on domain knowledge.
• An alternate way is to train different LDA models with different
numbers of K values and compute the ‘Coherence Score’.
• Choose the value of K for which the coherence score is highest.
Coherence score/ Topic Coherence score

• Topic coherence score is a measure of how good a topic model is in
generating coherent topics.
• A coherent topic should be semantically interpretable and not an

artifact of statistical inference. A higher coherence score indicates a
better topic model.
Dirichlet Distribution
• The Dirichlet distribution is a multivariate distribution, meaning it
describes the distribution of multiple variables.
• It is used for topic modeling since we are usually concerned with

multiple topics.
• It helps us specify our assumptions about the probability of

distributions before doing analysis.

LU - 35 Latent Dirichlet Algorithm

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LU - 35 Latent Dirichlet Algorithm

Uploaded by

Copyright:

Available Formats

Unit IV - TEXT CLASSIFICATION

AND TOPIC MODELING

• ‘Dirichlet’ indicates LDA’s assumption that the distribution of

• ‘Allocation’ indicates the distribution of topics in the document.

• LDA assumes that documents are composed of words that help

• The ‘α’ hyperparameter controls the number of topics expected in the

Coherence score/ Topic Coherence score

• A coherent topic should be semantically interpretable and not an

• It is used for topic modeling since we are usually concerned with

• It helps us specify our assumptions about the probability of

You might also like