You are on page 1of 13

Unit IV - TEXT CLASSIFICATION

AND TOPIC MODELING


LU 35 - Topic Modelling Algorithms:
Latent Dirichlet Allocation
Latent Dirichlet Allocation
• The word ‘Latent’ indicates that the model discovers the ‘yet-to-
be-found’ or hidden topics from the documents.

• ‘Dirichlet’ indicates LDA’s assumption that the distribution of


topics in a document and the distribution of words in topics are
both Dirichlet distributions.

• ‘Allocation’ indicates the distribution of topics in the document.

• LDA assumes that documents are composed of words that help


determine the topics and maps documents to a list of topics by
assigning each word in the document to different topics.
Latent Dirichlet Allocation
• LSA is prone to overfitting on training data. After we have
trained our model and added new documents the model will get
progressively worse until we retain it. That situation we have to
use it LDA rather than LSI.
• LDA is a generative statistical model that allows a set of items to
be sorted into unobserved groups by similarity.
• LDA is applied to items that are each made up of various parts,
and to which similarity sorting can be done using the statistical
patterns of those parts.
• LDA can be performed on any collection of things that are made
up of parts, such as employees and their skills, sports teams and
their individual members, documents and their words.
Latent Dirichlet Allocation – cntd.,
• In topic modeling groups are unobserved. However, what we can
observe are the words in documents and how we can reverse
engineer statistical patterns about the words to figure out the
groups.
• Each document is assumed to have a small number of topics
associated with it.
• For ex: cat related document, there is a high probability for the
words meow, purr, kitty, while dog related documents the words
are bone, bark and wag. This probability is called a prior
probability, which is assumption we can safely make about
something before actually verifying it.
• LDA is a way of measuring the statistical patterns of words in
documents such that we can deduce the topics.
Latent Dirichlet Allocation
• The value in each cell indicates the probability of a word wj belonging
to topic tk. ‘j’ and ‘k’ are the word and topic indices respectively.
• It is important to note that LDA ignores the order of occurrence of
words and the syntactic information. It treats documents just as a
collection of words or a bag of words.
LDA Algorithm
• LDA assumes that each document is generated by a statistical
generative process. That is, each document is a mix of topics, and each
topic is a mix of words. For Example:
LDA Algorithm – cntd.,
• Last figure shows a document with ten different words.
• This document could be assumed to be a mix of three topics; tourism,
facilities and feedback.
• Each of these topics, in turn, is a mix of different collections of words.
• In the process of generating this document, first, a topic is selected
from the document-topic distribution and later, from the selected topic,
a word is selected from the multinomial topic-word distributions.
• While identifying the topics in the documents, LDA does the opposite
of the generation process.
• LDA begins with random assignment of topics to each word and
iteratively improves the assignment of topics to words through Gibbs
sampling.
Steps in LDA
Example in LDA
Example in LDA – cntd.,
Hyper parameters in LDA
LDA has three hyper parameters;
• document-topic density factor ‘α’, shown in step-7,
• topic-word density factor ‘β’, shown in step-8 and
• the number of topics ‘K’ to be considered.

• The ‘α’ hyperparameter controls the number of topics expected in the


document. Low value of ‘α’ is used to imply that fewer number of
topics in the mix is expected and a higher value implies that one would
expect the documents to have higher number topics in the mix.

• The ‘β’ hyper parameter controls the distribution of words per topic.
At lower values of ‘β’, the topics will likely have fewer words and at
higher values topics will likely have more words.
• Ideally, we would like to see a few topics in each document and few
words in each of the topics. So, α and β are typically set below one.
Hyper parameters in LDA
• The ‘K’ hyperparameter specifies the number of topics expected in
the corpus of documents.
• Choosing a value for K is generally based on domain knowledge.
• An alternate way is to train different LDA models with different
numbers of K values and compute the ‘Coherence Score’.
• Choose the value of K for which the coherence score is highest.

Coherence score/ Topic Coherence score


• Topic coherence score is a measure of how good a topic model is in
generating coherent topics.

• A coherent topic should be semantically interpretable and not an


artifact of statistical inference. A higher coherence score indicates a
better topic model.
Dirichlet Distribution
• The Dirichlet distribution is a multivariate distribution, meaning it
describes the distribution of multiple variables.

• It is used for topic modeling since we are usually concerned with


multiple topics.

• It helps us specify our assumptions about the probability of


distributions before doing analysis.

You might also like