You are on page 1of 15

Topic Modeling

Natural Language Processing


Session 4

Madhuri Prabhala
Overview
What is topic modeling?

❑ Unsupervised learning technique – No need to provide a labelled dataset

o What is the main idea (s) being discussed

o Do the themes of discussion change over time

o Given a document, is it possible to find another document that discusses similar ideas?

❑ Some use cases

o Categorizing documents
o Summarizing documents
o Information retrieval – decipher themes and topics of large group of documents
o Dimensionality reduction
What are Topics?

❑Repeating group of statistically significant tokens or words in a corpus

❑What does statistical significance mean?


o Group of words occurring together in a document
o Frequently occur together in the corpus
o They have a similar range of term frequency and inverse document frequency
ranges
Techniques of topic modeling

❑ LDA - Latent Dirichlet Allocation

❑ NNMF – Non-Negative Matrix Factorization

❑ LSA – Latent Semantic Allocation


Latent Dirichlet Allocation
Introduction to Latent Dirichlet Allocation (LDA)

❑ First discussed by David M. Blei, Andrew Y. Ng, Michael I. Jordon

❑ Given a set of topics, the aim of LDA is two things

o What are the important topics of discussion in a corpus?

o Which is the dominant topic discussed in each document?

❑ So, the LDA seeks to identify:

o What are the topics to term distribution?

o What are the document to topics distribution?


Movie Reviews
One of the other reviewers has mentioned that after watching just 1
Oz episode you'll be hooked.
The input of LDA I thought this was a wonderful way to spend time on a too hot
summer weekend, sitting in the air conditioned theater and
– Document watching a light-hearted comedy.
The plot is simplistic, but the dialogue is witty and the characters are
corpus likable
Basically there's a family where a little boy (Jake) thinks there's a
(Reviews, news zombie in his closet & his parents are fighting all the time.
If you like original gut wrenching laughter you will like this movie.

articles, medical Phil the Alien is one of those quirky films where the humour is based
around the oddness of everything rather than actual punchlines.
I recall the scariest scene was the big bird eating men dangling
reports, helplessly from parachutes right out of the air.
Even some of the most ridiculous b-movies will still give you some
employee laughs, but this is just too painful to watch!!
Encouraged by the positive comments about this film on here I was
appraisal) looking forward to watching this film.
This was the worst movie I saw at WorldFest and it also received the
least amount of applause afterwards!
LDA Output 1 (Topic to term distribution)
o Topic 1 – Horror 10%, Laughter 25%, Boys 10%...... (Emotions)
o Topic 2 – Theatre 8%, weekend 18%, air conditioner 24%...... (Experience)

o Topic 3 – Plot 30%, man 15%, closet 20%... (Plot)

….

The output of ….

LDA LDA Output 2 (Document to topics distribution)


o Rev 1 Emotions 40%, Experience 30%, Plot 10%.....
o Rev 2 Emotions 20%, Experience 50%, Plot 15%.....

o Rev 3 Emotions 15%, Experience 30%, Plot 55%.....

o Rev 4 Emotions 20%, Experience 25%, Plot 40%.....

….

….
LDA – Notations and Terminology
o The LDA is a generative probabilistic algorithm
o Finds topics from a corpus, assigns topics to individual documents
o Based on 2 Assumptions:
• Documents are a probabilistic distributions of topics (Documents are a mixture
of topics)
• Topics are a probabilistic distributions of words (Topics are a mixture of words)
W1 W2 W3 W4 ….. Wn
D1 2 0 2 2 4 3
Document – Term
D2 0 3 5 2 3 2 Matrix
LDA – How D3 1 1 2 3 5 1
D4 3 2 3 2 3 1
does it D5 2 3 1 2 4 0

work?
T1 T2 T3 T4 ….. Tn W1 W2 W3 W4 ….. Wn
D1 0 1 1 0 ….. 0 T1 1 1 1 0 …. 1
D2 1 1 0 1 ….. 1 T2 1 1 0 1 …. 1
D3 0 1 1 1 ….. 1 T3 0 1 0 0 …. 0
D4 1 0 1 0 ….. 0 T4 1 0 0 1 …. 1
D5 0 0 1 0 ….. 0 T5 1 0 1 1 …. 0
Document – Topic Matrix Topic – Term Matrix
o The goal of LDA is to find the most optimum representation of Document – topic
matrix and topic – term matrix

LDA –
Graphical o M = Total number of documents in the corpus
model space o
o
N = Number of words in the document
w = Word in a document
o z = Association of word to topic (Every word is associated with a latent or
hidden topic)
o Θ – Distribution of document topic (Assignment of ‘z’ gives the topic-term
distribution)
o α – Hyperparameter of LDA model (controls per document – topic word
distribution)
o β – Hyperparameter of LDA model (controls per topic-word distribution)
❑ The goal of LDA is to find the most optimum representation of Document – topic
matrix and topic – term matrix

o Assumptions:
1. Documents are a distribution of topics
2. Topics are a distribution of words

LDA – How o What happens:


does it LDA backtracks from document level
1. Which topics would have generated these documents
work? 2. Which words would have generated the topics
Example – A corpus of M documents
D1 = (w1, w2, w3…..wn)
D2 = (w’1, w’2, w’3…..w’n)
D3 = (w’’1, w’’2, w’’3…..w’’n)
….
….
DM = (w’’’1, w’’’2, w’’’3…..w’’’n)
Iterative process:
1. Randomly assigns topics to words
LDA – How 2. So each word will be assigned to any topic from 1 to t
• Each document will become a mixture of topics (Doc-topic matrix)
does it • Each topic will become a mixture of words (Topic – term matrix)
3. Assume all word topic assignment, except the current word assignment are
work? correct and then correct the current word assignment by computing p1
and p2
p1 -> proportion of words in document d that are assigned to topic t
p2 -> proportion of assignments to topic t from word w coming from all
documents in the corpus
p1 * p2 -> New topic ‘k’ based on this probability product assigned to
current word
This is performed a large number of times till a steady state is reached
How can we choose the optimum number of topics?

1. Going through the topic words


2. Human judgement of words in different topics

3. Topic coherence – To what extent the set of words support each other
Commonly use ‘c_v’ though there are other measures.

4. Perplexity – To what extent the developed model words on the new dataset
LDA –
Evaluating 5. Hyperparameter tuning (What the data scientist sets for the ML model):
o K – Number of topics
the Topic o α – Document topic density
o β – Topic – term density
Models 6. Model parameters –Output of the ML model.
o Weights associated with each word

You might also like