Professional Documents
Culture Documents
Madhuri Prabhala
Overview
What is topic modeling?
o Given a document, is it possible to find another document that discusses similar ideas?
o Categorizing documents
o Summarizing documents
o Information retrieval – decipher themes and topics of large group of documents
o Dimensionality reduction
What are Topics?
articles, medical Phil the Alien is one of those quirky films where the humour is based
around the oddness of everything rather than actual punchlines.
I recall the scariest scene was the big bird eating men dangling
reports, helplessly from parachutes right out of the air.
Even some of the most ridiculous b-movies will still give you some
employee laughs, but this is just too painful to watch!!
Encouraged by the positive comments about this film on here I was
appraisal) looking forward to watching this film.
This was the worst movie I saw at WorldFest and it also received the
least amount of applause afterwards!
LDA Output 1 (Topic to term distribution)
o Topic 1 – Horror 10%, Laughter 25%, Boys 10%...... (Emotions)
o Topic 2 – Theatre 8%, weekend 18%, air conditioner 24%...... (Experience)
….
The output of ….
….
….
LDA – Notations and Terminology
o The LDA is a generative probabilistic algorithm
o Finds topics from a corpus, assigns topics to individual documents
o Based on 2 Assumptions:
• Documents are a probabilistic distributions of topics (Documents are a mixture
of topics)
• Topics are a probabilistic distributions of words (Topics are a mixture of words)
W1 W2 W3 W4 ….. Wn
D1 2 0 2 2 4 3
Document – Term
D2 0 3 5 2 3 2 Matrix
LDA – How D3 1 1 2 3 5 1
D4 3 2 3 2 3 1
does it D5 2 3 1 2 4 0
work?
T1 T2 T3 T4 ….. Tn W1 W2 W3 W4 ….. Wn
D1 0 1 1 0 ….. 0 T1 1 1 1 0 …. 1
D2 1 1 0 1 ….. 1 T2 1 1 0 1 …. 1
D3 0 1 1 1 ….. 1 T3 0 1 0 0 …. 0
D4 1 0 1 0 ….. 0 T4 1 0 0 1 …. 1
D5 0 0 1 0 ….. 0 T5 1 0 1 1 …. 0
Document – Topic Matrix Topic – Term Matrix
o The goal of LDA is to find the most optimum representation of Document – topic
matrix and topic – term matrix
LDA –
Graphical o M = Total number of documents in the corpus
model space o
o
N = Number of words in the document
w = Word in a document
o z = Association of word to topic (Every word is associated with a latent or
hidden topic)
o Θ – Distribution of document topic (Assignment of ‘z’ gives the topic-term
distribution)
o α – Hyperparameter of LDA model (controls per document – topic word
distribution)
o β – Hyperparameter of LDA model (controls per topic-word distribution)
❑ The goal of LDA is to find the most optimum representation of Document – topic
matrix and topic – term matrix
o Assumptions:
1. Documents are a distribution of topics
2. Topics are a distribution of words
3. Topic coherence – To what extent the set of words support each other
Commonly use ‘c_v’ though there are other measures.
4. Perplexity – To what extent the developed model words on the new dataset
LDA –
Evaluating 5. Hyperparameter tuning (What the data scientist sets for the ML model):
o K – Number of topics
the Topic o α – Document topic density
o β – Topic – term density
Models 6. Model parameters –Output of the ML model.
o Weights associated with each word