Topic Modelling Using Non-Negative Matrix Factorization: Anjusha C MA18M008

Topic Modelling Using Non-Negative Matrix
factorization
Anjusha C
MA18M008
Department of Mathematics
IIT Madras
15/05/2020
What is topic modelling?
I Topic modelling is a method for finding topics from a

collection of documents that best represents the information in
the collection.
I Topics are ’repeating pattern of co-occurring terms in a corpus’
How to mathematically quantify them?
I Consider 2 documents:
I Doc A : On ’Biology’
I Doc B : On ’Linear Algebra’
Consider the words : ’decompose’,’matrix’,’mitochondria’ and
’eigen vector’. Now, the word ’decompose’ can occur in both
documents. But its relative frequency of occurrence in Doc A
will be more than Doc B. Similarly, ’matrix’ might occur in
both docs but the relative frequency in Doc B would be more
than Doc A.
I Consider the set of words occuring in the documents:
(w1 , w2 , ...., wn ). Each document can be represented as a
vector where the ith entry is the frequency of the word wi .
I Doc A = (νA1 , νA2 , ......, nuAn )
I Doc B = (νB1 , νB2 , ......, nuBn )
I These vectors are the two topics. One corresponding to
Biology and the other to Linear Algebra.
Multiple topics in the same document
Each document is a linear combination of the topic vectors (basis

vectors).
Term-document matrix and term-weighting
Souce: Introduction to Information Retrival

I Stop word removal/Tf-Idf scoring
Linear Dimensionality Reduction
AIM : extract these topic vectors from the term-document matrix.
Souce:
http://derekgreene.com/slides/topic-modelling-with-scikitlearn.pdf
NMF
I Previous approach : there can be negative entries in W and H

I NMF assumption : Non-negative entries - constrained
optimization
I AIM : Find W and H such that V ≈ WH, W ≥ 0, H ≥ 0,
given factorization rank k, (i.e, number of topics)
Cost functions
How to check the goodness of approximation V ≈ WH ?

I Frobenius norm : ||V − WH||2 = ij (Vij − (WH)ij )2
P
I K-L divergence :
P Vij
D(V ||WH) = ij (Vij log (WH) ij
− Vij + (WH)ij )
Method
I Gradient descent
I Naive GD : No guarantee of convergence
I Problem convex on only W or H at a time and not on both
simultaneously.
I So, have to optimize W keeping H fixed and then optimize H
keeping W fixed and keep alternating until convergence.
(Alternating Least Squares approach)
Lee Sung multiplicative updates
I Put forward by Daniel D Lee and H Sebastian Sung
(Link : Lee-Sung paper)
I These G-D updates are guaranteed to converge
I Frobenius norm update :
(W T V )aµ
Haµ = Haµ
(W T WH)aµ
(VH T )ia
Wia = Wia
(WHH T )ia
I K-L divergence update :
P
Wia Viµ /(WH)iµ
i
Haµ = Haµ P (1)
P k Wka
µ Haµ Viµ /(WH)iµ
Wia = Wia P (2)
ν Haν
Implementation
Dataset : ’20 newsgroups’ dataset is used. Newsgroups are
discussion groups on Usenet, which was popular in the 80s and 90s
before the web really took off. This dataset includes 18,000
newsgroups posts with 20 topics.
Only 4 topics are chosen : ’alt.atheism’, ’talk.religion.misc’,

’comp.graphics’ and ’sci.space’. The dataset is vectorised (the
document-term matrix is constructed)
NMF in Scikit learn
The ’show topics’ function picks up the top 8 words (words with
highest frequency in each topic vector). The words with the highest
frequencies are the ones that are specific to the topic.
The rows are the topics
The time taken is approximately 5.92 s.
Lee-Sung update : Frobenius norm
The
time taken is 21.27 s.
Error plot for Frobenius norm
Lee-Sung update : K-L divergence
The time taken is 54.04 s.
Error plot for Frobenius norm
References
I Daniel D Lee, H Sebastian Sung, Algorithms for non-negative

matrix factorization, Advances in neural information processing
systems, 2001
I Nicolas Gillis, The Why and How of Non-negative Matrix
Factorization, arXiv:1401.5226v2,2014
I Christopher D Manning, Introduction to Information Retrieval,
Cambridge University Press,2009
I NMF tutorial by Racheal Thomas
I http://derekgreene.com/slides/topic-modelling-with-
scikitlearn.pdf

Topic Modelling Using Non-Negative Matrix Factorization: Anjusha C MA18M008

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Topic Modelling Using Non-Negative Matrix Factorization: Anjusha C MA18M008

Uploaded by

Copyright:

Available Formats

Topic Modelling Using Non-Negative Matrix

I Topic modelling is a method for finding topics from a

Each document is a linear combination of the topic vectors (basis

Souce: Introduction to Information Retrival

AIM : extract these topic vectors from the term-document matrix.

I Previous approach : there can be negative entries in W and H

How to check the goodness of approximation V ≈ WH ?

Only 4 topics are chosen : ’alt.atheism’, ’talk.religion.misc’,

I Daniel D Lee, H Sebastian Sung, Algorithms for non-negative

You might also like