Latent Semantic Indexing (LSI) : CSE 434/535 Information Retrieval Fall 2019

CSE 434/535
Information Retrieval
Fall 2019
Latent Semantic Indexing (LSI)

Material from:
Manning, Raghavan, Schutze: Chapter 18
Supplemented by additional material
PLSA: Ata Kaban, The University of Birmingham
Srihari-CSE535-Fall2019
Latent Semantic Analysis
n Latent semantic space: illustrating example
courtesy of Susan Dumais

Example -Technical Memo
n Query: human-computer interaction
n Dataset:
c1 Human machine interface for Lab ABC computer application
c2 A survey of user opinion of computer system response time
c3 The EPS user interface management system
c4 System and human system engineering testing of EPS
c5 Relations of user-perceived response time to error measurement
m1 The generation of random, binary, unordered trees
m2 The intersection graph of paths in trees
m3 Graph minors IV: Widths of trees and well-quasi-ordering
m4 Graph minors: A survey
Example cont
% 12-term by 9-document matrix
>> X=[ 1 0 0 1 0 0 0 0 0;
1 0 1 0 0 0 0 0 0;
1 1 0 0 0 0 0 0 0;
0 1 1 0 1 0 0 0 0;
011200000
0 1 0 0 1 0 0 0 0;
0 1 0 0 1 0 0 0 0;
0 0 1 1 0 0 0 0 0;
0 1 0 0 0 0 0 0 1;
0 0 0 0 0 1 1 1 0;
0 0 0 0 0 0 1 1 1;
0 0 0 0 0 0 0 1 1;];
Example contd.
T*S*D
0.1621 0.4006 0.3790 0.4677 0.1760 -0.0527 -0.12 -0.16 -0.09
0.1406 0.3697 0.3289 0.4004 0.1649 -0.0328 -0.07 -0.10 -0.04
0.1525 0.5051 0.3580 0.4101 0.2363 0.0242 ……………………..
0.2581 0.8412 0.6057 0.6973 0.3924 0.0331
0.4488 1.2344 1.0509 1.2658 0.5564 -0.0738
0.1595 0.5816 0.3751 0.4168 0.2766 0.0559
0.1595 0.5816 0.3751 0.4168 0.2766 0.0559
0.2185 0.5495 0.5109 0.6280 0.2425 -0.0654
0.0969 0.5320 0.2299 0.2117 0.2665 0.1367
-0.0613 0.2320 -0.1390 -0.2658 0.1449 0.2404
-0.0647 0.3352 -0.1457 -0.3016 0.2028 0.3057
-0.0430 0.2540 -0.0966 -0.2078 0.1520 0.2212 0.50 0.71 0.62
Observations:
n Does not exactly match the original term x document matrix (gets closer and closer depending on k)
n Notice that some of the 0 s in X are getting closer to 1 and vice versa
More exhaustive example available at:

http://alias-i.com/lingpipe/demos/tutorial/svd/read-me.html
Good source of information on LSA:
lsa.colorado.edu
Linear Algebra
Background
Eigenvalues & Eigenvectors
n Eigenvectors (for a square m´m matrix S)
Example
(right) eigenvector eigenvalue
n How many eigenvalues are there at most?
only has a non-zero solution if

this is a m-th order equation in λ which can have at
most m distinct solutions (roots of the characteristic
polynomial) – can be complex even though S is real.
Matrix-vector multiplication
S= 30 0 0 has eigenvalues 30, 20, 0 with

0 20 0
corresponding eigenvectors
0 0 0
æ 1ö æ 0ö æ 0ö
ç ÷ ç ÷ ç ÷
v1 = ç 0 ÷ v2 = ç 1 ÷ v3 = ç 0 ÷
ç 0÷ ç 0÷ ç 1÷
è ø è ø è ø
On each eigenvector, S acts as a multiple of the identity

matrix: but as a different multiple on each.
æ 2ö
ç ÷
Any vector (say x= ç 4÷
ç 6÷
) can be viewed as a combination of
the eigenvectors: è ø
x = 2v1 + 4v2 + 6v3
Matrix vector multiplication
n Thus a matrix-vector multiplication such as Sx (S,
x as in the previous slide) can be rewritten in
terms of the eigenvalues/vectors:
Sx = S(2v1 + 4v 2 + 6v 3 )
Sx = 2Sv1 + 4Sv 2 + 6Sv 3 = 2 λ1v1 + 4 λ2v 2 + 6 λ 3 v 3
Sx = 60v1 + 80v 2 + 6v 3
n Even though x is an arbitrary vector, the action of
S on x is determined by the eigenvalues/vectors.
n Suggestion: the effect of small eigenvalues is
small.
Eigenvalues & Eigenvectors
For symmetric matrices, eigenvectors for distinct
eigenvalues are orthogonal
Sv{1, 2} = l{1, 2}v{1, 2} , and l1 ¹ l2 Þ v1 • v2 = 0
All eigenvalues of a real symmetric matrix are real.
for complex l , if S - lI = 0 and S = ST Þ l Î Â
All eigenvalues of a positive semidefinite matrix

S are non-negative w is a non-zero col vector
"w Î Â , w Sw ³ 0, then if Sv = lv Þ l ³ 0
n T
Example
Let é2 1 ù Real, symmetric.
n
S=ê ú
ë1 2 û
n Then é2 - l 1 ù
S - lI = ê ú Þ ( 2 - l ) 2
- 1 = 0.
ë 1 2 - lû
n The eigenvalues are 1 and 3 (nonnegative, real).

n The eigenvectors are orthogonal (and real):
æ1ö æ1ö Plug in these values
çç ÷÷ çç ÷÷ and solve for
è -1ø è1ø eigenvectors.
Eigen/diagonal Decomposition
n Let be a square matrix with m linearly
independent eigenvectors (a non-defective
Unique
matrix) for
n Theorem: Exists an eigen decomposition distinc
Diagonal Lambda t
eigen-
values
n (cf. matrix diagonalization theorem)
n Columns of U are eigenvectors of S

n Diagonal elements of are eigenvalues of
Diagonal decomposition: why/how
é ù
Let U have the eigenvectors as columns: U = êêv1 ... vn úú
êë úû
Then, SU can be written
é ù é ù é ù él1 ù
SU = S êêv1 ... vn úú = êêl1v1 ... ln vn úú = êêv1 ... vn úú êê ... ú
ú
êë úû êë úû êë úû êë ln úû
Thus SU=UL, or U–1SU=L
And S=ULU–1.
Diagonal decomposition - example
é2 1 ù
Recall S=ê ú; l1 = 1, l2 = 3.
ë1 2û
æ 1 ö æ 1ö é 1 1ù
çç ÷÷ andçç ÷÷ form U = ê- 1 1ú
The eigenvectors
è -1ø è1ø ë û
-1 é1 / 2 - 1 / 2ù Recall
Inverting, we have U = ê ú UU –1 =1.
ë1 / 2 1 / 2 û
é 1 1ù é1 0ù é1 / 2 - 1 / 2ù
Then, S=ULU–1 = ê ú ê ú ê ú
ë- 1 1û ë0 3û ë1 / 2 1 / 2 û
Example continued
Let s divide U (and multiply U–1) by 2
é 1 / 2 1 / 2 ù é1 0ù é1 / 2 - 1 / 2 ù
Then, S= ê úê ú ê ú
ë- 1 / 2 1 / 2 û ë0 3û ë1 / 2 1 / 2 û
Q L (Q-1= QT )
Why? Stay tuned …

Symmetric Eigen Decomposition
n If is a symmetric matrix:
n Theorem: Exists a (unique) eigen
decomposition S = QLQT
n where Q is orthogonal:
n Q-1= QT
n Columns of Q are normalized eigenvectors
n Columns are orthogonal.
n (everything is real)
Exercise
n Examine the symmetric eigen decomposition, if
any, for each of the following matrices:
é 0 1ù é0 1 ù é 1 2ù é 2 2ù
ê - 1 0ú ê1 0ú ê - 2 3ú ê 2 4ú
ë û ë û ë û ë û
Time out!
n I came to this class to learn about text retrieval
and mining, not have my linear algebra past
dredged up again …
n But if you want to dredge, Strang s Applied
Mathematics is a good place to start.
n What do these matrices have to do with text?
n Recall m´ n term-document matrices …
n But everything so far needs square matrices – so
…
Singular Value Decomposition
For an m´ n matrix A of rank r there exists a factorization
(Singular Value Decomposition = SVD) as follows:
A = USV T
m´m mń V is nń
The columns of U are orthogonal eigenvectors of AAT.

The columns of V are orthogonal eigenvectors of ATA.
Eigenvalues l1 … lr of AAT are the eigenvalues of ATA.
s i = li
S = diag(s 1...s r ) Singular values.
Singular Value Decomposition
n Illustration of SVD dimensions and sparseness
M>N
M<N
SVD example
é1 - 1ù
Let
ê
A=ê 0 1ú ú
êë 1 0 úû
Thus m=3, n=2. Its SVD is
é 0 2/ 6 1/ 3 ùé 1 0 ù
ê úê ú é1 / 2 1 / 2 ù
ê1 / 2 -1/ 6 1 / 3 ú ê0 3úê ú
ê1 / 2 ú ë1/ 2 -1/ 2 û
ë 1/ 6 - 1 / 3 û êë 0 0 úû
VT
U
Typically, the singular values arranged in decreasing order.
Low-rank Approximation
n SVD can be used to compute optimal low-rank
approximations.
n Approximation problem: Find Ak of rank k such that
Ak = min
X :rank ( X ) = k
A- X F Frobenius norm
Ak and X are both mń matrices.

Typically, want k << r.
Low-rank Approximation
n Solution via SVD
Ak = U diag (s 1 ,...,s k ,0,...,0)V T

set smallest r-k
singular values to zero
Ak = åi =1s u v
k T column notation: sum
i i i of rank 1 matrices
Approximation error
n How good (bad) is this approximation?
n It s the best possible, measured by the
Frobenius norm of the error:
min A- X F
= A - Ak F
= s k +1
X :rank ( X ) = k
where the si are ordered such that si ³ si+1.

Suggests why Frobenius error drops as k
increased.
SVD Low-rank approximation
n Whereas the term-doc matrix A may have
m=50000, n=10 million (and rank close to 50000)
n We can construct an approximation A100 with
rank 100.
n Of all rank 100 matrices, it would have the lowest
Frobenius error.
n Great … but why would we??
n Answer: Latent Semantic Indexing
C. Eckart, G. Young, The approximation of a matrix by another of lower rank.

Psychometrika, 1, 211-218, 1936.
Latent Semantic
Analysis via SVD
Latent Semantic Indexing (LSI)
n Perform a low-rank approximation of
document-term matrix (typical rank 100-300)
n General idea
n Map documents (and terms) to a low-
dimensional representation.
n Design a mapping such that the low-dimensional
space reflects semantic associations (latent
semantic space).
n Compute document similarity based on the inner
product in this latent semantic space
Goals of LSI
n Similar terms map to similar location

in low dimensional space
n Noise reduction by dimension
reduction
Latent Semantic Analysis
n Latent semantic space: illustrating example
courtesy of Susan Dumais

What it is
n From term-doc matrix A, we compute
the approximation Ak.
n There is a row for each term and a
column for each doc in Ak
n Thus docs live in a space of k<<r
dimensions
n These dimensions are not the original
axes
n But why?
Method
n Given C, construct its SVD in the form
C = U Σ VT
n Derive from Σ the matrix Σk formed by replacing
by zeros, the r-k smallest singular values on the
diagonal of Σ
n Compute and output Ck = U Σk VT as the rank-k
approximation to C.
Example -Technical Memo
n Query: human-computer interaction
n Dataset:
c1 Human machine interface for Lab ABC computer application
c2 A survey of user opinion of computer system response time
c3 The EPS user interface management system
c4 System and human system engineering testing of EPS
c5 Relations of user-perceived response time to error measurement
m1 The generation of random, binary, unordered trees
m2 The intersection graph of paths in trees
m3 Graph minors IV: Widths of trees and well-quasi-ordering
m4 Graph minors: A survey
Example cont
% 12-term by 9-document matrix
>> X=[ 1 0 0 1 0 0 0 0 0;
1 0 1 0 0 0 0 0 0;
1 1 0 0 0 0 0 0 0;
0 1 1 0 1 0 0 0 0;
011200000
0 1 0 0 1 0 0 0 0;
0 1 0 0 1 0 0 0 0;
0 0 1 1 0 0 0 0 0;
0 1 0 0 0 0 0 0 1;
0 0 0 0 0 1 1 1 0;
0 0 0 0 0 0 1 1 1;
0 0 0 0 0 0 0 1 1;];
Example cont
% X=T0*S0*D0', T0 and D0 have orthonormal columns and S0 is diagonal
% T0 is the matrix of eigenvectors of the square symmetric matrix XX'
% D0 is the matrix of eigenvectors of X X
% S0 is the matrix of eigenvalues in both cases
>> [T0, S0] = eig(X*X');
>> T0
T0 =
0.1561 -0.2700 0.1250 -0.4067 -0.0605 -0.5227 -0.3410 -0.1063 -0.4148 0.2890 -0.1132 0.2214
0.1516 0.4921 -0.1586 -0.1089 -0.0099 0.0704 0.4959 0.2818 -0.5522 0.1350 -0.0721 0.1976
-0.3077 -0.2221 0.0336 0.4924 0.0623 0.3022 -0.2550 -0.1068 -0.5950 -0.1644 0.0432 0.2405
0.3123 -0.5400 0.2500 0.0123 -0.0004 -0.0029 0.3848 0.3317 0.0991 -0.3378 0.0571 0.4036
0.3077 0.2221 -0.0336 0.2707 0.0343 0.1658 -0.2065 -0.1590 0.3335 0.3611 -0.1673 0.6445
-0.2602 0.5134 0.5307 -0.0539 -0.0161 -0.2829 -0.1697 0.0803 0.0738 -0.4260 0.1072 0.2650
-0.0521 0.0266 -0.7807 -0.0539 -0.0161 -0.2829 -0.1697 0.0803 0.0738 -0.4260 0.1072 0.2650
-0.7716 -0.1742 -0.0578 -0.1653 -0.0190 -0.0330 0.2722 0.1148 0.1881 0.3303 -0.1413 0.3008
0.0000 0.0000 0.0000 -0.5794 -0.0363 0.4669 0.0809 -0.5372 -0.0324 -0.1776 0.2736 0.2059
0.0000 0.0000 0.0000 -0.2254 0.2546 0.2883 -0.3921 0.5942 0.0248 0.2311 0.4902 0.0127
-0.0000 -0.0000 -0.0000 0.2320 -0.6811 -0.1596 0.1149 -0.0683 0.0007 0.2231 0.6228 0.0361
0.0000 -0.0000 0.0000 0.1825 0.6784 -0.3395 0.2773 -0.3005 -0.0087 0.1411 0.4505 0.0318
Example cont
>> [D0, S0] = eig(X'*X);
>> D0
D0 =
0.0637 0.0144 -0.1773 0.0766 -0.0457 -0.9498 0.1103 -0.0559 0.1974
-0.2428 -0.0493 0.4330 0.2565 0.2063 -0.0286 -0.4973 0.1656 0.6060
-0.0241 -0.0088 0.2369 -0.7244 -0.3783 0.0416 0.2076 -0.1273 0.4629
0.0842 0.0195 -0.2648 0.3689 0.2056 0.2677 0.5699 -0.2318 0.5421
0.2624 0.0583 -0.6723 -0.0348 -0.3272 0.1500 -0.5054 0.1068 0.2795
0.6198 -0.4545 0.3408 0.3002 -0.3948 0.0151 0.0982 0.1928 0.0038
-0.0180 0.7615 0.1522 0.2122 -0.3495 0.0155 0.1930 0.4379 0.0146
-0.5199 -0.4496 -0.2491 -0.0001 -0.1498 0.0102 0.2529 0.6151 0.0241
0.4535 0.0696 -0.0380 -0.3622 0.6020 -0.0246 0.0793 0.5299 0.0820
Example cont
>> S0=eig(X'*X)
>> S0=S0.^0.5
S0 =
0.3637
0.5601
0.8459
1.3064
1.5048
1.6445
2.3539
2.5417
3.3409
% We only keep the largest two singular values

% and the corresponding columns from the T and D
Example cont
>> T=[0.2214 -0.1132;
0.1976 -0.0721;
0.2405 0.0432;
0.4036 0.0571;
0.6445 -0.1673;
0.2650 0.1072;
0.2650 0.1072;
0.3008 -0.1413;
0.2059 0.2736;
0.0127 0.4902;
0.0361 0.6228;
0.0318 0.4505;];
>> S = [ 3.3409 0; 0 2.5417 ];
>> D =[0.1974 0.6060 0.4629 0.5421 0.2795 0.0038 0.0146 0.0241 0.0820;
-0.0559 0.1656 -0.1273 -0.2318 0.1068 0.1928 0.4379 0.6151 0.5299;]
>>
Example contd.
T*S*D
0.1621 0.4006 0.3790 0.4677 0.1760 -0.0527 -0.12 -0.16 -0.09
0.1406 0.3697 0.3289 0.4004 0.1649 -0.0328 -0.07 -0.10 -0.04
0.1525 0.5051 0.3580 0.4101 0.2363 0.0242 ……………………..
0.2581 0.8412 0.6057 0.6973 0.3924 0.0331
0.4488 1.2344 1.0509 1.2658 0.5564 -0.0738
0.1595 0.5816 0.3751 0.4168 0.2766 0.0559
0.1595 0.5816 0.3751 0.4168 0.2766 0.0559
0.2185 0.5495 0.5109 0.6280 0.2425 -0.0654
0.0969 0.5320 0.2299 0.2117 0.2665 0.1367
-0.0613 0.2320 -0.1390 -0.2658 0.1449 0.2404
-0.0647 0.3352 -0.1457 -0.3016 0.2028 0.3057
-0.0430 0.2540 -0.0966 -0.2078 0.1520 0.2212 0.50 0.71 0.62
Observations:
n Does not exactly match the original term x document matrix (gets closer and closer depending on k)
n Notice that some of the 0 s in X are getting closer to 1 and vice versa
More exhaustive example available at:

http://alias-i.com/lingpipe/demos/tutorial/svd/read-me.html
Good source of information on LSA:
lsa.colorado.edu
Performing the maps
n Each row and column of A gets mapped into the
k-dimensional LSI space, by the SVD.
n Claim – this is not only the mapping with the best
(Frobenius error) approximation to A, but in fact
improves retrieval.
n A query q is also mapped into this space, by
-1
qk = q U k S
T
k
n Query NOT a sparse vector.

n Note that q can be any vector, e.g. a new
document that needs to be mapped into this space
Vector Space Model: Pros
n Automatic selection of index terms
n Partial matching of queries and documents
(dealing with the case where no document contains all
search terms)
n Ranking according to similarity score (dealing
with large result sets)
n Term weighting schemes (improves retrieval
performance)
n Various extensions
n Document clustering
n Relevance feedback (modifying query vector)
n Geometric foundation
Problems with Lexical Semantics
n Ambiguity and association in natural
language
n Polysemy: Words often have a multitude
of meanings and different types of usage
(more severe in very heterogeneous
collections).
n The vector space model is unable to
discriminate between different meanings of
the same word.
Problems with Lexical Semantics
n Synonymy: Different terms may have
an dentical or a similar meaning
(weaker: words indicating the same
topic).
n No associations between words are
made in the vector space

representation.
Polysemy and Context
n Document similarity on single word level:
polysemy and context ring
jupiter
•••
space
meaning 1 voyager
… …
planet saturn
... ...
meaning 2 car
company
•••
contribution to similarity, if dodge
used in 1st meaning, but not ford
if in 2nd
Empirical evidence
n Experiments on TREC 1/2/3 – Dumais
n Lanczos SVD code (available on netlib)
due to Berry used in these expts
n Running times of ~ one day on tens of
thousands of docs
n Dimensions – various values 250-350
reported
n (Under 200 reported unsatisfactory)
n Generally expect recall to improve – what
about precision?
Empirical evidence
n Precision at or above median TREC
precision
n Top scorer on almost 20% of TREC topics
n Slightly better on average than straight
vector spaces
n Effect of dimensionality: Dimensions Precision
250 0.367
300 0.371
346 0.374
Failure modes
n Negated phrases
n TREC topics sometimes negate certain
query/terms phrases – automatic
conversion of topics to
n Boolean queries
n As usual, freetext/vector space syntax of
LSI queries precludes (say) Find any doc
having to do with the following 5
companies
n See Dumais for more.
But why is this clustering?
n We ve talked about docs, queries,
retrieval and precision here.
n What does this have to do with
clustering?
n Intuition: Dimension reduction
through LSI brings together related
axes in the vector space.
Intuition from block matrices
n documents
Block 1 What s the rank of this matrix?
0 s
Block 2
m
terms
…
0 s
Block k
= Homogeneous non-zero blocks.

n documents
Block 1
0 s
Block 2
m
terms
…
0 s
Block k
Vocabulary partitioned into k topics (clusters);
each doc discusses only one topic.
n documents
What s the best rank-k

Block 1
approximation to this matrix?
0 s
Block 2
m
terms
…
0 s
Block k
= non-zero entries.
Likely there s a good rank-k
approximation to this matrix.
wiper
tire Block 1
V6
Few nonzero entries

Block 2
…
Few nonzero entries
Block k
car 10
automobile 0 1
Simplistic picture
Topic 1
Topic 2
Topic 3
Some wild extrapolation
n The dimensionality of a corpus is
the number of distinct topics
represented in it.
n More mathematical wild extrapolation:
n if A has a rank k approximation of low
Frobenius error, then there are no
more than k distinct topics in the
corpus.
LSI has many other applications
n In many settings in pattern recognition and
retrieval, we have a feature-object matrix.
n For text, the terms are features and the docs are
objects.
n Could be opinions and users …
n This matrix may be redundant in dimensionality.
n Can work with low-rank approximation.
n If entries are missing (e.g., users opinions), can
recover if dimensionality is low.
n Powerful general analytical technique
n Close, principled analog to clustering methods.
Applications of LSI/LSA
n Enterprise search
n Essay scoring
n Data/Text Mining
n FAA safety incident reports
n Image Processing
n eigenfaces
Probabilistic Latent Semantic Analysis
n Let us start from what we know

n Remember the random sequence model
P(doc) = P(term1 | doc) P(term2 | doc)...P(termL | doc)

L T X ( termt , doc )
= Õ P(terml | doc) = Õ P(termt | doc)

l =1 t =1
n improves on the LSA model
n Doc nPLSA relies on likelihood function,
attempts to derive maximum
conditional probabilities
n t1 n t2 n tT
n Think: Topic ~ Factor
n Now let us have K topics as well:
K
P (termt | doc) = å P(termt | topick )P (topick | doc) n Doc
k =1
The same, written using shorthands : n k1 n k2 n kK

K
P (t | doc) = å P (t | k ) P(k | doc)
k =1 t1 n n t2 n tT
So by replacing this, for any doc in the collection,
T K
P (doc) = Õ {å P (t | k ) P(k | doc)}X ( t ,doc ) nWhich are the
parameters of this
t =1 k =1 model?
n The parameters of this model are:

P(t|k)
P(k|doc)
n It is possible to derive the equations for computing these
parameters by Maximum Likelihood.
n If we do so, what do we get?
P(t|k) for all t and k, is a term by topic matrix
(gives which terms make up a topic)
P(k|doc) for all k and doc, is a topic by document matrix
(gives which topics are in a document)
Deriving the parameter estimation algorithm
n The log likelihood of this model is the log

probability of the entire collection:
N N T K
å log P(d ) =å å X (t , d ) log å P(t | k ) P(k | d )

d =1 d =1 t =1 k =1
which is to be maximised w.r.t. parameters P(t | k) and then also P(k | d),
T K
subject to the constraints that å P(t | k ) = 1 and å P(k | d ) = 1.
t =1 k =1
The PLSA algorithm
n Inputs: term by document matrix X(t,d), t=1:T, d=1:N and the
number K of topics sought
n Initialise arrays P1 and P2 randomly with numbers between [0,1]
and normalise them to sum to 1 along rows
n Iterate until convergence
For d=1 to N, For t =1 to T, For k=1:K
N
X (t , d ) P1(t , k )
P1(t , k ) ¬ P1(t , k )å K
P 2(k , d ); P1(t , k ) ¬ T
d =1
å P1(t , k ) P2(k , d )
k =1
å P1(t , k )
t =1
T
x(t , d ) P 2(k , d )
P 2(k , d ) ¬ P 2(k , d )å K
P1(t , k ); P 2(k , d ) ¬ K
t =1
å P1(t , k ) P2(k , d )
k =1
å P2(k , d )
k =1
n Output: arrays P1 and P2, which hold the estimated parameters

P(t|k) and P(k|d) respectively
n EM algorithm can be used
Example of topics found from a Science
Magazine papers collection
The performance of a retrieval system based on this model (PLSI)
was found superior to that of both the vector space based similarity
(cos) and a non-probabilistic latent semantic indexing (LSI) method.
nFrom Th. Hofmann,

2000
Summing up
n Documents can be represented as numeric vectors in
the space of words.
n The order of words is lost but the co-occurrences of
words may still provide useful insights about the
topical content of a collection of documents.
n PLSA is an unsupervised method based on this idea.
n We can use it to find out what topics are there in a
collection of documents
n Better than LSA, but not as good as LDA based
Topic Models. The latter disregards document
boundaries.
Related resources
Thomas Hofmann, Probabilistic Latent Semantic Analysis.

Proceedings of the Fifteenth Conference on Uncertainty in Artificial
Intelligence (UAI'99) http://www.cs.brown.edu/~th/papers/Hofmann-
UAI99.pdf
Scott Deerwester et al: Indexing by latent semantic analysis, Journal of

te American Society for Information Science, vol 41, no 6, pp. 391—
407, 1990.
http://citeseer.ist.psu.edu/cache/papers/cs/339/http:zSzzSzsuperbo
ok.bellcore.comzSz~stdzSzpaperszSzJASIS90.pdf/deerwester90in
dexing.pdf
The BOW toolkit for creating term by doc matrices and other text
processing and analysis utilities:
http://www.cs.cmu.edu/~mccallum/bow

Latent Semantic Indexing (LSI) : CSE 434/535 Information Retrieval Fall 2019

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Latent Semantic Indexing (LSI) : CSE 434/535 Information Retrieval Fall 2019

Uploaded by

Copyright:

Available Formats

CSE 434/535

Latent Semantic Indexing (LSI)

courtesy of Susan Dumais

More exhaustive example available at:

Good source of information on LSA:

(right) eigenvector eigenvalue

n How many eigenvalues are there at most?

only has a non-zero solution if

S= 30 0 0 has eigenvalues 30, 20, 0 with

On each eigenvector, S acts as a multiple of the identity

All eigenvalues of a positive semidefinite matrix

n The eigenvalues are 1 and 3 (nonnegative, real).

n Columns of U are eigenvectors of S

Thus SU=UL, or U–1SU=L

Why? Stay tuned …

The columns of U are orthogonal eigenvectors of AAT.

Ak and X are both m´n matrices.

Ak = U diag (s 1 ,...,s k ,0,...,0)V T

where the si are ordered such that si ³ si+1.

C. Eckart, G. Young, The approximation of a matrix by another of lower rank.

n Similar terms map to similar location

courtesy of Susan Dumais

% We only keep the largest two singular values

More exhaustive example available at:

Good source of information on LSA:

n Query NOT a sparse vector.

made in the vector space

Block 1 What s the rank of this matrix?

= Homogeneous non-zero blocks.

What s the best rank-k

Few nonzero entries

n FAA safety incident reports

n Let us start from what we know

P(doc) = P(term1 | doc) P(term2 | doc)...P(termL | doc)

= Õ P(terml | doc) = Õ P(termt | doc)

The same, written using shorthands : n k1 n k2 n kK

n The parameters of this model are:

n The log likelihood of this model is the log

å log P(d ) =å å X (t , d ) log å P(t | k ) P(k | d )

n Output: arrays P1 and P2, which hold the estimated parameters

nFrom Th. Hofmann,

Thomas Hofmann, Probabilistic Latent Semantic Analysis.

Scott Deerwester et al: Indexing by latent semantic analysis, Journal of

You might also like