: is the weight associated with token i indocument j.
LATENT SEMANTIC INDEXING (
The vector space model is presented in section 2suffers from the
curse of dimensionality
. In other words, as the problem of sizes increase may becomemore complex, the processing time required toconstruct a vector space and query throughout thedocument space will increase as well. In addition, thevector space model exclusively measures term co-occurrence—that is, the inner product between twodocuments is nonzero if and only if there exist atleast one shared term between them. Latent SemanticIndexing (LSI) is used to overcome the problems of
Singular Value Decomposition
LSI is based on a mathematical technique calledSingular Value Decomposition (SVD). The SVD isused to process decomposes a term-by-documentmatrix A into three matrices: a term-by-dimensionmatrix, U, a singular-value matrix,
, and adocument-by-dimension matrix, V
. The purpose of analysis the SVD is to detect semantic relationshipsin the documents collection. This decomposition is performed as following:
V U A
matrix whose columns are leftsingular vectors of
diagonal matrix on whose diagonal are singular values of matrix
in descending order -
matrix whose columns are rightsingular vectors of
To generate a rank-k approximation A
of A where k << r, each matrix factor is truncated to its first k columns. That is, A
is computed as:
T k k k k
V U A
is m×k matrix whose columns are first k leftsingular vectors of A-
is k×k diagonal matrix whose diagonal isformed by k leading singular values of A- V
is n×k matrix whose columns are first k rightsingular vectors of AIn LSI, A
is approximation of A is created and thatis very important: detected a combination of literatures between terms used in the documents,excluding the change in usage term bad influence tothe method to search for the index , , .Because use of k-dimensional LSI (k<<r) thedifference is not important in the "means" isremoved. Keywords often appear together in thedocument is nearly the same performance space in k-dimensional LSI, even the index does not appear simultaneously in the same document.
Drawback of the LSI model
SVD often treated as a ”magical” process.
SVD is computationally expensive.
Initial ”huge matrix step”
RANDOM INDEXING (RI) (,)
Random Indexing is an incremental vector spacemodel that is computationally less demanding(Karlgren and Sahlgren, 2001). The RandomIndexing model reduces dimensionality by, instead of giving each word a whole dimension, it gives them arandom vector by a much lesser dimensionality thanthe total number of words in the text.Random Indexing differs from the basic vector spacemodel in that it doesn’t give each word an orthogonalunit vector. Instead each word is given a vector of length 1 in a random direction. The dimension of thisrandomized vector will be chosen to be smaller thanthe amount of words in the document, with the endresult that not all words will be orthogonal to eachother since the rank of the matrix won’t be highenough. This can be formulated as
is the original matrix representation of the
d × w
word document matrix as in the basic vector spacemodel,
is the random vectors as a
matrixrepresenting the mapping between each word
andthe k-dimensional random vectors,
d × k
dimensions. A query is then matched by first multiplying the query vector with
, and thenfinds the column in
˜ that gave the best match.
isconstructed by, for each column in
, eachcorresponding to a row in
2 of these are assigned the value 1
), andthe rest are assigned
). This ensures unitlength, and that the vectors are distributed evenly inthe unit sphere of dimension
(Sahlgren, 2005).An even distribution will ensure that every pair of vectors has a high probability to be orthogonal.Information is lost during this process (pigeonhole principle, the fact that the rank of the reduced matrixis lower). However, if used on a matrix with very fewnonzero elements, the induced error will decrease asthe likelihood of a conflict in each document, and between documents, will decrease. Using RandomIndexing on a matrix will introduce a certain error tothe results. These errors will be introduced by wordsthat match with other words, i.e. the scalar product between the corresponding vectors will be
0. In thematrix this will show either that false positivematches are created for every word that have anonzero scalar product of any vector in the vector room of the matrix. False negatives can also becreated by words that have corresponding vectors thatcancel each other out.
Advantages of Random Indexing
Based on Pentti Kanerva's theories on SparseDistributed Memory.
Uses distributed representations to accumulatecontext vectors.
Incremental method that avoids the ”huge matrixstep”.
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 7, October 2010140http://sites.google.com/site/ijcsis/ISSN 1947-5500