You are on page 1of 25

Matrix decompositions and latent

semantic indexing
Theorem 1.

Theorem 2

Theorem 3

.
Proof ( Theorem 1) : be the eigenvectors of S as columns

Then we have

Thus, we have

.
Figure Illustration of the singular-value decomposition.
In the top figure matrix C for which M > N.
Below is the case for M < N.
.
.
.
Low-rank approximations

Given an M × N matrix C and a positive integer k.

Find a new M × N matrix Ck of rank at most k, so as to minimize the

Frobenius norm of the matrix difference X = C − Ck.

(18.15)

Frobenius norm of X measures the discrepancy between Ck and C.

Goal is to find a matrix Ck that minimizes this discrepancy, while constraining Ck


to have rank at most k.

.
The singular value decomposition can be used to solve the low-rank matrix
approximation problem.

.
We invoke the following three-step procedure to this end:

1. Given C, construct its SVD ; thus,

2. Derive from the matrix formed by replacing by zeros the r − k


smallest singular values on the diagonal of S.

3. Compute and output as the rank-k approximation to C.

.
.
has at most k non-zero values.

The effect of small eigenvalues on matrix products is small.

Rreplacing small eigenvalues by zero will not substantially alter the product,
leaving it “close” to C.

.
Theorem 4.

Ck is the best rank-k approximation to C, incurring an error equal to σk+1.

Thus the larger k is, the smaller this error.

When k = r, the error is zero since

.
Latent semantic indexing

The low-rank approximation to C yields a new representation for each


document in the collection.

Similarly cast queries into this low-rank representation as well, to compute


query document similarity scores in this low-rank representation.

This process is known as latent semantic indexing (LSI).

.
A query vector is mapped into its representation in the LSI space by the
transformation

Now, we can use cosine similarities to compute the similarity between a query and a
document.

.
.
.
.
.
.
.
Notice that the low-rank approximation, unlike the original matrix C, can have
negative entries.

.
.
LSI shares two basic drawbacks of vector space retrieval:

• there is no good way of expressing negations (find documents that contain


german but not shepherd),

• no way of enforcing Boolean conditions.

.
LSI has not been established as a significant force in scoring and ranking for information
retrieval.

It remains an intriguing approach to clustering in a number of domains including for


collections of text documents

.
END

You might also like