You are on page 1of 3

Singular Value Decomposition

The Text Mining node in Model Studio enables you to process text data in a document collection.
Often, you might be able to improve the predictive ability of your models that use only numerical data
if you add selected text mining results (clusters or SVD values) to the numerical data. Data are
processed in two phases: text parsing and transformation. Text parsing processes textual data into a
term-by-document frequency matrix. Transformations such as singular value decomposition (SVD)
alter this matrix into a data set that is suitable for data mining purposes. A document collection with
thousands of documents and terms can be represented in a compact and efficient form.

Singular Value Decomposition (SVD)


• Singular value decomposition (SVD) projects the high-dimensional
document and term spaces into a lower-dimension space.
• Singular value decomposition is a method of decomposing a matrix into
three other matrices:

A = U S VT
m× n m× r r× r r× n
rectangular orthogonal diagonal orthogonal
matrix matrix matrix matrix

• The singular values can be thought of as providing a measure of importance


used to decide how many dimensions to keep.
15
C o p yri gh t © SA S In sti tu te In c. A l l ri gh ts reserved .

Consider A to be a term-document matrix with m terms and n documents. (Typically, m > n. That is,
there are more terms than documents.) The SVD theorem states that the term-document matrix
(and, in fact, any rectangular matrix of real or complex values) can always be decomposed into the
product of three matrices. Below are the details:
• A is the real m x n matrix that you want to decompose.
• U is an m x r matrix that contains the left singular vectors and satisfying the orthogonality condition
UTU=Ir x r.
• r is the rank of the matrix A.
• Ir x r is an r x r identity matrix.
• S is an r x r diagonal matrix consisting of r positive “singular values” s1 ≥ s2 ≥ …≥ sr > 0.
• V is an r x n matrix that contains the right singular vectors and satisfies the orthogonality condition
VVT=Ir x r.
• T signifies the transpose of a matrix.

Copyright © 2019, SAS Institute Inc., Cary, NC, USA. All rights reserved.
1
Details: SVD
This brief discussion is based on the very helpful paper “Taming Text with the SVD” (recommended
reading and readily available for downloading from the internet) by Dr. Russ Albright of SAS R&D.
The most relevant aspects of the SVD theorem are presented for dimension reduction, followed by
an example. If you do not understand the abstract explanations, the concrete example at least gives
you the “flavor” of what is happening.
Russ Albright’s example consists of three documents:
Doc 1: Error: invalid message file format
Doc 2: Error: unable to open message file using message path
Doc 3: Error: unable to format variable
These three documents generate the following 11 x 3 term-document matrix A.
doc 1 doc 2 doc 3
Term 1 error 1 1 1
Term 2 invalid 1 0 0
Term 3 message 1 2 0
Term 4 file 1 1 0
Term 5 format 1 0 1
Term 6 unable 0 1 1
Term 7 to 0 1 1
Term 8 open 0 1 0
Term 9 using 0 1 0
Term 10 path 0 1 0
Term 11 variable 0 0 1

With the right software (for example, PROC IML), it is very easy to compute the SVD decomposition
for this example and obtain the separate matrices U, S, and V.
The product UTA produces the SVD projections of the original document vectors. These are the
document SVD input values (COL columns) that you will see in the next demonstration produced by
the Text Mining node (except that they are normalized for each document as explained later).
This amounts to forming linear combinations of the original (possibly weighted) term frequencies for
each document.
First project the first document vector d1 into a three-dimensional SVD space by the matrix
multiplication:

U T d1 =

UT was obtained using the SVD matrix function in PROC IML applied to matrix A.
d1 is the term-frequency vector for document 1.

Copyright © 2019, SAS Institute Inc., Cary, NC, USA. All rights reserved.
2
The product of the 3 x 11 UT matrix with the 11 x 1 term-frequency vector d1 for doc 1 gives the
following:

U T d=
1 dˆ=
1

And then write this in transposed form with column labels:

dˆ1T =

The SVD dimensions are ordered by the size of their singular values (their importance). Therefore,
the document vector can simply be truncated to obtain a lower-dimensional projection.

The 2-D representation for doc 1 is .


As a final step, these coordinate values are normalized so that the sums of squares for each
document are 1.0.
Using this document’s 2-D representation, 1.632 + 0.492 = 2.847 and √2.897 = 1.70.

Therefore, the final 2-D representation for doc 1 would be .


These are the SVD1 and SVD2 values that you would see for this document. A similar calculation is
performed for the other two documents.

Copyright © 2019, SAS Institute Inc., Cary, NC, USA. All rights reserved.
3

You might also like