Singular Value Decomposition Tutorial

Kirk Baker
March 29, 2005
Contents
1 Introduction 2
2 Points and Space 2
3 Vectors 3
4 Matrices 3
4.1 Matrix Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
5 Vector Terminology 5
5.1 Vector Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
5.2 Vector Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.3 Scalar Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.4 Inner Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.5 Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.6 Normal Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.7 Orthonormal Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.8 Gram-Schmidt Orthonormalization Process . . . . . . . . . . . . . . . . . . . 7
6 Matrix Terminology 8
6.1 Square Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
6.2 Transpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
6.3 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
6.4 Identity Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
6.5 Orthogonal Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6.6 Diagonal Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6.7 Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6.8 Eigenvectors and Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1
ROUGH DRAFT - USE AT OWN RISK: suggestions kbaker@ling.osu.edu
7 Singular Value Decomposition 14
7.1 Example of Full Singular Value Decomposition . . . . . . . . . . . . . . . . . 16
7.2 Example of Reduced Singular Value Decomposition . . . . . . . . . . . . . . 21
8 References 23
1 Introduction
Most tutorials on complex topics are apparently written by very smart people whose goal is
to use as little space as possible and who assume that their readers already know almost as
much as the author does. This tutorial’s not like that. It’s more a manifestivus for the rest of
us. It’s about the mechanics of singular value decomposition, especially as it relates to some
techniques in natural language processing. It’s written by someone who knew zilch about
singular value decomposition or any of the underlying math before he started writing it,
and knows barely more than that now. Accordingly, it’s a bit long on the background part,
and a bit short on the truly explanatory part, but hopefully it contains all the information
necessary for someone who’s never heard of singular value decomposition before to be able
to do it.
2 Points and Space
A point is just a list of numbers. This list of numbers, or coordinates, specifies the point’s
position in space. How many coordinates there are determines the dimensions of that space.
For example, we can specify the position of a point on the edge of a ruler with a single
coordinate. The position of the two points 0.5cm and 1.2cm are precisely specified by single
coordinates. Because we’re using a single coordinate to identify a point, we’re dealing with
points in one-dimensional space, or 1-space.
The position of a point anywhere in a plane is specified with a pair of coordinates; it takes
three coordinates to locate points in three dimensions. Nothing stops us from going beyond
points in 3-space. The fourth dimension is often used to indicate time, but the dimensions
can be chosen to represent whatever measurement unit is relevant to the objects we’re trying
to describe.
Generally, space represented by more than three dimensions is called hyperspace. You’ll
also see the term n-space used to talk about spaces of different dimensionality (e.g. 1-space,
2-space, ..., n-space).
For example, if I want a succinct way of describing the amount of food I eat in a given
day, I can use points in n-space to do so. Let the dimensions of this space be the following
food items:
Eggs Grapes Bananas Chickens Cans of Tuna
2
ROUGH DRAFT - USE AT OWN RISK: suggestions kbaker@ling.osu.edu
There are five categories, so we’re dealing with points in 5-space. Thus, the interpretation
of the point (3, 18, 2, 0.5, 1, ) would be “three eggs, eighteen grapes, two bananas, half a
chicken, one can of tuna”.
3 Vectors
For most purposes, points and vectors are essentially the same thing
1
, that is, a sequence of
numbers corresponding to measurements along various dimensions.
Vectors are usually denoted by a lower case letter with an arrow on top, e.g. x. The
numbers comprising the vector are now called components, and the number of components
equals the dimensionality of the vector. We use a subscript on the vector name to refer to
the component in that position. In the example below, x is a 5-dimensional vector, x
1
= 8,
x
2
= 5, etc.
x =

¸
¸
¸
¸
¸
¸
¸
8
6
7
5
3
¸

Vectors can be equivalently represented horizontally to save space, e.g. x = [8, 6, 7, 5, 3] is
the same vector as above. More generally, a vector x with n-dimensions is a sequence of n
numbers, and component x
i
represents the value of x on the i
th
dimension.
4 Matrices
A matrix is probably most familiar as a table of data, like Table 1, which shows the top 5
scorers on a judge’s scorecard in the 1997 Fitness International competition.
Contestant Round 1 Round 2 Round 3 Round 4 Total Place
Carol Semple-Marzetta 17 18 5 5 45 1
Susan Curry 42 28 30 15 115 3
Monica Brant 10 10 10 21 51 2
Karen Hulse 28 5 65 39 132 5
Dale Tomita 24 26 45 21 116 4
Table 1: 1997 Fitness International Scorecard. Source: Muscle & Fitness July 1997, p.139
1
Technically, I think, a vector is a function that takes a point as input and returns as its value a point of
the same dimensionality.
3
ROUGH DRAFT - USE AT OWN RISK: suggestions kbaker@ling.osu.edu
A table consists of rows (the horizontal list of scores corresponding to a contestant’s
name), and columns (the vertical list of numbers corresponding to the scores for a given
round). What makes this table a matrix is that it’s a rectangular array of numbers. Written
as a matrix, Table 1 looks like this:

17 18 5 5 45 1
42 28 30 15 115 3
10 10 10 21 51 2
28 5 65 39 132 5
24 26 45 21 116 4
¸
¸
¸
¸
¸
¸
¸
¸
The size, or dimensions, of a matrix is given in terms of the number of rows by the number
of columns. This makes the matrix above a “five by six” matrix, written 5 ×6 matrix.
We can generalize the descriptions made so far by using variables to stand in for the
actual numbers we’ve been using. Traditionally, a matrix in the abstract is named A. The
maximum number of rows is assigned to the variable m, and the number of columns is called
n. Matrix entries (also called elements or components) are denoted by a lower-case a, and a
particular entry is referenced by its row index (labeled i) and its column index (labeled j).
For example, 132 is the entry in row 4 and column 5 in the matrix above, so another way of
saying that would be a
45
= 132. More generally, the element in the i
th
row and j
th
column
is labeled a
ij
, and called the ij-entry or ij-component.
A little more formally than before, we can denote a matrix like this:
4.1 Matrix Notation
Let m, n be two integers ≥ 1. Let a
ij
, i = 1, ..., m, j = 1, ..., n be mn numbers. An array of
numbers
A =

a
11
... a
1j
... a
1n
. . .
a
i1
... a
ij
... a
in
. . .
a
m1
... a
mj
... a
mn
¸
¸
¸
¸
¸
¸
¸
¸
is an m×n matrix and the numbers a
ij
are elements of A. The sequence of numbers
A
(i)
= (a
i1
, ..., a
in
)
is the i
th
row of A, and the sequence of numbers
A
(j)
= (a
1j
, ..., a
mj
)
is the j
th
column of A.
4
ROUGH DRAFT - USE AT OWN RISK: suggestions kbaker@ling.osu.edu
Just as the distinction between points and vectors can blur in practice, so does the
distinction between vectors and matrices. A matrix is basically a collection of vectors. We
can talk about row vectors or column vectors. Or a vector with n components can be
considered a 1 ×n matrix.
For example, the matrix below is a word×document matrix which shows the number of
times a particular word occurs in some made-up documents. Typical accompanying descrip-
Doc 1 Doc 2 Doc 3
abbey 2 3 5
spinning 1 0 1
soil 3 4 1
stunned 2 1 3
wrath 1 1 4
Table 2: Word×document matrix for some made-up documents.
tions of this kind of matrix might be something like “high dimensional vector space model”.
The dimensions are the words, if we’re talking about the column vectors representing doc-
uments, or documents, if we’re talking about the row vectors which represent words. High
dimensional means we have a lot of them. Thus, “hyperspace document representation”
means a document is represented as a vector whose components correspond in some way to
the words in it, plus there are a lot of words. This is equivalent to “a document is represented
as a point in n-dimensional space.”
5 Vector Terminology
5.1 Vector Length
The length of a vector is found by squaring each component, adding them all together, and
taking the square root of the sum. If v is a vector, its length is denoted by

|v|. More
concisely,

|v| =

n
¸
i=1
x
2
i
For example, if v = [4, 11, 8, 10], then

|v| =

4
2
+ 11
2
+ 8
2
+ 10
2
=

301 = 17.35
5
ROUGH DRAFT - USE AT OWN RISK: suggestions kbaker@ling.osu.edu
5.2 Vector Addition
Adding two vectors means adding each component in v
1
to the component in the correspond-
ing position in v
2
to get a new vector. For example
[3, 2, 1, −2] + [2, −1, 4, 1] = [(3 + 2), (2 −1), (1 + 4), (−2 + 1)] = [5, 1, 5, −1]
More generally, if A = [a
1
, a
2
, ...a
n
] and B = [b
1
, b
2
, ...b
n
], then A+B = [a
1
+b
1
, a
2
+b
2
, ...a
n
+
b
n
].
5.3 Scalar Multiplication
Multiplying a scalar (real number) times a vector means multiplying every component by
that real number to yield a new vector. For instance, if v = [3, 6, 8, 4], then 1.5 ∗ v =
1.5 ∗ [3, 6, 8, 4] = [4.5, 9, 12, 6]. More generally, scalar multiplication means if d is a real
number and v is a vector [v
1
, v
2
, ..., v
n
], then d ∗ v = [dv
1
, dv
2
, ..., dv
n
].
5.4 Inner Product
The inner product of two vectors (also called the dot product or scalar product) defines
multiplication of vectors. It is found by multiplying each component in v
1
by the component
in v
2
in the same position and adding them all together to yield a scalar value. The inner
product is only defined for vectors of the same dimension. The inner product of two vectors
is denoted ( v
1
, v
2
) or v
1
· v
2
(the dot product). Thus,
(x, y) = x · y =
n
¸
i=1
x
i
y
i
For example, if x = [1, 6, 7, 4] and y = [3, 2, 8, 3], then
x · y = 1(3) + 6(2) + 7(8) + 3(4) = 83
5.5 Orthogonality
Two vectors are orthogonal to each other if their inner product equals zero. In two-
dimensional space this is equivalent to saying that the vectors are perpendicular, or that the
only angle between them is a 90

angle. For example, the vectors [2, 1, −2, 4] and [3, −6, 4, 2]
are orthogonal because
[2, 1, −2, 4] · [3, −6, 4, 2] = 2(3) + 1(−6) −2(4) + 4(2) = 0
6
ROUGH DRAFT - USE AT OWN RISK: suggestions kbaker@ling.osu.edu
5.6 Normal Vector
A normal vector (or unit vector) is a vector of length 1. Any vector with an initial length >
0 can be normalized by dividing each component in it by the vector’s length. For example,
if v = [2, 4, 1, 2], then

|v| =

2
2
+ 4
2
+ 1
2
+ 2
2
=

25 = 5
Then u = [2/5, 4/5, 1/5, 1/5] is a normal vector because

|u| =

(2/5)
2
+ (4/5)
2
+ (1/5)
2
+ (2/5)
2
=

25/25 = 1
5.7 Orthonormal Vectors
Vectors of unit length that are orthogonal to each other are said to be orthonormal. For
example,
u = [2/5, 1/5, −2/5, 4/5]
and
v = [3/

65, −6/

65, 4/

65, 2/

65]
are orthonormal because

|u| =

(2/5)
2
+ (1/5)
2
+ (−2/5)
2
+ (4/5)
2
= 1

|v| =

(3/

65) + (−6/

65) + (4/

65) + (2/

65) = 1
u · v =
6
5

65

6
5

65

8
5

65
+
8
5

65
= 0
5.8 Gram-Schmidt Orthonormalization Process
The Gram-Schmidt orthonormalization process is a method for converting a set of vectors
into a set of orthonormal vectors. It basically begins by normalizing the first vector under
consideration and iteratively rewriting the remaining vectors in terms of themselves minus a
multiplication of the already normalized vectors. For example, to convert the column vectors
of
A =

1 2 1
0 2 0
2 3 1
1 1 0
¸
¸
¸
¸
¸
into orthonormal column vectors
A =


6
6

2
6
2
3
0
2

2
3
−1
3 √
6
3
0 0

6
6


2
6
−2
3
¸
¸
¸
¸
¸
¸
¸
,
7
ROUGH DRAFT - USE AT OWN RISK: suggestions kbaker@ling.osu.edu
first normalize v
1
= [1, 0, 2, 1]:
u
1
= [
1

6
, 0,
2

6
,
1

6
].
Next, let
w
2
= v
2
− u
1
· v
2
∗ u
1
= [2, 2, 3, 1] −[
1

6
, 0,
2

6
,
1

6
] · [2, 2, 3, 1] ∗ [
1

6
, 0,
2

6
,
1

6
]
= [2, 2, 3, 1] −(
9

6
) ∗ [
1

6
, 0,
2

6
,
1

6
]
= [2, 2, 3, 1] −[
3
2
, 0, 3,
3
2
]
= [
1
2
, 2, 0,
−1
2
]
Normalize w
2
to get
u
2
= [

2
6
,
2

2
3
, 0,


2
6
]
Now compute u
3
in terms of u
1
and u
2
as follows. Let
w
3
= v
2
− u
1
· v
3
∗ u
1
− u
2
· v
3
∗ u
2
= [
4
9
,
−2
9
, 0,
−4
9
]
and normalize w
3
to get
u
3
= [
2
3
,
−1
3
, 0,
−2
3
]
More generally, if we have an orthonormal set of vectors u
1
, .., u
k−1
, then w
k
is expressed
as
w
k
= v
k

k−1
¸
i=1
u
t
· v
k
∗ u
t
6 Matrix Terminology
6.1 Square Matrix
A matrix is said to be square if it has the same number of rows as columns. To designate
the size of a square matrix with n rows and columns, it is called n-square. For example, the
matrix below is 3-square.
A =

1 2 3
4 5 6
7 8 9
¸
¸
¸
8
ROUGH DRAFT - USE AT OWN RISK: suggestions kbaker@ling.osu.edu
6.2 Transpose
The transpose of a matrix is created by converting its rows into columns; that is, row 1
becomes column 1, row 2 becomes column 2, etc. The transpose of a matrix is indicated
with a superscripted
T
, e.g. the transpose of matrix A is A
T
. For example, if
A =
¸
1 2 3
4 5 6
¸
then its transpose is
A
T
=

1 4
2 5
3 6
¸
¸
¸
6.3 Matrix Multiplication
It is possible to multiply two matrices only when the second matrix has the same number
of rows as the first matrix has columns. The resulting matrix has as many rows as the first
matrix and as many columns as the second matrix. In other words, if A is a m×n matrix
and B is a n ×s matrix, then the product AB is an m×s matrix.
The coordinates of AB are determined by taking the inner product of each row of A and
each column in B. That is, if A
1
, ..., A
m
are the row vectors of matrix A, and B
1
, ..., B
s
are
the column vectors of B, then ab
ik
of AB equals A
i
· B
k
. The example below illustrates.
A =
¸
2 1 4
1 5 2
¸
B =

3 2
−1 4
1 2
¸
¸
¸ AB =
¸
2 1 4
1 5 2
¸

3 2
−1 4
1 2
¸
¸
¸ =
¸
9 16
0 26
¸
ab
11
=

2 1 4

3
−1
1
¸
¸
¸ = 2(3) + 1(−1) + 4(1) = 9
ab
12
=

2 1 4

2
4
2
¸
¸
¸ = 2(4) + 1(4) + 4(2) = 16
ab
21
=

1 5 2

3
−1
1
¸
¸
¸ = 1(3) + 5(−1) + 2(1) = 0
ab
22
=

1 5 2

2
4
2
¸
¸
¸ = 1(2) + 5(4) + 2(2) = 26
9
ROUGH DRAFT - USE AT OWN RISK: suggestions kbaker@ling.osu.edu
6.4 Identity Matrix
The identity matrix is a square matrix with entries on the diagonal equal to 1 and all other
entries equal zero. The diagonal is all the entries a
ij
where
i
=
j
, i.e., a
11
, a
22
, ..., a
mm
. The
n-square identity matrix is denoted variously as I
n×n
, I
n
, or simply I. The identity matrix
behaves like the number 1 in ordinary multiplication, which mean AI = A, as the example
below shows.
A =
¸
2 4 6
1 3 5
¸
I =

1 0 0
0 1 0
0 0 1
¸
¸
¸ AI =
¸
2 4 6
1 3 5
¸

1 0 0
0 1 0
0 0 1
¸
¸
¸ =
ai
11
=

2 4 6

1
0
0
¸
¸
¸ = 2(1) + 0(4) + 0(6) = 2
ai
12
=

2 4 6

0
1
0
¸
¸
¸ = 2(0) + 4(1) + 6(0) = 4
ai
13
=

2 4 6

0
0
1
¸
¸
¸ = 2(0) + 4(0) + 6(1) = 6
ai
21
=

1 3 5

1
0
0
¸
¸
¸ = 1(1) + 3(0) + 5(0) = 1
ai
22
=

1 3 5

0
1
0
¸
¸
¸ = 1(0) + 3(1) + 5(0) = 3
ai
23
=

1 3 5

0
0
1
¸
¸
¸ = 1(0) + 3(0) + 5(1) = 5
=
¸
2 4 6
1 3 5
¸
10
ROUGH DRAFT - USE AT OWN RISK: suggestions kbaker@ling.osu.edu
6.5 Orthogonal Matrix
A matrix A is orthogonal if AA
T
= A
T
A = I. For example,
A =

1 0 0
0 3/5 −4/5
0 4/5 3/5
¸
¸
¸
is orthogonal because
A
T
A =

1 0 0
0 3/5 −4/5
0 4/5 3/5
¸
¸
¸

1 0 0
0 3/5 4/5
0 −4/5 3/5
¸
¸
¸ =

1 0 0
0 1 0
0 0 1
¸
¸
¸
6.6 Diagonal Matrix
A diagonal matrix A is a matrix where all the entries ai
ij
are 0 when i = j. In other words,
the only nonzero values run along the main dialog from the upper left corner to the lower
right corner:
A =

a
11
0 ... 0
0 a
22
0
. . .
0 ... a
mm
¸
¸
¸
¸
¸
6.7 Determinant
A determinant is a function of a square matrix that reduces it to a single number. The
determinant of a matrix A is denoted |A| or det(A). If A consists of one element a, then
|A| = a; in other words if A = [6] then |A| = 6. If A is a 2 ×2 matrix, then
|A| =

a b
c d

= ad −bc.
For example, the determinant of
A =
¸
4 1
1 2
¸
is
|A| =

4 1
1 2

= 4(2) −1(1) = 7.
Finding the determinant of an n-square matrix for n > 2 can be done by recursively deleting
rows and columns to create successively smaller matrices until they are all 2 ×2 dimensions,
and then applying the previous definition. There are several tricks for doing this efficiently,
11
ROUGH DRAFT - USE AT OWN RISK: suggestions kbaker@ling.osu.edu
but the most basic technique is called expansion by row and is illustrated below for a 3 ×
3 matrix. In this case we are expanding by row 1, which means deleting row 1 and successively
deleting columns 1, column 2, and column 3 to create three 2×2 matrices. The determinant
of each smaller matrix is multiplied by the entry corresponding to the intersection of the
deleted row and column. The expansion alternately adds and subtracts each successive
determinant.

−1 4 3
2 6 4
3 −2 8

= (−1)

6 4
−2 8

−(4)

2 4
3 8

+ (3)

2 6
3 −2

=
−1(6 · 8 −4 · −2) −4(2 · 8 −4 · 3) + 3(2 · −2 −3 · 6) =
−56 −16 −66 = −138
The determinant of a 4 ×4 matrix would be found by expanding acros row 1 to alternately
add and subtract 4 3 × 3 determinants, which would themselves be expanded to produce a
series of 2 ×2 determinants that would be reduced as above. This procedure can be applied
to find the determinant of an arbitrarily large square matrix.
6.8 Eigenvectors and Eigenvalues
An eigenvector is a nonzero vector that satisfies the equation
Av = λv
where A is a square matrix, λ is a scalar, and v is the eigenvector. λ is called an eigen-
value. Eigenvalues and eigenvectors are also known as, respectively, characteristic roots and
characteristic vectors, or latent roots and latent vectors.
You can find eigenvalues and eigenvectors by treating a matrix as a system of linear
equations and solving for the values of the variables that make up the components of the
eigenvector. For example, finding the eigenvalues and corresponding eigenvectors of the
matrix
A =
¸
2 1
1 2
¸
means applying the above formula to get
Av = λv =
¸
2 1
1 2
¸ ¸
x
1
x
2
¸
= λ
¸
x
1
x
2
¸
in order to solve for λ, x
1
and x
2
. This statement is equivalent to the system of equations
2x
1
+ x
2
= λx
1
12
ROUGH DRAFT - USE AT OWN RISK: suggestions kbaker@ling.osu.edu
x
1
+ 2x
2
= λx
2
which can be rearranged as
(2 −λ)x
1
+ x
2
= 0
x
1
+ (2 −λ)x
2
= 0
A necessary and sufficient condition for this system to have a nonzero vector [x
1
, x
2
] is that
the determinant of the coefficient matrix
¸
(2 −λ) 1
1 (2 −λ)
¸
be equal to zero. Accordingly,

(2 −λ) 1
1 (2 −λ)

= 0
(2 −λ)(2 −λ) −1 · 1 = 0
λ
2
−4λ + 3 = 0
(λ −3)(λ −1) = 0
There are two values of λ that satisfy the last equation; thus there are two eigenvalues of
the original matrix A and these are λ
1
= 3, λ
2
= 1.
We can find eigenvectors which correspond to these eigenvalues by plugging λ back in
to the equations above and solving for x
1
and x
2
. To find an eigenvector corresponding to
λ = 3, start with
(2 −λ)x
1
+ x
2
= 0
and substitute to get
(2 −3)x
1
+ x
2
= 0
which reduces and rearranges to
x
1
= x
2
There are an infinite number of values for x
1
which satisfy this equation; the only restriction
is that not all the components in an eigenvector can equal zero. So if x
1
= 1, then x
2
= 1
and an eigenvector corresponding to λ = 3 is [1, 1].
Finding an eigenvector for λ = 1 works the same way.
(2 −1)x
1
+ x
2
= 0
x
1
= −x
2
So an eigenvector for λ = 1 is [1, −1].
13
ROUGH DRAFT - USE AT OWN RISK: suggestions kbaker@ling.osu.edu
7 Singular Value Decomposition
Singular value decomposition (SVD) can be looked at from three mutually compatible points
of view. On the one hand, we can see it as a method for transforming correlated variables
into a set of uncorrelated ones that better expose the various relationships among the original
data items. At the same time, SVD is a method for identifying and ordering the dimensions
along which data points exhibit the most variation. This ties in to the third way of viewing
SVD, which is that once we have identified where the most variation is, it’s possible to find
the best approximation of the original data points using fewer dimensions. Hence, SVD can
be seen as a method for data reduction.
As an illustration of these ideas, consider the 2-dimensional data points in Figure 1.
The regression line running through them shows the best approximation of the original data
with a 1-dimensional object (a line). It is the best approximation in the sense that it is
the line that minimizes the distance between each original point and the line. If we drew a
Figure 1: Best-fit regression line reduces data from two dimensions into one.
perpendicular line from each point to the regression line, and took the intersection of those
lines as the approximation of the original datapoint, we would have a reduced representation
14
ROUGH DRAFT - USE AT OWN RISK: suggestions kbaker@ling.osu.edu
of the original data that captures as much of the original variation as possible. Notice that
there is a second regression line, perpendicular to the first, shown in Figure 2. This line
Figure 2: Regression line along second dimension captures less variation in original data.
captures as much of the variation as possible along the second dimension of the original
data set. It does a poorer job of approximating the orginal data because it corresponds to a
dimension exhibiting less variation to begin with. It is possible to use these regression lines
to generate a set of uncorrelated data points that will show subgroupings in the original data
not necessarily visible at first glance.
These are the basic ideas behind SVD: taking a high dimensional, highly variable set of
data points and reducing it to a lower dimensional space that exposes the substructure of the
original data more clearly and orders it from most variation to the least. What makes SVD
practical for NLP applications is that you can simply ignore variation below a particular
threshhold to massively reduce your data but be assured that the main relationships of
interest have been preserved.
15
ROUGH DRAFT - USE AT OWN RISK: suggestions kbaker@ling.osu.edu
7.1 Example of Full Singular Value Decomposition
SVD is based on a theorem from linear algebra which says that a rectangular matrix A can
be broken down into the product of three matrices - an orthogonal matrix U, a diagonal
matrix S, and the transpose of an orthogonal matrix V . The theorem is usually presented
something like this:
A
mn
= U
mm
S
mn
V
T
nn
where U
T
U = I, V
T
V = I; the columns of U are orthonormal eigenvectors of AA
T
, the
columns of V are orthonormal eigenvectors of A
T
A, and S is a diagonal matrix containing
the square roots of eigenvalues from U or V in descending order.
The following example merely applies this definition to a small matrix in order to compute
its SVD. In the next section, I attempt to interpret the application of SVD to document
classification.
Start with the matrix
A =
¸
3 1 1
−1 3 1
¸
In order to find U, we have to start with AA
T
. The transpose of A is
A
T
=

3 −1
1 3
1 1
¸
¸
¸
so
AA
T
=
¸
3 1 1
−1 3 1
¸

3 −1
1 3
1 1
¸
¸
¸ =
¸
11 1
1 11
¸
Next, we have to find the eigenvalues and corresponding eigenvectors of AA
T
. We know that
eigenvectors are defined by the equation Av = λv, and applying this to AA
T
gives us
¸
11 1
1 11
¸ ¸
x
1
x
2
¸
= λ
¸
x
1
x
2
¸
We rewrite this as the set of equations
11x
1
+ x
2
= λx
1
x
1
+ 11x
2
= λx
2
and rearrange to get
(11 −λ)x
1
+ x
2
= 0
x
1
+ (11 −λ)x
2
= 0
16
ROUGH DRAFT - USE AT OWN RISK: suggestions kbaker@ling.osu.edu
Solve for λ by setting the determinant of the coefficient matrix to zero,

(11 −λ) 1
1 (11 −λ)

= 0
which works out as
(11 −λ)(11 −λ) −1 · 1 = 0
(λ −10)(λ −12) = 0
λ = 10, λ = 12
to give us our two eigenvalues λ = 10, λ = 12. Plugging λ back in to the original equations
gives us our eigenvectors. For λ = 10 we get
(11 −10)x
1
+ x
2
= 0
x
1
= −x
2
which is true for lots of values, so we’ll pick x
1
= 1 and x
2
= −1 since those are small and
easier to work with. Thus, we have the eigenvector [1, −1] corresponding to the eigenvalue
λ = 10. For λ = 12 we have
(11 −12)x
1
+ x
2
= 0
x
1
= x
2
and for the same reason as before we’ll take x
1
= 1 and x
2
= 1. Now, for λ = 12 we have the
eigenvector [1, 1]. These eigenvectors become column vectors in a matrix ordered by the size
of the corresponding eigenvalue. In other words, the eigenvector of the largest eigenvalue
is column one, the eigenvector of the next largest eigenvalue is column two, and so forth
and so on until we have the eigenvector of the smallest eigenvalue as the last column of our
matrix. In the matrix below, the eigenvector for λ = 12 is column one, and the eigenvector
for λ = 10 is column two.
¸
1 1
1 −1
¸
Finally, we have to convert this matrix into an orthogonal matrix which we do by applying
the Gram-Schmidt orthonormalization process to the column vectors. Begin by normalizing
v
1
.
u
1
=
v
1

|v
1
|
=
[1, 1]

1
2
+ 1
2
=
[1, 1]

2
= [
1

2
,
1

2
]
Compute
w
2
= v
2
− u
1
· v
2
∗ u
1
=
[1, −1] −[
1

2
,
1

2
] · [1, −1] ∗ [
1

2
,
1

2
] =
17
ROUGH DRAFT - USE AT OWN RISK: suggestions kbaker@ling.osu.edu
[1, −1] −0 ∗ [
1

2
,
1

2
] = [1, −1] −[0, 0] = [1, −1]
and normalize
u
2
=
w
2

|w
2
|
= [
1

2
,
−1

2
]
to give
U =
¸
1

2
1

2
1

2
−1

2
¸
The calculation of V is similar. V is based on A
T
A, so we have
A
T
A =

3 −1
1 3
1 1
¸
¸
¸
¸
3 1 1
−1 3 1
¸
=

10 0 2
0 10 4
2 4 2
¸
¸
¸
Find the eigenvalues of A
T
A by

10 0 2
0 10 4
2 4 2
¸
¸
¸

x
1
x
2
x
3
¸
¸
¸ = λ

x
1
x
2
x
3
¸
¸
¸
which represents the system of equations
10x
1
+ 2x
3
= λx
1
10x
2
+ 4x
3
= λx
2
2x
1
+ 4x
2
+ 2x
3
= λx
2
which rewrite as
(10 −λ)x
1
+ 2x
3
= 0
(10 −λ)x
2
+ 4x
3
= 0
2x
1
+ 4x
2
+ (2 −λ)x
3
= 0
which are solved by setting

(10 −λ) 0 2
0 (10 −λ) 4
2 4 (2 −λ)

= 0
This works out as
(10 −λ)

(10 −λ) 4
4 (2 −λ)

+ 2

0 (10 −λ)
2 4

=
18
ROUGH DRAFT - USE AT OWN RISK: suggestions kbaker@ling.osu.edu
(10 −λ)[(10 −λ)(2 −λ) −16] + 2[0 −(20 −2λ)] =
λ(λ −10)(λ −12) = 0,
so λ = 0, λ = 10, λ = 12 are the eigenvalues for A
T
A. Substituting λ back into the original
equations to find corresponding eigenvectors yields for λ = 12
(10 −12)x
1
+ 2x
3
= −2x
1
+ 2x
3
= 0
x
1
= 1, x
3
= 1
(10 −12)x
2
+ 4x
3
= −2x
2
+ 4x
3
= 0
x
2
= 2x
3
x
2
= 2
So for λ = 12, v
1
= [1, 2, 1]. For λ = 10 we have
(10 −10)x
1
+ 2x
3
= 2x
3
= 0
x
3
= 0
2x
1
+ 4x
2
= 0
x
1
= −2x
2
x
1
= 2, x
2
= −1
which means for λ = 10, v
2
= [2, −1, 0]. For λ = 0 we have
10x
1
+ 2x
3
= 0
x
3
= −5
10x
1
−20 = 0
x
2
= 2
2x
1
+ 8 −10 = 0
x
1
= 1
which means for λ = 0, v
3
= [1, 2, −5]. Order v
1
, v
2
, and v
3
as column vectors in a matrix
according to the size of the eigenvalue to get

1 2 1
2 −1 2
1 0 5
¸
¸
¸
19
ROUGH DRAFT - USE AT OWN RISK: suggestions kbaker@ling.osu.edu
and use the Gram-Schmidt orthonormalization process to convert that to an orthonormal
matrix.
u
1
=
v
1

|v
1
|
= [
1

6
,
2

6
,
1

6
]
w
2
= v
2
− u
1
· v
2
∗ u
1
= [2, −1, 0]
u
2
=
w
2

|w
2
|
= [
2

5
,
−1

5
, 0]
w
3
= v
3
− u
1
· v
3
∗ u
1
− u
2
· v
3
∗ u
2
= [
−2
3
,
−4
3
,
10
3
]
u
3
=
w
3

|w
3
|
= [
1

30
,
2

30
,
−5

30
]
All this to give us
V =

1

6
2

5
1

30
2

6
−1

5
2

30
1

6
0
−5

30
¸
¸
¸
¸
when we really want its transpose
V
T
=

1

6
2

6
1

6
2

5
−1

5
0
1

30
2

30
−5

30
¸
¸
¸
¸
For S we take the square roots of the non-zero eigenvalues and populate the diagonal with
them, putting the largest in s
11
, the next largest in s
22
and so on until the smallest value
ends up in s
mm
. The non-zero eigenvalues of U and V are always the same, so that’s why
it doesn’t matter which one we take them from. Because we are doing full SVD, instead of
reduced SVD (next section), we have to add a zero column vector to S so that it is of the
proper dimensions to allow multiplication between U and V . The diagonal entries in S are
the singular values of A, the columns in U are called left singular vectors, and the columns
in V are called right singular vectors.
S =
¸ √
12 0 0
0

10 0
¸
Now we have all the pieces of the puzzle
A
mn
= U
mm
S
mn
V
T
nn
=
¸
1

2
1

2
1

2
−1

2
¸ ¸ √
12 0 0
0

10 0
¸

1

6
2

6
1

6
2

5
−1

5
0
1

30
2

30
−5

30
¸
¸
¸
¸
=


12

2

10

2
0

12

2


10

2
0
¸
¸

1

6
2

6
1

6
2

5
−1

5
0
1

30
2

30
−5

30
¸
¸
¸
¸
=
¸
3 1 1
−1 3 1
¸
20
ROUGH DRAFT - USE AT OWN RISK: suggestions kbaker@ling.osu.edu
7.2 Example of Reduced Singular Value Decomposition
Reduced singular value decomposition is the mathematical technique underlying a type of
document retrieval and word similarity method variously called Latent Semantic Indexing
or Latent Semantic Analysis. The insight underlying the use of SVD for these tasks is
that it takes the original data, usually consisting of some variant of a word×document
matrix, and breaks it down into linearly independent components. These components are
in some sense an abstraction away from the noisy correlations found in the original data
to sets of values that best approximate the underlying structure of the dataset along each
dimension independently. Because the majority of those components are very small, they
can be ignored, resulting in an approximation of the data that contains substantially fewer
dimensions than the original. SVD has the added benefit that in the process of dimensionality
reduction, the representation of items that share substructure become more similar to each
other, and items that were dissimilar to begin with may become more dissimilar as well. In
practical terms, this means that documents about a particular topic become more similar
even if the exact same words don’t appear in all of them.
As we’ve already seen, SVD starts with a matrix, so we’ll take the following word×
document matrix as the starting point of the next example.
A =

2 0 8 6 0
1 6 0 1 7
5 0 7 4 0
7 0 8 5 0
0 10 0 0 7
¸
¸
¸
¸
¸
¸
¸
¸
Remember that to compute the SVD of a matrix A we want the product of three matrices
such that
A = USV
T
where U and V are orthonormal and S is diagonal. The column vectors of U are taken from
the orthonormal eigenvectors of AA
T
, and ordered right to left from largest corresponding
eigenvalue to the least. Notice that
AA
T
=

2 0 8 6 0
1 6 0 1 7
5 0 7 4 0
7 0 8 5 0
0 10 0 0 7
¸
¸
¸
¸
¸
¸
¸
¸

2 1 5 7 0
0 6 0 0 10
8 0 7 8 0
6 1 4 5 0
0 7 0 0 7
¸
¸
¸
¸
¸
¸
¸
¸
=

104 8 90 108 0
8 87 9 12 109
90 9 90 111 0
108 12 111 138 0
0 109 0 0 149
¸
¸
¸
¸
¸
¸
¸
¸
is a matrix whose values are the dot product of all the terms, so it is a kind of dispersion
matrix of terms throughout all the documents. The singular values (eigenvalues) of AA
T
are
λ = 321.07, λ = 230.17, λ = 12.70, λ = 3.94, λ = 0.12
21
ROUGH DRAFT - USE AT OWN RISK: suggestions kbaker@ling.osu.edu
which are used to compute and order the corresponding orthonormal singular vectors of U.
U =

−0.54 0.07 0.82 −0.11 0.12
−0.10 −0.59 −0.11 −0.79 −0.06
−0.53 0.06 −0.21 0.12 −0.81
−0.65 0.07 −0.51 0.06 0.56
−0.06 −0.80 0.09 0.59 0.04
¸
¸
¸
¸
¸
¸
¸
¸
This essentially gives a matrix in which words are represented as row vectors containing
linearly independent components. Some word cooccurence patterns in these documents are
indicated by the signs of the coefficients in U. For example, the signs in the first column
vector are all negative, indicating the general cooccurence of words and documents. There
are two groups visible in the second column vector of U: car and wheel have negative
coefficients, while doctor, nurse, and hospital are all positive, indicating a grouping in which
wheel only cooccurs with car. The third dimension indicates a grouping in which car, nurse,
and hospital occur only with each other. The fourth dimension points out a pattern in
which nurse and hospital occur in the absence of wheel, and the fifth dimension indicates a
grouping in which doctor and hospital occur in the absence of wheel.
Computing V
T
is similar. Since its values come from orthonormal singular vectors of
A
T
A, arranged right to left from largest corresponding singular value to the least, we have
A
T
A =

79 6 107 68 7
6 136 0 6 112
107 0 177 116 0
68 6 116 78 7
7 112 0 7 98
¸
¸
¸
¸
¸
¸
¸
¸
which contains the dot product of all the documents. Applying the Gram-Schmidt orthonor-
malization process and taking the transpose yields
V
T
=

−0.46 0.02 −0.87 −0.00 0.17
−0.07 −0.76 0.06 0.60 0.23
−0.74 0.10 0.28 0.22 −0.56
−0.48 0.03 0.40 −0.33 0.70
−0.07 −0.64 −0.04 −0.69 −0.32
¸
¸
¸
¸
¸
¸
¸
¸
S contains the square roots of the singular values ordered from greatest to least along its
diagonal. These values indicate the variance of the linearly independent components along
each dimension. In order to illustrate the effect of dimensionality reduction on this data set,
we’ll restrict S to the first three singular values to get
S =

17.92 0 0
0 15.17 0
0 0 3.56
¸
¸
¸
22
ROUGH DRAFT - USE AT OWN RISK: suggestions kbaker@ling.osu.edu
In order for the matrix multiplication to go through, we have to eliminate the corresponding
row vectors of U and corresponding column vectors of V
T
to give us an approximation of A
using 3 dimensions instead of the original 5. The result looks like this.
ˆ
A =

−0.54 0.07 0.82
−0.10 −0.59 −0.11
−0.53 0.06 −0.21
−0.65 0.07 −0.51
−0.06 −0.80 0.09
¸
¸
¸
¸
¸
¸
¸
¸

17.92 0 0
0 15.17 0
0 0 3.56
¸
¸
¸

−0.46 0.02 −0.87 −0.00 0.17
−0.07 −0.76 0.06 0.60 0.23
−0.74 0.10 0.28 0.22 −0.56
¸
¸
¸
=

2.29 −0.66 9.33 1.25 −3.09
1.77 6.76 0.90 −5.50 −2.13
4.86 −0.96 8.01 0.38 −0.97
6.62 −1.23 9.58 0.24 −0.71
1.14 9.19 0.33 −7.19 −3.13
¸
¸
¸
¸
¸
¸
¸
¸
In practice, however, the purpose is not to actually reconstruct the original matrix but
to use the reduced dimensionality representation to identify similar words and documents.
Documents are now represented by row vectors in V , and document similarity is obtained
by comparing rows in the matrix V S (note that documents are represented as row vectors
because we are working with V , not V
T
). Words are represented by row vectors in U, and
word similarity can be measured by computing row similarity in US.
Earlier I mentioned that in the process of dimensionality reduction, SVD makes similar
items appear more similar, and unlike items more unlike. This can be explained by looking
at the vectors in the reduced versions of U and V above. We know that the vectors contain
components ordered from most to least amount of variation accounted for in the original
data. By deleting elements representing dimensions which do not exhibit meaningful vari-
ation, we effectively eliminate noise in the representation of word vectors. Now the word
vectors are shorter, and contain only the elements that account for the most significant cor-
relations among words in the original dataset. The deleted elements had the effect of diluting
these main correlations by introducing potential similarity along dimensions of questionable
significance.
8 References
Deerwester, S., Dumais, S., Landauer, T., Furnas, G. and Harshman, R. (1990). ”Indexing by
Latent Semantic Analysis”. Journal of the American Society of Information Science
41(6):391-407.
Ientilucci, E.J., (2003). ”Using the Singular Value Decomposition”. http://www.cis.rit.
edu/˜ejipci/research.htm
23
ROUGH DRAFT - USE AT OWN RISK: suggestions kbaker@ling.osu.edu
Jackson, J. E. (1991). A User’s Guide to Principal Components Analysis. John Wiley &
Sons, NY.
Manning, C. and Sch¨ utze, H. (1999). Foundations of Statistical Natural Language Processing.
MIT Press, Cambridge, MA.
Marcus, M. and Minc, H. (1968). Elementary Linear Algebra. The MacMillan Company,
NY.
24

ROUGH DRAFT - USE AT OWN RISK: suggestions kbaker@ling.osu.edu 7 Singular Value Decomposition 7.1 Example of Full Singular Value Decomposition . . . . . . . . . . . . . . . . . 7.2 Example of Reduced Singular Value Decomposition . . . . . . . . . . . . . . 8 References 14 16 21 23

1

Introduction

Most tutorials on complex topics are apparently written by very smart people whose goal is to use as little space as possible and who assume that their readers already know almost as much as the author does. This tutorial’s not like that. It’s more a manifestivus for the rest of us. It’s about the mechanics of singular value decomposition, especially as it relates to some techniques in natural language processing. It’s written by someone who knew zilch about singular value decomposition or any of the underlying math before he started writing it, and knows barely more than that now. Accordingly, it’s a bit long on the background part, and a bit short on the truly explanatory part, but hopefully it contains all the information necessary for someone who’s never heard of singular value decomposition before to be able to do it.

2

Points and Space

A point is just a list of numbers. This list of numbers, or coordinates, specifies the point’s position in space. How many coordinates there are determines the dimensions of that space. For example, we can specify the position of a point on the edge of a ruler with a single coordinate. The position of the two points 0.5cm and 1.2cm are precisely specified by single coordinates. Because we’re using a single coordinate to identify a point, we’re dealing with points in one-dimensional space, or 1-space. The position of a point anywhere in a plane is specified with a pair of coordinates; it takes three coordinates to locate points in three dimensions. Nothing stops us from going beyond points in 3-space. The fourth dimension is often used to indicate time, but the dimensions can be chosen to represent whatever measurement unit is relevant to the objects we’re trying to describe. Generally, space represented by more than three dimensions is called hyperspace. You’ll also see the term n-space used to talk about spaces of different dimensionality (e.g. 1-space, 2-space, ..., n-space). For example, if I want a succinct way of describing the amount of food I eat in a given day, I can use points in n-space to do so. Let the dimensions of this space be the following food items: Eggs Grapes Bananas Chickens Cans of Tuna

2

2.g. half a chicken.ROUGH DRAFT . so we’re dealing with points in 5-space. e. 7. x = [8. which shows the top 5 scorers on a judge’s scorecard in the 1997 Fitness International competition. a vector x with n-dimensions is a sequence of n numbers. the interpretation of the point (3. e. Contestant Carol Semple-Marzetta Susan Curry Monica Brant Karen Hulse Dale Tomita Round 1 17 42 10 28 24 Round 2 18 28 10 5 26 Round 3 Round 4 5 5 30 15 10 21 65 39 45 21 Total 45 115 51 132 116 Place 1 3 2 5 4 Table 1: 1997 Fitness International Scorecard.g. x. 6. 3 Vectors For most purposes. More generally.139 Technically.5. 0. 1. We use a subscript on the vector name to refer to the component in that position. The numbers comprising the vector are now called components. 1 3 . x2 = 5. p. and component xi represents the value of x on the ith dimension. ) would be “three eggs. Thus. x 1 = 8. x is a 5-dimensional vector. that is.edu There are five categories. In the example below. a vector is a function that takes a point as input and returns as its value a point of the same dimensionality. 3] is the same vector as above. like Table 1. 5. one can of tuna”. two bananas. Source: Muscle & Fitness July 1997. and the number of components equals the dimensionality of the vector. etc. eighteen grapes. Vectors are usually denoted by a lower case letter with an arrow on top.USE AT OWN RISK: suggestions kbaker@ling. a sequence of numbers corresponding to measurements along various dimensions. points and vectors are essentially the same thing1 .    x=     8 6 7 5 3         Vectors can be equivalently represented horizontally to save space. I think.osu. 18. 4 Matrices A matrix is probably most familiar as a table of data.

We can generalize the descriptions made so far by using variables to stand in for the actual numbers we’ve been using. 4 . Written as a matrix. The sequence of numbers A(i) = (ai1 . and the number of columns is called n. . .. so another way of saying that would be a45 = 132. we can denote a matrix like this: 4. amj ) is the jth column of A. written 5 × 6 matrix. .osu. .. and columns (the vertical list of numbers corresponding to the scores for a given round). A little more formally than before. j = 1. aij ... This makes the matrix above a “five by six” matrix. The maximum number of rows is assigned to the variable m.. and a particular entry is referenced by its row index (labeled i) and its column index (labeled j)..1 Matrix Notation Let m. a1n .... Matrix entries (also called elements or components) are denoted by a lower-case a. ai1 . am1 ... ... ain ) is the ith row of A. ... Traditionally. . the element in the ith row and jth column is labeled aij . n be two integers ≥ 1. i = 1. m. amn         is an m × n matrix and the numbers aij are elements of A. . and the sequence of numbers A(j) = (a1j . ain ... of a matrix is given in terms of the number of rows by the number of columns. What makes this table a matrix is that it’s a rectangular array of numbers.. Table 1 looks like this:         17 42 10 28 24 18 28 10 5 26 5 30 10 65 45 5 45 1  15 115 3   21 51 2   39 132 5   21 116 4  The size..edu A table consists of rows (the horizontal list of scores corresponding to a contestant’s name). a1j . n be mn numbers. Let aij . a matrix in the abstract is named A...... and called the ij-entry or ij-component.USE AT OWN RISK: suggestions kbaker@ling.ROUGH DRAFT . For example. More generally. 132 is the entry in row 4 and column 5 in the matrix above. amj . or dimensions. An array of numbers    A=     a11 .

then √ √ |v| = 42 + 112 + 82 + 102 = 301 = 17. Or a vector with n components can be considered a 1 × n matrix. if we’re talking about the column vectors representing documents. its length is denoted by |v|. adding them all together. so does the distinction between vectors and matrices. High dimensional means we have a lot of them.ROUGH DRAFT . n |v| = x2 i i=1 For example. We can talk about row vectors or column vectors. or documents.USE AT OWN RISK: suggestions kbaker@ling. 8. and taking the square root of the sum. For example. if v = [4. if we’re talking about the row vectors which represent words.” 5 5. Thus. the matrix below is a word×document matrix which shows the number of times a particular word occurs in some made-up documents. The dimensions are the words. 11. If v is a vector.35 5 .1 Vector Terminology Vector Length The length of a vector is found by squaring each component.edu Just as the distinction between points and vectors can blur in practice. “hyperspace document representation” means a document is represented as a vector whose components correspond in some way to the words in it.osu. A matrix is basically a collection of vectors. More concisely. Typical accompanying descripabbey spinning soil stunned wrath Doc 1 Doc 2 2 3 1 0 3 4 2 1 1 1 Doc 3 5 1 1 3 4 Table 2: Word×document matrix for some made-up documents. tions of this kind of matrix might be something like “high dimensional vector space model”. This is equivalent to “a document is represented as a point in n-dimensional space. plus there are a lot of words. 10].

.5 ∗ v = 1. then d ∗ v = [dv1 .. (−2 + 1)] = [5. the vectors [2. For instance.. Thus.edu 5. a2 . 1.. v2 ) or v1 · v2 (the dot product). 5. The inner product is only defined for vectors of the same dimension. −1] More generally. (1 + 4). then 1.. then A+B = [a1 +b1 .5 ∗ [3. 1.4 Inner Product The inner product of two vectors (also called the dot product or scalar product) defines multiplication of vectors. scalar multiplication means if d is a real number and v is a vector [v1 . 6. . a2 +b2 . 4] and y = [3. In twodimensional space this is equivalent to saying that the vectors are perpendicular.. 2. 8. 4.5. . if A = [a1 . dvn ].5 Orthogonality Two vectors are orthogonal to each other if their inner product equals zero. 5.3 Scalar Multiplication Multiplying a scalar (real number) times a vector means multiplying every component by that real number to yield a new vector. . For example [3. if v = [3. y) = x · y = xi yi i=1 For example. or that the only angle between them is a 90◦ angle. 4]. 4. It is found by multiplying each component in v1 by the component in v2 in the same position and adding them all together to yield a scalar value. if x = [1. More generally. b2 . n (x. dv2 . 1. vn ]. For example.an ] and B = [b1 . −6. then x · y = 1(3) + 6(2) + 7(8) + 3(4) = 83 5. 1] = [(3 + 2). 2.. 7. −6. 3]. 9.. 12. 4.2 Vector Addition Adding two vectors means adding each component in v1 to the component in the corresponding position in v2 to get a new vector. −2.bn ]. . (2 − 1).ROUGH DRAFT . 6]. . 6. 8. 1. 4] · [3. −1. 5. −2... −2] + [2. 2] are orthogonal because [2. v2 . The inner product of two vectors is denoted (v1 .. 6.an + bn ]. 2] = 2(3) + 1(−6) − 2(4) + 4(2) = 0 6 . 4] and [3.. 4] = [4. 8.osu.USE AT OWN RISK: suggestions kbaker@ling.

then √ √ |v| = 22 + 42 + 12 + 22 = 25 = 5 (2/5)2 + (4/5)2 + (1/5)2 + (2/5)2 = 25/25 = 1 5.ROUGH DRAFT .8 Gram-Schmidt Orthonormalization Process The Gram-Schmidt orthonormalization process is a method for converting a set of vectors into a set of orthonormal vectors. 4/ 65.osu. 1/5.6 Normal Vector Then u = [2/5. For example. 1/5.edu 5. 4/5. For example. 2/ 65] are orthonormal because |u| = (2/5)2 + (1/5)2 + (−2/5)2 + (4/5)2 = 1 √ √ √ √ |v| = (3/ 65) + (−6/ 65) + (4/ 65) + (2/ 65) = 1 6 8 6 8 u·v = √ − √ − √ + √ =0 5 65 5 65 5 65 5 65 5. −6/ 65. For example.USE AT OWN RISK: suggestions kbaker@ling. −2/5. 1. 4. 1/5] is a normal vector because |u| = A normal vector (or unit vector ) is a vector of length 1. 2]. 0    7 . Any vector with an initial length > 0 can be normalized by dividing each component in it by the vector’s length. u = [2/5. to convert the column vectors of   1 2 1  0 2 0   A=    2 3 1  1 1 0 into orthonormal column vectors   0  A =  √6   √ 3 6 6  √ 6 6 2 6 √ 2 2 3 √ − 2 6 √ 2 3 −1 3 −2 3 0    . 4/5] √ √ √ √ v = [3/ 65. It basically begins by normalizing the first vector under consideration and iteratively rewriting the remaining vectors in terms of themselves minus a multiplication of the already normalized vectors. if v = [2.7 Orthonormal Vectors and Vectors of unit length that are orthogonal to each other are said to be orthonormal.

0. 0.ROUGH DRAFT . 0. √ ] · [2. 0. 1] − [ . 2. ] u3 = [ . . √ . . 6 6 6 Next. √ ] 6 6 6 6 6 6 9 1 2 1 = [2.   1 2 3   A= 4 5 6  7 8 9 8 . √ .. uk−1 .. 0. ] 6 3 6 Now compute u3 in terms of u1 and u2 as follows. 3. the matrix below is 3-square. Let 4 −2 −4 w3 = v 2 − u 1 · v 3 ∗ u 1 − u 2 · v 3 ∗ u 2 = [ . 2. 3. then wk is expressed k−1 Normalize w2 to get √ as wk = v k − i=1 ut · v k ∗ u t 6 6. 2. √ . 3 3 3 More generally. 2.1 Matrix Terminology Square Matrix A matrix is said to be square if it has the same number of rows as columns. 1]: 1 2 1 u1 = [ √ . let 2 1 1 2 1 1 w2 = v2 − u1 · v2 ∗ u1 = [2. 2. 0. 1] ∗ [ √ .edu first normalize v1 = [1. 1] − ( √ ) ∗ [ √ . 3. √ ]. it is called n-square. if we have an orthonormal set of vectors u1 .osu. 0.USE AT OWN RISK: suggestions kbaker@ling. 3. √ . 0. . ] 9 9 9 and normalize w3 to get −2 2 −1 . 0. ] 2 2 √ √ 2 2 2 − 2 u2 = [ . √ ] 6 6 6 6 3 3 = [2. 0. 1] − [ √ . 2. 3. To designate the size of a square matrix with n rows and columns. ] 2 2 1 −1 = [ . For example.

USE AT OWN RISK: suggestions kbaker@ling.osu. the transpose of matrix A is AT . The transpose of a matrix is indicated with a superscripted T ..g. The example below illustrates.3 Matrix Multiplication It is possible to multiply two matrices only when the second matrix has the same number of rows as the first matrix has columns. if A is a m × n matrix and B is a n × s matrix. if A= then its transpose is 1 4   T A = 2 5  3 6   1 2 3 4 5 6 6. that is. That is. The resulting matrix has as many rows as the first matrix and as many columns as the second matrix. e. and B 1 .edu 6..ROUGH DRAFT . then abik of AB equals Ai · B k .. then the product AB is an m × s matrix.2 Transpose The transpose of a matrix is created by converting its rows into columns. A= 2 1 4 1 5 2 3 2   B =  −1 4  AB = 1 2     2 1 4 1 5 2  3 2    −1 4  = 1 2  9 16 0 26 ab11 = 3  2 1 4  −1  = 2(3) + 1(−1) + 4(1) = 9  1 2 2 1 4  4  = 2(4) + 1(4) + 4(2) = 16   2 3  1 5 2  −1  = 1(3) + 5(−1) + 2(1) = 0  1 2  1 5 2  4  = 1(2) + 5(4) + 2(2) = 26  2 9       ab12 = ab21 = ab22 = . if A1 . . The coordinates of AB are determined by taking the inner product of each row of A and each column in B.. row 2 becomes column 2. row 1 becomes column 1.. In other words. B s are the column vectors of B. etc.. For example. . Am are the row vectors of matrix A.

amm . The n-square identity matrix is denoted variously as In×n . In . a11 . The diagonal is all the entries aij where i =j . The identity matrix behaves like the number 1 in ordinary multiplication. which mean AI = A. or simply I.ROUGH DRAFT ..USE AT OWN RISK: suggestions kbaker@ling. as the example below shows. a22 .4 Identity Matrix The identity matrix is a square matrix with entries on the diagonal equal to 1 and all other entries equal zero..edu 6.. i.e..osu. A= 2 4 6 1 3 5 1 0 0 I =  0 1 0  AI =   0 0 1     2 4 6 1 3 5  1 0 0    0 1 0 = 0 0 1  ai11 = 1 2 4 6  0  = 2(1) + 0(4) + 0(6) = 2   0 0 2 4 6  1  = 2(0) + 4(1) + 6(0) = 4   0 0 2 4 6  0  = 2(0) + 4(0) + 6(1) = 6   1 1  1 3 5  0  = 1(1) + 3(0) + 5(0) = 1  0 0 1 3 5  1  = 1(0) + 3(1) + 5(0) = 3   0 0 1 3 5  0  = 1(0) + 3(0) + 5(1) = 5   1 = 2 4 6 1 3 5           ai12 = ai13 = ai21 = ai22 = ai23 = 10 . .

.. the determinant of A= is |A| = 4 1 1 2 a b = ad − bc. 1 2 Finding the determinant of an n-square matrix for n > 2 can be done by recursively deleting rows and columns to create successively smaller matrices until they are all 2 × 2 dimensions. 11 .edu 6.ROUGH DRAFT ..6 Diagonal Matrix A diagonal matrix A is a matrix where all the entries aiij are 0 when i = j.. The determinant of a matrix A is denoted |A| or det(A). There are several tricks for doing this efficiently.osu. In other words. 0 0 a22 0 . 0 . . . amm      6.7 Determinant A determinant is a function of a square matrix that reduces it to a single number.USE AT OWN RISK: suggestions kbaker@ling. c d 4 1 = 4(2) − 1(1) = 7. and then applying the previous definition. the only nonzero values run along the main dialog from the upper left corner to the lower right corner:    A=  a11 0 . then |A| = a. If A consists of one element a. If A is a 2 × 2 matrix. is orthogonal because 1 0 0 1 0 0 1 0 0     T A A =  0 3/5 −4/5   0 3/5 4/5  =  0 1 0   0 4/5 3/5 0 0 1 0 −4/5 3/5      6.5 Orthogonal Matrix 1 0 0 A =  0 3/5 −4/5    0 4/5 3/5   A matrix A is orthogonal if AAT = AT A = I. For example. then |A| = For example. in other words if A = [6] then |A| = 6.

λ is a scalar. The determinant of each smaller matrix is multiplied by the entry corresponding to the intersection of the deleted row and column. column 2. λ is called an eigenvalue. −56 − 16 − 66 = −138 6. and v is the eigenvector.ROUGH DRAFT .USE AT OWN RISK: suggestions kbaker@ling. x1 and x2 . respectively. The expansion alternately adds and subtracts each successive determinant. Eigenvalues and eigenvectors are also known as. For example. characteristic roots and characteristic vectors. In this case we are expanding by row 1. finding the eigenvalues and corresponding eigenvectors of the matrix 2 1 A= 1 2 means applying the above formula to get Av = λv = 2 1 1 2 x1 x2 =λ x1 x2 in order to solve for λ. −1 4 3 2 6 2 4 6 4 2 6 4 = (−1) = + (3) − (4) 3 −2 3 8 −2 8 3 −2 8 −1(6 · 8 − 4 · −2) − 4(2 · 8 − 4 · 3) + 3(2 · −2 − 3 · 6) = The determinant of a 4 × 4 matrix would be found by expanding acros row 1 to alternately add and subtract 4 3 × 3 determinants. which would themselves be expanded to produce a series of 2 × 2 determinants that would be reduced as above.8 Eigenvectors and Eigenvalues Av = λv An eigenvector is a nonzero vector that satisfies the equation where A is a square matrix. and column 3 to create three 2 × 2 matrices.edu but the most basic technique is called expansion by row and is illustrated below for a 3 × 3 matrix. You can find eigenvalues and eigenvectors by treating a matrix as a system of linear equations and solving for the values of the variables that make up the components of the eigenvector. which means deleting row 1 and successively deleting columns 1.osu. This procedure can be applied to find the determinant of an arbitrarily large square matrix. This statement is equivalent to the system of equations 2x1 + x2 = λx1 12 . or latent roots and latent vectors.

Finding an eigenvector for λ = 1 works the same way.ROUGH DRAFT . the only restriction is that not all the components in an eigenvector can equal zero. (2 − λ) 1 =0 1 (2 − λ) (2 − λ)(2 − λ) − 1 · 1 = 0 λ2 − 4λ + 3 = 0 (λ − 3)(λ − 1) = 0 There are two values of λ that satisfy the last equation. 13 . To find an eigenvector corresponding to λ = 3. thus there are two eigenvalues of the original matrix A and these are λ1 = 3. Accordingly. 1]. So if x1 = 1. start with (2 − λ)x1 + x2 = 0 and substitute to get (2 − 3)x1 + x2 = 0 which reduces and rearranges to x1 = x 2 There are an infinite number of values for x1 which satisfy this equation. We can find eigenvectors which correspond to these eigenvalues by plugging λ back in to the equations above and solving for x1 and x2 . (2 − 1)x1 + x2 = 0 x1 = −x2 So an eigenvector for λ = 1 is [1.edu x1 + 2x2 = λx2 which can be rearranged as (2 − λ)x1 + x2 = 0 x1 + (2 − λ)x2 = 0 A necessary and sufficient condition for this system to have a nonzero vector [x1 . x2 ] is that the determinant of the coefficient matrix (2 − λ) 1 1 (2 − λ) be equal to zero. λ2 = 1. then x2 = 1 and an eigenvector corresponding to λ = 3 is [1.osu. −1].USE AT OWN RISK: suggestions kbaker@ling.

It is the best approximation in the sense that it is the line that minimizes the distance between each original point and the line. At the same time. we can see it as a method for transforming correlated variables into a set of uncorrelated ones that better expose the various relationships among the original data items.USE AT OWN RISK: suggestions kbaker@ling. we would have a reduced representation 14 .edu 7 Singular Value Decomposition Singular value decomposition (SVD) can be looked at from three mutually compatible points of view. which is that once we have identified where the most variation is. it’s possible to find the best approximation of the original data points using fewer dimensions. Hence.ROUGH DRAFT . perpendicular line from each point to the regression line. consider the 2-dimensional data points in Figure 1. This ties in to the third way of viewing SVD. and took the intersection of those lines as the approximation of the original datapoint. If we drew a Figure 1: Best-fit regression line reduces data from two dimensions into one. As an illustration of these ideas. SVD is a method for identifying and ordering the dimensions along which data points exhibit the most variation. The regression line running through them shows the best approximation of the original data with a 1-dimensional object (a line). On the one hand. SVD can be seen as a method for data reduction.osu.

edu of the original data that captures as much of the original variation as possible. Notice that there is a second regression line. These are the basic ideas behind SVD: taking a high dimensional.osu. captures as much of the variation as possible along the second dimension of the original data set.ROUGH DRAFT . 15 . This line Figure 2: Regression line along second dimension captures less variation in original data.USE AT OWN RISK: suggestions kbaker@ling. It is possible to use these regression lines to generate a set of uncorrelated data points that will show subgroupings in the original data not necessarily visible at first glance. It does a poorer job of approximating the orginal data because it corresponds to a dimension exhibiting less variation to begin with. What makes SVD practical for NLP applications is that you can simply ignore variation below a particular threshhold to massively reduce your data but be assured that the main relationships of interest have been preserved. highly variable set of data points and reducing it to a lower dimensional space that exposes the substructure of the original data more clearly and orders it from most variation to the least. perpendicular to the first. shown in Figure 2.

The following example merely applies this definition to a small matrix in order to compute its SVD. the columns of U are orthonormal eigenvectors of AAT .ROUGH DRAFT .edu 7.USE AT OWN RISK: suggestions kbaker@ling.osu. I attempt to interpret the application of SVD to document classification.1 Example of Full Singular Value Decomposition SVD is based on a theorem from linear algebra which says that a rectangular matrix A can be broken down into the product of three matrices . Start with the matrix 3 1 1 A= −1 3 1 In order to find U . The theorem is usually presented something like this: T Amn = Umm Smn Vnn where U T U = I. We know that eigenvectors are defined by the equation Av = λv. The transpose of A is 3 −1  T A = 1 3   1 1 AAT = 3 1 1 −1 3 1 3 −1    1 3 = 1 1     so 11 1 1 11 Next. In the next section. and applying this to AAT gives us 11 1 1 11 We rewrite this as the set of equations 11x1 + x2 = λx1 x1 + 11x2 = λx2 and rearrange to get (11 − λ)x1 + x2 = 0 x1 + (11 − λ)x2 = 0 16 x1 x2 =λ x1 x2 . and S is a diagonal matrix containing the square roots of eigenvalues from U or V in descending order. a diagonal matrix S. we have to start with AAT . the columns of V are orthonormal eigenvectors of AT A. V T V = I. we have to find the eigenvalues and corresponding eigenvectors of AAT .an orthogonal matrix U . and the transpose of an orthogonal matrix V .

osu. (11 − λ) 1 =0 1 (11 − λ) which works out as (11 − λ)(11 − λ) − 1 · 1 = 0 (λ − 10)(λ − 12) = 0 λ = 10. and the eigenvector for λ = 10 is column two. −1] ∗ [ √ . for λ = 12 we have the eigenvector [1. √ ] =√ 2 u1 = 2 1 +1 2 2 2 |v1 | Compute w2 = v 2 − u 1 · v 2 ∗ u 1 = 1 1 1 1 [1. −1] corresponding to the eigenvalue λ = 10. Plugging λ back in to the original equations gives us our eigenvectors. 1] 1 1 [1. Thus. These eigenvectors become column vectors in a matrix ordered by the size of the corresponding eigenvalue. 1 1 1 −1 Finally.USE AT OWN RISK: suggestions kbaker@ling. Begin by normalizing v1 . we have the eigenvector [1. so we’ll pick x1 = 1 and x2 = −1 since those are small and easier to work with. √ ] = 2 2 2 2 17 . and so forth and so on until we have the eigenvector of the smallest eigenvalue as the last column of our matrix. the eigenvector for λ = 12 is column one. In the matrix below. the eigenvector of the largest eigenvalue is column one. λ = 12 to give us our two eigenvalues λ = 10. For λ = 10 we get (11 − 10)x1 + x2 = 0 x1 = −x2 which is true for lots of values.edu Solve for λ by setting the determinant of the coefficient matrix to zero. √ ] · [1. For λ = 12 we have (11 − 12)x1 + x2 = 0 x1 = x 2 and for the same reason as before we’ll take x1 = 1 and x2 = 1. λ = 12. the eigenvector of the next largest eigenvalue is column two.ROUGH DRAFT . we have to convert this matrix into an orthogonal matrix which we do by applying the Gram-Schmidt orthonormalization process to the column vectors. [1. Now. 1]. −1] − [ √ . In other words. 1] v1 = √ = [√ .

osu. √ ] 2 2 |w2 | w2 1 √ 2 1 √ 2 1 √ 2 −1 √ 2 U= The calculation of V is similar.edu 1 1 [1.ROUGH DRAFT . 0] = [1.USE AT OWN RISK: suggestions kbaker@ling. V is based on AT A. so we have 3 −1   AT A =  1 3  1 1 Find the eigenvalues of AT A by    3 1 1 −1 3 1 10 0 2   =  0 10 4  2 4 2     10 0 2 x1 x1  0 10 4   x2  = λ  x2       2 4 2 x3 x3 which represents the system of equations 10x1 + 2x3 = λx1 10x2 + 4x3 = λx2 2x1 + 4x2 + 2x3 = λx2 which rewrite as (10 − λ)x1 + 2x3 = 0 (10 − λ)x2 + 4x3 = 0 2x1 + 4x2 + (2 − λ)x3 = 0 which are solved by setting (10 − λ) 0 2 0 (10 − λ) 4 =0 2 4 (2 − λ) This works out as (10 − λ) (10 − λ) 4 0 (10 − λ) +2 = 4 (2 − λ) 2 4 18   . −1] − [0. √ ] = [1. −1] 2 2 and normalize u2 = to give 1 −1 = [√ . −1] − 0 ∗ [ √ .

−5]. −1. v2 . λ = 10. 1 2 1    2 −1 2  1 0 5 19  . For λ = 10 we have (10 − 10)x1 + 2x3 = 2x3 = 0 x3 = 0 2x1 + 4x2 = 0 x1 = −2x2 x1 = 2. 0]. v1 = [1. For λ = 0 we have 10x1 + 2x3 = 0 x3 = −5 10x1 − 20 = 0 x2 = 2 2x1 + 8 − 10 = 0 x1 = 1 which means for λ = 0. v2 = [2.edu (10 − λ)[(10 − λ)(2 − λ) − 16] + 2[0 − (20 − 2λ)] = so λ = 0. 2. x3 = 1 (10 − 12)x2 + 4x3 = −2x2 + 4x3 = 0 x2 = 2x3 x2 = 2 So for λ = 12. Substituting λ back into the original equations to find corresponding eigenvectors yields for λ = 12 (10 − 12)x1 + 2x3 = −2x1 + 2x3 = 0 x1 = 1. Order v1 . λ = 12 are the eigenvalues for AT A. v3 = [1. 1].ROUGH DRAFT . x2 = −1 which means for λ = 10.USE AT OWN RISK: suggestions kbaker@ling. and v3 as column vectors in a matrix according to the size of the eigenvalue to get  λ(λ − 10)(λ − 12) = 0. 2.osu.

The diagonal entries in S are the singular values of A. ] 3 3 3 1 2 −5 w3 = [√ . v1 1 2 1 u1 = = [√ . √ 12 √0 0 S= 0 10 0 Now we have all the pieces of the puzzle T Amn = Umm Smn Vnn = 1 √ 2 1 √ 2 1 √ 2 −1 √ 2 VT =  1 √ 6 2 √ 5 √1 30 1 √ 6 −5 √ 30 0     √ 12 √0 0 0 10 0 2 √ 6 −1 √ 5 √2 30 1 √ 6 −5 √ 30     1 √ 6 2 √ 5 √1 30 2 √ 6 −1 √ 5 √2 30 1 √ 6 −5 √ 30   √ 12 √ √2 12 √ 2 √ 10 √ √2 −√ 10 2 0 0      1 √ 6 2 √ 5 √1 30  0 =    0 =   3 1 1 −1 3 1 20 . √ ] 6 6 6 |v1 | w2 = v2 − u1 · v2 ∗ u1 = [2. The non-zero eigenvalues of U and V are always the same.USE AT OWN RISK: suggestions kbaker@ling. the columns in U are called left singular vectors. putting the largest in s11 . √ . √ . √ . √ ] u3 = 30 30 30 |w3 | All this to give us   when we really want its transpose  V =    1 √ 6 2 √ 6 1 √ 6 2 √ 5 −1 √ 5 0 2 √ 6 −1 √ 5 √2 30 √1 30 √2 30 −5 √ 30    For S we take the square roots of the non-zero eigenvalues and populate the diagonal with them. and the columns in V are called right singular vectors. we have to add a zero column vector to S so that it is of the proper dimensions to allow multiplication between U and V .osu. the next largest in s22 and so on until the smallest value ends up in smm .ROUGH DRAFT . Because we are doing full SVD. 0] w2 2 −1 u2 = = [ √ . −1. .edu and use the Gram-Schmidt orthonormalization process to convert that to an orthonormal matrix. 0] 5 5 |w2 | −2 −4 10 w3 = v 3 − u 1 · v 3 ∗ u 1 − u 2 · v 3 ∗ u 2 = [ . instead of reduced SVD (next section). so that’s why it doesn’t matter which one we take them from.

and ordered right to left from largest corresponding eigenvalue to the least. and items that were dissimilar to begin with may become more dissimilar as well. λ = 12. λ = 230.     A=    2 0 1 6 5 0 7 0 0 10 8 0 7 8 0 6 1 4 5 0 0 7 0 0 7         Remember that to compute the SVD of a matrix A we want the product of three matrices such that A = U SV T where U and V are orthonormal and S is diagonal.07. SVD starts with a matrix.17. These components are in some sense an abstraction away from the noisy correlations found in the original data to sets of values that best approximate the underlying structure of the dataset along each dimension independently. Because the majority of those components are very small.2 Example of Reduced Singular Value Decomposition Reduced singular value decomposition is the mathematical technique underlying a type of document retrieval and word similarity method variously called Latent Semantic Indexing or Latent Semantic Analysis. SVD has the added benefit that in the process of dimensionality reduction. The insight underlying the use of SVD for these tasks is that it takes the original data.edu 7. so it is a kind of dispersion matrix of terms throughout all the documents.USE AT OWN RISK: suggestions kbaker@ling. resulting in an approximation of the data that contains substantially fewer dimensions than the original. The singular values (eigenvalues) of AA T are λ = 321. Notice that     T AA =     2 0 1 6 5 0 7 0 0 10 8 0 7 8 0 6 1 4 5 0 0 7 0 0 7         2 0 8 6 0 1 6 0 1 7 5 0 7 4 0 7 0 104 8 90 108 0    0 10   8 87 9 12 109     8 0  =  90 9 90 111 0     5 0   108 12 111 138 0     0 7 0 109 0 0 149    is a matrix whose values are the dot product of all the terms. so we’ll take the following word× document matrix as the starting point of the next example. they can be ignored. the representation of items that share substructure become more similar to each other. this means that documents about a particular topic become more similar even if the exact same words don’t appear in all of them. usually consisting of some variant of a word×document matrix.70.12 21 .osu. The column vectors of U are taken from the orthonormal eigenvectors of AAT . λ = 0. λ = 3. and breaks it down into linearly independent components. In practical terms.94. As we’ve already seen.ROUGH DRAFT .

76 0.56 22   −0.12  −0.07 −0.32  .51 0.56   −0.11 0.46 0.osu.12 −0.21 0.69 −0.00 0.ROUGH DRAFT . The third dimension indicates a grouping in which car.06 −0.33 0.59 −0.54 0.    U =     This essentially gives a matrix in which words are represented as row vectors containing linearly independent components.10 0. For example. Computing V T is similar.53 0.10 −0.22 −0.04  which contains the dot product of all the documents. The fourth dimension points out a pattern in which nurse and hospital occur in the absence of wheel. These values indicate the variance of the linearly independent components along each dimension. and hospital are all positive.80 0. Applying the Gram-Schmidt orthonormalization process and taking the transpose yields    =     79 6 107 68 7  6 136 0 6 112   107 0 177 116 0   68 6 116 78 7   7 112 0 7 98  VT S contains the square roots of the singular values ordered from greatest to least along its diagonal.06 0.09 0.06   −0. we have    AT A =      −0.56   −0. Some word cooccurence patterns in these documents are indicated by the signs of the coefficients in U . Since its values come from orthonormal singular vectors of T A A.06 0.59 0.28 0.USE AT OWN RISK: suggestions kbaker@ling.17  −0.48 0.07 0.64 −0.65 0.23   −0.11 −0.87 −0. There are two groups visible in the second column vector of U : car and wheel have negative coefficients. indicating a grouping in which wheel only cooccurs with car. nurse.06 −0.17 0  S= 0  0 0 3. and the fifth dimension indicates a grouping in which doctor and hospital occur in the absence of wheel. arranged right to left from largest corresponding singular value to the least.81   −0. and hospital occur only with each other. In order to illustrate the effect of dimensionality reduction on this data set.70   −0.74 0.07 −0.03 0. the signs in the first column vector are all negative.07 −0. indicating the general cooccurence of words and documents. nurse.92 0 0  15.79 −0.02 −0. we’ll restrict S to the first three singular values to get 17.82 −0.04 −0.60 0.edu which are used to compute and order the corresponding orthonormal singular vectors of U .40 −0. while doctor.

The deleted elements had the effect of diluting these main correlations by introducing potential similarity along dimensions of questionable significance. G.edu In order for the matrix multiplication to go through.50 4. We know that the vectors contain components ordered from most to least amount of variation accounted for in the original data.23 9.17 0   −0. http://www. and document similarity is obtained by comparing rows in the matrix V S (note that documents are represented as row vectors because we are working with V . Now the word vectors are shorter.29 −0.htm 23 . Ientilucci.90 −5.01 0.46 0.86 −0. we effectively eliminate noise in the representation of word vectors. T.13         8 References Deerwester. The result looks like this. ”Using the Singular Value Decomposition”.17  −0. edu/˜ejipci/research. and Harshman.54 0..87 −0. ”Indexing by Latent Semantic Analysis”. and contain only the elements that account for the most significant correlations among words in the original dataset. (1990). 2.53 0. Journal of the American Society of Information Science 41(6):391-407.24 1.23      −0.59 −0.76 0.19 −3.19 0.56 −0.60 0.cis.14 9.rit.07 0.06 0.ROUGH DRAFT . S.07 −0. Dumais..74 0. E.62 −1. ˆ A=         −0. the purpose is not to actually reconstruct the original matrix but to use the reduced dimensionality representation to identify similar words and documents.77 6.02 −0.33 1.38 6.80 0. Earlier I mentioned that in the process of dimensionality reduction. and unlike items more unlike. not V T ).97 −0. Landauer. S.65 0. and word similarity can be measured by computing row similarity in U S.09 −2.76 0.06 −0.33 −7.11  17. SVD makes similar items appear more similar.56 0 −0. Furnas. R.osu.96 8.66 9. Words are represented by row vectors in U . This can be explained by looking at the vectors in the reduced versions of U and V above.21   0 15. we have to eliminate the corresponding row vectors of U and corresponding column vectors of V T to give us an approximation of A using 3 dimensions instead of the original 5.58 0. (2003).22 −0.25 1.10 0.10 −0.71 −3.13 −0.06 −0.J.00 0.51  0 3.. Documents are now represented by row vectors in V .28 0.07 −0.USE AT OWN RISK: suggestions kbaker@ling.92 0 0 −0.09    =      In practice.82    −0. By deleting elements representing dimensions which do not exhibit meaningful variation.. however.

24 . Manning. The MacMillan Company. and Minc. Marcus. u MIT Press. C. John Wiley & Sons. and Sch¨tze. (1968). (1999). NY. Cambridge. J. Elementary Linear Algebra. NY.USE AT OWN RISK: suggestions kbaker@ling. A User’s Guide to Principal Components Analysis. MA. M. H.ROUGH DRAFT .osu. Foundations of Statistical Natural Language Processing.edu Jackson. H. E. (1991).

Sign up to vote on this title
UsefulNot useful