You are on page 1of 6

CSE 521: Design and Analysis of Algorithms I Winter 2017

Lecture 9: Low Rank Approximation


Lecturer: Shayan Oveis Gharan February 8th Scribe: Jun Qi

Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications.

In this lecture we discuss theory and applications of singular value decomposition.

1.1 Theories

Recall the singular value decomposition theorem.

Theorem 1.1 (Singular Value Decomposition). For any matrix M ∈ Rm×n (m ≤ n), there exist two sets
of m orthonormal (column) vectors v1 , ..., vm ∈ Rm and u1 , ..., um ∈ Rn such that
m
X
M= σi ui vTi (1.1)
i=1

where for each i, σi ≥ 0.

In this lecture we discuss how to generate a low rank approximate of a matrix M with small error.

Definition 1.2 (Operator norm). The operator norm of a matrix M is defined as

kM xk2
kM k2 = max . (1.2)
x kxk2

In the last lecture we showed that for any matrix M , kM k2 is equal to the largest singular value of M .

Definition 1.3 (Frobenius norm). The Frobenius norm of a matrix M is defined as


X
kM k2F = 2
Mi,j = Tr(M T M ). (1.3)
i,j

In the last lecture we showed that for any matrix M , kM k2F is equal to σi2 .
P
i

Theorem 1.4. For any matrix M ∈ Rm×n

inf kM − M̂ k2 = σk+1 , (1.4)


rank(M̂ )=k

where the infimum is over all rank k matrices M̂ .

We did not discuss the proof of this theorem in class. We are including the proof here for the sake of
completeness.

1-1
1-2 Lecture 9: Low Rank Approximation

Proof. To proof 1.4, we need to prove two statements: (i) There exists a rank k matrix M̂k such that
kM − M̂k k2 = σk+1 ; (ii) For any rank k matrix M̂k , kM − M̂k k ≥ σk+1 .
We start with part (i). We let
k
X
M̂k = σi ui vTi (1.5)
i=1

By definition of M ,
m
X m
X
M − M̂k = σi ui vTi → kM − M̂k k2 = σmax ( σi ui vTi ) = σk+1 (1.6)
i=k+1 i=k+1

Now, we prove part (ii). Let M̂k be an arbitrary rank k matrix. The null space of the matrix M ,

NULL(M ) = {x : M x = 0} (1.7)

is the set of vector that M maps to zero. Let null(M ) be the dimension of the linear space NULL(M ). It is
a well-known fact that for any M ∈ Rm×n ,

rank(M ) + null(M ) = n. (1.8)

It is easy to see this from the SVD decomposition; null is just n minus the number of nonzero singular values
of M . Putting it differently, any vector orthogonal to the right singular vectors of M is in NULL(M ). So,
the above equality follows from the fact that M has rank(M ) right singular vectors (with positive singular
values). As an application, since M̂k has rank k, we have

null(M̂k ) = n − rank(M̂k ) = n − k. (1.9)

By the Rayleigh quotient, we obtain

xT M T M x
σk+1 (M )2 = λk+1 (M T M ) = min max
S:(n−k)dims x∈S xT x
xT M T M x
≤ max
x∈NULL(M̂k ) xT x
(1.10)
xT (M − M̂k )T (M − M̂k )x
= max
x∈NULL(M̂k ) xT x
xT (M − M̂k )T (M − M̂k )x
≤ max
x xT x

The first inequality uses the fact that NULL(M ) is a n − k dimensional linear space; so a special case of S
being a n − k dimensional linear space is S = NULL(M ). The second equality uses that M̂k x = 0 for any
x ∈ NULL(M̂k ).
Now, we are done using another application of the Rayleigh quotient.

xT (M − M̂k )T (M − M̂k )x
max = λmax ((M − M̂k )T (M − M̂k )) = σmax (M − M̂k )2 . (1.11)
x xT x
The completes the proof of 1.4.
Lecture 9: Low Rank Approximation 1-3

Theorem 1.5. For any matrix M ∈ Rm×n (with m ≤ n) with singular values σ1 ≤ σ2 ≤ ... ≤ σm
m
X
inf kM − M̂k k2F = σi2 (1.12)
M̂k :rank(M̂k )=k
i=k+1

Proof. Since M̂k has rank k, we can assume columns of M̂ ∈ span{w1 , w2 , ..., wk } where {w1 , ..., wk } is a
set of orthonormal vectors for the linear space of columns of M̂k . First, observe that

k
X
kM − M̂ k2F = kMi − M̂i k2 .
i=1

Now, let us answer the following question: Choose a vector v ∈ span{w1 , . . . , wk } that minimizes kMi − vk2 ?
It is not hard to see that the optimum solution is the projection of Mi onto the subspace of wi ’s, i.e., we
must have
Xk
M̂i = hMi , wi iwi . (1.13)
j=1

To write the above equation in matrix form we need to use the notion of a projection matrix. Given a set of
orthonormal vectors w1 , . . . , wk , a projection matrix Π is defined as

k
X
wi wiT .
i=1

A projection matrix projects any vector in Rn in the linear span of its vectors; in the above case we have

k
X
Πk M i = hwj , Mi iwj .
j=1

Projection matrices have many nice properties that come in handy when we need to use them. For example,
all their singular values are either 0 or 1. For any projection matrix Π, we have Π2 = Π. There is a unique
projection matrix of rank n which is the identity matrix.
Have this in hand, we can write M̂ = Πk M . Therefore,

kM − M̂ k2F = kM − Πk M k2F = k(I − Πk )M k2F = kΠ⊥ 2


k M kF

In the last identity Π⊥


k is nothing but the projection matrix on the space orthogonal to {w1 , . . . , wk }. More
precisely, let us add wk+1 , . . . , wn such that w1 , . . . , wn form an orthonormal basis of Rn . Then,
n
X
Π⊥
k = wi wiT .
i=k+1

Now, we need to show that to minimize kΠ⊥ 2


k M kF , w1 , . . . , wk must completely lie on the space of top
singular vectors of M . Firstly, by (1.3),

T
kΠ⊥ 2 T ⊥ ⊥ T ⊥ T ⊥
k M kF = Tr(M Πk Πk M ) = Tr(M Πk M ) = Tr(M M Πk )
1-4 Lecture 9: Low Rank Approximation

The last identity follows by the invariance of the trace under cyclic permutation. Now, we can rewrite the
above as follows:
n
X n
X
Tr(M M T Π⊥
k ) = Tr( σi2 ui uTi wj wTj )
i=1 j=k+1
n
X
= Tr(σi2 ui uTi wj wTj )
i=1
k
(1.14)
X
= Tr(σi2 hui , wj i2 )
i=1
X n
X
= σi2 hui , wj i2 ≥ σi2
i,j i=k+1

The second identity follows by linearity of trace and the third identity follows by invariance Pn of the trace
2
under cyclic permutation. To see that last inequality, observe that for any k + 1 ≤ j ≤ n, i=1 i , wj i = 1.
hu
So, i,j σi2 hui , wj i2 is minimized if all wj ’s lie in the linear space span{uk+1 , . . . , un } as desired.
P

The above theorems may not be desirable in practice because SVD computation is very costly. However,
there are numerous algorithms that efficiently (in almost linear time) approximate the SVD of a given matrix.
The following famous paper received the best paper award of STOC 2013 to resolve this question.

Theorem 1.6 (Clarkson-Woodruff 2013 [1]). There is an algorithm that for any matrix M ∈ Rn×n and an
input parameter k, finds a rank k matrix M̂ such that

kM − M̂ k2F ≤ (1 + ) inf kM − N k2F .


N :rank(N )=k

The algorithm runs in time O(nnz(A) · ( k + k log k) + n · poly( k )). Here nnz(M ) denotes the number of
nonzero entries of M , i.e., the input length of M .

1.2 Applications

We conclude this lecture by giving several other applications of low rank approximation.

Hidden Partition Given a graph, assume that there are two hidden communities A, B each of size n/2
such that for any pair of individuals i, j ∈ A there is an edge between them with probability p; similarly, for
any pair of individuals i, j ∈ B there is an edge with probability p. For any pair of individuals in the two
sides the probability of existing an edge is q. Given a sample of such a graph and assuming p  q, we want
to approximately recover A, B. If we reorder the vertices such that the first community consists of nodes
1, . . . , n/2 and the second one consists of n/2 + 1, . . . , n, the expected matrix looks like the following:

 
a b
M̂2 =
c d

Observe that the above matrix is a rank 2 matrix. So, one may expect that by applying a rank 2 approxima-
tion of the adjacency matrix of the given graph he can recover the hidden partition. This is indeed possible
assuming p is sufficiently larger than q.
Lecture 9: Low Rank Approximation 1-5

Topic Modeling. Our next application is topic modeling. Given a words/documents matrix, we want to
discover the hidden topics underlying these words and documents. As shown in Figure 1.1, the process of
discovering topics is to decompose the given words/documents matrix, C, which is associated with word-
document pairs, into two matrices Φ and Θ. Φ and Θ separately represent word-topic and topic-document
pairs. Assuming the number of possible topics is small, one can use low rank approximation to determine

Figure 1.1: Topic Modeling

such a decomposition. Unfortunately, the resulting singular vectors of the low rank approximation may have
many negative entries so the decomposition matrices would have negative entries as well. To avoid this
unwanted cases, one may consider a low rank decomposition question subject to nonnegativity constraint on
the entries. This is also known as nonnegative matrix factorization. Unfortunately, the problem is NP-hard
in general, but several algorithms have been discovered if the underlying words/documents matrix have a
specific structure.

Max-cut Our most formal application of low rank approximation. We use this idea to design an opti-
mization algorithm for the maximum cut problem. Given a graph G = (V, E) we want to find a set S ⊆ V
which maximizes |E(S, S)|. Although the min-cut can be solved optimally, max-cut problem is an NP-hard
problem. The best known approximation algorithm for this problem has an approximation factor of 0.878
by a seminal work of Goemans and Williamson. They showed taht there is a polynomial time algorithm
which always return a cut (T, T̄ ) such that

|E(T, T̄ )| ≥ 0.878 max |E(S, S̄)|. (1.15)


S

Firstly, we formulate the max-cut problem algebraically. For a set S ⊆ V , let


(
1 if i ∈ S
1Si = (1.16)
0 otherwise

be the indicator vector of the set S. We claim that for any set S ⊆ V ,
T
|E(S, S̄)| = 1S A1S̄ . (1.17)

In fact, for any two vectors x and y, X


xT Ay = xi Ai,j yj . (1.18)
i,j

So we can rewrite the max-cut problem as the following algebraic problem:

max |E(S, S̄)| = max xT A(1 − x) (1.19)


S x∈{0,1}n
1-6 Lecture 9: Low Rank Approximation

The rest of the proof is a fairly general argument and use no combinatorial structure of the underlying
problem. Namely we give an algorithm to solve the quadratic optimization problem maxx∈{0,1}n xA(1 − x)
for a given matrix A.
We do this task in two steps. First, we approximate A by a low rank matrix and we show that the optimum
solution of the optimization problem for the low rank matrix is close to the optimum solution of A. Then,
we design an algorithm to approximately solve the above quadratic problem for a low rank matrix. Roughly
speaking, using the low rank property we reduce our n dimensional problem to a k dimensional question,
and then we use the k-dimensional question using an -net.

1.3 Reference

[1] Clarkson, Kenneth L., and David P. Woodruff. ”Low rank approximation and regression in input sparsity
time.” In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pp. 81-90. ACM,
2013.
[2] McSherry, Frank. ”Spectral partitioning of random graphs.” In Foundations of Computer Science, 2001.
Proceedings. 42nd IEEE Symposium on, pp. 529-537. IEEE, 2001.
[3] Clarkson, Kenneth L., and David P. Woodruff. ”Low rank approximation and regression in input sparsity
time.” In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pp. 81-90. ACM,
2013.

You might also like