Professional Documents
Culture Documents
1, JANUARY 2010 45
Abstract—We present several new variations on the theme of nonnegative matrix factorization (NMF). Considering factorizations of
the form X ¼ F GT , we focus on algorithms in which G is restricted to containing nonnegative entries, but allowing the data matrix X to
have mixed signs, thus extending the applicable range of NMF methods. We also consider algorithms in which the basis vectors of F
are constrained to be convex combinations of the data points. This is used for a kernel extension of NMF. We provide algorithms for
computing these new factorizations and we provide supporting theoretical analysis. We also analyze the relationships between our
algorithms and clustering algorithms, and consider the implications for sparseness of solutions. Finally, we present experimental
results that explore the properties of these new methods.
1 INTRODUCTION
analysis problems, including classification [14] and colla-
M ATRIX factorization is a unifying theme in numerical
linear algebra. A wide variety of matrix factorization
algorithms have been developed over many decades,
borative filtering [15]. A number of studies have focused on
further developing computational methodologies for NMF
providing a numerical platform for matrix operations such [16], [17], [18], [19]. Finally, researchers have begun to
as solving linear systems, spectral decomposition, and explore some of the relationships between matrix factoriza-
subspace identification. Some of these algorithms have also tions and K-means clustering [20], making use of the least
proven useful in statistical data analysis, most notably the square objectives of NMF; as we emphasize in the current
singular value decomposition (SVD), which underlies paper, this relationship has implications for the interpret-
principal component analysis (PCA). ability of matrix factors. NMF with the Kullback-Leibler
Recent work in machine learning has focused on matrix (KL) divergence objective has been shown [21], [13] to be
factorizations that directly target some of the special features equivalent to probabilistic latent semantic analysis [22],
of statistical data analysis. In particular, nonnegative matrix which has been further developed into the fully probabil-
factorization (NMF) [1], [2] focuses on the analysis of data istic latent Dirichlet allocation model [23], [24].
matrices whose elements are nonnegative, a common Our goal in this paper is to expand the repertoire of
occurrence in data sets derived from text and images. nonnegative matrix factorization. Our focus is on algorithms
Moreover, NMF yields nonnegative factors, which can be that constrain the matrix factors; we do not require the data
matrix to be similarly constrained. In particular, we develop
advantageous from the point of view of interpretability.
NMF-like algorithms that yield nonnegative factors but do
The scope of research on NMF has grown rapidly in
not require the data matrix to be nonnegative. This extends
recent years. NMF has been shown to be useful in a variety
the range of application of NMF ideas. Moreover, by focusing
of applied settings, including environmetrics [3], chemo-
on constraints on the matrix factors, we are able to strengthen
metrics [4], pattern recognition [5], multimedia data
the connections between NMF and K-means clustering. Note
analysis [6], text mining [7], [8], DNA gene expression
in particular that the result of a K-means clustering run can
analysis [9], [10], and protein interaction [11]. Algorithmic
be written as a matrix factorization X ¼ F GT , where X is the
extensions of NMF have been developed to accommodate a
data matrix, F contains the cluster centroids, and G contains
variety of objective functions [12], [13] and a variety of data
the cluster membership indicators. Although F typically has
entries with both positive and negative signs, G is non-
. C. Ding is with the Department of Computer Science and Engineering, negative. This motivates us to propose general factorizations
University of Texas at Arlington, Nedderman Hall, Room 307, 416 Yates in which G is restricted to be nonnegative and F is
Street, Arlington, TX 76019. E-mail: CHQDing@uta.edu. unconstrained. We also consider algorithms that constrain
. T. Li is with the School of Computer Science, Florida International F ; in particular, restricting the columns of F to be convex
University, ECS 318, 11200 SW 8th Street, Miami, FL 33199. combinations of data points in X we obtain a matrix
E-mail: taoli@cs.fiu.edu.
. M.I. Jordan is with the Department of Electrical Engineering and factorization that can be interpreted in terms of weighted
Computer Science (EECS) and the Department of Statistics, University cluster centroids.
of California, Berkeley, 387 Soda Hall #1776, Berkeley, CA 94720-1776. The paper is organized as follows: In Section 2, we present
E-mail: jordan@cs.berkeley.edu. the new matrix factorizations, and in Section 3, we present
Manuscript received 24 June 2007; revised 22 Jan. 2008; accepted 16 Oct. algorithms for computing these factorizations. Section 4
2008; published online 11 Nov. 2008. provides a theoretical analysis, which provides insights into
Recommended for acceptance by Z. Ghahramani. the sparseness of matrix factors for a convex variant of NMF.
For information on obtaining reprints of this article, please send e-mail to:
tpami@computer.org, and reference IEEECS Log Number In Section 5, we consider extensions of Convex-NMF and the
TPAMI-2007-06-0385. relationships of NMF-like factorizations. In Section 5.1, we
Digital Object Identifier no. 10.1109/TPAMI.2008.277. show that a convex variant of NMF has the advantage that it is
0162-8828/10/$26.00 ß 2010 IEEE Published by the IEEE Computer Society
46 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 1, JANUARY 2010
readily kernelized. In Section 6, we present comparative Convex-NMF has an interesting property: The factors W
experiments that show that constraining the F factors to be and G both tend to be very sparse.
convex combinations of input data enhances their interpret- Lee and Seung [25] considered a model in which the F
ability. We also present experiments that compare the factors were restricted to the unit interval, i.e., 0 Fik 1.
performance of the NMF variants to K-means clustering, This so-called convex coding does not require the f k to be
where we assess the extent to which the imposition of nonnegative linear combinations of input data vectors and,
constraints that aim to enhance interpretability leads to thus, in general do not capture the notion of cluster centroid.
poorer clustering performance. Finally, we present our Indeed, the emphasis in [25] and in [1], [2] is the parts-of-
conclusions in Section 7. whole encoding provided by NMF, not the relationship of
nonnegative factorizations to vector quantization.
To summarize our development thus far, let us write the
2 SEMI-NMF AND CONVEX-NMF different factorizations as follows:
Let the input data matrix X ¼ ðx1 ; . . . ; xn Þ contain a
collection of n data vectors as columns. We consider SVD: X F GT ; ð3Þ
factorizations of the form:
NMF: Xþ Fþ GTþ ; ð4Þ
X F GT ; ð1Þ
where X 2 IRpn , F 2 IRpk , and G 2 IRnk . For example, Semi-NMF: X F GTþ ; ð5Þ
the SVD can be written in this form. In the case of the SVD,
there are no restrictions on the signs of F and G; moreover,
the data matrix X is also unconstrained. NMF can also be Convex-NMF: X X Wþ GTþ ; ð6Þ
written in this form, where the data matrix X is assumed to
where the subscripts are intended to suggest the constraints
be nonnegative, as are the factors F and G. We now
imposed by the different factorizations.
consider some additional examples.
Before turning to a presentation of algorithms for
2.1 Semi-NMF computing Semi-NMF and Convex-NMF factorizations
When the data matrix is unconstrained (i.e., it may have and supporting theoretical analysis, we provide an illus-
mixed signs), we consider a factorization that we refer to as trative example.
Semi-NMF, in which we restrict G to be nonnegative while 2.3 An Illustration
placing no restriction on the signs of F .
Consider the following data matrix:
We can motivate Semi-NMF from the perspective of
clustering. Suppose we do a K-means clustering on X and 0 1
1:3 1:8 4:8 7:1 5:0 5:2 8:0
obtain cluster centroids F ¼ ðf 1 ; . . . ; f k Þ. Let G denote the B 1:5 6:9 3:9 5:5 8:5 3:9 5:5 C
cluster indicators, i.e., gik ¼ 1, if xi belongs to cluster ck , B C
X¼B B 6:5 1:6 8:2 7:2 8:7 7:9 5:2 C
C:
gik ¼ 0, otherwise. We can write the K-means clustering @ 3:8 8:3 4:7 6:4 7:5 3:2 7:4 A
objective function as
7:3 1:8 2:1 2:7 6:8 4:8 6:2
n X
X K
The K-means clustering produces two clusters, where the
JK -means ¼ gik kxi f k k2 ¼ kX F GT k2 :
i¼1 k¼1
first cluster includes the first three columns and the second
cluster includes the last four columns.
In this paper, kvk denotes the L2 norm of a vector v and We show that Semi-NMF and Convex-NMF factoriza-
kAk denotes the Frobenius norm of a matrix A. We see that tions gives clustering solutions which are identical to the
the K-means clustering objective can be alternatively K-means clustering results. We run SVD, Semi-NMF, and
viewed as an objective function for matrix approximation. Convex-NMF. The matrix factor G obtained for the three
Moreover, this approximation will generally be tighter if we factorizations are
relax the optimization by allowing gij to range over values
in ð0; 1Þ or values in ð0; 1Þ. This yields the Semi-NMF 0:25 0:05 0:22 :45 :44 :46 :52
matrix factorization. GTsvd ¼ ;
0:50 0:60 0:43 0:30 0:12 0:01 0:31
2.2 Convex-NMF ð7Þ
While, in NMF and Semi-NMF, there are no constraints on
the basis vectors F ¼ ðf 1 ; . . . ; f k Þ, for reasons of interpret- 0:61 0:89 0:54 0:77 0:14 0:36 0:84
GTsemi ¼ ; ð8Þ
ability it may be useful to impose the constraint that the 0:12 0:53 0:11 1:03 0:60 0:77 1:16
vectors defining F lie within the column space of X:
f ‘ ¼ w1‘ x1 þ þ wn‘ xn ¼ Xw‘ ; or F ¼ XW : ð2Þ 0:31 0:31 0:29 0:02 0 0 0:02
GTcnvx ¼ : ð9Þ
0 0:06 0 0:31 0:27 0:30 0:36
Moreover, again for reasons of interpretability, we may
wish to restrict ourselves to convex combinations of the Both the Semi-NMF and Convex-NMF results agree with
columns of X. This constraint has the advantage that we the K-means clustering: For the first three columns, the
could interpret the columns f ‘ as weighted sums of certain values in the upper rows are larger than the lower rows,
data points; in particular, these columns would capture a indicating they are in the same cluster, while, for the last
notion of centroids. We refer to this restricted form of the four columns, the upper rows are smaller than the lower
F factor as Convex-NMF. Convex-NMF applies to both rows, indicating they are in another cluster. Note, however,
nonnegative and mixed-sign data matrices. As we will see, that Convex-NMF gives sharper indicators of the clustering.
DING ET AL.: CONVEX AND SEMI-NONNEGATIVE MATRIX FACTORIZATIONS 47
The computed basis vectors F for the different matrix particularly far from those in CKmeans : ðFsemi Þ1;1 ¼ 0:05
factorizations are as follows: versus ðCKmeans Þ1;1 ¼ 0:29 versus and ðFsemi Þ4;2 ¼ 0:08 versus
0 1 0 1 ðCKmeans Þ4;2 ¼ 0:36. Thus, restrictions on F can have large
0:41 0:50 0:05 0:27 effects on subspace factorization. Convex-NMF gives
B C B C
B 0:35 0:21 C B 0:40 0:40 C F factors that are closer to cluster centroids, validating
B C B C
Fsvd ¼ B
B 0:66 0:32 C B C
C; Fsemi ¼ B 0:70 0:72 C; our expectation that this factorization produces centroid-
B C B C like factors. More examples are given in Fig. 1.
@ 0:28 0:72 A @ 0:30 0:08 A
Finally, computing the residual values, we have kX
0:43 0:28 0:51 0:49
0 1 F GT k ¼ 0:27940; 0:27944; 0:30877; for SVD, Semi-NMF,
0:31 0:53 and Convex-NMF, respectively. We see that the enhanced
B C
B 0:42 0:30 C interpretability provided by Semi-NMF is not accompanied
B C
Fcnvx ¼ B
B 0:56 0:57 C;
C by a degradation in approximation accuracy relative to the
B C SVD. The more highly constrained Convex-NMF involves a
@ 0:49 0:41 A
modest degradation in accuracy.
0:41 0:36
We now turn to a presentation of algorithms for computing
and the cluster centroids obtained from K-means clustering the two new factorizations, together with theoretical results
are given by the columns of the following matrix: establishing convergence of these algorithms.
0 1
0:29 0:52
B 0:45 0:32 C 3 ALGORITHMS AND ANALYSIS
B C
CKmeans ¼ B
B 0:59 0:60 C:
C
In this section, we provide algorithms and accompanying
@ 0:46 0:36 A analysis for the NMF factorizations that we presented in the
0:41 0:37
previous section.
We have rescaled all column vectors so that their L2 -norm is
one for purposes of comparison. 3.1 Algorithm for Semi-NMF
One can see that Fcnvx is close to CKmeans : kFcnvx We compute the Semi-NMF factorization via an iterative
CKmeans k ¼ 0:08. Fsemi deviates substantially from CKmeans : updating algorithm that alternatively updates F and G:
kFsemi CKmeans k ¼ 0:53. Two of the elements in Fsemi are (S0) Initialize G. Do a K-means clustering. This gives cluster
48 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 1, JANUARY 2010
indicators G: Gik ¼ 1, if xi belongs to cluster k. Otherwise, It is easy to see that the limiting solution of the update
Gik ¼ 0. Add a small constant (we use the value 0.2 in rule of (11) satisfies the fixed point equation. At
practice) to all elements of G. See Section 3.3 for more convergence, Gð1Þ ¼ Gðtþ1Þ ¼ GðtÞ ¼ G, i.e.,
discussion on initialization. (S1) Update F (while fixing G) sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
is an auxiliary function for JðHÞ, i.e., it satisfies the 4½ðBþ Þik þ ðH 0 A Þik Hik
0
B 0 þ
ik þ ðH A Þik
Yik ¼ 2
þ 2 0 :
requirements JðHÞ ZðH; H 0 Þ and JðHÞ ¼ ZðH; HÞ. Hik Hik
Furthermore, it is a convex function in H and its global Thus, ZðH; H 0 Þ is a convex function of H. Therefore,
minimum is
we obtain the global minimum by setting @ZðH; H 0 Þ=
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi @Hik ¼ 0 in (26) and solving for H. Rearranging, we
0 0 Bþ ik þ ðH A Þik
0
Hik ¼ arg min ZðH; H Þ ¼ Hik : ð22Þ obtain (22). u
t
H Bik þ ðH 0 Aþ Þik
þ B
ik 0 þ 0 ð34Þ
The computational complexity for convex-NMF is of Hik Hik
ik ik
order n2 p þ mð2n2 k þ nk2 Þ for (28) and is of order mð2n2 k þ !
X Hjk Hi‘
0 0
2nk2 Þ for (29), where m 100 is the number of iterations to Aij Hjk Ck‘ Hi‘ 1 þ log 0 0 ;
ijk‘
Hjk Hi‘
convergence. These are matrix multiplications and can be
computed efficiently on most computers. is an auxiliary function of JðHÞ, and furthermore,
The correctness and convergence of the algorithm are ZðH; H 0 Þ is a convex function of H and its global
addressed in the following: minimum is
Theorem 6. Fixing G, under the update rule for W of (29), 1) the sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
residual kX XW GT k2 decreases monotonically (nonincreas- 0 Bþ ik þ ðA H CÞik
0
Hik ¼ arg min ¼ Hik þ : ð35Þ
ing), and 2) the solution converges to a KKT fixed point. H Bik þ ðA H 0 CÞik
random initialization. An SVD-based initialization has The main point of Lemma 9 is that the solutions to
recently been proposed by Boutsidis and Gallopoulos [27]. minW ;G kI W GT k2 are the sparsest possible rank-K
See more initialization references in [27], [17]. matrices W ; G. Now, returning to our characterization of
Convex-NMF in (37), we can write
X
4 SPARSITY OF CONVEX-NMF kI W GT k2 ¼ eT ðI W GT Þ2 :
k
In the original presentation of NMF, Lee and Seung [1] k
emphasized the desideratum of sparsity. For example, in the Comparing to the Convex-NMF case, we see that the
case of image data, it was hoped that NMF factors would projection of ðI W GT Þ onto the principal components has
correspond to a coherent part of the original image, for more weight while the projection of ðI W GT Þ onto the
example, a nose or an eye; these would be sparse factors in nonprincipal components has less weight. Thus, we conclude
which most of the components would be zero. Further that sparsity is enforced strongly in the principal component
experiments have shown, however, that NMF factors are not subspace and weakly in the nonprincipal component sub-
necessarily sparse, and sparsification schemes have been space. Overall, Lemma 9 provides a basis for concluding that
developed on top of NMF [16], [5]. Parts-of-whole repre- Convex-NMF tends to yield sparse solutions.
sentations are not necessarily recovered by NMF, but A more intuitive understanding of the source of the
conditions for obtaining parts-of-whole representations have sparsity can be obtained by counting parameters. Note, in
been discussed [28]. See also [29], [30], and [31] for related particular, that Semi-NMF is based on Nparam ¼ kp þ kn
literature on sparse factorizations in the context of PCA. parameters, whereas Convex-NMF is based on Nparam ¼
Interestingly, the Convex-NMF factors W and G are 2kn parameters. Considering the usual case n > p (i.e., the
naturally sparse. We provide theoretical support for this number of data points is more than the data dimension),
Convex-NMF has more parameters than Semi-NMF. But we
assertion in this section, and provide additional experi-
know that Convex-NMF is a special case of Semi-NMF. The
mental support in Section 6. (Sparseness can also be seen in
resolution of these two contradicting facts is that some of
the illustrative example presented in Section 2.3.)
the parameters in Convex-NMF must be zero.
We first note that Convex-NMF can be reformulated as
X 2
min 2k vTk ðI W GT Þ ; s:t: W 2 Rnk
þ ; G 2 Rkn
þ ; 5 ADDITIONAL REMARKS
W ;G
0
k
Convex-NMF stands out for its interpretability and its
ð37Þ
sparsity properties. In this section, we consider two
where we use the SVD of X ¼ UV T and thus have additional interesting aspects of Convex-NMF, and we also
P consider the relationship of all of the NMF-like factoriza-
X T X ¼ k 2k vk vTk . Therefore, kX XW GT k2 ¼ TrðI
P tions that we have developed to K-means clustering.
GW T ÞXT XðI W GT Þ ¼ k 2k kvTk ðI W GT Þk2 . We now
claim that 1) this optimization will produce a sparse 5.1 Kernel-NMF
solution for W and G, and 2) the more slowly k decreases, Consider a mapping xi ! ðxi Þ or X ! ðXÞ ¼ ððx1 Þ; . . . ;
the sparser the solution. ðxn ÞÞ. A standard NMF or Semi-NMF factorization ðXÞ
This second part of our argument is captured in a lemma. F GT would be difficult to compute since F and G depend
Lemma 9. The solution of the following optimization problem, explicitly on the mapping function ðÞ. However, Convex-
NMF provides an appealing solution of this problem:
min kI W GT k2 ; s:t: W ; G 2 RnK
þ ;
W ;G
0 ðXÞ ðXÞW GT :
is given by W ¼ G ¼ any K columns of ðe1 eK Þ, where ek Indeed, it is easy to see that the minimization objective
is a basis vector: ðek Þi6¼k ¼ 0, ðek Þi¼k ¼ 1.
Proof. We first prove the result for a slightly more general kðXÞ ðXÞW GT k2 ¼ Tr½ðXÞT ðXÞ 2GT T ðXÞðXÞW
case. Let D ¼ diagðd1 ; . . . ; dn Þ be a diagonal matrix and þ W T T ðXÞðXÞW GT G
let d1 > d2 > > dn > 0. The pffiffiffiffiffi lemma holds
pffiffiffiffi
ffi if we replace
depends only on the kernel K ¼ T ðXÞðXÞ. In fact, the
I by D and W ¼ G ¼ ð d1 e1 d1 eK Þ.2 The proof
update rules for Convex-NMF presented in (29) and (28)
follows from the fact that we have a unique spectral
P depend on XT X only. Thus, it is possible to “kernelize”
expansion Dek ¼ dk ek and D ¼ nk¼1 dk ek eTk . Now we
Convex-NMF in a manner analogous to the kernelization of
take the limit: d1 ¼ ¼ dn ¼ 1. The spectral expansion
P PCA and K-means.
is not unique: I ¼ nk¼1 uk uTk for any orthogonal basis
ðu1 ; . . . ; un Þ ¼ U. However, due the nonnegativity con- 5.2 Cluster-NMF
straint, ðe1 en Þ is the only viable basis. Thus, W ¼ G ¼ In Convex-NMF, we require the columns of F to be convex
for any K columns of ðe1 en Þ. u
t combinations of input data. Suppose now that we interpret
the entries of G as posterior cluster probabilities. In this
case, the cluster centroids can be computed as f k ¼ Xgk =nk ,
2. In NMF, the degree of freedom of diagonal rescalingpis ffiffiffiffi always or F ¼ XGD1 n , where Dn ¼ diagðn1 ; . . . ; nk Þ. The extra
present. Indeed, let E ¼ ðe1 eK Þ. Our p
choice
ffiffiffiffi offfiffiffiffiW ¼ G ¼ E D can be
p
written in different ways W G ¼ ðE DÞðE DÞ ¼ ðED ÞðE T D1 ÞT ,
T T degree of freedom for F is not necessary. Therefore, the
where 1 < < 1. pair of desiderata: 1) F encodes centroids, and 2) G encodes
52 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 1, JANUARY 2010
posterior probabilities motivates a factorization X factorizations, we display the resulting F factors. We see
12
XGD1n G T
. We can absorb Dn into G and solve for that the Semi-NMF factors (denoted Fsemi in the figure) tend
to lie distant from the cluster centroids. On the other hand,
Cluster-NMF: X XGþ GTþ : ð38Þ the Convex-NMF factors (denoted Fcnvx ) almost always lie
We call this factorization Cluster-NMF because the degree of within the clusters.
freedom in this factorization is the cluster indicator G, as in 6.2 Real-Life Data Sets
a standard clustering problem. The objective function is
We conducted experiments on the following data sets:
J ¼ kX XGGT k2 .
Ionosphere and Wave from the UCI repository, the document
5.3 Relation to Relaxed K-Means Clustering data sets URCS, WebkB4, Reuters (using a subset of the data
NMF, Semi-NMF, Convex-NMF, Cluster-NMF, and Kernel- collection which includes the 10 most frequent categories),
NMF all have K-means clustering interpretations when the WebAce, and a data set which contains 1,367 log messages
factor G is orthogonal (GT G ¼ I). Orthogonality and collected from several different machines with different
nonnegativity together imply that each row of G has only operating systems at the School of Computer Science at
one nonnegative element, i.e., G is a bona fide cluster Florida International University. The log messages are
indicator. This relationship to clustering is made more grouped into nine categories: configuration, connection, create,
precise in the following theorem. dependency, other, report, request, start, and stop. Stop words
Theorem 10. G-orthogonal NMF, Semi-NMF, Convex-NMF, were removed using a standard stop list. The top 1,000 words
Cluster-NMF, and Kernel-NMF are all relaxations of were selected based on frequencies.
K-means clustering. Table 1 summarizes the data sets and presents our
experimental results. These results are averages over 10 runs
Proof. For NMF, Semi-NMF, and Convex-NMF, we first for each data set and algorithm.
eliminate F . The objective is J ¼ kX F GT k2 ¼ We compute clustering accuracy using the known class
TrðXT X 2XT F GT þ F F T Þ. Setting @J=@F ¼ 0, we ob- labels. This is done as follows: The confusion matrix is first
tain F ¼ XG. Thus, we obtain J ¼ TrðXT X GT X T XGÞ.
computed. The columns and rows are then reordered so as to
For Cluster-NMF, we obtain the same result directly:
maximize the sum of the diagonal. We take this sum as a
J ¼ kX XGGT k2 ¼ TrðXT X GT XT XGÞ. For Kernel-
measure of the accuracy: It represents the percentage of data
N MF , we h a v e J ¼ kðXÞ ðXÞW GT k2 ¼ TrðK
points correctly clustered under the optimized permutation.
GT KW þ W T KW Þ, where K is the kernel. Setting
To measure the sparsity of G in the experiments, we
@J=@W ¼ 0, we have KG ¼ KW . Thus, J ¼ TrðXT X
compute the average of each column of G and set all
GT KGÞ. In all five of these cases, the first terms are
elements below 0.001 times the average to zero. We report
constant and do not affect the minimization. The mini-
the number of the remaining nonzero elements as a
mization problem thus becomes maxGT G¼I TrðGT KGÞ,
percentage of the total number of elements. (Thus, small
where K is either a linear kernel XT X or hðXÞ; ðXÞi. It
is known that this is identical to (kernel) K-means values of this measure correspond to large sparsity.)
clustering [32], [33]. u
t A consequence of the sparsity of G is that the rows of G
tend to become close to orthogonal. This indicates a hard
In the definitions of NMF, Semi-NMF, Convex-NMF, clustering (if we view G as encoding posterior probabilities
Cluster-NMF, and Kernel-NMF, G is not restricted to be for clustering). We compute the normalized orthogonality,
orthogonal; these NMF variants are soft versions of ðGT GÞnm ¼ D1=2 ðGT GÞD1=2 , where D ¼ diagðGT GÞ. Thus,
K-means clustering. diag½ðGT GÞnm ¼ I. We report the average of the off-
diagonal elements in ðGT GÞnm as the quantity “Deviation
6 EXPERIMENTS from Orthogonality” in the table.
From the experimental results, we observe the following:
We first present the results of an experiment on synthetic
data, which aims to verify that Convex-NMF can yield 1. All of the matrix factorization models are better
factorizations that are close to cluster centroids. We then than K-means on all of the data sets. This is our
present experimental results for real data comparing principal empirical result. It indicates that the NMF
K-means clustering and the various factorizations. family is competitive with K-means for the pur-
poses of clustering.
6.1 Synthetic Data Set 2. On most of the nonnegative data sets, NMF gives
One main theme of our work is that the Convex-NMF somewhat better accuracy than Semi-NMF and
variants may provide subspace factorizations that have Convex-NMF (with WebKb4 the exception). The
more interpretable factors than those obtained by other differences are modest, however, suggesting that the
NMF variants (or PCA). In particular, we expect that, in more highly constrained Semi-NMF and Convex-
some cases, the factor F will be interpretable as containing NMF may be worthwhile options if interpretability
cluster representatives (centroids) and G will be interpre- is viewed as a goal of a data analysis.
table as encoding cluster indicators. We begin with a simple 3. On the data sets containing both positive and
investigation of this hypothesis. In Fig. 1, we randomly negative values (where NMF is not applicable) the
generate four two-dimensional data sets with three clusters Semi-NMF results are better in terms of accuracy
each. Computing both the Semi-NMF and Convex-NMF than the Convex-NMF results.
DING ET AL.: CONVEX AND SEMI-NONNEGATIVE MATRIX FACTORIZATIONS 53
TABLE 1
Data Set Descriptions and Results
4. In general, Convex-NMF solutions are sparse, while Finally, note that NMF is initialized randomly for the
Semi-NMF solutions are not. different runs. We investigated the stability of the solution
5. Convex-NMF solutions are generally significantly over multiple runs and found that NMF converges to
more orthogonal than Semi-NMF solutions. solutions F and G that are very similar across runs; moreover,
the resulting discretized cluster indicators were identical.
6.3 Shifting Mixed-Sign Data to Nonnegative
While our algorithms apply directly to mixed-sign data, it is
also possible to consider shifting mixed-sign data to be 7 CONCLUSIONS
nonnegative by adding the smallest constant so all entries We have presented a number of new nonnegative matrix
are nonnegative. We performed experiments on data shifted factorizations. We have provided algorithms for these
in this way for the Wave and Ionosphere data. For Wave, factorizations and theoretical analysis of the convergence
the accuracy decreases to 0.503 from 0.590 for Semi-NMF of these algorithms. The ability of these algorithms to deal
and decreases to 0.5297 from 0.5738 for Convex-NMF. The with mixed-sign data makes them useful for many
sparsity increases to 0.586 from 0.498 for Convex-NMF. For applications, particularly given that covariance matrices
Ionosphere, the accuracy decreases to 0.647 from 0.729 for are often centered.
Semi-NMF and decreases to 0.618 from 0.6877 for Convex- Semi-NMF offers a low-dimensional representation of
NMF. The sparsity increases to 0.829 from 0.498 for Convex- data points which lends itself to a convenient clustering
NMF. In short, the shifting approach does not appear to interpretation. Convex-NMF further restricts the basis
provide a satisfactory alternative.
vectors to be convex combinations of data points,
6.4 Flexibility of NMF providing a notion of cluster centroids for the basis. We
A general conclusion is that NMF almost always performs also briefly discussed additional NMF algorithms—Kernel-
NMF and Cluster-NMF—that are further specializations of
better than K-means in terms of clustering accuracy while
Convex-NMF.
providing a matrix approximation. We believe this is due to
We also showed that the NMF variants can be viewed as
the flexibility of matrix factorization as compared to the
rigid spherical clusters that the K-means clustering objec- relaxations of K-means clustering, thus, providing a closer
tive function attempts to capture. When the data distribu- tie between NMF and clustering than has been present in
tion is far from a spherical clustering, NMF may have the literature to date. Moreover, our empirical results
advantages. Fig. 2 gives an example. The data set consists of showed that the NMF algorithms all outperform K-means
two parallel rods in 3D space containing 200 data points. clustering on all of the data sets that we investigated in
The two central axes of the rods are 0.3 apart and they have terms of clustering accuracy. We view these results as
diameter 0.1 and length 1. As seen in the figure, K-means indicating that the NMF family is worthy of further
gives a poor clustering, while NMF yields a good clustering. investigation. We view Semi-NMF and Convex-NMF as
Fig. 2c shows the differences particularly worthy of further investigation, given their
P in the columns of G (each
column is normalized to i gk ðiÞ ¼ 1). The misclustered native capability for handling mixed-sign data and their
points have small differences. particularly direct connections to clustering.
54 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 1, JANUARY 2010
Fig. 2. A data set with two clusters in 3D. (a) Clusters obtained using K-means, as indicated by either “r” or “t u.” (b) Clusters obtained using NMF.
(c) The difference g2 ðiÞ g1 ðiÞ; i ¼ 1; . . . ; 200, “r” for those misclustered points, and “u
t” for correctly clustered points.
[17] M. Berry, M. Browne, A. Langville, P. Pauca, and R. Plemmons, Chris Ding received the PhD degree from
“Algorithms and Applications for Approximate Nonnegative Columbia University. He did research at the
Matrix Factorization,” Computational Statistics and Data Analysis, California Institute of Technology, the Jet
2006. Propulsion Laboratory, and Lawrence Berkeley
[18] T. Li and S. Ma, “IFD: Iterative Feature and Data Clustering,” Proc. National Laboratory before joining the Univer-
SIAM Int’l Conf. Data Mining, pp. 472-476, 2004. sity of Texas at Arlington in 2007 as a
[19] T. Li, “A General Model for Clustering Binary Data,” Proc. professor of computer science. His main
Knowledge Discovery and Data Mining, pp. 188-197, 2005. research areas are machine learning and
[20] C. Ding, X. He, and H. Simon, “On the Equivalence of bioinformatics. He serves on many program
Nonnegative Matrix Factorization and Spectral Clustering,” Proc. committees of international conferences and
SIAM Data Mining Conf., 2005. gave tutorials on spectral clustering and matrix models. He is an
[21] E. Gaussier and C. Goutte, “Relation between PLSA and NMF and associate editor of the journal Data Mining and Bioinformatics and is
Implications,” Proc. ACM Conf. Research and Development in writing a book on spectral clustering to be published by Springer. He
Information Retrieval (SIGIR), pp. 601-602, 2005. is a member of the IEEE.
[22] T. Hofmann, “Probabilistic Latent Semantic Indexing,” Proc. ACM
Conf. Research and Development in Information Retrieval (SIGIR), Tao Li received the PhD degree in 2004 from
pp. 50-57, 1999. the University of Rochester. He is currently an
[23] D. Blei, A. Ng, and M. Jordan, “Latent Dirichlet Allocation,” associate professor in the School of Computer
J. Machine Learning Research, vol. 3, pp. 993-1022, 2003. Science at Florida International University. His
[24] M. Girolami and K. Kaban, “On an Equivalence between PLSI and research interests are in data mining, machine
LDA,” Proc. ACM Conf. Research and Development in Informational learning, and information retrieval. He is a
Retrieval (SIGIR), 2003. recipient of the US National Science Foundation
[25] D. Lee and H.S. Seung, “Unsupervised Learning by Convex and CAREER Award and the IBM Faculty Research
Conic Coding,” Advances in Neural Information Processing Systems 9, Awards.
MIT Press, 1997.
[26] L. Xu and M. Jordan, “On Convergence Properties of the EM
Algorithm for Gaussian Mixtures,” Neural Computation, vol. 18,
pp. 129-151, 1996.
[27] C. Boutsidis and E. Gallopoulos, “SVD Based Initialization: A
Head Start for Nonnegative Matrix Factorization,” Pattern Michael I. Jordan received the master’s degree
Recognition, vol. 41, no. 4, pp. 1350-1362, 2008. from Arizona State University, and the PhD
[28] D. Donoho and V. Stodden, “When Does Non-Negative Matrix degree in 1985 from the University of California,
Factorization Give a Correct Decomposition into Parts?” Advances San Diego. He was a professor at the Massa-
in Neural Information Processing Systems 16, MIT Press, 2004. chusetts Institute of Technology from 1988 to
[29] A. D’Aspremont, L.E. Ghaoui, M.I. Jordan, and G.R.G. Lanckriet, 1998. He is the Pehong Chen Distinguished
“A Direct Formulation for Sparse PCA Using Semidefinite Professor in the Department of Electrical En-
Programming,” SIAM Rev., 2006. gineering and Computer Science and the De-
[30] H. Zou, T. Hastie, and R. Tibshirani, “Sparse Principal Component partment of Statistics at the University of
Analysis,” J. Computational and Graphical Statistics, vol. 15, pp. 265- California, Berkeley. He has published more
286, 2006. than 300 research articles on topics in electrical engineering, statistics,
[31] Z. Zhang, H. Zha, and H. Simon, “Low-Rank Approximations computer science, computational biology, and cognitive science. He was
with Sparse Factors II: Penalized Methods with Discrete Newton- named a fellow of the American Association for the Advancement of
Like Iterations,” SIAM J. Matrix Analysis Applications, vol. 25, Science (AAAS) in 2006 and was named a medallion lecturer of the
pp. 901-920, 2004. Institute of Mathematical Statistics (IMS) in 2004. He is a fellow of the
[32] H. Zha, C. Ding, M. Gu, X. He, and H. Simon, “Spectral Relaxation IEEE, a fellow of the IMS, a fellow of the AAAI, and a fellow of the ASA.
for K-Means Clustering,” Advances in Neural Information Processing
Systems 14, pp. 1057-1064, 2002.
[33] C. Ding and X. He, “K-Means Clustering and Principal Compo- . For more information on this or any other computing topic,
nent Analysis,” Proc. Int’l Conf. Machine Learning, 2004. please visit our Digital Library at www.computer.org/publications/dlib.