You are on page 1of 11

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO.

1, JANUARY 2010 45

Convex and Semi-Nonnegative


Matrix Factorizations
Chris Ding, Member, IEEE, Tao Li, and Michael I. Jordan, Fellow, IEEE

Abstract—We present several new variations on the theme of nonnegative matrix factorization (NMF). Considering factorizations of
the form X ¼ F GT , we focus on algorithms in which G is restricted to containing nonnegative entries, but allowing the data matrix X to
have mixed signs, thus extending the applicable range of NMF methods. We also consider algorithms in which the basis vectors of F
are constrained to be convex combinations of the data points. This is used for a kernel extension of NMF. We provide algorithms for
computing these new factorizations and we provide supporting theoretical analysis. We also analyze the relationships between our
algorithms and clustering algorithms, and consider the implications for sparseness of solutions. Finally, we present experimental
results that explore the properties of these new methods.

Index Terms—Nonnegative matrix factorization, singular value decomposition, clustering.

1 INTRODUCTION
analysis problems, including classification [14] and colla-
M ATRIX factorization is a unifying theme in numerical
linear algebra. A wide variety of matrix factorization
algorithms have been developed over many decades,
borative filtering [15]. A number of studies have focused on
further developing computational methodologies for NMF
providing a numerical platform for matrix operations such [16], [17], [18], [19]. Finally, researchers have begun to
as solving linear systems, spectral decomposition, and explore some of the relationships between matrix factoriza-
subspace identification. Some of these algorithms have also tions and K-means clustering [20], making use of the least
proven useful in statistical data analysis, most notably the square objectives of NMF; as we emphasize in the current
singular value decomposition (SVD), which underlies paper, this relationship has implications for the interpret-
principal component analysis (PCA). ability of matrix factors. NMF with the Kullback-Leibler
Recent work in machine learning has focused on matrix (KL) divergence objective has been shown [21], [13] to be
factorizations that directly target some of the special features equivalent to probabilistic latent semantic analysis [22],
of statistical data analysis. In particular, nonnegative matrix which has been further developed into the fully probabil-
factorization (NMF) [1], [2] focuses on the analysis of data istic latent Dirichlet allocation model [23], [24].
matrices whose elements are nonnegative, a common Our goal in this paper is to expand the repertoire of
occurrence in data sets derived from text and images. nonnegative matrix factorization. Our focus is on algorithms
Moreover, NMF yields nonnegative factors, which can be that constrain the matrix factors; we do not require the data
matrix to be similarly constrained. In particular, we develop
advantageous from the point of view of interpretability.
NMF-like algorithms that yield nonnegative factors but do
The scope of research on NMF has grown rapidly in
not require the data matrix to be nonnegative. This extends
recent years. NMF has been shown to be useful in a variety
the range of application of NMF ideas. Moreover, by focusing
of applied settings, including environmetrics [3], chemo-
on constraints on the matrix factors, we are able to strengthen
metrics [4], pattern recognition [5], multimedia data
the connections between NMF and K-means clustering. Note
analysis [6], text mining [7], [8], DNA gene expression
in particular that the result of a K-means clustering run can
analysis [9], [10], and protein interaction [11]. Algorithmic
be written as a matrix factorization X ¼ F GT , where X is the
extensions of NMF have been developed to accommodate a
data matrix, F contains the cluster centroids, and G contains
variety of objective functions [12], [13] and a variety of data
the cluster membership indicators. Although F typically has
entries with both positive and negative signs, G is non-
. C. Ding is with the Department of Computer Science and Engineering, negative. This motivates us to propose general factorizations
University of Texas at Arlington, Nedderman Hall, Room 307, 416 Yates in which G is restricted to be nonnegative and F is
Street, Arlington, TX 76019. E-mail: CHQDing@uta.edu. unconstrained. We also consider algorithms that constrain
. T. Li is with the School of Computer Science, Florida International F ; in particular, restricting the columns of F to be convex
University, ECS 318, 11200 SW 8th Street, Miami, FL 33199. combinations of data points in X we obtain a matrix
E-mail: taoli@cs.fiu.edu.
. M.I. Jordan is with the Department of Electrical Engineering and factorization that can be interpreted in terms of weighted
Computer Science (EECS) and the Department of Statistics, University cluster centroids.
of California, Berkeley, 387 Soda Hall #1776, Berkeley, CA 94720-1776. The paper is organized as follows: In Section 2, we present
E-mail: jordan@cs.berkeley.edu. the new matrix factorizations, and in Section 3, we present
Manuscript received 24 June 2007; revised 22 Jan. 2008; accepted 16 Oct. algorithms for computing these factorizations. Section 4
2008; published online 11 Nov. 2008. provides a theoretical analysis, which provides insights into
Recommended for acceptance by Z. Ghahramani. the sparseness of matrix factors for a convex variant of NMF.
For information on obtaining reprints of this article, please send e-mail to:
tpami@computer.org, and reference IEEECS Log Number In Section 5, we consider extensions of Convex-NMF and the
TPAMI-2007-06-0385. relationships of NMF-like factorizations. In Section 5.1, we
Digital Object Identifier no. 10.1109/TPAMI.2008.277. show that a convex variant of NMF has the advantage that it is
0162-8828/10/$26.00 ß 2010 IEEE Published by the IEEE Computer Society
46 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 1, JANUARY 2010

readily kernelized. In Section 6, we present comparative Convex-NMF has an interesting property: The factors W
experiments that show that constraining the F factors to be and G both tend to be very sparse.
convex combinations of input data enhances their interpret- Lee and Seung [25] considered a model in which the F
ability. We also present experiments that compare the factors were restricted to the unit interval, i.e., 0  Fik  1.
performance of the NMF variants to K-means clustering, This so-called convex coding does not require the f k to be
where we assess the extent to which the imposition of nonnegative linear combinations of input data vectors and,
constraints that aim to enhance interpretability leads to thus, in general do not capture the notion of cluster centroid.
poorer clustering performance. Finally, we present our Indeed, the emphasis in [25] and in [1], [2] is the parts-of-
conclusions in Section 7. whole encoding provided by NMF, not the relationship of
nonnegative factorizations to vector quantization.
To summarize our development thus far, let us write the
2 SEMI-NMF AND CONVEX-NMF different factorizations as follows:
Let the input data matrix X ¼ ðx1 ; . . . ; xn Þ contain a
collection of n data vectors as columns. We consider SVD: X  F GT ; ð3Þ
factorizations of the form:
NMF: Xþ  Fþ GTþ ; ð4Þ
X  F GT ; ð1Þ
where X 2 IRpn , F 2 IRpk , and G 2 IRnk . For example, Semi-NMF: X  F GTþ ; ð5Þ
the SVD can be written in this form. In the case of the SVD,
there are no restrictions on the signs of F and G; moreover,
the data matrix X is also unconstrained. NMF can also be Convex-NMF: X  X Wþ GTþ ; ð6Þ
written in this form, where the data matrix X is assumed to
where the subscripts are intended to suggest the constraints
be nonnegative, as are the factors F and G. We now
imposed by the different factorizations.
consider some additional examples.
Before turning to a presentation of algorithms for
2.1 Semi-NMF computing Semi-NMF and Convex-NMF factorizations
When the data matrix is unconstrained (i.e., it may have and supporting theoretical analysis, we provide an illus-
mixed signs), we consider a factorization that we refer to as trative example.
Semi-NMF, in which we restrict G to be nonnegative while 2.3 An Illustration
placing no restriction on the signs of F .
Consider the following data matrix:
We can motivate Semi-NMF from the perspective of
clustering. Suppose we do a K-means clustering on X and 0 1
1:3 1:8 4:8 7:1 5:0 5:2 8:0
obtain cluster centroids F ¼ ðf 1 ; . . . ; f k Þ. Let G denote the B 1:5 6:9 3:9 5:5 8:5 3:9 5:5 C
cluster indicators, i.e., gik ¼ 1, if xi belongs to cluster ck , B C
X¼B B 6:5 1:6 8:2 7:2 8:7 7:9 5:2 C
C:
gik ¼ 0, otherwise. We can write the K-means clustering @ 3:8 8:3 4:7 6:4 7:5 3:2 7:4 A
objective function as
7:3 1:8 2:1 2:7 6:8 4:8 6:2
n X
X K
The K-means clustering produces two clusters, where the
JK -means ¼ gik kxi  f k k2 ¼ kX  F GT k2 :
i¼1 k¼1
first cluster includes the first three columns and the second
cluster includes the last four columns.
In this paper, kvk denotes the L2 norm of a vector v and We show that Semi-NMF and Convex-NMF factoriza-
kAk denotes the Frobenius norm of a matrix A. We see that tions gives clustering solutions which are identical to the
the K-means clustering objective can be alternatively K-means clustering results. We run SVD, Semi-NMF, and
viewed as an objective function for matrix approximation. Convex-NMF. The matrix factor G obtained for the three
Moreover, this approximation will generally be tighter if we factorizations are
relax the optimization by allowing gij to range over values  
in ð0; 1Þ or values in ð0; 1Þ. This yields the Semi-NMF 0:25 0:05 0:22 :45 :44 :46 :52
matrix factorization. GTsvd ¼ ;
0:50 0:60 0:43 0:30 0:12 0:01 0:31
2.2 Convex-NMF ð7Þ
While, in NMF and Semi-NMF, there are no constraints on  
the basis vectors F ¼ ðf 1 ; . . . ; f k Þ, for reasons of interpret- 0:61 0:89 0:54 0:77 0:14 0:36 0:84
GTsemi ¼ ; ð8Þ
ability it may be useful to impose the constraint that the 0:12 0:53 0:11 1:03 0:60 0:77 1:16
vectors defining F lie within the column space of X:
 
f ‘ ¼ w1‘ x1 þ    þ wn‘ xn ¼ Xw‘ ; or F ¼ XW : ð2Þ 0:31 0:31 0:29 0:02 0 0 0:02
GTcnvx ¼ : ð9Þ
0 0:06 0 0:31 0:27 0:30 0:36
Moreover, again for reasons of interpretability, we may
wish to restrict ourselves to convex combinations of the Both the Semi-NMF and Convex-NMF results agree with
columns of X. This constraint has the advantage that we the K-means clustering: For the first three columns, the
could interpret the columns f ‘ as weighted sums of certain values in the upper rows are larger than the lower rows,
data points; in particular, these columns would capture a indicating they are in the same cluster, while, for the last
notion of centroids. We refer to this restricted form of the four columns, the upper rows are smaller than the lower
F factor as Convex-NMF. Convex-NMF applies to both rows, indicating they are in another cluster. Note, however,
nonnegative and mixed-sign data matrices. As we will see, that Convex-NMF gives sharper indicators of the clustering.
DING ET AL.: CONVEX AND SEMI-NONNEGATIVE MATRIX FACTORIZATIONS 47

t” are Fcnvx factors.


Fig. 1. Four random data sets, each with three clusters. “/” are Fsemi factors and “u

The computed basis vectors F for the different matrix particularly far from those in CKmeans : ðFsemi Þ1;1 ¼ 0:05
factorizations are as follows: versus ðCKmeans Þ1;1 ¼ 0:29 versus and ðFsemi Þ4;2 ¼ 0:08 versus
0 1 0 1 ðCKmeans Þ4;2 ¼ 0:36. Thus, restrictions on F can have large
0:41 0:50 0:05 0:27 effects on subspace factorization. Convex-NMF gives
B C B C
B 0:35 0:21 C B 0:40 0:40 C F factors that are closer to cluster centroids, validating
B C B C
Fsvd ¼ B
B 0:66 0:32 C B C
C; Fsemi ¼ B 0:70 0:72 C; our expectation that this factorization produces centroid-
B C B C like factors. More examples are given in Fig. 1.
@ 0:28 0:72 A @ 0:30 0:08 A
Finally, computing the residual values, we have kX 
0:43 0:28 0:51 0:49
0 1 F GT k ¼ 0:27940; 0:27944; 0:30877; for SVD, Semi-NMF,
0:31 0:53 and Convex-NMF, respectively. We see that the enhanced
B C
B 0:42 0:30 C interpretability provided by Semi-NMF is not accompanied
B C
Fcnvx ¼ B
B 0:56 0:57 C;
C by a degradation in approximation accuracy relative to the
B C SVD. The more highly constrained Convex-NMF involves a
@ 0:49 0:41 A
modest degradation in accuracy.
0:41 0:36
We now turn to a presentation of algorithms for computing
and the cluster centroids obtained from K-means clustering the two new factorizations, together with theoretical results
are given by the columns of the following matrix: establishing convergence of these algorithms.
0 1
0:29 0:52
B 0:45 0:32 C 3 ALGORITHMS AND ANALYSIS
B C
CKmeans ¼ B
B 0:59 0:60 C:
C
In this section, we provide algorithms and accompanying
@ 0:46 0:36 A analysis for the NMF factorizations that we presented in the
0:41 0:37
previous section.
We have rescaled all column vectors so that their L2 -norm is
one for purposes of comparison. 3.1 Algorithm for Semi-NMF
One can see that Fcnvx is close to CKmeans : kFcnvx  We compute the Semi-NMF factorization via an iterative
CKmeans k ¼ 0:08. Fsemi deviates substantially from CKmeans : updating algorithm that alternatively updates F and G:
kFsemi  CKmeans k ¼ 0:53. Two of the elements in Fsemi are (S0) Initialize G. Do a K-means clustering. This gives cluster
48 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 1, JANUARY 2010

indicators G: Gik ¼ 1, if xi belongs to cluster k. Otherwise, It is easy to see that the limiting solution of the update
Gik ¼ 0. Add a small constant (we use the value 0.2 in rule of (11) satisfies the fixed point equation. At
practice) to all elements of G. See Section 3.3 for more convergence, Gð1Þ ¼ Gðtþ1Þ ¼ GðtÞ ¼ G, i.e.,
discussion on initialization. (S1) Update F (while fixing G) sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi


using the rule ðXT F Þþ ik þ ½GðF F Þ ik


T
Gik ¼ Gik þ : ð16Þ
ðXT F Þ ik þ ½GðF F Þ ik
T

F ¼ XGðGT GÞ1 : ð10Þ


Note
T
Note G G is a k  k positive semidefinite matrix. The
inversion of this small matrix is trivial. In most cases, GT G F T F ¼ ðF T F Þþ  ðF T F Þ ; F T X ¼ ðF T XÞþ  ðF T XÞ :
is nonsingular. When GT G is singular, we take the pseudo- Thus, (16) reduces to
inverse. (S2) Update G (while fixing F ) using
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

ð2X T F þ 2GF T F Þik G2ik ¼ 0: ð17Þ
ðXT F Þþ ik þ ½GðF F Þ ik
T
Gik Gik þ ; ð11Þ Equation (17) is identical to (15). Both the equations
ðXT F Þ ik þ ½GðF F Þ ik
T
require that at least one of the two factors is equal to zero.
where we separate the positive and negative parts of a The first factor in both equations is identical. For the
matrix A as second factor Gik or G2ik , if Gik ¼ 0, then G2ik ¼ 0, and vice
versa. Thus, if (15) holds, (17) also holds and vice versa.t u

ik ¼ ðjAik j þ Aik Þ=2; A
ik ¼ ðjAik j  Aik Þ=2: ð12Þ
Next, we prove the convergence of the iterative update
The computational complexity for Semi-NMF is of order algorithm. We need to state two propositions that are used
mðpnk þ nk2 Þ for Step (S1) and of order mðnpk þ kp2 þ in the proof of convergence.
n2 kÞ for (11), where m 100 is the number of iterations
to convergence. Proposition 3. The residual of (13) is monotonically decreasing
(nonincreasing) under the update given in (11) for fixed F .
Theorem 1. 1) Fixing F , the residual kX  F GT k2 decreases
monotonically (i.e., it is nonincreasing) under the update rule Proof. We write JðHÞ as
for G. 2) Fixing G, the update rule for F gives the optimal
JðHÞ ¼ Trð2H T Bþ þ 2H T B þ HAþ H T  HA H T Þ;
solution to minF kX  F Gk2 .
ð18Þ
Proof. We first prove part 2). The objective function that we
T T
minimize is the following sum of squared residuals: where A ¼ F F , B ¼ X F , and H ¼ G.
We use the auxiliary function approach [2]. A function
J ¼ kX  F GT k2 ¼ Tr ðXT X  2XT F GT þ GF T F GT Þ: ~ is called an auxiliary function of JðHÞ if it satisfies
ZðH; HÞ
ð13Þ ~
JðHÞ;
ZðH; HÞ ZðH; HÞ ¼ JðHÞ; ð19Þ
Fixing G, the solution for F is obtained by computing
~ Define
for any H; H.
dJ=dF ¼ 2XG þ 2F GT G ¼ 0. This gives the solution
F ¼ XGðGT GÞ1 .
H ðtþ1Þ ¼ arg min ZðH; H ðtÞ Þ; ð20Þ
To prove part 1), we now fix F and solve for G while H
imposing the restriction G
0. This is a constrained
where we note that we require the global minimum. By
optimization problem. We present two results: 1) We
show that, at convergence, the limiting solution of the construction, we have JðH ðtÞ Þ ¼ ZðH ðtÞ ; H ðtÞ Þ
ZðH ðtþ1Þ ;
update rule of (11) satisfies the KKT condition. This is H ðtÞ Þ
JðH ðtþ1Þ Þ. Thus, JðH ðtÞ Þ is monotone decreasing
established in Proposition 2 below. This proves the (nonincreasing). The key is to find 1) appropriate ZðH; HÞ ~
correctness of the limiting solution. 2) We show that the and 2) its global minimum. According to Proposition 4
iteration of the update rule of (11) converges. This is (see below), ZðH; H 0 Þ defined in (21) is an auxiliary
established in Proposition 3 below. u
t function of J and its minimum is given by (22). According
Proposition 2. The limiting solution of the update rule in (11) to (20), H ðtþ1Þ H and H ðtÞ H 0 ; substituting A ¼ F T F ,
T
satisfies the KKT condition. B ¼ F X, and H ¼ G, we recover (11). u
t
Proof. We introduce the Lagrangian function
Proposition 4. Given the objective function J defined as in (18),
LðGÞ ¼ Tr ð2XT F GT þ GF T F GT  GT Þ; ð14Þ where all matrices are nonnegative, the following function,
where the Lagrangian multipliers ij enforce nonnega- X  
Hik
tive constraints, Gij
0. The zero gradient condition ZðH; H 0 Þ ¼  2Bþ H 0
ik ik 1 þ log 0
@L Hik
gives @G ¼ 2XT F þ 2GF T F   ¼ 0. From the comple- ik

mentary slackness condition, we obtain X Hik2


þ H 0 2ik X ðH 0 Aþ Þik Hik
2
þ B
ik 0 þ 0 ð21Þ
T T ik
Hik ik
Hik
ð2X F þ 2GF F Þik Gik ¼ ik Gik ¼ 0: ð15Þ  
X Hik Hi‘
This is a fixed point equation that the solution must  A H 0
H
k‘ ik i‘
0
1 þ log 0 H0 ;
ik‘
Hik i‘
satisfy at convergence.
DING ET AL.: CONVEX AND SEMI-NONNEGATIVE MATRIX FACTORIZATIONS 49

is an auxiliary function for JðHÞ, i.e., it satisfies the 4½ðBþ Þik þ ðH 0 A Þik Hik
0
B 0 þ
ik þ ðH A Þik
Yik ¼ 2
þ 2 0 :
requirements JðHÞ  ZðH; H 0 Þ and JðHÞ ¼ ZðH; HÞ. Hik Hik
Furthermore, it is a convex function in H and its global Thus, ZðH; H 0 Þ is a convex function of H. Therefore,
minimum is
we obtain the global minimum by setting @ZðH; H 0 Þ=
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi @Hik ¼ 0 in (26) and solving for H. Rearranging, we
0 0 Bþ ik þ ðH A Þik
0 
Hik ¼ arg min ZðH; H Þ ¼ Hik  : ð22Þ obtain (22). u
t
H Bik þ ðH 0 Aþ Þik

Proof. The function JðHÞ is Proposition 5. For any matrices A 2 IRnn


þ , B 2 IRþ ,
kk
nk 0 nk
S 2IRþ , S 2 IRþ , with A and B symmetric, the following
JðHÞ ¼ Trð2H T Bþ þ 2H T B þ HAþ H T  HA H T Þ: inequality holds:
ð23Þ
X
n X
k ðAS 0 BÞ S 2
ip ip
We find upper bounds for each of the two positive terms, 0
TrðS T ASBÞ: ð27Þ
i¼1 p¼1
Sip
and lower bounds for each of the two negative terms. For
the third term in JðHÞ, using Proposition 5 (see below) 0
and setting A I; B Aþ , we obtain an upper bound Proof. Let Sip ¼ Sip uip . Using an explicit index, the
difference  between the left-hand side and the right-
X ðH 0 Aþ Þ H 2 hand side can be written as
TrðHAþ H T Þ  0
ik ik
:
ik
Hik
X
n X
k  2 
0 0
The second term of JðHÞ is bounded by ¼ Aij Sjq Bqp Sip uip  uip ujq :
i;j¼1 p;q¼1
X X 2
Hik þ H 0 2ik
TrðH T B Þ ¼ Hik B B Because A and B are symmetric, this is equal to
ik  ik 0 ;
ik ik
2Hik !
X n X k u 2
ip þ u 2
jq
0 0
using the inequality a  ða þ b2 Þ=2b, which holds for
2 ¼ Aij Sjq Bqp Sip  uip ujq
i;j¼1 p;q¼1
2
any a; b > 0.
To obtain lower bounds for the two remaining terms, 1X n X k  2
0 0
we use the inequality z
1 þ logz, which holds for any ¼ Aij Sjq Bqp Sip uip  ujq
0:
2 i;j¼1 p;q¼1
z > 0, and obtain
t
u
Hik Hik
0
1 þ log H 0 ;
Hik
ð24Þ In the special case in which B ¼ I and S is a column
ik
vector, this result reduces to a result due to [2].
and
3.2 Algorithm for Convex-NMF
Hik Hi‘ Hik Hi‘
0 0
1 þ log 0 0 : ð25Þ We describe the algorithm for computing the Convex-NMF
Hik Hi‘ Hik Hi‘
factorization when X has mixed sign (denoted as X ). When
From (24), the first term in JðHÞ is bounded by X is nonnegative, the algorithm is simplified in a natural way.
  (C0) Initialize W and G. There are two methods. 1) Fresh
X X Hik
T þ
TrðH B Þ ¼ þ
Bik Hik
þ 0
Bik Hik 1 þ log 0 : start. Do a K-means clustering. Let the obtained cluster
ik ik
Hik indicators be H ¼ ðh1 ; . . . ; hk Þ, Hik ¼ f0; 1g. Then set
From (25), the last term in JðHÞ is bounded by Gð0Þ ¼ H þ 0:2E, where E is a matrix of all 1s. The cluster
  centroids can be computed as f k ¼ Xhk =nk , or F ¼ XHD1 n ,
X Hik Hi‘ where Dn ¼ diagðn1 ; . . . ; nk Þ. Thus, W ¼ HD1 . We smooth
 T  0 0 n
TrðHA H Þ
Ak‘ Hik Hi‘ 1 þ log 0 0 :
ik‘
Hik Hi‘ W and set W ð0Þ ¼ ðH þ 0:2EÞD1 n . 2) Suppose we already
have an NMF or Semi-NMF solution. In this case, G is
Collecting all bounds, we obtain ZðH; H 0 Þ as in (21). known and we set Gð0Þ ¼ G þ 0:2E. We solve X ¼ XW GT
Obviously, JðHÞ  ZðH; H 0 Þ and JðHÞ ¼ ZðH; HÞ.
for W . This leads to W ¼ GðGT GÞ1 . Since W must be
To find the minimum of ZðH; H 0 Þ, we take
nonnegative, we set W ð0Þ ¼ W þ þ 0:2EhW þ i, where hAi ¼
þ
P
@ZðH; H 0 Þ ij jAij j=kAk0 and where kAk0 is the number of nonzero
0
Hik Hik 2ðH 0 A Þik Hik
¼ 2Bþik þ 2Bik 0 þ 0 elements in A.
@Hik Hik Hik Hik
0 
0
ð26Þ Then update Gþ and Wþ alternatively until convergence
ðH A Þik Hik
2 : as follows: (C1) Update Gþ using
Hik
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u
The Hessian matrix containing the second derivatives u½ðXT XÞþ W  þ ½GW T ðXT XÞ W 
Gik Gik t 
ik
þ
ik
: ð28Þ
@ 2 ZðH; H 0 Þ ½ðXT XÞ W ik þ ½GW T ðXT XÞ W ik
¼ ij k‘ Yik ;
@Hik @Hj‘
This can be derived in a manner similar to (11), replacing
is a diagonal matrix with positive entries F by XW ; (C2) Update Wþ using
50 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 1, JANUARY 2010
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi X  
u Hik
u½ðXT XÞþ G þ ½ðXT XÞ W GT G ZðH; H 0 Þ ¼  2Bþ 0
t ik ik ik Hik 1 þ log 0
Wik Wik  þ : ð29Þ ik
Hik
½ðXT XÞ Gik þ ½ðXT XÞ W GT Gik
X 2
Hik þ H 0 2ik X ðA H 0 CÞik Hik

þ B
ik 0 þ 0 ð34Þ
The computational complexity for convex-NMF is of Hik Hik
ik ik
order n2 p þ mð2n2 k þ nk2 Þ for (28) and is of order mð2n2 k þ !
X Hjk Hi‘
 0 0
2nk2 Þ for (29), where m 100 is the number of iterations to  Aij Hjk Ck‘ Hi‘ 1 þ log 0 0 ;
ijk‘
Hjk Hi‘
convergence. These are matrix multiplications and can be
computed efficiently on most computers. is an auxiliary function of JðHÞ, and furthermore,
The correctness and convergence of the algorithm are ZðH; H 0 Þ is a convex function of H and its global
addressed in the following: minimum is
Theorem 6. Fixing G, under the update rule for W of (29), 1) the sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
residual kX  XW GT k2 decreases monotonically (nonincreas- 0 Bþ ik þ ðA H CÞik
 0
Hik ¼ arg min ¼ Hik  þ : ð35Þ
ing), and 2) the solution converges to a KKT fixed point. H Bik þ ðA H 0 CÞik

From its minima and setting H ðtþ1Þ H and H ðtÞ H0,


þ 
The proof of part 2) is given by Proposition 7, which þ T 
we recover (29), letting B ¼ ðX XÞ G, ¼ ðX XÞ G,T

ensures the correctness of the solution. The proof of part 1) A ¼ XT X, C ¼ GT G, and H ¼ W . u


t
is given by Proposition 8, which ensures the convergence of
3.3 Some Generic Properties of NMF Algorithms
the algorithm.
First, we note all these multiplicative updating algorithms
Proposition 7. The limiting solution of update rule of (29) are guaranteed to converge to a local minimum, but not
satisfies the KKT condition. necessarily to a global minimum. This is also true for many
Proof. We minimize other algorithms that have a clustering flavor, including
K-means, EM for mixture models, and spectral clustering.
J2 ¼ kX  XW GT k2 Practical experience suggests that K-means clustering tends
¼ Tr ðX T X  2GT X T XW þ W T XT XW GT GÞ; to converge to a local minimum that is close to the initial
guess, whereas NMF and EM tend to explore a larger range
where X 2 IRpn , W 2 IRnk kn
þ , and G 2 IRþ . The mini- of values.
mization with respect to G is the same as in Semi-NMF. Second, we note that NMF updating rules are invariant
We focus on the minimization with respect to W ; that is, with respect to rescaling of NMF. By rescaling, we mean
we minimize F GT ¼ ðF D1 ÞðGDT ÞT ¼ FeG eT , where D is a k-by-k positive
diagonal matrix. Under this rescaling, (11) becomes
JðW Þ ¼ Tr ð2GT XT XW þ W T XT XW GT GÞ: ð30Þ vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u
We can easily obtain the KKT complementarity uðX T FeÞþ þ ½Gð e FeT FeÞ 
e
Gik Gik t
e ik ik
: ð36Þ
condition T e  e e T
e þ
ðX F Þ þ ½GðF F Þ 
ik ik
ðXT XG þ XT XW GT GÞik Wik ¼ 0: ð31Þ Since ðXT FeÞik ¼ ðX T F Þik D1 ee e T
e eT 1
kk , ðGF F Þik ¼ ðGF F Þik Dkk ,
Next, we can show that the limiting solution of the and ðGÞe ¼ ðGÞ Dkk , (36) is identical to (11).
ik ik
update rule of (29) satisfies Third, we note that the convergence rate of the NMF
multiplicative updating algorithm is generally of first order.
ðXT XG þ XT XW GT GÞik Wik2 ¼ 0: ð32Þ To see this, we set  ¼ ðG F
Þ and view the updating
ðtþ1Þ
These two equations are identical for the same reasons that algorithms as a mapping  ¼ MððtÞ Þ. At convergence,

(17) is identical to (15). Thus, the limiting solution of the  ¼ Mð Þ. The objective functions have been proved to
update rule satisfies the KKT fixed point condition. u
t be nonincreasing, Jððtþ1Þ Þ  JððtÞ Þ. Following Xu and
Jordan [26], we expand1  ’ Mð Þ þ ð@M=@Þð   Þ.
Therefore,
Proposition 8. The residual, (30), decreases monotonically (it is
 
nonincreasing). Thus, the algorithm converges.  ðtþ1Þ  @M   ðtÞ 
        
 @      ;
Proof. We write JðW Þ as

JðHÞ ¼ Tr ð2H T Bþ þ 2H T B þ H T Aþ HC under an appropriate matrix norm. In general, @M=@ 6¼ 0.


ð33Þ Thus, these updating algorithms have a first-order conver-
 H T A HCÞ; gence rate, which is the same as the EM algorithm [26].
where B ¼ XT XG, A ¼ XT X, C ¼ GT G, and H ¼ W . Fourth, we note that there are many ways to initialize
JðHÞ differs from JðHÞ of (23) in that the last two terms NMF. In our paper, we use the equivalence between NMF
and relaxed K-means clustering to initialize F and G to the
has four matrix factors instead of three. Following the
K-means clustering solution. Lee and Seung [2] suggest
proof of Proposition 4, with the aid of Proposition 5, we
can prove that the following function, 1. Note that a nonnegativity constraint needs to be enforced.
DING ET AL.: CONVEX AND SEMI-NONNEGATIVE MATRIX FACTORIZATIONS 51

random initialization. An SVD-based initialization has The main point of Lemma 9 is that the solutions to
recently been proposed by Boutsidis and Gallopoulos [27]. minW ;G kI  W GT k2 are the sparsest possible rank-K
See more initialization references in [27], [17]. matrices W ; G. Now, returning to our characterization of
Convex-NMF in (37), we can write
X 
4 SPARSITY OF CONVEX-NMF kI  W GT k2 ¼ eT ðI  W GT Þ2 :
k
In the original presentation of NMF, Lee and Seung [1] k

emphasized the desideratum of sparsity. For example, in the Comparing to the Convex-NMF case, we see that the
case of image data, it was hoped that NMF factors would projection of ðI  W GT Þ onto the principal components has
correspond to a coherent part of the original image, for more weight while the projection of ðI  W GT Þ onto the
example, a nose or an eye; these would be sparse factors in nonprincipal components has less weight. Thus, we conclude
which most of the components would be zero. Further that sparsity is enforced strongly in the principal component
experiments have shown, however, that NMF factors are not subspace and weakly in the nonprincipal component sub-
necessarily sparse, and sparsification schemes have been space. Overall, Lemma 9 provides a basis for concluding that
developed on top of NMF [16], [5]. Parts-of-whole repre- Convex-NMF tends to yield sparse solutions.
sentations are not necessarily recovered by NMF, but A more intuitive understanding of the source of the
conditions for obtaining parts-of-whole representations have sparsity can be obtained by counting parameters. Note, in
been discussed [28]. See also [29], [30], and [31] for related particular, that Semi-NMF is based on Nparam ¼ kp þ kn
literature on sparse factorizations in the context of PCA. parameters, whereas Convex-NMF is based on Nparam ¼
Interestingly, the Convex-NMF factors W and G are 2kn parameters. Considering the usual case n > p (i.e., the
naturally sparse. We provide theoretical support for this number of data points is more than the data dimension),
Convex-NMF has more parameters than Semi-NMF. But we
assertion in this section, and provide additional experi-
know that Convex-NMF is a special case of Semi-NMF. The
mental support in Section 6. (Sparseness can also be seen in
resolution of these two contradicting facts is that some of
the illustrative example presented in Section 2.3.)
the parameters in Convex-NMF must be zero.
We first note that Convex-NMF can be reformulated as
X  2
min 2k vTk ðI  W GT Þ ; s:t: W 2 Rnk
þ ; G 2 Rkn
þ ; 5 ADDITIONAL REMARKS
W ;G
0
k
Convex-NMF stands out for its interpretability and its
ð37Þ
sparsity properties. In this section, we consider two
where we use the SVD of X ¼ UV T and thus have additional interesting aspects of Convex-NMF, and we also
P consider the relationship of all of the NMF-like factoriza-
X T X ¼ k 2k vk vTk . Therefore, kX  XW GT k2 ¼ TrðI 
P tions that we have developed to K-means clustering.
GW T ÞXT XðI  W GT Þ ¼ k 2k kvTk ðI  W GT Þk2 . We now
claim that 1) this optimization will produce a sparse 5.1 Kernel-NMF
solution for W and G, and 2) the more slowly k decreases, Consider a mapping xi ! ðxi Þ or X ! ðXÞ ¼ ððx1 Þ; . . . ;
the sparser the solution. ðxn ÞÞ. A standard NMF or Semi-NMF factorization ðXÞ 
This second part of our argument is captured in a lemma. F GT would be difficult to compute since F and G depend
Lemma 9. The solution of the following optimization problem, explicitly on the mapping function ðÞ. However, Convex-
NMF provides an appealing solution of this problem:
min kI  W GT k2 ; s:t: W ; G 2 RnK
þ ;
W ;G
0 ðXÞ  ðXÞW GT :
is given by W ¼ G ¼ any K columns of ðe1    eK Þ, where ek Indeed, it is easy to see that the minimization objective
is a basis vector: ðek Þi6¼k ¼ 0, ðek Þi¼k ¼ 1.
Proof. We first prove the result for a slightly more general kðXÞ  ðXÞW GT k2 ¼ Tr½ðXÞT ðXÞ  2GT T ðXÞðXÞW
case. Let D ¼ diagðd1 ; . . . ; dn Þ be a diagonal matrix and þ W T T ðXÞðXÞW GT G
let d1 > d2 >    > dn > 0. The pffiffiffiffiffi lemma holds
pffiffiffiffi
ffi if we replace
depends only on the kernel K ¼ T ðXÞðXÞ. In fact, the
I by D and W ¼ G ¼ ð d1 e1    d1 eK Þ.2 The proof
update rules for Convex-NMF presented in (29) and (28)
follows from the fact that we have a unique spectral
P depend on XT X only. Thus, it is possible to “kernelize”
expansion Dek ¼ dk ek and D ¼ nk¼1 dk ek eTk . Now we
Convex-NMF in a manner analogous to the kernelization of
take the limit: d1 ¼    ¼ dn ¼ 1. The spectral expansion
P PCA and K-means.
is not unique: I ¼ nk¼1 uk uTk for any orthogonal basis
ðu1 ; . . . ; un Þ ¼ U. However, due the nonnegativity con- 5.2 Cluster-NMF
straint, ðe1    en Þ is the only viable basis. Thus, W ¼ G ¼ In Convex-NMF, we require the columns of F to be convex
for any K columns of ðe1    en Þ. u
t combinations of input data. Suppose now that we interpret
the entries of G as posterior cluster probabilities. In this
case, the cluster centroids can be computed as f k ¼ Xgk =nk ,
2. In NMF, the degree of freedom of diagonal rescalingpis ffiffiffiffi always or F ¼ XGD1 n , where Dn ¼ diagðn1 ; . . . ; nk Þ. The extra
present. Indeed, let E ¼ ðe1    eK Þ. Our p
choice
ffiffiffiffi offfiffiffiffiW ¼ G ¼ E D can be
p
written in different ways W G ¼ ðE DÞðE DÞ ¼ ðED ÞðE T D1 ÞT ,
T T  degree of freedom for F is not necessary. Therefore, the
where 1 <  < 1. pair of desiderata: 1) F encodes centroids, and 2) G encodes
52 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 1, JANUARY 2010

posterior probabilities motivates a factorization X  factorizations, we display the resulting F factors. We see
12
XGD1n G T
. We can absorb Dn into G and solve for that the Semi-NMF factors (denoted Fsemi in the figure) tend
to lie distant from the cluster centroids. On the other hand,
Cluster-NMF: X  XGþ GTþ : ð38Þ the Convex-NMF factors (denoted Fcnvx ) almost always lie
We call this factorization Cluster-NMF because the degree of within the clusters.
freedom in this factorization is the cluster indicator G, as in 6.2 Real-Life Data Sets
a standard clustering problem. The objective function is
We conducted experiments on the following data sets:
J ¼ kX  XGGT k2 .
Ionosphere and Wave from the UCI repository, the document
5.3 Relation to Relaxed K-Means Clustering data sets URCS, WebkB4, Reuters (using a subset of the data
NMF, Semi-NMF, Convex-NMF, Cluster-NMF, and Kernel- collection which includes the 10 most frequent categories),
NMF all have K-means clustering interpretations when the WebAce, and a data set which contains 1,367 log messages
factor G is orthogonal (GT G ¼ I). Orthogonality and collected from several different machines with different
nonnegativity together imply that each row of G has only operating systems at the School of Computer Science at
one nonnegative element, i.e., G is a bona fide cluster Florida International University. The log messages are
indicator. This relationship to clustering is made more grouped into nine categories: configuration, connection, create,
precise in the following theorem. dependency, other, report, request, start, and stop. Stop words
Theorem 10. G-orthogonal NMF, Semi-NMF, Convex-NMF, were removed using a standard stop list. The top 1,000 words
Cluster-NMF, and Kernel-NMF are all relaxations of were selected based on frequencies.
K-means clustering. Table 1 summarizes the data sets and presents our
experimental results. These results are averages over 10 runs
Proof. For NMF, Semi-NMF, and Convex-NMF, we first for each data set and algorithm.
eliminate F . The objective is J ¼ kX  F GT k2 ¼ We compute clustering accuracy using the known class
TrðXT X  2XT F GT þ F F T Þ. Setting @J=@F ¼ 0, we ob- labels. This is done as follows: The confusion matrix is first
tain F ¼ XG. Thus, we obtain J ¼ TrðXT X  GT X T XGÞ.
computed. The columns and rows are then reordered so as to
For Cluster-NMF, we obtain the same result directly:
maximize the sum of the diagonal. We take this sum as a
J ¼ kX  XGGT k2 ¼ TrðXT X  GT XT XGÞ. For Kernel-
measure of the accuracy: It represents the percentage of data
N MF , we h a v e J ¼ kðXÞ  ðXÞW GT k2 ¼ TrðK 
points correctly clustered under the optimized permutation.
GT KW þ W T KW Þ, where K is the kernel. Setting
To measure the sparsity of G in the experiments, we
@J=@W ¼ 0, we have KG ¼ KW . Thus, J ¼ TrðXT X 
compute the average of each column of G and set all
GT KGÞ. In all five of these cases, the first terms are
elements below 0.001 times the average to zero. We report
constant and do not affect the minimization. The mini-
the number of the remaining nonzero elements as a
mization problem thus becomes maxGT G¼I TrðGT KGÞ,
percentage of the total number of elements. (Thus, small
where K is either a linear kernel XT X or hðXÞ; ðXÞi. It
is known that this is identical to (kernel) K-means values of this measure correspond to large sparsity.)
clustering [32], [33]. u
t A consequence of the sparsity of G is that the rows of G
tend to become close to orthogonal. This indicates a hard
In the definitions of NMF, Semi-NMF, Convex-NMF, clustering (if we view G as encoding posterior probabilities
Cluster-NMF, and Kernel-NMF, G is not restricted to be for clustering). We compute the normalized orthogonality,
orthogonal; these NMF variants are soft versions of ðGT GÞnm ¼ D1=2 ðGT GÞD1=2 , where D ¼ diagðGT GÞ. Thus,
K-means clustering. diag½ðGT GÞnm  ¼ I. We report the average of the off-
diagonal elements in ðGT GÞnm as the quantity “Deviation
6 EXPERIMENTS from Orthogonality” in the table.
From the experimental results, we observe the following:
We first present the results of an experiment on synthetic
data, which aims to verify that Convex-NMF can yield 1. All of the matrix factorization models are better
factorizations that are close to cluster centroids. We then than K-means on all of the data sets. This is our
present experimental results for real data comparing principal empirical result. It indicates that the NMF
K-means clustering and the various factorizations. family is competitive with K-means for the pur-
poses of clustering.
6.1 Synthetic Data Set 2. On most of the nonnegative data sets, NMF gives
One main theme of our work is that the Convex-NMF somewhat better accuracy than Semi-NMF and
variants may provide subspace factorizations that have Convex-NMF (with WebKb4 the exception). The
more interpretable factors than those obtained by other differences are modest, however, suggesting that the
NMF variants (or PCA). In particular, we expect that, in more highly constrained Semi-NMF and Convex-
some cases, the factor F will be interpretable as containing NMF may be worthwhile options if interpretability
cluster representatives (centroids) and G will be interpre- is viewed as a goal of a data analysis.
table as encoding cluster indicators. We begin with a simple 3. On the data sets containing both positive and
investigation of this hypothesis. In Fig. 1, we randomly negative values (where NMF is not applicable) the
generate four two-dimensional data sets with three clusters Semi-NMF results are better in terms of accuracy
each. Computing both the Semi-NMF and Convex-NMF than the Convex-NMF results.
DING ET AL.: CONVEX AND SEMI-NONNEGATIVE MATRIX FACTORIZATIONS 53

TABLE 1
Data Set Descriptions and Results

4. In general, Convex-NMF solutions are sparse, while Finally, note that NMF is initialized randomly for the
Semi-NMF solutions are not. different runs. We investigated the stability of the solution
5. Convex-NMF solutions are generally significantly over multiple runs and found that NMF converges to
more orthogonal than Semi-NMF solutions. solutions F and G that are very similar across runs; moreover,
the resulting discretized cluster indicators were identical.
6.3 Shifting Mixed-Sign Data to Nonnegative
While our algorithms apply directly to mixed-sign data, it is
also possible to consider shifting mixed-sign data to be 7 CONCLUSIONS
nonnegative by adding the smallest constant so all entries We have presented a number of new nonnegative matrix
are nonnegative. We performed experiments on data shifted factorizations. We have provided algorithms for these
in this way for the Wave and Ionosphere data. For Wave, factorizations and theoretical analysis of the convergence
the accuracy decreases to 0.503 from 0.590 for Semi-NMF of these algorithms. The ability of these algorithms to deal
and decreases to 0.5297 from 0.5738 for Convex-NMF. The with mixed-sign data makes them useful for many
sparsity increases to 0.586 from 0.498 for Convex-NMF. For applications, particularly given that covariance matrices
Ionosphere, the accuracy decreases to 0.647 from 0.729 for are often centered.
Semi-NMF and decreases to 0.618 from 0.6877 for Convex- Semi-NMF offers a low-dimensional representation of
NMF. The sparsity increases to 0.829 from 0.498 for Convex- data points which lends itself to a convenient clustering
NMF. In short, the shifting approach does not appear to interpretation. Convex-NMF further restricts the basis
provide a satisfactory alternative.
vectors to be convex combinations of data points,
6.4 Flexibility of NMF providing a notion of cluster centroids for the basis. We
A general conclusion is that NMF almost always performs also briefly discussed additional NMF algorithms—Kernel-
NMF and Cluster-NMF—that are further specializations of
better than K-means in terms of clustering accuracy while
Convex-NMF.
providing a matrix approximation. We believe this is due to
We also showed that the NMF variants can be viewed as
the flexibility of matrix factorization as compared to the
rigid spherical clusters that the K-means clustering objec- relaxations of K-means clustering, thus, providing a closer
tive function attempts to capture. When the data distribu- tie between NMF and clustering than has been present in
tion is far from a spherical clustering, NMF may have the literature to date. Moreover, our empirical results
advantages. Fig. 2 gives an example. The data set consists of showed that the NMF algorithms all outperform K-means
two parallel rods in 3D space containing 200 data points. clustering on all of the data sets that we investigated in
The two central axes of the rods are 0.3 apart and they have terms of clustering accuracy. We view these results as
diameter 0.1 and length 1. As seen in the figure, K-means indicating that the NMF family is worthy of further
gives a poor clustering, while NMF yields a good clustering. investigation. We view Semi-NMF and Convex-NMF as
Fig. 2c shows the differences particularly worthy of further investigation, given their
P in the columns of G (each
column is normalized to i gk ðiÞ ¼ 1). The misclustered native capability for handling mixed-sign data and their
points have small differences. particularly direct connections to clustering.
54 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 1, JANUARY 2010

Fig. 2. A data set with two clusters in 3D. (a) Clusters obtained using K-means, as indicated by either “r” or “t u.” (b) Clusters obtained using NMF.
(c) The difference g2 ðiÞ  g1 ðiÞ; i ¼ 1; . . . ; 200, “r” for those misclustered points, and “u
t” for correctly clustered points.

[7] W. Xu, X. Liu, and Y. Gong, “Document Clustering Based on Non-


ACKNOWLEDGMENTS Negative Matrix Factorization,” Proc. ACM Conf. Research and
The authors would like to thank the anonymous reviewers Development in Information Retrieval (SIGIR), pp. 267-273, 2003.
[8] V.P. Pauca, F. Shahnaz, M. Berry, and R. Plemmons, “Text Mining
and the editor for suggesting various changes. Chris Ding is Using Non-Negative Matrix Factorization,” Proc. SIAM Int’l Conf.
supported in part by the University of Texas STARS Award Data Mining, pp. 452-456, 2004.
and US National Science Foundation (NSF) grant DMS- [9] J.-P. Brunet, P. Tamayo, T. Golub, and J. Mesirov, “Metagenes and
Molecular Pattern Discovery Using Matrix Factorization,” Proc.
0844497. Tao Li is partially supported by the NSF CAREER Nat’l Academy of Sciences USA, vol. 102, no. 12, pp. 4164-4169, 2004.
Award IIS-0546280 and NSF grant DMS-0844513. Michael [10] H. Kim and H. Park, “Sparse Non-Negative Matrix Factorizations
Jordan is supported in part by a grant from Microsoft via Alternating Non-Negativity-Constrained Least Squares for
Microarray Data Analysis,” Bioinformatics, vol. 23, no. 12, pp. 1495-
Research and by NSF grant 0509559. 1502, 2007.
[11] D. Greene, G. Cagney, N. Krogan, and P. Cunningham, “Ensemble
Non-Negative Matrix Factorization Methods for Clustering
REFERENCES Protein-Protein Interactions,” Bioinformatics, vol. 24, no. 15,
[1] D. Lee and H.S. Seung, “Learning the Parts of Objects by Non- pp. 1722-1728, 2008.
Negative Matrix Factorization,” Nature, vol. 401, pp. 788-791, 1999. [12] I. Dhillon and S. Sra, “Generalized Nonnegative Matrix Approx-
[2] D. Lee and H.S. Seung, “Algorithms for Non-Negative Matrix imations with Bregman Divergences,” Advances in Neural Informa-
Factorization,” Advances in Neural Information Processing Systems tion Processing Systems 17, MIT Press, 2005.
13, MIT Press, 2001. [13] C. Ding, T. Li, and W. Peng, “Nonnegative Matrix Factorization
[3] P. Paatero and U. Tapper, “Positive Matrix Factorization: A Non- and Probabilistic Latent Semantic Indexing: Equivalence, Chi-
Negative Factor Model with Optimal Utilization of Error Square Statistic, and a Hybrid Method,” Proc. Nat’l Conf. Artificial
Estimates of Data Values,” Environmetrics, vol. 5, pp. 111-126, Intelligence, 2006.
1994. [14] F. Sha, L.K. Saul, and D.D. Lee, “Multiplicative Updates for
[4] Y.-L. Xie, P. Hopke, and P. Paatero, “Positive Matrix Factorization Nonnegative Quadratic Programming in Support Vector Ma-
Applied to a Curve Resolution Problem,” J. Chemometrics, vol. 12, chines,” Advances in Neural Information Processing Systems 15, MIT
no. 6, pp. 357-364, 1999. Press, 2003.
[5] S. Li, X. Hou, H. Zhang, and Q. Cheng, “Learning Spatially [15] N. Srebro, J. Rennie, and T. Jaakkola, “Maximum Margin Matrix
Localized, Parts-Based Representation,” Proc. IEEE Conf. Computer Factorization,” Advances in Neural Information Processing Systems,
Vision and Pattern Recognition, pp. 207-212, 2001. MIT Press, 2005.
[6] M. Cooper and J. Foote, “Summarizing Video Using Non- [16] P.O. Hoyer, “Non-Negative Matrix Factorization with Sparseness
Negative Similarity Matrix Factorization,” Proc. IEEE Workshop Constraints,” J. Machine Learning Research, vol. 5, pp. 1457-1469,
Multimedia Signal Processing, pp. 25-28, 2002. 2004.
DING ET AL.: CONVEX AND SEMI-NONNEGATIVE MATRIX FACTORIZATIONS 55

[17] M. Berry, M. Browne, A. Langville, P. Pauca, and R. Plemmons, Chris Ding received the PhD degree from
“Algorithms and Applications for Approximate Nonnegative Columbia University. He did research at the
Matrix Factorization,” Computational Statistics and Data Analysis, California Institute of Technology, the Jet
2006. Propulsion Laboratory, and Lawrence Berkeley
[18] T. Li and S. Ma, “IFD: Iterative Feature and Data Clustering,” Proc. National Laboratory before joining the Univer-
SIAM Int’l Conf. Data Mining, pp. 472-476, 2004. sity of Texas at Arlington in 2007 as a
[19] T. Li, “A General Model for Clustering Binary Data,” Proc. professor of computer science. His main
Knowledge Discovery and Data Mining, pp. 188-197, 2005. research areas are machine learning and
[20] C. Ding, X. He, and H. Simon, “On the Equivalence of bioinformatics. He serves on many program
Nonnegative Matrix Factorization and Spectral Clustering,” Proc. committees of international conferences and
SIAM Data Mining Conf., 2005. gave tutorials on spectral clustering and matrix models. He is an
[21] E. Gaussier and C. Goutte, “Relation between PLSA and NMF and associate editor of the journal Data Mining and Bioinformatics and is
Implications,” Proc. ACM Conf. Research and Development in writing a book on spectral clustering to be published by Springer. He
Information Retrieval (SIGIR), pp. 601-602, 2005. is a member of the IEEE.
[22] T. Hofmann, “Probabilistic Latent Semantic Indexing,” Proc. ACM
Conf. Research and Development in Information Retrieval (SIGIR), Tao Li received the PhD degree in 2004 from
pp. 50-57, 1999. the University of Rochester. He is currently an
[23] D. Blei, A. Ng, and M. Jordan, “Latent Dirichlet Allocation,” associate professor in the School of Computer
J. Machine Learning Research, vol. 3, pp. 993-1022, 2003. Science at Florida International University. His
[24] M. Girolami and K. Kaban, “On an Equivalence between PLSI and research interests are in data mining, machine
LDA,” Proc. ACM Conf. Research and Development in Informational learning, and information retrieval. He is a
Retrieval (SIGIR), 2003. recipient of the US National Science Foundation
[25] D. Lee and H.S. Seung, “Unsupervised Learning by Convex and CAREER Award and the IBM Faculty Research
Conic Coding,” Advances in Neural Information Processing Systems 9, Awards.
MIT Press, 1997.
[26] L. Xu and M. Jordan, “On Convergence Properties of the EM
Algorithm for Gaussian Mixtures,” Neural Computation, vol. 18,
pp. 129-151, 1996.
[27] C. Boutsidis and E. Gallopoulos, “SVD Based Initialization: A
Head Start for Nonnegative Matrix Factorization,” Pattern Michael I. Jordan received the master’s degree
Recognition, vol. 41, no. 4, pp. 1350-1362, 2008. from Arizona State University, and the PhD
[28] D. Donoho and V. Stodden, “When Does Non-Negative Matrix degree in 1985 from the University of California,
Factorization Give a Correct Decomposition into Parts?” Advances San Diego. He was a professor at the Massa-
in Neural Information Processing Systems 16, MIT Press, 2004. chusetts Institute of Technology from 1988 to
[29] A. D’Aspremont, L.E. Ghaoui, M.I. Jordan, and G.R.G. Lanckriet, 1998. He is the Pehong Chen Distinguished
“A Direct Formulation for Sparse PCA Using Semidefinite Professor in the Department of Electrical En-
Programming,” SIAM Rev., 2006. gineering and Computer Science and the De-
[30] H. Zou, T. Hastie, and R. Tibshirani, “Sparse Principal Component partment of Statistics at the University of
Analysis,” J. Computational and Graphical Statistics, vol. 15, pp. 265- California, Berkeley. He has published more
286, 2006. than 300 research articles on topics in electrical engineering, statistics,
[31] Z. Zhang, H. Zha, and H. Simon, “Low-Rank Approximations computer science, computational biology, and cognitive science. He was
with Sparse Factors II: Penalized Methods with Discrete Newton- named a fellow of the American Association for the Advancement of
Like Iterations,” SIAM J. Matrix Analysis Applications, vol. 25, Science (AAAS) in 2006 and was named a medallion lecturer of the
pp. 901-920, 2004. Institute of Mathematical Statistics (IMS) in 2004. He is a fellow of the
[32] H. Zha, C. Ding, M. Gu, X. He, and H. Simon, “Spectral Relaxation IEEE, a fellow of the IMS, a fellow of the AAAI, and a fellow of the ASA.
for K-Means Clustering,” Advances in Neural Information Processing
Systems 14, pp. 1057-1064, 2002.
[33] C. Ding and X. He, “K-Means Clustering and Principal Compo- . For more information on this or any other computing topic,
nent Analysis,” Proc. Int’l Conf. Machine Learning, 2004. please visit our Digital Library at www.computer.org/publications/dlib.

You might also like