Professional Documents
Culture Documents
Abstract—We propose a unified and systematic framework conditions. Another example is the presence of impulse noises
for performing online nonnegative matrix factorization in the in time series, including speech recordings in natural language
presence of outliers. Our framework is particularly suited to processing or temperature recordings in weather forecasting.
large-scale data. We propose two solvers based on projected
gradient descent and the alternating direction method of multi- The outliers, if not handled properly, can significantly corrupt
arXiv:1604.02634v2 [stat.ML] 15 Oct 2016
pliers. We prove that the sequence of objective values converges the learned basis matrix, thus the underlying low-dimensional
almost surely by appealing to the quasi-martingale convergence data structure cannot be learned reliably. Moreover, outlier de-
theorem. We also show the sequence of learned dictionaries tection and pruning become much more difficult in large-scale
converges to the set of stationary points of the expected loss datasets [14]. As such, it is imperative to design algorithms
function almost surely. In addition, we extend our basic problem
formulation to various settings with different constraints and that can learn interpretable parts-based basis representations
regularizers. We also adapt the solvers and analyses to each from large-scale datasets whilst being robust to possible out-
setting. We perform extensive experiments on both synthetic and liers.
real datasets. These experiments demonstrate the computational
efficiency and efficacy of our algorithms on tasks such as (parts-
based) basis learning, image denoising, shadow removal and A. Previous Works
foreground-background separation. Many efforts have been devoted to address each challenge
Index Terms—Nonnegative matrix factorization, Online learn- separately. To handle large datasets, researchers have pursued
ing, Robust learning, Projected gradient descent, Alternating solutions in three main directions. The first class of algo-
direction method of multipliers rithms proposed is know as online NMF algorithms [12],
[15]–[19]. These algorithms aim to refine the basis matrix
I. I NTRODUCTION each time a new data sample is acquired without storing the
past data samples. The second class of algorithms is known
In recent years, Nonnegative Matrix Factorization (NMF) as distributed NMF [20]–[23]. The basic idea behind these
has become a popular dimensionality reduction [2] technique, algorithms is to distribute the data samples over a network of
due to its parts-based, non-subtractive interpretation of the agents so that several small-scale optimization problems can be
learned basis [3]. Given a nonnegative data matrix V, it performed concurrently. The final class of algorithms is called
seeks to approximately decompose V into two nonnegative the compressed NMF algorithms [24], [25]. These algorithms
matrices, W and H, such that V ≈ WH. In the literature, perform structured random compression to project the data
the fidelity of such a approximation is most commonly mea- onto the lower-dimensional manifolds. As such, the size of
2
sured by kV − WHkF [4]–[8]. To obtain this approximation, the dataset can be reduced. These three approaches have suc-
many algorithms have been proposed, including multiplicative cessfully reduced the computation and storage complexities—
updates [4], block principal pivoting [5], projected gradient either provably or through numerical experiments. For the
descent [6], active set method [7], and the alternating direction existence of outliers, a class of algorithms called the (batch)
method of multipliers [8]. These algorithms have promising robust NMF [13], [26]–[35] has been proposed to reliably learn
performances in numerous applications, including document the basis matrix by minimizing the effects of the outliers.
clustering [9], hyperspectral unmixing [10] and audio source The robustness against the outliers is achieved via different
separation [11]. However, there are also many studies [12], approaches. These are detailed in Section II-A. However, to
[13] showing that their performances deteriorate under two the best of our knowledge, so far there are no NMF-based
common and practical scenarios. The first scenario is when the algorithms that are able to systematically handle outliers in
data matrix V has a large number of columns (data samples). large-scale datasets.
This situation arises in today’s data-rich environment. Batch
data processing methods used in the aforementioned algo-
B. Main Contributions
rithms become highly inefficient in terms of the computational
time and storage space. The second scenario is the existence In this paper, we propose an algorithm called the online
of outliers in some of the data samples. For example (e.g.), NMF with outliers that fills this void. Specifically, our algo-
there are glares and shadows in images due to bad illumination rithm aims to learn the basis matrix W in an online manner
whilst being robust to outliers. The development of the pro-
A preliminary work has been published in ICASSP 2016 [1]. posed algorithm involves much more than the straightforward
The authors are with the Department of Electrical and Computer Engineer- combination or adaptation of online NMF and robust NMF
ing and the Department of Mathematics, National University of Singapore.
They are supported in part by the NUS Young Investigator Award (grant algorithms. Indeed, since there are many ways to “robustify”
number R-263-000-B37-133). the NMF algorithms, it is crucial to find an appropriate way
2
algorithms by leveraging the empirical process theory, as was In other words, we aim to solve a (non-convex) stochastic
done in [16], [36], [37], [44]. These methods have exten- program [57]. Note that by the strong law of large numbers,
a.s.
sive applications, including document clustering [18], image given any W ∈ C, we have ft (W) −−→ f (W) as t → ∞.
inpainting [44], face recognition [16], and image annotation Remark 1. We make three remarks here. First we explain
[16]. The second category of algorithms [12], [15], [17], [19], the reasonings behind the choice of the constraint sets C
[38], [45]–[48], with major applications in visual tracking, and R. The set C constrains the columns of W in the unit
assumes that the data generation distribution P is time-varying. (nonnegative) `2 ball. This is to prevent the entries of W from
Although these assumptions are weaker than those in the first being unbounded, following the conventions in [6], [58]. The
class of algorithms, it is very difficult to provide theoretical set R uniformly bounds the entries of r. This is because in
guarantees on the convergence of the online algorithms. practice, both the underlying data (without outliers) and the
observed data are uniformly bounded entrywise. Since we do
C. Online Low-rank and Sparse Modeling not require r to be exactly recovered, this prior information
can often improve the estimation of r. For real data, the bound
Another related line of works [49]–[53] aims to recover M > 0 can often be easily chosen. For example, for gray-scale
the low-rank data matrix L and the sparse outlier matrix S images, M can be chosen as 2m − 1, where m is the number
from their additive mixture M in an online fashion. Among of bits per pixel. In the case of matrices containing ratings
these works, [49]–[51] assume that the sequence of mixture from users with a maximum rating of υ, M can be chosen
vectors (columns of M) arrives in a streaming fashion. The as υ. In some scenarios where M is difficult to estimate,
authors derive the recovery algorithms based on their proposed we simply set M = ∞. As will be shown in Sections IV
models of the sequence of ground-truth data vectors (columns and V, the algorithms and analyses developed for finite M
of L) and outlier vectors (columns of S). The authors of [52], can be easily adapted to infinite M . Second, for the sake of
[53] adopt a different approach. They allow the number of brevity, we omit regularizing W and h. Such regularizations,
mixture vectors to be large but finite and use an alternating together with other possible constraint sets of W and r will be
minimization approach to solve a variant of (4) (with addi- discussed in Section VI. Third, we assume the ambient data
tional Tikhonov regularizers on W and {hi }N i=1 ). In learning dimension F , the latent data dimension K and the penalty
the coefficient vectors {hi }N
i=1 (with fixed W), the authors parameter λ in (5) are time-invariant parameters.1
employ the stochastic gradient descent method, thus ensuring
that their algorithms are scalable to large-scale data.
IV. A LGORITHMS
To tackle the problem proposed in Section III, we lever-
III. P ROBLEM F ORMULATION
age the stochastic majorization-minimization (MM) framework
Following Section II-A, in this work we explicitly model the [60], [61], which has been widely used in previous works on
outlier vectors as {rt }t≥1 . Also, we assume the data generation online matrix factorization [12], [15], [17], [19], [36]–[38],
distribution P is time-invariant, for ease of the convergence [44], [45]. In essence, such framework decomposes the opti-
analysis. First, for a fixed data sample v and a fixed basis mization problem in (8) into two steps, namely nonnegative
matrix W, define the loss function with respect to (w.r.t.) v encoding and dictionary update. Concretely, at a time instant t
and W, `(v, W) as
1 In this work, we do not simultaneously consider the data with high ambient
`(v, W) , min `(v,
e W, h, r), (5) dimensions. An attempt on this problem in the context of dictionary learning
h≥0,r∈R with the squared-`2 loss has been made in [59].
4
(t ≥ 1), we first learn the coefficient vector ht and the outlier where2
vector rt based on the newly acquired data sample vt and the 1 2
previous dictionary matrix Wt−1 . Specifically, we solve the Qη (h0 |h) , q(h) + h∇q(h), h0 − hi + kh0 − hk2 ,
2η
following convex optimization problem 2 2
q(h) , 12 kv − Wh − rk2 and η ∈ (0, 1/L] with L , kWk2 .
(ht , rt ) = arg min `(v
e t , Wt−1 , h, r). (9) The steps (13) and (14) can be interpreted based on the frame-
h≥0,r∈R
work of block MM [63], [64]. Specifically, it is easy to verify
Here the initial basis matrix W0 is randomly chosen in C. that the steps (13) and (14) amount to finding the (unique)
Next, based on the past statistics {vi , hi , ri }i∈[t] , the basis minimizers of the majorant functions3 of h0 7→ `(v,
e W, h0 , r)
matrix is updated to 0 + 0
at h and r 7→ `(v, W, h , r ) at r respectively. Therefore, the
e
Wt = arg min fet (W), (10) convergence analysis in [63] guarantees that such alternating
W∈C minimization procedure converges to a global optimum of (9).
where In addition, we notice that the minimizations in both (13)
t and (14) have closed-form solutions. For (13), the solution is
1 X 1 2
fet (W) , kvi − Whi − ri k2 + λ kri k1 . (11) given by the PGD update step (with constant step size)
t i=1
2
h+ := P+ (h − η∇q(h)). (15)
We note that (10) can be rewritten as
For ease of parameter tuning, we rewrite η = κ/L (0 < κ ≤
1
Wt = arg min tr WT WAt − tr WT Bt ,
(12) 1). Furthermore, we fix η (or κ) throughout all iterations.
W∈C 2
Pt Pt For (14), if M = ∞, the solution is precisely given by
where At , 1/t i=1 hi hTi and Bt , 1/t i=1 (vi − ri )hTi
are the sufficient statistics. From (12), we observe that our r+ := Sλ (v − Wh+ ), (16)
algorithm has a storage complexity independent of t since only where Sλ is the (elementwise) soft-thresholding operator
At , Bt and Wt need to be stored and updated. threshold λ. Otherwise, when M is finite, using [65, Lemma 5]
To solve (9) and (12) (at a fixed time instant t), we propose (see Lemma S-3), we have
two solvers based on PGD and ADMM respectively. For ease
of reference, we refer to the former algorithm as OPGD and r+ := Seλ,M (v − Wh+ ), (17)
the latter as OADMM. We now explain the motivations behind where for any x ∈ RF and i ∈ [F ],
proposing these two solvers. Since both (9) and (12) are
constrained optimization problems, the most straightforward 0,
|xi | < λ
solver would be based on PGD. Although such a solver has Sλ,M (x) := xi − sgn(xi )λ, λ ≤ |xi | ≤ λ + M .
e
i
a linear computational complexity per iteration, it typically
sgn(xi )M, |xi | > λ + M
needs a large number of iterations to converge. Moreover, it is
also easily trapped in bad local minima [62]. Thus, it would be 2) PGD solver for (12): Similar to the procedure for
meaningful to contrast its performance with a solver with very solving (13), we first rewrite (12) as
different properties. This leads us to propose another solver
based on ADMM. Such a solver has a higher computational min pt (W) + IC (W), where
W
complexity per iteration but typically needs fewer number of 1
tr WT WAt − tr WT Bt .
iterations to converge [8]. It is also less susceptible to bad pt (W) = (18)
2
local minima since it solves optimization problems in the First, it is easy to see pt is convex and differentiable, ∇pt is
dual space. In Section VII, we will show that the practical Lipschitz with constant L e t , kAt k . Thus, we can construct
F
performances of these two solvers are comparable despite the a majorant function Pt (W0 |W) for pt (W0 ) at W ∈ C
different properties they possess. Thus either solver can be
used for most practical purposes. Pt (W0 |W) = pt (W) + h∇pt (W), W0 − Wi
In the sequel, we omit the time subscript t to keep notations 1 2
+ kW0 − WkF , (19)
uncluttered. In the iterations, the updated value of a variable is 2e
ηt
denoted with the superscript ‘+’. Pseudo-codes of the entire
algorithm (for N data samples) are provided in Algorithm 1. where ∇pt (W) = WAt − Bt and ηet ∈ (0, 1/Le t ]. Minimizing
Pt (W0 |W) + IC (W0 ) over W0 ∈ C, we have
2
A. Online Algorithm Based on PGD (OPGD) W+ := arg min kW0 − (W − ηet ∇pt (W))kF (20)
W0 ∈C
1) PGD solver for (9): For a fixed W, we solve (9) by := PC (W − ηet ∇pt (W)). (21)
alternating between the following two steps
h+ := arg min Qη (h0 |h), (13)
h0 ≥0 2 At time t, the value of L is computed based on W
t−1 , i.e., Lt ,
1
kWt−1 k22 .
r+ := arg min
v − Wh+ − r0
2 + λ kr0 k ,
2 1 (14) 3 For a function g with domain G, its majorant at κ ∈ G, g
e is the function
r0 ∈R 2 e ≥ g on G and ii) g
that satisfies i) g e(κ) = g(κ).
5
By Lemma S-4, the projection step in (21) is given by Algorithm 1 Online NMF with outliers (ONMFO)
P+ (W − ηet ∇pt (W)):j Input: Data samples {vi }i∈[N ] , penalty parameter λ, initial
+
W:j := , ∀ j ∈ [K]. dictionary matrix W0
max{1, kP+ (W − ηet ∇pt (W)):j k2 }
(22) Initialize sufficient statistics: A0 := 0, B0 := 0
Again, for each iteration given in (21), we use the same step for t = 1 to N do
size ηet = κ
et /L where 0 < κet ≤ 1. 1) Acquire a data sample vt .
2) Learn the coefficient vector ht and the outlier vector rt
Remark 2. In the literature [66], [67], the accelerated prox- based on Wt−1 , using the solvers based on PGD or ADMM
imal gradient descent (APGD) method has been proposed (detailed in Sections IV-A1 and IV-B1)
to accelerate the canonical PGD method. With an additional
extrapolation step in each iteration, the convergence rate can (ht , rt ) := arg min `(v
e t , Wt−1 , h, r). (36)
h≥0,r∈R
be improved from O(1/k) to O(1/k 2 ), where k denotes the
number of iterations. However, since in our implementations, 3) Update the sufficient statistics
both (9) and (12) were only solved to a prescribed accuracy
At := 1/t (t − 1)At−1 + ht hTt ,
(see Section VII-B2), we observed no significant reduction in
Bt := 1/t (t − 1)Bt−1 + (vt − rt )hTt .
running times on the real tasks. Thus for simplicity, we only
employ the canonical PGD method. 4) Learn the dictionary matrix Wt based on At and Bt ,
using the solvers based on PGD or ADMM (detailed in
B. Online Algorithm Based on ADMM (OADMM) Section IV-A2 and IV-B2)
Below we only present the update rules derived via ADMM. 1
Wt := arg min tr WT WAt − tr WT Bt
The detailed derivation steps are shown in Section S-1. (37)
W∈C 2
1) ADMM solver for (9): First we reformulate (9) as
end for
1 2 Output: Final dictionary matrix WN
min kv − Wh − rk2 + λ krk1 + I+ (u) + IR (q)
h,r 2
s. t. h = u, r = q (23)
Q+ := PC W+ + D/ρ3
Thus the augmented Lagrangian is (34)
D+ := D + ρ3 W+ − Q+ .
1 2
(35)
L(h, r, u, q, α, β) = kv − Wh − rk2 + λ krk1
2 Remark 3. A few comments are now in order: First, although
+ I+ (u) + IR (q) + αT (h − u) + β T (r − q) algorithms based on both PGD and ADMM exist in the
ρ1 2 ρ2 2 literature for the canonical NMF problem (see for example,
+ kh − uk2 + kr − qk2 , (24)
2 2 [6], [8], [62]), our problem setting is different from those
where α, β are dual variables and ρ1 , ρ2 are positive penalty in the previous works. Specifically, our problem explicitly
parameters. Then we sequentially update each of h, r, u, q, models the outlier vector r and the constraint set on W
×K
α and β while keeping other variables fixed is more complicated than the non-negative orthant RF + .
Moreover, for the algorithm based on PGD, we use a fixed step
h+ := (WT W + ρ1 I)−1 WT (v − r) + ρ1 u − α
(25)
size though all the iterations. In contrast, the usual projected
r+ := Sλ (ρ2 q + v − β − Wh)/(1 + ρ2 ) (26) gradient method applied to NMF [6] involves computationally
u+ := P+ h+ + α/ρ1
(27) intensive Armijo-type line searches for the step sizes. Second,
q+ := PR r+ + β/ρ2 although updates (25) and (33) involve matrix inversions, the
(28)
+ + + operations do not incur high computational complexity since
α := α + ρ1 (h − u ) (29)
both matrices to be inverted have sizes K ×K where K F .
β + := β + ρ2 (r+ − q+ ). (30) Third, the proposed two solvers can be easily extended to solve
2) ADMM solver for (12): Again, we rewrite (12) as the batch NMF problem with outliers. Thus, we have also
effectively proposed two new solvers for the (batch) robust
1
tr WT WAt − tr WT Bt + IC (Q)
min NMF problem. These extensions are detailed in Section S-2.
2 In the sequel we term these two batch algorithms as BPGD
s. t. W = Q (31) and BADMM respectively. Finally, the stopping criteria of
The augmented Lagrangian in this case is both OPGD and OADMM for solving (9) and (12) will be
described in Section VII-B2.
1
L(W, Q, D) = tr WT WAt − tr WT Bt + IC (Q)
2
ρ3 2 V. C ONVERGENCE A NALYSES
+ hD, W − Qi + kW − QkF , (32)
2
In this section, we analyze the convergence of both the
where D is the dual variable and ρ3 is a positive penalty
sequence of objective values {f (Wt )}t≥1 (Theorem 1) and
parameter. Minimizing W, Q and D sequentially yields
the sequence of dictionary matrices {Wt }t≥1 (Theorem 2)
−1
W+ := (Bt − D + ρ3 Q) (At + ρ3 I) (33) produced by Algorithm 1. It is worth noticing that the analyses
6
are independent of the specific solvers used in steps 2) and 4) Finally, we define the notion of stationary point of a
in Algorithm 1. Indeed, in our analyses, we only leverage the constrained optimization problem.
fact that both (9) and (12) can be exactly solved.
Definition 1. Given a real Hilbert space X , a differentiable
function g : X → R and a set K ⊆ X , x0 ∈ K is a stationary
A. Preliminaries point of minx∈K g(x) if h∇g(x0 ), x − x0 i ≥ 0, for all x ∈ K.
First, given any h0 ≥ 0 and r0 ∈ R, we observe that
(v, W) 7→ `(v, e W, h0 , r0 ) and fet serve as upper-bound B. Main Results and Key Lemmas
functions for ` (on RF+ × C) and ft (on C) respectively. Denote In this section we present our main results (Theorem 1 and
(h∗ , r∗ ) as an optimal solution of (5). (Note that (h∗ , r∗ ) is a 2) and key lemmas to prove these results (Lemma 1 and 2). We
function of v and W and always exists since `(v e 0 , W0 , h, r) only show proof sketches here. Detailed proofs are deferred
is closed and coercive on R+ × R for any v0 ∈ RF
K
+ and to Section S-3 to S-8 in the supplemental material.
W0 ∈ C.) Clearly we have `(v, e W, h∗ , r∗ ) = `(v, W), for
Theorem 1 (Almost sure convergence of the sequence of ob-
any (v, W) ∈ RF + × C. jective values). In Algorithm 1, the nonnegative stochastic pro-
Next, we notice that ` can be equivalently defined as
cess {fet (Wt )}t≥1 converges a.s.. Furthermore, {f (Wt )}t≥1
`(v, W) = min `(v,
e W, h, r), (38) converges to the same almost sure limit as {fet (Wt )}t≥1 .
(h,r)∈H×R
Proof Sketch. The proof proceeds in two major steps. First,
where H is a compact and convex set in RK + . This is because making use of the quasi-martingale convergence theorem [41,
h∗ in (5) is bounded, due to the boundedness of vi and W. Theorem 9.4 & Proposition 9.5] (see Lemma S-8), we prove
Thus it suffices to consider minimizing h over a compact set {fet (Wt )}t≥1 converges a.s. by showing the series of the
in RK
+ . We can further choose H to be convex. If M = ∞, expected positive variations of this process converges. A key
we can similarly show r∗ is bounded thus still consider step in proving the convergence of this series is to bound each
minimizing r over a compact and convex set R in RF . summand of the series by Donsker’s theorem (see Lemma S-
We make three assumptions for our subsequent analyses. 9). To invoke this theorem, we show the class of measurable
Assumptions. functions {`(·, W) : W ∈ C} is P-Donsker [40] using the
1) The data generation distribution P has a compact support Lipschitz continuity of `(v, ·) on C (see Lemma 1) and [40,
set V. Example 19.7] (see Lemma S-11). Second, we show
2) For all (v, W) ∈ V × C, `(v, e W, h, r) is jointly m1 - a.s.
ft (Wt ) − fet (Wt ) −−→ 0. (40)
strongly convex in (h, r) for some constant m1 > 0. P∞ e
3) For all t ≥ 1, fet (W) is m2 -strongly convex on C for by showing t=1 ft (Wtt+1 )−ft (Wt )
converges a.s.. Define bt ,
some constant m2 > 0. fet (Wt ) − ft (Wt ). Using the Lipschitz continuity of both fet
Remark 4. All of the assumptions above are made to sim- and ft on C and Lemma 2, we can show |bt+1 − bt | = O(1/t)
plify the convergence analyses. Among them, Assumption 1 a.s.. Now we invoke [44, Lemma 8] (see Lemma S-12)
a.s.
naturally holds for real data, which are always bounded. to conclude ft (Wt ) − fet (Wt ) −−→ 0. By Lemma 1 and
Assumptions 2 and 3 play roles in proving Lemma 1 and [40, Example 19.7] (see Lemma S-11), we can show the
2 respectively. Apart from this, they have no effects in class of measurable functions {`(·, W) : W ∈ C} is P-
proving our main theorems (i.e., Theorem 1 and 2). These Glivenko-Cantelli [40]. By the Glivenko-Cantelli theorem (see
two assumptions hold by simply adding strongly convex Lemma S-10) we have
regularizers to `(v,
e W, h, r) or fet (W). For example, we can a.s.
2 2
sup |ft (W) − f (W)| −−→ 0. (41)
add Tikhonov regularizer (ν1 khk2 + ν2 krk2 )/2 (for some W∈C
positive ν1 , ν2 ) to `(v,
e W, h, r), then in such case m1 = In particular,
min(ν1 , ν2 ). Adding such regularizers can be regarded as a a.s.
way to promote smoothness and avoid over-fitting on W, h ft (Wt ) − f (Wt ) −−→ 0. (42)
and r (see Section VI). Moreover, including the regularizers Finally, (40) and (42) together imply that {f (Wt )}t≥1
in `(v,
e W, h, r) or fet (W) will not alter any steps in our
converges to the same almost sure limit as {fet (Wt )}t≥1 .
analyses. In terms of our algorithms, for small regularization
weights (e.g., ν1 , ν2 1), such regularizers will only slightly
alter steps (36) and (37) in Algorithm 1. For example, in (36),
Theorem 2 (Almost sure convergence of the sequence of
`(v
e t , Wt−1 , h, r) will become
dictionaries). The stochastic process {Wt }t≥1 converges to
1 2 ν1 2 the set of stationary points of (8) a.s..4
`e0 (vt , Wt−1 , h, r) , kv − Wh − rk2 + khk2
2 2
ν2 2 Proof Sketch. Fix a realization {vt }t≥1 such that
+ λ krk1 + krk2 . (39)
2 ft (Wt ) − ft (Wt ) → 0,5 and generate {Wt }t≥1 according to
e
Thus for simplicity we omit such regularizers in the algorithm. 4 Given a metric space (X , d), a sequence (x ) in X is said to converge
n
However, both of our solvers (OPGD and OADMM) can be to a set A ⊆ X if limn→∞ inf a∈A d(xn , a) = 0.
straightforwardly adapted to the regularized case. 5 The set of all such realizations have probability one by (40).
7
Algorithm 1. Then it suffices to show every subsequential limit (h∗ (v, W), r∗ (v, W)) is unique for any (v, W) ∈ V × C, we
of {Wt }t≥1 is a stationary point of (8). The compactness of can invoke Danskin’s theorem (see Lemma S-5) to guarantee
C enables us to find a convergent subsequence {Wtm }m≥1 . the differentiability of `. Moreover, we invoke the maxi-
Furthermore, it is possible to find a further subsequence of mum theorem (see Lemma S-6) to ensure the continuity of
{tm }m≥1 , {tk }k≥1 such that all {Atk }k≥1 , {Btk }k≥1 and (h∗ (v, W), r∗ (v, W)) on V × C. These results together imply
{fetk (0)}k≥1 converge. We focus on {Wtk }k≥1 hereafter that ` satisfies the desired regularity properties. The regularity
and denote its limit by W. Our
goal is to show for any properties of f on C hinges upon the regularity properties of
W ∈ C, the directional derivative ∇f (W), W − W ≥ 0. `. Indeed, by Leibniz integral rule (see Lemma S-7), we can
First, we show the sequence of differentiable functions show f is differentiable and
{fetk }k≥1 converges uniformly to a differentiable function
∇f (W) = Ev [∇W `(v, W)]. (43)
f . By (41) we also have that {ftk }k≥1 converges uniformly
to f . Denote gt , fet − ft , we have {gtk }k≥1 converges Hence the Lipschitz continuity of ∇f follows naturally from
uniformly to g = f − f . By Lemma 1, f is differentiable the Lipschitz continuity of W 7→ ∇W `(v, W).
on C. Next, we show for any W ∈ C,
g is differentiable
so
∇f (W), W − W ≥ 0 by showing W is a global
minimizer of f . Then we show ∇g(W) = 0 using the
Lemma 2. In Algorithm 1, kWt+1 − Wt kF = O (1/t) a.s..
first-order
Taylor expansion.
These results suffice to imply
∇f (W), W − W ≥ 0 for any W ∈ C. Proof Sketch. The upper bound for kWt+1 − Wt kF results
from the strong convexity and Lipschitz continuity of fet .
Specifically, they together imply an order difference between
Remark 5. Due to the nonconvexity of the canonical NMF the upper and lower bounds of fet (Wt+1 ) − fet (Wt ), i.e.,
problem, finding global optima of (2) is in general out-of- m2 2
kWt+1 − Wt kF ≤ fet (Wt+1 ) − fet (Wt ) (44)
reach. Indeed, [68] shows (2) is NP-hard. Therefore in the 2
literature [58], [69], convergence to the set of stationary points ≤ ct kWt+1 − Wt kF , (45)
(see Definition 1) of (2) has been studied instead. In the
where ct is only dependent on t. Some calculations reveal
online setting, it is even harder to show that the sequence
that ct = O(1/t) a.s..
of dictionaries {Wt }t≥1 converges to the global optima of
(8) (in some probabilistic sense). Thus Theorem 2 only states
the convergence result with respect to the stationary point of
(8), which has been the state-of-the-art in the literature of
online matrix factorization [37], [44]. On the other hand, we C. Discussions
notice that under certain assumptions on the data matrix V, We highlight the distinctions between our convergence
e.g., the separability condition proposed in [70], exact NMF analyses and those in the previous works. Our first main result
algorithms have been proposed in [71]–[73]. The optimization (Theorem 1) is somewhat analogous to the results in [16],
techniques therein are discrete (and combinatorial) in nature, [36], [37], [44]. However, the stochastic optimization of (8) in
and vastly differ from the continuous optimization techniques [16] is based on robust stochastic approximations [75], a very
typically employed in solving the canonical NMF problem. different approach from ours (see Section IV). This leads to
In addition, some heuristics for obtaining exact NMF using significant differences in the analyses. For example, at time
continuous optimization techniques have been proposed in t, their dictionary update step does not necessarily minimize
[74]. Nevertheless, developing exact NMF algorithms in the fet (W), a fact that we heavily leverage. The rest of the works
online setting could be an interesting future research direction. fall under a similar framework as ours. However, in [44],
The following two lemmas are used in the proof of Theo- a general sparse coding problem is considered. Thus, they
rem 1 and 2. In particular, Lemma 1 states that the loss func- assume a sufficient condition ensuring a unique solution in
tions ` and f respectively defined in (5) and (8) satisfy some the Lasso-like sparse coding step holds. This assumption adds
regularity conditions. Lemma 2 then bounds the variations in certain complications in proving ∇f is Lipschitz. Because our
the stochastic process {Wt }t≥1 . problem setting is different, we avoid these issues. Thus we
are able to provide a more succinct proof, despite having some
Lemma 1 (Regularity properties of ` and f ). The loss
additional features, e.g., the presence of the outlier vector
functions ` and f satisfy the following properties
r. The most closely related works are [36] and [37], both
1) `(v, W) is differentiable on V × C and ∇W `(v, W) is
of which consider the online robust PCA problems but with
continuous on V × C. Moreover, for all v ∈ V, ∇W `(v, W)
different loss functions. However, the nonnegativity constraints
is Lipschitz on C with Lipschitz constant independent of v.
on W and h and box constraints on r distinguish our proof
2) The expected loss function f (W) is differentiable on C and
from theirs. For our second main result (Theorem 2), we note
∇f (W) is Lipschitz on C.
that in previous works [36], [37], seemingly stronger results
Proof Sketch. The regularity properties of ` on V × have been shown. Specifically, these works manage to show
C originate from the regularity properties of `e on V × {Wt }t≥1 converges, either to one stationary point [37] or to
C. Since Assumption 2 ensures the minimizer for (38), the global minimizer [36] of f . However, due to different
8
problem settings, their proof techniques cannot be easily B. Regularizers for W, h and r
adapted to our problem setting. Inspired by [60], we instead We use λi ≥ 0 to denote the penalty parameters. For W, the
manage to characterize the subsequential limits of {Wt }t≥1 elastic-net regularization can be employed, i.e., λ1 kWk1,1 +
generated by almost sure realizations of {vt }t≥1 . This result is 2
λ2 /2 kWkF . This includes `1 and `2 regularizers on W as
almost as good as the results in the abovementioned previous
special cases. As pointed out in [19], [82] and [78], `1 and `2
works since it implies that there exists a stationary point of f
regularizers promote sparsity and smoothness on W respec-
such that its neighborhood contains infinitely many points of
tively. Because W ≥ 0, (12) with the elastic-net regularization
the sequence almost surely.
is still quadratic. In terms of the analyses, Assumption 3 can
be removed and the rest remains the same.
For h, several regularizers can be used:
VI. E XTENSIONS
1) The Lasso regularizer λ3 khk1 . It induces sparsity on h,
hence the optimization problem (13) with such regularizer is
In Section III, we considered a basic formulation of the
termed nonnegative sparse coding [38], [80].
online NMF problem with outliers. Here, we extend this 2
2) The Tikhonov regularizer: λ4 /2 khk2 . This regularizer
formulation to a wider class of problem settings, with different
induces smoothness and avoids over-fitting on h. Both the `1
constraints and regularizers on W, h and r. We also discuss
and `2 regularizers preserve the quadratic
P √ nature of (13).
how to adapt our algorithms and analyses to these settings.
3) The group Lasso regularizer λ5 α ξα khα k2 , where
hα ’s are (non-overlapping) subvectors of h with lengths ξα .
This regularizer induces sparsity among groups of entries in
A. Different Constraints on W and r h. Efficient algorithms to solve (13) under this regularization
If the constraint set C changes, in terms of both solvers have been proposed in [82], [83]. P √
in Section IV, only (21) and (34), the updates involving 4) The sparse group Lasso regularizer λ6 (ν α ξα khα k2
projections onto C, need to be changed. In terms of the +(1 − ν) khk1 ), where ν ∈ [0, 1]. Compared with the group
analyses, using a similar argument as in Section V-A, we can lasso regularizer, this one also encourages sparsity within
show the (unique) optimal solution W∗ (At , Bt ) in (12) is each group. Efficient algorithms for solving (13) with this
bounded6 regardless of boundedness of C. Thus, the optimiza- regularization have been discussed in [84], [85].
tion problem (12) can always be restricted to a compact set. Since (13) with each regularizer above is still a convex
The rest of the analyses proceeds as usual. Therefore, in the program, the analyses remain unchanged. Similar regularizers
following, we only discuss the (convex) variants of C and the on h can be applied to r, so for simplicity we omit the
associated (Euclidean) projection methods. discussions here. Compared to h, the only differences are that
1) The nonnegative orthant C1 , RF ×K
. This constraint r may be negative and bounded. However, standard algorithms
+
is the most basic yet the most widely used constraint in in such case are available in the literature [84]–[86].
NMF, although it does not well control the scale of W. Remark 6. Some remarks are in order. First, for certain pairs
The projection onto the nonnegative orthant involves simple of constraints and regularizers above (e.g., the elastic-net ball
entrywise thresholding. constraint of W and the elastic-net regularization on W),
×K
2) The probability simplex C2 , {W ∈ RF + | kW:i k1 = the optimization problems are equivalent (admit the same
1, ∀ i ∈ [K]} [16]. Different from C, C2 also prevents some optimal solutions) for specific value pairs of (γi , λi ). However,
columns of W from having arbitrarily small norms. Efficient the algorithms and analyses for these equivalent problems
projection algorithms onto the probability simplex have been can be much different. Second, all the alternative constraint
well studied. See [76], [77] for details. sets and regularizers discussed above are convex, since our
3) The nonnegative elastic-net ball [78]–[80]: C3 , {W ∈ analyses heavily leverage the (strong) convexity of `e and fe
RF ×K 2
|γ1 kW:i k1 +γ2 /2 kW:i k2 ≤ 1, ∀ i ∈ [K]}, where both (see Section V-A). It would be a meaningful future research
+
γ1 and γ2 are nonnegative. C3 is a general constraint set since it direction to consider nonconvex constraints or regularizers.
subsumes both C (the nonnegative `2 ball) and the nonnegative
`1 ball. Compared to C, this constraint encourages sparsity on VII. E XPERIMENTS
the basis matrix W, thus leading to better interpretability on We present results of numerical experiments to demonstrate
the basis vectors. Efficient algorithms for projection onto the the computational efficiency and efficacy of our online algo-
elastic-net ball have been proposed in [44], [81]. rithms (OPGD and OADMM). We first state the experimental
For the outlier vector r, the nonnegativity constraint can setup. Next we introduce some heuristics used in our experi-
be added to R to model (bounded) nonnegative outliers, so ments and the choices of parameters. We show the efficiency
the new constraint R0 , {r ∈ RF + | krk∞ ≤ M }. In such of our algorithms by examining the convergence speeds of the
case krk1 = 1T r so (14) amounts to a quadratic minimization objective functions. The efficacy of our algorithms is shown
program with box constraints. Both the algorithms and the via the (parts-based) basis representations and three meaning-
analyses can be easily adapted to such a simpler case. ful real-world applications—image denoising, shadow removal
and foreground-background separation. All the experiments
6 From (12), we observe that W is a function of only A and B if
t t t
were run in 64-bit Matlabr (R2015b) on a machine with
Assumption 3 holds. Intelr Core i7-4790 3.6 GHz CPU and 8 GB RAM.
9
the comparison is divided into two parts. The first involves than our online algorithms (the time difference is less than 1s).
comparison between our online algorithms and their batch This is because ORPCA’s formulation does not incorporate
counterparts. The second involves comparison between our nonnegativity constraints (and magnitude constraints for the
online algorithms to other online algorithms. For the first outliers) so the algorithm has fewer projection steps, lead-
part, we use the surrogate loss function fet as a measure of ing to a lower computational complexity. However, we will
convergence because i) fet (Wt ) has the same a.s. limit as show in Section VII-E that it fails to learn the parts-based
f (Wt ) (as t → ∞) and ii) with storage of the past statistics representations due to the lack of nonnegativity constraints.
{vi , hi , ri }i∈[t] , fet is easier to compute than the expected Moreover, we also observe that the ONMF algorithm fails to
loss f or the empirical loss ft . For the second part, since converge because it is unable to handle the outliers. Thus the
different online algorithms have different objective functions, constant perturbation from the outliers in the data samples on
we propose a heuristic and unified measure of convergence the learned basis matrix keeps the algorithm from converging.
fbt : C → R+ , defined as Furthermore, from both subfigures, we observe the objective
t function of OADMM has a smoother decrease that that of
1X1 o 2 OPGD. This is OADMM solves the optimization problems (9)
fbt (W) , kv − Whi k2 , (46)
t i=1 2 i and (12) in the dual space so it avoids the sharp transitions
from plateaus to valleys in the primal space. Apart from this
where vio denotes the clean data sample (without outliers and
difference, the convergence speeds of OADMM and OPGD are
observation noise) at time i. Loosely speaking, fbt (Wt ) can
almost the same.
be interpreted as the averaged regret of reconstructing the
3) Insensitivity to Parameters: We now examine the in-
clean data up to time t. In the following experiments, unless
fluences of various parameters, including τ , K, ρ and κ,
compared with other online algorithms, we always use fet as
on the convergence speeds of our online algorithms on the
the objective function for our online algorithms.
synthetic dataset. To do so, we vary one parameter at a time
1) Generation of Synthetic Dataset: The procedure to gen-
while keeping the other parameters fixed as in the canonical
erate the synthetic dataset is described as follows. First, we
F ×K o iid √ setting. We first vary τ from 5 to 40 in a log-scale, since our
generated Wo ∈ R+ such that (wo )ij ∼ HN (1/ K o ), proposed heuristic rule indicates τ ≤ 40. Figure 2 shows that
for any (i, j) ∈ [F ]×[K o ], where K o denotes the ground-truth
o our rule-of-thumb works well since within its indicated range,
×N
latent dimension. Similarly, we generated Ho ∈ RK + such different values of τ have very little effects on the convergence
iid √
that (h )ij ∼ HN (1/ K ), for any (i, j) ∈ [K ]×[N ]. The
o o o
speeds of both our online algorithms. Next, we consider other
(clean) data matrix Vo = PVe (Wo Ho ), where V e = [0, 1]F ×N . values of the latent dimension K. In particular, we consider
Note that the normalization (projection) PVe preserves the K = 25 and K = 100. Figure 3 shows that the convergence
boundedness of the data. Then we generated the outlier vectors speeds of both our online and batch algorithms are insensitive
{ri }i∈[N ] . First, we uniformly randomly selected a subset of to this parameter. Then we vary ρ (in both BADMM and
[N ] with size bνN c and denoted it as I, where ν ∈ (0, 1) OADMM) from 1 to 1000 on a log-scale. We can make two
denotes the fraction of columns in Vo to be contaminated by observations from Figure 4. First, as ρ increases, both ADMM-
the outliers. For each i ∈ I, we generated the outlier vector based algorithms converge slower. However, the change of the
ri ∈ [−1, 1]F by first uniformly selecting a support set with convergence speed of OADMM is much less drastic compared
cardinality be ν F c, where νe ∈ (0, 1) denotes the outlier density to BADMM. In fact, in less than 10s, all the OADMM
in each data sample. Then the nonzero entries in ri were i.i.d. algorithms (with different values of ρ) converge. This is
generated from the uniform distribution U[−1, 1]. For each because for OADMM, a large ρ only has a significant effect
i 6∈ I, we set ri = 0. Define the outlier matrix Ro such that on its convergence speed in the initial transient phase. After
(Ro ):i = ri , for any i ∈ [N ]. Then the (contaminated) data the algorithm reaches its steady state (i.e., the basis matrix
matrix V = PVe (Vo +Ro +N), where N ∈ RF ×N denotes the becomes stable), the effect of a large ρ becomes minimal
observation noise matrix with i.i.d. standard normal entries. in terms of solving both (9) and (12). Similar effects of the
2) Comparison to Other Online and Batch NMF Algo- different values of κ on the convergence speeds of the PGD-
rithms: To make a fair comparison, all the algorithms were based algorithms are observed in Figure 5. Together, Figure 4
initialized with the same W0 for each parameter setting. and 5 reveal another advantage of our online algorithms. That
Moreover, all online algorithms had the same mini-batch size is, our online algorithms have a significantly larger tolerance
τ . The synthetic data matrix V was generated using the of suboptimal parameters than their batch counterparts.
parameters values F = 400, K o = 49, N = 1 × 105 , ν = 0.7 4) Effects of Proportion of Outliers: We evaluate the
and νe = 0.1. performance of our online (and batch) algorithms (with the
The convergence speeds of the aforementioned algorithms canonical parameter setting) on the synthetic dataset with a
under the canonical parameter setting11 are shown in Figure 1. larger proportion of outliers. Specifically, in generating the
From Figure 1(a), we observe our online algorithms converge synthetic data matrix V, we keep F , K o and N unchanged, but
much faster than their batch counterparts. From Figure 1(b), simultaneously increase ν and νe. Figures 6(a) and 6(b) show
we observe the ORPCA algorithm converges slightly faster the convergence results of our algorithms on the synthetic
11 For BPGD and BADMM, the step sizes and penalty parameters had the
dataset with (ν, νe) equal to (0.8, 0.2) and (0.9, 0.3) respec-
same values as those of the online algorithms. For ORPCA and ONMF, we tively. From Figure 6, we observe that our online algorithms
kept the parameter settings in the original works. still converge fast and steadily (without fluctuations) to a
11
20 40 80 25
OADMM OADMM
OPGD OPGD
Objective Values
Objective Values
20
15 30 60
BADMM BADMM
ONMF
BPGD BPGD 15
ORPCA
10 20 40 ONMF
OADMM
10 ORPCA
OPGD
OADMM
5 10 20
5 OPGD
0 0 0 0
10-1 100 101 10-1 100 101 10-1 100 101 102 10-1 100 101
Time (s) Time (s) Time (s) Time (s)
(a) (b) (a) K = 25 (b) K = 25
Fig. 1. The objective values (as a function of time) of (a) our online
30 40
algorithms and their batch counterparts (b) our online algorithms and other OADMM
online algorithms. The parameters are set according to the canonical setting. OPGD
Objective Values
30
BADMM
20 ONMF
BPGD
ORPCA
20
OADMM
15 15 τ =5
10 OPGD
= 10
Objective Values
τ
10
τ = 20
10 10 τ = 40
0 0
10-1 100 101 102 10-1 100 101
5 5 Time (s) Time (s)
(c) K = 100 (d) K = 100
0 0
10-1 10 0
10 1
10-1 10 0
10 1 Fig. 3. The objective values (as a function of time) of all the algorithms for
Time (s) Time (s) different values of K. In (a) and (b), K = 25. In (c) and (d), K = 100. All
the other parameters are set according to the canonical setting.
(a) OPGD (b) OADMM
Fig. 2. The objective values of (a) OPGD and (b) OADMM for different
values of τ . All the other parameters are set according to the canonical setting. 30
15 ρ =1
= 10
Objective Values
ρ
ρ = 100
20 10 ρ = 1000
stationary point even though both ν and νe have increased.
5) Experiments on the CBCL Dataset: To generate the 10 5
CBCL face dataset with outliers, we first replicated it by a
factor p = 50 (so that the total number of data samples
0 0
is of the order 105 ) and randomly permuted the aggregated 10-1 100 101 102 103 10-1 100 101 102
dataset as described in Section VII-B. We then contaminated Time (s) Time (s)
it with outliers in the same manner as for the synthetic dataset. (a) BADMM (b) OADMM
However, we avoided adding observation noise since it had Fig. 4. The objective values of (a) BADMM and (b) OADMM for different
been already introduced by the image acquisition process. values of ρ. All the other parameters are set according to the canonical setting.
We then conducted experiments on the contaminated face
dataset with the same parameter settings as those on the syn-
thetic data. In the interest of space, we defer the convergence learned by our online (and batch) algorithms are free of noise
results of all the online and batch algorithms to Section S-10. and much more local than those learned by ORPCA. We can
The results show that both our online algorithms demonstrate easily observe facial parts from the learned basis. Also, the
fast and steady convergences on this dataset. Moreover, the basis images learned by our four algorithms are of similar
convergence speeds are insensitive to the key parameters (τ , quality. To further enhance the sparsity of the basis images
K, ρ and κ). Therefore, in the subsequent experiments, we learned, one can add sparsity constraints to each column of
will only focus on the canonical parameter setting unless the basis matrix, as done for example in [43]. However, we
mentioned otherwise. omit such constraints in our discussions. Another parameter
affecting the sparsity of the learned basis images is the latent
dimension K. In general, the larger K is, the sparser the basis
E. Basis Representations images will be. Here we set K = 49, in accordance with the
We now examine the basis matrices learned by all the original setting in [3].
algorithms on the CBCL dataset with outliers as introduced in
Section VII-D5. The parameters controlling the outlier density,
ν and νe are set to 0.7 and 0.1 respectively. Figure 7 shows the F. Application I: Image Denoising
basis representations learned by all the algorithms. From this A natural application that arises from the experiments on
figure, we observe that the basis images learned by ONMF the CBCL dataset would be image denoising, in particular,
have large residues of salt and pepper noise. Also, the basis removing the salt and pepper noise on images. The metric
images learned by ORPCA appear to be non-local and do not commonly used to measure the quality of the reconstructed
yield meaningful interpretations. In contrast, the basis images images is peak signal-to-noise ratio (PSNR) [92]. A larger
12
10 8
κ = 0.1
= 0.2
Objective Values
8 κ
6 κ = 0.4
6 κ = 0.8
4
4
2
2
0 0
10-1 100 101 102 103 10-1 100 101 102
Time (s) Time (s) (a) ONMF (b) ORPCA
(a) BPGD (b) OPGD
Fig. 5. The objective values of (a) BPGD and (b) OPGD for different values
of κ. All the other parameters are set according to the canonical setting.
20 20
OADMM OADMM
OPGD
Objective Values
15 OPGD
BADMM 15
BADMM
BPGD BPGD
10 10
(c) OADMM (d) OPGD
5 5
0 0
10-1 100 101 102 10-1 100 101 102
Time (s) Time (s)
(a) ν = 0.8, νe = 0.2 (b) ν = 0.9, νe = 0.3
Fig. 6. The objective values of our online and batch algorithms on the
synthetic dataset with a larger proportion of outliers.
value of PSNR indicates better quality of image denoising. (e) BADMM (f) BPGD
With a slight abuse of definition, we apply this metric to all Fig. 7. Basis representations learned by all the algorithms (ν = 0.7, νe =
0.1). All the parameters are in the canonical setting (K = 49).
the recovered images. For the batch algorithms, we define
n o
PSNR , −10 log10 kVo − W b 2 /(F N ) ,
c Hk
F (47) TABLE I
AVERAGE PSNR S ( IN D B) AND RUNNING TIMES ( IN SECONDS ) WITH
where Vo , W c and H b denote the matrix of the clean images, STANDARD DEVIATIONS OF ALL THE ALGORITHMS ON THE CBCL FACE
DATASET WITH DIFFERENT NOISE DENSITY (K = 49).
estimated basis matrix and estimated coefficient matrix respec-
tively. For the online algorithms, we instead define PSNR in
terms of WN (the final dictionary output by Algorithm 1) and Setting 1 Setting 2 Setting 3
the past statistics {hi }i∈[N ] OADMM 11.37 ± 0.02 11.35 ± 0.15 11.33 ± 0.11
(N ) OPGD 11.48 ± 0.10 11.47 ± 0.05 11.39 ± 0.03
X BADMM 11.53 ± 0.19 11.51 ± 0.10 11.48 ± 0.03
o 2
PSNR , −10 log10 kvi − WN hi k2 /(F N ) (48) BPGD 11.56 ± 0.07 11.52 ± 0.14 11.48 ± 0.05
i=1
ONMF 5.99 ± 0.12 5.97 ± 0.18 5.95 ± 0.17
c ORPCA 11.25 ± 0.04 11.24 ± 0.19 11.23 ± 0.05
= −10 log10 fbN (WN ). (49)
(a) PSNRs
In other words, a low averaged regret will result in a high
Setting 1 Setting 2 Setting 3
PSNR. Table I shows the image denoising results of all the OADMM 416.68 ± 3.96 422.35 ± 3.28 425.83 ± 4.25
algorithms on the CBCL face dataset. Here the settings 1, OPGD 435.32 ± 4.80 449.25 ± 3.18 458.58 ± 4.67
2 and 3 represent different densities of the salt and pepper BADMM 1000.45 ± 10.18 1190.55 ± 5.88 1245.39 ± 10.59
BPGD 1134.89 ± 11.37 1185.84 ± 9.83 1275.48 ± 9.48
noise. Specifically, these settings correspond to (ν, νe) equal to ONMF 2385.93 ± 11.15 2589.38 ± 15.57 2695.47 ± 14.15
(0.7, 0.1), (0.8, 0.2) and (0.9, 0.3) respectively. All the results ORPCA 368.35 ± 3.23 389.59 ± 3.49 399.85 ± 4.12
were obtained using ten random initializations of W0 . From (b) Running times
Table I(a), we observe that for all the three settings, in terms
of PSNRs, our online algorithms only slightly underperform
their batch counterparts, but slightly outperform ORPCA and
greatly outperform ONMF. From Table I(b), we observe our achieve comparable performances with ORPCA but are much
online algorithms only have slightly longer running times than better than the rest algorithms. Also, in terms of PSNRs
ORPCA, but are significantly faster than the rest ones. Thus and running times, no significant differences can be observed
in terms of the trade-off between the computational efficiency between our online algorithms. Similar results were observed
and the quality of image denoising, our online algorithms when K = 25 and K = 100. See Section S-10 for details.
13
TABLE II
AVERAGE RUNNING TIMES ( IN SECONDS ) OF ALL THE ALGORITHMS ON
SUBJECT N O .2.
TABLE III
AVERAGE RUNNING TIMES ( IN SECONDS ) OF ALL THE ALGORITHMS ON [6] C.-J. Lin, “Projected gradient methods for nonnegative matrix factor-
VIDEO SEQUENCES ( A ) Hall AND ( B ) Escalator. ization,” Neural Comput., vol. 19, no. 10, pp. 2756–2779, Oct. 2007.
[7] H. Kim and H. Park, “Nonnegative matrix factorization based on alter-
nating nonnegativity constrained least squares and active set method,”
SIAM J. Matrix Anal. A., vol. 30, no. 2, pp. 713–730, 2008.
Algorithms Time (s) Algorithms Time (s) [8] Y. Xu, W. Yin, Z. Wen, and Y. Zhang, “An alternating direction
ONMF 1525.57 ± 14.43 ORPCA 166.85 ± 6.09 algorithm for matrix completion with nonnegative factors,” Frontiers
OADMM 172.29 ± 5.64 OPGD 178.46 ± 7.57 Math. China, vol. 7, pp. 365–384, 2012.
BADMM 1280.06 ± 13.57 BPGD 1167.95 ± 13.38 [9] C. Ding, X. He, and H. D. Simon, “On the equivalence of nonnegative
(a) matrix factorization and spectral clustering,” in Proc. SDM, Newport
Beach, California, USA, Apr. 2005, pp. 606–610.
Algorithms Time (s) Algorithms Time (s) [10] Y. Yuan, Y. Feng, and X. Lu, “Projection-based nmf for hyperspectral
ONMF 1324.86 ± 11.45 ORPCA 160.85 ± 5.03 unmixing,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 8,
no. 6, pp. 2632–2643, Jun. 2015.
OADMM 166.83 ± 5.64 OPGD 170.46 ± 5.91
[11] J. L. Durrieu, B. David, and G. Richard, “A musically motivated
BADMM 898.47 ± 8.57 BPGD 867.41 ± 9.35
mid-level representation for pitch estimation and musical audio source
(b) separation,” IEEE J. Sel. Top. Signal Process., vol. 5, no. 6, pp. 1180–
1191, Oct. 2011.
[12] B. Cao, D. Shen, J.-T. Sun, X. Wang, Q. Yang, and Z. Chen, “Detect
and track latent factors with online nonnegative matrix factorization,”
performances. Also, both of our online algorithms have similar in Proc. IJCAI, Hyderabad, India, Jan. 2007, pp. 2689–2694.
performances on the foreground-background separation task. [13] L. Zhang, Z. Chen, M. Zheng, and X. He, “Robust nonnegative matrix
factorization,” Frontiers Elect. Electron. Eng. China, vol. 6, no. 2, pp.
192–200, 2011.
VIII. C ONCLUSION AND F UTURE W ORK [14] Netflix, “Rad - outlier detection on big data,” http://techblog.netflix.
com/2015/02/rad-outlier-detection-on-big-data.html, 2015.
In this paper, we have developed online algorithms for NMF [15] S. S. Bucak and B. Gunsel, “Incremental subspace learning via non-
negative matrix factorization,” Pattern Recogn., vol. 42, no. 5, pp. 788–
where the data samples are contaminated by outliers. We 797, May 2009.
have shown that the proposed class of algorithms is robust [16] N. Guan, D. Tao, Z. Luo, and B. Yuan, “Online nonnegative matrix fac-
to sparsely corrupted data samples and performs efficiently torization with robust stochastic approximation,” IEEE Trans. Neural
Netw. Learn. Syst., vol. 23, no. 7, pp. 1087–1099, Jul. 2012.
on large datasets. Finally, we have proved almost sure con- [17] A. Lefèvre, F. Bach, and C. Févotte, “Online algorithms for nonneg-
vergence guarantees of the objective function as well as the ative matrix factorization with the itakura-saito divergence,” in Proc.
sequence of basis matrices generated by the algorithms. WASPAA, New Paltz, New York, USA, Oct 2011, pp. 313–316.
[18] F. Wang, C. Tan, A. C. Konig, and P. Li, “Efficient document clustering
We hope to pursue the following three directions for further via online nonnegative matrix factorizations,” in Proc. SDM, Mesa,
research. First, it would be interesting to extend the conver- Arizona, USA, Apr. 2011, pp. 908–919.
gence analyses given i.i.d. data sequences to weakly dependent [19] Y. Wu, B. Shen, and H. Ling, “Visual tracking via online nonnegative
matrix factorization,” IEEE Trans. Circuits Syst. Video Technol., vol. 24,
data sequences (e.g., martingale or auto-regressive processes). no. 3, pp. 374–383, Mar. 2014.
This is because real data (e.g., video and audio recordings) [20] C. Liu, H. chih Yang, J. Fan, L.-W. He, and Y.-M. Wang, “Distributed
have correlations among adjacent data samples. Second, it nonnegative matrix factorization for web-scale dyadic data analysis on
mapreduce,” in Proc. WWW, Raleigh, North Carolina, USA, Apr. 2010,
would be meaningful to consider the case where the fit-to-data pp. 681–690.
term between v and Wh+r, d(vkWh+r) is not the squared [21] R. Gemulla, P. J. Haas, Y. Sismanis, C. Teflioudi, and F. Makari,
`2 loss, e.g., the β-divergence [95], [96]. This is because in “Large-scale matrix factorization with distributed stochastic gradient
descent,” in Proc. KDD, San Diego, California, USA, Aug. 2011, pp.
many scenarios, the observation noise is not Gaussian [95] 69–77.
so other types of loss functions need to be used. Finally, as [22] S. S. Du, Y. Liu, B. Chen, and L. Li, “Maxios: Large scale nonnegative
mentioned in Remark 6, it would be meaningful to consider matrix factorization for collaborative filtering,” in Proc. NIPS-DMLMC,
Montréal, Canada, Dec. 2014.
nonconvex constraint sets and/or regularizers on W, h or r. [23] J. Chen, Z. J. Towfic, and A. H. Sayed, “Dictionary learning over
This is because the nonconvex constraints and regularizers distributed models,” IEEE Trans. Signal Process., vol. 63, no. 4, pp.
often appear in real applications [97]. 1001–1016, 2015.
[24] N. Halko, P. G. Martinsson, and J. A. Tropp, “Finding structure
Acknowledgment: The authors are grateful for the helpful with randomness: Probabilistic algorithms for constructing approximate
clarifications and suggestions from Julien Mairal. matrix decompositions,” SIAM Rev., vol. 53, no. 2, pp. 217–288, 2011.
[25] M. Tepper and G. Sapiro, “Compressed nonnegative matrix factoriza-
tion is fast and accurate,” IEEE Trans. Signal Process., vol. 64, no. 9,
R EFERENCES pp. 2269–2283, May 2016.
[26] H. Gao, F. Nie, W. Cai, and H. Huang, “Robust capped norm non-
[1] R. Zhao and V. Y. F. Tan, “Online nonnegative matrix factorization negative matrix factorization: Capped norm NMF,” in Proc. CIKM,
with outliers,” in Proc. ICASSP, Shanghai, China, Mar. 2016, pp. 2662– Melbourne, Australia, Oct. 2015, pp. 871–880.
2666. [27] J. Huang, F. Nie, H. Huang, and C. Ding, “Robust manifold nonnegative
[2] S. Tsuge, M. Shishibori, S. Kuroiwa, and K. Kita, “Dimensionality matrix factorization,” ACM Trans. Knowl. Discov. Data, vol. 8, no. 3,
reduction using non-negative matrix factorization for information re- pp. 1–21, Jun. 2014.
trieval,” in Proc. SMC, vol. 2, Tucson, Arizona, USA, Oct. 2001, pp. [28] S. Yang, C. Hou, C. Zhang, and Y. Wu, “Robust non-negative matrix
960–965. factorization via joint sparse and graph regularization for transfer
[3] D. D. Lee and H. S. Seung, “Learning the parts of objects by non- learning,” Neural Comput. and Appl., vol. 23, no. 2, pp. 541–559,
negative matrix factorization,” Nature, vol. 401, pp. 788–791, October 2013.
1999. [29] L. Du, X. Li, and Y.-D. Shen, “Robust nonnegative matrix factorization
[4] ——, “Algorithms for non-negative matrix factorization,” in Proc. via half-quadratic minimization,” in Proc. ICDM, Brussels, Belgium,
NIPS, Denver, USA, Dec. 2000, pp. 556–562. Dec. 2012, pp. 201–210.
[5] J. Kim and H. Park, “Toward faster nonnegative matrix factorization: [30] C. Ding and D. Kong, “Nonnegative matrix factorization using a robust
A new algorithm and comparisons,” in Proc. ICDM, Pisa, Italy, Dec. error function,” in Proc. ICASSP, Kyoto, Japan, Mar. 2012, pp. 2033–
2008, pp. 353–362. 2036.
15
|←
− (a) −
→|←−−−− (b) −−−−→|←−−−− (c) −−−−→|←−−−− (d) −−−−→|←−−−− (e) −−−−→|←−−−− (f) −−−−→|←−−−− (g) −−−−→|
Fig. 9. Results of background-foreground separation by all the online and batch algorithms on the Hall (upper) and Escalator (lower) video sequences.
The labels (a) to (g) denote the original video frames and results produced by ONMF, ORPCA, OADMM, OPGD, BADMM, BPGD respectively. For each
algorithm, the left column denotes the background and the right column denotes the foreground (moving objects).
[31] D. Kong, C. Ding, and H. Huang, “Robust nonnegative matrix factor- [41] M. Métivier, Semimartingales: A Course on Stochastic Processes. de
ization using `21 -norm,” in Proc. CIKM, Glasgow, Scotland, UK, Oct. Gruyter, 1982.
2011, pp. 673–682. [42] P. Paatero and U. Tapper, “Positive matrix factorization: A non-negative
[32] A. Cichocki, S. Cruces, and S.-i. Amari, “Generalized alpha-beta diver- factor model with optimal utilization of error estimates of data values,”
gences and their application to robust nonnegative matrix factorization,” Environmetrics, vol. 5, no. 2, pp. 111–126, 1994.
Entropy, vol. 13, no. 1, pp. 134–170, 2011. [43] P. O. Hoyer, “Non-negative matrix factorization with sparseness con-
[33] S. P. Kasiviswanathan, H. Wang, A. Banerjee, and P. Melville, “Online straints,” J. Mach. Learn. Res., vol. 5, pp. 1457–1469, Dec. 2004.
`1 -dictionary learning with application to novel document detection,” [44] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix
in Proc. NIPS, Lake Tahoe, USA, Dec. 2012, pp. 2258–2266. factorization and sparse coding,” J. Mach. Learn. Res., vol. 11, pp. 19–
[34] C. Févotte and N. Dobigeon, “Nonlinear hyperspectral unmixing with 60, Mar 2010.
robust nonnegative matrix factorization,” IEEE Trans. Image Process., [45] D. Wang and H. Lu, “On-line learning parts-based representation via
vol. 24, no. 12, pp. 4810–4819, 2015. incremental orthogonal projective non-negative matrix factorization,”
[35] B. Shen, B. Liu, Q. Wang, and R. Ji, “Robust nonnegative matrix Signal Process., vol. 93, no. 6, pp. 1608 – 1623, 2013.
factorization via l1 norm regularization by multiplicative updating [46] J. Xing, J. Gao, B. Li, W. Hu, and S. Yan, “Robust object tracking
rules,” in Proc. ICIP, Paris, France, Oct. 2014, pp. 5282–5286. with online multi-lifespan dictionary learning,” in Proc. ICCV, Sydney,
Australia, Dec. 2013, pp. 665–672.
[36] J. Feng, H. Xu, and S. Yan, “Online robust PCA via stochastic
[47] S. Zhang, S. Kasiviswanathan, P. C. Yuen, and M. Harandi, “Online
optimization,” in Proc. NIPS, Lake Tahoe, USA, Dec. 2013, pp. 404–
dictionary learning on symmetric positive definite manifolds with
412.
vision applications,” in Proc. AAAI, Austin, Texas, Jan. 2015, pp. 3165–
[37] J. Shen, H. Xu, and P. Li, “Online optimization for max-norm regular- 3173.
ization,” in Proc. NIPS, Montréal, Canada, Dec. 2014, pp. 1718–1726. [48] X. Zhang, N. Guan, D. Tao, X. Qiu, and Z. Luo, “Online multi-modal
[38] N. Wang, J. Wang, and D.-Y. Yeung, “Online robust non-negative dic- robust non-negative dictionary learning for visual tracking,” PLoS ONE,
tionary learning for visual tracking,” in Proc. ICCV, Sydney, Australia, vol. 10, no. 5, pp. 1–17, 2015.
Dec. 2013, pp. 657–664. [49] J. Zhan, B. Lois, and N. Vaswani, “Online (and offline) robust PCA:
[39] R. T. Rockafellar, Convex analysis. Princeton University Press, 1970. Novel algorithms and performance guarantees,” arXiv:1601.07985,
[40] A. W. van der Vaart, Asymptotic Statistics. Cambridge Press, 2000. 2016.
16
[50] B. Lois and N. Vaswani, “Online matrix completion and online robust [77] W. Wang and M. A. Carreira-Perpin̈án, “Projection onto the probability
PCA,” arXiv:1503.03525, 2015. simplex: An efficient algorithm with a simple proof, and an applica-
[51] X. Guo, “Online robust low rank matrix recovery,” in Proc. IJCAI, tion,” arXiv:1309.1541, 2013.
Buenos Aires, Argentina, Jul. 2015, pp. 3540–3546. [78] D. M. Witten, R. Tibshirani, and T. Hastie, “A penalized matrix
[52] P. Sprechmann, A. Bronstein, and G. Sapiro, “Real-time online singing decomposition, with applications to sparse principal components and
voice separation from monaural recordings using robust low-rank canonical correlation analysis,” Biostatistics, vol. 10, no. 3, pp. 515–
modeling,” in Proc. ISMIR, Porto, Portugal, Oct. 2012, pp. 67–72. 534, 2009.
[53] P. Sprechmann, A. M. Bronstein, and G. Sapiro, “Learning efficient [79] H. Zou and T. Hastie, “Regularization and variable selection via the
sparse and low rank models,” arXiv:1212.3631, 2012. elastic net,” J. Roy. Statist. Soc. Ser. B, vol. 67, no. 2, pp. 301–320,
[54] Y. Tsypkin, Adaptation and Learning in automatic systems. Academic 2005.
Press, New York, 1971. [80] V. Sindhwani and A. Ghoting, “Large-scale distributed non-negative
[55] L. Bottou, “Online algorithms and stochastic approximations,” in sparse coding and sparse dictionary learning,” in Proc. KDD, Beijing,
Online Learning and Neural Networks. Cambridge University Press, China, Aug. 2012, pp. 489–497.
1998. [81] J. Duchi, “Elastic net projections,” 2009. [Online]. Available:
[56] L. Bottou and O. Bousquet, “The tradeoffs of large scale learning,” in http://stanford.edu/∼jduchi/projects/proj elastic net.pdf
Proc. NIPS, Vancouver, Canada, Dec. 2008, pp. 161–168. [82] J. Kim, R. Monteiro, and H. Park, “Group sparsity in nonnegative
[57] A. Shapiro and A. Philpott, “A tutorial on stochastic programming,” matrix factorization,” in Proc. SDM, Anaheim, California, USA, Apr.
http://stoprog.org/sites/default/files/SPTutorial/TutorialSP.pdf, 2007. 2012, pp. 851–862.
[58] C.-J. Lin, “On the convergence of multiplicative update algorithms for [83] X. Liu, H. Lu, and H. Gu, “Group sparse non-negative matrix fac-
nonnegative matrix factorization,” IEEE Trans. Neural Netw., vol. 18, torization for multi-manifold learning,” in Proc. BMVC, Dundee, UK,
no. 6, pp. 1589–1596, 2007. Aug. 2011, pp. 1–11.
[59] A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux, “Dictionary [84] J. Friedman, T. Hastie, and R. Tibshirani, “A note on the group lasso
learning for massive matrix factorization,” in Proc. ICML, New York, and a sparse group lasso,” arXiv:1001.0736, 2010.
USA, Jul. 2016. [85] N. Simon, J. Friedman, T. Hastie, and R. Tibshirani, “A sparse-group
[60] J. Mairal, “Stochastic majorization-minimization algorithms for large- lasso,” J. Comput. Graph. Statist., vol. 22, pp. 231–245, 2013.
scale optimization,” in Proc. NIPS, Lake Tahoe, USA, Dec. 2013, pp. [86] M. Yuan and Y. Lin, “Model selection and estimation in regression
2283–2291. with grouped variables,” J. Roy. Statist. Soc. Ser. B, vol. 68, no. 1, pp.
[61] M. Razaviyayn, M. Sanjabi, and Z.-Q. Luo, “A stochastic successive 49–67, 2006.
minimization method for nonsmooth nonconvex optimization with ap- [87] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman, “From few
plications to transceiver design in wireless communication networks,” to many: illumination cone models for face recognition under variable
Math. Program., vol. 157, no. 2, pp. 515–545, 2016. lighting and pose,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23,
no. 6, pp. 643–660, Jun. 2001.
[62] D. Sun and C. Fevotte, “Alternating direction method of multipliers for
[88] L. Li, W. Huang, I. Y.-H. Gu, and Q. Tian, “Statistical modeling of
non-negative matrix factorization with the beta-divergence,” in Proc.
complex backgrounds for foreground object detection,” IEEE Trans.
ICASSP, Florence, Italy, May 2014, pp. 6201–6205.
Image Process., vol. 13, no. 11, pp. 1459–1472, Nov. 2004.
[63] M. Razaviyayn, M. Hong, and Z.-Q. Luo, “A unified convergence
[89] V. Y. F. Tan and C. Févotte, “Automatic relevance determination in
analysis of block successive minimization methods for nonsmooth
nonnegative matrix factorization with the β-divergence,” IEEE Trans.
optimization,” SIAM J. Optim., vol. 23, no. 2, pp. 1126–1153, 2013.
Pattern Anal. Mach. Intell., vol. 35, no. 7, pp. 1592–1605, Jul. 2013.
[64] M. Hong, M. Razaviyayn, Z. Q. Luo, and J. S. Pang, “A unified
[90] C. M. Bishop, “Bayesian PCA,” in Proc. NIPS, Denver, USA, Jan.
algorithmic framework for block-structured optimization involving big
1999, pp. 382–388.
data: With applications in machine learning and signal processing,”
[91] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed
IEEE Signal Process. Mag., vol. 33, no. 1, pp. 57–77, Jan. 2016.
optimization and statistical learning via the alternating direction method
[65] H. Zhang and L. Cheng, “Projected shrinkage algorithm for box- of multipliers,” Found. Trends Mach. Learn., vol. 3, no. 1, pp. 1–122,
constrained `1 -minimization,” Optim. Lett., vol. 1, no. 1, pp. 1–16, Jan. 2011.
2015. [92] T. Veldhuizen, “Measures of image quality,” http://homepages.inf.ed.
[66] P. Tseng, “On accelerated proximal gradient methods for convex- ac.uk/rbf/CVonline/LOCAL COPIES/VELDHUIZEN/node18.html,
concave optimization,” Technical report, University of Washington, 1998.
Seattle, 2008. [93] E. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component
[67] Y. Nesterov, “Gradient methods for minimizing composite functions,” analysis?” J. ACM, vol. 58, no. 3, pp. 1–37, Jun. 2011.
Math. Program., vol. 140, no. 1, pp. 125–161, 2013. [94] P. Netrapalli, N. U N, S. Sanghavi, A. Anandkumar, and P. Jain, “Non-
[68] S. A. Vavasis, “On the complexity of nonnegative matrix factorization,” convex robust pca,” in Proc. NIPS, Montréal, Canada, Dec. 2014, pp.
SIAM J. Optim., vol. 20, no. 3, pp. 1364–1377, 2009. 1107–1115.
[69] D. Hajinezhad, T. H. Chang, X. Wang, Q. Shi, and M. Hong, “Non- [95] C. Févotte, N. Bertin, and J.-L. Durrieu, “Nonnegative matrix factor-
negative matrix factorization using admm: Algorithm and convergence ization with the Itakura-Saito divergence. With application to music
analysis,” in Proc. ICASSP, Shanghai, China, Mar. 2016, pp. 4742– analysis,” Neural Comput., vol. 21, no. 3, pp. 793–830, Mar. 2009.
4746. [96] C. Févotte and J. Idier, “Algorithms for nonnegative matrix factoriza-
[70] D. Donoho and V. Stodden, “When does non-negative matrix factoriza- tion with the beta-divergence,” Neural Comput., vol. 23, no. 9, pp.
tion give correct decomposition into parts?” in Proc. NIPS, Vancouver, 2421–2456, 2011.
Canada, Dec. 2004, pp. 1141–1148. [97] C. Bao, H. Ji, Y. Quan, and Z. Shen, “Dictionary learning for sparse
[71] J. E. Cohen and U. G. Rothblum, “Nonnegative ranks, decompositions, coding: Algorithms and analysis,” IEEE Trans. Pattern Anal. Mach.
and factorizations of nonnegative matrices,” Linear Algebra Appl., vol. Intell., vol. 38, no. 7, pp. 1356–1369, July 2015.
190, no. 1, pp. 149 – 168, 1993. [98] J. F. Bonnans and A. Shapiro, “Optimization problems with perturba-
[72] S. Arora, R. Ge, R. Kannan, and A. Moitra, “Computing a nonnegative tions: A guided tour,” SIAM Rev., no. 2, pp. 228–264, 1998.
matrix factorization – provably,” in Proc. STOC, New York, New York, [99] K. Sydsaeter, P. Hammond, A. Seierstad, and A. Strom, Further
USA, May 2012, pp. 145–162. Mathematics for Economic Analysis. Pearson, 2005.
[73] N. Gillis, “Sparse and unique nonnegative matrix factorization through [100] P. Billingsley, Probability and Measure, 2nd ed. John Wiley & Sons,
data preprocessing,” J. Mach. Learn. Res., vol. 13, no. 1, pp. 3349– 1986.
3386, Nov. 2012. [101] D. L. Fisk, “Quasi-martingales,” Trans. Amer. Math. Soc., vol. 120,
[74] A. Vandaele, N. Gillis, F. Glineur, and D. Tuyttens, “Heuristics for no. 3, pp. 369–89, 1965.
exact nonnegative matrix factorization,” J. Glob. Optim., vol. 65, no. 2,
pp. 369–400, 2016.
[75] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic
approximation approach to stochastic programming,” SIAM J. Optimiz.,
vol. 19, no. 4, pp. 1574–1609, Jan. 2009.
[76] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra, “Efficient
projections onto the l1-ball for learning in high dimensions,” in Proc.
ICML, Helsinki, Finland, Jul. 2008, pp. 272–279.
17
Supplemental Material
In this supplemental material, the indices of all the sections, definitions, lemmas and equations are prepended with an ‘S’ to
distinguish those in the main text. The organization of this article is as follows. In section S-1, we drive the ADMM algorithms
presented in Section IV-B. In Section S-2, we extend our two solvers (based on PGD and ADMM) in Section IV to the batch
NMF problems with outliers. Then, we provide detailed proofs of the theorems and lemmas in Section V in Section S-3 to
S-8. The technical lemmas used in the algorithm derivation (Section IV) and the convergence analysis (Section V) are shown
in Section S-9. Finally, we show additional experiment results in Section S-10 to supplement those in Section VII. The finite
constants c, c1 and c2 are used repeatedly in different sections, and their meanings depend on the context.
S-1. D ERIVATION OF THE ADMM A LGORITHMS IN S ECTION IV-B
A. Algorithms for (9)
Minimizing h and r amounts to solving the following two unconstrained problems
1 2 ρ1 2
min kWh + (r − v)k2 + αT (h − u) + kh − uk2
h 2 2
1 T T
⇐⇒ min h (WT W + ρ1 I)h + WT (r − v) − ρ1 u + α h (S-1)
h 2
and
1 2 ρe2 2
min k(v − Wh) − rk2 + kr − qk2 + β T (r − q) + λ krk1
r 2 2
1 + ρe2 T
⇐⇒ min r r + (β − v + Wh − ρe2 q)T r + λ krk1
r 2
2
r − ρe2 q + v − β − Wh
+ λ krk .
1 + ρe2
⇐⇒ min 1 (S-2)
r 2
1 + ρe2
2
We notice that (S-1) is a standard strongly convex quadratic minimization problem and (S-2) is a standard proximal minimization
problem with `1 norm, thus the closed-form optimal solutions for (S-1) and (S-2) are
h∗ = (WT W + ρ1 I)−1 (WT (v − r) + ρ1 u − α)
ρe2 q + v − β − Wh Sλ (e
ρ2 q + v − β − Wh)
r∗ = Sλ/(1+eρ2 ) = .
1 + ρe2 1 + ρe2
Minimizing u and q amounts to solving the following two constrained problems
ρ1 2
min kh − uk2 − αT u (S-3)
u≥0 2
ρe2 2
min kr − qk2 − β T q. (S-4)
kqk∞ ≤M 2
Since both constraints u ≥ 0 and kqk∞ ≤ M are separable across coordinates, we can simply solve the unconstrained quadratic
minimization problems and then project the optimal solutions to the feasible sets.
A. Problem Formulation
As usual, we denote the data matrix (with outliers) as V, basis matrix as W, coefficient matrix as H and outlier matrix as
R. The number of total data samples is denoted as N . Then, the batch counterpart for the online NMF problem with outliers
can be formulated as follows
1 2
min kV − WH − RkF + λ kRk1,1
2
K×N
s. t. H ∈ R+ , R ∈ R,
e W ∈ C, (S-7)
e = {R ∈ RF ×N | |ri,j | ≤ M, ∀ (i, j) ∈ [F ] × [N ]}.
where R
B. Notations
In the sequel, we overload soft-thresholding operators Seλ,M and Sλ . When these two operators are applied to matrices, each
operator denotes entrywise soft-thresholding. The updated variables are denoted with superscripts ‘+’.
where det (W) , fet (W) − fet+1 (W), for all W ∈ C. We aim to show det is Lipschitz on C with Lipschitz constant only
dependent on t. For all W ∈ C, we have
t t+1
1X1 2 1 X1 2
det (W) = kvi − Whi − ri k2 + λ kri k1 − kvi − Whi − ri k2 + λ kri k1
t i=1 2 t + 1 i=1 2
t t+1
!
1 X 1 2
X 1 2
= (t + 1) kvi − Whi − ri k2 + λ kri k1 − t kvi − Whi − ri k2 + λ kri k1
t(t + 1) i=1
2 i=1
2
t
!
c 1 X 1 t 2 2
= kvi − Whi − ri k2
− kvt+1 − Wht+1 − rt+1 k2
t(t + 1) i=1
2 2
t
1 X 1 2 1 2
= kvi − Whi − ri k2 − kvt+1 − Wht+1 − rt+1 k2 ,
t(t + 1) i=1 2 2
c
where = denotes equality up to an additive constant (independent of W). By a reasoning similar to the one for Lemma S-2,
we can show there exist some positive constants c1 and c2 such that
t
1 X
dt (Wt+1 ) − det (Wt ) ≤ (c1 khi − ht+1 k2 + c2 kvi − vt+1 k2 + c2 kri − rt+1 k2 ) kWt+1 − Wt k2
e
t(t + 1) i=1
c3
≤ kWt+1 − Wt kF , (S-29)
t+1
where c3 > 0 is a constant since for all i ≥ 1, vi , hi and ri are bounded a.s.. Combining (S-28) and (S-29), we have with
probability one,
c0
kWt+1 − Wt kF ≤ 3 , (S-30)
t+1
where c03 = c3 /m2 . Hence we complete the proof.
We prove that {fet (Wt )}t≥1 converges a.s. by the quasi-martingale convergence theorem (see Lemma S-8). Let us first define
a filtration {Ft }t≥1 where Ft , σ {vi , Wi , hi , ri }i∈[t] is the minimal σ-algebra such that {vi , Wi , hi , ri }i∈[t] are measurable.
Also define (
1, if E[ut+1 − ut |Ft ] > 0
ut , ft (Wt ) and δt ,
e , (S-31)
0, otherwise
then itPis easy to see {ut }t≥1 is adapted to {Ft }t≥1 . According to the quasi-martingale convergence theorem, it suffices to
∞
show t=1 E[δt (ut+1 − ut )] < ∞. To bound E[δt (ut+1 − ut )], we decompose ut+1 − ut as
ut+1 − ut = fet+1 (Wt+1 ) − fet+1 (Wt ) + fet+1 (Wt ) − fet (Wt )
1 e t e
= fet+1 (Wt+1 ) − fet+1 (Wt ) + `(vt+1 , ht+1 , rt+1 , Wt ) + ft (Wt ) − fet (Wt )
t+1 t+1
`(vt+1 , Wt ) − fet (Wt )
= fet+1 (Wt+1 ) − fet+1 (Wt ) + (S-32)
t+1
`(v t+1 , W t ) − ft (Wt ) ft (Wt ) − fet (Wt )
= fet+1 (Wt+1 ) − fet+1 (Wt ) + + . (S-33)
| {z } t+1 t+1
h1i
| {z }
h2i
By definition, it is easy to see both h1i , h2i ≤ 0. In (S-33), we insert the term ft (Wt ) in order to invoke Donsker’s theorem
(see Lemma S-9). Thus,
E[`(vt+1 , Wt ) − ft (Wt ) | Ft ] f (Wt ) − ft (Wt )
E[ut+1 − ut | Ft ] ≤ =
t+1 t+1
21
By (S-35), it suffices to show that for every realization of {vt }t≥1 such that fet (Wt ) − ft (Wt ) → 0, each subsequential limit
of {Wt }t≥1 is a stationary point of f . We focus on such a realization, then all the variables in the sequel become deterministic.
(With a slight abuse of notations we use the same notations to denote the deterministic variables.) By the compactness of V,
H and R, both sequences {At }t≥1 and {Bt }t≥1 are bounded. Thus there exist compact sets A and B such that {At }t≥1 ⊆ A
and {Bt }t≥1 ⊆ B. Similar reasoning shows that the sequence {fet (0)}t≥1 resides in a compact set F. e By the compactness of
C, there exists a convergent subsequence {Wtm }m≥1 in {Wt }t≥1 . Also, by the compactness of A, B and F, e it is possible
to find convergent subsequences {Atk }k≥1 , {Btk }k≥1 and {ftk (0)}k≥1 such that {tk }k≥1 ⊆ {tm }m≥1 . Thus, we focus on
e
the convergent sequences {Wtk }k≥1 , {Atk }k≥1 , {Btk }k≥1 and {fetk (0)}k≥1 and drop the subscript k to make notations
uncluttered. We denote the limits of the sequences {Wt }t≥1 , {At }t≥1 and {Bt }t≥1 as W, A and B respectively.
First, we show the sequence of differentiable functions {fet }t≥1 converges uniformly to a differentiable function f . Since
{ft (0)}t≥1 converges, it suffices to show the sequence {∇fet }t≥1 converges uniformly to a function h. Since ∇fet (W) =
e
22
= ∇f − ∇g. To
show W is a stationary point
any W ∈ C, the
of f , it suffices to show for directional derivative
∇f (W), W − W ≥ 0. We show this by proving ∇f (W), W − W ≥ 0 and ∇g(W), W − W = 0 for any W ∈ C.
By definition, for any W ∈ C and t ≥ 1, we have fet (Wt ) ≤ fet (W). First consider
ft (Wt ) − f (W) = fet (Wt ) − f (Wt ) + f (Wt ) − f (W)
e
≤
fet − f
+ f (Wt ) − f (W) .
C
u
Since
fet −
→ f and f is continuous, we have fet (Wt ) → f (W) as t → ∞. Thus f (W) ≤ f (W), for any W ∈ C. This implies
∇f (W), W − W ≥ 0.
Next we show ∇g(W), W − W = 0 for any W ∈ C. It suffices to show ∇g(W) = 0. Since both fet and ft are
differentiable, gt is differentiable and ∇gt = ∇fet − ∇f . First it is easy to see ∇gt is Lipschitz on C with constant L > 0
independent of t since both ∇fet and ∇ft are Lipschitz on C with constants independent of t. It is possible to construct
another differentiable function get with domain RF ×K such that get is nonnegative with a L-Lipschitz gradient on RF ×K and
get (W) = gt (W) for all W ∈ C. Thus
1 1
k∇gt (Wt )k2F = k∇egt (Wt )k2F ≤ get (Wt ) − get∗ ≤ gt (Wt ) = fet (Wt ) − ft (Wt ),
2L 2L
where get∗ = inf W∈RF ×K get (W) ≥ 0. Since fet (Wt ) − ft (Wt ) → 0, we have ∇gt (Wt ) → 0 as t → ∞. Now consider the
first-order Taylor expansion of gt at Wt
gt (W) = gt (Wt ) + h∇gt (Wt ), W − Wt i + o (kW − Wt kF ) , ∀ W ∈ C. (S-37)
As t → ∞, we have
g(W) = g(W) + o
W − W
F , ∀ W ∈ C. (S-38)
On the other hand, we have
g(W) = g(W) + ∇g(W), W − W + o
W − W
F , ∀ W ∈ C. (S-39)
Comparing (S-38) and (S-39), we have
∇g(W), W − W + o
W − W
F = 0, ∀ W ∈ C. (S-40)
Therefore we conclude ∇g(W) = 0.
Since v, W, h∗ (v, W) and r∗ (v, W) are all bounded, there exist positive constants M1 , M2 , M3 and M4 that upper bound
kvk, kWk, kh∗ (v, W)k and kr∗ (v, W)k respectively. Here the matrix norm is the one induced by the (general) vector norm.
Take arbitrary W1 and W2 in C, we have
k∇f (W1 ) − ∇f (W2 )k = kEv [∇W `(v, W1 ) − ∇W `(v, W2 )]k
≤ Ev k∇W `(v, W1 ) − ∇W `(v, W2 )k .
Fix an arbitrary v ∈ V we have
k∇W `(v, W1 ) − ∇W `(v, W2 )k ≤
W1 h∗ (v, W1 )h∗ (v, W1 )T − W2 h∗ (v, W2 )h∗ (v, W2 )T
≤ kW1 k kh∗ (v, W1 )k kh∗ (v, W1 ) − h∗ (v, W2 )k + kW1 h∗ (v, W1 ) − W2 h∗ (v, W2 )k kh∗ (v, W2 )k
≤ M2 M3 c1 kW1 − W2 k + (kW1 k kh∗ (v, W1 ) − h∗ (v, W2 )k + kW1 − W2 k kh∗ (v, W2 )k) kh∗ (v, W2 )k
≤ M2 M3 c1 kW1 − W2 k + M3 (M2 c1 kW1 − W2 k + M3 kW1 − W2 k)
= (2c1 M2 M3 + M32 ) kW1 − W2 k ,
∗
r (v, W1 )h∗ (v, W1 )T − r∗ (v, W2 )h∗ (v, W2 )T
≤ kr∗ (v, W1 )k kh∗ (v, W1 ) − h∗ (v, W2 )k + kr∗ (v, W1 ) − r∗ (v, W2 )k kh∗ (v, W2 )k
≤ c1 M4 kW1 − W2 k + c2 M3 kW1 − W2 k
≤ (c1 M4 + c2 M3 ) kW1 − W2 k ,
and
∗
vh (v, W1 )T − vh∗ (v, W2 )T
≤ c1 M1 kW1 − W2 k .
By the compactness of Z, A and B, there exist positive constants M1 , M2 and M3 such that kzk2 ≤ M1 , kAk2 ≤ M2 and
kbk2 ≤ M3 , for any z ∈ Z, A ∈ A and b ∈ B. Thus,
T
h1i ≤ M3 kAT A − AT A0 + AT A0 − A0 A0 k2
≤ M3 kAT (A − A0 )k2 +
(A − A0 )T A0
2
≤ 2M2 M3 kA − A0 k2 .
Similarly for h2i we have
T
h2i =
AT z − AT z0 + AT z0 − A0 z0
2
≤
AT (z − z0 )
2 +
(A − A0 )T z0
2
≤ M2 kz − z0 k2 + M1 kA − A0 k2 .
Hence h1i + h2i ≤ M2 kz − z0 k2 + (M1 + 2M2 M3 ) kA − A0 k2 . We now take c1 = M2 and c2 = M1 + 2M2 M3 to complete
the proof.
In particular, if for some u0 ∈ U, S(u0 ) = {x0 }, then v is differentiable at u = u0 and ∇v(u0 ) = ∇u f (x0 , u0 ).
Lemma S-6 (The Maximum Theorem; [99, Theorem 14.2.1 & Example 2]). Let P and X be two metric spaces. Consider a
maximization problem
max f (p, x), (S-49)
x∈B(p)
Remark 7. This is a simplified version of the Leibniz Integral Rule. See [100, Theorem 16.8] for weaker conditions on f .
Definition S-1 (Quasi-martingale; [101, Definition 1.4]). Let {Xt }t∈T be a stochastic process and let {Ft }t∈T be the filtration
to which {Xt }t∈T is adapted, where T is a subset of the real line. We call {Xt }t∈T a quasi-martingale if there exist two
stochastic processes {Yt }t∈T and {Zt }t∈T such that
i) both {Yt }t∈T and {Zt }t∈T are adapted to {Ft }t∈T ,
ii) {Yt }t∈T is a martingale and {Zt }t∈T has bounded variations on T a.s.,
iii) Xt = Yt + Zt , for all t ∈ T a.s..
Lemma S-8 (The Quasi-martingale Convergence Theorem; [41, Theorem 9.4 & Proposition 9.5]). Let (ut )t≥1 be a nonnegative
discrete-time stochastic process on a probability space (Ω, A, P), i.e., ut ≥ 0 a.s., for all t ≥ 1. Let {Ft }t≥1 be a filtration
25
to which (ut )t≥1 is adapted. Define another binary stochastic process (δt )t≥1 as
(
1, if E[ut+1 − ut |Ft ] > 0
δt = . (S-51)
0, otherwise
P∞ a.s.
If t=1 E[δt (ut+1 − ut )] < ∞, then ut −−→ u, where u is integrable on (Ω, F, P) and nonnegative a.s.. Furthermore, (ut )t≥1
is a quasi-martingale and
∞
X
E[|E[ut+1 − ut |Ft ]|] < ∞. (S-52)
t=1
Lemma S-9 (Donsker’s Theorem; Pn [40, Section 19.2]). Let X1 , . . . , Xn be i.i.d. generated from a distribution P. Define the
empirical distribution Pn , n1 i=1 δXi . For a measurable function f , define Pn f and Pf as the expectations of f under
√
the distributions Pn and P respectively. Define an empirical process Gn (f ) , n(Pn f − Pf ), f ∈ F, where F is a class of
measurable functions. F is P-Donsker if and only if the sequence of empirical processes {Gn }n≥1 converges in distribution
to a zero-mean Gaussian process G tight in `∞ (F), where `∞ (F) is the space of all real-valued and bounded functionals
defined on F equipped with the uniform norm on F, denoted as k·kF . Moreover, in such case, we have E kGn kF → E kGkF .
Lemma S-10 (Glivenko-Cantelli theorem; [40, Section 19.2]). Let Pn and P be the distributions defined as in Lemma S-9. A
a.s.
class of measurable functions F is P-Glivenko-Cantelli if and only if supf ∈F |Pn f − Pf | −−→ 0.
Lemma S-11 (A sufficient condition for P-Glivenko-Cantelli and P-Donsker classes; [40, Example 19.7]). Define a probability
space (X , A, P). Let F = {fθ : X → R | θ ∈ Θ} be a class of measurable functions, where Θ is a bounded subset in Rd . If
there exists a universal constant K > 0 such that
|fθ1 (x) − fθ2 (x)| ≤ K kθ1 − θ2 k , ∀ θ1 , θ2 ∈ Θ, ∀ x ∈ X , (S-53)
d
where k·k is a general vector norm in R , then F is both P-Glivenko-Cantelli and P-Donsker.
P∞ P∞
Lemma S-12 ( [44, Lemma 8]). Let (an ), (bn ) be two nonnegative sequences. Suppose n=1 an = ∞ and n=1 an bn < ∞,
and ∃ N ∈ N and K > 0 such that for all n ≥ N , |bn+1 − bn | ≤ Kan . Then (bn ) converges and limn→∞ bn = 0.
20
OADMM 50
Objective Values OPGD
15 40
BADMM ONMF
BPGD ORPCA
30
10 OADMM
OPGD
20
5
10
0 0
10-1 100 101 102 10-1 100 101
Time (s) Time (s)
(a) (b)
Fig. 10. The objective values (as a function of time) of (a) our online algorithms and their batch counterparts (b) our online algorithms and other online
algorithms on the CBCL face dataset. The parameters are set according to the canonical setting.
25 25
τ =5
= 10
Objective Values
20 20 τ
τ = 20
15 15 τ = 40
10 10
5 5
0 0
10-1 100 101 102 10-1 100 101
Time (s) Time (s)
(a) OPGD (b) OADMM
Fig. 11. The objective values (as a function of time) of (a) OPGD and (b) OADMM for different values of τ on the CBCL face dataset. All the other
parameters are set according to the canonical setting.
TABLE IV
PSNR S ( IN D B) OF ALL THE ALGORITHMS ON THE CBCL FACE DATASET WITH DIFFERENT NOISE DENSITY.
40
OADMM 50
OPGD
Objective Values
30 40
BADMM ONMF
BPGD ORPCA
30
20 OADMM
OPGD
20
10
10
0 0
10-1 100 101 102 10-1 100 101
Time (s) Time (s)
(a) K = 25 (b) K = 25
30 OADMM 50
OPGD
Objective Values
BADMM 40 ONMF
20 BPGD ORPCA
30
OADMM
OPGD
20
10
10
0 0
10-1 100 101 102 10-1 100 101 102
Time (s) Time (s)
(c) K = 100 (d) K = 100
Fig. 12. The objective values (as a function of time) of all the algorithms for different values of K on the CBCL face dataset. In (a) and (b), K = 25. In
(c) and (d), K = 100. All the other parameters are set according to the canonical setting.
20
ρ =1
30
= 10
Objective Values
ρ
15 ρ = 100
20 ρ = 1000
10
10 5
0 0
10-1 100 101 102 103 10-1 100 101 102
Time (s) Time (s)
(a) BADMM (b) OADMM
Fig. 13. The objective values (as a function of time) of (a) BADMM and (b) OADMM for different values of ρ on the CBCL face dataset. All the other
parameters are set according to the canonical setting.
28
10 20
κ = 0.1
Objective Values 8 κ = 0.2
15 κ = 0.4
6 κ = 0.8
10
4
5
2
0 0
10-1 100 101 102 103 10-1 100 101 102 103
Time (s) Time (s)
(a) BPGD (b) OPGD
Fig. 14. The objective values (as a function of time) of (a) BPGD and (b) OPGD for different values of κ on the CBCL face dataset. All the other parameters
are set according to the canonical setting.
15 15
OADMM OADMM
OPGD OPGD
Objective Values
BADMM BADMM
10 10
BPGD BPGD
5 5
0 0
10-1 100 101 102 10-1 100 101 102
Time (s) Time (s)
(a) ν = 0.8, νe = 0.2 (b) ν = 0.9, νe = 0.3
Fig. 15. The objective values (as a function of time) of our online and batch algorithms on the CBCL face dataset with a larger proportion of outliers.
TABLE V
RUNNING TIMES ( IN SECONDS ) OF ALL THE ALGORITHMS ON THE CBCL FACE DATASET WITH DIFFERENT NOISE DENSITY.