Professional Documents
Culture Documents
Learning For Structured Prediction Using Approximate Subgradient Descent With Working Sets
Learning For Structured Prediction Using Approximate Subgradient Descent With Working Sets
1
learning. The use of sampling leads to decreased learning finding a convex-concave saddle-point and solving it with a
time while still maintaining good levels of performance. We dual extra-gradient method [27]. In [19] max-margin learn-
demonstrate the strength of our method on the task of learn- ing is solved as an unconstrained optimization problem and
ing CRF models for image segmentation. subgradients are used to approximate the gradient in the
The rest of the paper is organized as follows. We discuss resulting non-differentiable problem. This method trades
the prior work in Section 2 and provide the background on optimality for a lower complexity, making it more suitable
the large-margin framework and learning techniques based for large-scale problems. The approach of [15] proposes
on subgradient descent in Section 3. Section 4 describes a perceptron-like algorithm based on an update whose ex-
our working set based algorithm in detail and analyze its pectation is close to the gradient of the true expected loss.
properties. We present the experimental results in Section 5 However, the soundness of these methods heavily depends
and conclude in Section 6. on the assumption that a valid subgradient is obtained at
each iteration. Hence they become much less reliable when
2. Related work the subgradients are noisy due to inexact inference, as is the
case for loopy CRFs.
Maximum margin learning of CRFs was first formu-
The recently proposed SampleRank [29] avoids perform-
lated in the max-margin Markov networks (M3 N) [26],
ing inference altogether during learning; Instead it sam-
whose objective is to minimize a margin-sensitive hinge
ples labelings at random using Markov chain Monte Carlo
loss between the ground-truth labeling and all other label-
(MCMC). At each step, parameters are updated with respect
ings for each training example. This is especially appealing
to a pair of sampled labelings. Though achieving notable
for learning CRFs with loopy structures, due to its more
speed improvement, the method does not in fact optimize
objective-driven nature and its complete bypassing of the
the actual hinge loss but rather a loose upper bound on it.
partition function that presents a major challenge to maxi-
Hence, unlike our method, it solves a problem that is sub-
mum likelihood based approaches. Nevertheless, the num-
stantially different from the original max-margin formula-
ber of constraints in the resulting quadratic program (QP)
tion.
is exponential in the size of the graph, hence making it
a highly non-trivial problem. In M3 N this is handled by
rewriting the QP dual in terms of a polynomial number of 3. Max-margin Learning of CRFs
marginal variables, which can then be solved by a coordi- Conditional random fields (CRF) [11] are graphical
nate descent method analogous to the sequential minimal models used to encode relationships between a set of in-
optimization (SMO) [18]. However, solving such a QP is put and output variables. Given its parameters w, a CRF
not tractable for loopy CRFs with high tree widths that are predicts the labeling Y for a given input X by maximizing
often needed in many computer vision tasks and even solv- some score function Sw : X × Y → R, i.e.,
ing it approximately can become overwhelmingly expen-
sive on large graphs. Ŷ = arg max Sw (Y ) = arg max wT Ψ(X, Y ) (1)
Structured SVMs (SSVM) [28] optimize the same kind Y ∈Y Y ∈Y
of objective as M3 N, while allowing for a more general
class of loss functions. It employs a cutting plane algo- The score is usually expressed as a linear function of w and
rithm to iteratively solve a series of increasingly larger QPs, can be written as wT Ψ(X, Y ), where the vector Ψ(X, Y )
which makes learning more scalable. However, the cutting is the feature map corresponding to the input X and the
plane algorithm requires the computation of the most vio- labeling Y . The fundamental properties of random fields
lated constraints, namely the labeling maximizing the hinge imply that the feature map Ψ(X, Y ) and hence the score
loss [28]. This involves performing the loss augmented in- Sw decompose into sums over individual nodes and edges
ference [25], which makes it intractable on loopy CRFs. for any pairwise CRFs [2]. For a comprehensive review of
Though approximate constraints can be used [5], they make CRF models, we refer readers to [11] and [24].
the cutting plane algorithm susceptible to premature termi-
3.1. Discriminative Learning
nation and can lead to catastrophic failure. Moreover, solv-
ing the QP can become slow as the set of constraints grows Discriminative learning uses the labeled training data to
larger after each iteration, especially when the dimension- learn the CRF parameters so that the inferred labeling of
ality of the feature space is also high. the CRF is “close” to that of the ground truth, defined as
An alternative to solving the quadratic program deter- yielding a low loss. More specifically, given a set of N
ministically is to employ stochastic gradient or subgradient training examples D = ((X 1 , Y 1 ), . . . , (X N , Y N )) where
descent. This class of methods has been studied extensively X i ∈ X is an input example, such as the image or features
for non-structured prediction problems [21]. In the con- associated to it, and Y i ∈ Y is the associated labeling, the
text of structured prediction, learning can be achieved by learning task consists in finding model parameters w that
achieve low empirical loss subject to some regularization. 3.3. Stochastic Subgradient Descent
In other words, we seek
The objective function of Eq. 4 can be minimized via
stochastic subgradient descent, similar to [19, 15]. This
w∗ = arg min L(D, w)
w class of methods iteratively computes and steps in the op-
X posite direction of a subgradient vector with respect to a
= arg min l(X n , Y n , w) + R(w),(2)
w example X n chosen by picking an index n ∈ {1 . . . N }
(X n ,Y n )∈D
uniformly at random. We then replace the objective in
Eq. 4 with an approximation based on the training exam-
where l is the surrogate loss function and R(w) is the reg-
ple (X n , Y n ), yielding:
ularizer (typically the L2 norm). The most common choice
of l is the hinge loss, as used in [26, 28], which will be 1 2
described later on in this section. Note that the definition f (w, n) = l(Y n , Y ∗ , w) + ||w|| . (7)
2C
of the surrogate loss l depends on the score function Sw ,
since the goal of learning is to make the maximizer of Sw a A subgradient of the convex function f : W → R at w
desirable output for the given input. is defined as a vector g, such that
The max-margin approach is a particular instance of dis- The set of all subgradients at w is called the subdifferential
criminative learning, where parameter learning is formu- at w and is denoted ∂f (w). The subdifferential is always a
lated as a quadratic program (QP) with soft margin con- non-empty convex compact set.
straints [28]: A valid subgradient g(Y n , Y ∗ , w) with respect to the pa-
rameter w can always be computed as the gradient of f (w)
1 XN at Y ∗ . Hence for the hinge loss, it can be computed as:
2
min ||w|| + C ξn (3)
w,ξ≥0 2 ∂f (Y n , Y ∗ , w) w
n=1
= ψ(Y ∗ ) − ψ(Y n ) + . (9)
s.t. ∀n : Sw (Y n ) ≥ max (Sw (Y ) + ∆(Y n , Y )) − ξn , ∂w C
Y ∈Yn
This results in a simple algorithm that iteratively computes
where Yn is the set of all possible labelings for example n, and steps in the direction of the negative subgradient. In
the constant C controls the trade-off between margin and order to guarantee convergence, the step size η (t) has to sat-
training error, and the task loss ∆ measures the closeness of isfy the following conditions :
any inferred labeling Y to the ground truth labeling Y n .
T T
The QP can be converted to an unconstrained optimiza- X X
lim η (t) = ∞ and lim (η (t) )2 < ∞.
tion problem by incorporating the soft constraints directly T →+∞ T →+∞
t=1 t=1
into the objective function, yielding: (10)
For loopy CRFs, however, true subgradients of the hinge
min L(w) = (4) loss cannot always be obtained due to the intractability of
w
N loss-augmented inference. This can lead to erratic behavior
1 2
X
due to noisy subgradient estimates and loss of performance.
min ||w|| + [Sw (Y ∗ ) + ∆(Y n , Y ∗ ) − Sw (Y n )]+ ,
w 2C n=1
4. Estimating Subgradient Using Working Sets
where
Our algorithm aims at better estimating an approximate
Y ∗ = arg max(Sw (Y ) + ∆(Y n , Y )). (5) subgradient of f (Y n , Y, w) by using working sets of con-
Y ∈Yn straints, denoted An , for learning loopy CRFs where exact
inference is intractable. The algorithm we propose is out-
It is easy to see that Eq. 4 has the same form as Eq.2 where lined in Algorithm 1. It first solves the loss-augmented in-
the hinge loss is used as the surrogate loss l, i.e., ference to find a constraint Y ∗ and add it to the working set
An . It then steps in the opposite direction of the approx-
l(Y n , Y ∗ , w) = [Sw (Y ∗ ) + ∆(Y n , Y ∗ ) − Sw (Y n )]+ (6) imate subgradient computed as an average over the set of
violated constraints belonging to An .
For most existing approaches, a key challenge to solving Hence unlike dual averaging methods [17, 30] that ag-
Eq. 4 effectively is the loss-augmented inference, i.e., find- gregate over all previous subgradients, our algorithm only
ing Y ∗ . considers the subset of active, namely violated, constraints
Algorithm 1 some distribution with mean µi ∈ ∂L(w) and bounded
1: INPUTS : variance.
2: D : Training set of N examples. Let δi = gi − µi be the difference between approximate
3: β : Learning rate parameter. -subgradient gi and true -subgradient µi , and assume that
4: w(1) : Arbitrary initial values, e.g., 0. all δi are independent of one another. Note that, by defini-
5: OUTPUT : w(T +1) P δi has1zero
tion, each P expectation
P and hence their average
1 1
6: Initialized An ← ∅ for each n = 1 . . . N δ̄ = m δi = m gi − m µi .
7: for t = 1 . . . T do Therefore, using Hoeffding’s inequality [6] and the
8: Pick some example (X n , Y n ) from D union bound, we can show that the average error δ̄ concen-
9: Y ∗ = arg maxY ∈Yn (Sw (Y ) + ∆(Y n , Y )) trates around its expectation, i.e., 0 in this case, as the num-
10: An ← An ∪ {Y ∗ } ber of violated constraints in the working set m increases:
0
11: An ← {Y ∈ An | l(Y, Y n , w(t) ) > 0}
−mr2
η (t) ← βt
12: Pr δ̄ ≥ r ≤ 2d exp , (13)
n (t) 2G2
g(t) ← |A1n0 | Y ∈An0 ∂f (Y∂w,Y,w )
P
13: (t)
The convexity of the subdifferential ∂L(w) implies that
14: w(t+1) ← w(t) − η (t) g(t) 1
P
µ̄ = m i µ i ∈ ∂L(w). Therefore the probability of
15: end for 1
g(t) , m
P
gi being more than a distance r away from
any true subgradient is bounded by Eq. 13 as well.
when computing the parameter updates. Therefore all sub-
gradients are computed with respect to the parameters at the Algorithm 2
current iteration, as opposed to using their historical values. 1: INPUTS :
This produces more meaningful descent directions, as evi- 2: D : Training set of N examples.
denced by the results in Section 5. 3: Q : MCMC walker.
We now analyze the convergence properties of the algo- 4: β : Learning rate parameter.
rithm presented in Algorithm 1. Although finding true sub- 5: w(1) : Arbitrary initial values, e.g., 0.
gradients as defined in Eq. 8 cannot be guaranteed for loopy 6: OUTPUT : w(T +1)
CRFs, interesting results can still be obtained even if one 7: Intialized An ← ∅ for each n = 1 . . . N
can only find an approximate -subgradient g, as defined 8: for t = 1 . . . T do
in [22]: 9: Pick some example (X n , Y n ) from D
∀w0 : gT (w − w0 ) ≥ f (w) − f (w0 ) − (11) 10: Sample Y ∗ according to Q(w(t) , Y n )
11: An ← A ∪ {Y ∗ }
0
The convergence properties of -subgradient descent meth- 12: An ← {Y ∈ An | l(Y, Y n , w(t) ) > 0}
ods were studied in [20, 22, 19]. The “regret” (i.e., loss) 13: η (t) ← βt
of the parameter vector w can be bounded as follows (the 14: z ← w(t)
0
re-derivation of the proof using our notation is given in the 15: for Y ∈ An do
n (t)
supplementary material): 16: g(t) ← |A1n0 | ∂f (Y∂w,Y,w
(t)
)
5. Experimental Results
We apply our algorithm to image segmentation. We first
briefly describe how a CRF model is used for this task
and then present our experimental results on two distinct
datasets to demonstrate the effectiveness of our method.
5.1. CRF for Image Segmentation Figure 1. A nearly isotropic stack of neural tissue acquired using EM
microscopy annotated for training and testing. This stack contains 1065
As a standard preprocessing step, we perform a prelim- images of 2048 × 1536 pixels, and was used for the task of segmenting
mitochondria, indicated by arrows.
inary over-segmentation of our input image into superpix-
els 1 using SLIC [1]. The CRF G = (V, E) is thus defined
so that each node i ∈ V corresponds to a superpixel and
there is an edge (i, j) ∈ E between two nodes i and j if the • SSVM – The cutting plane algorithm described in [28].
corresponding superpixels are adjacent in the image. Let
Y = {yi } for i ∈ V denote the labeling of the CRF which • SampleRank – The method described in [29].
assigns a class label yi to each node i. • SGD + inference – solve the loss-augmented inference
The score function associated with the CRF can then be using graph-cuts or belief-propagation. This algorithm
written as is the SGD (subgradient descent) formulation of [19].
X X
Sw (Y ) = Di (yi ) + Vij (yi , yj ), (14) • SGD + sampling – Instead of performing inference,
i∈V (i,j)∈E use MCMC to sample constraints from a distribution
targeting the loss-augmented score. This is equivalent
where Di is the unary data term and Vij is the pairwise spa- to the method named “SampleRank SVM” described
tial term. Both Di and Vij are linear in the CRF parameters in [29].
w and also depend on the observed data X in addition to
the labeling Y . For inference, we use graph cuts when the In all cases, we used a decreasing step size rule of √1t and
corresponding energy function (i.e., negated score) is sub- used cross-validation on the training set to determine the
modular [7] and belief propagation otherwise. regularization constant. The results reported for the sam-
P ∆ (Eq. 3) is the per-
A natural choice for the task loss pling method and SampleRank were averaged over 5 runs.
superpixel 0-1 loss ∆(Y n , Y ) = i∈V I(yi 6= yin ), which
penalizes all errors equally. However, in image segmenta- 5.3. MSRC Dataset
tion, it is common for certain classes to occur much more
The MSRC dataset is a popular multi-class object seg-
frequently than others. To ensure good performance across
mentation dataset containing 591 images with objects from
all classes, we adopt a loss function that weighs errors for
21 categories. Training and testing are done using the stan-
a given class inversely proportional to the frequency with
dard split of the dataset [23], and we used the annotated
which it appears in the training data.
images provided by [14]. We extract feature vectors by first
5.2. Methods over-segmenting images using SLIC superpixels [1]. We
then extract SIFT descriptors and color histograms from im-
In the following, we will compare our learning methods age patches surrounding each superpixel centroid. We then
(referred as Working sets + inference and Working sets create a bag-of-words descriptor by first generating a dic-
+ sampling) with the following baselines. We also experi- tionary containing 1,000 words for SIFT features, and 400
mented with averaging all past subgradients [17, 30], which words for color histograms are constructed using k-means
did not produce meaningful results for our task. on extracted features. We also include location informa-
tion as in [9] and the unary potentials of [8]. The resulting
• Linear SVM – A linear SVM classifying each sample
feature vector xi is used to train the various methods. Simi-
independently (i.e., without CRF).
larly to [23], the pairwise term we used was made gradient-
1 Or supervoxels in the case of volumetric data. adaptive by including parameters for each discretized im-
Table 1. MSRC segmentation results. We follow the standard reporting procedure. For each category, the pixel-wise classification rate is
provided. Global pixel-wise accuracy and average per-category scores provide measures for overall performance. Bold entries are used
for each part separately to indicate best performance. Note that all the methods in the top part of the table were optimized for the average
score, for which our method achieves the best results. We also include the results of two state-of-the-art algorithms as reported in [31].
Average
building
airplane
bicycle
Global
flower
sheep
water
grass
book
chair
body
road
boat
sign
face
cow
bird
tree
dog
sky
car
cat
Linear SVM 63 88 77 90 85 91 89 59 86 88 94 85 80 53 93 85 79 65 50 82 0 74.8 79.6
SSVM [28] 69 91 82 80 82 92 78 68 85 79 85 90 77 58 95 82 87 74 44 80 44 77.1 82.3
SampleRank 66 91 80 84 80 92 76 70 82 65 89 91 85 54 93 80 88 69 48 89 40 76.7 82.2
SGD + sampling 67 93 80 82 75 93 64 67 84 74 81 93 82 43 93 75 88 68 52 73 48 75.4 82.1
SGD + inference [19] 69 91 84 87 82 92 80 64 86 82 85 83 76 61 95 82 88 73 43 74 35 76.7 82.2
Working set + sampling 71 90 87 91 84 92 83 74 82 84 85 87 66 46 94 84 88 79 49 74 37 77.2 83.3
Working set + inference 67 89 85 93 79 93 84 75 79 87 89 92 71 46 96 79 86 76 64 77 50 78.9 83.7
Ladickỳ et al. [10] 73 93 82 81 91 98 81 83 88 74 85 97 79 38 96 61 90 69 48 67 18 75.8 85.0
Yao et al. [31] 67 92 80 82 89 97 86 83 86 79 94 96 85 35 98 70 86 78 55 62 23 77.4 84.4
Original
Ground truth
Figure 2. Example segmentations from the MSRC dataset. Best viewed in color.
age gradient level. In a similar fashion, it also considers Jaccard index commonly used for image segmentation [4].
geometric relationships such as “sky appears above grass”. The Jaccard index is the ratio of the areas of the intersection
between what has been segmented and the ground truth, and
Table 1 summarizes the segmentation performance of of their union. It is written as:
the various approaches and example segmentations appear
True Positive
in Fig. 2. The quantitative results show that the working Jaccard index =
True Positive + False Positive + False Negative
.
set of constraints improves the average score regardless of
whether inference or sampling was used during learning.
The segmentation process begins by over-segmenting the
The results obtained by the sampling approach are close to
volume using SLIC supervoxels [1]. For each supervoxel,
those from using inference, but with a significantly lower
we extract a feature vector that captures local shape and
running time as shown in Table 3. In addition to the base-
texture information using Ray descriptors [13] and inten-
line methods described above, we compare our approach to
sity histograms. These feature vectors are used to train each
state-of-the-art approaches [10, 31]. We achieve the best re-
baseline method, as well as our model. In a second set of
sults in terms of the average score for which we optimize
experiments, we also transform the features using the ker-
our algorithm.
nel method of [12]. The original feature vectors are 120-
5.4. Electron Microscopy Dataset dimensional and are thus mapped to a higher dimensional
space. The details are described in [12]. Due to the high
Here, we perform mitochondria segmentation in 3D us- cost of labeling such large volumes, our experiments are re-
ing the large image stack from Fig. 1. This electron mis-
cropy dataset is publicly available at http://cvlab. stricted to two subvolumes containing 1024×768×165 vox-
epfl.ch/data/em. Performance is measured by the els. The first subvolume was used to train the various meth-
Table 2. Segmentation performance measured with the Jaccard index for the mitochondria EM dataset. We report results for two different
set of features (see text for full description). Note that the original features were already kernelized with a RBF-SVM in [13].
SVM Lucchi SSVM SampleRank SGD + SGD + Working set + Working set +
[13] [28] [29] sampling inference [19] sampling inference
Original features 73.0% 80.0% 80.5% 81.2% 77.5% 79.9% 83.0% 84.5%
Ground truth Linear SVM SGD + inference [19] Working set + inference
EM MSRC
SampleRank [29] 2524s 80s (a) Training set, EM (b) Test set, EM
SGD + Sampling 2481s 72s
Working set + Sampling 2619s (+5.5%) 76s (+5.2%)
SGD + inference [19] 5315s 546s
Working set + inference 5842s (+9.9%) 583s (+6.8%)