You are on page 1of 10

Computer Vision and Image Understanding 120 (2014) 81–90

Contents lists available at ScienceDirect

Computer Vision and Image Understanding


journal homepage: www.elsevier.com/locate/cviu

Efficient semantic image segmentation with multi-class ranking prior q


Deli Pei a,b,c,d, Zhenguo Li e, Rongrong Ji f,⇑, Fuchun Sun b,c,d,⇑
a
Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
b
Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
c
State Key Laboratory of Intelligent Technology and Systems, Beijing 100084, China
d
Tsinghua National Laboratory for Information Science and Technology, Beijing 100084, China
e
Huawei Noah’s Ark Lab, Hong Kong, China
f
Department of Cognitive Science, Xiamen University, Xiamen 361005, China

a r t i c l e i n f o a b s t r a c t

Article history: Semantic image segmentation is of fundamental importance in a wide variety of computer vision tasks,
Received 3 October 2012 such as scene understanding, robot navigation and image retrieval, which aims to simultaneously decom-
Accepted 6 October 2013 pose an image into semantically consistent regions. Most of existing works addressed it as structured pre-
Available online 25 October 2013
diction problem by combining contextual information with low-level cues based on conditional random
fields (CRFs), which are often learned by heuristic search based on maximum likelihood estimation. In
Keywords: this paper, we use maximum margin based structural support vector machine (S-SVM) model to combine
Computer vision
multiple levels of cues to attenuate the ambiguity of appearance similarity and propose a novel multi-
Machine learning
Semantic segmentation
class ranking based global constraint to confine the object classes to be considered when labeling regions
Structural SVMs within an image. Compared with existing global cues, our method is more balanced between expressive
power for heterogeneous regions and the efficiency of searching exponential space of possible label com-
binations. We then introduce inter-class co-occurrence statistics as pairwise constraints and combine
them with the prediction from local and global cues based on S-SVMs framework. This enables the joint
inference of labeling within an image for better consistency. We evaluate our algorithm on two challeng-
ing datasets which are widely used for semantic segmentation evaluation: MSRC-21 dataset and Stanford
Background dataset and experimental results show that we obtain high competitive performance com-
pared with state-of-the-art methods, despite that our model is much simpler and efficient.
Ó 2013 Elsevier Inc. All rights reserved.

1. Introduction forward, pixel itself contains limited and ambiguous information


that cannot always be discriminative enough to determine its cor-
Semantic segmentation is a fundamental but challenging prob- rect label. On the other hand, the proliferation of unsupervised im-
lem in computer vision, which aims to assign each pixel in an im- age segmentation algorithms, such as mean shift [3], graph based
age a pre-defined semantic label. It can be seen as an extension of segmentation [4,38], quick shift [5], TurboPixel [6] and SLIC [7], en-
the traditional object detection which aims at detecting prominent ables higher order features representation of regions. Therefore,
objects in the foreground of an image, with closed relation to some more recently semantic segmentation approaches based on re-
other fundamental computer vision tasks such as image segmenta- gion-wise labeling [8–13] are also well investigated, which make
tion and image classification. Semantic segmentation has many use of region-level features that are not only more informative
applications in practice, including scene understanding, robot nav- but also robust to noise, clutter, illuminate variance et al. In such
igation, and image retrieval. a setting, an initial unsupervised segmentation is commonly
Semantic image segmentation algorithms in early stage typi- adopted for pre-processing. However, image segmentation is still
cally solve this problem from a pixel-wise labeling perspective far away from being perfect without regard to the extensive at-
[1,2]. Although using pixels as labeling units is simple and straight- tempts in the last several decades. From this point of view, how
to make best use of these imperfect unsupervised image segmen-
tation algorithms on the semantic segmentation problem is of fun-
q
This paper has been recommended for acceptance by Nicu Sebe. damental importance yet is still unclear.
⇑ Corresponding authors. Addresses: Department of Computer Science and Although higher order features extracted from regions are more
Technology, Tsinghua University, Beijing 100084, China (F. Sun). Department of expressive and informative than those from pixels, sematic ambi-
Cognitive Science, School of Information Science and Technology, Xiamen Univer-
guity still exists because of the appearance similarity. A general
sity, Xiamen 361005, China (R. Ji).
E-mail addresses: derrypei@gmail.com (D. Pei), li.zhenguo@huawei.com (Z. Li), consent is that contextual information within an image is a very
rrji@xmu.edu.cn (R. Ji), fcsun@mail.tsinghua.edu.cn (F. Sun). useful cue to attenuate this ambiguity, which can be used to

1077-3142/$ - see front matter Ó 2013 Elsevier Inc. All rights reserved.
http://dx.doi.org/10.1016/j.cviu.2013.10.005
82 D. Pei et al. / Computer Vision and Image Understanding 120 (2014) 81–90

suppress/encourage the presence of object classes during labeling. advantage of multiple classes ranking over 1-VS-All classifiers is
Context refers to any information that is not extracted directly discussed in [20].
from local appearance and can be summarized into two categories: The remainder of the paper is organized as follows: In the next
pairwise constraints and global cues. Pairwise constraints, such as section we review the related work. Our model is presented in Sec-
smoothness based on contrast [14,9], relative location [10,11] and tion 3, including the problem formulation and model details. Sec-
co-occurrence [8,11,15] are used to model the pairwise relation- tions 4 and 5 describe the inference and learning methods.
ship between regions within an image. Global constraints are usu- Implementation details and performance evaluation are shown in
ally used to enforce higher level consistency of region sets or image Section 6 while conclusions are drawn in Section 7.
level. Some approaches are proposed to model these cues, such as
using image classification results [13], Potts potential [12], pN Potts
potential [16] and its improved versions robust PN potential [14], 2. Related work
pN-based hierarchical CRFs [17], and Harmony potential [9]. These
models will be further discussed in Section 2. Despite of the success in inferring pixel labels [1,2], more recent
In terms of the methodology, most of the existing methods methods tend to infer labels over regions or superpixels for the sake
[16,12,14,10,11,15,9] use conditional random fields (CRFs) to com- of lower computational complexity and incorporating higher level
bine these constraints from different levels and make joint infer- semantic cues. For these approaches, traditional image segmenta-
ence of labeling within an image, which is also known as tion algorithms such as Normalized Cut [8], meanshift [14,17,13],
structured prediction. In contrast to many sophisticated algorithms graph-based image segmentation [10], quick shift [9,22] are
for inference, these models [10,11,9,14,15] are usually learned by adopted to get initial segments. More recently, several over-seg-
gradient descent or heuristic search on validation set based on mentation algorithms [6,7] are developed to bypass the problem
maximum likelihood estimation. On the other hand, Zhu et al. of tradition segmentation algorithms, such as the semantic ambigu-
[18] showed that the max-margin based learning algorithm is ity (regions span multiple object classes) and the difficulty to deter-
more robust for structured prediction compared with the maxi- mine the optimal number of segment regions. These algorithms try
mum likelihood estimation based learning algorithm in many ma- to seek the trade-off between reducing image complexity through
chine learning applications. pixel-grouping and avoiding under-segmentation [6]. Images are
In this paper, we use maximum margin based structural sup- decomposed into much smaller regions than object size, e.g.
port vector machine (S-SVMs) model to combine multiple levels 100–300 regions. Many traditional segmentation algorithms can
of cues to attenuate the ambiguity of appearance similarity and also be adopted to generate superpixels by setting a finer level
we propose multi-class ranking based global constraints to confine region segments. Qualitative results of different segmentation algo-
the object classes to be considered when labeling regions within an rithms are given in Fig. 1, where each image is decomposed into
image. approximate 150 superpixels. It can be seen that over-segmenta-
For global cues, we first rank all the object classes for an image tion algorithms tent to segment an image into regions with regions
(class with higher probability present in the image gets larger with approximate size while the region size of traditional segmen-
score) using multi-class ranking algorithm [20] and transform tation may vary a lot with the complexity of the content.
the ranking scores into image-level soft constraint to confine the Although various powerful features have been proposed re-
possible classes present in the image. The advantages of this global cently (e.g. color histogram, texture and SIFT, these feature are still
cues can be seen from two aspects: on the one hand, compared not informative enough to achieve high classification performance
with robust pN potential [14] which limits their parent node to take because of the appearance similarity. To attenuate this ambiguity
only one single label, our method ranks all the classes for an image of feature representation, some pairwise constraints, such as
and thus is more representative to heterogeneous regions. On the smoothness [14,9,23], relative location [10,11] and co-occurrence
other hand, since we compute the ranking scores for all the classes [8,11,15], are further introduced to attenuate the ambiguity of fea-
and transform them to soft constraint, we do not need to make ture representation: (i) The assumption for pairwise smoothing
hard decision for every class and thus avoid searching exponential term is that adjacent regions tend to have same label, and subse-
space of possible label combination as harmony potential [9]. The quently spatially adjacent regions with different labels will be pun-
global cues are integrated with the prediction obtained from region ished. To keep the boundary, appearance contrast is considered in
feature and logistic regression to encouraging more likely classes smoothing term, by which regions with larger appearance contrast
while suppressing the others. will be punished less for their inconsistent labels. However, the di-
We then introduce inter-class co-occurrence statistics as pair- lemma of this smoothing term is that regions with similar appear-
wise constraints and combine them with the prediction from pre- ance will naturally tend to have same label. This is contradicted
vious stage under S-SVMs framework. This enables the joint with the objective of smoothing term that expecting spatially adja-
inference of labeling within an image for better consistency. More- cent regions with variant appearance to have same label. (ii) The
over, our model can be can be efficiently learned with cutting plane co-occurrence statistics exploit the property that some classes
algorithm [19] instead of using heuristic search approach as in (e.g. boat, water) are more likely to present within an image than
CRFs learning. Experimental results show that we obtain high com- others (e.g. car, water). Thus the existence of one class can be used
petitive performance with state-of-the-art methods with a much as the evidence of expecting the presence of some highly related
simpler and efficient model on two challenging datasets: MSRS- classes and suppress the presence of other unlikely classes. For in-
21 and Stanford Background Dataset. stance, Rabinovich et al. [8,11] construct context matrices by
Probably the most related work is [21], which discussed the counting the co-occurrence frequency among object labels in the
application of structural SVM in image semantic segmentation training set to incorporate semantic contextual information. Lad-
and compared with alternative maximum likelihood method. icky et al. [15] claimed that the co-occurrence cost should depend
However, our model is different from their model in designing only on the labels present in an image, it should be invariant to the
pairwise and global constraints as well as loss function in parame- number and location of pixels that object occupies. (iii) Gould et al.
ter learning. The standard contrast-dependent Potts model was [10] encoded the inter-class spatial relationship as a local feature
used as pairwise constraint in contrast to our co-occurrence prop- in a two-stage classification process. However, because of the 2D
erty. With regard to global constraints, they used very simple and projection, relative location in images is usually uninformative
straightforward K image-level classification results and the and hence degenerates to co-occurrence constraint.
D. Pei et al. / Computer Vision and Image Understanding 120 (2014) 81–90 83

Fig. 1. Over-segmentation examples of different segmentation algorithms, with approximate 150 superpixels for each image.

Pairwise constraint can only capture local context information jects in an image. Ladicky et al. [15] integrated the results from
between regions. A more recent trend is building a hierarchical sliding window detectors with low-level pixel-based unary and
model by adding an extra global constraint to pairwise frame- pairwise relations into a conditional random field framework
work to incorporate constraints on higher level, such as the group (CRF) for joint reasoning about regions, objects and their attributes
of segments or image level. Plath et al. [12] proposed a Potts po- and similar idea in [9].
tential to model the label consistency of regions in a hierarchical
tree structure, which punished all nodes that have inconsistent 3. Model
labels with their parent label. Kohli et al. [14] adapted the pN
Potts potential proposed in [16] to a segment quality sensitive In this section we will specify our model for the structural pre-
higher order potential, named robust PN potential. The cost of diction. First, we consider superpixels obtained by an unsuper-
inconsistent labeling in high contrast region will be less com- vised image segmentation, and use xi, i = 1, . . . , N to denote the
pared to low contrast region. However, a drawback of both high feature vector of superpixel i and yi 2 C = {c1, c2, . . . , cK} for its cor-
order potentials [12,14] is that they both limited their parent responding label where N and K are the numbers of superpixels
node to take only one single label, which is often not the case and classes, respectively. The whole image can then be repre-
and makes it unable to handle heterogeneous regions. Csurka sented as the collection of superpixel feature vectors, X = {xi-
and Perronnin [13] proposed to use image classification results j i = 1, . . . , N}, and an assignment of labels to the set of
to reduce the number of classes to be considered in an image. superpixels is referred to as a labeling of the image, denoted by
But this hard constraint schema did not take into account the Y = {yij i = 1, . . . , N}. Our objective is to learn a function F(X, Y) that
classification accuracy and classification errors could propagate is able to capture the compatibility of the prediction Y and the
to following stages and affect the overall performance. The work observation X, such that the better the prediction Y describes
[17] proposed a novel hierarchical CRF framework which allowed the image content X, the higher value F(X, Y) becomes. Thus, given
for integration of features computed at different levels to avoid the observation X, the optimal prediction Y can be found by max-
single choice of quantization. Gonfaus et al. [9] proposed more imizing F(X, Y) over all possible labelings:
expressive constraint named harmony potential, which restrict
the power set over all possible labels on image level first and then ^ ¼ arg max FðX; YÞ:
Y ð1Þ
use it as a higher order constraint. However, the exponential Y

sized power set makes the exact inference infeasible. And heuris- Following the structural SVMs [27], we assume the compatibil-
tic method such as branch-and-bound sampling has to be applied ity function F is linear in terms of a combined feature representa-
to get an approximation of the best assignment, which results in tion of inputs and outputs u(X, Y) (also known as joint feature
taking into account a small subset only. map):
Besides the context cues directly extracted from the image, pri-
ors from various vision tasks are also introduced to improve the FðX; YÞ ¼ hx; uðX; YÞi: ð2Þ
performance. Several approaches considered jointing the object The joint feature map u(X, Y) can be designed in order to cap-
detection and multi-class image segmentation by feeding informa- ture multi-scale, multi-layer and contextual cues. Given the joint
tion from one task to the other [24–26,15,9]. Heitz et al. [24] devel- feature map, the task in learning is to train an optimal model
oped Cascaded Classification Models (CCM) to combine the parameter x using training set. The local constraint, also called
subtasks of scene categorization, object detection, multiclass im- unary potential, captures the local appearance evidence for
age segmentation for holistic scene understanding. However, since labeling superpixels; the mid-level constraint usually exploits
these subtasks were only coupled by their input/output variables pairwise relationship, such as smoothness, relative location and
in a loose style, each of them is still optimized separately and infor- co-occurrence, between superpixels. In some approaches, certain
mation sharing was limited and may cause inconsistent represen- global constraint is also applied to infer possible labeling from
tation. Gould et al. [25] proposed a hierarchical region-based image level rather than superpixels. We will specify how we
approach that combined joint object detection with image seg- define these constraints and combine them together in the
mentation to reason simultaneously about pixels, regions and ob- following sections.
84 D. Pei et al. / Computer Vision and Image Understanding 120 (2014) 81–90

3.1. The unary potential superpixel casts a vote for its support to all the other superpixels’
class labels given its region size and the confidence of its initial
First we detail our feature representation for superpixels. The guess, which is defined as follows:
raw features of a superpixel consist of two ingredients, appear-
ance-based descriptors and bag-of-word (BoW) representation. Pðyi jsi ÞSi þ Pðyj jsj ÞSj
usi ;sj ðyi ; yj Þ ¼ P ; ð8Þ
Following [10], the appearance-based descriptors include color i Si
and texture features which compute mean, standard deviation,
where P(yijsi) is the probability of superpixel si taking label yi de-
skewness, and kurtosis statistics of the superpixel’s color distribu-
fined in (4) and Si is the size of the superpixel i.
tion and filter responses. In addition, we also extract the location
Thus, each superpixel i receives N  1 votes from all the other
and geometry features of the superpixel. For more details we refer
superpixels for its label assignment yi:
the reader to [10]. The BoW representation has been shown useful
in many state-of-the-art vision systems. Therefore we also incor- X
N
porate it for superpixel representation. Moreover, as shown in V si ðyi Þ ¼ lyi ;yj usi ;sj ðyi ; yj Þ: ð9Þ
[22,28], BoW features extracted not only inside superpixels, but j¼1;j–i

also in their neighborhood can describe superpixels more effec-


We define the pairwise potential by aggregating votes of all
tively. Thus, for each superpixel wee extract BoW from both itself
superpixels for their label assignments Y and then we have
and its adjacent regions and then concatenate them together. The
X XX
final representation of the raw features becomes: F pair ðX; YÞ ¼ V si ðyi Þ ¼ lyi ;yj usi ;sj ðyi ; yj Þ; ð10Þ
i i j–i
>
si ¼ ðha sai ; hb sbi Þ ; ð3Þ
where lci ;cj are K2 model parameters for pairwise potential, describ-
where sai
is the appearance descriptor, sbi
is the concatenated BoW ing the preferences of co-occurrent class pair in the data.
feature, and ha, hb are the weight parameters to be learned by cross
validation. 3.3. The global constraint
Instead of using the above raw feature, we compute an interme-
diate representation from these raw features via logistic regres- When most of the superpixels are correctly labeled, the co-
sion, which makes feature more compact. Given the raw feature occurrence property is beneficial to rectify the minor superpixels
representation si of a superpixel, the probability of taking label that are mislabeled. However, as the proportion of mislabeled
l 2 C = {c1, c2, . . . , cK} can be computed by the following logistic superpixels increases, it is more likely that the error is propagated
regression model: to other superpixels due to the voting scheme. To resolve this prob-
8
> expðbl0 þbTl si Þ lem, global constraint on image level is further introduced to con-
< 1þPcK1 expðbt0 þbT si Þ if l ¼ c1 ; . . . ; cK1 ;
>
fine the possible classes present in an image. However, the existing
t¼c1 t
Pðljsi Þ ¼ ð4Þ global consistency potentials are either too simple in expressive
>
: 1þPcK1 expðbt0 þbT si Þ if l ¼ cK ;
> 1
t¼c1 t power, which only allows regions have a single class label such
as Potts [12] and Robust PN-based [14], or too complicated which
where b is the learned parameter for the logistic regression. We has to search exponential space of likely combinations of labels
concatenate class probabilities to form the K-dimensional interme- such as Harmony potential [9].
diate representation: We propose a new efficient global constraint which has a better
trade-off between expressive power for to heterogeneous regions
xi ¼ ðPðc1 jsi Þ; Pðc2 jsi Þ; . . . ; PðcK jsi ÞÞ> : ð5Þ
and the efficiency of searching exponential space of possible label
Moreover, we assign the most probable label to the superpixel combinations. With the help of multi-class ranking algorithm in
as an initial label guess for further joint inference: [20], we first rank all the object classes from image level and then
 transform the ranking into soft constraint.
li ¼ arg max Pðljsi Þ: ð6Þ
l2C To obtain the multi-class ranking score, each image is repre-
sented by kernel descriptor [29] and its corresponding binary label
As a baseline, the performances of raw features under various 1 K
vector li ¼ fli ; . . . ; li g 2 f1; þ1gK , where K is the total number of
over-segmentation algorithms are systematically evaluated in j
object classes, li ¼ þ1 denotes the presence of class j in image Ii
Section 6 and compared with those obtained from structured j
and li ¼ 1 denotes the absence. We aim to learn K classification
prediction using contextual information.
function ft(Ii), Rd ? R, t 2 C = {c1, c2, . . . , cK}, one for each class, such
The unary potential can be written as follows:
that for any image I; fci ðIÞ scores higher than fcj ðIÞ when I is more
X
F unary ðX; YÞ ¼ xTyi xi ; ð7Þ likely belonging to class ci than class cj.1
i The ranking score indicates the confidence of assigning specific
label to a given image. Although it is informative, the result is still
where xc1 ; . . . ; xcK 2 RK are the model parameters for the unary
very rough. Therefore, rather than setting a threshold to binary this
potential.
vector to obtain possible label set as in [13,30], we transform this
ranking score into soft constraint using a sigmoid function:
3.2. The pairwise potential
1
ht ðIÞ ¼ þ q; ð11Þ
The unary potential part computes not only the intermediate 1 þ a  expðb  f t ðIÞÞ
representation of superpixels, but also the initial labeling of each
superpixel based on local features. However the performance of where a, b, q are parameters to be learned from a validation set.
such labeling may not be satisfactory due to the ambiguity on Now each image Ij can then be represented by a K dimensional
low level representation. To leverage the semantic context be- vector r j ¼ fhc1 ðIj Þ; . . . ; hcK ðIj Þg. Then this soft constraint is
tween superpixels and attenuate the ambiguity, we introduce a integrated with the unary potential to impose image label prior
voting strategy to exploit co-occurrence property of objects within
an image. Based on the initial label obtained in Section 3.1, each 1
We use the code available at http://www.cse.msu.edu/bucakser/software.html.
D. Pei et al. / Computer Vision and Image Understanding 120 (2014) 81–90 85

to superpixels within image Ij. Thus the intermediate representa- Thus the inference process is described as follows: (1) In each
tion of superpixel defined in (5) can be revised as follows: iteration we randomly choose one superpixel and fix all the other
superpixels’ labels. (2) We compute the score function g of all K
x i ¼ ðhc1 ðIj Þ  Pðc1 j si Þ; hc2 ðIj Þ  Pðc2 j si Þ; . . . ; hcK ðIj Þ  PðcK jsi ÞÞ> :
e possible classes for this superpixel. (3) If the label with largest
ð12Þ score is different with the previous label, then update the label.
(4) The iteration stops when no more label changes or reaches
To illustrate the benefit of our proposed soft constraint strategy,
max iterations.
we compare it with two alternative global constraint strategies:
Like most greedy search algorithm, the initialization is crucial to
Top n labels hard constraint and Threshold-t constraint. The Top
the performance. For our case, we found that the local prediction
n constraint selects the most probable n labels for each image
obtained from logistic regression serves as a natural and good start
according to the rank score vector f computed in the above proce-
prediction. We initialize our prediction with logistic regression re-
dure, other labels are simply discarded. In Threshold-t strategy, in-
sults in Eq. (12) instead of random values. The pseudocode of the
stead of selecting a fixed number of labels for each image, we filter
above operations is given as follows:
out unlikely labels by setting a threshold to the ranking score vec-
tor f. We compare with these strategies to show the efficiency of
Algorithm 1. inference algorithm
our proposed approach in Section 6.2.
The advantages of transforming multi-label ranking to global
constraints can be seen from two aspects: on one hand, the mul- 1: Input:
ti-label ranking score inferred from image level is more represen- 2: Image feature Ij, superpixels si, i = 1, . . . , N
tative to heterogeneous regions by encouraging multiple labels, 3: Initialization:
compared with the robust pN model [14] which limits their parent 4: y ^i ¼ maxyi 2C hyi ðIj Þ  Pðyi jsi Þ
node to take only one single label. On the other hand, instead of 5: xei ¼ ðhc1 ðIj Þ  Pðc1 jsi Þ; hc2 ðIj Þ  Pðc2 jsi Þ; . . . ;
inferring possible label set of an image from exponential sized hcK ðIj Þ  PðcK jsi ÞÞ
power set of labels as in [22], which is intractable and can only 6: rebeat
be solved by sampling strategy, we can directly compute the rank- 7: for all superpixel si do
ing scores of every label for an image and it can be integrated di-
8: yi ¼ maxyi 2C gðyi j xei ; y
^1 ; . . . ; y
^i1 ; y
^iþ1 ; . . . ; y
^N Þ
rectly with prediction results obtained from local features and
9: if yi –y^i then
logistic regression.
10: update yi ! y ^i
11: end if
3.4. Overall compatibility function 12: end for
13: until no label changes OR reaches max interactions
Combining all the above development together, we propose the 14: return Y ¼ fy^i ji ¼ 1; . . . ; Ng
compatibility function as follows:

FðX; YÞ ¼ F unary ðX; YÞ þ F pair ðX; YÞ


X XX Because the inference is conducted directly on superpixels in-
¼ xTyi ex i þ xyi ;yj usi ;sj ðyi ; yj Þ: ð13Þ
i i j–i
stead of pixels, the number of variables is significantly reduced,
typically from tens of thousands (e.g. an image of 400  300 pixels)
The compatibility function combines local and global cues and to several hundred(usually 100–300 superpixels per image). There-
contextual information in a unified framework and makes joint fore the inference algorithm converges very fast, typically less than
labeling inference, which efficiently attenuates the ambiguity of lo- 15 iterations.
cal appearance similarity and makes the labeling more consistent.
We systematically evaluate our model on two challenging datasets
for semantic segmentation and compare with state-of-the-art 5. Learning
methods in Section 6.
In this section, we discuss how to learn the proposed model 14,
i.e., the model parameters x. To find the optimal solution x⁄, we
4. Inference follow the idea in [27] for structured output prediction, and con-
sider the following maximum-margin optimization problem:
The inference process defined in (1) seeks the most compatible
labeling Y for a given observation X. Typically the process of max- 1 CX
min kxk2 þ n
imizing this compatible function can be formulated as an integer 2 n i i
programming problem, which is NP-hard in general except some ð15Þ
s:t: 8i ni P 0
special cases (e.g. K = 1) and consequently can only be solved
b 2 X n Y; hx; duðX; Y; Y
8Y b Þi P Dð Y
b ; YÞ  ni ;
approximately. In this paper, we adopt a greedy search algorithm
in an iterative style because of its simplicity. First we rewrite the where duðX; Y; Y b Þ ¼ uðX; YÞ  uðX; Y
b Þ; ni is a slack variable which
compatibility function as follows: becomes non-zero when the margin is violated, Y is the ground
X XX truth label of the given image and X is the structured output space.
FðX; YÞ ¼ xTyi xei þ lyi ;yj usi ;sj ðyi ; yj Þ b ; YÞ is the loss function that quantifies how incorrect the predic-
i i j–i
Dð Y
X X tion Y b is when Y is the correct output value.
¼ fxTyi xei þ lyi ;yj usi ;sj ðyi ; yj Þg One intuitive form of the loss function is 0–1 loss on each
i j–i
X superpixel:
¼ gðyi jxi ; y1 ; . . . ; yi1 ; yiþ1 ; . . . ; yN Þ; ð14Þ
X
N
i b ; YÞ ¼
Dð Y ^i ; yi ÞÞ;
ð1  dðy ð16Þ
i
where g() is the potential of superpixel xi being labeled yi while the
rests are y1, . . . , yi1, yi+1, . . . , yN. where d takes 1 when two values are identical and 0 otherwise.
86 D. Pei et al. / Computer Vision and Image Understanding 120 (2014) 81–90

However loss function defined in (16) penalizes incorrect super- the superpixels from different over-segmentation methods using
pixel labeling equally without taking into account the region size. Logistic Regression. Particularly, to stress the role of over-segmen-
Thus the loss of a large mislabeled superpixel is equal to the loss of tation, no pairwise or global contextual information is incorpo-
a very small one. We then derive a more appropriate loss function rated. We use MSRC-21 in this experiment. For the feature
as follows: representation of superpixels, we combine appearance-based and
PN bag-of-word descriptors (see Section 3.1).
b ; YÞ ¼ i gSi ð1  dðy^i ; yi ÞÞ The appearance-based descriptor has 238 features, consisting of
Dð Y P ; ð17Þ
i Si (1) color features computing the mean, standard deviation, skew-
ness, and kurtosis statistics of RGB, Lab, and YCrCb color-space
where Si is the area of superpixel i and g is a weight factor to be
channels and gray image (4  10 dimensions); (2) texture features
learned from cross validation.
computing the same statistics of 48 filter responses (4  48 dimen-
Because the structured output space X to be sought grows
sions), including first and second derivatives of Gaussian and
exponentially with the numbers of superpixels N and object classes
Laplacian-of-Gaussian with various orientations and scales; (3)
K, the number of constraints in (15) is also exponentially large
shape features (3 dimensions); and (4) location features (3 dimen-
which makes it impossible to optimize directly. Current state-of-
sions). To build a bag-of-word (BoW) representation, we divide an
the-art approaches typically use cutting plane algorithm proposed
image into 16  16 pixel cells with 75% overlap. Each cell is cap-
by Joachims et al. [19] and their implementation SVM Struct pack-
tured by a 128-dimensional SIFT descriptor. The dictionary size is
age.2 For a better efficiency, we follow a variant implementation of
400 visual words built with K-means clustering and these descrip-
the cutting plan algorithm presented in [31]. The learning algorithm
tors are then quantized using nearest neighbor. To represent a
aims at finding a small set of constraints that ensures a sufficiently
superpixel, we concatenate the BoW representations of the super-
accurate solution. It starts with an unconstrained optimization prob-
pixel and the region around it, giving a BoW feature vector of
lem as a relaxation of original problem and maintains a working set
length 2  400 = 800. The overall representation of each superpixel
Wi. In each iteration through the training process, the ‘‘most vio-
is thus of 238 + 800 = 1038 dimensions.
lated’’ constraint is selected and then added to the existing working
A Logistic Regressor is trained on the training set obtained by
set if certain condition is satisfied. Once a constraint is added, we
standard split of MSRC-21 dataset, where the cost parameter is
optimize the problem again to get new solution. Iteration stops
set to C = 25. For evaluation metric, we follow [17] to use the global
when no constraint has changed or objective precision has reached.
accuracy, which is the proportion of correctly labeled pixels to all
the pixels considered (excluding pixels with void label):
6. Experimental results P
Nii
accuracy ¼ P i ; ð18Þ
In this section, we evaluate the proposed method on two bench- i;j N ij
marking datasets, the MSRC-21 Dataset [32] and the Stanford
Background Dataset (SBD) [33], which are widely used for seman- where Nij is the number of pixels of label i (ground-truth) being la-
tic image segmentation evaluation. MSRC-21 consists of 591 beled as j.
images in 21 classes: building, grass, tree, cow, sheep, sky, airplane, The results are shown in Fig. 2(a), where different numbers of
water, face, car, bicycle, flower, sign, bird, book, chair, road, cat, dog, superpixels are tested. We can see that the results associated with
body, and boat, where the ground truth is provided at pixel level. FH, SLIC and TP are relatively robust when the number of superpix-
A void label is included to avoid the membership ambiguity of pix- els is greater than 50, compared to that of MS. Overall, FH performs
els on object boundaries, which is typically ignored in training and slightly better than the rest, and is adopted later in our structured
evaluation. Following [2], we divide MSRC-21 into 45% for training, prediction model.
10% for validation, and 45% for test. SBD is mostly used for back- Moreover, we computer the performance of different segmenta-
ground understanding, where various foreground objects, includ- tions by assigning the dominant labels to superpixels as shown in
ing car, cow, book, boat, chair, person, et al., are merged into one Fig. 2(b). It can be seen that the accuracies increase with the num-
foreground class. It contains 715 images chosen from the following ber of superpixels as expected. That’s because the greater the gran-
public datasets: LabelMe [34], MSRC-21 [32], PASCAL VOC [35], ularity, the better the segmentation coincides with the border. On
and Geometric Context [36]. Eight category labels were obtained the other side, we can see that the performance of image semantic
using Amazon’s Mechanical Turk (AMT), which include sky, tree, segmentation (Fig. 2(a)) doesnt monotonically increase with the
road, grass, water, building, mountain, and foreground. number of superpixels.

6.1. Influence of over-segmentation 6.2. Influence of global constraints

Though over-segmentation is widely adopted as a key prepro- In this section, we evaluate the impact of multi-label ranking as
cessing in semantic segmentation, its impact on subsequent learn- global constraints on semantic segmentation, by integrating the
ing is rarely evaluated. In this section, we test the influence of four global constraints into the experiment in the last section. We com-
popular over-segmentation techniques, including Mean Shift (MS) pare three ways of using label ranking, Top n constraint, Threshold
[3], Felzenszwalb and Huttenlocher’s efficient graph-based seg- constraint, and our proposed soft constraint (see Section 3.3).
mentation (FH) [4], SLIC [5], and TurboPixel (TP) [6]. Note that dif- In order to capture complementary image properties, we pro-
ferent methods perform segmentation differently, as shown in pose to use different features for multi-label ranking, w.r.t. the lo-
Fig. 1, where each image is segmented into about 150 superpixels, cal features for superpixels (see Section 3.1). We adopt kernel
MS and FH tend to generate larger superpixels in coherent regions descriptors [29] for holistic image representation, which construct
and smaller superpixels in complex regions, while SLIC and Turbo- kernel descriptors from gradient, color, and local binary pattern
Pixel appear to produce grid-style balanced superpixels. The ques- match kernels using kernel principal component analysis (KPCA).
tion is how such difference in over-segmentation can affect later Following the setting in [29], an image is divided into 16  16 pixel
superpixel labeling. To this end, we consider the task of labeling patches with 50% overlap to extract low level features. We com-
pute image-level features using efficient match kernels (EMK) on
2
The code is available on http://svmlight.joachims.org/svm_struct.html. 1  1, 2  2, and 4  4 pyramid sub-regions, and perform
D. Pei et al. / Computer Vision and Image Understanding 120 (2014) 81–90 87

74
(a) SLIC
TurboP
MeanS
95 (b)
GraphB

Global Accuracy (%)

Global Accuracy (%)


73 94

93
72

92
SLIC
71 TurboP
91 MeanS
GraphB

70 90
50 100 150 200 250 300 50 100 150 200 250 300
Number of Superpixels Number of Superpixels

Fig. 2. Influence of initial segmentation on semantic segmentation performance. (a) The performance of semantic segmentation by unary feature and linear classifier. (b) The
performance of semantic segmentation by assigning the dominant labels to superpixels.

constrained kernel singular value decomposition (CKSVD) with


1000 visual words learned by K-means. Overall, each image is rep- 1
resented by a 84,000-dimensional feature vector. KDES
0.9
SPM
We adopt the efficient multi-class ranking algorithm [20] to
learn K classification functions ft(Ii): Rd ? R, t 2 C = {c1, c2, . . . , cK}, 0.8

one for each class, with the goal that for any image I; f ci ðIÞ scores 0.7
higher than fcj ðIÞ when I is more likely to belong to class ci than to
0.6
class cj. We compare the kernel descriptor with the widely used
spatial pyramid matching (SPM) representation with similar set- 0.5
tings. The results measured by ROC curve are shown in Fig. 3,
0.4
where the area under curve (AUC) of SPM is 90.3%, while the
AUC of the kernel descriptor increases to a higher 94.3%. 0.3
Now we are ready to report results after the integration of mul-
0.2
ti-labeling ranking. Recall that the Top n constraint considers only 0 0.2 0.4 0.6 0.8 1
the top n labels for each image according to the ranking scores,
Fig. 3. The ROC curve of two feature representation in multi-class ranking. The area
while Threshold-t constraint retains those with scores greater than under curve (AUC) of kernel descriptor is 94.3% while the AUC of spatial pyramid
t. In contrast, our method converts the ranking scores to soft con- matching is 90.3%.
straint using the sigmoid function defined in Eq. (11) (here we set
a = 3, b =  3, q =  0.4). Either hard or soft constraint is combined
with Logistic Regression as in Eq. (12) and the labels of superpixels Table 1
can be inferred by Eq. (6). The results are shown in Table 1. We can Comparison of different global constraint methods.
see that the proposed soft constraint method outperforms the
Global Average
other two hard constraint alternatives under various parameters.
Local Feature 72.8 58.6
Local Feature + Top 3 labels 72.6 59.8
Local Feature + Top 4 labels 76.1 64.5
6.3. Results for MSRC-21 Local Feature + Top 5 labels 77.5 65.0
Local Feature + Top 6 labels 76.8 64.9
In this section, we report our structured prediction results on Local Feature + Top 7 labels 76.3 63.1
MSRC-21. We also report the results obtained by combining local Local Feature + Threshold = 0 75.9 63.0
unary features and Logistic Regression, with or without global con- Local Feature + Threshold = 0.2 76.7 64.5
straints, where no pairwise co-occurrence information is incorpo- Local Feature + Threshold = 0.4 77.8 66.0
Local Feature + Threshold = 0.6 78.0 66.7
rated. For comparison, we show the results of six state-of-the-art Local Feature + Threshold = 0.8 74.3 60.8
methods, taken from [22,30,9,37,17]. The overall results are sum-
Local Feature + Soft const. 79.1 67.7
marized in Table 2.
From Table 2, we can see that using local unary features and Lo-
gistic Regression yields a baseline of 73% pixel-wise global accu-
racy and 59% average per-class accuracy. Note that in this in that we decouple the global constraint from pairwise potential
baseline, the label of a region (i.e., a superpixel) is decided by its in joint inference and instead integrate it with the local prediction
appearance alone. By integrating the multi-label ranking results, from Logistic Regression (Section 3.1).
we improve the global accuracy by 6% and average accuracy by Considering the per-class accuracy, we obtained very good per-
9%. This shows that global cues can effectively guide the labeling formance on classes such as grass, sky, flower, which can be inferred
of local regions by substantially reducing potential classes to be easily from local appearance and their accuracies are above 95%.
considered during labeling. This is because region labels will be For some difficult classes, such as bird and boat, the accuracies
strengthened if they are consistent with the global ranking and are less than 40%, due to the similar appearance, various sizes,
suppressed otherwise. By further refining the labeling with pair- and complex background.
wise co-occurrence information using structural SVMs framework, Fig. 4 shows example results of our model. Consider the images
we achieve 84% global accuracy and 76% average accuracy, which shown in Fig. 4(a), the labeling results obtained by applying Logis-
are highly competitive compared to the results reported in previ- tic Regression on local appearance features are shown in Fig. 4(b),
ous methods, although our model is much simpler and efficient where the label of a region is decided by its appearance feature
88 D. Pei et al. / Computer Vision and Image Understanding 120 (2014) 81–90

Table 2
Quantitative results on the MSRC-21 data set. The computation of these scores follows the protocol defined in [17]. The best performance is highlighted in bold.

Building Grass Tree Cow Sheep Sky Aeroplane Water Face Car Bicycle Flower Sign Bird Book Chair Road Cat Dog Body Boat Global Average
Gould et al. [10] 72 95 81 66 71 93 74 70 70 69 72 68 55 23 83 40 77 60 50 50 14 77 64
Ladicky et al. [17] 80 96 86 74 87 99 74 87 86 87 82 97 95 30 86 31 95 51 69 66 09 86 75
Gonfaus et al. [9] 60 78 77 91 68 88 87 76 73 77 93 97 73 57 95 81 76 81 46 56 46 77 75
Munoz et al. [37] 63 93 88 84 65 89 69 78 74 81 84 80 51 55 84 80 69 47 59 71 24 78 71
Csurka and 75 93 78 70 79 88 66 63 75 76 81 74 44 25 75 24 79 54 55 43 18 77 64
Perronnin [30]
Lucchi et al. [21] 64 94 91 72 87 97 90 76 72 83 86 88 93 62 90 89 85 97 0 83 0 85 77
Boix et al. [22] 66 87 84 81 83 93 81 82 78 86 94 96 87 48 90 81 82 82 75 70 52 83 80
LR w/o global 66 94 83 50 52 93 68 70 64 51 82 73 54 25 69 40 82 39 18 42 19 73 59
LR w/global 76 98 87 68 66 90 77 70 73 60 84 77 57 32 79 58 86 65 41 57 20 79 68
Structural SVMs 70 98 87 76 79 96 81 75 86 74 88 96 72 36 90 79 87 74 60 54 35 84 76

Fig. 4. Example results on MSRC-21 data set by our model. (a) Original images. (b) Logistic regression prediction. (c) Logistic regression prediction with global constraint. (d)
Structured prediction results with multi-class labeling prior and contextual information. (e) Ground-truth labeling.

alone. Taking the first image for example, it can be seen that partial results are running in MATLAB 7.10.0(R2010a) 64 bit on a laptop
regions of the bird were mislabeled as dog, cat, sheep or even road with 2.67 GHz i5 CPU and 8 GB RAM.
because of the ambiguity of local appearance. Then the multi-label
ranking results give higher confidence to labels like grass, bird and 6.4. Results on Stanford Background dataset
dog and suppress the presence of road, sheep and cat as in Fig. 4(c).
Finally, by introducing co-occurrence property, most of regions la- In this section, we report our results on SBD. We follow [33] to
beled as bird would surpress the presence of dog in an image be- perform 5-fold cross-validation with the dataset randomly divided
cause these two classes rarely present at the same time. Fig. 4(d) into 572 training images and 143 test images for each fold. The
shows the labeling results obtained by our final structured predic- results are shown in Table 3. We can see that our structured
tion and post-processing by grouping superpixels into a larger prediction model preforms favorably compared to other state-of-
group, and it can be seen that the final results are much more clean the-art methods. We also observed that the incorporation of the
and consistent. global label-ranking, although useful, did not improve the perfor-
The proposed method is very efficient. It takes about 800 s for mance significantly. This probably can be explained from two
training structural SVMs on a training set of 335 samples in aspects: First, in SBD the foreground class includes a wide range
MSRC-21, and takes about 1 s for labeling one test image. These of object classes such as person, car, cow, sheep, bicycle, and their

Table 3
Quantitative results on Stanford Background dataset. The best performance is highlighted in bold.

Sky Tree Road Grass Water Building Mountain Foreground Global Average
Gould et al. [33] 92.6 61.4 89.6 82.4 47.9 82.4 13.8 53.7 76.4 65.5
Munoz et al. [37] 91.6 66.3 86.7 83.0 59.8 78.4 5.0 63.5 76.9 66.2
LR w/o global 91.9 70.0 88.9 78.3 54.9 79.6 6.3 54.2 76.6 65.5
LR w/global 92.4 70.0 89.3 77.8 51.5 79.3 3.0 57.9 77.0 65.2
Structural SVMs 94.9 69.7 90.0 81.4 60.9 79.9 13.5 54.1 77.8 68.0
D. Pei et al. / Computer Vision and Image Understanding 120 (2014) 81–90 89

Fig. 5. Example results on Stanford Background dataset by our model. (a) Original images. (b) Logistic regression prediction. (c) Logistic regression prediction with global
constraint. (d) Structured prediction results with multi-class labeling prior and contextual information. (e) Ground-truth labeling. Note that objects such as cars, person, horse
are merged as one foreground class.

appearances vary drastically among classes, making it very difficult In the future work, we plan to integrate multi-source cues such
to model the appearance; Second, the number of classes in SBD is as depth into structural SVMs framework. So far we only consider
much less than MSRC-21 and therefore the multi-label ranking and extracting multi-scale cues from single source, that is the optic im-
co-occurrence statistics may be less informative. Beside forground, age. Features from multiple sources could contain complementary
another challenging class is mountain, which has few instance in information and be potentially useful for prompting performance.
the dataset, making it very hard to label correctly.
Some example results of our model are shown in Fig. 5, where Acknowledgments
we can see that local labeling is not sufficient to address the
appearance ambiguity (Fig. 5(b)). In the presence of pairwise and This work was supported by the National Key Project for Basic
global cues, the labeling becomes more robust, as shown in Research of China (2013CB329403), Nature Science Foundation of
Fig. 4(d). China (No. 61373076), the Fundamental Research Funds for the
Central Universities (No.2013121026), and the 985 Project of
Xiamen University.
7. Conclusion

We have presented a new structured prediction model for References


semantic segmentation. Traditional structured prediction
[1] J. Shotton, J. Winn, C. Rother, A. Criminisi, Textonboost: joint appearance,
frameworks using pairwise constraints alone suffer degeneration shape and context modeling for multi-class object recognition and
when a notable number of regions within an image are wrongly la- segmentation, in: Proceedings of European Conference on Computer Vision
beled in the early stage prediction by Logistic Regression, because (ECCV), 2006, pp. 1–15.
[2] J. Shotton, M. Johnson, R. Cipolla, Semantic texton forests for image
the wrong contextual information of mislabeled regions may categorization and segmentation, in: Proceedings of IEEE Conference on
propagate to correct ones. Therefore it is necessary to confine pos- Computer Vision and Pattern Recognition (CVPR), 2008, pp. 1–8.
sible labels from image-level. We utilized the multi-label ranking [3] D. Comaniciu, P. Meer, Mean shift: a robust approach toward feature space
analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (5)
score and converted it to soft global constraint, which encourage (2002) 603–619.
the presence of some likely labels while suppress the presence of [4] P. Felzenszwalb, D. Huttenlocher, Efficient graph-based image segmentation,
unlikely labels. Compared with other existing global constraint International Journal of Computer Vision 59 (2) (2004) 167–181.
[5] A. Vedaldi, S. Soatto, Quick shift and kernel methods for mode seeking, in:
schemas, we decoupled the global constraint with pairwise con-
Proceedings of European Conference on Computer Vision (ECCV), 2008, pp.
straint and integrated with unary potential directly, making it 705–718.
much simpler while remain efficiency. The proposed model was [6] A. Levinshtein, A. Stere, K. Kutulakos, D. Fleet, S. Dickinson, K. Siddiqi,
evaluated on two challenging datasets and experiments showed Turbopixels: fast superpixels using geometric flows, IEEE Transactions on
Pattern Analysis and Machine Intelligence 31 (12) (2009) 2290–2297.
that our model obtained highly competitive performance [7] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, S. Süsstrunk, Slic Superpixels,
compared with the state-of-the-art results. Technical Report 149300 EPFL (June).
90 D. Pei et al. / Computer Vision and Image Understanding 120 (2014) 81–90

[8] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, S. Belongie, Objects in [22] X. Boix, J. Gonfaus, J. van de Weijer, A. Bagdanov, J. Serrat, J. Gonzàlez, Harmony
context, in: Proceedings of the IEEE International Conference on Computer potentials, International Journal of Computer Vision 96 (1) (2012) 83–102.
Vision (ICCV), 2007, pp. 1–8. [23] S. Nowozin, P. Gehler, C. Lampert, On parameter learning in CRF-based
[9] J. Gonfaus, X. Boix, J. Van De Weijer, A. Bagdanov, J. Serrat, J. Gonzalez, approaches to object class image segmentation, in: Proceedings of European
Harmony potentials for joint classification and segmentation, in: Object conference on Computer vision (ECCV), ECCV’10, 2010, pp. 98–111.
Categorization Using Co-Occurrence, Location and Appearance (CVPR), 2010, [24] G. Heitz, S. Gould, A. Saxena, D. Koller, Cascaded classification models:
pp. 3280–3287. combining models for holistic scene understanding, in: Proceedings of Neural
[10] S. Gould, J. Rodgers, D. Cohen, G. Elidan, D. Koller, Multi-class segmentation Information Processing Systems (NIPS), 2008.
with relative location prior, International Journal of Computer Vision 80 (3) [25] S. Gould, T. Gao, D. Koller, Region-based segmentation and object detection, in:
(2008) 300–316. Proceedings of Neural Information Processing Systems (NIPS), vol. 1, 2009.
[11] C. Galleguillos, A. Rabinovich, S. Belongie, Object categorization using co- [26] D. Hoiem, A. Efros, M. Hebert, Closing the loop in scene interpretation, in:
occurrence, location and appearance, in: Proceedings of IEEE Conference on Proceedings of IEEE Conference on Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition (CVPR), 2008, pp. 1–8. (CVPR), 2008, pp. 1–8.
[12] N. Plath, M. Toussaint, S. Nakajima, Multi-class image segmentation using [27] I. Tsochantaridis, T. Hofmann, T. Joachims, Y. Altun, Support vector machine
conditional random fields and global classification, in: Proceedings of learning for interdependent and structured output spaces, in: Proceedings of
International Conference on Machine Learning (ICML), 2009, pp. 817–824. International Conference on Machine Learning (ICML), 2004, pp. 104–111.
[13] G. Csurka, F. Perronnin, A simple high performance approach to semantic [28] B. Fulkerson, A. Vedaldi, S. Soatto, Class segmentation and object localization
segmentation, in: Proceedings of British Machine Vision Conference (BMVC), with superpixel neighborhoods, in: Proceedings of the IEEE International
2008. Conference on Computer Vision (ICCV), 2009, pp. 670–677.
[14] P. Kohli, L. Ladickỳ, P. Torr, Robust higher order potentials for enforcing label [29] L. Bo, X. Ren, D. Fox, Kernel descriptors for visual recognition, in: Proceedings
consistency, International Journal of Computer Vision 82 (3) (2009) 302– of Neural Information Processing Systems (NIPS), 2010.
324. [30] G. Csurka, F. Perronnin, An efficient approach to semantic segmentation,
[15] L. Ladicky, C. Russell, P. Kohli, P. Torr, Graph cut based inference with co- International Journal of Computer Vision 95 (2) (2011) 198–212.
occurrence statistics, in: Proceedings of European Conference on Computer [31] C. Desai, D. Ramanan, C. Fowlkes, Discriminative models for multi-class object
Vision (ECCV), 2010, pp. 239–253. layout, in: Proceedings of the IEEE International Conference on Computer
[16] P. Kohli, M. Kumar, P. Torr, P3 & beyond: solving energies with higher order Vision (ICCV), 2009, pp. 229–236.
cliques, in: Proceedings of IEEE Conference on Computer Vision and Pattern [32] A. Criminisi, Micorsoft Research Cambridge Object Recognition Image Database.
Recognition (CVPR), 2007, pp. 1–8. <http://research.microsoft.com/en-us/projects/objectclassrecognition>.
[17] L. Ladicky, C. Russell, P. Kohli, P. Torr, Associative hierarchical CRFs for object [33] S. Gould, R. Fulton, D. Koller, Decomposing a scene into geometric and
class image segmentation, in: Proceedings of the IEEE International Conference semantically consistent regions, in: Proceedings of IEEE Conference on
on Computer Vision (ICCV), 2009, pp. 739–746. Computer Vision and Pattern Recognition (CVPR), 2009, pp. 1–8.
[18] J. Zhu, E. Xing, B. Zhang, Laplace maximum margin markov networks, in: [34] B. Russell, A. Torralba, K. Murphy, W. Freeman, Labelme: a database and web-
Proceedings of International Conference on Machine Learning (ICML), 2008, based tool for image annotation, International Journal of Computer Vision 77
pp. 1256–1263. (1–3) (2008) 157–173.
[19] T. Joachims, T. Finley, C. Yu, Cutting-plane training of structural SVMs, [35] M. Everingham, L. Van Gool, C. Willianms, J. Winn, A. Zisserman, The Pascal
Machine Learning 77 (1) (2009) 27–59. Visual Object Classes Challenge 2007 (voc2007) Results (2007).
[20] S. Bucak, P. Kumar Mallapragada, R. Jin, A. Jain, Efficient multi-label ranking for [36] D. Hoiem, A. Efros, M. Hebert, Recovering surface layout from an image,
multi-class learning: application to object recognition, in: Proceedings of the International Journal of Computer Vision 75 (1) (2007) 151–172.
IEEE International Conference on Computer Vision (ICCV), 2009, pp. 2098– [37] D. Munoz, J. Bagnell, M. Hebert, Stacked hierarchical labeling, in: Proceedings
2105. of European Conference on Computer Vision (ECCV), 2010, pp. 57–70.
[21] A. Lucchi, Y. Li, X. Boix, K. Smith, P. Fua, Are spatial and global constraints really [38] Z. Li, X.-M. Wu, S.-F. Chang, Segmentation using superpixels: A bipartite graph
necessary for segmentation? in: IEEE International Conference on Computer partitioning approach, in: Proceedings of IEEE Conference on Computer Vision
Vision (ICCV), 2011, pp. 9–16. and Pattern 826 Recognition (CVPR), 2012, pp. 789–796.

You might also like