You are on page 1of 8

IET Image Processing

Research Article

Semantic image segmentation using an ISSN 1751-9659


Received on 12th July 2017
Revised 3rd February 2018
improved hierarchical graphical model Accepted on 12th June 2018
E-First on 16th July 2018
doi: 10.1049/iet-ipr.2017.0738
www.ietdl.org

Neda Noormohamadi1, Peyman Adibi1 , Sayyed Mohammad Saeed Ehsani1


1Artificial
Intelligence Department, Faculty of Computer Engineering, University of Isfahan, Isfahan, Iran
E-mail: adibi@eng.ui.ac.ir

Abstract: Hierarchical graphical models can incorporate jointly several tasks in a unified framework. By applying this approach,
information exchange among tasks would improve the results. A hierarchical conditional random field (CRF) is proposed here to
improve the semantic image segmentation. Although this newly proposed model applies the information of several tasks, its run
time is comparable with the contemporary approaches. This method is evaluated on MSRC dataset and has shown similar or
better segmentation accuracy in comparison with models where CRFs or hierarchical models are adopted.

1 Introduction ambiguities still remain in the model. To overcome this drawback,


the idea of non-local interest regions [12, 13] is suggested to define
The objective of computer vision is to provide the capability of the pairwise potentials although unary and pairwise potentials are
understanding the world by intelligent processing. Such a vision is not adequate to eliminate all ambiguities in the image. New
obtained through extraction of useful information from digital approaches are proposed where global information is integrated
images and performing complex inferences on input data [1]. This into the problem [14–16]. Adding global information to the model
task is performed by constructing a statistical model relating image as a layer in the graph [17] is one of the suggestions in eliminating
data to high-level information. The probabilistic graphical models, ambiguities.
as a consolidated group of statistical modelling tools widely Besides these approaches, the deep learning techniques [18, 19]
applied in computer vision problems such as stereo matching [2] are being proposed to capture more contextual cues, where a multi-
and class segmentation [3], constitute the appropriate candidates layer convolutional neural network (CNN) is applied to generate
for this purpose. hierarchical features in order to segment the scenes. Most recently,
There exist many studies on isolated solving of individual an attempt is made to combine the strengths of CRFs and CNNs by
machine vision problems, including tasks like object detection, the authors in [20, 21]. To this end, the problem is formulated as a
scene understanding, and pose estimation. Obviously, there exists a CRF, while feature extraction, inference, or training is dealt with
meaningful relation among these tasks. For example, if the type of through CNN. They produce promising segmentation results.
image scene is known, the objects expected to be seen will be The objective of this study is to assign a semantic label to each
limited. Thus, if these tasks are solved in a unified framework, in a pixel of an input image. To this end, our suggested method applies
sense that information is exchanged among them, then expecting to multiple tasks including object detection, scene recognition, and
improved results is justified. By applying this approach, the system semantic segmentation in a simultaneous manner. Here, it is
can accomplish a holistic perception from the image. Holistic scene assumed that a basic set of pixel labels, a probabilistic list of scene
understanding is an excellent objective in machine vision [4]. types for each image, and a set of candidate objects detected in the
Different methods have been proposed to solve this problem. Some image are available; accordingly a method with the following
of them apply the output of each of the tasks (i.e. scene characteristics is being proposed:
recognition, depth estimation, and segmentation) in the model in a
sequential manner, by applying the output of one task (e.g. object (i) The proposed model not only can improve the segmentation
detection) as the features for the other tasks (e.g. image results by utilising several potential functions but also simplifies
segmentation) [5–7]. Another method, depicted here applies jointly the model as explained in the experimental section.
several tasks in a unified framework in a hierarchical structure set
(ii) Taking inspiration from [22], an improved unary potential is
up [1, 8, 9]. Many researchers formulate the joint reasoning for
applied to this proposed model for each SP of the image. Unlike
multiple tasks, as an inference problem in an undirected graphical
the findings in [22], both the local and global information are
model. The most popular graphical model adopted in this area is
combined as a term in energy function.
the conditional random field (CRF) [10, 11], where often the lower
layer corresponds to the image segmentation task. The ith node of (iii) A class of potential functions is the shape prior for the objects.
this layer represents the semantic label of the ith pixel/super-pixel The quality of the shape prior in this paper is enhanced.
(SP) in the image.
The proposed method has similar ideas with the work of Yao et
In the probabilistic graphical models adopted in such multi-task
al. [17], in presenting a holistic approach to scene understanding
vision problems, the image information is embodied into the model
which jointly inferences about segmentation, the scene type,
by unary potentials. Moreover, pairwise potentials are applied in
presence of a class in the image, and presence of objects in the
the model to smooth the labels of adjacent nodes. If several tasks
image which are the instances of the classes (e.g. an image
are applied in the model, each task is added to the model as a
including three horses has three objects of the class ‘horse’).
separate layer. According to each CRF, a problem-specific energy
However, there exist differences between the method of [17] and
function is defined, which should be minimised, in order to solve
our newly proposed approach. One important difference is that they
the problem.
apply scene type in the model by adding a separate layer to the
In the most of the related literature, information is extracted
graph, while, here the scene type in the model is applied by
from small regions of an image like the SPs. Although this
combining it with the local unary potential of SPs. This approach
information, named local, is effective, it is not efficient and many

IET Image Process., 2018, Vol. 12 Iss. 11, pp. 1943-1950 1943
© The Institution of Engineering and Technology 2018
promotes the model simplicity. In spite of their model, here the take the same label. It is clear that if the segment generator
contextual relations among candidate objects are of concern. algorithm is not accurate sufficiently, a segment may include
The rest of the paper is organised as follows: the proposed several objects in the image, which results in a big fault in the
approach is introduced in Section 2; the experimental results are model. However, if it is accurate enough, its application decreases
discussed in Section 3 and the conclusion is presented in Section 4. the model size.
Moreover, the label of a pixel may be subject to the long-range
2 Proposed approach information, i.e. to determine the label of a green pixel (‘grass’ or
‘tree’), information about the surrounding pixels should be known.
The image semantic segmentation problem is formulated as an In module 2, the candidate objects of the image are detected by a
inference problem in a CRF model. Since this suggested model part-based object detector model, which assumes an object has
extends a basic CRF, we first provide a brief background material several parts [24]. For each one of the detected objects, the method
on the basic CRF and hierarchical CRF (HCRF) and then explain of [24] specifies an object class, the location of the detected object,
our proposed method. a score, and the index of a root mixture component. Module 3
computes a probability vector for the image, containing the
2.1 Basic CRF classifier scores for each scene class.
By applying the extracted segments and the detected objects, a
On a given image, the basic CRF [10] is defined on a set of random multi-level graph is drawn for this proposed model, as illustrated in
variables X = xi , i ∈ V where V corresponds to the image Fig. 1, where the low-level nodes correspond to the smaller
elements like pixels, SPs, and patches. For segmentation task, each segments of the image named super-pixel (SP). The second level
xi can take any label in a discrete label space L representing the set nodes are associated with extracted segments in a larger size. The
of object categories like a bottle, a car, a person, and so on. In bigger segments are named super-segment (SS). Each SS is
correspondence with each CRF, an energy function E(X) is connected to the SPs which lie in that (SS). In the higher level, the
defined. Most of the inference methods seek to minimise the model consists of several binary variables bl ∈ {0, 1}, assigned to
following energy function: each detected object, which allow the model to accept or reject
these detections. For each detected object its constituent SPs are
E(X) = ∑ λcψ c(Xc) (1) known. Knowing the class of an object influences the prediction
c∈C about the classes of its corresponding SPs. Therefore, at the higher
level, each object is connected to its constituent SPs in the graph.
where λc is the weight vector, C is the set of cliques, and ψ c is the By using the outputs of the modules, the energy function over
cost function (named potential function), defined over a clique c of the generated graph which consists of six parts that related to the
a set of variables Xc involved in this clique (e.g. the set of pixels in model scene classification, object detection, and object
a SP). The potential functions assign scores to a clique for different segmentation in a simultaneous manner is presented as
labelling.
The following refers to the commonly applied potential E(x, y, b) = w1E1 + w2E2 + w3E3 + ∑ w4clsE4 + ∑ wrE 5 5
functions. A unary potential (named data term) is defined over just cls ∈ C r∈R
one variable. This function calculates the scores through the (2)
appearance-based or location-based features. A pairwise potential
+ ∑ w6clsE6
cls ∈ C
(named smooth term) is defined over each pair of the neighbouring
variables vi and v j. This function encourages the neighbouring where x and y are the random variables associated with the image
variables to get the similar labels from the label space. The SPs and SSs, respectively, b is the set of candidate objects in the
objective of the pairwise potential functions is to cause a spatial image, C consists of image semantic labels, R consists of semantic
smoothness in final labelling. relations among candidate objects, and each E1 to E6 is one term of
In addition to these common potential functions, higher order energy function which will be explained separately in the following
potential functions may be defined on bigger cliques. Bigger subsections. A different weight of w1, w2, w3, w4cls, w5r, and w6cls, is
connectivity can incorporate complex priors into the model, and
defined for each term of the energy function that will be learned in
reduce discretisation artefacts. Although the higher order potentials
the training step through the structure prediction framework [25].
are useful in the models, they lead to too complex inferences. Thus,
in most studies, the cliques with a size of greater than two are not
considered and if are considered each of the higher order cliques is 2.3.1 Unary potential for each SP and SS: In order to take
decomposed into a set of the pairwise ones [17]. A clique of pixels advantage of the image segmentation task, information is
that belongs to a SP is an example of a higher order clique. introduced into the model by defining the terms E1 and E2 as
follows:
2.2 Hierarchical CRF
E1 = ∑ ∅i(xi) , (3)
Researchers like Sun et al. [8] have proposed methods where i ∈ SP
information about several tasks are applied jointly in computer
vision problem. HCRF [8, 23] is such a multi-task model E2 = ∑ ∅ j(y j) (4)
developed on top of a basic CRF, where each layer corresponds to j ∈ SS
one task. In HCRF, information is exchanged between tasks
through pairwise potentials defined among layers. where ∅i and ∅ j encode the unary potential functions over each SP
and SS, respectively. By employing SSs, we take advantage of the
2.3 Proposed extended HCRF long range dependencies in this proposed model. To calculate ∅i,
local unary potential and global feature extracted from the entire
In order to jointly take advantage of scene classification, object
image must cooperate. The local unary potential corresponding to
detection, and object segmentation, the problem is modelled
each SP is computed by averaging the TextonBoost [26] pixel
through a HCRF. The architecture of this newly proposed model is
potentials inside each SP. Markov random fields are mostly able to
shown in Fig. 1. Its main components are described in the
model the local interactions between nodes, consequently, they
following.
only apply local information. Although they contain useful
There are three modules in Fig. 1. Module 1 runs two
information, it is clear that they do not contain all the hidden
segmentation procedures in a coarse to fine manner over an input
information in the image, making global information necessary for
image, which results in two sets of image segments, called SPs and
better inference. In order to apply global information in the model,
super-segments (SSs). In this model, for both of the segmentation
the image label classification is calculated through
levels, the pixels that lie inside the same segment are assumed to

1944 IET Image Process., 2018, Vol. 12 Iss. 11, pp. 1943-1950
© The Institution of Engineering and Technology 2018
∅scene(scene = cls) = σ(tscene) (5) −β, if xi ≠ y j
∅i, j xi, y j = (8)
0, otherwise .
where tscene is the classifier score assigned to the scene type cls and
σ(x) = 1/(1 + exp( − 1.5x)) is a logistic function. where β is learned as the weight of this potential.
In [22], the local and global information are considered as the
two separate terms: data and global terms, respectively. 2.3.3 Unary potential for object detection: As previously stated,
Nevertheless, in this paper, the global feature in (5) is combined we assign a binary variable bl to each detected object. The model is
with the local unary potential for each SP as follows: allowed to accept or reject these detections by defining the fourth
term of energy function as follows:
∅i(xi = cls) = ∅li(xi = cls) + ∅scene(scene type = cls) (6)
E4 = ∑ ∅l bl (9)
where ∅li is the unary local potential for the ith SP xi. In (4), the l ∈ BB
unary potential ∅ j(y j) is computed for each SS by averaging the
TextonBoost pixel potentials inside each SS. Note that the scene where ∅l bl is the unary potential defined over candidate object bl.
types can be generally different from the pixel labels, but they are A logistic function f (x) = 1/(1 + exp( − 1.5x)) is applied in
very similar. Thus, the scene types could be mapped to the pixel calculating ∅l(bl). If this score is high for the lth candidate object,
labels that are semantically most similar to them (e.g. the scene bl is encouraged to be visible. It is expected that the estimation of
type ‘person’ can be mapped to the pixel label ‘body’). the candidate objects to be a strong high-level guide for
segmentation task (e.g. if a ‘bicycle’ object is detected with a high
2.3.2 Pair-wise potential between SP and SS: In order to make detection probability, it is more likely that some pixels take the
the assigned labels between SP and SS consistent, each SS is class ‘bicycle’).
connected to the SPs which lie in that SS using an edge in the
graph and define their covering energy term E3 as follows: 2.3.4 Spatial pairwise potential among the candidate
objects: The fifth term of energy function models the spatial
E3 = ∑ ∅i j(xi, y j) (7) relationship between two objects as follows:
(i, j)
E5 = ∑ ∅rlk(bl, bk) (10)
where ∅i j is the pair-wise potential between ith SP xi and jth SS y j. (l, k) ∈ BB
The Pn potential of [27] is applied in calculating ∅i j expressed as

Fig. 1  Overview of our proposed method for image semantic segmentation. First, the raw input image is fed to three modules. Module 1 partitions the image
into a set of segments in two levels (SPs and SSs), module 2 extracts the candidate objects of the image. The third module computes a probability vector for the
image containing the classifier scores for each scene class. Next, the graph is established based on the outputs of these modules. Thereafter, the outputs of
these modules and the constructed graph are used to compute the terms of the energy function. Finally, the training and testing procedures are performed
using the produced models. In this figure SupPix, SupSeg, and obj stand for super-pixel, super-segment, and object, respectively. Also, E_1 to E_6 are the
terms of the energy function in (2)

IET Image Process., 2018, Vol. 12 Iss. 11, pp. 1943-1950 1945
© The Institution of Engineering and Technology 2018
where BB is the set of all detected objects in the image, and ∅rlk is manner, the mean mask μ(p, compl) of component compl is
defined between two candidate objects which have a semantic calculated for all pixels p which lie inside the detected bounding
relationship of type r. This will be an effective measure since, i.e. if box.
it is known that the first object is a table located ‘below’ the second This new potential encourages the labelling of support region to
object, the second one is more likely to be a dish rather than a boot. correspond to the candidate object and to be consistent with the
The following spatial relationship commonly applied in related object shape average. Moreover, this term encourages the candidate
studies [8, 28] is of concern here: object to be visible (take label 1), if a sufficient number of SPs take
the class of the detector.
R = {next to, above, below, overlap} (11)
3 Experiments
To determine the geometric relationship between two candidate
objects, the bounding box of one candidate object is set first as the 3.1 Experimental setting
reference bounding box, next, the spatial relationship of the second The MSRC-21 dataset [29], the popular benchmark for scene
object related to the reference object is identified as follows: first, labelling is applied here for evaluation (Table 1). MSRC-21
draw additional bounding boxes similar to the reference object for consists of classes like water and sky which do not have a specific
each of the mentioned spatial relationships (above: draw on top of shape, as well as more shape-defined classes such as car and bird.
a box; next-to: draw on left and right side of a box, and below: These classes are commonly referred to as ‘stuff’ and ‘things’,
draw below of a box). Next, check which one of the drawn respectively. The standard train/test split [29] is applied to the
bounding boxes has an overlap more than 50% with the reference experiments.
bounding box. If a bounding box meets this condition, the model In this newly proposed algorithm, three basic modules are
assigns the corresponding relationship to the given pair of the applied in calculating the energy terms of (2): in the first module,
boxes and if none of the above relationships are met and the two the algorithm provided by Arbelaez et al. [26] is applied to
original boxes have an overlap more than 50% then, there exist an construct a hierarchical region tree. In this algorithm, first, the
‘overlap’ relationship between them and at the end a score is initial regions are constructed from an oriented contour signal;
computed for the assigned relationship. For example, if the next, an agglomerative clustering procedure forms these regions
relationship ‘above’ is of concern, its score is calculated as follows: into a hierarchical region tree. That the generated regions respect
the given boundaries is the advantage of this algorithm. The
countAbove threshold parameter is set at 0.08 and 0.16 for the two layers,
scoreAbove = (12)
∑r countr which determines the regions’ number and size. The second
module represents an object detector, where the well-known
where countr is the frequency of relation r occurrence among deformable part-based model of [24] is employed to detect the
objects in training data. The score of the other relationships candidate image objects. For each detected object, the method of
mentioned in (11) is computed similarly to (12). [24] specifies object class, location, score, and index of root
mixture component. For this module, we need to train a few
2.3.5 Pair-wise potential between segmentation and detectors with different components and compute the shape masks
detection: When several visual processes are run in a combined comp for each class on the MSRC dataset. We used the trained
manner to solve the problem, the objective is to improve the models and computed the shape masks provided by Yao et al. [17].
performance and resolve ambiguities in each visual task by In the third module, to generate the global image classification
applying the information from the other tasks. In this paper, the information, we apply the method of a standard bag-of-words
relation between detection and segmentation tasks as the sixth term spatial pyramid and train a linear one-vs-all SVM classier. The
of (2), in a more effective manner, is defined as follows: used features in this module consist of: SIFT features, colour SIFT,
RGB histograms, and colour moment invariants. The output for
each image is a probability vector containing the scores for each
E6 = ∑ ∅il xi, blcls (13) scene type. In training phase, the learning method provided by
(i, l)
Hazan and Urtasun [25] is applied. They present an intuitive
where ∅il is the pair-wise potential between the corresponding approximation for structured prediction problems using Fenchel
duality based on a local entropy approximation and define a
segmentation and detection. Inspired by Yao et al. [17], a new
message-passing algorithm for it. In the experiments, a CRF with l2
potential ∅il is defined as follows:
regularisation is applied. The inference is run by a message-passing
algorithm [30] which splits the graph-based optimisation program
into several local optimisation problems by introducing additional
Lagrange multipliers. This algorithm guarantees to converge [30].
∅il xi, blcls = The basic parameter ε is 1 in this method. In the experiments here,
the scene types similar to that of [17] are applied (Table 2).
Although the scene type can be different as to the pixel label, there
can exist a similarity between them. As observed in Table 2, there
1 (14) exists a one-to-one mapping between 18 scene types and 18 pixel
Ai p∑
μ(p, compl) ∗ ∅l bl , if xi = cls ∩ bl = 1 labels (white columns). The scene types with no equivalent pixel
∈A i
labels (grey columns) are mapped to the most similar unaligned
1 ∅l bl pixel labels. Therefore, the scene types ‘Person’, ‘City’, and
Ai p∑
μ(p, compl) ∗ , if xi = cls ∩ bl ≠ 1
∈A
γ ‘Nature’ can be associated with the pixel labels ‘Body’, ‘Road’,
i
and ‘Sky’, respectively. The γ is chosen empirically, beginning
0, otherwise . with a relatively small value, ending with a constant step size
increase at each iteration, that is, γ = 10. The training and testing
where γ ≥ 0 encourages bl to be detected, and Ai is the set of pixels processes are run on an Intel i7-4702MQ 2.20 GHz processor.
in the ith SP. In this potential, the pose and shape of mixture The labelling models are assessed according to the standard
component compl that has detected object bl is applied. In this measures of average-based class accuracy (ACA) and global-based
pixel accuracy (GPA) [14], presented as follows, respectively:
Table 1 Standard train/test split of MSRC-21 dataset [29]
1 N m, m
c∑
No. of classes No. of training images No. of testing images ACA = (15)
m
∑ n N m, n
21 335 256

1946 IET Image Process., 2018, Vol. 12 Iss. 11, pp. 1943-1950
© The Institution of Engineering and Technology 2018
N unary potential, SS unary potential, global feature, object detection
GPA = ∑ ∑ m mn Nmm n
,
,

,
(16) unary potential, spatial pairwise potential between the candidate
m
objects, pair-wise potential between SP and SS, and pair-wise
potential between segmentation and detection, respectively. For
where Nm, n is the number of pixels of category m labelled as class
example, in the third row ‘x + g’ indicates the combination of SP
n, and C is the class total. information with global information. The contribution of each
The advantages of this approach are revealed when compared to potential on the performance over the entire MSRC dataset is
some of the state-of-the-art models as baselines. We ran some of shown in Table 4. Also, we have shown these contributions for a
the baselines using the published codes and default parameter few images in Fig. 2.
settings given by the authors, while the experimental results of
others directly were borrowed for comparison. The baselines are of
three categories: (i) modelling scene labelling through CRFs [17, 3.2.2 Analysis of the energy terms: In the following, we discuss
31], (ii) modelling scene labelling through CNN-based approaches, the impact of energy terms, which directly affect the performance.
like fully convolutional networks [19], hierarchical deep learning Unary potential for each SP: Calculating unary potentials for
[32], and (iii) modelling based on fusion model, including a fusion the SPs using local information is usual. Also, many researches
model where flexible segmentation graph is applied [33]. apply the global information to the model by adding a separate
Our framework has been implemented in MATLAB, and the layer to the graph. Here, we combine global information and local
source code for this work is available online at https://github.com/ potential of each SP (including local structure information) to
neda-noor-m/Seg_hgm. calculate their unary potentials. Global information involves long-
range dependencies in the image and local potential of each SP
includes local structure information. As observed in Table 4,
3.2 Results and analysis integration of these information improves the results, since this is
3.2.1 Quantitative results: The performance of this model on an efficient and easy solution for capturing long-range
MSRC-21 dataset and its comparison with the considered baselines dependencies. By adopting this new potential in the model, the
are presented in Table 3. The results clearly reveal that this method hidden information of the image can easily be considered.
is comparable with state-of-the-art methods. This approach Comparison between the second and third rows indicates that
outperforms other counterparts in terms of ACA measure and applying global features in SP potential improves the performance
training time. The training time is an important factor, due to the of the ‘thing’ classes. The comparison between the fourth and fifth
application of algorithm for larger datasets. In this context, a rows confirms this observation.
reduction in training and testing times leads to a decrease in the Pair-wise potential between segmentation and detection: This
complexity of this newly proposed model (e.g. here 33 components new potential encourages support region labelling corresponding to
are compared to 52 components applied in [17]). This proposed the candidate object to be consistent with the object shape average.
algorithm is comparable with others of its kind in terms of GPA Moreover, this term encourages the candidate object to be visible if
and testing time criteria. a sufficient number of SPs are of the class detector. In fact, by
More detailed quantitative results on the MSRC dataset are incorporating this potential, we aim to exchange the information
tabulated in Table 4, where, x, y, g, o, c, xy, and xo represent the SP between segmentation and detection tasks. The proposed shape

Table 2 Mapping between the used scene types and pixel labels
Pixel building grass tree cow sheep sky aeroplane water face car bicycle flowers sign bird book chair road cat dog body boot
label
Scene type building grass tree cow sheep nature aeroplane water face car bicycle flowers sign bird book chair city cat dog person boot

Table 3 Performance on MSRC-21 in terms of recognition accuracy and efficiency


Methods ACA, % GPA, % Training, h Testing, s
this method 80.01 86 0.5 4.68
[33] 79.4 87.3 6.8 0.68
[19] 77.9 91.4 N/A 0.2
[34] N/A 81.7 N/A N/A
[31] 78.2 86.8 5.4 16.8
[32] 74.6 80.4 5.8 0.7
[17] 79.3 86.2 2.5 14.06
The bold values indicate which method performs the best in the given category.
The average training times are for the whole training set (per hour), and the average testing times are per image (per second).

Table 4 Quantitative results on the MSRC dataset


Model Building Grass Tree Cow Sheep Sky Aeroplane Water Face
x 72 97 90 78 85 96 84 84 83
x + y + xy 70 96 90 76 81 93 86 87 83
x + g + y + xy 68 64 90 77 83 90 88 85 86
x + y + o + xy + xo 69 97 90 83 87 96 88 84 90
x + g + y + o + xy + xo 66 96 89 84 87 96 90 82 90
full model 66 96 89 85 87 96 90 82 90
[17] 71 98 90 79 86 93 88 86 90
[33] 88 96 93 87 84 92 73 69 93
[31] 81 96 89 74 84 99 84 92 90
[19] — — — — — — — — —
[32] — — — — — — — — —

IET Image Process., 2018, Vol. 12 Iss. 11, pp. 1943-1950 1947
© The Institution of Engineering and Technology 2018

Model Car Bicycle Flower Sign Bird Book Chair Road Cat Dog Body Boat ACA
x 81 91 97 69 49 95 59 90 81 53 65 00 76.1
x + y + xy 83 92 97 71 50 95 59 85 83 53 65 00 76.14
x + g + y + xy 83 93 97 72 52 95 63 85 81 53 63 00 76.29
x + y + o + xy + xo 85 94 97 76 51 96 77 89 84 52 70 08 79.25
x + g + y + o + xy + xo 85 95 97 83 42 96 81 84 83 55 74 12 79.57
full model 85 95 97 87 43 97 81 88 84 55 74 12 80.01
[17] 84 94 98 78 54 97 71 89 83 54 68 17 79.18
[33] 80 78 88 92 85 65 84 81 65 67 65 43 79.4
[31] 86 92 98 91 35 95 53 90 62 77 70 12 78.2
[19] — — — — — — — — — — — — 77.9
[32] — — — — — — — — — — — — 77.6
The table shows pixel recall in per cent for different object classes. Full model in the seventh row points to a model with the contribution of all potentials (x + g + y + xy + o + xo + c).
The bold values indicate which method performs the best in the given category.

Fig. 2  Segmentation result on MSRC-21 dataset. The last right panel indicates the mapping between the colours and labels
(a) Image, (b) Ground-truth, (c) x + y, (d) Our model

prior presented in (6) is better than the one proposed in [17], 3.2.3 Qualitative results: Some visual examples of our labelling
because here the object detection probability is applied in the outputs on MSRC-21 dataset are shown in Fig. 2; some examples
equation. In the experiments, it is observed that applying this of failure cases are shown in Fig. 3. As observed in Table 4, most
proposed shape prior increases ACA measure from 78.69 to 79.25. of the failure cases occur in classes ‘bird’ and ‘boat’. One of the
Spatial pairwise potential between the candidate objects: The main reasons for causing an error is related to the equal
geometric relationships (next to, above, below and overlap) contribution of local and global information to calculate unary
between candidate objects are considered in this energy term. Thus, potential for SP. Assigning unequal weights, which are learned in
the candidate objects whose relationship is unusual are penalised, the training step, to local and global information contributions can
e.g. the relationship ‘top’ between ‘cow’ and ‘water’ is strange. We be effective to decrease this error (Fig. 4).
apply the mentioned method in Section 2.3.4 to extract the
geometric relationships between objects. Next, we calculate the 4 Conclusion
score of each relation using (12). The details of how these relations
are learned and how the scores are calculated were explained The objective of this study is to improve the segmentation
comprehensively in Section 2.3.4. In MSRC dataset, the most precision of the images. To accomplish this, a hierarchical CRF
common geometric relationship is ‘next to’ which is 40% of all the model where segmentation, detection information and scene type
extracted relations. Also, the ‘above’, ‘below’ and ‘overlap’ consist are applied is being proposed. The unary potentials for each SP are
of 20, 10, and 30% of all relations, respectively. Adding this promoted by combining local and global information extracted
potential to the model (sixth row of Table 4) improves the result as from the image. The spatial relationships between objects are
compared to the model associated with fifth row, in Table 4. applied to improve shape prior in this model. This proposed
method is evaluated on MSRC dataset, and reveal similar or better

1948 IET Image Process., 2018, Vol. 12 Iss. 11, pp. 1943-1950
© The Institution of Engineering and Technology 2018
Fig. 3  Examples of failure cases. The figure depicts that the bird and boat are classified as a dog and a building, respectively.
(a) Image, (b) Ground-truth, (c) x + y, (d) Our model

Fig. 4  Quantitative results on several sample images in MSRC

pixel-wised labelling accuracy against the competing models where [10] Lafferty, J., McCallum, A., Pereira, F.: ‘Conditional random fields:
probabilistic models for segmenting and labeling sequence data’. Proc. 18th
CRFs or hierarchical models are applied. As a future work on this Int. Conf. on Machine Learning, Williamstown, USA, June 2001, pp. 282–
method, learning more suitable weights in training step for 289
contributing local and global information can be considered, which [11] Koltun, V.: ‘Efficient inference in fully connected CRFs with Gaussian edge
may improve the results. potentials’, Adv. Neural Inf. Process. Syst., 2011, 24, pp. 109–117
[12] Torralba, A., Murphy, K.P., Freeman, W.T.: ‘Contextual models for object
detection using boosted random fields’. Adv. Neural Inf. Process. Syst., 2004,
5 References 17, (1), pp. 1401–1408
[13] Winn, J., Shotton, J.: ‘The layout consistent random field for recognizing and
[1] Wang, C., Komodakis, N., Paragios, N.: ‘Markov random field modeling, segmenting partially occluded objects’. Proc. 2006 IEEE Computer Society
inference & learning in computer vision & image understanding: a survey’, Conf. on Computer Vision and Pattern Recognition, New York, USA, June
Comput. Vis. Image Underst., 2013, 117, (11), pp. 1610–1627 2006, pp. 37–44
[2] Sun, J., Zheng, N.-N., Shum, H.-Y.: ‘Stereo matching using belief [14] Ladicky, L., Russell, C., Kohli, P., et al.: ‘Graph cut based inference with co-
propagation’, IEEE Trans. Pattern Anal. Mach. Intell., 2003, 25, (7), pp. 787– occurrence statistics’. Proc. 11th European Conf. on Computer Vision,
800 Heraklion, Crete, September 2010, pp. 239–253
[3] Fulkerson, B., Vedaldi, A., Soatto, S., et al.: ‘Class segmentation and object [15] Russell, C., Kohli, P., Torr, P.H.S., et al.: ‘Associative hierarchical CRFs for
localization with superpixel neighborhoods’. Proc. 2009 IEEE 12th Int. Conf. object class image segmentation’. Proc. 2009 IEEE 12th Int. Conf. on
on Computer Vision, Kyoto, Japan, September 2009, pp. 670–677 Computer Vision, Kyoto, Japan, September 2009, pp. 739–746
[4] Mottaghi, R., Fidler, S., Yuille, A., et al.: ‘Human-machine CRFs for [16] Plath, N., Toussaint, M., Nakajima, S.: ‘Multi-class image segmentation using
identifying bottlenecks in holistic scene understanding’, arXivPrepr, 2014 conditional random fields and global classification’. Proc. 26th Annual Int.
[5] Brox, T., Bourdev, L., Maji, S., et al.: ‘Object segmentation by alignment of Conf. on Machine Learning, Montreal, Canada, June 2009, pp. 817–824
poselet activations to image contours’. Proc. CVPR 2011, Colorado Springs, [17] Yao, J., Fidler, S., Urtasun, R.: ‘Describing the scene as a whole: joint object
USA, June 2011, pp. 2225–2232 detection, scene classification and semantic segmentation’. Proc. 2012 IEEE
[6] Gu, C., Lim, J.J., Arbeláez, P., et al.: ‘Recognition using regions’. Proc. 2009 Conf. on Computer Vision and Pattern Recognition, Providence, USA, June
IEEE Conf. on Computer Vision and Pattern Recognition, Miami, USA, June 2012, pp. 702–709
2009, pp. 1030–1037 [18] Shuai, B., Wang, G., Zuo, Z., et al.: ‘Integrating parametric and non-
[7] Dai, J., He, K., Sun, J.: ‘Instance-aware semantic segmentation via multi-task parametric models for scene labeling’. Proc. 2015 IEEE Conf. on Computer
network cascades’. Proc. 2016 IEEE Conf. on Computer Vision and Pattern Vision and Pattern Recognition, Boston, USA, June 2015, pp. 4249–4258
Recognition, Las Vegas, USA, June 2016, pp. 3150–3158 [19] Sharma, A., Tuzel, O., Jacobs, D.W.: ‘Deep hierarchical parsing for semantic
[8] Sun, M., Kim, B., Kohli, P., et al.: ‘Relating things and stuff via segmentation’. Proc. 2015 IEEE Conf. on Computer Vision and Pattern
objectproperty interactions’, IEEE Trans. Pattern Anal. Mach. Intell., 2014, Recognition, Boston, USA, June 2015, pp. 530–538
36, (7), pp. 1370–1383 [20] Chen, L.-C., Papandreou, G., Kokkinos, I., et al.: ‘Semantic image
[9] Eigen, D., Fergus, R.: ‘Predicting depth, surface normals and semantic labels segmentation with deep convolutional nets and fully connected CRFs’, arXiv
with a common multi-scale convolutional architecture’. Proc. 2015 IEEE Int. Prepr, 2014
Conf. on Computer Vision, Santiago, Chile, December 2015, pp. 2650–2658

IET Image Process., 2018, Vol. 12 Iss. 11, pp. 1943-1950 1949
© The Institution of Engineering and Technology 2018
[21] Zheng, S., Jayasumana, S., Romera-Paredes, B., et al.: ‘Conditional random [28] Anand, A., Koppula, H.S., Joachims, T., et al.: ‘Contextually guided semantic
fields as recurrent neural networks’. Proc. 2015 IEEE Int. Conf. on Computer labeling and search for three-dimensional point clouds’, Int. J. Robot. Res.,
Vision, Santiago, Chile, December 2015, pp. 1529–1537 2012, 32, (1), pp. 19–34
[22] Lucchi, A., Li, Y., Boix, X., et al.: ‘Are spatial and global constraints really [29] Shotton, J., Johnson, M., Cipolla, R.: ‘Semantic Texton forests for image
necessary for segmentation?’. Proc. 2011 Int. Conf. on Computer Vision, categorization and segmentation’. Proc. 2008 IEEE Conf. on Computer Vision
Barcelona, Spain, November 2011, pp. 9–16 and Pattern Recognition, Anchorage, USA, June 2008, pp. 1–8
[23] Wang, H., Koller, D.: ‘Multi-level inference by relaxed dual decomposition [30] Schwing, A., Hazan, T., Pollefeys, M., et al.: ‘Distributed message passing for
for human pose segmentation’. Proc. CVPR 2011, Providence, USA, June large scale graphical models’. Proc. CVPR 2011, Colorado Springs, USA,
2011, pp. 2433–2440 June 2011, pp. 1833–1840
[24] Felzenszwalb, P.F., Girshick, R.B., McAllester, D., et al.: ‘Object detection [31] Ladickỳ, L., Russell, C., Kohli, P., et al.: ‘Associative hierarchical random
with discriminatively trained part-based models’, IEEE Trans. Pattern Anal. fields’, IEEE Trans. Pattern Anal. Mach. Intell., 2014, 36, (6), pp. 1056–1077
Mach. Intell., 2010, 32, (9), pp. 1627–1645 [32] Farabet, C., Couprie, C., Najman, L., et al.: ‘Learning hierarchical features for
[25] Hazan, T., Urtasun, R.: ‘A primal-dual message-passing algorithm for scene labeling’, IEEE Trans. Pattern Anal. Mach. Intell., 2013, 35, (8), pp.
approximated large scale structured prediction’. Adv. Neural Inf. Process. 1915–1929
Syst., 2010, 158, (2), pp. 838–846 [33] Zhou, Q., Zheng, B., Zhu, W., et al.: ‘Multi-scale context for scene labeling
[26] Arbelaez, P., Maire, M., Fowlkes, C., et al.: ‘Contour detection and via flexible segmentation graph’, Pattern Recognit.., 2016, 59, pp. 312–324
hierarchical image segmentation’, IEEE Trans. Pattern Anal. Mach. Intell., [34] Saito, M., Okatani, T.: ‘Transformation of Markov random fields for marginal
2011, 33, (5), pp. 898–916 distribution estimation’. Proc. 2015 IEEE Conf. on Computer Vision and
[27] Kohli, P., Kumar, M.P., Torr, P.H.S.: ‘Solving energies with higher order Pattern Recognition, Boston, USA, June 2015, pp. 797–805
cliques’, 2007

1950 IET Image Process., 2018, Vol. 12 Iss. 11, pp. 1943-1950
© The Institution of Engineering and Technology 2018

You might also like