Professional Documents
Culture Documents
Bolei Zhou1 , Hang Zhao1 , Xavier Puig1 , Sanja Fidler2 , Adela Barriuso1 , Antonio Torralba1
1
Massachusetts Institute of Technology, USA
2
University of Toronto, Canada.
Abstract
1
tencies across different annotators. In contrast, our dataset notated both stuff and object classes, for which we addi-
was annotated by a single expert annotator, providing ex- tionally annotated their parts, and parts of these parts. We
tremely detailed and exhaustive image annotations. On av- believe that our dataset, ADE20K, is one of the most com-
erage, our annotator labeled 29 annotation segments per im- prehensive datasets of its kind. We provide a comparison
age, compared to the 16 segments per image labeled by ex- between datasets in Sec. 2.5.
ternal annotators (like workers from Amazon Mechanical Semantic segmentation/parsing models.. There are a
Turk). Furthermore, the data consistency and quality are lot of models proposed for image parsing. For example,
much higher than that of external annotators. Fig. 1 shows MRF frameworks are proposed to parse images in differ-
examples from our dataset. ent levels [30] or segment rare object classes [33]; detec-
The paper is organized as follows. Firstly we describe tion is combined with segmentation to improve the perfor-
the ADE20K dataset, the collection process and statistics. mance [31]; stuff classes are leveraged to localize objects
We then introduce a generic network design called Cascade [13]. With the success of convolutional neural networks
Segmentation Module, which enables neural networks to (CNN) for image classification [16], there is growing in-
segment stuff, objects, and object parts in cascade. Several terest for semantic image parsing using CNNs with dense
semantic segmentation networks are evaluated on the scene output, such as the multiscale CNN [11], recurrent CNN
parsing benchmark of ADE20K as baselines. The proposed [25], fully CNN [19], deconvolutional neural networks [24],
Cascade Segmentation Module is shown to improve over encoder-decoder SegNet [1], multi-task network cascades
those baselines. We further apply the scene parsing net- [9], and DilatedNet [4, 34]. They are benchmarked on Pas-
works to the tasks of automatic scene content removal and cal dataset with impressive performance on segmenting the
scene synthesis. 20 object classes. Some of them [19, 1] are evaluated on
Pascal-Context [21] or SUN RGB-D dataset [28] to show
1.1. Related work the capability to segment more object classes in scenes.
Many datasets have been collected for the purpose of se- Joint stuff and object segmentation is explored in [8] which
mantic understanding of scenes. We review the datasets uses pre-computed superpixels and feature masking to rep-
according to the level of details of their annotations, then resent stuff. Cascade of instance segmentation and catego-
briefly go through the previous work of semantic segmenta- rization has been explored in [9]. In this paper we introduce
tion networks. a generic network module to segment stuff, objects, and ob-
ject parts jointly in a cascade, which could be integrated in
Object classification/detection datasets. Most of the
existing networks.
large-scale datasets typically only contain labels at the im-
age level or provide bounding boxes. Examples include Im-
agenet [26], Pascal [10], and KITTI [12]. Imagenet has the 2. ADE20K: Fully Annotated Image Dataset
largest set of classes, but contains relatively simple scenes. In this section, we describe our ADE20K dataset and an-
Pascal and KITTI are more challenging and have more ob- alyze it through a variety of informative statistics.
jects per image, however, their classes as well as scenes are
more constrained. 2.1. Dataset summary
Semantic segmentation datasets. Existing datasets
with pixel-level labels typically provide annotations only There are 20,210 images in the training set, 2,000 images
for a subset of foreground objects (20 in PASCAL VOC [10] in the validation set, and 3,000 images in the testing set. All
and 91 in Microsoft COCO [17]). Collecting dense anno- the images are exhaustively annotated with objects. Many
tations where all pixels are labeled is much more challeng- objects are also annotated with their parts. For each object
ing. Such efforts include SIFT Flow dataset [18], Pascal- there is additional information about whether it is occluded
Context [21], NYU Depth V2 [22], SUN database [32], or cropped, and other attributes. The images in the valida-
SUN RGB-D dataset [28], CityScapes dataset [7], and tion set are exhaustively annotated with parts, while the part
OpenSurfaces [2, 3]. annotations are not exhaustive over the images in the train-
Datasets with objects, parts and attributes. Core ing set. The annotations in the dataset are still growing.
dataset [6] is one of the earliest work that explores the object Sample images and annotations from the ADE20K dataset
part annotation across categories. Recently, two datasets are shown in Fig. 1.
were released that go beyond the typical labeling setup by
2.2. Image annotation
also providing pixel-level annotation for the object parts, i.e.
Pascal-Part dataset [5], or material classes, i.e. OpenSur- For our dataset, we are interested in having a diverse set
faces [2, 3]. We advance this effort by collecting very high- of scenes with dense annotations of all the objects present.
resolution imagery of a much wider selection of scenes, Images come from the LabelMe [27], SUN datasets [32],
containing a large set of object classes per image. We an- and Places [35] and were selected to cover the 900 scene
Image
Seg. 1
Seg. 2
Difference
Figure 2. Annotation interface, the list of the objects and their as-
sociated parts in the image.
b)
a)
a)
c)
Annotated instances Annotated instances
'person' 100 1000 10000
100
200
300
400
500
600
700
wall
100 1000 10000
person
'sky' head
right arm building
tree
left arm car
'floor' left leg chair
Person floor
right leg plant
right hand light
'trees' torso window
window sky
painting
door ceiling
50
'ceiling' balcony
column
table
cabinet
shop window Building sign
'building' lamp
roof sidewalk
shutter road
wheel book
'tree' window cushion
streetlight
door curtain
headlight box
'ground' license plate
Car grass
bottle
100
taillight shelf
'grass' windshield seat
leg mountain
back rock
seat mirror
spotlight
apron ground
bed
arm
Chair
stretcher flowerpot
seat cushion flower
armchair
aperture fence
150
diffusor pillow
shade pole
vase
backplate Light source column
canopy sconce
bulb plate
pane sofa
'cue' sash glass
wall plug
lower sash carpet
'kitchen island' upper sash sink
muntin railing
200
Window bench
rail desk
'pool ball' casing bow
leg house
drawer pot
Number of images 'soap' top work surface
palm tree
b)
bag
door
Table
'tombstone' base basket
towel
0
500
1000
1500
2000
2500
side traffic light
250
0
door fluorescent
'merchandise' drawer swivel chair
top toy
shrub
front awning
'videos' side Cabinet stool
jar
shelf skyscraper
'headstone' skirt clock
shade trash can
10
column coffee table
base flag
water
300
'gravestone' tube double door
canopy van
Lamp
cord chandelier
'hay roll' arm shoe
fruit
door frame boat
20
handle food, solid food
knob sea
pane handrail
paper
muntin
Door
field
lock candlestick
hinge tray
Number of objects sharing each part stand
d)
headboard staircase
30
'door' footboard
leg path
ball
10
20
30
40
50
60
70
80
side rail switch
'window' bedpost Bed poster
side brand name
blind
ladder television
'handle' shelf motorbike
truck
40
door
classes (9.9)
side dresser
figurine
10
'base' top
leg
Shelf bicycle
monitor
instances (19.6)
arm magazine
kitchen stove
Number of categories/instances
'drawer' leg air conditioner
back can
computer
50
seat cushion
20
'leg' back pillow Armchair beam
umbrella
Number of objects seat base sculpture
seat animal
'wheel' seat cushion bookcase
c)
10
telephone
2.10
3.10
back pillow fireplace
0
4
4
4
arm bucket
30
'pane'
0
leg fan
seat base Sofa bus
back counter
'knob' dress
skirt napkin
drawer toilet
leg refrigerator
40
'arm' top bathtub
screen
door Desk text
side gravestone
shelf ottoman
window wardrobe
sand
door pane
50
roof mu g
chimney cake
manhole
'left foot' railing
House
gate
column river
shutter countertop
'left hand' hill
60
back microwave
base hat
seat container
'left leg' arm
piston
Swivel chair doorframe
step
seeing a new object (or part) class as a function of the number of instances.
armrest curb
'mouse' loudspeaker
70
leg handbag
seat board
stretcher snow
80
top bulletin board
apron cup
shelf post
'right arm' Coffee table
classes 2.3
drawer hedge
wheel hood
pipe
instances 3.9
window central reservation
Number of categories/instances
'right foot'
90
blanket
1 2 3 4 5 6 7 8 9
door
license plate suitcase
taillight ground
Number of classes 'right hand' windshield
Van dishwasher
pool table
headlight booklet
'right leg' shade pitcher
doors. The bottom-right image is an example where the door does not behave as a part.
bulb grill
100
place mat
d)
arm
10
100
1000
deck chair
'torso' chain showcase
Chandelier
canopy light bulb
arcade machine
10 1
wheel drawing
windshield faucet
headlight notebook
window papers
e)
fish
license plate
Truck
jacket
mirror price tag
drawer bar
10 2
side radiator
skirt shirt
bird
front tennis shoe
top banner
Dresser statue
leg
base knife
coffee maker
oven parking meter
Parts
10 3
stove grandstand
button panel backpack
merchandise
Objects
dial Kitchen stove teapot
burner partition
drawer pen
monitor traffic cone
keyboard tower
washing machine
computer case toilet paper
mouse videos
Computer
speaker laptop
receiver antenna
embankment
base ladder
screen screen door
buttons vent
Number of instances
slot machine
10 4 10 5
cord Telephone fire hydrant
keyboard booth
blade cap
motor soap dispenser
canopy heater
Fan rack
shade shower
tube stand
window teacup
10 6
wheel vending machine
Probability of new classes pool stick
windshield kitchen island
headlight binder
Bus remote control
10
10
10
door cow
license plate pier
-3
-2
-1
mirror rod
e)
door boot
10
drawer crane
2
shelf canister
hanger
side Wardrobe counter
top pallet
leg bread
door tank
pottery
button panel waterfall
dial kettle
screen dog
cart
10 3
buttons Microwave machine
display chest
button projection screen
landing gear mailbox
stabilizer tree trunk
dishcloth
wing easel
fuselage trouser
Airplane t-shirt
turbine engine hut
corner pocket horse
10 4
leg stage
side pocket candelabra
bed hay
runway
rail Pool table dummy
cabinet console
base microphone
door pack
saucepan
dial spoon
buttons towel rack
10 5
detergent dispenser Waching machine fork
button panel wire
soap
Parts
Number of instances
screen cross
door fountain
sweater
Objects
drawer
work surface paper towel
printer
side Kitchen island bouquet
top skylight
10 6
classes as a function of segmented instances (objects and parts). The squares represent the current state of the dataset. e) Probability of
object instances and classes per image. c) Histogram of the number of segmented part instances and classes per object. d) Number of
Figure 5. a) Mode of the object segmentations contains ‘sky’, ‘wall’, ‘building’ and ‘floor’. b) Histogram of the number of segmented
by the number of scenes they are part of. d) Object parts ranked by the number of objects they are part of. e) Examples of objects with
annotated parts. Only objects with 5 or more parts are shown in this plot (we show at most 7 parts for each object class). c) Objects ranked
have more than a 1000 segmented instances. b) Frequency of parts grouped by objects. There are more than 200 object classes with
Figure 4. a) Object classes sorted by frequency. Only the top 270 classes with more than 100 annotated instances are shown. 68 classes
Stuff Stream
48x48x4096
Stuff Segmenta+on
Objectness Map
96x96x256
384x384x3 48x48x4096 Object Score Map
Object Stream Scene Segmenta*on
(Part Stream)
48x48x4096
Part Segmenta*on
Figure 6. The framework of Cascade Segmentation Module for scene parsing. Stuff stream generates the stuff segmentation and objectness
map from the shared feature activation. The object stream then generates object segmentation by integrating the objectness map from the
stuff stream. Finally the full scene segmentation is generated by merging the object segmentation and the stuff segmentation. Similarly, part
stream takes object score map from object stream to further generate object-part segmentation. Since not all objects have part annotation,
the part stream is optional. Feature sizes are based on the Cascade-dilatedNet, the Cascade-SegNet has different but similar structures.
stuff classes. On the other hand, there are spatial layout annotated with parts, the part stream can be jointly trained
relations among stuffs and objects, and the objects and the to segment object parts. Thus the part stream further seg-
object parts, which are ignored by the design of the previ- ments parts on each object score map predicted from the
ous semantic segmentation networks. For example, a draw- object stream.
ing on a wall is a part of the wall (the drawing occludes the The network with the two streams (stuff+objects) or
wall), and the wheels on a car are also parts of the car. three streams (stuff+objects+parts) could be trained end-to-
To handle the long-tail distribution of objects in scenes end. The streams share the weights of the lower layers.
and the spatial layout relations of scenes, objects, and ob- Each stream has a training loss at the end. For the stuff
ject parts, we propose a network design called Cascade Seg- stream we use the per-pixel cross-entropy loss, where the
mentation Module. This module is a generic network de- output classes are all the stuff classes plus the foreground
sign which can potentially be integrated in any previous class (all the discrete object classes are included in it). We
semantic segmentation networks. We first categorize se- use the per-pixel cross-entropy loss for the object stream,
mantic classes of the scenes into three macro classes: stuff where the output classes are all the discrete object classes.
(sky, road, building, etc), foreground objects (car, tree, sofa, The objectness map is given as a ground-truth binary mask
etc), and object parts (car wheels and door, people head and that indicates whether a pixel belongs to any of the stuff
torso, etc). Note that in some scenarios there are some ob- classes or the foreground object class. This mask is used to
ject classes like ‘building’ or ‘door’ that could belong to ei- exclude the penalty for pixels which belong to any of the
ther of two macro classes, here we assign the object classes stuff classes in the training loss for the object stream. Sim-
to their most likely macro class. ilarly, we use cross-entropy loss for the part stream. All
In the network for scene parsing, different streams of part classes are trained together, while non-part pixels are
high-level layers are used to represent different macro ignored in training. In testing, parts are segmented on their
classes and recognize the assigned classes. The segmenta- associated object score map given by the object stream. The
tion results from each stream are then fused to generate the training losses for the two streams and for the three streams
segmentation. The proposed module is illustrated in Fig. 6. are L = Lstuf f +Lobject and L = Lstuf f +Lobject +Lpart
respectively.
More specifically, the stuff stream is trained to classify
all the stuff classes plus one foreground object class (which The configurations of each layer are based on the base-
includes all the non-stuff classes). After training, the stuff line network being used. We integrate the proposed module
stream generates stuff segmentation and a dense objectness on two baseline networks Segnet [1] and DilatedNet [4, 34].
map indicating the probability that a pixel belongs to the In the following experiments, we evaluate that the proposed
foreground object class. The object stream is trained to clas- module brings great improvements for scene parsing.
sify the discrete objects. All the non-discrete objects are
ignored in the training loss function of the object stream. 4. Experiments
After training, the object stream further segments each dis-
crete object on the predicted objectness map from the stuff To train the networks for scene parsing, we select the
stream. The result is merged with the stuff segmentation to top 150 objects ranked by their total pixel ratios from the
generate the scene segmentation. For those discrete objects ADE20K dataset and build a scene parsing benchmark of
ADE20K, termed as MIT SceneParse1502 . As the orig- Table 2. Performance on the validation set of SceneParse150.
Networks Pixel Acc. Mean Acc. Mean IoU Weighted IoU
inal images in the ADE20K dataset have various sizes, FCN-8s 71.32% 40.32% 0.2939 0.5733
SegNet 71.00% 31.14% 0.2164 0.5384
for simplicity we rescale those large-sized images to make DilatedNet 73.55% 44.59% 0.3231 0.6014
their minimum heights or widths as 512 in the benchmark. Cascade-SegNet 71.83% 37.90% 0.2751 0.5805
Cascade-DilatedNet 74.52% 45.38% 0.3490 0.6108
Among the 150 objects, there are 35 stuff classes (i.e., wall,
sky, road) and 115 discrete objects (i.e., car, person, table). Table 3. Performance of stuff and discrete object segmentation.
35 stuff 115 discrete objects
The annotated pixels of the 150 objects occupy 92.75% of Networks Mean Acc. Mean IoU Mean Acc. Mean IoU
all the pixels in the dataset, where the stuff classes occupy FCN-8s 46.74% 0.3344 38.36% 0.2816
SegNet 43.17% 0.3051 27.48% 0.1894
60.92%, and discrete objects occupy 31.83%. DilatedNet 49.03% 0.3729 43.24% 0.3080
Cascade-SegNet 40.46% 0.3245 37.12% 0.2600
4.1. Scene parsing Cascade-DilatedNet 49.80% 0.3779 44.04% 0.3401
As for baselines of scene parsing on SceneParse150 as the evaluation criteria in the challenge.
benchmark, we train three semantic segmentation networks: The segmentation results of the baselines and the cas-
SegNet [1], FCN-8s [19], and DilatedNet [4, 34]. SegNet cade networks are listed in Table 2. Among the base-
has encoder and decoder architecture for image segmen- lines, the DilatedNet achieves the best performance on the
tation; FCN upsamples the activations of multiple layers SceneParse150. The cascade networks, Cascade-SegNet
in the CNN for pixelwise segmentation; DilatedNet drops and Cascade-DilatedNet both outperform the original base-
pool4 and pool5 from fully convolutional VGG-16 net- lines. In terms of mean IoU, the improvement brought by
work, and replaces the following convolutions with dilated the proposed cascade segmentation module for SegNet is
convolutions (or atrous convolutions), a bilinear upsam- 6%, and for DilatedNet is 2.5%. We further decompose the
pling layer is added at the end. performance of networks on 35 stuff and 115 discrete ob-
We integrate the proposed cascade segmentation module ject classes respectively, in Table 3. We observe that the
on the two baseline networks: SegNet and DilatedNet. We two cascade networks perform much better on the 115 dis-
did not integrate it with FCN because the original FCN re- crete objects compared to the baselines. This validates that
quires a large amount of GPU memory and has skip connec- the design of cascade module helps improve scene parsing
tions across layers. For the Cascade-SegNet, two streams for the discrete objects as they have less training data but
share a single encoder, from conv1 1 to conv5 3, while more visual complexity compared to those stuff classes.
each stream has its own decoder, from deconv5 3 to Segmentation examples from the validation set are
loss. For the Cascade-DilatedNet, the two streams split shown in Fig. 7. Compared to the baseline SegNet and Di-
after pool3, and keep spatial dimensions of their feature latedNet, the segmentation results from the Cascade-SegNet
maps afterwards. For a fair comparison and benchmark pur- and Cascade-DilatedNet are more detailed. Furthermore,
poses, the cascade networks only have stuff stream and ob- the objectness maps from the stuff stream highlight the pos-
ject stream. We train these network models using the Caffe sible discrete objects in the scenes.
library [15] on NVIDIA Titan X GPUs. Stochastic gradient
descent with 0.001 learning rate and 0.9 momentum is used 4.2. Part segmentation
as optimizer, and we drop learning rate every 10k iterations. For part segmentation, we select the eight object classes
Results are reported in four metrics commonly used frequently annotated with parts: ‘person’, ‘building’, ‘car’,
for semantic segmentation [19]: Pixel accuracy indicates ‘chair’,‘table’, ‘sofa’, ‘bed’, ‘lamp’. After we filter out the
the proportion of correctly classified pixels; Mean accu- part classes of those objects with instance number smaller
racy indicates the proportion of correctly classified pix- than 300, there are 36 part classes included in the train-
els averaged over all the classes. Mean IoU indicates the ing and testing. We train the part stream on the Cascade-
intersection-over-union between the predicted and ground- DilatedNet used in the scene parsing.
truth pixels, averaged over all the classes. Weighted IoU The results of joint segmentation for stuff, objects, and
indicates the IoU weighted by the total pixel ratio of each object parts are shown in Fig. 8. In a single forward pass the
class. network with the proposed cascade module is able to parse
Since some classes like ‘wall’ and ‘floor’ occupy far scenes at different levels. We use the accuracy instead of the
more pixels of the images, pixel accuracy is biased to reflect IoU as the metric to measure the part segmentation results,
the accuracy over those few large classes. Instead, mean as the parts of object instances in the dataset are not fully
IoU reflects how accurately the model classifies each dis- annotated. The accuracy for all the parts of the eight objects
crete class in the benchmark. The scene parsing data and is plotted in Fig.8.a The average accuracy is 55.47%.
the development toolbox will be made available to the pub-
lic. We take the average of the pixel accuracy and mean IoU 4.3. Further applications
2 http://sceneparsing.csail.mit.edu/ We show two applications of the scene parsing below:
Test image Ground truth FCN-8s SegNet DilatedNet Cascade-DilatedNet Objectness Map
Figure 7. Ground-truths, segmentation results given by the networks, and objectness maps given by the Cascade-DilatedNet.
person
Test Image
head
body Person
leg
shade
column Lamp
base
drawer
top Table
leg
back
leg Chair
seat
seat cushion Ground truth
seat base
back pillow Sofa
arm
back tree
wheel
door
windshield Car
car body
window
window
door
shutter
shop window Segmenta2on result
chimney Building
roof
balcony
column
other
railing
other
a)
headboard
footboard Bed
leg