Professional Documents
Culture Documents
*
Anna Khoreva3 * Federico Perazzi1,2 Rodrigo Benenson3
Bernt Schiele3 Alexander Sorkine-Hornung1
1
Disney Research 2 ETH Zurich
3
Max Planck Institute for Informatics, Saarbrücken, Germany
arXiv:1612.02646v1 [cs.CV] 8 Dec 2016
1
heterogeneous video segmentation benchmarks, despite us- and over frozen deep learned features [11, 12], or are conv-
ing the same model and parameters across all videos. We net based trackers on their own right [18, 29].
provide a detailed ablation study and explore the impact of Our approach is most closely related to the latter group.
varying number and types of annotations, and moreover dis- GOTURN [18] proposes to train offline a convnet so as to di-
cuss extensions of the proposed model, allowing to improve rectly regress the bounding box in the current frame based
the quality even further. on the object position and appearance in the previous frame.
MDNet [29] proposes to use online fine-tuning of a convnet
2. Related work to model the object appearance.
Our training strategy is inspired by GOTURN for the of-
The idea of performing video object segmentation via fline part, and MDNet for the online stage. Compared to
tracking at the pixel level is at least a decade old [36]. Re- the aforementioned methods our approach operates at pixel
cent approaches interweave box tracking with box-driven level masks instead of boxes. Differently from MDNet, we
segmentation (e.g. TRS [48]), or propagate the first frame do not replace the domain-specific layers, instead finetuning
segmentation via graph labeling approaches. all the layers on the available annotations for each individ-
Local propagation JOTS [47] builds a graph over neigb- ual video sequence.
hour frames connecting superpixels and (generic) object Instance segmentation At each frame, video object seg-
parts to solve the video labeling task. ObjFlow [44] builds mentation outputs a single instance segmentation. Given
a graph over pixels and superpixels, uses convnet based ap- an estimate of the object location and size, bottom-up seg-
pearance terms, and interleaves labeling with optical flow ment proposals [34] or GrabCut [38] variants can be used
estimation. Instead of using superpixels or proposals, BVS as shape guesses. Also specific convnet architectures have
[26] formulates a fully-connected pixel-level graph between been proposed for instance segmentation [17, 32, 33, 49].
frames and efficiently infer the labeling over the vertices of Our approach outputs per-frame instance segmentation us-
a spatio-temporal bilateral grid [7]. Because these meth- ing a convnet architecture, inspired by works from other do-
ods propagate information only across neighbor frames they mains like [6, 40, 49]. A concurrent work [5] also exploits
have difficulties capturing long range relationships and en- convnets for video object segmentation. Differently from
suring globally consistent segmentations. our approach their segmentation is not guided, which might
Global propagation In order to overcome these limita- result in performance decay over time. Furthermore, the of-
tions, some methods have proposed to use long-range con- fline training exploits full video sequence annotations that
nections between video frames [14, 23, 50]. In particular, are notoriously difficult to obtain.
we compare to FCP [31], Z15 [51] and W16 [45] which build Interactive video segmentation Applications such as
a global graph structure over object proposal segments, and video editing for movie production often require a level
then infer a consistent segmentation. A limitation of meth- of accuracy beyond the current state-of-the-art. Thus sev-
ods utilizing long-range connections is that they have to op- eral works have also considered video segmentation with
erate on larger image regions such as superpixels or object variable annotation effort, enabling human interaction us-
proposals for acceptable speed and memory usage, compro- ing clicks [20, 43, 46], or strokes [1, 15, 52]. In this work
mising on their ability to handle fine image detail. we consider instead box or (full) segment annotations on
Unsupervised segmentation Another family of work does multiple frames. In §5 we report results when varying the
general moving object segmentation (over all parts of the amount of annotation effort (from one frame per video to all
image), and selects post-hoc the space-time tube that best frames).
match the annotation, e.g. NLC [14] and [16, 24, 48].
In contrast, our approach sides-steps the use of any in- 3. MaskTrack method
termediate tracked box, superpixels, or object proposals and We approach the video object segmentation problem
proceeds on a per-frame basis, therefore efficiently handling from a new angle we refer as guided instance segmentation.
even long sequences at full detail. We focus on propagating For each new frame we wish to label pixels as object/non-
the first frame segmentation forward onto future frames, us- object of interest, for this we build upon the architecture
ing an online fine-tuned convnet as appearance model for of the existing pixel labelling convnet and train it to gener-
segmenting the object of interest in the next frames. ate per-frame instance segments. We pick DeepLabv2 [8],
Box tracking Some previous works have investigated ap- but our approach is agnostic of the specific architecture se-
proaches that improve segmentation quality by leveraging lected.
object tracking and vice versa [10, 13, 36, 48]. More recent, The challenge is then: how to inform the network which
state-of-the-art tracking methods are based on discrimina- instance to segment? We solve this by using two comple-
tive correlation filters over handcrafted features (e.g. HOG) mentary strategies. One is guiding the network towards the
Input frame t
MaskTrack ConvNet
1st frame
model previously trained offline on the first frame for 200
iterations with training samples generated from the first
frame annotation. We augment the first frame by image
flipping and rotations as well as by deforming the annotated
Image Box annotation Segment annotation
masks for an extra channel via affine and non-rigid defor-
mations with the same parameters as for the offline training.
13th frame
This results in an augmented set of ∼103 training images.
The network is trained with the same learning parameters
as for offline training, finetuning all convolutional and fully
connected layers. Ground truth MaskTrackBox result MaskTrack result
At test time our base MaskTrack system runs at about 12
seconds per frame (average over DAVIS dataset, amortizing Figure 4: By propagating annotation from the 1st frame,
the online fine-tuning time over all video frames), which either from segment or just bounding box annotations, our
is a magnitude faster compared to ObjFlow [44] (takes 2 system generates results comparable to ground truth.
minutes per frame, averaged over DAVIS dataset).
0.6
0.5
0.4
ter n ects ion ake ject xity und uity ion r ge n n
lut atio j s h ob le ro ig t blu view an atio utio
n d c form g ob cclu ra-s eus omp ackg amb st mo otion t-of- ce ch -vari esol
r o u
D e cti n O me en
a e c
i c b ge F a M O u an
r ale w r
ck
g era C erog hap am Ed pe
a Sc Lo
Ba Int He
t S yn Ap
D
MaskTrack+Flow+CRF MaskTrack ObjFlow BVS
Quantiles
Figure 6 presents qualitative results of the proposed
MaskTrack model across three different datasets.
Attribute-based analysis Figure 5 presents a more de-
tailed evaluation on DAVIS [30] using video attributes. (a) Segment annotations
The attribute based analysis shows that our generic model,
MaskTrack, is robust to various video challenges present
in DAVIS. It compares favourably on any subset of videos
sharing the same attribute, except camera-shake, where
ObjFlow [44] marginally outperforms our approach. We
observe that MaskTrack handles fast-motion and motion-
blur well, which are typical failure cases for methods rely-
ing on spatio-temporal connections [26, 44].
Due to the online fine-tuning on the first frame annota-
tion of a new video, our system is able to capture the appear- Quantiles
ance of the specific object of interest. This allows it to better
recover from occlusions, out-of-view scenarios and appear-
ance changes, which usually affect methods that strongly
rely on propagating segmentations on a per-frame basis. (b) Box annotations
Incorporating optical flow information into MaskTrack
substantially increases robustness on all categories. As one Figure 7: Percent of annotated frames versus video object
could expect, MaskTrack+Flow+CRF better discriminates segmentation quality. We report mean IoU, and quantiles
cases involving color ambiguity and salient motion. How- at 5, 10, 20, 30, 70, 80, 90, and 95%. Result on DAVIS
ever, we also observed less-obvious improvements in cases dataset, using segment or box annotations. The baseline
with scale-variation and low-resolution objects. simply copies the annotations to adjacent frames. Discus-
Conclusion With our simple, generic system for video ob- sion in §5.4.
ject segmentation we are able to achieve competitive results
with existing techniques, on three different datasets. These time to the annotated frame (either from forward or back-
results are obtained with fixed parameters, from a forward- wards propagation). Here, the online fine-tuning uses all
pass only, using only static images during offline training. annotated frames instead of only the first one. For the exper-
We also reach good results even when using only a box an- iments with box annotations (Figure 7b), we use a similar
notation as starting point. procedure to MaskTrackBox . Box annotations are first con-
verted to segments, and then apply MaskTrack as-is, treat-
5.4. Multiple frames annotations ing these as the original segment annotations.
In some applications, e.g. video editing for movie pro- The evaluation reports the mean IoU results when an-
duction, one may want to consider more than a single frame notating one frame only (same as table 2), and every 40th,
annotation on videos. Figure 7 shows the video object seg- 30th, 20th, 10th, 5th, 3rd, and 2nd frame. Since DAVIS
mentation quality result when considering different number videos have length ∼100 frames, 1 annotated frame corre-
of annotated frames, on the DAVIS dataset. We show results sponds to ∼1%, otherwise annotations every 20th is 5% of
for both pixel accurate segmentation per frame, or bounding annotated frames, 10th 10%, 5th 20%, etc. We follow the
box annotation per frame. same DAVIS evaluation protocol as §5.3, ignoring first and
For these experiments, we run our method twice, forward last frame, and choosing to include the annotated frames in
and backwards; and for each frame pick the result closest in the evaluation (this is particularly relevant for the box anno-
DAVIS
SegTrack
YouTube
1st frame, GT segment Results with MaskTrack, the frames are chosen equally distant based on the video sequence length
Figure 6: Qualitative results of three different datasets. Our algorithm is robust to challenging situations such as occlussions,
fast motion, multiple instances of the same semantic class, object shape deformation, camera view change and motion blur.
tation results). mate from boxes. After 10% of annotated frames, not much
Other than mean IoU we also show the quantile curves additional gain is obtained. Interestingly, the mean IoU and
indicating the cutting line for the 5%, 10%, 20%, etc. low- 30% quantile here both reach ∼0.8 mIoU range. Addition-
est quality video frame results. This gives a hint of how ally, 70% of the frames have IoU above 0.89.
much targeted additional annotations might be needed. The Conclusion Results indicate that with only 10% of anno-
higher mean IoU of these quantiles are, the better. tated frames we can reach satisfactory quality, even when
The baseline for these experiments consists in directly using only bounding box annotations. We see that with
copying the ground truth annotations from the nearest an- moving from one annotation per video to two or three
notated neighbour. This baseline indicates "level zero" for frames (1% → 3% → 4%) quality increases sharply, show-
our results. For visual clarity, we only include the mean ing that our system can adequately leverage a few extra an-
value for the baseline. notations per video.
1st frame annotation Results with MaskTrackBox and MaskTrack, the frames are chosen equally distant based on the video sequence length
Figure S8: Qualitative results of MaskTrackBox and MaskTrack on Davis using 1st frame annotation supervision (box or
segment). By propagating annotation from the 1st frame, either from segment or just bounding box annotations, our system
generates results comparable to ground truth.
Annotated image Example training masks
Figure S9: Examples of training mask generation. From one annotated image, multiple training masks are generated. The
generated masks mimic plausible object shapes on the preceding frame.
DAVIS
RGB Image
Optical Flow
SegTrack-v2
RGB Image
Optical Flow
Figure S10: Examples of optical flow magnitude images for different datasets.