You are on page 1of 18

Journal Pre-proof

STI-Net: Spatiotemporal Integration Network for Video Saliency Detection

Xiaofei Zhou, Weipeng Cao, Hanxiao Gao, Zhong Ming and Jiyong Zhang

PII: S0020-0255(23)00121-4
DOI: https://doi.org/10.1016/j.ins.2023.01.106
Reference: INS 18637

To appear in: Information Sciences

Received date: 16 May 2021


Revised date: 18 January 2023
Accepted date: 24 January 2023

Please cite this article as: X. Zhou, W. Cao, H. Gao et al., STI-Net: Spatiotemporal Integration Network for Video Saliency Detection, Information Sciences,
doi: https://doi.org/10.1016/j.ins.2023.01.106.

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for
readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its
final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which
could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2023 Published by Elsevier.


Noname manuscript No.
(will be inserted by the editor)

STI-Net: Spatiotemporal Integration Network for Video


Saliency Detection
Xiaofei Zhou · Weipeng Cao* · Hanxiao Gao · Zhong Ming · Jiyong
Zhang

Received: date / Accepted: date

Abstract Image saliency detection, to which much ef- dictions have well-defined boundaries, effectively im-
fort has been devoted in recent years, has advanced proving the final saliency map’s quality. Furthermore,
significantly. In contrast, the community has paid lit- “shortcut connections” are introduced into our model
tle attention to video saliency detection. Especially, ex- to make the proposed network easy to train and ob-
isting video saliency models are very likely to fail in tain accurate results when the network is deep. Exten-
videos with difficult scenarios such as fast motion, dy- sive experimental results on two publicly available chal-
namic background, and nonrigid deformation. Further- lenging video datasets demonstrate the effectiveness of
more, performing video saliency detection directly us- the proposed model, which achieves comparable perfor-
ing image saliency models that ignore video temporal mance to state-of-the-art saliency models.
information is inappropriate. To alleviate this issue, this
Keywords spatiotemporal saliency · feature aggrega-
study proposes a novel end-to-end spatiotemporal inte-
tion · saliency prediction · saliency fusion
gration network (STI-Net) for detecting salient objects
in videos. Specifically, our method is made up of three
key steps: feature aggregation, saliency prediction, and 1 Introduction
saliency fusion, which are used sequentially to gener-
ate spatiotemporal deep feature maps, coarse saliency Saliency detection is a popular research topic that aims
predictions, and the final saliency map. The key ad- to identify the most visually appealing objects in an im-
vantage of our model lies in the comprehensive explo- age or video. It is especially useful in a variety of appli-
ration of spatial and temporal information across the cations such as video saliency detection [18, 34], image
entire network, where the two features interact with retrieval [22, 48], and video-based traffic control [43, 8].
each other in the feature aggregation step, are used to In general, there are two settings for saliency detection:
construct boundary cue in the saliency prediction step, one estimates where humans look in an image, while
and also serve as the original information in the saliency the other attempts to pop-out all salient objects in an
fusion step. As a result, the generated spatiotemporal image or video. This paper is concerned with the latter.
deep feature maps can precisely and completely char- Furthermore, saliency models are typically classified as
acterize the salient objects, and the coarse saliency pre- static image saliency models or dynamic video saliency
models based on the input. So far, a large number of
X. Zhou, H. Gao, and J. Zhang studies have focused on detecting saliency in static im-
School of Automation, Hangzhou Dianzi University, ages. In stark contrast, efforts on video salient object
Hangzhou 310018, China detection have received remarkably little attention due
E-mail: zxforchid@outlook.com; gaohx@hdu.edu.cn;
jzhang@hdu.edu.cn
to the high complexity of video saliency computation
and the scarcity of massive pixel-wise annotated video
W. Cao and Z. Ming
Guangdong Laboratory of Artificial Intelligence and Digital
data.
Economy (Shenzhen), Shenzhen 518107, China The primary distinction between video and image
E-mail: caoweipeng@gml.ac.cn; mingz@szu.edu.cn lies in the temporal information, which is quite appeal-
(* Corresponding author) ing to people’s attention. Obviously, for video saliency
2 Xiaofei Zhou et al.

detection, we should pay attention to one critical point: (STI-Net) can achieve very promising results in difficult
how to adequately integrate the spatial information and scenarios.
temporal information using some efficient utilization The most related work (denoted as Amulet) to this
methods. According to the crucial point, many video paper is proposed in [45], which investigates the ag-
saliency models have been developed based on various gregation of multilevel convolutional feature maps for
aspects such as the center-surround mechanism [29], saliency detection in still images. The differences be-
machine learning [25], information fusion [5], and re- tween our model and Amulet are evident in both in-
gional saliency assessment [49]. The aforementioned video put and model architecture. Firstly, with the introduc-
saliency models can achieve comparable performance, tion of temporal information, i.e. optical flow image,
but these will deteriorate in handling some challenging we design a more comprehensive architecture for fea-
scenarios such as dynamic background, non-rigid defor- ture aggregation compared to Amulet. In the feature
mation, and fast motion. aggregation step of our model, feature interaction oc-
curs not only between features originating from differ-
Fortunately, the performance of image saliency mod-
ent layers, but also between features originating from
els has improved significantly since the successful use of
spatial and temporal information. Differently, Amulet
convolutional neural networks. As a result, an intuitive
only achieves the interaction between features origi-
thought occurs to us: can we directly apply these image
nating from different layers. Furthermore, to make our
saliency models to perform video saliency detection?
model easy to train and obtain promising results, we
The path is obviously unavailable due to the lack of
design the “shortcut connections” based convolutional
temporal information in image saliency models, which
block Conv-B1-i shown in Fig. 2, which is also another
are unable to maintain visual coherence and temporal
difference between our model and Amulet. Certainly,
correlation in the generated saliency maps. Meanwhile,
inspired by Amulet, this block also introduces the re-
convolutional neural networks have recently been ex-
current connections to further accelerate the interac-
tended to video saliency models [39,9].
tion of deep features. Secondly, the generation and uti-
Motivated by the aforementioned discussion, we present lization of boundary information are not the same. In
a novel end-to-end spatiotemporal integration network the saliency prediction step of our model, we generate
(STI-Net) for video saliency detection, as shown in Fig. the spatiotemporal boundary cue from both spatial and
1, which contains three key steps: feature aggregation, temporal information, where the obtained boundary
saliency prediction, and saliency fusion. The key ad- cue is concatenated with the deep features for saliency
vantage of our model lies in that spatial and temporal prediction. In contrast, Amulet only adopts the spatial
information can be adequately integrated throughout information as the boundary cue, which is employed in
the overall network. Firstly, in the feature aggregation a summation way. Thirdly, unlike Amulet, our model
step, the network is designed to enable an interaction uses the initial information , namely current frame and
between spatial information and temporal information optical flow image, to correct the predicted results in
via concatenation, convolution and recurrent connec- saliency fusion step. Further, in the fusion step, we also
tions. In this way, the generated spatiotemporal deep deploy the “shortcut connections” based convolutional
features can effectively characterize salient objects in block Conv-B3 shown in Fig. 5 to accelerate model op-
videos. Secondly, to ensure that the saliency maps pro- timization.
duced have well-defined boundaries, the spatiotemporal Other typical works related to this paper are [5, 27,
boundary cue originated from both spatial and tempo- 36], which are all built on hand-crafted low-level fea-
ral domains is incorporated into the saliency predic- tures (e.g., color, edge, etc), and are dependent on mo-
tion step. Thirdly, in order to fuse the coarse predic- tion information originated from optical flow. Further-
tion results effectively, the original or initial informa- more, the predicted saliency map suffers from low ac-
tion, i.e. current frame and optical flow image, is used curacy due to the limited representability of low-level
to provide the complementary information for the cor- features. In contrast, our model makes adequate use of
rection of prediction results in the fusion step. Further- spatial information and temporal information through-
more, to aid network optimization and obtain more ac- out the network, resulting in the effective exploitation
curate results as the depth of our network increases, of deep features. As a result, the obtained saliency maps
the shortcut connections adopted in [11] are imposed are of high quality (see Figs. 6 and 7). Thus, the differ-
on the convolutional blocks located in the feature ag- ence between our model and the aforementioned models
gregation and saliency fusion steps, i.e. Conv-B1 and lies in the utilization of features.
Conv-B3, as shown in Figs. 2 and 5, respectively. Conse- Overall, the main contributions of our paper are
quently, the proposed spatiotemporal saliency network summarized as follows:
STI-Net: Spatiotemporal Integration Network for Video Saliency Detection 3

It OPt
Input Input
(image) (optical flow)

Extraction
Conv3

Conv3
Conv2

Conv1

Conv4

Conv5
Conv5

Conv4

Conv1

Conv2

Feature
256×256×1

256×256×1
Conv(1×1)

Conv(1×1)

IFtS,5 IFtS,4 IFtS,3 IFtS,2 IFtS,1 IFtT,1 IFtT,2 IFtT,3 IFtT,4 IFtT,5

Concat Concat Concat Concat Concat


256×256×640 128×128×640 64×64×640 32×32×640 16×16×640

Aggregation
256×256×2

Feature
Concat

Conv-B1-1 Conv-B1-2 Conv-B1-3 Conv-B1-4 Conv-B1-5


256×256×66 256×256×66 256×256×66 256×256×66 256×256×64

Conv(3×3) Conv(3×3) Conv(3×3) Conv(3×3) Conv(3×3)


ST
256×256×2 256×256×2 256×256×2 256×256×2 256×256×2
B t FtST FtST FtST FtST FtST
,1 ,2 ,3 ,4 ,5

Predection
Saliency
Conv-B2-1 Conv-B2-2 Conv-B2-3 Conv-B2-4 Conv-B2-5
256×256×2 256×256×2 256×256×2 256×256×2 256×256×2
CSt ,1 CS t ,2 CSt ,3 CS t ,4 CS t ,5
Concat
256×256×16

Saliency
Fusion
Conv-B3
256×256×2

Output S t

Fig. 1 Illustration of the proposed spatiotemporal integration network (STI-Net). Given the current frame It (256×256×3)
and its optical flow image OPt (256×256×3), feature extraction modules are first employed to generate the initial spatial
deep features {IFS 5 T 5
t,i }i=1 and the initial temporal deep features {IFt,i }i=1 . Then, the feature aggregation step is performed
on these initial deep features, yielding the spatiotemporal deep feature map FST t,i (i = 1, · · · , 5). After that, by considering
the spatiotemporal boundary cue BST t originated from spatial and temporal domains, we perform saliency prediction on these
spatiotemporal deep feature maps and acquire the coarse saliency prediction CSt,i (i = 1, · · · , 5). Finally, by incorporating the
original information including It and OPt , these prediction results are further fused to obtain the final saliency map St .

1. We propose a novel end-to-end spatiotemporal saliency sults and the related analyses are presented in Section 4.
network (STI-Net) for salient object detection in videos Finally, we conclude this paper in Section 5.
that consists of three key steps containing feature ag-
gregation, saliency prediction, and saliency fusion.
2. Spatial and temporal information are fully exploited
2 Related Works
throughout the network, namely interacting with one
another, constructing boundary cue, and serving as
This section reviews some works on detecting salient
the original information in feature aggregation step,
objects in static images and dynamic video sequences.
saliency prediction step and saliency fusion step, re-
spectively.
3. To make our model easy to train and obtain promis-
ing results, we first equip the “shortcut connections” 2.1 Image Saliency Detection
to two types of convolutional blocks, i.e. Conv-B1
and Conv-B3, and then reuse the parameters of Amulet Saliency detection for static images has been sufficiently
for initialization. studied for many years. The pioneering model was pro-
posed by Itti et al. in [17], in which orientation, color
and luminance are used to compute center-surround
The rest of this paper is organized as follows. The difference. Referring to this famous framework, in [7],
related works are reviewed in Section 2. The proposed center-surround difference is measured using global con-
model is described in Section 3. The experimental re- trast, which incorporates spatial distance of each pair
4 Xiaofei Zhou et al.

of regions. Besides, in [28], Liu et al. employ the bi- ergy, luminance, color and so on are employed to esti-
nary tree to model the saliency of each superpixel. Re- mate feature difference. In [29], saliency scores are ob-
cently, some saliency models are also constructed on tained by performing Kullback-Leibler divergence on
some machine learning techniques. For instance, in [26], dynamic features. Traditional machine learning tech-
conditional random field is utilized to aggregate multi- niques have also been employed by some video saliency
ple features and generate saliency maps. In [19], fea- models. For example, in [14], saliency map is constructed
tures, which are used to represent regions, are mapped based on the diffusion of trajectories, which can be
to saliency scores by using a random forest regressor. found by using one-class support vector machines. In
Furthermore, some saliency models are built on con- [50], a random forest based video saliency model is
volutional neural networks. In [46], the multi-context learned on each frame. Besides, some saliency mod-
deep model is proposed to handle the scene whose at- els are performed on segmented or superpixel regions,
tributes are low contrast background and confusing vi- which are often used to constitute a graph. For in-
sual appearance. In [45], multi-level deep features are stance, In [27], saliency is computed by using propa-
effectively integrated by using the proposed bidirec- gation method based on a superpixel-level graph. Fur-
tional learning network. In [47], a simple gated network thermore, some other theoretical methods such as back-
is proposed to control the information interactions be- ground priors and quantum cuts are also used to con-
tween encoder and decoder. In [31], a two-level nested struct saliency models [5], which also achieves encour-
U-structure based network is designed to acquire con- aging performances.
textual information and increase network’s depth. In Recent few years, convolutional neural networks are
[40], a decomposition and completion network is em- also adopted by some video saliency models. For in-
ployed to integrate edge and skeleton cues, which local- stance, in [39], an attention mechanism based two-stream
izes boundaries and interiors of salient objects, respec- spatiotemporal network is proposed to perform video
tively. In [21], a recursive CNN architecture equipped salient object detection. In [37], static/dynamic saliency
with a contour-saliency blending module is proposed models are constructed using fully convolutional neural
to promote the information exchange between saliency networks. Besides, in [33, 9], the ConvLSTM [41] is em-
and contour. ployed to detect salient objects in videos. In [24], the
Overall, the above works focus on image saliency de- video saliency model can be obtained by weakly retrain-
tection. However, it is inappropriate to directly employ ing a pretrained image saliency deep model. In [20], the
them for video saliency detection, because the spatial guidance and teaching network is proposed to conduct
and temporal cues should be taken into account simul- implicit guidance and explicit teaching strategies for
taneously for videos. In this paper, to perform salient video salient object detection. In [42], graph convolu-
object detection in videos, we adopt an off-the-shelf tion networks based video saliency model is proposed
deep saliency model [45] as feature extraction module, to highlight complete salient objects with well-defined
and fully exploit the spatial information and tempo- boundaries. In [6], a confidence-guided adaptive gate
ral information of a video sequence by devising specific module and a dual differential enhancement module are
operations in the following three contents including fea- used to purify and merge spatial and temporal cues.
ture aggregation, saliency prediction, and saliency fu- In [44], the proposed dynamic context sensitive filter-
sion as shown in Fig. 1. ing network employs the bidirectional dynamic fusion
strategy to promote the interaction of spatial and tem-
2.2 Video Saliency Detection poral information. In addition, some other efforts are
also devoted to the video research. For example, in [1],
In contrast to image saliency detection, it is full of chal- an evaluation method is proposed to simulate most of
lenging for video saliency detection, due to the high the details in a wireless automated video surveillance
computational complexity, the complex scene, and the framework. In [12], the key frame is extracted by a bio-
lack of large-scale pixel-wise annotated video data. Gen- inspired method which is applied to optimize the qual-
erally speaking, the existing models can be categorized ity of the frames after embedding watermark. In [8], the
into into two classes, one is the conventional models spatial and temporal relations are defined by mapping
constructed using center-surround framework, informa- between vocabulary and the object movements.
tion theory, control theory, traditional machine learning Compared with all the aforementioned saliency mod-
techniques, or information fusion method, etc, and the els, our model can sufficiently exploit spatial informa-
other one is deep learning based models. tion and temporal information, where the two modal
Referring to the classical center-surround framework cues interact with one another, are used to construct
[17], in [16], multiple features consisting of motion en- boundary information, and are served as the original
STI-Net: Spatiotemporal Integration Network for Video Saliency Detection 5

information. Compared with the contemporary video shortcut connections

saliency models, the proposed model achieves a compa-

CWi×CHi×640
Conv(3×3) Conv(3×3)

+
+ + Conv(3×3)
rable performance. BN BN +
+ + BN
ReLU ReLU

((256/CWi)×(256/CHi))
256×256×66
Conv(1×1)
+

Deconv
Concat
3 The Proposed Model

ReLu
BN
CWi×CHi×640 +
ReLU

256×256×2
In this section, we give a detailed presentation for con- Conv-B1-i

structing and training the proposed spatiotemporal saliency


256×256×66 256×256×2
network.
(a) (b)
Fig. 2 Illustration of the convolutional block Conv-B1. (a):
the thumbnail of Conv-B1, (b): the detailed configuration of
Conv-B1, which consists of the convolutional layers with 3 × 3
3.1 Overall Architecture
kernel size and 1 × 1 kernel size, the deconvolutional layer with
(256/CW i ) × (256/CHi ) kernel size, batch normalization layers
Our model, as illustrated in Fig. 1, consists of three (BN), and ReLU layers. Besides, in Conv-B1, the “shortcut
key steps including feature aggregation, saliency predic- connections” are also adopted. Here, CW i and CHi represent
the width and height of input feature maps.
tion, and saliency fusion. These three steps are trained
in an end-to-end manner jointly. Concretely, the input 3.2 Feature Aggregation
of our model is the current frame It and optical flow
image OPt . Here, OPt is obtained by using large dis- To characterize the salient objects in the video effec-
placement optical flow method (LDOF) [4], which is tively, we perform feature aggregation on the initial spa-
further converted to a optical flow image [3] coded by 3- tial and temporal deep features {IFSt,i }L T L
i=1 and {IFt,i }i=1
channel (R/G/B) color. Firstly, current frame together via concatenation, convolution and recurrent connec-
with its optical flow image are fed into feature extrac- tions, in which spatial information and temporal infor-
tion modules to yield the initial spatial and temporal mation can interact with each other. Following this way,
deep features. Secondly, the feature aggregation is per- we acquire the spatiotemporal deep features {FST L
t,i }i=1 .
formed on initial deep features, where we can obtain the Firstly, we mix the initial spatial and temporal deep
spatiotemporal deep features. Thirdly, by incorporating features via a simple concatenation operation. Concretely,
the boundary cues, which can be acquired by using spa- since the initial deep features are obtained by using five
tial and temporal cues, we perform saliency prediction convolutional blocks, which leads to five different res-
on these spatiotemporal deep feature maps and obtain olution features. Therefore, we concatenate the initial
the coarse saliency predictions. Lastly, by incorporating spatial deep features and temporal deep features be-
the original information, i.e. current frame and optical longing to the same resolution along the channel direc-
flow image, we deploy saliency fusion on the prediction tion, which is formulated as:
results, generating the final saliency map St .  
For the feature extraction module, we adopt an off- IFST
t,i = Cat IF S
t,i , IFT
t,i , i = 1, · · · , L, (1)
the-shelf deep saliency model, i.e. Amulet [45], which is
built on VGG-16 model [32]. In our model, the feature where Cat denotes the cross channel concatenation,
extraction module that provides the initial deep fea- and IFSTt,i refers to the initial spatiotemporal deep fea-
tures is only a part of Amulet, where we only adopt the tures with five different spatial resolutions including
output of resolution-based feature combination struc- 256 × 256 × 640, 128 × 128 × 640, 64 × 64 × 640, 32 ×
ture (i.e. RFC) and the second layer of VGG-16 (i.e. 32 × 640, and 16 × 16 × 640.
conv1-2) in Amulet to perform feature aggregation and Secondly, inspired by the recurrent connections [45]
saliency prediction, respectively. Concretely, the cur- which facilitates the information exchange, we mix IFST t,i
rent frame It and optical flow image OPt are sent to the by using the convolutional blocks Conv-B1-i (i = 1, · · · , 5),
feature extraction module to obtain the corresponding which are connected via recurrent connections. Here,
initial spatial deep features {IFSt,i }5i=1 and initial tem- each convolutional block is followed by a 3×3 convolu-
poral deep features {IFTt,i }5i=1 , respectively. Notably, tional layer, as shown in Fig. 1. Moreover, to make our
the number of channels of these initial deep features are model easy to train, we also equip the convolutional
all set to 320, and these initial deep features are with blocks with “shortcut connections” [11]. For simplicity,
different spatial resolutions, which can be obtained by we denote each of these convolutional blocks as Conv-
using the five convolutional blocks in VGG-16. B1-i (i = 1, · · · , 5) shown in Fig. 2(a).
6 Xiaofei Zhou et al.

Then, we introduce the boundary cue to the prediction

Concat

(3×3)
ReLU
step shown in Fig. 1.

Conv
Conv-B2-i
Firstly, we apply two 1 × 1 convolutional layers to
conv1-2 layers that correspond to spatial and temporal
(a) (b) information, yielding the spatial boundary cue BSt and
Fig. 3 Illustration of the convolutional block Conv-B2. (a): temporal boundary cue BTt , respectively.
the thumbnail of Conv-B2, (b): the detailed configuration of Secondly, we concatenate the two boundary cues,
Conv-B2, which consists of the 3 × 3 convolutional layer, con-
catenation layer and ReLU layer. namely:

BST = Cat BSt , BTt ,



Specifically, in each level i, the convolution block t (3)
Conv-B1-i takes IFST t,i and the high-level aggregation
where BST refers to the spatiotemporal boundary cue.
result FST
t,i+1 as input, and produces the new aggrega-
t
Finally, by incorporating the spatiotemporal bound-
tion result. According to the two blue boxes located in
ary cue BSTt , the saliency prediction deploys the con-
the aggregation step shown in Fig. 1 and the convolu-
volutional block Conv-B2 on each spatiotemporal deep
tional block Conv-B1-i shown in Fig. 2, this process can
feature map FSTt,i , and we can obtain the coarse saliency
be defined as:
map CSt,i shown in Fig. 4(d-h), namely:
    
 Wa ∗ Cat f B1 IFST , FST
+ bai i < L CSt,i = Wip ∗ σ Cat BST ST

+ bpi ,
FST =
i
 i
 t,i t,i+1
, t , Ft,i (4)
t,i
 Wa ∗ f B1 IFST
i i t,i + bai i=L
where the convolutional block Conv-B2 consists of Conv-
(2) B2-1, Conv-B2-2, Conv-B2-3, Conv-B2-4 and Conv-B2-
5, as shown in Fig. 1. For simplicity, each one is denoted
where FSTt,i denotes the spatiotemporal deep feature as Conv-B2-i (i = 1, · · · , 5) shown in Fig. 3(a). Accord-
map. The kernel weights Wia and bias bai refer to the ing to Fig. 3(b), we can see that Conv-B2-i consists of a
parameters of the 3×3 convolutional layers, i.e. the bot- concatenation layer, an ReLU activation function, and
tom blue boxes in the feature aggregation step shown in a 3×3 convolutional layer. Wip and bpi denote the kernel
Fig. 1. fiB1 denotes the convolutional block Conv-B1- weights and the convolutional layer’s bias in Conv-B2-
i shown in Fig. 2, which includes convolutional layers, i, respectively. Besides, σ denotes the ReLU activation
batch normalization layers (BN) [2], an ReLU activa- function.
tion function, concatenation layer, and deconvolutional
layers. Specifically, when i = L, theconvolutional
 block
B1 ST
Conv-B1-i can be denoted as fi IFt,i , namely the 3.4 Saliency Fusion
recurrent connections are removed from the high-level
aggregation result.
 When
 i < L, Conv-B1-i
 is repre- The obtained coarse saliency maps {CSt,i }L i=1 , that are
sented by Cat fiB1 IFST , FST
. computed by using five different resolution spatiotem-
t,i t,i+1
poral deep feature maps, complement each other[13].
According to Eq. (2) and Fig. 1, we can see that
Furthermore, to obtain high quality saliency maps and
the initial low-resolution spatiotemporal deep features
inspired by [35], we incorporate the original (or initial)
are recurrently utilized to generate the deep features
information, i.e. current frame It and optical flow image
with high resolution. This facilitates the interaction be-
OPt , into the saliency fusion step. In the saliency fu-
tween spatial information and temporal information ef-
sion step, we first concatenate the original information
fectively, and the generated spatiotemporal deep fea-
and the obtained coarse saliency maps into a 16-channel
tures can give an effective representation for salient ob-
image FOCt , which can be written as:
jects in videos.
FOCt = Cat Ft , OPt , {CSt,i }L

i=1 . (5)

3.3 Saliency Prediction After that, the 16-channel image FOCt is fed into
the convolutional block Conv-B3 shown in Fig. 5, namely:
Based on the obtained spatiotemporal deep feature maps,
we perform saliency prediction. Furthermore, inspired  
by the boundary preserved efforts [45,21], we first gen- Sft , Sbt = f B3 (FOCt ) , (6)
erate the spatiotemporal boundary cue by using the ini-
tial spatial and temporal deep features, which are the where f B3 denotes the convolutional block Conv-B3,
output of the feature extraction module (i.e. conv1-2). which consists of the 3 × 3 convolutional layers, batch
粗略显著性图

STI-Net: Spatiotemporal Integration Network for Video Saliency Detection 7

(a) (b) (c) (d) (e) (f) (g) (h)


Fig. 4 Intermediate results of our model. (a): Input video frames, (b): binary ground truths, (c): the final saliency maps St ,
coarse saliency maps generated by the saliency prediction step (d): CSt,1 , (e): CSt,2 , (f): CSt,3 , (g): CSt,4 , (h): CSt,5 .

shortcut connections
current frame, optical flow image and binary ground
Conv(3×3) Conv(3×3)
truth with Np pixels, respectively. Gjn = 1 denotes that
+
+ + Conv(3×3) ReLU
BN BN + +

Conv-B3
+
ReLU
+
ReLU
BN Conv(3×3) the j-th pixel belongs to salient object, and Gjn = 0
(a) (b)
otherwise. Inspired by the effort in [45], the weighted
cross-entropy loss fusion is employed to approach the
Fig. 5 Illustration of the convolutional block Conv-B3. (a):
imbalanced data (i.e. salient and non-salient pixels).
the thumbnail of Conv-B3, (b): the detailed configuration of
Conv-B3, which consists of the 3 × 3 convolutional layers, Here, we drop the subscript n and use {I, OP} to rep-
batch normalization layers (BN) and ReLU layers. Besides, resent each frame. Thus, the loss function is presented
in Conv-B3, the “shortcut connections” are also adopted. as:
normalization layers (BN), and ReLU activation func-
X
log P Gj = 1|I, OP; W, b

L (W, b) = − α
tion. Here, we should note that the channel number of j∈G+
the 3×3 convolutional layers in Conv-B3 is set to 2. Be- X
log P Gj = 0|I, OP; W, b ,

sides, similar to convolutional block Conv-B1, we also − (1 − α)
j∈G−
employ the “shortcut connections” to make our model
easy to train. In this way, we can sufficiently exploit (8)
the complementary information of these coarse saliency
where b and W denotes convolutional layers’s bias and
maps via a nonlinear manner to obtain the final high-
kernel weights, respectively. G+ and G− are pixel sets
quality saliency map. Sft and Sbt mean the probability of
of salient object regions and background regions, re-
each pixel belonging to salient objects or background.
spectively. The notation α denotes the proportion of
Finally, the final saliency map St , as illustated in
Fig. 4(c), is computed in a contrast way, namely:
salient object pixels, namely α =  |G+ |/|G− |. The ex-
pression P Gj = 1|I, OP; W, b is the probability of
  a pixel belonging to salient regions, while the proba-
St = σ Sft − Sbt , (7)
bility of a pixel being background can be denoted as
P Gj = 0|I, OP; W, b .
where σ denotes the ReLU activation function. Besides,
The entire model is trained in an end-to-end way,
according to Fig. 4, it can be found that the coarse
and the pre-trained model Amulet [45] is used to ini-
saliency maps highlight the mainbody of the salient ob-
tialize the parameters of feature extraction module. The
jects shown in Fig. 4(d-h). Meanwhile, we can also find
parameters of remaining layers are initialized using “msra”
that the final saliency map St shown in Fig. 4(c) both
[10]. To minimize the loss function in Eq. (8), we adopt
highlights the salient object regions more uniformly and
the stochastic gradient descent algorithm (SGD) to train
shows more precise detail, such as the bell around the
our model, namely:
neck of cow (top example) and the spoiler on the car
(bottom example). (W∗ , b∗ ) = arg min L (W, b) , (9)

where b∗ and W∗ denotes the optimal solution of con-


3.5 Model Learning volutional layers’s bias and kernel weights, respectively.
Here, some parameters of SGD such as momentum, it-
In our model, the three key steps including feature ag- erations, base learning rate, mini-batch size and weight
gregation, saliency prediction and saliency fusion are decay are set to 0.9, 22000, 10−8 , 32, and 0.0001, re-
trained in an end-to-end manner. Given the training set spectively. Besides, when the training loss is conver-
Dtrain = {(In , OPn , Yn )}N
n=1 with N training samples, gence, the learning rate is declined by 0.1. During the
in which In = {In , j = 1, · · · , Np }, OPn = {OPjn , j =
j
training stage of our model, we don’t adopt the valida-
1, · · · , Np }, and Gn = {Gjn , j = 1, · · · , Np } represent tion datasets to prevent over-fitting. Meanwhile, after
8 Xiaofei Zhou et al.

the training loss is convergence, the performance of the DAVIS, respectively. According to Figs. 6(a) and 7(a),
generated models is very close to each other. Therefore, it can be seen that these videos exhibit various complex
we randomly select a model (12000 training iterations) scenarios such as shape complexity, occlusion and non-
as the reported model. rigid deformation, motion blur and so on. Therefore, it
is a challenging task for popping-out salient objects in
videos. The results of video saliency models including
4 Experimental Results SGSP, GD and SFCD are shown in Figs. 6(d,e,f) and
7(d,e,f). We can find that they highlight the main parts
In this part, the video datasets and implementation of salient objects. However, they mistakenly highlight
details are first introduced. Then, the comprehensive some background regions around the salient objects.
qualitative and quantitative comparison results are pro- For the deep learning based image saliency models such
vided in Section 4.1. Some validation tests are per- as MC and Amulet shown in Figs. 6(h,i) and 7(h,i),
formed in Section 4.2. After that, we present some fail- we can see that more promising performances exhibit
ure cases in Section 4.3. Finally, the computation cost when compared to the aforementioned video saliency
of our model is presented in Section 4.4. models. Unfortunately, due to the lack of temporal in-
For training and test our model, three video datasets formation, the two deep learning based image saliency
including DAVIS[30], UVSD [27] and SegTrackV2 [23] models often highlight some background regions incor-
are employed in this paper. For each frame in each video rectly. Further, for deep learning based video saliency
of these video datasets, they provide pixel-wise anno- model VSFCN shown in Figs. 6(g) and 7(g), we can
tation. Similar to [37], the whole SegTrackV2 dataset see that some background regions are also popped-out
and the training set of DAVIS dataset are collected as in the videos which are with fast motion and non-rigid
training data. For test, the performance of the pro- deformation such as the bottom example in Fig. 6 and
posed model is evaluated on the UVSD dataset and the middle example in Fig. 7. The two top-level deep
the test set of DAVIS dataset. Besides, we also adopt learning-based models PDB and SSAV achieve satisfac-
some augmentation operations to obtain good perfor- tory results shown in Figs. 6(j,k) and 7(j,k). Compared
mance without over-fitting, where we perform rotation to PDB and SSAV, the results of our model shown in
(90◦ , 180◦ , and 270◦ ) and mirror reflection on each im- Figs. 6(c) and 7(c) achieve a comparable performance
age of the training set successively. In this way, com- on all videos even for the cluttered background (the top
pared to the original training set, our training data can example in Fig. 7) and fast motion with non-rigid de-
be augmented by 8 times. formation (the middle example in Fig. 6). We can find
that our saliency maps can not only give a complete
prediction for salient objects, but also are with pre-
4.1 Performance Comparison
cise boundary details. Overall, all the results shown in
Fig. 6 and Fig. 7 clearly demonstrate the effectiveness
To evaluate the performance of the proposed STI-Net,
and superiority of our model (STI-Net), which gener-
we perform qualitative and quantitative comparisons
ates high-quality saliency maps.
against the state-of-the-art saliency models including
SGSP [27], GD [38], SFCD [5], VSFCN [37], PDB[33],
SSAV[9], MC [46] and Amulet [45] on the whole UVSD 4.1.2 Quantitative Evaluation
dataset and the test set of DAVIS dataset. The former
four models are state-of-the-art video saliency mod- We employ three standard metrics, precision-recall (PR)
els while the latter two are deep learning based im- curve, F-measure curve, and mean absolute error (MAE)
age saliency models. For a fair comparison, the source value, to quantitatively evaluate the performances of
codes of SGSP, GD, SFCD, VSFCN, MC and Amulet different saliency models. Here, the default settings of
are directly provided by their authors, and the saliency these three metrics in prior works [7, 37] are adopted.
maps generated by different models are normalized into The PR curves of all saliency models are presented
the same resolution as original videos with pixel values in Fig. 8(a). Except the top-level models including PDB
ranging from 0 to 255. In the following, quantitative and and SSAV, our model achieves the best performance
qualitative comparisons are performed successively. on both datasets. Especially, on DAVIS dataset, our
model can pop-out the salient objects in videos pre-
4.1.1 Qualitative Evaluation cisely. The F-measure curves are shown in Fig. 8(b), in
which our model performs best when compared with
Figs. 6 and 7 show qualitative comparisons for our model other saliency models including SGSP, GD, SFCD, VS-
and the state-of-the-art saliency models on UVSD and FCN, MC, and Amulet. Similarly, the MAE values pre-
STI-Net: Spatiotemporal Integration Network for Video Saliency Detection 9

(a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k)
Fig. 6 Visualization comparison of different saliency models on some videos in the UVSD dataset. (a): Input video frames,
(b): binary ground truths, (c): Ours, (d): SGSP[27], (e): GD[38], (f): SFCD [5], (g): VSFCN[37], (h): MC[46], (i): Amulet[45],
(j): PDB[33], (k): SSAV[9].

(a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k)
Fig. 7 Visualization comparison of different saliency models on some videos in the DAVIS dataset. (a): Input video frames,
(b): binary ground truths, (c): Ours, (d): SGSP[27], (e): GD[38], (f): SFCD [5], (g): VSFCN[37], (h): MC[46], (i): Amulet[45],
(j): PDB[33], (k): SSAV[9].

sented in Fig. 8(c) also show the same conclusion that in the top row of Fig. 8. Interestingly, as shown in the
our model gains the lowest MAE value when compared bottom row of Fig. 8, our model presents a comparable
with the six saliency models including SGSP, GD, SFCD, performance when compared to PDB and SSAV on the
VSFCN, MC, and Amulet. Here, we should note that DAVIS dataset. This also indicates that our model is
the performance of our model is lower than the two top- suitable for the DAVIS dataset. Overall, the PR curves,
level models including PDB and SSAV in terms of PR F-measure curves and MAE values shown in Fig. 8 con-
curves, F-measure curves, and MAE values, as shown
10 Xiaofei Zhou et al.

0.18
UVSD
Ours 0.0461 0.16
SGSP 0.1635
GD 0.111
SFCD 0.0595 0.14
VSFCN 0.0599
MC 0.0825
Amulet 0.1039 0.12
PDB 0.0182
SSAV 0.0251
0.1

0.08

Ours MC 0.06
SGSP Amulet Ours SFCD Amulet
GD PDB SGSP VSFCN PDB 0.04
SFCD SSAV GD MC SSAV
VSFCN 0.02

0
Ours SGSP GD SFCD VSFCN MC Amulet PDB SSAV

0.16

0.14
DAVIS
Ours 0.0303
SGSP 0.1345 0.12
GD 0.1059
SFCD 0.0601
VSFCN
MC
0.0547
0.0736
0.1
Amulet 0.0871
PDB 0.0278
SSAV 0.028 0.08

0.06
Ours MC
SGSP Amulet
GD PDB Ours SFCD Amulet
0.04
SFCD SSAV SGSP VSFCN PDB
VSFCN GD MC SSAV 0.02

0
Ours SGSP GD SFCD VSFCN MC Amulet PDB SSAV

(a) (b) (c)

Fig. 8 Quantitative results of different saliency models: (a) PR curves, (b) F-measure curves, and (c) MAE values. From top
to down, each row presents the results of compared models on UVSD and DAVIS datasets, respectively.

vincingly demonstrate the effectiveness of the proposed aggregation are denoted as “woSC”, “woOI”, “woBC”
spatiotemporal integration network (STI-Net). and “woAGG”, respectively. Here, “woOI” means that
Thus, according to the qualitative and quantitative we don’t incorporate the original information, namely
comparison results show in Figs. 6, 7 and 8, we can current frame and its optical flow image, in the saliency
firmly demonstrate the effectiveness of our model, and fusion step. “woBC” denotes that we don’t adopt the
the reason behind this can be attributed to the suffi- boundary cues obtained from conv1-2 in the saliency
cient exploitation of spatial and temporal information prediction step. “woAGG” refers to that our model
via feature aggregation, saliency prediction and saliency treats current frame and optical flow image separately,
fusion, which gives a powerful characterization for the and there is no interaction between them. Specifically,
salient objects in the complex video scenarios. Besides, firstly, our model acquires the best performance when
we also insert the boundary cues into the saliency pre- compared with woAGG, woBC, and woSC in terms of
diction step, which guarantees the accuracy of the gen- PR curves, F-measure curves, and average MAE val-
erated saliency maps. Based on this, our model char- ues. In particular, our model outperforms woAGG and
acterizes salient objects in videos effectively, and thus woBC with a large margin, which clearly demonstrates
can perform very well on some complex scenes. the effectiveness of interaction in the feature aggrega-
tion step and the usefulness of boundary cues in the
saliency prediction step. Besides, our model also per-
4.2 Validation of the Proposed Model forms better than woSC, which indicates the neces-
sity of “shortcut connections”. Secondly, woOI achieves
In this subsection, we first study the contribution of competitive performance with our model in terms of F-
each component (namely feature aggregation, boundary measure curves and averages MAE values as shown in
cues, original information and “shortcut connections”) Figs. 9(b) and (c). Nonetheless, we can also find that in
in our model. Then, we analyze the differences between terms of the PR curves shown in Fig. 9(a), our model
our model and Amulet [45]. Thirdly, we deeply explore outperforms woOI with a small margin. Therefore, in
the effect of boundary information and temporal infor- terms of all three metrics, our model performs better
mation, respectively. Lastly, we validate the effects of than woOI, which indicates the effectiveness of the orig-
five-level feature blocks in our network. Firstly, to val- inal information adopted in the saliency fusion step.
idate the contributions of feature aggregation, bound- According to the above studies, we see that each com-
ary cues, original information (current frame and opti- ponent in our model is beneficial to obtain satisfactory
cal flow image), and “shortcut connections” to the fi- performance, and thus the rationality of our model has
nal performance, we performed the ablation studies as been verified.
shown in Fig. 9. Our model without “shortcut connec-
tions”, original information, boundary cues and feature
STI-Net: Spatiotemporal Integration Network for Video Saliency Detection 11
MAE
DAVIS DAVIS
1 1 0.07

0.06
0.8 0.8
0.05

F-measure
0.6 0.6
Precision

0.04

0.4 0.4 0.03

0.02
Ours woOI Ours woOI
0.2 0.2
woAGG woSC woAGG woSC 0.01
woBC woBC
0 0
0 0.2 0.4 0.6 0.8 1 0 50 100 150 200 250 0
Ours woAGG woBC woOI woSC
Recall threshold

(a) (b) (c)

Fig. 9 Quantitative comparison for the model analysis on DAVIS dataset: (a) PR curves, (b) F-measure curves, and (c) MAE
values.
DAVIS DAVIS
1 1
0.1
0.09 0.087
0.8 0.8
0.08
0.07
F-measure

0.6 0.6
Precision

0.06
0.05
0.4 0.4 0.04 0.039
0.030 0.033
Ours Ours 0.03
0.2 Ours-B 0.2 Ours-B 0.02
Amulet Amulet
0.01
Amulet-OF Amulet-OF
0 0 0
0 0.2 0.4 0.6 0.8 1 0 50 100 150 200 250
Ours Ours-B Amulet Amulet-OF
Recall threshold

(a) (b) (c)

Fig. 10 Quantitative comparisons for the differences between our model and Amulet on DAVIS dataset: (a) PR curves, (b)
F-measure curves, and (c) MAE values.

Secondly, our model is constructed based on Amulet information and the effective design using boundary in-
[45], therefore it is essential to analyze the differences formation.
between our model and Amulet. Some comparisons are Thirdly, to further analyze the effects of boundary
performed in this part, as shown in Fig. 10. We make information, we perform some comparisons among dif-
a concatenation for the temporal data (i.e. optical flow ferent versions of our model. In our model, if the bound-
image) and the current frame as the input of Amulet. ary information is originated from the current frame
The Amulet is retrained with the same training data as only, we denote it as “Ours-B-RGB”. If the boundary
ours, and the retrained Amulet is denoted as “Amulet- information is originated from the optical flow image
OF”. Besides, for the boundary information, the dif- only, we denote it as “Ours-B-OF”. Thus, according to
ferences between our model and Amulet lie in two as- the experimental results shown in Fig. 11, it can be seen
pects, namely the generation and usage of boundary in- that our model achieves the best performance when
formation. The boundary information is generated by compared with Ours-B-OF and Ours-B-RGB in terms
using motion information and appearance information of PR curves, F-measure curves, and average MAE val-
in our model, as shown in Fig. 1. Differently, Amulet ues. Furthermore, by comparing with our model and
only utilizes the appearance information. As for the us- woBC shown in Fig. 9, we can demonstrate our model’s
age of boundary information, Amulet is performed in effectiveness and the rationality of our boundary design.
a summation way, where the boundary information is
Besides, as for the temporal information is the key
added to the saliency prediction. In contrast, our model
point for video saliency detection, we conduct the per-
adopts the concatenation way. Thus, we make a modi-
formance evaluation for our model and our model with-
fication for our model, namely adopting the same way
out the optical flow input branch, which is denoted as
as Amulet for the usage of boundary information. This
“Ours-woOF”. The corresponding results are shown in
new version of model is denoted as “Ours-B”. We can
Fig. 12. It can be seen that our model performs better
see from Fig. 10 that our model (i.e. “Ours”) performs
than Ours-woOF with a significant margin in terms of
best when compared with Amulet, Ours-B and Amulet-
PR curves, F-measure curves, and average MAE values.
OF in terms of all three metrics. This clear validates the
This clearly indicates the effect of temporal informa-
differences between Amulet and our model, where our
tion, which is very important for video saliency detec-
model provides the sufficient exploitation of temporal
tion. Furthermore, we also conduct a qualitative com-
12 MAE Xiaofei Zhou et al.

DAVIS DAVIS
1 1
0.035
0.032
0.030 0.031
0.03
0.8 0.8

0.025

F-measure
0.6 0.6
Precision

0.02

0.4 0.4 0.015

0.01
0.2 Ours 0.2 Ours
Ours-B-OF Ours-B-OF 0.005
Ours-B-RGB Ours-B-RGB
0 0 0
0 0.2 0.4 0.6 0.8 1 0 50 100 150 200 250
Ours Ours-B-OF Ours-B-RGB
Recall threshold

(a) (b) (c)

Fig. 11 Quantitative results of boundary information’s effect on DAVIS dataset: (a) PR curves, (b) F-measure curves, and
(c) MAE values.
MAE
DAVIS DAVIS
1 1 0.045
0.041
0.04
0.8 0.8
0.035
0.030
0.03
F-measure

0.6 0.6
Precision

0.025

0.4 0.4 0.02

0.015

0.2 0.2 0.01


Ours Ours
0.005
Ours-woOF Ours-woOF
0 0 0
0 0.2 0.4 0.6 0.8 1 0 50 100 150 200 250
Ours Ours-woOF
Recall threshold

(a) (b) (c)

Fig. 12 Quantitative results of temporal information’s effect on DAVIS dataset: (a) PR curves, (b) F-measure curves, and (c)
MAE values.
MAE
DAVIS DAVIS
1 1 0.04
0.035 0.034
0.035
0.032 0.033
0.8 0.8
0.030
0.03
F-measure

0.6 0.6 0.025


Precision

0.02
0.4 0.4
Ours Ours 0.015
Ours-S1 Ours-S1
0.01
0.2 Ours-S2 0.2 Ours-S2
Ours-S3 Ours-S3 0.005
Ours-S4 Ours-S4
0 0 0
0 0.2 0.4 0.6 0.8 1 0 50 100 150 200 250
Ours Ours-S1 Ours-S2 Ours-S3 Our-S4
Recall threshold

(a) (b) (c)

Fig. 13 Quantitative comparisons for the effects of five-level blocks in our model on DAVIS dataset: (a) PR curves, (b)
F-measure curves, and (c) MAE values.

parison, as shown in Fig. 14, on two situations: 1) the two examples do not highlight the goat and the mal-
salient object is moving between the two frames so that lard. But fortunately, the corresponding saliency maps
it will be highlighted by the optical flow visualization shown in Fig. 14(d) are satisfactory when compared
used in our implementation; 2) the salient object is not with the ground truths shown in Fig. 14(b). The rea-
the one that is moving between the two frames so that son behind this lies in the sufficient usage of spatial
it will not be highlighted by the optical flow visualiza- information and temporal information in our model, in
tion. The top two examples belong to the first situation, which they complement each other. Therefore, we can
i.e. the salient object is highlighted in the optical flow make a conclusion that the proposed model is robust
visualization. It can be seen that the obtained saliency and effective to videos with both scenes including sta-
map shown in Fig. 14(d) is very similar to the ground tionary objects and slowly moving objects.
truth shown in Fig. 14(b). As for the bottom two exam-
ples, which correspond to the second situation, where Lastly, in our model, the features originated from
the salient object moves slowly or is stationary. It can the five blocks are all utilized to obtain the saliency
be found that the optical flow images of the bottom map. Thus, in this part, we perform some comparisons
to show the performance gain by adding features from
STI-Net: Spatiotemporal Integration Network for Video Saliency Detection 13

(a) (b) (c) (d) (a) (b)


Fig. 14 Qualitative results about the effect of temporal in-
Fig. 15 Failure examples. (a): case 1, (b): case 2. Here, the
formation. (a): Input video frames,
运动显著 (b): binary ground truths,
& 运动不显著
top row is the input video frames, the second row is the
(c): optical flow images, (d): saliency maps generated by our
ground truths, and the third row is our results.
model.

each block, as shown in Fig. 13. “Our-Sn” means that 4.4 Computation Cost
our model takes the integrated features originated from
blocks {1, 2, ..., n − 1, n}. For example, Our-S3 denotes We present and analyse the computation cost of our
that our model takes the integrated features from block1, model in this part. Our model is implemented by Mat-
block2 and block3. According to Fig. 13, our model, lab R2014a platform together with the Caffe toolbox.
which takes all the integrated features from all blocks, The configurations of our deployed PC machine include
achieves the best performance in terms of all the afore- an i7-4790K CPU (32GB RAM) and a single NVIDIA
mentioned three metrics. This clearly verifies the effec- Titan XP GPU (with 12GB memory). The training
tiveness of our model, and also indicates the rationality process costs nearly 48 hours under this configuration.
of the model design. In the test phase, the optical flow computation takes
33.542 seconds for a video frame with the video resolu-
tion of 480 × 854. The generation of saliency map only
needs 0.111 seconds. Therefore, it can be easily found
that the computation of optical flow is the bottleneck
of our model in terms of computation cost. To acceler-
4.3 Failure Examples and Analysis ate the runtime of our model, we can adopt the recent
deep learning based optical flow computation method
Although our model is able to achieve good results in [15], which takes 5.45 seconds for a 480 × 854 video
most cases, it is still difficult for our model to efficiently frame.
deal with some challenging scenarios. Two examples
are shown in Fig. 15. The first case presented in Fig.
15(a) show that a gymnast with black trousers (i.e. 5 Conclusion
the salient object) is performing on a pommel horse.
Note that, the audiences exhibit cluttered background This paper presented a novel end-to-end spatiotemporal
in this video, and the heterogeneous salient object also integration network (STI-Net) for popping-out salient
shows non-rigid deformations. Our model only high- objects in videos, which includes three key steps in-
lights part of salient object, and some background re- cluding feature aggregation, saliency prediction, and
gions are highlighted falsely. The reason behind this lies saliency fusion. The key advantage of our model lies
in that the extracted spatial and temporal information in the sufficient utilization of spatial and temporal in-
are inaccurate under such a complicated scenes, e.g. low formation. Firstly, in the feature aggregation step, the
contrast background with confusing visual appearance, interaction between spatial and temporal deep features
cluttered background, and non-rigid deformation. The provides an effective description for salient objects in
second case in Fig. 15(b) present the situation where a videos. Secondly, in the saliency prediction step, the in-
camera is with severe jitter, and the cyclist is the salient corporation of the boundary cues originating from spa-
object to be detected. It can be seen that our model tial and temporal domains is conductive to generate
fails to highlight the salient object uniformly, and the accurate prediction results. Finally, in the saliency fu-
background is not effectively suppressed. It can be at- sion step, taking into account original information, such
tributed to two aspects including the inaccurate optical as the current frame and optical flow image, allows for
flow images and the small salient object, which leads to the correction of errors in the prediction results. Ex-
the poor characterization of temporal information. tensive experiments are carried out on two challenging
14 Xiaofei Zhou et al.

video datasets, and the experimental results show that 10. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rec-
the proposed spatiotemporal saliency network presents tifiers: Surpassing human-level performance on imagenet
classification. In: International Conference on Computer
a comparable performance when compared with the
Vision, ICCV, pp. 1026–1034. IEEE (2015)
state-of-the-art saliency models, demonstrating the ef- 11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning
fectiveness of our model. In future work, we would de- for image recognition. In: Computer Vision and Pattern
ploy the efficient optical flow computation method into Recognition, CVPR, pp. 770 –778. IEEE (2016)
12. Hossain, M.S., Muhammad, G., Abdul, W., Song, B.,
the entire network in an effective way and promote the Gupta, B.B.: Cloud-assisted secure video transmission
advancement of video saliency detection. and sharing framework for smart cities. Future Gener-
ation Computer Systems 83, 596–606 (2018)
13. Hou, Q., Cheng, M.M., Hu, X., Borji, A., Tu, Z., Torr,
P.H.: Deeply supervised salient object detection with
Acknowledgments short connections. In: Computer Vision and Pattern
Recognition, CVPR, pp. 3203–3212. IEEE (2017)
This work was supported in part by the National Natu- 14. Huang, C.R., Chang, Y.J., Yang, Z.X., Lin, Y.Y.: Video
ral Science Foundation of China (61901145, 62106150),in saliency map detection by dominant camera motion re-
moval. IEEE Transactions on Circuits and Systems for
part by the Fundamental Research Funds for the Provin-
Video Technology 24(8), 1336–1349 (2014)
cial Universities of Zhejiang under Grants GK229909299001-15. Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy,
009, in part by the Hangzhou Dianzi University (HDU) A., Brox, T.: Flownet 2.0: Evolution of optical flow es-
and the China Electronics Corporation DATA (CEC- timation with deep networks. In: Computer Vision and
Pattern Recognition, CVPR, pp. 1647–1655. IEEE (2017)
DATA) Joint Research Center of Big Data Technologies
16. Itti, L., Baldi, P.: A principled approach to detecting sur-
under Grant KYH063120009. prising events in video. In: Computer Vision and Pattern
Recognition, CVPR, pp. 631–637. IEEE (2005)
17. Itti, L., Koch, C., Niebur, E.: A model of saliency-based
visual attention for rapid scene analysis. IEEE Trans-
References
actions on Pattern Analysis and Machine Intelligence
20(11), 1254–1259 (1998)
1. Alsmirat, M.A., Jararweh, Y., Obaidat, I., Gupta, B.B.: 18. Jian, M., Wang, J., Yu, H., Wang, G.G.: Integrating ob-
Automated wireless video surveillance: an evaluation ject proposal with attention networks for video saliency
framework. Journal of Real-Time Image Processing detection. Information Sciences 576, 819–830 (2021)
13(3), 527–546 (2017) 19. Jiang, H., Wang, J., Yuan, Z., Wu, Y., Zheng, N., Li, S.:
2. Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A Salient object detection: A discriminative regional feature
deep convolutional encoder-decoder architecture for im- integration approach. In: Computer Vision and Pattern
age segmentation. IEEE Transactions on Pattern Anal- Recognition, CVPR, pp. 2083–2090. IEEE (2013)
ysis and Machine Intelligence 39(12), 2481–2495 (2017) 20. Jiao, Y., Wang, X., Chou, Y.C., Yang, S., Ji, G.P., Zhu,
3. Baker, S., Scharstein, D., Lewis, J.P., Roth, S., Black, R., Gao, G.: Guidance and teaching network for video
M.J., Szeliski, R.: A database and evaluation methodol- salient object detection. In: IEEE International Confer-
ogy for optical flow. International Journal of Computer ence on Image Processing (ICIP), pp. 2199–2203. IEEE
Vision 92(1), 1–31 (2011) (2021)
4. Brox, T., Malik, J.: Large displacement optical flow: de- 21. Ke, Y.Y., Tsubono, T.: Recursive contour-saliency blend-
scriptor matching in variational motion estimation. IEEE ing network for accurate salient object detection. In: Pro-
Transactions on Pattern Analysis and Machine Intelli- ceedings of the IEEE/CVF Winter Conference on Appli-
gence 33(3), 500–513 (2011) cations of Computer Vision, pp. 2940–2950 (2022)
5. Chen, C., Li, S., Wang, Y., Qin, H., Hao, A.: Video 22. Khan, Z., Latif, B., Kim, J., Kim, H.K., Jeon, M.: Dense-
saliency detection via spatial-temporal fusion and low- bert4ret: Deep bi-modal for image retrieval. Information
rank coherency diffusion. IEEE Transactions on Image Sciences 612, 1171–1186 (2022)
Processing 26(7), 3156–3170 (2017) 23. Li, F., Kim, T., Humayun, A., Tsai, D., Rehg, J.M.:
6. Chen, P., Lai, J., Wang, G., Zhou, H.: Confidence-guided Video segmentation by tracking many figure-ground seg-
adaptive gate and dual differential enhancement for video ments. In: International Conference on Computer Vision,
salient object detection. In: IEEE International Confer- ICCV, pp. 2192–2199. IEEE (2013)
ence on Multimedia and Expo (ICME), pp. 1–6. IEEE 24. Li, Y., Li, S., Chen, C., Hao, A., Qin, H.: A plug-and-
(2021) play scheme to adapt image saliency deep model for video
7. Cheng, M.M., Mitra, N.J., Huang, X., Torr, P.H., Hu, data. IEEE Transactions on Circuits and Systems for
S.M.: Global contrast based salient region detection. Video Technology 31(6), 2315–2327 (2020)
IEEE Transactions on Pattern Analysis and Machine In- 25. Li, Y., Liu, G., Liu, Q., Sun, Y., Chen, S.: Moving ob-
telligence 37(3), 569–582 (2015) ject detection via segmentation and saliency constrained
8. Choi, C., Wang, T., Esposito, C., Gupta, B.B., Lee, K.: rpca. Neurocomputing 323, 352–362 (2019)
Sensored semantic annotation for traffic control based 26. Liu, T., Yuan, Z., Sun, J., Wang, J., Zheng, N., Tang, X.,
on knowledge inference in video. IEEE Sensors Journal Shum, H.Y.: Learning to detect a salient object. IEEE
21(10), 11758–11768 (2021) Transactions on Pattern Analysis and Machine Intelli-
9. Fan, D.P., Wang, W., Cheng, M.M., Shen, J.: Shifting gence 33(2), 353–367 (2011)
more attention to video salient object detection. In: Com- 27. Liu, Z., Li, J., Ye, L., Sun, G., Shen, L.: Saliency detec-
puter Vision and Pattern Recognition, CVPR, pp. 8554– tion for unconstrained videos using superpixel-level graph
8564. IEEE (2019) and spatiotemporal propagation. IEEE Transactions on
STI-Net: Spatiotemporal Integration Network for Video Saliency Detection 15

Circuits and Systems for Video Technology 27(12), 2527– 45. Zhang, P., Wang, D., Lu, H., Wang, H., Ruan, X.:
2542 (2017) Amulet: Aggregating multi-level convolutional features
28. Liu, Z., Zou, W., Le Meur, O.: Saliency tree: A novel for salient object detection. In: International Conference
saliency detection framework. IEEE Transactions on Im- on Computer Vision, ICCV, pp. 202–211. IEEE (2017)
age Processing 23(5), 1937–1952 (2014) 46. Zhao, R., Ouyang, W., Li, H., Wang, X.: Saliency de-
29. Mahadevan, V., Vasconcelos, N.: Spatiotemporal saliency tection by multi-context deep learning. In: Computer
in dynamic scenes. IEEE Transactions on Pattern Anal- Vision and Pattern Recognition, CVPR, pp. 1265–1274.
ysis and Machine Intelligence 32(1), 171–177 (2010) IEEE (2015)
30. Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., 47. Zhao, X., Pang, Y., Zhang, L., Lu, H., Zhang, L.: Sup-
Gross, M., Sorkine-Hornung, A.: A benchmark dataset press and balance: A simple gated network for salient
and evaluation methodology for video object segmen- object detection. In: European Conference on Computer
tation. In: Computer Vision and Pattern Recognition, Vision, pp. 35–51 (2020)
CVPR, pp. 724–732. IEEE (2016) 48. Zhou, J., Gan, J., Gao, W., Liang, A.: Image retrieval
31. Qin, X., Zhang, Z., Huang, C., Dehghan, M., Zaiane, based on aggregated deep features weighted by regional
O.R., Jagersand, M.: U2-net: Going deeper with nested significance and channel sensitivity. Information Sciences
u-structure for salient object detection. Pattern Recog- 577, 69–80 (2021)
nition 106, 107404 (2020) 49. Zhou, X., Liu, Z., Gong, C., Wei, L.: Improving video
32. Simonyan, K., Zisserman, A.: Very deep convolutional saliency detection via localized estimation and spatiotem-
networks for large-scale image recognition. International poral refinement. IEEE Transactions on Multimedia
Conference on Learning Representations pp. 1–14 (2015) 20(11), 2993–3007 (2018)
33. Song, H., Wang, W., Zhao, S., Shen, J., Lam, K.: Pyramid 50. Zhou, X., Liu, Z., Li, K., Sun, G.: Video saliency de-
dilated deeper convlstm for video salient object detection. tection via bagging-based prediction and spatiotemporal
In: European Conference on Computer Vision, ECCV, propagation. Journal of Visual Communication and Im-
pp. 715–731. Springer (2018) age Representation 51, 131–143 (2018)
34. Song, S., Jia, Z., Yang, J., Kasabov, N.: Salient detec-
tion via the fusion of background-based and multiscale
frequency-domain features. Information Sciences 618,
53–71 (2022)
35. Tang, Y., Wu, X.: Saliency detection via combining
region-level and pixel-level predictions with cnns. In:
European Conference on Computer Vision, ECCV, pp.
809–825. Springer (2016)
36. Wang, W., Shen, J., Shao, L.: Consistent video saliency
using local gradient flow optimization and global refine-
ment. IEEE Transactions on Image Processing 24(11),
4185–4196 (2015)
37. Wang, W., Shen, J., Shao, L.: Video salient object detec-
tion via fully convolutional networks. IEEE Transactions
on Image Processing 27(1), 38–49 (2018)
38. Wang, W., Shen, J., Yang, R., Porikli, F.: Saliency-
aware video object segmentation. IEEE Transactions on
Pattern Analysis and Machine Intelligence 40(1), 20–33
(2018)
39. Wen, H., Zhou, X., Sun, Y., Zhang, J., Yan, C.: Deep
fusion based video saliency detection. Journal of Visual
Communication and Image Representation 62, 279–285
(2019)
40. Wu, Z., Su, L., Huang, Q.: Decomposition and completion
network for salient object detection. IEEE Transactions
on Image Processing 30, 6226–6239 (2021)
41. Xingjian, S., Chen, Z., Wang, H., Yeung, D.Y., Wong,
W.K., Woo, W.c.: Convolutional lstm network: A ma-
chine learning approach for precipitation nowcasting. In:
Advances in Neural Information Processing Systems, pp.
802–810 (2015)
42. Xu, M., Fu, P., Liu, B., Li, J.: Multi-stream attention-
aware graph convolution network for video salient object
detection. IEEE Transactions on Image Processing 30,
4183–4197 (2021)
43. Xu, X., Liu, W., Yu, L.: Trajectory prediction for hetero-
geneous traffic-agents using knowledge correction data-
driven model. Information Sciences 608, 375–391 (2022)
44. Zhang, M., Liu, J., Wang, Y., Piao, Y., Yao, S., Ji, W.,
Li, J., Lu, H., Luo, Z.: Dynamic context-sensitive filtering
network for video salient object detection. In: Proceed-
ings of the IEEE International Conference on Computer
Vision (ICCV), pp. 1553–1563. IEEE (2021)
Declaration of interests

☒ The authors declare that they have no known competing financial interests or personal relationships
that could have appeared to influence the work reported in this paper.

☐The authors declare the following financial interests/personal relationships which may be considered
as potential competing interests:
Credit author statement

Xiaofei Zhou: Conceptualization, Writing-Original Draft Preparation.

Weipeng Cao: Investigation, Funding Acquisition, Writing-Reviewing and Editing.

Hanxiao Gao: Data Curation, Validation.

Zhong Ming: Supervision, Formal Analysis, Software.

Jiyong Zhang: Methodology, Writing-Reviewing and Editing.

You might also like