Detecting Marine Organisms Via Joint Attention-Relation Learning For Marine Video Surveillance

IEEE JOURNAL OF OCEANIC ENGINEERING, VOL. 47, NO.
4, OCTOBER 2022 959
Detecting Marine Organisms Via Joint

Attention-Relation Learning for Marine
Video Surveillance
Zhensheng Shi , Member, IEEE, Cheng Guan, Qianqian Li, Ju Liang, Liangjie Cao,
Haiyong Zheng , Member, IEEE, Zhaorui Gu , Member, IEEE, and Bing Zheng, Member, IEEE
Abstract—The better way to understand marine life and ecosys- Index Terms—Convolutional neural network (CNN), marine
tems is to surveil and analyze the activities of marine organisms. organism detection, marine video surveillance, relation model,
Recently, research on marine video surveillance is becoming in- visual attention.
creasingly popular. With the rapid development of deep learning
(DL), convolutional neural networks (CNNs) have made remark-
able progresses in image/video understanding tasks. In this article, I. INTRODUCTION
we explore a visual attention and relation mechanism for marine
organism detection, and propose a new way to apply an improved NDERSTANDING the behavior and abundance distribu-
attention-relation (AR) module on an efficient marine organism
detector (EMOD), which can well enhance the discrimination of
organisms in complex underwater environments. We design our
U tion of marine organisms plays a very important role in
marine ecosystem, environment monitoring, and marine fish-
EMOD via integrating current state-of-the-art (SOTA) detection ery [1]–[3]. Detecting and recognizing marine organisms have
methods in order to detect organisms and surveil marine environ- been a very challenging and crucial research topic. To surveil
ments in a real time and fast fashion for high-resolution marine the activity and master the distribution of marine organisms, re-
video surveillance. We implement our EMOD and AR on the searchers have made great efforts to capture them in-situ and on-
annotated video data sets provided by the public data challenges in
conjunction with the workshops (CVPR 2018 and 2019), which are
site with large amounts of high-quality images or videos using
supported by National Oceanic and Atmospheric Administration modern underwater vehicles, such as autonomous underwater
(NOAA) and their research works (NMFS-PIFSC-83). Experimen- vehicles (AUVs) [4], [5] and remotely operated vehicles (ROVs).
tal results and visualizations demonstrate that our application of Then, biologists are required to manually annotate and recognize
AR module is effective and efficient, and our EMOD equipped with the species with their positions for further studies from these
AR modules can outperform SOTA performance on the experimen-
tal data sets. For application requirements, we also provide the
large-scale images and videos, which is very time-consuming
application suggestions of EMOD framework. Our code is publicly and laborious as well as inapplicable to in situ and on-site
available at https://github.com/zhenglab/EMOD. real-time analysis. Thus, it is necessary to devise a computer
vision-based method to detect and recognize organisms for
automatic marine video surveillance by using the cutting-edge
Manuscript received 16 December 2020; revised 13 July 2021, 12 November
2021, and 19 January 2022; accepted 23 March 2022. Date of publication 1 June artificial intelligence technologies [6]–[8].
2022; date of current version 13 October 2022. This work was supported in part Recently, relying on large amounts of annotated data and com-
by the National Natural Science Foundation of China under Grant 62171421 putational resources [i.e., graphical processing units, (GPUs)],
and Grant 61771440, and in part by the Qingdao Postdoctoral Applied Research
Project of China. Preliminary work for this paper was presented at the IEEE/MTS convolutional neural networks (CNNs) have first witnessed re-
Global OCEANS 2020 Singapore - US Gulf Coast, October 5–31, 2020. [DOI: markable progress in image recognition [9]–[12]. Besides, deep
10.1109/IEEECONF38699.2020.9389458]. (Zhensheng Shi, Cheng Guan, and CNNs have been widely applied to more complicated object
Qianqian Li contributed equally to this work.) (Corresponding authors: Haiyong
Zheng and Zhaorui Gu.) detection for images and videos, where the objects are not only
Associate Editor: B. Thornton. recognized but also marked with locations, achieving amazing
Zhensheng Shi is with the Frontiers Science Center for Deep Ocean Mul- performance and even better results than our human beings in
tispheres and Earth System, Ocean University of China, Qingdao 266100,
China, and also with the Underwater Vision Lab, College of Electronic applications, such as video surveillance [13]–[16]. However,
Engineering, Ocean University of China, Qingdao 266100, China (e-mail: video surveillance in the marine environment is quite different
shizhensheng@ouc.edu.cn). from that in the terrestrial environment, and faces several tough
Cheng Guan, Qianqian Li, Ju Liang, Liangjie Cao, Haiyong Zheng, and
Zhaorui Gu are with the Underwater Vision Lab, College of Electronic Engineer- challenges. First, underwater images and videos suffer from
ing, Ocean University of China, Qingdao 266100, China (e-mail: guancheng@ strong absorption, scattering, color distortion, and noises from
stu.ouc.edu.cn; liqianqian5957@stu.ouc.edu.cn; liangju@stu.ouc.edu.cn; artificial light sources as well as marine snow particles, causing
caoliangjie@stu.ouc.edu.cn; zhenghaiyong@ouc.edu.cn; guzhaorui@ouc.edu.
cn). image blur, haziness, and bluish or greenish tone. Second, un-
Bing Zheng is with the Underwater Vision Lab, College of Electronic En- derwater objects are usually hard to stay static especially when
gineering, Ocean University of China, Qingdao 266100, China, and also with marine organisms and underwater vehicles are also moving,
Sanya Oceanographic Institution, Ocean University of China, Sanya 572024,
China (e-mail: bingzh@ouc.edu.cn). resulting in motion blur and various poses of organisms. Third,
Digital Object Identifier 10.1109/JOE.2022.3162864 underwater imaging environment is complicated and changeable
1558-1691 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Universidad Nacional de Colombia (UNAL). Downloaded on June 06,2023 at 14:18:36 UTC from IEEE Xplore. Restrictions apply.
960 IEEE JOURNAL OF OCEANIC ENGINEERING, VOL. 47, NO. 4, OCTOBER 2022
Fig. 1. Visualization of attention maps in AR module on scallop detection from HabCam data set. Our AR module can enhance the representation of organisms
in complex underwater environments. More visualizations are in Fig. 8.
due to the changes of illumination as well as spatio–temporal and relation for precisely detecting marine organisms. In addi-
scales. Although low-level image enhancement can to some tion, for high-resolution marine video surveillance, we integrate
extent solve these problems and make the underwater image current state-of-the-art (SOTA) detectors and design an EMOD
data clearer, the general image/video understanding algorithms framework, which can detect and surveil marine organisms in
still struggle to deal with such challenging problems in marine a real-time and fast fashion. EMOD will be helpful for marine
environment [17], [18]. scientists to effectively analyze the vast amount of survey data
Visual attention and relation are two fascinating facets for collected from marine environment. The experimental results
human intelligence, which have been modeled in CNNs and also demonstrate that our application of AR module is effective
proved to be effective for image/video understanding tasks. and efficient, and our EMOD equipped with AR module can
Given an image or a video clip, human focus attention selectively exceed the performance of SOTA detectors on the experimental
on salient parts of the visual space to acquire useful information data sets.
and also discover relations among them [19], [20]. In marine A preliminary version of this work has been published in
video understanding tasks, the capture of visual attention and Global OCEANS 2020: Singapore – U.S. Gulf Coast [25]. The
relation becomes more difficult due to complex underwater envi- present work is a developed version of previously presented
ronments, therefore, how to overcome the effects of underwater version, which significantly adds to the initial work. First, we
degradation and improve the discrimination of marine organisms propose a new way to apply an improved AR module on our
is crucial for the study of underwater attention and relation EMOD framework for exploring AR jointly, and also conduct
modeling. Motivated by this observation, we propose a new ablation study as well as extensive experiments to validate the
way to apply an improved attention–relation (AR) module on an efficacy of our AR module. Then, we improved our original
efficient marine organism detector (EMOD), which explores to EMOD framework by involving both our AR module and opti-
learn joint AR in CNNs. Our AR module can better understand mal detector for marine video surveillance. Moreover, we pro-
underwater specific video by enhancing the representation of vide suggestions of EMOD framework for different application
organisms (objects). As Fig. 1 shows, we visualize the attention requirements.1
maps in AR module on scallop detection from HabCam data
set. It can be seen that, although the organism exists in the II. MARINE VIDEO SURVEILLANCE
environment with complex noises (e.g., the impurities in the first
In recent years, video surveillance of marine environment
column, the blurring and color distortion in the second column,
and marine organisms has attracted wide attention and a lot
the haziness of organisms in the third and fourth columns), our
of related work has been carried out. Hu et al. [26] designed
joint AR method can well enhance attentive features and capture
a system that is capable of categorizing different underwater
the target organism.
habitat types to monitor and analyze the change and damage
In this work, we use the challenging marine video data sets,
of the marine environment. By using time-lapse photography
which are provided by the public data challenges [21] in con-
and a suite of marine instruments, Fujii et al. [27] constructed
junction with the workshops hosted in IEEE/CVF Conference
a new underwater monitoring system to determine the dynamic
on Computer Vision and Pattern Recognition (CVPR 2018 [22]
relationship between changing environmental conditions, bio-
and 2019 [23] Workshops). The data sets are supported by
logical activity, and the actual presence of marine infrastruc-
National Oceanic and Atmospheric Administration (NOAA)
ture. Taormina et al. [2] developed and optimized a reliable
and their research works (NMFS-PIFSC-83 [24]), and we take
the three sub-data sets for our study: MOUSS, MBARI, and
HabCam. Our AR module aims to study the visual attention 1 [Online]. Available: https://github.com/zhenglab/EMOD
SHI et al.: DETECTING MARINE ORGANISMS VIA JOINT AR LEARNING FOR MARINE VIDEO SURVEILLANCE 961
image scoring strategy to improve the underwater image for tasks, from image recognition and understanding [50], [51], to
monitoring benthic ecosystems. The purpose of Rose [1] was sequence-based models [52], [53]. It is usually implemented in
to describe the methods and results of acoustic surveys of combination with gating functions (such as softmax or sigmoid)
Smith Sound over seven years in several seasons, and use these and sequential technology [54], [55]. Recent work has shown its
results to design an acoustic-monitoring method for these fish. applicability to tasks, such as image captioning [56], [57] and
Smale et al. [4] described the application of AUV technology lip reading [58]. In these applications, it is often used on top
to conduct ongoing monitoring of benthic habitats at two key of one or more layers representing higher level abstractions for
locations in Western Australia. Effective fish sampling method adaptation between modalities.
proposed by Salman et al. [6] was developed by using under- Besides, Vaswani et al. [59] proposed a transformer architec-
water video and image processing techniques to automatically ture based on self-attention mechanism for machine translation,
estimate, and, thus, monitor fish biomass and groups in water which calculates the response of a position (such as a sentence)
bodies. in the sequence by focusing on all positions and taking their
weighted average in the embedding space. Wang et al. [60]
A. Marine Organism Detection viewed self-attention as a form of the nonlocal mean [61]
and bridged self-attention for machine translation to the more
Acoustic and optical measurements are widely used in the
general class of nonlocal filtering operations that are applicable
detection of underwater organisms. There are many acoustic
to image and video problems in computer vision.
techniques, from calculating the echo of a sonar beam [28],
[29] to the more complex methods [30]. Tiemann et al. [31]
developed a passive acoustic localization technique based on C. Relation Model
a acoustic propagation modeling to monitor marine mammal Recently, relation models have been developed in the
activities. Optical techniques include classical methods, such area of visual question answering [62], [63], object detec-
as edge-based classifiers [32], and fish detector and tracker [33] tion/recognition [64]–[66], and intuitive physics [67], [68].
that combines an adaptive Gaussian mixture model detector [34] Cross-channel correlations are typically mapped as new com-
with a continuously adaptive mean shift algorithm tracker [35]. binations of features, either independent spatial structure [69],
Mizuno et al. [36] developed a towed optical camera array sys- [70] or joint convolutional structure of standard convolutional
tem named speedy sea scanner (SSS) combined with a DL-based filters [71] and 1 × 1 convolutions. Much of this work has
estimation method to monitor benthic marine habitats. concentrated on the objective of reducing model parameters and
Through fish detection and classification, Villon et al. [37] computational complexity, indicating an assumption that chan-
studied the performance of an support vector machine (SVM) nel relationships can be formulated as a composition of instance-
classifier trained on HOG features, and compared it with perfor- agnostic functions with local receptive fields. In contrast,
mance of fine-tuned CNNs, where the results showed that CNNs Hu et al. [72] proposed a novel module “Squeeze-and-
are indeed better than traditional methods. Marburg et al. [38] Excitation” and provided the net unit with a mechanism to
used various neural networks to train an underwater detector explicitly model dynamic and nonlinear dependencies between
and classifier for benthic macrofauna. Siddiqui et al. [39] used channels using global information, for easing the learning pro-
deep-layer CNNs and cross-layer pooling methods to enhance cess and enhance the representational power of network. More-
the discriminative ability to achieve SOTA performance in the over, GENet [73] and PSANet [74] performed rescaling to
classification of fish species, which can solve the problem of lim- different channels to recalibrate the channel dependency with
ited training data for labeling. Lu et al. [40] proposed the identi- global context.
fication of the species of six common tuna and billfish by using
machine vision. Rasmussen et al. [41] utilized YOLOv2 [16] D. Object Detection
for scallop detection that was able to run in a real-time way.
The mainstream solutions for object detection usually be-
Song et al. [42] applied sidescan sonar (SSS) and an onboard
long to one of these two categories: 1) two-stage detector and
GPU to implement automatic real-time object detection. Aznar
2) one-stage detector.
et al. [43] proposed a swarm behavior that was formalized with a
1) Two-Stage Detector: Current SOTA object detectors are
microscopic model to improve jellyfish detection system. Tseng
usually based on a two-stage, proposal-driven mechanism [75].
et al. [44] proposed an automatic approach for prescreening
The first stage generates a sparse set of candidate object loca-
harvested fish in the electronic monitoring system (EMS) videos
tions, which can be performed by using either basic algorithms
by using CNNs. Li et al. [3] constructed a phytoplankton micro-
for region proposals or a dedicated region proposal CNN. The
scopic image data set, namely, PMID2019, to train the advanced
second stage consists of a CNN that classifies each candidate as
artificial intelligence model for phytoplankton detection.
one of the object classes or a background class. The two-stage
methods, such as the R-CNN series [76], have achieved SOTA
B. Visual Attention
accuracy scores on many benchmark data sets [77].
Attention, in a broad sense, can be regarded as a useful tool, Due to the great success of the R-CNN detector [78], two-
which tends to allocate available processing resources to the stage detector has become popular in recent years [77]. In
most informative part of the input signal [45]–[49]. The benefits order to reduce redundant computation, SPP-Net [79] and Fast
of attention mechanism have been demonstrated in a range of R-CNN [80] introduced an idea of regional feature extraction
Fig. 2. Structure and details of our attention–relation (AR) module. We simplify the global attention module and then add channel relation to construct a joint
AR module.
that enabled objects to share a large number of feature compu- channel relation among different perspectives. The AR mod-
tation. Faster R-CNN [13] proposed region proposal network ule is designed in a residual structure, which can be flexibly
to improve the efficiency of detectors and allowed end-to-end incorporated to many existing CNN architectures. Thanks to the
training of detectors. After this meaningful milestone, some bottleneck structure that greatly reduces model parameters, AR
works aimed to enhance faster R-CNN from different details. module is lightweight which can be applied to multiple layers
For example, R-FCN [81] utilized efficient regionwise full con- for better modeling the joint AR. We employ the good practice
volutions to avoid the heavy computations of faster R-CNN, and of Cao et al. [85] and implement an improved AR module (see
Cascade R-CNN [14] proposed classic and powerful cascade Section III-B and Table IV for details), where an AR module is
architecture to extend R-CNN to a multistage detector, while essentially a computational unit with the transformation map-
mask R-CNN [76] added a mask branch that refined the detection ping of an input x ∈ RC×H×W , which denotes the feature map
results with the help of multitask learning. of an input instance (an image)
2) One-Stage Detector: A main disadvantage of the two-
stage detectors is that they are time-consuming, which limits x = {xi }Pi=1 = [x1 , x2 , . . . , xP ] ∈ R
C×H×W
, xi ∈ RC×1
in real-time applications and encourages the development of (1)
one-stage detectors [75], [77]. One-stage detectors classify all where i refers to the index of the query position; C is the channel
image regions at once, thus are beneficial to computational number; P = H × W is the number of positions in an input
efficiency and become popular in some related tasks. feature map; and W and H are width and height of the feature
OverFeat [82] was one of the first modern one-stage object map.
detector based on deep networks. YOLO [83] output very sparse
detection results and implemented real-time object detection by A. Attention
inputting image once through an efficient backbone network.
SSD [84] was one of the first attempts at using pyramidal feature Traditional CNN only focuses on the area of receptive field
hierarchy to reuse the multiscale feature maps from different size. Although the receptive field can be increased by stacking
layers computed in the forward pass. RetinaNet [15] proposed convolutional layers, the receptive field of convolutional kernel
a new focal loss to handle the class imbalance problem of dense in a specific layer on the original image is still very limited.
object detection, achieving results comparable to two-stage de- In order to obtain global receptive field and extract more infor-
tectors. The main limitation of one-stage detectors is that their mation from the original image, Wang et al. [60] proposed a
accuracy is usually lower than that of two-stage detectors. self-attention model as the nonlocal (NL) block [Fig. 2(a)] to
capture the global attention. The core nonlocal operation can be
expressed as
P
III. LEARNING JOINT ATTENTION–RELATION 1
oi = f (xi , xj )g(xj ) (2)
We propose a new way to apply an improved AR module C(x) j=1
on our EMOD (see Section IV) to learn joint AR in CNNs for
marine organism detection, as shown in Fig. 2. Inspired by [60] where x and o represent the input signal and output signal,
and [72], our AR module adopts nonlocal (NL) combined with respectively, f (xi , xj ) is used to compute the similarity between
squeeze-excitation (SE) structure to learn the global attention position i and position j, C(x) represents the normalization
and relation in marine videos, which includes two submodules operation, and g(xj ) denotes the linear transformation of xj . In
[Fig. 2(d)] as follows: First, we model the global context in- this way, we actually obtain the probability (oi ) that represents
formation via utilizing a global attention submodule; then, a the effect of all positions on query position i, indicating the
bottleneck transform operation is implemented to model the global attention. Then the nonlocal block [60] can be defined in
a residual structure as utilizes two adjacent fully connected layers to form a bottle-
P neck structure for modeling the relations between channels, and
f (xi , xj )
zi = xi + Wz oi = xi + Wz g(xj ). (3) outputs the same number of channels as the input features. We
j=1
C(x) define our relation operation as
In particular, there are four different forms [60] to calculate R(·) = σ(Wc2 (δ(Wc1 (·)))) (9)
f (xi , xj ), here, we set f (xi , xj ) = exp((Wq xi )T (Wk xj )) and

C(x) = P j=1 f (xi , xj ), which are commonly used in self-
where Wc1 and Wc2 represent the two fully connected layers
attention module [59], [86], [87]. Similarly, we also employ the of MLP, δ(·) denotes ReLU activation function and σ· denotes
1 × 1 convoluational filter Wv as linear transformation matrix to Sigmoid activation function.
construct g(xj ) = Wv · xj . So the detailed nonlocal block can Thus, our AR residual framework (7) can be illustrated as
be represented as [Fig. 2(a)]
zi = x i
P
T
exp((Wq xi ) (Wk xj )) ⎛ ⎛ ⎛⎛ ⎞⎞⎞⎞
P
z i = x i + Wz P (Wv · xj ). exp(Wk x j )
j=1 m=1 exp((Wq xi )T (Wk xm )) + σ ⎝Wc2 ⎝δ ⎝Wc1 ⎝ P xj ⎠⎠⎠⎠.
(4) j=1 m=1 exp(Wk xm )
Further, Cao et al. [85] found that the attention maps for (10)
different query positions are almost the same so that the global
attention operation can be simplified by computing a global Note that, our improved AR module adopts the standard
(query-independent) attention map and sharing this global at- squeeze-excitation (SE) [72] structure (FC-ReLU-FC-Sigmoid)
tention map for all query positions. By doing so, Wz can be to learn relations across multiple channel perspectives in marine
dropped since the channel number will not be changed. video data set, while Cao et al. [85] only used Conv layers in their
Here, we also follow these observations and simplify the NL module and directly fused these convolutional features without
block to [Fig. 2(b)] capturing relations. We show the improved performance of our
AR module in Table IV.
P
exp(Wk xj )
zi = x i + P (Wv · xj ) (5)
j=1 m=1 exp(Wk xm ) IV. EFFICIENT MARINE ORGANISM DETECTOR
where Wq and Wz are dropped from (4). Then, for further A. EMOD Framework
reducing the computation cost, according to distributive law, Our new way for the application of AR mechanism is to
the simplified NL block can be modified as [Fig. 2(c)] employ the AR module (Section III) on our designed EMOD
P
framework. The EMOD can detect and recognize marine or-
exp(Wk xj ) ganisms from high-resolution marine surveillance videos in an
z i = x i + Wv P xj . (6)
j=1 m=1 exp(Wk xm ) efficient and fast way. The EMOD framework integrates current
SOTA detectors and consists of three parts as follows: The
The simplified nonlocal block can be viewed as a residual backbone (basic network), the SOTA detector, and the AR (joint
structure with two parts: 1) attention, which calculates a query- attention-relation) module. The backbone includes three kinds of
independent global attention map for all query positions; and basic networks: 1) ResNet-50, 2) ResNext-101, and 3) DarkNet-
2) relation, which captures the relation between channels. We 53. The SOTA detector integrates two-stage and single-stage
abstract the simplified NL block as an AR residual framework detection methods to complete organism detection based on
with the following definition [Fig. 2(d)]: marine surveillant videos. The two-stage detectors can achieve
high-precision detection performance, which are based on faster
zi = xi + R(y) (7)
R-CNN (FR, [13]) and Cascade R-CNN (CR, [14]), and one-
where y represents the output of attention module, which obtains stage detectors can achieve high-efficiency real-time detection
the global attention by grouping the features of all positions; and performance, which are based on YOLO (YL) and RetinaNet
R(·) denotes the relation module, which models channelwise (RN). By equipping with AR module, the EMOD can learn joint
dependencies. attention-relation from complex marine video contents, so as to
For the attention module, based on (6), the output y can be further improve the accuracy of marine organism detection. We
expressed as implement the AR module on EMOD framework via plugging
AR modules into every convolutional block (such as res block,
P
exp(Wk xj ) see Fig. 3) of the backbone network. Experiments show that the
y= P xj . (8) EMOD framework can achieve high-efficiency, real-time, and
j=1 m=1 exp(Wk xm )
accurate detection results of marine organisms. We note that our
EMOD is an extensible framework, the experiments of this work
B. Relation
are based on the current version. We will enrich the framework in
In order to model channelwise feature dependencies, we em- future versions by integrating more backbones, SOTA detectors,
ploy a two-layer MLP to construct the relation module, which and marine video methods (such as AR module).
Fig. 3. Framework of our efficient marine organism detector (EMOD).
B. Backbone Furthermore, FR uses the detection network to initialize RPN

1) ResNet-50 or R50 [12]: ResNet has become one of the training and fixes the shared convolutional layer to, respectively,
most popular CNN networks with its simple structure and ex- fine-tune the unique layers of RPN and fast R-CNN. In this way,
cellent effect, which effectively solves the degradation problem the two networks form a unified network sharing full-image
caused by the deepening of network layers. Due to moderate convolutional features and the step for region proposal is nearly
network depth and complexity, ResNet-50 is popular and com- cost-free which greatly improves detection speed.
monly used in research and application, and it contains a total of 2) Cascade R-CNN or CR [14]: To address the problem
five convolutional layers, in which the second to fifth layers are that a detector with low intersection over union (IoU) threshold
residual learning layers that are, respectively, composed of 3, 4, usually produces noisy detections, while detection performance
6, and 3 residual units and each residual unit adopts a bottleneck tends to degrade with increasing the IoU thresholds, CR is pro-
structure that contains 1 × 1, 3 × 3, and 1 × 1 convolutional posed in a multistage structure, which trains stages sequentially
layers. by using the output of one stage to train the next with increasing
2) ResNeXt-101 or X101 [88]: ResNeXt is an improvement IoU thresholds, and the same cascade procedure is applied at
based on ResNet and inception models [11], [89]–[91], which inference. This architecture is shown to avoid the problem of
gains higher accuracy in the way of increasing cardinality (con- overfitting at training and quality mismatching at inference.
volutional groups) instead of going deeper or wider, while reduc- 3) YOLO or YL [92]: Different from FR and CR, YL is
ing complexity. ResNeXt-101 (64 × 4 d) performs excellently in a one-stage detector without the stage of generating region
ResNeXt series and achieves better accuracy than ResNet-200 proposals. In our EMOD framework we adopt YLv3 as YL
with only 50% complexity. In the bottleneck convolutional layer, detector. Compared to YLv1 and YLv2, YLv3 uses a new net-
the cardinality is set to 64, which means the channels are divided work (DarkNet-53) for performing feature extraction and adopts
into 64 groups for convolutional operation. independent logistic classifiers instead of using a softmax as well
3) DarkNet-53 or D53 [83]: DarkNet is a lightweight and as employing binary cross-entropy loss for class predictions. For
efficient convolutional network for YOLO detector. DarkNet-53 determining bounding box priors, YLv3 still uses k-means clus-
is first proposed in YOLOv3 and consisted of 53 basic convo- tering and then chooses nine clusters and three scales arbitrarily
lutional units, which are composed of convolutional layers of and divides up the clusters evenly across scales. After these
1 × 1 and 3 × 3 in a residual structure. improvements, YLv3 achieves significant lift for small objects
detection, which has been a challenge for previous YL versions.
However, for medium and larger size object detection, the results
C. SOTA Detector are comparatively worse.
1) Faster R-CNN or FR [13]: Faster R-CNN (faster region- 4) RetinaNet or RN [15]: RN is a fully convolutional one-
based CNN) is an improvement of fast R-CNN [80] which first stage detector designed to demonstrate the efficacy of the fo-
adopts region proposal network (RPN) to generate high-quality cal loss, which is proposed to solve the extreme foreground–
region proposals, and then trains a separate detection network background class imbalance as it is considered as the primary
by fast R-CNN via using the proposals generated by RPN. obstacle preventing one-stage object detectors from surpassing
top-performing. Compare to YLv2, RN achieves a higher de- set with a ratio of 4:1. We report the results by averaging over
tection accuracy and no matter for detecting small-, medium-, all three splits. For the division of training/validation, we select
or large-size objects, it makes a considerable improvement on every ten consecutive frames according to the sequence of video
mean AP. frames, and then randomly assign the ten frames to training and
validation set in a 4:1 ratio. In this way, the target organisms from
V. EXPERIMENTS different temporal frames and spatial locations can be evenly
divided into each set, which ensures that the division of data
A. Data Sets set can cover the diversity of original data. For the selection of
We conduct experiments on our proposed AR module and the splits, we use FR-R50 detector to conduct experiments on
EMOD framework with annotated video data sets, which show multiple randomly generated splits. According to the principle
many species of fish, shellfish, underwater plants, and other of high results and small fluctuation range (standard deviation),
biota in Fig. 4. Each data set contains different types of im- we select the optimal three splits to ensure the reliability of
ages, such as different lighting conditions, camera angles, and experimental results. Table I shows the results of the three
wildlife. selected splits, we can see that, the values of standard deviation
We adopt the default “coarse-bbox-only” flavor of the data for each data set are in a reasonable scope. Therefore, we use
set as our COCO-style annotations. We use the bounding box these three splits of the four data sets for our experiments in the
annotated data sets (MOUSS, MBARI, HabCam) to evaluate our following sections.
proposed method, and show the experimental results. Besides, we further note that, we obtain the valuable data from
The MOUSS [21] data sets include two sequences, MOUSS the data challenge [21] in conjunction with CVPR 2018 [22] and
seq0 (MOUSS0) and MOUSS seq1 (MOUSS1). MOUSS0 2019 [23] Workshops, the division of the data sets refers to the
consists of 194 images belonging to the same category (Car- division of training and testing in the challenge, and our way
charhiniformes) and the images have a resolution of 968 × 728. of the division also accords to the purpose and definition of
MOUSS1 consists of 241 images within one category (Perci- the data set provider. Meanwhile, due to the data insufficiency
formes) and the images have a resolution of 720 × 480. and the particularity of marine video surveillance tasks, that is
The MBARI [21] data set contains a single video consisting of what we strive to do to divide the data sets and experiment the
740 RGB frames with six classes. Each image has a 1920 × 1080 models based on the obtained data in order to realize a useful
resolution. application. Thus, it is still a challenging yet important future
The HabCam [21] data set contains objects, such as scallops, work for collecting data sets with more diversity and dynamicity
sand dollars, rocks, sand, and the occasional fish. The annota- for marine video surveillance.
tions include scallops and fish. There are 52 344 images with Moreover, we take MOUSS1 and conduct two experiments to
2720 × 1024 of 11 classes in the data set. further analyze the effectiveness of our data set division and
proposed method: 1) We randomly selected 50 frames from
B. Evaluation Metrics the unlabeled testing set (provided by [21]), and then manually
annotated classes and bounding boxes. Our FR-R50 model still
Object detection requires to evaluate model performance of
worked well by 63.9 mAP (the result is 67.5 mAP in our
both classification and location, and each image may have differ-
validation set), which indicates that our model also fits to the
ent objects of different categories. Therefore, the standard evalu-
data of testing set. 2) We adopted another division, which used
ation metrics used in the image classification problem cannot be
the first 4/5 frames as training and the last 1/5 as validation. We
directly applied to object detection task. The evaluation metric
observed that it led to a performance degradation (35.0 mAP
of COCO [93] is average precision (AP), which is calculated as
for validation and 48.2 mAP for testing by FR-R50 model), the
area under precision-recall curves across IoU thresholds from
main reason might be that the division does not fully cover the
0.5 to 0.95, with an interval of 0.05. All reported results of our
diversity of contents in the limited data. Furthermore, we also
work follow standard COCO-style AP metrics that include AP
experimented our AR module on this division, it still worked
(averaged over IoU thresholds), AP50 (AP for IoU threshold
well on both ablation study (35.8 versus 35.0 mAP, FR-R50
50%), and AP75 (AP for IoU threshold 75%). We also include
model) and SOTA comparison (38.5 versus 37.3 mAP, CR-X101
APS , APM , and APL , which correspond to the results on small,
model), which can also demonstrate the effectiveness of our
medium, and large scales, respectively. In order to measure
proposed method.
the detection speed of different detectors, we use frames per
In this work, we train models on training set, and report the
second (FPS), which means the number of video frames that the
results of baseline models, ablation studies, and SOTA models
model can process per second (on NVIDIA GTX 1080Ti GPU),
on validation set, and also additionally provide standard devi-
and is calculated as the detector’s speed indicator in the testing
ation (std) of AP for reference. For two-stage detectors, 256
process.
anchors are sampled by RPN per image with 1:1 ratio of positive
to negative anchors where the IoU threshold of positive and
C. Implementation Details negative anchors are 0.7 and 0.3, respectively, and span five
We perform cross validation on the annotated training data scales and three aspect ratios. In classification branch, region
of the data sets. We randomly divide the data sets into three proposals extracted by region proposal network have an overlap
splits, and each split includes a training set and a validation with ground truth greater than 0.5 are regarded as positive
Fig. 4. Examples from three data sets of marine video surveillance.
TABLE I
RESULTS FOR THE DIVISION OF DATA SETS
samples. For fair comparison, all experiments are implemented tending to overfitting or underfitting with unstable results
on PyTorch framework (version 1.3). We use COCO data set of models (especially deep models). Analyzing the same
for pretraining models and employ a weight decay of 0.0001, a detector with different backbones on these three data sets,
momentum of 0.9 and standard horizontal flipping with a ratio we can conclude that the detector with X101 as backbone
of 0.5. All the experiments are implemented on two NVIDIA has higher accuracy than that with R50 as backbone,
GTX 1080Ti GPUs (two images per GPU) for 12 epochs with illustrating the efficacy of deeper models;
an initial learning rate of 0.005, and decrease it by 0.1 after 2) CR detector is the first choice with better performance for
8 and 11 epochs, respectively. The code of our EMOD frame- the application scenarios that require high detection accu-
work is publicly available, which can be directly used in marine racy. The two-stage detection framework and the improved
organism monitoring research. cascade structure of CR detector effectively improves
the performance of FR. By comprehensive comparison,
D. EMOD Performance CR-X101 is the most accurate detection configuration;
3) YL detector is the first choice with higher efficiency for
We first conduct experiments on three data sets (four subsets) application scenarios that require fast-speed detection or
to study the performance of EMOD with four SOTA detectors real-time detection. According to the comprehensive com-
(FR, CR, YL, and RN) and three backbones (R50, X101, and parison of FPS results, the processing speed of YL-D53
D53), and the results are shown in Table II, where the detectors detector is faster than that of other detectors, which can
FR, CR, and RN use two backbones R50 and X101, while reach the online real-time detection with an acceptable
detector YL employs D53 backbone. In order to better present detection accuracy;
the performance of each detector, we also provide the scatter 4) for application scenarios that require consideration of both
diagrams with x-axis and y-axis as FPS and AP, respectively, in accuracy and efficiency, RN detector is a good choice.
Fig. 5. It can be seen that: As a one-stage detector, RN can achieve a competitive
1) in Fig. 5, the detection results of two-stage detectors (FR performance compared to two-stage detectors, which is
and CR) are higher than those of one-stage detectors (YL more efficient. Experimental results show that the detector
and RN), indicating that the performance of two-stage RN can balance accuracy and efficiency well.
detector is generally better than that of single-stage detec-
tor. In addition, the detection results of one-stage detector
are on the right side of two-stage detectors, showing a E. Ablation Study for AR Module
higher efficiency. In particular, the accuracy of CR is the We conduct ablation study to validate the efficacy of our AR
best and the efficiency of YL is the highest. However, module, and show comparisons with other AR modules (NL
there also exists an exception, i.e., in the experiments on block [60], SE module [72], and GCNet. [85]). Considering
MOUSS0, the detection accuracy of two-stage detector the tradeoff between accuracy and efficiency, we use FR-R50
FR is not better than that of one-stage detectors RN and detector as the baseline model to study the efficacy of our AR
YL, which might due to the small data set of MOUSS0 module as well as its components: Attention (A, simplified NL
TABLE II
EXPERIMENTAL RESULTS OF EMOD PERFORMANCE ON THREE DATA SETS
TABLE III
ABLATION STUDY FOR AR MODULE
TABLE IV
COMPARATIVE RESULTS OF OUR AR MODULE AND CAO ET AL. [85]
block [60]) and Relation (R, SE module [72]). To clearly observe better than the model with AR module, also the improve-
the effect of AR module, we also present the scatter diagrams ment is almost twice of the AR module compared to single
with x-axis as FPS and y-axis as AP in Fig. 6, and the detailed A or R module. It also indicates that, through combining
results are shown in Table III. It can be seen that: A and R for joint attention–relation, our AR module is the
1) both AR module and its component A or R can improve the improved implementation for current AR modules (NL
detection performance, indicating that the proposed visual block [60] and SE module [72]);
attention and relation method is effective for detecting 3) the comparative experiments are carried out on the basis
organisms from marine surveillant videos; of feature pyramid network (FPN) [94] module, which
2) specifically, a model achieves almost the same perfor- is an important module in FR that has been proved to
mance with A or R module, and neither of them works improve detection performance significantly, while, our
TABLE V
RESULTS OF STATE-OF-THE-ART COMPARISON
Fig. 5. Scatter diagrams of EMOD performance on three data sets.
Fig. 6. Scatter diagrams of ablative results on AR module.
Fig. 7. Scatter diagrams of SOTA Comparison.
AR module and its components can further boost the F. State-of-the-art Comparison
performance, showing the flexibility and effectiveness of We finally conduct experiments to show SOTA comparison of
our design. our EMOD+AR models. We adopt X101 backbone for detectors
Furthermore, we show performance comparison of our AR FR, CR and RN, and D53 backbone for detector YL, due to their
module and the module designed by Cao et al. [85] in Ta- outstanding performance in baseline experiments (Section V-D).
ble IV. We also use FR-R50 as the base detector, and report Similarly, the results are shown in Fig. 7 and Table V, which
AP value on each data set. We can see that our AR module illustrate that:
performs better, which indicates that our improvements for 1) our AR module can further improve the performance of
learning relations across multiple channel perspectives is ef- EMOD, and outperform the SOTA detectors;
fective for marine video data sets. (See Section III-B for more 2) while our AR module can achieve performance gains with
details). slight decrease of FPS, indicating the lightweight design;
Fig. 8. Detailed visualization of scallop detection on HabCam data set. We visualize the input images, backbone w/o. and w/. AR modules, and detection results.
It can be seen that our proposed AR module can enhance the representation of organisms obviously and bring benefits to marine organism detection.
3) the results provide valuable suggestions for model choice benefits to marine organism detection for all cases. Furthermore,
considering different requirements, e.g., EMOD+AR even in the case of more target objects and more impurities (the
(CR) model is a good choice for situations that focus more last instance), our AR module is still able to accurately detect all
on accuracy, while EMOD+AR (YL) model is better to target organisms in this more complex underwater environments.
preferentially care about the efficiency.
We note that, the SOTA comparison of Table V is based on
the current version of EMOD framework. We will integrate more
SOTA detectors and backbones in future version. Meanwhile, the
AR module will be implemented to the updated version, and will VI. CONCLUSION
be improved to a higher SOTA performance based on stronger In this work, we propose a new way to apply an improved
SOTA detectors. AR module on an EMOD, which explores to learn joint AR
in CNNs for detecting marine organisms. In order to better
surveil the behaviors and understand the distribution of marine
G. Visualization of AR Module organisms, our designed EMOD framework, which is equipped
Fig. 8 shows detailed visualization on HabCam data set, with our AR modules can detect and surveil marine organisms
and each row represents an instance of scallop detection. We for high-resolution marine video surveillance in a real-time and
experiment the data set on EMOD+AR (CR-R50) model, and fast fashion. We integrate both one-stage and two-stage detection
provide attention maps of last residual layer from EMOD w/o. methods to train the organism detector based on high resolution
AR (column 2) and w/. AR (column 3), and also the detection data sets of marine videos. Experimental results and visualiza-
results (column 4) and ground truth (column 5). The input tions demonstrate that our application of AR module is effective
instances are affected by various underwater environments, e.g., and efficient, and our EMOD can further boost the performance
lots of impurities in the first instance, the blurring and color based on SOTA methods. We hope our proposed AR module
distortion in the second instance, the haziness of organisms in the and EMOD framework, which are publicly available, can benefit
third and fourth instances. It can be obviously seen that, our AR for the research of marine video surveillance as well as marine
module can enhance the representation of organisms, and bring biology.
ACKNOWLEDGMENT [23] “CVPR Workshop: Automated analysis of marine video for environmental
monitoring,” 2019. Accessed: Sep. 15, 2019. [Online]. Available: https:
The authors would like to thank NOAA (NMFS-PIFSC-83) //www.aamvem.com/
and Kitware (viametoolkit.org) for providing the data and anno- [24] B. L. Richards et al., “Automated analysis of underwater imagery: Ac-
complishments, products, and vision,” Nat. Ocean. Atmospheric Ad-
tations of the marine video data sets. min., Washington, DC, USA, NOAA Tech. Memo NMFS PIFSC; 83,
2019.
[25] Z. Shi et al., “Detecting organisms for marine video surveillance,”
in Proc. Glob. OCEANS: Singapore-U.S. Gulf Coast, 2020, pp. 1–7,
doi: 10.1109/IEEECONF38699.2020.9389458.
REFERENCES [26] J. Hu, H. Zhang, A. Miliou, T. Tsimpidis, H. Thornton, and V. Pavlovic,
[1] G. A. Rose, “Monitoring coastal northern cod: Towards an optimal sur- “Categorization of underwater habitats using dynamic video textures,” in
vey of smith sound Newfoundland,” ICES J. Mar. Sci., vol. 60, no. 3, Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshops, 2013, pp. 838–843.
pp. 453–462, 2003. [27] T. Fujii and A. J. Jamieson, “Fine-scale monitoring of fish movements and
[2] B. Taormina et al., “Optimizing image-based protocol to monitor multiple environmental parameters around a decommissioned offshore
macroepibenthic communities colonizing artificial structures,” ICES J. oil platform: A pilot study in the north sea,” Ocean Eng., vol. 126,
Mar. Sci., vol. 77, no. 2, pp. 835–845, 2020. pp. 481–487, 2016.
[3] Q. Li et al., “Developing a microscopic image dataset in support of [28] H. Balk and T. Lindem, “Improved fish detection in data from split-beam
intelligent phytoplankton detection using deep learning,” ICES J. Mar. sonar,” Aquatic Living Resour., vol. 13, no. 5, pp. 297–303, 2000.
Sci., vol. 77, no. 4, pp. 1427–1439, 2020. [29] Y. Xie, G. Cronkite, and T. J. Mulligan, A Split-Beam Echosounder
[4] D. A. Smale et al., “Regional-scale benthic monitoring for ecosystem- Perspective on Migratory Salmon in the Fraser River: A Progress Report
based fisheries management (EBFM) using an autonomous underwater on the Split-Beam Experiment At Mission, BC, in 1995, vol. 11. Vancouver,
vehicle (AUV),” ICES J. Mar. Sci., vol. 69, no. 6, pp. 1108–1118, 2012. VC, Canada, British Columbia: Pacific Salmon Commission, 1997.
[5] S. B. Williams, O. Pizarro, M. How, D. Mercer, and R. Hanlon, “Surveying [30] J. A. Holmes, G. M. Cronkite, H. J. Enzenhofer, and T. J. Mulligan,
noctural cuttlefish camouflage behaviour using an AUV,” in Proc. IEEE “Accuracy and precision of fish-count data from a ‘dual-frequency iden-
Int. Conf. Robot. Automat., 2009, pp. 214–219. tification sonar’(DIDSON) imaging system,” ICES J. Mar. Sci., vol. 63,
[6] A. Salman et al., “Automatic fish detection in underwater videos by a deep no. 3, pp. 543–555, 2006.
neural network-based hybrid motion learning system,” ICES J. Mar. Sci., [31] C. O. Tiemann, S. W. Martin, and J. Mobley, “Aerial and acoustic marine
vol. 77, no. 4, pp. 1295–1307, 2020. mammal detection and localization on navy ranges,” IEEE J. Ocean. Eng.,
[7] A. Mahmood et al., “Automatic detection of western rock lobster using vol. 31, no. 1, pp. 107–119, Jan. 2006.
synthetic data,” ICES J. Mar. Sci., vol. 77, no. 4, pp. 1308–1317, 2020. [32] G. Shrivakshan and C. Chandrasekar, “A comparison of various edge
[8] C. S. Tan, P. Y. Lau, P. L. Correia, and A. Campos, “Automatic analysis of detection techniques used in image processing,” Int. J. Comput. Sci. Issues,
deep-water remotely operated vehicle footage for estimation of Norway vol. 9, no. 5, 2012, Art. no. 269.
lobster abundance,” Frontiers Inf. Technol. Electron. Eng., vol. 19, no. 8, [33] C. Spampinato, Y.-H. Chen-Burger, G. Nadarajan, and R. B. Fisher, “De-
pp. 1042–1055, 2018. tecting, tracking and counting fish in low quality unconstrained underwater
[9] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learn- videos,” in Proc. 3rd Int. Conf. Comput. Vis. Theory Appl., vol. 2, 2008,
ing applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 514–519.
pp. 2278–2324, Nov. 1998. [34] Z. Zivkovic, “Improved adaptive Gaussian mixture model for background
[10] K. Simonyan and A. Zisserman, “Very deep convolutional networks for subtraction,” in Proc. Int. Conf. Pattern Recognit., vol. 2, 2004, pp. 28–31.
large-scale image recognition,” 2014, arXiv:1409.1556. [35] K. Fukunaga, Introduction to Statistical Pattern Recognition. Amsterdam,
[11] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE/CVF The Netherlands: Elsevier, 2013.
Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1–9. [36] K. Mizuno et al., “Development of an efficient coral-coverage estimation
[12] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image method using a towed optical camera array system [speedy sea scanner
recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., (SSS)] and deep-learning-based segmentation: A sea trial at the Kujuku-
2016, pp. 770–778. Shima Islands,” IEEE J. Ocean. Eng., vol. 45, no. 4, pp. 1386–1395,
[13] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time Oct. 2020.
object detection with region proposal networks,” in Proc. Adv. Neural Inf. [37] S. Villon, M. Chaumont, G. Subsol, S. Villéger, T. Claverie, and D.
Process. Syst., 2015, pp. 91–99. Mouillot, “Coral reef fish detection and recognition in underwater videos
[14] Z. Cai and N. Vasconcelos, “Cascade R-CNN: High quality object detec- by supervised machine learning: Comparison between deep learning and
tion and instance segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., hog SVM methods,” in Proc. Int. Conf. Adv. Concepts Intell. Vis. Syst.,
vol. 43, no. 5, pp. 1483–1498, May 2020. 2016, pp. 160–171.
[15] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for [38] A. Marburg and K. Bigham, “Deep learning for benthic fauna identifi-
dense object detection,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2017, cation,” in Proc. IEEE/MTS OCEANS Conf., Monterey, CA, USA, 2016,
pp. 2980–2988. pp. 1–5.
[16] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in Proc. [39] S. A. Siddiqui et al., “Automatic fish species classification in underwater
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2017, pp. 7263–7271. videos: Exploiting pre-trained deep neural network models to compensate
[17] C. Beyan and H. I. Browman, “Setting the stage for the machine infor limited labelled data,” ICES J. Mar. Sci., vol. 75, no. 1, pp. 374–389,
telligence era in marine science,” ICES J. Mar. Sci., vol. 77, no. 4, 2018.
pp. 1267–1273, 2020. [40] Y.-C. Lu, C. Tung, and Y.-F. Kuo, “Identifying the species of harvested
[18] K. Malde, N. O. Handegard, L. Eikvil, and A.-B. Salberg, “Machine tuna and Billfish using deep convolutional neural networks,” ICES J. Mar.
intelligence and the data-driven future of marine science,” ICES J. Mar. Sci., vol. 77, no. 4, pp. 1318–1329, 2020.
Sci., vol. 77, no. 4, pp. 1274–1285, 2020. [41] C. Rasmussen, J. Zhao, D. Ferraro, and A. Trembanis, “Deep census:
[19] C. Kemp and J. B. Tenenbaum, “The discovery of structural form,” Proc. AUV-based scallop population monitoring,” in Proc. IEEE/CVF Int. Conf.
Nat. Acade. Sci., vol. 105, no. 31, pp. 10687–10692, 2008. Comput. Vis. Workshops, 2017, pp. 2865–2873.
[20] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. Lawrence [42] Y. Song, B. He, and P. Liu, “Real-time object detection for AUVs using self-
Zitnick, and R. Girshick, “CLEVR: A diagnostic dataset for compositional cascaded convolutional neural networks,” IEEE J. Ocean. Eng., vol. 46,
language and elementary visual reasoning,” in Proc. IEEE/CVF Conf. no. 1, pp. 56–67, Jan. 2021.
Comput. Vis. Pattern Recognit., 2017, pp. 2901–2910. [43] F. Aznar, M. Pujol, and R. Rizo, “A swarm behaviour for jellyfish bloom
[21] K. Inc, “CVPR 2018 workshop data challenge,” Accessed: Sep. 15, 2019. detection,” Ocean Eng., vol. 134, pp. 24–34, 2017.
[Online]. Available: http://www.viametoolkit.org/cvpr-2018-workshop- [44] C.-H. Tseng and Y.-F. Kuo, “Detecting and counting harvested fish
data-challenge/ and identifying fish types in electronic monitoring system videos using
[22] “CVPR Workshop: Automated analysis of marine video for environmental deep convolutional neural networks,” ICES J. Mar. Sci., vol. 77, no. 4,
monitoring,” 2018. Accessed: Sep. 15, 2019. [Online]. Available: http: pp. 1367–1378, 2020.
//www.viametoolkit.org/cvpr-2018-workshop/ [45] L. Itti and C. Koch, “Computational modelling of visual attention,” Nature
Rev. Neurosci., vol. 2, no. 3, pp. 194–203, 2001.
[46] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention [74] H. Zhao et al., “PSANet: Point-wise spatial attention network for scene
for rapid scene analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, parsing,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 267–283.
no. 11, pp. 1254–1259, Nov. 1998. [75] Z.-Q. Zhao, P. Zheng, S.-T. Xu, and X. Wu, “Object detection with deep
[47] H. Larochelle and G. E. Hinton, “Learning to combine foveal glimpses learning: A review,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 11,
with a third-order Boltzmann machine,” in Proc. Adv. Neural Inf. Process. pp. 3212–3232, Nov. 2019.
Syst., 2010, pp. 1243–1251. [76] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc.
[48] V. Mnih et al., “Recurrent models of visual attention,” in Proc. Adv. Neural IEEE/CVF Int. Conf. Comput. Vis., 2017, pp. 2961–2969.
Inf. Process. Syst., 2014, pp. 2204–2212. [77] L. Liu et al., “Deep learning for generic object detection: A survey,” Int.
[49] B. A. Olshausen, C. H. Anderson, and D. C. Van Essen, “A neurobiological J. Comput. Vis., vol. 128, no. 2, pp. 261–318, 2020.
model of visual attention and invariant pattern recognition based on dy- [78] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierar-
namic routing of information,” J. Neurosci., vol. 13, no. 11, pp. 4700–4719, chies for accurate object detection and semantic segmentation,” in Proc.
1993. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2014, pp. 580–587.
[50] C. Cao et al., “Look and think twice: Capturing top-down visual attention [79] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep
with feedback convolutional neural networks,” in Proc. IEEE/CVF Conf. convolutional networks for visual recognition,” IEEE Trans. Pattern Anal.
Comput. Vis. Pattern Recognit., 2015, pp. 2956–2964. Mach. Intell., vol. 37, no. 9, pp. 1904–1916, Sep. 2015.
[51] M. Jaderberg et al., “Spatial transformer networks,” in Proc. Adv. Neural [80] R. Girshick, “Fast R-CNN,” in Proc. IEEE/CVF Int. Conf. Comput. Vis.,
Inf. Process. Syst., 2015, pp. 2017–2025. 2015, pp. 1440–1448.
[52] T. Bluche, “Joint line segmentation and transcription for end-to-end hand- [81] J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object detection via region-based
written paragraph recognition,” in Proc. Adv. Neural Inf. Process. Syst., fully convolutional networks,” in Proc. Adv. Neural Inf. Process. Syst.,
2016, pp. 838–846. 2016, pp. 379–387.
[53] A. Miech, I. Laptev, and J. Sivic, “Learnable pooling with context gating [82] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun,
for video classification,” 2017, arXiv:1706.06905. “OverFeat: Integrated recognition, localization and detection using con-
[54] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural volutional networks,” 2013, arXiv:1312.6229.
Comput., vol. 9, no. 8, pp. 1735–1780, 1997. [83] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once:
[55] M. F. Stollenga, J. Masci, F. Gomez, and J. Schmidhuber, “Deep networks Unified, real-time object detection,” in Proc. IEEE/CVF Conf. Comput.
with internal selective attention through feedback connections,” in Proc. Vis. Pattern Recognit., 2016, pp. 779–788.
Adv. Neural Inf. Process. Syst., 2014, pp. 3545–3553. [84] W. Liu et al., “SSD: Single shot multibox detector,” in Proc. Eur. Conf.
[56] L. Chen et al., “SCA-CNN: Spatial and channel-wise attention in convolu- Comput. Vis., 2016, pp. 21–37.
tional networks for image captioning,” in Proc. IEEE/CVF Conf. Comput. [85] Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu, “GCNet: Non-local networks
Vis. Pattern Recognit., 2017, pp. 5659–5667. meet squeeze-excitation networks and beyond,” in Proc. IEEE/CVF Int.
[57] K. Xu et al., “Show, attend and tell: Neural image caption generation with Conf. Comput. Vis. Workshops, 2019, pp. 1971–1980.
visual attention,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 2048–2057. [86] Z. Yang et al., “XlNet: Generalized autoregressive pretraining for language
[58] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip reading understanding,” in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 5753–
sentences in the wild,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern 5763.
Recognit., 2017, pp. 3444–3453. [87] J. Fu et al., “Dual attention network for scene segmentation,” in Proc.
[59] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 3146–3154.
Process. Syst., 2017, pp. 5998–6008. [88] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual
[60] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” transformations for deep neural networks,” in Proc. IEEE/CVF Conf.
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7794– Comput. Vis. Pattern Recognit., 2017, pp. 1492–1500.
7803. [89] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network
[61] A. Buades, B. Coll, and J.-M. Morel, “A non-local algorithm for image training by reducing internal covariate shift,” in Proc. Int. Conf. Mach.
denoising,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Learn., 2015, pp. 448–456.
vol. 2, 2005, pp. 60–65. [90] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking
[62] A. Santoro et al., “A simple neural network module for relational reason- the inception architecture for computer vision,” in Proc. IEEE/CVF Conf.
ing,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 4967–4976. Comput. Vis. Pattern Recognit., 2016, pp. 2818–2826.
[63] L. Li, Z. Gan, Y. Cheng, and J. Liu, “Relation-aware graph attention [91] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4,
network for visual question answering,” in Proc. IEEE/CVF Int. Conf. inception-resnet and the impact of residual connections on learning,” 2016,
Comput. Vis., 2019, pp. 10313–10322. arXiv:1602.07261.
[64] H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei, “Relation networks for object [92] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,”
detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, 2018, arXiv:1804.02767.
pp. 3588–3597. [93] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in Proc.
[65] M. Yatskar, L. Zettlemoyer, and A. Farhadi, “Situation recognition: Visual Eur. Conf. Comput. Vis., 2014, pp. 740–755.
semantic role labeling for image understanding,” in Proc. IEEE/CVF Conf. [94] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie,
Comput. Vis. Pattern Recognit., 2016, pp. 5534–5542. “Feature pyramid networks for object detection,” in Proc. IEEE/CVF Conf.
[66] G. Gkioxari, R. Girshick, P. Dollár, and K. He, “Detecting and recognizing Comput. Vis. Pattern Recognit., 2017, pp. 2117–2125.
human-object interactions,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Recognit., 2018, pp. 8359–8367.
[67] P. Battaglia et al., “Interaction networks for learning about objects,
relations and physics,” in Proc. Adv. Neural Inf. Process. Syst., 2016,
pp. 4502–4510.
[68] N. Watters, D. Zoran, T. Weber, P. Battaglia, R. Pascanu, and A. Tacchetti,
“Visual interaction networks: Learning a physics simulator from video,”
in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 4539–4547. Zhensheng Shi (Member, IEEE) received the B.Sc.
[69] F. Chollet, “Xception: Deep learning with depthwise separable convolu- degree in electronic information science and technol-
tions,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2017, ogy from Qufu Normal University, Qufu, China, in
pp. 1251–1258. 2009, the M.Eng. degree in electronic and commu-
[70] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convolutional nication engineering, and the Ph.D. degree in intel-
neural networks with low rank expansions,” 2014, arXiv:1405.3866. ligent information and communication system from
[71] M. Lin, Q. Chen, and S. Yan, “Network in network,” 2013, arXiv: the Ocean University of China, Qingdao, China, in
1312.4400. 2012 and 2021, respectively.
[72] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc. He is currently a Postdoctor with the Frontiers Sci-
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7132–7141. ence Center for Deep Ocean Multispheres and Earth
[73] J. Hu, L. Shen, S. Albanie, G. Sun, and A. Vedaldi, “Gather-excite: System, Ocean University of China. His research
Exploiting feature context in convolutional neural networks,” in Proc. Adv. interests include video understanding, multi-modal learning, and underwater
Neural Inf. Process. Syst., 2018, pp. 9401–9411. vision.
Cheng Guan received B.Sc. degree in electronic in- Haiyong Zheng (Member, IEEE) received the B.Eng.
formation engineering from Northeastern University, degree in electronic information engineering and the
Shenyang, China, in 2018, and the M.Eng. degree in Ph.D. degree in ocean information sensing and pro-
electronic and communication engineering from the cessing from the Ocean University of China, Qingdao,
Ocean University of China, Qingdao, China, in 2021. China, in 2004 and 2009, respectively.
His research interests include deep learning and In 2009, he joined the Department of Electronic
video understanding. Engineering, Ocean University of China, where he is
currently a Professor. His research interests include
computer vision, underwater vision, and deep learn-
ing.
Qianqian Li received the B.Sc. degree in applied

physics from Xihua University, Chengdu, China, in
2019. She is currently working toward the master’s
degree in electronic and communication engineering
from the Ocean University of China, Qingdao, China.
Her research interests include action recognition Zhaorui Gu (Member, IEEE) received the B.Eng. de-
and object detection. gree in electronic information science and technology
and the M.Eng. degree in electronic and communica-
tion engineering from the Ocean University of China,
Qingdao, China, in 2010 and 2013, respectively.
He is currently an Experimentalist with the Depart-
ment of Electronic Engineering, Ocean University of
China. His research interests include image process-
ing and underwater vision.
Liangjie Cao received the B.Sc. degree in communi-
cation engineering from Shandong Agricultural Uni-
versity, Taian, China, in 2018, and the M.Eng. degree
in electronic and communication engineering from
the Ocean University of China, Qingdao, China, in
2021.
Bing Zheng (Member, IEEE) received the B.Sc.
His research interests include action recognition
and video understanding. degree in electronics and information system, the
M.Sc. degree in marine physics, and the Ph.D. degree
in computer application technology from the Ocean
University of China, Qingdao, China, in 1991, 1995,
and 2013, respectively.
He is currently a Professor with the Department of
Electronic Engineering, Ocean University of China.
His research interests include ocean optics, underwa-
Ju Liang received the B.Sc. degree in electronic in- ter imaging, and optical detection.
formation science and technology from Qufu Normal
University, Qufu, China, in 2009. She is currently
working toward the master’s degree in electronic and
communication engineering from the Ocean Univer-
sity of China, Qingdao, China.
Her research interests include action recognition
and video understanding.

Detecting Marine Organisms Via Joint Attention-Relation Learning For Marine Video Surveillance

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Detecting Marine Organisms Via Joint Attention-Relation Learning For Marine Video Surveillance

Uploaded by

Copyright:

Available Formats

IEEE JOURNAL OF OCEANIC ENGINEERING, VOL. 47, NO.

4, OCTOBER 2022 959

Detecting Marine Organisms Via Joint

Fig. 3. Framework of our efficient marine organism detector (EMOD).

B. Backbone Furthermore, FR uses the detection network to initialize RPN

Fig. 4. Examples from three data sets of marine video surveillance.

Fig. 5. Scatter diagrams of EMOD performance on three data sets.

Fig. 6. Scatter diagrams of ablative results on AR module.

Fig. 7. Scatter diagrams of SOTA Comparison.

Qianqian Li received the B.Sc. degree in applied

You might also like