Jeon 2019

Available online at www.sciencedirect.
com
ScienceDirect
IFAC PapersOnLine 52-21 (2019) 78–81
Underwater
Underwater Object
Object Detection
Detection and
and Pose
Pose
Underwater
Underwater Object
Object Detection
Detection and
and Pose

Pose
Estimation
Underwater using
Object Deep Learning
Detection and Pose
Estimation using
Estimation using
Estimation
Deep
using Deep Learning
Learning
Deep Learning
Estimation using Deep
∗
MyungHwan Learning
∗∗
Jeon ∗ Yeongjun Lee ∗∗
MyungHwan
MyungHwan
MyungHwan Jeon
Jeon
Jeon ∗ Yeongjun
∗
Yeongjun
Yeongjun Lee
Lee
Lee
∗∗
∗∗ ∗∗∗
Young-Sik
Young-Sik Shin,Hyesu
MyungHwan Jeon
Shin,Hyesu
∗Jang,Ayoung
Yeongjun
Jang,Ayoung LeeKim
∗∗ ∗∗∗
Kim ∗∗∗
Young-Sik
Young-Sik Shin,Hyesu
Shin,Hyesu
MyungHwan ∗Jang,Ayoung
Jang,Ayoung
Jeon Jang,Ayoung
Yeongjun Lee Kim
Kim ∗∗∗
∗∗ ∗∗∗
Young-Sik Shin,Hyesu Kim
∗
∗ Young-Sik
∗ Department of the Shin,Hyesu
Robotics Jang,Ayoung
Program, KAIST, Kim ∗∗∗S. Korea
Daejeon,
∗ Department
Department of
∗ Department of the
of the
the Robotics
Robotics
Robotics Program,
Program, KAIST,
KAIST, Daejeon,
Daejeon, S.
S. Korea
Korea
∗
Department
∗∗
of the Robotics Program, KAIST, Daejeon, S.
Program, KAIST,
(myunghwan.jeon@kaist.ac.kr)
Daejeon, S. Korea
Korea
∗∗ KRISO,
Department of (myunghwan.jeon@kaist.ac.kr)
Daejeon,
the S.
Robotics Korea (leeyeongjun@kriso.re.kr)
Program, KAIST,
(myunghwan.jeon@kaist.ac.kr) Daejeon, S. Korea
∗∗
∗∗∗ ∗∗ KRISO,
∗∗ KRISO,
KRISO, Daejeon,
Daejeon,
Daejeon, S.
S.
S. Korea
Korea
Korea (leeyeongjun@kriso.re.kr)
(leeyeongjun@kriso.re.kr)
∗∗∗ Department of
KRISO, of Civil and Environmental
Daejeon, S. Korea Engineering, KAIST,
∗∗∗ Department
∗∗∗
Department
∗∗ of Civil
Civil and
and Environmental
Environmental Engineering,
Engineering, KAIST,
KAIST,
∗∗∗ Department
Daejeon,
KRISO, of
S.Civil and
Korea,
Daejeon, Environmental
S.(e-mail:
Korea Engineering,
[youngsik.shin, iriter,KAIST,
∗∗∗
Department
Daejeon,
Daejeon,of
S.
S.Civil and
Korea,
Korea, Environmental
(e-mail:
(e-mail: Engineering,
[youngsik.shin,
[youngsik.shin, iriter,
iriter,KAIST,
Daejeon,
Department S.
of Korea,
Civil and
Daejeon, S. Korea, (e-mail: [youngsik.shin,
ayoungk]@kaist.ac.kr).
Environmental
(e-mail: iriter,
Engineering,
[youngsik.shin, iriter,KAIST,
Daejeon, S. Korea, (e-mail: [youngsik.shin,
ayoungk]@kaist.ac.kr). iriter,
Abstract:
Abstract:
Abstract:
Abstract:
This paper presents an approach for making aa dataset using aa 3D CAD model for deep learning
Abstract:
This
This paper
paper presents
presents an
an approach
approach for
for making
making aa dataset
dataset using
using aa 3D
3D CAD
CAD model
model for
for deep
deep learning
learning
This
based paper
Abstract:
This paper presents
underwater
presents an
object
an approach
detection
approach for
for making
and
making pose a dataset
estimation.
dataset using
using Wea 3D
3D CAD
also
CAD model
introduce
model for
fora deep
simple
deep learning
pose
learning
based
based
based underwater
underwater
underwater object
object
object detection
detection
detection and
and
and pose
pose
pose estimation.
estimation.
estimation. We
We
We also
also
also introduce
introduce
introduce a
a
a simple
simple
simple pose
pose
pose
estimation
This
based paper
estimation network
presents
underwater
network for
an
object
for underwater
approach
detection
underwater for objects.
making
and
objects. In
pose
Ina the
dataset experiment,
estimation.
the using
experiment,Wea 3D weCAD
also
we show modelthat
introduce
show that object
fora deep detection
simple
object learning
pose
detection
estimation
estimation
and
based pose network
network
estimation
underwater for
for underwater
underwater
networks
object trained
detection objects.
objects.
via
and ourIn
In
pose the
the
synthetic experiment,
experiment,
estimation. dataset
We we
we
present
also show
show a that
that object
object
preliminary
introduce a detection
detection
simplepotential
pose
estimation
and
and pose
pose network
estimation
estimation for underwater
networks
networks trained
trained objects.
via
via our
ourIn the
synthetic
synthetic experiment,
dataset
dataset we
present
presentshow that object
aaa preliminary
preliminary detection
potential
potential
and
for pose
deep
estimation
and estimation
learning
network
pose learning
estimation networks
based
for approaches
underwater
networks trained
trainedin invia
objects.
via our
underwater.
In synthetic
the
our synthetic dataset
Lastly,
experiment,we
dataset present
show
we
present that
show preliminary
our
that synthetic
object
a preliminary potential
image
detection
potential
for
for deep
for deep learning based
based approaches
approaches underwater. Lastly, we
we show that our
our synthetic image
for deep
dataset
and deep
dataset
learning
poseprovides
estimation
learning
provides
based
meaningful approaches
networks
based
meaningful approachestrainedin
performance
performance in
underwater.
invia
underwater.
for
ourdeep
underwater.
for learning
synthetic
deep learning
Lastly,
Lastly, we
models
dataset
Lastly, we
models
show
showin
present
showin
that
that
underwater
that
our synthetic
synthetic
a preliminary
underwater our synthetic
image
image
environments.
potential
image
environments.
dataset
dataset
for provides
provides
deepprovides meaningful
meaningful
learningmeaningful
based approaches performance
performance for
for deep
deep
in underwater. learning
learning models
models
Lastly,models
we show in
in underwater
underwater
that our syntheticenvironments.
environments.
image
dataset
Copyrightprovides
© 2019. The performance
Authors. Published for
by Elsevier deep learning
Ltd. All rights reserved. in underwater environments.
dataset 1. INTRODUCTION meaningful performance for deep learning models in underwater environments.
1. INTRODUCTION
1. INTRODUCTION
1. INTRODUCTION
1. INTRODUCTION
Recently, object 1. INTRODUCTION
detection and 6D pose estimation have
Recently, object
Recently,
Recently, object detection
object detection and
detection and 6D
and 6D pose
6D pose estimation
pose estimation have
estimation have
have
utilized
Recently,
utilized deep
object
deep learning
detection
learning methods.
and
methods. 6D These
pose
These methods
estimation
methods have
have
utilized
utilized
shown
Recently, deep
deep learning
learning
significantly
object methods.
methods.
impressive
detection and These
These
results
6D pose for methods
methods
general
estimation have
object
have
utilized
shown
shown deep learning
significantly
significantly methods.
impressive
impressive These
results
results for
for methods
general
general have
object
object
shown
utilized
shown significantly
recognition deeptasks in impressive
unconditional
learning
significantly methods.
impressive results
These
results for
environments
for general
methods
general and object
have
object
recognition
recognition
recognition tasks
tasks
tasks in
in
in unconditional
unconditional
unconditional environments
environments
environments and
and
and have
have
have
sufficient
shown
recognition
sufficient accuracy
significantly
tasks
accuracy infor robotic
impressive
unconditional
for robotic tasks,
tasks, including
results for
environments
including grasping
general and
grasping ob-
object
have
ob-
sufficient
sufficient
jects (Zhou
recognition accuracy
accuracy
et al.
tasks forunconditional
for
(2018);
in robotic
roboticChu tasks,
tasks,
and including
including
Vela (2018)).
environments grasping
grasping
However,
and ob-
ob-
have
sufficient
jects
jects (Zhou
(Zhou accuracy
et
et al.
al. for
(2018);
(2018);roboticChu
Chu tasks,
and
and including
Vela
Vela (2018)).
(2018)). grasping
However,
However, ob-
jects
the (Zhou
results
sufficient
jects (Zhou et
of al.
accuracy
et al. (2018);
previous
for
(2018);roboticChu
research
Chu and
tasks,
and Vela
mostly (2018)).
focus
including
Vela on
(2018)). However,
terrestrial
grasping
However, ob-
the results
the
the results of
results of previous
of previous research
previous research mostly
research mostly focus
mostly focus on
focus on terrestrial
on terrestrial
terrestrial
environments.
jects (Zhou et Since
al.
the results of previous
environments. datasets
(2018);
Since datasets
datasets Chu constituting
and Vela
researchconstituting
mostly focus
constituting underwater
(2018)). However,
on terrestrial
underwater en-
en-
environments.
environments.
vironment
the results are
of Since
Since
scarce,
previous datasets
implementing
research constituting
mostly a deep
focus underwater
underwater
learning
on en-
en-
based
terrestrial
environments.
vironment
vironment are
are Since
scarce,
scarce, datasets
implementing
implementing constituting
a
a deep
deep underwater
learning
learning en-
based
based
vironment
approach
environments.
vironment are
for
are scarce,
underwater
Since
scarce, implementing
datasets applications
implementing a
constituting
a deep
is
deep learning
challenging.
underwater
learning based
en-
based
approach
approach
approach for
for
for underwater
underwater applications
applications is
is challenging.
challenging.
vironment
approach areunderwater
for underwater applications
scarce, implementing
applications is
a deep
is challenging.
learning based
challenging.
Obtaining
Obtaining and utilizing
andunderwater underwater
utilizing underwater
underwater object
object data lead
data lead to
lead to two
to two
two
approach for
Obtaining
Obtaining and
and utilizing
utilizing applications
underwater is challenging.
object
object data
data lead to two
further
Obtaining
further issues.
and
issues. Firstly,
utilizing
Firstly, acquiring
underwater
acquiring underwater
object
underwater data object
lead
object to data
two
data
further
further
is issues.
issues.
challenging
Obtaining and Firstly,
Firstly,
compared
utilizing acquiring
acquiring
to underwater
underwater
acquisition
underwater object from
data object
object
the
lead data
data
ground
to two
further
is issues.
is challenging Firstly,
challenging compared
compared to acquiring underwater
to acquisition
acquisition from from theobject
the grounddata
ground
is challenging
environments.
further issues. compared
Even
Firstly,
is challenging compared if to
the acquisition
acquiringdata are
underwater
to acquisition from
obtained, the
object
from the manualground
manual
data
ground Fig. 1. Illustration of synthetic image generation. The
environments.
environments.
environments. Even if
Even
Even if the
if the
the data are
data
data are obtained,
are obtained,
obtained, manual
manual Fig.
Fig. 1.
1. Illustration
Illustration of
of synthetic
synthetic image
image generation.
generation. The
The
annotation
is challenging
environments. is costly
compared
Even and
if thecould
to data have
acquisition
are inaccuracies
from
obtained, the due
ground
manual to Fig.
Fig. 1.
3D
1. Illustration
model is
Illustration of
projected
of synthetic
by
synthetic a image
virtual
image generation.
camera The
(Virtual
generation. The
annotation
annotation
annotation is
is
is costly
costly
costly and
and
and could
could
could have
have
have inaccuracies
inaccuracies
inaccuracies due
due
due to
to
to 3D
3D model
model is
is projected
projected by
by a
a virtual
virtual camera
camera (Virtual
(Virtual
human
annotation error.
environments. is Secondly,
Even
costly if
and underwater
the data
could have camera
are images
obtained,
inaccuracies should
manual
due to Fig. 3D
1.
3D model
Camera is
Illustration
model is projected
(V-CAM)) of
projected to by
synthetic
by a
capture
a virtual camera
synthetic
image
virtual camera (Virtual
images.
generation. By
The
(Virtual
human
human
human error.
error.
error. Secondly,
Secondly,
Secondly, underwater
underwater
underwater camera
camera
camera images
images
images should
should
should Camera
Camera (V-CAM))
(V-CAM)) to
to capture
capture synthetic
synthetic images.
images. By
By
have
human diverse
annotation error.isvariations
costly
Secondly, and such as
could
underwater intensity
have camera degeneration
inaccuracies
images due and
shouldto Camera
extracting
3D model
Camera (V-CAM))
the
is projected
(V-CAM)) to
annotations
to capture
by a for
capture synthetic
the
virtual object
camera
synthetic images. By
detection
(Virtual
images. By
have
have
have diverse
diverse
diverse variations
variations
variations such
such
such as
as
as intensity
intensity
intensity degeneration
degeneration
degeneration and
and
and extracting
extracting the
the annotations
annotations for
for the
the object
object detection
detection
color
human
have distortion
error.
diverse (Chen
Secondly,
variations et
suchal. (2017)).
underwater
as camera
intensity images
degeneration should
and extracting
and
Camera pose
extracting the
(V-CAM))
the annotations
estimation to
annotationsnetworksfor
capture
for the
from
the object
these
synthetic
object detection
synthetic
images. By
detection
color
color
color distortion
distortion
distortion (Chen
(Chen
(Chen et
et
et al.
al.
al. (2017)).
(2017)).
(2017)). and
and pose
pose estimation
estimation networks
networks from
from these
these synthetic
synthetic
have
color diverse
distortion variations
(Chen such
et al. as intensity degeneration and
(2017)). and
images,pose
extracting
and pose we estimation
utilized
the annotations
estimation networks
these synthetic
networksfor from
the
from these
images
object synthetic
with
thesewith object
detection
synthetic
In this paper, we enhance the existing approach, in which images, we utilized these synthetic images object
In
In
In thisdistortion
color
this
this paper, we
paper,
paper, we(Chen
we enhance
enhance
enhance the
et the
the existing approach,
al. (2017)).
existing
existing approach, in
approach, in which aaaa
in which
which
images,
images,
mask
and
images, and
pose weobject
we utilized
utilized
estimation
weobject
utilized classthese
these synthetic
synthetic
annotations
networks
these synthetic fromimages
images
for the
images
with
with object
object
training
thesewith set
synthetic
object
dataset
In this is created
paper, we by adopting
enhance the a 3D
existing computer
approach, aided
in design
which a mask
mask
mask and
and
and object
object class
class
class annotations
annotations
annotations for
for
for the
the
the training
training
training set
set
set
dataset
dataset
dataset is
is
is created
created
created by
by
by adopting
adopting
adopting aa 3D
3D
aa 3D computer
computer
computer aided
aided
aided design
design
design of
maskthe
images, object
and we detection
utilized
object classthesenetwork.
synthetic
annotations Then,
for we
images
the cropped
with
training the
object
set
(CAD)
In this
dataset model,
paper,
is we
createdto ensure
enhance
by adopting that
the the3D dataset
existing computer involves
approach, in
aided various
which
design a of
of
of the
the
the object
object
object detection
detection
detection network.
network.
network. Then,
Then,
Then, we
we
we cropped
cropped
cropped the
the
the
(CAD)
(CAD)
(CAD) model,
model,
model, to
to
to ensure
ensure
ensure that
that
that the
the
the dataset
dataset
dataset involves
involves
involves various
various
various synthetic
mask
of the and
objectimages
object
detectionusing
class truncation
annotations
network. Then,annotation.
for the
we croppedThese
training set
the
optical
dataset conditions
(CAD) is created
model, to byand
ensure underwater
adopting that the environments.
a 3Ddataset
computer aided
involves To meet
design
various synthetic
synthetic
synthetic images
images
images using
using
using truncation
truncation
truncation annotation.
annotation.
annotation. These
These
These
optical
optical
optical conditions
conditions
conditions and
and
and underwater
underwater
underwater environments.
environments.
environments. To
To
To meet
meet
meet cropped
of the
synthetic images
objectimages and
detectionusing pose annotations
network.
truncation Then, were
we piped
cropped
annotation. into
the
These
the
(CAD) research
optical model, objective,
conditions to ensure
and we we present
that
underwater the an automatic
dataset
environments.involvesannotation
Tovarious
meet cropped
cropped
cropped images
images
images and
and
and pose
pose
pose annotations
annotations
annotations were
were
were piped
piped
piped into
into
into
the
the
the research
research
research objective,
objective,
objective, we
we present
present
present an
an
an automatic
automatic
automatic annotation
annotation
annotation the training
synthetic
cropped imagesset
images for
andusingthe
posepose estimation
truncation
annotations network.
annotation.
were pipedThese
into
tool.
optical
the Also
researchto verify
conditions the
and
objective, effectiveness
underwater
we present an of our
automaticdataset,
environments. we
To show
meet
annotation the
the
the training
training
training set
set
set for
for
for the
the
the pose
pose
pose estimation
estimation
estimation network.
network.
network.
tool.
tool.
tool. Also
Also
Also to
to
to verify
verify
verify the
the
the effectiveness
effectiveness
effectiveness of
of
of our
our
our dataset,
dataset,
dataset, we
we
we show
show
show cropped
the images
training set and
for pose
the annotations
pose estimation were piped
network. into
its
the
tool.
its application
research
Also to
application to
verify
to object
objective,
the
object wedetection
present an
effectiveness
detection and
of
and pose
automatic
our
pose estimation.
dataset, annotation
we
estimation. showIn
In
its
its application to object
object detection and pose estimation. In the training set for the pose estimation network.
its application
addition,
tool. Also towe
application
addition,
addition, we
to
propose
weverify
to object
propose
propose
aa detection
a
simple
the effectiveness
detection
simple pose
simple
pose
pose
and
and pose
estimation
of ourpose
estimation
estimation
estimation.
dataset, network.
we show
estimation.
network.
network.
In
In
addition,
The
its object
addition,
The object
we
application
we propose
detection
to object
propose
detection
aa simple
network
detection
simple
network is
is
pose
trained
poseand
trained
estimation
pose with
estimation
with our
our
network.
dataset
estimation.
network.
dataset In In In
In
summary,
summary,
summary,
this
this
this
paper
paper
paper
presents
presents
presents
three
three
three
things
things
things
as
as
as
follows:
follows:
follows:
The
The
and object
object
presents
addition, wedetection
detection
preliminary
propose network
network
a is
is
potential
simple trained
trained
posefor deep with
with
estimation our
our
learning dataset
dataset
based
network. In
In summary,
summary, this
this paper
paper presents
presents three
three things
things as
as follows:
follows:
The
and
and object
presents
presents detection
preliminary
preliminary network is
potential
potential trained
for
for deep
deep with our
learning
learning dataset
based
based • We automatically generate all the necessary annota-
and presents
approaches
The object inpreliminary
underwater
detection networkpotential for
applications.
is trained deep
We learning
verify
with our thatbased
our
dataset In •
• We
• We
summary,
We automatically
this
automatically
automatically paper generate
presents
generate
generate all
three
all
all the
the
the necessary
things as
necessary
necessary annota-
follows:
annota-
annota-
and presents
approaches
approaches in
inpreliminary
underwater
underwater potential for
applications.
applications. deep
We
We learning
verify
verify that
thatbased
our
our • tions
We for object
automatically detection
generate andall pose
the estimation.
necessary annota-
approaches
dataset
and suits
presents in underwater
deep learning
preliminary applications.
model
potential in
for We
underwater
deep verify
learning that our
environ-
based tions
tions
tions for
for
for object
object
object detection
detection
detection and
and
and pose
pose
pose estimation.
estimation.
estimation.
approaches
dataset
dataset suits
suitsin underwater
deep
deep learning
learning applications.
model
model in
in We
underwater
underwaterverify that our
environ-
environ- • We
tions propose
automatically
for objecta simple
detection pose
generate and estimation
all the
pose network
necessary
estimation. for
annota-
dataset
ments.
approaches suitsindeep learning
underwater model in
applications. underwater
We verify that environ-
our ••• We We
We propose
propose
propose aaa simple
simple pose
pose estimation
estimation network
network for
for
dataset
ments.
ments.
ments.
suits deep learning model in underwater environ- • We tions propose a simple pose estimation network for
underwater.
for
underwater.
underwater.
object simple
detection pose
and estimation
pose network
estimation. for
dataset
ments. suits deep learning model in underwater environ- •• Weunderwater.
verify
propose that the
a simple synthetic image
pose estimation set using
network a 3D
for
This study is a part of the results of R&D project, Development underwater.
We verify that the synthetic image set using a 3D
ments.

of
This
This
Basic
This
study
study is
is a
a part
part of
Technologies
study is a part of
of
of
the
the
the3D
results
results
Object
results
of R&D
R&D project,
project, Development
of Reconstruction
of R&D project, Development
and Robot
Development
••• CAD
We
We verify
verify
underwater.
We model
verify
that
that
that is the
the
the
synthetic
synthetic
feasible for
synthetic
image set
image
training
image
set
in
set
using
using a 3D
a
underwater
using a
3D
3D
CAD model is feasible for training in underwater
of
of

This
Basic
Basicstudy
Manipulator
is a part of
Technologies
Technologies
Motion
of
of the3D
3D
Compensation
results
Object
Object of Reconstruction
R&D project, Development
Reconstruction
Control,
R&Dsupported
and
and
by
Robot
Robot • CAD
CAD
object
We
CAD
model
model
detection
verify
model that is
is feasible
feasible
and
is the
feasiblepose for
for training
training
estimation.
synthetic
for image set
training
in
in underwater
underwater
in using a 3D
underwater
of
of Basic
This
Manipulator
Technologies
Basicstudy is a part
Technologies
Motion
of
of
of the3D Object
3Dresults
Compensation Object of Reconstruction
Control, andKRISO.
and Robot
project, Development
Reconstruction
supported by Robot
KRISO. object
object detection
detection and
and pose
pose estimation.
estimation.
Manipulator
Manipulator
of
Motion
Motion
Basic Technologies
Compensation
Compensation
of 3D Object
Control,
Control, supported
supported
Reconstruction
by
by KRISO.
KRISO.
andKRISO.
Robot
object
CAD detection
model is and
feasiblepose
object detection and pose estimation. estimation.
for training in underwater
Manipulator Motion Compensation Control, supported by
Manipulator Motion ©
2405-8963 Copyright Compensation
2019. The Authors. Control, supported
Published by KRISO.
by Elsevier
object detection and pose estimation.
Ltd. All rights reserved.
Peer review under responsibility of International Federation of Automatic Control.
10.1016/j.ifacol.2019.12.286
MyungHwan Jeon et al. / IFAC PapersOnLine 52-21 (2019) 78–81 79
2. RELATED WORKS
Securing enough training data is essential for deep learning

based approaches. In fact, since acquiring diverse datasets
is challenging, many researchers have focused on creating
synthetic datasets to enable automatic annotation.
2.1 Synthesizing Images Using a 3D CAD Model
In the literature, authors have utilized synthetic images

generated using 3D CAD models to detect objects and Fig. 2. Illustration of the proposed pose estimator. We
estimate poses. Peng et al. collected 3D CAD models exploit Densely Connedted Convolutional Networks
by searching online for the names of 20 categories. In (DenseNet)121. The Dense Block consists of 12 pairs
their paper, 25 models per category were coated with of 1×1 conv and 3×3 conv. Through global Average
a texture, and the authors selected their colors. The Pooling and Fully Connected Network (FC), the pose
viewpoint was changed manually to render virtual images estimator makes four outputs.
during the model generation step. Su et al. (2015) selected
3D models from PASCAL 3D+ for 12 categories. They 3.1 Synthesizing the Image
randomly adjusted the illumination condition, viewpoint,
and background when generating the images. Our work is The entire synthesizing stage was processed automatically
similar to Su et al. (2015), but we focus on underwater without human intervention. As shown in Fig. 1, a 3D
implementation. model was located in the center of the spherical coordinate
system. The 3D model remained stationary, only the pose
of the Virtual Camera (V-CAM) was changed to acquire
2.2 Automatic Annotation Tool samples for relative poses between the V-CAM and 3D
model (i.e., azimuth, elevation, in-plane rotation, and
Precise annotation of the target to be learned is es- distance). The V-CAM parameter was tuned manually.
sential in supervised manners to ensure the significant At the same time, we obtained a transparent background
performance of deep learning based approaches. Manual image with the model centered using the V-CAM for
annotation is the most common method of addressing every pose sample. This transparent background image
this issue. However, this annotation scheme can be in- was utilized for extracting the truncation parameters and
accurate, and infeasible for large datasets. Even though segmentation labels. For the training set of the object
researchers have used automated annotation tools using a detection, we overlaid a background onto the generated
real image to overcome this issue, automated generation transparent background image. For the training set of pose
of instance annotations in images has been a challenging estimation, after cropping the transparent background
issue caused by the impediment of occlusion with other images using the truncation parameter, we coated the
objects. A synthetic dataset can alleviate these problems. cropped images with the background. these image were
For instance, Johnson-Roberson et al. (2016) created a employed for the training set of the pose estimation.
training set for deep learning through a highly realistic Finally, we augmented all of the images using various effect
simulation engine. Through the simulation engine, changes to be robust in unconditional environments.
in the weather, daytime, and nighttime were given to
ensure diversity of datasets. Alternatively, Su et al. (2015); 3.2 Pose Estimation
Hattori et al. (2018); Busto and Gall (2018); Wang et al.
(2018) developed automated annotation tools using 3D In this paper, since pose estimation was used to validate
CAD models. Our approach automatically annotates view- our synthetic dataset, we did not estimate full 6D poses
points, bounding-boxes, and segmentation labels for use in but only the 3D orientation. Thus, we wanted to determine
underwater environments. whether our synthetic dataset is appropriate for the pose
estimation task. The proposed pose estimation network be-
3. METHOD longs to the regression task. In the rotation regression task,
we used quaternion rotation representations because they
do not suffer from gimbal lock, which occurs frequently in
We composed a network comprising two cascadely con- Euler angle representation, and by construction, quater-
nected networks, (i ) Mask R-CNN and (ii ) a pose estima- nions are unit-norm, q2 = 1. The cost of quaternion
tion network to perform instance level object detection and regression is shown in (1).
pose estimation. The first network, Mask R-CNN, detects
mask, bounding-box, and class for an object. For input EQ = 2acos(|q, q̇|) (1)
to the pose estimation network, an image is truncated
q is the ground truth value, and q̇ is the predicted value.
by bounding-box of an object. This input truncated by
The operator · , · indicates the inner product. The
bounding-box, thus, allows the pose estimator to focus
prerequisite of this cost is q, q̇ < 0 to prevent quaternion
only on the objects for which the pose is to be estimated.
ambiguity.
Secondly, the proposed pose estimation network was com-
bined with DenseNet (Huang et al. (2017)), Dense Block, To approximate the four parameters in quaternion, we
and FC. created a new network by combining one DenseNet, four
80 MyungHwan Jeon et al. / IFAC PapersOnLine 52-21 (2019) 78–81
Fig. 3. Experiment setup. A camera and four objects are

prepared in-air. All objects and camera are attached
to the external frame. The external frame feed into
the water tank.
Fig. 4. Results of the object detection. The results of
Dense Blocks and four FCs. DenseNet (Huang et al. the four objects and the all four objects are shown.
(2017)) extracts more complex key points from the learn- The colored region represents detection mask. The
ing process than other networks do because almost all of its detection score and object class are represented by
layers deploy the information of the previous layer through the bounding-boxes.
a skip connection. All parameters have shared features
through the DenseNet. We assign a Dense Block to each 4.2 Object Detection
of the four parameters; each block extracted key points for
one parameter. One FC was allocated for each parameter Mask R-CNN used in the experiment is a state of the
(Fig. 2). art object detection model that provides the class, mask,
and bounding-box for an object. We evaluated whether
In fact, object pose estimation from a single RGB image our synthetic dataset is suitable for underwater object
is known as a substantially challenging task. To solve this detection using Mask R-CNN. We set the threshold of the
challenging task, we needed to build a more specific model detection confidence to 0.9. As the evaluation metric, we
and costs focused on orientation information like Kehl used Average Precision (AP) and the pixel-level overrode
et al. (2017); Do et al. (2018); Xiang et al. (2017). Our ratio of the ground truth mask and predicted mask to
purpose was effective evaluation of our synthetic dataset measure the object detection accuracy (Table. 1).
for pose estimation task, by judging if our synthetic
dataset has the potential for pose estimation from simple Except for fish, the AP and mask overlay of the objects
network. exceeded 0.9. Also, when we analyzed the images as shown
in Fig. 4, the detection scores (detection confidence) were
close to 1.0, and the bounding-boxes and masks fitted
fairly. The object detection results trained by the proposed
4. EXPERIMENTS
method performed well for various illumination conditions.
The object was reliably detected even in the over-exposure
4.1 Experiment Setting situation shown in Fig. 4 without requiring preprocesses
such as deblurring or dehazing.
We made a training set utilizing four 3D CAD models.
Table 1. Summary of object detection with
For the underwater experiments, we prepared 3D physical
evaluation metrics.
models using a 3D printer. Then, we fed these outputs
into a water tank to validate the preliminary performance. Object Chess Duck Rabbit Fish All
(Fig. 3). AP 0.913 0.936 0.921 0.865 0.908
Mask Overlay 0.916 0.942 0.925 0.892 0.918
For training, we generated 1000 samples for each model
for the object detection and 2000 samples for the pose
estimation. The example results for the above process are 4.3 Pose Estimation
shown in Fig. 1. When the object detector and pose es-
timator were trained, we exploited only synthetic images. Both the objects and the camera were in a water tank
We used uncropped images for the object detector and during the experiment. To evaluate the pose estimation
cropped images for the pose estimator as shown in Fig. 1. results, we made a test set using 1000 synthetic images and
used the mean absolute error between the ground truth
For the test, we captured three images per object (i.e.,
and the predicted value.
DUCK, RABBIT, FISH, and CHESS), in differing poses, along
with images containing all four objects (ALL), as shown in As shown in Table. 2, quaternion as the output of the pose
Fig. 4. estimation network were converted to Euler to improve
MyungHwan Jeon et al. / IFAC PapersOnLine 52-21 (2019) 78–81 81
their readability. As can be seen, the roll and yaw were Chu, F.J. and Vela, P.A. (2018). Deep grasp: Detection
somewhat larger than the pitch, by nearly an average and localization of grasps with deep neural networks.
of 20◦ . Since the pose error and the deviation of the arXiv:1802.00520.
error between the objects were not large, we can conclude Do, T.T., Cai, M., Pham, T., and Reid, I. (2018). Deep-
that our synthetic dataset has preliminary potential for 6dpose: Recovering 6d object pose from a single rgb
pose estimation. On the other hand, the experimental image. arXiv preprint arXiv:1802.10367.
results revealed a prevalent 180◦ difference between the Hattori, H., Lee, N., Boddeti, V.N., Beainy, F., Kitani,
ground truth and the estimated value of one axis. Our K.M., and Kanade, T. (2018). Synthesizing a scene-
pose estimator could be confused when distinguishing the specific pedestrian detector and pose estimator for static
front and back of an object, which is a common problem video surveillance. International Journal of Computer
in pose estimation with regression. Thus, we need to build Vision, 1–18.
a more specific model focused on orientation information Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger,
and costs that copes with object symmetry. K.Q. (2017). Densely connected convolutional networks.
Table 2. Mean Absolute Error of Pose Estima- In CVPR, volume 1, 3.
tion Johnson-Roberson, M., Barto, C., Mehta, R., Sridhar,
S.N., Rosaen, K., and Vasudevan, R. (2016). Driv-
Roll Pitch Yaw All ing in the matrix: Can virtual worlds replace human-
Chess 27.45◦ 5.85◦ 37.80◦ 23.70◦ generated annotations for real world tasks? arXiv
Duck 20.83◦ 1.69◦ 29.31◦ 17.27◦ preprint arXiv:1610.01983.
Rabbit 19.54◦ 1.27◦ 18.29◦ 13.03◦ Kehl, W., Manhardt, F., Tombari, F., Ilic, S., and Navab,
Fish 26.94◦ 2.40◦ 20.64◦ 16.66◦ N. (2017). Ssd-6d: Making rgb-based 3d detection and
All 23.69◦ 2.80◦ 20.64◦ 26.51◦ 6d pose estimation great again. In Proceedings of the
IEEE International Conference on Computer Vision,
5. CONCLUSIONS 1521–1529.
Peng, X., Sun, B., Ali, K., and Saenko, K. (2014). Explor-
In this paper, we proposed an approach to making a ing invariances in deep convolutional neural networks
synthetic image dataset with an automatic annotation tool using synthetic images. CoRR, 2(4).
using a 3D CAD model. We applied this dataset to object Su, H., Qi, C.R., Li, Y., and Guibas, L.J. (2015). Render
detection and pose estimation in underwater environment. for cnn: Viewpoint estimation in images using cnns
The experiment proved that our synthetic dataset presents trained with rendered 3d model views. In Proceedings
preliminary potential for underwater object detection and of the IEEE International Conference on Computer
pose estimation. In future works, we will pursue improve- Vision, 2686–2694.
ments in the dataset for the pose estimation. Furthermore, Wang, Y., Tan, X., Yang, Y., Liu, X., Ding, E., Zhou, F.,
through the pose estimation quality improvement, we seek and Davis, L.S. (2018). 3D pose estimation for fine-
to grasp the object in underwater environment. grained object categories. arXiv:1806.04314.
Xiang, Y., Schmidt, T., Narayanan, V., and Fox, D. (2017).
REFERENCES
Posecnn: A convolutional neural network for 6d object
Busto, P.P. and Gall, J. (2018). Viewpoint refinement and pose estimation in cluttered scenes. arXiv preprint
estimation with adapted synthetic data. Comp. Vis. and arXiv:1711.00199.
Img. Under., 169, 75–89. Zhou, X., Lan, X., Zhang, H., Tian, Z., Zhang, Y., and
Chen, Z., Zhang, Z., Dai, F., Bu, Y., and Wang, H. (2017). Zheng, N. (2018). Fully convolutional grasp detection
Monocular vision-based underwater object detection. network with oriented anchor box. arXiv:1803.02209.
Sensors, 17(8), 1784.

Jeon 2019

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Jeon 2019

Uploaded by

Copyright:

Available Formats

Available online at www.sciencedirect.

Securing enough training data is essential for deep learning

2.1 Synthesizing Images Using a 3D CAD Model

In the literature, authors have utilized synthetic images

Fig. 3. Experiment setup. A camera and four objects are

You might also like