You are on page 1of 14

sensors

Article
Real-Time Underwater Image Recognition
with FPGA Embedded System for Convolutional
Neural Network
Minghao Zhao, Chengquan Hu, Fenglin Wei, Kai Wang, Chong Wang and Yu Jiang *
College of Computer Science and Technology, Jilin University, Changchun 130012, China;
zhaomh17@mails.jlu.edu.cn (M.Z.); hucq@jlu.edu.cn (C.H.); weifenglin@jlu.edu.cn (F.W.);
wangkai87@jlu.edu.cn (K.W.); wangchonghrb@jlu.edu.cn (C.W.)
* Correspondence: jiangyu2011@jlu.edu.cn; Tel.: +86-431-8516-6269

Received: 28 November 2018; Accepted: 11 January 2019; Published: 16 January 2019 

Abstract: The underwater environment is still unknown for humans, so the high definition
camera is an important tool for data acquisition at short distances underwater. Due to
insufficient power, the image data collected by underwater submersible devices cannot be
analyzed in real time. Based on the characteristics of Field-Programmable Gate Array (FPGA),
low power consumption, strong computing capability, and high flexibility, we design an embedded
FPGA image recognition system on Convolutional Neural Network (CNN). By using two technologies
of FPGA, parallelism and pipeline, the parallelization of multi-depth convolution operations is
realized. In the experimental phase, we collect and segment the images from underwater video
recorded by the submersible. Next, we join the tags with the images to build the training set. The test
results show that the proposed FPGA system achieves the same accuracy as the workstation, and we
get a frame rate at 25 FPS with the resolution of 1920 × 1080. This meets our needs for underwater
identification tasks.

Keywords: CNN; FPGA; image recognition; underwater smart device

1. Introduction
The underwater environment is very complex. We usually use sensors, such as sonar and cameras,
to collect underwater object information [1–3]. The recognition degree of cameras is higher than that of
sonar, but higher computational complexity is required in analyzing the whole data. The autonomous
intelligent underwater vehicle (AUV) is an underwater robot, which has the advantages of a large
range of activities, safety, and intelligence, and has become an important tool for the completion of
various underwater tasks. It can be used for laying pipelines, undersea inspection, data collection,
drilling support, submarine construction, maintenance, and repair of underwater equipment, etc.
AUV was unable to communicate with staff because electromagnetic waves could not communicate
over long distances in the water. Therefore, the AUV seldom processes or analyzes the images captured
by the camera in real time, which usually stores sampled images on the built-in memory chip [4].
Moreover, AUV cannot respond to the surrounding environment in time [5]. When the underwater
vehicle travels to dangerous areas, it can’t make timely adaptations, such as evasion or bypass,
which are easy to cause damage to the hull. Regarding the traditional solution, the sonar of large
AUVs is used to detect objects. However, sonar detectors must be kept a few meters or more away
from the objects. Since small AUVs are navigating in shallow water, the physical damage to the AUVs
cannot be prevented, and the AUV’s working area is limited. To solve this problem, we plan to
implant chips into the underwater vehicle to improve the capability of immediate image processing.

Sensors 2019, 19, 350; doi:10.3390/s19020350 www.mdpi.com/journal/sensors


Sensors 2019, 19, 350 2 of 14

In this way, the AUV can respond to the image information provided by the camera, make evasive
behaviors to dangerous objects, track marine life, and surround and photograph environmental
resources. However, the computing capability of traditional low power CPU is very weak, and it must
be paired with a GPU for these applications [6]. The navigation system mainly occupies the power
supply of the submarine, whose power cannot provide any more than 100 watts to the traditional
microcomputer [4]. For those reasons, we propose the FPGA chip with low power consumption
to perform the real-time image recognition. Image recognition, in the context of machine vision,
is the ability of software to identify objects, places, people, writing, and actions in images [7].
With the processing of FPGA, AUV is able to automatically distinguish between driving areas like
seawater, and non-driving areas, such as rocks and algae. When rocks are found, AUV takes steps to
avoid collisions. When confronted with algae, AUV needs to prevent the propeller from being caught.
Therefore, AUV can respond in a timely manner to different environments.
FPGAs have three advantages: (1) High-speed communication interface [8–10]. FPGA can be used
for high-speed signal processing, especially for high sampling rate from sensors. The FPGA can filter
the data and reduce the data rate, which makes the complex signal easy to process, transmit,
and store. (2) Fast digital signal processing. FPGA is suitable for high-real-time data processing,
such as images [9–11], radar signals, and medical signals, more efficiently than CPU [12]. (3) Higher
level of parallelism [7]. FPGA has two characteristics of parallelism and pipeline [13]. Parallelism
can repeatedly allocate and compute resources so that multiple modules can be computed
independently, at the same time. The pipeline makes hardware resources reusable. Those two
factors combined can significantly improve parallel performance [14].
Convolutional Neural network (CNN) is a mature, fast image classification algorithm [15].
In the 1960s, Hubel and Wiesel proposed the first convolutional neural network. Now, CNN has
become one of the research hot spots in many fields of science, especially in the field of pattern
recognition. Without complicated pre-processing of the image, it can analyze the original image
directly [16]. This algorithm can make multiple data processing units paralleled, which is
suitable for the parallelization of FPGA. To solve the problem of image processing in time,
we design and implement an FPGA data processing system based on lightweight CNN. We expect
the FPGA system has a speed of 1920 × 1080 10 fps to meet the minimum requirements for real-time
image processing, while consuming no more than 10 watts. Comparing the well-trained model written
in the FPGA with a workstation platform, our system is able to accurately classify the captured images
in a short time with low power consumption.
The structure of the paper is as follows. In the second section, the concept of CNN and related
theories are introduced. In the third section, the overall architecture of a lightweight FPGA system is
described. In the fourth section, the system implementation is presented, including parallelism design
and pipeline design. In the fifth section, the neural network is written into the FPGA system, and it is
compared with the results of the workstation platform and embedded platform in the same video test
set. In the sixth section, some related work about this paper is demonstrated. The last section shows
the corresponding conclusion and future prospects.

2. Methodology
With the wide application of CNN, it has been achieving excellent results in many fields in recent
years. The accuracy rate of simultaneous shorthand product based on CNN exceeds 95 percent,
more than the level of human shorthand [15]. In 2012, the neural network method made
a significant breakthrough in the large-scale image data set ImageNet, and the accuracy rate
reached 84.7% [16–20]. In LFW face recognition evaluation database, the face recognition method
based on deep neural network DeepID, in 2014 and 2015, it reached an accuracy rate of 99.15%
and 99.53%, respectively. This is even higher than the accuracy of humans, at 97.53% [20,21].
In the smart games area, in 2016, AlphaGo, of Google’s DeepMind division, bet Go world champion
Lee Sedol, ranked 9-dan professional [22]. On 18 October 2017, the DeepMind team released
Sensors 2019, 19, 350 3 of 14

the strongest version of AlphaGo, code named AlphaGo Zero. However, CNN’s calculation is
enormous, and
Sensors 2019, 19, thePEER
x FOR power consumption of computing equipment has been tremendous in recent
REVIEW 3 of 14
years. Therefore, researchers focus on the low-power FPGA chip to implement the algorithm [23–27].
In 2016,
In 2016, Wang
Wangetetal. al. designed
designed the the deep
deep learning
learning accelerator
accelerator unit unit (DLAU),
(DLAU), whichwhich isis aascalable
scalable
accelerator architecture for
accelerator forlarge-scale
large-scaledeepdeep learning
learningnetworks,
networks,using field-programmable
using field-programmable gate array
gate
(FPGA)
array as the as
(FPGA) hardware prototype
the hardware [23]. The[23].
prototype sameThe year,
same Qiu year,
et al. presented
Qiu et al. an analysis of
presented anaanalysis
state-of-
the-art
of CNN modelCNN
a state-of-the-art and model
showedand thatshowed
Convolutional layers are computational
that Convolutional centric andcentric
layers are computational Fully-
Connected layers are memory-centric. Next, the dynamic-precision
and Fully-Connected layers are memory-centric. Next, the dynamic-precision data quantization data quantization method and a
convolver
method design
and that is efficient
a convolver designforthatall layer types in
is efficient forCNN is proposed
all layer types to in improve
CNN isthe bandwidth
proposed to
and resource
improve utilization, and
the bandwidth andaresource
prototypeutilization,
of FPGA system and aisprototype
designed and of FPGA implemented
system [24]. In 2017,
is designed
Guoimplemented
and et al. used optimization
[24]. In 2017,approach
Guo et toal.find
used the optimal allocation
optimization approach plan toforfindDigital Signal Processing
the optimal allocation
(DSP)
plan forresources. On aProcessing
Digital Signal Xilinx Virtex-7
(DSP)FPGA,
resources.theirOndesign approach
a Xilinx Virtex-7 achieved
FPGA, their performance over the
design approach
state-of-the-art
achieved FPGA-based
performance over the CNN acceleratorsFPGA-based
state-of-the-art from 5.48× toCNN 7.25×, and by 6.21×
accelerators fromon5.48
average,
× to 7.25when×,
theyby
and evaluated
6.21× onthe popular
average, whenCNNsthey[25]. We will
evaluated therefer to their
popular prototype
CNNs [25]. We towill
optimize the
refer to FPGA
their system
prototype
architecture,
to optimize the andFPGAimprove
system thearchitecture,
overall performance
and improve of the thesystem
overall by using pipeline
performance of theand parallel
system by
technology.
using pipeline and parallel technology.
CNN is aavariant
CNN variantof offeedforward
feedforward neural networks
neural networks[15,28]. CNN uses
[15,28]. CNN convolution to reduce the
uses convolution to
proportion
reduce the of images, while
proportion extracting
of images, different
while texturedifferent
extracting features in different
texture ways, in
features anddifferent
then calculates
ways,
the then
and neural network.the
calculates Convolution can not
neural network. only reduce
Convolution canthenotcomputational
only reduce the amount of neural network
computational amount
image
of neuralcomputation,
network image but also extract
computation, butfeatures
also extracteffectively. Thus, CNN
features effectively. includes
Thus, input, several
CNN includes input,
convolutions, pooling, neural network, and output. We usually describe
several convolutions, pooling, neural network, and output. We usually describe the various parts of the various parts of CNN as
layers,
CNN asso the structure
layers, of CNN
so the structure of includes
CNN includes the input layer,layer,
the input convolutional
convolutional layer,layer,
pooling layer,
pooling full
layer,
connected
full connected layer, andand
layer, output
output layer. The
layer. Theoutside
outsideofofthe theinput
inputlayer
layerandand the the output layer is is usually
usually
invisible,and
invisible, andwe weput
putthem
themtogether
togetherand andcall
callthem
themhidden
hiddenlayers.
layers.TheThemain
mainprocessprocessof ofCNN
CNNisisshown
shown
inFigure
in Figure1.1.

Convolution Convolution Fully Fully


Pooling Pooling Output Predictions
+ReLU +ReLU Connected Connected

water(0.01)
rock(0.94)
bubble(0.04)
algae(0.02)

Figure1.1.A
Figure ACNN
CNNmodel
modelfor
forimage
imagerecognition.
recognition.

You
Youcan cansee
seethe stone
the stone on on
thethe
left left
of Figure 1. That
of Figure is theisinput
1. That layer, which
the input is understood
layer, which as a several
is understood as a
input
several matrix
inputbymatrix
the computer. Next is theNext
by the computer. convolutional layer. The depth
is the convolutional layer. of thedepth
The first convolution
of the first
isconvolution
3, and the depthis 3, and of the thesecond
depth convolution
of the second is 6.convolution
The depth is of 6.
theThe
convolution
depth of determines how
the convolution
many texture features it can extract. Convolution can’t always get the
determines how many texture features it can extract. Convolution can’t always get the value we want, value we want, so we usually
need
so we to insert
usually activation
need tofunction after convolution
insert activation functionto after
make convolution
the appropriate adjustments.
to make The part
the appropriate
of the activation function is also called the activation layer, and the activation
adjustments. The part of the activation function is also called the activation layer, and the activation function selected in
Figure 1 is ReLU. ReLU is an activation function that filters negative values.
function selected in Figure 1 is ReLU. ReLU is an activation function that filters negative values. Behind the convolutional
layer
Behind is the
the pooling layer. layer
convolutional The combination
is the pooling of layer.
convolutional layer andofpooling
The combination layer can
convolutional appear
layer and
in
pooling layer can appear in the hidden layer many times, which appears two times in Figure 1. of
the hidden layer many times, which appears two times in Figure 1. In fact, the frequency In
the
fact,convolutional
the frequency layer andconvolutional
of the pooling layerlayer is based
andon the needs
pooling layerofisthe model.
based Of course
on the needs we alsomodel.
of the have
the
Of flexibility
course weto usehave
also a combination
the flexibility of convolutional
to use a combinationlayer and convolutional layer
of convolutional layer,and
or aconvolutional
combination
of convolutional layer, convolutional layer, and pooling layer.
layer, or a combination of convolutional layer, convolutional layer, and pooling layer. These These are not limited for building
are not
the model.
limited for The most the
building commonmodel.CNN is thecommon
The most combinationCNNofisseveral convolutional
the combination layersconvolutional
of several and pooling
layers,
layers andlike pooling
the structure
layers, inlike
Figure 1. The fullinconnected
the structure Figure 1. The layer is connected
full behind a number
layer is of convolutional
behind a number
layers and pooling layers. The full connected layer is actually the
of convolutional layers and pooling layers. The full connected layer is actually the Deep Neural Deep Neural Networks(DNN)
structure.
Networks(DNN) The output layer classifies
structure. The output image by classifies
layer Softmax activation
image by function. To facilitate
Softmax activation computer
function. To
facilitate computer processing, we usually combine images of different depths into one nx1 array
before DNN, which we call Flatten.
One DNN layer is fully connected to another layer, that is to say, one neuron in the layer i must
be connected to any neuron of the layer i + 1. From a small local model, it is a combination of
Sensors 2019, 19, 350 4 of 14

processing, we usually combine images of different depths into one nx1 array before DNN, which we
call Flatten.
Sensors One DNN
2019, 19, x FORlayer
PEER is fully connected to another layer, that is to say, one neuron in the layer
REVIEW 4 of 14 i
must be connected to any neuron of the layer i + 1. From a small local model, it is a combination of
zz==∑ xii x+ib+ and
wiw b andanan
activation function
activation function σ (σz()z.).First,
First,we
wedefine
definethe linear
the linearrelationship coefficientww.
coefficient
relationship
. Taking the three layer DNN in Figure Figure 22 as
as an
anexample,
example,the thelinear
linearcoefficients
coefficientsfrom
fromthe
the4th
4thneuron
neuronofof
thesecond
secondlayer layertotothe
the2nd
2ndneuron
neuronofofthe
thethird
thirdlayerlayerare
aredefined
definedas 33
as w24
the 24 .

X1
a1(2)

X2 a (2)
2

X1
a 3(2) h W,b (X)
X3
Layer L3

+1 +1

Layer L1 Layer L1

Figure 2. This is a simple model for DNN structure.


Figure 2. This is a simple model for DNN structure.
Bias b definition: Taking this three-layer DNN in Figure 2 as an example, the second layer
Biasthird
of the b definition:
neuron Taking this three-layer
corresponding DNN in as
bias is defined Figure 2 as an example,
b32 . Assuming the second
that there are m layer of the
neurons on
2
third neuron corresponding bias is defined as
the layer l − 1. Then for the output of the neuron b . Assuming that there
3 j of the layer l, we get: are m neurons on the layer l
− 1. Then for the output of the neuron j of the layer l, we get:
m
alj =l σ (zlj ) l= σ( ∑m wlljk alkl −−1 1 +l blj ). (1)
a j = σ ( z j ) = σk(=
1 w jk ak + bj ) . (1)
k =1

Our goal is to use FPGA’s high-width feature to achieve a closed link between the FPGA neural
Our goal is to use FPGA’s high-width feature to achieve a closed link between the FPGA neural
network layers, while using high throughput rate to achieve fast operation based on matrix and matrix.
network layers, while using high throughput rate to achieve fast operation based on matrix and
matrix.
3. System Structure
In this
3. System section we will describe the structure of the FPGA system in detail and how we
Structure
improve the throughput of the system through parallelism and pipeline. In the convolutional layer,
In this section we will describe the structure of the FPGA system in detail and how we improve
we parallelize the convolution unit and use the pipeline scheme to calculate the inner product; in the full
the throughput of the system through parallelism and pipeline. In the convolutional layer, we
connected layer, we discuss two parallel pipeline scenarios. Taking into consideration the actual
parallelize the convolution unit and use the pipeline scheme to calculate the inner product; in the full
calculation of the full connected layer, we choose the second solution to eliminate the waiting time of
connected layer, we discuss two parallel pipeline scenarios. Taking into consideration the actual
multiplication and addition. The two points above effectively accelerate the heavy matrix operation of
calculation of the full connected layer, we choose the second solution to eliminate the waiting time of
CNN and improve the overall efficiency of the FPGA-based CNN algorithm.
multiplication and addition. The two points above effectively accelerate the heavy matrix operation
of3.1.
CNN and improve the overall efficiency of the FPGA-based CNN algorithm.
Architecture
CNN is equipped with the gift of image recognition, while FPGA is characterized by high-speed,
3.1. Architecture
high concurrent performance, and low power consumption. We implement the lightweight
CNN is equipped
convolutional neuralwith the giftalgorithm
network of image recognition,
on the FPGA. while WeFPGA
use isthe
characterized
Xilinx ZYNQ by high-speed,
7000 series
high concurrent performance, and low power consumption. We
FPGA development board [29]. The board contains high performance ARM CPU, providing implement the lightweight
a Linux
convolutional neural network algorithm on the FPGA. We use the Xilinx ZYNQ
operating system for data storage, logic processing, and logging. CNN mainly includes input 7000 series FPGA
layer,
development board [29]. The board contains high performance ARM CPU,
hidden layer, and output layers. As the input layer is implemented on the ARM side, the hidden providing a Linux
layer
operating
and output system
layersfor data
will storage, logicon
be implemented processing,
the FPGAand side.logging. CNNlayer
The hidden mainly includes
consists input layer,
of 8 convolutional
hidden
units, 8layer, and output
activators, layers.
8 poolers, andAs256theneural
input modules.
layer is implemented on the unit
The convolutional ARMrefers
side,to
the hidden
the layer
convolution
and
layer of CNN. The convolutional unit corresponds to the convolution layer, with the activator in of
output layers will be implemented on the FPGA side. The hidden layer consists CNN 8
convolutional
corresponding units, 8 activators,
to the 8 poolers,
activator layer, and
pooler 256 neural modules.
corresponding The convolutional
to the pooling unitconnected
layer, and full refers to
the convolution layer of CNN. The convolutional unit corresponds to the convolution layer, with the
activator in CNN corresponding to the activator layer, pooler corresponding to the pooling layer, and
full connected layer is composed of the 256 neural modules. Output layers consist of 10 neural
modules and interrupt modules. The structure of the system is shown in Figure 3.
Sensors 2019, 19, 350 5 of 14

layer is composed of the 256 neural modules. Output layers consist of 10 neural modules and interrupt
modules.
Sensors 2019,The
19, xstructure
FOR PEERof the system is shown in Figure 3.
REVIEW 5 of 14

Conv Acti

Load Kernel

Conv Acti
ARM
Buffer

... ...

Load Image

Conv Acti
AXI

Pooler Pooler Pooler Pooler

Memory
N N N N N N

N N N N N N

Figure 3. FPGA
Figure3. FPGA system
system architecture.
architecture.

In Figure 3, ARM is an onboard arm CPU. AXI bus provided by ZYNQ is a data structure between
In Figure 3, ARM is an onboard arm CPU. AXI bus provided by ZYNQ is a data structure
the FPGA and the memory, because the logical unit in the FPGA cannot exchange data directly with
between the FPGA and the memory, because the logical unit in the FPGA cannot exchange data
the memory without AXI bus. In Figure 3, Conv stands for convolutional unit, acti stands for activator,
directly with the memory without AXI bus. In Figure 3, Conv stands for convolutional unit, acti
pooler stands for pooler module, and N stands for represented neural module. The neural module
stands for activator, pooler stands for pooler module, and N stands for represented neural module.
has 256 neurons that are directly connected to the AXI bus, and the neuron operation is dispatched
The neural module has 256 neurons that are directly connected to the AXI bus, and the neuron
through a neural selection register with a bit width of 16.
operation is dispatched through a neural selection register with a bit width of 16.
Considering the low power operation characteristics of FPGA, FP16 is a better choice for
Considering the low power operation characteristics of FPGA, FP16 is a better choice for float
float format. FP16 is faster in multiplication than FP32, which gives FPGA a higher throughput
format. FP16 is faster in multiplication than FP32, which gives FPGA a higher throughput rate. On
rate. On the AXI bus, we have designed a total of 4 DDR controllers with 32-bit width that can
the AXI bus, we have designed a total of 4 DDR controllers with 32-bit width that can provide 128
provide 128 data bit width. The 128 data bit width can read 8 float numbers, 16-bit at a time.
data bit width. The 128 data bit width can read 8 float numbers, 16-bit at a time. In the convolution
In the convolution phase, a fast convolution process for images with depth of 8 layers can be
phase, a fast convolution process for images with depth of 8 layers can be performed. In the full
performed. In the full connected layer, the multiplication of 8 neurons can be calculated at the same
connected layer, the multiplication of 8 neurons can be calculated at the same time, which greatly
time, which greatly accelerates the neural network.
accelerates the neural network.
3.2. Convolution Layer Design
3.2. Convolution Layer Design
In the convolutional layer, we can allow 8 convolution kernels to read the same values
In the convolutional layer, we can allow 8 convolution kernels to read the same values and
and calculate simultaneously for the reason that the input sources of the same depth in different
calculate simultaneously for the reason that the input sources of the same depth in different
convolution kernels are the same. The process of convolution [30] is shown in Formula (2):
convolution kernels are the same. The process of convolution [30] is shown in Formula (2):

sl (bi,+j ) ∑
sl (i, j) =
 ( X)(∗i,Wj))(=i, jb) += b∑+ ∑
X∗W
= b(+ ∑ x(xi (+i +m,m,j j++ nn))ww((mm,, nn))(
 (11≤≤l ≤l L≤) .L). (2)
(2)
d d d m n
d m n

In (7), ssdenotes
In Formula (7), denotesthetheresult
result
of of each
each point
point after
after the the convolution,
convolution, l denotes
l denotes the of
the label label
the
of the convolution,
convolution, that is,that is, the number
the number of convolution
of convolution denotesLthe
kernel, L kernel, denotes the convolution
convolution depth
depth of the of
current
layer; the i and j denote the horizontal and ordinate, respectively. X denotes the image matrix and W
represents the weight matrix required by the calculation unit, while d denotes the depth of the
convolution, whose maximum value is the maximum depth (the number of convolution kernels) of
the previous layer; m, n denotes the horizontal ordinate of the convolution kernel, and b refers to the
bias, usually a constant obtained during training. The operation in the above formula is implemented
Sensors 2019, 19, 350 6 of 14

the current layer; the i and j denote the horizontal and ordinate, respectively. X denotes the image
matrix and W represents the weight matrix required by the calculation unit, while d denotes the depth
of the convolution, whose maximum value is the maximum depth (the number of convolution kernels)
of the previous layer; m, n denotes the horizontal ordinate of the convolution kernel, and b refers
Sensors 2019, 19, x FOR PEER REVIEW 6 of 14
to the bias, usually a constant obtained during training. The operation in the above formula is
implemented in the convolutional
in the convolutional unit, and
unit, and in Figure in 8Figure
3 the 3 the 8 convolutional
convolutional units can be units canwith
running be running with
a maximum
adepth
maximum depth
of 8 in parallel.of 8 in parallel.
Considering
Considering that the calculation
that the calculationinindifferent
differentconvolution
convolutiondepths
depthsis is independent,
independent, thethe result
result of
of the
the previous point has no effect on the result of the latter point, so we separate the calculation
previous point has no effect on the result of the latter point, so we separate the calculation in different in
different convolution depths to improve the parallel efficiency. This can reduce
convolution depths to improve the parallel efficiency. This can reduce the computational difficultythe computational
difficulty in the above
in the formula formulaand above
causeand cause Formula
Formula (3). (3).

sl,d (i, sjl), d= ) =∗(W


(i, (jX ∗W
X )( = 
i, j)()i,=j )∑ ∑ x(xi(+
i +m, +n
m,jj+ n))w
w((mm,
, nn) ). . (3)
(3)
m mn n

Theparameter
The parameterdescription
descriptionof offormula
formulais is the
the same
same as
as the formula ssll ((ii,, jj)), ,and
the formula andfinally
finallythe
theaddition
addition
resultofofall
result thesl,dsl ,is
allthe is the
d the same
same as sas sl . this
l . For For purpose,
this purpose, we designed
we designed 8 convolutional
8 convolutional kernelkernel units
units in eachin
convolutional unit.
each convolutional unit.
As
Asshown
shownininFigure Figure4,4,the
theleft
leftbuffer
bufferwillwillstore NN
store images
images of of
different
differentdepths.depths. TheTheshift_window
shift_window is
aisconvolution shift window that holds a 3 × 3 convolution matrix. K represents
a convolution shift window that holds a 3 × 3 convolution matrix. K represents the convolution the convolution unit,
and
unit,the
andweight of the of
the weight convolution matrixmatrix
the convolution is stored in K. After
is stored in K. receiving
After receiving the convolution data from
the convolution data
the shift_window, the inner product operation starts immediately. At
from the shift_window, the inner product operation starts immediately. At the end of the innerthe end of the inner convolution
operation,
convolution K sends
operation, an end request
K sends antoend therequest
shift_window module, and SUM
to the shift_window module, willandget SUM
the result at this
will get the
point,
resultwhich
at thisaccumulates
point, whichthe values of all
accumulates thedepths
valuesand bias.
of all The result
depths is cached
and bias. The resultin theisbuffer,
cachedand as
in the
the memory read request ends, the cache writes the new point back
buffer, and as the memory read request ends, the cache writes the new point back into memory.into memory.

Figure 4. The structure of the convolutional unit.


Figure 4. The structure of the convolutional unit.

The maximum inner product of the convolution is 3 × 3, with 9 multipliers and 9 adders, as shown
The maximum inner product of the convolution is 3 × 3, with 9 multipliers and 9 adders, as
in Figure 5.
shown in Figure 5.
Sensors 2019, 19, x FOR PEER REVIEW 7 of 14
Sensors 2019, 19, 350 7 of 14
Sensors 2019, 19, x FOR PEER REVIEW 7 of 14

Figure 5. The structure of the inner product unit of Convolution Layers.


Figure 5. The structure
Figure 5. structure of
of the
the inner
inner product
product unit
unit of
of Convolution
Convolution Layers.
Layers.
Since the calculation time of the multiplier and adder is greater than data loading, and the
Since
multiplier the calculation
the calculation
Since calculation time is timeof of
greater
time thethanthethemultiplier
adder.and
multiplier Weand adder
accelerate
adder theis convolution
is greater greater than
than data speeddata loading,
by pipeline
loading, and the
and the multiplier
strategy.
multiplier calculationcalculation time is than
time is greater greaterthethan theWe
adder. adder. We accelerate
accelerate the convolution
the convolution speed byspeed by
pipeline
pipeline
As
strategy. strategy.
shown in Figure 6, each unit has three steps—loading data, multiplication, and addition.
OnceAs theshown
As data isin
shown in Figure
loaded, 6,
Figurethe each
each unit
unit has
6, multiplication has three
of steps—loading
each
three data,
data, multiplication,
unit starts immediately.
steps—loading After that the
multiplication, and
and addition.
data bus is
addition.
Once
Once the
vacated, data
thethe is
is loaded,
next
data the
unit loads
loaded, multiplication
thedata at once. The
multiplication of each
each unit
unit starts
ofmultiplication immediately.
startsstarts, After
After that
and the adder
immediately. that the
needs todata
the waitbus
data is
until
bus is
vacated,
the last the next
addition unit loads
finishes. data
Thereforeat once.
it takesThe
9 multiplication
units of adder starts,
time and
plus a the adder
multiplier
vacated, the next unit loads data at once. The multiplication starts, and the adder needs to wait until needs
time to
and wait
a until
loading
the
the last
time addition
addition finishes.
to calculate
last afinishes. Therefore
convolution. itit takes
Pipeline
Therefore 99 units
speeds
takes upof
units the
of adder time
time plus
convolution.
adder plus aa multiplier
multiplier time
time and
and aa loading
loading
time
time to to calculate
calculate aa convolution.
convolution. Pipeline
Pipeline speeds
speeds upup the
the convolution.
convolution.
t
DL MUL ADD t
DL DLMUL MUL ADD ADD
DL DLMUL MUL ADD ADD
DL DLMUL MUL ADD ADD
DL DLMUL MUL ADD ADD
I DL MUL… ADD
I Figure 6. The
The pipeline
pipeline of

of inner product unit of Convolution Layers.
Figure 6. The pipeline of inner product unit of Convolution Layers.
Given the limited computing
computing resources
resources of the FPGA, a relatively simple ReLU [30] function is
selected as the
Given theactivation function. It’s
limited computing shown of
resources the Formula
in the FPGA, a (4): relatively simple ReLU [30] function is
selected as the activation function. It’s shown in the Formula (4):
 x ( x > 0)
(
f ( x) =x  ( x > 0). .
f (x) = (4)
0  > 0)
0x(((xxx << 0)
0 )
f ( x) =  . (4)
0 ( x < 0)
The function can effectively filter out the negative points in the image, with an easy
The function can effectively filter out the negative points in the image, with an easy
implementation
The function in FPGA.
can effectively filter out the negative points in the image, with an easy
implementation in FPGA.
Pooling layer
implementation inselects
FPGA.the Max Pooling method. The max pooling window is 2 × 2, which chooses
a maximum
Poolingvalue
layer from
selects4 the
points
MaxasPooling
the newmethod.
point; the The8max poolers can read
pooling dataissimultaneously
window from
2 × 2, which chooses
a maximum value from 4 points as the new point; the 8 poolers can read data simultaneously from
Sensors 2019, 19, 350 8 of 14

Sensors 2019, 19, x FOR PEER REVIEW 8 of 14

SensorsPooling
2019, 19, xlayer selects
FOR PEER the Max Pooling method. The max pooling window is 2 × 2, which chooses
REVIEW 8 of 14
memory
a maximum andvalue
read from
four values
4 pointssequentially. The structure
as the new point; of thecan
the 8 poolers Pooler
read is shown
data in Figure 7.from
simultaneously The
register
memory starts
memory and read with
readfoura value
fourvalues of zero and
valuessequentially. stores
sequentially. The the current
structure
The maximum
of the
structure of Pooler value.
is shown
the Pooler When
in Figure
is shown one comparison
7. The register
in Figure 7. The
finishes,
register Control
starts with a value
starts Logic
with will and
aofvalue
zero save theand
result
stores
of zero the of Comparator,
current
stores currentand
the maximum Comparator
value.
maximum value.will
When one read
When a new
comparison
one point and
finishes,
comparison
start a
Control new comparison.
Logic will save After
the four
result comparisons,
of Comparator, Control
and Logic will
Comparator write
will it
readback
a to
new
finishes, Control Logic will save the result of Comparator, and Comparator will read a new point and memory.
point and start
start
a newa new comparison.
comparison. AfterAfter
fourfour comparisons,
comparisons, Control
Control LogicLogic will write
will write it back
it back to memory.
to memory.
Input
Data
Input
Data
Comparator
Register
Comparator
Register

Control Output
Logic Output
Control
Logic

Figure 7. The structure of Pooler.


Figure
Figure7.7.The
Thestructure
structureof
ofPooler.
Pooler.
3.2. Fully-Connected Layers Design
3.3.Fully-Connected
3.2. Fully-ConnectedLayersLayersDesign
Design
For the fully connected layer, a larger matrix operation has been generated after flattening,
For the
however
For the fully
ourfully
memory connected layer,
bus haslayer,
connected a larger
only a128 larger
bits andmatrix operation
canoperation
matrix only get eighthas been
has been generated
16-bit generated
float numbers afterat
after flattening,
the same
flattening,
however
time. Each our memory
neuron can bus has
record 16only 128
weights. bits
The and can
following only get
are eight
two 16-bit
strategies
however our memory bus has only 128 bits and can only get eight 16-bit float numbers at the float
to numbers
consider. at thesame
same
time. Each
time. Strategy neuron
Each neuron 1: Loadcan record
can 8record 16
× 16 weightsweights.
16 weights. The
of number following
0–7 neurons,
The following are two
are twoloadstrategies
8 × 8 image
strategies to consider.
values, and calculate.
to consider.
When Strategy
the 1: Load
calculating 8is× 16 weights
carried out, of
loadnumber
8 × 8 0–7
neuronsneurons,
at load
the
Strategy 1: Load 8 × 16 weights of number 0–7 neurons, load 8 × 8 image values, same 8 × 8 image
time. values,and
Calculate, andload
and calculate.
8×8
calculate.
When the calculating is carried out, load 8 × 8 neurons at the same time. Calculate,
When the calculating is carried out, load 8 × 8 neurons at the same time. Calculate, and load 88××88
image values simultaneously. Load the following 8 × 16 weights of number 0–7and load
neurons, 8 ×
load8 image
valuesvalues
image
image simultaneously.
values, Load the
and calculate.
simultaneously. Repeat
Load following
the 8 × 16until
thefollowing
process 8weights
× 16all of number
points
weights are
of 0–7 neurons,
calculated
number andload
0–7 neurons, × 8output
get8the
load image
8×8
values,
value. and
Then calculate.
calculate theRepeat
output the process
value of until
number all points
8–15 neuron are calculated
values, and
image values, and calculate. Repeat the process until all points are calculated and get the output and get
calculate the
the output
output value.
value
Then
of
value. calculate
numbers 16–23,
Then calculatethe until
output
theyou value of number
get value
output output all 8–15
ofofnumber neuron
neurons,
8–15 values,
as shown
neuron inand
Figure
values, calculate
and 8.
calculatethetheoutput
output value of
value
numbers
of numbers 16–23,
16–23, until you
until yougetget
output
output ofofallall
neurons,
neurons, asas
shown
shown in in
Figure
Figure8. 8.
t
KL1 DL CAL(1,1)
KL1 t
KL2 DL CAL(2,1)
KL2
KL1 DL CAL(1,1)
KL1

KL2 DL CAL(2,1)
KL2
KL8 DL CAL(8,1)
KL8

DL CAL(1,2)
KL8 KL8 DL CAL(8,1)
DL CAL(2,2)
DL CAL(1,2)

DL CAL(2,2)
DL CAL(8,2)

KL1 KL1
DL1 CAL(1,3)
DL CAL(8,2)
KL1 KL2
DL2 CAL(2,3)
KL1 KL1
DL1 CAL(1,3)

KL1 KL2
DL2 CAL(2,3)
KL8 KL8
DL2 CAL(8,3)


KL8 KL8 DL2 CAL(8,3)
CAL(k-7,1)
KL(k-7) KL(k-7)DL

CAL(k-6,1)
KL(k-6) KL(k-6)DL
CAL(k-7,1)
KL(k-7) KL(k-7) DL

CAL(k-6,1)
KL(k-6) KL(k-6) DL
CAL(k,1)
KL(k) KL(k) DL

DL CAL(k-7,2)
CAL(k,1)
KL(k) KL(k) DL
DL CAL(k-6,2) …
DL CAL(k-7,2)

DL CAL(k-6,2) …
I DL CAL(k,2)

I Figure 8. The first strategy of Full Connected layer.
Figure 8. The first strategy of Full Connected layer. DL CAL(k,2)

Figure 8. The first strategy of Full Connected layer.


Sensors 2019,
Sensors 2019, 19,
19, 350
x FOR PEER REVIEW 99 of
of 14
14

Strategy 2: Load number 0–7 neuron’s 8 × 16 weights, load number 8–15 neuron’s 8 × 16 weights,
and Strategy 2: Loadweights.
load all neuron’s numberThen 0–7 numbers 8 ×
neuron’s 0–7 load168 weights,
× 8 imageloadvaluesnumber 8–15 neuron’s
and calculate. When
× 16 weights,
8numbers and load all neuron’s weights. Then numbers 0–7 load 8 ×
0–7 are calculating, number 8–15 loads 8 × 8 neurons at the same time, then numbers 8–158 image values
and calculate.
calculate. UntilWhen numbers
all the neurons0–7 are calculating,
finish number
the work, reload all8–15 loads 8 ×of8 all
the weights neurons at the same
the neurons, and time,
then
then numbers 8–15 calculate. Until all the neurons finish the work, reload all the
load 8 points, calculate, … until all the points are calculated. It’s shown in Figure 9. weights of all the neurons,
and then load 8 points, calculate, . . . until all the points are calculated. It’s shown in Figure 9.
t
KL1 KL1 KL9 KL9
KL2 KL2 KL10 KL10 …
… … … …
KL8 KL8 KL16 KL16
DL CAL(1,1)
DL CAL(2,1)
… …
DL CAL(8,1)
DL CAL(9,1)
DL CAL(10,1)
… …
DL CAL(16,1)

KL1 KL1 KL9 KL9


KL2 KL2 KL10 KL10 …
… … … …
KL8 KL8 KL16 KL16
DL CAL(1,3)
DL CAL(2,3)
… …
DL CAL(8,3)
DL CAL(9,3)
DL CAL(10,3) …
… …
I DL CAL(16,3)

Figure
Figure 9.
9. The second strategy of Full Connected layer.

Assuming thetheresult
resultofofthe
the previous
previous layer
layer is the
is the matrix
matrix of Nof × n1 points
× 1Nand and n on
points on this
this layer, layer,
loading
loading 8 values takes a clock cycle, and the calculation needs the
8 values takes a clock cycle, and the calculation needs the b clock cycle.b clock cycle.
For strategy 1, loading weights costs

N N ⋅ 2a ⋅ nn .
1k = · 2a
t1k t= · . (5)
88 88
Calculation costs
n N a ·n
t1c = ·n N· b + a ⋅ n . (6)
t1c = ⋅ ⋅ b + 8 .
8 8 (6)
The whole process needs 8 8 8
The whole process needs
( N (b + 2a) + a)n
t1 = t1k + t1c = . (7)
t1 = t1k + t1c =
( a) + a) n
N ( b + 264
. (7)
For strategy 2, loading weights costs 64
For strategy 2, loading weights costs
N n
t2k = · 2a · . (8)
8N 8n
t2 k = ⋅ 2a ⋅ . (8)
Calculation costs 8 8
((n/8 − 1) a + b) · N
Calculation costs t2c = . (9)
8

t2 c =
( ( n / 8 − 1) a + b ) ⋅ N . (9)
8
The whole process needs
Sensors 2019, 19, 350 10 of 14

The whole process needs

(8(n − 1) a + 8b + 2a · n) · N
t2 = t2k + t2c = . (10)
64
The difference between Scenario 1 and Scenario 2 is

(b − a)(n − 8) N + 8an
∆ t = t1 − t2 = . (11)
8
It is known that calculation is slower than data loading, and we note that there are at
least 2 neurons in the output layer. So n is greater than or equal to 2. When n is greater than or
equal to 8, it is constant, but according to experience, only a few layers’ neuron number less than 8,
which indicates that strategy two is more effective than the strategy one in most cases, so we choose
strategy two as the pipeline design of full connected layer.

4. Experiment Setup
Based on our ideas in section three and four, we wrote FPGA logic code. In this section,
we will test our logic. Our experimental data comes from the video, collected by underwater
robot in the underwater world of the Tianchi Lake of Changbai Mountain, which is a 1920 × 1080,
30 FPS video. We extracted the clearest images by one frame per second, divided one picture
into 480 × 360 pictures, and labeled the data source. After pre-procession, we trained the dataset
on the workstation and exported the best-trained neural network model. Instantiating on the FPGA,
we compared two strategies. After that, we compared the better one with the training platform
and embedded platform.

4.1. Evaluation Criteria


Because of the complicated training process and the large amount of computation, we trained
the model on the high performance workstation. The evaluation criteria of the experiment were
accuracy and categorical cross entropy loss [15]. For each picture, the output is shown in the matrix.
The first element of the column in the matrix corresponds to the probability of the first category,
the element in the second column and the second row corresponds to the probability of the second
class of classification, and the element in the n row and n column corresponds to the probability of
the n category. It’s shown in Formula (12):
 
0.13 0 0
result =  0 0.72 0 . (12)
 
0 0 0.15

From the results in Formula (12), the result of the second line is the highest probability, so we
assume that the picture belongs to the second category. Accuracy calculates the rate of the correct
number in all picture categories.
n pt
Accuracy = . (13)
n
In the Formula (13), n is the actual number of samples in a category, and the npt is the correct
number of the result in the test.
When calculating multi-classification, the difference of categorical cross entropy loss [30] exists
between the predicted data and the true data. The lower the value is, the better the model we get.
The formula is as follows:
n
∧ ∧ ∧
loss = − ∑ yi1 log yi1 + yi2 log yi2 + . . . + yim log yim . (14)
i =1
Sensors 2019, 19, 350 11 of 14


In Formula (14), n is the number of samples, and m is the number of categories; y is the probability
that the sample is expected to be. If the tag is marked as a certain class, the expected probability is 1,
otherwise it is 0. Further, y is the probability of the test result, as shown in results above. Even though
the accuracy rate is 100 and all samples have a predictive probability of 0.6, for 100 samples, the loss
is 22.18. So we expect the loss of the model to be as low as possible, and the loss values to be as similar
as possible in the test set and in the predicted set.

4.2. Experiment Result


We received three effective videos, with 20 min effective duration. From 20 min of video,
we got 1200 pictures. The videos come from three areas. The first area is about the images that
the submersible collects when it just enters into the water, so we mark it as shallow water. The second
area is about the images that the submersible collects when it dives to a certain depth and more rocks
and algae can be seen, so we mark it as middle water area. The third area is that the submersible
reached 300 m, where the light is weak, so we mark it as a weak light area. The captured 1200 images
contain 338 shallow waters, 403 middle waters, and 459 low light photos. Through dividing the images
by 4:3, we get 14,400 training samples. We divide the images into four categories: water, algae, bubbles,
and rocks, and we attach the label to each picture. Tianchi Lake of Changbai Mountain is a volcanic lake
with no primitive fish habitats. In order to reduce the damage and cost of AUV, Tianchi Lake is selected
as an experimental site for acquiring data, which can avoid the interference of fish on the submersible.
In future experiments, other underwater environments and oceans would be considered. This is
the first AUV travel in the Tianchi Lake, and in the future, we will consider the possible categories.
The data set results are shown in Table 1.

Table 1. Dataset.

Sum Water Algae Rock Bubble


shallow
4056 1661 652 1188 555
water
middle water 4836 1789 877 1820 350
weak light 5508 1456 1879 2173 -

Note that in weak light areas, the shooting ability of the camera is weak, so it is unable to
obtain a valid bubble picture. And for some blurry pictures which humans cannot identify accurately,
we labeled as rock, since rocks and unknown areas are dangerous. As three data sets are different,
we trained each data set separately. We sampled 70 percent of the data set randomly as a training
sample and 30 percent as a test sample.
The FPGA test platform used in this article was Xilinx ZYNQ 7035. The platform contains 275K
logic cells, 900 DSP units, and a dual-core arm, which meets our demand for the number of logical
units and float computing capacity. The platform can run Linux, which is easy to store data and debug.
The training experiment platform selected in this paper: Hardware platform: E5 1230v4 64GDDR4
1080Ti SLI, software platform: Ubuntu 18.04, Python 3.6, tensorflow 1.9.1 and Keras2.2.4.
The embedded experiment platform selected in this paper: Hardware platform: Raspberry Pi 3B,
software platform: Raspbian, Swap 2GB, Python 3.6, tensorflow 1.9.1 and Keras2.2.4
The power consumption of the Raspberry Pi is less than 10 watts, and it can work as a personal
computer. It is suitable for computing with low power consumption.
In order to correspond to the structure of our FPGA design, we use two 3 × 3 convolution kernels,
and take three convolutions. The depth of convolution is 8, 4, 4, and the step is 1. There are three layers
in the full connection layer. The epochs are 10, the activation function is ReLU, and dropout is 0.5.
The loss function used in training is cross-entropy cost function, and the optimal result of the training
model is as follows.
Sensors 2019, 19, 350 12 of 14

From Table 2, the accuracy of the training model in the training set is more than 0.85.
As the underwater submersible moves at speed, the camera always captures blurred images,
which human eyes cannot identify accurately. We export the Keras model weights as h5 and adjust them
to the parallel-loaded file format and write these into the FPGA Flash. The results of the two strategies
are shown in Table 3.

Table 2. Training optimal parameters.

Loss Accuracy
Shallow water 0.59 0.887
Middle water 0.44 0.923
Weak light 0.43 0.861

Table 3. Comparison of two strategies.

Loss1 Acc1 Time1 Loss2 Acc2 Time2


Shallow
0.61 0.872 8.0 s 0.61 0.872 4.0 s
water
Middle
0.48 0.920 8.0 s 0.48 0.920 4.1 s
water
Weak light 0.44 0.865 9.9 s 0.42 0.865 4.5 s

In Table 3, loss1 , acc1 , and time1 are loss, accuracy, and computing time of the first strategy,
respectively, and loss2 , acc2 , and time2 are results of the second strategy, respectively. They show
the same loss and acc. The second strategy takes half the time of the first one. So in our model,
the second strategy is a better choice.
The experimental results of the FPGA platform with the second strategy and contrast platform
are shown in Table 3.
In Table 4, Test1, Test2, and Test3, respectively, represent shallow water, middle water, and weak
light layers. Lossf , accf , and timef are loss, accuracy, and computing time of the FPGA, respectively;
lossc , accc , and timec are the loss, accuracy, and computing time of the comparison training experiment
platform, and losse , acce , and timee are the results of embedded platform. The experiment shows that
the accuracy of the FPGA structure is the same as the comparison platform. Loss is slightly higher
than the workstation platform, because we use the non-standard 16-bit float multiplication and addition.
The speed non-standard 16-bit float multiplication and addition performance is high, compared with
the real value, which will lead to a certain error. The error only affects the loss value, and will not
have too much impact on the accuracy rate. From the test time, the differences between our FPGA test
platform and 1080 Ti are remarkable. The workstation platform needs 200 watts on power supply, a huge
power consumption that underwater submersibles cannot provide. Our FPGA platform’s average power
consumption is just about 9.5 watts. The three test sets contain 1235, 1451, and 1652 pictures, respectively.
Therefore, the average processing speeds of the FPGA are 309, 353, and 367 images per second. The lowest
speed is 309 480 × 360 images per second, which is equivalent to 25 1920 × 1080 images, which is
1920 × 1080 25 fps. The processing speed reduced by 50 percent and power consumption reduced
by 30 percent, and the processing power requirements for 1920 × 1080 10 fps of underwater cameras
can almost be met. Althoug the Raspberry Pi can achieve the same loss and accuracy, the processing time
of the Raspberry Pi does not meet our needs.

Table 4. Comparison of results.

Lossf Accf Timef Lossc Accc Timec Losse Acce Timee


Test1 0.61 0.872 4.0 s 0.60 0.872 <1 s 0.60 0.872 479.6 s
Test2 0.48 0.920 4.1 s 0.48 0.920 <1 s 0.48 0.920 572.9 s
Test3 0.44 0.865 4.5 s 0.42 0.865 <1 s 0.42 0.865 656.3 s
Sensors 2019, 19, 350 13 of 14

5. Conclusions
Based on the parallelism and pipeline technology of FPGA, we designed an underwater image
recognition system with CNN and instantiate it on an FPGA platform. We exported the trained model
from the workstation platform and inputted it into the FPGA system. In the test phase, compared with
the workstation, the final test result is the same as the train workstation. The system needs low power
consumption, and its computing capability is satisfied with the requirement of the underwater camera,
which is 1920 × 1080 10 fps, possessing the highest performance of 1920 × 1080 25 fps. Our system
can analyze the image collected by the camera of AUV in time, which enhances its perception of
the surroundings, and allows it to take timely action in dangerous environments. For the following
work, we will improve the overall performance of our system. The convolution operation will
be expanded to the scale of more than eight convolution kernels, while non-standard size images
and convolution operations with depths less than four will be optimized.

Author Contributions: M.Z. designed the system structure. C.H. conceived and designed the experiments.
F.W. wrote FPGA codes. K.W. and C.W. arranged the images and analyzed the data. M.Z. and Y.J. wrote the paper.
Funding: The work was supported in part by the National Natural Science Foundation of China
(51679105, 61872160, 51809112, 51409117).
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Marsh, L.; Copley, J.T.; Huvenne, V.A.; Tyler, P.A. Getting the bigger picture: Using precision Remotely
Operated Vehicle (ROV) videography to acquire high-definition mosaic images of newly discovered
hydrothermal vents in the Southern Ocean. Deep Sea Res. Part II Top. Stud. Oceanogr. 2013, 92, 124–135.
[CrossRef]
2. Jiang, Y.; Gou, Y.; Zhang, T.; Wang, K.; Hu, C. A machine learning approach to argo data analysis in
a thermocline. Sensors 2017, 17, 2225. [CrossRef] [PubMed]
3. Jiang, Y.; Zhao, M.; Hu, C.; He, L.; Bai, H.; Wang, J. A parallel FP-growth algorithm on World Ocean Atlas
data with multi-core CPU. The journal of Supercomputing. 2018, 1–14. [CrossRef]
4. Zerr, B.; Mailfert, G.; Bertholom, A.; Ayreault, H. Sidescan sonar image processing for auv navigation.
In Proceedings of the Europe Oceans 2005, Brest, France, 20–23 June 2005; Volume 1, pp. 124–130. [CrossRef]
5. Allotta, B.; Caiti, A.; Costanzi, R.; Fanelli, F.; Fenucci, D.; Meli, E.; Ridolfi, A. A new AUV navigation system
exploiting unscented Kalman filter. Ocean Eng. 2016, 113, 121–132. [CrossRef]
6. Degalahal, V.; Tuan, T. Methodology for high level estimation of FPGA power consumption. In Proceedings of
the ACM 2005 Asia and South Pacific Design Automation Conference, Shanghai, China, 18–21 January 2005;
pp. 657–660. [CrossRef]
7. Guan, Y.; Yuan, Z.; Sun, G.; Cong, J. FPGA-based accelerator for long short-term memory recurrent
neural networks. In Proceedings of the 2017 22nd Asia and South Pacific Design Automation Conference
(ASP-DAC), Chiba, Japan, 16–19 January 2017; pp. 629–634. [CrossRef]
8. Jiang, Y.; Zhang, T.; Gou, Y.; He, L.; Bai, H.; Hu, C. High-resolution temperature and salinity model analysis
using support vector regression. J. Ambient Intell. Hum. Comput. 2018, 1–9. [CrossRef]
9. Nuno-Maganda, M.A.; Arias-Estrada, M.O. Real-time FPGA-based architecture for bicubic interpolation:
An application for digital image scaling. In Proceedings of the International Conference on Reconfigurable
Computing and FPGAs, Puebla City, Mexico, 28–30 September 2005; Volume 1. [CrossRef]
10. Klein, B.; Hochgürtel, S.; Krämer, I.; Bell, A.; Meyer, K.; Güsten, R. High-resolution wide-band fast Fourier
transform spectrometers. Astron. Astrophys. 2012, 542, L3. [CrossRef]
11. Lifeng, W.; Shanqing, H.; Teng, L. Design and implementation of multi-rate data exchange system for radar
signal processing. In Proceedings of the IET International Radar Conference, Guilin, China, 20–22 April 2009.
[CrossRef]
12. Nane, R.; Sima, V.M.; Pilato, C.; Choi, J.; Koen, B. A survey and evaluation of FPGA high-level synthesis
tools. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2016, 35, 1591–1604. [CrossRef]
Sensors 2019, 19, 350 14 of 14

13. Farabet, C.; LeCun, Y.; Kavukcuoglu, K.; Culurciello, E.; Martini, B.; Akselrod, P.; Talay, S.
Large-scale FPGA-based convolutional networks. In Scaling up Machine Learning: Parallel and Distributed
Approaches; Cambridge University Press: Cambridge, UK, 2011; pp. 399–419.
14. Choi, Y.; Cong, J.; Fang, Z.; Hao, Y.; Reinman, G.; Peng, W. A quantitative analysis on microarchitectures of
modern CPU-FPGA platforms. In Proceedings of the ACM 53rd Annual Design Automation Conference,
Austin, TX, USA, 5–9 June 2016; p. 109. [CrossRef]
15. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436. [CrossRef] [PubMed]
16. Bilen, H.; Fernando, B.; Gavves, E.; Vedaldi, A.; Gould, S. Dynamic image networks for action recognition.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA,
27–30 June 2016; pp. 3034–3042. [CrossRef]
17. Zhang, X.; Zou, J.; Ming, X.; He, K.; Sun, J. Efficient and accurate approximations of nonlinear
convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Boston, MA, USA, 7–12 June 2015; pp. 1984–1992. [CrossRef]
18. Jaderberg, M.; Vedaldi, A.; Zisserman, A. Speeding up convolutional neural networks with low rank
expansions. arXiv, 2014; arXiv:1405.3866.
19. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural
networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems,
Lake Tahoe, NV, USA, 3–6 December 2012.
20. Tompson, J.J.; Jain, A.; LeCun, Y.; Bregler, C. Joint training of a convolutional network and a graphical
model for human pose estimation. In Proceedings of the NIPS 2014: Neural Information Processing Systems
Conference, Montreal, Canada, 8–13 December 2014; pp. 1799–1807.
21. Hinton, G.; Deng, L.; Yu, D.; Dahl, G.; Mohamed, A.R.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.;
Sainath, T.N.; et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of
four research groups. IEEE Signal Process. Mag. 2012, 29, 82–97. [CrossRef]
22. Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hurbert, T.; Baker, L.; Lai, M.;
Bolton, A.; et al. Mastering the game of Go without human knowledge. Nature 2017, 550, 354. [CrossRef]
[PubMed]
23. Wang, C.; Gong, L.; Yu, Q.; Xi, L.; Xie, Y.; Zhou, X. DLAU: A scalable deep learning accelerator unit on FPGA.
IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2017, 36, 513–517. [CrossRef]
24. Qiu, J.; Wang, J.; Yao, S.; Cuo, K.; Li, B.; Zhou, E.; Yu, J.; Tang, T.; Xu, N.; Song, S.; et al.
Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the 2016
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 21–23
February 2016; pp. 26–35. [CrossRef]
25. Guo, J.; Yin, S.; Ouyang, P.; Liu, L.; Wei, S. Bit-Width Based Resource Partitioning for CNN Acceleration
on FPGA. In Proceedings of the 2017 IEEE 25th Annual International Symposium on Field-Programmable
Custom Computing Machines (FCCM), Napa, CA, USA, 30 April–2 May 2017; p. 31. [CrossRef]
26. Farabet, C.; Poulet, C.; Han, J.Y.; Lecun, Y. Cnp: An fpga-based processor for convolutional networks.
In Proceedings of the 2009 IEEE International Conference on Field Programmable Logic and Applications
(FPL 2009), Prague, Czech Republic, 31 August–2 September 2009; pp. 32–37. [CrossRef]
27. Del Sozzo, E.; Di Tucci, L.; Santambrogio, M.D. A highly scalable and efficient parallel design of N-body
simulation on FPGA. In Proceedings of the 2017 IEEE International Parallel and Distributed Processing
Symposium Workshops (IPDPSW), Lake Buena Vista, FL, USA, 29 May–2 June 2017; pp. 241–246. [CrossRef]
28. Bouvrie, J. Notes on Convolutional Neural Networks. Available online: http://cogprints.org/5869/
(accessed on 15 January 2019).
29. Xilinx, Inc. Xilinx Zynq-7000 All Programmable SoC Accelerates the Development of Trusted Systems at ARM
TechCon 2012; Xilinx Inc.: San Jose, CA, USA, 2012.
30. Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. 2015, 61, 85–117. [CrossRef]

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

You might also like