You are on page 1of 11

Journal of Real-Time Image Processing

https://doi.org/10.1007/s11554-021-01161-4

ORIGINAL RESEARCH PAPER

Efficient binary 3D convolutional neural network and hardware


accelerator
Guoqing Li1 · Meng Zhang1 · Qianru Zhang1 · Zhijian Lin2

Received: 23 April 2021 / Accepted: 31 July 2021


© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2021

Abstract
The three-dimensional convolutional neural networks have abundant parameters and computational costs. It is urgent to
compress the three-dimensional convolutional neural network. In this paper, an efficient and simple binary three-dimensional
convolutional neural network architecture is proposed, in which the weight and activation are constrained to 0 or 1 instead
of the common + 1 or – 1. Binary weight and activation are first applied to the three-dimensional convolutional neural net-
works. The proposed binary three-dimensional convolutional neural network has less computational complexity and memory
consumption than standard convolution, and it is more appropriate for digital hardware design. Furthermore, an optimized
convolution operation is proposed, in which case one input pixel is only required to be read once. A distributed storage
approach is proposed to support the proposed convolution operation. With the proposed methods, a hardware accelerator
for the binary three-dimensional convolutional neural network on the field programmable gate array platform is designed.
The experimental results show that the presented accelerator is excellent in terms of computational resources and power
efficiency. By jointly optimizing the algorithm and hardware, the accelerator achieves 89.2% accuracy and 384 frames per
second on the KTH dataset.

Keywords  Binary convolutional neural network · Hardware accelerator · Three-dimensional convolution · Action
recognition · FPGA

1 Introduction could outperform that of the human-level (94.9%) on Ima-


geNet. Higher accuracy relies on a more complex model,
In recent years, convolutional neural networks (CNNs) which needs more computational resources and memory. For
have been the state-of-the-art solutions in a good deal of instance, VGG-19 [27] has 144 M parameters and performs
computer vision tasks, such as object detection [6], image 19.6 billion FLOPs (multiply-adds) for one 224 × 224 image.
classification [7], and image super-resolution [23], etc. For The three-dimensional convolutional neural network (3D
example, the classification accuracy of ResNet (96.4% [9]) CNN) is more intricate than two-dimensional (2D) CNNs,
which is widely used in human action recognition fields
[33]. As shown in Fig. 1, one more temporal dimension is
* Meng Zhang
zmeng@seu.edu.cn introduced for capturing temporal features in 3D CNNs,
which leads to more computational and memory costs than
Guoqing Li
li_guoqing@seu.edu.cn 2D CNNs. C3D [28], a 3D CNN model for human action
recognition with only 11 layers, takes more than 77 Giga
Qianru Zhang
zhangqr@seu.edu.cn operations (GOPs). It is difficult to deploy complex CNNs
into low-power devices. Therefore, the model compression
Zhijian Lin
22095094@seu.edu.cn works and dedicated hardware accelerators for the embedded
device have been widely considered recently [35]. However,
1
National ASIC Research Center, School of Electronics most compression works were considered for 2D CNNs. In
Science and Engineering, Southeast University, this paper, 3D CNN is compressed by binary quantization,
Nanjing 210096, China
and then an accelerator is proposed for binary 3D CNN.
2
School of Microelectronics, Southeast University,
Nanjing 210096, China

13
Vol.:(0123456789)
Journal of Real-Time Image Processing

the one more dimension for temporal features. In [20], the


FPGA-based accelerator for 3D CNNs results 43.9× faster
than the central processing unit (CPU). Shen et al. propose
a uniform architecture design for accelerating 2D and 3D
CNNs on FPGAs [26], which achieves 33 frames per second
(FPS). To the best of our knowledge, there is no hardware
accelerator for binary 3D CNNs. We design an accelerator
for the proposed binary 3D CNNs, in which O2M (one input
pixel affects many output pixels) is adopted to maximize
data reuse.
In this paper, CNNs are accelerated through jointly opti-
mizing the algorithm and the FPGAs. For the algorithm, a
simple and hardware friendly 3D AND-Net architecture is
Fig. 1  Comparison between 2D CNN and 3D CNN
proposed, in which the multiplication operation could be
replaced by a simpler AND operation. For the hardware, an
So far, many methods have been proposed for address- efficient hardware accelerator for 3D AND-Net is designed,
ing the network redundancy, which can be separated into which uses O2M and distributed storage to optimize convo-
two categories: designing new efficient architectures from lution. By jointly optimizing algorithm and hardware, the
scratch and compressing existing architectures. For the accelerator can achieve 384 FPS with 1.604 watts (W) on
first category, many efficient CNNs are proposed, such as the KTH dataset.
MobileNet [25], KSGC [12], and Dense2Net [13]. The sec- The key contributions of this paper are as follows.
ond category usually bases on pruning [18] and quantiza-
tion [14]. Quantization is an excellent method to compress • Binary weight and activation are applied to 3D CNNs,
CNNs, which does not cause large accuracy losses. Cour- which significantly reduces computational complexity
bariaux et al. propose a binarized neural networks (BNN) and memory cost.
[3] whose weight and activation are constrained to + 1 or • It is proposed that the XNOR operation can be replaced
– 1. BNN is an extreme quantitative solution, which leads by AND operation in binary CNNs, which makes BNN
to a 32× compression rate. Based on BNN, a more efficient more suitable for implementation on the hardware.
XNOR-Net [24] is proposed, in which XNOR operations • An optimized convolution operation called O2M is intro-
replace 32 bits floating-point operations. Although the duced, which causes that an input pixel only needs to be
weight and activation are constrained to + 1 or – 1, when read once.
the convolution is implemented, – 1 is replaced by 0. In • A hardware accelerator for 3D AND-Net on the FPGA
this paper, it is proposed that the weight and activation can platform is presented, which consumes 1.604 W and
be directly constrained to + 1 or 0. Then the multiplica- meets the requirements of many real-time applications.
tion operations could be replaced by simple AND bitwise
operations. Moreover, in the complementary metal-oxide- This paper is organized as follows. Section 2 lists some
semiconductor (CMOS) process, an AND gate only needs related works about 3D CNN, BNN and its hardware accel-
six transistors, but an XNOR gate usually needs ten transis- erator. In Sect. 3, AND-Net and the accelerator architec-
tors. AND gate is more efficient than the XNOR gate. The ture are explained in detail. The experiment and results are
proposed binary convolutional neural network is referred shown in Sect. 4 and the conclusion is presented in Sect. 5.
to as AND-Net. Compared with XNOR-Net, AND-Net is
simpler for implementation on hardware because the number
of 0, 1 directly corresponds to the digital circuits logic 0, 1. 2 Related work
To evaluate the performance of the proposed AND-Net, a
3D AND-Net is designed for human action recognition, in In this section, we are going to briefly review previous
which the weight and activation are constrained to 0 or 1. tasks that inspire the design of 3D AND-Net and hardware
Field programmable gate arrays (FPGAs) excel at low- accelerator.
precision computation, and their adaptability to new algo-
rithms lends themselves to support rapidly changing CNNs. 2.1 Three‑dimensional convolutional neural
The usage of FPGA to accelerate CNNs has attracted signifi- network
cant research attention in recent years [1, 32]. Most accel-
erators are proposed for 2D CNNs, while very few are for A typical 3D CNN is mainly composed of 3D convolution,
3D CNNs. Accelerating 3D CNNs is more complex due to normalization, activation, 3D pooling, and fully connected

13
Journal of Real-Time Image Processing

(FC) layers. The 3D convolutional layer is used to extract width and height dimensions. The 3D pooling is more com-
features, and the 3D pooling layer is used for down-sam- plex than the 2D pooling. The FC layer, nonlinear layer, and
pling. The differences between 2D and 3D CNNs are con- batch normalization of 3D CNNs are similar to that of 2D
volution and pooling layers. In this paper, we mainly focus CNNs. 3D CNN is more complex and has more parameters
on accelerating 3D convolutional layers because 3D convo- and computational cost than the 2D CNN. It is significant to
lutional layers contain the majority of the complexity in the compress 3D CNNs.
form of computation and memory.
The convolutional layers are used to extract temporal and
spatial features in convolutional neural networks. The 3D 2.2 Binary neural networks
convolution is illustrated in Fig. 2a. The inputs of 3D con-
volution are cubes, each of which is constructed by stacking Courbariaux et al. firstly propose the binary weight neural
multiple consecutive frames. Each cube is convoluted by network (BWN) whose weights are constrained to + 1 or
a 3D convolutional kernel, and the convolutional result is – 1. It leads to a 32× parameters compression rate and effi-
obtained by summing the total partial sum. In 3D convolu- cient computation because one weight only occupies one bit
tion, each feature map in a cube is connected to multiple [2]. After that, some better binary weight neural networks
adjacent consecutive frames, so temporal information can be are proposed, which had higher accuracy. For example,
extracted. The computation of 3D convolution is described Zhuang et al. successfully train BWN-based Group-Net on
as follows. the ImageNet dataset with an accuracy of 2.6% drop com-
pared with the full precision performance [37].
C Kt Kx Ky
∑ ∑∑∑ (t,kx ,ky ) (z+t,x+kx ,y+ky ) The binary weight and activation network is proposed by
O(x,y,z) = Wm ∗ Ic . (1)
c=1 kt =1 kx =1 ky =1
Courbariaux et al., which could lead to a 32× model com-
pression rate [3] and achieves the 42.2% TOP-1 accuracy
In Eq. (1), x, y, and z are the spatial coordinates in an output on ImageNet. Based on BNN, XNOR-Net [24] is proposed,
cube. The Kx , Ky , and the Kt are the kernel size in width, which uses XNOR and popcount operations to replace
height and temporal dimensions, respectively. C is the num- the floating-point operations and achieves 51.2% TOP-1
ber of total channels. From Fig. 2a and Eq. (1), it can be accuracy. It not only compresses the memory cost but also
known that 3D kernel can extract spatial and temporal fea- reduces the computational complexity. After that, the image
tures, and fuse the channel information. Moreover, the 3D classification accuracy of BNN is higher and higher. PC-
convolution is more complex than the 2D convolution and BNN [34] replaces the original binary convolution layer
has more computational costs. in conventional BNN with two parallel binary convolu-
Pooling layers can be seen as a form of dimensionality tion layers and achieves 86% accuracy on the CIFAR-10
reduction that can significantly reduce complexity while dataset with only the 2.3 Mb parameter size. ABC-Net [19]
being able to better extract features and improve accuracy. proposes a novel binarization scheme and achieved 68.3%
Average and max-pooling are common operations in the TOP-1 accuracy on the ImageNet dataset. ReActNet-C [21]
pooling layer. The 3D pooling operation is illustrated in achieves 71.4% Top-1 accuracy using a novel ReAct opera-
Fig. 2b, where the feature maps are down-sampled in width, tion on the ImageNet dataset.
height, and temporal dimensions. However, in the 2D pool- Previous BNNs constrain weight and activation to + 1
ing operation, the feature maps are down-sampled only in or – 1 and most works focuses on 2D CNN. In this work,

Fig. 2  Illustration of 3D convolution and 3D pooling

13
Journal of Real-Time Image Processing

weight and activation are constrained to + 1 or 0, and the Fig. 3  A block in XNOR-Net
binary weight and activation are used in 3D CNN.

2.3 Hardware accelerator for BNN

BNNs have a large number of XNOR operations, so a dedi-


cated hardware accelerator can preferably accelerate BNN.
PC-BNN [34], a hardware accelerator based on FPGA
achieves 930 frames per second, which only needs on-chip
memory. Umuroglu et al. proposes FINN [30], a frame-
work for building a fast and flexible FPGA accelerator. On
a ZC706 embedded FPGA platform, it achieves 21,906 FPS
on the CIFAR-10 dataset. Zhao et al. [36] presents a BNN
accelerator that is synthesized from C++ to FPGA-targeted
Verilog. The accelerator only needs 5.94 ms per image on
CIFAR-10. Guo et al. [8] proposes a fully binarized convo-
lutional neural network accelerator (FBNA). With the hard-
ware oriented algorithm optimizations, they construct a uni-
fied hardware design for BNNs. On CIFAR-10, it achieves
722 GOPS overall performance.
Most accelerators are designed for 2D CNNs, while very
few are for 3D CNNs. In [31], the FPGA-based accelerator
for 3D CNNs have computational performance 14× faster
than that of the CPU. Liu et al. [20] proposes a uniform Fig. 4  Differences between XNOR-Net and AND-Net
architecture design for accelerating 2D and 3D CNNs on
FPGA, which achieves state-of-the-art throughput perfor-
mance on both 2D and 3D CNN. In this paper, to accelerate
{
+ 1, if x≥0
the proposed 3D AND-Net, we propose a dedicate accelera- y =Sign(x) = , (2)
− 1, otherwise.
tor based on FPGA.
{
+ 1, if x� ≥ 0
3 3D AND‑Net and accelerator � �
y =Step(x ) = . (3)
0, otherwise.
In this section, the proposed AND-Net algorithm, O2M, and
distributed storage approaches are first introduced. Then
The weight and activation in XNOR-Net are constrained
the architecture of 3D AND-Net is introduced. Finally, the
to + 1 or – 1, but those are constrained to + 1 or 0 in AND-
hardware accelerator based on FPGA for 3D AND-Net is
Net. The convolutional results of AND-Net are greater than
presented in detail.
or equal to 0. The normalization operation can normalize the
feature maps before the binary activation, which leads to the
3.1 AND‑Net algorithm results of binary activation not always being 1. Figure 4 is an
example that illustrates the difference between XNOR-Net
Inspired by XNOR-Net [24], AND-Net is proposed, which
and AND-Net. Getting a convolutional result by XNOR-Net
is a more efficient and simpler model than XNOR-Net. As
needs three steps, but getting a convolutional result by AND-
shown in Fig. 3, a typical block in XNOR-Net consists of
Net only needs two steps. AND-Net is simpler and more
batch normalization, binary activation, and binary convolu-
efficient than XNOR-Net. Furthermore, the two numbers 0,
tion. Compared with general CNN, XNOR-Net has binary
1 directly correspond to logic 0, 1 of digital circuits, which
activation (BinActiv) operation and binary activation value.
can save the work of mapping – 1 to 0.
The BinActiv is implemented before the binary convolution
In addition, the proposed AND-Net uses fewer memory
(BinConv) operation, so that weight and activation are con-
resources due to the smaller value range. For example, we
strained to + 1 or – 1 for binary activation. AND-Net and
assume the convolutional kernel size to be 3 × 3 . The maxi-
XNOR-Net are similar in many ways except the values of
mum value of the convolutional result is 9 (sum of nine 1s),
weight and activation. Equations (2) and  (3) are the activa-
and the minimum value of the convolutional result is – 9
tion functions of XNOR-Net and AND-Net, respectively.

13
Journal of Real-Time Image Processing

(sum of nine – 1s) for XNOR-Net. However, the maximum next pixel arrives, the intermediate results of the affected
value of the convolutional result is 9 (sum of nine 1s), and output pixel are taken out and added to the new intermediate
the minimum value of the convolutional result is 0 (sum results. To get intermediate results of 4 affected pixels at the
of nine 0s) for AND-Net. Therefore, the ranges of convo- same time, the distributed storage approach is proposed. As
lutional results are [– 9, 9] and [0, 9] for XNOR-Net and shown in Fig. 5, different colors represent different random
AND-Net, respectively. The range of XNOR-Net is larger access memories (RAMs) and one feature map is stored in
than that of AND-Net, but their results are discrete. The four different RAMs. As a result, the four affected pixels can
convolutional results of both XNOR-Net and AND-Net have be read simultaneously and the sum operations can be com-
ten values. The convolutional result only can be one of the pleted in one cycle. O2M and distributed storage cooperate
candidate ten values (– 9, – 7, – 5, – 3, – 1, 1, 3, 5, 7, 9) for with each other. A complete and correct output pixel can be
XNOR-Net because the sum of nine even numbers cannot obtained after the 4 input pixels that affect the output pixel
be an odd number. And the convolutional result is one of the are all calculated. All the input pixels are only read once due
ten candidate values (0, 1, 2, 3, 4, 5, 6, 7, 8, 9) for AND-Net. to the O2M approach, which reduces the power consumption
As a result, the representation performance of AND-Net is and the latency of reading input data.
similar to that of XNOR-Net.
3.3 3D AND‑Net architectures
3.2 Optimized convolution and distributed storage
Firstly, a simple 2D AND-Net for MNIST dataset is designed
It is well known that the convolutional areas are overlapped to evaluate the performance of the proposed AND-Net,
when the kernel size is larger than the sliding stride. It means which is presented in Table 1. The 2D AND-Net is very
that the pixels of input feature maps are used repeatedly. For simple because of the simplicity of the MNIST. The AND-
this reason, the input data will be read repeatedly, which Net contains two convolutional layers, two pooling layers
results in longer latency and more power consumption. The and a fully connected layer. Every convolutional layer is fol-
reading count of input data should be as low as possible for lowed by a pooling layer for dimensionality reduction. The
an accelerator. To this end, the approach for optimizing con- Softmax function is for classification. ReLU function is used
volution operation is proposed, which is called O2M. For an as the non-linearity function.
arbitrary kernel size ( k × k ), k × k output pixels are affected Then a 3D AND-Net for the KTH dataset is proposed
by one input pixel in a feature map. Previous works [22, and its architecture is presented in Table 2. The 3D AND-
29] considered data reuse from getting more output pixels. Net consists of six convolutional layers and a fully con-
However, data reuse is considered from making full use of nected layer. The spatial resolution of the input video is
an input pixel in this paper. 64 × 64 pixels. The 1st , 3rd , and 5th convolutional layers
To simply explain the proposed O2M approach, a small
kernel size ( 2 × 2 ) is selected, but this method is suitable
for any size convolutional kernels. As illustrated in Fig. 5, Table 1  Architecture of 2D AND-Net for MNIST
a pixel of the input feature map can affect 4 pixels of the Layer Output Operation Kernel size
output feature map. In other words, obtaining one output
1 28 × 28 Conv 3 × 3 × 10
pixel needs 4 input pixels. The contributions of an input
2 14 × 14 Pooling 2×2
pixel for 4 output pixels can be computed first, and then
3 14 × 14 Conv 3 × 3 × 20
the intermediate results are stored in the buffer. When the
4 7×7 Pooling 2×2
5 10 FC-Softmax –

Table 2  Architecture of 3D AND-Net for KTH


Layer Kernel size Output channels Max pooling

1 2×2×2 4×4 Yes


2 2×2×2 4×4 No
3 2×2×2 5×5 Yes
4 2×2×2 5×5 No
5 2×2×2 6×6 Yes
6 2×2×2 6×6 No
7 FC-Softmax
Fig. 5  Illustration of distributed storage for one feature map

13
Journal of Real-Time Image Processing

are followed by max-pooling layers for down-sampling 3.4 Accelerator for 3D AND‑Net


with stride 2. One block in 3D AND-Net consists of shuf-
fle, normalization, binary activation, binary group convo- An accelerator is firstly designed for the simple 2D AND-
lution, and shortcut. Group convolution is used to reduce Net, then an accelerator for the proposed 3D AND-Net is
parameters and computational cost. Shuffle operation is presented. The 2D AND-Net is an efficient and simple net-
used to improve the information transfer of group convo- work, which has a few parameters and computational cost.
lution between different groups. Before binary convolu- Thus the hardware accelerator for 2D AND-Net only uses
tion, feature maps need to be binary activated. Afterward, on-chip memory, which leads to more efficient energy and
binary weight and activation can implement AND opera- shorter inference time. The accelerator receives the configu-
tion. In BNN, the information loss is serious and the group ration information and command from the CPU by advanced
convolution also introduces information loss. For this rea- extensible interface (AXI4). The direct memory access
son, the shortcut is introduced. (DMA) interface is designed for transferring the off-chip
The shuffle operation and group convolution in 3D memory data. An finite state machine (FSM) can control
AND-Net are illustrated in Fig.  6. In the proposed 3D and record the working status of the hardware accelerator.
AND-Net, the number of channels is n × n , so the feature There are two buffers to store the feature map in the convo-
maps are divided into n groups where each group contains lution process. As shown in Fig. 7, the two buffers (buffer0
n channels. The shuffle operation is described as follows. and buffer1) can store input and output feature maps in turn.
For instance, when input feature maps are stored in buffer0,
O(j,m) = I(m,j) . (4) the output feature maps can be stored in buffer1. Next, the
feature maps in buffer1 are input data and the output feature
In Eq. (4), I and O are input feature maps and output maps will be stored in buffer0. The results of all convolu-
feature maps, respectively. The j is the serial number of tional layers are stored in on-chip memory, which can reduce
the group, and m is the serial number in the jth group. For the off-chip memory access for low power. The computing
example, as shown in Fig. 6, the I(1,3) (the third feature unit is reused by different convolution layers. The proposed
cube in group 1) changes to O(3,1) (the first feature cube in O2M approach is adopted by the computing unit. The paral-
group 3) after shuffle operation. Each group contains the lelism of the computing unit is 40.
features from all groups after shuffle operation. Then each The 3D AND-Net is larger and more complex than the
group is implemented with convolution independently. 2D AND-Net, such that the on-chip memory is not enough
Shuffle operation can reduce the disadvantage of group to store the intermediate results of convolution. For our 3D
convolution, which is conducive to information exchange AND-Net, the output feature map size of the first convo-
between different groups. lutional layer is 64 × 64 × 16 × 16 , and one pixel needs 1
Byte memory to store. So the output feature map of the
first layer needs about 1 MB of memory. If we only use on-
chip memory, a ping-pong buffer is needed and the on-chip
memory to store feature maps should be larger than 2 MB.
Weights occupy a few memory, which can be ignored. The
PYNQZ2 (Z-7020) has 0.6 MB on-chip memory (BRAM).
The on-chip memory of PYNQZ2 is not enough to store
the intermediate results of convolution for our proposed
AND-Net. The external memory DRAM is used to store the

Fig. 6  Illustrations of shuffle operation and group convolution in 3D


AND-Net Fig. 7  Illustration of double buffers in 2D AND-Net accelerator

13
Journal of Real-Time Image Processing

intermediate results of convolution. Furthermore, our accel-


erator is programmable. It not only supports one network
(the network in this paper) but also can support multiple 3D
AND-Net networks by configuring the accelerator. For more
complex tasks, a wider and deeper 3D AND-Net will be
needed for good accuracy performance. If running a wider
and deeper network on our accelerator, on-chip memory is
not enough to store the intermediate results of convolution.
So external memory DRAM is necessary. Therefore, the
3D AND-Net accelerator uses external memory DRAM to
store the intermediate results of convolution but not on-chip
memory. Generally, the size of external memory is much
larger than on-chip memory and off-chip memory is much
cheaper than on-chip memory. In addition to storing the
intermediate results of convolution, DRAM is also used for
running the PYNQ system. The off-chip memory for run-
ning the PYNQ system is much larger than that for storing
intermediate results of convolution. 1MB off-chip memory
is used for storing intermediate results of convolution when
the 3D AND-Net running on the accelerator.
Double buffers are not used in the 3D AND-Net accelera-
tor, so the workflow of the 3D AND-Net accelerator is differ-
ent from the 2D AND-Net accelerator. The workflow for the
3D AND-Net accelerator is illustrated in Fig. 8. The input
pixels are transferred to the computing unit of the accelerator
from external DRAM. Then the input pixels are immediately
computed. The input pixel will only be used once for obtain-
ing the intermediate results of its affected pixels due to the Fig. 9  Overview of 3D AND-Net accelerator architecture
O2M approach and distributed storage. The used input pixel
can be discarded without storing it because it will not be
used again. Therefore, the accelerator only contains a buffer MAXI4 (master advanced extensible interface) is used to
for storing intermediate results but does not contain a dedi- access off-chip DRAM and SAXI4 (slave advanced exten-
cated buffer to store the input pixel. After the tiling feature sible interface) can communicate with the CPU. AXI4MW
maps being computed, the intermediate results in the OUT and AXI4MR are used to control which model can access
Buffer are transferred to external DRAM by the AXI4 bus. (write or read) DRAM by MAXI4.
Ping-pong buffers are not used in OUT Buffer for reducing The Conv (green area in Fig. 9) can implement the con-
on-chip memory resources computation. The workflows are volutional layer. IFCube can read and shuffle the input fea-
the same for all phases. ture maps and then NormBA implements normalization and
As illustrated in Fig. 9, the whole system for the proposed binary activation. NormBA needs to consume digital signal
3D AND-Net is mapped in one PYNQ-Z2 chip. The 3D processor (DSP) resources to achieve multiplication for nor-
AND-Net is implemented on Processing Logic (PL), and malization. The binary 3D convolution is implemented by
the Processing System (PS) is used to reconfigure the 3D InCrement and OFCube according to the optimized convo-
AND-Net through the AXI4. Before starting the accelera- lution and distributed storage in Sect. 3.2. The computing
tor, the hyperparameters of the whole network are set by PS. unit of InCrement computes the contributions of an input
In this way, our accelerator can implement various layers pixel for 8 output pixels (kernel size is 2 × 2 × 2 ) as shown
with different parameters. The external memory DRAM is in Fig. 10. All AND gates share the same input pixel and use
used to store the intermediate results of convolution. The different weights for computing 8 contributions of 8 output

Fig. 8  Illustration of workflow for 3D AND-Net accelerator

13
Journal of Real-Time Image Processing

Table 3  Comparison with other BNNs relations on FPGA


FINN [30] PC-BNN [34] Ours

LUTs 25,794 23,436 10,429


BRAM 270 135 170
DSP 26 53 20
Power (W) 2.5 2.4 2.2
FPS 620 930 1080
FPS/W 248 287.5 490.9

language is used to describe the accelerator architecture for


the 3D AND-Net. The execution time is reported by the Ver-
ilog compiled simulator (VCS), and the Vivado Design Suite
is used to map the design to the PYNQ-Z2 FPGAs system.
The resource utilization and power consumption are reported
by the Vivado Design Suite after the implementation stage.

Fig. 10  Illustration of the computing unit for 3D AND-Net accelera- 4.2 Results and analysis


tor

For 2D AND-Net, the AND-Net is compared with the


pixels at the same time. The proposed accelerator not only XNOR-Net. They have the same architecture except for
supports the 3D AND-Net presented in this paper but also the values of weight and activation. The results show that
supports other 3D AND-Nets. The parallelism of our accel- AND-Net and XNOR-Net have a similar accuracy (98.91%,
erator is not only designed for the 3D AND-Net presented 98.98%) on the MNIST dataset, which indicates that the
in this manuscript but also for other wider and deeper 3D weight and activation are constrained to +  1 or 0 is feasible.
AND-Nets. The accelerator supports up to 14 output fea- The implementation results of the AND-Net hardware
ture maps of one cube being computed in parallel. For our accelerator based on FPGA are presented in Table 3. It is
3D convolution, one input pixel can affect 8 pixels of the shown that the accelerator employs a few hardware resources
output feature maps and the 8 output pixels are computed because the AND operations replace the multiplication
in parallel by using the proposed O2M approach. There- operations. Another great advantage of the accelerator is
fore, the parallelism of our accelerator is 112. InCrement the extremely high recognition speed. All operations for
does not consume DSP resources because AND operation one similar spatial pixel of all input channels can be com-
is implemented by the lookup table (LUT). OFCube sums pleted in one clock cycle due to the optimized convolution
the intermediate results and obtains the complete and cor- approach and distributed storage approach. Therefore, the
rect output pixels. After obtaining complete convolutional accelerator can process 1080 images per second with a 100
results, Pool2DDR can implement pooling operation and MHz working frequency, which is almost 15× faster than
write results to external DRAM by MAXI4. In the convolu- the CPU (Intel Core i5-2500K with 3.2 GHz). Compared
tion operation, loop tiling is used to save limited computing with FINN and PC-BNN, our accelerator has lower power
resources. consumption and faster speed. The FPS/W of our accelerator
is about 1.7× larger than that of FINN and PC-BNN.
The accuracy of our proposed 3D AND-Net on the KTH
4 Experimental results dataset is presented in Table 4. Some networks are selected
to compare with the 3D AND-Net. As shown in Table 4,
In this section, the experimental environment is introduced, compared with the other 32 bits floating-point networks,
and the results and analysis are presented. our proposed 3D AND-Net only loses a little accuracy but
reduces a great number of multiplications and memory
4.1 Experimental environment costs. The multiplication needs to be implemented by DSP
and the bit-wise AND operation only consumes very lit-
MATLAB R2015b framework is used to train the 3D AND- tle LUT resources. In FPGA, LUT resources are far more
Net for KTH. The KTH dataset contains 2391 videos for six than DSP resources. Therefore, AND-Net is more efficient
types of human actions. 80% videos are defined for training than other 32 bits floating-point networks. There are two
and 20% videos are for testing. Verilog hardware description quantization methods. The first one is firstly training a

13
Journal of Real-Time Image Processing

Table 4  Accuracy of 3D AND-Net and other networks on KTH The performance and resource utilization of our proposed
Networks Accuracy (%) Bits
3D AND-Net accelerator is presented in Table 5. Some
accelerators for 2D BNN and the previous works about 3D
3D AND-Net 89.2 1 CNNs accelerator with fixed-point weight and activation are
3D XNOR-Net 89.3 1 also presented in Table 5 because we cannot find an accel-
3D CNN [10] 90.2 32 erator for 3D BNN. The proposed 3D AND-Net accelerator
[11] 90.34 32 runs at 100 MHz. The multiplication operation is replaced
S-Net-level2 [4] 90.1 32 by AND operation in 3D AND-Net, which leads to the fact
[15] 91.6 32 that DSP efficiency is about 20× higher than that of other 3D
CNN accelerators. In the 3D AND-Net accelerator, only 14
DSPs are used and the DSP is used to implement normaliza-
floating-point network and then quantizing it by binary tion. The 3D AND-Net uses most LUT resources because
weight and without fine-tuning, which will lead to a lot the binary convolution is implemented by the LUT. Most
of accuracy loss. The second one is doing quantization in Block RAMs are mainly used for feature map buffers for
the training process, and the accuracy loss will be small. high utilization of memory bandwidth. The total power of
In most BNN works [3, 19, 21, 24, 34], the quantization the proposed accelerator is only 1.6 W and the power of PL
is done in the training process for high accuracy. There- is 0.35 W, which is an order of magnitude lower than other
fore, the quantization is done in the training process in accelerators for 3D CNNs. The proposed accelerator can
this paper and high accuracy is achieved. Furthermore, 3D process 24 videos per second. Each video consists of 16
AND-Net and 3D XNOR-Net have similar accuracy, which frames, so the proposed accelerator can process 384 frames
also indicates that AND-Net is effective. per second, which meets the real-time requirements. The
Generally, deeper networks have higher accuracy than speed of our accelerator is significantly higher than other
shallow networks but bring more computational cost [9]. 3D CNN accelerators because of the simple 3D AND-Net
Therefore, it is reasonable to expect that the accuracy is architecture, binary convolution, and O2M approach. Our
higher if using deeper and wider 3D AND-Net architec- accelerator can achieve 240 frames per second and per
ture. The results indicate that binary weight and activation watt (FPS/W), which is far more efficient than other 3D
are feasible for 3D CNNs. Furthermore, hardware opti- CNNs accelerators.
mization should not introduce accuracy loss when imple- Throughput, DSP efficiency, power efficiency, and LUT
menting a quantified network [17, 30, 34, 36] by the hard- efficiency are also used as efficiency metrics. DSP efficiency
ware accelerator. Our hardware accelerator avoids data is GOPS per DSP (GOPS/DSP), whereas power efficiency
overflow, approximate calculations, and other operations is GOPS per watt (GOPS/W). LUT efficiency is GOPS per
that affect accuracy. The accuracy of 3D AND-Net imple- kilo LUTs (GOPS/KLUT). The throughput of our accelera-
mented on the proposed hardware accelerator is tested and tor is minimum because of minimum resource consumption
its accuracy is 89.2%. The result shows that the proposed and low frequency. However, our 3D AND-Net accelerator
hardware optimizations will not introduce accuracy loss. is more efficient in terms of power and DSP. The power

Table 5  Comparison with other F-C3D [5] [26] [20] BNN [16] FBNA [8] Ours
3D CNN accelerators on FPGA
FPGA ZC706 VC709 XC7VX690T Virtex 7 ZC702 PYNQ-Z2
Precision 16bit fixed fixed fixed 1 bit 1 bit 1 bit
CNN Model C3D C3D C3D BNN BNN 3D AND-Net
Frequency (MHz) 172 150 120 90 – 100
LUTs 90,281 242,000 272,734 34,126 29,600 21,727
Block RAM 472 1508 391 1007 103 140
DSP 810 1536 3595 1096 – 14
Power (W) 9.7 (PL, 8.4) 25 15.8 8.2 3.3 1.6 (PL, 0.35)
GOPS 142 430.7 667.7 7663 722 22.4
GOPS/DSP 0.175 0.280 0.186 7.6 – 1.6
GOPS/W 14.6 (PL,16.9) 17.2 42.2 935 219 14 (PL, 64)
GOPS/KLUT 1.57 1.78 2.45 22.4 24 1.03
FPS 3 33 8.7 6218 520 384
FPS/W 0.31 1.32 0.55 758 157 240

13
Journal of Real-Time Image Processing

efficiencies of all 3D CNN accelerators are not high because Acknowledgements  This research work was partly supported by Natu-
3D CNNs are more complicated than 2D CNNs. Our power ral Science Foundation of Jiangsu Province (Project No. BK20201145).
efficiency of PL is about 3.7× higher than that of F-C3D [5]
and [26]. The improvement of power efficiency is mainly
because of simple binary convolution and efficient O2M References
approach. The LUT efficiency of our proposed accelera-
1. Arredondo-Velazquez, M., Diaz-Carmona, J., Torres-Huitzil, C.,
tor is a little lower than others, since many LUT resources
Padilla-Medina, A., Prado-Olivarez, J.: A streaming architecture
process AND operations. In summary, compared with other for Convolutional Neural Networks based on layer operations
accelerators, the proposed accelerator and 3D AND-Net chaining. J. Real-Time Image Process. 17(5), 1–19 (2020)
have similar accuracy, but are faster and more efficient for 2. Courbariaux, M., Bengio, Y., David, J.P.: BinaryConnect: Train-
ing Deep Neural Networks with binary weights during propaga-
the same task. Moreover, the 2D BNN accelerator [16] is
tions. Adv. Neural Inf. Process. Syst. (NeurIPS) pp. 3123–3131
more efficient than our 3D AND-Net accelerator because (2015)
of the larger FPGA resource and simpler 2D CNN. Large 3. Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., Bengio, Y.:
memory resources can reduce access to DRAM and simpler Binarized Neural Networks: Training Deep Neural Networks with
Weights and Activations Constrained to + 1 or – 1. arXiv preprint
2D CNN is helpful to improve computational efficiency and
arXiv:​ 1602.​02830 (2016)
reduce power consumption. FBNA [8] and our 3D CNN 4. Cui, Y., Shi, Y., Sun, X., Yin, W.: S-Net: A Lightweight Con-
accelerator use the same FPGA platform. The performance volutional Neural Network for N-Dimensional Signals. In: IEEE
of FBNA is higher than the proposed accelerator in terms International Conference on Multimedia and Expo Workshops
(ICMEW), pp. 1–4 (2018)
of LUT and Power. The reason is that 2D BNN is simpler
5. Fan, H., Niu, X., Liu, Q., Luk, W.: F-C3D: FPGA-based 3-dimen-
than 3D BNN. The accelerator design of 3D CNN is more sional convolutional neural network. In: IEEE International Con-
challenging than that of 2D BNN. ference on Multimedia and Expo Workshops, pp. 1–4 (2017)
In summary, using algorithm and hardware co-design, 6. Gagliardi, A., de Gioia, F., Saponara, S.: A real-time video smoke
detection algorithm based on Kalman filter and CNN. J. Real-
the proposed 3D AND-Net and accelerator achieve high
Time Image Process. 1–11 (2021)
performance for human action recognition. In terms of the 7. Gao, S., Cheng, M., Zhao, K., Zhang, X., Yang, M., Torr, P.H.S.:
algorithm, an efficient and hardware-friendly 3D CNN is Res2net: a new multi-scale backbone architecture. IEEE Trans.
designed. In terms of the hardware, some optimizations are Pattern Anal. Mach. Intell. 43(2), 652–662 (2021)
8. Guo, P., Ma, H., Chen, R., Li, P., Xie, S., Wang, D.: FBNA: A
adopted for the proposed 3D AND-Net.
Fully Binarized Neural Network Accelerator. In: 28th Interna-
tional Conference on Field Programmable Logic and Applica-
tions, (FPL), pp. 51–54 (2018)
9. He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for
Image Recognition. In: IEEE Conference on Computer Vision and
5 Conclusion Pattern Recognition (CVPR), pp. 770–778 (2016)
10. Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks
3D CNN is more complex and has more computational cost for human action recognition. IEEE Trans. Pattern Anal. Mach.
than 2D CNN. It is urgent to compress 3D CNN. In this Intell. 35(1), 221–231 (2012)
11. Latah, M.: Human action recognition using support vector
paper, a simple and efficient binary convolution algorithm
machines and 3D convolutional neural networks. Int. Jo. Adv.
is proposed for 3D CNN, in which the weight and activation Intell. Inform. 3(1), 47–55 (2017)
are constrained to 0 or 1 but not – 1 or + 1. The proposed 3D 12. Li, G., Zhang, M., Duan, B., Zhang, Q., Tong, G.: Kernel Sharing
AND-Net has 89.2% accuracy on the KTH dataset, which in the Channel Dimension to Improve Parameters Efficiency. In:
2019 International Conference on Computing, Electronics Com-
indicates that the AND-Net is suitable for 3D CNN. AND-
munications Engineering (iCCECE), pp. 78–82 (2019)
Net is more appropriate for digital hardware design because 13. Li, G., Zhang, M., Li, J., Lv, F., Tong, G.: Efficient densely con-
the number of 0, 1 directly corresponds to the digital circuits nected convolutional neural networks. Pattern Recognit. 109,
logic 0, 1. In addition, an efficient convolution operation 107610 (2021)
14. Li, J., Long, X., Hu, S., Hu, Y., Gu, Q., Xu, D.: A novel hardware-
O2M and distribute storage approach are proposed, which
oriented ultra-high-speed object detection algorithm based on con-
can significantly reduce the time of reading input pixels. volutional neural network. J. Real-Time Image Process. 17(5),
An accelerator is presented for 3D AND-Net based on 1703–1714 (2020)
the improvement approaches. The results demonstrate the 15. Li, J., Wang, T., Zhou, Y., Wang, Z., Snoussi, H.: Using Gabor
filter in 3D convolutional neural networks for human action recog-
excellence of our design on resource utilization, power, and
nition. In: Chinese Control Conference (CCC), pp. 11139–11144
DSP efficiency using the proposed approaches. It should (2017). https://​doi.​org/​10.​23919/​ChiCC.​2017.​80291​34
be mentioned that the joint optimization on algorithms and 16. Li, Y., Liu, Z., Xu, K., Yu, H., Ren, F.: A GPU-outperforming
hardware can obtain better performance than independent FPGA accelerator architecture for binary convolutional neural net-
works. ACM J. Emerg. Technol. Comput. Syst. 14(2), 18 (2018)
optimization.
17. Liang, S., Yin, S., Liu, L., Luk, W., Wei, S.: FP-BNN: Binarized
In the future, we will apply 3D AND-Net to other com- neural network on FPGA. Neurocomputing 275, 1072–1086
plicated datasets to evaluate it, such as UCF101. (2018)

13
Journal of Real-Time Image Processing

18. Lin, S., Ji, R., Li, Y., Deng, C., Li, X.: Toward compact ConvNets 34. Yang, L., He, Z., Fan, D.: A Fully Onchip Binarized Convolutional
via structure-sparsity regularized filter pruning. IEEE Trans. Neu- Neural Network FPGA Impelmentation with Accurate Inference.
ral Networks Learn. Syst. 31(2), 574–588 (2020) In: International Symposium on Low Power Electronics and
19. Lin, X., Zhao, C., Pan, W.: Towards accurate binary convolu- Design (ISLPED), p. 50. ACM (2018)
tional neural network. Adv. Neural Inf. Process. Syst. (NeurIPS) 35. Zhang, Q., Zhang, M., Chen, T., Sun, Z., Ma, Y., Yu, B.: Recent
345–353 (2017) advances in convolutional neural network acceleration. Neuro-
20. Liu, Z., Chow, P., Xu, J., Jiang, J., Dou, Y., Zhou, J.: A uniform computing 323, 37–51 (2019)
architecture design for accelerating 2D and 3D CNNs on FPGAs. 36. Zhao, R., Song, W., Zhang, W., Xing, T., Lin, J.H., Srivastava,
Electronics 8(1), 65 (2019) M., Zhang, R.G.Z.: Accelerating Binarized Convolutional Neural
21. Liu, Z., Shen, Z., Savvides, M., Cheng, K.: ReActNet: towards Networks with Software-Programmable FPGAs. In: ACM/SIGDA
precise binary neural network with generalized activation func- International Symposium on Field-Programmable Gate Arrays
tions. Comput. Vis. Eur. Conf. (ECCV) 14, 143–159 (2020) (FPGA), pp. 15–24 (2017)
22. Ma, Y., Cao, Y., Vrudhula, S., Seo, J.: Optimizing the convolu- 37. Zhuang, B., Shen, C., Reid, I.: Training Compact Neural Networks
tion operation to accelerate deep neural networks on FPGA. IEEE with Binary Weights and Low Precision Activations. arXiv pre-
Trans. Very Large Scale Integr. (VLSI) Syst. 26(7), 1354–1367 print arXiv:​1808.​02631 (2018)
(2018)
23. Meng, B., Wang, L., He, Z., Jeon, G., Dou, Q., Yang, X.: Gradient Publisher's Note Springer Nature remains neutral with regard to
information distillation network for real-time single-image super- jurisdictional claims in published maps and institutional affiliations.
resolution. J. Real-Time Image Process. 18(2), 333–344 (2021)
24. Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: XNOR-Net:
ImageNet Classification Using Binary Convolutional Neural Net-
works. In: European Conference on Computer Vision (ECCV), pp. Guoqing Li  received the B.S. degree from Qingdao University, Qing-
525–542 (2016) dao, China, in 2014, the M.S. degree from South China Normal Uni-
25. Sandler, M., Howard, A.G., Zhu, M., Zhmoginov, A., Chen, L.: versity, Guangzhou, China, in 2017. He is currently pursuing the Ph.D.
MobileNetV2: Inverted Residuals and Linear Bottlenecks. In: degree with the National ASIC Engineering Technology Research
IEEE Conference on Computer Vision and Pattern Recognition, Center, School of Electronics Science and Engineering, Southeast Uni-
(CVPR), pp. 4510–4520 (2018) versity, Nanjing, China. His current research interests include computer
26. Shen, J., Huang, Y., Wang, Z., Qiao, Y., Wen, M., Zhang, C.: vision, deep learning hardware accelerator.
Towards a Uniform Template-based Architecture for Accelerating
2D and 3D CNNs on FPGA. In: ACM/SIGDA International Sym- Meng Zhang  received the B.S. degree in Electrical Engineering from
posium on Field-Programmable Gate Arrays (FPGA), pp. 97–106 the China University of Mining and Technology, the M.S. degree in
(2018) Bioelectronics from Southeast University in China in 1986, in 1993
27. Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks respectively, and Ph.D. degree in Microelectronic Engineering from
for Large-Scale Image Recognition. In: International Conference Southeast University as an on-the-job postgraduate student. He is cur-
on Learning Representations (ICLR) (2015) rently a professor in National ASIC System Research Center, College
28. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learn- of Electronic Science and Engineering of Southeast University, Nan-
ing Spatiotemporal Features with 3D Convolutional Networks. In: jing, PR China. He is a faculty adviser of PhD graduates. His research
IEEE International Conference on Computer Vision (ICCV), pp. interests include deep learning, machine learning, digital signal pro-
4489–4497 (2015) cessing, digital communication systems, wireless sensor networks,
29. Tu, F., Yin, S., Ouyang, P., Tang, S., Liu, L., Wei, S.: Deep convo- digital integrated circuit design, information security and assurance
lutional neural network architecture with reconfigurable computa- etc. He is an author or coauthor of more than 40 referred journal and
tion patterns. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. international conference papers and a holder of more than 90 patents,
25(8), 2220–2233 (2017) including some PCT, US patents.
30. Umuroglu, Y., Fraser, N.J., Gambardella, G., Blott, M., Leong,
P., Jahre, M., Vissers, K.: FINN: A Framework for Fast, Scalable Qianru Zhang  is a Ph.D. student at National ASIC Center in School of
Binarized Neural Network Inference. In: ACM/SIGDA Interna- Electronic Science and Engineering, Southeast University, China. She
tional Symposium on Field-Programmable Gate Arrays (FPGA), obtained her M.S. in Electrical and Computer Engineering from Uni-
pp. 65–74. ACM (2017) versity of California, Irvine in 2016. Her research interests include digi-
31. Wang, H., Shao, M., Liu, Y., Zhao, W.: Enhanced efficiency 3D tal signal processing, big data analysis and deep learning techniques.
convolution based on optimal FPGA accelerator. IEEE Access 5,
6909–6916 (2017) Zhijian Lin  received the B.S. degree from the School of Optics and
32. Xu, K., Wang, X., Liu, X., Cao, C., Li, H., Peng, H., Wang, D.: Electronic Science and Technology, China Jiliang University, Hang-
A dedicated hardware accelerator for real-time acceleration of zhou, China, in 2018. He is currently pursuing the M.S. degree in the
YOLOv2. J. Real-Time Image Process. 18(3), 1–12 (2020) School of Microelectronics, Southeast University, China. His current
33. Yang, H., Yuan, C., Li, B., Du, Y., Xing, J., Hu, W., Maybank, research interests include accelerating deep neural networks based on
S.J.: Asymmetric 3D convolutional neural networks for action FPGA.
recognition. Pattern Recognit. 85, 1–12 (2019)

13

You might also like