Professional Documents
Culture Documents
https://doi.org/10.1007/s11554-021-01161-4
Abstract
The three-dimensional convolutional neural networks have abundant parameters and computational costs. It is urgent to
compress the three-dimensional convolutional neural network. In this paper, an efficient and simple binary three-dimensional
convolutional neural network architecture is proposed, in which the weight and activation are constrained to 0 or 1 instead
of the common + 1 or – 1. Binary weight and activation are first applied to the three-dimensional convolutional neural net-
works. The proposed binary three-dimensional convolutional neural network has less computational complexity and memory
consumption than standard convolution, and it is more appropriate for digital hardware design. Furthermore, an optimized
convolution operation is proposed, in which case one input pixel is only required to be read once. A distributed storage
approach is proposed to support the proposed convolution operation. With the proposed methods, a hardware accelerator
for the binary three-dimensional convolutional neural network on the field programmable gate array platform is designed.
The experimental results show that the presented accelerator is excellent in terms of computational resources and power
efficiency. By jointly optimizing the algorithm and hardware, the accelerator achieves 89.2% accuracy and 384 frames per
second on the KTH dataset.
Keywords Binary convolutional neural network · Hardware accelerator · Three-dimensional convolution · Action
recognition · FPGA
13
Vol.:(0123456789)
Journal of Real-Time Image Processing
13
Journal of Real-Time Image Processing
(FC) layers. The 3D convolutional layer is used to extract width and height dimensions. The 3D pooling is more com-
features, and the 3D pooling layer is used for down-sam- plex than the 2D pooling. The FC layer, nonlinear layer, and
pling. The differences between 2D and 3D CNNs are con- batch normalization of 3D CNNs are similar to that of 2D
volution and pooling layers. In this paper, we mainly focus CNNs. 3D CNN is more complex and has more parameters
on accelerating 3D convolutional layers because 3D convo- and computational cost than the 2D CNN. It is significant to
lutional layers contain the majority of the complexity in the compress 3D CNNs.
form of computation and memory.
The convolutional layers are used to extract temporal and
spatial features in convolutional neural networks. The 3D 2.2 Binary neural networks
convolution is illustrated in Fig. 2a. The inputs of 3D con-
volution are cubes, each of which is constructed by stacking Courbariaux et al. firstly propose the binary weight neural
multiple consecutive frames. Each cube is convoluted by network (BWN) whose weights are constrained to + 1 or
a 3D convolutional kernel, and the convolutional result is – 1. It leads to a 32× parameters compression rate and effi-
obtained by summing the total partial sum. In 3D convolu- cient computation because one weight only occupies one bit
tion, each feature map in a cube is connected to multiple [2]. After that, some better binary weight neural networks
adjacent consecutive frames, so temporal information can be are proposed, which had higher accuracy. For example,
extracted. The computation of 3D convolution is described Zhuang et al. successfully train BWN-based Group-Net on
as follows. the ImageNet dataset with an accuracy of 2.6% drop com-
pared with the full precision performance [37].
C Kt Kx Ky
∑ ∑∑∑ (t,kx ,ky ) (z+t,x+kx ,y+ky ) The binary weight and activation network is proposed by
O(x,y,z) = Wm ∗ Ic . (1)
c=1 kt =1 kx =1 ky =1
Courbariaux et al., which could lead to a 32× model com-
pression rate [3] and achieves the 42.2% TOP-1 accuracy
In Eq. (1), x, y, and z are the spatial coordinates in an output on ImageNet. Based on BNN, XNOR-Net [24] is proposed,
cube. The Kx , Ky , and the Kt are the kernel size in width, which uses XNOR and popcount operations to replace
height and temporal dimensions, respectively. C is the num- the floating-point operations and achieves 51.2% TOP-1
ber of total channels. From Fig. 2a and Eq. (1), it can be accuracy. It not only compresses the memory cost but also
known that 3D kernel can extract spatial and temporal fea- reduces the computational complexity. After that, the image
tures, and fuse the channel information. Moreover, the 3D classification accuracy of BNN is higher and higher. PC-
convolution is more complex than the 2D convolution and BNN [34] replaces the original binary convolution layer
has more computational costs. in conventional BNN with two parallel binary convolu-
Pooling layers can be seen as a form of dimensionality tion layers and achieves 86% accuracy on the CIFAR-10
reduction that can significantly reduce complexity while dataset with only the 2.3 Mb parameter size. ABC-Net [19]
being able to better extract features and improve accuracy. proposes a novel binarization scheme and achieved 68.3%
Average and max-pooling are common operations in the TOP-1 accuracy on the ImageNet dataset. ReActNet-C [21]
pooling layer. The 3D pooling operation is illustrated in achieves 71.4% Top-1 accuracy using a novel ReAct opera-
Fig. 2b, where the feature maps are down-sampled in width, tion on the ImageNet dataset.
height, and temporal dimensions. However, in the 2D pool- Previous BNNs constrain weight and activation to + 1
ing operation, the feature maps are down-sampled only in or – 1 and most works focuses on 2D CNN. In this work,
13
Journal of Real-Time Image Processing
weight and activation are constrained to + 1 or 0, and the Fig. 3 A block in XNOR-Net
binary weight and activation are used in 3D CNN.
13
Journal of Real-Time Image Processing
(sum of nine – 1s) for XNOR-Net. However, the maximum next pixel arrives, the intermediate results of the affected
value of the convolutional result is 9 (sum of nine 1s), and output pixel are taken out and added to the new intermediate
the minimum value of the convolutional result is 0 (sum results. To get intermediate results of 4 affected pixels at the
of nine 0s) for AND-Net. Therefore, the ranges of convo- same time, the distributed storage approach is proposed. As
lutional results are [– 9, 9] and [0, 9] for XNOR-Net and shown in Fig. 5, different colors represent different random
AND-Net, respectively. The range of XNOR-Net is larger access memories (RAMs) and one feature map is stored in
than that of AND-Net, but their results are discrete. The four different RAMs. As a result, the four affected pixels can
convolutional results of both XNOR-Net and AND-Net have be read simultaneously and the sum operations can be com-
ten values. The convolutional result only can be one of the pleted in one cycle. O2M and distributed storage cooperate
candidate ten values (– 9, – 7, – 5, – 3, – 1, 1, 3, 5, 7, 9) for with each other. A complete and correct output pixel can be
XNOR-Net because the sum of nine even numbers cannot obtained after the 4 input pixels that affect the output pixel
be an odd number. And the convolutional result is one of the are all calculated. All the input pixels are only read once due
ten candidate values (0, 1, 2, 3, 4, 5, 6, 7, 8, 9) for AND-Net. to the O2M approach, which reduces the power consumption
As a result, the representation performance of AND-Net is and the latency of reading input data.
similar to that of XNOR-Net.
3.3 3D AND‑Net architectures
3.2 Optimized convolution and distributed storage
Firstly, a simple 2D AND-Net for MNIST dataset is designed
It is well known that the convolutional areas are overlapped to evaluate the performance of the proposed AND-Net,
when the kernel size is larger than the sliding stride. It means which is presented in Table 1. The 2D AND-Net is very
that the pixels of input feature maps are used repeatedly. For simple because of the simplicity of the MNIST. The AND-
this reason, the input data will be read repeatedly, which Net contains two convolutional layers, two pooling layers
results in longer latency and more power consumption. The and a fully connected layer. Every convolutional layer is fol-
reading count of input data should be as low as possible for lowed by a pooling layer for dimensionality reduction. The
an accelerator. To this end, the approach for optimizing con- Softmax function is for classification. ReLU function is used
volution operation is proposed, which is called O2M. For an as the non-linearity function.
arbitrary kernel size ( k × k ), k × k output pixels are affected Then a 3D AND-Net for the KTH dataset is proposed
by one input pixel in a feature map. Previous works [22, and its architecture is presented in Table 2. The 3D AND-
29] considered data reuse from getting more output pixels. Net consists of six convolutional layers and a fully con-
However, data reuse is considered from making full use of nected layer. The spatial resolution of the input video is
an input pixel in this paper. 64 × 64 pixels. The 1st , 3rd , and 5th convolutional layers
To simply explain the proposed O2M approach, a small
kernel size ( 2 × 2 ) is selected, but this method is suitable
for any size convolutional kernels. As illustrated in Fig. 5, Table 1 Architecture of 2D AND-Net for MNIST
a pixel of the input feature map can affect 4 pixels of the Layer Output Operation Kernel size
output feature map. In other words, obtaining one output
1 28 × 28 Conv 3 × 3 × 10
pixel needs 4 input pixels. The contributions of an input
2 14 × 14 Pooling 2×2
pixel for 4 output pixels can be computed first, and then
3 14 × 14 Conv 3 × 3 × 20
the intermediate results are stored in the buffer. When the
4 7×7 Pooling 2×2
5 10 FC-Softmax –
13
Journal of Real-Time Image Processing
13
Journal of Real-Time Image Processing
13
Journal of Real-Time Image Processing
13
Journal of Real-Time Image Processing
Table 4 Accuracy of 3D AND-Net and other networks on KTH The performance and resource utilization of our proposed
Networks Accuracy (%) Bits
3D AND-Net accelerator is presented in Table 5. Some
accelerators for 2D BNN and the previous works about 3D
3D AND-Net 89.2 1 CNNs accelerator with fixed-point weight and activation are
3D XNOR-Net 89.3 1 also presented in Table 5 because we cannot find an accel-
3D CNN [10] 90.2 32 erator for 3D BNN. The proposed 3D AND-Net accelerator
[11] 90.34 32 runs at 100 MHz. The multiplication operation is replaced
S-Net-level2 [4] 90.1 32 by AND operation in 3D AND-Net, which leads to the fact
[15] 91.6 32 that DSP efficiency is about 20× higher than that of other 3D
CNN accelerators. In the 3D AND-Net accelerator, only 14
DSPs are used and the DSP is used to implement normaliza-
floating-point network and then quantizing it by binary tion. The 3D AND-Net uses most LUT resources because
weight and without fine-tuning, which will lead to a lot the binary convolution is implemented by the LUT. Most
of accuracy loss. The second one is doing quantization in Block RAMs are mainly used for feature map buffers for
the training process, and the accuracy loss will be small. high utilization of memory bandwidth. The total power of
In most BNN works [3, 19, 21, 24, 34], the quantization the proposed accelerator is only 1.6 W and the power of PL
is done in the training process for high accuracy. There- is 0.35 W, which is an order of magnitude lower than other
fore, the quantization is done in the training process in accelerators for 3D CNNs. The proposed accelerator can
this paper and high accuracy is achieved. Furthermore, 3D process 24 videos per second. Each video consists of 16
AND-Net and 3D XNOR-Net have similar accuracy, which frames, so the proposed accelerator can process 384 frames
also indicates that AND-Net is effective. per second, which meets the real-time requirements. The
Generally, deeper networks have higher accuracy than speed of our accelerator is significantly higher than other
shallow networks but bring more computational cost [9]. 3D CNN accelerators because of the simple 3D AND-Net
Therefore, it is reasonable to expect that the accuracy is architecture, binary convolution, and O2M approach. Our
higher if using deeper and wider 3D AND-Net architec- accelerator can achieve 240 frames per second and per
ture. The results indicate that binary weight and activation watt (FPS/W), which is far more efficient than other 3D
are feasible for 3D CNNs. Furthermore, hardware opti- CNNs accelerators.
mization should not introduce accuracy loss when imple- Throughput, DSP efficiency, power efficiency, and LUT
menting a quantified network [17, 30, 34, 36] by the hard- efficiency are also used as efficiency metrics. DSP efficiency
ware accelerator. Our hardware accelerator avoids data is GOPS per DSP (GOPS/DSP), whereas power efficiency
overflow, approximate calculations, and other operations is GOPS per watt (GOPS/W). LUT efficiency is GOPS per
that affect accuracy. The accuracy of 3D AND-Net imple- kilo LUTs (GOPS/KLUT). The throughput of our accelera-
mented on the proposed hardware accelerator is tested and tor is minimum because of minimum resource consumption
its accuracy is 89.2%. The result shows that the proposed and low frequency. However, our 3D AND-Net accelerator
hardware optimizations will not introduce accuracy loss. is more efficient in terms of power and DSP. The power
Table 5 Comparison with other F-C3D [5] [26] [20] BNN [16] FBNA [8] Ours
3D CNN accelerators on FPGA
FPGA ZC706 VC709 XC7VX690T Virtex 7 ZC702 PYNQ-Z2
Precision 16bit fixed fixed fixed 1 bit 1 bit 1 bit
CNN Model C3D C3D C3D BNN BNN 3D AND-Net
Frequency (MHz) 172 150 120 90 – 100
LUTs 90,281 242,000 272,734 34,126 29,600 21,727
Block RAM 472 1508 391 1007 103 140
DSP 810 1536 3595 1096 – 14
Power (W) 9.7 (PL, 8.4) 25 15.8 8.2 3.3 1.6 (PL, 0.35)
GOPS 142 430.7 667.7 7663 722 22.4
GOPS/DSP 0.175 0.280 0.186 7.6 – 1.6
GOPS/W 14.6 (PL,16.9) 17.2 42.2 935 219 14 (PL, 64)
GOPS/KLUT 1.57 1.78 2.45 22.4 24 1.03
FPS 3 33 8.7 6218 520 384
FPS/W 0.31 1.32 0.55 758 157 240
13
Journal of Real-Time Image Processing
efficiencies of all 3D CNN accelerators are not high because Acknowledgements This research work was partly supported by Natu-
3D CNNs are more complicated than 2D CNNs. Our power ral Science Foundation of Jiangsu Province (Project No. BK20201145).
efficiency of PL is about 3.7× higher than that of F-C3D [5]
and [26]. The improvement of power efficiency is mainly
because of simple binary convolution and efficient O2M References
approach. The LUT efficiency of our proposed accelera-
1. Arredondo-Velazquez, M., Diaz-Carmona, J., Torres-Huitzil, C.,
tor is a little lower than others, since many LUT resources
Padilla-Medina, A., Prado-Olivarez, J.: A streaming architecture
process AND operations. In summary, compared with other for Convolutional Neural Networks based on layer operations
accelerators, the proposed accelerator and 3D AND-Net chaining. J. Real-Time Image Process. 17(5), 1–19 (2020)
have similar accuracy, but are faster and more efficient for 2. Courbariaux, M., Bengio, Y., David, J.P.: BinaryConnect: Train-
ing Deep Neural Networks with binary weights during propaga-
the same task. Moreover, the 2D BNN accelerator [16] is
tions. Adv. Neural Inf. Process. Syst. (NeurIPS) pp. 3123–3131
more efficient than our 3D AND-Net accelerator because (2015)
of the larger FPGA resource and simpler 2D CNN. Large 3. Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., Bengio, Y.:
memory resources can reduce access to DRAM and simpler Binarized Neural Networks: Training Deep Neural Networks with
Weights and Activations Constrained to + 1 or – 1. arXiv preprint
2D CNN is helpful to improve computational efficiency and
arXiv: 1602.02830 (2016)
reduce power consumption. FBNA [8] and our 3D CNN 4. Cui, Y., Shi, Y., Sun, X., Yin, W.: S-Net: A Lightweight Con-
accelerator use the same FPGA platform. The performance volutional Neural Network for N-Dimensional Signals. In: IEEE
of FBNA is higher than the proposed accelerator in terms International Conference on Multimedia and Expo Workshops
(ICMEW), pp. 1–4 (2018)
of LUT and Power. The reason is that 2D BNN is simpler
5. Fan, H., Niu, X., Liu, Q., Luk, W.: F-C3D: FPGA-based 3-dimen-
than 3D BNN. The accelerator design of 3D CNN is more sional convolutional neural network. In: IEEE International Con-
challenging than that of 2D BNN. ference on Multimedia and Expo Workshops, pp. 1–4 (2017)
In summary, using algorithm and hardware co-design, 6. Gagliardi, A., de Gioia, F., Saponara, S.: A real-time video smoke
detection algorithm based on Kalman filter and CNN. J. Real-
the proposed 3D AND-Net and accelerator achieve high
Time Image Process. 1–11 (2021)
performance for human action recognition. In terms of the 7. Gao, S., Cheng, M., Zhao, K., Zhang, X., Yang, M., Torr, P.H.S.:
algorithm, an efficient and hardware-friendly 3D CNN is Res2net: a new multi-scale backbone architecture. IEEE Trans.
designed. In terms of the hardware, some optimizations are Pattern Anal. Mach. Intell. 43(2), 652–662 (2021)
8. Guo, P., Ma, H., Chen, R., Li, P., Xie, S., Wang, D.: FBNA: A
adopted for the proposed 3D AND-Net.
Fully Binarized Neural Network Accelerator. In: 28th Interna-
tional Conference on Field Programmable Logic and Applica-
tions, (FPL), pp. 51–54 (2018)
9. He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for
Image Recognition. In: IEEE Conference on Computer Vision and
5 Conclusion Pattern Recognition (CVPR), pp. 770–778 (2016)
10. Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks
3D CNN is more complex and has more computational cost for human action recognition. IEEE Trans. Pattern Anal. Mach.
than 2D CNN. It is urgent to compress 3D CNN. In this Intell. 35(1), 221–231 (2012)
11. Latah, M.: Human action recognition using support vector
paper, a simple and efficient binary convolution algorithm
machines and 3D convolutional neural networks. Int. Jo. Adv.
is proposed for 3D CNN, in which the weight and activation Intell. Inform. 3(1), 47–55 (2017)
are constrained to 0 or 1 but not – 1 or + 1. The proposed 3D 12. Li, G., Zhang, M., Duan, B., Zhang, Q., Tong, G.: Kernel Sharing
AND-Net has 89.2% accuracy on the KTH dataset, which in the Channel Dimension to Improve Parameters Efficiency. In:
2019 International Conference on Computing, Electronics Com-
indicates that the AND-Net is suitable for 3D CNN. AND-
munications Engineering (iCCECE), pp. 78–82 (2019)
Net is more appropriate for digital hardware design because 13. Li, G., Zhang, M., Li, J., Lv, F., Tong, G.: Efficient densely con-
the number of 0, 1 directly corresponds to the digital circuits nected convolutional neural networks. Pattern Recognit. 109,
logic 0, 1. In addition, an efficient convolution operation 107610 (2021)
14. Li, J., Long, X., Hu, S., Hu, Y., Gu, Q., Xu, D.: A novel hardware-
O2M and distribute storage approach are proposed, which
oriented ultra-high-speed object detection algorithm based on con-
can significantly reduce the time of reading input pixels. volutional neural network. J. Real-Time Image Process. 17(5),
An accelerator is presented for 3D AND-Net based on 1703–1714 (2020)
the improvement approaches. The results demonstrate the 15. Li, J., Wang, T., Zhou, Y., Wang, Z., Snoussi, H.: Using Gabor
filter in 3D convolutional neural networks for human action recog-
excellence of our design on resource utilization, power, and
nition. In: Chinese Control Conference (CCC), pp. 11139–11144
DSP efficiency using the proposed approaches. It should (2017). https://doi.org/10.23919/ChiCC.2017.8029134
be mentioned that the joint optimization on algorithms and 16. Li, Y., Liu, Z., Xu, K., Yu, H., Ren, F.: A GPU-outperforming
hardware can obtain better performance than independent FPGA accelerator architecture for binary convolutional neural net-
works. ACM J. Emerg. Technol. Comput. Syst. 14(2), 18 (2018)
optimization.
17. Liang, S., Yin, S., Liu, L., Luk, W., Wei, S.: FP-BNN: Binarized
In the future, we will apply 3D AND-Net to other com- neural network on FPGA. Neurocomputing 275, 1072–1086
plicated datasets to evaluate it, such as UCF101. (2018)
13
Journal of Real-Time Image Processing
18. Lin, S., Ji, R., Li, Y., Deng, C., Li, X.: Toward compact ConvNets 34. Yang, L., He, Z., Fan, D.: A Fully Onchip Binarized Convolutional
via structure-sparsity regularized filter pruning. IEEE Trans. Neu- Neural Network FPGA Impelmentation with Accurate Inference.
ral Networks Learn. Syst. 31(2), 574–588 (2020) In: International Symposium on Low Power Electronics and
19. Lin, X., Zhao, C., Pan, W.: Towards accurate binary convolu- Design (ISLPED), p. 50. ACM (2018)
tional neural network. Adv. Neural Inf. Process. Syst. (NeurIPS) 35. Zhang, Q., Zhang, M., Chen, T., Sun, Z., Ma, Y., Yu, B.: Recent
345–353 (2017) advances in convolutional neural network acceleration. Neuro-
20. Liu, Z., Chow, P., Xu, J., Jiang, J., Dou, Y., Zhou, J.: A uniform computing 323, 37–51 (2019)
architecture design for accelerating 2D and 3D CNNs on FPGAs. 36. Zhao, R., Song, W., Zhang, W., Xing, T., Lin, J.H., Srivastava,
Electronics 8(1), 65 (2019) M., Zhang, R.G.Z.: Accelerating Binarized Convolutional Neural
21. Liu, Z., Shen, Z., Savvides, M., Cheng, K.: ReActNet: towards Networks with Software-Programmable FPGAs. In: ACM/SIGDA
precise binary neural network with generalized activation func- International Symposium on Field-Programmable Gate Arrays
tions. Comput. Vis. Eur. Conf. (ECCV) 14, 143–159 (2020) (FPGA), pp. 15–24 (2017)
22. Ma, Y., Cao, Y., Vrudhula, S., Seo, J.: Optimizing the convolu- 37. Zhuang, B., Shen, C., Reid, I.: Training Compact Neural Networks
tion operation to accelerate deep neural networks on FPGA. IEEE with Binary Weights and Low Precision Activations. arXiv pre-
Trans. Very Large Scale Integr. (VLSI) Syst. 26(7), 1354–1367 print arXiv:1808.02631 (2018)
(2018)
23. Meng, B., Wang, L., He, Z., Jeon, G., Dou, Q., Yang, X.: Gradient Publisher's Note Springer Nature remains neutral with regard to
information distillation network for real-time single-image super- jurisdictional claims in published maps and institutional affiliations.
resolution. J. Real-Time Image Process. 18(2), 333–344 (2021)
24. Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: XNOR-Net:
ImageNet Classification Using Binary Convolutional Neural Net-
works. In: European Conference on Computer Vision (ECCV), pp. Guoqing Li received the B.S. degree from Qingdao University, Qing-
525–542 (2016) dao, China, in 2014, the M.S. degree from South China Normal Uni-
25. Sandler, M., Howard, A.G., Zhu, M., Zhmoginov, A., Chen, L.: versity, Guangzhou, China, in 2017. He is currently pursuing the Ph.D.
MobileNetV2: Inverted Residuals and Linear Bottlenecks. In: degree with the National ASIC Engineering Technology Research
IEEE Conference on Computer Vision and Pattern Recognition, Center, School of Electronics Science and Engineering, Southeast Uni-
(CVPR), pp. 4510–4520 (2018) versity, Nanjing, China. His current research interests include computer
26. Shen, J., Huang, Y., Wang, Z., Qiao, Y., Wen, M., Zhang, C.: vision, deep learning hardware accelerator.
Towards a Uniform Template-based Architecture for Accelerating
2D and 3D CNNs on FPGA. In: ACM/SIGDA International Sym- Meng Zhang received the B.S. degree in Electrical Engineering from
posium on Field-Programmable Gate Arrays (FPGA), pp. 97–106 the China University of Mining and Technology, the M.S. degree in
(2018) Bioelectronics from Southeast University in China in 1986, in 1993
27. Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks respectively, and Ph.D. degree in Microelectronic Engineering from
for Large-Scale Image Recognition. In: International Conference Southeast University as an on-the-job postgraduate student. He is cur-
on Learning Representations (ICLR) (2015) rently a professor in National ASIC System Research Center, College
28. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learn- of Electronic Science and Engineering of Southeast University, Nan-
ing Spatiotemporal Features with 3D Convolutional Networks. In: jing, PR China. He is a faculty adviser of PhD graduates. His research
IEEE International Conference on Computer Vision (ICCV), pp. interests include deep learning, machine learning, digital signal pro-
4489–4497 (2015) cessing, digital communication systems, wireless sensor networks,
29. Tu, F., Yin, S., Ouyang, P., Tang, S., Liu, L., Wei, S.: Deep convo- digital integrated circuit design, information security and assurance
lutional neural network architecture with reconfigurable computa- etc. He is an author or coauthor of more than 40 referred journal and
tion patterns. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. international conference papers and a holder of more than 90 patents,
25(8), 2220–2233 (2017) including some PCT, US patents.
30. Umuroglu, Y., Fraser, N.J., Gambardella, G., Blott, M., Leong,
P., Jahre, M., Vissers, K.: FINN: A Framework for Fast, Scalable Qianru Zhang is a Ph.D. student at National ASIC Center in School of
Binarized Neural Network Inference. In: ACM/SIGDA Interna- Electronic Science and Engineering, Southeast University, China. She
tional Symposium on Field-Programmable Gate Arrays (FPGA), obtained her M.S. in Electrical and Computer Engineering from Uni-
pp. 65–74. ACM (2017) versity of California, Irvine in 2016. Her research interests include digi-
31. Wang, H., Shao, M., Liu, Y., Zhao, W.: Enhanced efficiency 3D tal signal processing, big data analysis and deep learning techniques.
convolution based on optimal FPGA accelerator. IEEE Access 5,
6909–6916 (2017) Zhijian Lin received the B.S. degree from the School of Optics and
32. Xu, K., Wang, X., Liu, X., Cao, C., Li, H., Peng, H., Wang, D.: Electronic Science and Technology, China Jiliang University, Hang-
A dedicated hardware accelerator for real-time acceleration of zhou, China, in 2018. He is currently pursuing the M.S. degree in the
YOLOv2. J. Real-Time Image Process. 18(3), 1–12 (2020) School of Microelectronics, Southeast University, China. His current
33. Yang, H., Yuan, C., Li, B., Du, Y., Xing, J., Hu, W., Maybank, research interests include accelerating deep neural networks based on
S.J.: Asymmetric 3D convolutional neural networks for action FPGA.
recognition. Pattern Recognit. 85, 1–12 (2019)
13