Professional Documents
Culture Documents
Abstract—Convolutional Neural Networks (CNNs) have rev- implemented on hardware (i.e., ASICs, FPGAs) accelerators.
olutionized many applications in recent years, especially in However, due to the varying AI requirements and work-
image classification, video processing, and pattern recognition. load, hardware resource utilization reaches a boundary very
This success of CNNs has been a motivating factor for solving
even more complex problems involving multiple data modali- quickly. An increase in hardware resources impacts directly
ties. Traditionally, a single CNN accelerator has been optimized on power consumption. Many solutions have been proposed
for just one task or has been used to perform correlated tasks. to optimize CNN models for edge device deployment, i.e.,
We leverage the CNNs capability to learn patterns and use one pruning, quantization, knowledge distillation, and low-rank
accelerator to perform multiple uncorrelated tasks from different factorization, etc. We tend to provide a different perspective
modalities and achieve an average accuracy above 90%, which
would otherwise require three accelerators. Two types of CNN on saving hardware resources and power consumption. We
architectures (i.e., fused and branched) are evaluated for three leverage the fundamental capability of CNNs of learning to
distinct tasks based on accuracy, quantization, pruning, hardware recognize patterns and train multiple distinct tasks from dif-
resource utilization, power, and latency. Capitalizing on this, we ferent modalities, thereby forcing one CNN accelerator to learn
have further proposed a runtime reconfigurable CNN accelera- the common features between the tasks, which would other-
tor supporting fault-tolerant (FT), high-performance (HP), and
de-stress (DS) modes. wise require three separate accelerators (Fig. 2(a)). Thus, the
proposed approach, assisted by pruning and quantization meth-
Index Terms—Multi-task learning, multi-modal learn- ods, reduces hardware resources and power substantially. We
ing, FPGAs, convolutional neural network, reliability,
reconfigurability. extend this concept further to propose runtime reconfigurable
CNNs. To cope further with the challenge of changing AI
application requirements, reconfigurable or adaptive acceler-
I. I NTRODUCTION ators have been proposed by many researchers. The primary
NNS have revolutionized many applications in recent idea is that AI accelerators should adapt based on the changing
C years, ranging from smart video surveillance and intelli-
gent manufacturing to smart cities and medical imaging, etc.
needs of accuracy, power, latency, reliability, etc. Thus, vari-
ous concepts of reconfigurable DNNs have been presented: [1]
However, running CNN models is a resource-intensive process, dynamically changes the bitstream of the DNN model for
and deploying these complex models, with millions of param- a tradeoff between accuracy and power, [2] performs re-
eters, on low-power edge devices is a growing concern. Using programming weights of the DNN, [3] decomposes large CNN
multiple sensors to collect data is becoming common in var- kernel computations to small kernel-sized computations, [4]
ious applications, i.e., using radars and cameras in industrial conducts adaptive loading and processing of data in CNNs
and medical applications, lidar, radar, and camera in self- kernels, [5] supports hybrid quantization, [6] performs data
driving cars, etc. This indicates that the CNNs are also becom- path reconfiguration to reduce total energy. To the best of
ing increasingly complex, from processing images from a our knowledge, one aspect of reconfigurability that seems to
single image sensor for object detection to processing multiple be missing in most studies is hardware resource and power
data streams from numerous sensors classifying diverse tasks. efficiency while providing high reliability for safety-critical
To fulfill the high-performance constraints, CNNs models are applications, aging-aware, and high-performance computation.
Major contributions of this brief are:
Manuscript received 22 December 2022; accepted 23 January 2023. Date of • A novel approach of shared layers to execute multiple
publication 31 January 2023; date of current version 6 March 2023. This work distinct tasks from different modalities on one accelerator.
was supported by the Federal Ministry of Education and Research of Germany
through the Project “Open6GHub” under Grant 16KISK009. This brief was • Workflow for generating fused and branched CNN archi-
recommended by Associate Editor A. L. Zimpeck. (Corresponding author: tectures.
Rizwan Tariq Syed.) • An approach for runtime reconfigurable CNN accelerator
Rizwan Tariq Syed, Marko Andjelkovic, and Markus Ulbricht are with
the System Architectures, Leibniz-Institut für innovative Mikroelektronik, for fault tolerance, de-stress (or aging-aware), and high-
15236 Frankfurt (Oder), Germany (e-mail: syed@ihp-microelectronics.com; performance computational needs.
andjelkovic@ihp-microelectronics.com; ulbricht@ihp-microelectronics.com).
Milos Krstic is with the System Architectures, Leibniz-Institut für inno-
vative Mikroelektronik, 15236 Frankfurt (Oder), Germany, and also with
the Chair of Design and Test Methodology, University of Potsdam, 14469 II. M ULTI M ODAL M ULTI TASKS CNN ACCELERATOR
Potsdam, Germany (e-mail: krstic@ihp-microelectronics.com). CNNs, a special type of neural networks, have proven to be
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TCSII.2023.3241154. very effective in solving image classification problems. CNNs
Digital Object Identifier 10.1109/TCSII.2023.3241154 have the ability to develop an internal representation of an
1549-7747
c 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:52:58 UTC from IEEE Xplore. Restrictions apply.
1250 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 70, NO. 3, MARCH 2023
Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:52:58 UTC from IEEE Xplore. Restrictions apply.
SYED et al.: TOWARDS RECONFIGURABLE CNN ACCELERATOR FOR FPGA IMPLEMENTATION 1251
Fig. 2. (a) Tasks execution on application specific accelerators (b) Runtime reconfigurable CNN accelerators (c) Tasks execution on FM (d) Control element
(e) Workflow (f) Tasks execution on BM (g) CNN architecture with trainable params: 14926.
images. The T3 dataset has been converted to a 3-dimensional Thus, pruning is a way to remove unnecessary parameters
RGB image, and noise has been added additionally to make that do not significantly contribute to the accuracy of results,
the whole dataset more challenging. Two colors have been thereby making the deep neural networks sparsed. This spar-
introduced in the dataset, (i.e., red and green) for subtasks sity in the neural network parameters due to the pruning
classification. The last step is resizing all tasks to the exact has two advantages; a) it causes a significant reduction in
dimensions, i.e., 32 x 32 x 3, so that there is a match between hardware resources, which further helps reduce the compu-
the input dimensions of the images and CNN layers. tational complexity, and b) It improves the resiliency of the
Model Creation, Training and Testing FM and BM: The DNN model [11]. The focus of our implementation is on the
CNN architecture, illustrated in Fig. 2(g), has been created magnitude-based weight pruning method.
using Tensor flow/Keras and Qkeras [10]. Following data pre- DNN quantization refers to a method of approximating
processing and model creation, training and testing of the a neural network’s parameters and activations to low bit-
model can be accomplished (Fig. 2(e)). FM training can be width fixed point (FxP) numbers because, in many cases, the
treated as a single task learning and follows a standard train- dynamic range that the floating-point (FP) provides is not
ing process in which the dataset now consists of three different needed. FxP numbers are generally hardware-friendly. FxP
tasks, and all the output layers have classes representing all computations are not only faster than FP but also cost less
three tasks. area overhead as compared to FP computations. Quantization
BM training can be performed using two methods. 1) MTL also happens to increase the reliability of the DNN model [12].
method 2) Transfer learning method. MTL method consists Hence, we can expect significant benefits in terms of model
of training multiple tasks jointly by optimizing multiple loss size and reliability after model quantization. Both model
functions. In this way, the common layers learn the shared rep- optimization methods lead to reduced hardware resources and
resentations between related tasks, and the task-specific branch power consumption and have become the de-facto step during
learns to perform well on a specific task. As all the tasks are the DNN deployment on the hardware. Accuracy results on
being optimized simultaneously, it may happen that the train- the test set for FM and BM are demonstrated in the Table I
ing of all the tasks does not converge to the desired accuracy. HLS4ML Framework: The hls4ml [13] is an open-source
Therefore, an alternative method of transfer learning (TL) can framework designed to deploy machine learning (ML) models
be applied. TL training method consists of a multi-stage train- on FPGAs, especially for low-latency and low-power process-
ing process. 1) First, the entire model is trained as an FM. The ing at the edge. After achieving the desired test accuracy, the
last layer of the model is removed, leaving behind the layers hls4ml framework can be used to convert the trained model
which are trained on all three tasks. 2) Freeze the weights into an HLS-compatible C/C++. The hls4ml framework takes
of the shared layers 3) Add task-specific layers to the model into account some parameters for generating the synthesizable
and train each task individually 4) If a task has additional C/C++. These parameters include FPGA part no., interface
sub-tasks, they can also be trained similarly. type, reuse factor (R), and FxP precision. The parameter reuse
Additional model optimization methods (i.e., pruning and factor determines the parallelism in the hls4ml generated DNN
quantization) are also added to the training loop. Many exper- model (Fig. 3(b)). Multiplication is the most fundamental
iments have concluded that there are many parameters in the operation in neural networks as it involves the multiplication of
DNNs that are not of much significance. It is still feasi- the weight with the input. After multiplication, a bias is added,
ble to achieve the expected performance in the absence of and the result is passed to an activation function. The reuse
these parameters. This situation may occur when the neural factor is equal to 1 would mean the design is highly parallel
network’s parameters are zero, close to zero, or are replicated. and would generate the HLS model with the lowest possible
Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:52:58 UTC from IEEE Xplore. Restrictions apply.
1252 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 70, NO. 3, MARCH 2023
TABLE I
E XPERIMENTAL R ESULTS
Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:52:58 UTC from IEEE Xplore. Restrictions apply.
SYED et al.: TOWARDS RECONFIGURABLE CNN ACCELERATOR FOR FPGA IMPLEMENTATION 1253
workaround described in the previous section. BMQP has traditional approaches to perform these modes (e.g., FT mode)
achieved slightly higher accuracy as compared to FMQP. will require nine CNN accelerators.
In BMQP, multiple branches can be active simultaneously,
thereby supporting the classification of multiple sub-tasks. V. C ONCLUSION AND F UTURE W ORK
Thus, for task 3, two branches get active to classify digits (T3)
The proposed methodology leverages fundamental working
and the color of the digits (T3c).1 BMQP consumes slightly
principles of CNNs and assists in executing three different
more hardware resources and power consumption as compared
tasks on one optimized application-specific CNN accelera-
to FMQP but has additional benefits from the reliability point
tor, thereby significantly reducing hardware resources and
of view. The impact of quantization and pruning is the same in
power. Workflow, experimental results, and an extension of the
BM; therefore, only the results of BMQP are shown. As evi-
proposed approach to make a reconfigurable CNN accelerator
dent from the FMP, FMQP, and BMQP results, pruning not
are presented.
only helps reduce hardware resources but also improves the
Future work will focus on assisting the existing runtime
model’s accuracy.
reconfigurability with dynamic partial reconfigurability (DPR)
The latency of all the fused models is approximately the
and on-chip reliability sensors, thereby aiming toward a fully
same. The latency of the hls4ml generated CNN architec-
adaptive AI processing system that can adapt based on the
ture is dependent on the depth of the network, reuse factor
changing real-time and application-specific design require-
(R = 1) of DSPs, and size of the input (32 x 32 x 3). All
ments. Additionally, comprehensive fault analysis against
three parameters are similar for FM, FMP, FMQ, and FMQP.
different fault models will also be performed.
The execution of the layers is sequential. Subsequent layers
can only process the data when the previous layer finishes
its computation. Therefore, for hls4ml, it is recommended to R EFERENCES
have a wider network (more kernels/parameters in a layer) as [1] O. Eldash, A. Frost, K. Khalil, A. Kumar, and M. Bayoumi,
opposed to more layers (depth of the DNN), as it is more effi- “Dynamically reconfigurable deep learning for efficient video pro-
cessing in smart IoT systems,” in Proc. IEEE 6th World Forum
cient to parallelize per-layer computations in FPGA. Latency Internet Things (WF-IoT), New Orleans, LA, USA, 2020, pp. 1–6,
is not affected by the quantization bit widths, kernel size, and doi: 10.1109/WF-IoT48130.2020.9221101.
the number of kernels. This is why fused models being differ- [2] G. D. Guglielmo et al., “A reconfigurable neural network ASIC for
detector front-end data compression at the HL-LHC,” IEEE Trans. Nucl.
ent in terms of quantization bits and pruning percentage, have Sci., vol. 68, no. 8, pp. 2179–2186, Aug. 2021.
the same latency. Each task has varied latency in BMQP, as [3] L. Du et al., “A reconfigurable streaming deep convolutional neu-
each task goes through only specific layers. i.e., T3c branched ral network accelerator for Internet of Things,” IEEE Trans. Circuits
Syst. I, Reg. Papers, vol. 65, no. 1, pp. 198–208, Jan. 2018,
earliest and has the lowest latency, while T2 branched at the doi: 10.1109/TCSI.2017.2735490.
last layer and therefore has the highest latency in BMQP. [4] J. Cho, Y. Jung, S. Lee, and Y. Jung, “Reconfigurable binary neu-
T3c latency is not added to the total latency, as T3 and T3c ral network accelerator with adaptive parallelism scheme,” Electronics,
vol. 10, no. 3, p. 230, 2021.
branches execute in parallel. Our hardware results match the [5] M. P. Véstias, R. P. Duarte, J. T. De Sousa, and H. C. Neto, “A config-
analysis presented in [13]. Finding the best match of hardware urable architecture for running hybrid convolutional neural networks in
resource utilization, accuracy, power, and latency depend on low-density FPGAs,” IEEE Access, vol. 8, pp. 107229–107243, 2020,
doi: 10.1109/ACCESS.2020.3000444.
the specific application and the available hardware resources [6] F. Tu, S. Yin, P. Ouyang, S. Tang, L. Liu, and S. We, “Deep con-
in the targeted hardware. volutional neural network architecture with reconfigurable computation
patterns,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst, vol. 25,
no. 8, pp. 2220–2233, Aug. 2017.
IV. R ECONFIGURABLE CNN ACCELERATORS [7] Y. Zhang and Q. Yang, “A survey on multi-task learning,” 2017,
arXiv.abs/1707.08114.
We have discussed numerous interpretations of ‘reconfig- [8] S. Albawi et al., “Understanding of a convolutional neural network,” in
urability’ in Section I. We approach reconfigurability from Proc. Int. Conf. Eng. Technol., 2017, pp. 1–6.
the standpoint of hardware reliability, aging, and computing [9] Y. Zhao et al., “Novel approach for gesture recognition using mmWave
FMCW radar,” in Proc. IEEE 95th Veh. Technol. Conf. (VTC-Spring),
performance. Execution of multiple tasks in different modes, Helsinki, Finland, 2022, pp. 1–6.
often associated with microprocessors [14], [15], is possible [10] C. Coelho et al., “Ultra low-latency, low-area inference accelerators
in an application-specific CNN accelerator during runtime. By using heterogeneous deep quantization with QKeras and HLS4ML,”
2020, arxiv.abs/2006.10159.
default, one accelerator is enough to perform three tasks, but [11] Z. Gao et al., “Reliability evaluation of pruned neural networks against
with triple modular redundancy, it is possible to execute tasks errors on parameters,” in Proc. IEEE Int. Symp. Defect Fault Tolerance
in 1) Fault tolerant (FT) mode, 2) De-stress mode (DS), and VLSI Nanotechnol. Syst. (DFT), 2020, pp. 1–6.
[12] B. Goldstein et al., “Reliability evaluation of compressed deep learning
3) High performance (HP) mode (Fig. 2(b)). FT mode: In this models,” in Proc. IEEE 11th Latin Amer. Symp. Circuits Syst. (LASCAS),
mode, all tasks execute on all three accelerators, thereby pro- 2020, pp. 1–5.
viding maximum reliability. DS mode: this is an aging-aware [13] T. Aarrestad et al., “Fast convolutional neural networks on FPGAs with
hls4ml,” Mach. Learn. Sci. Technol., vol. 2, no. 4, 2021, Art. no. 045015.
mode, only one accelerator is active at a time, and all tasks are [14] M. Ulbricht, R. T. Syed, and M. Krstic, “Developing a configurable
executed in a TDM fashion. HP mode: This mode runs all the fault tolerant multicore system for optimized sensor processing,” in Proc.
tasks in parallel, i.e., each accelerator executes the individual IEEE Int. Symp. Defect Fault Tolerance VLSI Nanotechnol. Syst. (DFT),
Noordwijk, The Netherlands, 2019, pp. 1–4.
task, which leads to a reduction in total latency by 2x. Using [15] A. Simevski, O. Schrape, C. Benito, M. Krstic, and M. Andjelkovic,
“PISA: Power-robust multiprocessor design for space applications,” in
1 Values marked with superscript ‘1’ in Table I are additional resource Proc. IEEE 26th Int. Symp. On-Line Test. Robust Syst. Design (IOLTS),
utilization when T3c is added. Napoli, Italy, 2020, pp. 1–6.
Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:52:58 UTC from IEEE Xplore. Restrictions apply.