You are on page 1of 5

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 70, NO.

3, MARCH 2023 1249

Towards Reconfigurable CNN Accelerator


for FPGA Implementation
Rizwan Tariq Syed , Marko Andjelkovic , Markus Ulbricht , and Milos Krstic

Abstract—Convolutional Neural Networks (CNNs) have rev- implemented on hardware (i.e., ASICs, FPGAs) accelerators.
olutionized many applications in recent years, especially in However, due to the varying AI requirements and work-
image classification, video processing, and pattern recognition. load, hardware resource utilization reaches a boundary very
This success of CNNs has been a motivating factor for solving
even more complex problems involving multiple data modali- quickly. An increase in hardware resources impacts directly
ties. Traditionally, a single CNN accelerator has been optimized on power consumption. Many solutions have been proposed
for just one task or has been used to perform correlated tasks. to optimize CNN models for edge device deployment, i.e.,
We leverage the CNNs capability to learn patterns and use one pruning, quantization, knowledge distillation, and low-rank
accelerator to perform multiple uncorrelated tasks from different factorization, etc. We tend to provide a different perspective
modalities and achieve an average accuracy above 90%, which
would otherwise require three accelerators. Two types of CNN on saving hardware resources and power consumption. We
architectures (i.e., fused and branched) are evaluated for three leverage the fundamental capability of CNNs of learning to
distinct tasks based on accuracy, quantization, pruning, hardware recognize patterns and train multiple distinct tasks from dif-
resource utilization, power, and latency. Capitalizing on this, we ferent modalities, thereby forcing one CNN accelerator to learn
have further proposed a runtime reconfigurable CNN accelera- the common features between the tasks, which would other-
tor supporting fault-tolerant (FT), high-performance (HP), and
de-stress (DS) modes. wise require three separate accelerators (Fig. 2(a)). Thus, the
proposed approach, assisted by pruning and quantization meth-
Index Terms—Multi-task learning, multi-modal learn- ods, reduces hardware resources and power substantially. We
ing, FPGAs, convolutional neural network, reliability,
reconfigurability. extend this concept further to propose runtime reconfigurable
CNNs. To cope further with the challenge of changing AI
application requirements, reconfigurable or adaptive acceler-
I. I NTRODUCTION ators have been proposed by many researchers. The primary
NNS have revolutionized many applications in recent idea is that AI accelerators should adapt based on the changing
C years, ranging from smart video surveillance and intelli-
gent manufacturing to smart cities and medical imaging, etc.
needs of accuracy, power, latency, reliability, etc. Thus, vari-
ous concepts of reconfigurable DNNs have been presented: [1]
However, running CNN models is a resource-intensive process, dynamically changes the bitstream of the DNN model for
and deploying these complex models, with millions of param- a tradeoff between accuracy and power, [2] performs re-
eters, on low-power edge devices is a growing concern. Using programming weights of the DNN, [3] decomposes large CNN
multiple sensors to collect data is becoming common in var- kernel computations to small kernel-sized computations, [4]
ious applications, i.e., using radars and cameras in industrial conducts adaptive loading and processing of data in CNNs
and medical applications, lidar, radar, and camera in self- kernels, [5] supports hybrid quantization, [6] performs data
driving cars, etc. This indicates that the CNNs are also becom- path reconfiguration to reduce total energy. To the best of
ing increasingly complex, from processing images from a our knowledge, one aspect of reconfigurability that seems to
single image sensor for object detection to processing multiple be missing in most studies is hardware resource and power
data streams from numerous sensors classifying diverse tasks. efficiency while providing high reliability for safety-critical
To fulfill the high-performance constraints, CNNs models are applications, aging-aware, and high-performance computation.
Major contributions of this brief are:
Manuscript received 22 December 2022; accepted 23 January 2023. Date of • A novel approach of shared layers to execute multiple
publication 31 January 2023; date of current version 6 March 2023. This work distinct tasks from different modalities on one accelerator.
was supported by the Federal Ministry of Education and Research of Germany
through the Project “Open6GHub” under Grant 16KISK009. This brief was • Workflow for generating fused and branched CNN archi-
recommended by Associate Editor A. L. Zimpeck. (Corresponding author: tectures.
Rizwan Tariq Syed.) • An approach for runtime reconfigurable CNN accelerator
Rizwan Tariq Syed, Marko Andjelkovic, and Markus Ulbricht are with
the System Architectures, Leibniz-Institut für innovative Mikroelektronik, for fault tolerance, de-stress (or aging-aware), and high-
15236 Frankfurt (Oder), Germany (e-mail: syed@ihp-microelectronics.com; performance computational needs.
andjelkovic@ihp-microelectronics.com; ulbricht@ihp-microelectronics.com).
Milos Krstic is with the System Architectures, Leibniz-Institut für inno-
vative Mikroelektronik, 15236 Frankfurt (Oder), Germany, and also with
the Chair of Design and Test Methodology, University of Potsdam, 14469 II. M ULTI M ODAL M ULTI TASKS CNN ACCELERATOR
Potsdam, Germany (e-mail: krstic@ihp-microelectronics.com). CNNs, a special type of neural networks, have proven to be
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TCSII.2023.3241154. very effective in solving image classification problems. CNNs
Digital Object Identifier 10.1109/TCSII.2023.3241154 have the ability to develop an internal representation of an
1549-7747 
c 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:52:58 UTC from IEEE Xplore. Restrictions apply.
1250 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 70, NO. 3, MARCH 2023

image or pattern. This allows the CNN model to learn the


position and scale-invariant structures in the image data, which
is very important when working with images.
Multi-Modal Learning: We perceive the world in a multi-
modal fashion using our sensory organs, i.e., eyes, ears, nose,
etc. Similarly, neural networks can also learn features from
multiple input sources (i.e., multi-modal data from different
sensors) to make better decisions. The use of multiple sensors
in various applications has been increasing. Consumer devices,
industrial, medical, and safety-critical applications are getting
equipped with more sensors, which results in more data being
Fig. 1. (a) FMCW radar hand gesture samples (b) SVHN samples
collected. The data from multiple modalities (e.g., camera, (c) Transformed MNIST dataset.
radar, lidar, microphone) can be fused (i.e., sensor fusion) to
understand the environment more accurately. To learn the shared
representation, a CNN model can be trained on these modalities Fused and Branched CNN Model: To authenticate our
using various fusion methods (early fusion, late fusion, etc.). approach of shared layers for distinct tasks, we have created
The resulting model tends to make better classification because two CNN models, i.e., fused and branched. A fused model
it balances the strengths of the multiple modalities. (FM) is an un-branched model, where all the tasks share all
Multi-Task Learning: Depending upon the application and the layers of the neural network (Fig. 2(c)). In comparison,
the data modality, a CNN model can be used to perform one the branched model (BM) consists of tasks-specific branches
task (e.g., type of clothing) or multiple tasks (e.g., type of and shares only particular layers (Fig. 2(f)). Even though FM
clothing and color of clothing). In machine learning, this is is comparatively easier to train, and consumes slightly fewer
called multi-task learning (MTL). In MTL, a model is trained hardware resources and power, it may happen that having the
for multiple tasks jointly by optimizing multiple loss func- FM might not lead to the desired accuracy for a specific task.
tions [7]. The joint training generally reduces overfitting, and The BM involves a multi-stage training process, and consumes
with the shared representations between related tasks, we can slightly higher hardware resources and power as compared
facilitate the model to generalize better for different tasks at to FM. Additionally, BM can be helpful in 1) Task isolation
hand. MTL can be termed multi-modal multi-task learning in case of faults, 2) Task-specific bit-stream reconfiguration in
(MMMT) when the model can learn a shared representation FPGAs, 3) Selective replication of only specific layers (e.g.,
of inputs across multiple data modalities and tasks. more vulnerable layers or tasks-specific layers), 4) Addition
Shared Layers for Un-Correlated Tasks for CNNs (Proposed of sub-task 5) Adding extra layers to achieve more accuracy
Approach): CNN models consist of a few thousand parameters for specific tasks.
up to several million parameters. Eventually, these models will Workflow:
be deployed on edge devices comprising ASICs or FPGAs. 1) Data Preprocessing and Model Creation: Different tasks
The existing challenge is that deploying CNN models on are emulated as different datasets. i.e., Execution of a single
edge is both resource-intensive and energy-intensive due to task would mean performing a classification task on a sin-
the large number of weights, activations and computations gle dataset. Thus, for three tasks, we have considered three
involved. This indicates that there is a need to explore effi- datasets (i.e., radar samples for hand gesture detection, SVHN,
cient methods to minimize the hardware resource utilization, MNIST) from two different modalities (i.e., radar and camera
power consumption and latency while trying to fulfill the image), illustrated in Fig. 1(a)(b)(c). These three tasks are exe-
application requirements. The majority of the studies based cuted on one CNN accelerator in a time-division multiplexing
on CNNs MTL, and MMMT have focused on tasks that are (TDM) manner. Before training, the dataset needs to be trans-
highly correlated. We tend to provide a different perspective formed. CNNs perform extremely well on images and patterns.
to MTL and MMMT, which is related to un-correlated tasks. Therefore, in order for CNN to learn the common represen-
This perspective leverages the fundamental principle of how tations in different tasks, the dataset of all the tasks needs
CNN learns during training. The lower-level features of most to be transformed into images. Task 2 (T2) consists of the
of the images are the same. Lower layers learn the low-level SVHN dataset, which is RGB image samples, and therefore
features (i.e., edges, curves, blobs, etc.). The deeper we go in they don’t need any additional transformation. While task 3
the network, the layers start to learn the high-level (or more (T3) is a grayscale dataset, and task 1 (T1) is from a differ-
abstract) features [8]. We leverage this concept to force one ent modality. And thus, T1 and T3 need to be preprocessed
model to learn the lower-level features of multiple uncorre- and transformed before CNN training. Data preprocessing
lated tasks. These uncorrelated tasks, having similar low-level of T1 consists of collecting the raw radar data, extracting
features, share the layers. When mapped on the hardware information by fast Fourier transform (FFT), removing the
accelerator, these shared layers would mean sharing computing clutter, and fusing the range, velocity, and angle feature maps.
resources, thereby saving significant hardware resources and Additional technical details of the radar dataset are available
power consumption. This approach is further capitalized to re- in [9].
configure the FPGA-based CNN accelerator during runtime to T3 is a modified version of the MNIST dataset. The
execute multiple tasks in FT, HP, and DS modes. original MNIST dataset consists of 2-dimensional grayscale

Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:52:58 UTC from IEEE Xplore. Restrictions apply.
SYED et al.: TOWARDS RECONFIGURABLE CNN ACCELERATOR FOR FPGA IMPLEMENTATION 1251

Fig. 2. (a) Tasks execution on application specific accelerators (b) Runtime reconfigurable CNN accelerators (c) Tasks execution on FM (d) Control element
(e) Workflow (f) Tasks execution on BM (g) CNN architecture with trainable params: 14926.

images. The T3 dataset has been converted to a 3-dimensional Thus, pruning is a way to remove unnecessary parameters
RGB image, and noise has been added additionally to make that do not significantly contribute to the accuracy of results,
the whole dataset more challenging. Two colors have been thereby making the deep neural networks sparsed. This spar-
introduced in the dataset, (i.e., red and green) for subtasks sity in the neural network parameters due to the pruning
classification. The last step is resizing all tasks to the exact has two advantages; a) it causes a significant reduction in
dimensions, i.e., 32 x 32 x 3, so that there is a match between hardware resources, which further helps reduce the compu-
the input dimensions of the images and CNN layers. tational complexity, and b) It improves the resiliency of the
Model Creation, Training and Testing FM and BM: The DNN model [11]. The focus of our implementation is on the
CNN architecture, illustrated in Fig. 2(g), has been created magnitude-based weight pruning method.
using Tensor flow/Keras and Qkeras [10]. Following data pre- DNN quantization refers to a method of approximating
processing and model creation, training and testing of the a neural network’s parameters and activations to low bit-
model can be accomplished (Fig. 2(e)). FM training can be width fixed point (FxP) numbers because, in many cases, the
treated as a single task learning and follows a standard train- dynamic range that the floating-point (FP) provides is not
ing process in which the dataset now consists of three different needed. FxP numbers are generally hardware-friendly. FxP
tasks, and all the output layers have classes representing all computations are not only faster than FP but also cost less
three tasks. area overhead as compared to FP computations. Quantization
BM training can be performed using two methods. 1) MTL also happens to increase the reliability of the DNN model [12].
method 2) Transfer learning method. MTL method consists Hence, we can expect significant benefits in terms of model
of training multiple tasks jointly by optimizing multiple loss size and reliability after model quantization. Both model
functions. In this way, the common layers learn the shared rep- optimization methods lead to reduced hardware resources and
resentations between related tasks, and the task-specific branch power consumption and have become the de-facto step during
learns to perform well on a specific task. As all the tasks are the DNN deployment on the hardware. Accuracy results on
being optimized simultaneously, it may happen that the train- the test set for FM and BM are demonstrated in the Table I
ing of all the tasks does not converge to the desired accuracy. HLS4ML Framework: The hls4ml [13] is an open-source
Therefore, an alternative method of transfer learning (TL) can framework designed to deploy machine learning (ML) models
be applied. TL training method consists of a multi-stage train- on FPGAs, especially for low-latency and low-power process-
ing process. 1) First, the entire model is trained as an FM. The ing at the edge. After achieving the desired test accuracy, the
last layer of the model is removed, leaving behind the layers hls4ml framework can be used to convert the trained model
which are trained on all three tasks. 2) Freeze the weights into an HLS-compatible C/C++. The hls4ml framework takes
of the shared layers 3) Add task-specific layers to the model into account some parameters for generating the synthesizable
and train each task individually 4) If a task has additional C/C++. These parameters include FPGA part no., interface
sub-tasks, they can also be trained similarly. type, reuse factor (R), and FxP precision. The parameter reuse
Additional model optimization methods (i.e., pruning and factor determines the parallelism in the hls4ml generated DNN
quantization) are also added to the training loop. Many exper- model (Fig. 3(b)). Multiplication is the most fundamental
iments have concluded that there are many parameters in the operation in neural networks as it involves the multiplication of
DNNs that are not of much significance. It is still feasi- the weight with the input. After multiplication, a bias is added,
ble to achieve the expected performance in the absence of and the result is passed to an activation function. The reuse
these parameters. This situation may occur when the neural factor is equal to 1 would mean the design is highly parallel
network’s parameters are zero, close to zero, or are replicated. and would generate the HLS model with the lowest possible
Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:52:58 UTC from IEEE Xplore. Restrictions apply.
1252 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 70, NO. 3, MARCH 2023

TABLE I
E XPERIMENTAL R ESULTS

fused model quantized and pruned (FMQP), and branched


model quantized and pruned (BMQP). Experimental results
are illustrated in the Table I. The latency of all the models is
calculated from RTL simulation, and power is estimated during
post-synthesis functional simulation using a switching activity
file of the design (i.e., SAIF file). All the models are synthe-
sized in Vivado, targeting Virtex UltraScale+ VCU118 (FPGA
part no. xcvu9p-flga2104-2L-e) at 200 MHz clock frequency.
All the models have maintained an average of above 90% accu-
racy for all three tasks on one application-specific accelerator,
unlike other studies where only one accelerator is optimized
Fig. 3. (a) FMQP weights distribution (b) Impact of different reuse factors to perform just one task.
values on DSP48E utilization for the computation between 2 neuron pairs [13].
FM is the non-optimized baseline implementation, has good
accuracy for all three tasks, and utilizes the most hardware
latency. If the reuse factor is increased by a factor N, the HLS
resources and power. FMP is a 50% pruned version of FM.
compiler will try to reduce the DSP resource by ∼1/N, which
The high accuracy of FM and FMP comes at the cost of
will result in an increase in the overall latency of the model
increased resource utilization and power consumption as these
by ∼N. The parameter ‘precision’ is used by hls4ml to per-
two models are post-training homogeneously quantized to <
form quantization. The default precision in hls4ml is <16, 6>,
20, 10>bit-widths (BW). This provides FM and FMP with
which means the model will be quantized to FxP 16 bits in
a good dynamic range and helps avoid any saturation and
which 6 bits are integer bits (including the sign bit) and the
wrap-around issues. Decreasing the BW of FM and FMP will
remaining 10 bits are fractional bits. The ‘precision’ parameter
decrease their accuracy due to the loss of dynamic range. Thus,
is adjusted until it achieves the desired accuracy. Hls4ml com-
precision tuning is required, not just in the neural network
piles the model and calculates the model’s accuracy (termed
weights but also in the activation functions. FMQ is trained
as hls4ml accuracy) using bit-accurate fixed point emulation
directly with lower precision (i.e., quantization-aware training
of the FPGA inference code.
(QAT)). Any loss in accuracy due to loss of precision can
FM, being un-branched, can be easily converted to HLS-
be rectified during the training process, and the model learns
compatible C/C++. The hls4ml, version 0.6.0, does not
to perform better in low BW. FMQ performs equally well in
support the HLS code generation of BM. We propose a
terms of accuracy while consuming fewer hardware resources
workaround to use this framework for BM as well. 1) Dissect
and power. DSPs are a scarce resource, and these DSP hard
the BM in linear branches 2) Using hls4ml, generate HLS
macros consume more power than LUTs. During the synthesis
for each branch separately 3) Generate HDL code of all
of FQM, trained on low BW using QAT, the simpler multi-
branches using Vivado HLS 4) Stitch all branches in HDL
plication operations are mapped on LUTs instead of DSPs,
using control element (CE) (Fig. 2(d)). CE, 1) synchronize
leading to savings in DSP usage and power consumption.
the data and control signals between the stitched branches,
FMP removes the unnecessary weights, but the existing
and 2) enable/disable single or multiple branches based on
weights have high BW due to homogeneous PTQ. FMQ has
the tasks.
low BW due to heterogeneous QAT, but this model is not
pruned. Heterogeneous QAT and pruning are combined in
III. R ESULTS AND D ISCUSSION FMQP (see fig. 3(a)) to achieve the best possible results in
Following the workflow elaborated in the previous section, terms of accuracy, hardware resources, and power consump-
we have created five CNN models. i.e., fused model (FM), tion. BMQP is obtained using the multi-stage transfer learning
fused model pruned (FMP), fused model quantized (FMQ), training method, and its HLS code is generated using the

Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:52:58 UTC from IEEE Xplore. Restrictions apply.
SYED et al.: TOWARDS RECONFIGURABLE CNN ACCELERATOR FOR FPGA IMPLEMENTATION 1253

workaround described in the previous section. BMQP has traditional approaches to perform these modes (e.g., FT mode)
achieved slightly higher accuracy as compared to FMQP. will require nine CNN accelerators.
In BMQP, multiple branches can be active simultaneously,
thereby supporting the classification of multiple sub-tasks. V. C ONCLUSION AND F UTURE W ORK
Thus, for task 3, two branches get active to classify digits (T3)
The proposed methodology leverages fundamental working
and the color of the digits (T3c).1 BMQP consumes slightly
principles of CNNs and assists in executing three different
more hardware resources and power consumption as compared
tasks on one optimized application-specific CNN accelera-
to FMQP but has additional benefits from the reliability point
tor, thereby significantly reducing hardware resources and
of view. The impact of quantization and pruning is the same in
power. Workflow, experimental results, and an extension of the
BM; therefore, only the results of BMQP are shown. As evi-
proposed approach to make a reconfigurable CNN accelerator
dent from the FMP, FMQP, and BMQP results, pruning not
are presented.
only helps reduce hardware resources but also improves the
Future work will focus on assisting the existing runtime
model’s accuracy.
reconfigurability with dynamic partial reconfigurability (DPR)
The latency of all the fused models is approximately the
and on-chip reliability sensors, thereby aiming toward a fully
same. The latency of the hls4ml generated CNN architec-
adaptive AI processing system that can adapt based on the
ture is dependent on the depth of the network, reuse factor
changing real-time and application-specific design require-
(R = 1) of DSPs, and size of the input (32 x 32 x 3). All
ments. Additionally, comprehensive fault analysis against
three parameters are similar for FM, FMP, FMQ, and FMQP.
different fault models will also be performed.
The execution of the layers is sequential. Subsequent layers
can only process the data when the previous layer finishes
its computation. Therefore, for hls4ml, it is recommended to R EFERENCES
have a wider network (more kernels/parameters in a layer) as [1] O. Eldash, A. Frost, K. Khalil, A. Kumar, and M. Bayoumi,
opposed to more layers (depth of the DNN), as it is more effi- “Dynamically reconfigurable deep learning for efficient video pro-
cessing in smart IoT systems,” in Proc. IEEE 6th World Forum
cient to parallelize per-layer computations in FPGA. Latency Internet Things (WF-IoT), New Orleans, LA, USA, 2020, pp. 1–6,
is not affected by the quantization bit widths, kernel size, and doi: 10.1109/WF-IoT48130.2020.9221101.
the number of kernels. This is why fused models being differ- [2] G. D. Guglielmo et al., “A reconfigurable neural network ASIC for
detector front-end data compression at the HL-LHC,” IEEE Trans. Nucl.
ent in terms of quantization bits and pruning percentage, have Sci., vol. 68, no. 8, pp. 2179–2186, Aug. 2021.
the same latency. Each task has varied latency in BMQP, as [3] L. Du et al., “A reconfigurable streaming deep convolutional neu-
each task goes through only specific layers. i.e., T3c branched ral network accelerator for Internet of Things,” IEEE Trans. Circuits
Syst. I, Reg. Papers, vol. 65, no. 1, pp. 198–208, Jan. 2018,
earliest and has the lowest latency, while T2 branched at the doi: 10.1109/TCSI.2017.2735490.
last layer and therefore has the highest latency in BMQP. [4] J. Cho, Y. Jung, S. Lee, and Y. Jung, “Reconfigurable binary neu-
T3c latency is not added to the total latency, as T3 and T3c ral network accelerator with adaptive parallelism scheme,” Electronics,
vol. 10, no. 3, p. 230, 2021.
branches execute in parallel. Our hardware results match the [5] M. P. Véstias, R. P. Duarte, J. T. De Sousa, and H. C. Neto, “A config-
analysis presented in [13]. Finding the best match of hardware urable architecture for running hybrid convolutional neural networks in
resource utilization, accuracy, power, and latency depend on low-density FPGAs,” IEEE Access, vol. 8, pp. 107229–107243, 2020,
doi: 10.1109/ACCESS.2020.3000444.
the specific application and the available hardware resources [6] F. Tu, S. Yin, P. Ouyang, S. Tang, L. Liu, and S. We, “Deep con-
in the targeted hardware. volutional neural network architecture with reconfigurable computation
patterns,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst, vol. 25,
no. 8, pp. 2220–2233, Aug. 2017.
IV. R ECONFIGURABLE CNN ACCELERATORS [7] Y. Zhang and Q. Yang, “A survey on multi-task learning,” 2017,
arXiv.abs/1707.08114.
We have discussed numerous interpretations of ‘reconfig- [8] S. Albawi et al., “Understanding of a convolutional neural network,” in
urability’ in Section I. We approach reconfigurability from Proc. Int. Conf. Eng. Technol., 2017, pp. 1–6.
the standpoint of hardware reliability, aging, and computing [9] Y. Zhao et al., “Novel approach for gesture recognition using mmWave
FMCW radar,” in Proc. IEEE 95th Veh. Technol. Conf. (VTC-Spring),
performance. Execution of multiple tasks in different modes, Helsinki, Finland, 2022, pp. 1–6.
often associated with microprocessors [14], [15], is possible [10] C. Coelho et al., “Ultra low-latency, low-area inference accelerators
in an application-specific CNN accelerator during runtime. By using heterogeneous deep quantization with QKeras and HLS4ML,”
2020, arxiv.abs/2006.10159.
default, one accelerator is enough to perform three tasks, but [11] Z. Gao et al., “Reliability evaluation of pruned neural networks against
with triple modular redundancy, it is possible to execute tasks errors on parameters,” in Proc. IEEE Int. Symp. Defect Fault Tolerance
in 1) Fault tolerant (FT) mode, 2) De-stress mode (DS), and VLSI Nanotechnol. Syst. (DFT), 2020, pp. 1–6.
[12] B. Goldstein et al., “Reliability evaluation of compressed deep learning
3) High performance (HP) mode (Fig. 2(b)). FT mode: In this models,” in Proc. IEEE 11th Latin Amer. Symp. Circuits Syst. (LASCAS),
mode, all tasks execute on all three accelerators, thereby pro- 2020, pp. 1–5.
viding maximum reliability. DS mode: this is an aging-aware [13] T. Aarrestad et al., “Fast convolutional neural networks on FPGAs with
hls4ml,” Mach. Learn. Sci. Technol., vol. 2, no. 4, 2021, Art. no. 045015.
mode, only one accelerator is active at a time, and all tasks are [14] M. Ulbricht, R. T. Syed, and M. Krstic, “Developing a configurable
executed in a TDM fashion. HP mode: This mode runs all the fault tolerant multicore system for optimized sensor processing,” in Proc.
tasks in parallel, i.e., each accelerator executes the individual IEEE Int. Symp. Defect Fault Tolerance VLSI Nanotechnol. Syst. (DFT),
Noordwijk, The Netherlands, 2019, pp. 1–4.
task, which leads to a reduction in total latency by 2x. Using [15] A. Simevski, O. Schrape, C. Benito, M. Krstic, and M. Andjelkovic,
“PISA: Power-robust multiprocessor design for space applications,” in
1 Values marked with superscript ‘1’ in Table I are additional resource Proc. IEEE 26th Int. Symp. On-Line Test. Robust Syst. Design (IOLTS),
utilization when T3c is added. Napoli, Italy, 2020, pp. 1–6.

Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:52:58 UTC from IEEE Xplore. Restrictions apply.

You might also like