You are on page 1of 6

Design and Evaluation of Performance-efficient

SoC-on-FPGA for Cloud-based Healthcare


Applications
Mayank Kabra, Prashanth H C, and Madhav Rao
International Institute of Information Technology Bangalore, India
Email: {mayank.kabra, prashanth.c, mr}@iiitb.ac.in

Abstract—Cloud computing with more resources at disposal Units (TPUs), and other devices to extract and share the
has served heavy computing demands for various applications. information with the subscribers. The Cloud-based designs,
Health is one such sector where the cloud services have benefited especially for healthcare services, require customized pro-
by not only storing and updating the subscribers’ health records
frequently but also providing useful analytical and predictive cessing; using general purpose homogeneous processing units
information to the users. However most of the cloud processing infers sub-optimal utilization of processing units and high
runs are established on general-purpose computing devices that energy consumption. The computing energy and latency costs
are non-customized for accelerating medical inferences. Hence, are directly charged to the subscribers, which eventually
a special purpose and power-efficient SoC design for cloud limits the cloud users and hampers large-scale adoption of
computing targeted towards healthcare applications is desirable.
The study examines the design and evaluation of SoCs for health services on the run. Although the advantages of the
today’s neural networks trained for healthcare data. The work Cloud-supported healthcare services are enormous, the power
realizes fully connected neural networks comprised of 2 and 3 inefficient and non-customized design of the cloud engine fails
layers that are trained from two freely available datasets; one to attract the needy individuals, which otherwise should be
meant for detecting the survival status of the individuals suffering adopted on a large scale. Hence a cloud-targeted customized
from prostate cancer and the other for survival status post
blood and marrow transplantation, respectively. Additionally, a System-on-Chip (SoC) design with accelerated performance
light-weight image classifier derived from medical based MNIST and power-efficient designs for healthcare applications is es-
pathological dataset was also considered. The three neural sential, with the overall cost of using the cloud services for
networks were hardware designed and characterized with two short duration is likely to come down; the benefit of the same
different processor cores and two extreme cache memory configu- could be passed on to the customers. However, bespoke SoC
rations individually and in combination to generate multiple SoC
designs. Each SoC design is illustrated and evaluated in terms designing has its own challenges.
of power and hardware resource utilization. The SoC design Until recently, SoC design for specific applications was
incorporating all three hardware accelerators compared to the always considered a bottom-up approach where the design
existing cloud platforms showcased power reduction by 7.13X, involving the particular operation is developed, validated, and
5.25X, and 15X, and average throughput improvement by 49.37X, merged with the available processor design [13]. Modern SoC
19.78X, and 26.47X, when compared to its corresponding neural
network, runs on CPU, GPU, and TPU units respectively. The design involves utilizing many Intellectual Property (IP) cores
SoC designs in this work with three different co-processor choices over shared-bus, and network-on-chip (NoC) interconnects to
set an example to achieve customized cloud SoC designs that obtain the desired features. The difficulty in SoC design is
are power and performance efficient and targeted for healthcare due to several factors, including design complexity, protected
applications. IP cores, and tight boundary of design constraints.
Index Terms—SoC design, Hardware accelerator, AI on-chip,
Cloud computing
Several tools have been developed in the past to design
processor-based memory subsystems [14]–[18], however it
lacks the integration flexibility in IP cores, interconnects, and
I. I NTRODUCTION
parametric optimization leading to customized application-
Cloud platforms have always been on the fore-front for specific SoCs. Several SoC simulators to aid in the design
serving users’ demands [1]–[8], and healthcare services have evaluation also exist; however, most of them do not extract the
recently relied on the same for either personalized health- silicon footprint information, which is crucial for physically
record storage or providing intuition to the individuals based validating the design before tape-out [19]–[22]. Besides the
on their constantly supplied medical data [9]–[12]. In the recent surge in convolution neural networks (CNN) and other
past few years, due to the emergence of Artificial Intelligence fully-connected layers (FN), the hardware designers are com-
(AI) and Machine Learning (ML) techniques, the multi-modal pelled to not only realize and characterize the module but also
physiological signals indicate unique characteristics that are to try establishing some sort of information when integrated
extremely useful for the users’ well-being. The modern-day into the SoC design. The ESP [23], [24], and Chipyard [25]
cloud platforms are equipped with high-performance proces- frameworks allow to design the CNN accelerator and use the
sors, Graphics-Processing-Units (GPUs), Tensor-Processing- same to realize the SoC design.
In the past, the neural network architectures were either
completely or partially realized in the hardware blocks [26].
Few techniques, such as quantization [27]–[29] and inte-
gerization [30], were applied to fit the resource constraint
hardware. The neural network hardware accelerator designs
are beneficial, especially for time-critical applications where
the output aids in accelerating the decision-making [26], [30].
The SoC design with an in-built hardware accelerator for
healthcare services are essential; however, most of the attempt
in the past focuses on realizing the hardware accelerator block
designs and showcasing the benefits achieved in terms of
latency and resources utilized instead of characterizing the Fig. 1. Cloud SoC Architecture design equipped with three neural network
complete SoC design. Besides, most of the studies target one hardware accelerator design.
co-processor design by evaluating the performance and hard-
ware metrics, whereas healthcare services mandate multiple TABLE I
inferencing modules running on single or multi-modal data N EURAL NETWORKS IMPLEMENTED AS ACCELERATORS .
and then arrive at a decision. Output Accuracy
Model Dataset Inputs Parameters
This paper focuses on the novel contributions on engineer- Classes (%)
Stage C
ing the SoC design, that are listed below: Dense 2
Prostate 3 7 2563 76
Layers
Cancer [31]
1) Adopting ESP framework developed by the Columbia European Society
University Research group [23], [24] and realize single Dense 3
for Blood and 3 5 6591 90
Layers
and three multiple hardware accelerators in the SoC Marrow (EBMT) [32]
Image
design using two different processor cores and two Classifier Med-MNIST [33] 10 784 5235 81
extreme memory configurations. (2 layers)

2) The SoC design is targeted to offer health-related pre-


dictions, and hence the neural network architecture used
to build the hardware accelerator is selected on the basis neural network architectures. The paper establishes three dif-
of three different medical dataset. ferent neural network designs that are configured with fully
3) All SoC designs were characterized and evaluated in connected dense layers in the SoC design. Table I shows
terms of power, hardware resources such as BRAMs and these 3 neural networks, structured differently with variations
LUTs utilized, and latency incurred. in output classes, input feature maps, number of trained
4) The throughput and power metrics of the proposed parameters, and expected accuracy in the overall SoC design,
hardware accelerator enabled SoC designs were bench- in a view to establish an example for SoC design towards
marked against three other processing units including healthcare applications.
CPU, GPU, and TPU, to showcase the efficacy of the The first dense 2-layered neural network with 6 input
proposed hardware design. features and 3 output classes achieved 91% accuracy on flow-
cytometry results dataset [31], predicting the survival time in
This is the first time, as per the author’s knowledge, Cloud tar-
the form of 3 classes for the Stage#C type prostate cancer
geted and health-related SoC designs, and their characteristics
individuals. An input shape of 7, with three output shapes
are put across the research community. The SoC design with
of 32, 64, and 3 in the order were defined for this neural
AI accelerators is aimed to extract inference from the supplied
network, with an overall 2563 parameters. The second dense
data, which is expected to improve the performance efficacy,
3-layered neural network with 8 input features and 3 output
besides predicting any complex medical problems, for timely
classes achieved an accuracy of 90% on the dataset [32] in the
treatment.
form of 3 classes of risks suffered by the individuals post blood
and marrow transplantation. An input shape of 5, with three
II. S O C D ESIGN
output shapes of 32, 64, 63, and 3 in the order, was defined
The Cloud-targeted SoC design for healthcare service is for this neural network, with an overall 6591 parameters. The
different from other domains considering multiple health- third neural network is an image classifier with two layers
related information is supplied constantly by the users, and that were derived from the medical based pathological dataset
the derived inference is released to the users and doctors model [33] for performing classification tasks on light-weight
treating the individuals whenever required, as depicted in 28×28 images. An input image of 28×28 pixels for 10 output
Figure 1. The containers shown in the figure establishes three classes with an accuracy of 76%, with dropout and batch
different medically relevant dataset stored in the cloud for normalization, was also considered in this work. An input
achieving inferences made by the SoC design. The health shape of 784, with two output shapes of 5, 64, 64, and 10 in
data is acquired at a different sampling rate, and extracting the order, were defined for this neural network, with an overall
useful features toward successful diagnosis requires different 5235 parameters. All three neural network models are designed
as co-processors and are integrated with multiple combinations TABLE II
of open-source processor cores and memory blocks to evaluate S O C CONFIGURATIONS FOR THE NEURAL NETWORK DESIGNED
HARDWARE ACCELERATOR INTEGRATED WITH DIFFERENT PROCESSOR
the characteristics. The SoC was designed and characterized CORES .
using the ESP framework [23], [24] and was characterized in
terms of latency, LUTs used, BRAMs allocated, and power Configuration Accelerator Processor
A1 Dense2layers Ariane
consumption across different components of SoC design. A2 Dense2layers leon3
B1 Dense3layers Ariane
B2 Dense3layers leon3
C1 image classifier Ariane
C2 image classifier leon3
D1 all Accelerators Ariane
D2 all Accelerators leon3

for accelerator. Minimum Cache configuration is (32 set, 2


way L2) + (32 set, 4 way LLC) and (32 sets, 2 way L2) for
accelerator. All 16 SoC designs were synthesized on Xilinx
Vivado and system-level simulation performed using Xcelium
tool of Cadence Suite for establishing the results.

III. E VALUATIONS
Fig. 2. Schematic showing SoC design and Analysis flow.
14
The overall SoC design flow and analysis are pictorially Power(W) 12
explained in Figure 2. ESP allows tile-based SoC with the 10
desired mix of accelerator tiles, along with memory and 8
Dense 2 layers
Dense 3 layers
processor tiles. Each of the tiles is equipped with a channel 6 Image classifier
All accelerators
to the main memory. An I/O tile is also integrated to manage 4
various peripherals. ESP is built on a multi-plane network-on- 1 2 3 4 5 6 7 8 9 10
LUTs 10 5
chip (NoC) interconnect structure. These three hardware accel-
erators were individually incorporated in the SoC design with Fig. 3. Variation in power and LUTs utilized for different SOC designs with
two different combinations of processor cores namely - Ariane accelerator choice.
and Leon3 with its floating-point unit (FPU) variants. Ariane is
a 6-stage, single-issue, in-order scheduling based 64-bit RISC- Figure 3 depicts LUT utilization, and power conceded by the
V CPU with 64-bit DMA access [34]. Ariane is engineered SoC with individual hardware accelerators, and the SoC design
to handle Integer (I), Multiply (M), Accumulate (A), and involving all the three hardware accelerators. As expected, it
Compressed (C) instructions, along with specific extensions. is evidently clear that SoC with individual accelerator design
Leon3 is a 32-bit processor with 32-bit DMA access that demands less LUTs and power requirements when compared
belongs to SPARC V8 architecture [35]. It is designed with 7- to the SoC with all three accelerators. The Dense 2 layers
stage advanced pipeline architecture, with hardware multiply, have fewer computation requirements than the Dense 3 layers,
divide and MAC units, and operates on a double-precision which is reflected in the LUTs-Power profile. The Image
floating-point format. These two processor cores were opted classifier based SoC design presents slightly more power
due to their popularity in the SoC design, and at the same time, requirement as of Dense 2 layers, but less than fully connected
the two processors belong to a different family of processor dense 3-layered network, and utilizes on an average 105 LUTs
cores with different features. The evaluation of the SoC design more than dense 2 layers, and 105 LUTs less than dense 3
with 32-bit and 64-bit processor cores is expected to influence layers. This is attributed to the number of parameters involved
the power and throughput of the SoC design. hls4ml tool was in all the three neural network architectures. The SoC with
utilized to convert the Keras model for Dense layered network all the three accelerator designs shows power consumption of
and Image classifier to synthesize designs. hls4ml tool allows a around 13 W, which is almost 45.8% less than putting together
maximum of 4096 parameters in a layer to synthesize success- all the three individual hardware accelerator-designed SoCs.
fully, hence the neural network was designed with less than Figure 4 reports the hardware utilization for all 16 SoC
4096 parameters in each layer. Overall, 8 pair of SoC designs, designs, including minumum and maximum cache configura-
as stated in Table II were investigated and characterized in tions. As expected, the minimum cache memory configured
this work for minimum and maximum cache configurations SoC designs shows extremely low requirement of BRAMs
separately, making it a consolidated 16 individual SoC designs. (Block RAM) than the maximum Cache memory configured
Maximum Cache configuration is (8192 set, 8 way L2) + (4096 SoC design. On an average, 75% less BRAMs were utilized
set, 16 way LLC) for processor and (8192 set, 8 way L2) for the same SoC design when configured with minimum
5
compared to 8 input features for B1, B2 designs. Additionally,
10
10
min cache
1000
min cache
in an SoC design, separating the power consumed by the
max cache max cache accelerator designs, the CPU, I/O, and memory blocks also

Block RAM
concede power. The power conceded by the I/0, CPU, and
LUTs

5 500 Memory blocks for the 16 SoC designs are categorized into
minimum Cache and maximum Cache enabled designs, and
are reported in the Figure 4. The CPU, Memory, and I/O block
0
A1 A2 B1 B2 C1 C2 D1 D2
0
A1 A2 B1 B2 C1 C2 D1 D2
reported similar power requirements throughout the 16 SoC
Configuration Configuration designs including the maximum and minimum enabled Cache
15 80
configurations.

Accelerator Power (%)


min cache min cache
max cache max cache
Total Power(W)

60
10 static : 5%
cpu : 6%
40
memory : 11% DDR3 + ethernet : 30%
5
20

0 0
A1 A2 B1 B2 C1 C2 D1 D2 A1 A2 B1 B2 C1 C2 D1 D2
Configuration Configuration IO : < 1%
SOC Power(W) : Max Cache

3 3
SOC Power(W) : Min Cache

CPU
Memory accelerator : 48%
IO
2 2

Fig. 5. Average Power distribution among all the 16 SoC designs.


1 1
An average power distribution among different components
0 0
for all 16 SoC designs when used as a cloud server is presented
A1 A2 B1 B2 C1 C2 D1 D2 A1 A2 B1 B2 C1 C2 D1 D2 in Figure 5. On an average 48% of the power is consumed by
Configuration Configuration
the hardware accelerator, and 30% of total server is expended
Fig. 4. Hardware characteristics of the SoC designs. The last two plots show by DDR3 and ethernet which are external to the SoC. Memory,
the SOC power do not include accelerator power. CPU, Static (leakage) power, and IO concede 11%, 6%, 5%,
and 1% respectively.

Cache memory. Additionally, the total power conceded by Cloud TPU


Power (W) in logScale

the minimum cache memory allocated SoC designs were less 10 2 Intel Xeon 2.2GHz

than its corresponding design with maximum cache memory. Tesla T4


On an average the minimum Cache enabled design exhibited
2.13 W less power than the maximum Cache incorporated
10 1
SoC designs. Between minimum and maximum cache enabled
SoC designs, there is hardly any difference between the LUTs. A1 A2 B1 B2 C1 C2 D1 D2 CPU GPU TPU
As expected the LUT utilization grows from SoC designs
(a)
ranging from A1, A2, to D1, D2. The power utilization of
the accelerator design component, when charted in % with
Latency (us) in logScale

respect to the total SoC power, shows that more power is 10 3

conceded by the accelerator designs especially for D1, D2


designs when compared with A1, A2 designs. This resembles SoC
10 2
that three Neural network accelerator design extracts more Intel Xeon 2.2GHz
Tesla T4
computational power than single accelerator designs, albeit the Cloud TPU

three accelerator design power is still less than the sum of three 10 1
Dense 2 layers Dense 3 layers Image Classifier
individual accelerator power. Among all the 16 designs, D1
SoC design consumes maximum power, whereas A1 exhibits (b)
least power, which is attributed to less number of computations Fig. 6. Comparisons of power and latency metrics of the proposed Cloud
involved. Between B1, B2 (Dense 3 layers), and C1,C2 (Image targeted SoC designs with existing cloud platforms.
classifier) designs, the LUTs, total power, and accelerator
power (%) is high for B1,B2 SoCs, however the BRAM The rated power of typical cloud hardware: Intel Xeon
requirements are larger for C1,C2 SoCs, considering the 784 Single-core CPU running at 2.2 GHz, Cloud Tensor-
of input features to be accommodated in C1, C2 designs, when Processing-Unit (TPU), and Tesla T4 GPU were compared
TABLE III ual neural network inference on CPU, GPU, and TPU respec-
S UMMARY OF SPEED ACCELERATION , AND POWER REDUCTION FOR THE tively. The paper presents SoC design as a step, showcasing the
PROPOSED S O C DESIGNS WHEN COMPARED WITH EXISTING C LOUD
PLATFORMS . D1 S O C CONSUMES MAXIMUM POWER WHEN COMPARED ability to build many more customized power and throughput
WITH OTHER DESIGNED S O C S . A1, B1, AND C1 S O C DESIGN SHOWCASES efficient SoC designs. The healthcare services through Cloud
MAXIMUM DELAY FROM THE PAIR OF S O C S INVESTIGATED . based SoC platform with power and performance efficient
Existing Cloud Power reduction for D1 SoC
Average Accelerated speed for 3 individual design is likely to aid in accelerating the vital health updates
SoC designs with respect to other cloud platforms.
platforms design with respect to other
Dense 2 Dense 3 Image Classifier of the individual’s.
cloud platforms.
(A) (B) (C)
GPU 5.25X 39.23X 16.33X 3.80X R EFERENCES
TPU 15X 48.76X 26.67X 4X
CPU 7.13X 72.72X 70X 5.40X [1] Q. Jia, S. Chen, Z. Yan, and Y. Li, “Optimal incentive strategy in
cloud-edge integrated demand response framework for residential air
conditioning loads,” IEEE Transactions on Cloud Computing, vol. 10,
no. 1, pp. 31–42, 2022.
with the proposed Cloud targeted SoC designs of 16 con- [2] S. Mireslami, L. Rakai, M. Wang, and B. H. Far, “Dynamic cloud
figurations, also shown in the Figure 6. The Cloud targeted resource allocation considering demand uncertainty,” IEEE Transactions
SoC platform exhibits extremely low power requirements on Cloud Computing, vol. 9, no. 3, pp. 981–994, 2021.
[3] J. Kumar, A. Malik, S. K. Dhurandher, and P. Nicopolitidis, “Demand-
when compared with other cloud platforms. The proposed D1 based computation offloading framework for mobile devices,” IEEE
SoC configured with maximum Cache allocation consumes Systems Journal, vol. 12, no. 4, pp. 3693–3702, 2018.
maximum power consumption among the proposed 16 SoC [4] H.-M. Chung, S. Maharjan, Y. Zhang, F. Eliassen, and K. Strunz,
“Optimal energy trading with demand responses in cloud computing
configurations. However when compared with Tesla T4 GPU enabled virtual power plant in smart grids,” IEEE Transactions on Cloud
unit, it shows 5.25X less power consumption than Tesla T4 Computing, vol. 10, no. 1, pp. 17–30, 2022.
unit, 15X less than TPU unit, and 7.13X less than CPU. [5] J. Li, Y. Zhu, J. Yu, C. Long, G. Xue, and S. Qian, “Online auction for
iaas clouds: Towards elastic user demands and weighted heterogeneous
The proposed SoC architecture is a power-efficient design vms,” IEEE Transactions on Parallel and Distributed Systems, vol. 29,
when compared with the available cloud resources. For reliable no. 9, pp. 2075–2089, 2018.
comparison, the accelerated performance was investigated for [6] L. Ruan, Y. Yan, S. Guo, F. Wen, and X. Qiu, “Priority-based residential
energy management with collaborative edge and cloud computing,”
SoC design running individual accelerators with that of single IEEE Transactions on Industrial Informatics, vol. 16, no. 3, pp. 1848–
neural network runs on the corresponding TPU, GPU, and 1857, 2020.
CPU engines. In terms of latency, the cloud targeted SoC [7] X. Chen, W. Li, S. Lu, Z. Zhou, and X. Fu, “Efficient resource allocation
for on-demand mobile-edge cloud computing,” IEEE Transactions on
design showcased accelerated speed ranging from 5.4X to Vehicular Technology, vol. 67, no. 9, pp. 8769–8780, 2018.
72.72X with respect to CPU run, 3.8X to 39.23X with respect [8] S. Kardani-Moghaddam, R. Buyya, and K. Ramamohanarao, “Adrl:
to GPU run, and 4X to 48.76X with respect to TPU run for the A hybrid anomaly-aware deep reinforcement learning-based resource
scaling in clouds,” IEEE Transactions on Parallel and Distributed
individual of the selected neural networks. The average latency Systems, vol. 32, no. 3, pp. 514–526, 2021.
of SoC designs with minimum and maximum cache enabled [9] R. Ranchal, P. Bastide, X. Wang, A. Gkoulalas-Divanis, M. Mehra,
configurations were considered for comparisons. A summary S. Bakthavachalam, H. Lei, and A. Mohindra, “Disrupting healthcare
silos: Addressing data volume, velocity and variety with a cloud-native
of the power reduction and accelerated output is shown in the healthcare data ingestion service,” IEEE Journal of Biomedical and
Table III. On an average the SoC design showed throughput Health Informatics, vol. 24, no. 11, pp. 3182–3188, 2020.
improvement of 49.37X, 26.47X, and 19.78X when compared [10] Y. Motai, E. Henderson, N. A. Siddique, and H. Yoshida, “Cloud
colonography: Distributed medical testbed over cloud,” IEEE Transac-
with CPU, TPU, and GPU units respectively. tions on Cloud Computing, vol. 8, no. 2, pp. 495–507, 2020.
[11] L.-p. Jin and J. Dong, “Intelligent health vessel abc-de: An electro-
IV. C ONCLUSION cardiogram cloud computing service,” IEEE Transactions on Cloud
16 SoC designs including three different Neural Network Computing, vol. 8, no. 3, pp. 861–874, 2020.
[12] M. Akter, A. Gani, M. O. Rahman, M. M. Hassan, A. Almogren, and
hardware accelerators individually, and all combined were S. Ahmad, “Performance analysis of personal cloud storage services for
evaluated with respect to total power consumed, LUTs uti- mobile multimedia health record management,” IEEE Access, vol. 6, pp.
lized, BRAMs required, accelerator power conceded, and 52 625–52 638, 2018.
[13] A. P. Deb Nath, K. Raj, S. Bhunia, and S. Ray, “Soccom: Automated
throughput offered. Three neural network accelerators were synthesis of system-on-chip architectures,” IEEE Transactions on Very
picked to demonstrate proof-of-concept in extracting vital Large Scale Integration (VLSI) Systems, vol. 30, no. 4, pp. 449–462,
categorical information from the supplied health related data 2022.
[14] M. A. Kinsy, M. Pellauer, and S. Devadas, “Heracles: A tool for
and demonstrate power and performance efficiency of the fast rtl-based design space exploration of multicore processors,” in
designed SoC. The aim was to demonstrate the advantage Proceedings of the ACM/SIGDA International Symposium on Field
of SoC designs targeted for cloud based healthcare services. Programmable Gate Arrays, ser. FPGA ’13. New York, NY, USA:
Association for Computing Machinery, 2013, p. 125–134. [Online].
The paper showcases one such SoC design methodology using Available: https://doi.org/10.1145/2435264.2435287
a freely available ESP framework and open-source IP cores. [15] K. Asanović, R. Avizienis, J. Bachrach, S. Beamer, D. Biancolin,
49.37X, 26.47X, and 19.78X when compared with CPU, TPU, C. Celio, H. Cook, D. Dabbelt, J. Hauser, A. Izraelevitz, S. Karandikar,
B. Keller, D. Kim, J. Koenig, Y. Lee, E. Love, M. Maas,
and GPU A. Magyar, H. Mao, M. Moreto, A. Ou, D. A. Patterson,
Throughput enhancement of 49.37X, 19.78X, and 26.47X B. Richards, C. Schmidt, S. Twigg, H. Vo, and A. Waterman, “The
were reported for individual hardware accelerator designed rocket chip generator,” EECS Department, University of California,
Berkeley, Tech. Rep. UCB/EECS-2016-17, Apr 2016. [Online].
SoCs, and power reduction of 7.13X, 5.25X, and 15X show- Available: http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-
cases huge advantage with respect to its corresponding individ- 2016-17.html
[16] S. Bandara, A. Ehret, D. Kava, and M. Kinsy, “Brisc-v: An open-source [24] “Esp the open source soc platform,” https://esp.cs.columbia.edu, ac-
architecture design space exploration toolbox,” in Proceedings of the cessed: 10, January, 2022.
2019 ACM/SIGDA International Symposium on Field-Programmable [25] “Chipyard documentation,” https://chipyard.readthedocs.io/en/stable/,
Gate Arrays, ser. FPGA ’19. New York, NY, USA: Association accessed: 10, January, 2022.
for Computing Machinery, 2019, p. 306. [Online]. Available: [26] B. S. Ajay and M. Rao, “Binary neural network based real time emotion
https://doi.org/10.1145/3289602.3293991 detection on an edge computing device to detect passenger anomaly,”
[17] F. Fatollahi-Fard, D. Donofrio, G. Michelogiannakis, and J. Shalf, in 2021 34th International Conference on VLSI Design and 2021 20th
“Opensoc fabric: On-chip network generator,” in 2016 IEEE Interna- International Conference on Embedded Systems (VLSID), 2021, pp.
tional Symposium on Performance Analysis of Systems and Software 175–180.
(ISPASS), 2016, pp. 194–203. [27] M. Blott, N. J. Fraser, G. Gambardella, L. Halder, J. Kath, Z. Neveu,
[18] S. Wallentowitz, A. Lankes, A. Zaib, T. Wild, and A. Herkersdorf, Y. Umuroglu, A. Vasilciuc, M. Leeser, and L. Doyle, “Evaluation of op-
“A framework for open tiled manycore system-on-chip,” in 22nd In- timized cnns on heterogeneous accelerators using a novel benchmarking
ternational Conference on Field Programmable Logic and Applications approach,” IEEE Transactions on Computers, vol. 70, no. 10, pp. 1654–
(FPL), 2012, pp. 535–538. 1669, 2021.
[19] J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. [28] N. Kim, D. Shin, W. Choi, G. Kim, and J. Park, “Exploiting retraining-
Aamodt, and V. J. Reddi, “Gpuwattch: Enabling energy optimizations based mixed-precision quantization for low-cost dnn accelerator design,”
in gpgpus,” in Proceedings of the 40th Annual International Symposium IEEE Transactions on Neural Networks and Learning Systems, vol. 32,
on Computer Architecture, ser. ISCA ’13. New York, NY, USA: no. 7, pp. 2925–2938, 2021.
Association for Computing Machinery, 2013, p. 487–498. [Online]. [29] K. Nakata, A. Maki, D. Miyashita, F. Tachibana, T. Suzuki, and
Available: https://doi.org/10.1145/2485922.2485964 J. Deguchi, “Live demonstration: Fpga-based cnn accelerator with filter-
[20] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, wise-optimized bit precision,” in 2019 IEEE International Symposium
J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, on Circuits and Systems (ISCAS), 2019, pp. 1–1.
M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, “The gem5 [30] N. D. Truong, A. D. Nguyen, L. Kuhlmann, M. R. Bonyadi, J. Yang,
simulator,” SIGARCH Comput. Archit. News, vol. 39, no. 2, p. 1–7, aug S. Ippolito, and O. Kavehei, “Integer convolutional neural network for
2011. [Online]. Available: https://doi.org/10.1145/2024716.2024718 seizure detection,” IEEE Journal on Emerging and Selected Topics in
[21] N. Agarwal, T. Krishna, L.-S. Peh, and N. K. Jha, “Garnet: A detailed Circuits and Systems, vol. 8, no. 4, pp. 849–857, 2018.
on-chip network model inside a full-system simulator,” in 2009 IEEE [31] “Stage c prostate cancer,” https://vincentarelbundock.github.io/Rdatasets
International Symposium on Performance Analysis of Systems and /doc/rpart/stagec.html, accessed: 26, February, 2022.
Software, 2009, pp. 33–42. [32] “Data from the european society for blood and marrow transplantation
[22] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, (ebmt),” https://vincentarelbundock.github.io/Rdatasets
and N. P. Jouppi, “Mcpat: An integrated power, area, and timing /doc/mstate/ebmt1.html, accessed: 27, February, 2022.
modeling framework for multicore and manycore architectures,” in [33] J. Yang, R. Shi, and B. Ni, “Medmnist classification decathlon: A
Proceedings of the 42nd Annual IEEE/ACM International Symposium lightweight automl benchmark for medical image analysis,” in IEEE
on Microarchitecture, ser. MICRO 42. New York, NY, USA: 18th International Symposium on Biomedical Imaging (ISBI), 2021, pp.
Association for Computing Machinery, 2009, p. 469–480. [Online]. 191–195.
Available: https://doi.org/10.1145/1669112.1669172 [34] “Ariane processor,” https://github.com/lowRISC/ariane, accessed: 30,
[23] P. Mantovani, D. Giri, G. Di Guglielmo, L. Piccolboni, J. Zuckerman, January, 2022.
E. G. Cota, M. Petracca, C. Pilato, and L. P. Carloni, “Agile soc devel- [35] “Leon3 processor,” https://www.gaisler.com/doc/leon3 product sheet.pdf,
opment with open esp : Invited paper,” in 2020 IEEE/ACM International accessed: 20, January, 2022.
Conference On Computer Aided Design (ICCAD), 2020, pp. 1–9.

You might also like