1 s2.0 S1746809423002987 Main

Biomedical Signal Processing and Control 85 (2023) 104865
Contents lists available at ScienceDirect
Biomedical Signal Processing and Control

journal homepage: www.elsevier.com/locate/bspc
Hardware implementation of 1D-CNN architecture for ECG arrhythmia

classification
Viraj Rawal, Priyank Prajapati ∗, Anand Darji
Sardar Vallbhbhai National Institute of Technology, Ichchhanath, Surat, 395007, Gujarat, India
ARTICLE INFO ABSTRACT
Keywords: Electrocardiography (ECG) has been used as a diagnostic tool for various heart diseases. It is most effective in
ECG detecting myocardial infarction and fatal arrhythmias. This work proposes a safety–critical hardware system for
Arrhythmia early arrhythmia diagnoses, particularly Atrial Fibrillation. This diagnosis is fairly accurate and unbiased. It is
Atrial Fibrillation
attempted using 1D Convolutional Neural Network (CNN) architecture analysis with the physionet/computing
CNN
in cardiology challenge database considering the trade-off between accuracy and computational complexity.
Hardware architecture
FPGA
The state-of-the-art methods do not provide such analysis for CNN hardware implementation. Two software
CNN structures are introduced in this work. The foremost Supreme CNN Architecture (SCA) gives an accuracy
of 99.17%, which is 7.04% more than existing 1D-CNN architectures for Atrial Fibrillation (AF) classification.
Further, it has 18.93%, 12.3%, and 15.99% more precision, recall, and F1 score compared to the state-of-
the-art method. It is helpful for software-based arrhythmia classification. The second proposed architecture
is the Software-Selected CNN Architecture (SSCA), having lower computational complexity and providing
98.95% accuracy for AF classification. It is used for hardware realization through various optimization
techniques considering a reasonable trade-off between accuracy and computational complexity. The proposed
multiplier-less Hardware CNN Architecture (HCA) achieves 97.34% Atrial fibrillation classification accuracy for
arrhythmia detection. Further, it consumes only 628 mW power on ZYNQ Ultrascale (ZCU106) FPGA, which
is 66.95% lesser than the existing 1D-CNN hardware.
1. Introduction Feature Extraction, Morphological methods [7–9], Wavelet Transform-

based methods [8,9], Hermit Basis Function-based method [10], RR
Cardiovascular diseases cause more than 17.3 million deaths per interval-based methods [7–9], Principle Component Analysis (PCA) [8]
year worldwide, and this number is expected to rise to 23.6 million based methods were used by researchers. Moreover, for classification,
by 2030 [1]. The heart’s electrical activity can be measured using mainly Support Vector Machine methods were used [8–10]. Further, for
Electrocardiogram (ECG) to diagnose heart diseases. An abnormal heart
ECG classification Artificial Neural Network (ANN), Convolutional Neu-
condition causes different arrhythmia, which results in irregular heart-
ral Network (CNN) [11–16], Long and Short Term Memory (LSTM) [11,
beat or changes in ECG wave morphology. The medical community
12] based methods are preferred by researchers in the modern era. It
has identified many types of arrhythmia, including atrial/ ventricu-
lar fibrillation, premature atrial/ventricular contractions, sinus node has been observed that classical methods [7–10] have less arrhythmia
dysfunction, AV nodal re-entrant tachycardia, Ventricular tachycardia detection accuracy than a neural network-based method [11–16] on
(V-tach), Long QT syndrome. Arrhythmia may be harmless in some different databases. The classical methods have accuracy of 83% [7],
cases, or it may be life-threatening [2]. Therefore, early identification 85% [10], 86% [8], 87% [9] whereas neural network-based methods
of arrhythmia is essential. has accuracy of 98.42% [11], 99.01% [12], 86% [13], 98.2% [14],
In the field of arrhythmia classification in software, many meth- 91.33% [15], 91% [16]. Further, the neural network (i.e., CNN) based
ods are available. Normally, ECG classification is divided into the methods [11–16] have no computational overhead for separate feature
following subtasks (1) Preprocessing, (2) Feature Extraction and (3) extraction and classification. Because they generate high-level features
Classification. The raw ECG signal may be contaminated by various
from their input internally and provide classification. Moreover, CNN-
artifacts [3]. Hence, pre-processing techniques [4,5] and digital filter
based methods have excellent classification and prediction capabilities
hardware design techniques [6] were helpful to denoise ECG. For
∗ Corresponding author.
E-mail addresses: virajrawal25@gmail.com (V. Rawal), priyank.prajapati@rediffmail.com (P. Prajapati), add@eced.svnit.ac.in (A. Darji).
https://doi.org/10.1016/j.bspc.2023.104865
Received 2 June 2022; Received in revised form 20 February 2023; Accepted 11 March 2023
Available online 20 March 2023
1746-8094/© 2023 Elsevier Ltd. All rights reserved.
V. Rawal et al. Biomedical Signal Processing and Control 85 (2023) 104865
on massively varied databases. Therefore, CNN is used in this work to et al. [36] have proposed half-wave Gaussian quantization functions
classify arrhythmia. for feed-forward and backpropagation networks. Hao Li et al. [37]
Shu Lih Oh et al. [11] and Zheng et al. [12] have combined CNN for have proposed L1-Norm-based independent and Greedy two Pruning
high-level features extraction and LSTM for processing them to classify strategies, which achieve up to 60% weight removal without significant
different arrhythmia with 1D ECG signal as a 2D image. Pourbabaee reduction of accuracy. Further, Zhang [38] has proposed a new Pruning
et al. [16] have proposed CNN-KNN deep learning (DL) based method concept, i.e., Reversed-Pruning and Peak-Pruning. Finally, Morteza
to classify life-threatening cardiac arrhythmia. However, these methods et al. [39] showed a comparative study of different Pruning methods.
have higher computational complexity because of the combination of
They concluded that the L1-Norm-based method was giving superior
two algorithms.
results. Ney et al. [40] have proposed a machine learning methodology
Plawiak et al. [15] have used 1D-CNN with a kernel size of 128. Its
to do an automatic hardware-aware topology search for the optimized
output is flattened after each convolution layer and fed to a fully con-
DNN for the hardware implementation based on the requirements of
nected layer for final classification with softmax probability function.
Nguyen [17] had proposed CNN-based high-level feature extraction to the applications.
feed its stacked SVM architecture for classification. Pyakillya et al. [13] The conventional wavelet/SVM based methods achieves 83% [7],
have applied raw ECG signal to CNN and extracted features in the 85% [10], 86% [8], 87% [9] classification accuracy and with moderate
internal layer, and classified ECG with a fully connected layer. Khriji computational complexity. However, their classification accuracy is
et al. [18] have chosen the 1D-CNN architecture to classify Normal still lower compared to ML-based methods such as CNN, CNN-LSTM,
Sinus Rhythm (NSR), Atrial Fibrillation (AF), and noisy ECG signals. and CNN-RNN, which could provide high accuracy of up to 98% [11]
They achieved 93% accuracy on CPU with the highest processing time 99% [12] and 97.1% [41]. Nevertheless, CNN and LSTM individually
of 2.051 s. To reduce computational time and complexity, Castillo have high computational complexity. So despite the higher accuracy
et al. [19] have proposed an Artificial Neural Network (ANN) architec- of the CNN-LSTM architecture, it is not easily feasible to implement
ture for binary classification of Atrial Fibrillation intending to reduce it on hardware for a real-time safety–critical system. Existing research
trainable parameters. Its hardware implementation achieves maximum work [13,15,18,19] with 1D-CNN architectures are computationally
accuracy of 83.01% in the Intel Core i5 CPU. Rajpurkar et al. [20] complex and provide accuracy no higher than 93% in software.
have provided a qualitative comparison of the CNN architecture with
The existing 1D-CNN architectures for arrhythmia detection (specif-
a cardiologist for analyzing its performance in terms of F1-Score.
ically Atrial Fibrillation) are mostly implemented on software or lack
This work focuses on designing a hardware system to diagnose ar-
rhythmia quickly. CNN’s software implementation needs high computa- hardware design or are not feasible to easily implement on edge com-
tion power and time because its multi-layered architecture carries huge puting wearable or portable devices due to their higher computational
numbers of neurons (computing elements). Hence, application-specific complexity. The existing works are not addressing the hardware design
dedicated hardware and CNN architecture optimization techniques are analysis for CNN’s parameter selection and hardware implementation
required for ECG classification. Jatmiko et al. [21] have designed the to get a better trade-off between accuracy and computational complex-
FPGA-based hardware with a wavelet transform technique. Brito [1] ity. Methods with software-based approaches have high latency because
has implemented biochip architecture on FPGA hardware for cardiac the ECG samples are sent to the remote computing processor or cloud
disease detection. Jaramillo et al. [22] proposed a CNN hardware and system. Moreover, it is not feasible in life-threatening conditions, where
implemented it on Xilinx Artix-7 FPGA. However, a lesser accuracy real-time or continuous ECG monitoring is required on edge devices.
on hardware than software-based methods has been observed due to Hence, this work focuses on building a hardware architecture for
the truncation error, bit quantization, or insufficient bits available to arrhythmia detection and quick diagnosis while maintaining acceptable
represent data. So, hardware-efficient techniques are investigated. accuracy on hardware. This work attempts to get a better trade-off
Courbariaux et al. [23] first introduced optimization in a Deep between accuracy and computational complexity. The contributions of
Neural Network (DNN) for software that can be implementable in hard-
this work are as follows:
ware. They have proposed the BinaryConnect DNN implementation
strategy in which weights are constrained to only two possible values • The novelty of this work is pioneering the research on 1D-CNN
(e.g., −1 or 1). Further, the work was improved by introducing Bina- hardware design through proposed quantitative analysis of dif-
rized Neural Networks (BNN) [24], and ternary weight networks [25]. ferent CNN architectures by considering the trade-off between
Rastegari et al. [26] have proposed XNOR-Net for CNN that follows accuracy and computational complexity. The existing research
the BNN concept of binary weight, input, and activation. Hailesellasie
work lacks to provide such analysis or is not addressed in any
et al. [27] have derived a new variant of CNN architecture (i.e., called
of the existing literature that is best known to us.
SqueezeNet), which did not use a fully connected layer to achieve
• This work shows the Pruning analysis of 1D-CNN architecture by
nearly the same classification accuracy.
considering the reasonable trade-off for the safety–critical hard-
Miyashita et al. [28] have proposed a new approach of a logarithmic-
ware system design using FPGA. It is helpful for continuous ECG
based multiplier to perform multiplication with less computation com-
plexity. Lu et al. [29] have proposed efficient architecture for sparse signal monitoring on edge computing wearable/portable devices.
CNN. The sparse architecture creates issues for the control unit as • This work further proposes hardware optimization techniques
there is no regularity in the design. Zhu et al. [30] have proposed such as hardware folding, functions interchanging or reorder-
a sparse-wise data flow, wherein the processing of the zero weight ing, and logarithmic Shifter-Based Multiplier (i.e., Multiplier-
multiplication and accumulation was skipped. Khabbazan and Mirza- Less) system design to reduce computational complexity and
kuchaki [31] have proposed loop unrolling and loop interchange con- power consumption.
cepts in the architecture. Tummeltshammer et al. [32] have proposed
the time-multiplexed multiple-constant multiplication strategy. Faraone The work is organized as follows: Section 2 describes the software
et al. [33] have proposed a new re-configurable constant–coefficient analysis results of the proposed work to identify high accuracy and low
Multipliers (RCCMs) based architecture. Jacob et al. [34] have used hardware complexity 1D-CNN architecture for ECG arrhythmia classi-
the Quantization technique for efficient integer arithmetic inference. fication. Section 3 presents the proposed 1D-CNN hardware system for
Song et al. [35] have proposed the compression-based method. arrhythmia classification. The experimental and analytical results of the
Using Huffman coding, they applied weights Pruning, clustering, and proposed work and their comparison with existing work are shown in
index compression to reduce weight storage requirements. Zhaowei Cai Section 4. Finally, Section 5 describes the conclusion.
2
Fig. 1. Types of ECG waves used for classification.
2. Arrhythmia classification 8261 ECG files are labeled into four classes: NSR, AF, OIR,
and EN, containing 4890, 707, 2391, and 273 ECG wave files,
This work focuses on ECG arrhythmia detection through the dif- respectively.
ferent types of ECG wave rhythms classification, such as (1) Normal • Training–Testing: The CNN architectures are implemented using
Sinus Rhythm (NSR), (2) Atrial Fibrillation (AF), (3) Other Irregular Keras 2.6 and the sci-kit learn library in a Google Collaboratory
Rhythms (OIR or O), and (4) Excessive Noise (EN or E) as shown software environment. The 1D-CNN architectures are trained and
in Fig. 1. A healthy person’s heart rhythm is ideally in regular ECG tested after data partitioning into 80% training and 20% testing
shape, known as Normal Sinus Rhythm. Atrial Fibrillation is a fatal samples with five cross-fold validation.
arrhythmia, which occurs due to blockages in the veins, which supply
oxygen to the heart. All other types of arrhythmias are classified into 2.1.1. Performance parameters
a separate class of Other Irregular Rhythms. Lastly, during ambulatory The cross-entropy loss function (i.e., Loss function) is measured
ECG measurement, the ECG may be highly contaminated by noise such during training using Eq. (2) by computing the softmax probability 𝑝𝑖
as baseline wondering, muscle, and motion artifacts. Hence, Excessive for each 𝑖th class of the True label 𝑦𝑖 [45]. Adam optimizer is used
Noise is used as the fourth classification. in this work to speed the training process [46] with loss minimization
constraint, 276 batch size, and a learning rate of 0.001 on the NVIDIA
2.1. Database and software setup GPU Tesla P100 platform.
• Database: This work uses the 2017 PhysioNet/ Computing in ∑

𝐶𝑙𝑠
𝐿𝑜𝑠𝑠 = − 𝑦𝑖 log(𝑝𝑖 ) (2)
Cardiology Challenge database [42], which is short-term (9 s to 𝑖=1
just over 60 s) single-lead ECG recordings with a sampling rate
True Positive (TP), True Negative (TN), False Positive (FP), and
of 300 Hz [43]. ECG recordings were sampled from different
False Negative (FN) are obtained in the confusion matrix after passing
patients using the AliveCos device and classified their rhythm by
the dataset from pre-trained or inference architectures. The perfor-
a team of experts into four classes: Normal, AF, Other, and Noisy
mance metrics [16,47] such as accuracy, sensitivity, specificity, pre-
signal.
cision, false negative rate (FNR), false positive rate (FPR), F1-score are
• Pre-processing: The ECG database signals are mean normalized
calculated using Eqs. (3)–(9) as follows.
using Eq. (1) to ease the CNN’s training process. Further data
Accuracy:
duplication (or padding) of ECG samples is carried out at the
end of ECG wave files such that each data file contains the same (𝑇 𝑃 + 𝑇 𝑁) × 100
𝐴𝑐𝑐. (%) = (3)
number of data samples (10100 samples, approximately 34 s of 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
ECG recording at 300 Hz) for reliable classification. Sensitivity or Recall:
𝐗𝑖𝑛 − 𝜇𝐗𝑖𝑛 𝑇 𝑃 × 100
𝐂𝐍𝐍𝑖𝑛 = (1) 𝑆𝑒𝑛. = 𝑇 𝑃 𝑅 (%) = (4)
𝜎𝐗𝑖𝑛 𝑇𝑃 + 𝐹𝑁
Specificity:
Where, 𝐂𝐍𝐍𝑖𝑛 is mean normalized ECG signal for CNN input, 𝜇𝐗𝑖𝑛
is the mean, and 𝜎𝐗𝑖𝑛 is standard deviation of the input ECG signal 𝑇 𝑁 × 100
𝑆𝑝𝑒. = 𝑇 𝑁𝑅 (%) = (5)
𝐗𝑖𝑛 . Note that this work does not use the ECG denoising method 𝑇𝑁 + 𝐹𝑃
for experimentation with the physionet database. However, in Precision or Positive Predictive Rate:
real practice, the hardware-efficient raw ECG signal denoising 𝑇 𝑃 × 100
+𝑃 = 𝑃 𝑃 𝑅 (%) = (6)
methods [4,5] is essentially required before normalization, and 𝑇𝑃 + 𝐹𝑃
data segmentation with the R-peak detection method [44]. False Negative Rate:
• Data Labeling: The rhythm of each wave file of ECG recordings
𝐹 𝑁 × 100
is labeled from the database for experimental work. A total of 𝐹 𝑁𝑅(%) = (7)
𝐹𝑁 + 𝑇𝑃
3
Fig. 2. For trained dataset accuracy and loss parameters of different 1D-CNN architectures with 1 Convolution Layers.
False Positive Rate: 𝐹𝑐𝑖 is the number of Filters for 𝑖th convolution layer. 𝐹𝑐𝑁 is number
𝐹 𝑃 × 100 of Filters for the last 𝑁th convolution layer. 𝐾𝑐𝑖 is filter’s Kernel size for
𝐹 𝑃 𝑅 (%) = (8)
𝐹𝑃 + 𝑇𝑁 𝑖th convolution layer. 𝜂𝑑𝑖 is number of neurons for 𝑖th dense layer. 𝐷 is
F1-score: the number of hidden layers of the fully connected network. Moreover,
Cls is the number of classes. 𝑂𝑢𝑡𝑖 is length of data samples generated at
2𝑇 𝑃 × 100
𝐹 1 (%) = (9) 𝑖th convolution layer. 𝑂𝑢𝑡𝑁 is the last convolution layer’s output data
2𝑇 𝑃 + 𝐹 𝑃 + 𝐹 𝑁
sample length. 𝑝𝑜𝑜𝑙𝑖𝑛𝑔_𝑠𝑖𝑧𝑒𝑖 is the pooling size of maxpooling operation
for 𝑖th convolution layer.
2.1.2. Hardware design computation cost This work attempts to design a low-power and reliable hardware
The computational complexity for hardware design is analyzed us- system for arrhythmia classification. In this work, a 1D-CNN method is
ing the estimated multipliers and adders/subtractors counts of 1D-CNN used to classify arrhythmia because the raw ECG is a 1D signal and the
architectures. The multipliers of 1D-CNN architecture are calculated computational complexity of the 1D-CNN is lesser than the 2D CNN
based on Eqs. (10) and (14), and the adders are calculated based on (which requires an image of the ECG signal). Section 3 discusses the
Eq. (12), Eq. (13) and (15) as follow. basic structure of the 1D-CNN and the proposed hardware architecture.
Number of multipliers in convolution layer:
2.2. Analysis of 1D-CNN architectures
𝑁 (
∑ )
𝑀𝑐 = 𝐹𝑐𝑖−1 ⋅ 𝐹𝑐𝑖 ⋅ 𝐾𝑐𝑖 (10) Existing research for ECG feature extraction and classification meth-
𝑖=1 ods other than machine learning (ML) lacks classification accuracy.
Number of adders in the convolution layer: Research on ML algorithms like CNN-LSTM, CNN-RNN, and CNN-
𝑁 (( SVM-based dual methods achieves higher accuracy at the cost of high
( ) ∑ ) )
𝐴 𝑐 = 𝑀𝑐 − 1 + 𝐹𝑐𝑖−1 − 1 ⋅ 𝐹𝑐𝑖 (11) computational complexity. On the other hand, existing 1D-CNN meth-
𝑖=1 ods for arrhythmia classification has less computational complexity and
Number of adders in the global average adder tree: accuracy than dual methods. Further, CNN’s computational complexity
⌊ ⌋ and classification accuracy are highly varied depending on its internal
𝑂𝑢𝑡𝑖−1 − 𝐹𝑐𝑖−1 + 1 structure, like the number of convolution layers, number of dense layers
𝑂𝑢𝑡𝑖 = (12)
𝑝𝑜𝑜𝑙𝑖𝑛𝑔_𝑠𝑖𝑧𝑒𝑖 − 1 (hidden layers), filters, kernels, and activation functions. Therefore,
( ) analysis is required to identify the 1D-CNN architecture which pro-
𝐴𝑔 = 𝑂𝑢𝑡𝑁 − 1 ⋅ 𝐹𝑐𝑁 (13) vides better classification accuracy, low computational complexity, and
helpful for low-power hardware design applications.
Number of multipliers in fully connected layer:
A total of 164 different 1D-CNN architectures are explored and
analyzed during training by tuning its parameters such as the num-
( ) 𝐷−1
∑( ) ber of convolution layers (𝐶 varied in the range [1–4]), number of
𝑀𝑑 = 𝐹𝑐𝑁 ⋅ 𝜂𝑑1 + 𝜂𝑑𝑖−1 ⋅ 𝜂𝑑𝑖
(14) filters (𝐹 configured from [32, 64, 128, 256]), filter’s kernel size (𝐾
𝑖=2
( ) configured from [1 × 5, 1 × 10, 1 × 25, 1 × 55]), the number of
+ 𝜂𝑑𝐷 ⋅ 𝐶𝑙𝑠 dense layers (𝐷 varied in the range [0–4]) in a fully connected layer
Number of adders in fully connected layer: with rectified linear activation function (RELU) and classified through
soft-max function. This work shows the analysis from lower complexity
𝐴 𝑑 = 𝑀𝑑 − 1 (15) (single convolution, i.e., Fig. 2) to higher complexity (higher number of
4
convolutions, i.e., Figs. 3–5) with the available parameters to identify It has been observed that the accuracy of a single convolution layer
the best-suited CNN architecture considering the trade-off between is low and cross-entropy loss [45] is very high from Fig. 2 analysis
accuracy and complexity. Further, note that the performance of CNN results. Therefore, moved to two convolution layers with different
architectures is reported based on the training dataset in this analysis configurations. As shown in Fig. 3, only a single configuration 2-C-
results to speed up the analysis process of its parameter combination 256-F-0-D-55-K gives the highest overall accuracy of 96.48%. However,
trails. it has 256 Filters to be convolved in each convolution layer. Hence
5
Fig. 6. Analysis result of different 1D-CNN architectures considering the accuracy, loss and computational complexity.
again moved ahead to find better architectures with the same trend training and validation plot for accuracy and Loss function is shown
for three and four convolution layers. Fig. 4 shows that 3-C-128-F-0- in Fig. 7. However, the SCA is at the edge of exponential growth
D-25-K architecture has 97.68% accuracy, which is higher than the in computational complexity from Fig. 6. Therefore, it is considered
two convolution layers. Further, it gives better results than the four complex for hardware design, and another architecture searched for a
convolution layers’ maximum accuracy of 96.56% as shown in Fig. 5. better trade-off between computational complexity and accuracy.
The primary goal is to find the 1D-CNN architecture that reduces The CNN architectures in the vicinity of SCA are observed further,
the hardware complexity of the arrhythmia detection system. Further, which have lesser computational complexity and considerably provide
CNN architectures’ exploration with the five convolution layers incurs higher accuracy. These architectures are annotated in Fig. 6. 3-C-32-
enormous computational complexity. Therefore, CNN architectures are F-0-D-55-K CNN architecture provides a reasonably high accuracy of
analyzed up to the fourth convolution layer by considering the trade- 93.11% and 87% computational complexity reduction with respect to
off among accuracy, loss, and computational complexity as shown SCA at the cost of compromising 4.57% accuracy. It has high compu-
in Fig. 6. It has been observed that architecture 3-C-128-F-0-D-25-K tational complexity reduction compared to other architectures existing
has the highest accuracy, called Supreme CNN Architecture (SCA). Its above and below the SCA. Further, its 93.11% accuracy during training
6
Fig. 7. Training and validation plot of accuracy and loss for CNN architecture 3-C-128-F-0-D-25-K.
Fig. 8. Pruning analysis results for trained dataset.
is still higher than the existing 1D-CNN architectures [13,15,19]. Fur- 𝜂𝑑0 = 𝑂𝑢𝑡3 , 𝜂𝑑1 = 1, 𝐶𝑙𝑠 = 4, 𝐷 = 0. The proposed PRCA shows a 14.2X
thermore, the 4.57% accuracy drop during training only shows a 2.51% less computational complexity with just a 0.47% accuracy drop in
overall accuracy reduction while passing the entire database through atrial fibrillation classification than the SCA (from Table 3). Therefore,
its pre-trained or inference architecture compared to the SCA (from PRCA is selected for hardware implementation considering the trade-
Table 2). Hence, 3-C-32-F-0-D-55-K is annotated as Software-Selected off between accuracy and computational complexity for low-power
CNN Architecture (SSCA) and is used in pruning-based optimization in hardware design. The proposed 1D-CNN hardware implementation of
the hardware implementation process, considering the better trade-off. PRCA is discussed next.
2.3. CNN architecture optimization with pruning 3. Proposed CNN hardware
Inspired by existing work of [35,37–39], some part of the SSCA 3.1. Hardware system architecture
is trimmed to reduce the computational complexity further by the
Pruning method. In this work, the Pruning is carried out using L1- A 1D-CNN-based ECG arrhythmia classification system hardware is
Norm, i.e., the sum of absolute values of each kernel’s weight. The designed as shown in Fig. 9. It takes raw ECG samples and classifies
Filters having least L1 Norm are Pruned from each convolution layer them into four different classes. Each three CNN layers have different
because they do not contribute much to the output. The Pruned CNN filters, convolved with input ECG samples. The first convolution layer
Architecture (PRCA) is retrained to recover the loss of accuracy with the has 26 different filters operating on a single input channel of the ECG
50 epochs on each Pruning iteration. Fig. 8 shows the Pruning analysis dataset from the main block RAM (BRAM). Each filter is 1 × 55 in size,
result on the SSCA. It has been observed that the accuracy significantly so their kernel multiplication output generates 55 different values. It is
reduced for the fifth iteration and settled at 87.15% after retraining. applied to the adder tree, which generates a convolution output, and
Further, this work attempts to design a continuous ECG monitoring serves in the max-pooling unit. It finds the maximum value over the ten
safety–critical system with a better trade-off between accuracy and convolution outputs used by the nonlinear activation function RELU. In
computational complexity. Hence, the PRCA of the fourth iteration is this work, the order of MAXPOOL and RELU modules is interchanged
selected for final hardware implementation. A total of 28 filters are to reduce complexity. Ideally, the convolution output is directly given
pruned from the SSCA. The hardware design parameters of PRCA are to the nonlinear activation function RELU and then to MAXPOOL.
as follow: 𝑁 = 3, 𝐹𝑐0 = 1, 𝐹𝑐1 = 26, 𝐹𝑐2 = 23, 𝐹𝑐3 = 19, 𝐾𝑐1 = 55, However, in this case, RELU activation must be performed on every
𝐾𝑐2 = 55, 𝐾𝑐3 = 55, 𝑂𝑢𝑡0 = 10100, 𝑂𝑢𝑡1 = 1004, 𝑂𝑢𝑡2 = 190, convolution output followed by MAXPOOL to discard smaller values.
𝑂𝑢𝑡3 = 27, 𝑝𝑜𝑜𝑙𝑖𝑛𝑔_𝑠𝑖𝑧𝑒1 = 10, 𝑝𝑜𝑜𝑙𝑖𝑛𝑔_𝑠𝑖𝑧𝑒2 = 5, 𝑝𝑜𝑜𝑙𝑖𝑛𝑔_𝑠𝑖𝑧𝑒3 = 5, Hence, computational complexity can be reduced by interchanging
7
Fig. 9. The Architectural block diagram of proposed 1D-CNN for arrhythmia detection on Embedded-FPGA hardware.
Table 1 23 different Filters is performed with 26 different input channels.

Complexity reduction of the proposed CNN architecture by interchanging RELU and
The output of the second convolution is 26 different bunch of data
Maxpooling operation while processing a single 30 s ECG recording.
samples, and each bunch carries 23 different outputs. So a total of
Layer Number of Output Channels Max(RELU(F)) RELU(Max(F))
from stage No. Op. No. Op.
23 × 26 different output samples would be generated. However, the
second convolution operation output needs 23 different channels. So
Conv. 10046 26 261196 –
1 the cross filter adder tree is proposed to convert that data into 23
Maxpool 1004 26 – 26104
Conv. 950 23 21850 –
different channels output, as shown in Fig. 9. Similar to the second
2 convolution layer, the third convolution operation is performed over
Maxpool 190 23 – 4370
Conv. 136 19 2584 –
23 input channels using 19 Filters of the third layer. From the cross
3 filter adder tree, 19 different outputs are generated, as shown in Fig. 9.
Maxpool 27 19 – 513
Total No. Op. 285630 30987 Finally, 27 different samples are produced for data flattening (global
Conv.: Convolution stage, No. Op.: Number of Operations
averaging) after RELU for every 19 different output channels in the
third convolution layer. Hence, at the end of the third convolution
layer, the flattening layer (global averaging layer) produces 19 different
flattened outputs.
order (from Table 1) without affecting the functionality as discussed As shown in Fig. 9, the third convolution layer output is served in
below. a fully connected layer. It consists of only input and output layers, no
RELU is a mathematical monotonic increasing function, i.e., (𝑖𝑓 𝑎 > hidden layer (dense). The generated output is passed to the Processing
𝑏, 𝑅𝐸𝐿𝑈 (𝑎) >= 𝑅𝐸𝐿𝑈 (𝑏)). Moreover, RELU is a logical comparator System (PS) block, where the softmax operation is performed. Softmax
shown in Eq. (16). hardware logic is not implemented in Programmable Logic (PL) blocks
{
0 ;𝐚 < 0 because it has higher computational complexity and requires floating-
𝑅𝐸𝐿𝑈 (𝐚) = (16) point operation for better results. It may consume more resources and
𝐚 ;𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
power if implemented on a PL block. Therefore it is implemented using
It returns a ‘0’ output for negative input; otherwise passes the same the floating-point unit of the Processing System (PS). PL and PS blocks
input to the output port. Hence, reordering RELU with MAXPOOL are interfaced via an Advanced Extensible Interface (AXI) protocol.
operation cannot alter the functionality shown in Eq. (17).
3.1.1. Proposed multiplier block
𝑀𝑎𝑥𝑝𝑜𝑜𝑙(𝑅𝐸𝐿𝑈 (𝐚)) == 𝑅𝐸𝐿𝑈 (𝑀𝑎𝑥𝑝𝑜𝑜𝑙(𝐚)) (17)
The logarithmic shifter-based multiplier hardware is proposed to re-
Furthermore, Table 1 shows that computing the RELU before MAX- duce the hardware complexity and storage required for Filters’ weight,
POOL has more computation operations compared to RELU after MAX- as shown in Fig. 10. First, the weights are converted into the shift
POOL operations. Hence, the reordering of functions in the proposed 𝐵 ′ using Algorithm 1 and stored in BRAM for multiplication. In the
hardware architecture reduces to 9.22𝑋 comparator/operations. The proposed multiplier block, the multiplication is performed using shift
generated 26 different outputs from the first convolution layer is ap- operation described as follows:
plied as 26 different input channels to the second convolution layer as
shown in Fig. 9.
𝐴′ =|𝐴| (18)
Similarly, the second convolution layer has 23 different Filters and
′
26 different input channels from layer 1. Hence, the convolution of 𝐵 =𝐿𝑜𝑔2 |𝐵| (19)
8
Fig. 10. Proposed shifter-based multiplier block diagram.
Algorithm 1 Steps to convert weight values into shift values for

proposed logarithmic shifter
1: Store the weight sign value.
2: Take the absolute value of the weight.
3: Convert an absolute weight value into Q6.10 binary fixed-point Fig. 11. Proposed filter block of first convolution layer.
representation.
4: Treat the fixed-point number as a real number (neglecting the
decimal point position).
5: Convert back that binary number into an integer value.
6: Take the logarithm of that integer value with base 2 using (19).
7: Rounding that logarithmic value to the nearest integer.
8: Convert back that final rounded integer value as binary.
9: The final binary value takes a 4-bits shifter value and 1 sign bit.
10: The final weight has 5 bits, i.e, stored in internal BRAM for
multiplication.
𝑌 ′ =𝐵𝑖𝑡𝑠ℎ𝑖𝑓 𝑡(𝐴′ , 𝐵 ′ ) (20) Fig. 12. Multiplier block in second convolution layer.
{
𝑛𝑜𝑡(𝑌 ′ ) + 1 ; 𝑠𝑖𝑔𝑛(𝐴) 𝑋𝑂𝑅 𝑠𝑖𝑔𝑛(𝐵) = 1 signal and are logically ORed, i.e., only one BRAM output is delivered
𝑌 = (21)
𝑌′ ; 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 to the shifter block. The filter kernel corresponding to the input channel
is enabled and clocked simultaneously. The rest of all other BRAM are
Here, input sample 𝐴 is shifted with a value of weight 𝐵 ′ , i.e., shift clock gated and idle. In the proposed architecture, clock gating helps
value derived from algorithm 1, which is shown in (20). Moreover, to reduce power. A local BRAM avoids extra latency for the coefficients
there is no negative shift in the shifter mechanism. Hence, MOD of in- read from the DDR memory each time. Hence, it improves convolution
put is taken and applied shift to it. Finally, the multiplier’s output sign is speed in the proposed architecture. The architecture folding for each
decided using the XOR of both operands’ signed bit. Opposite sign bits input channel multiplication with the filter’s processing unit (multi-
give negative (2’s complemented) output; otherwise, it gives positive plier) reduces excessive hardware usage. The multiplier block of the
output as shown in (21). This type of multiplier may introduce some third convolution layer is similar to the second layer only, wherein 23
errors in output due to the weight quantization and approximation. input channels from the previous layer get convolved with 19 different
However, it may recover to some extent due to the relative probability Filters.
calculated in the final softmax function for the classification result.
The array multiplier is commonly used for hardware realization 3.1.3. Adder tree architecture
due to the regular cell array structure [48]. However, this multiplier The barrel shifter (multiplier) output is passed to the adder tree
introduces approximately 𝑂(𝑛2 ) complexity. Further, the bipolar input through demultiplexer (DEMUX is a combinational digital circuit used
data samples and weights with floating-point values increase the hard- to serve its input data over several outputs in different timestamp [49])
ware complexity and storage elements. Hence, a fixed-point logarithmic of size (1 × 55) as shown in Fig. 13. Once the last (55𝑡ℎ ) data arrives
shifter-based multiplier is proposed for CNN hardware. It significantly from demux, a valid signal is triggered to start the subsequent compu-
reduces CNN hardware design complexity and storage elements by tation. A standard carry look-ahead adder is used in tree fashion for
replacing multiplication by a shifter and weight to shift conversion fast computing in the proposed architecture.
algorithm. Further, it only introduces approximately 𝑂(𝑛) complexity.
3.1.4. Cross filter adder tree
3.1.2. Filter data flow architecture A cross-filter adder tree is required for the second and third convo-
The first convolution layer has a single input channel of ECG lution operation because they have more input channels, i.e., 26 and
samples from the main BRAM, which are multiplied with 55 different 23, respectively. So convolution operation for each input channel is
Filters’ weights, as shown in Fig. 11. A dedicated first-in-first-out (FIFO) performed in the second and third layers with 23 and 19 Filters, re-
(FIFO is a memory module designed using shift registers based on a spectively. Hence, 23 Filters with 26 different input channels generate
first-come, first-serve data scheme [49]) based local BRAM contains 23 × 26 different outputs in the second convolution layer. To feed them
filter weights attached to the filter’s shifter block for multiplication one by one in a cross filter adder tree, a bigger Demultiplexer of the size
in the first layer. However, multiple channels (26 channels from the 23 × 598 (i.e., 23 × 26 output) is used as shown in Fig. 14. Once the
previous layer) are multiplied in the second layer with each channel’s last (26𝑡ℎ ) channel is processed, a valid signal is generated to trigger the
weight. Hence, more FIFO BRAM is attached to the filter’s shifter block subsequent operation of the CNN block. These 23 outputs of the adder
of the second layer compared to the first layer, as shown in Fig. 12. tree are processed with dedicated max-pooling and RELU activation
Further, BRAM outputs are tied with the corresponding channel enable block to store in CDC FIFO that act as 23 channels for the next layer.
9
Fig. 13. Convolution operation: Adder Tree.
3.1.8. Softmax function

The output of a fully connected layer is called logits and is applied
to the nonlinear activation function softmax. It returns a relative prob-
ability of the target classes using the exponential probability function
(22).
exp(𝑥𝑖 )
𝑆𝑜𝑓 𝑡𝑚𝑎𝑥(𝑥)𝑖 = ∑𝐾 (22)
(𝑥𝑗 )
𝑗=0 exp
A floating-point operation is essential for exponential function com-

putation. Hence, the softmax function is not implemented in the FPGA
Programmable Logic (PL) for high precision. Instead, the FPGA Pro-
cessing System (PS), i.e., the processor’s floating-point unit, is utilized.
In the processor, the softmax function calculation is computed sequen-
tially. The sequential computation would not create any issue because
the output data frequency of logits from the third convolution layer is
relatively lower than the processor’s frequency.
4. Results comparison and discussion
Fig. 14. Convolution output of multiple channels.

4.1. Software implementation results
Three 1D-CNN architectures are discovered after analysis in this

work. Their confusion matrix and performance parameters are obtained
3.1.5. Global average adder tree
using the entire database passing through their pre-trained or inference
The third convolution layer’s flattening (global average) module
architectures and tabulated in Table 2. Three types of accuracy are
creates a noise-invariant single-dimensional feature vector to feed the
reported, i.e., (1) Individual class accuracy measured for a specific
fully connected layer. The proposed flattening adder tree architecture
class based on Eq. (3), (2) AVG is the average of all classes’ accuracy,
adds (1X27) demultiplexer’s output data samples from RELU output of
and (3) the ratio of the sum of diagonal elements to the sum of all
the single-channel as shown in Fig. 15.
elements of the confusion matrix measures the overall accuracy. Table 2
shows that the proposed SCA, SSCA, and PRCA achieve high accuracy of
3.1.6. Clock domain crossing FIFO
99.17%, 98.95%, and 98.70% for AF classification and overall accuracy
The output of each layer of CNN is dumping its data into the
subsequent layer. Therefore, each top layer of CNN architecture cannot of 96.77%, 94.26%, and 90.80% respectively. It is noted that SCA has
change its output or run independently unless the subsequent layer the highest performance, followed by SSCA and PRCA. Further, It has
completes its task. This kind of data dependency is forcing the system been observed that the SCA results are not biasing towards specific
to perform slower. Hence, this problem is mitigated by pipelining the class and provides uniform sensitivity compared to other proposed CNN
output of each layer using Clock-Domain Crossing (CDC) FIFO, as architectures. However, it has been observed that as the computational
shown in Fig. 16. In this work, the CDC FIFO is a digital synchronizer complexity decreases towards PRCA, the NSR class FP increases due
circuit that allows different layers to operate with a different frequency, to the OIR samples being classified in the NSR class. Hence, Fig. 18
considering the constraint of overwriting issue of the fixed FIFO depth. shows that going from high complexity SCA to low complexity PRCA,
Hence, each layer of the proposed architecture can run independently the uniform sensitivity of architectures get slowly deforms due to the
with different clocks and speed up the convolution operation. reduction of OIR class sensitivity and increases the NSR class FPR
shown in Section 4.3.2. PRCA has 48.92% lesser computational com-
3.1.7. Fully connected layer plexity from Table 3. However, it compromises 3.46% overall accuracy
As shown in Fig. 17, the 19 data output through CDC FIFO of compared to the proposed SSCA due to pruning-based optimization.
the third CNN layer is multiplied with four sets of 19 weights of the Overall it has been observed that the proposed 1D-CNN architectures
fully connected layer in the proposed architecture. The weight values fairly provide better performance parameters specifically helpful for
are fixed for each input and are incorporated in the logarithmic fixed accurate AF classification based on results of Table 2.
shifter, so no extra storage is utilized. The 19 shifters (multiplier)
output of each four neurons is provided to the adder tree, and bias is 4.2. Software results comparison
added, which is sent to the RELU unit. The output of RELU (i.e., the
output of a fully connected layer) is sent to the softmax function of the The comparison of the proposed CNN architectures and the existing
Processing System. work for AF classification is shown in Table 3 with the same database
10
Fig. 15. Adder tree for data flattening.
Table 2
Confusion matrix and performance parameters of the proposed architectures over the entire database.
1D-CNN SCA SSCA PRCA HCA/FPGA
Predicted / NSR AF OIR EN NSR AF OIR EN NSR AF OIR EN NSR AF OIR EN
True Class
NSR 4818 5 64 3 4812 10 62 6 4782 3 92 13 4776 7 69 38
AF 5 686 16 0 7 688 12 0 11 664 26 6 18 648 33 8
OIR 101 38 2246 6 291 51 2037 12 493 52 1819 27 734 126 1475 56
EN 21 3 5 244 18 3 2 250 28 1 8 236 30 3 4 236
Perfor. SCA SSCA PRCA HCA/FPGA
Para.
NSR AF OIR EN AVG NSR AF OIR EN AVG NSR AF OIR EN AVG NSR AF OIR EN AVG
Acc. (%) 97.57 99.17 97.20 99.53 98.37 95.18 98.95 94.77 99.48 97.09 92.14 98.70 91.49 98.91 95.31 88.84 97.34 87.47 98.09 92.94
Sen. (%) 98.53 97.03 93.94 89.38 94.72 98.40 97.31 85.19 91.58 93.12 97.79 93.92 76.08 86.45 88.56 97.67 91.65 61.69 86.45 84.37
Spe. (%) 96.16 99.37 98.54 99.88 98.49 90.40 99.11 98.70 99.76 96.99 83.64 99.19 97.83 99.37 95.01 75.10 97.95 98.16 98.54 92.44
+P (%) 97.43 93.72 96.35 96.44 95.99 93.84 91.49 96.40 93.28 93.75 89.99 92.22 93.52 83.69 89.86 85.93 82.65 93.30 69.82 82.93
F1 (%) 97.98 95.34 95.13 92.78 95.31 96.07 94.31 90.45 92.42 93.31 93.73 93.06 83.90 85.05 88.93 91.42 86.92 74.27 77.25 82.47
Overall 96.77 94.26 90.80 86.37
Acc. (%)
except the binary classification works [16,19]. The proposed architec- introduced in the system due to the other irregular rhythm or excessive
tures have the highest accuracy among all the CNN-based methods. noise may be classified either in Normal or AF classes.
Moreover, the proposed methods have comparatively low computa- Moreover, state-of-the-art methods [17,41] are compared with same
tional complexity than the dual methods, such as CNN-LSTM [11,12], database in Table 3. Better performance has been observed with the
1D-CNN-KNN [16], 1D-CNN-SVM [17], and CNN-RNN [41]. The other proposed CNN architecture while classifying output into four classes
methods combined with 1D-CNN introduce more complexity, which with low complexity. The existing method [41] is more complex due
may affect the system’s power consumption. Further, CNN-LSTM-based to the combination of CNN and RNN, individually introducing higher
methods [11,12] may achieve high accuracy than the proposed 1D- computational complexity. Similarly, the method [17] 1D-CNN stacked
CNN architecture due to 2D-image-based signal classification. However, with SVM architecture is 3.13𝑋 more computationally complex than
the additional pre-processing of 1D ECG signals required to be con- the proposed SCA as shown in Table 3. The proposed SCA comparably
verted into 2D images may increase the computation cost of system shows 18.93%, 12.3%, and 15.99% more precision, recall, and F1 score
compared to the recent work of [41] for AF detection. Further, it shows
design. Hence, the proposed 1D-CNN methods are more suitable for
14.78%, 27.87%, and 21.14% more precision, recall, and F1 score
the safety–critical system hardware design due to a better trade-off
compared to the existing work of [17] for AF detection. Furthermore,
between accuracy and computational complexity. The proposed SCA
it has been noted that the proposed methods have better performance
shows 2.13%, 8.97%, 1.71% and 15.31% more accuracy than the
in other classes with lower computational complexity compared to
existing CNN-based architectures of [16,19,41], and [13] respectively.
existing work [17,41].
Further, SCA has the highest sensitivity (recall) in all classes compared
The performance of classification methods should not be biased
to existing work. It has been observed 12.3%, 7.57%, 27.87%, and
towards specific classes because it gives a higher false-positive rate for
0.07% more sensitivity compared to existing work of [16,17,19,41] for those highly sensitive classes to the system. Fig. 19 shows the sensitivity
AF class respectively. of the proposed SCA compared with the existing work for arrhythmia
Pourbabaee et al. [16] showed fatal AF arrhythmia detection using classification. The sensitivity of the proposed SCA is higher than the
a binary classification system. If AF is not detected earlier, it may result existing design, unbiased, and uniformly spread for all classes.
in a heart attack. Hence, similar work of [16,19] is compared in Table 3 Receiver-Operating characteristic (ROC) curve Fig. 20 shows the
with the proposed work. The SCA multiclass classification has little proposed SCA performance for arrhythmia classification. It further
higher sensitivity and 3.46% lesser precision than the binary classifi- signifies that the proposed CNN can identify the arrhythmia classes
cation method [19]. However, binary classification methods [16,19] correctly and not generate perturbing results. It has been noted that the
may not be sufficient in actual practice because more errors may be proposed CNN has a 0.97 area under the curve (AUC) of ROC, which
11
Table 3
Atrial Fibrillation detection performance comparison of proposed CNN architectures with existing work.
CNN methods/ Wang Pourb. Nguyen Castillo Pyakillya SCA SSCA PRCA HCA
Performance Parameters [41] [16] [17] [19] [13] (FPGA)
+P (%) 89 89.87 91.25 97.88 – 97.43 93.84 89.99 85.93
NSR
Sen. (%) 91.8 90.48 94.87 97.87 – 98.53 98.40 97.79 97.67
Class
AUC 0.924 – – – – 0.93 – – –
F1 (%) 90.6 – 93.04 97.88 – 97.98 96.07 93.73 91.42
+P (%) 78.8 90.79 81.65 96.97 – 93.72 91.49 92.22 82.65
AF
Sen. (%) 86.4 90.20 75.88 96.96 – 97.03 97.31 93.92 91.65
Class
AUC 0.953 – – – – 0.97 – – –
F1 (%) 82.2 – 78.7 96.97 – 95.34 94.31 93.06 86.92
+P (%) 78 – 78.75 – – 96.35 96.40 93.52 93.30
OIR
Sen. (%) 74.6 – 80.04 – – 93.94 85.19 76.08 61.69
Class
AUC 0.884 – – – – 0.88 – – –
F1 (%) 76 – 79.31 – – 95.13 90.45 83.90 74.27
+P (%) – – – – – 96.44 93.28 83.69 69.82
EN
Sen. (%) – – – – – 89.38 91.58 86.45 86.45
Class
AUC – – – – – 0.94 – – –
F1 (%) – – – – – 92.78 92.42 85.05 77.25
Acc. (%) 97.1 91 – 97.50 86 99.17 98.95 98.70 97.34
for AF
CNN’s
accuracy, Arch. CNN- CNN- 1D-CNN 1D-CNN 1D-CNN 1D-CNN 1D-CNN 1D-CNN 1D-CNN
Types and Type RNN KNN SVM
Types of No. 3 2 3 2 4 4 4 4 4
Rhythm Classes
Classes
Type of NSR, NSR, NSR, NSR, NSR, NSR, NSR, NSR, NSR,
Classes AF,O AF AF,O AF AF,O,E AF,O,E AF,O,E AF,O,E AF,O,E
Epochs 25 88 10 200 – 500 500 – –
a Comput. Multi. > > 2570688 644 538560 822536 114508 58463 0
Complex. 1D-CNN 1D-CNN
Adders 2688888 703 636107 859268 117320 59946 60258
a
Indicates the computational complexity is estimated based on the CNN architecture.
Table 4
F1-Score comparison of Proposed architectures with
Cardiologist(Human)[20] for arrhythmia classification.
CNN architectures NSR AF OIR EN
% % % %
Cardiologist (Human)[20] 84.7 63.5 – 76.8
SCA 97.98 95.34 95.13 92.78
SSCA 96.07 94.31 90.45 92.42
PRCA 93.73 93.06 83.90 85.90
HCA(FPGA) 91.42 86.92 74.27 77.25
is sufficiently high for AF classification performance. It is 4% and 1.7%

higher than the existing work of [18,41], respectively. The proposed
work has a small FNR (5.03%) and AUC of 0.97, towards the ideal value
of 1 for AF detection, which signifies that it can accurately classify the
arrhythmia.
In this work, the F1-score of Cardiologist(Human) [20] for different
arrhythmia classifications are compared with proposed architectures in
Table 4. It is noted that the performance of the proposed CNN archi-
tectures is higher than the cardiologist (Human) [20] in all classes. The
purpose of this comparison and behind proposing hardware-based di-
agnostic tools is to supplement cardiologists’ diagnosis reports. Further,
no hardware can replace cardiologists. However, it is better to detect
early signs and symptoms of fatal diseases by detecting arrhythmias
such as Atrial Fibrillation and other arrhythmias to save a life.
4.3. FPGA hardware implementation results
4.3.1. Hardware setup and tools

Xilinx VIVADO 2019.1 software environment is used for the hard-
Fig. 16. CNN with CDC FIFO. ware development of the proposed 1D-CNN PRCA and implemented on
12
Fig. 17. Block diagram of fully connected layer.
Fig. 18. Uniform sensitivity plot of proposed (A) SCA (B) SSCA and (C) PRCA to check unbiased performance.
Fig. 19. Sensitivity comparison of (A) Chazal [7] (B) C.Ye [8] (C) Proposed SCA.
ZYNQ Ultrascale ZCU106 FPGA. The direct hardware implementation

of 1D-CNN may lead to high resource consumption. Hence, in this
work, the folding technique (i.e., reuse of the same hardware) is further
utilized in addition to a logarithmic shifter-based multiplier. A custom
Intellectual Property (IP) of the proposed design and direct memory ac-
cess (DMA) IP are used to create a block-based system design as a whole
system to implement it in the Programmable Logic (PL) block of FPGA.
The rest of the data handling logic and floating-point computation
tasks are implemented on the Processing System (PS) block of FPGA
using the Xilinx software development kit (SDK) tool. FPGA’s PS block
is programmed to configure the DMA and other computations, such
as softmax computation for UART data acquisition and DDR memory
interface. The block design of the proposed arrhythmia classification
system is shown in Fig. 21.
Physionet ECG database setup discussed in Section 2.1 is used for
hardware testing. The input samples of ECG wave files are stored in
the DDR RAM to process by the proposed CNN hardware. First, DDR
memory data is transferred to Block RAMs of the FPGA using the
Fig. 20. Proposed CNN Receiver-Operating Curve (ROC). Direct Memory Access (DMA) module; after that, the computations start
on the hardware. Note that DDR memory is filled with ECG dataset
that is already pre-processed on software to validate the hardware
architecture implemented on FPGA. However, the raw ECG signal pre-
processing hardware efficient filter system is essential for the real-life
13
Fig. 21. The block design of proposed ECG arrhythmia classifier system interface on embedded FPGA platform.
Fig. 22. Hardware sensitivity comparison (A) Brito [1] (B) Proposed HCA for arrhythmia classification.
Table 5
FPGA hardware resource utilization.
Resource PL block Entire system
Utilization PS and PL block
LUT 174651 (75.80%) 184450 (80.06%)
LUTRAM 1 (0.00098%) 595 (0.58%)
FF 46876 (10.17%) 57855 (12.55%)
BRAM 249 (79.80%) 251 (80.44%)
Fig. 23. Hardware False Positive rate (FPR) comparison (A) Brito [1] (B) Proposed
HCA.
the hardware performance is reduced compared to the software-based
proposed architecture because of the fixed point signal quantization
test scenario, as shown in work [4,5] with R-peak detection and ECG and shifter-based approximate multiplier design with the weight quan-
data segmentation [44]. tization. It has been noticed that the sensitivity of OIR class reached
61.69% due to its ECG data of OIR class being classified into the
other classes. However, the sensitivity of all other classes of HCA has a
4.3.2. FPGA results and comparison
relatively small drop than OIR class. It will be possible to improve the
Hardware CNN Architecture (HCA) results are obtained after pass- OIR class sensitivity by retraining the HCA on the FPGA device.
ing the entire dataset through the designed system on FPGA. Its con- The proposed HCA still provides better accuracy of 97.34% for
fusion matrix is tabulated in Table 2 to check the hardware ver- AF classification despite the performance reduction and gives nearer
sion’s performance with software architectures. The overall accuracy performance parameters as of existing software-based state-of-the-art
of arrhythmia classification is 86.37%, which is 5.13% lesser than its method [41] shown in Table 3. Further, its F1-score remains higher
software version, i.e., PRCA. However, it provides 97.34% accuracy than [20] from Table 4, which shows that it can give a reliable
of AF classification, which is only 1.4% lesser than PRCA. Further, performance for AF classification. The results of HCA are compared
14
Table 6
Comparison of proposed hardware architecture with existing work.
Papers Caffeine [50] Jaramillo [22] Lu. [46] Proposed
Methods VGG-16 1D-CNN 1D-CNN 1D-CNN
Classification type – Beat Beat Rhythm
Input data size 96×96x1 320×1 500×1 10100×1
Number of Convolution Layers 9 4 8 3
Dense layers 1 3 2 0
FPGA Vertex 7 VX690T Artix-7 Zynq ZC706 Zynq ZCU106
LUT(kLUT) 561.427 128.96 1.538 184.450
DSP 2833 121 80 0
BRAM 1248 16.5 12 251
FF 311904 2080 – 57855
with biochip architecture [1] for ECG arrhythmia classification. The hardware implemented PRCA, i.e., HCA shows accuracy reduction due
proposed HCA sensitivity is lesser than [1] for three classes. However, to hardware optimization using a logarithmic shifter-based multiplier,
the proposed HCA provides more uniform sensitivity towards unbias data quantization, weights quantization, and fixed point hardware
perfromance compared to [1], as shown in Fig. 22. The sensitivity of [1] design. Hence, the scope of improvement is possible by proposing
is biased to three classes and less sensitive to one class. Therefore, it has careful novel hardware optimization techniques to minimize accuracy
a higher false-positive rate on average compared to the proposed work, reduction for safety–critical hardware design. The careful pruning and
as shown in Fig. 23. approximation-based hardware optimization techniques with the ad-
Table 5 shows resource utilization of the proposed 1D-CNN IP and vanced chip design technologies can be helpful for the hardware design
the entire system, including the proposed design and DMA logic in of complex SCA by considering the constraints of maintaining high ac-
FPGA. The total FPGA power consumption of the entire system is 628 curacy and low power design. The resource utilization of the proposed
mW at PS 𝑓 𝑐𝑙𝑘 = 100 MHZ, PL 𝑓 𝑐𝑙𝑘 = 1.250 MHz, and 𝑉 𝐷𝐷 = 0.8𝑉 . HCA can be reduced by the folding technique further applied across
This work compares the state-of-the-art lightweight CNN architectures’ the filters, convolution layers, and with the adder tree (i.e., recursive
hardware report with the proposed work in Table 6 to show that the addition), considering the low power constraints. A higher number of
proposed hardware architecture can also be considered lightweight. input data samples is utilized in architectures to detect atrial fibrilla-
VGG16 CNN architecture requires 2D images for classification, whereas tion, increasing the proposed system’s initial latency (i.e., a one-time
the proposed application is based on a 1D-ECG signal. Therefore, the waiting period of ECG samples feed into the CNN architecture) and
1D-ECG dataset is not applicable to VGG16 architecture unless it is con- hardware resource utilization. Hence, extensive analysis is required
verted to a 2D image using additional processing, which is costlier in with fewer input data samples to reduce initial latency considering the
practice. This pre-processing operation introduced more computational design’s reliable accuracy and computational complexity trade-off. The
complexity for continuous ECG monitoring safety–critical systems, re- hardware comparison shows that the proposed 1D-CNN method has
sulting in more hardware resource utilization, power consumption, and higher resource utilization. However, it is affordable and manageable to
computation time. Hence, VGG16 architecture has not been used for implement on ASIC with the help of the available modern chip design
classifying ECG signals for this application. Moreover, in machine learn- technology. The power reduction can be further achieved using ASIC
ing, each architecture is very specific to one application, so one CNN design rather than FPGA implementations. However, it is essential to
architecture working for one use case may not work that efficiently for emulate the proposed design on an FPGA device for hardware design
another use case. verification and fast time to market. The future design should consider
Very little work is available for ECG arrhythmia classification using better trade-offs among parameters such as accuracy, sensitivity, com-
CNN architecture on FPGA to the best of our knowledge. Further, ECG putational complexity, power consumption, signal sampling rate, and
rhythm classification through 1D-CNN on FPGA hardware has not been the number of input data samples to be processed while designing a
found, which is attempted in this work with a power minimization system on chip machine learning-based hardware. It will be helpful for
perspective. Furthermore, a small number of ECG samples may be diagnosing arrhythmia on low-power wearable or portable devices.
insufficient for reliable rhythm classification, unlike the ECG beat
classification based on existing work. Hence, the proposed architec- 5. Conclusion
ture occupies more BRAM and LUTs to store and process the entire
rhythm. Using more DSP multipliers in the design increases the device’s The primary focus of this work is to design a safety–critical system
power consumption. However, the proposed hardware does not utilize for early diagnosis of arrhythmia on hardware with low power con-
any built-in DSP multiplier because of the multiplier-less logarithmic sumption and get higher accuracy. For that, an extensive analysis of the
shifter-based hardware design. Hence, the proposed architecture has 1D-CNN-based architecture is carried out on software (GPU) and pro-
a lower power consumption of 628 mW, which is 3𝑋 (i.e., 66.9%) posed SCA (3-C-128-F-0-D-55-K) that gives 96.77% overall accuracy.
lower than the existing 1.9 W power consumption of CNN-based ECG It has higher accuracy than the existing state-of-the-art 1D-CNN-based
arrhythmia classification method [40]. architectures. However, its computational complexity is high. Hence,
it is helpful only for software-based classification. The SSCA (3-C-32-
4.4. Discussion F-0-D-55-K) is proposed to implement the 1D-CNN architecture on
FPGA hardware considering the trade-off between accuracy and com-
In this work, the number of parameters chosen for the 1D-CNN putational complexity with an overall accuracy of 94.26%. The SSCA
architectures is such that it can be easily analyzed from lower to higher is Pruned to reduce hardware design complexity further. The PRCA
computational complexity to speed up the analysis process out of the provides 90.80% overall accuracy on software (GPU: NVIDIA TESLA
many possibilities that exist over its parameters selection. However, it P100), and it has been finally implemented on FPGA. The proposed
is not feasible to explore each parameter of CNN by fine-tuning over PRCA shows a 14.33𝑋 less computational complexity with a negligible
all the possible ranges. Hence, the scope of improvement is possible 0.47% accuracy drop for atrial fibrillation classification than the SCA.
for future design to come up with a better trade-off compared to It compromises 5.82% overall arrhythmia classification accuracy to get
the proposed work and analysis. The accuracy of PRCA is reduced a better trade-off for the continuous ECG monitoring safety–critical
due to the deformation of SSCA during pruning. Further, the FPGA system FPGA implementation. The logarithmic shifter-based multiplier
15
and weight quantization algorithm are proposed to reduce hardware [13] B. Pyakillya, N. Kazachenko, N. Mikhailovsky, Deep learning for ECG classifi-
complexity and storage. Furthermore, CDC-FIFO-based data pipelin- cation, in: Journal of Physics: Conference Series, Volume 913, IOP Publishing,
2017, 012004.
ing is introduced between the convolution layers to reduce hardware
[14] J. Li, Y. Si, T. Xu, S. Jiang, Deep convolutional neural network based ECG
power consumption and data synchronization. The FPGA-based hard- classification system using information fusion and one-hot encoding techniques,
ware implementation is carried out using a mixed-precision fixed-point Math. Probl. Eng. 2018 (2018).
technique. It gives a reliable overall accuracy of 86.37% for arrhythmia [15] Ö. Yıldırım, P. Pławiak, R.-S. Tan, U.R. Acharya, Arrhythmia detection using
classification. The proposed HCA has lower power consumption of 628 deep convolutional neural network with long duration ECG signals, Comput.
Biol. Med. 102 (2018) 411–420.
mW, which can be helpful to the wearable system-on-chip hardware
[16] B. Pourbabaee, M.J. Roshtkhari, K. Khorasani, Deep convolutional neural net-
design for early arrhythmia diagnoses. works and learning ECG features for screening paroxysmal atrial fibrillation
patients, IEEE Trans. Syst. Man Cybern. 48 (12) (2018) 2095–2104.
CRediT authorship contribution statement [17] Q.H. Nguyen, B.P. Nguyen, T.B. Nguyen, T.T. Do, J.F. Mbinta, C.R. Simpson,
Stacking segment-based CNN with SVM for recognition of atrial fibrillation from
single-lead ECG recordings, Biomed. Signal Process. Control 68 (2021) 102672.
Viraj Rawal: Conceptualization, Methodology, Software/Hardware [18] L. Khriji, M. Fradi, M. Machhout, A. Hossen, Deep learning-based approach for
implementation, Formal analysis, Writing – original draft, Editing. atrial fibrillation detection, in: International Conference on Smart Homes and
Priyank Prajapati: Investigation, Software/Hardware, Results vali- Health Telematics, Springer, 2020, pp. 100–113.
dation, Formal analysis, Writing – original draft, Visualization, Data [19] J.A. Castillo, Y.C. Granados, C.A. Fajardo, Patient-specific detection of atrial
fibrillation in segments of ECG signals using deep neural networks, Cienc. E
curation, Writing – review & editing. Anand Darji: Conceptualization,
Ing. Neogranadina 30 (1) (2020) 45–58.
Resources, Project administration, Supervision, Investigation. [20] P. Rajpurkar, A.Y. Hannun, M. Haghpanahi, C. Bourn, A.Y. Ng, Cardiologist-level
arrhythmia detection with convolutional neural networks, 2017, arXiv preprint
Declaration of competing interest arXiv:1707.01836.
[21] W. Jatmiko, P. Mursanto, A. Febrian, M. Fajar, W. Anggoro, R. Rambe, M.
Tawakal, F.F. Jovan, et al., Arrhythmia classification from wavelet feature
The authors declare that they have no known competing finan- on FGPA, in: 2011 International Symposium on Micro-NanoMechatronics and
cial interests or personal relationships that could have appeared to Human Science, IEEE, 2011, pp. 349–354.
influence the work reported in this paper. [22] A.F. Jaramillo-Rueda, L.Y. Vargas-Pacheco, C.A. Fajardo, A computational ar-
chitecture for inference of a quantized-CNN for detecting atrial fibrillation, Ing.
Cienc. 16 (32) (2020) 135–149.
Data availability
[23] M. Courbariaux, Y. Bengio, J.-P. David, Binaryconnect: Training deep neural
networks with binary weights during propagations, in: Advances in Neural
The authors do not have permission to share data. Information Processing Systems, 2015, pp. 3123–3131.
[24] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, Y. Bengio, Binarized neural
networks: Training neural networks with weights and activations constrained to+
Acknowledgments
1 or-1, 2016, arXiv preprint arXiv:1602.02830.
[25] F. Li, B. Zhang, B. Liu, Ternary weight networks, 2016, arXiv preprint arXiv:
The authors are thankful for Special Manpower Development Pro- 1605.04711.
gram – Chip to System Designs (SMDP-C2SD) sponsored by the Ministry [26] M. Rastegari, V. Ordonez, J. Redmon, A. Farhadi, XNOR-Net: Imagenet classifi-
of Electronics and Information Technology (MeitY), Govt. of India, for cation using binary convolutional neural networks, in: European Conference on
Computer Vision, Springer, 2016, pp. 525–542.
providing the necessary support.
[27] M. Hailesellasie, S.R. Hasan, F. Khalid, F.A. Wad, M. Shafique, FPGA-based
convolutional neural network architecture with reduced parameter requirements,
References in: 2018 IEEE International Symposium on Circuits and Systems, ISCAS, IEEE,
2018, pp. 1–5.
[1] P.B. de Sá, Biochip architecture for cardiac pathologies detection, 2017, www. [28] D. Miyashita, E.H. Lee, B. Murmann, Convolutional neural networks using
it.pt. logarithmic data representation, 2016, arXiv preprint arXiv:1603.01025.
[2] Mayo Clinic Staff, Heart arrhythmia, 2011, URL https://www.mayoclinic. [29] L. Lu, J. Xie, R. Huang, J. Zhang, W. Lin, Y. Liang, An efficient hardware accel-
org/diseases-conditions/heart-arrhythmia/symptoms-causes/syc-20350668, erator for sparse convolutional neural networks on FPGAs, in: 2019 IEEE 27th
www.mayoclinic.org. Annual International Symposium on Field-Programmable Custom Computing
[3] P.H. Prajapati, A.D. Darji, Two stage step-size scaler adaptive filter design for Machines, FCCM, IEEE, 2019, pp. 17–25.
ECG denoising, in: 2021 IEEE International Symposium on Circuits and Systems, [30] C. Zhu, K. Huang, S. Yang, Z. Zhu, H. Zhang, H. Shen, An efficient hardware
ISCAS, 2021, pp. 1–5. accelerator for structured sparse convolutional neural networks on FPGAs, 2020,
[4] P.H. Prajapati, A.D. Darji, Hardware efficient low-frequency artifact reduction arXiv preprint arXiv:2001.01955.
technique for wearable ECG device, IEEE Trans. Instrum. Meas. 71 (2022) 1–9. [31] B. Khabbazan, S. Mirzakuchaki, Design and implementation of a low-power,
[5] P. Prajapati, A. Darji, Hardware design of two stage reference free adaptive embedded CNN accelerator on a low-end FPGA, in: 2019 22nd Euromicro
filter for ECG denoising, in: A.P. Shah, S. Dasgupta, A. Darji, J. Tudu (Eds.), Conference on Digital System Design, DSD, IEEE, 2019, pp. 647–650.
VLSI Design and Test, Springer Nature Switzerland, Cham, 2022, pp. 305–319. [32] P. Tummeltshammer, J.C. Hoe, M. Puschel, Time-multiplexed multiple-constant
[6] K.K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation, multiplication, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 26 (9)
John Wiley & Sons, 2007. (2007) 1551–1563.
[7] P. De Chazal, M. O’Dwyer, R.B. Reilly, Automatic classification of heartbeats [33] J. Faraone, M. Kumm, M. Hardieck, P. Zipf, X. Liu, D. Boland, P.H. Leong,
using ECG morphology and heartbeat interval features, IEEE Trans. Biomed. Eng. AddNet: Deep neural networks using FPGA-optimized multipliers, IEEE Trans.
51 (7) (2004) 1196–1206. Very Large Scale Integr. (VLSI) Syst. (2019).
[8] C. Ye, B.V. Kumar, M.T. Coimbra, Heartbeat classification using morphological [34] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, D.
and dynamic features of ECG signals, IEEE Trans. Biomed. Eng. 59 (10) (2012) Kalenichenko, Quantization and training of neural networks for efficient integer-
2930–2941. arithmetic-only inference, in: Proceedings of the IEEE Conference on Computer
[9] Z. Zhang, X. Luo, Heartbeat classification using decision level fusion, Biomed. Vision and Pattern Recognition, 2018, pp. 2704–2713.
Eng. Lett. 4 (4) (2014) 388–395. [35] S. Han, H. Mao, W.J. Dally, Deep compression: Compressing deep neural
[10] K. Park, B. Cho, D. Lee, S. Song, J. Lee, Y. Chee, I.Y. Kim, S. Kim, Hierarchical networks with pruning, trained quantization and huffman coding, 2015, arXiv
support vector machine based heartbeat classification using higher order statistics preprint arXiv:1510.00149.
and hermite basis function, in: 2008 Computers in Cardiology, IEEE, 2008, pp. [36] Z. Cai, X. He, J. Sun, N. Vasconcelos, Deep learning with low precision by half-
229–232. wave gaussian quantization, in: Proceedings of the IEEE Conference on Computer
[11] S.L. Oh, E.Y. Ng, R. San Tan, U.R. Acharya, Automated diagnosis of arrhythmia Vision and Pattern Recognition, 2017, pp. 5918–5926.
using combination of CNN and LSTM techniques with variable length heart beats, [37] H. Li, A. Kadav, I. Durdanovic, H. Samet, H.P. Graf, Pruning filters for efficient
Comput. Biol. Med. 102 (2018) 278–287. convnets, 2016, arXiv preprint arXiv:1608.08710.
[12] Z. Zheng, Z. Chen, F. Hu, J. Zhu, Q. Tang, Y. Liang, An automatic diagnosis of [38] M. Zhang, L. Li, H. Wang, Y. Liu, H. Qin, W. Zhao, Optimized compression for
arrhythmias using a combination of CNN and LSTM technology, Electronics 9 implementing convolutional neural networks on FPGA, Electronics 8 (3) (2019)
(1) (2020) 121. 295.
16
[39] M.M. Pasandi, M. Hajabdollahi, N. Karimi, S. Samavi, Modeling of pruning [45] Y. Cui, J. Zhai, X. Wang, Extreme learning machine based on cross entropy, in:
techniques for deep neural networks simplification, 2020, arXiv preprint arXiv: 2016 International Conference on Machine Learning and Cybernetics, Volume 2,
2001.04062. ICMLC, IEEE, 2016, pp. 1066–1071.
[40] J. Ney, D. Loroch, V. Rybalkin, N. Weber, J. Krüger, N. Wehn, HALF: Holistic [46] J. Lu, D. Liu, Z. Liu, X. Cheng, L. Wei, C. Zhang, X. Zou, B. Liu, Efficient
auto machine learning for FPGAs, in: 2021 31st International Conference on hardware architecture of convolutional neural network for ECG classification
Field-Programmable Logic and Applications, FPL, IEEE, 2021, pp. 363–368. in wearable healthcare device, IEEE Trans. Circuits Syst. I. Regul. Pap. 68 (7)
[41] M. Wang, S. Rahardja, P. Fränti, S. Rahardja, Single-lead ECG recordings (2021) 2976–2985.
modeling for end-to-end recognition of atrial fibrillation with dual-path RNN, [47] ANSI/AAMI-EC57 Standard, ANSI, Testing and reporting performance results
Biomed. Signal Process. Control 79 (2023) 104067. of cardiac rhythm and ST segment measurement algorithms, 1998, Standard
[42] G.D. Clifford, C. Liu, B. Moody, H.L. Li-wei, I. Silva, Q. Li, A. Johnson, ANSI/AAMI-EC57, 46.
R.G. Mark, AF classification from a short single lead ECG recording: The [48] C.R. Baugh, B.A. Wooley, A two’s complement parallel array multiplication
PhysioNet/computing in cardiology challenge 2017, in: 2017 Computing in algorithm, IEEE Trans. Comput. 100 (12) (1973) 1045–1047.
Cardiology, CinC, IEEE, 2017, pp. 1–4. [49] A.A. Kumar, Fundamentals of Digital Circuits, PHI Learning Pvt. Ltd., 2016.
[43] A.L. Goldberger, L.A. Amaral, L. Glass, J.M. Hausdorff, P.C. Ivanov, R.G. Mark, [50] C. Zhang, G. Sun, Z. Fang, P. Zhou, P. Pan, J. Cong, Caffeine: Toward uniformed
J.E. Mietus, G.B. Moody, C.-K. Peng, H.E. Stanley, PhysioBank, PhysioToolkit, representation and acceleration for deep convolutional neural networks, IEEE
and PhysioNet: components of a new research resource for complex physiologic Trans. Comput.-Aided Des. Integr. Circuits Syst. 38 (11) (2018) 2072–2085.
signals, Circulation 101 (23) (2000) e215–e220.
[44] J. Pan, W.J. Tompkins, A real-time QRS detection algorithm, IEEE Trans. Biomed.
Eng. 32 (3) (1985) 230–236.
17

1 s2.0 S1746809423002987 Main

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S1746809423002987 Main

Uploaded by

Copyright:

Available Formats

Biomedical Signal Processing and Control 85 (2023) 104865

Contents lists available at ScienceDirect

Biomedical Signal Processing and Control

Hardware implementation of 1D-CNN architecture for ECG arrhythmia

ARTICLE INFO ABSTRACT

1. Introduction Feature Extraction, Morphological methods [7–9], Wavelet Transform-

Fig. 1. Types of ECG waves used for classification.

• Database: This work uses the 2017 PhysioNet/ Computing in ∑

Fig. 8. Pruning analysis results for trained dataset.

2.3. CNN architecture optimization with pruning 3. Proposed CNN hardware

Table 1 23 different Filters is performed with 26 different input channels.

Fig. 10. Proposed shifter-based multiplier block diagram.

Algorithm 1 Steps to convert weight values into shift values for

Fig. 13. Convolution operation: Adder Tree.

3.1.8. Softmax function

A floating-point operation is essential for exponential function com-

4. Results comparison and discussion

Fig. 14. Convolution output of multiple channels.

Three 1D-CNN architectures are discovered after analysis in this

Fig. 15. Adder tree for data flattening.

is sufficiently high for AF classification performance. It is 4% and 1.7%

4.3. FPGA hardware implementation results

4.3.1. Hardware setup and tools

Fig. 17. Block diagram of fully connected layer.

ZYNQ Ultrascale ZCU106 FPGA. The direct hardware implementation

You might also like