You are on page 1of 18

sensors

Article
A Configurable and Fully Synthesizable RTL-Based
Convolutional Neural Network for Biosensor Applications
Pervesh Kumar 1 , Huo Yingge 1 , Imran Ali 1,2 , Young-Gun Pu 1,2 , Keum-Cheol Hwang 1 , Youngoo Yang 1 ,
Yeon-Jae Jung 1,2 , Hyung-Ki Huh 1,2 , Seok-Kee Kim 1,2 , Joon-Mo Yoo 1,2 and Kang-Yoon Lee 1,2, *

1 Department of Electrical and Computer Engineering, Sungkyunkwan University, Suwon 16416, Korea;
itspervesh@skku.edu (P.K.); yingge@skku.edu (H.Y.); imran.ali@skku.edu (I.A.); hara1015@skku.edu (Y.-G.P.);
khwang@skku.edu (K.-C.H.); yang09@skku.edu (Y.Y.); yjjung@skaichips.co.kr (Y.-J.J.);
gray.huh@skku.edu (H.-K.H.); seokkeekim@skku.edu (S.-K.K.); fiance2@g.skku.edu (J.-M.Y.)
2 SKAIChips, Sungkyunkwan University, Suwon 16419, Korea
* Correspondence: klee@skku.edu; Tel.: +82-31-299-4954

Abstract: This paper presents a register-transistor level (RTL) based convolutional neural network
(CNN) for biosensor applications. Biosensor-based diseases detection by DNA identification using
biosensors is currently needed. We proposed a synthesizable RTL-based CNN architecture for this
purpose. The adopted technique of parallel computation of multiplication and accumulation (MAC)
approach optimizes the hardware overhead by significantly reducing the arithmetic calculation and
achieves instant results. While multiplier bank sharing throughout the convolutional operation with
fully connected operation significantly reduces the implementation area. The CNN model is trained
in MATLAB® on MNIST® handwritten dataset. For validation, the image pixel array from MNIST®
handwritten dataset is applied on proposed RTL-based CNN architecture for biosensor applications
in ModelSim® . The consistency is checked with multiple test samples and 92% accuracy is achieved.
 The proposed idea is implemented in 28 nm CMOS technology. It occupies 9.986 mm2 of the total

area. The power requirement is 2.93 W from 1.8 V supply. The total time taken is 8.6538 ms.
Citation: Kumar, P.; Yingge, H.;
Ali, I.; Pu, Y.-G.; Hwang, K.-C.;
Keywords: convolutional neural network; biosensor; diseases classification; RTL-based design
Yang, Y.; Jung, Y.-J.; Huh, H.-K.;
Kim, S.-K.; Yoo, J.-M.; et al. A
Configurable and Fully Synthesizable
RTL-Based Convolutional Neural
Network for Biosensor Applications.
1. Introduction
Sensors 2022, 22, 2459. https:// A biosensor is a device that is sensitive to biological substances and converts its
doi.org/10.3390/s22072459 concentration into an electrical signal for further processing and analysis [1]. Existing
Received: 14 December 2021
artificial intelligence (AI) biosensors have many limitations: (1) They require a large number
Accepted: 21 March 2022
of well-labeled data; (2) have poor flexibility, and (3) features extraction strongly depends
Published: 23 March 2022
on logic and accumulation. Due to these restrictions, the potency of traditional biosensors
is limited by aspects of performance such as accuracy, timing, etc. [2,3]. During the past few
Publisher’s Note: MDPI stays neutral
years, with the promising research in deep learning, especially in CNN, these limitations can
with regard to jurisdictional claims in
be overcome. Apart from traditional biosensor systems, CNN can meaningfully progress on
published maps and institutional affil-
extracting the different features. Therefore, based on CNN, a biosensor system can exploit
iations.
the unlabeled data and the features are learned automatically by the network architecture.
Hence CNN is a promising approach for biosensor applications and has also been vastly
studied by existing works [4,5].
Copyright: © 2022 by the authors.
While achieving real-time performance, CNN-based techniques demand much more
Licensee MDPI, Basel, Switzerland. computation and memory resources than conventional methods. Therefore, an energy effi-
This article is an open access article cient CNN implementation is inevitable. Application specific integrated circuit (ASIC) and
distributed under the terms and field-programmable gate arrays (FPGA) [6–9] based accelerators are promising alternates.
conditions of the Creative Commons ASIC-based study in [10,11] is proposed for the purpose of cost efficiency, energy, and
Attribution (CC BY) license (https:// throughput. Similarly, FPGA-based research [12–14] achieve better performance because
creativecommons.org/licenses/by/ of the parallel computation. Moreover, CNN chip implementation is categorized into two
4.0/). classes: (1) standard or traditional chips, and (2) neuro-chips. Standard chips are further

Sensors 2022, 22, 2459. https://doi.org/10.3390/s22072459 https://www.mdpi.com/journal/sensors


Sensors 2022, 22, 2459 2 of 18

divided into two subclasses: (I) multi-processor, and (II) sequential accelerator. In multi-
processor, multiple cores are integrated for CNN operation. The purpose is to perform
parallel operations for decreasing the throughput, improving the system performance
and data loading by two folds. With the sequential accelerator breakthrough, the CNN
algorithm on-chip integration is implementable.
The neuro-chips are built with multi-electrode arrays; a kind of integrated circuit chip
that performs better because of good connectivity and fast and parallel computation. It
needs low power and occupies less memory compared to traditional chips [15–18]. The
research associated with neuro-chips is further classified into three main methods: fully ana-
log, fully digital, and mixed analog/digital methods. For the mixed analog/digital [19,20],
there are two operational modes; one is neuron mode, and another is synapse mode. For
the neuron mode, usually, the asynchronous digital is adopted to perform the spiking.
Regarding the synapse mode, most studies use analog methods to carry out compact
operations such as those discussed in [21–23]. But there are studies that use analog/digital
hybrid circuits to achieve timing synapses [24]. The digital logic foundation is adopted for
on-chip structure configuration and providing corresponding algorithms. Different CNN
architectures are implemented as per requirements. When using only the analog approach,
mem-resistors are the key technology for implementing any functionality, keeping and
tuning the resistance state according to the supply changes. Therefore, through it and resis-
tive switching memories, it can meaningfully give the chance to achieve neural networks
performance and functionalities in a fully analog manner. In [25], different functionalities
of the neural network are implemented and achieved good performance in terms of small
area and less power consumption. For the fully digital methods, the hardware algorithms
are designed to address the functionalities and performance of the neural networks. These
studies adopt software for modeling and training the neural network, get trained weights
and bias values, and use these parameters for hardware implementation [26,27].
Previously, all the proposed CNN hardware implementation architectures for classifi-
cation purposes used architectural parallelism and parameter reuse approaches. As a result,
less memory was required or all memory on-chip was accommodated, but the drawback
was moderate accuracy. In [28] and some other related studies, they also adopted parallel
computation for reducing the processing time, but the trade-off was a large implementation
area, nominal accuracy, and limitation of the number of layers.
The motivation of this work is to design a reconfigurable RTL-based CNN architecture
for a biosensor for disease detection applications. The proposed design is fully synthesizable
and technology independent. The parallel computation of MAC operation is used to reduce
the arithmetic and extensive calculations. The multiplier bank is shared among all the
convolutional layers and fully connected layers to reduce the implementation area.
The rest of the paper is organized as follows: Section 2 introduces the proposed top
architecture of CNN for biosensor applications. The overall proposed system structure.
The structures and functions of the sub-blocks are introduced in Section 3. Section 4
lists the experimental results, which includes software modeling results and hardware
simulations results. Layout, mathematical results, and comparison with other works are
also summarized. Finally, the paper is concluded in Section 5.

2. Top Architecture
The top structure of the proposed RTL-based CNN hardware implementation is shown
in Figure 1. The proposed idea is comprised of three parts: (1) CNN architecture modeling
and training in MATLAB® ; (2) External on-board memory; and (3) hardware implementa-
tion system based on RTL compiler. In MATLAB® , two operations are performed: (1) CNN
architecture is modeled, trained, tested, and trained weights and bias data is saved in a .txt
file.; (2) Converting the input feature map data into binary data, which can be recognized
by hardware tools and save it into a .txt file, for further processing.
implementation system based on RTL compiler. In MATLAB®, two operations are per-
formed: (1) CNN architecture is modeled, trained, tested, and trained weights and bias
data is saved in a .txt file.; (2) Converting the input feature map data into binary data,
Sensors 2022, 22, 2459 which can be recognized by hardware tools and save it into a .txt file, for further 3pro-
of 18
cessing.

Figure
Figure1.1.Top
Topstructure
structureof
ofCNN
CNNhardware
hardware implementation.
implementation.

External
External on-board
on-board memory is is aa kind
kindofofmultiple
multipleprogrammable
programmablememory. memory. It used
It is is used for
for storing
storing thethe trained
trained kernels’
kernels’ weights
weights andand biasbias values
values withwith preloaded
preloaded instruction
instruction for on-
for on-chip
chip processing.
processing. Its operation
Its operation is controlled
is controlled by top
by the the top controller
controller withwith reading
reading instructions
instructions and
and enabling
enabling signals.
signals. TheThe interface
interface protocolisisadopted
protocol adoptedto toflow
flow the
the data between
between external
external
on-boardmemory
on-board memoryand andon-chip
on-chipCNN CNNsystem.
system.
Inthe
In the on-chip
on-chip system,
system, thethe same
same CNN CNN architecture
architecture isis modeled.
modeled. Figure
Figure11showsshowsthe the
correspondingarchitecture,
corresponding architecture,whichwhichisis comprised
comprised of of several
several building
building blocks.
blocks. TheThe dotted
dotted
linesrepresent
lines representthe thedata
datapath
pathandanddescribe
describethe thedata
dataflow
flowdirection
directionbetween
betweendifferent
differentsub- sub-
blocks. The solid line shows the control signal working path and
blocks. The solid line shows the control signal working path and indicates how sub-blocks indicates how sub-blocks
worktogether.
work together. On-chip
On-chipblock
blockhashasseveral
severalsub-blocks
sub-blockssuch suchasasthe
thememory
memorysystem,system,which which
isisdesigned
designedto topreload
preloadand and store
store kernel
kernel weights,
weights, bias,
bias, and
and feature
feature maps
maps data.data. The
TheCNN CNN
architecturelayers
architecture layersareareconvolution
convolution++ReLU ReLUlayer,
layer,pooling
poolinglayers
layersandandfully
fullyconnected
connected(FC) (FC)
layer. The
layer. Thetop
topcontroller
controllercontrols
controlsthe the whole
whole system
system operation,
operation, connecting
connectingthe thedifferent
different
sub-blocksand
sub-blocks andcontrolling
controllingthe thedata
datasaving
savingon onthe
theon-chip
on-chipmemory
memoryor orgoing
goingto tothethenext
next
stage operation. A multiplier bank is there to perform convolution
stage operation. A multiplier bank is there to perform convolution calculations. Actually, calculations. Actually,
itit isis the
the main
main computing
computing resource
resource for for reducing
reducing the the computation
computationand andisis designed
designedto to bebe
shared among all convolution + ReLU and fully connected layers.
shared among all convolution + ReLU and fully connected layers. An output control logic An output control logic
isisdesigned
designedto tofind
findthe
thefinal
final results
results of of the
the system
system and and also
also for
for final
final classification
classification results.
results.
The on-chip system operation is described in Figure 2. As the
The on-chip system operation is described in Figure 2. As the system starts, the system starts, the feature
fea-
buffer saves the input data, and the external on-board memory
ture buffer saves the input data, and the external on-board memory saves trained weights saves trained weights and
bias. Once the CONV enable signal works, the CONV
and bias. Once the CONV enable signal works, the CONV + ReLU and multiplier bank + ReLU and multiplier bank get
input data from the feature buffers and weight and bias values
get input data from the feature buffers and weight and bias values from external on-board from external on-board
memory, and then the convolutional operation is performed. The convolved results go
memory, and then the convolutional operation is performed. The convolved results go to
to pooling layer for sampling purposes. Once all convolution and pooling iterations are
pooling layer for sampling purposes. Once all convolution and pooling iterations are
done, FB enables the signal works and the output activation feature maps are saved on the
done, FB enables the signal works and the output activation feature maps are saved on
on-chip feature buffers. When FC enables the signal works, the FC layer and multiplier
the on-chip feature buffers. When FC enables the signal works, the FC layer and multiplier
bank obtain weights and bias from external on-board memory and feature maps from
feature buffers to perform the FC operation. As the FC operation finishes, the FC done
signal is generated, the output control logic finds the labels having maximum computation
value, and outputs it as a class.
Sensors 2022, 22, x FOR PEER REVIEW 4 of 18

bank obtain weights and bias from external on-board memory and feature maps from
Sensors 2022, 22, 2459
feature buffers to perform the FC operation. As the FC operation finishes, the FC4 done
of 18
signal is generated, the output control logic finds the labels having maximum computa-
tion value, and outputs it as a class.

Figure2.2.Flow
Figure Flowchart
chartofofthe
theproposed
proposedCNN
CNNarchitecture.
architecture.

Figure33shows
Figure showsthe theproposed
proposedCNN CNNarchitecture.
architecture.ItIthashasseven
sevenlayers
layersinintotal,
total,out
outofof
whichthere
which therearearetwo twoconvolutional
convolutionallayers,layers,twotwopooling
poolinglayers,
layers,twotwofully
fullyconnected
connectedlayers,
layers,
andone
and onesoftMax
softMaxoutput outputlayer.
layer.Input
Inputfeature
featuremap mapdatadatadimension
dimensionisis3232××32. 32.InInC1,
C1,the
the
input feature maps are convolved with six kernels each of size 5 × 5 with a stride of 1. ItIt
input feature maps are convolved with six kernels each of size 5 × 5 with a stride of 1.
generatesthe
generates thesix
sixfeature
featuremaps mapswith
witha asize
sizeofof2828××28.
28.Then
Thenititisishandled
handledby bythetheactivation
activation
function,such
function, such as as rectified linear
linear unit
unit(ReLU).
(ReLU).InInS2, S2,the
thesize of of
size feature maps
feature maps hashasbeenbeen
sub-
sampled to to
sub-sampled thethe half
halfbybyadopting
adoptingthe themax-pooling
max-poolingapproach
approach with size of of 22×× 2,2, and
andthethe
stridesize
stride sizeisis2,2,totosub-sample
sub-samplethe theinput
inputfeature
featuremaps
mapstoto1414××14. 14.InInC3,
C3,16
16feature
featuremaps
maps
with14
with 14×× 14 is convolved
convolved with withaa1616kernel size5 5× ×
kernelofofsize 5 to getget
5 to 16 16
feature
featuremaps
maps with thethe
with size
of 10
size × 10
of 10 × and
10 and again
again processed
processed withwithReLU
ReLU operation.
operation. The
TheS4S4max-pooling
max-poolingoperation
operationis
isoperated
operatedwith withsize × the
size2 2× 2, 2, the stride
stride is 2isto2acheive
to acheive 16 feature
16 feature maps maps
withwith × 5.C5
5 × 5.5 The The C5
is also
isaalso a convolution
convolution layer120
layer with with 120 kernels
kernels each of each
5 × of 5 × F6
5 size, 5 size, F6 isconnected
is fully fully connected layer
layer with 84
with 84 feature
feature maps. Softmax
maps. Softmax is basically
is basically used for used for classification.
classification. The feature
The feature with the with the
highest
highest probability value is classified as an output result from handwritten digits. The CNN
model is typically trained with a 32-bit floating point precision using MATLAB® platform.
Since the MATLAB® computes parallelly, so the processing time is reduced compared to
conventional C or C++ language processing approaches.
probability value is classified as an output result from handwritten digits. The CNN
model is typically trained with a 32-bit floating point precision using MATLAB® platform.
Sensors 2022, 22, 2459 5 of 18
Since the MATLAB® computes parallelly, so the processing time is reduced compared to
conventional C or C++ language processing approaches.

Figure 3.
Figure 3. The proposed
proposed CNN
CNN model
model structure.
structure.

3.
3. Building
Building Blocks
Blocks
3.1. Top Controller
3.1. Top Controller
The top controller is the overall controlling module of the system. Firstly, it is used to
The top controller is the overall controlling module of the system. Firstly, it is used
control the module sequentially with enable and done signals. When these blocks need data,
to control the module sequentially with enable and done signals. When these blocks need
the enable signal for the next operation directly starts the next module, but also disables the
data, the enable signal for the next operation directly starts the next module, but also dis-
current module operation. Secondly, the top controller is applied for communication with
ables the current module operation. Secondly, the top controller is applied for communi-
external memory for reading weights and bias information. Through the interface protocol,
cation with external memory for reading weights and bias information. Through the in-
like serial peripheral interface (SPI), the data is transferred from external on-board memory
terface protocol, like serial peripheral interface (SPI), the data is transferred from external
to an on-chip system. Thirdly, the top controller is also used to differentiate the read/write
on-board memory to an on-chip system. Thirdly, the top controller is also used to differ-
indicates between the contiguous blocks and manage the calculation results being saved to
entiate the read/write indicates between the contiguous blocks and manage the calculation
the memories or go to the next stage. Fourthly, it is used to control convolution + ReLU
results being saved to the memories or go to the next stage. Fourthly, it is used to control
module and share the multiplier bank with the fully connected module. It controls the
convolution
selection + ReLU
of the module andinshare
data information the multiplier
memories, bank with
which consists the fully
of various connectedblocks
calculation mod-
ule. It controls the selection of the data information in memories, which consists
to be shared multipliers or pooling blocks, and also manages the multiplier calculated of various
calculation
values blocks
that are to sent
being be shared
to the multipliers
convolutionoroperation
pooling blocks, and also manages the mul-
or FC operation.
tiplier calculated values that are being sent to the convolution operation or FC operation.
3.2. Feature Buffers
3.2. Feature Buffersbuffers, as shown in Figure 1, are used to save the output data of each
The feature
The feature
sub-block. They buffers, as shown
are integrated in Figureas1,on-chip
to perform are used to save the
memories. output
Each data ofsaves
sub-block each
sub-block.
its They are integrated
output activation map into to perform
different as on-chip
on-chip memories.
memories, Each sub-block
according to the sizesaves its
of the
output activation
memories. map into different
These memories are built on-chip memories,
by the size of 10 k pfaccording to theitsize
components, of the
means formem-
each
ories. These
block, memories
the capacity are
is 10 k built by the
bits. As for size of 10of
the case k pf
thecomponents,
depth of theitcomputation
means for each block,
memory
thebigger
is capacity is 10 k bits.
compared As for
to this the case size
maximum of the
ofdepth of theanother
10 k, then computation
memorymemory is bigger
component is
compared
adopted to this
which alsomaximum
has same size with
of 10this
k, then
one. another memory component is adopted
which also has same size with this one.
3.3. Convolutional Operation
3.3. Convolutional Operationof the convolution layer is shown in Figure 4. It consists of a
The top architecture
dedicated CONV controller,
The top architecture of windows wrapper,
the convolution multiplier
layer is shownbank, adder4.trees,
in Figure and ReLU
It consists of a
modules.
dedicated TheCONVinput feature maps
controller, data wrapper,
windows and kernel bias databank,
multiplier for convolutional operation
adder trees, and ReLU
are pre-trained,
modules. after preloading,
The input feature maps and saved
data andinkernel
the on-chip memory.
bias data The calculation
for convolutional results
operation
from
are pre-trained, after preloading, and saved in the on-chip memory. The calculation the
multiplier bank passed to adder tress for the next calculation. After computation, re-
results are transformed
sults from multiplier bankand stored
passedinto the on-chip
to adder feature
tress for buffers
the next for next layer
calculation. Afterprocessing,
computa-
which also
tion, the has the
results aresame size withand
transformed thisstored
one. into the on-chip feature buffers for next layer
The CONV controller of convolution
processing, which also has the same size with operation is mainly built by counters. Firstly
this one.
it getsThe
enable
CONV controller of convolution operation is providing
signal from the top controller and starts mainly builtread
by signals
counters.forFirstly
featureit
buffers and signal
gets enable externalfromon-board
the topmemory
controllerwhen
and convolution operation
starts providing starts. for
read signals Secondly,
feature
according to the kernel window size, it controls the window wrapper to select the input
buffers and external on-board memory when convolution operation starts. Secondly, ac-
features maps data for partial convolution operation and slides this over all the spatial
cording to the kernel window size, it controls the window wrapper to select the input
feature maps with the stride of 1. Finally, it manages the writing address to save output
activation map value after ReLU processing to on-chip memories.
The role of the window wrapper is to select the window, as shown in Figure 5. Ac-
cording to the kernel size, selecting the corresponding pixels from the input feature map
data. It consists of a window shifter and a window selector. The window shifter consists
features maps data for partial convolution operation and slides this over all the spatial
featuresmaps
feature mapswith
datathe forstride
partialof convolution
1. Finally, it operation
manages the andwriting
slides this overtoallsave
address the output
spatial
feature maps with the stride of 1. Finally, it
activation map value after ReLU processing to on-chip memories.manages the writing address to save output
Sensors 2022, 22, 2459 activation
The role map ofvalue after ReLU
the window processing
wrapper to on-chip
is to select memories.
the window, as shown in Figure 5. 6 ofAc-
18
The role of the window wrapper is to select the window,
cording to the kernel size, selecting the corresponding pixels from the input feature mapas shown in Figure 5. Ac-
cording
data. to the kernel
It consists size, selecting
of a window shifter the
andcorresponding
a window selector. pixelsThefrom the input
window feature
shifter map
consists
data.
of
of shiftItregisters,
shift consists of
registers, asashown
as window
shown in shifter 6.
in Figure
Figure and
6. a window
After
After obtaining
obtaining selector.
feature
feature The
map
mapwindow shifter
data from
data from the consists
the feature
feature
of shift
buffer and
buffer registers,
and the as
the shift shown
shift signal in
signal from Figure
from Conv 6. After
Conv Controller, obtaining
Controller,ititshifts feature
shiftsthethedata map data
dataserially from
serially and the
and providesfeature
provides itit
buffer
to the
to and
the window the shift
window selector signal
selector in from
in parallel.Conv
parallel. The Controller,
The window it
window selector shifts
selector is the data
is consists serially
consists ofof MUX, and
MUX, after provides
gettingit
after getting
to the xwindow
kernel
kernel x and selector in from
and yy coordinates
coordinates parallel.
from Conv
ConvThe window selector
Controller,
Controller, ititperformsis consists
performs of MUX,according
pixel selection,
pixel selection, after getting
according to
to
kernel
the
the kernel
kernelx and y coordinates
window
window size,as
size, from
asshown
shown Conv Controller,
inFigure
in Figure 7.7. it performs pixel selection, according to
the kernel window size, as shown in Figure 7.

Figure4.
Figure 4. Overall
Overall architecture
architectureof
ofthe
theconvolution
convolutionlayer.
layer.
Figure 4. Overall architecture of the convolution layer.

Figure 5. Window wrapper structure.


Figure5.5.Window
Figure Windowwrapper
wrapperstructure.
structure.

After receiving selected pixels from window wrapper and kernel, bias values from
external on-board memory, the convolution operation is performed by multiplier bank
and adder tree, as shown in Figure 8. The multiplier bank consists of multipliers, and
the number of multipliers is decided by the kernel window size. For example, the kernel
window size is 5 × 5, so the number of multipliers should be 25 = 32. Each multiplier
is used to multiply 8-bit kernel values with 8-bit selected pixel in a parallel manner and
provide the result to the adder tree. The adder tree accumulates the multiplier bank result,
also with bias values within one kernel window. The number of addresses is decided
through the multipliers, then it can be calculated as follows in (1):

log2 N
Nadder = ∑ 2i + 1 (1)
i =0
Sensors 2022, 22, 2459 7 of 18

Sensors 2022, 22, x FOR PEER REVIEW 7 of 18


Sensors 2022, 22, x FOR PEER REVIEW 7 of 18
where, N is the number of multipliers, and Nadder is the number of addresses.

Figure 6.
Figure Window shift
6. Window shift structure.
structure.
Figure 6. Window shift structure.

Figure 7. Window selector structure.


Figure7.7.Window
Figure Windowselector
selectorstructure.
structure.

After
The receiving
block selected
diagram pixels
of the ReLUfrom window
operation is wrapper
given and 9.
Figure kernel, bias values
The adder tree infrom
this
After receiving selected pixels from window wrapper and kernel, bias values from
external on-board
figure consists memory, the
of comparator andconvolution
mux, whichoperation
performsisasperformed
a kernel forbythemultiplier
assigned bank
bit of
external on-board memory, the convolution operation is performed by multiplier bank
and
pixeladder
value.tree, as shown
Basically, in Figurethe
it converts 8. negative
The multiplier
valuesbank consists
to zero, ofitmultipliers,
while and the
leaves the positive
and adder tree, as shown in Figure 8. The multiplier bank consists of multipliers, and the
number of multipliers
values unchanged. is decided by this
Mathematically the kernel
can be window
given as size.
in (2):For example, the kernel win-
number of multipliers is decided by the kernel window size. For example, the kernel win-
dow size is 5 × 5, so the number of multipliers should be 25 5= 32. Each multiplier is used
dow size is 5 × 5, so the number of multipliers 
0, xshould
< 0 inbe 2 = 32. Each multiplier is used
to multiply 8-bit kernel values withf (8-bit ) =selected pixel a parallel manner and provide
to multiply 8-bit kernel values with x8-bit selected
x, x ≥ pixel
0 in a parallel manner and provide (2)
the result to the adder tree. The adder tree accumulates the multiplier bank result, also
the result to the adder tree. The adder tree accumulates the multiplier bank result, also
with bias values within one kernel window. The number of addresses is decided through
with bias
where f (x)values within
represents theone kernel
output ofwindow. The number
ReLU activation of addresses
function. is decided
Its output is giventhrough
to the
the multipliers, then it can be calculated as follows in (1):
the multipliers, then it
max-pooling layer directly. can be calculated as follows in (1):
NNadder ==i =log
log N
N i
i0=02 22i++11
2
(1)
adder (1)
figure consists of comparator and mux, which performs as a kernel for the assigned bit of
pixel value. Basically, it converts the negative values to zero, while it leaves the positive
pixel value. Basically, it converts the negative values to zero, while it leaves the positive
values unchanged. Mathematically this can be given as in (2):
values unchanged. Mathematically this can be given as in (2):
0, < 0
( ) = 0, < 0 (2)
( )= , ≥0 (2)
Sensors 2022, 22, 2459 , ≥0 8 of 18
where f(x) represents the output of ReLU activation function. Its output is given to the
where f(x) represents the output of ReLU activation function. Its output is given to the
max-pooling layer directly.
max-pooling layer directly.

Figure
Figure 8.
8. The
The architecture
architecture of
of convolutional
convolutional operation.
Figure 8. The architecture of convolutional operation.

Figure 9. The architecture of convolutional operation.


Figure9.9.The
Figure Thearchitecture
architectureof
ofconvolutional
convolutionaloperation.
operation.
3.4. Max-Pooling Operation
3.4.
3.4.Max-Pooling
Max-PoolingOperation
Operation
The Max-pooling operation is achieved by combining the max-pooling controller
The
TheMax-pooling
Max-poolingoperation
operation is achieved
is achieved by combining
by combining the max-pooling
the max-poolingcontroller with
controller
with comparators. The block diagram is given in Figure 10. After receiving the enable
comparators. The block diagram is given in Figure 10. After receiving
with comparators. The block diagram is given in Figure 10. After receiving the enable the enable signal
signal from the top controller and input values from the previous layer, the max-pooling
from
signalthefrom
top controller and input
the top controller values
and inputfrom the previous
values from the layer,
previousthe max-pooling controller
layer, the max-pooling
controller performs partition of the input feature maps data into a set of rectangular sub-
performs partition of the input feature maps data into a set of rectangular
controller performs partition of the input feature maps data into a set of rectangular sub- sub-regions
regions with the size of 2 × 2, and the stride size is 2. The difference with window selection
with thewith
regions size the × of
of 2size 2, 2and
× 2,the
andstride size size
the stride is 2.is The
2. Thedifference
difference with
withwindow
windowselection
selection
of convolution operation is moved without any overlapping. The comparator is used to
of
of convolution
convolution operation
operation is moved without
is moved without anyany overlapping.
overlapping.The Thecomparator
comparatoris is used
used to
compare
to compare the the
2 × 22 sub-region
× 2 values
sub-region and outputs
values and the maximum
outputs the value of
maximum eachof
value sub-region.
each sub-
compare the 2 × 2 sub-region values and outputs the maximum value of each sub-region.
The controller
region. also provides
The controller a read signal
also provides read to the lastthe
stage
lastfor informing the start
theofofoper-
The controller also provides a read asignal signal
to the tolast stage stage
for for informing
informing the start start of
oper-
ation and
operation supplies
and a write
supplies a address
write to
address save
to the
save output
the value
output to
valuefeature
to buffer
feature for
bufferthe
fornext
the
ation and supplies a write address to save the output value to feature buffer for the next
layer
next processing.
layerlayer processing.
processing.
3.5. Fully Connected Operation
The function of the FC layer is the matrix multiplication. It is typically built by the
the FC controller and multiplication and accumulation operations, which are similar to
convolutional layers. A block diagram of fully connected operation is shown in Figure 11.
After receiving enable signal from the top controller, the FC controller starts sending
the read signal to the feature buffer and external on-board memory. It saves the last
layer’s computation results to obtain the input feature and weight value. The parallel
multiplication and accumulation computations are used to calculate the sharing feature
maps data to all the rows and they are calculated together in parallel. By adopting the
Sensors 2022, 22, 2459 9 of 18

sharing multiplier bank with a convolutional layer, the computation is much reduced.
Typically, the performance of the FC layer associates the input pitch points with output
pitch points in the current layer. The function is given as in (3):
Figure 10. The architecture of the max-pooling
W ∗ H −1 operation.
Yi = ∑ k =0 Xk × Wi + bi , N ≥ i ≥ 0 (3)
3.5.
Sensors 2022, 22, x FOR PEER REVIEW Fully Connected Operation 9 of 18
where Xk represents the feature maps, Yi represents calculations results, Wi is the weights
The function of the FC layer is the matrix multiplication. It is typically
values, and bi represents the bias value. N is the number of the output nodes. built by the
the FC controller and multiplication and accumulation operations, which are similar to
convolutional layers. A block diagram of fully connected operation is shown in Figure 11.
After receiving enable signal from the top controller, the FC controller starts sending the
read signal to the feature buffer and external on-board memory. It saves the last layer’s
computation results to obtain the input feature and weight value. The parallel multiplica-
tion and accumulation computations are used to calculate the sharing feature maps data
to all the rows and they are calculated together in parallel. By adopting the sharing mul-
tiplier bank with a convolutional layer, the computation is much reduced. Typically, the
performance of the FC layer associates the input pitch points with output pitch points in
the current layer. The function is given as in (3):

Yi =  k = 0
W * H −1
X k × Wi + bi , N ≥ i ≥ 0 (3)

where Xk represents the feature maps, Yi represents calculations results, Wi is the


Figure10.
Figure
weights10.values,
The and bi of
Thearchitecture
architecture ofthe
themax-pooling
max-pooling
represents the biasoperation.
operation.
value. N is the number of the output nodes.

3.5. Fully Connected Operation


The function of the FC layer is the matrix multiplication. It is typically built by the
the FC controller and multiplication and accumulation operations, which are similar to
convolutional layers. A block diagram of fully connected operation is shown in Figure 11.
After receiving enable signal from the top controller, the FC controller starts sending the
read signal to the feature buffer and external on-board memory. It saves the last layer’s
computation results to obtain the input feature and weight value. The parallel multiplica-
tion and accumulation computations are used to calculate the sharing feature maps data
to all the rows and they are calculated together in parallel. By adopting the sharing mul-
tiplier bank with a convolutional layer, the computation is much reduced. Typically, the
performance of the FC layer associates the input pitch points with output pitch points in
the current layer. The function is given as in (3):

Yi =  k = 0
W * H −1
X k × Wi + bi , N ≥ i ≥ 0 (3)

where Xk represents the feature maps, Yi represents calculations results, Wi is the


weights values, and bi represents the bias value. N is the number of the output nodes.

Figure 11.The
Figure11. Thearchitecture
architectureof
ofthe
thefully
fullyconnected
connectedoperation.
operation.

4. Experimental Results
4.1. MATLAB® Modeling and Results
The proposed CNN architecture was modeled and verified in MATLAB® . The model
structure comprehensive analysis is given in Figure 12. It shows the layer-wise execution
details, operations, operands and the total number of parameters at each layer after training.
The model was trained on 60,000 images of MNIST® handwritten digits [29], number of
epochs were 10, with batch size of 50. The learning rate was kept 0.5. The proposed CNN
model consisted of 3 convolutional layers, and 2 fully connected layers with a kernel size
of 3 × 3 are used for each convolutional layer, ReLu activation function was used and
The proposed
The proposed CNN CNN architecture
architecture was was modeled
modeled andand verified
verified in MATLAB®.. The
in MATLAB The model
model
structure comprehensive analysis is given in Figure 12. It shows the
structure comprehensive analysis is given in Figure 12. It shows the layer-wise execution layer-wise execution
details, operations,
details, operations, operands
operands and and the
the total
total number
number of of parameters
parameters at at each
each layer
layer after
after train-
train-
ing. The
ing. The model
model was was trained
trained on on 60,000
60,000 images
images ofof MNIST
MNIST®® handwritten
handwritten digits digits [29],
[29], number
number
of epochs
of epochs were were 10,
10, with
with batch
batch size
size ofof 50.
50. The
The learning
learning rate
rate was
was kept
kept 0.5.
0.5. The
The proposed
proposed
Sensors 2022, 22, 2459 10 of 18
CNN model consisted of 3 convolutional layers, and 2 fully connected layers with aa kernel
CNN model consisted of 3 convolutional layers, and 2 fully connected layers with kernel
size of
size of 33 ×× 33 are
are used
used for
for each
each convolutional
convolutional layer,
layer, ReLu
ReLu activation
activation function
function was was used
used and
and
max-pooling with
max-pooling with aa strip
strip size
size of
of 22 ×× 22 was
was used.
used. The
The model
model was was tested
tested on on 1000
1000 images
images
from MNIST
from MNIST®®with
max-pooling dataaset.
data set. Initially,
strip
Initially, × 2 was
size of aa2digits
digits image
image is given
used.
is given to the
The model
to the model.
model.
was testedAfteron
After processing,
1000 images
processing, the
the
from MNIST ® data set. Initially, a digits image is given to the model. After processing, the
training error is shown in Figure 13a and the accuracy is
training error is shown in Figure 13a and the accuracy is shown in Figure 13b. We shown in Figure 13b. We
training
achieved
achieved error is shown
92.4977%
92.4977% in Figure
model
model 13aaccuracy.
training
training and the accuracy
accuracy. is shown results
The classification
The classification in Figure
results are13b.
are We achieved
displayed
displayed in Fig-
in Fig-
92.4977%
ure 14.
ure 14. model training accuracy. The classification results are displayed in Figure 14.

Figure 12.
Figure 12. Analysis
Analysis results
results of
of proposed
proposed CNN
CNN structure
structure in MATLAB
MATLAB
®®. ..
®
Figure 12. Analysis results of proposed CNN structure ininMATLAB

(a)
(a) (b)
(b)
Sensors 2022, 22, x FOR PEER REVIEW 11 of 18
Figure13.
Figure
Figure 13.Training,
13. Training,testing
Training, testingerror
testing errorand
error andaccuracy
and accuracy of
accuracy of the
of the proposed
the proposed architecture.
architecture. (a)
(a) The
(a) The training
The training error
trainingerror
error
result. (b)
result.(b) The
(b)The training
Thetraining accuracy.
trainingaccuracy.
accuracy.
result.

Figure
Figure 14. The
14. The classification
classification result
result of the
of the proposed
proposed CNNCNN model.
model.

4.2. FPGA Implementation


The simulation result of the top convolutional layer is shown in Figure 15. Image data
was preloaded into image array memory; after top window selection, the selected pixels
could perform the convolution operation with kernel and bias value. Figure 16 basically
shows the convolution operation logic, which combined the multiplier bank and adder
Sensors 2022, 22, 2459 11 of 18

Figure 14. The classification result of the proposed CNN model.


Figure 14. The classification result of the proposed CNN model.
4.2. FPGA Implementation
4.2. FPGA Implementation
4.2. FPGA
The Implementation of the top convolutional layer is shown in Figure 15. Image data
The simulation
simulation result
result of the top convolutional layer is shown in Figure 15. Image data
was The simulation result of the top convolutional layer is shown in Figureselected
15. Image data
was preloaded into image array
preloaded into image array memory;
memory; after
after top
top window
window selection,
selection, the
the selected pixels
pixels
was preloaded the
could into image arrayoperation
memory; with
after kernel
top window selection, Figure
the selected pixels
could perform
perform the convolution
convolution operation with and bias
kernel and bias value.
value. Figure 1616 basically
basically
could perform
shows the the convolution
convolution operation operation
logic, with
which kernel and
combined the bias value.bank
multiplier Figure
and16adder
basically
tree
shows the convolution operation logic, which combined the multiplier bank and adder
shows
to the convolution
perform operation operation
given in logic,(4).
Equation which combined the multiplier bank and adder
tree to perform operation given in Equation (4).
tree to perform operation given in Equation (4).
Out

Out== ∑ inin××kernel

kernel++
Out = in × kernel + bias
bias
bias (4)
(4)
(4)
where in
where in is
is the
the input
input feature
feature map
map data
data of
of convolution module, kernel
convolution module, kernel is the corresponding
is the corresponding
where in is theandinput feature map data of convolution module, kernel is the corresponding
weights data, and bias
weights data, bias represents
represents the
the system
system bias
bias data.
data.
weights data, and bias represents the system bias data.

Figure 15.
15. The simulation result of
of the top
top convolutional layer.
layer.
Figure 15. The
Figure The simulation
simulation result
result of the
the top convolutional
convolutional layer.

Figure 16. The simulation result of the convolution operation.


Figure 16.
Figure 16. The
The simulation
simulation result
result of
of the
the convolution
convolution operation.
operation.
Window wrapper simulation results are described in Figure 17, which performed the
wrapper simulation results are
aredescribed in Figure 17,17,
which performed the
pixelsWindow wrapper
selection. simulation
As Figure results
17a shows, described
according to the in Figure_ , which
_ , performed
the kernel
pixels
the selection.
pixels As As
selection. Figure 17a17a
Figure shows, according
shows, according to to _ , kernel_y,
thethe kernel_x, _ , the kernel
window can slide around the whole image data and output the selected value which has
window can slide around the whole image data and output the selected value which has
the same size with kernel window, such as 5 × 5, Figure 17b shows the whole image data
kernel window,
the same size with kernel window, such
suchasas55×× 5, Figure 17b shows the whole image data
array with selected windows.
array with selected windows.
windows.
Max-pooling operations simulation results: Figure 18 shows the max-pooling opera-
tion results. When enable signal is asserted, the comparator compares the 2 × 2 sub-region
values and outputs the maximum value of each sub-region. Its calculation formula is given
as in (5):
Out = MAX (input1, input2, input3, input4) (5)
where Out represents the output value of the comparison. Input1, input2, input3, input4
are the four input values of each sub-region.
Fully connected operations simulation results: Figure 19 shows the FC module op-
eration results. FC block associates input value with output value in the present module.
When enable signal is high, it performs matrix multiplication. Shown as follow in (6).

sumb = ∑ f eature pixels ∗ weight + bias (6)


where 𝑂𝑢𝑡 represents the output value of the comparison. 𝐼𝑛𝑝𝑢𝑡1, 𝑖𝑛𝑝𝑢𝑡2, 𝑖𝑛𝑝𝑢𝑡3,
𝑖𝑛𝑝𝑢𝑡4 are the four input values of each sub-region.
Fully connected operations simulation results: Figure 19 shows the FC module oper-
ation results. FC block associates input value with output value in the present module.
Sensors 2022, 22, 2459 12 of 18
When enable signal is high, it performs matrix multiplication. Shown as follow in (6).

𝑠𝑢𝑚 = 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 ∗ 𝑤𝑒𝑖𝑔ℎ𝑡 + 𝑏𝑖𝑎𝑠 (6)


where, f eature pixels represents the input data of the fully connected module, weight repre-
where,
sents 𝑓𝑒𝑎𝑡𝑢𝑟𝑒
kernel represents
weights value, bias is the
the input data of
bias value, thesum
and fullypresents
connected
themodule, weight
output value ofrep-
the
b
resents kernel weights
fully connected module. value, bias is the bias value, and 𝑠𝑢𝑚 presents the output value
of the fully connected module.

(a)

(b)

Figure 17. The simulation result of the max-pooling operation, where windows wrapper results are
shown in (a) Simulation result of window wrapper, while data of one selected image is shown
Sensors 2022, 22, x FOR PEER REVIEW 13 of in
18
(b) One image data with selected window.

Figure 18.
Figure 18. The
The simulation result of
simulation result of the
the max-pooling
max-pooling operation.
operation.
Sensors 2022, 22, x FOR PEER REVIEW 13 of 18
Sensors 2022, 22, 2459 13 of 18

Figure 17. The simulation result of the max-pooling operation, where windows wrapper results
Sensors 2022, 22, x FOR PEER REVIEW 13 ofare
18
shownThe
in (a)
topSimulation
simulation result of window
results wrapper,
of the CNN while
system aredata of one selected
described image
in Figure 20. is shown in
Compared
(b) One image
10 outputs data with
of final fullyselected window.
connected layer, the maximum value can be found which represents
final classification results. It can be achieved by (7),

Y = MAX ( a1, a2, a3, a4, a5, a6, a7, a8, a9) (7)

where Y is the final output of the classification results. The fully connected operation
results are a1–a9. Figure 20a–d shows the classification results when the input digit
is 3, 5, 18.
Figure 6, The
8 separately.
simulation result of the max-pooling operation.
Figure 18. The simulation result of the max-pooling operation.

Figure 19.
Figure
Figure 19. The
The simulation result
The simulation
simulation result of
result of fully
fully connected
connected operation.
operation.

The top
Figure
The top21simulation
shows the
simulation results
timing
results ofconsumption
of the CNN
the CNN system
system are described
described
of different
are inofFigure
layersin Figure 20. Compared
Compared
the proposed
20. CNN
10
10 outputs
system.
outputs of final
According fully connected
to this
of final fully table, CONV3
connected layer, the maximum
hasmaximum
layer, the value
the highestvalue can
processing be found
can be time
found which
because
which inrepre-
this
repre-
sents
layer final
it has classification
the largest results.
number ofIt can be
kernels. achieved
So, the
sents final classification results. It can be achieved by (7), by
timing(7),
of loading weights and bias value
to this layer and the feature map data loading time is the most costed. The total processing
𝑌= = 𝑀𝐴𝑋(𝑎1, ( 1, 𝑎2,
2,𝑎3,
3, 𝑎4,
4, 𝑎5,
5, 𝑎6,
6, 𝑎7,
7, 𝑎8,
8, 𝑎9)
9) (7)
(7)
timing is 8.6538 ms.
where
where YY is
Figureis the
the final output
22 final
shows output of the
the layout
of theresults
classification results. The
of the proposed
classification results. The
CNN fullyon
fully connected operation
the chip operation
connected logic re-
part.re-
It
sults
is are
implemented 1– 9. Figure
on 28 nm20a–d
processshows the
technologyclassification
by design results
compiler when
sults are 𝑎1– 𝑎9. Figure 20a–d shows the classification results when the input digit is and the
IC input digit
compiler. Theis
3, 5,
5, 6,
6, 88 separately.
synthesis
3, area is 3.16 mm × 3.16 mm.
separately.

(a) The top simulation


(a) result of digit 3.

(b) The top simulation


(b) result of digit 5.

Figure 20. Cont.


Sensors 2022, 22, 2459 14 of 18
Sensors 2022, 22, x FOR PEER REVIEW 14 of 18

(c) The top simulation result of digit 6.

(c)

(d) The top simulation result of digit 8.


Figure 20. The top simulation results of the proposed CNN system.

Figure 21 shows the timing consumption of different layers of the proposed CNN
system. According to this table, CONV3 has the highest processing time because in this
layer it has the largest number of kernels. So, the timing of loading weights and bias value
to this layer and the feature map (d) data loading time is the most costed. The total processing
timing is 8.6538 ms.
Figure 20. The
Figure 20. The top
top simulation
simulation results
results of
of the proposed CNN
the proposed system. (a)
CNN system. Thetop
(a) The topsimulation
simulation re- of
result
Figure 22 shows the layout results of the proposed CNN on the chip logic part. It is
sult of digit 3. (b) The top simulation result of digit 5. (c) The top simulation result of
digit 3. (b) The top simulation result of digit 5. (c) The top simulation result of digit 6. (d) The top
implemented on 28 nm process technology by design compiler and IC compiler. The syn-
digit 6. (d) The top
simulation simulation result of digit 8.
thesis arearesult of digit
is 3.16 mm 8. × 3.16 mm.
Figure 21 shows the timing consumption of different layers of the proposed CNN
system. According to this table, CONV3 has the highest processing time because in this
Timing Consumption
Execution Time (ms)

layer it has the largest number of kernels. So, the timing of loading weights and bias value
to this layer and the
3.5feature map data3.02683
loading time is the most costed. The total processing
timing is 8.6538 ms.3
2.31135
Figure 22 shows2.5 the layout results of the proposed CNN on the chip logic part. It is
1.67533
implemented on 28 2nm process technology by design compiler and IC compiler. The syn-
1.12616
thesis area is 3.16 1.5
mm × 3.16 mm.
1 0.51361
0.5
0
Execution
Time (ms)
CONV1+Pooling 2.31135
CONV2+Pooling 1.67533
CONV3 3.02683
FC1 1.12616
FC2 0.51361
CONV1+Pooling CONV2+Pooling
CONV3 FC1
FC2

Figure21.
Figure 21. The
Thetiming
timingconsumption
consumptionof
ofdifferent
differentlayers.
layers.

The proposed CNN system is verified on the FPGA. Figure 23 shows the experimental
setup for measuring the proposed CNN system. Figure 23a shows the block diagram of the
measurement framework, Figure 23b shows the actual verifying lab setup situation. UART
Data Logger is the software for monitoring activities of ports. It monitors data exchanged
Sensors 2022, 22, 2459 15 of 18

between the FPGA and an application via UART external interface, and analysis of the
Sensors 2022, 22, x FOR PEER REVIEWresult 15 of 18 by UART
for further researching. The FPGA board is connected to the computer
cable. After processing, the result is shown on the 7-segment on the FPGA board.

Sensors 2022, 22, x FOR PEER REVIEW 16 of 18

analysis of the result for further researching. The FPGA board is connected to the com-
puter by UART cable. After processing, the result is shown on the 7-segment on the FPGA
Figure
Figure
board.22.22.
The layout
The of the
layout of proposed CNN CNN
the proposed system.system.

The proposed CNN system is verified on the FPGA. Figure 23 shows the experi-
mental setup for measuring the proposed CNN system. Figure 23a shows the block dia-
gram of the measurement framework, Figure 23b shows the actual verifying lab setup
situation. UART Data Logger is the software for monitoring activities of ports. It monitors
data exchanged between the FPGA and an application via UART external interface, and
analysis of the result for further researching. The FPGA board is connected to the com-
puter by UART cable. After processing, the result is shown on the 7-segment on the FPGA
board.

(a)

(a) The proposed CNN system measurement setup block diagram.

(b)
Figure
Figure23.
23.The
Theproposed
proposedCNN
CNNsystem measurement
system setup. setup.
measurement (a) The (a)
proposed CNN system
The proposed CNNmeas-
system measure-
urement setup block diagram. (b) Measurement setup on FPGA.
ment setup block diagram. (b) Measurement setup on FPGA.
Table 1 shows the performance summaries. Compared to the other three works, [9–
11], firstly, we can achieve the highest classification accuracy. Secondly, the on chip
memory size is relatively small due to adopting the methods of sharing the multiplier
bank and adder tree, especially compared to [10], which has a smaller number of layers
Sensors 2022, 22, 2459 16 of 18

Table 1 shows the performance summaries. Compared to the other three works, [9–11],
firstly, we can achieve the highest classification accuracy. Secondly, the on chip memory
size is relatively small due to adopting the methods of sharing the multiplier bank and
adder tree, especially compared to [10], which has a smaller number of layers but has a
large on chip memory size. Thirdly, the power consumption is relatively low compared to
the other works, which are also a fully digital-based design.

Table 1. Performance comparison.

Parameter This Work [17] [12] [16]


Process (nm) CMOS 28 CMOS 28 CMOS 65 CMOS 40
Digital and Digital and
Architecture Digital Digital
Analog Analog
Design Entry RTL - RTL -
Frequency (MHz) 100 300 550 204
9 layers
CNN Model 6 layers 11 layers -
(CNN/MLP)
Datasets MNIST CIFAR-10 MNIST MNIST
V (V) 1.8 0.8 1 0.55–1.1
Power(W) 2.93 0.000899 0.00012 25
Accuracy (%) 92 86.05 98 98.2
On-Chip Memory 10 Kb 2676 Kb - -
Off-Chip Memory 40 Kb no - -
Throughput (FPS) 5.33 k - 8.6 M 1k
Chip Area (mm2 ) 9.986 5.76 15 -

5. Conclusions
In the recent past, DNN and CNN have gained significant attention. This is because of
its high precision and throughput. In the field of biosensors, there is still a gap in terms
of the rapid detection of diseases. In this paper, we presented a synthesizable RTL-based
CNN architecture for disease detection by DNA classification. The opted approach of MAC
technique optimizes the hardware system by decreasing the arithmetic calculation and
achieves a quick output. Multiplier bank sharing among all the convolutional layer and
fully connected layer significantly reduce the implementation area.
We trained and validated the proposed RTL-based CNN model on MNIST handwritten
dataset and achieved 92% accuracy. It is synthesized in 28 nm CMOS process technology
and occupies 9.986 mm2 of the synthesis area. The drawn power is 2.93 W from 1.8 V supply.
The total computation time is 8.6538 ms. Compared to the reference studies, our proposed
design achieved the highest classification accuracy while maintaining less synthesis area
and power consumption.

Author Contributions: Conceptualization, P.K. and H.Y.; methodology, P.K. and I.A.; software, P.K.
and H.Y.; validation, investigation P.K. and I.A.; resource H.Y.; data curation, P.K.; writing—original
draft presentation, P.K.; writing-review and editing, P.K. and K.-Y.L.; visualization, P.K.; supervi-
sion, K.-Y.L., I.A., Y.-G.P., K.-C.H., Y.Y., Y.-J.J., H.-K.H., S.-K.K. and J.-M.Y.; project administration,
K.-Y.L.; funding acquisition, K.-Y.L. All authors have read and agreed to the published version of
the manuscript.
Funding: This work was supported by the National Research Foundation of Korea (NRF) grant
funded by the Korean government (MSIT) (No. 2021R1A4A1033424), and the Institute of Infor-
mation and communications Technology Planning and Evaluation (IITP) grant funded by the
Korean government (MSIT) (No.2019-0-00421, Artificial Intelligence Graduate School Program
(Sungkyunkwan University)).
Sensors 2022, 22, 2459 17 of 18

Institutional Review Board Statement: Not applicable.


Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Justino, C.I.L.; Duarte, A.C.; Rocha-Santos, T.A.P. Recent progress in biosensors for environmental monitoring: A review. Sensors
2017, 17, 2918. [CrossRef]
2. Schackart, K.E., III; Yoon, J.-Y. Machine Learning Enhances the Performance of Bioreceptor-Free Biosensors. Sensors 2021, 21, 5519.
[CrossRef]
3. Cui, F.; Yue, Y.; Zhang, Y.; Zhang, Z.; Zhou, H.S. Advancing biosensors with machine learning. ACS Sens. 2020, 5, 3346–3364.
[CrossRef]
4. Nguyen, N.; Tran, V.; Ngo, D.; Phan, D.; Lumbanraja, F.; Faisal, M.; Abapihi, B.; Kubo, M.; Satou, K. DNA Sequence Classification
by Convolutional Neural Network. J. Biomed. Sci. Eng. 2016, 9, 280–286. [CrossRef]
5. Jin, X.; Liu, C.; Xu, T.; Su, L.; Zhang, X. Artificial intelligence biosensors: Challenges and prospects. Biosens. Bioelectron. 2020,
165, 112412. [CrossRef] [PubMed]
6. Shawahna, A.; Sait, S.M.; El-Maleh, A. FPGA-based accelerators of deep learning networks for learning and classification: A
review. IEEE Access 2019, 7, 7823–7859. [CrossRef]
7. Farrukh, F.U.D.; Xie, T.; Zhang, C.; Wang, Z. Optimization for efficient hardware implementation of CNN on FPGA. In
Proceedings of the 2018 IEEE International Conference on Integrated Circuits, Technologies and Applications (ICTA), Beijing,
China, 21–23 November 2018; pp. 88–89.
8. Ma, Y.; Cao, Y.; Vrudhula, S.; Seo, J. Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA.
IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2018, 26, 1354–1367. [CrossRef]
9. Guo, K.; Zeng, S.; Yu, J.; Wang, Y.; Yang, H. A survey of FPGA-based neural network interface accelerator. ACM Trans. Reconfig.
Technol. Syst. 2018, 12, 2.
10. Jiang, Z.; Yin, S.; Seo, J.S.; Seok, M. C3SRAM: An in-memory-computing SRAM macro based on robust capacitive coupling
computing mechanism. IEEE J. Solid-State Circuits 2020, 55, 1888–1897. [CrossRef]
11. Ding, C.; Ren, A.; Yuan, G.; Ma, X.; Li, J.; Liu, N.; Yuan, B.; Wang, Y. Structured Weight Matrices-Based Hardware Accelerators in
Deep Neural Networks: FPGAs and ASICs. arXiv 2018, arXiv:1804.11239.
12. Kim, H.; Choi, K. Low Power FPGA-SoC Design Techniques for CNN-based Object Detection Accelerator. In Proceedings
of the 2019 IEEE Annual Ubiquitous Computing, Electronics & Mobile Communication Conference, New York, NY, USA,
10–12 October 2019.
13. Mujawar, S.; Kiran, D.; Ramasangu, H. An Efficient CNN Architecture for Image Classification on FPGA Accelerator. In
Proceedings of the 2018 Second International Conference on Advances in Electronics, Computers and Communications, Bangalore,
India, 9–10 February 2018; pp. 1–4.
14. Ghaffari, A.; Savaria, Y. CNN2Gate: An Implementation of Convolutional Neural Networks Inference on FPGAs with Automated
Design Space Exploration. Electronics 2020, 9, 2200. [CrossRef]
15. Kang, M.; Lee, Y.; Park, M. Energy Efficiency of Machine Learning in Embedded Systems Using Neuromorphic Hardware.
Electronics 2020, 9, 1069. [CrossRef]
16. Schuman, C.D.; Potok, T.E.; Patton, R.M.; Birdwell, J.D.; Dean, M.E.; Rose, G.S.; Plank, J.S. A Survey of Neuromorphic Computing
and Neural Networks in Hardware. arXiv 2017, arXiv:1705.06963.
17. Liu, B.; Li, H.; Chen, Y.; Li, X.; Wu, Q.; Huang, T. Vortex: Variation-aware training for memristor X-bar. In Proceedings of the 52nd
Annual Design Automation Conference, San Francisco, CA, USA, 8–12 June 2015; pp. 1–6.
18. Chang, J.; Sha, J. An Efficient Implementation of 2D convolution in CNN. IEICE Electron. Express 2017, 14, 20161134. [CrossRef]
19. Marukame, T.; Nomura, K.; Matusmoto, M.; Takaya, S.; Nishi, Y. Proposal analysis and demonstration of Analog/Digital-mixed
Neural Networks based on memristive device arrays. In Proceedings of the 2018 IEEE International Symposium on Circuits and
Systems (ISCAS), Florence, Italy, 27–30 May 2018; pp. 1–5.
20. Bankman, D.; Yang, L.; Moons, B.; Verhelst, M.; Murmann, B. An Always-On 3.8 µJ/86% CIFAR-10 Mixed-Signal Binary CNN
Processor with All Memory on Chip in 28 nm CMOS. IEEE J. Solid-State Circuits 2018, 54, 158–172. [CrossRef]
21. Indiveri, G.; Corradi, F.; Qiao, N. Neuromorphic Architectures for Spiking Deep Neural Networks. In Proceedings of the 2015
IEEE International Electron Devices Meeting (IEDM), Washington, DC, USA, 7–9 December 2015; pp. 421–424.
22. Chen, P.; Gao, L.; Yu, S. Design of Resistive Synaptic Array for Implementing On-Chip Sparse Learning. IEEE Trans. Multi-Scale
Comput. Syst. 2016, 2, 257–264. [CrossRef]
23. Hardware Acceleration of Deep Neural Networks: GPU, FPGA, ASIC, TPU, VPU, IPU, DPU, NPU, RPU, NNP and Other Letters.
Available online: https://itnesweb.com/article/hardware-acceleration-of-deep-neural-networks-gpu-fpga-asic-tpu-vpu-ipu-
dpu-npu-rpu-nnp-and-other-letters (accessed on 12 March 2020).
Sensors 2022, 22, 2459 18 of 18

24. Pedram, M.; Abdollahi, A. Low-power RT-level synthesis techniques: A tutorial. IEE Proc. Comput. Digit. Tech. 2005, 152, 333–343.
[CrossRef]
25. Ahn, M.; Hwang, S.J.; Kim, W.; Jung, S.; Lee, Y.; Chung, M.; Lim, W.; Kim, Y. AIX: A high performance and energy efficient
inference accelerator on FPGA for a DNN-based commercial speech recognition. In Proceedings of the 2019 Design, Automation
& Test in Europe Conference & Exhibition (DATE), Florence, Italy, 25–29 March 2019; pp. 1495–1500.
26. Krestinskaya, O.; James, A.P. Binary Weighted Memristive Analog Deep Neural Network for Near-Sensor Edge Processing. In
Proceedings of the 2018 IEEE 18th International Conference on Nanotechnology (IEEE-NANO), Cork, Ireland, 23–26 July 2018;
pp. 1–4.
27. Hasan, R.; Taha, T.M. Enabling Back Propagation Training of Memristor Crossbar Neuromorphic Processors. In Proceedings of
the 2014 International Joint Conference on Neural Networks (IJCNN), Beijing, China, 6–11 July 2014; pp. 21–28.
28. Zhang, C.; Li, P.; Sun, G.; Guan, Y.; Xiao, B.; Cong, J. Optimizing fpga-based accelerator design for deep convolutional neural
networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey,
CA, USA, 22–24 February 2015; pp. 161–170.
29. Yann, L. The MNIST Database of Handwritten Digits. 1998. Available online: http://yann.lecun.com/exdb/mnist/ (accessed on
1 January 2022).

You might also like