Design Possibilities and Challenges of DNN Models

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/357871214
Design possibilities and challenges of DNN models: a review on the

perspective of end devices
Article in Artificial Intelligence Review · January 2022

DOI: 10.1007/s10462-022-10138-z
CITATIONS READS
3 601
3 authors, including:
Hanan Hussain P. S. Tamizharasan

BITS Pilani, Dubai 10 PUBLICATIONS 9 CITATIONS
15 PUBLICATIONS 58 CITATIONS
SEE PROFILE
SEE PROFILE
All content following this page was uploaded by Hanan Hussain on 30 March 2022.
The user has requested enhancement of the downloaded file.

Artificial Intelligence Review
https://doi.org/10.1007/s10462-022-10138-z
Design possibilities and challenges of DNN models: a review

on the perspective of end devices
Hanan Hussain1 · P. S. Tamizharasan1 · C. S. Rahul1
© The Author(s), under exclusive licence to Springer Nature B.V. 2022
Abstract
Deep Neural Network (DNN) models for both resource-rich environments and resource-
constrained devices have become abundant in recent years. As of now, the literature on
different available options for the design, development, and deployment of DNN models
to resource constrained-end devices is limited and demands extensive further study. This
paper reviews vital research efforts for the design of DNN models while deploying them
at the end devices such as smart cameras for real-time object detection tasks. The design
ideas include the types of DNN models, hardware and software requirements for the devel-
opment, resource constraints imposed by the computing devices, and the optimization
techniques required for the efficient processing of DNN. The study also aims to conduct
a systematic literature review on current trends in different real-time applications of DNN
models and explores the following four dimensions: (1) DNN model perspective: to asso-
ciate appropriate DNN models with the proper hardware to achieve optimal throughput.
(2) Hardware perspective: to answer different available options in hardware platforms for
achieving on-device intelligence. (3) Resources and optimization perspective: to analyze
the type of resource limitations in hardware platforms and the use of optimization tech-
niques to overcome the performance issues. (4) Application perspective: to understand the
real-time uses of DNN models in different application domains. This work also explores
different performance measures that need to be considered for on-device intelligence and
provides possible future directions for the challenges reviewed.
Keywords Deep learning · DNN · Lightweight models · Optimization · Hardware devices ·

On-device intelligence · Edge inference
1 Introduction
The evolution of Artificial intelligence (AI) technologies attains state-of-the-art perfor-

mance in different sectors of society, academics, and industry. As reported on google
trends, AI technologies involving deep learning algorithms are gaining much research
* Hanan Hussain
p20200005@dubai.bits-pilani.ac.in
1
Department of Computer Science, BITS Pilani, Dubai Campus, Dubai, United Arab Emirates
13
Vol.:(0123456789)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
H. Hussain et al.
interest in the past 10 years. As seen in Fig 1, the chart reaches 100, which shows the
peak popularity of the search.
The breakthrough in deep neural network (DNN) models is due to their ability to
solve myriads of challenging applications like (a) natural language processing involving
complex machine translation techniques to convert text from one language to another
(Bahdanau et al. 2015), (b) speech and signal processing, including extracting features
and transforming signals for training deep neural networks where the signal data are
analyzed for the classification and prediction (Amodei et al. 2016), (c) digital image
processing (DIP), including medical imaging (Erickson 2019), which can be applied on
three subcategories of DIP, namely (i) object detection (He et al. 2016), (ii) seman-
tic segmentation (Tanzi et al. 2021), and (ii) object classification (Tanzi et al. 2020).
Recent applications on medical video processing results show a very significant advan-
tage from the exploitation of deep learning strategies (Twinanda et al. 2017). Some of
the miscellaneous applications include deep learning for encrypting data, signature gen-
eration (Zitar et al. 2020), stock-price prediction (Mohamed et al. 2021) are few of a
kind. All these challenging tasks that DNN models address are computationally expen-
sive and resource-intensive.
Many kinds of DNN models are available in the literature which comes including (a)
convolutional neural network (CNN), (b) recurrent neural network (RNN), (c) genera-
tive adversarial network (GAN), and (d) stacked autoencoders (SAE). The current suc-
cess of DNN models is due to the following factors: (a) availability of big datasets to
train DNN models b) fast development of DNN algorithms, c) availability of powerful
computers to model and train complex neural networks to accomplish accelerated per-
formance and results (Goodfellow et al. 2016).
The DNN models which are suitable for one application may not be suitable for
another. For example, CNNs perform better with computer vision applications, whereas
RNNs perform well in natural language processing (NLP) applications. Similarly, the
performance of DNN models varies from one hardware platform to another. CPU opti-
mization for sequential computations is more suitable for executing RNN with strong
Fig. 1 Deep learning interest (Google trends)
13
Design possibilities and challenges of DNN models: a review…
sequential dependencies and will perform poorly with GPUs. However, GPUs are
well optimized for parallel computations, and hence CNNs can be applied effectively
(Chaber and Ławryńczuk 2018).
Optimization is essential not only for the resource-rich environment but also for
resource-constrained environments (Han et al. 2015). Several optimization strategies sup-
port to enable on-device inference on hardware devices. Consider a situation of anomaly
detection, in which a surveillance camera must capture the footage of its vicinity continu-
ously. In this case, it is better not to send hundreds of frames to high-performing computers
(e.g., on the cloud) and perform classification tasks due to wastage of resources like storage
and transmission costs. Instead, it is more practical to utilize an on-device inference model,
where the pre-trained model can run on the security camera. Undesirably, the attained opti-
mization comes at the cost of a reduction in model accuracy (Song et al. 2017).
Based on the systematic literature review, this paper explores and analyzes the trends,
techniques, and tools in the above-discussed dimensions like (a) DNN model’s perspective,
(b) the hardware perspective, (c) optimization and resource perspective, and finally (d) the
application perspective.
The contributions of this paper is summarized as follows:
(a) This work provides a systematic literature review (SLR) on the design possibilities of
DNN models in lightweight devices / end devices.
(b) It incorporates different hardware and software options for the development of efficient
DNN models.
(c) The review explores different optimization techniques adopted for the DNN models.
(d) It also discusses the resource limitations (computing speed, memory, and energy
requirements) imposed by the hardware devices.
(e) The work extends over the applications of DNN models on different hardware devices
and their comparative analysis.
(f) Four research questions in four different perspectives are defined and answered after
the detailed analysis.
(g) Top ten challenges and possible solutions found in the on-device intelligence are dis-
cussed along with the future scope of the research in this field.
(h) The review also includes the experimental setup for achieving on-device intelligence
and conclusions.
The overall structure of the review is given in Fig 2. It was also found that the proposed
work adds to the existing works by adding the complete background study.
2 Related works
In this section, we considered survey articles published in the last two years (2019 – pre-
sent) with similar goals. Summary of these surveys is explained below:
Chen et al. (2020), discuss different DNN accelerators (on-chip, stand-alone, emerging
memories, and applications) and give a better insight into the development of future DNN
accelerators. The study conducted by Li and Liewig (2020), gives a detailed explanation of
hardware and software criteria for AI accelerators along with some hardware examples. It
also provides the current trends in this domain. Similar work was conducted by Talib et al.
13
H. Hussain et al.
Fig. 2 Structure of the review paper
(2020), on the DNN accelerators built on different hardware platforms like GPU, ASIC,
and FPGAs along with their accelerator designs and examples.
The review paper studied in the perspective of DNN with edge device is explained by
Wang et al. (2020a), they present a comprehensive survey on different domains of applica-
tion in edge computing and also provides some of the research challenges and directions.
Vestias (2019) studied the features of CNN, available re-conFigurable hardware options
and trends, challenges faced while implementation as well as a broad overview of types
of DNN models, their optimization techniques, computing devices, and applications. They
also discuss how to learn and infer DNN models on edge (Vestias et al.2020b).
The resource constraints and their issues while implementing on-device learning along
with the theoretical aspects of heuristic estimation of resource requirements and challenges
are studied by Dhar et al. (2019). The optimization techniques and the building of DNN for
low power devices using pruning, quantization, neural architecture search, and Knowledge
distillation are given in the review paper by Goel et al. (2020). Similar optimization tech-
niques from a hardware perspective are discussed by Capra et al. (2020). The development
of DNNs based on architectural innovations, learning, activation and loss functions, opti-
mization, and regularization methods are examined by Khan et al. (2020), whereas Choud-
hary et al. (2020) examined different methods for compressing and accelerating AI models,
along with the suggestion on challenges and future directions in this field.
The remaining reviews discuss comprehensively the fundamentals of deep learning and
edge computing along with their optimization, training, inference, and challenges (Wang
et al. 2020c). DNN models and its hardware designs, software design, and its optimizations
along with their future trends (Song et al. 2020). Marchisio et al. (2019), explored the cur-
rent trends of deep learning for edge computing, different cross-layer optimizations, case
study, and other open research challenges.
Each paper has got different scopes ranging from types of DNN models, hardware and
software requirements, optimization techniques, resource constraints, etc. The scopes of
13
Table 1 Scope of the similar surveys compared to our paper
Refrences Types of DNN Hardware Software Optimization tech- Resource Applications Challenges Future scope
models requirements requirements niques adopted constraints faced
Chen et al. (2020) – ✓ – – – – – ✓

Li and Liewig (2020) – ✓ ✓ – – ✓ – ✓
Talib et al. (2020) ✓ – – – ✓ – ✓
Wang et al. (2020) – – – – – ✓ ✓ ✓
Véstias (2019) ✓ ✓ – – – – ✓ ✓
Dhar et al. (2019) – – – – ✓ – ✓ ✓
Goel et al. (2020) – – – ✓ – – – –
Capra et al. (2020) ✓ ✓ – ✓ ✓ ✓ – –

Khan et al. (2020) ✓ – – ✓ – ✓ ✓ ✓
Choudhary et al. (2020) ✓ − − ✓ − − ✓ ✓
Wang et al. (2020) ✓ ✓ ✓ – – – ✓ –
Song et al. (2020) – ✓ ✓ ✓ – – – ✓
Marchisio et al. (2019) – – – ✓ – – ✓ –
Véstias et al. (2020b) ✓ – – ✓ – ✓ – ✓
Our Paper ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
13
Table 2 List of recent survey papers and their summaries
Publication year Survey title Summary and description
13
2020 A survey of accelerator architectures for deep neural networks The survey paper discusses different DNN accelerators (on-chip, stand-alone,
emerging memories, and applications). They also give an insight into future
DNN accelerators (Chen et al. 2020).
2020 A survey of AI accelerators for edge environment The scope of the paper involves hardware and software criteria for AI
accelerators along with some hardware examples and current trends (Li and
Liewig 2020).
2020 A systematic literature review on the hardware implementation of artificial This paper focuses on the DNN models built on different hardware platforms
intelligence algorithms like GPU, ASIC, and FPGAs along with their accelerator designs and exam-
ples (Talib et al. 2020).
2020 Deep learning for edge computing applications: A state-of-the-art survey The paper presents a comprehensive survey on four domains of application
in edge computing ( multimedia, transportation, smart city, and industry).
It also provides some of the research challenges and directions (Wang et al.
2020).
2019 A survey of convolutional neural networks on edge with reconFigurable The paper describes the features of CNN, available re-configurable hardware
computing options and trends, challenges faced while implementation (Véstias 2019).
2020 On-device machine learning: An algorithms and learning theory perspective This paper deals with resource constraint issues while implementing on-
device learning, their theoretical aspects, and their challenges (Dhar et al.
2019).
2020 A survey of methods for low-power deep learning and computer vision The survey explains different techniques for building DNN for low power
devices and their optimization including pruning, quantization, NAS and
Knowledge distillation (Goel et al. 2020).
2020 An updated survey of efficient hardware architectures for accelerating deep The paper provides a brief overview of DNN architectures, hardware acceler-
convolutional neural networks ators, and their architectures. Some of the optimization techniques adopted
are also discussed (Capra et al. 2020).
2020 A survey of the recent architectures of deep convolutional neural networks This paper discusses the development of DNNs based on architectural innova-
tions, learning, activation and loss functions, optimization and regulariza-
tion methods (Khan et al. 2020).
2020 A comprehensive survey on model compression and acceleration. The paper presents a survey of different methods for compressing and accel-
erating AI models, along with the challenges and future directions in this
field (Choudhary et al. 2020).
H. Hussain et al.

Table 2 (continued)
Publication year Survey title Summary and description
2019 Convergence of edge computing and deep learning: A comprehensive The authors give a comprehensive survey on the fundamental of deep learn-
survey ing and edge computing along with their optimization, training, inference
and challenges (Wang et al. 2020).
2020 A survey of neural network accelerator with software development environ- The review focuses mainly on DNN and its hardware designs. It also consid-
ments ers software design and its optimizations along with their future trends
(Song et al. 2020).
2019 Deep learning for edge computing: current trends, cross-layer optimizations, This review gives a current trend of deep learning for edge computing, dif-
and open research challenge ferent cross-layer optimizations and case study, and other open research
challenges (Marchisio et al. 2019).
2020 Moving deep learning to the edge The authors give a broad overview of types of DNN models, their optimiza-
tion techniques, computing devices, and applications. They also discuss
how to learn and infer DNN models on edge (Véstias et al. 2020b).
13
H. Hussain et al.
the review papers are then compared with the contribution of our paper in table 1. Addi-
tionally, the table 2 gives the summary of these surveys. The subsequent section discusses
the formulated research question and the methodology of the search strategy for complet-
ing the final review process.
3 Methodology
The objective of this survey is to answer the following research questions (RQ) related to
the development of DNN models for different hardware platforms.
The research questions are:
– RQ 1: DNN model perspective: What are the DNN models suitable for different types
of hardware? This question associates appropriate DNN models with the right hardware
to achieve optimal throughput.
– RQ 2: Hardware perspective: What are the suitable hardware platforms for achieving
on-device intelligence? This question gives different available options in hardware
resources for edge-level inference.
– RQ 3: Resources and Optimization perspective: What are the resource trade-offs
encountered while deploying DNN models to hardware platforms and how does the
optimization technique help? This question analyzes the type of resource constraints in
hardware platforms and optimization techniques to overcome the performance issues.
– RQ 4: Application perspective: What are the main application domains that utilize
DNN models in different hardware models? This question is to understand the real-time
uses of DNN models in different domains.
3.1 Search strategy
This review paper went through empirical research areas, issued as articles, written in Eng-
lish logical peer-reviewed papers during the last few years. A four-step procedure as shown
in Fig 3 was utilized to look for effective writing.
1. Identification: The initial step involves database search and identification of the papers,
which extracted related papers from popular electronic databases and libraries like
SpringerLink, IEEE Xplore, MDPI, ScienceDirect, ACM Digital Library, etc. Using
relevant keywords or blends of the keywords given in table 3 were utilized to filter the
number of similar outcomes. This step identified 204 research articles from different
libraries.
2. Screening: The second step involves the screening process, which removes 34 duplicate
documents to get 170 selected papers. Further, 16 more papers were discarded due to
inaccessibility.
3. Eligibility: The remaining 154 articles from the previous step were studied carefully for
the eligibility test, and found that 14 articles contain research studies that do not deal
with the deep learning domain and disregarded those papers to get 140 remaining papers.
Further analysis resulted in removing four non-Scopus articles to yield 136 eligible and
fully accessible papers.
13
Fig. 3 The flow diagram summary of the database search and publications included for systematic reviews.
Source: Modified from Moher et al. (2009)
Table 3 Keywords used for the selection of relevant papers from different electronic databases
Electronic databases Keywords used for the search
SpringerLink lightweight DNN models; On-device inference;

DNN models on end devices; edge intelligence;
Image processing on light-weighted devices, Hardware-software co-design of DNN.
IEEE Xplore DNN accelerators; FPGA accelerators;
ASIC accelerators; Optimization of DNN models; Tiny ML algorithms; Perfor-
mance measurements of edge DNN models.
MDPI Deep learning in the resource-constrained environment; Deep learning in resource-
rich environment; Applications of end devices;
Benchmarking DNN models.
ScienceDirect On-device intelligence; Edge-AI, Edge computation; Memory optimization on end
devices. Signal processing on edge;
Memory optimization on light weight devices.
ACM Digital Library On-device learning, edge inference optimization, Benchmarking models for DNN,
Profiling DNN resources. Quantization and pruning in neural networks. Neural
network compression.
13
H. Hussain et al.
Fig. 4 Collected research papers
4. Inclusion: The last step involves a snowball strategy of selecting papers through the
references of the eligible papers, which yields 13 more articles and added them to form
a final set of 149 articles. They were then sent to quality assessment for further analysis.
3.1.1 Quality assessment rules
The quality assessment rules (QARs) are applied to ensure the proper evaluation of the
research paper’s quality. A set of ten questions are formulated, each question worth a
maximum of one point. If the question is fully answered and not answered will get 1 and
0 points respectively. Others include, “above average” = 0.75, “average” = 0.5, “below
average” = 0.25, and “not answered” = 0. The total of the marks obtained for the 10
QARs is the score of each article. Moreover, if the score is 5 or higher, the article is
considered otherwise it is excluded. The formulated QARs are given below:
1. QAR1: Does the paper relate to any of the four research questions defined?
2. QAR2: Does the paper specify the research objectives?
3. QAR3: Does the paper describe the methodology they adopted?
4. QAR4: Does the methodology include practical experiments of the proposed tech-
nique?
5. QAR5: Are the experiments well designed with detailed explanation?
6. QAR6: Does the paper specify any specific real-time application?
7. QAR7: Does the paper report any performance metrics similar to accuracy, throughput,
memory consumption?
8. QAR8: Is the proposed model design was compared to other existing models?
9. QAR9: Are the methods and results obtained were analyzed and justified?
10. QAR10: Does the overall study contribute significantly towards the DNN’s area of
research?
13
All the 149 selected papers cleared the quality assessment test. Hence, they were then stud-
ied carefully to perform a systematic literature review (SLR).
Among them, a subset of twenty-five papers was taken for the ’comparative analysis’,
to understand the trend of using the different deep learning models on various hardware
platforms.
The timeline of the selected papers and the total number of papers extracted in each year
for the review are shown in Fig 4. The papers published from the years 2010 to the early
months of 2021 are considered to analyze the recent trends. There is a significant increase
in publications since 2016 due to the massive advancement in the application of AI algo-
rithm’s realization in different hardware platforms. A total of 149 papers were utilized to
compile this review. The bars in Fig 4 represent journals, conferences, and others including
books. The high-quality journals comprise 80 peer-reviewed papers in total, conferences
with 67 papers, and the remaining 2 articles are taken from others.
3.2 Motivation
The motivation to do this work is to lay a strong background for the readers who are mainly
interested in on-device intelligence ( DNN models deployed on the lightweight models).
Usually, DNN models are often executed on powerful GPUs in the cloud due to their
demand for computation and resource requirements. Even though edge inference models
are significantly less, there is a considerable demand for executing DNN models at the
edge rather than executing them on the cloud (Wang et al. 2020). Consider the following
reasons listed below in favor of on-device intelligence.
– Memory constraints: Memory constrained environment includes (i) embedded systems,

(ii) edge devices on which lightweight CNNs performs well, (iii) hardware computing
devices like FPGAs, ASICs where they have less than 10 MB of on-chip memory, and
nearly zero amount of off-chip storage.
– Low Latency: Since DNN models are deployed on devices, processing does not require
any additional sending of data; hence they are speedy and preferred for time-critical
applications.
– Reliability: When network connections are lost, there are chances of failure in cloud
computing, whereas no such scenarios occur in on-device inference. Hence it can be
considered more reliable.
– Cost: Training and Inference on the cloud requires a massive amount of data to be
transmitted, thus consuming more network bandwidth. On-device intelligence does not
require such transmission.
– Privacy and security: Compared to the cloud, the edge is more efficient and effective in
protecting users’ data privacy and security. Privacy issues are critical to areas such as
smart homes and cities.
4 Background study
This section explores the major DNN models present in the literature with their exam-
ples, followed by the hardware platforms and software options available for the research.
It also contains different widely used optimization techniques and examples along with the
resource constraint criteria imposed by light-weighted devices. The subsequent sections
13
H. Hussain et al.
discuss the application of DNN models at edge level, followed by the comparative analysis
of the applications of DNN models implemented on light devices and their impact on per-
formance metrics.
4.1 Types of DNN models
The Deep learning model consists of various deep neural networks. Some of the prominent
models and techniques are the following:
4.1.1 Fully connected neural network (FCNN)
FCNNs are an extension of traditional artificial neural network (ANN). ANNs have a shal-
low structure and are less deeper compared to other DNNs. However, FCNNs usually have
a deeper layer structure for performing complicated learning tasks. As shown in Fig 5, they
have three types of layers: input layer, multiple hidden layers, and output layer. Here the
neurons in one layer are connected to the neurons in the next layer, but no neurons are con-
nected to each other in the same layer. The output of each layer is fed to the next layer with
activation functions. Many activation functions are available in the literature, and some of
the popular ones are binary step function, linear function, sigmoid function, hyperbolic
tangent function, rectified linear unit (ReLU), leaky ReLU, parameterized ReLU, exponen-
tial linear unit. At the output layer, the result of the model prediction is produced. The
computation of the FCNN’s output layer proceeds according to the following steps: Let wi j
(k) be weight for perceptron j in layer k for node i, Oi(k) be the output for node i in layer k
and bi(k) be the bias for perceptron i in layer k.
(a) Initialize the input layer lo : Set the values of the outputs Oi(o) for nodes in the input
layer lo, to their associated inputs in the vector X= x1, . . . , xn ; here Oi(o) = xi
(b) Calculate the product sums and outputs of each hidden layer in order from l1 to lm ,
Assuming m hidden layers and rk nodes in layer k. For k from 1 to m, and i from 1 to
rk compute
Fig. 5 Fully Connected Neural

Network (FCNN)
13
rk−1
∑
hki = bki + wkji ok−1
j (1)
j=1
oki = g(hki ) (2)
(c) Compute the output y of first node of output layer lm

rm−1
∑
hm
1
= bm
1
+ wkj1 ok−1
j (3)
j=1
om m
1 = g(h1 ) (4)
The model is fitted to a particular dataset with the help of training algorithms like sto-
chastic gradient descent (SGD), Levenberg-Marquardt algorithm, conjugate gradient, etc.
The learning consists of continuously updating the values of weights and bias to minimize
the mean squared error (MSE). These types of DNNs are high in complexity and low in
performance and convergence. Hence, they can be mainly used for feature extractions, and
approximation (Cho and Jang 2020).
4.1.2 Convolutional neural network (CNN)
CNNs are mainly designed to process data that comes in multidimensional arrays like
images. The basic structure of a CNN, given in Fig 6, consists of two main parts, (a) Fea-
ture extraction, where the convolution techniques are used to extract the characteristics of
the input images. (b) Classification, in which the studied features are used to predict the
class of the given image. Different types of layers given below are stacked to form a CNN,
each layer has got its own functionalities:
1. Convolutional layer: The convolution operation is a computation between the input

image and filter of a given size (M X M). The filter slides over the images to calculate
the dot product and to obtain feature maps. At this stage, the feature maps give an idea
about the corners and edges of the input image. It is then fed to the next layer to learn
the high-level features of the image.
Fig. 6 The basic structure of a CNN
13
H. Hussain et al.
2. Pooling layer: Pooling operation helps to reduce the size of the convolved feature map,
thereby reducing the computational cost. There are different types of pooling opera-
tions. In Max pooling, the largest value is taken from the feature map, whereas in aver-
age pooling, the average value of the elements within a predefined (NXN) selection is
computed for passing on to the next layer. It acts as a bridge between convolution and
a fully connected layer.
3. Fully connected (FC) layers are usually the last few layers of a CNN architecture. It
consists of neurons, weights, and biases for connecting different layers among them. The
input to the FC layer is always flattened before it is fed. The classification of the image
takes place at the end of FC layers.
4. Some of the additional layers added includes dropout layers to reduce over-fitting. Some
neurons are dropped from the neural network during training, to reduce the size of the
model. Drop ratio can be given manually (usually between 0.3 to 0.6). Other layers
include batch-normalization layers, activation layers, depth concatenation layers, etc.
In general, CNNs can extract features automatically with reduced complexity as compared
to FCNNs (Zeiler and Fergus 2014).
4.1.3 Recurrent neural network (RNN)
RNNs are built to handle sequential data, in which each neuron receives data from the pre-
vious layer and its output. Even though RNNs are good at the prediction of sequential data,
they suffer from gradient explosion. The gradient explosion occurs when the gradients
keep on getting bigger throughout the execution of back propagation algorithm. It causes
very large weight updates and results in the gradient descent to diverge. This shortcoming
is handled by controlling the flow of information using a gate structure and a memory cell
to form long short term memory (LSTM) (Sak et al. 2014). Enhancing the RNN architec-
ture with a gate structure and a memory cell can overcome the issue of gradient explosion
by controlling the information. In the LSTM model, the forget gate is used to operate the
cell state and pick what to hold in the memory. During the training, the stored details in the
memory cells are not affected, which helps in attaining better performance. The LSTM and
RNNs are widely used in applications like natural language processing (NLP) and human
activity recognition (HAR).
4.1.4 Transformer networks
Transformers are multi-layered architectures constructed by placing transformer blocks on

top of one another (Vaswani et al. 2017). The Fig 7 shows the main mechanism incorpo-
rated in a transformer architecture with a multi-head self-attention process, a position-wise
FFNN, layer normalization (LN) modules, and residual connections (RC). The input to the
transformer is a tensor of shape B X N, where B and N are batch size and sequence length
respectively. Initially, the inputs are fed to an embedding layer to form a D dimensional
tensor of shape B X N X D. It is then additively composed against positional encoding
and fed through a multi-headed self-attention module. Positional encodings are sinusoidal
input to the trainable embeddings. The inputs and output of the multi-headed self-attention
module are coupled by RCs and an LN. The output of the multi-headed self-attention mod-
ule is then passed to a two-layered FFNN which has its inputs and outputs connected in a
residual fashion with layer normalization.
13
Fig. 7 Working of the Transformer network (Vaswani et al. 2017)
In summary, each transformer block can be represented as:

XA = Layer Norm(Multihead Self Attention(X)) + X (5)
XB = Layer Norm(Position FFNN(XA)) + XA (6)

Where X is the input given to the transformer block and XB is the output of the transformer
block. The applications of transformer include (a) classification using encoder module, (b)
13
H. Hussain et al.
language modeling using decode module, and (c) machine translation with the encoder-
decoder module.
4.1.5 Auto‑encoder (AE)
An auto-encoder is an artificial neural network that produces output similar to its inputs. It
has two parts namely encoder and decoder. The encoder extracts the n input features and
converts them to its m unique coded features where m and n are the numbers of neurons
in input and hidden layers respectively. The Decoder reconstructs the n inputs from the m
features.
From Fig 8, Suppose X be the input vector and (x1,x2...xn) be the sub inputs (features),
then the output obtained from the hidden layer Y= (y1,y2..ym) after encoding phase is
given in the foll0owing equation . Here w is the weight or kernel vector, and b is the bias
value.
f (x) = Y = 𝜉(wX + b) (7)
1
f (x) = (8)
1 + e−x
Reconstructed input after the decoding phase is obtained by using the equation:
f (Y) = Z = 𝜁 (wY + b) ≈ X (9)
The encoding algorithm reduces the reconstruction error between the input X and the out-
put Z. It also helps to attain a set of parameters for the encoding and decoding phases. J
(X, Z) represents the reconstruction error, xi represents the input value a. zi represents the
output obtained.
n
1�
J(X, Z) = ‖xi − zi‖2 (10)
n i
The AE can form a deep architecture by stacking multiple layers as a hidden layer from
the stacked autoencoder (SAE). It involves deep variants of neural networks that consti-
tute a higher level of complexity and performance. Designing a deep learning model by
stacking auto encoder showed excellent results in different applications (Sun et al. 2021).
The Stacked Autoencoder (SAE) deep neural networks have proved the best computational
capabilities when they are trained using the best training function.
Fig. 8 Autoencoder Model
13
4.1.6 Deep belief network (DBN)
The main module of DBN is restricted Boltzmann machines (RBM). An RBM is a stochas-
tic neural network with visible layers and hidden layers containing inputs and latent vari-
ables respectively. They are usually organized as a bipartite graph, where visible neurons
and hidden neurons are connected. Multiple RBMs are stacked to form DBN. The training
procedure follows a layer-by-layer approach where each layer is considered as an RBM
trained on top of the previously trained layer Deng et al. (2020). Some of the applications
of DBN are emotional feature extraction, security alert system, and fault detection. The
individual application of RBMs include network anomaly detection and collaborative fil-
tering (Fischer and Igel 2012).
4.1.7 Generative adversarial network (GAN)
GAN belongs to generative models, which don’t require Markov chains like RBMs. It also
makes use of a game theory concept called min-max game. The basic structure of GAN
comprises a generator and discriminator in which the generator produces samples, while
the discriminator tries to distinguish between samples drawn from the training data and
samples drawn from the generator as seen in Fig 9. The training process of GAN requires a
fixed-length input vector to the generator model to generate a sample image. Based on the
feedback from the discriminator, if the image is correctly classified, the discriminator is
rewarded and the generator is penalized and should update its weight. In case, if the image
is incorrectly classified, the discriminator is penalized and should update the weights. Both
networks need to constantly try to generate and distinguish in the adversarial process until
reaching the Nash equilibrium (Goodfellow et al. 2014).
4.1.8 Deep reinforcement learning
Deep reinforcement learning (DRL) is a combination of deep learning (DL) and reinforce-
ment learning (RL) (Mnih et al. 2015). The goal of RL is to enable an agent to take the best
Fig. 9 Generative adversarial network
13
H. Hussain et al.
action choices over a set of states through the interaction with the environment, to maxi-
mize the long-term rewards. This process is modeled using the Markov decision process
(MDP). With a deep learning model, DRL approximates reinforcement learning’s value
function and solves high-dimensional MDPs. An MDP in DRL is defined as a tuple (S, A,
P, R, D) that indicates a state, action, transition probability, reward, and discount respec-
tively. Typically, the agent decides an action based on the policy at the current state. The
action received to the environment calculates a reward and is transferred to the next state
based on the preset transition probability. This process is repeated until the exit condition
is reached. The final aim is to maximize the expected discounted cumulative reward. The
discount factor is used for balancing the future and immediate reward (Chen et al. 2021).
DRL can be divided into two types on an agent’s ability to learn a model of the envi-
ronment: (a) Model-based method, which aims to calculate transition function and reward
function. In this model, the agent studies the environment and preplans the actions. (b)
The model-free method focuses to estimate the value function or policy from experience.
Hence it should be comprehensively developed and tested.
DRL can also be categorized into three, based on how policy is learned by the agent
shown in Fig 10(a) Value-based method, where the agent updates the value function to
learn a policy. Some of the examples include Deep Q-Learning (DQL), Double DQL, and
Duel DQL (Hasselt et al. 2016), (b) Policy-based methods which learn the policy directly,
an example is deep deterministic policy gradient (DDPG) and (c) Hybrid method, also
known as an actor-critic method. It uses both value-based and policy-based methods with
the help of two different networks. An example under this category is asynchronous advan-
tage actor-critic (A3C) (Mnih et al. 2016).
4.1.9 Transfer learning and knowledge distillation
Transfer learning (TL) is the transfer of knowledge from the source domain (any resource-
rich environment) to the target domain (resource-constrained environment) to accelerate
training, attain better performance as well as reduce development costs at the target plat-
form. The pretrained model is loaded along with the weights. Further training can be done
by freezing the initial layers and only retraining the last few layers, as shown in Fig 11(a)
(Pan and Yang 2010).
Fig. 10 Working of the DRL approach
13
Fig. 11 a Transfer learning and b Knowledge distillation
Knowledge distillation (KD) is a form of TL (Hinton et al. 2015). KD can extract infor-
mation from a pre-trained model with a high performance called teacher, which is then
transferred to a smaller DL model called student. The transformation is done by various
optimization techniques like pruning and quantization, such that the accuracy is preserved
as given in Fig 11(b).
DNN models used in the end devices can be further divided into two types based on
their parameters and complexity of the computation; they are accuracy-oriented models
and lightweight models.
4.1.10 Accuracy oriented models
Accuracy-oriented models’ focus is mainly on the accuracy measurement without consid-

ering the complexity of the underlying operations. Some of the important DNNs under
this model are AlexNet which had a breakthrough in the history of CNN after winning the
ImageNet classification competition (Krizhevsky et al. 2012). It also introduced the ReLU
operation, which is a popular activation function till now. Another model is ZefNet that
has multi-layer deconvolution for monitoring the activities of neurons (Zeiler and Fergus
2014). The VGG16, VGG19 is an enhancement of AlexNet with sixteen to nineteen more
extended layers as shown in Fig 12(a).
The GoogleNet has inception blocks (Fig 12c) that allow different types of filter size [
3x3, 1x1 ], which are then grouped, concatenated, and sent to the next layer (Szegedy et al.
13
H. Hussain et al.
2016). Inception versions 2,3, and 4 differ by the modified number of layers regarding the
basic block. An example of basic module is given in (Fig 12b). Comparing the basic block
and inception module convolution, the results show a ten times reduction in the compu-
tation. The ResNet has a residual block in Fig 12(d) that has a shortcut path to identity
connection to skip multiple convolution layers (He et al. 2016). An extension of ResNet is
ResNext that splits residual blocks multiple times without increasing the complexity (Xie
et al. 2017). Another modification on the residual block was proposed in DenseNet where
the layers receive and group the data from the preceding layer within the block (Huang
et al. 2017). SeNet introduced Squeeze and excitation block, which can be added to incep-
tion block and residual block (Hu et al. 2020).
The Table 4 summarizes the popular deep learning models that are accuracy oriented. It
includes the year of publication, depth of the layers, number of parameters used, followed
by the number of operations, and accuracy when implemented on a bench-marking dataset
called ImageNet.
4.1.11 Lightweight models
Lightweight models are mainly designed for resource-constrained environments like

embedded devices, IoT, or edge devices. The depth size will be less when compared to
accuracy-oriented models and are optimized in terms of the number of computations,
parameters, and accuracy.
Some of the important CNNs under this model are SqueezeNet that has the same accu-
racy as AlexNet with fifty times fewer parameters (Iandola et al. 2016). A new fire module
(fire) is created, which consists of squeeze and expansion modules. It squeezes the input
using 1x1 filters and then expands the results using parallel 1x1 and 3x3 filters. A non-
linearity is added, and the results are concatenated as shown in Fig 13(a),(b). An extension
of SqueezeNet is SqueezeNext that proposes a separate module block for the fire module
(Gholami et al. 2018).
A new method called group convolution is introduced in ShuffleNet for reducing the
number of parameters and operations (Zhang et al. 2018). Group convolution (Gconv)
applies different convolutions to different parts of the input image instead of applying each
convolution to the whole image. ShuffleNet V2 uses the same principle, but the channels
are split into two before the convolution (Ma et al. 2018a). In Fig 13(e), (f), the modules
Table 4 Accuracy Oriented DNN models

Reference Model Depth Parameters(M) Operations(M) Accuracy
Krizhevsky et al. (2012) Alexnet 8 61 650 79.06

Zeiler and Fergus (2014) ZefNet 8 60 650 88.8
Simonyan and Zisserman (2015) VGG19 19 138 7800 90.37
Szegedy et al. (2016) GoogLeNet 22 4 750 87.52
Szegedy et al. (2015) Inception-v3 48 24 5700 93.59
Szegedy et al. (2017) Inception- v4 77 43 6250 95.3
He et al. (2016) ResNet-101 101 45 3800 92.93
Huang et al. (2017) DenseNet161 161 28 1500 93.6
Xie et al. (2017) ResNeXt 101 83 4000 94.7
Hu et al. (2020) SeNet154 154 115 10500 95.53
13
Fig. 12 a VGG-19 , a Basic module , c Inception module and d Residual module
of ShuffleNet V1 and V2 are shown respectively. The CondenseNet also uses group con-
volution, but instead of shuffling channels, it groups channels of the same size for train-
ing (Huang et al. 2018). A hyper-parameter C is defined that results in the pruning of 1/c
parameters at the end of each condensing stage. MobileNet series achieves its accuracy and
reduction in complexity using depth-wise separable convolution (DWConv) (Sandler et al.
2018). In Fig 13(c) and (d), two modules of MobileNet V1 and V2 are shown respectively.
13
H. Hussain et al.
An automatic method of generating CNN without human intervention is called the neu-
ral architecture search network (NASNet) (Zoph et al. 2018). This series uses the RNN
for generating the required model using a reinforcement search algorithm by reducing the
given loss function. Replacing the search algorithm with sequential model-based optimiza-
tion and tournament selection evolution algorithm, we get PNASNets (Liu et al. 2018), and
AmoebaNets (Real et al. 2019) respectively. The NASNet search was also used to design a
new baseline network called EfficientNets and scale it up to obtain a family of models ( B0
to B7). A compound scaling is applied where the model uniformly scales all three dimen-
sions (width, depth, and resolution) with a fixed ratio. The table 5 compares the discussed
model based on the number of parameters, operations, and accuracy.
4.1.12 Equations for computing the parameters, number of operations, and accuracy

in a conventional CNN
The DNNs given in tables 4 and 5 are compared based on the following criteria:
(a) Depth: The depth of the CNN is the total number of layers, Let it be L.
(b) Total number parameters of the CNN: It is calculated layer-wise using the following
equations.
Let Pi be the number of parameters in layer i.
Let fw , fh and N be the height,width and number of filters used for the convolution.
Ic be the number of the input channel. Then,
Table 5 Lightweight DNN models

Reference Model Parameters (M) Operations (G) Error- top 5 %
Iandola et al. (2016) SqueezeNet 1.2 1.72 19.7

Gholami et al. (2018) SqueezeNext 3.2 1.42 11.8
Zhang et al. (2018) ShuffleNet 5.4 1.05 10.2
Sandler et al. (2018) MobileNet-v1 4.2 1.15 10.5
Sandler et al. (2018) MobileNet-v2 3.5 0.60 9
Howard et al. (2019) MobileNet-v3 L 5.4 0.44 7.8
Howard et al. (2019) MobileNet-v3 S 2.9 0.11 12.3
Huang et al. (2018) CondenseNet C=4 2.9 0.55 10
Huang et al. (2018) C=8 4.8 1.06 8.3
NASNet – A 5.3 1.13 8.4
Zoph et al. (2018) NASNet – B 5.3 0.98 8.7
NASNet – C 4.9 1.12 9.0
Liu et al. (2018) PNASNet 5.1 1.18 8.1
Tan et al. (2019) MNASNet-A1 3.9 0.62 7.5
MNASNet- small 2.0 0.14 –
Real et al. (2019) AmoebaNet-A 5.1 1.11 8.0
AmoebaNet-C 6.4 1.14 7.6
Xiong et al. (2019) ANTNets 3.7 0.64 8.8
Tan and Le (2019) EfficientNet B0 5.3 .39 6.7
13
Fig. 13 a SqueezeNet , b Fire module , c MobileNet V1 module, d MobileNet V2 module, e ShuffleNet V1
module, and f ShuffleNet V2 module
Pi = ([(fw ∗ fh ) ∗ Ic ] + 1) ∗ N (11)
Total number parameters of the network is:
L
∑
TotalP = (Pi ) (12)
i=1
(c) Total number of operations: The majority of the operations in DNN are layer-wise
multiplication and accumulation operations. Hence they need to be computed as fol-
lows:
13
H. Hussain et al.
Let Iw , Ih be the height, width of the input given to the layer i

Then the number of operations executed in layer i is:
Oi = (Iw ∗ Ih ∗ Ic ) ∗ (fw ∗ fh ∗ N) (13)
Total number operations of the network is:
L
∑
TotalO = (Oi ) (14)
i=1
(d) Accuracy: Accuracy is the ratio between the number of correct predictions and the
total no of correct predictions. The top 5 percent accuracy is the model’s top 5 highest
probability answers match with the expected answer.
In Fig 14, the trade-off between accuracy of the DNN models, their size in MB, and
the number of parameters (in million) are shown. It shows that the reduction in the
number of parameters or size within a certain threshold range does not have any drastic
degradation on accuracy. Both accuracy-oriented and lightweight models are taken in
the figure.
4.2 Hardware and software requirements for DNN
Implementing an AI accelerator needs both hardware and software components. In this

section, some of the crucial components that come under both hardware and software
are discussed. Hardware platforms available for AI accelerators are mainly classified into
General-purpose processors, Application-specific integrated circuits, and reconfigurable
devices.
4.2.1 Hardware platforms
(a) General-purpose processors (GPP): General processors like CPU and GPUs are sig-
nificant, without which the acceleration tasks are almost challenging to implement.
Another processor that comes under this category is GPU. Usually, GPPs are built on
Fig. 14 Trade-off between accuracy, parameters and size
13
Von-Neuman’s structure that has ALU as its computing core and follows the sequence of
fetching, decoding, and executing instructions. AI acceleration cannot be done exclusively
on the CPU since it must deal with many applications that include multifaceted control
flow resulting in multiple jumps and interrupts. Moreover, instruction throughput depends
on the control logic of the program as well as the cache hit ratio. Hence slow processing of
data occurs in CPUs as compared to GPUs.
Since GPUs do not depend on unseen latencies, using big-cache memories while
accessing DRAM results in high explicit parallelism and throughput optimization com-
pared to CPU cores, (Li et al. 2012). GPUs also use many SIMD (single instruction,
multiple data) computing units to increase parallelism and speed up the AI accelera-
tion algorithm several hundred times (Cheng and Wang 2011). This kind of processor is
more suitable for executing similar, repetitive large-scale operations like training at the
cost of high-power consumption. All qualities made GPUs more and more useful for AI
acceleration.
GPPs need repeated data exchange among registers and memory, and between on-chip
cache and off-chip storage are required after computing complex operations in AI. How-
ever, these memory accesses and exchanges result in reduced performance and increased
energy consumption. Hence researchers are compelled to find an alternate solution to over-
come these limitations by designing AI accelerators for specific applications.
(b) Reconfigurable devices: Reconfigurable devices include field-programmable gate
array (FPGA) and coarse-grained reconfigurable array (CGRA). FPGAs are devices that
contain configurable logic blocks (CLBs) and programmable interconnects. Usually, CLBs
also include memory elements like local RAM arrays or simple flip-flops required for
storing local data. According to the requirement of the application, the CLBs can be pro-
grammed and reprogrammed with complex functions like multiplication as well as logic
functions by making use of universal gates. These procedures can be completed in less
time resulting in a reduced development cycle. It can also support a significant amount
of computing and storage resources for computationally intensive applications like deep
learning algorithms that make use of massive parallelism.
CGRA is proposed to overcome the chip area and computing speed limitations found
in FPGA. It integrates the computing units to the processing elements (PE) and alters the
connection between the Processing element and memory by configuring information, thus
achieving the dynamic configuration of the hardware structure. Unlike FPGA, it solidifies
the internal hardware circuitry of PE, resulting in the reduction in the cost of interconnect-
ing configuration. Due to these reasons, it can be placed next to ASIC in terms of energy
efficiency and power consumption.
(c) Application-specific integrated circuit (ASIC): ASICs are processors meant for spe-
cific or AI accelerations. It utilizes less area and has high reliability, low power require-
ment, and fast computation speed. Thus making them suitable solutions for AI algorithms
acceleration in tiny devices (Faraone et al. 2018). It also utilizes hardware circuit paths for
computing fixed types of tasks on attaining a good energy efficiency ratio at low power
consumption in milliwatts (mW). However, their cost and the development cycle time are
high. Hence ASICs are more suitable when the AI algorithm and its resource constraints
are already fixed. Similarly, it can also be used when the target is low power and area.
Usually, before the manufacturing of ASICs for specific applications, FPGAs are used
for prototyping. The application’s program logic is initially tested by uploading the pro-
gram code to FPGA, and after verification of the model, ASICs are manufactured. These
procedures help the researchers and manufacturers achieve less testing time and lessened
13
H. Hussain et al.
cost. Since the ASICs design cannot be revised after fabrication, they are not appropriate
when AI model design needs modification after deployment.
The Table 6 shows the comparison of the discussed hardware platforms based on ten
features so that the researchers can have a basic idea on which one to choose, based on the
requirement. The implementation level of GPU is at the software level and is restricted by
the underlying hardware, whereas FPGAs and ASICs implementation are at the hardware
level. The optimizations in FPGAs and ASICs are mainly performed at the design phase;
they are faster while executed. As discussed earlier, ASICs are not flexible like FPGAs,
which are open for changes even after implementation. Since FPGAs’ implementation time
is less than ASICs, the former is more suitable for prototyping and verification even if the
final target of the AI model is ASIC.
The high fabrication cost causes ASICs to be more expensive than GPU and FPGA.
GPUs’ price for acceleration ranges from $50 to $900 for NVIDIA GeForce GT 730 and
Tesla K40, respectively. Examples of FPGA boards used for acceleration and their prices
range from $900 to $7000 for Xilinx Virtex-5 and Altera Stratix-V GXA7, respectively,
whereas its power consumptions range from 1.7 5- 20 W. In ASICs, implementation
requires only less power when compared to others, varies from 6.67 mW to 15.97 W (Talib
et al. 2020). In summary, FPGAs have flexibility, performance, and speed between GPPs
and ASICs.
(d) Resource limited accelerators: Some AI algorithms are accelerated by using Gen-
eral purpose processors or primary processors like CPU or GPU with AI coprocessors to
make the acceleration solution more independent. The coprocessor is a computer processor
and a basic physical type of AI accelerator that supplements the functions of the primary
processor CPU. They are often considered as an accessory that can be plugged into a host
machine for task acceleration. A typical example of a coprocessor is ’Edge TPU’ used in
the Coral USB accelerator (Park et al. 2020). Other examples include Neural-network Pro-
cessing Unit (NPU) by HiSilicon, Neural Compute Engine manufactured by Intel, Artifi-
cial Intelligence Engine (AIE) by Qualcomm, and AI Processing Unit (APU) by MediaTek.
AI accelerators can also be designed in the form of System on a Chip (SoC), System on
Module (SoM), and Single-board computer (SBC).
1. System on a Chip (SoC): Typical SoC AI accelerators contain General-purpose proces-

sors, memory controllers, and more than one coprocessor to execute the given AI model.
Table 6 Hardware platforms for DNN implementation

Features GPP FPGA ASIC
Implementation level Software Hardware Hardware

Required technical skill GPU programming Skills VHDL / Verilog VHDL / Verilog
Developing Time Medium Medium Long
Area Big Medium Small
Cost Low Medium High
Parallelism Medium-Low Medium High
Speed Slow Fast Medium
Flexibility Medium High Low
Power Required High Low Low
Performance Medium to Low Medium High
13
An example of SoC is Kirin 970 by Hisilicon, which contains CPUs, GPUs, and two
similar NPU coprocessors (Li and Liewig 2020).
2. System on Module (SoM): An AI accelerator designed as SoM contains a microproces-
sor, memory, and IO in a single module. SoM can include SoC and needs to be paired
with a carrier board to be completely functional. SoM AI accelerators have a plug-
and-play advantage; hence, the AI accelerators can be upgraded without changing the
underlying carrier board. An example of an SoM is Jetson Nano from NVIDIA can be
plugged into the carrier board provided by its developer kit (Mittal 2019).
3. Single board computer (SBC): Unlike SoM, SBC is a computer made on a single cir-
cuit board with microprocessor(s), memory, I/O, and other features required to be used
directly. It does not need any other external hardware components to be fully functional.
An example of SBC is the Coral Dev Board by Google (Li and Liewig 2020). It has CPU
and GPU, RAM, Flash, network interface, and an edge TPU for acceleration purposes
(Jouppi et al. 2018). Another typical example for SBC is raspberry pi (Shah et al. 2016).
SBCs are often seen as a good choice for AI acceleration due to their tiny size and low
power constraints.
4.2.2 Software frameworks
The basic software required for developing a deep learning accelerator includes SDK
for developing and executing DNN models. It can also have languages like python, C,
C++. Another criterion includes supporting layers and operations and, finally, the data-
types used in the accelerators. Examples are floating-point 32 (FP32), fixed point 16
(FiP16), integer 8 (Int8). If the accelerators support only low-precision data types, then
the model needs to be designed with low precision data and should be compressed. The
table 7 shows libraries that support edge devices and other hardware platforms.
4.3 Optimization techniques
Optimization techniques require hardware-software co-designing to achieve better

performance. Typically, the DNN models are implemented so that the researchers are
mainly focused on maximizing accuracy, which is feasible in a resource-rich environ-
ment. However, in resource-constrained environments, such models are difficult to
deploy. This problem can be addressed by the optimization techniques, listed below
with examples in Table 8. Optimizations are applied to a model in such a way that there
is a reduction in execution time, memory consumption, and other performance param-
eters (Tamizharasan and Ramasubramanian 2018).
1. Reduction in the precision of the data: Precision can be reduced by lowering the number
of bits, thus reducing storage cost and computation demands. It is mainly of two types:
uniform quantization and non-uniform quantization (Sze et al. 2017).
– Uniform quantization involves converting data and operation from floating point
to fixed point. For example, a 32-bit floating-point is converted into an 8-bit fixed
point to reduce the precision.
13
H. Hussain et al.
Table 7 Potential DNN Frameworks and Libraries for the edge

Libraries Edge support Android iOS Arm FPGA DSP GPU Mobile GPU Training
support
CNTK – – – – – – ✓ – ✓
Chainer – – – – – – ✓ – ✓
TensorFlow ✓ – – ✓ – – ✓ – ✓
DL4J ✓ ✓ – ✓ – – ✓ – ✓
TensorFlow-Lite ✓ ✓ – ✓ – – ✓ – –
MXNet ✓ ✓ ✓ ✓ – – ✓ – ✓
PyTorch ✓ ✓ ✓ ✓ ✓ – ✓ – ✓
CoreML ✓ – ✓ – – – – ✓ –
SNPE ✓ ✓ – ✓ – ✓ – ✓ –
NCNN ✓ ✓ ✓ ✓ – – – ✓ –
MNN ✓ ✓ ✓ ✓ – – – ✓ –
Paddle-Mobile ✓ ✓ ✓ ✓ ✓ – – ✓ –
MACE ✓ ✓ ✓ ✓ – – – ✓ –
FANN ✓ – – ✓ – – – – ✓
Edge Impulse ✓ ✓ – – – – ✓ ✓ ✓
– Non-Uniform quantization is also considered since the distribution of weights is not

uniform, and hence moving from uniform to non-uniform can improve accuracy.
Two varieties of non-uniform quantization are log domain quantization and learned
quantization.
2. Computation graph optimization: Deep learning workloads are represented as a compu-
tational graph, in which the edges represent the data and vertices represent the opera-
tions. The objective is to transform the computation graph into a more simplified version
by pre-computing constants in the batch normalization layer. It also removes the drop-
out layer since it acts like an identity function during inference (Jiang et al. 2018).
3. Kernel optimization: An efficient kernel will have less algorithmic complexity as well
as an effective schedule that maximizes that resource utilization, so it is recommended
for a schedule that can make use of the following option:
– Tiling: Divide each input into sections or blocks to fit into cache, then compute
block by block.
– Reordering: Arrange the order of ’for’ loops to improve memory locality.
– Unrolling: Substitute a for loop into repetitive code sentences.
– Vectorization: Replace a for loop with vector instructions.
– Parallelization. Execute a for loop in parallel.
4. Reduction in several operations and network size: Optimization can also be done on the
network by implementing the following methods (Shahshahani et al. 2018):
– Network pruning: removes the redundant connections in the network and discards
weight below a certain threshold. Retraining might be required at this stage to
ensure a preserved accuracy or a minimal loss.
– Compact Network architecture: The network can be condensed by replacing a larger
filter with multiple smaller filters. This approach can be implemented during the
13
Table 8 Different optimization techniques with applications
Optimization techniques Methodologies Applications
Reduction in precision Uniform quantization:

Dynamic fixed points With & without fine tuning Ma et al. (2019)
Reduced weights Binary connect Courbariaux et al. (2015)
Binary weight network Rastegari et al. (2016)
Ternary weight network Feng et al. (2019)
Reduced weights Li and Liewig (2020) Binarized neural network
XNOR Net Rastegari et al. (2016)
Non-uniform quantization:
Log domain quantization Incremental quantization
LogNet Lee et al. (2017)
Learned quantization Deep compression Han et al. (2016)
Reduction in operations Network pruning Optimal brain damage Chaber and Ławryńczuk (2018)
ADMM-based pruning
Compact networks: Talib et al. (2020)
BeforetTraining MobileNet
After training C. P decomposition
Graph Optimization Constant folding, DL inference techniques Jiang et al. (2018)
Graph simplification
Kernel fusion
Pre-computing
Layout transformation
Kernel Optimization Tiling Tiling in FPGAs Shahshahani et al. (2018)
Tiling in GPUs Mittal and Vaishay (2019)
Unrolling Unrolling in FPGA Shahshahani et al. (2018)
Unrolling in GPU Mittal and Vaishay (2019)
Reordering, Vectorization DL inference techniques Jiang et al. (2018)
Parallelization
13
H. Hussain et al.
model design (before training) or decomposing a pre-trained network’s filters (after

training). The latter one is easier but less flexible than the former one (Sze et al.
2017).
4.4 Resource constraints and performance measures
This section examines the primary type of resource limitations while modeling learning
algorithms for resource-restricted environments like end devices / light-weighted devices.
The main difference between the execution of deep learning algorithms at high-end devices
and light-weighted devices is about the number of model parameters and the requirement
of hardware resources (Dhar et al. 2019).
The table 9 gives the comparisons based on processing speed, memory, power require-
ments, and applications of different popular computing devices.
4.4.1 Processing speed
Processing speed is an essential factor for measuring the usability of any on-device appli-
cation. One of the columns in table 9 reports different units for measuring the computation
speed of the processor including:
(a) The clock frequency in (MHz and GHz), which is the number of cycles per second
of a processor. In each cycle, the processor can compute a limited number of operations
based on the type of underlying hardware architecture and DNN model. Moreover, pro-
cessing speed dictates the run time of an application. For example, Arduino Nano 330 BLE
and Uno Rev3 have 64 MHz and 16 MHz respectively, whereas the Raspberry Pi3 is repre-
sented in 1.2 GHz of computing speed.
(b) Floating-point operations per second (FLOPS), Giga Floating Point Operations per
second (GFLOPS), and Tera Operations Per Second (TOPS) which is a common perfor-
mance metric used for high-performance SoCs. All the above metrics are used in light-
weight devices because the designed operations being executed are measured by either
in Tera or Giga Operations (TOP, GOP) in fixed-point implementations or Giga Floating
Point Operations (GFLOP) in floating-point implementation. For example, TOPS is used
in Google Coral and Qualcomm AI engine, whereas T/GFLOPS is used in NVIDIA Jetson
NANO and Intel Movidius (Marantos et al. 2018).
4.4.2 Memory
Most of the devices reported in table 9 use RAM (random access memory) in parallel
with flash memory to achieve better performance. Flash memory is used primarily for
storage, while RAM performs calculations on the data retrieved from storage. Design-
ing DNN models on a light-weighted device requires detailed information about memory
specifications present in the target device. Due to area, cost, and power limitations, most
edge devices come with limited memory. For example, the memory ranges from 2 KB in
Arduino Uno Rev3 to 256 MB in Amazon Echo and few GBs in Qualcomm AI Engine.
However, typical DNN models take hundreds of Megabytes or a few Gigabytes (Dhar
et al. 2019). As discussed above, such a tremendous amount of memory will be absent at
the lightweight devices. Thus, there is a trend towards developing small DNN architecture
13
that can be deployed easily on resource-constrained environments. Performance of the

memory also depends on the model parameter’s (weights and other hyperparameters)
retrieval from on-chip and off-chip memory. The on-chip and off-chip access vary in terms
of energy consumption and time consumption. For example, a MAC operation needs
three memory reads, and one memory write. These reads and writes may be on the off-
chip memory rather than the on-chip buffer in the worst-case and would result in a sig-
nificant throughput bottleneck and cause orders of magnitude higher energy consumption
(Sze et al. 2020). Both on-chip and off-chip memory accesses are important since they are
closely related to power consumption SP (2019). The designed model will not be success-
ful when system power consumption is not evaluated correctly. Hence without considering
the off-chip memory access, one cannot execute a DNN model on a processor and claim
low cost, high throughput, high accuracy, and low chip power (Sze et al. 2020).
4.4.3 Power requirements
Power consumption is defined as the amount of energy consumed per unit of time and
is often reported in watts, as seen in devices like Raspberry Pis, Amazon Echo, Jetson
NANO. However, some devices report power consumption in terms of the number of
operations per second per watt. For example, in Intel Movidius (bits operations per second
within 500 mW), ARM ML, and Google Coral (Tera operations per seconds per watt).
While calculating the performance of DNN models, measuring the only power con-
sumption of a device is not suitable, since the power requirement will be maximum if com-
putation is intensive and depends on computation. Hence, fair calculation of power con-
sumption requires energy consumption along with throughput. Additionally, these factors
again depend on several factors such as run-time, memory. Capturing these dependencies
is rarely deterministic. Hence, the ongoing research quantifies the power/energy efficiency
through functions that typically depend on the memory and run-time of an application
(Dhar et al. 2019).
4.4.4 Other performance measures
The model’s accuracy, which should not be less than a particular threshold after the model
optimization. All the performance matrices are necessary for a fair evaluation of the design
trade-offs. For example, without reporting the accuracy given for a specific data set and
task, one could run a simple DNN accelerator and claim low power, high throughput, and
low cost. However, the model might not be usable for a meaningful task. Similarly, with-
out reporting the amount of off-chip memory access, one could build a model with only
MACs and claim low cost, high throughput, high accuracy, and low chip power. However,
when evaluating system power, the off-chip memory access would be significant (Sze et al.
2017).
Most of the DNN model implementations on lightweight devices report runtime meas-
ures. It is a substitution for both throughput and Latency. Throughput and Latency are
two commonly used measurements. The former is defined as the rate at which the input
data is processed, and the latter is defined as a time interval between a single input and its
response. Throughput is often reported as inference per second and Latency in terms of
seconds per inference. Equation for calculating throughput is:
13
13
Table 9 Comparison of different lightweight devices
Device Coprocessor or microcon- Processing speed) Memory (RAM; Flash Power requirements Applications
troller Memory)
Raspberry Pi3 Quad Cortex A53; 400 MHz 1.2GHz 1 GB SDRAM; 32 GB Flash 0.58 W Video analysis
VideoCore IV Memory
SparkFun-Edge 32-bit ARM Cortex-M4F 48MHz 384KB; 1MB 6uA/MHz Speech recognition
Intel Movidius High Performance VPU 4TFLOPS 1 GB ; 4 GB 2 trillion 16 bit ops/s within Computer vision Barry et al.
Myriad 2 VPU 500mW (2015)
Google Coral Dev Board Quad Cortex-A53, Cortex- 4TOPS 1 GB LPDDR4; 2 TOPS per Watt (Processor Image processing Cass (2019)
M4F; GC7000 Lite 8GB LPDDR4 level)
Graphics + Edge TPU
coprocessor
ARM ML ARM ML Processor 1GB RAM 4TOPs/W Image and voice recognition
Amazon Echo TI DM3725 ARM Cortex-A8 up to 1 GHz 256 MB 4 W (peak) Smart Home
Qualcomm AI Engine Hexagon 685 and Adreno 615 2.1 TOPS 2-4 GB 1W General IoT
Arduino Uno Rev3 ATmega328P 16MHz 2KB 0.3W General IoT
Arduino Nano 330 BLE nRF52840 64MHz 256KB ;1MB Flash 2uA-1 HzODR General IoT
NVIDIA Jetson NANO Quad ARM A57 MPCore; 472GFLOPS 4 GB 64 bit LPDDR4 25.6 5- 10W Computer vision applications,
Maxwell 128 CUDA cores GB/s; audio anaysis
6 GB eMMC,5.1 Flash
H. Hussain et al.

inferences operations 1
= ∗ operations (15)
second second
inference
Here operations/second depends on the DNN model as well as the underlying hardware,
whereas the operations/inference depends only on the DNN model.
4.5 Applications of DNN models
Deep learning is applied to a vast set of fields. With the migration of deep learning to the
edge, a new set of applications with latency constraints are now possible. Considering the
computational complexity of end devices, some applications propose a hybrid method that
uses both edge and cloud. This section deals with the applications of deep learning models
in different domains like multimedia applications, smart city, smart transport, industrial
domain, and others.
4.5.1 Multimedia applications
Most of DNNs are applied in image processing, video analytics domains, and NLP
domains. Many of these application needs to be performed in real-time, hence must be
executed at the edge.
– Image processing: The applications of image processing spans mainly image classi-
fication, feature extractions, and object detection (Sreenu and Durai 2019). Some of
its applications in different sectors include: (i) Agriculture: Smart agriculture is a new
method developed to enhance productivity in agriculture. Deep learning is applied in
several agriculture areas like leaf disease prediction, crop type classification, plant rec-
ognition, and land analysis using images (Kavitha and Rubini 2021). (ii) Medical sec-
tor: Medical image classification and its hidden feature extractions are the two popular
applications in this domain. Medical imaging analysis with DNN can detect different
kinds of cancer from histopathology images, MRI images, and classification of frac-
tures from femur X-Ray images (Tanzi et al. 2020). Similar work from Salama and Aly
(2021), proposed an automated CNN approach for segmentation and classification from
mammography images. In genetic engineering, DNNs are used to extract hidden fea-
tures from genetic information that allows us to predict diseases, like autism (Tayara
and to Chong 2020).
– Video analytics: Video analytics along with deep learning models are usually applied
in augmented reality (AR) and surveillance cameras. Traditional video analytics heav-
ily depends on the cloud for the processing of videos, i.e., the videos are streamed to
the cloud servers where the actual computation takes place, and the processed results
are sent back to the edge devices. It causes problems in time-critical applications like
autonomous cars. The development of lightweight deep learning models and their
deployment to the end devices allows the video data to be processed near to the source
and get a rapid response. Amazon has released the first deep learning-enabled camera
AWS Deeplens that can detect real-time objects without cloud service intervention.
(i) Adaptive streaming: It is a subsection under the video analytics domain, that decides
the quality of experience in video delivery. The videos are streamed based on proper
video bit rate compatible with the network states, stability, fairness, user’s preference
13
H. Hussain et al.
of video quality, etc. The traditional method of adaptive streaming approaches depends
on the client-based technique, which aims to adapt to the bandwidth variations based
on different predicted metrics like buffer size, bandwidth situation, etc. The recent
methods of DNN models with reinforcement learning resulted in an edge computing-
assisted framework that controls DRL to intelligently assign users to proper edge serv-
ers to achieve proper video streaming services (Wang et al. 2020).
– Natural language processing: The applications of NLP and machine translation uses
deep learning models like RNNs, LSTMs, and transformers for better performance and
are deployed on embedded GPUs like NVIDIA JETSON TX2 (Marco et al. 2020). In
contrast, the voice assistant application of NLP includes (i) Alex, Alexa Voice Service
(2016). (ii) and Siri, Deep Learning for Siri’s Voice: On-device Deep Mixture Density
Networks for Hybrid Unit Selection Synthesis (2017). It has a combined solution with a
DNN employed on the device to detect the specific words. The detected words are then
sent to the cloud for further processing.
4.5.2 Smart city
– Smart home and building: The smart home gathers and handles data from home devices
or handheld devices to deliver a set of services. Some examples are human activity
monitoring and home robotics (Erol et al. 2018). Since this domain requires high pri-
vacy, it is essential to offload the computation from the cloud to the end devices. The
ubiquitous wireless signals are used with deep learning models for applications includ-
ing gesture recognition and sign language. SignFi, proposed by Ma et al. (2018b), used
WiFi signals and CNN model to recognize 276 sign language gestures whereas Wang
et al. (2019) used both CNN and LSTM to identify activities and gestures, which is then
used to control home devices like lights, fans, and television. Smart building requires
more efficient data processing compared to smart homes. Some of the works to opti-
mize electricity consumption using multi-task learning were proposed by Zheng et al.
(2019).
– Smart grid: The smart grid is defined as an electricity distribution network (EDN) hav-
ing smart meters installed at various points to estimate real-time consumption. DNN
models deployed at the edge level leverages the control and management of EDNs. The
attacks and false data injection problems existing on these networks are easily handled
by deep belief networks (Yan et al. 2017). Another application that considers the bat-
tery energy and power consumption of vehicles to perform an automated charging pol-
icy (Wan et al. 2019).
4.5.3 Transportation domain
The formation of the Internet of Vehicle (IoV), combined with the latest techniques in deep
learning, will allow a more efficient transportation system (Wang et al. 2019). It includes
autonomous driving, traffic and car parking predictions, traffic signal control, and traffic
flow management (Zantalis et al. 2019).
– Autonomous driving: In autonomous driving, automobiles initially gather information

from multiple sensors, cameras, and radars, and perform an efficient analysis. Earlier
13
this analysis was done based on the model deployed at cloud. This may not fulfill the
requirements like a rapid response to a real-time situation and might need to compro-
mise on security for autonomous driving. However, DNN models provide a solution to
this by deploying the models to the end devices or edge servers. A DNN model applied
in this domain is the SqueezeDet which was designed for better object detection accu-
racy, reduced complexity, and energy consumption (Wu et al. 2017).
– Traffic analysis: The mobility pattern of the vehicle and pedestrians are the two impor-
tant factors in traffic analysis and signal control. The conventional method used to
analyze them is the time-series analysis and probabilistic graph analysis that cannot
efficiently identify the spatiotemporal relationships among the values. However, the
DNN models are very much capable in this scenario, where the LSTM models cap-
ture the mobility pattern on large scale and predicted the future movements (Song et al.
2016). Yao et al. (2019), extended the work by incorporating spatiotemporal features
with LSTM and CNN to form a spatial-temporal dynamic network (STDN), thus get-
ting more prediction accuracy than the existing models. A multi-step prediction model
named Spatial-Temporal Attention Wavenet (STAWnet) is proposed, to capture the
complex spatial-temporal dependencies and predict traffic on road networks (Tian and
Chan 2021).
– Traffic signal control systems: The traffic control models are required to ensure mini-
mum waiting time of the automobile and pedestrians with the reduction in accident
cases and congestion. Before the DNN models, this domain mainly depended on fuzzy
and optimization algorithms. The recent advancement in DRLs resulted in the integra-
tion of the actor-critic and multi-agent-critic DRL algorithms to combine the features
for better traffic signal control (Chu et al. 2020).
4.5.4 Industrial applications
Any smart industrial application requires smart data analysis and efficient product
automation.
– Smart data analysis: Smart industrial data analysis estimates the remaining lifespan of
the engineered product or system. A vanilla-LSTM model proposed by Wu et al. (2018)
predicts the lifetime accuracy efficiently. Similarly, Wang et al. (2019) uses the above
estimation on batteries so that they can request a further backup.
– Product automation: Product automation with deep learning techniques accelerates pro-
ductivity in the industry. A health monitoring system is efficiently developed by Huang
et al. (2019) using CNN with bi-directional LSTM. Similarly, the work by Li et al.
(2018) offloaded the computations from the cloud to the edge level for inspecting raw
data without human interference.
4.5.5 Others
Other areas include myriads of applications like gaming, where the systems with deep
learning acquire a proficiency level above the human player. Recent examples are chess
and go (Gao and Wu 2021). Several other areas are benefiting from deep learning: credit
risk predictions (Leo et al. 2019), solar power forecasting and analysis (Trappey et al.
2019), speech and signal processing, customer service (Xu et al. 2020b), digital marketing
(Miklosik et al. 2019), and fraud detection (Divya et al. 2020).
13
H. Hussain et al.
4.6 Comparative study
The 25 peer-reviewed research are carefully selected, for the comparative analysis as they
satisfy all the four research questions mentioned in section 3. For example, the papers are
based on the adoption of real-time application of DNN models realized on different light-
weight devices using various optimization techniques to enhance the performance.
The review papers are from Scopus indexed journals and conferences in which Journal
constitutes the majority, 17 out of 25 are journal papers (68%) and 8 out of 25 are confer-
ence papers (32%). The distribution of papers based on the year from 2019 and 2020 con-
stitute 40% and 48% respectively, followed by 2021 with the remaining 12%. The system-
atic-review lens contains the study of the four aspects as shown in table 10.
Recent trends on DNN model acceleration in resource-constrained environments are
towards several software-based and hardware-based methods. The DNN model considered
here is CNN, and hence most of the applications handled by the models are image, and
video processing except a few like machine translation, Electroencephalography Signal,
and Human activity recognition. Most of the computing devices used by the models are
FPGAs followed by embedded GPUs like NVIDIA Jetson TX2. The authors also experi-
mented with hardware-based accelerators.
Software-based optimization techniques like pruning, compression, quantization, and
transfer learning parameter tuning, computational kernel optimization, task parallelism,
and trading precision for time are also applied to achieve software-based acceleration.
Since a single model is doubtful to meet all the constraints of accuracy, inference time,
and energy consumption across inputs, it is desirable to know how each model handles
the existing challenges. The table 11 summarizes all the above-discussed aspects of the
selected papers and their comparative analysis.
5 Review results and discussions
This section addresses the common design options that we found after conducting the anal-
ysis. It includes popular DNN models, commonly used hardware options, common prob-
lems that utilize both hardware and DNN models, the commonly optimized resources, and
computation to achieve edge-level AI. The following subsection explains the results and
inferences obtained after the descriptive analysis of the selected papers.
5.1 DNN model perspective
To identify the types of DNN models used in different hardware platforms, we have con-
sidered both accuracy-oriented DNN models and lightweight DNN models. The analysis
shows that the majority (62%) of the studies follow lightweight DNN models, and only
38% of the studies used accuracy-oriented models and subcategories. The Fig 15 shows the
distribution of different DNN models which are widely used in different applications. The
lightweight DNN models shown are MobileNet (18%), followed by squeezeNet with 7%.
Other models include NASNet, SparkNet, ZynqNet, and SparseNet with 4% each. Simple
CNN model is also listed with 21%, consisting of different primitive CNNs with fewer hid-
den layers and computation. However, in the accuracy-oriented model, the popular ones
13
are YOLO and VGG-16 with 11% each, followed by AlexNet (7%). Other remaining DNN
models in this section are Inception, ResNet, and RCNN constituting 3% each.
5.2 Hardware perspective
The results found from the analysis show that the hardware platforms can be GPUs,
FPGAs, ASICs, or other resource-constrained accelerators like Raspberry Pis. Most of the
papers used FPGAs (16 studies) as their primary computing platform due to their re-con-
figurable and fast computation properties. Widely used examples of FPGA in this study are
Altera boards (Arria-10 series) and Xilinx (Zynq 70 series) boards. The next popular com-
puting platform is embedded GPU like Jetson NANO TX2, NVIDIA GTX1080, constitut-
ing 16% of the study. They are extremely useful when we require high parallel computation
and flexibility in the application even after implementation. The analysis also shows that
SBCs like Raspberry Pis (3,4) are also helpful for hardware acceleration. The remaining
computing devices considered in the study include embedded devices and intelligent chips
like Ascend 310. The Fig 16 shows the composition of different computing platforms used
in the descriptive analysis.
5.3 Resource constraints and optimization perspective
The statistics of optimization techniques used for implementing DNN models on differ-
ent hardware platform research are as follows: The Quantization techniques constitutes
28% followed by varieties of convolution techniques with 19%. In addition to this, network
compression models and variants of pruning methods (7 studies) prove the efficient imple-
mentation with DNN models. Other optimization techniques used in the analyzed papers
are pipe-lining methods and feature compression having 8% each. The transfer learning
methods and encoding have a minor proportion in the study. The Fig 17 outlines essential
optimization techniques and number of studies associated with which DNN adoption is
highly acceptable. The studies shows that the optimization methods applied on DNN mod-
els significantly impact the resources like memory usage, power requirements, number of
computations, and computation speed.
The researchers have successfully addressed the increased computation speed as an
objective in 15 studies. Similarly, other objectives considered are reduction in power effi-
ciency and minimal computation in 8 studies. The memory utilization and reductions con-
stitute 12% of the papers considered for the analysis. In addition to this, the accuracy factor
is also considered in 10 studies.
Table 10 Types of Study Fields

Study fields Description
Computing devices The review includes papers based on types of light weight computing devices.
Key contribution Focused on various application domains to get a birds-eye view.
Challenging methodologies The challenging issues addressed by the researchers.
Performance evaluated The amount of resource in terms of accuracy, speed, memory and power
utilized by the model.
13
Table 11 Comparative study of different DNN models with their computing environments , key contribution and performance evaluation
Base model year reference Computing devices Key contributions Challenging methodologies Performance (Accuracy, Speed,
addressed Memory & Power)
13
MobileNet ResNet Marco et al. NVIDIA Jetson TX2 Optimized inference time Dynamic selection of the opti- 1.34x and 1.8x reduction in
(2020) mal model inference in image classifica-
tion. Machine translation with a
7.52% improvement in accuracy.
Efficient Net, Tan and Le (2019) Any lightweight device Scaled up Convnet Optimize model size, compute 8.1x smaller and 6.1x faster than
capability the best existing ConvNet.
CNN Zhu and yuan Ge (2020) FPGA Board Image-quality assessment based Cache optimization, saves Outperforms the benchmarks
on DL by extracting discrimi- storage resources, speed up (10% more on spearman correla-
native image quality features. operation rate tion compared to the next best)
SqueezeNet, ZynqNet Mousouli- Xilinx XC7Z020 HLS for mobile CNN on FPGAs Task level pipelining data rear- More than 10 fps CNN infer-
otis and Petrou (2020) rangement in max-pooling ence at 100 MHz using a batch
layer before merging size: 1.
AlexNet VGG 16 Véstias et al. Zynq7020, Zynq7045 Optimization for inference of Optimization of fixed-point Inferred an image in 4.3 ms in
(2020a) specific CNNs in FPGAs. quantization, efficient calcula- a ZYNQ7020 and 1.2 ms in a
tion of conv-layers ZYNQ7045.
Faster CNN HyperNet Han et al. NVIDIA GTX 1080ti used on a RPN to optimize the feature Model optimization to be suit- Improved recall, detection accu-
(2020) camera in ROV extraction capability. able for feature extraction racy with a speed of 17 fps on
a GPU
CNN Lei et al. (2019) Xilinx Zynq XC7Z045 board Parallel processing strategies for optimization and quantization of Classification accuracy of87.8%,
the acceleration of CNNs decision’s classification speedup of 30.42% and accuracy
loss of 0.27% compared to
GPUs
Icing-EdgeNet, Bo et al. (2021) Huawei-Atlas 200DK with An ice thickness monitoring the discrimination-aware channel 74.5% recognition accuracy, the
Ascend 310 intelligent chip system created for real-time pruning model size is 11 MB
recognition of ice thickness.
AlexNet VGG 16 Struharik et al. Xilinx ZynqUtrascale FPGA Designed a co-processor soft-IP Processed the compressed 3x-11x times faster than the exist-
(2020) family core feature and kernel maps, by ing accelerator models.
skipping all ineffectual com-
putations
H. Hussain et al.

Table 11 (continued)
VGG CNN, Ma et al. (2019) PYNQ-Z1 , programmable chip FPGA-Based rapid electroen- signals are processed as a series Accuracy of 80.5%, less power
is zynq7020 cephalography signal classifi- of multichannel images; the consumption, and 8x faster
cation system CNN model used is 16-bit acceleration than that of PCs
fixed point
MobileNet v1,V2 Aguiar et al. Edge TPU, Jetson TX2 A real-time approach to compute Transfer learning Tiny YOLO-V3 outperforms by
(2020) the detection of vine trunks achieving higher inference accu-
racy and runtime performance
SparkNet, Xia et al. (2021) Arria-10 GX1150 FPGA Mapping to dedicated hardware Model compression of CNN 337.2 GOP/s performance under
unit for pipelined work the energy efficiency of 44.48
GOP/s/w,outperf-orms the
previous methods
ESPNet V2, Mehta et al. (2019) NVIDIA Jetson TX2 Introduced a light-weight, power encoded the spatial informa- Outperforms ESPNet,YOLO2 by
efficient, and general purpose tion in images by learning 4-5% and has 2-4×, 6 × fewer
CNN representations from a large FLOPs on the PASCAL VOC,
effective receptive field. Cityscapes dataset, and the
MS-COCO object detection

respectively .
CondenseNeXt, Kalgaonkar and Any Embedded System An ultra-efficient DNN for Depth-wise separable convolu- Size less than 3.0 MB and accu-
El-Sharkawy (2021) embedded systems tions, group-wise pruning racy trade-off resulting in an
unprecedented computational
efficiency.
Inception V3MobileNet Kris- NCS 2 and Raspberry Pi 4 Optimization of deep learning Intermediate values of the The speed and accuracy are 9
tiani et al. (2020) inference on edge devices network are generated using fps, 41.28% and 24 fps, 71.29%
model on Inception V3&MobileNet
respectively.
YOLO V2, Jinguji et al. (2019) Xilinx ZCU104 FPGA board Pedestrian and object detection Image splitting to detect small 3.9 times faster than mobile GPU
using surveillance footage objects
13
13
OCR NN, Vreča et al. (2020) Zynq-7000 System-on-a-Chip Accelerating deep learning infer- Advancements in compres- Average clock cycle count reduced
ence hardware Loops and a dot sion of NN, coupled with an by 73%, 72% and 78%for a
product unit optimized ISA small-scale CNN, FC-layers and
Conv-layers respectively. Energy
consumption reduced by 73%.
YOLO V2,Tiny YOLO Xu et al. Arria-10 GX1150 FPGA A dedicated hardware accelera- Model quantization, Scalable Peak throughput of 566 GOP/s.
(2020c) tor for real-time acceleration kernel pipeline design,full Inference computation.YOLOv2
of YOLOv2 8-bit fixed-point data path. and tiny YOLOv2 of 35 and 71
fps, respectively.
STD-CNN DS-CNN, Ding et al. Intel Arria 10 FPGA Designing efficient acceleratorof The flow of data between neigh- 98.9 GOP/s which is equivalent to
(2019) depth-wise separable CNN on boring layers using double- a speedup of 17.6 at 29.4 times
FPGA buffering based memory lower power than CPU, GPU-
based designs.
Sparse CNN Lu et al. (2019) Xilinx ZCU102 An efficient hardware accelerator Deals with the irregular connec- 2.4 to 12.9 speedup over dense
for sparse CNNs tions in the sparse Conv-layers, CNN
element- matrix multiplication
as the key operation.
ResCoNN, (SqueezeNet) Lech- Zynq SoC (xc7z020clg484-1) Resource-efficient FPGA-accel- Optimization by avoiding com- Classification accuracy of >96%
ner et al. (2019) erated CNN for traffic sign plex computation and adopting on real-world images at a
classification binary weights and integer framerate of 36fps on a Zynq
activations SoC with 90% reduced weights
compared to state-of-the-art
CNNs.
YOLO CNN, Nguyen et al. VC707 FPGA A high-throughput & power- Techniques involving manipula- mAP of 64.16% ,1.88 TOPS
(2019) efficient implementation tion of binary weights and throughput at 200 MHz, 18.29
of YOLO CNN for object low-bit activation are used W on-chip power needed
detection
H. Hussain et al.

CNN , Kim et al. (2019) Xilinx Kintex UltraScale FPGA Real-time CNN-based SR HW Simple quantization scheme Upscaling at 60 fps
KCU105 board that upscale 2K full high- for weight parameters and
definition video to UHD video activations
at 60 fps.
SS CNN, Si et al. (2019) Cyclone IVE FPGA An SS-CNN on an FPGA for Using parameters with 8 bits Requires less power and area.
handwritten digit recognition precision, with no accuracy 67 to 355 times power savings
-loss potential when compared to the
software solution.
MobileNet V2, Pham et al. Raspberry Pi 3 Deep learning-Based bearing Changing the scaling rate and The accuracy up to 99.58% with
(2020) fault diagnosis method for pruning rate. less computation overhead
embedded systems
13
H. Hussain et al.
5.4 Application perspective
The selected research papers for analysis are grouped according to the implemented appli-
cation. They mainly come under the multimedia domain, followed by signal processing and
natural language processing. The categorized studies are given below:
– Image classification: Many of the discussed optimized DNN models like EfficientNet
(Tan and Le 2019), CondenseNext (Kalgaonkar and El-Sharkawy 2021), depthwise
separable CNN (Ding et al. 2019), optical character recognition neural network (Vreča
et al. 2020) and super skinny CNN (Si et al. 2019), are used in the image classification
using benchmarking datasets like MNIST, CIFAR-10, CIFAR-100, and ImageNet data-
set respectively on lightweight devices like FPGAs and Raspberry Pis (Ma et al. 2016).
The highly optimized DNN accelerators like CNNGrider (Mousouliotis and Petrou
Fig. 15 Summary of DNN Perspective outcome
13
2020) and CoNNa (Struharik et al. 2020), SparkNOC (Xia et al. 2021) , accelerator
sparse cnn (Lu et al. 2019), scalable architecture are also tested on image classification
problems using the same datsets Véstias et al. (2020a).
Application of image classification under transportation domain includes traffic
sign classification with an FPGA accelerated model called ResCoNN (Lechner et al.
2019). It uses a German traffic sign recognition and detection dataset for classification.
Another application consists of ten car classifications using inception v3 and MobileNet
where the dataset is based on the hottest wheels most stolen cars in the US by the year
2017 (Kristiani et al. 2020). An industrial application to classify fault bearings is imple-
mented on raspberry pi 3 using MobileNet V3 (Pham et al. 2020).
– Object detection: An industrial application of object detection is Icing-Edgenet which
was proposed by Bo et al. (2021), to detect the thickness of ice on power transmission
lines. Object detection was also implemented for pedestrians detection using surveil-
lance footage (Jinguji et al. 2019). In the agriculture scenario, the visual detection of
vine trunks using MobileNets was implemented on edge TPU (Aguiar et al. 2020). The
YOLO (you only look once)-CNN-based object detections are proposed by Nguyen
et al. (2019), Xu et al. (2020c) and are efficiently implemented on FPGA boards.
– Feature Extraction: Image quality assessment (IQA) based on DNN models with FPGA
implementation by extracting discriminative image quality features that are highly cor-
related to human visual perception (Zhu and yuan Ge 2020).
– Signal processing: FPGA-Based rapid electroencephalography (EEG) signal Classifica-
tion System using VGG16 Courbariaux et al. (2015). Another application under video
analytics was proposed by Kim et al. (2019), where they used a CNN based super-
resolution dedicated hardware (SR-HW) model to upscale, 2k full HD videos to 4K
ultra-high-definition videos. CNNs are also used in the embedded application of the
millimeter-wave radar-based for human activity recognition (Lei et al. 2019).
Fig. 16 Summary of hardware perspective outcome
13
H. Hussain et al.
Fig. 17 Summary of optimization perspective outcome
– Multiple applications: Some applications use both object detection and classification
in a single model. Such examples include the Marine organism detection and classi-
fication from an underwater vision based on the deep CNN method (Han et al. 2020).
Similarly, some DNN models are tested on different applications. For instance, ESPNet
V2 is evaluated on four different tasks like object classification and detection, seman-
tic segmentation, and language modeling (Mehta et al. 2019). Another work by Marco
et al. (2020) was to dynamically choose the pre-trained DNN model based on the given
input, required accuracy, and inference time. The application tested is machine transla-
tion and image classification,
From the analysis, it was found that a total of 82% of the studies come under image pro-
cessing. Some of the subcategories of image processing applications include object clas-
sification, feature extraction, object detection. The remaining applications include signal
processing (11%) followed by natural language processing with 7%. The graphical repre-
sentation of the analysis is shown in Fig 18.
6 Challenges and possible solutions
This section explains some of the prominent challenging issues in on-device intelligence at
different stages like data collection, estimation of resource constraints, implementation of
optimization techniques, model training for limited environments, deployment of the devel-
oped model to light-weighted devices, inferring stage, and ensuring privacy and security.
Additionally, it discusses the possible solutions for overcoming the described limitations.
Finally, the table 12 summarizes the discussed challenges and solutions.
1. A large demand for computing and memory resources DNN algorithm designs focused
on achieving high accuracy require millions of parameters and billions of floating-
point operations. This large volume of data and operations poses a challenge for edge
devices that have limited computing resources. For example, the memory requirement
13
of an accuracy-oriented model ResNet-50 having 25 million parameters and 5.71

billion operations requires 5.71 GiB memory when batch size is set to 64. Similarly,
a lightweight model, MobileNetV2 having 4.2 million parameters and 1.15 billion
operations also requires 4.18 GiB memory to train from scratch.
The solution to the above problem is to use:
(a) Transfer learning method, where a pre-trained model and its weights can be
loaded to solve similar problems. The top layers can be retrained based on the
new input by keeping the bottom layers frozen. In this scenario, the memory
requirement will be reduced from 5.71 GiB to 870 MiB and 4.18 GiB to 489 MiB
for ResNet-50 and MobileNetV2 respectively.
(b) Compression techniques that involve (i) Quantization that reduces the precision
by lowering the number of bits. (ii) Pruning which removes the redundant con-
nections in the network and discards weight below a certain threshold, and (iii)
Encoding (eg: Huffman encoding), working together in a pipeline so that the
model size and memory requirements are reduced 39x to 45x and can be deployed
to any lightweight devices efficiently (Han et al. 2016).
2. Constrained battery life of edge devices: Battery-powered edge devices rely on a

power-hungry sensor to collect high-quality real-time data. For example, the image
sensors inside mobile cameras consume more energy to capture images that are having
the same quality as that of professional cameras. Hence the reduction of energy con-
sumption is another challenge faced. To address this challenge, make sure that sensors
are turned on only when required. If there are applications that require full availability
of sensors (streaming applications), the following two methods can be adopted (Zhang
et al. 2020):
(a) An intelligent data sub-sampling method can be adopted to reduce energy con-
sumption. Since streaming data needs the DNN models to be executed continu-
ously, the sampling method helps to extract relevant samples and exclude redun-
dant data points.
(b) The second option is to redesign sensor hardware to reduce energy consump-
tion related to sensing by analog to digital converter (ADC). Recent research
in replacing ADC and using analog sensor signals as input to DNN concluded
with promising results. It was also found that downscaling the high-resolution
input image to lower resolution image can reduce the number of computations of
DNNs, which leads to the reduction in energy consumption.
3. Real-time input data quality: The input data collected at the edge devices might not
be of good quality and can have poor consistency compared to the trained data. Some
reasons for such a situation are undefined background, blurriness, or shading in image
data, whereas speech data can be noisy when accumulated with a heavy crowded back-
ground. The results also show that different sensors collecting the same data vary due
to their difference in efficiency, sampling rate, and sensitivity. Since the performance
of the deep learning algorithms is highly dependent on the trained data, these incon-
sistencies and noisy data pose a threat to the model. Possible research directions on
these issues are:
13
H. Hussain et al.
Fig. 18 Summary of application perspective outcome
(a) Data augmentation techniques which mimic the discrepancies that occurred in
real-time data scenarios, Example: add noise-robust loss functions, that are suit-
able for creating flexibility between training and testing data.
(b) Representation learning with a translation module is used to solve the data col-
lection from a different source. The new module can translate the real-time data
representation between two sensors working on the same data source, (Xu et al.
2020a).
4. Lack of adequate data at the edge: Edge devices use self-generated data, or the data
captured from the surrounding environment to generate an effective model. The data
thus obtained are sparse, unannotated, and lack quality (Wang et al. 2020). Many
existing models do not consider this challenge and assume that the data is of good
quality. Active learning is a solution that requires human intervention for labeling the
annotated data. This solution is applicable only when the dataset is very small. Several
solutions to address the data scarcity include:
(a) Transfer learning method in which an already learned model’s knowledge can
be used in the new model to solve similar problems. This solution is applicable
when adequate training data are scarce. An example is few-shot learning.
(b) Use shallow models like naive Bayes classifier or decision trees are more suitable
for learning small datasets, as compared to DNNs.
(c) Incremental learning-based methods execute training on a pre-trained model by
adding new data instances incrementally. In this case, only a few instances are
required to generate the new customized model.
5. Heterogeneity in Computing Units: An edge device can have different computing units
ranging from traditional processors like CPU, GPU to new domain-specific processors
13
Table 12 Summary of challenges and its possible solutions
Sl. No. Challenges Possible solutions
1 A large demand for computing and memory resources. Transfer learning methods Deep compression technique.
2 Constrained battery life of edge devices. Data sub-sampling techniques, Redesigning sensor hardware
3 Real-time input data quality Data augmentation techniques Representation learning.
4 Lack of adequate data at the edge. Transfer models, Shallow learning, Incremental learning.
5 Heterogeneity in computing units. Architecture aware compiler.
6 Resource-constrained algorithms Implementing cross-platform models, libraries, and framework

7 Dynamic resources Deploy optimal algorithms based on heuristic search.
8 On-device Inference Model Partition, Early exit of inference
9 Lack of flexibility in trained models at the edge. Lifelong machine learning.
10 Privacy and security Adversarial training, Homomorphic encryption.
13
H. Hussain et al.
like Google’s TPU. Hence, mapping DNN models to the available set of heterogene-
ous computing units is still a challenging issue. This issue can be addressed by an
architecture-aware compiler that can understand the type of operations being executed
in DNN models. The operations involved in DNN models are of two types: sequential
operations and parallel operations as seen in CNNs and RNNs respectively. CPU
optimizations are for sequential computation, and thus it is more suitable for execut-
ing RNN with strong sequential dependencies. In contrast, GPUs are more suitable
for CNN models having parallel operations like matrix multiplications(Chaber and
Ławryńczuk 2018). Thus, the compiler should map a particular operation to suitable
computing units for better results.
6. Hardware-software dependencies DNN model designing within the given limit of
resources is extremely challenging. The resources requirement and the execution of
DNN models heavily depend on the platform, framework, and computing library used.
Memory footprints are hard to quantify among all the resource constraints since each
framework and library deploys different memory allocation and optimization strate-
gies. For example, on the NVIDIA Jetson TX1 platform, the memory requirements of
DNN models may vary according to the framework (Caffe, TensorFlow) and provide
different optimization techniques for different resources. Hence to propose an optimal
algorithm that satisfies given resource constraints is exceptionally challenging.
A cross-platform model that can transfer knowledge from one hardware to another
is the best solution to solve the above issue. To achieve this, we need a cross library or
cross framework model that can estimate the resource requirement used in different
kinds of implementation (Wang et al. 2020).
7. Dynamic resources: Most of the papers reported that the resources are static; how-
ever, they can be dynamic in some situations. In mobile phones, the resource capacity
changes due to events like the opening and closing of the applications. Confusion can
occur when the resource available is smaller than the model requirement. To solve
this issue, researchers proposed a method of generating five DNN models with vari-
ous accuracy-latency trade-offs and shared weight to reduce memory footprints. The
profilers calculate each of the generated model’s resource requirements, and during the
deployment phase, a greedy heuristic method is used to choose the best models among
them, based on the given resource constraints (Dhar et al. 2019). Thus, the chosen
models will execute multiple applications at runtime within the specified memory,
latency and performance.
8. On-device inference Most of the successful real-time applications using DNN models
are often deployed in a centralized cloud server. The recent trend to move the infer-
ence from cloud to edge level is due to the high latency, security, and privacy issues.
Since the edge poses issues due to resource constraints, there should be an alternative
solution for model deployment, that can tackle both challenges.
To cope up with the challenges, two methods are proposed:
(a) model partitioning to execute a part of DNN at the end device, edge server, and
cloud respectively such that execution delay is reduced. An automatic scheduler
is developed to perform that partitioning at different levels (Kang et al. 2017).
(b) Early exit of inference (EEOI) of DNN models, in which the algorithm can stop
earlier than the usual model thus saving time and energy requirements. Inference
exit is possible only when a predefined model of the same kind ensures the per-
formance (Teerapittayanon et al. 2016).
13
9. Lack of flexibility in trained models at the edge: In most of the DNN applications,
when a pre-trained model is a deployment at low-end devices, there is no provision to
retrain the model. Such models might work well in a small local area (familiar environ-
ment), but when the service broadens, the model is expected to work with new data
or tasks in an unfamiliar environment. This situation leads to poor user experience
and low performance. To tackle this problem an advanced learning paradigm called
Lifelong machine learning (LML) can be used (Crowder et al. 2019). This method
handles the challenge effectively by teaching the machines to learn newly accumu-
lated knowledge by themselves without human intervention. Since the technique is
not designed for edge devices, the machines are expected to be power efficient for the
execution.
10. Privacy and security: Common attacks on DNN models include (i) an adversarial
attack that manipulates the input data to attain the model misclassification. The one-
pixel adversarial attack is an example, where the noise added is barely perceptible to
the human eye. (ii) Data poisoning attack, where the malicious users inject a set of
training into the computing tasks, which results in a backdoor during inference. (iii)
Exploratory attacks, where the privacy of the model is compromised such that the
model parameters and private input data can be extracted from the user. Hence the
challenge here is to provide security and ensure privacy from the attackers (Rezaei
and Liu 2019). Adopting the following method can enhance privacy and security:
(a) Adversarial training is a solution to adversarial attacks. Here the created DNN
models are trained with the adversarial samples (input image added with perturba-
tions) to get the resistance from adversarial attacks
(b) Homomorphic encryption, where the computations are applied on ciphertext to
generate encrypted results. After decryption, the result generated will be identi-
cal to the result obtained from the unencrypted data. In summary, both training
and inference can be done on encrypted data and also obtain resistance from data
poisoning and exploratory attacks (Rezaei and Liu 2019).
7 Future scopes
This section list the important future research directions of on-device intelligence.
1. DNN Model Training on edge

The increase in AI applications and the high requirement of privacy and security resulted
in the rising demand for training at the edge level. Researchers are working on this area
by incorporating various technologies like federated learning, incremental training,
accelerator arrays (since a single accelerator can no longer support huge DNN models)
(Song et al. 2020).
13
H. Hussain et al.
2. Hardware-software co-design on DNN models

Typically, DNN Models are designed to maximize accuracy without considering imple-
mentation complexity. Hence, they can be challenging while deployed at edge devices.
This issue is addressed by hardware -DNN model co-design. The approaches are of two
types: (a) reduced precision of operators and operations (quantization, weight sharing,
reduced bandwidth) and (b) reduced number of operations and model size (compression,
pruning). Hardware-aware software optimizations are promising and need further effort
to succeed (Dhar et al. 2019).
3. Neuromorphic hardware and Event-based spiking neural network
Even though there is a lack of information related to biological brain operation mecha-
nisms, studies show that neuromorphic hardware simulates the biological neuro archi-
tecture to obtain high speed and low power consumption. The applications of these
technologies still have a scope of enhancement.
Spiking Neural Network is energy efficient compared to digital-based DNN because
the power consumption is high only when the spike is fired. The results also show that
SNN has a great potential to lower energy consumption when executed on neuromorphic
hardware. Further research on this event-based processing on architecture chips and
accelerators is promising (Li and Liewig 2020).
4. On-chip memory and in-memory processing
In DNN models, the workloads are memory intensive; hence, the connection between
computing units and memory devices in the existing solutions has become a bottleneck
to the performance. A future direction in research is to use on-chip memory. Implement-
ing them in the AI accelerators and the recent advancement of in-memory processing
gives the possibility of using them in ANN processing (Li and Liewig 2020).
5. Future trends in Optimization
Some possible aspects of optimization of the DNN model include the following:
– Optimizing computational performance by reducing the number of parameters and
calculations. Precision loss should be minimal by pruning, quantization to make the
model a small-scale version for easy deployment on embedded devices.
– Optimizing memory access performance: Optimization techniques like pruning,
compression, and other technologies adopted can be helpful for memory access per-
formance and calculation speed. Nevertheless, the storage speed that cannot keep up
with calculation speed is still a challenging problem.
– Optimizing power consumption and chip area: Multipliers are the units that con-
sume the most area and power in computational units. Therefore, further studies
on data organization forms can reduce the use of hardware resources, reduce the
power consumption of accelerators, and reduce data exchange. Other techniques
like approximate computing, pruning, and compression are also useful(Song et al.
2020).
6. Choice of Key metrics
Key metrics calculate the efficiency of the proposed DNN Models when deployed on
hardware platforms.
13
– The accuracy determines how effectively a DNN model can perform the
assigned task. It is affected by the difficulty of the given task and the dataset
used. For example, a high accuracy obtained for MNIST image classification
is different from a high accuracy obtained for ImageNet image classification
because of the difference in complexity of the dataset and classification task.
– Latency and throughput measures how fast the model can execute in real-time.
Latency is the time taken for generating the output and is reported as seconds
per inference. At the same time, throughput is the number of tasks completed in
each unit of time. It is expressed as inferences per second.
– Energy efficiency is the number of tasks executed in a given unit of energy and
is reported as inferences per joule. In contrast, energy consumption is often
expressed as joules per inference.
– Cost, which is mainly contributed by chip area and off-chip memory bandwidth
necessities.
– Flexibility and Scalability: Flexibility ensures that the model can execute a
wide variety of tasks, whereas scalability ensures that the same design can be
deployed in multiple domains (cloud or edge) by increasing the model size.
8 Experimental setup
Efficient processing of DNN models and hardware optimization has diverse options,
including various hardware platforms, operating systems, Deep Learning (DL) frame-
works, lightweight DNN models, datasets, optimized DL libraries, and DL bench-
marks. There are many possible combinations of different options for designing infer-
ence at the edge level.
(a) Hardware Platforms: Diverse Platforms ranging from embedded GPUs, ASICs, FPGAs,
and other accelerators and computing devices from tables 6 and 9
(b) Operating Systems: Linux, Windows, macOS, Android, RTOS.
(c) DL Frameworks: A dozen ML frameworks commonly serve for developing deep-
learning models, such as Caffe/Caffe2, Chainer, CNTK, Keras, MXNet, TensorFlow,
and PyTorch from the table 7
(d) DNN Models: SqueezeNet, CondenseNet, MobileNet, etc. from tables 4 and 5
(e) Benchmark Datasets: ImageNet, COCO, VOC, can be used for the image classification
and Machine Translation techniques.
(f) Optimized DL Libraries: many optimized libraries, such as cuDNN, Intel MKL, and
FBGEMM, support various inference run times, such as Apple CoreML, Intel Open-
Vino, NVIDIA TensorRT, ONNX Runtime, Qualcomm SNPE, and TF-Lite.
(g) DNN Profiling tools: (1) TensorBoard is the most popular visualization tool used by
data scientists and applied researchers using TensorFlow. Other available tools include
(2) TensorboardX for PyTorch (Chen et al. 2015) and (3) MXnet Data Visualization
13
H. Hussain et al.
for MXNet (Stevens and Antiga 2020) users. Energy requirements can be addressed
using (4) Accelergy (Wu et al. 2019).
(h) DNN Benchmarks: benchmarking models using benchmarks like EmBench (Almeida
et al. 2019) and MLPerf (Mattson et al. 2020), evaluate the designed models in various
perspectives such as FLOP analysis, batch size, accuracy vs time, per layer analysis
ensures the model’s performance.
(i) Applications: Computer vision, signal processing, language translation.
9 Conclusion
The paper mainly deals with the key topics on the design of DNN models while deploy-
ing them at the edge devices. The design ideas include the types of DNN models, hard-
ware and software requirements for the development, resource constraints imposed by
the computing devices, and the optimization techniques required for the efficient pro-
cessing of DNN. The study also conducts a systematic literature review on different
real-time applications of DNN models. This review answers the previously defined four
key research questions in the following four perspectives:
1. DNN model perspective, and found the most commonly used DNN algorithms used in
different hardware computing platforms.
2. Hardware and software perspective, and answered different available options in hardware
and software resources.
3. Resources and Optimization perspective, and analyzed the type of resource constraints
in hardware platforms and the uses of optimization techniques to overcome the perfor-
mance issues.
4. Application perspective, and understood the real-time uses of DNN models in different
domains.
The Table 13 summarizes the defined research questions and the findings from the
systematic literature review.
Moreover, based on the comparative analysis conducted on the application of recent

research papers, we have also identified the developments of on-device intelligence.
13
Table 13 Research questions and conclusions
Research Research question descriptions Survey results
questions
RQ1 What are the DNN models that are suitable for different types of hardware? The review identifies two kinds of DNN models and their commonly used DNN
in the literature:
(a) Accuracy-oriented DNN models: VGG-16 and YOLO CNN
(b) Lightweight DNN models: MobileNet and SqueezeNet.
RQ2 What are the most commonly used hardware platforms for achieving on- device A widely used hardware platform is FPGA (64%), in which standard FPGA
intelligence? boards used are Altera boards (Arria-10 series)and Xilinx (Zynq 70series)
boards.
They use customized CNN designs (SparkNet, ZynqNet) by mapping them to
their specif FPGA architecture.
The remaining majority of application uses hardware devices like NVIDIA
Jetson TX2 (Embedded GPUs).

RQ3 What are the optimization techniques adopted and what kind of resources are The most adopted optimization technique is quantization in which 32 bits
optimized? floating point precisions are converted to 8-bit fixed points. Other techniques
include optimized convolutions, neural net compression, and pruning. The
resources optimized due to the these techniques are memory utilization, power
efficiency, and reduction in computation.
RQ4 What are the main application domains that utilizes DNN models in different Image processing applications (82%) are the main domain that uses DNN models
hardware models? in hardware.
Examples include analysis of surveillance footage, traffic signal classification,
monitoring ice thickness, vine trunk recognition, fault bearing detection etc.
Other applications are signal processing, NLP, video enhancement etc.
13
H. Hussain et al.
Declarations
Conflicts of interest The authors declare that they have no conflict of interest.
References
Aguiar A, Santos FN, Sousa AJMD, Oliveira PM, Santos LC (2020) Visual trunk detection using trans-
fer learning and a deep learning-based coprocessor. IEEE Acc 8:77308–77320
Almeida M, Laskaridis S, Leontiadis I, Venieris SI, Lane N (2019) Embench: Quantifying performance
variations of deep neural networks across modern commodity devices. ArXiv:abs/1905.07346
Amodei D, Ananthanarayanan S, Anubhai R, Bai J, Battenberg E, Case C, Casper J, Catanzaro B, Cheng
Q, Chen G, Chen J, Chen J, Chen Z, Chrzanowski M, Coates A, Diamos G, Ding K, Du N, Elsen
E, Engel J, Fang W, Fan L, Fougner C, Gao L, Gong C, Hannun A, Han T, Johannes L, Jiang B,
Ju C, Jun B, LeGresley P, Lin L, Liu J, Liu Y, Li W, Li X, Ma D, Narang S, Ng A, Ozair S, Peng
Y, Prenger R, Qian S, Quan Z, Raiman J, Rao V, Satheesh S, Seetapun D, Sengupta S, Srinet K,
Sriram A, Tang H, Tang L, Wang C, Wang J, Wang K, Wang Y, Wang Z, Wang Z, Wu S, Wei L,
Xiao B, Xie W, Xie Y, Yogatama D, Yuan B, Zhan J, Zhu Z (2016) Deep speech 2 : End-to-end
speech recognition in english and mandarin. In: Balcan MF, Weinberger KQ (eds) Proceedings
of The 33rd International Conference on Machine Learning, PMLR, New York, New York, USA,
Proceedings of Machine Learning Research, vol 48, p 173–182, http://proceedings.mlr.press/v48/
amodei16.html
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and trans-
late. In: Bengio Y, LeCun Y (eds) 3rd International Conference on Learning Representations,
ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, arXiv:1409.
0473
Barry B, Brick C, Connor F, Donohoe D, Moloney D, Richmond R, ORiordan M, Toma V (2015) Always-
on vision processing unit for mobile applications. IEEE Micro 35:56–66
Bo W, Ma F, Ge L, Ma H, Hongxia W, Mohamed MA (2021) Icing-edgenet: a pruning lightweight edge
intelligent method of discriminative driving channel for ice thickness of transmission lines. IEEE
Trans on Instrum and Measur 70:1–12
Capra M, Bussolino B, Marchisio A, Shafique M, Masera G, Martina M (2020) An updated survey of effi-
cient hardware architectures for accelerating deep convolutional neural networks. Future Int 12:113
Cass S (2019) Taking ai to the edge: googles tpu now comes in a maker-friendly package. IEEE Spectrum
56:16–17
Chaber P, Ławryńczuk M (2018) Pruning of recurrent neural models: an optimal brain damage approach.
Nonlin Dyn 92:763–780
Chen T, Li M, Li Y, Lin M, Wang N, Wang M, Xiao T, Xu B, Zhang C, Zhang Z (2015) Mxnet: A flexible
and efficient machine learning library for heterogeneous distributed systems. ArXiv:abs/1512.01274
Chen X, Yao L, McAuley J, Zhou G, Wang X (2021) A survey of deep reinforcement learning in recom-
mender systems: A systematic review and future directions. ArXiv:abs/2109.03540
Chen Y, Xie Y, Song L, Chen F, Tang T (2020) A survey of accelerator architectures for deep neural net-
works. Engineering 6:264–274
Cheng K, Wang YC (2011) Using mobile gpu for general-purpose computing - a case study of face recogni-
tion on smartphones.In: Proceedings of 2011 International Symposium on VLSI Design, Automation
and Test p 1–4
Cho K, Jang HJ (2020) Comparison of different input modalities and network structures for deep learning-
based seizure detection. Scientific Reports 10
Choudhary T, Mishra V, Goswami A, Jagannathan S (2020) A comprehensive survey on model compression
and acceleration. Art Intell Rev p 1–43
Chu T, Wang J, Codecà L, Li Z (2020) Multi-agent deep reinforcement learning for large-scale traffic signal
control. IEEE Trans Intell Transp Syst 21:1086–1095
Courbariaux M, Bengio Y, David JP (2015) Binaryconnect: Training deep neural networks with binary
weights during propagations. In: NIPS
Crowder JA, Carbone J, Friess S (2019) Methodologies for continuous, life-long machine learning for ai
systems. Artificial Psychology
Deng W, Liu H, Xu J, Zhao H, Song Y (2020) An improved quantum-inspired differential evolution algo-
rithm for deep belief network. IEEE Trans Instrum Measur 69:7319–7327
13
Dhar S, Guo J, Liu J, Tripathi S, Kurup U, Shah M (2019) On-device machine learning: An algorithms and
learning theory perspective. ArXiv:abs/1911.00623
Ding W, Huang Z, Huang Z, Tian L, Wang H, Feng S (2019) Designing efficient accelerator of depthwise
separable convolutional neural network on fpga. J Syst Archit 97:278–286
Divya P, Rajan DP, Kumar NS (2020) Analysis of machine and deep learning approaches for credit card
fraud detection. ICCCE 2020. P 243–254
Erickson BJ (2019) Deep learning and machine learning in imaging: Basic principles. Deep learning and
machine learning in imaging: Basic principles. P 39–46
Erol B, Majumdar A, Lwowski J, Benavidez P, Rad P, Jamshidi M (2018) Improved Deep Neural Network
Object Tracking System for Applications in Home Robotics, p 369–395
Faraone J, Gambardella G, Fraser NJ, Blott M, Leong PHW, Boland D (2018) Customizing low-precision
deep neural networks for fpgas. In: 2018 28th International Conference on Field Programmable Logic
and Applications (FPL) p 97–973
Feng J, Li D, Chen J, Zhang X, Tang X, Wu X (2019) Hyperspectral band selection based on ternary weight
convolutional neural network. In: IGARSS 2019 - 2019 IEEE International Geoscience and Remote
Sensing Symposium p 3804–3807
Fischer A, Igel C (2012) An introduction to restricted boltzmann machines. In: CIARP
Gao Y, Wu L (2021) Efficiently mastering the game of nogo with deep reinforcement learning supported by
domain knowledge. Electronics. 10(13):1533
Gholami A, Kwon K, Wu B, Tai Z, Yue X, Jin PH, Zhao S, Keutzer K (2018) Squeezenext: Hardware-aware
neural network design. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Workshops (CVPRW) p 1719–171909
Goel A, Tung C, Lu YH, Thiruvathukal GK (2020) A survey of methods for low-power deep learning and
computer vision. In: 2020 IEEE 6th World Forum on Internet of Things (WF-IoT) p 1–6
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y
(2014) Generative adversarial nets. In: Proceedings of the 27th International Conference on Neu-
ral Information Processing Systems - Volume 2, MIT Press, Cambridge, MA, USA, NIPS’14, p
2672-2680
Goodfellow IJ, Bengio Y, Courville A (2016) Deep Learning. MIT Press, Cambridge, MA, USA, http://
www.deeplearningbook.org
Han F, Yao J, Zhu H, Wang C (2020) Marine organism detection and classification from underwater
vision based on the deep cnn method. Math Probl Eng 2020:1–11
Han S, Pool J, Tran J, Dally WJ (2015) Learning both weights and connections for efficient neural net-
works. In: Proceedings of the 28th International Conference on Neural Information Processing
Systems - Volume 1, MIT Press, Cambridge, MA, USA, NIPS’15, p 1135-1143
Han S, Mao H, Dally W (2016) Deep compression: Compressing deep neural network with pruning,
trained quantization and huffman coding. arXiv Computer Vision and Pattern Recognition
Hasselt HV, Guez A, Silver D (2016) Deep reinforcement learning with double q-learning. ArXiv:abs/
1509.06461
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR), p 770–778, https://doi.org/10.1109/
CVPR.2016.90
Hinton GE, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. ArXiv:abs/1503.
02531
Howard AG, Sandler M, Chu G, Chen LC, Chen B, Tan M, Wang W, Zhu Y, Pang R, Vasudevan V, Le
QV, Adam H (2019) Searching for mobilenetv3. In: 2019 IEEE/CVF International Conference on
Computer Vision (ICCV) p 1314–1324
Hu J, Shen L, Albanie S, Sun G, Wu E (2020) Squeeze-and-excitation networks. IEEE Trans Patt Anal
Mach Intell 42:2011–2023
Huang G, Liu Z, Weinberger KQ (2017) Densely connected convolutional networks. In: 2017 IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR) p 2261–2269
Huang G, Liu S, Maaten LVD, Weinberger KQ (2018) Condensenet: An efficient densenet using learned
group convolutions. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
p 2752–2761
Huang SM, Chan YW, Chang CH, Kang TC, Yang CT, Tsai YT (2019) A holistic and local feature
learning method for machine health monitoring with convolutional bi-directional lstm networks.
International Conference on Frontier Computing. P 382–388
Iandola FN, Han S, Moskewicz MW, Ashraf K, Dally WJ, Keutzer K (2016) Squeezenet: Alexnet-level
accuracy with 50x fewer parameters and < 0.5 mb model size
Jiang Z, Chen T, Li M (2018) Efficient deep learning inference on edge devices. ACM SysML
13
H. Hussain et al.
Jinguji A, Sada Y, Nakahara H (2019) Real-time multi-pedestrian detection in surveillance camera using
fpga. In: 2019 29th International Conference on Field Programmable Logic and Applications
(FPL) p 424–425
Jouppi N, Young C, Patil N, Patterson DA (2018) Motivation for and evaluation of the first tensor pro-
cessing unit. IEEE Micro 38:10–19
Kalgaonkar P, El-Sharkawy M (2021) Condensenext: An ultra-efficient deep neural network for embed-
ded systems. In: 2021 IEEE 11th Annual Computing and Communication Workshop and Confer-
ence (CCWC) p 0524–0528
Kang Y, Hauswald J, Gao C, Rovinski A, Mudge TN, Mars J, Tang L (2017) Neurosurgeon: Collabora-
tive intelligence between the cloud and mobile edge. In:Proceedings of the Twenty-Second Inter-
national Conference on Architectural Support for Programming Languages and Operating Systems
Kavitha P, Rubini P (2021) A comprehensive literature survey for deep learning approaches to agricul-
tural applications. World Rev Sci, Technol Sustain Develop 1:1
Khan A, Sohail A, Zahoora U, Qureshi AS (2020) A survey of the recent architectures of deep convolu-
tional neural networks. Art Intell Rev p 1 – 62
Kim Y, Choi JS, Kim M (2019) A real-time convolutional neural network for super-resolution on fpga with
applications to 4k uhd 60 fps video services. IEEE Trans Circ Syst Video Technol 29:2521–2534
Kristiani E, Yang C, Nguyen KLP (2020) Optimization of deep learning inference on edge devices. In:
2020 International Conference on Pervasive Artificial Intelligence (ICPAI) p 264–267
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural
networks. Commun ACM 60:84–90
Lechner M, Jantsch A, Dinakarrao SMP (2019) Resconn: Resource-efficient fpga-accelerated cnn for
traffic sign classification. In: 2019 Tenth International Green and Sustainable Computing Confer-
ence (IGSC) p 1–6
Lee EG, Miyashita D, Chai E, Murmann B, Wong S (2017) Lognet: Energy-efficient neural networks using
logarithmic computation. In: 2017 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP) p 5900–5904
Lei P, Liang J, Guan Z, Wang J, Zheng T (2019) Acceleration of fpga based convolutional neural network
for human activity classification using millimeter-wave radar. IEEE Acc 7:88917–88926
Leo MD, Sharma S, Maddulety K (2019) Machine learning in banking risk management: a literature review.
Risks 7(1):29
Li E, Yang L, Wang B, Li J, ti Peng Y (2012) Surf cascade face detection acceleration on sandy bridge
processor. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Workshops p 41–47
Li L, Ota K, Dong M (2018) Deep learning for smart industry: efficient manufacture inspection system with
fog computing. IEEE Trans Industr Inform 14:4665–4673
Li W, Liewig M (2020) A survey of ai accelerators for edge environment. In: WorldCIST
Liu C, Zoph B, Shlens J, Hua W, Li L, Fei-Fei L, Yuille A, Huang J, Murphy K (2018) Progressive neural
architecture search. In: ECCV
Lu L, Xie J, Huang R, Zhang J, Lin W, Liang Y (2019) An efficient hardware accelerator for sparse convo-
lutional neural networks on fpgas. In: 2019 IEEE 27th Annual International Symposium on Field-
Programmable Custom Computing Machines (FCCM) p 17–25
Ma N, Zhang X, Zheng H, Sun J (2018a) Shufflenet v2: Practical guidelines for efficient cnn architecture
design. In: ECCV
Ma X, Zheng W, Peng Z, Yang J (2019) Fpga-based rapid electroencephalography signal classification sys-
tem. In: 2019 IEEE 11th International Conference on Advanced Infocomm Technology (ICAIT) p
223–227
Ma Y, Suda N, Cao Y, sun Seo J, Vrudhula S (2016) Scalable and modularized rtl compilation of convo-
lutional neural networks onto fpga. In: 2016 26th International Conference on Field Programmable
Logic and Applications (FPL) p 1–8
Ma Y, Zhou G, Wang S, Zhao H, Jung W (2018) Signfi: Sign language recognition using wifi. Proc ACM
Interact Mob Wearable Ubiquitous Technol 2(23):1–21
Marantos C, Karavalakis N, Leon V, Tsoutsouras V, Pekmestzi K, Soudris D (2018) Efficient support vector
machines implementation on intel/movidius myriad 2. In: 2018 7th International Conference on Mod-
ern Circuits and Systems Technologies (MOCAST) p 1–4
Marchisio A, Hanif M, Khalid F, Plastiras G, Kyrkou C, Theocharides T, Shafique M (2019) Deep learning
for edge computing: Current trends, cross-layer optimizations, and open research challenges. In: 2019
IEEE Computer Society Annual Symposium on VLSI (ISVLSI) p 553–559
Marco VS, Taylor B, Wang Z, Elkhatib Y (2020) Optimizing deep learning inference on embedded systems
through adaptive model selection. ACM Trans Emb Comput Syst (TECS) 19:1–28
13
Mattson P, Tang H, Wei GY, Wu CJ, Reddi V, Cheng C, Coleman CA, Diamos G, Kanter D, Micikevicius P,
Patterson D, Schmuelling G (2020) Mlperf: an industry standard benchmark suite for machine learn-
ing performance. IEEE Micro 40:8–16
Mehta S, Rastegari M, Shapiro L, Hajishirzi H (2019) Espnetv2: A light-weight, power efficient, and gen-
eral purpose convolutional neural network. In: 2019 IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR) p 9182–9192
Miklosik A, Kuchta M, Evans N, Zak S (2019) Towards the adoption of machine learning-based analytical
tools in digital marketing. IEEE Acc 7:85705–85718
Mittal S (2019) A survey on optimized implementation of deep learning models on the nvidia jetson plat-
form. J Syst Archit 97:428–442
Mittal S, Vaishay S (2019) A survey of techniques for optimizing deep learning on gpus. J Syst Archit
99:101635
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller MA, Fidjeland
A, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg
S, Hassabis D (2015) Human-level control through deep reinforcement learning. Nature 518:529–533
Mnih V, Badia AP, Mirza M, Graves A, Lillicrap TP, Harley T, Silver D, Kavukcuoglu K (2016) Asynchro-
nous methods for deep reinforcement learning. In: ICML
Mohamed EA, Ahmed I, Mehdi RAK, Hussain H (2021) Impact of corporate performance on stock price
predictions in the uae markets: Neuro-fuzzy model. Int J Intell Syst Acc, Fin Manag 28:52–71
Moher D, Liberati A, Tetzlaff J, Altman D (2009) Preferred reporting items for systematic reviews and
meta-analyses: the prisma statement. The BMJ 339
Mousouliotis PG, Petrou L (2020) Cnn-grinder: from algorithmic to high-level synthesis descriptions of
cnns for low-end-low-cost fpga socs. Microproc Microsyst 73:102990
Nguyen DT, Nguyen T, Kim H, Lee H (2019) A high-throughput and power-efficient fpga imple-
mentation of yolo cnn for object detection. IEEE Trans Very Large Scale Integr (VLSI) Syst
27:1861–1873
Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans on Knowl and Data Eng 22(10):1345–
1359. https://doi.org/10.1109/TKDE.2009.191
Park K, Jang W, Lee W, Nam K, Seong K, Chai K, Li W (2020) Real-time mask detection on google
edge tpu. ArXiv:abs/2010.04427
Pham M, Kim J, Kim C (2020) Deep learning-based bearing fault diagnosis method for embedded sys-
tems. Sensors 20(23):6886
Rastegari M, Ordonez V, Redmon J, Farhadi A (2016) Xnor-net: Imagenet classification using binary
convolutional neural networks. In: ECCV, P 525–542
Real E, Aggarwal A, Huang Y, Le QV (2019) Regularized evolution for image classifier architecture
search. In: AAAI
Rezaei S, Liu X (2019) Security of deep learning methodologies: Challenges and opportunities. ArXiv:
abs/1912.03735
Sak H, Senior A, Beaufays F (2014) Long short-term memory based recurrent neural network architec-
tures for large vocabulary speech recognition. ArXiv:abs/1402.1128
Salama WM, Aly MH (2021) Deep learning in mammography images segmentation and classification:
automated cnn approach. Alexandria Eng J 60:4701–4709
Sandler M, Howard AG, Zhu M, Zhmoginov A, Chen LC (2018) Mobilenetv2: Inverted residuals and
linear bottlenecks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
p 4510–4520
Shah AA, Zaidi Z, Chowdhry BS, Daudpoto J (2016) Real time face detection/monitor using raspberry
pi and matlab.In: 2016 IEEE 10th International Conference on Application of Information and
Communication Technologies (AICT) p 1–4
Shahshahani M, Goswami P, Bhatia D (2018) Memory optimization techniques for fpga based cnn
implementations.In: 2018 IEEE 13th Dallas Circuits and Systems Conference (DCAS) p 1–6
Si J, Yfantis E, Harris S (2019) A ss-cnn on an fpga for handwritten digit recognition.In: 2019 IEEE
10th Annual Ubiquitous Computing, Electronics and Mobile Communication Conference, UEM-
CON p 0088–0093
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition.
CoRR abs/1409.1556
Song J, Wang X, Zhao Z, Li W, Zhi T (2020) A survey of neural network accelerator with software
development environments. J Semicond 41:021403
Song M, Hu Y, Chen H, Li T (2017) Towards pervasive and user satisfactory cnn across gpu microarchi-
tectures. In: 2017 IEEE International Symposium on High Performance Computer Architecture
(HPCA), p 1–12, https://doi.org/10.1109/HPCA.2017.52
13
H. Hussain et al.
Song X, Kanasugi H, Shibasaki R (2016) Deeptransport: Prediction and simulation of human mobility
and transportation mode at a citywide level. In: IJCAI
SP T, (2019) Enhanced data parallelism for irregular memory access optimization on gpu. J Parallel
Distrb Comp 73(1):42–51
Sreenu G, Durai MAS (2019) Intelligent video surveillance: a review through deep learning techniques
for crowd analysis. J Big Data 6:1–27
Stevens E, Antiga L (2020) Deep learning with pytorch: A practical approach to building neural network
models using PyTorch. Packt Publishing Ltd
Struharik R, Vukobratovic B, Erdeljan A, Rakanovic D (2020) Conna-hardware accelerator for com-
pressed convolutional neural networks. Microproc Microsyst 73:102991
Sun T, Ding S, Xu X (2021) An iterative stacked weighted auto-encoder. Soft Comput 25:4833–4843
Sze V, Chen Y, Yang TJ, Emer J (2017) Efficient processing of deep neural networks: a tutorial and sur-
vey. Proc of the IEEE 105:2295–2329
Sze V, Chen YH, Yang TJ, Emer J (2020) How to evaluate deep neural network processors: tops alone
considered harmful. IEEE Solid-State Circ Magazine 12:28–41
Szegedy C, Liu W, Jia Y, Sermanet P, Reed SE, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A
(2015) Going deeper with convolutions.In: 2015 IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR) p 1–9
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for
computer vision.In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
p 2818–2826
Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of
residual connections on learning. In: AAAI
Talib M, Majzoub S, Nasir Q, Jamal D (2020) A systematic literature review on hardware implementation of
artificial intelligence algorithms. The J Supercomp 77:1897–1938
Tamizharasan P, Ramasubramanian N (2018) Analysis of large deviations behavior of multi-gpu memory
access in deep learning. The J Supercomp 74:2199–2212
Tan M, Le QV (2019) Efficientnet: Rethinking model scaling for convolutional neural networks. ArXiv:abs/
1905.11946
Tan M, Chen B, Pang R, Vasudevan V, Le QV (2019) Mnasnet: Platform-aware neural architecture search
for mobile.In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) p
2815–2823
Tanzi L, Vezzetti E, Moreno R, Aprato A, Audisio A, Massè A (2020) Hierarchical fracture classification of
proximal femur x-ray images using a multistage deep learning approach. European journal of radiol-
ogy 133:109373
Tanzi L, Piazzolla P, Porpiglia F, Vezzetti E (2021) Real-time deep learning semantic segmentation dur-
ing intra-operative surgery for 3d augmented reality assistance. Int J Comp Ass Radiol Surg
16:1435–1445
Tayara H, to Chong K (2020) Improved predicting of the sequence specificities of rna binding proteins by
deep learning.In: IEEE/ACM transactions on computational biology and bioinformatics
Teerapittayanon S, McDanel B, Kung HT (2016) Branchynet: Fast inference via early exiting from deep
neural networks.In: 2016 23rd International Conference on Pattern Recognition (ICPR) p 2464–2469
Tian C, Chan WKV (2021) Spatial-temporal attention wavenet: a deep learning framework for traffic predic-
tion considering spatial-temporal dependencies. Iet Intell Transp Syst 15:549–561
Trappey AJC, Chen PPJ, Trappey CV, Ma L (2019) A machine learning approach for solar power technol-
ogy review and patent evolution analysis. Appl Sci 9(7):1478
Twinanda AP, Shehata S, Mutter D, Marescaux J, de Mathelin M, Padoy N (2017) Endonet: a deep architec-
ture for recognition tasks on laparoscopic videos. IEEE Trans Med Imag 36:86–97
Vaswani A, Shazeer NM, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Atten-
tion is all you need. ArXiv:abs/1706.03762
Véstias M (2019) A survey of convolutional neural networks on edge with reconfigurable computing. Algo-
rithms 12:154
Véstias M, Duarte R, Sousa J, Neto H (2020) A fast and scalable architecture to run convolutional neural
networks in low density fpgas. Microprocess Microsystems 77:103136
Véstias M, Duarte R, Sousa J, Neto H (2020) Moving deep learning to the edge. Algorithms 13:125
Vreča J, Sturm KJX, Gungl E, Merchant F, Bientinesi P, Leupers R, Brezočnik Z (2020) Accelerating deep
learning inference in constrained embedded devices using hardware loops and a dot product unit.
IEEE Acc 8:165913–165926
Wan Z, Li H, He H, Prokhorov DV (2019) Model-free real-time ev charging scheduling based on deep rein-
forcement learning. IEEE Trans on Smart Grid 10:5246–5257
13
Wang F, Fan X, Wang F, Liu J (2019) Backup battery analysis and allocation against power outage for cel-
lular base stations. IEEE Trans Mob Comp 18:520–533
Wang F, Gong W, Liu J (2019) On spatial diversity in wifi-based human activity recognition: A deep learn-
ing-based approach. IEEE Int of Things J 6:2035–2047
Wang F, Wang F, Ma X, Liu J (2019) Demystifying the crowd intelligence in last mile parcel delivery for
smart cities. IEEE Netw 33:23–29
Wang F, Zhang C, Wang F, Liu J, Zhu Y, Pang H, Sun L (2020) Deepcast: Towards personalized qoe for
edge-assisted crowdcast with deep reinforcement learning. IEEE/ACM Trans Netw 28:1255–1268
Wang F, Zhang M, Wang X, Ma X, Liu J (2020) Deep learning for edge computing applications: a state-of-
the-art survey. IEEE Acc 8:58322–58336
Wang X, Han Y, Leung VC, Niyato D, Yan X, Chen X (2020) Convergence of edge computing and deep
learning: a comprehensive survey. IEEE Commun Surv Tutor 22(2):869–904
Wu B, Iandola FN, Jin PH, Keutzer K (2017) Squeezedet: Unified, small, low power fully convolutional
neural networks for real-time object detection for autonomous driving.In: 2017 IEEE Conference on
Computer Vision and Pattern Recognition Workshops (CVPRW) p 446–454
Wu Y, Yuan M, Dong S, Lin L, Liu Y (2018) Remaining useful life estimation of engineered systems using
vanilla lstm neural networks. Neurocomputing 275:167–179
Wu YN, Emer JS, Sze V (2019) Accelergy: An Architecture-Level Energy Estimation Methodology for
Accelerator Designs. In: IEEE/ACM International Conference On Computer Aided Design (ICCAD)
Xia M, Huang Z, Tian L, Wang H, Chang VI, Zhu Y, Feng S (2021) Sparknoc: An energy-efficiency fpga-
based accelerator using optimized lightweight cnn for edge computing. J Syst Archit 115:101991
Xie S, Girshick RB, Dollár P, Tu Z, He K (2017) Aggregated residual transformations for deep neural net-
works. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) p 5987–5995
Xiong Y, Kim H, Hedau V (2019) Antnets: Mobile convolutional neural networks for resource efficient
image classification. ArXiv:abs/1904.03775
Xu D, Li T, Li Y, Su X, Tarkoma S, Jiang T, Crowcroft J, Hui P (2020a) Edge intelligence: Architectures,
challenges, and applications. arXiv Networking and Internet Architecture
Xu K, Fu C, Zhang X, Chen C, Zhang YL, Rong W, Wen Z, Zhou J, Li X, Qiao Y (2020b) admscn: A novel
perspective for user intent prediction in customer service bots. In: Proceedings of the 29th ACM
International Conference on Information & Knowledge Management
Xu K, Wang X, Liu X, Cao C, Li H, Peng H, Wang D (2020c) A dedicated hardware accelerator for real-
time acceleration of yolov2. J Real-Time Image Proc 1–12
Yan J, He H, Zhong X, Tang Y (2017) Q-learning-based vulnerability analysis of smart grid against sequen-
tial topology attacks. IEEE Trans on Inform Forens and Secur 12:200–210
Yao H, Tang X, Wei H, Zheng G, Li ZJ (2019) Revisiting spatial-temporal similarity: a deep learning frame-
work for traffic prediction. In: AAAI
Zantalis F, Koulouras GE, Karabetsos S, Kandris D (2019) A review of machine learning and iot in smart
transportation. Future Int 11:94
Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: ECCV
Zhang M, Zhang F, Lane ND, Shu Y, Zeng X, Fang B, Yan S, Xu H (2020) Deep Learning in the Era of
Edge Computing: challenges and Opportunities. John Wiley and Sons Ltd, chap 3:67–78
Zhang X, Zhou X, Lin M, Sun J (2018) Shufflenet: An extremely efficient convolutional neural network
for mobile devices. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition p
6848–6856
Zheng Z, Chen Q, Fan C, Guan N, Vishwanath A, Wang D, Liu F (2019) An edge based data-driven chiller
sequencing framework for hvac electricity consumption reduction in commercial buildings. IEEE
Transactions on Sustainable Computing
Zhu M, yuan Ge D, (2020) Image quality assessment based on deep learning with fpga implementation.
Signal Process Image Commun. 83 115780
Zitar RA, Nachouki M, Hussain H, Alzboun F (2020) Recurrent neural networks for signature generation.
In: 2020 13th International Congress on Image and Signal Processing, BioMedical Engineering and
Informatics (CISP-BMEI) p 1093–1097
Zoph B, Vasudevan V, Shlens J, Le QV (2018) Learning transferable architectures for scalable image rec-
ognition. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition p 8697–8710
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
13
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center
GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers
and authorised users (“Users”), for small-scale personal, non-commercial use provided that all
copyright, trade and service marks and other proprietary notices are maintained. By accessing,
sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of
use (“Terms”). For these purposes, Springer Nature considers academic use (by researchers and
students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and
conditions, a relevant site licence or a personal subscription. These Terms will prevail over any
conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription (to
the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of
the Creative Commons license used will apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may
also use these personal data internally within ResearchGate and Springer Nature and as agreed share
it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not otherwise
disclose your personal data outside the ResearchGate or the Springer Nature group of companies
unless we have your permission as detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial
use, it is important to note that Users may not:
1. use such content for the purpose of providing other users with access on a regular or large scale
basis or as a means to circumvent access control;
2. use such content where to do so would be considered a criminal or statutory offence in any
jurisdiction, or gives rise to civil liability, or is otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association
unless explicitly agreed to by Springer Nature in writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a
systematic database of Springer Nature journal content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a
product or service that creates revenue, royalties, rent or income from our content or its inclusion as
part of a paid for service or for other commercial gain. Springer Nature journal content cannot be
used for inter-library loans and librarians may not upload Springer Nature journal content on a large
scale into their, or any other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not
obligated to publish any information or content on this website and may remove it or features or
functionality at our sole discretion, at any time with or without notice. Springer Nature may revoke
this licence to you at any time and remove access to any copies of the Springer Nature journal content
which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or
guarantees to Users, either express or implied with respect to the Springer nature journal content and
all parties disclaim and waive any implied warranties or warranties imposed by law, including
merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published
by Springer Nature that may be licensed from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a
regular basis or in any other manner not expressly permitted by these Terms, please contact Springer
Nature at
onlineservice@springernature.com
View publication stats

Design Possibilities and Challenges of DNN Models

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Design Possibilities and Challenges of DNN Models

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Design possibilities and challenges of DNN models: a review on the

Article in Artificial Intelligence Review · January 2022

Hanan Hussain P. S. Tamizharasan

The user has requested enhancement of the downloaded file.

Design possibilities and challenges of DNN models: a review

Hanan Hussain1 · P. S. Tamizharasan1 · C. S. Rahul1

© The Author(s), under exclusive licence to Springer Nature B.V. 2022

Keywords Deep learning · DNN · Lightweight models · Optimization · Hardware devices ·

The evolution of Artificial intelligence (AI) technologies attains state-of-the-art perfor-

Fig. 1 Deep learning interest (Google trends)

Fig. 2 Structure of the review paper

Chen et al. (2020) – ✓ – – – – – ✓

Capra et al. (2020) ✓ ✓ – ✓ ✓ ✓ – –

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

SpringerLink lightweight DNN models; On-device inference;

Fig. 4 Collected research papers

3.1.1 Quality assessment rules

– Memory constraints: Memory constrained environment includes (i) embedded systems,

4.1 Types of DNN models

4.1.1 Fully connected neural network (FCNN)

Fig. 5 Fully Connected Neural

oki = g(hki ) (2)

(c) Compute the output y of first node of output layer lm

4.1.2 Convolutional neural network (CNN)

1. Convolutional layer: The convolution operation is a computation between the input

Fig. 6 The basic structure of a CNN

4.1.3 Recurrent neural network (RNN)

Transformers are multi-layered architectures constructed by placing transformer blocks on

Fig. 7 Working of the Transformer network (Vaswani et al. 2017)

In summary, each transformer block can be represented as:

XB = Layer Norm(Position FFNN(XA)) + XA (6)

Fig. 8 Autoencoder Model

4.1.6 Deep belief network (DBN)

4.1.7 Generative adversarial network (GAN)

4.1.8 Deep reinforcement learning

Fig. 9 Generative adversarial network

4.1.9 Transfer learning and knowledge distillation

Fig. 10 Working of the DRL approach

Fig. 11 a Transfer learning and b Knowledge distillation

4.1.10 Accuracy oriented models

Accuracy-oriented models’ focus is mainly on the accuracy measurement without consid-

Lightweight models are mainly designed for resource-constrained environments like

Table 4 Accuracy Oriented DNN models

Krizhevsky et al. (2012) Alexnet 8 61 650 79.06

4.1.12 Equations for computing the parameters, number of operations, and accuracy

Table 5 Lightweight DNN models

Iandola et al. (2016) SqueezeNet 1.2 1.72 19.7

Let Iw , Ih be the height, width of the input given to the layer i

4.2 Hardware and software requirements for DNN

Implementing an AI accelerator needs both hardware and software components. In this

Fig. 14 Trade-off between accuracy, parameters and size

1. System on a Chip (SoC): Typical SoC AI accelerators contain General-purpose proces-

Table 6 Hardware platforms for DNN implementation

Implementation level Software Hardware Hardware

Optimization techniques require hardware-software co-designing to achieve better

Table 7 Potential DNN Frameworks and Libraries for the edge

– Non-Uniform quantization is also considered since the distribution of weights is not

Reduction in precision Uniform quantization:

model design (before training) or decomposing a pre-trained network’s filters (after

4.4 Resource constraints and performance measures

that can be deployed easily on resource-constrained environments. Performance of the

4.4.4 Other performance measures

3.1.1 Quality assessment rules

4.1 Types of DNN models

4.1.1 Fully connected neural network (FCNN)

4.1.2 Convolutional neural network (CNN)

4.1.3 Recurrent neural network (RNN)

4.1.6 Deep belief network (DBN)

4.1.7 Generative adversarial network (GAN)

4.1.8 Deep reinforcement learning

4.1.9 Transfer learning and knowledge distillation

4.1.10 Accuracy oriented models

4.1.12 Equations for computing the parameters, number of operations, and accuracy

4.2 Hardware and software requirements for DNN

4.4 Resource constraints and performance measures

4.4.4 Other performance measures

4.5 Applications of DNN models

5 Review results and discussions

5.1 DNN model perspective

5.3 Resource constraints and optimization perspective

6 Challenges and possible solutions