You are on page 1of 17

SN Computer Science (2020) 1:133

https://doi.org/10.1007/s42979-020-00128-9

REVIEW ARTICLE

FPGA Implementations of SVM Classifiers: A Review


Shereen Afifi1   · Hamid GholamHosseini1 · Roopak Sinha2

Received: 28 April 2019 / Accepted: 30 March 2020 / Published online: 23 April 2020
© Springer Nature Singapore Pte Ltd 2020

Abstract
Support vector machine (SVM) is a robust machine learning model with high classification accuracy. SVM is widely utilized
for online classification in various real-time embedded applications. However, implementing SVM classification algorithm
for an embedded system is challenging due to intensive and complicated computations required. Several works attempted
to optimize performance and cost by implementing SVM in hardware, especially on field-programmable gate array (FPGA)
as it is a promising platform for meeting challenging embedded systems constraints. This article presents a comprehensive
survey of hardware architectures used for implementing SVM on FPGA over the period 2010–2019. We performed a criti-
cal analysis and comparison of existing works with in-depth discussions around limitations, challenges, and research gaps.
We concluded that the primary research gap is overcoming the challenging trade-off between meeting critical embedded
systems constraints and achieving efficient and precise classification. Finally, some future research directions are proposed,
aiming to address such research gaps.

Keywords  SVM · FPGA implementation · FPGA hardware architecture · Embedded system · System on chip

Introduction has shown very high classification accuracy that outperforms


other known classifiers [2–5].
Support vector machine (SVM) is a robust machine learning Several existing software implementations of the SVM
model that is widely used in different classification prob- algorithm provide high classification accuracy. However,
lems. SVM is recognized by its high classification accu- these implementations cannot be used in embedded sys-
racy for several applications such as face recognition, image tems and applications, because of the intensive computations
classification, object detection, bioinformatics, and cancer required by the SVM algorithm. Also, embedded systems
classification [1]. An SVM model is trained using a training development requires meeting challenging constraints such
dataset of data samples. Then, this trained model is utilized as real time, low cost, limited resources and low power con-
for classifying/predicting any new data using defined sup- sumption. Accordingly, using special-purpose hardware or
port vectors obtained from the training process. SVM has reconfigurable computing is required to achieve necessary
found use in numerous real-world applications/problems and high-performance computing with low cost and power con-
sumption [6].
Field-programmable gate arrays (FPGAs) are massive
parallel processing reconfigurable devices. FPGAs have
* Shereen Afifi been widely used for achieving the required performance
safifi@aut.ac.nz of embedded systems, while effectively utilizing hardware
Hamid GholamHosseini resources and achieving low power consumption. For vari-
hgholamh@aut.ac.nz ous applications, FPGAs have demonstrated significant
Roopak Sinha performance and acceleration with efficient hardware
rsinha@aut.ac.nz implementation results, outperforming other comparable
1
Department of Electrical and Electronic Engineering,
platforms such as general-purpose processors and graphics
Auckland University of Technology, Auckland 1010, processing units (GPUs) [7–12]. Subsequently, FPGAs have
New Zealand been widely used for implementing and accelerating SVM
2
Department of IT and Software Engineering, Auckland onto hardware for online and embedded classification.
University of Technology, Auckland 1010, New Zealand

SN Computer Science
Vol.:(0123456789)
133 
Page 2 of 17 SN Computer Science (2020) 1:133

This paper reviews existing hardware implementations of study is presented in Sect. 3. Section 4 presents different
SVM classification algorithm on FPGAs. The primary aim architectures for implementing SVM on FPGAs and classi-
of this study is to identify the strengths and limitations of fies these into the proposed categories. Section 5 provides
the surveyed works, research gaps, and future directions for the critical analysis, comparisons, and discussions, followed
investigation. To the best of our knowledge, this is the first by the conclusion and future research directions in Sect. 6.
comprehensive survey of this topic, and it can drive future
hardware implementations of SVM. This study extends our SVM Background
previous work [13], which reviewed papers related to FPGA
implementations of both training and classification phases Support vector machine is a powerful machine learning algo-
of SVM for the period of 2010 to 2015. However, this study rithm, which shows high accuracy in different classification
emphasizes the implementation of classification phase only. problems [1]. SVM is based on decision boundary theory,
In this study, the methodological framework is around iden- which efficiently differentiates between two different classes
tifying and classifying different FPGA implementations of of data samples. Two main phases exist in such a supervised
the SVM classification phase. Accordingly, we propose the learning model, learning/training and classification phases.
following six categories of hardware implementation of SVM In the training phase, a trained model is developed using
classifiers based on well-established hardware architectures: an input training dataset, in which a decision boundary is
formed from an optimum separating hyperplane that best
A. Parallel pipelined architectures separates data samples of the two classes. Support vectors
B. Systolic array architectures (SVs) are data samples that lie on the decision boundary,
C. Dynamic partially reconfiguration-based architectures which are defined in the training phase and are then used for
D. Multiplier-less architectures the classification task in the classification phase.
E. Development tool-based architectures In the classification phase, the main classification/deci-
F. Cascaded classification-based architectures sion function (1) is used to classify new test data x⃗ (dimen-
sional vector), which depends on the number of SVs n. (α, y,
A total of 44 papers published during 2010–2019 are and b are parameters specified in the training phase.)
reviewed and discussed. In addition, this study presents ( n )
an extensive analysis and comprehensive comparison with ∑ ( )
f (x) = sign 𝛼i ⋅ yi ⋅ K x⃗i, x⃗ + b (1)
in-depth discussions around limitations, challenges, and i=1
research gaps of the surveyed work. Moreover, leading
research groups in the area are identified, and key future The function consists of complicated calculation between
research directions are proposed, which could be considered each SV denoted as x⃗i and the test vector x⃗ , which is per-
by hardware designers for future implementations. Specifi- formed by one of the kernel tricks K. The common used
cally, we seek to answer the following research questions: kernel functions are as follows [14, 15]:
( )
1. How to find optimum hardware architecture for the SVM • Linear K x⃗i, x⃗( = x⃗)i ⋅ x⃗ ( )n
classifier to meet embedded systems constraints? • Polynomial( K x⃗)i, x⃗ = x⃗(i ⋅ x⃗ )
2. How do existing FPGA-based implementations of SVM • Sigmoid K x⃗i, x⃗ = tanh x⃗i ⋅ x⃗ + 𝜃
compare with respect to efficiency and precision? • G a u s s i a n ( r a d i a)l b a s i s f u n c t i o n (RBF):
3. What is the best trade-off (optimum solution) for achiev- ( ) −||x⃗i−⃗x||2
2

ing higher classification accuracy while meeting embed- K x⃗i, x⃗ = e 2𝜎 ( )


• Hardware-friendly K x⃗i, x⃗ = 2−𝛾 ||x⃗i−⃗x||1
ded systems constraints?
4. To what extent, can we reach satisfied flexibility and
scalability of a hardware design for big data problems? SVM is originally a binary classifier, while the multi-
5. How to integrate an embedded SVM classifier efficiently class classification is based on multibinary SVMs. Different
into a complete detection/classification system for real- methods are used to realize multiclass SVM classification as
world problems/applications? “one-against-one” and “one-against-all” [16].

We find that the primary challenge in this area is the


trade-off between meeting vital embedded systems con- Research Methodology
straints while achieving efficient classification with high
accuracy. The rest of this article is organized as follows. In This review began with a comprehensive search involving
the next section, we present basic background information thousands of publications within the initial scope of SVM
about SVM. Then, the research methodology used in this implementation on FPGA. Five scientific databases were

SN Computer Science
SN Computer Science (2020) 1:133 Page 3 of 17  133

considered for the searching process: IEEE Xplore, Scopus, exploiting embedded DSP slices and block RAMs of the
Google Scholar, ACM Digital Library, and ScienceDirect. FPGA. The proposed architecture uses 768 DSPs and 800
The keywords used for the search task were support vec- BRAMs for implementing an SVM classifier, including
tor machine, SVM classifier, SVM classification, FPGA, 760 SVs. A high throughput of 2.89 × 106 times per sec-
embedded system, and hardware implementation. The result- ond is achieved at 370.096 MHz, for classifying 128 feature
ing collection of papers was narrowed down to only include dimensions.
conference and journal publications from 2010 onward. This Another fully pipelined structure is proposed in [11, 18],
cutoff ensures a current view of technology given the signifi- emphasizing a comparison between FPGA and GPU imple-
cant advances in FPGA technology since 2010. mentations. Regarding the FPGA implementation, compli-
Additional refinement was applied manually based on a cated hardware modules are designed in a fully pipelined
screening of the remaining articles, so that only those works architecture using traditional hardware description language
that focus on FPGA implementation of standalone online SVM (HDL), whereas other basic modules, finite-state machines
classification were considered. Hardware implementation stud- (FSMs), FIFO, and interfaces are implemented using the
ies for the training phase have been explicitly excluded in this high-level language (HLL) (Impulse C). For few image pix-
study. We also excluded works that implement large systems els, the FPGA implementation is faster than the GPU and
for specific applications and integrate SVM classifier as only CPU implementations [11]. However, for larger number of
a small part of a whole system, or those that do not provide pixels, the GPU implementation is the fastest, but it dissi-
separate design and detailed implementation results for SVM. pates very high power, which makes it infeasible for embed-
We also focused on image processing and classification appli- ded applications.
cations in this study. Finally, a total of 44 papers were studied, A pipelined architecture is proposed to realize a multi-
reviewed, and analyzed in this study. purpose SVM with high flexibility in terms of input data
and kernel type in [19]. A pipelined design is presented for
a complete SVM core that allows dynamic selection of lin-
Main FPGA Architectures for SVM Classifiers ear, polynomial, or RBF kernel at run-time. The dot-product
operation is computed with parallel embedded DSP-based
Numerous techniques implement the SVM classifica- MAC units and a LUT-based adder tree. The Xilinx Coor-
tion stage in hardware on FPGAs. In order to implement dinate Rotation Digital Computer (CORDIC) [20] IP core
the classification stage on hardware, SVM is first trained is used to implement the exponential function. Also, two
offline in software, and then a trained model is extracted different number formats are used in the core, a fixed-point
for the FPGA implementation of the classifier. MATLAB number format for the dot-product calculation, and a single
was the mostly used platform for the SVM training stage. precision floating point format for other computations. This
The 44 works reviewed and discussed in this section pre- flexible SVM core is simulated and verified at 50 MHz for
sent different hardware architectures and implementation the RBF kernel. (A maximum frequency of 92 MHz might
techniques. We have classified implemented architectures be achieved.)
into six main categories, where some works apply multiple An FPGA pipelined design is introduced in [21], where a
architectures and are therefore categorized into multiple simplified algorithm based on posterior probability is imple-
categories. mented. The proposed pipelined structure exploits a LUT
method for computing the sigmoid function, while for other
Parallel Pipelined Architectures computations, adders, multipliers, and dividers are utilized.
A mean absolute error of 1­ 0–4 order of magnitude is achieved
The pipelining technique breaks a process into smaller from the FPGA implementation of the proposed simplified
stages, which allows for the concurrent execution of jobs. algorithm (fixed-point) compared to the original C algorithm
Pipelining can significantly boost the whole process and (floating-point), showing little loss in the recognition rate. In
increases data throughput. Many implementations use the addition, the computation complexity is reduced and 0.7 ms
pipelined technique by exploiting the parallel processing time delay is achieved, which is promising for meeting real-
capabilities of FPGAs, aiming to achieve efficient parallel time constraints for the targeted application.
pipelined architectures. A hardware architecture exploiting inherent FPGA
A fully pipelined architecture is proposed for implement- parallelism and pipelining features is proposed in [22]. A
ing and accelerating the SVM classification, while present- block diagram is drawn for the proposed hardware archi-
ing three different configurations to implement RBF, poly- tecture using the standard single precision floating-point
nomial, and sigmoid kernel functions [17]. A new processor format. This architecture is based on using counters for
is designed in a fully pipelined architecture by effectively address generation to different stored data in BRAMs for

SN Computer Science
133 
Page 4 of 17 SN Computer Science (2020) 1:133

performing required calculations. A maximum clock fre- classification process [30]. The proposed pipelined adder-
quency of 200 MHz is achieved with promising hardware based processing element outperforms RCA and KS adders
results based on synthesis results, while 97.87% accuracy is by 1.44 and 1.21, respectively, while utilizing few extra
obtained based on simulation results. resources. This implementation is realized on an ASIC; how-
A pipelined architecture is presented in [23], aiming to ever, 3.5 × GMACs are achieved compared to other FPGA
reach a universal coarse-grained reconfigurable architec- implementations.
ture that implements one of three types of machine learn- A pipelined digital architecture is proposed in [31], where
ing tools including SVM. The proposed pipelined structure two SVMs are working in parallel for three-class classifica-
is designed as a 1D or 2D array of simple reconfigurable tion using fixed-point arithmetic. The proposed implementa-
blocks in order to implement one of the three classifiers. tion is based on using subtraction, multiplication, and addi-
For the SVM implementation, the classification process is tion operations with comparison blocks. Pipelining with the
divided into partial sums controlled by an FSM model based time multiplexing technique is exploited to decrease hard-
on reconfigurable blocks utilizing adders and multipliers ware resources utilization and processing clock cycles. The
(and a subtractor in case of a radial kernel). The proposed implemented system shows high accuracy rates (> 81%) with
implementation gained an acceleration of 1–2 orders of mag- low resources utilization, while processing a hyperspectral
nitude, compared to a software implementation (R project- cube in 25 ms at 100 MHz with low dynamic power con-
based), while achieving acceptable utilization of hardware sumption of 67 mW.
resources. (Additional classification speedup experiments/ A parallel hardware architecture is presented in [32],
comparisons are provided in [24]). which uses feature extraction algorithms with SVM for real-
A parallel architecture with two-stage pipeline is pre- time image classification. The implemented SVM uses mul-
sented in [25]. It uses resource sharing for realizing a unified tipliers and accumulators for parallel kernel processing using
circuit for both linear and nonlinear SVM classification. The fixed-point numbers, while a Xilinx CORDIC IP is utilized
proposed architecture includes shared adders and multipliers for performing the exponential function. The proposed hard-
for inner product computations required by linear and non- ware system achieves 75% and 85% classification accuracy
linear SVM. The table-driven algorithm proposed in [26] is for two challenging datasets used, with 3% loss in accuracy
exploited for computing the RBF kernel, aiming to boost the compared to the software implementation. Also, a speedup
exponential function calculation (fixed-point) and increase of 5.7 × is achieved compared to the software implementa-
accuracy. The synthesized circuit utilizes 661,261 gates at tion, while moderate hardware resources are utilized with
a maximum frequency of 152 MHz and achieves a speed of 0.25 ms processing time for an image.
33.8 fps. Table 1 summarizes the architectures presented in cat-
Another two-pipelined-stage architecture implements a egory A.
three-class SVM identification system [27, 28]. The first
pipelined stage computes the inner product, while the second Systolic Array Architectures
stage calculates the summation. The proposed architecture is
implemented using fixed-point number format and is evalu- The systolic array architecture is a configurable modular
ated for different bit lengths, demonstrating the trade-off platform that combines both parallelism and pipelining
between the identification accuracy and hardware area. A techniques for enhancing computing speed. The systolic
customized architecture is also presented that is based on architecture is designed as an array of simple processors
their previous implementation in [29], which combines two- or processing elements (PEs), which provides efficient data
class classifiers. As a result, an increase in the processing flow and memory management. A 2D systolic array architec-
speed of 18% for the inner-product computations is achieved ture is depicted in Fig. 1 [33], which demonstrates a prom-
from the overlap of input data. Finally, an improved two- ising platform for achieving scalable designs and decreas-
pipelined-stage architecture is introduced to avoid the dupli- ing hardware complexity. Parallel FPGA is widely used for
cation in the inner-product process, where common inner implementing the systolic array architecture for numerous
products are calculated at the first stage (similar to the above applications, especially matrix multiplication-based applica-
architecture described [25]). The implementation results tions. Several SVM implementations exist in the literature
show a system throughput > 21.2 fps at 100 MHz [27], with that exploit the FPGA-based systolic array architecture for
sufficient identification accuracy > 90% and good hardware gaining accelerated parallel processing.
size. A parallel architecture proposed in [34] is based on a
A pipelined adder is proposed to execute the accumu- systolic array of PEs [14] (an initial implementation work),
lation operation replacing traditional adders in the main aiming to reach a scalable, flexible and adaptive array pro-
processing element, targeting acceleration of the SVM cessing system. The proposed systolic architecture consists

SN Computer Science
SN Computer Science (2020) 1:133 Page 5 of 17  133

Table 1  Category A: parallel pipelined architectures


SVM References Kernel App FPGA Tool

Binary [17] RBF Polynomial Sigmoid – Xilinx Virtex-6 (6VLX240T-FF1156) Xilinx ISE 14.1
[11, 18] Gaussian Skin classification Xilinx Virtex-5 (LX220) Xilinx Spar-
tan 6(XC6SLX75)
[19] Linear RBFPolynomial Pedestrian detection Xilinx ML505 (XC5VLX110T) –
[23] RBF Polynomial 18 UCI datasets Xilinx Virtex-7 Vivado 2014.2
[25] RBF – – –
[30] Linear MNIST dataset ASIC Synopsys
Multiclass [21] Linear Sigmoid Language Recognition Xilinx Virtex-5 ISE 10.1
[22] Linear Facial expression classification Xilinx ML510 (Virtex-5 FXT) ModelSim
10.1C, Xilinx
ISE 14.2
[27, 28] Linear Colorectal cancer detection Altera Stratix-IV (EP4SE360F35C2) –
[31] Linear Bruised apple detection Xilinx Zynq Z-7010 –
[32] RBF Image classification Xilinx ML509 (Virtex-5 LX110T) Xilinx ISE
Caltech-256, Belgium Traffic
Sign datasets

an additional hardware technique, the partial reconfigura-


tion technology (to be presented in the following category).
Table 2 summarizes the architectures presented in cat-
egory B.

DPR‑Based Architectures

Dynamic partially reconfiguration (DPR) technology allows


reconfiguring dynamically selected areas on FPGA on-the-
fly, while other parts are still working, also called run-time
reconfiguration (RTR) [39]. Not all FPGAs support the par-
tial reconfiguration technology. DPR offers design flexibility,
Fig. 1  Systolic array architecture
design space expansion, power and area savings with speed-
ups. However, for performance-critical cases, the reconfigu-
of three main sections, memory control, vector processing ration speed that is responsible for switching between dif-
and scalar processing. The implemented array architecture ferent cores/modules should be taken into consideration as
is evaluated for three types of object detection applications. a significant factor. Two styles of PR exist: module-based
High performance of 40, 46, 122 fps is achieved for the three and difference-based. The module-based PR is widely used
tested applications. Interestingly, compared to the software for various designs and applications, where reconfigurable
implementation results, no loss is achieved in the detection modules (RMs) are reconfigured at run-time as depicted in
accuracy from the hardware implementation of the three Fig. 2. The difference-based PR is efficient for small designs
applications (76, 77, 78%). and not recommended for big design changes, where a par-
In addition, different implementations of systolic array tial reconfiguration bitstream is provided to program the dif-
architecture of PEs that is responsible for the complicated ference only between the two modules/designs instead of the
matrix multiplication required for the kernel computation full configuration bitstream of each module.
are presented in [35–38]. Besides gaining advantages from DPR is used to implement an SVM classifier in [35], in
using the systolic architecture, these implementations use addition to using the systolic array architecture (category B)

Table 2  Category B: systolic SVM References Kernel App FPGA Tool


array architectures
Binary [34] RBF Polynomial Object detection Xilinx ML505 (Virtex 5-LX110T) –

SN Computer Science
133 
Page 6 of 17 SN Computer Science (2020) 1:133

Multiplier‑less Architectures

The multiplier-less approach aims to reduce hardware


complexity by avoiding the usage of the computationally
intensive multipliers. Different architectures based on the
multiplier-less technique are implemented in the literature.
Many such implementations adopt the hardware-friendly
kernel function proposed in [15], which is based on a sim-
ple multiplier-free design and offers acceptable classification
performance compared to the traditional Gaussian kernel.
An embedded hardware SVM implementation is proposed
Fig. 2  DRR architecture
in [42], which exploits the proposed hardware-friendly ker-
nel. A simple design of six blocks diagram is proposed that
that is designed in four blocks. The kernel calculation is imple- consists of sum of absolute differences (SAD) [43] and shifts
mented in three sub-blocks designed in two pipelined stages operations instead of the resource-consuming multiplica-
(category A). Compared to an equivalent GPP implementa- tions. As a result, low resource utilization of 167 slices is
tion, an acceleration of up to 85 × is achieved. Additionally, by demonstrated.
applying the DPR for changing SVM parameters, a speedup Another hardware-friendly kernel-based architecture is
of 8 × is realized compared to the full chip reconfiguration. presented in [44]. A SAD-based tree structure is used for the
This work is extended lately in [36], where two systolic 1-norm computation between vectors, targeting clock cycles
array-based architectures are proposed to handle various reduction and processing speed improvement. A preliminary
dataset sizes. (The number of SVs is higher than the dimen- simulation study is presented for the bit precision of fixed-
sion and vice versa.) The two architectures gain speedups point arithmetic regarding the classification accuracy level.
of ~ 61 × and ~ 49x, respectively, compared to equivalent An additional implementation that adopts the hardware-
GPP implementations. Furthermore, two DPR implementa- friendly kernel is introduced in [45] for a simple hardware
tions are presented as a single-core (same as in [35]) and design. The proposed architecture utilizes the CORDIC
multicore SVM architecture (quad-core), where different iterative algorithm [20], which is based on exploiting shift
copies of SVM cores with various parameters are swapped. and add operations that replace complex multipliers. The
The same speedup of 8 × is achieved by using the DPR as implemented SVM classification system utilizes an external
mentioned above [35]. memory for storing support vectors with 2 ms speed limita-
Later, Hussain et al. [37] extended their work by present- tion. From the hardware simulation, the implemented system
ing a DPR implementation of an adaptive multiclassifier demonstrates a 4% error rate, while consuming 75% of the
architecture that provides dynamically interchanging differ- device logic. These results suggest that there is still some
ent copies of the SVM and K-nearest neighbor (KNN) classi- room for further research and improvement.
fiers with various parameters (based on their previous SVM An FPGA implementation for a fast SVM is presented
and KNN implementations [35, 36, 40, 41]). The proposed in [46] following the CORDIC-like algorithm implemen-
architecture allows users to configure specific region on the tation of the proposed hardware-friendly kernel [15]. The
FPGA to work as either SVM or KNN classifier, which could proposed architecture is designed in three sub-circuits for
be extended in future to perform as an ensemble classifier. the kernel calculations. An iterative algorithm is proposed
The multiclassifier DPR implementation shows a speedup of for simplifying the CORDIC method by using adders and
8 × in reconfiguration time compared to the single classifier shifters only. The implemented system is faster than their
implementation (same value for previous implementations previous CORDIC circuit implementation in [47] by a factor
[35, 36]) and achieves twice as less area on FPGA as having of 6, while very few hardware resources are utilized.
both classifiers onto the same chip. An FPGA core generator tool for embedded SVM is pro-
Besides using the systolic array architecture (category posed in [48], which automatically generates an optimized
B), the difference-based partial reconfiguration technique hardware description for a digital implementation, meeting
is exploited in [38], targeting low area and power consump- user requirements and the target device constraints. The pro-
tion. A power-aware SVM classifier prototype is realized by posed hardware-friendly kernel [15] is also adopted in addi-
using the PR, as a power drop of 3 to 5% is achieved (total tion to the proposal in [49], in which the bottleneck 1-norm
power = 2.021 W). (Manhattan norm) calculation is implemented by exploit-
Table 3 summarizes the architectures presented in cat- ing the parallel tree structure of the common SAD modules
egory C. [43]. For the kernel implementation, three architectures

SN Computer Science
SN Computer Science (2020) 1:133 Page 7 of 17  133

Table 3  Category C: DPR- SVM References Kernel App FPGA Tool


based architectures
Binary [35–37] Linear Classifying Xilinx ML Xilinx ISE,
microarray/ 403(XC4VSX12) PlanAhead,
biomedical ChipScope
data 12.2
Multiclass [38] Polynomial Facial expres- Xilinx Virtex-6 (6vlx- Xilinx ISE,
sion recogni- 240tff1156-2) Power
tion analyzer
EDA

are proposed. The first architecture uses multiple cas- Development Tool‑based Architectures
cade pipelined stages of multiply-and-accumulate (MAC)
blocks (category A), in order to implement the polynomial Most existing FPGA implementations in literature are
approximation method. The second architecture employs designed using the traditional HDL approach. However,
the CORDIC-like iterative algorithm with multipliers-free very few implementations in the literature exploit the latest
implementation as in [15]. The third architecture uses the software design tools on the market, which mostly are very
look-up tables (LUTs) or memory blocks for storing kernel powerful and simplify the complicated hardware design pro-
data. The tool is tested for the trade-off between latency, cess as well as decreasing the development time. In this sec-
hardware resources and maximum clock frequency for tion, the software tools used for the SVM implementations
implementing the three architectures on low-cost, interme- on FPGA are introduced as well as the proposed designs.
diate-class and high-end FPGAs. The second CORDIC-like Xilinx System Generator is a common tool, which allows
and third LUT-based architectures demonstrate higher clas- high-performance system modeling and automatic code
sification figures. generation from MATLAB/Simulink [53]. The System
A hardware architecture is presented in [50], which Generator is used in [54] to design and implement a simple
employs the multiplier-less kernel that is similar to the hardware architecture. A parallel hardware architecture is
previous hardware-friendly kernel [15] and appears as the designed using a combination of serial and parallel designs
common radial kernel. The proposed architecture utilizes of simple design blocks and functions of the System Genera-
simple shifters for implementing the multiplications, while tor. Additionally, the available CORDIC block of the tool
the CORDIC algorithm is used for implementing the com- is instantiated for implementing the complicated exponen-
plicated exponential function. The proposed kernel achieves tial function. Fixed-point numbers are used for quantiza-
a comparable classification performance with the original tion. The simulation results report maximum frequency of
radial kernel. (No FPGA implementation is presented.) 202.840 MHz for linear classification and 1.33% error rate
A multiplier-less kernel implementation for boosting for nonlinear classification (98.67% accuracy). Also com-
SVM is presented in [51], which is based on the parallel pared to MATLAB implementation, less computation time
pipelined systolic array architecture (category B). Similarly, is achieved for both implemented classifiers.
simple shift and add operations implement a simplified mul- In addition, the Xilinx System Generator is used in [55],
tiplier-less kernel for decreasing the hardware complexity where a DSP slice-based processor is designed in paral-
and power consumption. In addition, a different approach lel (category A). The proposed hardware design is divided
is introduced by applying the CSD (canonic signed digit) into three main blocks, kernel implementation, inner-prod-
and CSE (common sub-expression elimination) representa- uct accumulation, and threshold comparison. The design
tion methods for reducing the number of required adders, employs Xilinx DSP slices, block RAMs, multipliers, and
which decreases the hardware complexity [52]. A compara- exponential blocks. The simulation results demonstrate the
tive study is presented for the resources utilization of three same accuracy of 93.6% as the software implementation
different implemented classifiers (binary linear, binary non- with less processing time of 0.02 ms and reasonable hard-
linear, and multiclass), which employ the proposed CSD- ware resources utilization.
based multiplier-less kernel against the normal vector prod- The System Generator is also used in [56], where a mix of
uct kernel [51]. Moreover, the three implemented classifiers parallel and series techniques are used to design a hardware
achieve power reductions of 1, 2.7, and 3.5% from using the architecture of linear SVM for three-class facial expression
traditional vector product kernel. classification (category A). An architecture is designed for
Table 4 summarizes the architectures presented in cat- class 1 using System Generator blocks and is replicated for
egory D. the other two classes to be integrated with an output logic,

SN Computer Science
133 
Page 8 of 17 SN Computer Science (2020) 1:133

Table 4  Category D: multiplier-less architectures


SVM References Kernel App FPGA Tool

Binary [42] Hardware-friendly Satellite onboard/ NASA data- Xilinx Virtex-5 Spartan-3E ModelSim
base
[44] Hardware-friendly UCI standard dataset breast Altera Cyclone III ModelSim
cancer
[46] Hardware-friendly – – –
[48] Hardware-friendly Automotive app/pedestrian Xilinx Spartan-IIE Spartan-3 –
detection Daimler-Chrysler Virtex-II Pro Virtex-4
dataset
[50] Digital kernel 4 UCI datasets – –
Multiclass [45] Hardware-friendly Image recognition COIL dataset Cyclone II (EP2C20) –
Binary and Multiclass [51] Linear polynomial Fisher’s iris dataset Xilinx Virtex-7 Xilinx XPE 14.1
(7vx485tffg1157-2)

forming the full architecture for three-class classification on classification system on the recent hybrid Zynq SoC using
FPGA. Synthesis results show low resource utilization with the modern UltraFast HLS design methodology is a low-
high accuracy of 97.87% at 132.7 MHz frequency. power system compared to other existing implementations
The Synopsys Signal Processing Workbench (SPW) soft- in the literature (the latest publication [64] reported the least
ware is used in [57] to design a digital hardware implemen- total power consumption of 1.5 watts).
tation using available building blocks. A fixed-point num- Another similar design has been recently proposed in
ber representation is used and a decision logic is utilized to [62], which is based on using BRAM interfaces for pass-
implement a priority scheme required for multiclassification, ing required data instead of streaming data using the stream
targeting accuracy enhancement. The SPW generated cor- interface/bus with the DMA IP in [60]. Some extra resources
responding VHDL code is analyzed using the Xilinx Simula- are utilized compared to [60] and the power is slightly
tor. Compared to the software implementation, the hardware increased (2.7 W); however, both area and power demon-
implementation achieves approximately 2.53 times speedup, strate lesser values compared to some related implementa-
while 16.2% accuracy loss is realized. tions in the literature. Moreover, a high acceleration factor of
Also the Synopsys software is used in [30] (category A) 32 × is achieved compared to a similar software implementa-
to realize an ASIC implementation as discussed earlier. tion, while a 97.9% accuracy is achieved with a processing
The High-level Synthesis (HLS) design methodology time of 11.46 µs at 250 MHz.
has been recently used to decrease the FPGA development Another HLS-based implementation with a hardware/
time and effort by using HLLs instead of HDLs. The Xilinx software co-design is recently presented in [66], focusing
Vivado HLS tool [58] is exploited to implement a low-cost on design space exploration. A systematic two-level meth-
SVM IP [59–64] on a recent FPGA platform “Zynq SoC” odology and prototype framework is proposed for realizing
with a hybrid architecture combining a processor with the an efficient HLS-based SVM IP. The proposed method opti-
FPGA in a single device [65]. An initial hardware/software mizes first the original code structure exploiting data- and
co-design is implemented in [59] as an SVM accelerator instruction-level parallelism (category A). Then, a second
to run the complicated processing task onto FPGA. It is level optimization applies a derived set of design space
considered the first hardware/software system implemented pruning guidelines for refining the applied HLS directives.
for SVM on FPGA/single device that exists in the current The proposed methodology is evaluated based on extensive
period. The proposed HLS design is successfully extended analysis and validation results with a case study of ECG-
in [60, 61, 64] to implement an IP running full SVM algo- based arrhythmia detection system. Experimental results
rithm onto FPGA part of the Zynq SoC. By using the avail- show execution latency gains of up to 98.78% compared to
able directives of the HLS tool, some hardware optimiza- the default Vivado-HLS optimization of the original SVM
tion techniques are simply applied to the proposed design code. A hardware/software co-design including the proposed
as the pipelining (category A) and loop unrolling, which IP is implemented on Zynq SoC for the detection system that
accelerates the original design/code by 37 ×. The experi- achieves speedups of up to 78 × at 25 MHz from an equiva-
mental results demonstrate high performance, low area and lent software implementation.
low power consumption, meeting embedded systems con- Also, HLS techniques are used for examining the design
straints, while preserving classification accuracy rate with- space exploration to optimize a proposed approximate SVM
out any loss. In addition, the implemented embedded SVM FPGA accelerator [67]. Two algorithmic approximation

SN Computer Science
SN Computer Science (2020) 1:133 Page 9 of 17  133

techniques of precision scaling and loop perforation are


applied for implementing an approximate HLS-based SVM
classifier, aiming to decrease the computational latency.
For the design space exploration, some HLS optimization
techniques/directives have been examined, the loop pipeline
and unrolling with array partition and reshape (category A).
The proposed approximate HLS-SVM classifier achieves an
acceleration of 15 × with a detection accuracy of 96.7% for
detecting arrhythmia based on ECG signal.
The HLS tool is exploited to efficiently implement a pro-
Fig. 3  Cascaded classifier scheme [69]
posed pipeline design for an energy-efficient embedded bina-
rized SVM architecture with binarized inputs and weights
[68] (category A). The proposed binarized SVM replaces with higher accuracy. Consequently, a significant speedup
the kernel dot-product float multiplication with bit XNOR could be achieved by using the cascaded classification archi-
(category D), while using the Hamming weights for comput- tecture over using a single SVM classifier. This motivates
ing the binarized vectors for improving the speed and power implementing the cascaded architecture in hardware/FPGA
consumption. The proposed embedded binarized SVM is to realize a real-time embedded classification system with
evaluated using MNIST and CIFAR-10 datasets showing high performance and low cost.
lower power consumption and execution time compared The first FPGA-based cascaded SVM classifier that
to CPU and GPU implementations with a slightly loss of exploits the custom-arithmetic feature of the device hetero-
accuracy. geneous nature is presented in [70] (following their previous
Table 5 summarizes the architectures presented in cat- proposal [71] for accelerating the training phase). A parallel
egory E. structure of multipliers with a pipelined adder tree is pro-
posed for implementing the kernel computations (category
Cascaded Classification‑Based Architectures A). The proposed data path is divided into two domains:
fixed-point and single floating-point precision. The proposed
The cascaded classification architecture consists of multi- heterogeneous architecture efficiently exploits the parallel
classification stages (classifiers) designed in a cascading processing power of the device heterogeneous resources and
structure, targeting accelerated classification process. The dynamic range diversities among the classification problem’s
cascaded scheme is depicted in Fig. 3 [69], where the major- features. Additionally, a two cascaded classifier scheme is
ity of data are rejected at the early stages that are based on implemented that combines a simple faster low-precision
simple classifiers with low complexity, leaving very little classifier with a complicated high-precision classifier of
data to be classified in later stages that are more complex higher area cost. Compared to the software implementation,

Table 5  Category E: development tool-based architectures


SVM References Kernel App FPGA Tool

Binary [55] RBF Ultrasonic flaw detection Xilinx Zynq-7000 Xilinx system generator
[59–64] Linear Melanoma detection Xilinx Zynq-7 Xilinx Vivado 2016.1 design
ZC702(XC7Z020CLG484-1) suite, Vivado-HLS
[66] RBF ECG-based arrhythmia detec- Zedboard Zynq (xc7z- Xilinx Vivado-HLS 2015.2,
tion 020clg484-1) Xilinx SDK 2014.4, PetaLinux
2014.4
[67] RBF ECG-based arrhythmia detec- Zynq-7 ZC706 Xilinx Vivado-HLS 2018.2
tion
Multiclass [54] Linear Gaussian Persian handwritten digits Xilinx Virtex-4 (XC4VSX35) System Generator
dataset
[56] Linear Female facial expression clas- Xilinx Virtex-5 System Generator
sification, JAFFE database
[57] RBF Multispeaker phoneme recogni- Xilinx Virtex-II (XC2V3000) Synopsys SPW, Xilinx Simulator
tion, TIMIT corpus
[68] Linear MNIST and CIFAR-10 datasets Xilinx ML605 board (Virtex-6 Xilinx ISE Design Suite, Xpower
LX240T) analyzer

SN Computer Science
133 
Page 10 of 17 SN Computer Science (2020) 1:133

Table 6  Category F: cascaded classification architectures


SVM References Kernel App FPGA Tool

Binary [70, 72] Gaussian polynomial sigmoid MNIST dataset Altera’s Stratix III (EP3SE260) Altera tools
[69, 73] Linear polynomial Face detection Xilinx ML505 (Virtex-5-LX110T) –
[74] Linear polynomial Face and pedestrian detection Xilinx Spartan-6 XC6SLX150T –
[75] Linear polynomial Face detection Xilinx Spartan-6 XC6SLX150T –
[62, 63] Linear Melanoma detection Xilinx Zynq-7 Xilinx Vivado
ZC702(XC7Z020CLG484-1) 2016.1
Design Suite

the implemented fully scalable heterogeneous architecture added demonstrates real-time processing of 40 fps with 80%
demonstrates an acceleration of 2–3 orders of magnitude detection accuracy [75] (similar behavior for the accuracy
in addition to a speed-up of 7 × compared to other previous loss as in [69]). Also, a reduction of 25% in area and 20% in
implementations on FPGAs and GPUs. power is achieved. Compared to their previous work [69],
As an extension to the previous work architecture [70], lower values are realized for performance and accuracy with
the FPGA reconfigurability feature is applied in [72] (cat- higher area and power; however, a big test set of higher reso-
egory C), in order to switch from the low- to high-precision lution images is evaluated, aiming to reach real-time embed-
classifier in the cascade. Accordingly, higher performance ded processing of online video classification.
is achieved, in addition to expanding the potential design Lately, a cascade SVM classifier has been introduced in
space. [62] based on a proposed scalable multicore architecture,
Another cascaded architecture is proposed in [69, which is formed from the implemented HLS-based SVM
73],where a hardware reduction method is proposed to IPs (category A and E). A two-stage cascade SVM classifier
decrease area and power dissipation. This method follows is implemented using a simplified design of the previously
the multiplier-less approach (category D) to implement the proposed SVM IP [62] (as described in category E section)
early simple cascade stages, where all multiplication opera- using the control bus for passing required data replacing the
tions are replaced with shift operations by rounding off the BRAM interface. As a result of the simplified design, lower
data to the nearest power of two values. A hybrid cascaded resources utilization and power consumption are demon-
architecture (4 cascade classifier) is designed to combine strated compared to a single SVM IP implementation, while
both parallel and sequential processing (as a series of pipe- enhancing the classification accuracy and speed (1.8 µs) as
lined PEs (category A)) for simple early stages (three clas- well as the diagnosis verification. Next, the hardware imple-
sifiers) and higher complexity SVM (one classifier) respec- mentation results were optimized by using the powerful DPR
tively. The implemented cascaded architecture demonstrates technology (category C), where very low resource utiliza-
an average performance of 70 fps, while an acceleration of tion of 1% slices and power consumption of 1.55 watts were
5 × is achieved compared to an implemented single parallel achieved, while gaining flexibility, adaptability, scalability,
SVM classifier. By using the proposed hardware reduction and applicability [63]. The implemented SVM classification
method, resource utilization reduces by 43% and 20% less systems on Zynq SoC using the proposed hardware designs
power is consumed, while experiencing only a 0.7% loss in have shown the least power consumption results among
classification accuracy (84%). other related implementations, in addition to significantly
Next, the previous work [69] is extended in [74, 75], low hardware resource utilization and processing time with
where a feature extraction mechanism (local binary pattern significant speedups and high classification accuracy rates
descriptors) is utilized within the proposed architecture to at low cost.
apply feature extraction before the final stage, aiming to Table 6 summarizes the architectures presented in cat-
enhance the detection accuracy. In [75], an additional novel egory F.
response evaluation method is incorporated into the pre-
viously proposed architecture. Another type of classifiers
(Neural Network) is used in the proposed response evalua- Critical Analysis and Discussion
tion process to classify the responses of the early stages in
the cascade. Accordingly, the number of tested samples is Preliminary Analysis
reduced before reaching the final high-complex classifica-
tion stage, resulting in improving the classification speed. Most existing implementations target binary SVM classifica-
The proposed hybrid architecture with the proposed methods tion (31 works), while only 12 works implement multiclass

SN Computer Science
SN Computer Science (2020) 1:133 Page 11 of 17  133

classification and only one implements both [51]. This is and time. The cascaded SVM classification (category F) is
probably because implementing a multiclass classifier is implemented on FPGA by only one research group [69, 70,
based on combining binary classifiers, and so numerous 72–75] (6 papers) that introduces heterogeneous architecture
works focus on implementing the basic binary classifier to and hybrid architecture implemented using different tech-
be then extended if required. Also, various classification niques of categories A, C, and D. Recently, another research
or detection applications have been targeted by the exist- group [62, 63] added a cascade SVM implementation in the
ing architectures/implementations, especially for different literature based on a proposed multicore architecture using
object detection. Some SVM implementations are tested different hardware techniques of categories A, C and E.
using some available or specific datasets, while other imple- In addition, some works have studied the quantization
mentations are general without validation for a particular impact on the classification accuracy rate resulted from
application/dataset. applying the fixed-point number formatting (bit-width pre-
Implementing kernels with the inherent complicated cal- cision) in the hardware implementation [23, 27, 28, 32, 44,
culations is the main challenge and focus for implementing 74]. They aim to reduce hardware area while maintaining
SVM on FPGA. Some researchers use and implement more high accuracy. Moreover, Afifi et al. [60, 61] have studied
than one type of kernels in their study. A total of 23 studies the trade-off between the speed/acceleration and hardware
implement the simple linear kernel, while others go for dif- resources utilization/area with applying some different
ferent nonlinear kernels that include the basic dot-product hardware architectures (HLS optimization directives), fol-
calculation exists in the linear kernel, but with additional lowed by similar analysis study focusing on design space
complexity (12 polynomial kernel and 4 sigmoid kernel exploration presented in [66, 67]. As a result of applying
existing implementations). Moreover, 15 papers implement different techniques for reducing hardware complexity and
the Gaussian RBF kernel, which basically implements the gaining acceleration, some implementations reported some
complex exponential function. The proposed simplified loss in the classification accuracy rate [21, 32, 57, 67–69,
hardware-friendly kernel is widely used as an alternative 73–75]. Furthermore, many works report classification
to the RBF for simple FPGA/hardware implementation (5 speedup results compared to similar software implemen-
papers). Another digital kernel is proposed that is similar tations [23, 32, 35, 36, 54, 55, 57, 61–63, 66–68, 70, 72],
to the hardware-friendly kernel, but no FPGA or hardware and few works present a comparison with some related
implementation exists using that proposed kernel [50]. FPGA implementations in the literature [32, 60–63, 70,
Most existing implementations are realized on old ver- 72, 74, 75].
sions of FPGA technology using traditional design meth-
ods, even while using modern development tools. Only four Results Comparison and Discussion
researchers utilize the recent Xilinx series-7 devices [23,
31, 51, 55]. One unique research work is discriminated by Table 7 presents some results provided by selected reviewed
exploiting the hybrid architecture of the recent Zynq-7 SoC papers (17 papers). Samples of results have been carefully
platform and using the latest UltraFast HLS design method- extracted from the papers, while some parameters are either
ology [59–64] followed by a recent study for design space unclear or ambiguous to easily define in the proposed table
exploration based on SVM implementation in [66, 67]. or not provided/applicable. Also, some articles are excluded
Beside the standard FPGA parallel design, researchers because of few or no hardware implementation results have
used various hardware technique(s) for improving their SVM been explicitly provided. This comparison includes different
implementations, which is classified into six categories in architectures and implementations for various SVM types
this study. The standard parallel pipelined architecture (cat- and kernels implemented for different applications.
egory A) are widely implemented by 30 papers. The mul- From Table 7, it is clear that low reasonable hardware
tiplier-less approach (category D) is commonly applied in resources utilization is achieved in [36, 38, 60, 62, 64],
11 implementations, which occupies high interest targeting where [38] is the least. From the limited number of works
multiplier-free architecture with hardware complexity reduc- that report the power consumption results, only five imple-
tion. Also, the parallel pipelined systolic array architecture mentations [31, 51, 60, 62, 64] achieve low values of around
(category B) is implemented in 6 papers with consider- 1.7 watts, which are realized on the latest FPGA generation.
able interest for reducing multiplication complexity. Only The least power consumption of 1.5 watts is successfully
6 implementations exploit the massive FPGA-based DPR achieved in [64],while the highest power of 15 W is reported
feature (category C) for enhancing designs and hardware in [11]. The highest number of SVs of 2048 and highest
results. Moreover, only 14 existing implementations are dimensionality of 2048 are considered in [19]. Almost all
developed by utilizing software system-design/development introduced classification accuracies are realistic and greater
tools (category E), replacing the traditional design method than 75%, while only one [57] has achieved low rate of
using the HDL for decreasing FPGA development effort 60.6% with 16.2% loss in accuracy. The least processing

SN Computer Science

133  Page 12 of 17

SN Computer Science
Table 7  Comparison of results
Refer- FPGA resources Power FPGA #SVs Dimension Accuracy % Processing time/ Frequency Category SVM
ence (W) speed (MHz)
Slices LUTs BRAM DSP

[17] 58,688 – 800 768 – Virtex-6 760 128 – 0.9 µs 370.096 A Binary RBF
[11] 59,208 122,637 2049 – 15 Virtex-5 16 – – 0.02 s 200 A Binary Gaussian
[19] 12,674 41,135 132 64 – ML505 2048 2048 – 712.66 µs 92 A Binary linear
[27] 42,600 250,552 1001 502 – Stratix-IV 474 500 98 47.2 ms 21.2 fps 100 A Multiclass linear
[31] 7228 5698 14.5 8 1.693 Zynq Z-7010 – – 96.4 25 ms 100 A Multiclass linear
[32] 9646 38,179 60 52 – ML509 100 500 82 0.25 ms 50 A Multiclass RBF
[34] 23,220 8887 74 64 – ML505 818 400 76–78 40–122 fps 100 B Binary RBF polynomial
[36] 1810 1705 21 21 – ML 403 1024 20 – 7.34 µs 142.9 ABC Binary linear
[38] 461 461 – 7 2.021 Virtex-6 192 120 – – – BC Multiclass polynomial
[51] 33,360 33,360 – 0 1.703 Virtex-7 74 – – – – BD Multiclass
[60] 2676 2267 12 5 1.752 Zynq-7 248 27 – 141.38 µs 100 AE Binary linear
[64] 1046 858 1 5 1.54 Zynq-7 61 27 97.9 1.5 µs 100 AE Binary linear
[54] 11,589 9141 99 81 – Virtex-4 18–35 – 98.67 0.27 ms 151.286 E Multiclass Gaussian
[55] 21,305 14,028 106 152 – Zynq-7000 1024 – 93.6 0.02 ms – AE Binary RBF
[57] 6373 11,943 – 64 – Virtex-II – – 60.6 14.18 ms 42.012 E Multiclass RBF
[69] 13,038 31,854 131 59 3.2 ML505 254 400 84 70 fps 84 ADF Cascaded linear polynomial
[75] 20,153 35,532 256 59 4.9 Spartan-6 122 400, 1062 80 40 fps 70 ADF Cascaded linear polynomial
[62] 4304 3414 2 10 1.74 Zynq-7 200 27 97.9 72.5 1.8 µs 250 AEF Cascaded linear
SN Computer Science (2020) 1:133
SN Computer Science (2020) 1:133 Page 13 of 17  133

time of 0.9 µs is achieved with a high frequency of 370 MHz Leading Research Groups
in [17], while others realize different real-time values of µs
and ms units (47 ms is the highest [27]). Some papers report From reviewing the presented papers, we identify three lead-
the processing speed in terms of frame rate, where real-time ing research groups that are actively working on implement-
image processing is achieved from 40 fps to the highest 122 ing SVM on FPGA.
fps achieved in [34]. Finally, a variety of good acceptable Group 1: One unique research group is exclusively work-
results is achieved by various presented implementations, ing on FPGA-based cascaded SVM classification archi-
where 9 papers include more than one architecture (catego- tecture for object detection [69, 70, 72–75] (category F).
ries) in their implementations. Still more optimization is However, another research group has recently implemented
required to achieve the best/optimum trade-off between dif- a cascade SVM classifier for melanoma detection [62] (S.
ferent parameters discussed in this comparison study. Afifi et al. research group). This research group first pro-
pose a heterogeneous architecture of low-precision and
Limitations and Challenges of Existing Works high-precision classifiers in a cascade (the first cascaded
SVM on FPGA), where DPR is then applied for the switch-
This study reveals critical limitations and existing challenges ing (category C). Next, a hybrid cascaded architecture is
faced by the surveyed works: proposed for combining parallel and sequential processing
modules in a pipelined design (category A). In addition, a
• Various simplification methods are used for reducing the hardware reduction method (category D) is proposed that is
hardware complexity and achieving efficient/optimized based on replacing expensive multipliers with simple shift
hardware implementation results (such as the multiplier- operations, achieving reduction in area and power con-
less method), which affect classification accuracy. sumption. They lately propose an optimized architecture
• Fully parallelized processing is not effectively addressed by adding a response evaluation method (NN classifier) to
in case of implementing SVM with very large-scale/ improve classification prior to the final classification stage
dimensionality, which requires an excessive number of in the cascade, for boosting the overall classification. Also,
processing cycles when implemented over limited hard- they apply a feature extraction phase before the complicated
ware resources. final stage for improving the accuracy rate. Good results are
• In case of implementing SVM with a very high number achieved for real-time processing of high-resolution images
of SVs for a large-scale problem/application, the memory in terms of fps. But, their implemented architectures that are
storage and management problem is not effectively stud- applied for object detection suffer from slight loss in detec-
ied. tion accuracy rates. Also, the power consumption results that
• Many SVM implementations are designed targeting spe- are greater than 3 watts are considered high for embedded
cific application, which are not capable of any extensions systems deployment. These also prove the existing challeng-
and not easily adapted for various applications. ing trade-off between efficient classification accuracy and
• Some architectures are not simply designed, which have meeting significant embedded systems constraints, in which
shortage in meeting flexibility and scalability conditions they manage to present an adequate compromise in their
of realizing an efficient embedded system. latest publication.
• The difficulty of meeting all important constraints of Group 2: Another research group led by H. Hussain
embedded systems such as real-time, high performance, [35–37] is working on the parallel systolic array implemen-
low cost, low power and others. tation (category B), exploiting the advantage of the massive
• Very few implementations report the power consumption DPR feature in the FPGA aiming to reach flexibility and
results, which is very challenging to achieve low-power scalability (category C). They implement various copies
embedded system. with different SVM parameters to fulfill user requirements at
• Most existing implementations are implemented on old run-time and great speedup results are achieved compared to
versions of FPGAs and designed using traditional meth- software implementations on GPP. Then, these implementa-
ods and tools/technologies. tions are used to realize a multicore architecture of different
• The trade-off exists between classification performance/ SVM copies with applying DPR, which offers flexibility of
speed and hardware resources utilization needs more changing cores on-the-fly during run-time to be used then
optimized solutions/methods. for ensemble classification. But, no specific application is
• The primary trade-off exists between meeting embedded applied to employ and use the proposed architectures and so
systems constraints and preserving/maintaining classifi- no classification accuracy is measured. They lately imple-
cation accuracy level. ment a DPR-based adaptive multiclassifier to be used as an

SN Computer Science
133 
Page 14 of 17 SN Computer Science (2020) 1:133

SVM or KNN classifier with different parameters saving In addition from Table 7, this HLS-based implementation
more space in the device, which could be extended in the of cascaded SVM [62] showed remarkably lesser hardware
future as an ensemble classifier targeting classification of resources utilization and power consumption than group 1′s
bioinformatics microarray data. But also, no real application implementation [69, 75], which promises meeting challeng-
is applied for verifying the usage and classification of the ing embedded systems constraints.
implemented multiclassifier. From their DPR implementa- Although this survey paper clarifies the work that has
tions, 8 × reduction in reconfiguration time is gained over been done to address the research questions, however further
reconfiguring the whole FPGA device. Finally, they gained effort is required to find answers to all research questions
flexibility and adaptability with good speedup results in (listed in Section I). In the next section, we suggest promis-
reconfiguration time, but with no verification of the clas- ing future work to be done in the subject area to fill the gaps
sification functioning. Also, no optimization methods are and find answers to the research questions.
used for reducing hardware resources utilization, and no
power consumption measurements are presented for meet-
ing embedded systems constraints. Conclusion and Future Directions
Group 3: An interesting and unique research work is pre-
sented by another research group led by S. Afifi [59–64] This paper is a unique survey study that presents and clas-
which exploit the most recent hybrid Zynq SoC platform and sifies different hardware architectures used for implement-
the modern UltraFast HLS design methodology (category ing SVM classifiers on FPGAs. In addition, it provides
E). They develop the first HLS SVM IP/core on Zynq SoC critical analysis and detailed comparison with discussions
exists in literature targeting low-cost and real-time mela- and identifies leading research groups, limitations, chal-
noma detection. A hardware/software co-design is first pro- lenges, and research gaps. In conclusion, the main research
posed to realize a Zynq accelerator, which is implemented in gap is finding an optimum solution for the challenging
HLL (C/C + +) applying available hardware directives/tech- trade-off between achieving efficient classification accu-
niques through the HLS tool that simplifies FPGA design racy and meeting significant embedded systems constraints
and reduces development effort and time. The implemented of high performance, low area, cost and power/energy
system is considered the first hardware/software SoC exists consumption.
in the recent period. Then, the hardware design is extended Finally, the following research directions are suggested
to achieve a full SVM IP running on Zynq SoC with high to be used in the future by hardware designers in order to
performance and low cost. They are targeting meeting chal- address the research questions, limitations, gaps and chal-
lenging embedded systems constraints, while achieving lenges identified in this reviewed study:
efficient classification without any loss in the accuracy rate
resulted from the hardware implementation. Their experi- • An optimized hardware architecture is required to over-
mental results demonstrate optimized hardware results of come the challenging trade-off, aiming to reach an effi-
high performance, low resources utilization, and low power cient real-time embedded classification with high perfor-
consumption as well as high classification accuracy. Their mance and low cost.
implemented low-power embedded system meets the critical • The existing techniques could be combined efficiently to
power constraint and shows the least power dissipation of achieve optimized results.
1.5 watts [64] compared to other reported implementations • New hardware-based methods are needed targeting
(Table 7). Still future testing is required for validating online improvement of the online classification accuracy rate
classification accuracy rate for melanoma detection. Lately, on hardware.
they proposed a scalable multicore architecture based on • The hardware-friendly kernel needs more improvement
adding more SVM IPs to implement a cascade SVM classi- regarding maintaining the classification accuracy rate.
fier on a single device/SoC for improving diagnosis verifi- • Effective techniques need to be studied for decreasing
cation. The implemented cascade classifier achieved lower memory requirements and hardware resources utilization
resources utilization and power consumption than imple- in case of large-scale applications.
menting a single SVM IP, in addition to improvement in the • The massive DPR feature should be widely and effi-
classification accuracy and speed. Consequently, this group ciently exploited for gaining more flexibility, adaptabil-
added a new FPGA-based cascaded SVM implementation ity and scalability in addition to more improvements in
in the literature besides that first implementation of group 1. speed, area and power consumption.
Moreover, the powerful DPR feature of the FPGA was lately • Additional research work is demanded to focus on imple-
applied to the proposed cascade SVM classifier, where more menting multiclass classifiers and nonlinear kernel-based
optimization in implementation results was achieved, while classifiers.
gaining flexibility, adaptability, scalability, and applicability.

SN Computer Science
SN Computer Science (2020) 1:133 Page 15 of 17  133

• The evolvable hardware method needs to be investigated 12. Fykse E, Performance comparison of GPU, DSP and FPGA
for achieving adaptive classification systems. implementations of image processing and computer vision
algorithms in embedded systems, M.Sc. thesis, Department of
• Flow control and memory management should be effec- Electronics and Telecommunications, Norwegian University of
tively studied for efficient data transfer in the classifica- Science and Technology, 2013.
tion system. 13. Afifi SM, GholamHosseini H, Sinha R. Hardware implementa-
• The multicore architecture should be investigated for tions of SVM on FPGA: a state-of-the-art review of current prac-
tice. Int J Innov Sci Eng Technols (IJISET). 2015;2(11):733–52.
working as an ensemble, multiclass or cascaded classifi- 14. Kyrkou C, Theocharides T. SCoPE: Towards a systolic array for
cation (by adding a voting/controller mechanism). SVM s. IEEE Embed Syst Lett. 2009;1(2):46–9.
• Using the latest FPGA devices/SoCs and technologies 15. Anguita D, Pischiutta S, Ridella S, Sterpi D. Feed-forward sup-
should be considered in the hardware implementations, port vector machine without multipliers. IEEE Trans Neural Netw.
2006;17(5):1328–31.
in addition to exploiting modern development tools and 16. Hsu C-W, Lin C-J. A comparison of methods for multi-
design methodologies (e.g., HLS) for reaching efficient class support vector machines. IEEE Trans Neural Netw.
designs with optimized results. 2002;13(2):415–25.
17. Ago Y, Nakano K, Ito Y, A classification processor for a support
vector machine with embedded DSP slices and block RAMs in the
Conflict of interest  The authors declare that they have no conflict of FPGA. In: 2013 IEEE 7th international symposium on embedded
interest. multicore socs (MCSoC), 2013, pp. 91–96.
18. Wielgosz M, Jamro E, Zurek D, Wiatr K. FPGA implementation
of the selected parts of the fast image segmentation. Stud Comput
Intell. 2012;390:203–216s.
References 19. Berberich M, Doll K, Highly flexible FPGA-architecture of a sup-
port vector machine. In: MPC-workshop 45, 2014, pp. 25–32.
1. Nayak J, Naik B, Behera H. A comprehensive survey on support 20. Andraka R, A survey of CORDIC algorithms for FPGA based
vector machine in data mining tasks: applications & challenges. computers. In: Proceedings of the 1998 ACM/SIGDA sixth inter-
Int J Database Theory Appl. 2015;8(1):169–86. national symposium on field programmable gate arrays, 1998, pp.
2. Sabouri P, Gholam Hosseini H, Larsson T, Collins J, A cas- 191–200.s ACM.
cade classifier for diagnosis of melanoma in clinical images. In: 21. Nie Z, Zhang X, Yang Z, An FPGA implementation of multi-class
36th annual international conference of the IEEE engineering in support vector machine classifier based on posterior probability.
medicine and biology society (EMBC), 2014, pp. 6748–6751. In: Proceedings of 2010 3rd s (ICCEE 2010 no. 2), 2010.
IEEE. 22. Saurav S, Singh S, Saini R, Saini AK, Hardware accelerator for
3. Foody GM, Mathur A. A relative evaluation of multiclass image facial expression classification using linear SVM. In: SIRS, 2015,
classification by support vector machines. IEEE Trans Geosci pp. 39–50.
Remote Sens. 2004;42(6):1335–433. 23. Vranjković VS, Struharik RJR, Novak LA. Reconfigurable hard-
4. Entezari-Maleki R, Rezaei A, Minaei-Bidgoli B. Comparison ware for machine learning applications. J Circuit Syst Comput.
of classification methods based on the type of attributes and 2015;24(5):1550064.
sample size. J Converg Inf Technol. 2009;4(3):94–102. 24. Vranjković V, Struharik R, Coarse-grained reconfigurable hard-
5. Kim J, Kim B-S, Savarese S. Comparing image classification ware accelerator of machine learning classifiers. In: 2016 Inter-
methods: k-nearest-neighbor and support-vector-machines. Ann national conference on systems, signals and image processing
Arbor. 2012;1001:48109–2122. (IWSSIP), 2016, pp. 1–5. IEEE.
6. MP Véstias, High-performance reconfigurable computing granu- 25. Kim S, Lee S, Cho K. Design of high-performance unified circuit
larity. In: Encyclopedia of information science and technology, for linear and non-linear SVM classsifications. J Semicond Tech-
p 3558–3567, 2015. nol Sci. 2012;12(2):162–7.
7. HM Hussain, K Benkrid, H Seker, The role of FPGAs as high 26. Tang P-TP. Table-driven implementation of the exponential func-
performance computing solution to bioinformatics and compu- tion in IEEE floating-point arithmetic. ACM Trans Math Softw
tational biology data. In: AIHLS2013, p. 102, 2013. (TOMS). 1989;15(2):144–57.
8. Asano S, Maruyama T, Yamaguchi Y, Performance comparison 27. Koide T, et al. FPGA implementation of type identifier for colo-
of FPGA, GPU and CPU in image processing. In: International rectal endoscopie images with NBI magnification. IEEE Asia Pac
conference on field programmable logic and applications, 2009. Conf Circuit Syst (APCCAS). 2014;2014:651–4.
FPL 2009, 2009, pp. 126–131: IEEE. 28. Shigemi S, et al., Customizable hardware architecture of support
9. Fowers J, Brown G, Cooke P, Stitt G., A performance and vector machine in CAD system for colorectal endoscopic images
energy comparison of FPGAs, GPUs, and multicores for slid- with NBI magnification. In: SASIMI 2013 proceedings, the 18th
ing-window applications. In: Proceedings of the ACM/SIGDA workshop on synthesis and system integration of mixed informa-
international symposium on field programmable gate arrays, tion technologies, pp. 298–203, 2013.
2012, pp. 47–56: ACM. 29. Shigemi S, An FPGA implementation of support vector machine
10. Cope B, Cheung PY, Luk W, Howes L. Performance comparison identifier for colorectal endoscopic images with NBI magnifica-
of graphics processors to reconfigurable logic: a case study. tion. In: Proceedings of the 28th international conference on cir-
IEEE Trans Comput. 2010;59(4):433–48. cuits/systems, computers and communications (IsTC-CSCC2013),
11. Pietron M, Wielgosz M, Zurek D, Jamro E, Wiatr K, Compari- pp. 571–572.
son of GPU and FPGA implementation of SVM algorithm for 30. Liu C, Qiao F, Yang X, Yang H, Hardware acceleration with pipe-
fast image segmentation. In: Architecture of computing sys- lined adder for support vector machine classifier. In: 2014 fourth
tems–ARCS, Springer, 2013, pp. 292–302. international conference on digital information and communica-
tion technology and it’s applications (DICTAP), 2014, pp. 13–16.
IEEE.

SN Computer Science
133 
Page 16 of 17 SN Computer Science (2020) 1:133

31. Cárdenas J, Figueroa M, Pezoa JE, A custom hardware classifier 48. Anguita D, Carlino L, Ghio A, Ridella S. A FPGA core genera-
for bruised apple detection in hyperspectral images. In: SPIE Opti- tor for embedded classification systems. J Circuits Syst Comput.
cal Engineering+ Applications, 2015, pp. 95992K-95992K-11. 2011;20(2):263–82.
International Society for Optics and Photonics. 49. Anguita D, Ghio A, Pischiutta S, Ridella S. A support vec-
32. Qasaimeh M, Sagahyroon A, Shanableh T. FPGA-based paral- tor machine with integer parameters. Neurocomputing.
lel hardware architecture for real-time image classification. IEEE 2008;72(1):480–9.
Trans Comput Imaging. 2015;1(1):56–70. 50. Vranjkovic V, Struharik R, New architecture for SVM classifier
33. Vucha M, Rajawat A. Design and FPGA implementation of sys- and its application to telecommunication problems. In: 2011 19th
tolic array architecture for matrix multiplication. Int J Comput Telecommunications Forum (TELFOR), 2011, pp. 1543–1545.s
Appl. 2011;26(3):18–22. 51. Mandal B, Sarma MP, Sarma KK, Implementation of systolic
34. Kyrkou C, Theocharides T. A parallel hardware architecture for array based SVM classifier using multiplierless kernel. In: 2014
real-time object detection with support vector machines. IEEE International Conference on Signal Processing and Integrated
Trans Comput. 2012;61(6):831–42. Networks (SPIN), pp. 35–39: IEEE.
35. Hussain HM, Benkrid K, Seker H, Reconfiguration-based imple- 52. Mandal B, Sarma MP, Sarma KK, Design of a systolic array based
mentation of SVM classifier on FPGA for classifying microar- multiplierless support vector machine classifier. In: 2014 interna-
ray data. In: 2013 35th annual international conference of the tional conference on signal processing and integrated networks
IEEE engineering in medicine and biology society (EMBC), (SPIN), 2014, pp. 35–39. IEEE.
2013, pp. 3058–3061. IEEE. 53. Xilinx System Generator for DSP. https​://au.mathw​orks.com/
36. Hussain H, Benkrid K, Şeker H. Novel dynamic partial reconfig- produ​cts/conne​ction​s/produ​ct_detai​l/produ​ct_35567​.html
uration implementations of the support vector machine classifier 54. Mahmoodi D, Soleimani A, Khosravi H, Taghizadeh M. FPGA
on FPGA. Turk J Electr Eng Comput Sci. 2016;24(5):3371–87. simulation of linear and nonlinear support vector machine. J Softw
37. Hussain HM, Benkrid K, Seker H, Dynamic partial reconfigura- Eng Appls. 2011;4(05):320–8.
tion implementation of the SVM/KNN multi-classifier on FPGA 55. Jiang Y, Virupakshappa K, Oruklu E, FPGA implementation of
for bioinformatics application. In: 37th annual international a support vector machine classifier for ultrasonic flaw detection.
conference of the IEEE engineering in medicine and biology In: 2017 IEEE 60th international midwest symposium on circuits
society (EMBC), 2015, pp. 7667–7670.s IEEE. and systems (MWSCAS), 2017, pp. 180–183.
38. Patil R, Gupta G, Sahula V, Mandal A, Power aware hardware 56. Saini R, Saurav S, Gupta DC, Sheoran N, Hardware implementation
prototyping of multiclass SVM classifier through reconfigura- of SVM using system generator. In: 2017 2nd IEEE international
tion. In: 2012 25th international conference on VLSI design conference on recent trends in electronics, information & commu-
(VLSID), 2012, pp. 62–67. IEEE. nication technology (RTEICT), 2017, pp. 2129–2132. IEEE.
39. Sasamal TN, Prasad R. Module based and difference based 57. Cutajar M, Gatt E, Grech I, Casha O, Micallef J. Hardware-based
implementation of partial reconfiguration on FPGA: a review. support vector machine for phoneme classification. IEEE Euro-
Int J Eng Res Appl (IJERA). 2011;1(4):1898–903. Con. 2013;2013:1701–8.
40. Hussain HM, Benkrid K, Seker H, An adaptive implementation 58. Vivado High-Level Synthesis. Available: https:​ //www.xilinx​ .com/
of a dynamically reconfigurable K-nearest neighbour classifier produ​cts/desig​n-tools​/vivad​o/integ​ratio​n/esl-desig​n.html
on FPGA. In: NASA/ESA conference on adaptive hardware and 59. Afifi S, GholamHosseini H, Sinha R, Hardware acceleration of
systems (AHS), 2012, pp. 205–212. IEEE. SVM-based classifier for melanoma images. In: Huang F, Sugi-
41. Hussain H, Benkrid K, Hong C, Seker H, An adaptive FPGA moto A (eds) Image and Video Technology – PSIVT 2015 Work-
implementation of multi-core K-nearest neighbour ensemble shops: RV 2015, GPID 2013, VG 2015, EO4AS 2015, MCB-
classifier using dynamic partial reconfiguration. In: 22nd inter- MIIA 2015, and VSWS 2015, Auckland, New Zealand, November
national conference on field programmable logic and applica- 23-27, 2015. Revised Selected Papers, Cham: Springer Interna-
tions (FPL), 2012, pp. 627–630. IEEE. tional Publishing, 2016, pp. 235-245s
42. Jallad AHM, Mohammed LB. Hardware support vector machine 60. Afifi S, GholamHosseini H, Sinha R. A low-cost FPGA-based
(SVM) for satellite on-board applications. NASA/ESA Conf SVM classifier for melanoma detection. IEEE EMBS con-
Adapt Hardw Syst (AHS). 2014;2014:256–61. ference on biomedical engineering and sciences (IECBES).
43. Wong S, Vassiliadis S, Cotofana S, A sum of absolute differ- 2016;2016:631–6.
ences implementation in FPGA hardware. In: Euromicro confer- 61. Afifi S, GholamHosseini H, Sinha R. A system on chip for mela-
ence, 2002. Proceedings. 28th, 2002, pp. 183–188. IEEE. noma detection using FPGA-based SVM classifier. Microprocess
44. Pan X, Yang H, Li L, Liu Z, Hou L, FPGA implementation of Microsyst. 2019;65:57–68.
SVM decision function based on hardware-friendly Kernel. In: 62. Afifi S, GholamHosseini H, Sinha R, SVM classifier on chip for
2013 international conference on computational and informa- melanoma detection. In: The 39th Annual International Confer-
tion sciences , ICCIS 2013 proceedings, 2013, pp. 133–136. ence of the IEEE Engineering in Medicine and Biology Society
45. Ruiz-Llata M, Guarnizo G, Yébenes-Calvino M, FPGA imple- (EMBC’17), 2017.
mentation of a support vector machine for classification and 63. Afifi S, GholamHosseini H, Sinha R, Dynamic hardware system
regression. In: The 2010 international joint conference on neural for cascade SVM classification of melanoma. Neural Computing
networks (IJCNN), 2010, pp. 1–5. IEEE. and Applications, 2018.
46. Gimeno Sarciada J, Lamel Rivera H, Jiménez M, CORDIC 64. Afifi S, Gholamhosseini H, Sinha R, Lindén M. A novel medi-
algorithms for SVM FPGA implementation. In: Proceedings of cal device for early detection of melanoma. Stud Health Technol
SPIE - the international society for optical engineering, 2010, Inform. 2019;261:122–7.
vol. 7703. 65. Zynq-7000 All Programmable SoC. https:​ //www.xilinx​ .com/produ​
47. Lamela H, Gimeno J, Jiménez M, Ruiz M, Performance evalua- cts/silic​on-devic​es/soc/zynq-7000.html
tion of a FPGA implementation of a digital rotation support vec- 66. Tsoutsouras V, Koliogeorgi K, Xydis S, Soudris D, An explora-
tor machine. In: SPIE defense and security symposium, 2008, tion framework for efficient high-level synthesis of support vector
pp. 697908–697908–8. International Society for Optics and machines: case study on ECG arrhythmia detection for xilinx zynq
Photonics. SoC. J Signal Process Syst, pp. 1–2s1, 2017.

SN Computer Science
SN Computer Science (2020) 1:133 Page 17 of 17  133

67. Koliogeorgi K, Zervakis G, Anagnostos D, Zompakis N, Siozios 73. Kyrkou C, Theocharides T, Bouganis CS, A hardware-efficient
K, Optimizing SVM classifier through approximate and high level architecture for embedded real-time cascaded support vector
synthesis techniques. In: 2019 8th international conference on machines classification. In: Proceedings of the 23rd ACM inter-
modern circuits and systems technologies (MOCAST), 2019, pp. national conference on great lakes symposium on VLSI, 2013, pp.
1–4. IEEE. 341–342. ACM.
68. Elgawi O, Mutawa A, Ahmad A, Energy-efficient embedded infer- 74. Kyrkou C, Theocharides T, Bouganis C-S, Polycarpou M. Boost-
ence of SVMs on FPGA. In: 2019 IEEE computer society annual ing the hardware-efficiency of cascade support vector machines
symposium on VLSI (ISVLSI), 2019, pp. 164–168. IEEE. for embedded classification applications. Int J Parallel Programm.
69. Kyrkou C, Theocharides T, Bouganis C-S, An embedded hard- 2017;46:1–27.
ware-efficient architecture for real-time cascade support vector 75. Kyrkou C, Bouganis C-S, Theocharides T, Polycarpou MM.
machine classification. In: 2013 international conference on Embedded Hardware-Efficient Real-Time Classification with Cas-
embedded computer systems: architectures, modeling, and simu- cade Support Vector Machines. IEEE Tran Neural Netw Learn
lation (SAMOS XIII), 2013, pp. 129–136. IEEE. Syst. 2015;27:90–112.
70. Papadonikolakis M, Bouganis C-S, A novel FPGA-based SVM
classifier. In: International conference on field-programmable Publisher’s Note Springer Nature remains neutral with regard to
technology (FPT) 2010, pp. 283–286. IEEE. jurisdictional claims in published maps and institutional affiliations.
71. Papadonikolakis M, Bouganis CS. A heterogeneous FPGA
architecture for support vector machine training. Proc IEEE
Symp Field-Programm Custom Comput Mach, FCCM.
2010;2010:211–4.
72. Papadonikolakis M, Bouganis C. Novel cascade FPGA accelera-
tor for support vector machines classification. IEEE Trans Neural
Netw Learn Syst. 2012;23(7):1040–52.

SN Computer Science

You might also like