Professional Documents
Culture Documents
https://doi.org/10.1007/s42514-020-00055-4
REGULAR PAPER
Received: 18 June 2020 / Accepted: 15 October 2020 / Published online: 26 January 2021
© China Computer Federation (CCF) 2021
Abstract
Deep convolutional neural network (CNN), which is widely applied in image tasks, can also achieve excellent performance
in acoustic tasks. However, activation data in convolutional neural network is usually indicated in floating format, which is
both time-consuming and power-consuming when be computed. Quantization method can turn activation data into fixed-
point, replacing floating computing into faster and more energy-saving fixed-point computing. Based on this method, this
article proposes a design space searching method to quantize a binary weight neural network. A specific accelerator is built
on FPGA platform, which has layer-by-layer pipeline design, higher throughput and energy-efficiency compared with CPU
and other hardware platforms.
13
Vol:.(1234567890)
An energy‑efficient convolutional neural network accelerator for speech classification… 5
and GPUs supporting 8 bits integer data, etc. These designs architecture has shared weight storage, balanced pipe-
reduce power consumption greatly and have up to hundreds line partition and bitwise expansion. The performance,
of speedup ratios compared with CPU platforms. It turns power consumption and energy efficiency of this accel-
out that hardware with corresponding quantized CNN mod- erator are discussed in Sect. 5.
els can achieve excellent computing performance and high 3. The target speech classification model is tested on CPU
energy-efficiency. platforms to get performance baseline. Compared with
Compared with CPU and GPU, ASIC and FPGA are more these test results, our design has absolute advantage on
suitable to accelerate a specific task. These platforms can be performance per watt and throughput.
customized by setting pipeline and expanding parallelism
degree, lowering power consumption and raising computing
performance. Although ASIC owns huge advantages over 2 Related work
FPGA on power and running frequency, expensive design-
ing and manufacturing cost limit general application. On 2.1 Neural network inferencing quantization
the contrast, FPGA keeps a good balance between perfor-
mance, power, flexibility and expense due to programmable When training a deep neural network, researchers usually
feature and mature industry design. Now, FPGA has been choose full-precision data format to ensure best model accu-
widely used in cloud computing and intelligence comput- racy. However, in inference task, these parameters will not
ing by Microsoft, Amazon and Alibaba (Gwennap 2017; be changed and therefore we can prune them in an offline
Turan et al. 2020), becoming an important part of high-per- method. Bo et al. (2018) raises an algorithm to compress
formance computing. floating-point DNN parameters into binary data, which is
The speech classification model which focuses on specific consisted of + 1 and − 1. Compared with common DNN
speech instructions or acoustics signal, is a basic component with floating-point weight and activation, this compression
of intelligent scenario analysis in both cloud and edge end. method not only sharply reduces parameter storage, but also
Such application circumstance needs a low-power and high- replaces multiplication and division with add and minus.
performance computing platform especially. Typical deep Less storage space and multiplication mean less memory-
convolutional neural networks can process coarse speech consuming energy and less computing cycles, leading to
classification task well. However, there still exists some lower power consumption and faster working speed.
space to accelerate CNN model and reduce computing plat- Matthieu et al. (2016) brings out a method to turn activa-
form’s energy consumption by quantization and customized tion into binary format. Unlike parameter in neural networks,
hardware design. To implement this power-efficient speech activation data fluctuates numerically with different input
classification computation platform, we choose a typi- (e.g., input image or input audio feature map). Although
cal CNN-based speech classification model where weight (Matthieu et al. 2016) still keeps a good model accuracy on
value is + 1 or − 1 and activation data is in full precision very deep CNNs like VGGNet, great numerical precision
floating-data format (Bo et al. 2018). We design an accel- loss is brought by binary activation data. This loss may cause
erator based on Xilinx XCKU-115 FPGA platform and run vital influence on some small-size CNNs (Jacob et al. 2017;
this binary weight network (BWN) model. Compared with Xu et al. 2018). In this situation, turning floating data into
state-of-the-art CPU platforms, our accelerator achieves fixed-point format is able to keep a good balance between
18–300 × throughput speed up ratio and high energy effi- computing performance and model accuracy. Fixed-point
ciency. The main contributions of this work are as follows: data computation needs less computing cycles compared
with floating data, and fixed-point can adapt to data’s numer-
1. We turn floating feature data into fixed-point data in ical distribution by flexible allocation of integer bitwise and
a common quantization method. However, we sper- decimal bitwise.
ate intermediate result and feature data apart and take When integer part is allocated with more bitwise, it can
design space searching method to find best quantization expand data-indication range; and when we give decimal part
settings respectively. As to super parameter like batch- longer word-length, it can achieve better numerical precision
normalization arguments, we take a different bitwise correspondingly. For example, given a decimal number 10.06.
allocation method and validate it’s influence to final With 4-bit integer part and five-bit decimal part, this number
performance. Based on common quantization method, can be indicated as 1010.00001 in binary format, which is
we take these operations to help decrease accuracy loss 10.0625 when again translated into decimal number. Numeri-
caused by quantization and achieve a near model predic- cal precision loss here is generated by binary data’s discrete
tion accuracy compared with non-quantized model. indicating method. With same integer part length and 10 bit
2. We design a novel layer-by-layer pipeline structure on allocated to decimal part, the number 10.06 can then be reset
our multi-PE (process element) BWN accelerator. The as 1010.0000011111, which is 10.060546875 when translated
13
6 D. Wen et al.
into original format. This example shows that quantized can 3 Speech classification model
keep better numerical precision with longer decimal bit length.
However, increasing the length of fixed-point data format will 3.1 Model architecture and weight binarization
add cycle counts to computation or occupy more hardware’s
resource on FPGA. Taking speed and computation precision This CNN-based speech classification model is trained on
into account, it is important to find the relatively best data Tensorflow speech command dataset. It can classify the six
format. sort of short speech segment “up”, “down”, “yes”, “right”,
“left” and “unknown words”. This model first uses MFCC
2.2 Acoustic model based on convolutional neural algorithm turning an audio file into a floating format tensor,
networks whose dimension is 20 × 49 × 1. Then this tensor is sent into
a convolutional neural network. The network is consisted
In 2016, Tom et al. introduced acoustic model based on con- of two convolutional layers, three full-connected layers and
volutional neural networks into large scale datasets. They pro- binary weight parameters. The detail information of model
posed a new CNN design without padding and pooling along architecture is shown in Fig. 1. All convolutional kernel size
time dimension. Also, their team introduced batch-normali- is 3 and convolutional stride is 1. There is no padding and
zation which could help train very deep CNN on sequenced expansion operation in this network, which is convenient
data. Although this structure made their model be slightly sub- for us to accelerate. To be noticed that activation is still in
optimal for accuracy, they still benefitted from convolutional float format at this stage. Via softmax function, this model
operation’s high computing efficiency. Compared with RNNs, outputs the possibility of six type of labels.
their solution could be trained and inferenced faster. After model parameters being fixed in the training pro-
In 2019, QuartzNet (Samuel et al. 2020) was published cess, we can transfer float weight value into (− 1, 1). Fig-
by Samuel et al. QuartzNet is a 1D time-channel CNN with ure 2 shows how we processing weight data. We assume the
residual blocks between layers. This model operates on data in distribution of primitive parameter is normal distribution,
time-channel and completely decouple the time and channel- the numerical distributing range is then modified by tanh
wise parts of convolution. This idea decreases the scale of function and a series of scale methods. Finally, all param-
parameter greatly for it use the same parameter on different eters are discretized to − 1 or + 1. This stage’s BWN model
time-channels. This CNN-based speech recognition model (activation data is still in floating format) can provide the
achieves near state-of-the-art performance and only occupies top-1 accuracy of 85.31%.
5.3–82.6% of parameters compared with other RNN-based and
CNN-based models. 3.2 Quantizing feature data and intermediate
Built on QuartzNet, Samuel et al. propose the latest CNN- results
based model for speech recognition in 2020: MatchboxNet
(Samuel and Boris 2004). Via 1 × 1 pointwise convolution and Once we change floating input feature into fixed-point for-
batch-normalization in time-channels, MatchboxNet achieves mat, intermediate result, activation and other hyper-param-
state-of-the-art performance in the territory. This new archi- eters (such as bias and batch-normalization parameter) will
tecture still keeps small scale of parameters, taking 38.1% of be in fixed-point format naturally. The data computed in the
parameters compared with Attention RNN (Douglas et al.
2018) and 32.4% compared with ResNet-15, but can achieve
better prediction accuracy.
To conclude, previous CNN-based speech recognition mod-
els own significant smaller size, faster training speed, higher
computation efficiency and similar or even better prediction
accuracy compared with RNNs. By using such advantages
offered by CNN, we can deploy speech recognition or speech
recognition models on compute-and-memory-limited hard-
ware platforms with high inference throughput.
13
An energy‑efficient convolutional neural network accelerator for speech classification… 7
network can be divided to two parts: intermediate results However, on FPGA platforms, hardware design and resource
after multiply-accumulate (MAC) operations and the batch- utilization will be influenced by intermediate results because
normalized output data. The batch-normalized output data is they usually demand for longer word length. As a result,
feature data which will be transferred to next layer of neural FPGA has to use more LUT to complete such fixed-point
network. In some cases, absolute value and variance value computation and more register to store intermediate result.
of intermediate results can be huge, which means the data To quantize feature data, we usually use 8 bits for inte-
distribution of intermediate result is rather fluctuant, which ger part (one for sign and seven for indicating value) and 8
brings difficulty to quantization. These intermediate results bits for decimal part in saturation mode. Some layers’ fea-
will then be normalized, their absolute value and variance ture data may need more decimal bits to increase indication
are narrowed down. Unlike intermediate result data, the fea- precision. However, it is obvious that this quantization set-
ture data follows normal distribution and easy to be quan- ting is not suitable for intermediate results whose absolute
tized. To quantize data into fixed-point format, the approxi- value can go far beyond 127. Thus, a quantization format for
mate distribution range of these two kinds of data needs intermediate results should be arranged separately. We try
to be determined first because numerical distribution range several bitwise schemes by design space search method and
deciding data’s integer part bitwise. Intermediate result’s the experiment result is discussed in Sect. 5.1.
quantization demands longer integer indication for fluctu-
ant data distribution. 3.3 Quantizing batch‑normalization’s parameters
On some integrated development environments (IDEs)
running on CPU platforms such as Matlab, their quantiza- Batch-normalization operation, which improves data distri-
tion functions libraries usually are not the “real” quantiza- bution greatly among neural network layers, has been proven
tion operations libraries. The intermediate results’ bit-width to be vital to model’s final prediction accuracy (Santurkar
and precision loss cannot be reflected by such libraries for et al. 2018; Liu et al. 2017). However, the multiplication
CPU platforms cannot handle fixed operations well. When and division in this process not only depend on intermediate
running quantization operations, CPU will do floating com- results, but are also sensitive to parameters’ numerical preci-
putation first and then convert result into fixed-point format. sion (Giri et al. 2016). The experiment in Sect. 5.1 shows the
13
8 D. Wen et al.
13
An energy‑efficient convolutional neural network accelerator for speech classification… 9
Table 1 The detail information of parameters our hardware design) and adding bias. As Fig. 6 shows, we
Layer Filter Kernel Parameter Parameter size
implement these four parts of normalization in a pipeline
way. The negative impact of multiply computation on FPGA
Conv-1 32 3*3*1 288 36B is reduced in this way.
Conv-2 32 3*3*32 9216 1152B Similarly, intermediate results of MAC process have same
FC-1 32 16*45*32 737,280 92160B need for bitwise expansion. The different point is that MAC
FC-2 – – 1024 128B results usually need more bitwise in integer part instead of
FC-3 – – 192 24B decimal number. We apply expansion to both normalization
step and accumulator in adder tree of MAC, it turns out that
this method ensures computing accuracy on hardware.
in batch-normalization step, it will lead unnecessary loss to
final prediction. 4.3 Layer‑by‑layer pipeline
To handle this situation, we introduce bitwise expansion
method. Bitwise expansion gives extra decimal bitwise to In deep convolutional neural networks such as VGG-16 and
both feature data and normalization parameter when com- AlexNet (Simonyan and Zisserman 2014; Krizhevsky et al.
puting. After DSP outputting multiply results, these extra 2012), it is difficult to keep this balance due that as networks
decimal bits will be cut and the data will return to original going deeper, deep layers will demand layers ahead to gener-
input format. ate feature result at a faster speed, which is beyond current
The batch-normalization operations can be divided into hardware’s computing power limit.
four steps: subtracting mean value, multiplying offset, The target neural network is shallow and tiny, so it is pos-
dividing the square root of variance (which is substituted sible to keep balance between different layers. The key point
by multiplying the reciprocal of variance’s square root on of layer-by-layer pipeline’s implementation is to balance
13
10 D. Wen et al.
13
An energy‑efficient convolutional neural network accelerator for speech classification… 11
32 Data for 32 Channel the other hand, final accuracy can be improved by better
indication precision at initial stage of experiment, which
is corresponding to common sense. We consider that the
0HDQIRU&KDQQHO
point (IRDB, NRDB) equaling (8, 9) lies in a balance point
ĂĂ
ĂĂĂĂ
where both robustness and numerical precision can function
0HDQIRU&KDQQHO well. When decimal bits are increased and model is over this
balancing point, model’s robustness is weakened by dedi-
2IIVHWIRU&KDQQHO
cated data indication and this boost on numerical precision
ĂĂ
'63 ĂĂĂĂ '63 cannot cover the loss of robustness. This is the reason why
2IIVHWIRU&KDQQHO model’s prediction effect cannot be enhanced in last couples
of experiments.
9DULDQFHIRU&KDQQHO
'63 '63 The experiment results on normalization param-
ĂĂ ĂĂ
ĂĂĂĂ
eters’ quantization scheme are demonstrated in Fig. 9.
9DULDQFHIRU&KDQQHO
We set (IRDB, NRDB) as (8, 9) and keep other variables
%LDVIRU&KDQQHO unchanged. It is clearly shown that model’s prediction accu-
racy is improved by better normalization parameters indica-
ĂĂ
ĂĂĂĂ
tion precision. However, the word length of these parameters
%LDVIRU&KDQQHO
cannot exceed the upper boundary 32 bits, so it is suitable
to allocate 12 decimal bits for batch-normalization param-
Fig. 6 Batch-normalization unit eters. There is one thing need to be noticed, hardware cannot
handle division operation as easy as Matlab code. Thus, we
5 Experiment and discussion turn variance’s division in batch normalization into recipro-
cal multiplication and expand decimal bit-width to ensure
5.1 Quantized model’s performance accuracy.
We run our non-quantization version of Matlab code on
We use Matlab-2018a’s quantizer function to turn feature Intel core i7-8700K with and without multi-thread accelerat-
and batch-normalization parameters into fixed-point format ing library. We divide whole program into several function
in saturate mode. In order to find the best bitwise allocation segments and test their running time. Table 2a shows that
scheme, we conduct a couple of quantization experiments when ignore the MFCC segment, the second convolutional
and compare their accuracy performance. layer is performance bottleneck on CPU platforms and has
The horizontal axis (IRDB, NRDB) means combination vital effect to the model. This phenomenon is corresponded
of intermediate result decimal bitwise and normalization to the largest computing scale of Conv-2 layer. Data in
result (feature data) decimal bitwise. Table 2b also illustrates that although parallelism function
To find best quantization bitwise allocation scheme, we library is used and running time is reduced sharply, Conv-2
conduct several experiments under the condition discussed is still the most time-costing layer on CPU.
ahead. Experiments’ results are illustrated in Figs. 8 and 9. Quantization version of code runs 9.85% slower than non-
In Fig. 8, vertical axis shows the corresponding model’s pre- quantization version, detail is shown in Table 2c. As dis-
diction accuracy with different combination of intermediate cussed in Sect. 3.2, although we apply quantization functions
result decimal bitwise (IRDB) and normalization result deci- and fixed-point computing in the code, they are still executed
mal bitwise (NRDB). It turns out when middle data applies in floating mode on CPU for CPU platforms cannot process
8-bit decimal bitwise and normalization parameter uses 9-bit fixed-point data well. This procedure brings extra work con-
decimal bitwise, this model will achieve best accuracy. We verting data between floating format and fixed-point format,
also implement the experiment that intermediate result owns which is responsible for worse performance of quantized
more decimal length than normalization data’s which vio- code. For the similar reason, different quantization formats
lates our searching principle, the final result supports our will not make noticeable impact on running time on CPU
idea effectively. platforms. To truly reflect our accelerator’s performance,
It is also noticeable that when (IRDB, NRDB) is over we do not choose quantization version’s running time but
(8, 9), model’s final accuracy does not rise with better data the non-quantization version’s as our performance baseline.
indication precision. Some works have proven that neural
networks have numerical robustness in low-precision data 5.2 Accelerator’s implement and performance
format (Zeng et al. 2020; Cheng et al. 2020), this feature
is widely used by deep neural networks’ quantization and To implement our layer-by-layer pipeline, it is necessary to
can keep model’s original prediction performance well. On expand Conv-1’s degree of parallelism to handle nine 3 × 3
13
12 D. Wen et al.
slide windows on a 5 × 5 feature area. Not only the Conv-1 and final writing back. Inside each layer’s pipeline, we also
layer’s computing scale is expanded, but the data-address divide all computing into some function parts like vector
generating strategy is also modified to fetch input feature computing unit, normalization unit in pipeline method,
pixel in specific order. which can help to rise hardware running frequency.
By adapting pipeline on layers, target neural networks We run this speech classification model on intel Core
can be accelerated without putting between-layers results i7-8700K (3.7 GHz, 6 cores, 95 W) with single thread,
into DRAM and thus reducing memory accessing cost. In intel Core i7-8700K with multi-thread and multi-node
another word, we keep data stream in layer pipeline from intel Xeon 5220 (2.2 GHz, 18 cores, 125 W) with Mat-
input audio feature to final predict results without stop. The lab distributed parallel library, the whole dataset we
accelerator only communicates with DRAM in fetching used contains 1512 audio files. We only test the running
13
An energy‑efficient convolutional neural network accelerator for speech classification… 13
83.57% 83.62%
83.29% 83.35% 83.37%
83.53%
reliability, we conduct extensive pressure test for up to 2 h.
83.50%
83.06% The implement results are shown in Table 3.
83.00%
We compute the energy efficiency of state-of-the-art CPU
82.50% platforms and ours. The results in Table 4 show that our
82.00% accelerator has great energy efficiency improvement on this
(7, 7) (8, 7) (8, 8) (8, 9) (9, 9) (8, 10) (10, 10) (8, 11)
speech classification task by customizing circuit design and
(IRDB, NRDB)
replacing floating data with fixed-point data.
In Table 5, we compare our accelerator with BWCNN-
Fig. 8 Accuracy experiment result
based (binary weight convolutional neural network) ASIC
architectures, fully binarized CNN FPGA architectures and
85.50%
84.96%
85.10% sparse RNN-based FPGA accelerators.
85.00% 84.65% Table 5a illustrates comparison between BWCNN-based
ASIC and ours. Two ASIC architecture own huge advan-
Prediction Accuracy
84.50% 84.16%
83.81%
84.00% tages on power-consuming and power-efficiency, however,
83.50% 83.14% our FPGA platform surpasses these ASIC design in expense,
83.00% 82.72% flexibility and perk performance.
82.50% The detail information of typical full-binary CNN-based
82.00% FPGA accelerator is shown in Table 5b, together with ours.
81.50% Our design performs better in peak performance, model’s
7 8 9 10 11 12 13 accuracy loss and have larger throughput than two of tar-
Decimal Bits
geted architectures. For fully-binarized accelerators bina-
rizing both weight and activation data, taking less resource
Fig. 9 The relationship between batch-normalization parameters’
(which enables designer to set far more PEs than us), it is
decimal bits and model’s prediction accuracy
considerable for some design outperforming ours in through-
put. However, this fully-binarized scheme bears more pre-
time of neural network’s forwarding part for we do not diction precision loss, in contrast, ours can keep model’s
implement the MFCC and data preprocessing on FPGA. classification performance better.
Figure 10 shows that compared to state-of-the-art CPU Current state-of-the-art RNN-based solutions (recurrent
platforms, our single PE version accelerator achieves neural network) usually use networks’ sparse feature to cut
18–300 × throughput speed up ratio. Table 2 shows that down useless computation thus accelerating original algo-
Conv-2 layer is the most time-costing function. However, rithm. For this reason, sparse accelerators can sometimes
our accelerator can eliminate bottlenecks on CPU plat- have excellent computing performance. Nevertheless, our
forms and achieve excellent accelerating performance BWN-based convolutional neural network accelerator still
by using balanced layer-by-layer pipeline. Also, the data has shortest computing latency for CNNs’ high-efficient
stream inside the pipeline reduces the DDR bus communi- architecture, while RNNs usually have complex recurrent
cation and eliminates the DRAM limit on CPU platform. structures which increase time-delay greatly.
Table 2 Running time on Segment Pre-process Conv-1 Conv-2 FC-1 FC-2 FC-3 Total
i7-8700K CPU platform
(a) Non-quantization version without parallelism
Time (s) 9.43 3.08 114.58 0.02 < 0.01 < 0.01 127.11
Portion (%) 7.40 2.43 90.14 <1 <1 <1 100
(b) Non-quantization version with parallelism
Time (s) 9.8 1.8 38.03 0.03 < 0.01 < 0.01 49.68
Portion (%) 19.74 3.63 76.56 <1 <1 <1 100
(c) Quantization version without parallelism
Time (s) 9.53 3.48 126.62 0.02 < 0.01 < 0.01 139.64
Portion (%) 6.82 2.49 90.67 <1 <1 <1 100
13
14 D. Wen et al.
6 Conclusion
Table 4 Energy efficiency on 8700 K 5220-1 Node 5220-2 Node 5220-4 Node 1 PE BWN 2 PE BWN
different platforms
Perf. per Watt (fps/W) 0.23 0.47 0.43 0.4 471.63 753.41
Board power (W) 95 125 250 500 7.794 9.758
Table 5 (a) Comparison with binary weight convolutional neural network’s architecture on ASIC, (b) comparison with full-binary convolutional
neural network’s architecture on FPGA, (c) comparison with sparse recurrent neural network’s architecture on FPGA
(a) Accelerator Bo et al. (2018) Renzo et al. (2018) 1 PE BWN 3 PE BWN
13
An energy‑efficient convolutional neural network accelerator for speech classification… 15
Acknowledgements This work is supported by National Science and Jyrki Kivinen, M.K.W.: Exponentiated gradient versus gradient descent
Technology Major Project 2018ZX01028101. for linear predictors. Inf. Comput. 132(1), 1–63 (1997)
Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet classification with
deep convolutional neural networks. NIPS. Curran Associates Inc.
Compliance with ethical standards (2012)
Liang, S., Yin, S., Liu, L., et al.: FP-BNN: binarized neural network on
Conflict of interest On behalf of all authors, the corresponding author FPGA. Neurocomputing 275, 1072–1086 (2018)
states that there is no conflict of interest. Liu, S., Pattabiraman, K., Moscibroda, T., et al.: Flikker: saving DRAM
refresh-power through critical data partitioning. Comput. Arch.
News 39(1), 213–224 (2011)
Liu, M., Wu, W., Gu, Z., et al.: Deep learning based on batch
References normalization for P300 signal detection. Neurocomputing
S0925231217314601 (2017). https://doi.org/10.1016/j.neuco
m.2017.08.039
Alemdar, H., Leroy, V., Prost-Boucle, A., P ́etrot, F.: Ternary neural
Matthieu, C., Itay, H., Daniel, S., et al.: Training deep neural net-
networks for resource-efficient ai applications. Int. Jt. Conf. Neu-
works with weights and activations constrained to +1 or -1.
ral Netw. (IJCNN) 2547–2554 (2017). https://doi.org/10.1109/
arXiv:1602.02830v3[cs.LG] (2016)
IJCNN.2017.7966166
Muckenhirn, H., Magimai-Doss, M., Marcell, S.: [IEEE ICASSP 2018
Alessandro, A., Hesham, M., Enrico, C., et al.: NullHop: a flexible
- 2018 IEEE international conference on acoustics, speech and sig-
convolutional neural network accelerator based on sparse repre-
nal processing (ICASSP) - Calgary, AB (2018.4.15–2018.4.20)]
sentations of feature maps. IEEE Trans. Neural Netw. Learn. Syst.
2018 IEEE international conference on acoustics, speech and sig-
1–13 (2018). https://doi.org/10.1109/TNNLS.2018.2852335
nal processing (ICASSP) - towards directly modeling raw speech
Baicheng, L., Song, C., Yi, K., Feng, W., et al.: An energy-efficient
signal for speaker verification using CNNS, pp. 4884–4888 (2018)
systolic pipeline architecture for binary convolutional neural
Pakyurek, M., Atmis, M., Kulac, S., et al.: Extraction of novel features
network. IEEE Int. Conf. ASIC (2019). https://doi.org/10.1109/
based on histograms of MFCCs used in emotion classification
ASICON47005.2019.8983637
from generated original speech dataset. Electron. Electr. Eng. 26,
Bo, L., Hai, Q., Yu, G., et al.: EERA-ASR: an energy-efficient reconfig-
46–51 (2020)
urable architecture for automatic speech recognition with hybrid
Palaz, D., Magimai-Doss, M., Collobert, R.: Convolutional neural net-
DNN and approximate computing. IEEE Access 6, 52227–52237
works-based continuous speech recognition using raw speech sig-
(2018)
nal. In: 2015 IEEE international conference on acoustics, speech
Chen, T., Du, Z., Sun, N., et al.: A high-throughput neural network
and signal processing (ICASSP). IEEE (2015)
accelerator. IEEE Micro 35(3), 24–32 (2015)
Perkins, S., Lacker, K., Theiler, J.: Grafting: fast, incremental feature
Cheng, G., Yao, C., Ye, L., Tao, L., Cong, H., et al.: Vecq: minimal
selection by gradient descent in function space. J. Mach. Learn.
loss DNN model compression with vectorized weight quanti-
Res. 3(3), 1333–1356 (2003)
zation. IEEE Trans. Comput. (2020). https://doi.org/10.1109/
Renzo, A., Lukas, C., Davide, R., Luca, B.: YodaNN: an architecture
TC.2020.2995593
for ultralow power binary-weight CNN acceleration. IEEE Trans.
Cheung, K., Schultz, S.R., Luk, W.: A large-scale spiking neural net-
Comput. Aided Des. Integr. Circuits Syst. 37(1) (2018). https://
work accelerator for FPGA systems. In: Proceedings of the 22nd
doi.org/10.1109/TCAD.2017.2682138
international conference on artificial neural networks and machine
Samuel, K., Boris, G.: MatchboxNet: 1D time-channel separable con-
learning - volume part I, Springer, Berlin, Heidelberg (2012)
volutional neural network architecture for speech commands rec-
Conti, F., Schiavone, P.D., Benini, L.: XNOR neural engine: a hard-
ognition. arXiv:2004.085431v2[eess.AS] (2020)
ware accelerator IP for 21.6-fJ/op binary neural network inference.
Samuel, K., Stanislav, B., Boris, G., et al.: QuartzNet: deep automatic
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 37(11),
speech recognition with 1D time-channel separable convolutions.
2940–2951 (2018)
In: 2020 IEEE international conference on acoustic, speech and
Douglas, C., Sabato, L., Martin, L., et al.: A neural attention model
signal processing (2020)
for speech command recognition. arXiv:1808.08929v1[eess.AS]
Santurkar, S., Tsipras, D., Ilyas, A., et al.: 2018; NIPS’18: Proceed-
(2018)
ings of the 32nd International Conference on Neural Information
Dundar, G., Rose, K.: The effects of quantization on multilayer neural
Processing Systems, pp. 2488– 2498 (2018)
networks. IEEE Trans. Neural Netw. 6(6), 1446–1451 (1995)
Shijie, C., Chen, Z., Zhuliang, Y., et al.: Efficient and effective sparse
Geoffrey, H., Li, D., Dong, Y., et al.: Deep neural networks for acoustic
LSTM on FPGA with bank-balanced sparsity. In: FPGA’19: pro-
modeling in speech recognition. IEEE Signal Process. Mag. 12,
ceedings of the 2019 ACM/SIGDA international symposium on
82–98 (2012)
field-programmable gate arrays, pp. 63–72 (2019)
Giri, E.P., Fanany, M.I., Arymurthy, A.M., et al.: Ischemic stroke iden-
Shuo, W., Zhe, L., Caiwen, D., et al.: C-LSTM: enabling efficient
tification based on EEG and EOG using 1D convolutional neural
LSTM using structured compression techniques on FPGAs. In:
network and batch normalization. ICACSIS. IEEE (2016)
FPGA’18: proceedings of the 2018 ACM/SIGDA international
Guo, P., Ma, H., Chen, R., et al.: A high-efficiency FPGA-based accel-
symposium on field-programmable gate arrays, pp. 21–30 (2018)
erator for binarized neural networks. J. Circuits Syst. Comput.
Simonyan, K., Zisserman, A.: 3rd International Conference on Learn-
(2019). https://doi.org/10.1142/S0218126619400048
ing Representations, ICLR 2015, Conference Track Proceedings
Gwennap, L.: Microsoft brainwave uses FPGAs. Microprocess. Rep.
(2014)
31(11), 25–27 (2017)
Song, H., Junlong, K., Huizi, M., et al.: ESE: efficient speech recogni-
Han et al.: 4th International Conference on Learning Representations,
tion engine with sparse LSTM on FPGA. In: FPGA’17: Proceed-
ICLR 2016, Conference Track Proceedings (2016)
ings of the 2017 ACM/sigda international symposium on field-
Jacob, B., Kligys, S., Chen, B., et al.: Quantization and training of
programmable gate arrays, pp. 75–84 (2017)
neural networks for efficient integer-arithmetic-only inference.
Tom, S., Vaibhava, G.: Advances in very deep convolutional neural
(2017). https://doi.org/10.1109/CVPR.2018.00286
networks for LVCSR. arXiv:1604.01792v2[cs.CL] (2016)
13
16 D. Wen et al.
Turan, F., Roy, S.S., Verbauwhede, I.: HEAWS: an accelerator for Yong Dou received the B.S.
homomorphic encryption on the amazon AWS FPGA. IEEE degree, the M.S. degree and the
Trans. Comput. (99):1–1 (2020). https : //doi.org/10.1109/ Ph.D. degree in Computer Sci-
TC.2020.2988765 ence from National University of
Umuroglu, Y., Fraser, N.J., Gambardella, G., Blott, M., Leong, P., Defense Technology, Changsha,
Jahre, M., Vissers, K.: Finn: a framework for fast, scalable bina- China. He is now the research
rized neural network inference. In: ACM/SIGDA international fellow of National Laboratory of
symposium on field-programmable gate arrays, pp. 65–74 (2017) Parallel and Distributed Comput-
Wan, H., Guo, S., Yin, K., et al.: CTS-LSTM: LSTM-based neural ing of National University of
networks for correlated time series prediction. Knowl. Based Syst. Defense Technology. His inter-
191 (2019). https://doi.org/10.1016/j.knosys.2019.105239 ested fields include Computer
Wei, Z., Jingyi, Q., Renbiao, W.: Straight convolutional neural net- Architecture and Artificial
works algorithm based on batch normalization for image classifi- Intelligence.
cation. J. Comput. Aided Des. Comput. Graph. 29(9), 1650–1657
(2017)
Xu, Y., Wang, Y., Zhou, A., et al.: Thirty-Second AAAI Conference on
Artificial Intelligence. 32 (2018) Jinwei Xu received the B.S.
Zeng, X., Zhi, T., Zhou, X., Du, Z., Guo, Q., Liu, S., et al.: Addressing degree in Communication Engi-
irregularity in sparse neural networks through a cooperative soft- neer from Tsinghua University,
ware/hardware approach. IEEE Trans. Comput. 69(7), 968–985 and received the M.S. degree and
(2020) the Ph.D. degree in Computer
Science from National Univer-
sity of Defense Technology,
Dong Wen received the B.S. Changsha, China. He is now the
degree in Computer Science research assistant of National
from Northeastern University, Laboratory of Parallel and Dis-
Shenyang, China. He is currently tributed Computing of National
working towards the M.S. degree University of Defense Technol-
in Computer Science with the ogy. His interested field includes
School of Computer of National Reconfigurable Computing.
University of Defense Technol-
ogy, Changsha, China. His inter-
ested fields include Computer Tao Xiao received the B.S. degree
Architecture and Reconfigurable in Computer Science from
Computing. Hunan Normal University,
Changsha, China. He is currently
working towards the M.S. degree
in Computer Science with the
School of Computer of National
University of Defense Technol-
Jingfei Jiang received the B.S. ogy, Changsha, China. His inter-
degree, the M.S. degree and the ested fields include Computer
Ph.D. degree in Computer Sci- Architecture and Reconfigurable
ence from National University of Computing.
Defense Technology, Changsha,
China. She is now the research
fellow of National Laboratory of
Parallel and Distributed Comput-
ing of National University of
Defense Technology. Her inter-
ested fields include Computer
Architecture, Reconfigurable
Computing and Ar tif icial
Intelligence.
13