You are on page 1of 22

FPGA AND ASIC IMPLEMENTATION OF MEMORY EFFICIENT CONVOLUTIONAL NEURAL NETWORK FOR

EXTREME LEARNING MACHINE


Abstract:
In the past few years, advanced machine learning algorithms tend to utilize more memory when used in
pattern recognition applications. Since the memory utilization increases, the device in which it is
implemented requires more memory. Hence, this increases the cost of the device. This paper presents an
implementation of Memory Efficient Convolutional Neural Network for encoding in Extreme Learning
Machine (ELM) algorithm based on Receptive - Field (RF) approach at hardware level. Here, the neural
network utilizes less memory compared to previous implementation for generating stimulus to hidden layer
neuron. For implementing on Altera Cyclone V FPGA, this Neural Network was simulated using MODELSIM-
Altera 10.1d and synthesized using QUARTUS II 13.0 sp1. The Neural Network was implemented on Altera
Cyclone V FPGA. This Neural Network was implemented on ASIC using Synopsys Design Compiler and
Synopsys IC Compiler with 180nm SCL library.
Keywords: Neuromorphic Computing, Extreme Learning Machine (ELM), Receptive-Field (RF), Pattern
Recognition, Convolutional Neural Network (CNN), Field Programmable Gated Array (FPGA), Application
Specific Integrated Circuit (ASIC)
I. INTRODUCTION:

There are several types of neural networks, but, feed-forward neural network is one of the most
prevalent neural networks. A feed-forward neural network contains one input layer, one or more hidden
layer and one output layer. Input stimuli from the external sources enter the input layer and the output
signal leaves from the output layer. Feed-forward neural networks can be trained based on gradient-
descent based back propagation algorithm [1]. Additive kind of hidden nodes are most frequently used in
such networks. For additive hidden node, the expression for output function of the i th node in the l th hidden
layer is [2]

y i=g ( ai(l ) . x( l) +bi( l) ) , bi(l ) ∈ R (1)

with sigmoid activation function

g( x ): R → R g ( x )=1/(1+exp ⁡(−x)) (2)

for nodes present in the hidden layer.

Here a i(l) is the weight vector linking the (l−1)th layer to the i th node of the l th layer and b i( l) is the bias
of the i th node of the l th layer [2].The architecture of feed-forward network is shown in Fig. 1. The
expression a i( l) . x (l ) signifies the inner vector product of a i( l) and x (l )[2]. But, this algorithm requires more
computation time.
Fig. 1.Architecture of Feed-Forward Neural Network

Extreme Learning machines are time efficient and they are less complex than conventional gradient
based algorithm. ELM uses additional methods such as weight decay and early stopping methods which is
used to prevent issues such as local minima and improper learning rate for single-hidden layer feed-forward
neural networks [3].ELM can be trained with non-differentiable activation functions contrasting the
gradient based learning algorithms [3].
------------TO BE WRITTEN-----------------
Section II focusses on literature survey of extreme learning machine (to be written after liter. Sur.)
This neural network is a Memory Efficient Convolutional Neural Network based on Receptive-Field
Approach for encoding.

This neural network is implemented on FPGA and the logic utilization and block memory utilization
is compared with previous implementation. Also, this neural network is implemented on Application
Specific Integrated Circuit (ASIC) and the area, power and timing results are tabulated.
----------------changes to be made---------------
First, a short summary about RF approach is presented. Then, the difficulties met when
implementing the techniques at hardware level are discussed. Then, the hardware-optimised RF approach
is discussed, which is followed by the results and a discussion.
---------------changes to be made------------------
II. LITERATURE SURVEY:
A. Algorithm

ELM model is a supervised learning machine based on a Single-Hidden Layer Feed-forward Neural
network (SLFN) architecture [3].It shows that the weights connecting to the output layer can be
systematically resolved through the generalized inverse operation when the weights and bias of the hidden
neurons are arbitrarily allocated [4].They evade the necessity to tune these hidden neuron parameters [4].
Assume that

( x i ,t j ) ∈ R n × R m ; f N ( x j )=t j ; i=1 , … … , n∧ j=1 , … … , m (1)

is a set of N patterns where x i is a n × 1 input vector and t i is a m × 1 output vector [4].If an SLFN with N
hidden neurons and activation function g(x), there exists Bi, a iand b i such that
N

∑ Bi G(ai , bi , x j)=t j ; j=1 , … … ., m (2)


i=1

where a i and b i are learning parameters of hidden nodes (where a i is the weight vector connecting input
node to the hidden node and b i is the bias of hidden node), Bi is the i-th output weight, and G(a i , bi , x j ) is
the output of i-th hidden node with respect to input vector x j [4]. If the hidden neuron is additive, it follows
that [4]
G(a i , bi , x j ) = g( ai . x j +b i) (3)
Fig. 1.Architecture of ELM. Adapted from [7].The neurons present in the input layer are linked to a huge
number of non-linear neurons present in the hidden layer through random weights and controllable bias,
b(1) to b(M) [7].The neurons present in output layer possess linear-characteristics [7]. The links from the
neurons present in the hidden layer to the neurons present in output layer, with weights that are trainable,
and the neurons present in the output layer produce a sum of their inputs [7].The sum is generated in linear
fashion [7].

Then, the Equation (2) can be written as


H.B = T (4)
where

G( a1 , b 1 , x 1) … G( aN , b N , x 1)
H=
[ ⋮ ⋱ ⋮
G(a1 , b1 , x N ) … G(a N , bN , x N )] (5)

is known as hidden layer output matrix (being the i-th row of H the output of the hidden layer with respect
to x i input vector, and the j-th column the output of the j-th hidden node with respect to x 1 to x N input
vectors) [4].

Thus, the matrix of output weights, B can be estimated as

B= H +¿¿ T (6)

where H +¿¿ is the Moore-Penrose generalised inverse (pseudo-inverse) of the hidden layer output
matrix H [4].
n m
Input: Training sets ( x i ,t j ) ∈ R × R ; f N ( x j )=t j ; i=1 , … … , n∧ j=1 , … … , m and N Hidden neurons
with sigmoid activation function g( x ).

Step 1: Randomly assign input weight a i and bias b i, i = 1,…,n

Step 2: Calculate the hidden layer output matrix H using G(a i , bi , x j ) = g( ai . x j +b i) and
G( a1 , b 1 , x 1) … G( aN , b N , x 1)
H=
[ ⋮ ⋱ ⋮
G(a1 , b1 , x N ) … G(a N , bN , x N ) ]
.

Step 3: Calculate the output weight B using B= H +¿¿ T where H +¿¿ is the Moore-Penrose
t1
generalised inverse (pseudo-inverse) of the hidden layer output matrix H and T =

N
.
.
tm []
, found using

∑ Bi G(ai , bi , x j)=t j ; j=1 , … … ., m.


i=1

Output: Output weight matrix B

Fig. 1.Training Procedure of Extreme Learning Machine

In spite of their efficiency, ELM neural networks are hard to implement at hardware level because
each input neuron is connected to all the neurons present in hidden layer. This needs considerable amount
of hardware resources that increases as the neurons present in hidden layer increases. In [6] and [7], the
authors assume fixed number of input layer and hidden layer neurons which limits them from using
complex datasets. In order to remove these issues, the neural network based on a receptive field (RF)
approach [8][9] is used.

The idea of receptive fields [9] (see Fig. 1) usage for pattern recognition originates from neural
science, where neurons frequently react to input stimulus generated from restricted three-dimensional
range of data, as shown in Fig. 2.Here, Primary Afferent is a pseudo-unipolar neuron with central end
synapsing on Second Order Neuron and the peripheral end is connected to receptors. Tissue from which
Primary Afferent receives sensory information is Receptive field of Primary Afferent. Multiple Primary
Afferent synapsing on Second Order Neuron (Dorsal Horn Neuron). Hence, the combined Receptive Field of
multiple Primary Afferent is the Receptive Field of Second Order Neuron.
Fig. 2 Receptive Field of Second Order Neuron [20]

Including this concept into hardware implementation of pattern recognition improves its
performance [9]. Receptive - Field based ELM gives similar accuracy when compared to traditional ELM
algorithms [9], with an additional benefit of utilizing less hardware resources. Also, the positions of the
receptive field [9] can differ based on size of the input data, with changes required between the
connections present in the neurons of the hidden layer and its corresponding receptive field.
Fig. 1.Illustration of the receptive field methodology [9]. Every single
neuron present in hidden layer obtain inputs from an arbitrary
receptive field [9], which is either square or rectangular in size.

But, implementing RF approach at hardware level is complex since the location and dimensions of
Receptive - Field is either found arbitrarily or through patterns, which arbitrarily access the input data.

Input image in binary format and shift register array are used for hardware implementation of the
RF approach (as shown in [5]).Regions are selected from the input image by shifting the index variable
across memory array. These generated regions are the receptive-fields. These receptive-fields are either
square or rectangular in size.

Fig. 2 shows functioning of RF approach on an image from MNIST dataset [10] of size 28x28 pixels.
Each pixel can be represented using 8-bit binary value. First, the image is restructured from 28x28 pixels to
784x1 pixels .Then, RF is shifted by 16 pixels across the restructured image. The 16 pixels corresponds to
step size. Here, size of receptive field is 128 pixels.
Fig. 2.Illustration of RF Approach. Adapted from [5].

Fig. 2 depicts the restructured image on the left and the region covered by receptive field over the
image on the right for different regions covered by index variable.
Fig. 2a shows the receptive-field covered over image for index 1 to index 128 from the restructured
image. The region covered by receptive-field (as shown in Fig. 2a) is used for generating stimulus towards
first hidden layer neuron when multiplied by the weights present in the input layer. These weights are
arbitrarily generated for each neuron present in hidden layer. In Fig. 2b, the starting position of index over
restructured image is shifted by one step and its corresponding receptive-field is shown. The region covered
by receptive-field (as shown in Fig. 2b) is used for generating stimulus towards second hidden layer neuron
when multiplied by the weights present in the input layer.

This process is repeated several times to generate stimulus to other neurons present in the hidden
layer by multiplying with its corresponding random weights. Fig. 2c shows the region covered by receptive-
field after shifting the starting index by 16 steps over restructured image. After the 49th step, the final step
of the restructured image, the index comes back to its initial position.

III.TECHNICAL GAP

In a previous work, the authors presented a huge-parallel system for pattern recognition of
handwritten digits, which comprises of an input layer (the encoder), a hidden layer with 8192 neurons and
an output layer, and implemented it on FPGA [11]. In this system, the hardware implementation of encoder
structure is done assuming all the input layer neurons is connected to all the hidden layer neurons. Under
this assumption, more Adaptive Logic Modules is consumed when the system is implemented on FPGA.

The authors proposed a SRAM based Convolutional Neural Network for encoding using Receptive –
Field Approach in [5]. This neural network is implemented on FPGA. But it consumes more block memory
on FPGA. Hence, reducing the memory is motivation of the proposed convolutional neural network.

IV. Proposed Methodology


Fig. 3 Topology of the Convolutional Neural Network for Encoding

In Fig. 3, The Single port Block RAM of size 9 x 128 bits is used. Therefore, the single port block RAM
can read and write 144 pixels. A Shift-register array [5] consisting of eight shift registers is used. These shift
registers are connected in sequential manner. Each Shift register can store 128-bits. The output of Single
Port Block RAM is linked to shift register array at its input. The output of each of these shift registers are
connected to input of weighted pixel array of size 128 x 9 bits. The input of parallel adder is connected to
the output of 128 9-bit registers from weighted pixel array.

Assume Single port Block RAM is not filled initially. Then, the input pattern is written into Single
port Block RAM stores after its arrival. Then, initialization of shift registers is done. This initialisation of shift
register array is performed by reading 128 pixels from the Single port Block RAM for eight clock cycles and
then the shift registers are shifted eight times.

After the eight initial shifts, one each clock cycle, the global counter will shift the register array by
one. For example, if the hidden layer has 8192 neurons, this RF generation process is executed 8192 times.
As it is a pipelined design, the output to the parallel adder is still being generated every clock cycle (with a
latency of one clock cycle).

C. Random weight generator

The random weight generator will generate a uniformly distributed binary random weight (-1 and 1)
for every pixel of the input digit. These weighted pixels will be summed to generate the stimulus for each
neuron in the hidden layer. The usage of binary weights saves significant hardware resources in the FPGAs
and ASICs, else we would require 128 multipliers to calculate the multiplication between all pixels and their
corresponding random weights.
For digital implementations, the utmost effective way to generate random numbers is usage of
linear feedback shift registers (LFSRs). Therefore, we use LFSRs to generate the binary random weights and
the output of the LFSR will be inferred following the 2’s compliment rule: 0 for +1 and 1 for -1. Input pixel
values (8-bit grey-scale) are just concatenated with the binary weights, resulting in weighted pixel values in
2’s compliment notation (9-bit values).

As the LFSR will go through all possible values in its cycle, its output will not be well-adjusted at each
shift. In other words, the number of the 0’s and 1’s are frequently not similar, which will disturb the
performance considerably, because the weighted pixels will be nearly all negative or positive and the
generated stimulus for hidden neurons will become very large in amplitude. For the generation of balanced
binary random weights, in place of naïve implementation using one 128-bit LFSR, we use twelve 11-bit
LFSRs with dissimilar seeds, each of which generates an 11-bit random number. For most of arbitrary seeds,
this leads to more equally dispersed number of 0’s and 1’s.

All these LFSRs will load again their own initial seed on the arrival of an input pattern. Next, they
keep generating random numbers till a new input pattern arrives. Like this, we can assure that the encoder
will generate the same set of precise random weights for all the arriving patterns. This “on the fly”
generation structure also decreases the memory usage considerably, as there is no necessity for storing the
random weights and merely the LFSR seeds need to be stored.

D. Parallel adder array

The parallel adder array adds 128 weighted pixels for generating the input to the hidden layer neurons. A
naïve implementation would require a 128-input 9-bit parallel adder and make large delay (~20 ns). As an
alternative, we use a 3-stage pipeline involving sixteen 8-input 9-bit adders, four 4-input 12-bit adders, and
one 4-input 14-bit adder respectively as shown in Fig. (). Due to pipelined design, the input to the hidden
layer is still generated every single clock cycle, however with a latency of three clock cycles.
Fig. () Structure of three stage 128- input 9-bit pipe-lined parallel adder
IV. RESULTS

Fig. 4.Generate input pattern from MNIST Dataset

MNIST dataset [10] was converted into 28 × 28 pixel images using Python as shown in Fig. 4. The
flow chart of conversion of MNIST dataset to image file is shown in Fig. 5. First, the dataset is downloaded.
Since the image and label files are downloaded in archive format, there is a need for extraction. Hence,
these files are extracted as u-byte files in .idx format. These files store different vectors and multi-
dimensional matrices. Then, the u-byte files are converted into numpy n dimensional arrays. Then, the data
present in numpy n dimensional array are saved as images in specific classes and directories.

Fig. 5. Flow chart of conversion of MNIST Dataset to image file


Then, an image file is chosen and it is converted into 8-bit grey scale values using MATLAB. Fig. 6.
shows the flowchart of conversion of image file to 8-bit grey scale.

Fig. 6.Flow chart of conversion of image file to 8-bit grey-scale values

These 8-bit greyscale values is provided as input to SRAM based convolutional neural network for
encoding.

A. FPGA Implementation:

The SRAM based convolutional neural network using RF Approach is simulated using MODELSIM-
Altera 10.1d, as shown in Fig. 6.
Fig. 6.Simulation result of proposed encoder using MODELSIM-Altera 10.1d

Then, the encoder block was synthesized using QUARTUS II 13.0 sp1 and implemented on Altera
Cyclone V FPGA (Device 5CGXFC5C6F27C7). This FPGA has Adaptive Logic Modules (ALM’s) instead of the
conventional Logic Elements (LE’s) (present in lower level FPGAs) [29]. ALM can add up to 3-bits at a time
while logic elements can add only two bits simultaneously [29] (as shown in Fig. 7).Here, parallel adder is
implemented utilizing ALM’s.
Fig. 7.3-bit addition with Adaptive Logic Module using Shared Arithmetic Mode [19]

The logic utilization and block memory utilization of proposed encoder are compared with encoder
implemented in [5] (shown in Fig. 8 and Fig. 9).
Logic Utilization
1600

1550

1500
No of ALM's

1450

1400

1350
Encoder [5] Proposed Encoder

Fig. 7.Comparison of Logic Utilization of Encoder (as shown in [5]) and


proposed Encoder. ALM is abbreviated as Adaptive Logic Modules.

Block Memory Utilization


70000

60000

50000

40000
No of Bits

30000

20000

10000

0
Encoder [5] Proposed Encoder

Fig. 8.Comparison of Block Memory Utilization of Encoder (as shown in [5])


and proposed Encoder.

The logic utilization of encoder is 5% of the available adaptive logic modules on Cyclone V FPGA,
which is bit higher compared to the encoder implemented in [5].The total block memory bits used is less
than 1% of the available block memory on FPGA, which is very less compared to the encoder implemented
in [5]. The device utilization of encoder on Cyclone V FPGA is also tabulated in Table I.

Table I: Device utilization of proposed encoder

Logic utilization (in ALMs) Total block memory bits Total DSP Blocks
1604 / 29,080 1152 / 4,567,040 0 / 150

----------------------------------to be deleted from paper-------------------------------------------------------------


B. ASIC Implementation

The SRAM based implementation of convolutional neural network is simulated and synthesized with
Synopsys Design Compiler using 180nm SCL library. The clock frequency is set at 50MHz.The area, power
and delay values of encoder are tabulated as follows:

Fig. 9.Simulation result of proposed encoder using Synopsys Design Compiler

Table II: Area, Power and Delay values of Proposed Encoder

Area consumed (µm2) Power consumed (mW) Delay (ns)


335927.85 7.64 3680

The area, power and delay are tabulated (as shown in Table II) for one generating one stimulus to
the hidden layer neuron respectively. This Neural Network was implemented on ASIC using Synopsys IC
Compiler.
Fig. 10.ASIC Implementation of proposed encoder using Synopsys IC Compiler
------------------------------------------------------------------------------------------------------------------------------------------
V. CONCLUSIONS

We have presented an SRAM-based convolutional neural network using RF Approach for encoding
[13]. Our method uses limited resources at hardware level because it uses SRAM instead of logic gates
[13].We can visualize a large-scale completely reconfigurable neuromorphic structure, which is capable of
executing more complex pattern recognition tasks [13].

VI. REFERENCES

[1] Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagation errors.
Nature 323:533–536

[2] G.-B. Huang, D. H. Wang, and Y. Lan, “Extreme learning machines: a survey,” International
Journal of Machine Learning and Cybernetics, vol. 2, no. 2, pp. 107–122, May 2011.

[3] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine: Theory and applications,”
Neurocomputing, vol. 70, no. 1–3, pp. 489–501, Dec. 2006.
[4] J. V. Frances-Villora, A. Rosado-Munoz, M. Bataller-Mompean, J. Barrios-Aviles, and J. F. Guerrero-
Martinez, “Moving Learning Machine towards Fast Real-Time Applications: A High-Speed FPGA-
Based Implementation of the OS-ELM Training Algorithm,” Electronics, vol. 7, no. 11, p. 308, Nov.
2018.

[5] R. Wang, G. Cohen, S. Thakur, J. Tapson, and A. Van Schaik, “An SRAM-based implementation of a
convolutional neural network.”

[6] C. S. Thakur, R. Wang, T. J. Hamilton, J. Tapson, and A. van Schaik, “A Low Power Trainable
Neuromorphic Integrated Circuit That Is Tolerant to Device Mismatch,” IEEE Transactions on Circuits
and Systems I: Regular Papers, vol. 63, no. 2, pp. 211–221, Feb. 2016.

[7] Thakur, Chetan Singh, T. J. Hamilton, R. Wang, J. Tapson, and van Schaik, “A neuromorphic hardware
framework based on population coding,” 2015 International Joint Conference on Neural Networks
(IJCNN), 2015. [Online]. Available:
https://www.academia.edu/17633387/A_neuromorphic_hardware_framework_based_on_populati
on_coding.

[8] G.-B. Huang, Z. Bai, L. L. C. Kasun, and C. M. Vong, “Local Receptive Fields Based Extreme Learning
Machine,” IEEE Computational Intelligence Magazine, vol. 10, no. 2, pp. 18–29, May 2015.

[9] M. D. McDonnell, M. D. Tissera, T. Vladusich, A. van Schaik, and J. Tapson, “Fast, Simple and
Accurate Handwritten Digit Classification by Training Shallow Neural Network Classifiers with the
‘Extreme Learning Machine’ Algorithm,” PLOS ONE, vol. 10, no. 8, p. e0134254, Aug. 2015.

[10] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document
recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.

[11] R. Wang, C. S. Thakur, G. Cohen, T. J. Hamilton, J. Tapson, and A. van Schaik, “Neuromorphic
Hardware Architecture Using the Neural Engineering Framework for Pattern Recognition,” IEEE
Transactions on Biomedical Circuits and Systems, vol. 11, no. 3, pp. 574–584, Jun. 2017.

[12] R. M. Wang, T. J. Hamilton, J. C. Tapson, and A. van Schaik, “A mixed-signal implementation of a


polychronous spiking neural network with delay adaptation,” Frontiers in Neuroscience, vol. 8, Mar.
2014.

[13] R. M. Wang, T. J. Hamilton, J. C. Tapson, and A. van Schaik, “A neuromorphic implementation of


multiple spike-timing synaptic plasticity rules for large-scale neural networks,” Frontiers in
Neuroscience, vol. 9, May 2015.

[14] R. Wang, G. Cohen, K. M. Stiefel, T. J. Hamilton, J. Tapson, and A. van Schaik, “An FPGA
Implementation of a Polychronous Spiking Neural Network with Delay Adaptation,” Frontiers in
Neuroscience, vol. 7, 2013.

[15] R. Wang, T. J. Hamilton, J. Tapson, and A. van Schaik, “A compact reconfigurable mixed-signal
implementation of synaptic plasticity in spiking neurons,” IEEE International Symposium on Circuits
and Systems (ISCAS), pp. 862–865, 2014.
[16] R. Wang, T. J. Hamilton, J. Tapson, and A. van Schaik, “An FPGA design framework for large-scale
spiking neural networks,” IEEE International Symposium on Circuits and Systems (ISCAS), pp. 457–
460, 2014.

[17] R. Wang, T. J. Hamilton, J. Tapson, and A. van Schaik, ““A compact neural core for digital
implementation of the Neural Engineering Framework,” BIOCAS, 2014.

[18] “Quartus II Handbook Version 13.1”, Altera Corporation, 2013

[19] “Stratix III Device Handbook”, Altera Corporation, 2013

[20] J. Hunter, “receptive field,” www.youtube.com. [Online]. Available:


https://www.youtube.com/watch?v=6xs8FF8A1F0. [Accessed: 26-Feb-2020]

You might also like