You are on page 1of 12

2688 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO.

10, OCTOBER 2017

VLSI Implementation of Deep Neural Network


Using Integral Stochastic Computing
Arash Ardakani, Student Member, IEEE, François Leduc-Primeau, Member, IEEE,
Naoya Onizawa, Member, IEEE, Takahiro Hanyu, Senior Member, IEEE,
and Warren J. Gross, Senior Member, IEEE

Abstract— The hardware implementation of deep neural


networks (DNNs) has recently received tremendous attention:
many applications in fact require high-speed operations that
suit a hardware implementation. However, numerous elements
and complex interconnections are usually required, leading
to a large area occupation and copious power consumption.
Stochastic computing (SC) has shown promising results for
low-power area-efficient hardware implementations, even though
existing stochastic algorithms require long streams that cause
long latencies. In this paper, we propose an integer form of
stochastic computation and introduce some elementary circuits.
We then propose an efficient implementation of a DNN based on
integral SC. The proposed architecture has been implemented
on a Virtex7 field-programmable gate array, resulting in 45%
and 62% average reductions in area and latency compared
with the best reported architecture in the literature. We also
synthesize the circuits in a 65-nm CMOS technology, and we
show that the proposed integral stochastic architecture results
Fig. 1. N -layer DBN where W and N denote the weights of each layer and
in up to 21% reduction in energy consumption compared with
number of layers, respectively.
the binary radix implementation at the same misclassification
rate. Due to fault-tolerant nature of stochastic architectures, to map learning input data into their desired outputs, while the
we also consider a quasi-synchronous implementation that yields
33% reduction in energy consumption with respect to the binary inference engine uses the extracted configuration to compute
radix implementation without any compromise on performance. outputs for new data.
Index Terms— Deep neural network (DNN), hardware imple- Deep neural networks (DNNs), especially deep belief net-
mentation, integral stochastic computation, machine learning, works (DBNs), have shown state-of-the-art results on various
pattern recognition, VLSI. computer vision and recognition tasks [3]–[8]. DBN can be
formed by stacking RBMs on top of each other to construct a
I. I NTRODUCTION deep network, as shown in Fig. 1 [4]. RBMs used in DBN
are pretrained using gradient-based contrastive divergence
R ECENTLY, the implementation of biologically inspired
artificial neural networks such as the restricted
Boltzmann machine (RBM) has aroused great interest due to
algorithms, followed by gradient descent and backpropagation
algorithms for classification and fine-tuning the results [4], [5].
their high performance in approximating complicated func- In the past few years, general purpose processors have
tions. A variety of applications can benefit from them, in been mainly used for software realization of both training
particular machine learning algorithms. They can be split into and inference engines of DBN. However, large power con-
two phases, which are referred to as learning and inference sumption and high resource utilization have pushed researchers
phases [2]. The learning engine finds a proper configuration to explore application-specified integrated circuit (ASIC) and
field-programmable gate array (FPGA) implementations of
Manuscript received August 2, 2016; revised December 1, 2016; accepted neural networks. The emergence of the Internet of things
December 26, 2016. Date of publication February 1, 2017; date of current
version September 25, 2017. paradigm provides a motivation for building machine learning
A. Ardakani and W. J. Gross are with the Department of Electrical and into low-power mobile devices. In this scenario, the complex
Computer Engineering, McGill University, Montreal, QC H3A 0E9, Canada neural network training algorithms can be computed in cloud
(e-mail: arash.ardakani@mail.mcgill.ca; warren.gross@mcgill.ca).
F. Leduc-Primeau was with the Department of Electrical and Computer servers with the extracted weights sent to the mobile devices
Engineering, McGill University, Montreal, QC H3A 0E9, Canada. He is now equipped with an ASIC or FPGA implementation of the
with the Electronics Department, IMT Atlantique, Brest, France, and also with inference engine tailored for low-energy operation.
CNRS Lab-STICC (e-mail: francois.leduc-primeau@mail.mcgill.ca).
N. Onizawa and T. Hanyu are with the Research Institute of Electrical Com- DBNs are constructed of multiple layers of RBMs, with a
munication, Tohoku University, Sendai 980-8577, Japan (e-mail: nonizawa@ classification layer at the end. The main computation kernel
m.tohoku.ac.jp; hanyu@ngc.riec.tohoku.ac.jp). consists of hundreds of vector–matrix multiplications followed
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. by nonlinear functions in each layer. Since multiplications
Digital Object Identifier 10.1109/TVLSI.2017.2654298 are costly to implement in hardware, existing parallel or
1063-8210 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
ARDAKANI et al.: VLSI IMPLEMENTATION OF DEEP NEURAL NETWORK USING INTEGRAL STOCHASTIC COMPUTING 2689

semiparallel VLSI implementations of such a network suf-


fer from high silicon area and power consumption [9].
The nonlinearity function is also implemented using lookup
tables (LUTs), requiring large memories. Moreover, hardware
implementation of this network results in large silicon area:
this is caused by the connections between layers that lead
to severe routing congestion. Therefore, the efficient VLSI
implementation of DBN is still an open problem.
Recently, stochastic computing (SC) has shown promis-
ing results for ultralow cost and fault-tolerant hardware
implementation of various systems [10]–[19]. Using SC,
Fig. 2. Stochastic multiplications using (a) AND gate in unipolar format and
many computational units have simple implementation. For (b) XNOR gate in bipolar format.
instance, using unipolar SC, the multiplication and addition
are implemented using an AND gate and a multiplexer (MUX),
respectively [20], [21]. However, the MUX-based adder intro- II. S TOCHASTIC C OMPUTING AND I TS
duces a scaling factor that can cause a precision loss [22], C OMPUTATIONAL E LEMENTS
resulting in the failure of SC for DNNs, which require many In stochastic computation, numbers are represented as
additions. An OR gate can provide a good approximation to sequences of random bits. The information content of the
addition if its input values are small [21]. However, using sequence does not depend on the particular value of each bit,
OR gates to perform addition in DBNs results in a huge
but rather on their statistics. Let us denote by X ∈ {0, 1} a bit
misclassification error compared with its fixed-point hardware in the random sequence. To represent a real number x ∈ [0, 1],
implementation. Therefore, an efficient stochastic implemen- we simply generate the sequence such that
tation that maintains the performance of DBN is still missing. E[X] = x (1)
In this paper, an integral stochastic computation is intro-
duced to solve the precision loss issue of conventional scaled where E[X] denotes the expected value of the random vari-
adder, while also reducing the latency compared with con- able X. This is known as the unipolar format. The bipolar
ventional binary stochastic computation. A novel finite state format is another commonly used format where x ∈ [−1, 1]
machine (FSM)-based tanh function is then proposed as the is represented by setting
nonlinearity function used in DBN. Finally, an efficient sto- E[X] = (x + 1)/2. (2)
chastic implementation of DBN based on the aforementioned
techniques with an acceptable misclassification error is pro- Note that any real number can be represented in one of these
posed, resulting in 45% smaller area on average compared two formats by scaling it down to fit within the appropriate
with the state-of-the-art stochastic architecture. interval. In this paper, we use upper case letters to represent
In this paper, we show that the proposed architectures can elements of a stochastic stream, while lower case letters
tolerate a fault rate of up to 16% when timing violations are represent the real value associated with that stream. It is also
allowed to occur, making them suitable for quasi-synchronous worth mentioning that a stochastic stream of a real value x is
implementation. The quasi-synchronous implementation usually generated by a linear feedback shift register (LFSR)
yields 33% reduction in energy consumption with respect to and a comparator. This unit is hereafter referred to as binary
the binary radix implementation without any compromise on to stochastic (B2S) convertor [23].
performance. A. Multiplication in SC
This paper can be divided into two major parts: the pro- Multiplication of two stochastic streams is performed using
posed algorithms and their hardware implementation results. AND and XNOR gates in unipolar and bipolar encoding for-
In the first part, we analyze elementary computational units. mats, respectively, as illustrated in Fig. 2(a) and (b). In unipo-
Also, some simulation results and examples are provided to lar format, the multiplication of two input stochastic streams
shed light on the proposed algorithm in comparison with the of A and B is computed as
existing methods. In the second part, design aspects of a
DNN based on the proposed method are studied and some Y = AND(A, B) = A · B (3)
implementation results under different conditions are provided. where “·” denotes bit-wise AND, and if the input sequences
The rest of this paper is organized as follows. Section II are independent, we have
provides a review of SC and its computational elements.
Section III introduces the proposed integral stochastic compu- y = E[Y ] = a × b. (4)
tation and operations in this domain. Section IV describes the Multiplications in bipolar format can be performed as
integral stochastic implementation of DBN. Implementation
results of the proposed architecture are provided in Section V. Y = XNOR (A, B) = OR (A · B, (1 − A) · (1 − B)) (5)
In this section, a quasi-synchronous implementation is studied, E[Y ] = E[ A · B] + E[(1 − A) · (1 − B)]. (6)
which yields further energy savings without any compromise
If the input streams are independent
on performance. In Section VI, we conclude this paper and
discuss future research. E[Y ] = E[ A] × E[B] + E[1 − A] × E[1 − B]. (7)
2690 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 10, OCTOBER 2017

Fig. 4. State transition diagram of the FSM implementing (a) tanh and
(b) exponentiation functions.

Fig. 3. Stochastic additions using (a) MUX and (b) OR gate. implemented in the stochastic domain using an FSM [24].
Fig. 4(a) and (b) shows the state transition diagram of the FSM
By simplifying the above equation, we have implementing tanh and exponentiation functions. The FSM is
constructed such that
y = 2E[Y ] − 1 = (2E[ A] − 1) × (2E[B] − 1). (8)  nx 
tanh ≈ 2 × E[Stanh(n, X)] − 1 (11)
2
B. Addition in SC exp(−2Gx) ≈ E[Sexp(n, G, X)] : x > 0 (12)
Additions in SC are usually performed using either scaled
where n denotes the number of states in the FSM, G is
adders or OR gates [20], [21]. The scaled adder uses a MUX
the linear gain of the exponentiation function, and Y is the
to perform addition. The output of a MUX Y is given by
stochastic output sequence. Let us define as Stanh and Sexp the
Y = A · S + B · (1 − S). (9) approximated functions of tanh and exp in stochastic domain.
It is worth mentioning that both input and output of the Stanh
As a result, the expected value of Y would be (E[ A]+E[B])/2
function are in bipolar format, while the input and output of the
when the select signal S is a stochastic stream with a proba-
Sexp function are in bipolar and unipolar formats, respectively.
bility of 0.5, as illustrated in Fig. 3(a). This two-input scaled
adder ensures that its output is in the legitimate range of
D. Extended SC
each encoding format by scaling it down by a factor of two.
In [25], another stochastic methodology named extended
Therefore, L-input addition can be performed using a tree of
stochastic logic (ESL) was proposed in which the information
multiple two-input MUXs. In general, the result of an L-input
scaled adder is scaled down L times, which can decrease the is encoded as the probability ratio of two binary stochastic
streams in bipolar format. A real value x ∗ ∈ [−∞, ∞] is
precision of the stream. To achieve the desired accuracy, longer
represented such that
bit-streams must be used, resulting in larger latency.
OR gates can also be used as approximate adders as shown p E[P]
x∗ = = (13)
in Fig. 3(b). The output Y of an OR gate with inputs A and B q E[Q]
can be expressed as where p and q are real values in [−1, 1] and P and Q denote
Y = A + B − A · B. (10) a bit in a random sequence. Basic arithmetic operations such
as addition for ESL were also introduced in [25]. However,
OR gates function as adders only if E[ AB] is close to zero. the presented adder still uses the conventional scaled adder,
Therefore, the inputs should first be scaled down to ensure that making ESL inappropriate for applications that require many
the aforementioned conditions are met. This type of adder still additions as discussed earlier in Section II-B. Small size neural
requires long bit-streams to overcome a precision loss incurred networks based on ESL were also implemented, showing the
by the scaling factor. high noise immunity of the ESL implementation compared
To overcome this precision loss, which could poten- with the conventional binary radix implementation.
tially lead to inaccurate results, the accumulative parallel
counter (APC) is proposed in [22]. The APC takes N parallel III. P ROPOSED I NTEGRAL S TOCHASTIC C OMPUTING
bits as inputs and adds them to a counter in each clock cycle of A. Generation of Integer Stochastic Stream
the system. Therefore, this adder results in lower latency due to
An integer stochastic stream is a sequence of integer num-
its small variance of the sum. It is also worth mentioning that
bers that are represented by either two’s complement or sign
this adder converts the stochastic stream to binary form [22].
and magnitude. The average value of this stream is a real
Therefore, this adder is restricted to cases where additions
number s ∈ [0, m] for unipolar format and s ∈ [−m, m] for
are performed to obtain the final result, or requiring an
bipolar format, where m ∈ {1, 2, . . .}. In other words, the real
intermediate result in binary format.
value s is the summation of two or more binary stochastic
stream probabilities. For instance, 1.5 can be expressed as
C. FSM-Based Functions in SC 0.75 + 0.75. Each of these probabilities can be represented by
Hyperbolic tangent and exponentiation functions are com- a conventional binary stochastic stream as shown in Fig. 5(a).
putations required by many applications. These functions are Therefore, the integer stochastic representation of 1.5 can be
ARDAKANI et al.: VLSI IMPLEMENTATION OF DEEP NEURAL NETWORK USING INTEGRAL STOCHASTIC COMPUTING 2691

Fig. 5. (a) Stochastic representations of 0.75. (b) Integer stochastic


representation of 1.5.

readily achieved as a summation of generated binary stochastic


streams as illustrated in Fig. 5(b). In general, the integer
stochastic stream S representing the real value s is a sequence
with elements Si , i = {1, 2, . . . , N}

m
j
Si = Xi (14)
j =1
j
where X i denotes each element of a binary stochastic Fig. 6. (a) Increasing the range value m of the integer stochastic stream
sequence representing a real value x j . The expected value of reduces computations’ latency. (b) Parallelized stochastic computation by a
factor of two.
the integer stochastic stream is then given by

m considered in parallel to speed up the computations. For the
s = E[Si ] = x .
j
(15) sake of clarity, the aforementioned example is illustrated in
j =1 Fig. 6(b) using the conventional parallelized SC by a factor of
two. This is due to the fact that if several copies of a binary
We can also generate integer stochastic streams in the
SC system are instantiated, the inputs still need to have the
bipolar format. In that case, the elements Si of the stream
same effective length.
are given by
In summary, a real number s ∈ [0, m] is first divided into

m
j the summation of multiple numbers that are in [0, 1] interval.
Si = 2 × Xi − m (16) Then, the integer stochastic stream of this number is generated
j =1 using column-wise addition [see (14) and (15)]. The bipolar
and the value represented by the stream is format of the integer stochastic stream is generated in a similar
way. The binary to integer stochastic convertor is hereafter

m
 j 
m
referred to as B2IS and it is composed of m B2S convertors
s = E[Si ] = 2 × E Xi − m = 2 × x j − m. (17)
j =1 j =1
followed by an AND adder as shown in Fig. 5.

Any real number can be approximated using an integer B. Implicit Scaling of Integer Stochastic Stream
stochastic stream without prior scaling, as opposed to a The integer stochastic representation of a real number s ∈
conventional stochastic stream that is restricted only to the [0, 1] can also be generated using an implicit scaling factor.
[−1, 1] interval. In integral SC, computation on two streams In this method, the expected value of the individual binary
with different effective lengths is also possible while conven- streams is chosen as x j = s, and the value s represented by
tional SC fails to provide this property. For instance, repre- the integer stream is given by
sentations of 0.875 and 0.5625 require effective bit-stream E[Si ]
lengths of 8 and 16, respectively, using conventional SC. s= . (18)
m
Therefore, effective bit-stream length of 16 is used to generate This method avoids the need to divide s by m to obtain x j , and
the conventional stochastic bit-stream of these two numbers can be easily taken into account in subsequent computations.
for operations. However, the second number that requires For instance, a real number 9/16 can be represented using
higher effective length, i.e., 0.5625 in this example, can be an integer stream length of 8 with m = 2. We can set
generated using the proposed integral SC with m = 2 as x j = 9/16 (with an implicit scaling factor of 1/2) and generate
shown in Fig. 6(a). In this case, the bit-stream length of 8 two binary sequences of length 8. These sequences are then
is used for both numbers and operations can be performed added together to form the integer sequence S. We obtain
using lower lengths with respect to conventional SC. This E[Si ] = 9/8, which corresponds to s = 9/16 because of the
technique potentially reduces the latency brought by stochastic implicit scaling factor of 1/2 [see Fig. 6(a)].
computations, making integral SC suitable for throughput-
intensive applications. It is worth mentioning that the integral C. Multiplication in Integral SC
SC is different from the conventional parallelized SC [26]. The main advantage of SC compared with its binary radix
In parallelized SC, multiple bits of each stochastic stream are format is the low-complexity implementation of mathematical
2692 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 10, OCTOBER 2017

TABLE I
H ARDWARE C OMPLEXITY OF THE B INARY R ADIX A DDERS
FOR m = 1 IN A 65-nm CMOS T ECHNOLOGY

a bit-width that increases by one at every level of the tree,


Fig. 7. (a) Integer stochastic multiplier with m = 2. (b) Multiplication of whereas the scaled adder requires a tree of MUXs with
integer stochastic stream with binary stochastic bit-stream using AND gate constant bit-width. Based on this, it is easy to show that the
or MUX.
complexity of the binary radix adder becomes twice that of the
operations. It is shown that multiplication can be implemented stochastic adder as the number of inputs goes to infinity. For a
using AND or XNOR gates depending on the coding format. moderate number of inputs, the results in Table I show that the
However, integer stochastic multipliers make use of binary complexity ratio can be much less than two. As confirmed by
radix multipliers [see Fig. 7(a)]. The multiplication of two real the results presented in Section V, the increase in complexity
numbers s 1 ∈ [0, m] and s 2 ∈ [0, m  ] with integer stochastic caused by the use of the binary radix adder in fact decreases
streams S 1 and S 2 in unipolar format is performed as follows: the energy consumption of the system by allowing the use of
      shorter stochastic streams or smaller networks for the same
y = s 1 × s 2 = E Si1 × Si2 = E Si1 × E Si2 (19) precision.
if Si1 and Si2 are independent.
E. FSM-Based Functions in Integral SC
The above equation holds true for integer stochastic multi-
plication in bipolar format as well. The implementation cost The nonlinear functions play a crucial role in both
of this multiplier strongly depends on m and m  . Considering the inference and training engines. Since the gradient of
one of these two values to be equal to “1,” the multiplication the nonlinear function is required in the backpropagation
can be implemented using bit-wise AND gate or a MUX as algorithm, a common approach in the training phase is to
depicted in Fig. 7(b). The range of y is [0, m × m  ] in the use an exact nonlinear function that has a known and easily
unipolar case and [−m × m  , m × m  ] in the bipolar case. computed gradient [4]. Therefore, the exact nonlinear function,
which is usually implemented using an FSM in the stochastic
domain, is also used in the inference engine to perform the
D. Addition in Integral SC
classification tasks.
Conventional SC suffers from precision loss incurred using The inputs of stochastic FSM-based tanh and exponentiation
scaled adder, making SC inappropriate for applications that functions are restricted to real values in the [−1, 1] interval.
require many additions. On the other hand, integral SC uses Therefore, a desired tanh or exponentiation function can be
binary radix adders to perform additions in this domain, achieved by scaling down the inputs and adjusting the term n
preserving all information. Using (15), addition in unipolar in (11) and (12), which potentially increases bit-stream length
format is performed as follows: and results in long latency. The transition between each state
    of FSM is performed according to the input value in bipolar
y = s 1 + s 2 = E[s 1 + s 2 ] = E Si1 + E Si2 (20)
format, which is either 1 or 0. This state transition can be
since the expected value operator is linear. formulated as shown in Algorithm 1 in conventional SC.
Equation 20 also remains valid in the bipolar case, while According to Algorithm 1, the input value in bipolar format is
the range of y is [0, m + m  ] and [−(m + m  ), m + m  ] for first converted to either 1 or −1 as an input of either 1 or 0,
unipolar and bipolar formats, respectively. This adder provides respectively. Then, the counter of FSM is added with the new
some advantages similar to APC. First of all, due to the fact encoded values that are similar to the values in an integral
that it retains all information provided as inputs, it reduces stochastic stream with m = 1. Therefore, the values of the
the variance of the sum. Second, it potentially reduces the conventional stochastic stream can be viewed as hard values
bit-stream length required for computations compared with of an integral stochastic stream. The FSM-based functions in
conventional SC [22]. Moreover, the output of this adder is still integral SC can be achieved by extending the conventional
an integer stochastic stream, which can be used by subsequent FSM-based functions to support soft values in integral SC,
stochastic computational units, as opposed to APC. which is explained in the following.
Table I compares the complexity of the binary radix adder The integer stochastic tanh and exponentiation functions are
used in integral SC with the conventional stochastic scaled proposed by generalizing Algorithm 1. In integral SC, each
adder, for 1-bit inputs. For the scaled adder with 2 p inputs, element of a stochastic stream is represented using two’s com-
the circuit includes a p-bit LFSR used to select a random plement or sign-magnitude representations in {−m, . . . , m}
input. The binary radix adder requires a tree of adders, with for bipolar format. A state counter is increased or decreased
ARDAKANI et al.: VLSI IMPLEMENTATION OF DEEP NEURAL NETWORK USING INTEGRAL STOCHASTIC COMPUTING 2693

Algorithm 1 Pseudocode of the Conventional Algorithm for Therefore, the approximate transfer function of integer tanh
FSM-Based Functions and exponentiation functions, which are referred to as NStanh
and NSexp, respectively, are
 ns 
tanh ≈ 2 × E[NStanh(m × n, S)] − 1 (21)
2
exp(−2Gs) ≈ E[NSexp(m × n, m × G, S)] : s > 0. (22)
In order to show the validity of the proposed algorithm,
Monte Carlo simulation is used. Fig. 8 illustrates two examples
of the proposed NStanh function compared with its corre-
sponding Stanh and tanh functions for different values of m.
Simulation results show that NStanh is more accurate than
Stanh for m > 1 and that the accuracy improves as the value
of m increases. Moreover, NStanh is able to approximate tanh
for input values outside of the [−1, 1] range with negligible
performance loss, while Stanh does not work. The proposed
NStanh function can also approximate tanh functions with
fractional scaling factor, e.g., tanh(3/2x) ≈ NStanh(3 ×m, S),
as long as the value m is even, to make sure that the number
of states is even. The aforementioned statements also hold
true for NSexp, unlike with Sexp, as shown in Fig. 9. The
proposed FSM-based functions in integral SC also result in
better approximation as the value of n increases, similar
Algorithm 2 Pseudocode of the Proposed Algorithm for to conventional stochastic FSM-based functions. The syn-
Integer Stochastic FSM-Based Functions thesis results of the proposed FSM-based functions in a
65-nm CMOS technology are also summarized in Table II.
The implementation results show that the proposed FSM-based
functions consume more power but also reduce the latency,
which results in a reduction of the energy consumption ranging
from 9% for m = 2 to 80% for m = 8. Note that the stream
length of FSM-based functions denotes the latency. Of course,
the proposed FSM-based functions are not restricted to neural
networks and can be used for other applications that require
more accurate nonlinear functions.

IV. I NTEGER S TOCHASTIC I MPLEMENTATION OF DBN


A. Review on the DBN Algorithm
DBNs are the hierarchical graphical models obtained by
stacking RBMs on top of each other and training them in
a greedy unsupervised manner [4], [5]. DBNs take low-level
inputs and construct higher level abstractions through the
composition of layers. Both the number of layers and the
number of neurons in each layer can be adjusted. Increasing
the number of layers and their size tends to improve the
performance of the network.
according to the integer input value Si ∈ {−m, . . . m} where In this paper, we exploit a DBN constructed using two layers
i ∈ {1, 2, . . . , N}. Therefore, the state counter is incremented of RBM, which are also called hidden layers, followed by a
or decremented by up to m in each clock cycle, as opposed classification layer at the end for handwritten digit recognition.
to conventional FSM-based functions that are restricted to As a benchmark, we use the Mixed National Institute of
one-step transitions. The algorithm for integer FSM-based Standards and Technology (MNIST) data set [27]. This data
functions is proposed as shown in Algorithm 2. set provides thousands of 28 × 28 pixel images for both
The output of the proposed integer FSM-based functions in training and testing procedures. Each pixel is represented by
integral SC domain and its encoding format are similar to the an integer number between 0 and 255, requiring 8 bits for
conventional FSM-based functions. For instance, the output digital representation. As mentioned in Section I, the training
of the integer tanh function is in bipolar format, while the procedure can be performed on remote servers in the cloud.
output of integer exponentiation function is in unipolar format. Therefore, the extracted weights are stored in a memory for
Moreover, the integer FSM-based functions require m times the hardware inference engine to classify the input images in
more states compared with its conventional counterpart. real time.
2694 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 10, OCTOBER 2017

Fig. 8. (a) Integer stochastic implementation of tanh(s). (b) Integer stochastic implementation of tanh(2s).

Fig. 9. (a) Integer stochastic implementation of exp(−s). (b) Integer stochastic implementation of exp(−2s).

TABLE II
H ARDWARE C OMPLEXITY OF THE P ROPOSED FSM-BASED F UNCTIONS AT 400 MHz IN A 65-nm CMOS T ECHNOLOGY

Fig. 10 shows the DBN used for handwritten digits’ classi- where M denotes the number of visible nodes, v j denotes
fication in this paper. Inputs of DBN and outputs of a hidden the value of visible nodes, Wi j denotes the extracted weights,
layer are hereafter referred to as visible nodes and hidden b j denotes the bias term, z j denotes the intermediate value,
nodes, respectively. Each hidden node is also called neuron. h j denotes the output value of each hidden node, and j denotes
The hierarchical computations of each neuron are performed an index to each hidden node. The nonlinearity function
as follows: used in DBN, i.e., (24), is called a sigmoid function. The
classification layer does not require a sigmoid function as it
is only used for quantization. In other words, the maximum

M
zj = Wi j v i + b j (23) value of the output denotes the recognized label.
i=1
B. Proposed Stochastic Architecture of a DBN
1 VLSI implementations of a DBN network in binary
hj = = σ (z j ) (24)
1 + exp(−z j ) form are computationally expensive since they require many
ARDAKANI et al.: VLSI IMPLEMENTATION OF DEEP NEURAL NETWORK USING INTEGRAL STOCHASTIC COMPUTING 2695

Fig. 11. Proposed integer stochastic neuron.

TABLE III
Fig. 10. High-level architecture of two-layer DBN. M ISCLASSIFICATION E RROR OF THE P ROPOSED A RCHITECTURES FOR
D IFFERENT N ETWORK S IZES AND S TREAM L ENGTHS

matrix multiplications. Moreover, there is no straightfor-


ward way to implement the sigmoid function in hardware.
Therefore, this unit is normally implemented by LUTs,
which requires additional memory in addition to the memory
used for storing weights. Considering 10 bits for weights,
78 400 10-bit×8-bit multipliers are required to do the
matrix multiplications of the first hidden layer for a par-
allel implementation of a network with a configuration of
784-100-200-10, meaning 784 visible nodes, 100 first-layer The sigmoid function can also be implemented in the integer
hidden nodes, 200 second-layer hidden nodes, and 10 output stochastic domain.
nodes. Note that the parallel implementation of such a net- It is well known that the sigmoid function σ (x) can be
works results in huge silicon area in part due to its routing computed using the tanh function as
congestion caused by the layer interconnection. x 
Stochastic implementation of DBN is a promising approach 1 + tanh
to perform the mentioned complex arithmetic operations using σ (x) = 2 . (25)
simple and low-cost elements. In order to find the output 2
value of the first hidden node, 784 multiplications are required, The NStanh FSM can therefore be used to compute σ (x).
which can be easily performed using AND gates in unipolar Given an integral stochastic stream X in bipolar format
format. Then, addition of multipliers output should be per- representing a value x, NStanh(m, X) produces a stochastic
formed using a scaled adder or an OR gate. Using a scaled output Y such that 2 × E[Y ] − 1 ≈ tanh(x/2). Therefore,
adder to sum 784 numbers requires an extremely long bit- using (25), σ (x) ≈ E[Y ], which corresponds to interpreting
stream due to the fact that the output result of this adder the output of the FSM in unipolar format.
is scaled down by 784 times, a very small number to be Fig. 11 shows the proposed integer stochastic architecture
represented by short stream length. In [28], an OR gate is of a single neuron. The input signal stream is generated using
used as an adder to perform this computation while the inputs conventional binary stochastic streams. However, the weights
first are scaled down to make the term “A · B” close to and biases are represented by two’s complement format in
zero in (10), which potentially increases the required stream integer stochastic domain with a range of m, which requires
length for computations. An APC is also proposed in [22] to log 2 (m) + 1 bits for representation. The multiplications are
realize the matrix operations. Despite its good performance on performed bit-wise by AND gates since pixels and weights
additions, it is not a suitable approach for a stochastic DBN, are represented by binary stochastic streams and integral sto-
since it converts the results to a binary form [22]. chastic streams, respectively. More precisely, 1-bit stochastic
We have shown in Section III-A that the integer stochastic streams are ANDed with m-bit integral stochastic streams in
stream can be generated by adding conventional stochastic this paper. A tree adder and an NStanh unit are used to
streams. Considering that the multiplications of the first layer perform the additions and nonlinearity function, respectively.
of a DBN are performed in conventional stochastic domain, The output of the integer stochastic sigmoid function is
the nature of the algorithm is to add the multiplication results represented by a single wire in unipolar format. Therefore,
together. Exploiting a binary tree adder, the addition result the input and output formats are the same. Integer stochastic
remains in integer stochastic form without any precision loss. architecture of DBN is formed by stacking the proposed single
2696 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 10, OCTOBER 2017

TABLE IV
P LACE AND ROUTE R ESULTS OF THE P ROPOSED A RCHITECTURE ON FPGA V IRTEX -7

TABLE V
S YNTHESIS R ESULTS F OR A 784-100-200-10 N ETWORK AT 400 MHz
AND 1 V IN A 65-nm CMOS T ECHNOLOGY

Fig. 12. Histogram of integer values as inputs of NStanh function at the each neuron is provided with different inputs provides enough
first layer of a 784-100-200-10 DBN for all the neurons that are denoted by
different colors.
decorrelation. This amount of LFSR units consumes only less
than 5% of the total silicon area.
neuron architecture (see Fig. 10). The input images require a
minimum bit-stream length of 256, but since the weights lie V. I MPLEMENTATION AND S IMULATION R ESULTS
in the [−4, 4] interval and are represented using 10 bits in
A. Misclassification Error Rate Comparison
the binary radix domain, they require a minimum bit-stream
length of 1024 in conventional stochastic domain. Therefore, The misclassification error rate of DBNs plays a crucial
the latency of the proposed integer stochastic implementation role in the performance of the system. In this section, the
of the DBN is equal to 1024 for m = 1. misclassification errors of the proposed integer stochastic
The input range of the NStanh function, i.e., the value of m  architectures of DBNs with different configurations are sum-
in Fig. 11, is selected through simulation. The histogram of marized in Table III. Simulation results have been obtained
the adder outputs identifies this range by taking a window using MATLAB on 10 000 MNIST handwritten test digits [27]
that covers 95% of data. For instance, Fig. 12 shows the for both floating point code and the proposed architecture
histogram of integer values as inputs of NStanh function at using LFSRs as the stream generators. The method proposed
the first layer of a 784-100-200-10 DBN. This diagram is in [29] is used as our training core to extract the network
generated based on the noncorrelated stochastic inputs and weights. In fixed-point format, a precision of 10 bits is used
the selected range for this network is six, i.e., the value of to represent the weights. A stochastic stream of equivalent
m  in Fig. 11. This range strongly depends on the correlations precision requires a length of 1024. The length of the stream
among the stochastic inputs. The range would be a bigger can be reduced by increasing m. For example, using m = 2,
number as the correlation increases. For instance, summa- the length can be reduced to 512, and using m = 4, it can be
tion of two correlated stochastic streams, {1, 1, 0, 0, 1, 0} and reduced to 256. Because the input pixels only require 8 bits
{1, 1, 0, 1, 0, 0}, representing a real value of 0.5 results in of precision, they can be represented using a binary (m = 1)
integral stochastic stream of {2, 2, 0, 1, 1, 0} and input range stochastic stream of length 256. Therefore, using m = 1 for the
of 2 while summation of two uncorrelated stochastic streams, pixels and m = 4 for the weights, it is possible to reduce the
{0, 0, 1, 0, 1, 1} and {1, 1, 0, 1, 0, 0}, representing real value of stream length to 256 while still using AND gates to implement
0.5 results in integral stochastic stream of {1, 1, 1, 1, 1, 1} and multiplications. The simulation results show the negligible
input range of 1. Correlation among the inputs is introduced performance loss of the proposed integer stochastic DBN for
when the same LFSR units are shared among several inputs, different sizes compared with their floating point versions.
in order to reduce hardware area. In this paper, the set of The reported misclassification errors for the proposed integral
LFSR units that are used for one neuron is shared for all the stochastic architecture were obtained using LFSR units as
other neurons. More precisely, (m + 1) × 785 11-bit LFSRs random number generators in MATLAB.
with different seeds are used in total to generate all inputs and
weights of the proposed DBN architectures. Therefore, in the B. FPGA Implementation
first layer, the streams only interact with other streams that are As mentioned previously, a fully or semiparallel VLSI
part of the same neuron, and in subsequent layers, the fact that implementation of DBN in binary form requires a lot
ARDAKANI et al.: VLSI IMPLEMENTATION OF DEEP NEURAL NETWORK USING INTEGRAL STOCHASTIC COMPUTING 2697

TABLE VI
S YNTHESIS R ESULTS F OR A 784-300-600-10 N ETWORK BASED ON I NTEGRAL SC AT 400 MHz AND 1 V IN A 65-nm CMOS T ECHNOLOGY

of hardware resources. Therefore, many works target TABLE VII


FPGAs [30]–[35], but none manage to fit a fully parallel D EVIATIONS OF L AYER -1 AND L AYER -2 N EURONS FOR
A 784-300-600-10 N ETWORK
DNN architecture in a single FPGA board. Recently, a fully
pipelined FPGA architecture of a factored RBM (fRBM)
was proposed in [9], which could implement a single-layer
neural network consisting of 4096 nodes using virtualiza-
tion technique, i.e., time multiplex sharing technique, on a
Virtex-6 FPGA board. However, the largest fRBM neural
network achievable without virtualization is on the order of
256 nodes.
In [28], a stochastic implementation of DBN on a FPGA In order to improve the energy consumption of the proposed
board is presented for different network sizes; however, this stochastic architectures, we select a bigger network size with
architecture cannot achieve the same misclassification error better misclassification rate and reduce the stream length to
rate as a software implementation. Table IV shows both the achieve roughly the same misclassification error rate as the
hardware implementation (place and route) and performance binary radix implementation in Table V. The implementation
results of the proposed integer stochastic architecture of results of a 784-300-600-10 neural network based on integral
DBN for different network sizes on a Virtex7 xc7v2000t SC for different stream lengths and values of m are summa-
Xilinx FPGA. The implementation results show that the mis- rized in Table VI. The implementation results show that the
classification error of the proposed architectures for network integral stochastic architecture for value of m = 4 and stream
size of 784-100-200-10 is the same as for the largest network length of 16 at misclassification error rate of 2.3% consumes
presented in [28], i.e., the network size of 784-500-1000-10, 21% less energy as well as 34% less area compared with the
while the areas of the proposed designs are reduced by 66%, binary radix implementation.
47%, and 21% for m = 1, m = 2, and m = 4. Moreover,
the latencies of the proposed architectures are also reduced D. Quasi-Synchronous Implementations
by 40%, 63%, and 84% for m = 1, m = 2, and m = 4. In order to further reduce the energy consumption of the
Therefore, as the value of m increases, the latency of the system, we also consider a quasi-synchronous implementation,
integer stochastic hardware is reduced and becomes suitable in which the supply voltage of the circuit is reduced beyond the
for throughput-intensive applications. Depending on the value critical voltage by permitting some timing violations to occur.
of m, there is a tradeoff between area and latency in the Timing violations introduce deviations in the computations,
proposed architectures. Note that the reported areas in Table IV but because the stochastic architecture is fault tolerant, we
include the costs of B2S and B2IS units. The misclassification can obtain the same classification performance by slightly
rates reported for FPGA and ASIC implementations were increasing the length of the streams. This yields further energy
obtained using functional circuit simulations and confirmed savings without any compromise on performance.
by the MATLAB software model. We characterize the effect of timing violations on the
algorithm by studying small test circuits that can be simulated
quickly, using the same approach as in [36]. In the proposed
C. ASIC Implementation architecture, the same processing circuit can be replicated
Table V shows the synthesis results for a fixed-point several times to form each layer, depending on the required
implementation of the network size of 784-100-200-10 in a degree of parallelism. Therefore, we characterize the effect
65-nm CMOS technology. Despite the improvements that of timing violations on these small processing circuits: each
the proposed architectures provide over previously proposed neuron processor (one for each layer) is synthesized in a
stochastic implementations, the stochastic implementations 65-nm CMOS technology and deviations are measured at
still use more energy than the fixed-point implementation in different voltages, from 0.7 to 1.0 V in 0.05 V increments,
65-nm CMOS, even if the power consumption and area of as shown in Table VII. Note that no deviations are observed
a stochastic neuron are smaller. A similar result was also when the supply voltage is larger than 0.8 V. The output
obtained in [17] for stochastic implementations of image of the first and second layers is binary, while the output of
processing circuits. classification layer has 6 bits. B2S converter units are also
2698 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 10, OCTOBER 2017

TABLE VIII R EFERENCES


ASIC I MPLEMENTATION R ESULTS F OR A 784-300-600-10
[1] A. Ardakani, F. Leduc-Primeau, N. Onizawa, T. Hanyu, and W. J. Gross,
N ETWORK AT 400 MHz IN A 65-nm CMOS T ECHNOLOGY
“VLSI implementation of deep neural networks using integral stochas-
U NDER FAULTY C ONDITIONS
tic computing,” in Proc. 9th Int. Symp. Turbo Codes Iterative Inf.
Process. (ISTC), Sep. 2016, pp. 216–220.
[2] S. Park, K. Bong, D. Shin, J. Lee, S. Choi, and H.-J. Yoo,
“A1.93TOPS/W scalable deep learning/inference processor with tetra-
parallel MIMD architecture for big-data applications,” in IEEE Int.
Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2015,
pp. 1–3.
[3] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-
trained deep neural networks for large-vocabulary speech recogni-
tion,” IEEE Trans. Audio, Speech, Language Process., vol. 20, no. 1,
pp. 30–42, Jan. 2012.
[4] G. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for deep
belief nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554, Jul. 2006.
considered for each neuron, and the weights are hard coded [5] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of
for the implementations. data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507,
The deviation error of the layer-3 neuron for 0.7 and 0.75 V 2006.
[6] M. A. Arbib, Ed., The Handbook of Brain Theory and Neural Networks,
results in a huge misclassification error. It is not beneficial to 2nd ed. Cambridge, MA, USA: MIT Press, 2002.
allow large deviations to occur in that layer since there are [7] P. Luo, Y. Tian, X. Wang, and X. Tang, “Switchable deep network
only ten neurons in the third layer, and therefore we do not for pedestrian detection,” in Proc. IEEE Conf. CVPR, Jun. 2014,
expect the supply voltage of layer-3 processing circuits to have pp. 899–906.
[8] X. Zeng, W. Ouyang, and X. Wang, “Multi-stage contextual deep
a big impact on the overall energy consumption. Therefore, learning for pedestrian detection,” in Proc. IEEE Int. Conf. Comput.
the layer-3 neurons supplied with 0.8 V are used. Note that Vis. (ICCV), Dec. 2013, pp. 121–128.
no deviations are observed when the supply voltage is 0.8 V [9] L.-W. Kim, S. Asaad, and R. Linsker, “A fully pipelined FPGA
architecture of a factored restricted Boltzmann machine artificial neural
in the layer-3 neurons. network,” ACM Trans. Reconfigurable Technol. Syst., vol. 7, no. 1,
The performance results for a 784-300-600-10 network Feb. 2014, Art. no. 5.
and m = 4 at different supply voltages are provided in [10] A. Alaghi, C. Li, and J. P. Hayes, “Stochastic circuits for real-time
image-processing applications,” in Proc. 50th ACM/EDAC/IEEE Design
Table VIII. The misclassification performance obtained by the Autom. Conf. (DAC), May 2013, pp. 1–6.
quasi-synchronous system is very similar to the performance [11] S. S. Tehrani, S. Mannor, and W. J. Gross, “Fully parallel stochas-
of the reliable system, despite the fact that the deviation rate is tic LDPC decoders,” IEEE Trans. Signal Process., vol. 56, no. 11,
pp. 5692–5703, Nov. 2008.
up to 9% in layer-1 neurons and 16% in layer-2 neurons. This [12] Y. Ji, F. Ran, C. Ma, and D. J. Lilja, “A hardware implementation
results in up to a 14% lower energy consumption without any of a radial basis function neural network using stochastic logic,” in
compromise on performance. On the other hand, introducing Proc. Design, Autom. Test Eur. Conf. Exhibit. (DATE), Mar. 2015,
pp. 880–883.
bit-wise deviations at a rate of 1% in the fixed-point system [13] Y. Liu and K. K. Parhi, “Architectures for recursive digital filters using
results in an 87% misclassification rate. Note that the reported stochastic computing,” IEEE Trans. Signal Process., vol. 64, no. 14,
implementation results in this paper include costs of B2N and pp. 3705–3718, Jul. 2016.
[14] B. Yuan and K. K. Parhi, “Successive cancellation decoding of polar
B2IS units. codes using stochastic computing,” in Proc. IEEE Int. Symp. Circuits
Syst. (ISCAS), May 2015, pp. 3040–3043.
VI. C ONCLUSION [15] W. Qian, X. Li, M. D. Riedel, K. Bazargan, and D. J. Lilja,
“An architecture for fault-tolerant computation with stochastic
Integral SC makes the hardware implementation of logic,” IEEE Trans. Comput., vol. 60, no. 1, pp. 93–105,
precision-intensive applications feasible in the stochastic Jan. 2011.
domain, and allows computations to be performed with [16] P. Li and D. J. Lilja, “Using stochastic computing to implement digital
image processing algorithms,” in Proc. IEEE 29th Int. Conf. Comput.
streams of different lengths, which can improve the latency Design, Oct. 2011, pp. 154–161.
of the system. An efficient stochastic implementation of a [17] P. Li, D. J. Lilja, W. Qian, K. Bazargan, and M. D. Riedel, “Computation
DBN is proposed using integral SC. The simulation and on stochastic bit streams digital image processing case studies,” IEEE
Trans. Very Large Scale Integr. (VLSI) Syst., vol. 22, no. 3, pp. 449–462,
implementation results show that the proposed design reduces Mar. 2014.
the area occupation by up to 66% and the latency by up to [18] A. Alaghi, C. Li, and J. P. Hayes, “Stochastic circuits for real-time
84% with respect to the state of the art. We also showed image-processing applications,” in Proc. 50th Annu. Design Autom.
that the proposed design consumes 21% less energy than its Conf., New York, NY, USA, May 2013, pp. 1–6.
[19] J. L. Rosselló, V. Canals, and A. Morro, “Hardware implementation
binary radix counterpart when using a bigger network size with of stochastic-based neural networks,” in Proc. Int. Joint Conf. Neural
better misclassification rate and reducing the stream length to Netw. (IJCNN), Jul. 2010, pp. 1–4.
achieve roughly the same misclassification error rate as the [20] J. A. Dickson, R. D. McLeod, and H. C. Card, “Stochastic arithmetic
implementations of neural networks with in situ learning,” in Proc. IEEE
binary radix implementation. Moreover, the proposed archi- Int. Conf. Neural Netw., vol. 2. Mar. 1993, pp. 711–716.
tectures can save up to 33% energy consumption with respect [21] B. R. Gaines, “Stochastic computing systems,” in Advances in Infor-
to the binary radix implementation using quasi-synchronous mation Systems Science (Advances in Information Systems Science),
J. T. Tou, Ed. Boston, MA, USA: Springer, 1969, pp. 37–172.
implementation without any compromise on performance. [22] P.-S. Ting and J. P. Hayes, “Stochastic logic realization of matrix
operations,” in Proc. 17th Euromicro Conf. Digital Syst. Design (DSD),
ACKNOWLEDGMENT Aug. 2014, pp. 356–364.
[23] P. Li, W. Qian, and D. J. Lilja, “A stochastic reconfigurable architecture
The authors would like to thank C. Condo for his helpful for fault-tolerant computation with sequential logic,” in Proc. IEEE 30th
suggestions. Int. Conf. Comput. Design (ICCD), Sep. 2012, pp. 303–308.
ARDAKANI et al.: VLSI IMPLEMENTATION OF DEEP NEURAL NETWORK USING INTEGRAL STOCHASTIC COMPUTING 2699

[24] B. D. Brown and H. C. Card, “Stochastic neural computation. Naoya Onizawa (M’09) received the B.E., M.E.,
I. Computational elements,” IEEE Trans. Comput., vol. 50, no. 9, and D.E. degrees in electrical and communication
pp. 891–905, Sep. 2001. engineering from Tohoku University, Sendai, Japan,
[25] V. Canals, A. Morro, A. Oliver, M. L. Alomar, and J. L. Rosselló, in 2004, 2006, and 2009, respectively.
“A new stochastic computing methodology for efficient neural network He was a Post-Doctoral Fellow at the University
implementation,” IEEE Trans. Neural Netw. Learn. Syst., vol. 27, no. 3, of Waterloo, Waterloo, ON, Canada, in 2011, and
pp. 551–564, Mar. 2016. at McGill University, Montréal, QC, Canada, from
[26] D. Cai, A. Wang, G. Song, and W. Qian, “An ultra-fast par- 2011 to 2013. In 2015, he joined the University
allel architecture using sequential circuits computing on random of Southern Brittany, Lorient, France, as a Visiting
bits,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), May 2013, Associate Professor. He is currently an Assistant
pp. 2215–2218. Professor with the Frontier Research Institute for
[27] Y. Lecun and C. Cortes. The MNIST Database of Hand- Interdisciplinary Sciences, Tohoku University. His current research interests
written Digits, accessed on 2010. [Online]. Available: include the energy-efficient VLSI design based on asynchronous circuits and
http://yann.lecun.com/exdb/mnist/ probabilistic computation, and their applications such as associative memories
[28] B. Li, M. H. Najafi, and D. J. Lilja, “An FPGA implementation of a and brain-like computers.
Restricted Boltzmann machine classifier using stochastic bit streams,” in Dr. Onizawa was a recipient of the Best Paper Award in the 2010 IEEE
Proc. IEEE 26th Int. Conf. Appl.-Specific Syst., Archit. Process. (ASAP), ISVLSI, the Best Paper Finalist in the 2014 IEEE ASYNC, the 20th Research
Jul. 2015, pp. 68–69. Promotion Award, the Aoba Foundation for the Promotion of Engineering in
[29] M. Tanaka and M. Okutomi, “A novel inference of a restricted 2014, and the Kenneth C. Smith Early Career Award for Microelectronics
Boltzmann machine,” in Proc. 22nd Int. Conf. Pattern Recognit. (ICPR), Research in the 2016 IEEE ISMVL.
Aug. 2014, pp. 1526–1531.
[30] C. E. Cox and W. E. Blanz, “GANGLION—A fast hardware imple-
mentation of a connectionist classifier,” in Proc. IEEE Custom Integr.
Circuits Conf., May 1991, pp. 6.5/1–6.5/4.
[31] J. Zhao and J. Shawe-Taylor, “Stochastic connection neural networks,”
in Proc. 4th Int. Conf. Artif. Neural Netw., Jun. 1995, pp. 35–39. Takahiro Hanyu (SM’11) received the B.E., M.E.,
[32] M. Skubiszewski, “An exact hardware implementation of the Boltzmann and D.E. degrees in electronic engineering from
machine,” in Proc. 4th IEEE Symp. Parallel Distrib. Process., Dec. 1992, Tohoku University, Sendai, Japan, in 1984, 1986,
pp. 107–110. and 1989, respectively.
[33] S. K. Kim, L. C. McAfee, P. L. McMahon, and K. Olukotun, “A highly He is currently a Professor with the Research
scalable Restricted Boltzmann machine FPGA implementation,” in Proc. Institute of Electrical Communication, Tohoku Uni-
Int. Conf. Field Program. Logic Appl., Aug. 2009, pp. 367–372. versity. His current research interests include non-
[34] D. L. Ly and P. Chow, “A multi-FPGA architecture for stochastic volatile logic circuits and their applications to
Restricted Boltzmann machines,” in Proc. Int. Conf. Field Program. ultralow-power and/or PVT variation-free VLSI
Logic Appl., Aug. 2009, pp. 168–173. processors, and multiple-valued current-mode circuit
[35] D. L. Ly and P. Chow, “High-performance reconfigurable hardware and its application to power aware asynchronous
architecture for restricted Boltzmann machines,” IEEE Trans. Neural Network-on-Chip systems.
Netw., vol. 21, no. 11, pp. 1780–1792, Nov. 2010. Dr. Hanyu received the Sakai Memorial Award from the Information
[36] F. Leduc-Primeau, F. R. Kschischang, and W. J. Gross. (Mar. 2015). Processing Society of Japan in 2000, the Judges Special Award at the 9th LSI
“Modeling and energy optimization of LDPC decoder circuits with Design of the Year from the Semiconductor Industry News of Japan in
timing violations.” [Online]. Available: https://arxiv.org/abs/1503.03880 2002, the Special Feature Award at the University LSI Design Contest from
ASP-DAC in 2007, the APEX Paper Award of the Japan Society of Applied
Physics in 2009, the Excellent Paper Award of IEICE, Japan, in 2010, the
Ichikawa Academic Award in 2010, the Best Paper Award of the IEEE ISVLSI
Arash Ardakani (S’16) received the B.Sc. degree in 2010, the Paper Award of SSDM 2012, and the Best Paper Finalist of the IEEE
electrical engineering from the Sadjad University of ASYNC 2014.
Technology, Mashhad, Iran, in 2011, and the M.Sc.
degree from the Sharif University of Technology,
Tehran, Iran, in 2013. He is currently pursuing the
Ph.D. degree with McGill University, Montréal, QC,
Canada.
His current research interests include the VLSI
implementation of signal processing algorithms, in
particular channel coding schemes and machine
learning algorithms. Warren J. Gross (SM’10) received the B.A.Sc.
degree in electrical engineering from the University
of Waterloo, Waterloo, ON, Canada, in 1996, and
the M.A.Sc. and Ph.D. degrees from the Univer-
François Leduc-Primeau (M’16) received the sity of Toronto, Toronto, ON, in 1999 and 2003,
B.Eng., M.Eng., and Ph.D. degrees in computer en- respectively.
gineering from McGill University, Montréal, QC, He is currently a Professor with the Department of
Canada, in 2007, 2010, and 2016, respectively. Electrical and Computer Engineering, McGill Uni-
In 2016, he joined IMT Atlantique, Brest, France versity, Montréal, QC, Canada. His current research
(formerly Télécom Bretagne), as a Post-Doctoral interests include the design and implementation of
Researcher. His current research in- terests include signal processing systems and custom computer
algorithms and systems for telecommunications and architectures.
signal processing, error-correction codes, and novel Dr. Gross served as the Chair of the IEEE Signal Processing Society
approaches for improving the energy efficiency of Technical Committee on Design and Implementation of Signal Processing
digital systems. Systems and the Technical Program Co-Chair of the IEEE Workshop on
Dr. Leduc-Primeau was awarded the doctoral scholarship from Fonds de Signal Processing Systems (SiPS 2012), and will serve as the General Chair
recherche du Québec Nature et technologies (FRQNT), and the Post-Doctoral of the IEEE SiPS 2017. He served as an Associate Editor of the IEEE
Fellowship from the National Sciences & Engineering Research Council of T RANSACTIONS ON S IGNAL P ROCESSING and is currently a Senior Area
Canada (NSERC). He is a member of the IEEE. Editor. He is a licensed Professional Engineer at the Province of Ontario.

You might also like