Snowflake_An_efficient_hardware_accelerator_for_convolutional_neural_networks

Snowflake: An Efficient Hardware Accelerator for
Convolutional Neural Networks
Vinayak Gokhale∗ , Aliasger Zaidy∗ , Andre Xian Ming Chang∗ , Eugenio Culurciello†
∗ School
of Electrical and Computer Engineering
† Weldon
School of Biomedical Engineering
Purdue University, West Lafayette, IN 47907
Email: {vgokhale,azaidy,amingcha,euge}@purdue.edu
Abstract—Deep learning is becoming increasingly popular Z'ŝŶƉƵƚ

for a wide variety of applications including object detection,
classification, semantic segmentation and natural language pro-
cessing. Convolutional neural networks (CNNs) are a type of >ϭ
>Ϯ >ϯ >ϰ >ϱ
deep neural network that achieve high accuracy for these tasks.
CNNs are hierarchical mathematical models comprising billions ϭϬϬϬ
of operations to produce an output. The high computational

complexity combined with the inherent parallelism in these
ϵϲ Ϯϱϲ ϯϴϰ ϯϴϰ Ϯϱϲ
models makes them an excellent target for custom accelerators. In ϰϬϵϲ ϰϬϵϲ
KƵƚƉƵƚŽĨĐŽŶǀŽůƵƚŝŽŶĂůůĂǇĞƌ
this work we present Snowflake, a scalable, efficient low-power KƵƚƉƵƚŽĨŵĂǆͲƉŽŽůŝŶŐ
accelerator that is agnostic to CNN architectures. Our design &ƵůůǇĐŽŶŶĞĐƚĞĚůĂǇĞƌ
is able to achieve an average computational efficiency of 91%
which is significantly higher than comparable architectures. We Fig. 1: Shown here is a CNN with five convolutional layers
implemented Snowflake on a Xilinx Zynq XC7Z045 APSoC. On and three fully-connected layers. While not shown, all layers
this platform, Snowflake is capable of achieving 128 G-ops/s while (except max-pooling) are followed by a rectified linear unit.
consuming 9.48 W of power. .Snowflake achieves a throughput
and energy efficiency of 98 frames per second and 10.3 frames
per joule, respectively, on AlexNet and 34 frames per second and
3.6 frames per joule on GoogLeNet.
While CNNs report high accuracy for object detection
I. I NTRODUCTION and classification tasks, they also have high computational
complexity. This fact, combined with their recent success
Convolutional neural networks (CNNs) have emerged as
has resulted in a large number of custom architectures that
the model of choice for object detection, classification, seman-
take advantage of the inherent parallelism of the convolution
tic segmentation and natural language processing tasks. CNNs
operator. However, the highly varied data access patterns in
achieving high accuracy in these tasks have been published
CNNs make it difficult for custom architectures to achieve
in [1]–[3]. CNNs are hierarchical models named after the
high computational efficiency while processing a network. In
eponymous convolution operator. The levels of hierarchy in
this context, we define computational efficiency as the ratio of
a CNN are called the layers of the network. The inputs and
actual number of operations processed to the theoretical max-
outputs of CNN layers are usually three dimensional. Typical
imum number of operations the design can process, expressed
CNNs use as input an image, with its red, green and blue
as a percentage. This translates to the ratio of actual FPS to
channels split into three separate two dimensional input feature
theoretical FPS for a given CNN. Most custom accelerators
maps. Three convolutional filters, one per channel, perform
are capable of achieving over 90% efficiency in some part
two dimensional convolution on the input maps to produce
of the network’s computation but have lower efficiencies in
three outputs. Finally, these outputs are combined by pixel-
other parts. We discuss computational efficiency because it
wise addition into a single two dimensional output called an
is one of two factors influencing computational throughput,
output feature map. Multiple such filters are convolved with
the other being bandwidth. Certain CNN layers are entirely
the same input to produce multiple output maps. The output
bandwidth dependent and cannot be made compute bound
of one layer becomes the input to the next.
by changing data access patterns alone. For these layers, a
Convolutional layers can be interspersed with some sort of variety of compression techniques have been presented which
spatial pooling operation that downsamples the feature maps. reduces their size significantly [4], [5]. However, network
The most commonly used pooling operator is spatial max- compression techniques are orthogonal to the work presented
pooling. Max-pooling performs the max operation on a spatial here. This is because compression techniques can be used with
region within an input feature map and produces a single most accelerator architectures by fetching compressed data
output. Finally, a non-linear function is applied to the result. from memory and decompressing it before writing to on-chip
The most common non-linear function used is the rectified buffers. However, if the computations within a network do
linear unit (ReLU). ReLU sets negative inputs to zero and not map efficiently to the compute resources available in the
keeps positive inputs unchanged. An example of a complete design, computational efficiency will decrease. In this paper,
CNN nicknamed AlexNet [1] is shown in figure 1. we present Snowflake, an efficient architecture that is able to
Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on April 10,2024 at 03:39:59 UTC from IEEE Xplore. Restrictions apply.
978-1-4673-6853-7/17/$31.00 ©2017 IEEE
<t
E
<ĞƌŶĞůdƌĂĐĞ 9HF 9HF ͙ 9HF 0
output. Splitting computation into traces allows us to hide the
<, <tΎE ^ƵďĚŝǀŝĚĞĚŝŶƚŽDǀĞĐƚŽƌƐ latency associated with branches, loads, stores and other non-
compute instructions. This is because computation of a trace
requires a single instruction that usually takes several cycles to
0DS7UDFH 9HF 9HF ͙ 9HF 0
<tΎE ^ƵďĚŝǀŝĚĞĚŝŶƚŽDǀĞĐƚŽƌƐ
process. Data that is part of a trace is accessed from on-chip
buffers at 256-bit granularity. These smaller blocks of data are
called vectors. Snowflake uses 16-bit fixed point words. As
a result, each vector contains 16 words. The choice for 16-
Fig. 2: A trace is a contiguous two dimensional region of the bit words was made because for CNNs there is insignificant
input operand volumes that is part of the computation of an loss in accuracy between networks using 16-bit fixed point and
output pixel. Traces are accessed from on-chip buffers at the 32-bit floating point [8].
granularity of 256-bit blocks called vectors.
The 16 words of a vector are split up among a group of 16
multiply-accumulate (MAC) units called a vector MAC unit
(vMAC). This splits up the computation of a single output
achieve an efficiency of 91% on AlexNet and GoogLeNet, two among several MAC units. A separate adder, one per vMAC,
CNNs with significantly varied data access patterns and layer performs reduction on the partial sums stored in the internal
sizes. accumulate register of each MAC, to produce an output pixel.
Since multiple MAC units work together to produce an output,
this method of mapping is called the cooperative mode of
II. R ELATED W ORK computation. The cooperative mode is illustrated in figure 3a.
Most recent accelerators comprise a collection of process- The benefits of this type of mapping are twofold. First, it
ing elements (PEs) and some scratchpad memory. Eyeriss is maximizes MAC utilization. In a 2D systolic grid, maximum
an ASIC implementation that contains a 12 × 14 grid of PEs utilization can only be achieved if the size of the grid is equal
and 108 KB of scratchpad memory [5]. Eyeriss achieves a to or an integer multiple of the kernel size. In Snowflake, MAC
maximum efficiency of 92% on AlexNet layer 1, 80% on utilization is 100% any time the number of input maps is an
layer 2 and 93% on layers 3-5. However, AlexNet layer 2 integer multiple of 16. This is usually the case for all but the
has 50% more computation than the next largest layer (layer first layer of a CNN [1]–[3]. The second benefit of this type
3) and represents 33% of the entire workload. The Origami of mapping is that it reduces the on-chip SRAM required per
architecture comprises multiple grids of 49 multipliers and MAC. Each MAC requires maps and kernels to process a layer.
adders. Origami is able to achieve an efficiency of 74% Maps can be shared across MACs to some extent. In that case,
[6]. The design proposed by Qiu, et al. is able to achieve however, they require different kernels. For large networks, the
an efficiency of 80.3% [7]. By comparison, Snowflake is size of a single kernel can be large. For example, the size of
able to achieve 91% efficiency when processing AlexNet and a single kernel in AlexNet’s layer 3 is 6.75 KB. Using our
GoogLeNet. mapping, this can be split up across 16 MAC units instead of
one, resulting in lower amount of SRAM necessary per MAC.
III. DATA O RGANIZATION
The cooperative mode does not work for layers that contain
Snowflake organizes computation into traces. A trace is de- input maps that are not a multiple of 16. For that scenario, the
fined as a contiguous region of memory that must be accessed data organization changes slightly. Maps are still stored depth-
as part of the computation necessary to produce a single output first. However, the weights are stored in a kernel-first order.
pixel. An example of a trace is shown in figure 2. Usually, Consider the case of AlexNet layer 1 [1] with 64 kernels, each
an output pixel requires multiple traces to be processed and of which is 3 × 11 × 11. These will be stored such that the first
accumulated. The length of a trace depends on the size of element in the weights array is the first pixel of the first red
the kernel and the number of input channels in that layer. For kernel’s first row. This is illustrated in figure 3b. The second
example, a layer with 256 input maps and 3 × 3 kernels would element will be the first pixel of the second red kernel’s first
require 3 traces, each of length 256 × 3 = 768, to produce one row. Similarly, the sixteenth pixel will be the first pixel of the
ǆ
Ǉ ǌ ǆ ǆ͕Ǉ͕ǌ͕Ŷ
ǆ͕Ǉ͕ǌ ϯ
͙ Ǉ ͙ Ϭ͕Ϭ͕Ϭ͕Ϭ Ϭ͕Ϭ͕Ϭ͕ϭ Ϭ͕Ϭ͕Ϭ͕Ϯ ͙ Ϭ͕Ϭ͕Ϭ͕ϭϱ
Ϭ͕Ϭ͕Ϭ Ϭ͕Ϭ͕ϭ Ϭ͕Ϭ͕Ϯ Ϭ͕Ϭ͕ϭϱ
ǆ D ͙ D D D D

D D D
ǌ
ǆ ϯ
Ǉ Ϭ͕Ϭ͕Ϭ Ϭ͕Ϭ͕ϭ Ϭ͕Ϭ͕Ϯ ͙ Ϭ͕ϱ͕Ϭ
Ϭ͕Ϭ͕Ϭ Ϭ͕Ϭ͕ϭ Ϭ͕Ϭ͕Ϯ ͙ Ϭ͕Ϭ͕ϭϱ
ǆ͕Ǉ͕ǌ Ǉ
ǆ͕Ǉ͕ǌ
(a) In cooperative mode, illustrated here, the mth word of the vector (b) In independent mode, the mth word of the maps vector is
is sent to the mth MAC within the vMAC broadcast to all MACs.
Fig. 3: An illustration of how computation maps to MAC units in independent and cooperative modes.
>ͬ^d
,WϬ
>ͬ^d ŽŵƉƵƚĞ
,Wϭ ŝƐƉĂƚĐŚ >h hŶŝƚ tΨ tΨ tΨ tΨ
ĂƚĂŝƐƚƌŝďƵƚŝŽŶEĞƚǁŽƌŬ
>ͬ^d ŽŵƉƵƚĞ
,WϮ
hŶŝƚ sĞĐƚŽƌ sĞĐƚŽƌ sĞĐƚŽƌ sĞĐƚŽƌ DĂǆ
>ͬ^d ĞĐŽĚĞ ZĞŐ͘&ŝůĞ ͘ D D D D ƉŽŽů
,Wϯ ͘
͘
&ĞƚĐŚ ŽŵƉƵƚĞ DΨ
hŶŝƚ
/Ψ
ŽŶƚƌŽůŽƌĞ ŽŵƉƵƚĞŽƌĞ
Fig. 4: An overview of the Snowflake architecture. The four major components of Snowflake - the memory interface, the data
distribution network, the control core and the compute core - are shown in the figure.
sixteenth red kernel’s first row. The seventeenth pixel will hold the maxpool and ReLU operations. Four vMACs and four
the first pixel of the first green kernel’s first row. In this type comparators are grouped together to form one compute unit
of mapping, each MAC unit produces an output pixel. Since (CU). Each CU also contains a reduction adder, a 128 KB
each MAC unit works independently of others, this type of maps buffer and a 32 KB weights buffer. The compute core
mapping is called the independent mode of computation. The contains four CUs which contain a total of 256 MAC units,
cooperative mode is used any time the number of input maps 512 KB of maps buffer and 128 KB of weights buffer. The 4
is a multiple of 16 while the independent mode is used for all CUs together provide a maximum throughput of 128 G-ops/s.
other cases.
3) Data Distribution Network: The data distribution net-
IV. A RCHITECTURE work (DDN) is responsible for forwarding data returned from
memory to the correct on-chip buffers and forwarding results
The Snowflake coprocessor has four components - the produced by the coprocessor back to memory. The bus from
memory interface, the data distribution network, the control memory contains 64-bit words while Snowflake vectors are
core and the compute core. An overview of the system is 256 bits wide. As a result, the DDN packs four memory words
shown in figure 4. into a single vector. When receiving data from memory that is
1) Control Core: The control core is designed as a simple destined for the instruction cache, the DDN unpacks a 64-bit
RISC-style, five stage pipeline. It contains a 4 KB direct- memory word into two 32-bit instructions.
mapped instruction cache, an instruction decoder, a two-ported 4) Memory Interface: The memory interface contains four
register file and an ALU containing a multiplier, an adder and load-store units. Each unit is capable of sustaining 1 GB/s in
a comparator. The traditional instruction decode stage is split full-duplex. The load units supply data to the DDN. Each
into two parts, with the second being the dispatch stage. The memory word is accompanied by its physical memory address,
dispatch stage is responsible for preventing read access to a the buffer address at which the word is to be written and a
buffer that is also the destination for a load instruction. By do- destination ID. The destination ID is used by the DDN to
ing this, it prevents the vMACs from accessing a buffer before forward data to the correct buffer. The memory address is
data from memory has completed writing to it. The control necessary to write to the instruction cache. Finally, a single
core is responsible for supplying compute instructions to the bit last signal is included which signals the dispatch stage of
compute core. These compute instructions are in the form of the control core to unlock read access to the corresponding
traces described in section III. Each trace instruction contains buffer so that any trace instructions that are being held back
the trace length, the mode of computation (independent or can proceed.
cooperative) and the addresses of the maps and weights buffers
where the first element of the trace resides. Other instructions V. R ESULTS
are loads, stores, ALU operations that increment the trace start
addresses and branches. Branch instructions are followed by A. System Specifications
four branch delay slots. Most loops can make full use of these We implemented Snowflake on a Xilinx Zynq XC7Z045
slots. The latencies associated with non-trace instructions can APSoC device. We used the ZC706 platform which has the
usually be hidden because a single trace instruction keeps the Zynq device and 1 GB of DDR3 memory. This Zynq device
vMACs occupied for several cycles. has 2 ARM cores running at 800 MHz and a Kintex-7 FPGA.
2) Compute Core: The compute core contains the vMACs, The implemented Snowflake system had 4 CUs, each with 64
the maps and weights buffers and comparators to perform MAC units, a 128 KB maps buffer, 32 KB weights buffer and
TABLE I: A comparison of computational efficiency with other FPGA accelerators.
Eyeriss [5] Origami [6] Zhang, et al. [9] Caffeine [10] Qiu, et al. [7] This work
Platform ASIC 65nm CMOS ASIC 65nm CMOS Virtex-7 VX485T Ultrascale KU060 Zynq XC7Z045 Zynq XC7Z045
Clock (MHz) 200 250 100 200 150 250
Precision 16-bit fixed 12-bit fixed 32-bit float 16-bit fixed 16-bit fixed 16-bit fixed
MAC units 168 196 448 1058 780 256
Performance (G-ops/s) 46.1 145.0 61.6 310.0 187.8 116.5
Theoretical max (G-ops/s) 67.2 196.0 89.6 423.2 234 128.0
Power (W) 0.28 0.45 18.61 25.00 9.63 9.48
Energy Eff. (G-ops/J) 164.64 322.2 3.3 12.4 19.5 12.3
Computational Eff. 68.6% 74.0% 68.8% 73.3% 80.3% 91%
four comparators. The entire system was clocked at 250 MHz. computational efficiency and scaling our design up to 768
At these specifications, the ZC706 platform consumes 9.48 W. MAC units on the current Zynq device and further on larger
This system is capable of a theoretical maximum throughput FPGAs.
of 128 G-ops/s.
ACKNOWLEDGMENT
B. Performance Comparison
This work was partly supported by Office of Naval Re-
We benchmarked both AlexNet [1] and GoogLeNet [2] on search (ONR) grants N000141210167, N000141512791 and
this system and achieved an average efficiency of 91% on MURI N000141010278.
both networks. Snowflake is able to achieve 98 frames per
second on AlexNet and 34 frames per second on GoogLeNet. R EFERENCES
Our benchmarks do not take the models’ large fully con-
[1] A. Krizhevsky, “One weird trick for parallelizing convolutional neural
nected layers into account because, as mentioned in section networks,” CoRR, vol. abs/1404.5997, 2014. [Online]. Available:
I, these layers add very little computational complexity and http://arxiv.org/abs/1404.5997
are bandwidth limited. Compression techniques can reduce [2] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
their size significantly, increasing throughput. We compare the V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
efficiency of our design with other works in literature in table in The IEEE Conference on Computer Vision and Pattern Recognition
I. In the table, an op is defined as a multiplication or an (CVPR), June 2015.
addition. As opposed to the other works, [5] does not explicitly [3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
for image recognition,” CoRR, vol. abs/1512.03385, 2015. [Online].
state performance. As such, it was computed by taking the Available: http://arxiv.org/abs/1512.03385
ratio of number of ops to latency. Snowflake’s efficiency of [4] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing
91% is 10.7 percentage points more than [7], which is the deep neural networks with pruning, trained quantization and huffman
next most computationally efficient system. [5] and [6] are coding,” International Conference on Learning Representations (ICLR),
ASIC implementations which explains their higher energy 2016.
efficiency. However, they are 22.4 and 17 percentage points [5] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An energy-
lower, respectively, in terms of computational efficiency than efficient reconfigurable accelerator for deep convolutional neural net-
works,” in IEEE International Solid-State Circuits Conference, ISSCC
Snowflake. [7], [10] are FPGA implementations that are 10.7 2016, Digest of Technical Papers, 2016, pp. 262–263.
and 17.7 percentage points lower in computational efficiency, [6] L. Cavigelli, D. Gschwend, C. Mayer, S. Willi, B. Muheim,
respectively, than Snowflake. While [7] achieves 1.6× more and L. Benini, “Origami: A convolutional network accelerator,” in
throughput, it also uses 3× more MAC units. Similarly, the Proceedings of the 25th Edition on Great Lakes Symposium on VLSI,
design in [10] is 2.7× faster but uses 4× more MAC units. ser. GLSVLSI ’15. New York, NY, USA: ACM, 2015, pp. 199–204.
Our simulations show that scaling our design up to 768 MAC [Online]. Available: http://doi.acm.org/10.1145/2742060.2743766
units would result in a system with a throughput of 349 G- [7] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu,
S. Song, Y. Wang, and H. Yang, “Going deeper with embedded fpga
ops/s. Such a system would be 1.9× faster than [7] and 1.1× platform for convolutional neural network,” in Proceedings of the 2016
faster than [10] while using fewer MAC units than either. ACM/SIGDA International Symposium on Field-Programmable Gate
Snowflake’s high computational efficiency is attributed to the Arrays, ser. FPGA ’16. New York, NY, USA: ACM, 2016, pp. 26–35.
efficient mapping of compute in CNNs to the vMACs in our [Online]. Available: http://doi.acm.org/10.1145/2847263.2847265
design using the independent and cooperative computation [8] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep
modes. In terms of computational throughput, the current learning with limited numerical precision.” in ICML, 2015, pp. 1737–
1746.
system can process both AlexNet and GoogLeNet in real time.
[9] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong,
“Optimizing fpga-based accelerator design for deep convolutional
VI. F UTURE W ORK AND C ONCLUSION neural networks,” in Proceedings of the 2015 ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays, ser. FPGA ’15.
We presented an efficient implementation of a CNN copro- New York, NY, USA: ACM, 2015, pp. 161–170. [Online]. Available:
cessor. Our design, titled Snowflake, is capable of achieving a http://doi.acm.org/10.1145/2684746.2689060
maximum throughput of 128 G-ops/s on CNN workloads while [10] C. Zhang, Z. Fang, P. Zhou, P. Pan, and J. Cong, “Caffeine: Towards
using 256 MAC units clocked at 250 MHz. Snowflake is able to uniformed representation and acceleration for deep convolutional
process AlexNet at 98 frames per second and GoogLeNet at 34 neural networks,” in Proceedings of the 35th International
Conference on Computer-Aided Design, ser. ICCAD ’16. New
frames per second while consuming 9.48 W. This demonstrates York, NY, USA: ACM, 2016, pp. 12:1–12:8. [Online]. Available:
the real time processing capabilities of our design on multiple http://doi.acm.org/10.1145/2966986.2967011
modern deep CNNs. We plan on exploiting Snowflake’s high

Snowflake_An_efficient_hardware_accelerator_for_convolutional_neural_networks

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Snowflake_An_efficient_hardware_accelerator_for_convolutional_neural_networks

Uploaded by

Copyright:

Available Formats

Snowﬂake: An Efﬁcient Hardware Accelerator for

Convolutional Neural Networks

Abstract—Deep learning is becoming increasingly popular Z'ŝŶƉƵƚ

of operations to produce an output. The high computational

ǆ D ͙ D D D D

You might also like

Snowflake_An_efficient_hardware_accelerator_for_convolutional_neural_networks

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Snowflake_An_efficient_hardware_accelerator_for_convolutional_neural_networks

Uploaded by

Copyright:

Available Formats

Snowﬂake: An Efﬁcient Hardware Accelerator for

Convolutional Neural Networks

Abstract—Deep learning is becoming increasingly popular Z'ŝŶƉƵƚ

of operations to produce an output. The high computational

ǆ D ͙ D D D D

You might also like

Abstract—Deep learning is becoming increasingly popular Z'ŝŶƉƵƚ

ǆ D ͙ D D D D