You are on page 1of 4

Energy-efficient Scheduling Method with Cross-loop

Model for Resource-limited CNN Accelerator Designs


Kaiyi Yang, Shihao Wang, Jianbin Zhou and Takeshi Yoshimura
Graduate School of Information, Production and Systems, Waseda University
Kitakyushu, Japan
yangkaiyi1222@fuji.waseda.jp, wshh1216@moegi.waseda.jp

Abstract—The state-of-the-art customized accelerators of


convolution neural networks (CNN) have achieved high
throughput while the huge amount of data movements still
remains as the dominant part of the total energy costs. In this
paper, we propose an energy-efficient scheduling approach to find
an efficient dataflow that minimizes data movements with limited
hardware resource budgets. In detail, two-level nested loop
transformations are proposed to separate memory and computing
resource constraints. This allows us to fully exploit the potential of
available memory resources for reducing off-chip memory traffic.
Further, the proposed cross-loop model is capable of figuring out i 1C l ih i b dl
the data locality across nested loops in CNN algorithms. Finally,
accurately represent the bandwidth. [4] solves this problem from
energy-delay production is employed as the evaluation criteria to hardware aspects by building a memory hierarchy and try to
balancing energy and throughput performance. The experimental allocate computations at register level as many as possible.
results show our cross-loop model can reduce the off-chip data Although these works can improve the energy efficiency, there
movements by 11-21% and achieve the theoretical optimum. is a lack of systematical analysis for evaluating the performance
Therefore, the proposed scheduling method can increase the so the results may be the local optimal solutions.
energy efficiency by at least 8.7 times.
This paper introduces a novel model and scheduling method
Keywords—convolutional neural network; memory bandwidth;
for finding an energy-efficient dataflow given a limited resource
energy efficiency; loop tiling; loop transformation; accelerator budgets. The contributions are concluded as below:
(a) With the same constraints, the proposed cross-loop
I. INTRODUCTION model can achieve theoretically minimal memory accesses to
obtain 11-21% reduction of memory bandwidth requirements.
Deep convolutional neural networks (CNN) have shown
their great performance in many computer vision tasks in recent (b) Besides, scheduling is based on two-level nested loop
years, like AlexNet [1], VGG [2] and ResNet [3]. Among these transformations to exploit the potential of on-chip resources.
networks, convolutional computations take the major part of the 8.7x energy-efficiency improvements can be achieved.
total workload for feature extractions. There is a trend that the
convolution layers will increase more for a better performance (c) This approach can be used systematically analyze most
at the expense of higher computational complexities. of the existing CNN accelerators as it covets data parallelism in
various levels including batch level.
The increasing computational complexity of the state-of-the-
art CNNs has urged many researchers to find hardware II. BACKGROUND AND CHANLLENGES
accelerators for speeding up these processes. With the
techniques like high-level synthesis, it greatly reduces the design A. CNN overviews
period for high parallel CNN accelerators. However, the large
amount of convolutions also involves huge number of memory Generally, CNNs contain multiple layers where
accesses to move various operands between memories and the convolutional layers contain the major of the computations. For
processor core. These data movements have been regarded as convolutional layers, the input/output of each image are
the major part of the total power consumption [4][5]. organized as channels of two dimensional input/output feature
maps (ifmaps/ofmaps). For each pair of ifmaps and ofmaps, a
There are many research groups trying to find energy- unique convolutional kernel exists for feature extractions. Each
efficient ASIC/FPGA-based CNN accelerators by reducing the ofmap is the sum up of features generated from all ifmaps.
data movements, which are tied tightly to the chosen dataflow. Several images are organized as a batch, in which kernels are
Some works [6][9] use the loop parameters of CNN algorithms invariant. They can be represented as the nested loops as Fig. 1.
to model the memory bandwidth requirement and try to find We don’t unroll the convolution operations as they are usually
efficient scheduling methods by loop transformations under designed to be processed in parallel [5] or processed
limited resource budgets. [7] gives an improved model that can continuously [8] in many customized CNN accelerators.

978-1-4673-6853-7/17/$31.00 ©2017 IEEE


for each TileD in loopTransformation(B):
// Calculate DRAM BW
DramBW = cross_loop_model(TileD) Sec.III.A
for each TileS in loopTransformation(TileD):
// Calculate SRAM BW Sec.III.B
SramBW = cross_loop_model(TileS)
thisEDP = calcEDP(DramBW,SramBW,TileD,TileS)
// Record results with minimum EDP Sec.III.C
if thisEDP<minEDP:
optRes = thisRes // May contain TileD, TileS, ...
return optRes
Fig. 2. General architecture of customized CNN accelerators. Fig. 3 Pseudocode of the hardware-oriented scheduling algorithm

We define the nested loops as control loops with three A. Framework of Scheduling Methodology
attributes (order, range and tile). The order and range of control
loops mean the iterative order starting from the outermost loop Two-level nested loop transformations form the framework
and the trip counts as shown in Fig. 1. Tile is used to define as a of our proposed scheduling algorithm as shown in Fig. 3. We
subset of computations in blue color and introduced in Sec. II.B. define TileD and TileS to represent the candidates of each loop
transformation level. TileD is generated with the memory
B. Design Chanllenges and Previous Works resource constraint while TileS is limited by the computing
resource constraint. Instead of combing two constraints together,
A general FPGA/ASIC-based CNN accelerator system is we separate them to enhance the exploration space. We first
shown in Fig. 2 with limited on-chip resources budgets. Memory guarantee the DRAM bandwidth is minimized with available
resource constraint limits the maximum available on-chip memory resources due to its huge energy cost. Then, we explore
SRAM, which can be used to exploit data locality and reduce an dataflow within TileD to achieve minimal SRAM bandwidth
DRAM accesses for energy saving (200x versus 6x energy cost traffic as well as fully utilizing the computing resources.
according to [4]). Meanwhile, Computing resource constraint
limits the maximum number of PEs in parallel, which directly Notice that the loop transformations contain not only loop
affects the throughput performance of CNN accelerators. tiling but also loop reordering. All possible combinations of
order and tile attributes are considered. Unlike a predefined
The design challenges are finding an energy-efficient order for control loops, the traversing of possible orders can
accelerator under the two constraints. Especially, as the fully exploit the data locality in CNN algorithms to further
FPGA/ASIC designs can easily stack PEs for high parallelism, reduce the memory bandwidth together with the proposed cross-
the remaining difficulty is finding an dataflow that can reduce loop model in Sec. III.B.
the memory bandwidth requirements as data movements
consume the major part of energy costs. Loop tiling is widely B. Cross-loop Model of Memory Bandwidth
employed in prior works like [6][7][9] and is solved as a
mathematical optimization problem as below (k is kernel size): When given the attributes of order and tile, many works [6][9]
use a simple model to model the memory bandwidth. This model
‫݈݁݀݋ܯܹܤ݉ܽݎܦ݁ݖ݅݉݅݊݅ܯ‬൫ܶ௕ ǡ ܶ௜ ǡ ܶ௢ ǡ ܶ௫ ǡ ܶ௬ ൯ is a production of the bandwidth per tile and the number of tiles.
‫݈݄ܶ݁݅ݓ‬௕ ܶ௜ ሺܶ௫ ൅ ݇ െ ͳሻ൫ܶ௬ ൅ ݇ െ ͳ൯ ൅ ‫ܤ‬
ܶ௕ ܶ௢ ܶ௫ ܶ௬ ൅ ܶ௜ ܶ௢ ൑ ‫݁ܿݎݑ݋ݏܴ݁ݕݎ݋݉݁ܯ‬ ‫ ݈݁݀݋ܯܹܤ‬ൌ ሺ͓݈ܶ݅݁‫ݏ‬ሻ ൈ ‫ܹܤ‬Ȁ݈ܶ݅݁ ൌ ෑ ൈ ‫ܹܤ‬Ȁ݈ܶ݅݁
݈ܶ݅݁
ܶ௕ ܶ௜ ܶ௢ ܶ௫ ܶ௬ ൑ ‫݁ܿݎݑ݋ݏܴ݁݃݊݅ݐݑ݌݉݋ܥ‬ ஻ ஻ ஻ ஻ ஻
ൌ ቒ ್ ቓ ቒ ೔ ቓ ቒ ೚ ቓ ቒ ೣቓ ඄ ೤ ඈ ൈ ‫ ܹܤ‬Τ݈ܶ݅݁ (1)
்್ ்೔ ்೚ ்ೣ ்೤
Although this method can find an efficient dataflow, we
could find a better solution with the hardware-oriented where: ࢏ࢌ࢓ࢇ࢖࢙ǣ ‫ ܹܤ‬Τ݈ܶ݅݁ ൌ ܶ௕ ܶ௜ ሺܶ௫ ൅ ݇ െ ͳሻሺܶ௫ ൅ ݇ െ ͳሻ
consideration. Firstly, the above constraints actually regard on- ࢕ࢌ࢓ࢇ࢖࢙ǣ ‫ ܹܤ‬Τ݈ܶ݅݁ ൌ ʹ ൈ ܶ௕ ܶ௜ ܶ௫ ܶ௬ 
chip SRAM and processor core as an indivisible system, ࢑ࢋ࢘࢔ࢋ࢒ǣ ‫ ܹܤ‬Τ݈ܶ݅݁ ൌ ܶ௜ ܶ௢
resulting in local optimum result. Secondly, existing models We claim that this model may over-estimate the bandwidth
overestimate the DRAM bandwidth as they can’t fully exploit requirement as it ignores the data locality among tiles. While [7]
on-chip SRAM. Thirdly, the energy efficiency is affected not proposed inter-tile model to eliminate the locality between
only by DRAM BW, but also by the SRAM part and processing adjacent tiles, there still exist potentials for data reuses in outer
time. These issues will result in local optimum results by control loops.
previous works and we are challenged by considering these Cross-loop model is proposed to provide an accurate
hardware-oriented thinking into our designs. estimation so as to fully exploit the caching ability of on-chip
memories. This model starts with the simple model and keeps
III. HARDWARE-ORIENTED SCHEDULING METHOD removing the overlaps between the operands of a new tile and
This section presents a detailed discussion on the proposed the buffered data inside on-chip SRAM. The removal is based
scheduling methodology for solving the three challenges on the conception of Reuse Potential (RP) defined in Table I.
mentioned in the end of last sections and concludes these Reuse potential is marked to every subscripts of each operand
optimizations in hardware views in Sec. III.D. with three candidates, including none, half and full. It refers how
Table I. The locality exists across the whole range B. Instead of
considering this control loop is divided into many tiles, we can
rewrite the bandwidth by imaging this control loop is just one
tile.
Table II. Removal of overlapped data in cross-loop model

Reuse Removal Method


Potential Before After
full ‫ܹܤ‬Ȁ‫݈݁݅ݐ‬
‫ܤ‬
Fig. 4. An example of half reuse potential (RP) ඄ ඈ ൈ ‫ܹܤ‬Ȁ‫݈݁݅ݐ‬ ‫ܤ‬൅݇െͳ
half ܶ ൈ ‫ܹܤ‬Ȁ‫݈݁݅ݐ‬
ܶ൅݇െͳ

C. Memory-centric Energy-Delay Procduction


Instead of considering only off-chip DRAM bandwidth in
prior works, this work also involves the energy cost of on-chip
SRAM to give a systematical analysis. The power consumption
can be written as the weighted sum of these two sources of
energy consumptions as below where the second term represents
the energy costs of SRAM. B means the parameters of the
current convolutional layer.
‫ ݐݏ݋ܿݕ݃ݎ݁݊ܧ‬ൌ ʹͲͲ ൈ ‫ܹܤ‬ሺ‫݉ܽݎܦ‬ሻ ൅ ͸ ൈ ‫ܹܤ‬ሺܵ‫݉ܽݎ‬ሻ
‫ܤ‬
ൌ ʹͲͲ‫ܹܤ‬ሺ‫݉ܽݎܦ‬ሻ ൅ ͸ ൈ ෑ ൈ ‫ܹܤ‬ሺܵ‫݉ܽݎ‬ሻȀ‫݈݁݅ݐ‬
݈ܶ݅݁‫ܦ‬
Fig. 5. Pseudocode of the function cross_loop_model. Moreover, we employ the energy-delay production as the
many overlapped data exist inside on-chip SRAM. Full RP cost function to guarantee the loop transformation not only
means all data of the current tile are already cached while none achieves a high energy efficiency, but also fully utilize PEs. The
RP is the case all the data have to be fetched from off-chip delay is calculated with the following equation:
DRAM. Specially, there exist cases (half RP) for ifmaps where
only part of the data is inside SRAM as shown in Fig. 4. ‫ ݕ݈ܽ݁ܦ‬ൌ ሺ͓݈ܶ݅݁ܵሻ ൈ ሺ݈݀݁ܽ‫݈ܵ݁݅ܶݎ݁݌ݕ‬ሻ
‫ܤ‬ ݈ܶ݅݁‫ܦ‬ ͳ
Table I. Reuse Potentials of the indexes for control loops ൌ  ൬ෑ ෑ ൰ൈ൬ ൰
݈ܶ݅݁‫ܦ‬ ݈ܶ݅݁ܵ ‫ܧܲ݁ݒ݅ݐܿܣ‬
Data Type
Reuse Potential The delay of each TileS is defined as the reciprocal of the
b i o x y active PEs. An improper tiling may not fully utilize the available
ifmaps[b][i][x][y] none none full half half computing resources. For example, when To is tiled as 3 in TileS
and we prefer at most two ofmaps are processed in parallel, half
ofmaps[b][o][x][y] none full none none none of the available PEs will not be active in the last iteration of loop
kernels[i][o] full none none full full To, resulting in achieving only 75% of its peak throughput.
We give the pseudocode of our cross-loop model in Fig. 5.
This function receives a tile candidate after loop transformation D. Hardware-oriented Analysis
and calculate BW by the simple model in Equation (1). Then, Firstly, two-level nested loop transformation can fully
two optimizations are executed sequentially: (Opt.A) Traverse exploit the potential of on-chip resources compared to prior
control loops from innermost to outermost. As long as the RP is works. The twisted constraints in Sec. II.B may cause a local
not none, we can remove the overlapped data according to Table optimal. When a tiling candidate is constrained by computing
II. Moreover, the traversing of each operand is interrupted as resources, on-chip memory resources may be under-utilized.
long as the RP is not full. Even half RP can’t guarantee the data The unused memory can actually be used to buffering more data
to be reused by the following outer control loops. (Opt.B) Notice for reducing DRAM bandwidth but this possibility is ignored by
that ofmaps in Equation (1) is double as we have extra reading prior works. Two-level nested loop transformation scheduling
operations compared to other operands. We can omit the reading can eliminate this problem as it divided these two constraints
accesses only if all the calculations involved by a specific into two sub-tasks. There is no dependency existing in these two
ofmaps are executed continuously. As shown in Fig. 5, we have sub-tasks so a better solution can be expected by our approach.
to be sure the innermost control loop has a subscript of ifmaps
Secondly, we introduce loop reordering, cross-loop model,
and the range of ifmaps is equal to the number of ifmaps channel
parallelism in batch level and energy-delay production to
of current convolutional layer.
enlarge the solution space. Most of them will guarantee the same
Table II shows how to remove overlaps considering current hardware design complexity as prior works because they just try
control loop has a range of B and a tile is T. With full RP, all the to find a more energy-efficient dataflow. Only the half reuse
tiles share the same data so we just need to divide the bandwidth potential in cross-loop model will require the on-chip SRAM to
by the tile numbers. Half RP occurs only in ifmaps according to
Off-chip Bandwidth Traffic (MByte/image)
Table III. Experimental Results with different methods and Configs.
5.6 This Work
DATE'15 [7] Scheduling Methods
FPGA'15 [6]
Metrics
4.6
Theoriotical Opt.
A B C
(b,i,o,x,y)
TileD (b,i,o,x,y) (b,i,o,x,y)
3.6 (4,8,128,13,13)
(2,64,64,13,13) (1,16,32,13,13)
(b,x,y,i,o)
TileS Para(x=8,y=64) Para(x=16,y=32)
2.6 (1,1,1,8,64)
BW(Dram) 1.8MB/imgae 4.19MB/image 10.5MB/image
1.6
BW(Sram) 26.8MB/imge 624.0MB/image 625.1MB/image
0.6
Active
200 300 400 500 600 700 800 900 1000 1100 512 512 512
On-chip Memory Constraint (KByte)
# of MAC
Active
Fig. 6 DRAM bandwidth versus memory resource constraint. 3.3Mb 2.8Mb 435.7Kb
SRAM
be controlled more sophisticatedly. Actually, a circular buffer EDP 1x 8.7x 11.1x
with a registered index can support this case like [8].
V. CONCLUSION
IV. EXPERIMENTAL RESULTS This paper provides an energy efficient scheduling approach
for find an optimal dataflow that can minimize data movements
A. Evaluation of Cross-loop model for DRAM Bandwidth from memories. Based on the understanding of hardware
DRAM bandwidth requirement is the most expensive part in behaviors, we propose several optimization schemes including
CNN accelerators. Fig. 6 evaluates the performance of our cross- two-level nested loop transformation, cross-loop model and
loop model. The test bench is the 3rd layer of AlexNet with a introduce energy-delay production as the cost function. These
loop range of (4,256,384,13,13) for (b,i,o,x,y) as used in [4] as schemes can generate more efficient dataflow and improve the
it is the most nested layer inside AlexNet. energy-efficiency by at 8.7 times.
The line of opt. in Fig. 6 means the theoretically minimum
DRAM bandwidth by assuming an infinite size of on-chip ACKNOWLEDGMENT
memories. Our cross-loop model achieves 11-21% more power- This paper is partly supported by KAKENHI (26420323)
efficiency than the inter-tile model [7] as we consider reuse and by Waseda University Graduate Program for Embodiment
possibilities across control loops. Generally, 1MB on-chip Informatics (FY2013-FY2019).
memories can reduce data movements to the theoretical minimal.
REFERENCES
B. Evalutaion of the Proposed Scheduling Method
We conduct several experiments to show how this approach [1] A. Krizhevsky et al., “ImageNet Classification with Deep Convolutional
can find efficient dataflow as well as fully exploiting on-chip Neural Networks,” in Proc. Adv. Neural Inf. Process. Syst., vol. 25, pp.
memories. In [6][9], Vertex7 485t is chosen as the platform with 1097-1105, 2012.
2800 DSPs (equivalent to 560 32-bit multiplier– [2] K. Simonyan et al., “Very Deep Convolutional Networks for LargeScale
accumulator/MAC) and 37Mb on-chip memories. We use these Image Recognition,” CoRR, vol. abs/1409.1556, pp. 1–14, Sep. 2014.
constraints for fair comparisons and evaluate three scheduling [3] K. He et al., “Deep residual learning for image recognition”, in Proc.
methods in Table III: (A) Scheduling by this work; (B) Two- IEEE Conf. Comput. Vis. Pattern Recognit., pp. 770-778, 2016.
level nested loop transformation with simple model in [6]; (C) [4] Y.-H. Chen et al., “Eyeriss: A Spatial Architecture for Energy-Efficient
Dataflow for Convolutional Neural Networks,” in Proc. 43rd Annu. Int.
Scheduling by [6]. Notice that in B/C computing parallelism of Symp. Comput. Archit., pp. 367-379, June 2016.
subscripts x and y are considered.
[5] S. Wang et al., “Chain-NN: An Energy-Efficient 1D Chain Architecture
Table III proves that our approach A needs much fewer for Accelerating Deep Convolutional Neural Networks,” in Proc. Design,
Autom. Test Eur. Conf. Exhibit., 2017.
memory accesses than B/C with different tiling parameters. In C,
a possible solution is usually constrained by available [6] C. Zhang et al., “Optimizing FPGA-based Accelerator Design for Deep
Convolutional Neural Networks,” In Proc. ACM/SIGDA Int. Symp.
computing resources while memories are not fully exploited. Field-Programmable Gate Arrays, pp. 161-170, 2015.
Even though we have 37Mb available memories, only 436Kb [7] M. Peemen et al., “Inter-tile reuse optimization applied to bandwidth
are used in C. Thus we conduct B with our proposed two-level constrained embedded accelerators,” in Proc. Design, Autom. Test Eur.
nested loop transformation to omit this issue as discussed in Sec. Conf. Exhibit., pp. 169-174, 2015.
III.D. B can use much more memories to reduce 60% redundant [8] T. Chen et al., “DianNao: A Small-Footprint High-Throughput
DRAM bandwidth requiremets than C. This partly explains why Accelerator for Ubiquitous Machine-Learning,” in Proc. ACM Int. Conf.
the half of memory reourses are not utilized in [6]. In A, we Archit. Support Program. Lang. Oper. Syst., pp. 269-284, 2014.
further enable the otpimization schemes with cross-loop model [9] M. Motamedi et al., "Design space exploration of FPGA-based Deep
Convolutional Neural Networks," in Proc. 21st Asia and South Pacific
and multiple parallelisms to reduce the data locality comapred Design Autom. Conf., pp. 575-580, 2016.
to B. The scheduling produces the optimal parallelism as shown
in TileD and TileS. This solution shows that 57.0% of DRAM
and 95.7% of SRAM accesses are reduced and we can improve
the energy-delay production by at least 8.7 times.

You might also like