Professional Documents
Culture Documents
Energy-Efficient Scheduling Method With Cross-Loop Model For Resource-Limited CNN Accelerator Designs
Energy-Efficient Scheduling Method With Cross-Loop Model For Resource-Limited CNN Accelerator Designs
We define the nested loops as control loops with three A. Framework of Scheduling Methodology
attributes (order, range and tile). The order and range of control
loops mean the iterative order starting from the outermost loop Two-level nested loop transformations form the framework
and the trip counts as shown in Fig. 1. Tile is used to define as a of our proposed scheduling algorithm as shown in Fig. 3. We
subset of computations in blue color and introduced in Sec. II.B. define TileD and TileS to represent the candidates of each loop
transformation level. TileD is generated with the memory
B. Design Chanllenges and Previous Works resource constraint while TileS is limited by the computing
resource constraint. Instead of combing two constraints together,
A general FPGA/ASIC-based CNN accelerator system is we separate them to enhance the exploration space. We first
shown in Fig. 2 with limited on-chip resources budgets. Memory guarantee the DRAM bandwidth is minimized with available
resource constraint limits the maximum available on-chip memory resources due to its huge energy cost. Then, we explore
SRAM, which can be used to exploit data locality and reduce an dataflow within TileD to achieve minimal SRAM bandwidth
DRAM accesses for energy saving (200x versus 6x energy cost traffic as well as fully utilizing the computing resources.
according to [4]). Meanwhile, Computing resource constraint
limits the maximum number of PEs in parallel, which directly Notice that the loop transformations contain not only loop
affects the throughput performance of CNN accelerators. tiling but also loop reordering. All possible combinations of
order and tile attributes are considered. Unlike a predefined
The design challenges are finding an energy-efficient order for control loops, the traversing of possible orders can
accelerator under the two constraints. Especially, as the fully exploit the data locality in CNN algorithms to further
FPGA/ASIC designs can easily stack PEs for high parallelism, reduce the memory bandwidth together with the proposed cross-
the remaining difficulty is finding an dataflow that can reduce loop model in Sec. III.B.
the memory bandwidth requirements as data movements
consume the major part of energy costs. Loop tiling is widely B. Cross-loop Model of Memory Bandwidth
employed in prior works like [6][7][9] and is solved as a
mathematical optimization problem as below (k is kernel size): When given the attributes of order and tile, many works [6][9]
use a simple model to model the memory bandwidth. This model
݈݁݀ܯܹܤ݉ܽݎܦ݁ݖ݅݉݅݊݅ܯ൫ܶ ǡ ܶ ǡ ܶ ǡ ܶ௫ ǡ ܶ௬ ൯ is a production of the bandwidth per tile and the number of tiles.
݈݄ܶ݁݅ݓ ܶ ሺܶ௫ ݇ െ ͳሻ൫ܶ௬ ݇ െ ͳ൯ ܤ
ܶ ܶ ܶ௫ ܶ௬ ܶ ܶ ݁ܿݎݑݏܴ݁ݕݎ݉݁ܯ ݈݁݀ܯܹܤൌ ሺ͓݈ܶ݅݁ݏሻ ൈ ܹܤȀ݈ܶ݅݁ ൌ ෑ ൈ ܹܤȀ݈ܶ݅݁
݈ܶ݅݁
ܶ ܶ ܶ ܶ௫ ܶ௬ ݁ܿݎݑݏܴ݁݃݊݅ݐݑ݉ܥ
ൌ ቒ ್ ቓ ቒ ቓ ቒ ቓ ቒ ೣቓ ඈ ൈ ܹܤΤ݈ܶ݅݁ (1)
்್ ் ் ்ೣ ்
Although this method can find an efficient dataflow, we
could find a better solution with the hardware-oriented where: ࢌࢇ࢙ǣ ܹܤΤ݈ܶ݅݁ ൌ ܶ ܶ ሺܶ௫ ݇ െ ͳሻሺܶ௫ ݇ െ ͳሻ
consideration. Firstly, the above constraints actually regard on- ࢌࢇ࢙ǣ ܹܤΤ݈ܶ݅݁ ൌ ʹ ൈ ܶ ܶ ܶ௫ ܶ௬
chip SRAM and processor core as an indivisible system, ࢋ࢘ࢋǣ ܹܤΤ݈ܶ݅݁ ൌ ܶ ܶ
resulting in local optimum result. Secondly, existing models We claim that this model may over-estimate the bandwidth
overestimate the DRAM bandwidth as they can’t fully exploit requirement as it ignores the data locality among tiles. While [7]
on-chip SRAM. Thirdly, the energy efficiency is affected not proposed inter-tile model to eliminate the locality between
only by DRAM BW, but also by the SRAM part and processing adjacent tiles, there still exist potentials for data reuses in outer
time. These issues will result in local optimum results by control loops.
previous works and we are challenged by considering these Cross-loop model is proposed to provide an accurate
hardware-oriented thinking into our designs. estimation so as to fully exploit the caching ability of on-chip
memories. This model starts with the simple model and keeps
III. HARDWARE-ORIENTED SCHEDULING METHOD removing the overlaps between the operands of a new tile and
This section presents a detailed discussion on the proposed the buffered data inside on-chip SRAM. The removal is based
scheduling methodology for solving the three challenges on the conception of Reuse Potential (RP) defined in Table I.
mentioned in the end of last sections and concludes these Reuse potential is marked to every subscripts of each operand
optimizations in hardware views in Sec. III.D. with three candidates, including none, half and full. It refers how
Table I. The locality exists across the whole range B. Instead of
considering this control loop is divided into many tiles, we can
rewrite the bandwidth by imaging this control loop is just one
tile.
Table II. Removal of overlapped data in cross-loop model