Textbook Advanced Computer Architecture 12Th Conference Aca 2018 Yingkou China August 10 11 2018 Proceedings Chao Li Ebook All Chapter PDF

Advanced Computer Architecture 12th
Conference ACA 2018 Yingkou China

August 10 11 2018 Proceedings Chao Li
Visit to download the full and correct content document:
https://textbookfull.com/product/advanced-computer-architecture-12th-conference-ac
a-2018-yingkou-china-august-10-11-2018-proceedings-chao-li/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...
Advanced Computer Architecture 13th Conference ACA 2020

Kunming China August 13 15 2020 Proceedings Dezun Dong
https://textbookfull.com/product/advanced-computer-
architecture-13th-conference-aca-2020-kunming-china-
august-13-15-2020-proceedings-dezun-dong/
Advanced Computer Architecture 11th Conference ACA 2016

Weihai China August 22 23 2016 Proceedings 1st Edition
Junjie Wu
https://textbookfull.com/product/advanced-computer-
architecture-11th-conference-aca-2016-weihai-china-
august-22-23-2016-proceedings-1st-edition-junjie-wu/
Network and System Security 12th International

Conference NSS 2018 Hong Kong China August 27 29 2018
Proceedings Man Ho Au
https://textbookfull.com/product/network-and-system-
security-12th-international-conference-nss-2018-hong-kong-china-
august-27-29-2018-proceedings-man-ho-au/
Software Architecture 12th European Conference on

Software Architecture ECSA 2018 Madrid Spain September
24 28 2018 Proceedings Carlos E. Cuesta
https://textbookfull.com/product/software-architecture-12th-
european-conference-on-software-architecture-ecsa-2018-madrid-
spain-september-24-28-2018-proceedings-carlos-e-cuesta/
Frontiers in Cyber Security First International
Conference FCS 2018 Chengdu China November 5 7 2018
Proceedings Fagen Li
https://textbookfull.com/product/frontiers-in-cyber-security-
first-international-conference-fcs-2018-chengdu-china-
november-5-7-2018-proceedings-fagen-li/
Computational Science ICCS 2018 18th International

Conference Wuxi China June 11 13 2018 Proceedings Part
II Yong Shi
https://textbookfull.com/product/computational-science-
iccs-2018-18th-international-conference-wuxi-china-
june-11-13-2018-proceedings-part-ii-yong-shi/
Computational Science ICCS 2018 18th International

Conference Wuxi China June 11 13 2018 Proceedings Part
III Yong Shi
https://textbookfull.com/product/computational-science-
iccs-2018-18th-international-conference-wuxi-china-
june-11-13-2018-proceedings-part-iii-yong-shi/
Computer Engineering and Technology 22nd CCF Conference

NCCET 2018 Yinchuan China August 15 17 2018 Revised
Selected Papers Weixia Xu
https://textbookfull.com/product/computer-engineering-and-
technology-22nd-ccf-conference-nccet-2018-yinchuan-china-
august-15-17-2018-revised-selected-papers-weixia-xu/
Intelligent Robotics and Applications 11th

International Conference ICIRA 2018 Newcastle NSW
Australia August 9 11 2018 Proceedings Part II Zhiyong
Chen
https://textbookfull.com/product/intelligent-robotics-and-
applications-11th-international-conference-icira-2018-newcastle-
nsw-australia-august-9-11-2018-proceedings-part-ii-zhiyong-chen/
Chao Li · Junjie Wu (Eds.)
Communications in Computer and Information Science 908
Advanced
Computer Architecture
12th Conference, ACA 2018
Yingkou, China, August 10–11, 2018
Proceedings
123
Communications
in Computer and Information Science 908
Commenced Publication in 2007
Founding and Former Series Editors:
Phoebe Chen, Alfredo Cuzzocrea, Xiaoyong Du, Orhun Kara, Ting Liu,
Dominik Ślęzak, and Xiaokang Yang
Editorial Board
Simone Diniz Junqueira Barbosa
Pontifical Catholic University of Rio de Janeiro (PUC-Rio),
Rio de Janeiro, Brazil
Joaquim Filipe
Polytechnic Institute of Setúbal, Setúbal, Portugal
Igor Kotenko
St. Petersburg Institute for Informatics and Automation of the Russian
Academy of Sciences, St. Petersburg, Russia
Krishna M. Sivalingam
Indian Institute of Technology Madras, Chennai, India
Takashi Washio
Osaka University, Osaka, Japan
Junsong Yuan
University at Buffalo, The State University of New York, Buffalo, USA
Lizhu Zhou
Tsinghua University, Beijing, China
More information about this series at http://www.springer.com/series/7899
Chao Li Junjie Wu (Eds.)
•
Advanced
Computer Architecture
12th Conference, ACA 2018
Yingkou, China, August 10–11, 2018
Proceedings
123
Editors
Chao Li Junjie Wu
Shanghai Jiao Tong University National University of Defense Technology
Shanghai Changsha
China China
ISSN 1865-0929 ISSN 1865-0937 (electronic)

Communications in Computer and Information Science
ISBN 978-981-13-2422-2 ISBN 978-981-13-2423-9 (eBook)
https://doi.org/10.1007/978-981-13-2423-9
Library of Congress Control Number: 2018954068
© Springer Nature Singapore Pte Ltd. 2018

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now
known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are
believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors
give a warranty, express or implied, with respect to the material contained herein or for any errors or
omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Preface
It is a great pleasure and honor to present the proceedings of ACA 2018, the 12th
Conference on Advanced Computer Architecture. ACA is sponsored by the China
Computer Federation (CCF) and it is the flagship conference of the CCF Technical
Committee on Computer Architecture (TCArch). It has been one of the most important
academic conferences in the field of computer architecture in China since 1995.
The 2018 edition of ACA was held in the scenic area of Yingkou, a port city of the
Bohai Sea. The theme this year was “Intelligent Architecture: From the Cloud to the
Edge.” ACA 2018 created a forum for academic researchers and industry practitioners
in China to share their insights on the next-generation computing systems. We con-
tinued the trend of making ACA an inclusive and interactive event that features invited
keynotes, top paper presentation, poster showcase, and design competition, etc.
This year, we received over 120 paper registrations. Finally, there were 80 suc-
cessful submissions. Each submission was reviewed by three Program Committee
(PC) members on average. In all, 13 papers were rejected immediately in the first round
of review and 67 papers were sent out for a second round of review. Only the papers
with an average score of 3 (borderline) were considered for final inclusion, and
almost all accepted papers had positive reviews or at least one review with a score of 5
(accept) or higher. Finally, the PC decided to accept 47 submissions, including 17
papers in English and 30 in Chinese. We asked the authors of all the accepted papers to
submit a revised version based on the review reports.
This program would have not been possible without the efforts of the PC, the
external reviewers, and the authors. We would like to express our gratitude to all the
authors who submitted their papers. We would like to convey our deepest and sincerest
appreciation for all the hard work and dedication of our PC members and external
reviewers. We also gratefully acknowledge the kind support from our general chair,
Prof. Yong Dou, organization chair, Prof. Kuanjiu Zhou, and our Steering Committee.
Our thanks also go to the China Computer Federation (CCF), Technical Committee on
Computer Architecture of CCF, Dalian University of Technology, the City of Yinkou,
Xilinx, Baidu, and all the other institutes that kindly helped us. Finally, we greatly
appreciate the steady support provided by Springer.
August 2018 Chao Li

Junjie Wu
Organization
General Chair
Yong Dou National University of Defense Technology, China
Organization Chair
Kuanjiu Zhou Dalian University of Technology, China
Program Chair
Chao Li Shanghai Jiao Tong University, China
Steering Committee
Zhenzhou Ji Harbin Institute of Technology, China
Chenggang Wu Institute of Computing Technology, CAS, China
Dongsheng Wang Tsinghua University, China
Junjie Wu National University of Defense Technology, China
Xingwei Wang Northeastern University, China
Gongxuan Zhang Nanjing University of Science and Technology, China
Program Committee
Quan Chen Shanghai Jiao Tong University, China
Zidong Du Institute of Computing Technology, CAS, China
Binzhang Fu Huawei
Yu Hua Huazhong University of Science and Technology, China
Weixing Ji Beijing Institute of Technology, China
Jingwen Leng Shanghai Jiao Tong University, China
Dongsheng Li National University of Defense Technology, China
Duo Liu Chongqing University, China
Yuhang Liu Institute of Computing Technology, CAS, China
Youyou Lu Tsinghua University, China
Guojie Luo Beijing University, China
Bo Mao Xiamen University, China
Songwen Pei University of Shanghai for Science and Technology, China
Minghua Shen Sun Yat-sen University, China
Wei Song Institute of Information Engineering, CAS, China
Guangyu Sun Beijing University, China
Jing Wang Capital Normal University, China
Lei Wang National University of Defense Technology, China
VIII Organization
Ying Wang Institute of Computing Technology, CAS, China

Junjie Wu National University of Defense Technology, China
Yubing Xia Shanghai Jiao Tong University, China
Zichen Xu Nanchang University, China
Fengyuan Xu Nanjing University, China
Hailong Yang Beihang University, China
Zhibin Yu Shenzhen Institute of Advanced Technology, China
Jingling Yuan Wuhan University of Technology, China
Fengkai Yuan Institute of Information Technology, CAS, China
Jidong Zhai Tsinghua University, China
Weihua Zhang Fudan University, China
Long Zheng Huazhong University of Technology, China
Wenli Zheng Shanghai Jiao Tong University, China
Junlong Zhou Nanjing University of Science and Technology, China
Bo Wu Colorado School of Mines, USA
Hongwen Dai Apple Inc., USA
Lizhong Chen Oregon State University, USA
Ruijin Zhou VMware, USA
Shaolei Ren University of California, Riverside, USA
Yakun Shao NVIDIA Research, USA
Xiaoyi Lu Ohio State University, USA
Xuehai Qian University of Southern California, USA
Yang Hu University of Texas at Dallas, USA
Yanqi Zhou Baidu Silicon Valley AI Lab, USA
Additional Reviewers
Qiang Cao Huazhong University of Technology, China
Li Jiang Shanghai Jiao Tong University, China
Naifeng Jing Shanghai Jiao Tong University, China
Cheng Li University of Science and Technology of China
Tao Li Nankai University, China
Yao Shen Shanghai Jiao Tong University, China
Shuang Song University of Texas at Austin, USA
Rui Wang Beihang University, China
Chentao Wu Shanghai Jiao Tong University, China
Qiaosha Zhou Zhejiang Sci-Tech University, China
Contents
Accelerators
A Scalable FPGA Accelerator for Convolutional Neural Networks . . . . . . . . 3

Ke Xu, Xiaoyun Wang, Shihang Fu, and Dong Wang
Memory Bandwidth and Energy Efficiency Optimization of Deep

Convolutional Neural Network Accelerators . . . . . . . . . . . . . . . . . . . . . . . . 15
Zikai Nie, Zhisheng Li, Lei Wang, Shasha Guo, and Qiang Dou
Research on Parallel Acceleration for Deep Learning Inference Based

on Many-Core ARM Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Keqian Zhu and Jingfei Jiang
Research on Acceleration Method of Speech Recognition Training . . . . . . . . 42

Liang Bai, Jingfei Jiang, and Yong Dou
New Design Explorations
A Post-link Prefetching Based on Event Sampling. . . . . . . . . . . . . . . . . . . . 53

Hongmei Wei, Fei Wang, and Zhongsheng Li
The Design of Reconfigurable Instruction Set Processor Based

on ARM Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Jinyong Yin, Zhenpeng Xu, Xinmo Fang, and Xihao Zhou
Stateful Forward-Edge CFI Enforcement with Intel MPX . . . . . . . . . . . . . . . 79

Jun Zhang, Rui Hou, Wei Song, Zhiyuan Zhan, Boyan Zhao,
Mingyu Chen, and Dan Meng
Analytical Two-Level Near Threshold Cache Exploration for Low

Power Biomedical Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Yun Liang, Shuo Wang, Tulika Mitra, and Yajun Ha
DearDRAM: Discard Weak Rows for Reducing DRAM’s

Refresh Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Xusheng Zhan, Yungang Bao, and Ninghui Sun
Towards Efficient ML/AI
EffectFace: A Fast and Efficient Deep Neural Network Model for Face
Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Weicheng Li, Dan Jia, Jia Zhai, Jihong Cai, Han Zhang, Lianyi Zhang,
Hailong Yang, Depei Qian, and Rui Wang
X Contents
A Power Efficient Hardware Implementation of the IF Neuron Model . . . . . . 140

Shuquan Wang, Shasha Guo, Lei Wang, Nan Li, Zikai Nie, Yu Deng,
Qiang Dou, and Weixia Xu
paraSNF: An Parallel Approach for Large-Scale Similarity Network Fusion . . . 155

Xiaolong Shen, Song He, Minquan Fang, Yuqi Wen, Xiaochen Bo,
and Yong Dou
An Experimental Perspective for Computation-Efficient Neural

Networks Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Lujia Yin, Xiaotao Chen, Zheng Qin, Zhaoning Zhang, Jinghua Feng,
and Dongsheng Li
Parallel Computing System
Distributed Data Load Balancing for Scalable Key-Value Cache Systems. . . . 181
Shanshan Chen, Xudong Zhou, Guiping Zhou, and Richard O. Sinnott
Performance Analysis and Optimization of Cyro-EM Structure

Determination in RELION-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Xin You, Hailong Yang, Zhongzhi Luan, and Depei Qian
The Checkpoint-Timing for Backward Fault-Tolerant Schemes . . . . . . . . . . . 210

Min Zhang
Quota-constrained Job Submission Behavior at Commercial Supercomputer . . . 219

Jinghua Feng, Guangming Liu, Zhiwei Zhang, Tao Li, Yuqi Li,
and Fuxing Sun
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233

Accelerators
A Scalable FPGA Accelerator for
Convolutional Neural Networks
Ke Xu1,2 , Xiaoyun Wang1,2 , Shihang Fu1,2 , and Dong Wang1,2(B)

1
Institute of Information Science, Beijing Jiaotong University, Beijing 100044, China
2
Beijing Key Laboratory of Advanced Information Science and Network Technology,
Beijing 100044, China
{17112071,16120304,17125155,wangdong}@bjtu.edu.cn
Abstract. Convolution Neural Networks (CNN) have achieved undis-

puted success in many practical applications, such as image classification,
face detection, and speech recognition. As we all know, FPGA-based
CNN prediction is more efficient than GPU-based schemes, especially
in terms of power consumption. In addition, OpenCL-based high-level
synthesis tools in FPGA is widely utilized due to the fast verification
and implementation flows. In this paper, we propose an FPGA accel-
erator with a scalable architecture of deeply pipelined OpenCL kernels.
The design is verified by implementing three representative large-scale
CNNs, AlexNet, VGG-16 and ResNet-50 on Altera OpenCL DE5-Net
FPGA board. Our design has achieved a peak performance of 141 GOPS
for convolution operation, and 103 GOPS for the entire VGG-16 network
that performs ImageNet classification on DE5-Net board.
Keywords: FPGA · OpenCL · Convolution Neural Networks

Optimization
1 Introduction
Convolutional Neural Network (CNN) is a widely-regarded algorithm in the field

of artificial intelligence. It has achieved great success in image classification [1],
object detection [2], and speech recognition [3]. In the past decade, CNN has sig-
nificantly improved the accuracy and performance of image classification. This
is mainly due to the continuous improvement of data sets and the successive
enhancement of the neural network structure. Being compute-intensive, GPUs
are now widely used to train CNN. However, the GPUs with high power dissipa-
tions at the deployment level of the CNNs is not the best choice. FPGA based
hardware accelerators with provide massive processing elements, reconfigurable
interconnections and lower power dissipation are naturally suitable to implement
neural network circuits.
The traditional FPGA development method uses hardware description lan-
guage (HDL). The work of [7,8] propose efficient CNN accelerators on embed-
ded FPGA platforms. However, traditional register-transfer-level (RTL) design
c Springer Nature Singapore Pte Ltd. 2018
C. Li and J. Wu (Eds.): ACA 2018, CCIS 908, pp. 3–14, 2018.
https://doi.org/10.1007/978-981-13-2423-9_1
4 K. Xu et al.
flows takes a lot of time to simulate and compile before actually running hard-
ware accelerators. With the development of FPGA high-level synthesis tool
(HLS), high-level programming language (C/C++) is used to replace low-level
HDL, which improves the speed of FPGA implementation and verification flows.
Greatly reducing the development cycle, to design FPGA has brought great con-
venience. In recent yeas, the use of HLS to design CNN architecture has contin-
ued to emerge. The work of [9] using the Vivado-HLS tool on a Xilinx VC707
FPGA board. However, only convolution layers are implemented on AlexNet [1].
In [10], author present a systematic methodology for maximizing the through-
put of an FPGA-based accelerator. In this work, an entire CNN model is pro-
posed consisting of all CNN layers: convolution, normalization, pooling and clas-
sification layers. The scalable of accelerator architecture only use like AlexNet
and VGG [4]. The feedforward neural networks with shortcut connections like
ResNet [5] dose not work. The main contribution of this work are:
(1) Propose a FPGA accelerator with a scalable architecture of deeply pipelined

OpenCL kernels;
(2) The design is verified by implementing three representative large-scale CNNs,
AlexNet, VGG-16 and ResNet-50;
(3) The design space of the proposed architecture was fully explored on Stratix-
V A7 FPGA.
2 Background
2.1 Classic Convolution Neural Network
AlexNet. AlexNet was able to achieve record breaking object recognition

results on the ImageNet challenge in 2012. It consisted of eight layers in total,
5 convolutional and 3 fully connected, as depicted in Fig. 1. The 3-dimensional
(3-D) convolution operation can be defined by
Cl K−1
K−1

Do (fo , y, x) = Wl (fo , fi , ky , kx ) · Di (fi , y + ky , x + kx ) (1)
fi =1 ky =0 kx =0
where Di (fi , y, x) and Do (fo , y, x) denote the neurons at position (x, y) in the
input feature map fi and output feature map fo , respectively. Wl (fo , fi , y, x) rep-
resents the corresponding weights in the l-th layer that gets convolved with fi .
The size of the convolution filters is K × K, while the total number of input fea-
ture maps is Cl . In addition to this, AlexNet considered the use of the ReLU non-
linearity instead of the saturating nonlinearites, such as sigmoids; Using dropout
in training and Local Response Normalization (LRN) to reduce the problem of
overfitting.
A Scalable FPGA Accelerator for Convolutional Neural Networks 5
Fig. 1. AlexNet architecture. Figure reproduced from [1]
VGG. VGG achieves its depth by simply stacking more layers while following
the standard practices introduced with AlexNet. The size of the convolution
kernel is more regular. AlexNet use 11 × 11, 5 × 5 and 3 × 3 filters, but VGG only
use 3 × 3 filters in the entire network. Notably, while using smaller filters, VGG
required far more filters per layer. The amount of calculations and parameters
of VGG is much larger than AlexNet.
ResNet. Deeper neural networks are more difficult to train, so in [5], author
proposed residual learning framework reducing the vanishing gradient problem
to ease the training of networks. This residual learning is mainly use of shortcut
connections, illustrated in Fig. 2, that connect components of different layers
with an identity mapping [6]. In particular, ResNet is built such that each layer
learns an incremental transformation, F (x), on top of the input, x, according to
H(x) = F (x) − x (2)
instead of learning the transformation H(x) directly as done in other standard

CNN architectures.
Fig. 2. Schematic of residual learning. Figure reproduced from [5]
2.2 OpenCL Framework on FPGA

OpenCL is an open, cross-platform parallel programming language that can
be used in both CPU, DSP, GPU and FPGA developments. Recently, FPGA
6 K. Xu et al.
vendors such as Xilinx and Intel have released OpenCL SDK for programming
FPGAs. The Intel OpenCL environment which can be a mixture of C, C++,
and OpenCL, provides a complete CPU/GPU-like development experience and
run-time experience on a CPU/FPGA platform, including a complete software
workflow spanning multiple target devices and x86 emulation with cycle-accurate
FPGA hardware models and cycle-accurate FPGA hardware.
3 Architecture Design and Optimization

3.1 Accelerator Architecture
As shown in Fig. 3, our FPGA design based OpenCL framework consists of a

group of OpenCL kernels that are cascaded by using Altera’s OpenCL extension
Channels. Two data mover kernels, namely MemRD and MemWR, transfer fea-
ture map and weight data from/to the global memory feeding other kernel with
high throughput data streams. The cascaded kernels form a deep computation
pipeline that can implement a serial of basic CNNs operations without the need
of storing interlayer data back to global memory. It significantly reduces the
bandwidth requirement compared to the work of [10]. The Convolution kernel is
designed to implement both the convolution layer and the fully connected layer
which are the most compute-intensive operations in CNNs. The Pooling kernel
is controlled by the synchronization signal of the MemWR kernel. When the
synchronization signal set one, the Pooling kernel operation is performed. This
technique is mainly used to achieve overlap between two kernels. The Batch-
Norm kernel using in [5] loads mean, variance, α and β from global memory
and performs the normalization directly on the output data streams of the Con-
volution kernel. The Local Response Normalization(LRN) kernel using in [1]
fetches data from global memory and performs normalization on the feature
map of neighboring neurons in deep direction. The Eltwise kernel mapping Elt-
wise Layer using in [5] loads data from global momory and adds each elements
mainly using shortcut connections.
This architecture has the following advances:
(1) The cascaded and overlaped kernels form a deep pipeline architecture.
(2) Using a single hardware kernel to implement both the convolution and fully
connected layers.
(3) Scalable hardware structure which implementation many classic CNNs oper-
ations, such as LRN kernel to AlexNet, BatchNorm kernel and Eltwise kernel
to ResNet.
Convolution Kernel. A single work-item kernel with parallel convolution data

paths is designed to implement both the function of the convolution and FC
layers. In this paper, we propose to flatten the 3-D convolution operation into
Fig. 3. The top-level architecture of CNN accelerator.
a 1-D convolution operation and integrate it with the full-connect operation as

follow:
Cl ×K×K

Do (fo ) = Wl (fo , fi ) · Di (fi ) (3)
fi =1
In this way, data vectorization and parallel CU structure are both explored
in the design. Vectorized input features Di and weights Wl are streamed by
multiple Channels. A design parameter VEC SIZE determines the degree of
data vectorization and controls the input throughput. Another design variable
parameter to accelerator the convolution operation CU NUM, represents the
parallel factor of weight and reuse factor of data. Due to efficient pipelined by
the OpenCL compiler, We propose an efficient convolution pipeline structure
consisted of a multiplier-adder tree with a delayed buffer as in Fig. 4.
Fig. 4. The hardware architecture of the convolution kernel.

8 K. Xu et al.
Data Mover Kernel. Two multi-model single work-item kernels are designed
to fetch/store data from/to the global memory for the computation pipelines.
MemRD kernel detailed schemes in [11] can fetch data from global memory
to convolution mode or FC mode. We propose design parameter FT NUM to
determine the size of local memory, which further influences the reuse of input
data. MemWR kernel is mainly used to receive the output of convolution ker-
nel through the channel and arrange it into the storage structure required for
the next convolution or pooling operation. For the convolution mode, the data
received from the channel is arranged to have a depth of CU NUM, and MemWR
kernel need to divide the depth into VEC SIZE copies and return it to global
memory. The pooling mode simply transfer the data received from the channel
and directly put back to global memory. In the pooling mode, the MemWR ker-
nel also needs to pipe the synchronization signal to the pooling kernel at the right
time for them can overlap work. Note all memory operations should be commit-
ted before sending token to the pooling kernel. Detailed MemWR schemes are
illustrated in Fig. 5.
Fig. 5. The hardware architecture of the memWR kernel.
Fig. 6. The hardware architecture of the maxpool kernel.
Pooling Kernel. A shift-register-based hardware structure is proposed for the

pooling kernel as shown in Fig. 6. The kernel first fetch the synchronization
signal from the blocked channel, only waiting for the synchronization signal
from blocked channel to come, the pooling kernel can start working. When the
synchronization signal comes, the pooling kernel read data from global memory
to shift register. In the process of data transfer, if the time point of pooling
is reached, data will be extracted from the shift register to the pooling logic.
Similarly, we designed a parameter PT NUM for adjusting the size of the local
memory in pooling kernel to exploiting input data reuse. In the pooling strategy,
the first line is processed first, then the second line is compared with the first line,
and so on. The final result of the pooling calculation is stored in the ping-pong
buffer. During the pooling calculation, the result of the last calculation is also
divided into VEC SIZE and returned to global memory for the next convolution
calculation. This process is similar to MemWR.
Other Kernel. Besides the most compute-intensive convolution and fully con-
nected kernel, we also designed some common opencl kernels, such as LRN,
BatchNorm, Eltwise for the scalability and integrity of the CNN accelerator’s
overall design. In this architecture, you can choose the basic units used in the
network to piece together to implement different network structures. For exam-
ple, implementation AlexNet just choose convolution kernel, pooling kernel and
LRN kernel. Therefore, this scalable architecture can process the complete CNN
forword computation flow with little involvement of host CPU.
Table 1. Operations in AlexNet model
Index Layer dx dy dz wx wy wn wm GOPS

1 Conv1 227 227 3 11 11 3 96 0.281
2 Conv2 55 55 96 5 5 48 256 0.448
3 Conv3 27 27 256 3 3 256 384 0.299
4 Conv4 13 13 384 3 3 192 384 0.224
5 Conv5 13 13 384 3 3 192 256 0.032
6 FC1 6 6 256 6 6 256 4096 0.075
7 FC2 1 1 4096 1 1 4096 4096 0.034
8 FC3 1 1 4096 1 1 4096 1024 0.008
Output 1 1 1024 Total Ops 1.40
4 Design Space Exploration

In this section, we present an analytical performance model and resource utiliza-
tion model to choose the best combination of the design parameters (VEC SIZE,
CU NUM, FT NUM, PT NUM ) that maximizes the performance of the CNN
accelerator, while still being able to fit in the limited FPGA resources.
10 K. Xu et al.
4.1 Performance Model
Convolution and Fully Connected Time. The execution time of convolution

and fully connected layer-i is modeled as follow:
N o.of Convolution or F C Opsi
Convolution or F C Runtimei = (4)
VEC SIZE × CU NUM × F requency
Table 1 gives a operations summary of each layer in AlexNet model. Note that
d x, d y and d z represents the size of the output feature map from the previous
layer, not the input size of the current layer. In 3.1 the convolution and fully
connected operation have parallelism of two levels, one is the degree of parallelism
VEC SIZE based on the depth dimension of the input feature map, and the other
is the degree of parallelism CU NUM based on the number of convolution filters.
So the speedup ratio for the convolution kernel is VEC SIZE × CU NUM. The
execution times of AlexNet, VGG-16 and ResNet-50 on CU NUM are shown in
Fig. 7.
Other Layers Time. Due to the idea of pipeline and overlap in the overall
hardware design, the execution time of other kernels can be basically ignored
relative to convolution and fully connected operations.
Fig. 7. Execution time empirical models for CU NUM.
Memory Bandwidth. In order to reduce the pressure of external memory

bandwidth, we use 8-bit fixed point calculations and propose a sliding-window-
based data buffering scheme. Using fixed-point instead of floating-point calcula-
tions can reduce hardware synthesis costs and memory bandwidth requirements.
Fortunately, research shows that using 8-bit fixed-point numbers instead of full-
precision floating-point numbers is less than 1% loss in top 1/5 accuratacy for
AlexNet/VGG predictions. As shown in Fig. 8, this sliding-window-based data
buffering scheme use in MemRD kernel and maxpool kernel to cache data that
was fetched from global memory. The filter stride S of this filter window is usu-
ally smaller than the filter size K. Therefore, a large portion of data can be
reused during the convolution and maxpool computation. To exploiting data
reuse, the MemRD kernel design a FT NUM parameter and maxpool kernel
design a PT NUM parameter. These kernel fetches a window of data that cov-
ers the area of FT NUM or PT NUM of filters each time, and caches the data
in the on-chip buffers or shift register.
Fig. 8. The hardware architecture of the convolution kernel.
Fig. 9. Resource utilization empirical models for CU NUM on VGG-16.
4.2 Resource Utilization Model

In this subsection, we analyze resource utilization model on DE5-Net board.
As discussed in 3.1, two design parameters VEC SIZE, CU NUM are used to
control the hardware cost of the CNN accelerator. Therefore, we mainly con-
sider the impact of the following two parameters in resource utilization model.
Figure 9 shows the model with parameter CU NUM on VGG-16. As the parame-
ter CU NUM gradually increases, both logic elements model and DSP utilization
model present a trend of linear increase. However, the on-chip memory utilization
model shows small discrepancy due to the complexity of load/store units.
12 K. Xu et al.
5 Experimental Results
In this section, we present the experimental results to validate the scalable of
this CNN accelerator by implementation three large-scale CNN models: AlexNet,
VGG-16 and ResNet-50 on DE5-Net platform.
5.1 Experimental Setup
We use DE5-Net FPGA development board from Altera and compare with
DE5a-Net listed its specification in Table 2. The OpenCL kernel codes are com-
piled using Altera OpenCL SDK 16.1, and the Quartus 16.1 is used as the FPGA
implementation tool. The host machine is equipped with an Intel i7-5930K CPU
and 64 GB memories. The data of images are first loaded from hard disks to
the host programs, and then sended to the FPGA accelerators to perform CNN
forword computations.
Table 2. Comparision of FPGA accelerator boards.
Specification DE5-Net DE5a-Net

FPGA Stratix-V GXA7 Arria-10 GX1150
Logic elements 622 k 1150 k
DSP blocks 256 1518
M20K RAMs 2560 2560
5.2 Results and Discussion
In this subsection, we first list the best parameter configuration on different net-
works. Then, we show the benchmark of our CNN accelerator. Finally, we dis-
cuss the scalability of this hardware architecture. As discussed in 4, four design
parameters VEC SIZE, CU NUM, FT NUM, PT NUM are used to control the
hardware cost and throughput of the FPGA accelerator. Therefore, design space
exploration can be quantitatively performed by implementing the accelerator
with different parameter configuration. The final design variables for three net-
works optimized on the DE5-Net board are shown in Table 3.
In Table 4, we summarize the resource utilization, execution time and per-
formance of different networks on the best parameters. We can see that different
networks have different parameters and achieve different performance on same
FPGA board. To prove how fast this accelerator can accelerate CNN computa-
tions, we also compare with CPU by using the Caffe deep learning framework.
The execution time for AlexNet, VGG-16 and ResNet-50 is 189 ms, 1547 ms
and 1238 ms, respectively. We can see that using FPGA-based accelerator can
achieve more than 10 times faster on average in implementation CNN-based
Table 3. Optimized parameters.
AlexNet VGG-16 ResNet-50

VEC SIZE 8 8 16
CU NUM 48 32 16
FT NUM 7 7 7
PT NUM 2 4 4
Table 4. Summary of the resource utilization, execution time and throughput on

different networks.
AlexNet VGG-16 ResNet-50

Logic elements 491.3 k 368.5 k 532.6 K
DSP blocks 236 170 256
M20K RAM 2252 1133 1537
Frequency 197.9 MHz 219.7 MHz 223.6 MHz
Execution time 18.08 ms 355.92 ms 102.97 ms
Throughput 77.5 GOPS 103 GOPS 75.7 GOPS
image classification applications. In future works, we will explore sparse convo-

lution algorithms and using Winograd transformations to reduce the number of
computations and to improve the performance of this accelerator.
6 Conclusion
In this work, we implemented a scalable FPGA accelerator for convolutional neu-
ral networks using OpenCL framework. An efficient and scalable hardware archi-
tecture with deep pipelined kernels was presented. We proposed and explored
four design parameters for hardware costs and bandwidth limited, and imple-
mented three large-scale CNNs, AlexNet, VGG-16 and ResNet-50 on DE5-Net
FPGA board.
Acknowledgment. This work was supported by NNSF of China Grants NO.

61574013, 61532005.
References
1. Krizhevsky, A., Sutskever, I., Hinton, G.E., et al.: Imagenet classification with
deep convolutional neural networks. In: Advances in Neural Information Processing
Systems, pp. 1097–1105 (2012)
2. Ren, S., et al.: Faster R-CNN: towards real-time object detection with region pro-
posal networks. In: Advances in Neural Information Processing Systems, pp. 91–99
(2015)
14 K. Xu et al.
3. Abdel-Hamid, O., et al.: Applying convolutional neural networks concepts to

hybrid NN-HMM model for speech recognition. In: Acoustics, Speech and Signal
Processing, pp. 4277–4280 (2012)
4. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556 (2014)
5. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 770–778 (2016)
6. Hadji, I., Wildes, R.P.: What Do We Understand About Convolutional Networks?
arXiv preprint arXiv:1803.08834 (2018)
7. Qiu, J., Wang, J., Yao, S., et al.: Going deeper with embedded FPGA platform for
convolutional neural network. In: Proceedings of the 2016 ACM/SIGDA Interna-
tional Symposium on Field-Programmable Gate Arrays, pp. 26–35 (2016)
8. Wang, C., Gong, L., Yu, Q., Li, X., Xie, Y., Zhou, X.: DLAU: a scalable deep learn-
ing accelerator unit on FPGA. IEEE Trans. Comput.-Aided Des. Integr. Circuits
Syst. 36(3), 513–517 (2017)
9. Zhang, C., et al.: Optimizing FPGA-based accelerator design for deep convolutional
neural networks. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 161–170 (2015)
10. Suda, N., et al.: Throughput-optimized OpenCL-based FPGA accelerator for large-
scale convolutional neural networks. In: Proceedings of the 2016 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays, pp. 16–25 (2016)
11. Wang, D., Xu, K., Jiang, D.: PipeCNN: an OpenCL-based open-source FPGA
accelerator for convolution neural networks. In: Field Programmable Technology
(ICFPT), pp. 279–282 (2017)
Memory Bandwidth and Energy
Efficiency Optimization of Deep
Convolutional Neural Network
Accelerators
Zikai Nie, Zhisheng Li, Lei Wang(B) , Shasha Guo, and Qiang Dou
National University of Defense Technology, Changsha, China

zikai93@163.com, leiwang@nudt.edu.cn
Abstract. Deep convolutional neural networks (DNNs) achieve state-

of-the-art accuracy but at the cost of massive computation and memory
operations. Although highly-parallel devices effectively meet the require-
ments of computation, energy efficiency is still a tough nut.
In this paper, we present two novel computation sequences,
NHWCf ine and NHWCcoarse , for the DNN accelerators. Then we com-
bine two computation sequences with appropriate data layouts. The
proposed modes enable continuous memory access patterns and reduce
the number of memory accesses, which is achieved by leveraging and
transforming the local data reuse of weights and feature maps in high-
dimensional convolutions.
Experiments with various convolutional layers show that the proposed
modes made up of computing sequences and data layouts are more energy
efficient than the baseline mode on various networks. The reduction for
total energy consumption is up to 4.10×. The reduction for the off-chip
memory access latency is up to 5.11×.
Keywords: Deep learning · Convolutional neural network

Acceleration · Memory efficiency · Data layout
1 Introduction
Deep Neural Networks (DN N s) are Machine Learning (M L) methods that can
learn a generic but effective representation of an input space from large amount
of data. They extract high level features using previous learned models from raw
data to infer the final data distribution.
Over the past decade, DNNs, especially deep convolutional neural networks,
have gained a huge development due to the outstanding experimental achieve-
ments of multiple related fields including computer vision [14], speech recognition
[9], and natural language processing [8]. In many specific situations, DNNs used
in some domains are now able to beyond human in both accuracy and speed.
The success of DNNs can be attributed to three main reasons: the availabil-
ity of large-scale data sets, the developments of deep structures and training
c Springer Nature Singapore Pte Ltd. 2018
C. Li and J. Wu (Eds.): ACA 2018, CCIS 908, pp. 15–29, 2018.
https://doi.org/10.1007/978-981-13-2423-9_2
16 Z. Nie et al.
algorithms, and the utilization of highly parallel computational resources. All of

the three factors show the demand of high computation throughput and mem-
ory bandwidth resulting in the developments of general-purpose and special-
ized accelerators based on highly parallel platform such as GPUs [6,7], FPGAs
[12,21], and ASICs [4,6].
After the chasing in accuracy, almost all modern accelerators, especially those
hold deep networks, pay more attention on the reduction of power consumption
[18]. To achieve better energy efficiency, eliminating unnecessary off-chip memory
access and optimizing memory access patterns are effective methods.
Previous studies proposed methods to mitigate the bottleneck of memory
access, such as dataflow [2,5], compression [10,13], data quantification [12]. These
works gain outstanding performances, but some of them biased towards the
adjustments of data formats and processing orders at on-chip or off-chip ends but
fail to consider them both simultaneously. The interactions between computation
sequences and data layouts have not been considered.
Based on the comparisons between different combinations of computation
sequences and data layouts, we propose two optimizations of computation
sequences and collocate favorable data layouts in convolutional layers to col-
laboratively improve the energy efficiency.
To enhance energy efficiency, previous works like [5] focus on the distribution
of all types of data movement, such as input data reuse or partial sum accumu-
lation, at different levels of the memory hierarchy. Based on the fact that the
data size and the number of off-chip accesses of input feature maps and weights
is different in the convolutional layers, our computation sequences that focus
on the transformation and balance among different data reuse forms can deliver
significant energy saving.
The main contributions of our work include:
– A framework that can model multiple accelerator architectures and evaluate
various combinations of computation sequences and data layouts in different
convolutional layers.
– Two novel sequences of the convolutional computation called NHWCf ine
and NHWCcoarse and their corresponding data layouts to maximize mem-
ory access coalescence and provide sequencial memeory acceess pattern in
order to optimize performance and energy efficieny of the memory system.
The experiment result shows that our two computation modes, NHWCf ine
and NHWCcoarse , gain higher efficiency in various convolutional layers compared
to the basic convolution, with reduction of off-chip latency up to 5.11× and
4.95×, respectively. The two modes also achieve up to 4.10× and 3.98× reduction
in total energy consumption of a single convolutional layer. When the networks
goes deeper, the reduction ratio will increase accordingly.
The rest of the paper is organized as follows. Section 2 gives the background
of CNNs and introduces the related work. Section 3 gives the motivation of
this work. We will introduce proposed data layout and optimizations in Sect. 4.
Sections 5 and 6 provide a series of experiments to evaluate the memory efficiency
of different modes, and this paper concludes in Sect. 7.
Memory Bandwidth and Energy Efficiency Optimization of DNN 17
2 Background and Related Works

2.1 Convolutional Neural Networks
In machine learning, convolution neural networks, inspired by animal neurons
organization of local sensitivity and direction selection, are members of multi-
layer feed-forward artificial neural networks. The working principle of CNNs is
to extract the local features with special data distributions from high-resolution
feature maps and combine them directly into more abstract low-resolution fea-
ture maps. Feature extraction and feature mapping operations are completed
through two types of layers: convolutional and pooling layers. The last few lay-
ers are fully-connected (F C) classifiers that combine all local features to produce
the final results.
A convolutional layer extracts local features from input feature maps (ifmaps)
via trained filters, and then combines them into the more abstract intermediate
activation called output feature maps (ofmaps). Each element of feature maps
can be represented into three dimensions: height (H), width (W ), and channel
index (C). When batch size is more than one to leverage parallelism among
different images, there is another dimension N that should be concerned, which
represents the different set of ifmaps in their contexts. The computation in the
convolutional layers is defined as
of maps [z] [u] [x] [y]
Ci−1
F h−1 F
w−1
= if maps [z] [k] [Sx + i] [Sy + j]
k=0 i=0 j=0
×weights [u] [k] [i] [j] + bias [u]
0 ≤ z ≤ N, 0 ≤ u ≤ Co, 0 ≤ x ≤ Ho, 0 ≤ y ≤ Wo (1)
N is the value of batch size. Ci and Co are the channel number of ifmaps and
ofmaps. Ho and Wo are the height and width of ofmaps. Fh and Fw represent
the size of convolution filters. The dimensions of both filters and fmaps are 4D.
It means that each filter or ifmap is a 3D structure consisting of 2D planes.
In Summary, for an output channel, a 3D ifmap is processed by a 3D filter in
convolutional layers. Figure 1 shows the main computation process.
To differentiate data layouts in 4D arrays of fmaps and filters, we will use
following notation in this paper: N CHW for fmaps and Co Ci Fh Fw for filters.
There are also two subtypes of N CHW data layout for ifmaps and ofmaps:
N Ci Hi Wi and N Co Ho Wo . Data layouts can represent the levels of distance
between every two data in the same dimension. For example, in the N CHW
data layout, the elements along the lowest dimension W are stored in succession,
which continuous elements along the H dimension have a stride of W , and H*W
for C dimension, and so on.
2.2 Related Works

Many works are proposed to improve memory utilization in various fields, such
as compressions, zero-skip computations [1], data layouts, dataflows, and so on.
18 Z. Nie et al.
Fig. 1. Computation of a CONV/FC layer. Hi and Wi are the height and width of
ifmaps. Other notations are explained behind Eq. 1
DRAM accesses can be reduced by compression techniques such as pruning

and quantization [13]. Note that [10] compress a 2D filter to a 1D row for storage
at the cost of slower convergence rate, which benefits the optimizations below.
Zero-skip computation in [1] ignores the zero bits in activations to eliminate
the useless MACs and reduce R/W operations of psums. However, this tech-
nique introduces more energy and area problems. Dataflows can be divided into
two aspects: intra-layer and inter-layer dataflows. [5] presents various intra-layer
dataflow and proposes a dataflow leveraging almost all the data reuses, which
is called row-stationary. It receives outstanding memory efficiency. [2] proposes
an inter-layer dataflow applied to continuous convolutional layers to reduce the
R/W operations of intermediate activations. Data layout, the part that we focus
on, can serialize the DRAM accesses to leverage bandwidth and coalescence bet-
ter. In the view of parallelism, [16] discusses the impacts of all kinds of basic
data layouts on GPUs. But, besides neglecting other underlying structures, the
relationships between data layouts and computation sequences have also been
overlooked.
3 Motivation
3.1 Limited On-Chip Storage Resources
Our basic accelerator’s architecture is a design of FPGA-based CNN accelerator

for LeNet called Laius [17].
LeNet [15] is one of the most traditional neural networks towards light-scale
data comparing with advanced networks like VGG. Laius is an FPGA-based
accelerator for LeNet.
Benefiting from ping-pong optimization, 8-bit weight precision and weight
compression, Laius’s inter-layer buffers can save all the output data from the pre-
vious layer and provide input to the next layer. If the network model is changed
to AlexNet, the data size and the depth of the network will both increase. And
Memory Bandwidth and Energy Efficiency Optimization of DNN 19
then there will be some problems. First, on-chip storage is not enough anymore,
and we have to leverage data reuse better to reduce the number of off-chip mem-
ory access and save energy. Second, with the deeper going of network structures,
there are more convolutional layers with small fmaps and filters. Third, the num-
ber of psums produced by parallel contexts is too large to make psums stay in
on-chip buffers.
3.2 Data Movement
In most widely used CNNs, such as LeNet [15], AlexNet [14] and VGG [20],
convolutional layers account for over 90% of the overall operations and produce
a large amount of data movement [5]. Thus, convolutional layers are important
for CNNs to gain high performance in throughput and energy efficiency.
There are two issues limiting throughput and energy efficiency of convolution.
First, a MAC operation that creates read requests of ifmaps and weights stored
in off-chip DRAM results in requirements of high bandwidth and energy con-
sumption. Second, a significant amount of partial sums (psums) are produced
by limited parallel contexts simultaneously, which introduce additional read or
write pressure and energy of access if not accumulated within an acceptable
time.
To deal with the first issue, we should leverage different types of data reuse:
– Sliding reuse. The value of S (stride) is always less than that of Fh and
Fw in convolutional layers to slow down the evolution roughening and to gain
more information from neighbors. This characteristic makes small amount of
ifmaps’ pixels become shared across many MAC kernels. Each pixel in ifmaps
can be used Fh × Fw (with padding) times in a 2D fmap with 2 directions.
– Ifmap reuse
– Intra-image filter reuse. According to Eq. 1, each 2D filter can be identified
by a couple of an input channel and an output channel. Therefore, convolu-
tional operations of an input channel use only one 2D filter to generate psums
of the same output channel. This kind of reuse doesn’t exist in FC layers.
– Inter-image filter reuse. Each filter can be further reused across the batch
of N ifmaps.
The second issue can be handled by scheduling the order of operations to

make the psums get final values as soon as possible.
Nevertheless, maximum 2D data reuse and immediate psum reduction cannot
be realized completely at the same time. Pixels of a specified output channel is
products of a group of Ci 2D ifmap kernels and a group of Ci 2D filters in the
same size. All filters in a group must be taken into computation if we want
to get the output pixel immediately. But we will read this group of 2D filters
in sequence again to compute the value of the next pixel in the same output
channel, which is conflicting with the aim of maximum data reuse in a 2D view.
As we can see, the reason why these two issues cannot be solved simultaneously
is mainly about the independence of each channel.
20 Z. Nie et al.
Fig. 2. The number of DRAM accesses of various buffer size. The 5 bars in each cluster
represent the 5 CONV layers of AlexNet. With various size of the on-chip buffers, we
crawl the output requests of the buffer and record the number of off-chip accesses.
The number of DRAM accesses is an essential factor that directly influences

the performance and energy efficiency of a convolutional layer. We analyze this
metric’s value with its three components, ifmaps/weights/psums, and observe
the proportion of each component. As shown in Fig. 2, we find that accessing
ifmap pixels is the most significant impact on the total number of accesses in
each layer. Failing to leverage access coalescence and ifmap data reuse with
limited buffer size makes a large number of repetitive ifmap accesses. When a
buffer cannot keep a whole 2D ifmap plane, the two-direction sliding reuse of
a 2D ifmap plane can always produce repetitive accesses requests in the second
direction. Therefore, some methods are needed to convert redundant accesses
of ifmap pixels into weights to keep balance and make buffer always hold a 3D
filter.
4 Data Layout Optimization

4.1 Computation and Data Layout
When the buffer size is not enough to keep the sum of all the pixels of an
input channel, all psums of an output channel and a 2D filter, ifmap pixels are
repeatedly read by sliding reuse in an HW ifmap plane. According to Eq. 1, we
can observe that different heights and widths between filters and ifmaps lead
to the two-direction sliding reuse in convolutional layers, which creates a long
stride to the sliding reuse in the second dimension.
Therefore, among dimensions N/H/W/C, we try to find a dimension with
the same length owned by both ifmaps and weights to apply single-direction
sliding on a 2D plane. Then Ci is found, and we expect to read pixels along Ci
dimension first in a specific WC planes. In these planes, kernels will slide along
dimension W. To gain a continuous DRAM access sequence, data values along
dimension C (Ci /Co ) are supposed to store close to each other in the memory.
We change data layouts to N HW C for fmaps based on this single-direction
sliding computation mode.
Another random document with
no related content on Scribd:
[180]subversion of their political liberties. The dogma of self-denial
has not prevented our financial pharisees from amassing fortunes
that would dwarf the spoils of a Roman triumphator; but the
hospitality of Mæcenas has not survived the religion of Nature. Our
philosophers have to study the problems of life in a personal struggle
for existence; our poets have to choose between starvation and
hypocrisy. Patriots are left to the consoling reflection that virtue is its
own reward. The endowers of theological seminaries seem to rely on
the mercy of Christ to cancel the odium of their shortcomings in the
recognition of secular merit. Kepler, Campanella, and Spinoza
perished in penury. Locke and Rousseau, the recognized primates of
the intellectual world, were left to languish in exile, admired and
neglected by a host of “friends”—Christian friends—in every city of
the civilized world. Schubert, Buerger, and Frederick Schiller, the
idols of a poetry-loving nation, were left to fight the bitter struggle for
existence to an extreme of which all the records of pagan antiquity
furnish only a single parallel. Anaxagoras, the founder of a
philosophic school counting its disciples by thousands, was left to
languish in exile, till the rumor of his extreme distress brought the
most illustrious of those disciples to the sick-bed of his neglected
teacher. “Do not, do not leave us!” he cried, in an agony of remorse;
“we cannot afford to lose the light of our life!”
“O Pericles,” said the dying exile, “those who need a lamp should
take care to supply it with oil!”
But how many lights of our latter-day lives have [181]thus been
extinguished before their time! Not one of the plethoric British
aristocrats who spiced their leisure with the sweets of poetry ever
dreamed of relieving the cruel distress of Robert Burns, or of cutting
the knot of the financial embroglio that strangled out the life of Sir
Walter Scott.
[Contents]
E.—REFORM.
Time is the test of truth; and the fallacies of the “Brotherhood in

Christ” plan have been abundantly demonstrated by their
consequences. Instead of being a bond of union, the doctrine of
renunciation has been found to be a root of discord and rancor, and
in times of need the fellowship of its converts has proved a most
rotten staff. Even the wretches who betrayed their friends to the
spies of the Holy Inquisition had no difficulty in palliating the infamy
of their conduct with the sanction of scriptural precepts. For centuries
the appeals of martyrs to the cause of freedom and Freethought
have been answered with the advice of Christian submission to the
“powers that be,” and our modern pharisees rarely fail to reprove the
“worldliness” of a poor neighbor’s lament for the loss of his earthly
possessions.
The founder of “Positivism,” the Religion of Humanity, proposes to

dedicate the days of the year to the leaders of progress, and inscribe
our places of worship with the names of discoverers, reformers, and
philosophers rather than of bigots and world-despising saints. And
for the sake of those who would not wish to repeat the mistake of
sacrificing [182]the present to the past, the builders of those
sanctuaries should add a temple of Friendship. From the adoration
of self-torturing fanatics, from the worship of sorrow and the love of
enemies, mankind will at last revert to the ancestral plan of elective
affinities, and the dread of preferring natural to theological duties will
not much longer prevent our fellow-men from recognizing their
obligations to their earthly benefactors.
[Contents]
CHAPTER XV.
EDUCATION.
[Contents]
A.—LESSONS OF INSTINCT.
The doctrine of Pythagoras, the philosophic Messiah of Paganism,

included the strange tenet of metempsychosis. After death, held the
confessors of that dogma, the souls of men and brutes would
reappear in new forms, higher or lower, according to the character-
traits of the dying individual. Thus the soul of a wealthy glutton might
be reborn in a pig-sty, that of a high-minded peasant perhaps on the
throne of a king. Death and rebirth are the upper and lower spokes
of a wheel that turns and turns forever, and in the persons of their
neighbors the Pythagoreans saw wanderers that might have walked
this earth thousands of years ago.
The strangeness of such a theory is still increased by the

circumstance that its teacher was an eminent astronomer, an
accomplished mathematician, and the leader of a memorable
hygienic reform. Our astonishment [183]is not lessened by the well-
established fact that, under some form or other, the doctrine of soul-
migration has for ages been the accepted creed of a large plurality of
our fellow-men. It is well known, however, that to his trusted disciples
Pythagoras imparted an esoteric or explanatory version of his
dogmas; and if we learn that the great philosopher attached a
special importance to the influence of hereditary dispositions, the
truth at last dawns upon us that the doctrine of metempsychosis
referred to the reappearance of individual types, passions, and
dispositions in the bodily and mental characteristics of the next
generation. “Parents live in their children.” The instinctive recognition
of that truth reconciles our dumb fellow-creatures to the prospect of
death. At the end of summer the night-moth carefully deposits her
eggs in a silver cradle, hidden safe in the crevice of some sheltering
nook, where they will survive the rigor of the winter and answer the
first summons of spring. Having thus, as it were, insured the
resurrection of her type, the parent moth quietly resigns herself to
the fate of sleeping her own winter-slumber in the arms of death. On
the Orinoco wounded river-turtles will use their last strength to climb
the slope of some bush-hidden sand-bank, and after intrusting their
eggs to the protection of the deep drift sand, will reënter the water
and quietly float off with the seaward currents. In the virgin-woods of
Southern Mexico, where the harpy-eagle fills the maws of her hungry
brood by incessant raids on the small denizens of the tree-tops, the
traveler D’Armand once witnessed a curious [184]scene. An eagle
had pounced upon a nursing mother monkey, who at first struggled
desperately to free herself from the claws of the murderer; but,
finding resistance in vain, she loosened her grasp on the branches,
and, just as the eagle carried her off, she disengaged the arm of her
baby from her neck, and shaking off the little creature with a swing of
her arm, she deliberately flung it back into the sheltering foliage of
the tree-top, thus taking the last possible chance of surviving in her
child.
The “dread of annihilation” reveals itself in the instincts of a dying

philosopher as plainly as in the instincts of a wounded animal; but,
on self-examination, that fear would prove to have but little in
common with a special solicitude for the preservation of material
forms or combinations—conditions which the process of organic
change constantly modifies in the cradle as well as in the grave. It is
rather the type of the body and its correlated mental dispositions
which the hope of resurrection yearns to preserve, and even
childless men have often partly realized that hope by impressing the
image of their soul on a younger mind, and transmitting their
cherished projects and theories through the medium of education. In
the consciousness of that accomplished task Socrates could as
calmly die in the arms of his disciples as the Hebrew patriarch in the
arms of his children and grandchildren. “You kill a sower,” cried St.
Adalbert under the clubs of his assassins, “but the seed he has
planted will rise and survive both his love and your hatred.”
Even the influence of a great practical example has [185]often

impressed the mental type of a reformer or patriot on a series of
subsequent generations. The Buddhist Calanus, preaching the
doctrine of renunciation to an audience of scoffers, deeply affected
the most thoughtless of his witnesses by proving his personal
convictions in the flames of a funeral pile. “I leave no sons,” were the
last words of Epaminondas, “but two immortal daughters, the battles
of Leuctra and Mantinea.” Rousseau smiled when he learned the
intrigues of his enemies who were trying their utmost to enlist the
coöperation of a violent pulpit-orator. “They are busy recruiting their
corps of partisans,” said he, “but Time will raise me an ally in every
intelligent reader of the next generation.”
[Contents]
B.—REWARDS OF CONFORMITY.
In the simple lives of the lower animals every day may bring the
sufficient reward of its toil; but the problem of progress, even from
the first dawn of civilization, involves tasks too apt to extend beyond
the span of individual existence. The forest-clearing husbandman,
the state-founding patriot, the scientific inquirer, all risk to receive the
summons of night before the completion of their labor. Before
reaching the goal of their hopes their earthly pilgrimage may end at
the brink of the unknown river, and education alone can bridge that
gulf, and make every day the way-station, of an unbroken road.
Children or children’s children will take up the staff from the last
resting-place of their pilgrim father; and, moreover, all progress is
cumulative. Every laborer works with the experience of his
forefathers, as well as his own; [186]every son stands on the
shoulders of his father. Even the failure of individual efforts
contributes a helpful lesson to the success of the next attempt:
Freedom’s brave battle, once begun,

Bequeathed from bleeding sire to son,
Though often baffled, e’er is won.
Persistent adherence to the programme of a traditional policy has

often made the work of successive centuries the triumphant
execution of a single plan. The empire of Islam sprung from the seed
which the prophet of Mecca had planted in the soil of his native land.
The storm of the Protestant revolt rose from the anathemas of a poor
Wittenberg friar; the unquenchable fire of the French Revolution was
kindled by the burning indignation of a Swiss recluse, and his fervid
appeals:
Those oracles that set the world aflame,

Nor ceased to burn till kingdoms were no more;
and the vast fabric of our republican federation was founded by the
poor colonists who sought independence in the freedom of the
wilderness, and combined against the power of a selfish despot.
Education sows a seed which may sprout even during the life-time of
the sower, and bless individual life with the sweets of a guaranteed
triumph over the power of death. Resurgam, “I shall live after death,”
expresses the significance of that triumph, and of the “esoteric
doctrine of Pythagoras.”
[Contents]
C.—PERVERSION.
The Christian church has constantly perverted the purpose of

education, but has never yet deserved the [187]reproach of having
neglected its means. From the very beginning the sect of the
apostle-training Galilean has been a sect of assiduous educators.
They were not satisfied with founding schools and opening their
doors to all comers, but went forth in quest of new converts, and
pursued their aim with a persistence of zeal and a versatility of skill
that could not fail to accomplish its purpose. As soon as a sufficient
increase of power enabled them to control the institutes of primary
instruction they turned their chief attention to the dogmatical
education of the young. They derived no aid from the attractiveness
and still less from the plausibility of their doctrine, but they realized
Schopenhauer’s remark that “there is in childhood a period
measured by six, or at most by ten years, when any well-inculcated
dogma, no matter how extravagantly absurd, is sure to retain its hold
for life.” And though the propagation of an unnatural creed is not
favored by natural fertility, the naturally barren doctrine of
renunciation was thus successfully propagated by a system of
incessant grafting. By the skilful application of that process the most
dissimilar plants were made subservient to its purpose. The
“Worship of Sorrow” with its whining renunciation of worldly
enjoyments, and its indifference to health and physical education,
was grafted on the manful naturalism of the Hebrew law-giver. Saint-
worship, the veneration of self-torturing fanatics, was grafted on a
stem of pagan mythology, and dozens of Christian martyrs have thus
usurped the honor and the sacrifices of pagan temples. Christian
holidays were grafted on the festivals of [188]the nature-loving
Saxons. But persuasion failing, the missionaries of the cross did not
hesitate to resort to more conclusive measures. Like refractory
children cudgeled along the path of knowledge, the obstinate
skeptics of northern Europe were harassed with fire and sword till
they could not help admitting the dangers of unbelief. The garden-
lands of the Albigenses were wasted till they found no difficulty in
yearning for the peace of a better world. Philosophers were tortured
in the prisons of the Holy Inquisition till the sorrows of life favored the
renunciation of its hopes.
For thirteen centuries the sunshine of millions of human hearts was

ruthlessly sacrificed to promote the task of luring mankind from life to
ghost-land, and during all those ages education was systematically
turned from a blessing into an earth-blighting curse.
[Contents]
D.—PENALTIES OF NEGLECT.
There is a story of a Portuguese slave-dealer who carried a private

chaplain on his pay-roll, and frequently expressed his solicitude for
the spiritual welfare of his shackled captives. A very similar kind of
spiritual duty has for centuries been made the excuse for an almost
total neglect of secular education. Divorced from the control of
common sense, religion soon degenerated into mere ceremonialism.
A priest who would travel twenty miles through a snow-storm to
supply a dying man with a consecrated wafer had no sympathy with
the needs of the living. He would extort the last penny of his
[189]tithes at the risk of starving a village full of needy parishioners.
He would groan at the sight of an unbaptized child, but had not a
drop of water to cool the brows of burning Moors or Jews. He would
rave about the cruelty of a prince who had deprived the clergy of
their mass-shillings, but had no ear for the laments of the exiled
Moriscos or the curses of starving serfs.
Such was the morality which arrogated the right of suppressing that
system of physical and intellectual education which had filled the
homes of the Mediterranean nations with all the blessings of health,
science, and beauty. Theological training had failed to kindle the
dawn of a supernatural millennium, but had thoroughly succeeded in
extinguishing the light of human reason. Not absolute ignorance
only, but baneful superstition—worse than ignorance by just as much
as poison is worse than hunger—was for centuries the inevitable
result of all so-called school-training; and the traditions of that age of
priest-rule have made religion almost a synonyme of cant. It also
gave book-learning its supposed tendency to mental aberration. Can
we wonder at that result of an age when the literary products of
Christian Europe were confined almost exclusively to ghost-stories
and manuals of ceremony? Can we wonder that delusions of the
most preposterous kind assumed the virulence of epidemic
diseases? Maniacs of self-mutilation, of epileptic contortions, of
were-wolf panics, traversed Europe from end to end. Men gloried in
ignorance, and boasted their neglect of worldly science till the
consequences [190]of that neglect avenged its folly in actual
madness.
The saddest of all the sad “it might have beens” is, perhaps, a
reverie on the probable results of earlier emancipation—of the
employment of thirteen worse than wasted centuries in scientific
inquiries, agricultural improvements, social and sanitary reforms. We
might have failed to enter the portals of the New Jerusalem, but we
would probably have regained our earthly paradise.
[Contents]
E.—REFORM.
The days of the Holy Inquisition are past; but the restless
propaganda of Jesuitry still shames the inactivity of Rationalism. Our
friends sit listless, relying on the theoretical advantages of their
cause, while the busy intrigues of our enemies secure them all
practical advantages.
Even in our model republic only primary education stands neutral,

while private enterprise has made nearly every higher college a
stronghold of dogmatism. And even the semi-secularism of primary
instruction is more than offset by the ultra-orthodoxy of “Sunday-
schools.” Millions of factory children have to sacrifice their only day
of leisure at the bidding of their dogmatic task-master and with the
timid connivance of their parents. “We cannot row against the
stream,” I have heard even Freethinkers say. “Let the youngsters join
the crowd; if it does them no good, it can do no harm.” But it will do
harm, even beyond the waste of time and the wasted opportunities
for health-giving exercise. The [191]process of dogmatic inoculation
may fail to serve its direct purpose, but the weekly repetition of the
experiment is sure to contaminate the moral organism with unsound
humors which may become virulent at unexpected times and, likely
enough, undermine that very peace of the household which a short-
sighted mother hoped to promote by driving her boys to Sunday-
school, as she would drive troublesome cattle to a public pasture.
The Freethinkers of every community should combine to engage a

teacher, or at least facilitate home instruction by collecting text-books
of Secularism, such as Voltaire’s “Philosophical Cyclopedia;”
Rousseau’s “Emile;” Hallam’s “History of the Middle Ages;”
Ingersoll’s pamphlets; Paine’s “Age of Reason;” Lecky’s “History of
Rationalism” and “History of Morals;” Lessing’s “Nathan;” Goethe
and Schiller’s “Xenions;” Darwin’s “Descent of Man;” Plutarch’s
Biographies; Trelawney’s “Last Days of Shelley and Byron;”
McDonnell’s Freethought novels; Parker Pillsbury’s “Review of
Sabbatarian Legislation;” Reade’s “Martyrdom of Man;” Bennett’s
“Gods and Religions of Ancient and Modern Times;” Gibbon’s
“History of Christianity;” Keeler’s “Short History of the Bible;” “Bible
Myths and Their Parallels in Other Religions;” “Supernatural
Religion;” Greg’s “Creed of Christendom;” Lord Amberley’s “Analysis
of Religious Belief;” “Religion Not History.”
We should have Freethought colleges and Secular missions, and

even isolated Liberals might do better than “drift with the stream.”
They might let their [192]children pass their Sundays in the freedom of
the forests and mountains to worship the God of Nature in his own
temple, and learn a lesson from the parental devotion of their dumb
fellow-creatures. She-wolves, deprived of their whelps, have been
known to enter human habitations at night to suckle their young
through the bars of a heavy cage. Thrushes and fly-catchers will
enter an open window to feed or rescue their captive nestlings, and
with a still wider sympathy a Liberal friend of mine tries to aid his
neighbors’ children, as well as his own. Renouncing the hope of
abolishing Sabbatarianism, he conceived the idea of controlling it,
and induced his neighbors to send their children to a “Sunday
Garden” with a free museum of pictures and stuffed birds, gymnastic
contrivances, and a little restaurant of free temperance refreshments
—apples, peanuts, and lemonade. He defrays the expenses of the
establishment, which his neighbors consider a sort of modified
kindergarten; and under the name of “Sunday books” circulates a
private library of purely secular literature.
“If life shall have been duly rationalized by science,” says Herbert
Spencer, “parents will learn to consider a sound physical constitution
as an entailed estate, which should be transmitted unimpaired, if not
improved;” and with a similar recognition of social obligations
Freethinkers should endeavor to transmit to their children a bequest
of unimpaired common sense. Loyalty to their Protestant ancestors,
loyalty to posterity, and to the majesty of truth herself, should prompt
us to stand [193]bravely by our colors and train our children to
continue the struggle for light and independence.
By the far-reaching influence of education Secularists should bridge

the chasm which orthodoxy hopes to cross on the wings of faith.
Secularism shall preach the gospel of immortality on earth. [194]
IV—OBJECTIVE MAXIMS.
[Contents]
CHAPTER XVI.
FOREST CULTURE.
[Contents]
A.—LESSONS OF INSTINCT.
It is wonderful how often instinct has anticipated the practical lessons

of science. Long before comparative anatomy taught us the
characteristics of our digestive organs the testimony of our natural
predilections indicated the advantages of a frugal diet. Long before
modern hygiene pointed out the perils of breathing the atmosphere
of unventilated dormitories the evidence of our senses warned us
against the foulness of that atmosphere. And ages before the
researches of agricultural chemistry began to reveal the protective
influence of arboreal vegetation, an instinct which the child of
civilization shares with the rudest savage revolted against the
reckless destruction of fine woodlands, and sought to retrieve the
loss by surrounding private homes with groves and parks. The love
of forests is as natural to our species as the love of rocks to the
mountain-goat. Trees and tree-shade are associated with our
traditions of paradise, and that the cradle of the human race was not
a brick tenement or a wheat-farm, but a tree-garden, is one of the
few points on which the genesis of Darwin agrees with that of the
Pentateuch. [195]The happiest days of childhood would lose half their
charm without the witchery of woodland rambles, and, like an echo
of the foreworld, the instincts of our forest-born ancestors often
awake in the souls of their descendants. City children are
transported with delight at the first sight of a wood-covered
landscape, and the evergreen arcades of a tropical forest would
charm the soul of a young Esquimaux as the ever-rolling waves of
the ocean would awaken the yearnings of a captive sea-bird. The
traveler Chamisso mentions an interview with a poor Yakoot, a
native of the North Siberian ice-coast, who happened to get hold of
an illustrated magazine with a woodcut of a fine southern landscape:
a river-valley, rocky slopes rising toward a park-like lawn with a
background of wooded highlands. With that journal on his knees the
Yakoot squatted down in front of the traveler’s tent, and thus sat
motionless, hour after hour, contemplating the picture in silent
rapture. “How would you like to live in a country of that kind?” asked
the professor. The Yakoot folded his hands, but continued his
reverie. “I hope we shall go there if we are good,” said he at last, with
a sigh of deep emotion.
The importance of hereditary instincts can be often measured by the

degree of their persistence. Man is supposed to be a native of the
trans-Caucasian highlands—Armenia, perhaps, or the terrace-lands
of the Hindookoosh. Yet agriculture has succeeded in developing a
type of human beings who would instinctively prefer a fertile plain to
the grandest highland paradise of the East. Warfare has in like
manner [196]engendered an instinctive fondness for a life of perilous
adventure, as contrasted with the arcadian security of the Golden
Age. There are men who prefer slavery to freedom, and think pallor
more attractive than the glow of health, but a millennium of
unnaturalism has as yet failed to develop a species of human beings
who would instinctively prefer the dreariness of a treeless plain to the
verdure of a primeval forest.
[Contents]
B.—REWARDS OF CONFORMITY.
The love of forest-trees is a characteristic of the nature-abiding
nations of the North, and has rewarded itself by an almost complete
reversion of the original contrast between the garden lands of the
South and the inhospitable wilderness of the higher latitudes. Forest
destruction has turned Southern Europe into a desert, while the
preservation of forests has made the homes of the hyperborean
hunters an Eden of beauty and fertility. “One-third to the hunter, two-
thirds to the husbandman,” was the rule of Margrave Philip in his
distribution of forest and fields, and expresses the exact proportion
which modern science indicates as most favorable to the perennial
fertility of our farm-lands. In a single century the forest-destroying
Spaniards turned many of their American colonies from gardens into
sand-wastes, while, after fourteen hundred years of continuous
cultivation, the fields of the Danubian Valley are still as fertile as in
the days of Trajan and Tacitus. Along the river-banks and half-way
up the foot-hills the arable land has been cleared, but higher [197]up
the forest has been spared. All the highlands from Ratisbon to Buda-
Pesth still form a continuous mountain park of stately oaks and
pines, and, as a consequence, springs never fail; crops are safe
against winter floods and summer drouths; song-birds still return to
their birthland, and reward their protectors by the destruction of
noxious insects; meadows, grain-fields, and orchards produce their
abundant harvest year after year; famine is unknown, and
contagious diseases rarely assume an epidemic form. In Switzerland
and Prussia the preservation of the now remaining woodlands is
guaranteed by strict protective laws; Scandinavia requires her forest-
owners to replant a certain portion of every larger clearing; in Great
Britain the parks of the ancient mansions are protected like sacred
monuments of the past, and landowners vie in lining their field-trails
with rows of shade-trees. The fertility of those lands is a constant
surprise to the American traveler disposed to associate the idea of
eastern landscapes with the picture of worn-out fields. Surrounded
by Russian steppes and trans-Alpine deserts, the homes of the
Germanic nations still form a Goshen of verdure and abundance.
Forest protectors have not lost their earthly paradise.
[Contents]
C.—PERVERSION.
Sixteen hundred years ago the highlands of the European continent

were still covered with a dense growth of primeval forests. The
healthfulness and fertility of the Mediterranean coastlands surpassed
that of the most favored regions of the present world, [198]and the
dependence of those blessings on the preservation of the spring-
sheltering woodlands was clearly recognized by such writers as Pliny
and Columella, though their own experience did not enable them to
suspect all the ruinous consequences of that wholesale forest
destruction, which modern science has justly denounced as the ne
plus ultra folly of human improvidence. Practical experiments had,
however, demonstrated such facts as the failing of springs on
treeless slopes, and the violence of winter floods in districts
unprotected by rain-absorbing forests, and tree culture was practiced
as a regular branch of rational husbandry. But with the triumph of the
Galilean church came the millennium of unnaturalism. Rational
agriculture became a tradition of the past; the culture of secular
science was fiercely denounced from thousands of pulpits;
improvidence, “unworldliness,” and superstitious reliance on the
efficacy of prayer were systematically inculcated as supreme virtues;
the cultivators of the soil were treated like unclean beasts, and for a
series of centuries the garden regions of the East were abandoned
to the inevitable consequences of neglect and misculture.

Textbook Advanced Computer Architecture 12Th Conference Aca 2018 Yingkou China August 10 11 2018 Proceedings Chao Li Ebook All Chapter PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Textbook Advanced Computer Architecture 12Th Conference Aca 2018 Yingkou China August 10 11 2018 Proceedings Chao Li Ebook All Chapter PDF

Uploaded by

Copyright:

Available Formats

Advanced Computer Architecture 12th

Conference ACA 2018 Yingkou China

Advanced Computer Architecture 13th Conference ACA 2020

Advanced Computer Architecture 11th Conference ACA 2016

Network and System Security 12th International

Software Architecture 12th European Conference on

Computational Science ICCS 2018 18th International

Computational Science ICCS 2018 18th International

Computer Engineering and Technology 22nd CCF Conference

Intelligent Robotics and Applications 11th

Communications in Computer and Information Science 908

ISSN 1865-0929 ISSN 1865-0937 (electronic)

Library of Congress Control Number: 2018954068

© Springer Nature Singapore Pte Ltd. 2018

August 2018 Chao Li

Ying Wang Institute of Computing Technology, CAS, China

A Scalable FPGA Accelerator for Convolutional Neural Networks . . . . . . . . 3

Memory Bandwidth and Energy Efficiency Optimization of Deep

Research on Parallel Acceleration for Deep Learning Inference Based

Research on Acceleration Method of Speech Recognition Training . . . . . . . . 42

New Design Explorations

A Post-link Prefetching Based on Event Sampling. . . . . . . . . . . . . . . . . . . . 53

The Design of Reconfigurable Instruction Set Processor Based

Stateful Forward-Edge CFI Enforcement with Intel MPX . . . . . . . . . . . . . . . 79

Analytical Two-Level Near Threshold Cache Exploration for Low

DearDRAM: Discard Weak Rows for Reducing DRAM’s

Towards Efficient ML/AI

A Power Efficient Hardware Implementation of the IF Neuron Model . . . . . . 140

paraSNF: An Parallel Approach for Large-Scale Similarity Network Fusion . . . 155

An Experimental Perspective for Computation-Efficient Neural

Parallel Computing System

Performance Analysis and Optimization of Cyro-EM Structure

The Checkpoint-Timing for Backward Fault-Tolerant Schemes . . . . . . . . . . . 210

Quota-constrained Job Submission Behavior at Commercial Supercomputer . . . 219

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233

Ke Xu1,2 , Xiaoyun Wang1,2 , Shihang Fu1,2 , and Dong Wang1,2(B)

Abstract. Convolution Neural Networks (CNN) have achieved undis-

Keywords: FPGA · OpenCL · Convolution Neural Networks

Convolutional Neural Network (CNN) is a widely-regarded algorithm in the ﬁeld

(1) Propose a FPGA accelerator with a scalable architecture of deeply pipelined

2.1 Classic Convolution Neural Network

AlexNet. AlexNet was able to achieve record breaking object recognition

Fig. 1. AlexNet architecture. Figure reproduced from [1]

H(x) = F (x) − x (2)

instead of learning the transformation H(x) directly as done in other standard

Fig. 2. Schematic of residual learning. Figure reproduced from [5]

2.2 OpenCL Framework on FPGA

3 Architecture Design and Optimization

As shown in Fig. 3, our FPGA design based OpenCL framework consists of a

Convolution Kernel. A single work-item kernel with parallel convolution data

Fig. 3. The top-level architecture of CNN accelerator.

a 1-D convolution operation and integrate it with the full-connect operation as

Fig. 4. The hardware architecture of the convolution kernel.

Fig. 5. The hardware architecture of the memWR kernel.

Fig. 6. The hardware architecture of the maxpool kernel.

Pooling Kernel. A shift-register-based hardware structure is proposed for the

Table 1. Operations in AlexNet model

Index Layer dx dy dz wx wy wn wm GOPS

4 Design Space Exploration

4.1 Performance Model

Convolution and Fully Connected Time. The execution time of convolution

Fig. 7. Execution time empirical models for CU NUM.

Memory Bandwidth. In order to reduce the pressure of external memory