You are on page 1of 20

Received August 7, 2019, accepted August 17, 2019, date of publication August 21, 2019, date of current version

September 3, 2019.
Digital Object Identifier 10.1109/ACCESS.2019.2936591

A Fourier Domain Training Framework for


Convolutional Neural Networks Based on the
Fourier Domain Pyramid Pooling Method and
Fourier Domain Exponential Linear Unit
JINHUA LIN 1,2 , LIN MA3 , AND YU YAO1 , (Member, IEEE)
1 Schoolof Applied Technology, Changchun University of Technology, Changchun 130012, China
2 Machinery and Electronics Engineering, University of Chinese Academy of Sciences, Beijing 100049, China
3 FAW Foundry Company Ltd., Changchun 130000, China

Corresponding authors: Jinhua Lin (ljh3832@163.com) and Yu Yao (282765569@qq.com)


This work was supported in part by the National Natural Science Foundation of China under Grant 51705032, and in part by the National
High-Tech Research and Development Program under Grant 2018AA7031010B.

ABSTRACT Training convolutional neural networks (CNNs) in the frequency domain is of great sig-
nificance for extending the deep learning principle to the frequency domain. However, the frequency
domain representation of the convnet architecture is highly demanding due to their complicated Fourier
domain training features. Therefore, high accuracy and unambiguous representation strategies are needed
for training convolutional neural networks entirely in the Fourier domain. Being founded on the bin
decomposition mechanism and the non-saturated activation theory, this paper proposes an accurate, stable
and efficient Fourier domain training framework for convolutional neural networks. The framework contains
two important Fourier domain representations: one is the Fourier domain exponential linear unit, and the
other is the pyramid pooling layer. The former alleviates the vanishing phenomenon and makes CNNs easier
to converge in the Fourier domain; the latter avoids the original cropping or warping steps and improves
the classification accuracy. With the framework, the Fourier domain training accuracy is improved without
sacrificing the throughput of the graphic processing unit (GPU). With the Re-50 as the backbone, the top-1
and top-5 classification errors are reduced from 28.85 and 9.55 to 18.63 and 4.05, respectively, while the
speedup ratios of the framework can reach up to 4.9877 and 1.8997, respectively, at a batch size of 128 on
an NVIDIA GEFORCE RTX 2080 GPU (8.92 TFLOPS). The average difference between the classification
value and the ground truth value is only 0.21 on the MetaGram-1 set, which indicates great goodness-of-
fit and robustness of the framework. This investigation illustrates that the proposed Fourier domain CNN
framework using the sophisticated Fourier domain representation strategy is highly efficient and accurate.
Therefore, it may serve as a baseline framework to establish the training pipelines for Fourier domain CNNs,
which can improve the deep learning accuracy of CNNs and extend the Fourier domain representation
strategy to other deep learning networks.

INDEX TERMS Convolutional neural networks, deep learning, Fourier domain training, exponential linear
unit, pyramid pooling.

I. INTRODUCTION important topics in the artificial intelligence field [7], [8].


The convolutional neural network is a significant deep learn- Benefitting from decades of rapid growth, the spatial domain
ing framework for image classification, object detection, training and inferences of CNNs have been developed much
natural language processing, etc. [1]–[6]. Therefore, accu- beyond the level of recognizing characters, and now they are
rate training and inferences of the CNN have always been able to classify and detect the objects in a much more accurate
manner [9]–[12].
The associate editor coordinating the review of this article and approving In the past ten years, great theoretical and applied progress
it for publication was Simone Bianco. has been made in the spatial domain training and inferences

116612 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ VOLUME 7, 2019
J. Lin et al.: Fourier Domain Training Framework for CNNs Based on FPP and FELU

FIGURE 1. Three advanced frameworks: (a) the conventional time domain framework, (b) the FFT-based CNN framework [27], and (c) our framework.

of CNNs [13]–[16]. They have been widely used in deep are indispensable for the fast training of Fourier domain
learning fields as baseline frameworks for studying large networks. First, an effective Fourier domain pooling function,
scale datasets, complex computations, semantic detection, which removes the fixed-size constraint of the CNNs by
and so forth [17]–[19]. Several popular spatial domain CNN avoiding cropping and warping steps, is important to improv-
frameworks are the cuDNN [20], Lavin’s fast algorithm [21] ing the accuracy of Fourier domain CNNs [36]–[38]. Second,
and Cong’s minimizing computation framework [22], [23]. an unsaturated Fourier domain activation function is of great
Training networks in the spatial domain is significant for significance in alleviating the gradient decreasing problem
accurately learning the input features; however, the spatial and making it easier to converge in the frequency domain
domain convolution operations can also increase the compu- training phase [39]–[43].
tational complexity. Therefore, the accurate learning of the In this study, on the basis of fully considering the above
input features is generally combined with huge computational two points, an accurate frequency domain training architec-
expenses. An expedient strategy is to use the Fast Fourier ture is proposed for training CNNs entirely in the Fourier
Transform (FFT) [24]–[26] to transform the spatial domain domain. The significance of the proposed method in this
convolution operations into the Fourier domain product oper- paper involves two points. First, with respect to the clas-
ations. FFT-based CNN frameworks [27]–[30] are proposed sification precision, it employs the accurate representation
for training networks with big kernels, but the advanced of the non-saturated activation operation and the pyramid
convnet frameworks use small kernels [31]–[33]. Jong Hwan pooling operation [44]. Second, with respect to the speedup
Ko et al. proposed an energy-efficient accelerator for the performance, it is a fast Fourier domain baseline architecture
CNN using Fourier domain computations (abbreviated as for training Fourier domain CNNs. In this work, a Fourier
koCNN) [34], in which the spectral pooling strategy [35] and domain architecture is established for training and testing
the discrete sync interpolation operation are two important networks completely in frequency domain (see Fig. 1). The
methods for Fourier domain training. Nevertheless, the costs proposed architecture achieves the Fourier domain training
for Fourier domain training are quite expensive. In addition, and testing of networks on basis of the bin decomposition
the koCNN provides inaccurate results since it is difficult for mechanism and non-saturated activation functions instead
the spectral pooling strategy to transmit the incomplete kernel of sacrificing the classification accuracy or depending on
spectrum to the previous neurons, and the tanh and sigmoid the time-consuming time-frequency transformation strategy.
operations that are employed by koCNN are not unsatu- First, a Fourier domain exponential linear unit is proposed for
rated activation operations that decrease the precision of the alleviating the vanishing phenomenon in neuron backpropa-
weights in the back propagation. In this case, accuracy often gation and make it easier to converge in the frequency domain
has to be sacrificed to achieve low computational complexity. training stage. Second, a frequency domain pyramid pooling
Therefore, two important accuracy improvement strategies layer is designed to pool the features regardless of the input

VOLUME 7, 2019 116613


J. Lin et al.: Fourier Domain Training Framework for CNNs Based on FPP and FELU

FIGURE 2. The structure of the FPPE.

size/scale to avoid the original cropping or warping steps The FPPE integrated CNN (abbreviated as FPPE net)
and improve the classification accuracy. Finally, the Fourier learns the features using two passes. In the first pass, each
domain inverse exponential linear unit and inverse pyramid output feature map is calculated as a sum of the input feature
pooling layer are proposed. maps multiplied by the corresponding weight filter. In the sec-
The structure of this paper is as follows. The proposed ond pass, the loss gradients of the inputs are calculated by
framework is named the FPPE (‘‘F’’ means Fourier-domain, multiplying the transposed weight filter by the gradients of
‘‘PP’’ means Pyramid Pooling, and ‘‘E’’ means Exponen- the outputs, and the loss gradients of the weight filter are
tial, which is the abbreviation of the Exponential Linear calculated by multiplying each input feature map by the loss
Unit). An overall introduction of the FPPE is provided in gradients of the outputs. Note that the original convolutional
section 2. In section 3, the Fourier domain exponential lin- operations in the spatial domain are converted to product
ear unit (FELU) and pyramid pooling method (FPP) are operations in frequency domain and that all product opera-
introduced. The Fourier-domain inverse exponential linear tions consist of multiplications between multi-sized feature
unit (FELU−1 ) and inverse pyramid pooling method (FPP−1 ) maps. This Fourier domain architecture inherits the advan-
are introduced in Section 4. The results and discussions are tages of the product operations in the traditional FFT-based
presented in Section 5. Finally, the conclusions are drawn in CNNs [27], [28], [30] and eliminates the time-consuming
Section 6. Fourier transforms at every layer in traditional FFT-based
CNNs. In this Fourier domain framework, two remarkable
II. OVERALL FRAMEWORK Fourier domain representation strategies are proposed to
The FPPE framework is able to train the CNNs entirely accelerate the training and improve the classification accu-
in Fourier domain. It boosts the accuracy of CNNs using racy of CNNs in this paper. First, a Fourier domain exponen-
a Fourier domain pyramid pooling method and a Fourier tial linear unit (named ‘‘FELU’’) is established for catching
domain exponential linear unit. The Fourier domain pyramid the output features. An inverse FELU (named ‘‘FELU−1 ’’) is
pooling method can pool the properties in the frequency established for extracting the features of the loss gradients.
domain regardless of the input size/scale and remove the FELU and inverse FELU are Fourier domain activation
fixed-size constraint of CNNs. The Fourier domain expo- functions in the CNN pipeline that alleviate the saturation
nential linear unit can introduce non-linearity into Fourier problem. The FELU and inverse FELU alleviate the vanish-
domain CNNs and alleviate the saturation problem [1]. The ing gradient problem in neuron backpropagation and allow
structure of the FPPE is presented in Fig. 2. the FPPE net to converge easier in the training phrase.

116614 VOLUME 7, 2019


J. Lin et al.: Fourier Domain Training Framework for CNNs Based on FPP and FELU

Second, a Fourier domain pyramid pooling layer (named Before presenting the proposed method, we first present
‘‘FPP’’ in Fig. 2) is proposed for generating fixed-size images some notations. For a given convolutional layer, the input
without the need for cropping or warping steps, which reduce feature map xqp is denoted as qp, where q denotes the con-
the classification accuracy of CNNs. Correspondingly, the vnet layers, and p denotes the input feature maps in convnet
inverse FPP (named ‘‘FPP−1 ’’ in Fig. 2) is proposed for layer q. The input feature maps are of size nqp1 × nqp2 . The
removing the fixed-length constraint of CNNs. The proposed number of input feature maps in each convent layer is labelled
FPP has two significant meanings in this paper. First, the FPP as f . The number of convnet layers is labelled as L. Note
is proposed to generate fixed-size feature maps without the that q ranges from 1 to L. Correspondingly, the output yqp0
need for cropping or warping steps. It is a global pooling is denoted as qp0 , where p0 denotes the maps in layer q. The
type of layer that fits feature maps’ outputs into an FC layer. output feature maps are of size mqp0 1 × mqp0 2 . The size of
Second, the Fourier domain maximum/average pooling oper- outputs in each convent layer is labelled as f 0 . In addition,
ations are applied in the FPP layer and inverse FPP layer. I.e., the weight kernel wqp00 is denoted as qp00 , where p00 denotes
the FPP layer and inverse FPP layer are respectively equipped the filter kernels in convnet layer q. The weight kernels are
with the proposed maximum/average pooling function to of size kqp00 1 × kqp00 2 . The weight kernels in each convnet
decrease the dimensions of feature maps in the frequency layer is denoted as Kq . In the spatial domain, the exponential
domain. The proposed Fourier domain framework that is linear unit extracts the unsaturated status of the neurons using
integrated with the FELU and FPP achieves the training of negative exponential values, which can ensure the effective
networks entirely in the frequency domain, and leads to better backpropagation of neurons and make the CNN easily con-
accuracy without reducing the training speed. The details of verge in the training stage. Correspondingly, we establish
the FPPE will be introduced in the following sections. the frequency domain exponential linear unit that can extract
the unsaturated status of the neurons. The frequency domain
III. FOURIER DOMAIN FORWARD exponential linear unit function is denoted as follows:
PROPAGATION PASS (FFPASS)
Aqp = FE(Yqp0 ) (1)
The Ffpass receives the initial input information (x), then
propagates the information forward to the hidden units of where Yqp0 denotes the result of point product operation.
each layer in the frequency domain, and finally produces the Aqp0 denotes the output in each layer. FE(·) is the Fourier
output (y). First, the Fast Fourier Transform (FFT) operation domain exponential linear unit function. In contrast to other
is implemented for the inputs with arbitrary sizes and the unsaturated functions, the exponential linear unit function has
filters. The FFT results of the inputs and filters are denoted both positive and negative values in the time domain. Because
as F(x) and F(w), respectively. The product operations are these values correspond to the Fourier coefficients of the dis-
implemented between F(w) and F(x), and the product opera- crete Fourier series of the output feature map, we can divide
tion results are denoted as F(y). Second, the Fourier domain the Fourier domain exponential linear unit function into the
exponential linear unit inputs F(y) and generates the output positive unit function and negative unit function. The positive
feature maps with arbitrary sizes. Note that the output feature unit function preserves the positive coefficient term in the
maps have the same aspect ratio as the inputs. Finally, the out- output Yqp0 while the negative unit function preserves the
put feature maps with arbitrary sizes are decomposed into a negative coefficient terms in the output Yqp0 . Therefore, we
fixed number of Fourier bins. The Fourier domain pyramid rewrite the Fourier domain exponential linear unit function
pooling (FPP) method pools in these Fourier bins, thereby in formula (1) as follows:
yielding the fixed-dimensional outputs, which are the input
of the fully connected layer (FC). In this section, we will Aqp0 = FE + (Yqp0 ) + FE − (Yqp0 ) (2)
introduce the implementation of the Ffpass in detail. where FE + (·) is the positive unit function, and FE − (·) is
the negative unit function. Because the output Yqp0 is the
A. FOURIER DOMAIN EXPONENTIAL LINEAR UNIT (FELU) discrete Fourier transform result of the exponential linear
In the spatial domain, the activation function is a mathemati- unit, we model the Fourier expansion equation of Yqp0 as
cal function of the output of the convolution operation, and it follows:
introduces non-linear characteristics into the spatial domain −j m2π 0 U
CNNs. The popular spatial domain activation functions are Yqp0 (U ) = yqp0 (0) + yqp0 (1)e qp

the sigmoid function, the tanh function, the rectified linear −j m2π 0 2U
unit and the exponential linear unit. In the spatial domain + yqp0 (2)e qp

training, these functions can extract the non-linear properties. −j m2π 0 (mqp0 −1)U
+ · · · + yqp0 e qp (3)
However, these functions cannot be used for the Fourier
domain training of CNNs. There is an urgent requirement for where mqp0 is the output size, i.e., mqp0 = mqp0 1 × mqp0 2 .
frequency domain activation operations that can extract the U denotes the output factors. As shown in Fig. 3, the out-
non-linear properties in the frequency domain training. In this put of the exponential linear unit function is saturated in
paper, a Fourier domain exponential linear unit is proposed the linear incremental sequence in the positive interval and
that introduces non-linearity into the Fourier domain CNNs. the exponential decreasing sequence in the negative interval.

VOLUME 7, 2019 116615


J. Lin et al.: Fourier Domain Training Framework for CNNs Based on FPP and FELU

FIGURE 3. Fourier domain exponential linear unit.

Correspondingly, the output of the Fourier domain expo- In the end, formulas (4) and (6) are integrated to generate
nential linear unit function is saturated at the equiangular the FELU function as follows:
sampling points along a spiral line in the Z plane. The output
yqp0 is the distance from the zero point to the equiangular Aqp (U ) = FE + (Yqp0 (U )) + FE − (Y qp0 (U ))
sampling point in the Z plane, and it represents the coefficient 1
= (Yqp0 (U ) + Yqp0 (U ) + Y qp0 (U )
of the Fourier term of the output of the FE+ function. This 2
indicates that the output of the FE+ function only contains − Y qp0 (U ) ) (7)
the positive coefficient terms in the output Yqp0 instead of the
Note that the outputs of the positive unit function only
whole element. To extract the positive coefficient terms in the
contain the positive coefficient items; however, the Fourier
output Yqp0 , the FE+ function is used as follows:
domain exponential linear units have negative coefficient
1 terms, which ensure that the average unit activations approx-
FE + (Yqp0 (U )) =

(Yqp0 (U ) + Yqp0 (U ) ) (4) imate zero to achieve lower computational complexity. For
2
instance, in Fig. 3, when the output of FE(·) contains only
where k·k represents the absolute value of the Fourier term. negative coefficient terms exclusively, i.e., a01 = a(e−3 −
In contrast to the FE + function, the output of the FE − func- 2π 2π 2π
1)e−j 9 + a(e−2 − 1)e−j 9 3 + a(e−1 − 1)e−j 9 4 + a(e−2 −
tion only contains the negative coefficient terms in the output 2π
1)e−j 9 8 . The elements numbered 0, 2, 5, 6, and 7 are elim-
Yqp0 . However, when the negative coefficients of the Fourier
inated from the output of the negative unit function. Corre-
term of the output contain the exponential values, formula (3)
spondingly, when the third element of the output of FE(·)
cannot express the negative exponential coefficient of the
contains positive coefficient terms exclusively, i.e., a02 =
Fourier term. Therefore, to extract the negative coefficient 2π 2π
terms in the output, we first rewrite formula (3) as follows: a(e3 − 1) + a(e1 − 1)e−j 9 4 + a(e0 − 1)e−j 9 10 + a(e3 −
2π 2π
1)e−j 9 12 + a(e2 − 1)e−j 9 14 . The elements numbered 1,
−j m2π 0 U
Y qp0 (U ) = a(eyqp0 (0) − 1) + a(eyqp0 (1) − 1)e qp 3, 4 and 8 are eliminated from the output of the positive
unit function. Therefore, the proposed negative unit function
−j m2π 0 2U
+ a(eyqp0 (2) − 1)e qp eliminates the positive coefficient terms of Yqp0 instead of the
−j m2π 0 (mqp0 −1)U whole element while the positive unit function only elimi-
+ · · · + a(eyqp0 (mqp0 −1) − 1)e qp (5) nates the negative coefficient terms of Yqp0 . This suggests
that the proposed Fourier domain exponential linear unit only
where a is a predefined parameter in the exponential linear
uses the existence of positive and negative terms in the output
unit function. Then, the function of FE − is presented as
instead of their absence. This strategy reduces the compu-
follows:
tational complexity of the whole element and improves the
1 learning accuracy by normalizing the average negative unit
FE − (Yqp0 (U )) =

(Y qp0 (U ) − Y qp0 (U ) ) (6)
2 to zero.

116616 VOLUME 7, 2019


J. Lin et al.: Fourier Domain Training Framework for CNNs Based on FPP and FELU

FIGURE 4. The Fourier domain pyramid pooling layer.

B. FOURIER DOMAIN PYRAMID POOLING METHOD (FPP) where yLp0 k 0 is the spatial bin that corresponds to Fourier
The popular CNN architecture requires a fixed input size bin YLp0 k 0 . Then, the output feature map YLp0 is able to be
(e.g., 224×224 in AlexNet). However, the current FFT-based represented as follows:
CNN architectures mostly fit the input to a fixed size via the Mp 0
cropping or warping method. The cropped or warped result
X
YLp0 = F(yLp0 (n)) = F(yLp0 k 0 (n)) (9)
may lead to information loss or distortions. To this end, there k 0 =1
is a strong requirement for an accurate FPPE without the
fixed-size constraint. In this work, a pyramid pooling method Therefore, the Fourier domain downsampling operation
is used to pool features and yield fixed-size outputs, which for each output feature map is transformed into the pooling
are then input to the FC layer. For instance, when the FFT- operation for several Fourier bins. The Fourier domain down-
based CNNs are constrained by a fixed input size, the input sampling operation for Fourier bin YLp0 k 0 is represented as
with an arbitrary size needs to be cropped or warped to the follows:
fixed size, which harms the recognition accuracy. The Fourier YLp0 k 0 = F(yLp0 k 0 ) = Fdown(ALp0 k 0 ) (10)
domain pyramid pooling method can decompose the input
with an arbitrary size into a fixed number of Fourier bins that where YLp0 k 0 denotes the pooled Fourier bin. ALp0 k 0 denotes
have sizes proportional to the input size. Then, the FPP pools the activated Fourier bin. Fdown(·) denotes the Fourier
in these Fourier bins and generates the fixed-size outputs. domain downsampling function. Specifically, the Fourier
Finally, the fixed-size outputs are fed to the FC layer, and a domain downsampling operation is divided into the maxi-
SoftMax value is yielded as the output of FPPE net. mum and average operations. The maximum/average pooling
In our Fourier domain pyramid pooling layer, each output operation is an effective downsampling method that can
feature map in the last convnet layer YLp0 (with an arbitrary reduce the dimension of the outputs. Note that the average
size) is divided into a set of Fourier bins YLp0 k 0 , where k 0 pooling function calculates the average feature value from
denotes the Fourier bins within the output feature map YLp0 , the specific neighbourhood of the input and replaces the
each bin is of size lLp0 k 0 1 × lLp0 k 0 2 , and the number of Fourier other feature values in this neighbourhood with the average
bins is denoted as Mp0 . F(·) denotes the fast Fourier transform feature value. Therefore, we construct the maximum/average
function. The FPP layer is placed between the last convnet pooling operation for each Fourier bin in the FPP
layer and the fully connected layer to avoid the initial crop- layer.
ping or warping steps, and the output of the last convnet layer The maximum pooling operation collects Fourier coeffi-
is divided into a set of Fourier bins, which have aspect ratios cient from certain neighbourhoods in each Fourier bin and
that are proportional to the output size. Therefore, the Lp0 k 0 -th replaces the other coefficients of the specific neighbourhood
Fourier bin of YLp0 is denoted as follows: in the Fourier bin with the extracted maximum Fourier coef-
ficient, as shown in Fig. 4. The maximum Fourier coef-
 Lp0 k 0 (n))
YLp0 k 0 = F(y
0 ficient is equivalent to the maximum feature value that is
F(y 0 (n)), n ∈ [(k − 1)(lLp0 k 0 1 × lLp0 k 0 2 ),
 collected by maximum pooling function in spatial domain.
Lp
= k 0 (lLp0 k 0 1 × lLp0 k 0 2 )] (8) The specific neighbourhood is the minimum pooling unit in

0, other n each Fourier bin. Therefore, the formula (10) is rewritten

VOLUME 7, 2019 116617


J. Lin et al.: Fourier Domain Training Framework for CNNs Based on FPP and FELU

(
as follows: wqp00 (n), 0 ≤ n ≤ kqp00 1 × kqp00 2 − 1
wqp00 (n) = (14)
YLp0 k 0 (U ) ≡ FMP(ALp0 k 0 (∩)) 0, kqp00 1 × kqp00 2 ≤ n ≤ Ñ − 1
X −j l 2π n∩ (11)
= max(yLp0 k 0 (n)) · e Lp0 k 0 where Ñ is the minimum integer that is not less than nqp1 ×
n=β nqp2 + kqp00 1 × kqp00 2 − 1 and equal to the mth power of 2,
where m is an arbitrary integer.
where β denotes the positive Fourier coefficients in the acti- Step 2. The Ñ -point FFT of the filter kernel wqp00 (n) is cal-
vated Fourier bin. ∩ denotes the maximum Fourier coefficient culated in each convnet layer, i.e., Wqp00 (U ) = F(wqp00 (n)) =
of the specific neighbourhood in the Fourier bin. The pooled FFT (wqp00 (n)).
Fourier bin is of size ((lLp0 k 0 1 − kqp00 1 )/str1 + 1) × ((lLp0 k 0 2 −
Step 3. The Ñ -point FFT of xqp (n) is calculated in the first
kqp00 2 )/str2 + 1), i.e., U ∈ [0, (lLp0 k 0 − kqp00 )/str]. lLp0 k 0 is the
convnet layer, i.e., Xqp (U ) = F(xqp (n)) = FFT (xqp (n)).
size of the lLp0 k 0 -th Fourier bin, i.e., lLp0 k 0 = lLp0 k 0 1 × lLp0 k 0 2 .
Step 4. The Ñ -point product of Xki (U ) and Wk 0 k (U ) is
str1 is the horizontal step length, str2 is the vertical step
calculated, i.e., Yqp0 (U ) = F(yqp0 (n)) = Xqp (U ) · Wqp00 (U );
length, and str denotes the step size of str1 × str2 .
Step 5. The Fourier domain output Yqp0 (U ) is fed to Fourier
Correspondingly, the Fourier domain average pooling
domain exponential linear unit (FELU) in each convnet layer.
operation extracts the average Fourier coefficient from the
The output of the FELU is the activated feature maps with
specific neighbourhoods in each Fourier bin and replaces the
arbitrary sizes, i.e., Aqp0 (U ) = FELU (Yqp0 (U )).
other coefficients of the specific neighbourhood in the Fourier
Step 6. The output of the last convnet layer YLp0 (U ) is
bin with the extracted average Fourier coefficient, as shown
the input to the Fourier domain pyramid pooling (FPP) layer.
in Fig. 4. The average Fourier coefficient is equivalent to the
In the EPP layer, YLp0 (U ) is decomposed into a fixed number
average feature value. Therefore, the formula(10) is further
of Fourier bins YLp0 k 0 (U ). The EPP layer pools these Fourier
written as follows:
bins and generates the fixed-sized outputs YLp0 (U ).
YLp0 k 0 (U ) ≡ FAP(ALp0 k 0 (∩)) Step 7. The fixed-size outputs YLp0 (U ) are the inputs to
X −j l 2π n∩ the fully connected layer (FC), and the inverse transformation
= avg(yLp0 k 0 (n)) · e Lp0 k 0 (12) is operated for YLp0 (U ) to generate yLp0 (n), i.e., yLp0 (n) =
n=β PMp0
IFFT (YLp0 (U )) = k 0 =1 IFFT (YLp0 k 0 (U )).
where FAP(·) denotes the average pooling function. The
pooled Fourier elements are fixed regardless the value of ∩. IV. FOURIER DOMAIN BACKWARD
∩ is equal to 1 in this paper. In summary, the output of FPP PROPAGATION PASS (FBPASS)
layer is denoted as follows: The Fbpass takes the gradient of the output ( ∂L ∂y ), and then
M 0
it propagates the loss deviation information backward to the
input ( ∂L ∂L
p
∂x ) and the weight ( ∂w ). First, the fast Fourier trans-
X
YLp0 (U ) = YLp0 k 0 (U ) (13)
form is implemented for the transposed weight kernel and
k 0 =1
the loss gradient of the output, and the product operations
The FPP layer can pool the input feature maps with arbi- are implemented between them. The product operation results
trary sizes and yield the fixed-sized outputs YLp0 , which is are denoted as F( ∂L∂x ). Second, the FELU
−1 inputs F( ∂L ) and
∂x
the input to the FC layer. In addition, the size of the Fourier activates the weight gradient. Finally, the activated gradient
bin YLp0 k 0 is equal to half the size of its previous variant, of the inputs are integrated into a fixed number of Fourier
which suggests that our Fourier domain pooling operation bins. The Fourier domain inverse pyramid pooling (FPP−1 )
corresponds to the spatial domain pooling operation. In other method pools these Fourier bins, thus yielding the weight gra-
words, the FPP layer has two notable characteristics for the dient. In this section, we will introduce the implementation of
FFT-based CNNs: 1) the FPP can yield a fixed-size output the Fbpass in detail.
regardless the size of the input feature map, and 2) the FPP
is able to maintain the Fourier information by pooling the A. FOURIER DOMAIN INVERSE EXPONENTIAL
Fourier bins. This multi-scale Fourier domain pooling opera- LINEAR UNIT (FELU −1 )
tion is robust to geometric distortion. The inverse activation function is able to solve the increas-
At the end of this section, considering the proposed FELU ing gradient problem and quickly propagate the loss devi-
(see Section 3.1) and FPP (see Section 3.2) methods, we sum- ation parameters in back propagation pass. In this section,
marize the detailed implementation steps of the FPPE net as an inverse exponential linear unit is proposed for transmitting
follows. the gradients to the previous convnet neuro and modify the
Step 1. To avoid confusion in the base-2 DIT FFT, we first feature values of the neurons.
pad zero values to the multi-size input feature maps xqp and For a given convnet layer, the input gradient is denoted
filter kernels wqp00 . The padded results are as follows: ∂8
as ∂x qp
. Correspondingly, the output gradient is denoted
∂8 ∂8
(
xqp (n), 0 ≤ n ≤ mqp1 × mqp2 − 1 as ∂yqp0 , and the weight gradient is denoted as ∂wqp00 .
xqp (n) =
0, mqp1 × mqp2 ≤ n ≤ Ñ − 1 The inverse exponential linear unit function is denoted

116618 VOLUME 7, 2019


J. Lin et al.: Fourier Domain Training Framework for CNNs Based on FPP and FELU

as follows: where a is a predefined parameter in the exponential lin-


−1
∂8 ear unit function. Then, the FE− function is presented as
A−1 = FE −1 ( ) (15) follows:
qp
∂Xqp
εj
where ∂X∂8
is the input gradient. FE −1 (·) is the inverse expo- ∂L X 1 ∂L −j 2π nU
qp F−−1 ( (U )) = ln( (n) + 1)e nqp (20)
nential linear unit function. The inverse exponential linear ∂X qp a ∂xqp
n=ε0
unit function has both positive and negative values in the spa-
tial domain. Because these values correspond to the Fourier In the end, formulas (18) and (20) are integrated to gener-
coefficients of the discrete Fourier series of ∂X ∂8
, we divide ate the Fourier domain inverse exponential unit function as
qp follows:
the Fourier domain inverse exponential linear unit function
into the inverse positive unit function and inverse negative −1 ∂8 −1 ∂8
A−1
qp (U ) = FE+ ( (U )) + FE− ( (U )) (21)
unit function. The inverse positive unit function preserves the ∂Xqp ∂X qp
∂8
positive coefficient term in ∂X qp
, while the inverse negative Note that the elements of the inverse positive unit func-
∂8 tion only contain the positive coefficient terms; however,
unit function preserves the negative coefficient terms in ∂X qp
.
Therefore, we rewrite the Fourier domain inverse exponential the Fourier domain inverse exponential linear units have neg-
linear unit function in formula (15) as follows: ative coefficient terms, which ensure that the average unit
activations approximate zero to achieve lower computational
−1 ∂8 ∂8
complexity. For instance, in Fig. 5, output of FE −1 contains
A−1
qp = FE+ ( ) + FE1−1 ( ) (16)
∂Xqp ∂Xqp negative coefficient terms exclusively, i.e., a−1 1
01 = ln( a ×
2π 2π 2π
where FE+ −1
(·) is the inverse positive unit function, and (−1)+1)e−j 9 +ln( 1a ×1+1)e−j 9 3 +ln( 1a ×0.1+1)e−j 9 4 +
−1 2π
FE− (·)is the inverse negative unit function. Because the ln( 1a × 2 + 1)e−j 9 8 . The elements numbered 0, 2, 5, 6, and
∂8
gradient of the input ∂X qp
is the discrete Fourier transform 7 are eliminated from the output of the inverse negative unit
result of the inverse exponential linear unit, we present the function. Correspondingly, the third element of the output of
∂8
Fourier expansion equation of ∂X qp
as follows: FE −1 contains only positive coefficient terms, i.e., a−1 02 =
−j 2π 4 −j 2π
10 −j 2π
12 −j 2π
14
∂8 ∂8 ∂8 2.5 + 1.5e 9 − 3e 9 + 1.7e 9 − 2e 9 . The
−j 2π U
(U ) = (0) + (1)e nqp elements numbered 1, 3, 4 and 8 are eliminated from the out-
∂Xqp ∂xqp ∂xqp
put of the inverse positive unit function. Therefore, the pro-
∂8 −j 2π 2U
+ (2)e nqp posed inverse negative unit function eliminates the positive
∂xqp coefficient terms of ∂X ∂8
instead of the whole element, while
∂8 −j n2πqp (nqp −1)U
qp

+ ··· + e (17) the inverse positive unit function only eliminates the negative
∂xqp ∂8
coefficient terms of ∂X qp
. This suggests that the proposed
−1 inverse exponential linear unit only includes the existence of
where nqp = nqp1 ×nqp2 . The output of FE+ (·) only contains
∂8 the positive and negative terms in the gradient map instead
the positive coefficient terms of the input ∂X instead of the
qp of their absence. This strategy reduces the computational
whole element. To extract the positive coefficient terms in the
∂8 complexity of the whole element and solves the exploding
input ∂X , the inverse positive unit function is presented as
qp gradient problem.
follows:
εj
∂L X ∂L −j 2π nU B. FOURIER DOMAIN INVERSE PYRAMID
F+−1 ( (U )) = (n)e nqp (18) POOLING METHOD (FPP −1 )
∂Xqp n=ε
∂xqp
0
In this section, a Fourier domain inverse pyramid pooling
where ε0 and εj denote the positive factors of the original method is proposed for pooling the gradient maps and gen-
−1 −1
feature map. In contrast to FE+ , FE− only contains neg- erating fixed-size gradient bins, which are then propagated
∂8
ative coefficient terms of the input ∂Xqp . However, when the to the neurons in the convnet layer. For instance, when the
negative coefficients of the Fourier term of the gradient map FFT-based CNNs are constrained by a fixed-size gradient
contain the exponential values, formula (17) cannot express map, the gradient map with an arbitrary size needs to be
the negative exponential coefficients of the Fourier term. cropped or warped to the fixed size, which is suboptimal for
Therefore, to extract the negative coefficient terms in the the gradient backpropagation accuracy. The Fourier domain
gradient map of the input, we rewrite formula (17) as follows: inverse pyramid pooling method can decompose the gradient
∂8 1 ∂8 1 ∂8 map with an arbitrary size into a fixed number of gradient
(U ) = ln( (0) + 1) + ln( (1) bins that have sizes that are proportional to the size of the
∂X qp a ∂xqp a ∂xqp
gradient map. Then, the FPP−1 pools these gradient bins
−j 2π U 1 ∂8 −j 2π 2U and generates the fixed-sized outputs. Finally, the fixed-sized
+ 1)e nqp + ln( (2) + 1)e nqp
a ∂xqp outputs are propagated to the neurons in the convnet layer,
1 ∂8 −j 2π (n −1)U and the loss deviation values are generated to update the
+ · · · + ln( +1)e nqp qp (19)
a ∂xqp weight parameters of the FPPE net.

VOLUME 7, 2019 116619


J. Lin et al.: Fourier Domain Training Framework for CNNs Based on FPP and FELU

FIGURE 5. Fourier domain inverse exponential linear unit.

FIGURE 6. The Fourier domain inverse pyramid pooling layer.

In our Fourier domain inverse pyramid pooling layer, each Fig. 6. The Lpk-th gradient bin with respect to the gradient
∂8 ∂8
gradient map in the last convnet layer ∂XLp
is divided into a map ∂X Lp
is denoted as follows:
∂8
set of gradient bins ∂XLpk , where k denotes the gradient bins ∂8 ∂8
∂8 = F( )
within the gradient map ∂X , each bin is of size lLpk1 × lLpk2 , ∂XLpk ∂x
Lp
and the number of gradient bins for each gradient map is  Lpk
denoted as Mp . Note that k ranges from 1 to Mp . The FPP−1 F( ∂8 ), n ∈ [(k − 1)(lLpk1 × lLpk2 ),

layer is placed between the FC layer and the last convnet layer = ∂xLp k(lLpk1 × lLpk2 )] (22)

0,
to avoid the initial cropping or warping steps, as shown in other n

116620 VOLUME 7, 2019


J. Lin et al.: Fourier Domain Training Framework for CNNs Based on FPP and FELU

∂8
where ∂xLpk is the spatial domain gradient bin that corre- written as follows:
∂8 ∂8
sponds to the Fourier domain gradient bin ∂XLpk . The input
(U ) ≡ FAP−1 (A−1
Lpk (∩))
gradient map ∂8
is able to be represented as follows: ∂XLpk
∂XLp
δ −1 −1
X ∂8 −j 2π n∩
∂8 ∂8
k=Mp
∂8 = ( )/δ −1 · e lLpk (26)
∂xLpk (n)
X
= F( )= F( ) (23) n=0
∂XLp ∂xLp (n) ∂xLpk (n)
k=1
where FAP−1 (·) denotes the inverse average pooling
Therefore, the Fourier domain inverse downsampling oper- operation. δ −1 denotes the size of neighbourhood. The index
ation for each gradient map is transformed into the inverse value of average coefficient in the gradient bin is NULL,
pooling operation for several gradient bins. The Fourier therefore, the inverse pooled Fourier elements are fixed
domain inverse pooling operation for gradient bin ∂X∂8Lpk
is regardless the value of ∩. ∩ is equal to 1 in this paper. In sum-
represented as follows: mation, the input gradient in the FPP−1 layer is denoted as
follows:
∂8 ∂8
= F( ) = Fdown−1 (A−1
Lpk ) (24) Mp
∂XLpk ∂xLpk ∂8 X ∂8
(U ) = (U ) (27)
∂XLp ∂XLpk
∂8 k=1
where ∂XLpkdenotes the inverse pooled gradient bin. ALpk
denotes the Fourier domain inverse exponential unit function The FPP−1 layer can pool the gradient map with an arbi-
with the gradient bin as the input, i.e., the inverse activated ∂8
trary size and yield the fixed-size outputs ∂X Lp
, which are
gradient bin. Fdown−1 (·) denotes the Fourier domain inverse propagated to the deviation parameters of the FPPE net.
downsampling function. The Fourier domain inverse pooling In addition, the size of the gradient bin is equal to half the
operation collects the maximum Fourier coefficient from size of its previous variant, which suggests that our inverse
certain neighbourhoods in each gradient bin and replaces pooling operation corresponds to the corresponding spatial
the other coefficients of the specific neighbourhood in the domain inverse pooling operation.
gradient bin with the extracted maximum Fourier coefficient,
as shown in Fig. 6. The maximum Fourier coefficient is V. RESULTS AND DISCUSSION
equivalent to the maximum feature value that is collected by A. DATASETS AND TRAINING CONFIGURATION
the inverse pooling function in spatial domain. The specific We train the CNN on two data sets: ImageNet [31] and
neighbourhood is the minimum inverse pooling unit in each MetaGram-1 [45], [46]. The former is an open source training
gradient bin. In the ordinary case, the neighbourhood is equal set that contains 1000 categories. The latter is made up of
to the transposed weight kernel. Therefore, the inverse max- in-house data sets with manually labelled images, and the
imum pooling function for each gradient bin in formula (24) typical and representative images were carefully examined
is rewritten as follows: and chosen. Before the training phase, we discarded the
∂8 redundant images and those that failed the initialization of the
(U ) ≡ FMP−1 (A−1
Lpk (∩)) FPPE from the training data set. The final database contains
∂XLpk
1200 labelled images. Because the CUDA [47], [48] and
X ∂8 −j 2π n∩
= · e lLpk (25) deep learning framework (e.g., Caffe [49]) are run using
−1
∂xLpk (n) fixed-sized inputs, we incorporate the spatial domain pyramid
n=β
pooling method [44] to establish our training solution for
where β −1 denotes the positive Fourier coefficients in the the inputs with arbitrary sizes in this section, which ensures
inverse activated gradient bin. U denotes the elements of the that our Fourier domain pyramid pooling layer is trained and
inverse pooled gradient bin. ∩ denotes the maximum Fourier tested under the CUDA and Caffe implementations.
coefficient of the specific neighbourhood in the gradient In our single-size training phase, we pre-calculate the sizes
bin. The inverse pooled gradient bin is of size ((lLpk1 − of the Fourier bins that are the input of the FPP layer. The
kqp00 1 )/str1 + 1) × ((lLpk2 − kqp00 2 )/str2 + 1), i.e., U ∈ output feature maps in the last convnet layer (e.g., conv5 )
[0, (lLpk − kqp00 )/str]. lLpk is the size of the Lpk-th gradient are of size α × α (e.g., 13×13), and the Fourier bin in a
bin, i.e., lLpk = lLpk1 × lLpk2 . Fourier pyramid level is of size m × m. To train a specific
Correspondingly, the inverse average pooling operation Fourier pyramid level, we model this Fourier pyramid pooling
extracts the Fourier coefficient from the specific neighbour- as an original sliding window pooling, where the size of the
hoods in each gradient bin and replaces the other coefficients sliding window is computed by dα/me and the stride length
of the specific neighbourhood in the gradient bin with the is computed by bα/mc. With respect to the CNN of a l-level
extracted average Fourier coefficient, as shown in Fig. 6. Fourier pyramid, we implement l sliding window pooling
The average Fourier coefficient is equivalent to the average layers. The FC layer after the last convnet layer integrates the
feature value that is collected by inverse pooling function. l outputs. In Table 1, we present the configuration of the pro-
Then, the inverse pooling operation for each gradient bin is posed Fourier domain pyramid pooling layer using CUDA.

VOLUME 7, 2019 116621


J. Lin et al.: Fourier Domain Training Framework for CNNs Based on FPP and FELU

TABLE 1. A3-level fourier domain pyramid pooling layer using CUDA.

Our configuration is for the Conv5 architecture, and therefore in references [31], [32], and [33]. The FPPE with the back-
the output is of size 13 × 13. The first pooling layer of bone from [31] is denoted as FPPE-Al-7, that with the back-
the FPP is denoted as FPool3 × 3, and the Fourier bins of bone from [32] is denoted as FPPE-Vg-19, and that with the
FPool3 × 3 are of size 3 × 3. Correspondingly, the Fourier backbone from [33] is denoted as FPPE-Re-50. In these FPPE
bins of FPool2 × 2 are of size 2 × 2 and the Fourier bins of nets, the FPP layer after the last convnet layer outputs the
FPool1 × 1 are of size 1 × 1. feature maps of size 6×6, the FC layer generates 4096 dimen-
In our multi-size training phase, our FPPE framework is sional outputs, and the SoftMax layer generates 1200-way
proposed to train the CNNs using inputs with arbitrary sizes values (i.e., classification labels).
in the Fourier domain. To simulate the multi-size training,
we adjust the input images to two sizes (180 × 180 and 2) FPPE IMPROVES ACCURACY WHEN USING
224 × 224). In contrast to the initial cropping or warping THE METAGRAM-1 DATASET
steps that are used by other non-FPP CNNs, the images of In these FPPE nets, the Fourier domain pyramid pooling
both sizes differ only in their resolution instead of their con- layer is placed after the last convnet layer to replace the
tent or layout. We implement the above single-sized training original pooling layer, and the proposed FELU is placed in
phase in the FPPE net, which uses inputs of size 180 × 180. each convnet layer to replace the original activation function.
The output feature maps in the last convnet layer are of size In contrast, the Non-FPPE network is a Fourier domain CNN
10 × 10 (e.g., α = 10), and the Fourier bin in a Fourier structure that does not embed the FPP layer after the last
pyramid level is of size m×m. The size of the sliding window convnet layer. The level of the FPP layer is set to be 4. The
is d10/me and the stride length is b10/mc. The above settings first pooling level of the FPP is denoted as FPool6×6, and the
are also applicable to the inputs of size 224×224. Despite two Fourier bins of FPool6 × 6 are of size ×6 Correspondingly,
different sized inputs, the output of the FPPE net possesses the Fourier bins of FPool3 × 3 are of size 3 × 3, the Fourier
the same fixed sized output. In summary, the multi-size train- bins of FPool2 × 2 are of size 2 × 2, and the Fourier bins of
ing phase consists of two single-sized training phases, and FPool1×1 are of size 1×1. The number of Fourier bins is 50.
the two training phases have the same parameters. Therefore, In Table 2, we show the results of the single-size and multi-
we implement a multi-size input FPPE net using two single- size training of these FPPE nets. The FPPE net provides a
size FPPE nets that share the same parameters. In addition, considerable accuracy improvement over the non-FPPE net-
we also tested the inputs with various sizes in the experimen- works. The smallest top-1 error is presented by the backbones
tal section. I.e., the size of input is between [180, 224]. with deep layers and small filters. This indicates that our
FPPE is more applicable for the advanced CNN architectures
with deep layers and small filters. Because the inputs of the
B. EXPERIMENTS ON IMAGE CLASSIFICATION
FPPE net and non-FPPE net are both 224 × 224, the accuracy
We train the proposed FPPE net on MetaGram-1 dataset and
improvement in Table 2 is attributed to the embedded FPP
ImageNet dataset. Our Fourier domain training algorithm
layer. For example, with the Re-50 as the backbone, the top-1
is the same as the previous FFT-based CNNs in references
and top-5 classification errors for the multi-size FPPE net
[27]–[30]. The learning rate is set to 0.01 at the beginning,
are only 18.63 and 4.05, respectively, while the top-1 and
and it is divided by 10 at the peak error. The training and
top-5 classification errors for the non-FPPE net increase by
inference of the proposed FPPE net are implemented using
10.22 and 5.50 percentage points, respectively. In addition,
the open source code of CUDA and Caffe. The FPPE net
this accuracy gain is not only due to the increase of the
is trained using an NVIDIA GEFORCE RTX 2080 GPU
training parameters but also to the robustness of our FPPE to
(8G memory) within one month.
the input distortion. For example, when the FPP layer is set
to be a different 4-level structure {4 × 4, 3 × 3, 2 × 2, 1 × 1},
1) BACKBONE ARCHITECTURES the number of Fourier bins is 30. This FPPE net uses fewer
The proposed FPPE framework can be integrated into any parameters than the corresponding non-FPPE net, and the
convolutional neural network architecture. We use three top-1 and top-5 errors are 20.02 and 6.20, respectively.
advanced CNNs as the backbones of the FPPE. The FPPE The results are almost the same as the above 50 Fourier
improves the accuracy of these CNNs in the Fourier domain. bin architecture, and they are significantly better than the
The convnet architectures of these three backbones are shown non-FPPE net.

116622 VOLUME 7, 2019


J. Lin et al.: Fourier Domain Training Framework for CNNs Based on FPP and FELU

TABLE 2. Error rates using the MetaGram-1 validation data set.

TABLE 3. Classification accuracy AP using the MetaGram-1 data set

In Table 3, we compare the classification accuracy of the percentile of the classification error vector, and the lower
FPPE to several popular spatial domain networks. The popu- edge represents the 25% percentile of the classification error
lar spatial domain networks are the cuDNN [20], Lavin’s fast vector. In general, the violin graphs of the three FPPE nets
algorithm [21] and Cong’s minimizing computation frame- (FPPE-Al-7, FPPE-Vg-19 and FPPE-Re-50) provide almost
work [22]. In this experiment, we conduct training using the the same error distribution results. This indicates that the pro-
union of 800 training categories, 300 validation categories posed FPPE framework has stable and robust performance.
and 100 test categories for the MetaGram-1 data set. Each However, despite the higher accuracy of the test set, the accu-
category includes 221 images. The standard metrics including racy of the training set in the FPPE-Re-50 net is not as good
the mAP180 , mAP224 , mAP364 , mAPS , mAPM , and mAPH as that of the FPPE-Al-7 net. This is because we increase the
are used to measure the classification accuracy. The mAP is weight parameters of the large error data in the training phase
the abbreviation of the mean Average Precision, and it is the so that these data are re-selected for further training in the
average value of the classification thresholds. The mAPs with next iteration. In Fig. 7, the violin graphs of the other two
different subscripts represent the average precision value at networks reflect the instability, poor accuracy and overfitting
different image scales, and the subscript is the input image problems on the MetaGram-1 dataset. We compare the violin
size. As shown in Table 3, all implementations of our FPPE graphs of the FPPE-Al-7 net to the violin graphs of the
outperform the baseline variants of the spatial domain net- cuDNN-Al-7 (cuDNN with the backbone from [31]); the test
works. The classification results of the FPPE (83.1%) exceed set of our FPPE-Al-7 has a more complete violin shape, which
those of the spatial domain network (cuDNN) (67.8%) by a is flatter than that of the cuDNN-Al-7. This indicates that
substantial margin (15.3%). This is due to the two proposed the FPPE-Al-7 is more accurate, which is consistent with the
Fourier domain representations: one is the Fourier domain numerical results in Table 2. For the Fourier domain network
exponential linear unit, and the other is the pyramid pooling (the koCNN with the backbone from [31]), there is over-
layer. The former alleviates the vanishing phenomenon in fitting in the training set. This indicates that our FPPE-Al-7 is
the back propagation pass, and the latter avoids the original more accurate than the other Fourier domain training network
cropping or warping steps and improves the classification without over-fitting. In other words, the results for the train-
accuracy. ing data sets that are shown in Fig. 2 reflect the superiority
In Fig. 7, we plot the violin graphs for the frameworks of the proposed FPPE framework, which is the same for the
with different backbones where the mean absolute errors are validation and test data sets.
calculated for three datasets: the training, validation and test To further compare the prediction error distributions of
sets. A violin graph is a kind of drawing that adds rotat- the FPPE net and other Fourier domain networks, Fig. 8
ing nuclear density plots to the neighbourhood of the box shows the scatter graphs of the ground truth values versus
diagram. In the middle of each violin drawing, we mark the classification values of the two networks for the test data
the middle quartile using a black dot. The upper edge of set. The average difference between the classification value
the black rectangle in each violin graph represents the 75% and ground truth value is only 0.21 for the MetaGram-1 data

VOLUME 7, 2019 116623


J. Lin et al.: Fourier Domain Training Framework for CNNs Based on FPP and FELU

FIGURE 7. The violin graphs of the MAEs for the FPPE, cuDNN and koCNN frameworks for the MetaGram-1 data set.

set, which indicates the great goodness-of-fit and robustness two networks for the test data set. The average difference
of the FPPE framework. In addition, the scatter plot can between the classification value and ground truth value is
also intuitively show the classification accuracy of CNNs. only 0.21 for the MetaGram-1 data set, which indicates the
As shown in Fig. 8, most classification values, whether for great goodness-of-fit and robustness of the FPPE frame-
the FPPE-Al-7 net or koCNN-Al-7 net, are either on the work. In addition, the scatter plot can also intuitively show
same line or very close to the same line. This indicates that the classification accuracy of CNNs. As shown in Fig. 8,
the test results of the Fourier domain network are excellent. most classification values, whether for the FPPE-Al-7 net or
Except for a few abnormal values, most of the classification koCNN-Al-7 net, are either on the same line or very close to
values of the FPPE net are distributed in a narrow band the same line. This indicates that the test results of the Fourier
area, which is distributed on the diagonal line. This indicates domain network are excellent. Except for a few abnormal
that the classification results of the FPPE net are better than values, most of the classification values of the FPPE net are
those of other Fourier domain networks. To further compare distributed in a narrow band area, which is distributed on the
the prediction error distributions of the FPPE net and other diagonal line. This indicates that the classification results of
Fourier domain networks, Fig. 8 shows the scatter graphs of the FPPE net are better than those of other Fourier domain
the ground truth values versus the classification values of the networks.

116624 VOLUME 7, 2019


J. Lin et al.: Fourier Domain Training Framework for CNNs Based on FPP and FELU

FIGURE 8. Classification versus ground truth values of the MetaGram-1 test sets for the three frameworks.

TABLE 4. Error rates using the ImageNet validation data set TABLE 5. Classification mAPs for the ImageNet data set

MetaGram-1 data set, the FPPE net that is trained using the
3) FPPE IMPROVES ACCURACY WHEN USING ImageNet data set is more accurate and stable.
THE IMAGENET DATASET In Table 5, we compare the classification accuracy of the
The experimental settings in this section are the same as those FPPE to the state-of-the-art spatial domain network (cuDNN)
in section 5.2.2. In Table 4, we show the results of the single- in classification accuracy. In this experiment, we con-
size and multi-size training of the FPPE nets and the non- duct training using the union of 700 training categories,
FPPE net on the ImageNet dataset. The FPPE net provides 200 validation categories and 100 test categories from the on
a considerable accuracy improvement over the non-FPPE ImageNet data set. Each category includes 681 images. The
networks. Because the inputs of the FPPE nets and non-FPPE accuracy performance is evaluated by using the mAP, which
net are both 224 × 224, the accuracy improvement in Table 4 has been defined in section 5.2.2. To accurately compare the
is attributed to the embedded FPP layer. For example, with FPPE and cuDNN accurately, we reset the size of the input
the Re-50 as the backbone, the top-1 and top-5 classification image and crop the image to 224 × 224 from the centre of the
errors for the multi-size FPPE net are only 15.77 and 2.87, image. For the ImageNet dataset, the deeper the convnet layer
respectively, while the top-1 and top-5 classification errors is, the better the classification result. In Tables 5(a) and (b),
for the non-FPPE net increase by 10 and 4.38 percentage we provide the results for the cuDNN and FPPE net after
points, respectively. However, the top-1 and top-5 classifica- training using the ImageNet dataset, respectively. The results
tion errors for the multi-size FPPE net that was trained using of the FC structure are better than the previous Conv4 and
the MetaGram-1 dataset reach 18.63 and 4.05, respectively, Conv5 structures. For example, the mAP of the FC layer is
which are 2.88 and 1.18 lower than those of the multi-size at least 16.3% and 10.8% higher than those of the conv4 and
FPPE net that was trained using the ImageNet dataset. This conv5 layers, respectively. This is due to the proposed FPP
suggests that compared with the net that is trained using the layer that performing multi-level pooling to fit the feature

VOLUME 7, 2019 116625


J. Lin et al.: Fourier Domain Training Framework for CNNs Based on FPP and FELU

FIGURE 9. The violin graphs of the MAEs for the FPPE, cuDNN and koCNN frameworks for the ImageNet data sets.

maps’ outputs into an FC layer. Table 5(c) shows the clas- ImageNet dataset but a larger area in the MetaGram-1 dataset;
sification results for full images, where we reset the shorter therefore, the relative scales of the two groups of objects
edge length of the image to 224. Compared to the cropped are different. These results suggest that the scale problem in
images, the classification results using full images are consid- Fourier domain classification tasks can be partially solved by
erably improved. For example, the mAP of the classification the proposed FPPE networks.
result of cuDNN-Re-50 is 75.1 on the cropped 224×224 In Fig. 9, we plot the violin graphs for the frameworks
images, while the mAP of the classification result of FPPE- with different backbones where the mean absolute errors
Re-50 is 79.7 on the full 224×− images, which is a mAP are calculated using the ImageNet datasets. The ImageNet
improvement 4.6. This is due to the proposed FPP layer dataset is divided into three subsets: the training, the val-
that maintaining the complete content of the input images. idation and the test subsets. The training subset includes
Because the proposed FPPE does not depend on the input 681,000 images in 1000 categories. A total of 476,700 images
size, we reset the size of the input image so that the smaller are used for training, 136,200 images are used for validation,
size is x. Table 5(d) shows that x = 364 provides the and the remaining images are used for testing. The violin
best classification result (83.5%) using the ImageNet dataset. graphs of the three FPPE nets (FPPE-Al-7, FPPE-Vg-19 and
This is because the object occupies a smaller area in the FPPE-Re-50) provide almost the same error distribution

116626 VOLUME 7, 2019


J. Lin et al.: Fourier Domain Training Framework for CNNs Based on FPP and FELU

FIGURE 10. Classification versus ground truth values of the ImageNet test sets for the three frameworks.

results as those in Fig. 7. However, the training results for the data set, which further supports the great goodness-of-fit and
three FPPE nets using the ImageNet dataset are more accurate robustness of the FPPE framework. As shown in Fig. 10,
than those using the MetaGram-1 dataset. This suggests that most classification values, whether for the FPPE-Re-50 net or
the error distribution of the proposed FPPE framework when koCNN-Re-50 net, are either on the same line or very close
using the ImageNet dataset is better than that when using to the same line. This indicates that the test results of the
the MetaGram-1 dataset. In Fig. 9, the violin graphs of Fourier domain network are excellent when using the Ima-
the cuDNN and koCNN nets also reflect the instability and geNet dataset. Except for a few abnormal values, most of
overfitting problems when using the ImageNet dataset, which the classification values of the FPPE net are distributed in a
are almost the same as those when using the MetaGram-1 narrow band area, which is distributed on the diagonal line.
dataset. However, the FPPE nets are more stable and have less This indicates that the classification results of the FPPE net
overfitting when using the ImageNet dataset. We compare are better than those of other Fourier domain networks for the
the violin graphs of the FPPE-Al-7 net to the violin graphs ImageNet test data set.
of the cuDNN-Al-7. The training set of our FPPE-Al-7 has
a more complete violin shape, which is flatter than that of C. THE TIME PERFORMANCE OF FPPE
the cuDNN-Al-7. This suggests that the FPPE-Al-7 that is In this section, the time performance of the FPPE is evaluated
trained using the ImageNet dataset is more accurate than the using two parameters: the speedup ratio and the throughput.
cuDNN-Al-7, which is consistent with the numerical results When the speedup ratio is less than 1, the arithmetic com-
in Table 4. For the Fourier domain network (koCNN-Al-7), plexity of the FPPE is higher than that using the comparison
there is over-fitting when using the training set. This suggests method. In addition, when the throughput of the FPPN is
that our FPPE-Al-7 that is trained using the ImageNet dataset bigger than the GPU peak throughput, it is considered to have
is more accurate than the other Fourier domain training net- effective time performance.
work without over-fitting. In summary, the results for the
ImageNet training data sets that are shown in Fig. 9 reflect 1) SPEEDUP RATIO
the superiority of the proposed FPPE framework, which is the The speedup ratio of the FPPE is calculated by dividing
same for the ImageNet validation and ImageNet test data sets. the run time of the comparison framework by the com-
To further compare the prediction error distributions of the plex multiplication operation time of the FPPE. Because
FPPE net and other Fourier domain networks, Fig. 10 shows the FPPE is a type of FFT-based framework that imple-
the scatter graphs of the ground truth values versus the clas- ments product operations instead of convolution operations
sification values of the two networks for the ImageNet test and the proposed Fourier domain exponential linear unit
data set. The average difference between the classification divides the output feature maps with arbitrary sizes into fixed-
value and ground truth value is only 0.16 for the ImageNet sized Fourier bins (gradient bins in Fbpass). I.e., the FPPE

VOLUME 7, 2019 116627


J. Lin et al.: Fourier Domain Training Framework for CNNs Based on FPP and FELU

FIGURE 11. The speedup ratio of the FPPE nets to the four frameworks on the MetaGram-1 dataset.

TABLE 6. The speedup performance of FPPE. Aacr# is the reduction ratio of the average arithmetic complexity

requires (max ×(lLp0 k 0 1 × lLp0 k 0 2 ) + kqp00 1 × kqp00 2 − 1)(1 + frameworks benefit from the deep structure of CNNs,
log(max ×(lLp0 k 0 1 × lLp0 k 0 2 ) + kqp00 1 × kqp00 2 − 1)) complex the training speed is still 0.5 to 3 times higher than that of the
multiplication operations in a convnet layer, where max(·) proposed FPPE. This is due to the Fourier domain pyramid
represents the maximum size of the Fourier bins. The speedup pooling layer of the FPPE. The FPP layer divides the feature
ratio of the FPPE is calculated as follows: map into Fourier bins in the last convnet layer and each bin
is placed in parallel in the LEU unit of the GPU, which
Sp reduces the arithmetic complexity of the FPPE framework.
128 By contrast, in the spatial domain, the arithmetic complexity
X T
= of the framework is inversely proportional to the number of
S ·f0
i=1 i i
· M · (max(l
p0 lLp0 k 0 2 ) + kqp00 1 kqp00 2 − 1)
Lp0 k 0 1
training layers. For example, the speedup ratio of FPPE-Al-7
(1 + log(max((lLp k 1 lLp k 0 2 ) + kqp00 1 kqp00 2 − 1)))
0 0 0
to cuDNN-Al-7 is 5.1120 at batch 1 and 6.0122 at batch 128,
(28)
indicating an increase of 0.9002 points. This fact also applies
where T is the implementation time of the framework to be to other spatial domain frameworks. The speedup ratio of
compared. S is the minibatch size of the GPU. Fig. 11 shows the FPPE nets to the spatial domain frameworks reaches its
the speedup ratio of the three FPPE nets on the MetaGram-1 maximum value at batch 128. Regardless of the depth of the
dataset, and the frameworks to be compared are the spa- spatial domain frameworks, the speedup ratio of the FPPE is
tial domain frameworks (cuDNN-Al-7/Vg-19/Re-50 and the highest. This is because the FPPE net is trained and tested
FA-Al-7/Vg-19/Re-50) and the Fourier domain frameworks entirely in the Fourier domain, and the transformation time
(koCNN-Al-7/Vg-19/Re-50 and fbFFT-Al-7/Vg-19/Re-50). is eliminated. In other words, the arithmetic complexity of
Judging from the speedup performance, whether it is an FFT- the FPPE is lower than that of the other frameworks using
based Fourier domain framework (fbFFT-Al-7/Vg-19/Re-50) the Al-7, Vg-19 and Re-50 backbones on the MetaGram-1
or the complete Fourier domain framework (koCNN-Al-7/ dataset.
Vg-19/Re-50), there is a downward trend. The arithmetic
complexity of the Fourier domain frameworks decreases 2) THE THROUGHPUT
when the number of training layers increases. However, The single-layer throughput of the FPPE is calculated by
the arithmetic complexity of the FPPE nets is still bigger dividing the TFLOPS values by their GPU calculation times.
than 1. For example, Table 6 shows that the speedup ratio While the single-layer throughput is higher than the GPU
of FPPE-Al-7 to fbFFT-Al-7 is 3.9999 at batch 128 and the peak throughput (8.92 TFLOPS in our experiment), the con-
speedup ratio of FPPE-Al-7 to koCNN-Al-7 is 1.4998 at vnet layer has effective computation complexity. The total
batch 128. This suggests that even if other Fourier domain throughput is calculated by dividing the total TFLOPS by

116628 VOLUME 7, 2019


J. Lin et al.: Fourier Domain Training Framework for CNNs Based on FPP and FELU

FIGURE 12. The Al-7 net effective TFLOPS vs. batch size for the koCNN and FPPE on an 8.92 TFLOPS GPU.

VOLUME 7, 2019 116629


J. Lin et al.: Fourier Domain Training Framework for CNNs Based on FPP and FELU

the total run time, where the total TFLOPS and run time are REFERENCES
computed by weighting the TFLOPS and GPU implemen- [1] Q. Zhang, M. Zhang, T. Chen, Z. Sun, Y. Ma, and B. Yu, ‘‘Recent advances
tation time for each convnet layer using its depth. Fig. 12 in convolutional neural network acceleration,’’ Neurocomputing, vol. 323,
pp. 37–51, Jan. 2019.
shows the throughputs for the koCNN and FPPE with the [2] C. Du and S. Gao, ‘‘Image segmentation-based multi-focus image fusion
Al-7 backbone for double precision at different batch sizes. through multi-scale convolutional neural network,’’ IEEE Access, vol. 5,
For the double precision data, the FPPE is 1.59X at pp. 15750–15761, 2017.
[3] K. Muhammad, J. Ahmad, I. Mehmood, S. Rho, and S. W. Baik, ‘‘Convo-
f = 256 and 1.84X as fast at f = 1. The throughput is lutional neural networks based fire detection in surveillance videos,’’ IEEE
14.0887 TFLOPS at f = 256 and 18.0416 TFLOPS at f = 1. Access, vol. 6, pp. 18174–18183, 2018.
This indicates that the speedup performance of the FPPE is [4] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang,
G. Wang, J. Cai, and T. Chen, ‘‘Recent advances in convolutional neural
stable for a minibatch with different input sizes. In Fig. 12, networks,’’ Pattern Recognit., vol. 77, pp. 354–377, May 2018.
the throughputs of the koCNN and FPPE are shown by layer. [5] X. Bai, B. Shi, C. Zhang, X. Cai, and L. Qi, ‘‘Text/non-text image classifi-
The koCNN’s sync interpolation operation performs poorly cation in the wild with convolutional neural networks,’’ Pattern Recognit.,
vol. 66, pp. 437–446, Jun. 2017.
after the third layer. For example, the throughput for layer 3 is
[6] Y. Bengio, A. Courville, and P. Vincent, ‘‘Representation learning:
7.7328 TFLOPS, which is less than the peak throughput. This A review and new perspectives,’’ IEEE Trans. Pattern Anal. Mach. Intell.,
indicates that despite the koCNN being trained entirely in the vol. 35, no. 8, pp. 1798–1828, Aug. 2013.
Fourier domain, the time-consuming downsampling opera- [7] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, ‘‘Efficient processing of
deep neural networks: A tutorial and survey,’’ Proc. IEEE, vol. 105, no. 12,
tion leads to inefficient multiplication stages unless the depth pp. 2295–2329, Dec. 2017.
of the network is shallow. However, the koCNN still performs [8] T. Hirasawa, K. Aoyama, T. Tanimoto, S. Ishihara, S. Shichijo, T. Ozawa,
worse than the FPPE with fewer layers. FPPE performs better T. Ohnishi, M. Fujishiro, K. Matsuo, J. Fujisaki, and T. Tada, ‘‘Application
of artificial intelligence using a convolutional neural network for detecting
than the koCNN in both the training and testing phases. gastric cancer in endoscopic images,’’ Gastric Cancer, vol. 21, no. 4,
Generally, the FPPE is superior to the only current Fourier pp. 653–660, 2018.
domain framework (koCNN), unless the input feature map is [9] T. Perol, M. Gharbi, and M. Denolle, ‘‘Convolutional neural network
for earthquake detection and location,’’ Sci. Adv., vol. 4, no. 2, 2018,
very large. This is because the koCNN truncates the higher Art. no. e1700578.
spectrum region of the frequency representation of the input [10] P. M. Burlina, N. Joshi, M. Pekala, K. Pacheco, D. Freund, and N. Bressler,
in the initial training phase. By contrast, the FPPE imple- ‘‘Automated grading of age-related macular degeneration from color fun-
dus images using deep convolutional neural networks,’’ JAMA Ophthal-
ments Fourier pyramid pooling operations in the last convnet mol., vol. 135, no. 11, pp. 1170–1176, Nov. 2017.
layer, which requires more GPU bandwidth to divide the [11] J. Kim, J. Kim, G.-J. Jang, and M. Lee, ‘‘Fast learning method for convolu-
large input feature map into more Fourier bins. However, tional neural networks using extreme learning machine and its application
to lane detection,’’ Neural Netw, vol. 87, pp. 109–121, Mar. 2017.
the koCNN’s truncated region may not contain complete
[12] E. Li, J. Xia, P. Du, C. Lin, and A. Samat, ‘‘Integrating multilayer features
spectrum information, which will result in lower training of convolutional neural networks for remote sensing scene classifica-
accuracy. The FPPE is able to well balance the accuracy tion,’’ IEEE Trans. Geosci. Remote Sens., vol. 55, no. 10, pp. 5653–5665,
Oct. 2017.
and speed; the training and testing CNN are highly accurate
[13] P. Wang, Y. Cao, C. Shen, L. Liu, and H. T. Shen, ‘‘Temporal pyra-
but not at the expense of the speed. For example, the FPPE mid pooling-based convolutional neural network for action recogni-
improves the training speed by at least 11.9754 TFLOPS, tion,’’ IEEE Trans. Circuits Syst. Video Technol., vol. 27, no. 12,
while the classification error (top-5) is still 4.84 points lower pp. 2613–2622, Dec. 2017.
[14] Y. Liu, X. Chen, H. Peng, and Z. Wang, ‘‘Multi-focus image fusion with
than that of the non-FPP framework, which experiences a deep convolutional neural network,’’ Inf. Fusion, vol. 36, pp. 191–207,
bigger gains for large batches. Jul. 2017.
[15] R. Yamashita, M. Nishio, R. K. G. Do, and K. Togashi, ‘‘Convolutional
neural networks: An overview and application in radiology,’’ Insights
VI. CONCLUSION Imag., vol. 9, pp. 611–629, Aug. 2018.
In this study, the Fourier domain convolutional neural net- [16] G. J. Scott, M. R. England, W. A. Starms, R. A. Marcum, and C. H. Davis,
work framework that incorporates the Fourier domain expo- ‘‘Training deep convolutional neural networks for land–cover classifica-
tion of high-resolution imagery,’’ IEEE Geosci. Remote Sens. Lett., vol. 14,
nential unit and the Fourier domain pyramid pooling method no. 4, pp. 549–553, Apr. 2017.
is explored. Using different backbone networks, the pro- [17] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen, ‘‘Learning
posed framework shows the advantages of training the net- hand-eye coordination for robotic grasping with deep learning and large-
scale data collection,’’ Int. J. Robot. Res., vol. 37, nos. 4–5, pp. 421–436,
workd completely in Fourier domain and the powerful Fourier 2017.
domain representation ability for improving the classification [18] Y.-C. Wu, F. Yin, and C.-L. Liu, ‘‘Improving handwritten chinese text
accuracy. One of the best FPPE nets in our experiments, recognition using neural network language models and convolutional
neural network shape models,’’ Pattern Recognit., vol. 65, pp. 251–264,
the FPPE-Re-50 network, possesses strong robustness and May 2017.
accuracy when using the MetaGram-1 dataset. In addition, [19] Z. Yin, B. Wan, F. Yuan, X. Xia, and J. Shi, ‘‘A deep normalization and
all other FPPE nets also obtain great classification results. convolutional neural network for image smoke detection,’’ IEEE Access,
vol. 5, pp. 18429–18438, 2017.
This suggests that the proposed framework is a promising
[20] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro,
architecture to establish accurate Fourier domain represen- and E. Shelhamer, ‘‘cuDNN: Efficient primitives for deep learning,’’
tations for CNNs, which can accelerate the establishment of Oct. 2014, arXiv:1410.0759. [Online]. Available: https://arxiv.org/abs/
new Fourier domain deep learning frameworks and provide a 1410.0759
[21] L. Andrew and G. Scott, ‘‘Fast algorithms for convolutional neural net-
shortcut for researchers to study their newly designed Fourier works,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2016,
domain networks. pp. 4013–4021. [Online]. Available: https://arxiv.org/abs/1509.09308v2

116630 VOLUME 7, 2019


J. Lin et al.: Fourier Domain Training Framework for CNNs Based on FPP and FELU

[22] J. Cong and B. Xiao, ‘‘Minimizing computation in convolutional neural [44] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Spatial pyramid pooling in deep con-
networks,’’ in Proc. Int. Conf. Artif. Neural Netw., Sep. 2014, pp. 281–290. volutional networks for visual recognition,’’ Jun. 2014, arXiv:1406.4729.
[23] V. Strassen, ‘‘Gaussian elimination is not optimal,’’ Numerische Mathe- [Online]. Available: https://arxiv.org/abs/1406.4729
matik, vol. 13, no. 4, pp. 354–356, 1969. [45] J. Lin, L. Ma, and Y. Yao, ‘‘Segmentation of casting defect regions for the
[24] M. T. Heideman, D. H. Johnson, and C. S. Burrus, ‘‘Gauss and the history extraction of microstructural properties,’’ Eng. Appl. Artif. Intell., vol. 85,
of the fast Fourier transform,’’ Arch. Hist. Exact Sci., vol. 34, no. 3, pp. 150–163, Oct. 2019.
pp. 265–277, 1985. [46] J. Lin, Y. Yao, L. Ma, and Y. Wang, ‘‘Detection of a casting defect tracked
[25] P. Duhamel and M. Vetterli, ‘‘Fast Fourier transforms: A tutorial review by deep convolution neural network,’’ Int. J. Adv. Manuf. Technol., vol. 97,
and a state of the art,’’ Signal Process., vol. 19, no. 4, pp. 259–299, 1990. nos. 1–4, pp. 573–581, 2018.
[26] J. W. Cooley and J. W. Tukey, ‘‘An algorithm for the machine calculation [47] M. Garland, S. Le Grand, J. Nickolls, J. Anderson, J. Hardwick, S. Morton,
of complex Fourier series,’’ Math. Comput., vol. 19, no. 90, pp. 297–301, E. Phillips, Y. Zhang, and V. Volkov, ‘‘Parallel computing experiences with
1965. CUDA,’’ IEEE Micro, vol. 28, no. 4, pp. 13–27, Jul./Aug. 2008.
[27] M. Mathieu, M. Henaff, and Y. LeCun, ‘‘Fast training of convolutional net- [48] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt,
works through FFTs,’’ Dec. 2013, arXiv:1312.5851. [Online]. Available: ‘‘Analyzing CUDA workloads using a detailed GPU simulator,’’ in Proc.
https://arxiv.org/abs/1312.5851 IEEE Int. Symp. Perform. Anal. Syst. Softw., Boston, MA, USA, Apr. 2009,
[28] NVIDIA CUDA Fast Fourier Transform Library (cuFFT). Accessed: 2013. pp. 163–174.
[Online]. Available: https://docs.nvidia.com/cuda/cufft/ [49] Y. Jia. (2013). Caffe: An Open Source Convolutional Architecture for Fast
[29] T. Brosch and R. Tam, ‘‘Efficient training of convolutional deep belief Feature Embedding. [Online]. Available: http://caffe.berkeleyvision.org/
networks in the frequency domain for application to high-resolution 2D
and 3D images,’’ Neural Comput., vol. 27, no. 1, pp. 211–227, Jan. 2015.
[30] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, and
Y. Lecun, ‘‘Fast convolutional nets with fbfft: A GPU performance evalua-
tion,’’ in Proc. Int. Conf. Learn. Represent., Dec. 2015, pp. 1–17. [Online].
Available: https://arxiv.org/abs/1412.7580v3
[31] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘Imagenet classification JINHUA LIN received the B.S. and M.S. degrees
with deep convolutional neural networks,’’ in Proc. Adv. Neural Inf. Pro- in computer science and technology from Xi’an
cess. Syst., 2012, vol. 25, no. 2, pp. 1097–1105. Jiao Tong University, Xi’an, China, in 2004 and
[32] S. Karen and Z. Andrew, ‘‘Very deep convolutional networks for large- 2008, respectively, and the Ph.D. degree in mecha-
scale image recognition,’’ Sep. 2015, arXiv:1409.1556. [Online]. Avail- tronic engineering from the University of Chinese
able: https://arxiv.org/abs/1409.1556 Academy of Sciences, Beijing, China, in 2017.
[33] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for She is currently an Associate Professor with the
image recognition,’’ Dec. 2015, arXiv:1512.03385. [Online]. Available: Changchun University of Technology, Changchun,
https://arxiv.org/abs/1512.03385 China. Her research interests include the compu-
[34] J. H. Ko, B. Mudassar, T. Na, and S. Mukhopadhyay, ‘‘Design of an energy- tational neuroscience, machine learning, computer
efficient accelerator for training of convolutional neural networks using vision, and engineering application of artificial intelligence.
frequency-domain computation,’’ in Proc. 54th ACM/EDAC/IEEE Design
Automat. Conf., no. 59, Austin, TX, USA, Jun. 2017, pp. 1–6.
[35] O. Rippel, J. Snoek, and R. P. Adams, ‘‘Spectral representations for con-
volutional neural networks,’’ in Proc. 28th Int. Conf. Neural Inf. Process.
Syst., vol. 2, 2015, pp. 2449–2457.
[36] M. D. Zeiler and R. Fergus, ‘‘Visualizing and understanding convolutional
networks,’’ in Proc. Eur. Conf. Comput. Vis., 2014, pp. 818–833. LIN MA received the B.S. degree in material shap-
[37] R. Girshick, J. Donahue, T. Darrell, and J. Malik, ‘‘Rich feature hierarchies ing and control engineering from Sichuan Uni-
for accurate object detection and semantic segmentation,’’ in Proc. IEEE versity, Chengdu, China, in 2000. He is currently
Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580–587. [Online]. a Senior Engineer with FAW Foundry Company
Available: https://.arxiv.org/abs/1311.2524v3 Ltd., Changchun, China. His research interests
[38] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and include casting process simulation and engineer-
T. Darrell, ‘‘Decaf: A deep convolutional activation feature for generic
ing application of artificial intelligence.
visual recognition,’’ Oct. 2013, arXiv:1310.1531. [Online]. Available:
https://arxiv.org/abs/1310.1531
[39] V. Nair and G. E. Hinton, ‘‘Rectified linear units improve restricted
boltzmann machines,’’ in Proc. Int. Conf. Mach. Learn. (ICML), 2010,
pp. 807–814.
[40] A. L. Maas, A. Y. Hannun, and A. Y. Ng, ‘‘Rectifier nonlinearities improve
neural network acoustic models,’’ in Proc. Int. Conf. Mach. Learn. (ICML),
2013, pp. 1–6.
[41] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Delving deep into rectifiers:
YU YAO received the B.S. and M.S. degrees
Surpassing human-level performance on imagenet classification,’’ in Proc.
in mechanical engineering from the Changchun
IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 1026–1034.
[42] B. Xu, N. Wang, T. Chen, and M. Li, ‘‘Empirical evaluation of rec- University of Technology, Changchun, China,
tified activations in convolutional network,’’ in Proc. Int. Conf. Mach. in 2006 and 2009, respectively, and the Ph.D.
Learn. Workshop, 2015, pp. 1–5. [Online]. Available: https://arxiv.org/ degree in mechanical engineering from Ji Lin
abs/1505.00853v2.2015 University, Changchun, in 2016, specializing in
[43] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, ‘‘Fast and accurate tracked vehicle engineering. Her main research
deep network learning by exponential linear units (ELUs),’’ in Proc. Int. interests include computational neuroscience,
Conf. Learn. Represent. (ICLR), 2016, pp. 1–14. [Online]. Available: industrial robotics, and vehicle engineering.
https://.arxiv.org/abs/1511.07289v5

VOLUME 7, 2019 116631

You might also like