Professional Documents
Culture Documents
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
a r t i c l e i n f o a b s t r a c t
Article history: Deep learning is a hot research topic in the field of machine learning methods and applications. Real-
Received 2 April 2019 value neural networks (Real NNs), especially deep real networks (DRNs), have been widely used in many
Revised 9 December 2019
research fields. In recent years, the deep complex networks (DCNs) and the deep quaternion networks
Accepted 11 February 2020
(DQNs) have attracted more and more attentions. The octonion algebra, which is an extension of com-
Available online 20 February 2020
plex algebra and quaternion algebra, can provide more efficient and compact expressions. This paper
Communicated by Dr Li Zhifeng constructs a general framework of deep octonion networks (DONs) and provides the main building blocks
of DONs such as octonion convolution, octonion batch normalization and octonion weight initialization;
Keywords:
Convolutional neural network DONs are then used in image classification tasks for CIFAR-10 and CIFAR-100 data sets. Compared with
Complex the DRNs, the DCNs, and the DQNs, the proposed DONs have better convergence and higher classification
Quaternion accuracy. The success of DONs is also explained by multi-task learning.
Octonion
© 2020 Elsevier B.V. All rights reserved.
Image classification
https://doi.org/10.1016/j.neucom.2020.02.053
0925-2312/© 2020 Elsevier B.V. All rights reserved.
180 J. Wu, L. Xu and F. Wu et al. / Neurocomputing 397 (2020) 179–191
which can be expressed as the real matrix form as follows: the image. The goal of the octonion convolution is to generate a
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ unique linear combination of each axis based on the results of a
R ( W o ∗o x o ) W0 −W1 −W2 −W3 −W4 −W5 −W6 −W7 x0
⎢ I ( W o ∗ o x o ) ⎥ ⎢W 1 single axis, allowing each axis of the kernel to interact with each
⎢ ⎥ ⎢ W0 −W3 W2 −W5 W4 ⎢x 1 ⎥
W7
⎢ ⎥ −W6 ⎥
⎥
⎢ J ( W ∗ x ) ⎥ ⎢W ⎥ ⎢ ⎥ axis of the image, thereby allowing the linear depth of the channel
⎢ o o o ⎥ ⎢ 2 W3 W0 −W1 −W6 −W7 W4 W5 ⎥ ⎢x2 ⎥
to be mixed, depending on the structure of the octonion multi-
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ K ( W o ∗ o x o ) ⎥ ⎢W 3 −W2 W1 W0 −W7 W6 −W5 W4 ⎥ ⎢x3 ⎥ plication. For example, using 8 kernels (m × n × 8) to convolve
⎢ ⎥ ⎢ ⎥ ∗ ⎢ ⎥,
⎢ E ( W o ∗ o x o ) ⎥ = ⎢W 4 W5 W6 W7 W0 −W1 −W2 −W3 ⎥ ⎢ ⎥ an 8 channels of feature maps (M × N × 8), finally generates one
⎢ ⎥ ⎢ ⎥ ⎢x 4 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ L ( W o ∗ o x o ) ⎥ ⎢W 5 −W4 W7 −W6 W1 W0 W3 −W2 ⎥ ⎢x5 ⎥ feature map. Conventional convolution is one convolution kernel
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣M(Wo∗oxo )⎦ ⎣W6 −W7 −W4 W5 W2 −W3 W0 W1 ⎦ ⎣x6 ⎦ applied to one feature map, and then added to the result of the
N ( W o ∗o x o ) W7 W6 −W5 −W4 W3 W2 −W1 W0 x7 previous operation, regardless of the correlation between the fea-
ture maps. The octonion convolution, using the octonion arithmetic
(6)
rule, applies eight convolution kernels to each feature map. Then
∗ ∗
where o and denote octonion convolution and real convolution, applying a 1 × 1 convolution to the result of the previous opera-
respectively. xi ∈ RN and Wi ∈ RN × N with i = 1,2,…,7. R(• ) de- tion allows to obtain the linear interaction of the feature map and
notes the real part of •, I (• ), J (• ), K (• ), E (• ), L(• ), M(• ) and thus to derive the new feature map space.
N (• ) denote the seven different imaginary parts of •, respectively.
The implementation of octonion convolutional operation is shown 3.3. Octonion batch normalization module
in Fig. 2, where Mr , Mi , Mj , Mk , Me , Ml , Mm , Mn refer to eight parts
of feature maps and Kr , Ki , Kj , Kk , Ke , Kl , Km , Kn refer to eight parts Batch normalization [31] can accelerate deep network train-
of kernels, and Mp1 ∗ Kp2 (p1 , p2 = r, i, j, k, e, l, m, n ) refer to the re- ing by reducing internal covariate shift. It allows us to use much
sult of a real convolution between the feature maps and the ker- higher learning rates and be less careful about initialization. When
nels. The real representations of complex convolution, quaternion applying the batch normalization to real numbers, it is sufficient
convolution, and octonion convolution are shown in Appendix 1. to translate and scale these numbers such that their mean is zero
From this latter, we can see that the octonion convolution is a kind and their variance is one. However, when applying the batch nor-
of mixed convolution, similar to a mixture of standard convolution malization to complex or quaternion numbers, this can’t ensure
and depth separable convolution, with certain links to the original equal variance in both the real and imaginary components. In or-
convolution [76]. Traditional real-valued convolution simply multi- der to overcome this problem, a whitening approach is used in
plies each channel of the kernel by the corresponding channel of [65, 67], which scales the data by the square root of their variances
182 J. Wu, L. Xu and F. Wu et al. / Neurocomputing 397 (2020) 179–191
Fig. 2. Illustration of the real convolution (a) and octonion convolution (b).
J. Wu, L. Xu and F. Wu et al. / Neurocomputing 397 (2020) 179–191 183
along each principle components. In this section, we use a simi- to converge stably and quickly. In view of this, we provide an ini-
lar approach, but treating this issue as a “whitening” of 8D vector tialization method for octonion weight. The 8 parts of every octo-
problem. nion weight Wo ∈ ON × N are assumed to be independent Gaussian
Firstly, whitening is accomplished by multiplying the zero- random variables with zero-mean and the same variance σ 2 . Then,
centered data (x-E[x]) by the inverse square root of the covariance the variance of Wo is provided by:
matrix V:
V ar (Wo ) = E |Wo|2 − |E[Wo]|2 (12)
( x − E [x ] )
x˜ = √ , (7) As the 8 parts are zero-mean then E[Wo] = 0 and as they have
V
the same variance then V ar (Wo ) = 8σ 2 . The value of the standard
and deviation σ is set according to the following:
⎡ ⎤ √
Vrr Vri Vrj Vrk Vre Vrl Vrm Vrn
1/ 2 nin + nout , ifGlorot sinitialization[77]isused
⎢ Vir Vii Vij Vik Vie Vil Vim Vin ⎥ σ= √ (13)
⎢ ⎥ 2/ nin , ifHe sinitialization[78]isused
⎢V ⎥
⎢ jr Vji Vjj Vjk Vje Vjl Vjm Vjn ⎥
⎢ ⎥
⎢ Vkr Vki Vkj Vkk Vke Vkl Vkm Vkn ⎥ 4. Implementation and experimental results
V=⎢
⎢ Ver
⎥, (8)
⎢ Vei Vej Vek Vee Vel Vem Ven ⎥
⎥
⎢ ⎥ Similar to the 110-layer deep real networks [25], we designed
⎢ Vlr Vli Vlj Vlk Vle Vll Vlm Vln ⎥ an octonion convolutional neural network named deep octonion
⎢ ⎥
⎣Vmr Vmi Vmj Vmk Vme Vml Vmm Vmn ⎦ networks, whose schematic diagram are shown in Fig. 3. Fig. 3(a)
shows the detailed convolution structure of the four stages, and
Vnr Vni Vnj Vnk Vne Vnl Vnm Vnn
Fig. 3(b) shows the entire structure including the input and out-
where E[x] refers to the average value of each batch of training put modules. Then we performed the image classification tasks
data x ∈ ON , and V ∈ O8 × 8 is the covariance matrix of each batch of CIFAR-10 and CIFAR-100 [66] to verify the validity of the pro-
of data x. posed deep octonion networks. The following experiment was im-
√
In order to avoid calculating the ( V )−1 , Eq. (7) can be com- plemented using Keras (Tensorflow as backend) on a PC machine,
puted as follows which sets up Ubuntu 16.04 operating system and has an Intel(R)
Core(TM) i7-2600 CPU with speed of 3.40 GHz and 64 GB RAM,
x˜ = U(x − E[x] ), (9) and has also two NVIDIA GeForce GTX1080-Ti GPUs.
where U is one of the matrices from the Cholesky decomposition
of V−1 , and each item of the matrix U is shown in Appendix 2. 4.1. Models configurations
Secondly, the forward conduction formula of the octonion batch
normalization layer is defined as 4.1.1. Octonion input construction
Since the images in datasets of CIFAR-10 and CIFAR-100 are
OctonionBN(x˜ ) = γ x˜ + β, (10)
real-valued, however, the input of the proposed deep octonion net-
where β = E (x ) ∈ O8
is a learned parameter with eight real
√ param- works needs to be an octonion matrix, which we have to derive
eters (one real part and seven imaginary parts) and γ = V ∈ O8×8 first. The octonion has one real part and seven imaginary parts, we
is also a learned parameter with only 36 independent real param- put the original N training real images into the real part, and simi-
eters, lar to [65] and [67], the seven imaginary parts of the octonion ma-
⎛ ⎞ trix are obtained by performing a single real-valued residual block
γrr γri γrj γrk γre γrl γrm γrn (BN → ReLU → Conv → BN → ReLU → Conv) [25] 7 times at the
⎜ γir γii γij γik γie γil γim γin ⎟ same time. Then, the 8 vectors are connected according to a given
⎜ ⎟
⎜γ γji γjj γjk γje γjl γjm γjn ⎟
axis to form a brand new octonion vector.
⎜ jr ⎟
⎜ ⎟
⎜ γkr γki γkj γkk γke γkl γkm γkn ⎟
γ =⎜
⎜ γer
⎟. 4.1.2. The structure of deep octonion networks
⎜ γei γej γek γee γel γem γen ⎟
⎟ The OctonionConv → OctonionBN → ReLU operation is per-
⎜ ⎟ formed on the obtained octonion input, where OctonionConv de-
⎜ γlr γli γlj γlk γle γll γlm γln ⎟
⎜ ⎟ notes the octonion convolution module shown in Section 3.2 and
⎝γmr γmi γmj γmk γme γml γmm γmn ⎠ OctonionBN denotes the octonion batch normalization module
γnr γni γnj γnk γne γnl γnm γnn shown in Section 3.3. Then the octonion output is sent to the next
three stages. In each stage, there are several residual blocks with
(11)
double convolution layers. The shape of the feature maps in three
√
Similar to [65] and [67], the diagonal of γ is initialized to 1/ 8, stages are the same, and the number of them are increased grad-
the off diagonal terms of γ and all components of β are initialized ually to ensure the expressive ability of the output features. To
to 0. speed up the training, the following layer is an AveragePooling2D
layer, which is then followed by a fully connected layer called
3.4. Octonion weight initialization method Dense to classify the input. The deep octonion network model sets
the number of residual blocks in the three stages to 10, 9, and 9,
Before starting to train the network, we need to initialize its respectively, and the number of convolution filters is set to 32, 64,
parameters. If the weights are initialized to the same value, the and 128. The batch size is set to 64.
updated weights will be the same, which means that the network
can’t learn the features. For deep neural networks, such initializa- 4.1.3. The training of deep octonion networks
tion will make deeper meaningless and will not match the effects Deep octonion networks are then compiled, the cross entropy
of linear classifiers. Therefore, the initial weight values are all dif- loss function and the stochastic gradient descent method are cho-
ferent and close but not equal to 0, which not only ensures the dif- sen for training the model. The Nesterov Momentum is set to 0.9
ference between the input and output, but also allows the model in the back propagation of stochastic gradient descent in order to
184 J. Wu, L. Xu and F. Wu et al. / Neurocomputing 397 (2020) 179–191
Table 2 Table 3
The learning rate (%) of octonion convolution neural network. The classification error rate of three models in two types of datasets. FLOPs and
MACCs denote floating point operations and multiply-accumulate operations, re-
Epoch Learning-rate spectively.
(0, 20) 0.01
Architecture Params FLOPs MACCs CIFAR-10 CIFAR-100
(20, 60) 0.1
(60, 80) 0.01 Real [25] 3,619,844 1,081,333,248 340,380,416 6.37 –
(80,110) 0.001 Complex [65] 1,823,620 541,132,288 270,273,792 5.60 27.09
(110, 120) 0.0001 Quaternion [67] 932,792 271,922,688 135,662,848 5.44 26.01
Octonion 481,150 137,350,144 68,368,896 5.35 24.60
models to perform our desired task and the output of the system of hard parameter sharing. We refer to the structure between the
is real. When facing the complex learning problems, traditional input layer and the output layer as the shared layer.
methods chose a similar approach. Firstly, decompose the com- However, currently it is difficult to determine the best shared
plex learning problems into simple and independent sub-problems, feature position [83], the best sharing / splitting scheme [81] and
and then study each sub-problem separately. Finally, establish the the existing CNN structure only receives feature tensors with a
mathematical model of complex problems by combining the sub- fixed number of feature channels. If multitasking is used for the
problem learning results, and the model is refined through fine existing CNN structure, the number of channels will increase as the
tuning until the performance no longer improves. These operations number of multitasking tasks increases. The tandem features will
seem reasonable but not accurate, for the reason that many prob- not be available to subsequent layers of the CNN. There are multi-
lems in the real world cannot be decomposed into independent ple solutions to this problem. One is the NDRR layer proposed in
sub-problems, the rich interrelated information between the sub- [84], which is plug-and-play extended to the existing CNN archi-
problems cannot be ignored. Even if it can be decomposed, the tecture and uses feature transformation to discriminate cascaded
sub-problems are interrelated, and connected by sharing factors or features and to reduce their dimensions. The DONs network pro-
share representations. In order to solve this problem, multi-task posed in this paper solves the problem of channel number mis-
learning (MTL) was born [73]. Compared to single-task learning match by improving the network structure, and constructs a new
(STL—learning just one task at a time), MTL is a kind of joint learn- octonion convolutional neural network for eight-task learning.
ing method which learns multiple related tasks together based on Generally, when considering the optimization of more than one
shared representations. The purpose of the shared representation loss function, we are effectively dealing with a MTL problem. Next,
is to improve the generalization. Multiple tasks are learned in par- we only focus on one task, that is, there is only one optimization
allel, and the results affect each other. goal, and the learning of auxiliary tasks may still help improving
the learning performance of the main task. Auxiliary tasks can pro-
vide inductive bias, which makes the model more inclined to those
5.1. The relationship between DONs and MTL solutions that can explain multiple tasks at the same time, and the
generalization performance of the model is then better. The DONs
MTL can be seen as a method that inspired by human learn- shown in Fig. 6(c) uses a network structure with hard parameter
ing. We often learn tasks to acquire the necessary skills in order sharing similar to Fig. 6(b) to learn eight related tasks together.
to master more complex problems. There are many forms of MTL The input of the last seven tasks is learned through the input of
that can further improve CNN performance. Fig. 6(a) and (b) shows the first task, which plays a supporting role in the first task. There-
a single-task learning and multi-task feedforward neural network fore, there are relevant and irrelevant parts in these eight tasks.
with one input layer, two hidden layers and one output layer, re- The details of the DONs shared layer follow the rules in Fig. 2.
spectively. In single-task learning, learning between tasks is inde-
5.2. Effectiveness of DONs
pendent of each other. In MTL, parameters between multiple tasks
are shared. MTL methods can be divided into two categories based
In [75], it has been proven that the number of parameters in
on how parameters are shared between different task models. In
the multi-task model is less than the number of parameters for es-
the soft parameter sharing category, each task has its own model
tablishing multiple models, and the task is optimized to reduce the
and its own parameters. The methods in this category focus on
risk of over-fitting and the generalization ability is then stronger.
how to design weight-sharing approaches. The most common way
The effectiveness of MTL is mainly reflected in the following five
is to share all convolutional layers and split on fully connected lay-
aspects: implicit data augmentation, attention focusing, eavesdrop-
ers for heuristic decisions for loss of specific tasks, such as Cross-
ping, representation bias and regularization. Therefore, considering
Stitch Networks [81], Sluice Network [82], etc. In hard parameter
the DONs as a specific MTL is supported by the following:
sharing category, all task models share exactly the same feature
extractor, and each branch head executes its own task. In the con- - Neural networks can help the hidden layer to avoid local min-
text of deep learning, MTL is usually done by sharing hard or soft ima through the interaction between different tasks during
parameters of the hidden layer [74]. Fig. 6(b) shows the MTL mode learning. When learning the main task, the parts that are not
J. Wu, L. Xu and F. Wu et al. / Neurocomputing 397 (2020) 179–191 187
Fig. 6. The comparison of Single-task learning (a), Multi-tasking learning (b) and Deep octonion networks (c).
related to the task will produce noise during the learning pro- do have an impact, as in the process of attention focusing for
cess. Since different tasks have different noise patterns, a model MTL. Studies have shown that if auxiliary tasks and main tasks
that learns eight tasks simultaneously is able to learn a more use the same characteristics for decision-making, they will ben-
general representation. Eight tasks learned at the same time efit more from MTL. Therefore, we need to find suitable auxil-
can average the noise patterns, which can make the model bet- iary tasks to benefit from MTL. The choice of auxiliary tasks is
ter representative of the data. This is similar to implicit data diverse [74].
augmentation for MTL. - DONs restrict the model by using octonion operation rules, so
- DONs takes the first of the eight tasks as the main one, and the that models that are more in line with real rules can be se-
latter seven tasks are learned through the input of the first task, lected from the hypothesis space. This kind of regularization is
which assists the first task. The gap between tasks is not partic- such that the risk of overfitting as well as the complexity of the
ularly large, so the model can be focused on those features that mode is reduced.
188 J. Wu, L. Xu and F. Wu et al. / Neurocomputing 397 (2020) 179–191
In addition, as described in the previous section, hard parame- Declaration of Competing Interest
ter sharing (DONs can be considered as such models) is the most
common method of MTL in neural networks, which can greatly The authors declare that they have no known conflicts of
reduce the risk of overfitting [74]. It has been demonstrated in interest.
[75] that the risk of overfitting the shared parameters is smaller
than overfitting the specific parameters for each task. What’s more,
regards the number of tasks as N, a larger N means that the more
tasks are learned simultaneously, the more the model can find a CRediT authorship contribution statement
representation that captures all tasks, and the less likely the orig-
inal task is to overfitting. This also explains that the performance Jiasong Wu: Conceptualization, Methodology, Writing - review
can be improved when the relationships between convolution ker- & editing. Ling Xu: Writing - original draft, Software, Visualization.
nels are modeled by complex algebra, quaternion algebra, and also Fuzhi Wu: Software, Validation, Data curation. Youyong Kong:
octonion algebra and why deep neural networks can achieve better Validation, Project administration. Lotfi Senhadji: Formal analysis,
results on octonion domains. Writing - review & editing. Huazhong Shu: Supervision, Resources.
6. Conclusion
Table A1
The real representations of complex convolution, quaternion convolution, and octonion convolution. e denotes imaginary component, where e2i = −1, ei e j =
−e j ei = e j ei , (ei e j )ek = −ei (e j ek ) = ei e j ek ,∀i = j = k, 1 ≤ i, j, k ≤ 7. ∗ , ∗ c , ∗ q , and ∗ o denote real convolution, complex convolution, quaternion convolution, and oc-
tonion convolution, respectively. R(• ) denotes the real components of •, I (• ), J (• ), K (• ), E (• ), L (• ), M (• ) and N (• ) denote the different imaginary components
of • respectively. xr ∈ RN , xc ∈ CN , xq ∈ QN , xo ∈ ON , Wr ∈ RN × N , Wc ∈ CN × N , Wq ∈ QN , Wo ∈ ON , xi ∈ RN , Wi ∈ RN×N , i = 0, 1, ..., 7, where R, C, Q, and O denote
real, complex, quaternion, and octonion domain, respectively.
Wr ∗ xr
Real [25] xr Wr R (W ∗ x ) = W ∗ x
= Wr [i, j]xr [m − i, n − j]
i j
xc Wc
Wc ∗c xc R(Wc ∗c xc ) W0 −W1 x
Complex [65] = x0 = W0 = ∗ 0
= (W0 + W1 e1 )∗c (x0 + x1 e1 ) I (Wc ∗c xc ) W1 W0 x1
+x1 e1 +W1 e1
xq Wq ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
R(Wq ∗q xq ) W0 −W1 −W2 −W3 x0
= x0 = W0
⎢ I (Wq ∗q xq ) ⎥ ⎢W1 W0 −W3 W 2 ⎥ ⎢x 1 ⎥
Quaternion [67] +x1 e1 +W1 e1 ⎣J (W ∗ x )⎦ = ⎣W W3 W0 −W1
⎦ ∗⎣ ⎦
x2
+x 2 e 2 +W2 e2 Wq ∗q xq q q q 2
= ( W0 + W1 e1 + W2 e2 + W3 e3 ) K (Wq ∗q xq ) W3 −W2 W1 W0 x3
+x3 e3 +W3 e3
∗q (x0 + x1 e1 + x2 e2 + x3 e3 )
xo Wo ⎡ R (W ∗ x ) ⎤
o o o
= x0 = W0
+x1 e1 +W1 e1 ⎢ I (Wo ∗o xo ) ⎥
⎢ J (Wo ∗o xo ) ⎥
+x2 e2 +W2 e2 ⎢ ⎥
⎢ K (Wo ∗o xo ) ⎥
⎢ ⎥
Octonion +x3 e3 +W3 e3 ⎢ E (Wo ∗o xo ) ⎥ =
+x4 e4 +W4 e4 Wo ∗o xo ⎢ ⎥
= ( W0 + W1 e1 + W2 e2 + W3 e3
⎢ L (Wo ∗o xo ) ⎥
+x5 e5 +W5 e5 ⎣ ⎦
+W4 e4 + W5 e5 + W6 e6 + W7 e7 ) M (Wo ∗o xo )
+x6 e6 +W6 e6
∗o (x0 + x1 e1 + x2 e2 + x3 e3 N ( W o ∗o x o )
+x7 e7 +W7 e7 ⎡W ⎤
+x4 e4 + x5 e5 + x6 e6 + x7 e7 ) 0 −W1 −W2 −W3 −W4 −W5 −W6 −W7
⎢W 1 W0 −W3 W2 −W5 W4 W7 −W6 ⎥
⎢W 2 −W1 −W6 −W7 W5 ⎥
⎢ W3 W0 W4 ⎥
⎢W 3 −W2 W1 W0 −W7 W6 −W5 W4 ⎥
⎢ ⎥∗
⎢W 4 W5 W6 W7 W0 −W1 −W2 −W3 ⎥
⎢ ⎥
⎢W 5 −W4 W7 −W6 W1 W0 W3 −W2 ⎥
⎣ ⎦
W6 −W7 −W4 W5 W2 −W3 W0 W1
W7 W6 −W5 −W4 W3 W2 −W1 W0
⎡x ⎤
0
⎢x 1 ⎥
⎢x 2 ⎥
⎢ ⎥
⎢x 3 ⎥
⎢ ⎥
⎢x 4 ⎥
⎢ ⎥
⎢x 5 ⎥
⎣ ⎦
x6
x7
J. Wu, L. Xu and F. Wu et al. / Neurocomputing 397 (2020) 179–191 189
References [19] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, Going deeper with convolutions, in: Pro-
ceedings of the CVPR, 2015.
[20] S. Ioffe, C Szegedy, Batch Normalization: Accelerating Deep Network Training
[1] W.S. McCulloch, W. Pitts, A logical calculus of ideas immanent in nervous ac-
by Reducing Internal Covariate Shift„ in: Proceedings of the 32nd International
tivity, Bull. Math. Biophys. 5 (4) (1943) 115–133.
Conference on Machine Learning (ICML), 2015, pp. 448–456.
[2] S. Haykin, Neural Networks: A Comprehensive Foundation, 2 ed., Prentice Hall,
[21] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception
1998 ISBN 0-13-273350-1.
architecture for computer vision, in: Proceedings of the IEEE Conference on
[3] Y. Bengio, A. Courville, P.Vincent, Representation learning: a review and new
Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2818–2826.
perspectives, IEEE Trans. Pattern Anal. Mach. Intell. 35 (8) (2013) 1798–1828.
[22] C. Szegedy, S. Ioffe, V. Vanhoucke, A.A. Alemi, Inception-v4, Inception-ResNet
[4] Y. LeCun, Y. Bengio, G.E. Hinton, Deep learning, Nature 521 (2015) 436–
and the Impact of Residual Connections on Learning, Thirty-first AAAI confer-
4 4 4.
ence on artificial intelligence, 2017, pp. 4278–4284.
[5] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016.
[23] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale
[6] L. Deng, D. Yu, Deep learning: methods and applications, Found. Trends Signal
image recognition, ArXiv e-prints, Arxiv:1409.1556, 2014.
Process. 7 (3–4) (2014) 197–387.
[24] R.K. Srivastava, K. Greff, J. Schmidhuber, Training very deep networks, in: Ad-
[7] J. Schmidhuber, Deep learning in neural networks: an overview, Neural Netw.
vances in neural information processing systems (NIPS), 2015, pp. 2377–2385.
61 (2015) 85–117.
[25] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition,
[8] H. Wang, B. Raj, On the Origin of Deep Learning, ArXiv e-prints Arxiv:1702.
in: Proceedings of the CVPR, 2016.
07800, 2017.
[26] S. Xie, R. Girshick, P. Dollár, Z. Tu, K. He, Aggregated residual transformations
[9] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuaib, et al., Recent advances
for deep neural networks, in: Proceedings of the IEEE conference on computer
in convolutional neural networks, Pattern Recognit. 77 (2018) 354–377.
vision and pattern recognition (CVPR), 2017, pp. 1492–1500.
[10] Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, M.S. Lew, Deep learning for visual
[27] G. Huang, Z. Liu, L. Van Der Maaten, K.Q. Weinberger, Densely Connected Con-
understanding: A review, Neurocomputing 187 (2016) 27–48.
volutional Networks, in: Proceedings of the IEEE conference on computer vi-
[11] A. Prieto, B. Prieto, E.M. Ortigosa, E. Ros, F. Pelayo, J. Ortega, et al., Neural net-
sion and pattern recognition (CVPR), 2017, pp. 4700–4708.
works: An overview of early research, current frameworks and new challenges,
[28] G. Larsson, M. Maire, G. Shakhnarovich, FractalNet: Ultra-Deep Neural Net-
Neurocomputing 214 (2016) 242–268.
works without Residuals, International Conference on Learning Representa-
[12] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, F.E. Alsaadi, A survey of deep neu-
tions (ICLR), 2017.
ral network architectures and their applications, Neurocomputing 234 (2017)
[29] X. Zhang, Z. Li, C.C. Loy, D. Lin, Polynet: a pursuit of structural diversity in very
11–26.
deep networks, in: Proceedings of the IEEE Conference on Computer Vision
[13] F. Rosenblatt, Principles of Neurodynamics: Perceptrons and the Theory of
and Pattern Recognition (CVPR), 2017, pp. 718–726.
Brain Mechanisms, Spartan Books, Washington DC, 1961.
[30] J. Hu, L. Shen, G. Sun, Squeeze-and-Excitation Networks, in: Proceedings of
[14] D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning representations by
the IEEE conference on computer vision and pattern recognition (CVPR), 2017,
back-propagating errors, Nature 323 (6088) (1986) 533–536.
pp. 7132–7141.
[15] G. Hinton, S. Osindero, Y.W. Teh, A fast learning algorithm for deep belief nets,
[31] Y. Yang, Z. Zhong, T. Shen, Z. Lin, Convolutional Neural Networks with Al-
Neural Comput. 18 (7) (2006) 1527–1554.
ternately Updated Clique, in: Proceedings of the IEEE CVPR, 2018, pp.
[16] G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neu-
2413–2422.
ral networks, Science 313 (5786) (2006) 504–507.
[32] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, Y. Bengio, Binarized Neural
[17] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient based learning applied to
Networks, in: Proceedings of NIPS, 2016, pp. 4107–4115.
document recognition, Proc. IEEE 86 (11) (1998) 2278–2324.
[33] F.N. Iandola, S. Han, M.W. Moskewicz, K. Ashraf, W.J. Dally, K. Keutzer,
[18] A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep con-
SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB
volutional neural networks, in: Proceedings of the NIPS, 2012.
model size, in: Proceedings of ICLR, 2017.
190 J. Wu, L. Xu and F. Wu et al. / Neurocomputing 397 (2020) 179–191
[34] A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, et al., [68] T. Parcollet, M. Morchid, P.M. Bousquet, R. Dufour, G. Linarès, R. De Mori, Qua-
MobileNets: Efficient convolutional neural networks for mobile vision applica- terion Neural Networks for Spoken Language Understanding, Spoken Language
tions, ArXiv e-prints, ArXiv:1704.04861, 2017. Technology Workshop IEEE (2016) 362–368.
[35] J.L. Elman, Finding structure in time, Cogn. Sci. 14 (2) (1990) 179–211. [69] T. Parcollet, M. Ravanelli, M. Morchid, G. Linarès, R. De Mori, Speech Recogni-
[36] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (8) tion with Quaternion Neural Networks, in: Conference on Neural Information
(1997) 1735–1780. Processing Systems, 2018.
[37] A. Graves, Jürgen Schmidhuber, Framewise phoneme classification with bidi- [70] T. Parcollet, M. Ravanelli, M. Morchid, G. Linarès, R. De Mori, Quaternion Re-
rectional LSTM and other neural network architectures, neural networks, IJCNN current Neural Networks, in: Proceedings of ICLR, 2019.
18 (5) (2005) 602–610. [71] T. Parcollet, M. Morchid, G. Linarès, R. De Mori, Bidirectional Quaternion
[38] K. Cho, B.V. Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Long-Short Term Memory Recurrent Neural Networks for Speech Recognition,
Y. Bengio, Learning phrase representations using RNN encoder-decoder for sta- in: ICASSP (2019) 8519–8523.
tistical machine translation„ in: Proceedings of the Conference on Empirical [72] T. Parcollet, M. Morchid, G. Linarès, Quaternion Convolutional Neural Networks
Methods in Natural Language Processing (EMNLP), 2014, pp. 1724–1734. for Heterogeneous Image Processing, in: ICASSP (2019) 8514–8518.
[39] W.R. Hamilton, Lectures on quaternions, Nature 57 (1462) (2010) 7. [73] R. Caruana, Multitask learning, Auton. Agent Multi Agent Syst. 27 (1) (1998)
[40] T.A. Ell, S.J. Sangwine, Hypercomplex Fourier transforms of color images, IEEE 95–133.
Trans. Image Process. 16 (1) (2007) 22–35. [74] S. Ruder, An Overview of Multi-Task Learning in Deep Neural Networks, 2017.
[41] C.C. Took, D.P. Mandic, The quaternion LMS algorithm for adaptive filtering of [75] J. Baxter, A Bayesian/information theoretic model of learning to learn via mul-
hypercomplex processes, IEEE Trans. Signal Process. 57 (2009) 1316–1327. tiple task sampling, Mach. Learn. 28 (1997) 7–39.
[42] R. Zeng, J.S. Wu, Z.H. Shao, Y. Chen, B.J. Chen, L. Senhadji, H.Z. Shu, Color image [76] F. Chollet, Xception: deep learning with depthwise separable convolutions, in:
classification via quaternion principal component analysis network, Neurocom- Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-
puting 216 (2016) 416–428. tion, 2017, pp. 1251–1258.
[43] S. Okubo, Introduction to Octonion and Other Non-Associative Algebras in [77] X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward
Physics, Cambridge University Press, 1995. neural networks, in: Proceedings of the International Conference on Artificial
[44] H.Y. Gao, K.M. Lam, From quaternion to octonion: feature-based image saliency Intelligence and Statistics, 2010.
detection, in: Proceedings of the IEEE International Conference on Acoustics, [78] K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: surpassing hu-
speech and signal processing, 2014. man-level performance on ImageNet classification, in: Proceedings of the IEEE
[45] Ł. Błaszczyk, K.M. Snopek, Octonion Fourier transform of real-valued func- International Conference on Computer Vision, 2015.
tions of three variables - selected properties and examples, Signal Process. 136 [79] K. He, J. Sun, Convolutional neural networks at constrained time cost, in: Pro-
(2017) 29–37. ceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
[46] B. Widrow, J. Mccool, M. Ball, The complex LMS algorithm, in: Proceedings of 2015, pp. 5353–5360.
the IEEE, 63, 1975, pp. 719–720. [80] S. Han, J. Pool, J. Tran, W. Dally, Learning both weights and connections for ef-
[47] A. Hirose, Complex-Valued Neural Networks: Advances and Applications, John ficient neural network, in: Advances in neural information processing systems
Wiley & Sons, 2013. (NIPS), 2015, pp. 1135–1143.
[48] Q. Song, H. Yan, Z. Zhao, Y. Liu, Global exponential stability of complex-valued [81] I. Misra, A. Shrivastava, A. Gupta, M. Hebert, Cross-stitch networks for multi–
neural networks with both time-varying delays and impulsive effects, Neural task learning„ in: Proceedings of the IEEE Conference on Computer Vision and
Netw. 79 (2016) 108–116. Pattern Recognition (CVPR), 2016, pp. 3994–4003.
[49] P. Arena, L. Fortuna, L. Occhipinti, M.G. Xinilia, Neural networks for quater- [82] S. Ruder, J. Bingel, I. Augenstein, A. Søgaard. Learning what to share between
nion-valued function approximation, Int. Symp. Circuits Syst. 6 (1994) loosely related tasks, ArXiv e-prints, ArXiv:1705.08142, 2017.
307–310. [83] M.D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks,
[50] T. Isokawa, T. Kusakabe, N. Matsui, F. Peper, Quaternion neural network in: Proceedings of the European Conference on Computer Vision, 2014,
and its application, in: Proceedings of the International Conference on pp. 818–833.
Knowledge-Based and Intelligent Information and Engineering Systems, 2003, [84] Y. Gao, J. Ma, M. Zhao, W. Liu, A.L. Yuille, NDDR-CNN: Layerwise Feature Fus-
pp. 318–324. ing in Multi-Task CNNs by Neural Discriminative Dimensionality Reduction„ in:
[51] H. Kusamichi, T. Isokawa, N. Matsui, Y. Ogawa, K. Maeda, A new scheme for Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-
color night vision by quaternion neural network, in: Proceedings of the Inter- tion (CVPR), 2019, pp. 3205–3214.
national Conference on Autonomous Robots and Agents, 2004, pp. 101–106.
[52] C. Jahanchahi, C. Took, D. Mandic, On HR calculus, quaternion valued stochastic Jiasong Wu received the B.S. degree in Biomedical En-
gradient, and adaptive three dimensional wind forecasting, in: Proceedings of gineering from the University of South China, Hengyang,
the International Joint Conference on Neural Networks, IEEE, 2010, pp. 1–5. China, in 2005, and joint Ph.D. degree with the Labora-
[53] T. Isokawa, T. Kusakabe, N. Matsui, F. Peper, Quaternion neural network tory of Image Science and Technology (LIST), Southeast
and its application, in: Proceedings of the International Conference on University, Nanjing, China, and Laboratoire Traitement du
Knowledge-Based and Intelligent Information and Engineering Systems, 2003, signal et de l’Image (LTSI), University of Rennes 1, Rennes,
pp. 318–324. France in 2012. He is now working in the LIST as a lec-
[54] N. Matsui, T. Isokawa, H. Kusamichi, F. Peper, H. Nishimura, Quaternion neural turer. His-research interest mainly includes deep learn-
network with geometrical operators, J. Intell. Fuzzy Syst. Appl. Eng. Technol. 15 ing, fast algorithms of digital signal processing and its
(3, 4) (2004) 149–164. applications. Dr. Wu received the Eiffel doctorate schol-
[55] C.-A. Popa, Octonion-Valued Neural networks, Artificial Neural Networks and arship of excellence (2009) from the French Ministry of
Machine Learning, ICANN, 2016. Foreign Affairs and also the Chinese government award
[56] C.-A. Popa, Global exponential stability of neutral-type octonion-valued neural for outstanding self-financed student abroad (2010) from
networks with time-varying delays, Neurocomputing 309 (2) (2018) 177. the China Scholarship Council.
[57] C.-A. Popa, Global exponential stability of octonion-valued neural networks
with leakage delay and mixed delays, Neural Netw. 105 (2018) 277–293. Ling Xu received the B.S. degree in Computer Science and
[58] J. Pearson, D. Bisset, Neural networks in the Clifford domain, in: Proceed- technology from Hefei University of Technology, Hefei,
ings of the International Conference on Neural Networks, 3, IEEE, 1994, China, in 2017. Now she is currently pursuing the M.S. de-
pp. 1465–1469. gree in Computer Science and technology, Southeast Uni-
[59] S. Buchholz, G. Sommer, On Clifford neurons and Clifford multi-layer percep- versity. Her research interests lie in deep learning and
trons, Neural Netw. 21 (7) (2008) 925–935. pattern recognition.
[60] Y. Kuroe, S. Tanigawa, H. Iima, Models of Hopfield-type Clifford neural net-
works and their energy functions –hyperbolic and dual valued networks, Proc.
Int. Conf. Neural Inf. Process. 7062 (2011) 560–569.
[61] D.P. Reichert, T. Serre, Neuronal synchrony in complex-valued deep networks,
in: Proceedings of the ICLR, 2014.
[62] R. Haensch, O. Hellwich, Complex-valued convolutional neural networks for
object detection in PolSAR data, in: Proceedings of the EUSAR, 2010, pp. 1–4.
[63] Z. Zhang, H. Wang, F. Xu, Y.Q. Jin, Complex-valued convolutional neural net- Fuzhi Wu received the B.S. degree from Anhui Normal
work and its application in polarimetric SAR image classification, IEEE Trans. University in 2017 and now is studying for Ph.D. degree
Geosci. Remote Sens. 55 (12) (2017) 7177–7188. in School of Computer Science and Engineering, Southeast
[64] C.-A. Popa, Complex-valued convolutional neural networks for real-valued im- University. His-research interests lie in deep learning and
age classification, in: Proceedings of the IJCNN, 2017. pattern recognition, signal and image processing.
[65] C. Trabelsi, O. Bilaniuk, Y. Zhang, D. Serdyuk, S. Subramanian, J.F. Santos, et al.,
Deep Complex Networks, in: Proceedings of ICLR, 2018.
[66] Alex Krizhevsky, H. Geoffrey, Learning Multiple Layers of Features From Tiny
Images, Learning Multiple Layers of Features From Tiny Images, 1, University
of Toronto, 2009 Technical report.
[67] C. Gaudet, A. Maida, Deep quaternion networks, in: Proceedings of the IJCNN,
2018.
J. Wu, L. Xu and F. Wu et al. / Neurocomputing 397 (2020) 179–191 191
Youyong Kong received the B.S. and M.S. degrees in com- Huazhong Shu received the B.S. degree in Applied Math-
puter science and engineering from Southeast University, ematics from Wuhan University, China, in 1987, and a
Nanjing, China, in 2008 and 2011, respectively, and the Ph.D. degree in Numerical Analysis from the University
Ph.D. degree in imaging and diagnostic radiology from the of Rennes 1, Rennes, France in 1992. He is a professor of
Chinese University of Hong Kong, Hong Kong, in 2014. the LIST Laboratory and the Codirector of the CRIBs. His-
He is currently an Assistant Professor with the College of recent work concentrates on the image analysis, pattern
Computer Science and Engineering, Southeast University. recognition and fast algorithms of digital signal process-
His-current research interests include machine learning, ing. Dr. Shu is a senior member of IEEE Society.
and medical image processing and brain network analy-
sis.