You are on page 1of 13

Neurocomputing 397 (2020) 179–191

Contents lists available at ScienceDirect

Neurocomputing
journal homepage: www.elsevier.com/locate/neucom

Deep octonion networks


Jiasong Wu, Ph.D. a,b,c,∗, Ling Xu a,c, Fuzhi Wu a,c, Youyong Kong, Ph.D. a,c,
Lotfi Senhadji, Ph.D. b,c, Huazhong Shu, Ph.D. a,c
a
Laboratory of Image Science and Technology, Key Laboratory of Computer Network and Information Integration, Southeast University, Ministry of
Education, Nanjing 210096, China
b
Université de Rennes, INSERM, LTSI - UMR 1099, Rennes F-35000, France
c
Centre de Recherche en Information Biomédicale Sino-français (CRIBs), Université de Rennes, Inserm, Southeast University, Rennes, France, Nanjing, China

a r t i c l e i n f o a b s t r a c t

Article history: Deep learning is a hot research topic in the field of machine learning methods and applications. Real-
Received 2 April 2019 value neural networks (Real NNs), especially deep real networks (DRNs), have been widely used in many
Revised 9 December 2019
research fields. In recent years, the deep complex networks (DCNs) and the deep quaternion networks
Accepted 11 February 2020
(DQNs) have attracted more and more attentions. The octonion algebra, which is an extension of com-
Available online 20 February 2020
plex algebra and quaternion algebra, can provide more efficient and compact expressions. This paper
Communicated by Dr Li Zhifeng constructs a general framework of deep octonion networks (DONs) and provides the main building blocks
of DONs such as octonion convolution, octonion batch normalization and octonion weight initialization;
Keywords:
Convolutional neural network DONs are then used in image classification tasks for CIFAR-10 and CIFAR-100 data sets. Compared with
Complex the DRNs, the DCNs, and the DQNs, the proposed DONs have better convergence and higher classification
Quaternion accuracy. The success of DONs is also explained by multi-task learning.
Octonion
© 2020 Elsevier B.V. All rights reserved.
Image classification

1. Introduction Although Real CNNs have achieved great success in various


applications, the correlations between convolution kernels are
Real-value neural networks (Real NNs) [1–12] attracted the generally not taken into consideration, that is, there is no con-
attention of many researchers and recently made major break- nection or special relationship established between convolution
throughs in many areas such as signal processing, image process- kernels. On the opposite of Real CNNs, real-value recurrent neural
ing, natural language processing, etc. Many models of Real NNs networks (Real RNNs) [35–38], obtains the correlations by estab-
have been constructed and reported in the literature. These mod- lishing connections between convolution kernels and by learning
els can generally be categorized into two kinds: non-deep models their weights. This approach increases significantly the training
and deep models. The non-deep models are mainly constructed by difficulty and has poor convergence. Thus, a first question raised:
multilayer perceptron module [13] and hard to train, if we only Can we consider correlations between convolution kernels by mean of
use the real-valued back propagation (BP) algorithm [14], when some special relationships, which do not require learning, instead of
their layers are larger than 4. The deep models can be roughly adding the connections between convolution kernels?
constructed by the following two strategies: multilayer percep- Many researchers showed that the performance can be im-
tron models assisted by the unsupervised pretrained methods (for proved when the relationships between convolution kernels are
example, deep belief nets [15], deep auto-encoder [16], etc.) and modeled by complex algebra, quaternion algebra [39–42], and also
real-value convolutional neural networks (Real CNNs), including octonion algebra [43–45]. Therefore, they attached a lot of atten-
LeNet-5 [17], AlexNet [18], Inception [19-22], VGGNet [23], High- tion for extending neural networks from real domain to complex,
wayNet [24], ResNet [25], ResNeXt [26], DenseNet [27], Fractal- quaternion, and also octonion domains. These extension models, as
Net [28], PolyNet [29], SENet [30], CliqueNet [31], BinaryNet [32], in Real CNNs, can also be divided into two categories: non-deep
SqueezeNet [33], MobileNet [34], etc. models [46–60] and deep models [61–71]. There are many research
work focusing on the non-deep models, for example, Widrow et al.

[46] first introduced the complex-valued neural networks (Com-
Corresponding author at: Laboratory of Image Science and Technology, Key Lab-
oratory of Computer Network and Information Integration, Southeast University,
plex NNs), which have been widely used in recent years in radar
Ministry of Education, Nanjing 210096, China. imaging, image processing, communication signal processing, and
E-mail addresses: jswu@seu.edu.cn (J. Wu), lotfi.senhadji@univ-rennes1.fr many others [47]. Compared to the Real NNs, the Complex NNs
(L. Senhadji), shu.list@seu.edu.cn (H. Shu).

https://doi.org/10.1016/j.neucom.2020.02.053
0925-2312/© 2020 Elsevier B.V. All rights reserved.
180 J. Wu, L. Xu and F. Wu et al. / Neurocomputing 397 (2020) 179–191

have better generalization ability due to their time-varying delays Table 1


The multiplication table of the unit octonions.
and impulse effects [48]. Arena et al. [49] then extended the neu-
ral networks from complex to quaternion domain and proposed ej
quaternion-valued neural networks (Quaternion NNs), which have ei ej 1 e1 e2 e3 e4 e5 e6 e7
been applied to color image compression [50], color night vision
ei 1 1 e1 e2 e3 e4 e5 e6 e7
[51], and 3D wind forecasting [52]. Furthermore, the quaternion- e1 e1 −1 e3 −e2 e5 −e4 −e7 e6
valued BP algorithms achieve correct geometrical transformations e2 e2 −e3 −1 e1 e6 e7 −e4 −e5
in color space for an image compression problem, whereas real- e3 e3 e2 −e1 −1 e7 −e6 e5 −e4
valued BP algorithms fail [53,54]. Popa [55] further extended the e4 e4 −e5 −e6 −e7 −1 e1 e2 e3
e5 e5 e4 −e7 e6 −e1 −1 −e3 e2
neural networks from quaternion to octonion domain and pro-
e6 e6 e7 e4 −e5 −e2 e3 −1 −e1
posed octonion-valued neural networks (Octonion NNs), whose e7 e7 −e6 e5 e4 −e3 −e2 e1 −1
leakage delays, time-varying delays, and distributed delays were
also introduced [56, 57]. Moreover, Clifford-valued neural networks
[58–60] were also proposed for the extension of complex NNs 2. Octonion representation
and quaternion NNs. For the deep models, Reichert and Serre
[61] proposed complex-valued deep networks, which interpreted An octonion number x is a hypercomplex number, which is an
half of the cell state as an imaginary part and used complex val- extension of complex number and quaternion number, consists of
ues to simulate the phase dependence of biologically sound neu- one real part and seven imaginary parts:
rons. Then, some researchers proposed complex-valued convolu-
tional neural networks (Complex CNNs) [62–65]. Among them, Tra- x = x0 e0 + x1 e1 + x2 e2 + x3 e3 + x4 e4 + x5 e5 + x6 e6 + x7 e7 ∈ O (1)
belsi et al. [65] proposed deep complex networks (DCNs) which where O denotes the octonion number field, xi ∈ R, i = 1, 2, ..., 7 (R
provide the key atomic components (complex convolution, com- denotes the real number field), e0 =1, and ei , i = 1,2,…,7, are seven
plex batch normalization, and complex weight initialization strat- imaginary units obeying the following rules:
egy, etc.) for the construction of DCNs and also obtain lower er- ⎧ 2
ror rate than corresponding deep real networks (DRNs) [25] in ⎨ei = −1
CIFAR-10 and CIFAR-100 [66]. Then, Gaudet and Maida [67] ex- ei e j = −e j ei , ∀i = j = k, 1 ≤ i, j, k ≤ 7, (2)
tended the deep networks from complex to quaternion domain and

(ei e j )ek = −ei (e j ek )
proposed deep quaternion networks (DQNs), which achieve better
image classification performance than DCNs [65] in CIFAR-10 and The above equation shows that the octonion multiplication is nei-
CIFAR-100. Meanwhile, Titouan et al. also proposed deep quater- ther commutative nor associative. The multiplication tables of the
nion neural network (QDNN) [68, 69], quaternion recurrent neu- imaginary units are also shown in Table 1.
ral network (QRNN) [70], bidirectional quaternion long-short term The conjugate of this octonion x ∈ O is given by
memory (BQLSTM) [71], and quaternion convolutional autoencoder x∗ = x0 e0 − x1 e1 − x2 e2 − x3 e3 − x4 e4 − x5 e5 − x6 e6 − x7 e7 . (3)
(QCAE) [72]. Then, a second question raised: Can we extend the deep
The unit norm octonion of x ∈ O is
networks from quaternion to octonion domain to obtain a further ben-
x
efit? How to explain the success of these deep networks on these var- x =  . (4)
ious domains (complex, quaternion, octonion)? x20 + x21 + x22 + x23 + x24 + x25 + x26 + x27
In an attempt to solving this second question, in this paper, we
For a complete review of the properties of octonion, the reader
propose deep octonion networks which can be seen as an exten-
can refer to [43].
sion of deep networks from quaternion domain to octonion do-
main. The contributions of the paper are as follows: 3. Deep octonion networks

This section introduces the methods and the modules required


(1) The key atomic components of deep octonion networks, to construct the deep octonion networks and to initialize them:
such as octonion convolution module, octonion batch octonion internal representation method (Section 3.1), octonion
normalization module and octonion weight initialization convolution module (Section 3.2), octonion batch normalization
method. module (Section 3.3), and octonion weight initialization method
(2) When applying the proposed deep octonion networks on the (Section 3.4).
classification tasks on CIFAR-10 and CIFAR-100, the classifi-
cation results are better than deep real networks, deep com- 3.1. Octonion internal representation method
plex networks and deep quaternion networks.
(3) The explanation of deep complex networks, deep quaternion We represent the real part and seven imaginary parts of an oc-
networks, and deep octonion networks behaviors from the tonion number as logically distinct real valued entities and sim-
perspective of multi-task learning [73–75]. ulate octonion arithmetic using real-valued arithmetic internally.
If we assume that an octonion convolutional layer has N feature
maps where N is divisible by 8, then, these feature maps can be
split into 8 parts to form an octonion representation. Specifically,
The rest of the paper is organized as follows. In Section 2, oc- as shown in Fig. 1, we allocate the first N/8 feature maps to the
tonion representation, its main properties and characteristics are real part and the remaining seven N/8 feature maps to the seven
briefly introduced. The architectural components needed to build imaginary parts.
deep octonion networks are described in Section 3. The classifica-
tion performance of deep octonion networks is analyzed and also 3.2. Octonion convolution module
compared to the deep real networks, deep complex networks, and
deep quaternion networks in Section 4. Then, Section 5 explains Octonion convolution can be implemented by convolving an oc-
deep networks behaviors on these domains from the perspective tonion vector xo ∈ ON by an octonion filter matrix Wo ∈ ON × N as
of multi-task learning. The conclusions are formulated in Section 6. follows:
J. Wu, L. Xu and F. Wu et al. / Neurocomputing 397 (2020) 179–191 181

Fig. 1. The octonion internal representation.

Wo∗oxo = (W0 + W1 e1 + W2 e2 + W3 e3 +W4 e4 + W5 e5 + W6 e6 + W7 e7 )


∗o (x0 + x1 e1 + x2 e2 + x3 e3 +x4 e4 + x5 e5 + x6 e6 + x7 e7 )
= ( W0 ∗ x0 − W1 ∗ x1 − W2 ∗ x2 − W3 ∗ x3 − W4 ∗ x4 − W5 ∗ x5 − W6 ∗ x6 − W7 ∗ x7 )
+ ( W 0 ∗ x 1 + W 1 ∗ x 0 + W 2 ∗ x 3 − W 3 ∗ x 2 + W 4 ∗ x 5 − W 5 ∗ x 4 − W 6 ∗ x 7 + W 7 ∗ x 6 )e 1
+ ( W 0 ∗ x 2 − W 1 ∗ x 3 + W 2 ∗ x 0 + W 3 ∗ x 1 + W 4 ∗ x 6 + W 5 ∗ x 7 − W 6 ∗ x 4 − W 7 ∗ x 5 )e 2
(5)
+ ( W 0 ∗ x 3 + W 1 ∗ x 2 − W 2 ∗ x 1 + W 3 ∗ x 0 + W 4 ∗ x 7 − W 5 ∗ x 6 + W 6 ∗ x 5 − W 7 ∗ x 4 )e 3
+ ( W 0 ∗ x 4 − W 1 ∗ x 5 − W 2 ∗ x 6 − W 3 ∗ x 7 + W 4 ∗ x 0 + W 5 ∗ x 1 + W 6 ∗ x 2 + W 7 ∗ x 3 )e 4
+ ( W 0 ∗ x 5 + W 1 ∗ x 4 − W 2 ∗ x 7 + W 3 ∗ x 6 − W 4 ∗ x 1 + W 5 ∗ x 0 − W 6 ∗ x 3 + W 7 ∗ x 2 )e 5
+ ( W 0 ∗ x 6 + W 1 ∗ x 7 + W 2 ∗ x 4 − W 3 ∗ x 5 − W 4 ∗ x 2 + W 5 ∗ x 3 + W 6 ∗ x 0 − W 7 ∗ x 1 )e 6
+ ( W 0 ∗ x 7 − W 1 ∗ x 6 + W 2 ∗ x 5 + W 3 ∗ x 4 − W 4 ∗ x 3 − W 5 ∗ x 2 + W 6 ∗ x 1 + W 7 ∗ x 0 )e 7 ,

which can be expressed as the real matrix form as follows: the image. The goal of the octonion convolution is to generate a
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ unique linear combination of each axis based on the results of a
R ( W o ∗o x o ) W0 −W1 −W2 −W3 −W4 −W5 −W6 −W7 x0
⎢ I ( W o ∗ o x o ) ⎥ ⎢W 1 single axis, allowing each axis of the kernel to interact with each
⎢ ⎥ ⎢ W0 −W3 W2 −W5 W4 ⎢x 1 ⎥
W7
⎢ ⎥ −W6 ⎥

⎢ J ( W ∗ x ) ⎥ ⎢W ⎥ ⎢ ⎥ axis of the image, thereby allowing the linear depth of the channel
⎢ o o o ⎥ ⎢ 2 W3 W0 −W1 −W6 −W7 W4 W5 ⎥ ⎢x2 ⎥
to be mixed, depending on the structure of the octonion multi-
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ K ( W o ∗ o x o ) ⎥ ⎢W 3 −W2 W1 W0 −W7 W6 −W5 W4 ⎥ ⎢x3 ⎥ plication. For example, using 8 kernels (m × n × 8) to convolve
⎢ ⎥ ⎢ ⎥ ∗ ⎢ ⎥,
⎢ E ( W o ∗ o x o ) ⎥ = ⎢W 4 W5 W6 W7 W0 −W1 −W2 −W3 ⎥ ⎢ ⎥ an 8 channels of feature maps (M × N × 8), finally generates one
⎢ ⎥ ⎢ ⎥ ⎢x 4 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ L ( W o ∗ o x o ) ⎥ ⎢W 5 −W4 W7 −W6 W1 W0 W3 −W2 ⎥ ⎢x5 ⎥ feature map. Conventional convolution is one convolution kernel
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣M(Wo∗oxo )⎦ ⎣W6 −W7 −W4 W5 W2 −W3 W0 W1 ⎦ ⎣x6 ⎦ applied to one feature map, and then added to the result of the
N ( W o ∗o x o ) W7 W6 −W5 −W4 W3 W2 −W1 W0 x7 previous operation, regardless of the correlation between the fea-
ture maps. The octonion convolution, using the octonion arithmetic
(6)
rule, applies eight convolution kernels to each feature map. Then
∗ ∗
where o and denote octonion convolution and real convolution, applying a 1 × 1 convolution to the result of the previous opera-
respectively. xi ∈ RN and Wi ∈ RN × N with i = 1,2,…,7. R(• ) de- tion allows to obtain the linear interaction of the feature map and
notes the real part of •, I (• ), J (• ), K (• ), E (• ), L(• ), M(• ) and thus to derive the new feature map space.
N (• ) denote the seven different imaginary parts of •, respectively.
The implementation of octonion convolutional operation is shown 3.3. Octonion batch normalization module
in Fig. 2, where Mr , Mi , Mj , Mk , Me , Ml , Mm , Mn refer to eight parts
of feature maps and Kr , Ki , Kj , Kk , Ke , Kl , Km , Kn refer to eight parts Batch normalization [31] can accelerate deep network train-
of kernels, and Mp1 ∗ Kp2 (p1 , p2 = r, i, j, k, e, l, m, n ) refer to the re- ing by reducing internal covariate shift. It allows us to use much
sult of a real convolution between the feature maps and the ker- higher learning rates and be less careful about initialization. When
nels. The real representations of complex convolution, quaternion applying the batch normalization to real numbers, it is sufficient
convolution, and octonion convolution are shown in Appendix 1. to translate and scale these numbers such that their mean is zero
From this latter, we can see that the octonion convolution is a kind and their variance is one. However, when applying the batch nor-
of mixed convolution, similar to a mixture of standard convolution malization to complex or quaternion numbers, this can’t ensure
and depth separable convolution, with certain links to the original equal variance in both the real and imaginary components. In or-
convolution [76]. Traditional real-valued convolution simply multi- der to overcome this problem, a whitening approach is used in
plies each channel of the kernel by the corresponding channel of [65, 67], which scales the data by the square root of their variances
182 J. Wu, L. Xu and F. Wu et al. / Neurocomputing 397 (2020) 179–191

Fig. 2. Illustration of the real convolution (a) and octonion convolution (b).
J. Wu, L. Xu and F. Wu et al. / Neurocomputing 397 (2020) 179–191 183

along each principle components. In this section, we use a simi- to converge stably and quickly. In view of this, we provide an ini-
lar approach, but treating this issue as a “whitening” of 8D vector tialization method for octonion weight. The 8 parts of every octo-
problem. nion weight Wo ∈ ON × N are assumed to be independent Gaussian
Firstly, whitening is accomplished by multiplying the zero- random variables with zero-mean and the same variance σ 2 . Then,
centered data (x-E[x]) by the inverse square root of the covariance the variance of Wo is provided by:
matrix V:  
V ar (Wo ) = E |Wo|2 − |E[Wo]|2 (12)
( x − E [x ] )
x˜ = √ , (7) As the 8 parts are zero-mean then E[Wo] = 0 and as they have
V
the same variance then V ar (Wo ) = 8σ 2 . The value of the standard
and deviation σ is set according to the following:
⎡ ⎤   √ 
Vrr Vri Vrj Vrk Vre Vrl Vrm Vrn
1/ 2 nin + nout , ifGlorot sinitialization[77]isused
⎢ Vir Vii Vij Vik Vie Vil Vim Vin ⎥ σ= √ (13)
⎢ ⎥ 2/ nin , ifHe sinitialization[78]isused
⎢V ⎥
⎢ jr Vji Vjj Vjk Vje Vjl Vjm Vjn ⎥
⎢ ⎥
⎢ Vkr Vki Vkj Vkk Vke Vkl Vkm Vkn ⎥ 4. Implementation and experimental results
V=⎢
⎢ Ver
⎥, (8)
⎢ Vei Vej Vek Vee Vel Vem Ven ⎥

⎢ ⎥ Similar to the 110-layer deep real networks [25], we designed
⎢ Vlr Vli Vlj Vlk Vle Vll Vlm Vln ⎥ an octonion convolutional neural network named deep octonion
⎢ ⎥
⎣Vmr Vmi Vmj Vmk Vme Vml Vmm Vmn ⎦ networks, whose schematic diagram are shown in Fig. 3. Fig. 3(a)
shows the detailed convolution structure of the four stages, and
Vnr Vni Vnj Vnk Vne Vnl Vnm Vnn
Fig. 3(b) shows the entire structure including the input and out-
where E[x] refers to the average value of each batch of training put modules. Then we performed the image classification tasks
data x ∈ ON , and V ∈ O8 × 8 is the covariance matrix of each batch of CIFAR-10 and CIFAR-100 [66] to verify the validity of the pro-
of data x. posed deep octonion networks. The following experiment was im-

In order to avoid calculating the ( V )−1 , Eq. (7) can be com- plemented using Keras (Tensorflow as backend) on a PC machine,
puted as follows which sets up Ubuntu 16.04 operating system and has an Intel(R)
Core(TM) i7-2600 CPU with speed of 3.40 GHz and 64 GB RAM,
x˜ = U(x − E[x] ), (9) and has also two NVIDIA GeForce GTX1080-Ti GPUs.
where U is one of the matrices from the Cholesky decomposition
of V−1 , and each item of the matrix U is shown in Appendix 2. 4.1. Models configurations
Secondly, the forward conduction formula of the octonion batch
normalization layer is defined as 4.1.1. Octonion input construction
Since the images in datasets of CIFAR-10 and CIFAR-100 are
OctonionBN(x˜ ) = γ x˜ + β, (10)
real-valued, however, the input of the proposed deep octonion net-
where β = E (x ) ∈ O8
is a learned parameter with eight real
√ param- works needs to be an octonion matrix, which we have to derive
eters (one real part and seven imaginary parts) and γ = V ∈ O8×8 first. The octonion has one real part and seven imaginary parts, we
is also a learned parameter with only 36 independent real param- put the original N training real images into the real part, and simi-
eters, lar to [65] and [67], the seven imaginary parts of the octonion ma-
⎛ ⎞ trix are obtained by performing a single real-valued residual block
γrr γri γrj γrk γre γrl γrm γrn (BN → ReLU → Conv → BN → ReLU → Conv) [25] 7 times at the
⎜ γir γii γij γik γie γil γim γin ⎟ same time. Then, the 8 vectors are connected according to a given
⎜ ⎟
⎜γ γji γjj γjk γje γjl γjm γjn ⎟
axis to form a brand new octonion vector.
⎜ jr ⎟
⎜ ⎟
⎜ γkr γki γkj γkk γke γkl γkm γkn ⎟
γ =⎜
⎜ γer
⎟. 4.1.2. The structure of deep octonion networks
⎜ γei γej γek γee γel γem γen ⎟
⎟ The OctonionConv → OctonionBN → ReLU operation is per-
⎜ ⎟ formed on the obtained octonion input, where OctonionConv de-
⎜ γlr γli γlj γlk γle γll γlm γln ⎟
⎜ ⎟ notes the octonion convolution module shown in Section 3.2 and
⎝γmr γmi γmj γmk γme γml γmm γmn ⎠ OctonionBN denotes the octonion batch normalization module
γnr γni γnj γnk γne γnl γnm γnn shown in Section 3.3. Then the octonion output is sent to the next
three stages. In each stage, there are several residual blocks with
(11)
double convolution layers. The shape of the feature maps in three

Similar to [65] and [67], the diagonal of γ is initialized to 1/ 8, stages are the same, and the number of them are increased grad-
the off diagonal terms of γ and all components of β are initialized ually to ensure the expressive ability of the output features. To
to 0. speed up the training, the following layer is an AveragePooling2D
layer, which is then followed by a fully connected layer called
3.4. Octonion weight initialization method Dense to classify the input. The deep octonion network model sets
the number of residual blocks in the three stages to 10, 9, and 9,
Before starting to train the network, we need to initialize its respectively, and the number of convolution filters is set to 32, 64,
parameters. If the weights are initialized to the same value, the and 128. The batch size is set to 64.
updated weights will be the same, which means that the network
can’t learn the features. For deep neural networks, such initializa- 4.1.3. The training of deep octonion networks
tion will make deeper meaningless and will not match the effects Deep octonion networks are then compiled, the cross entropy
of linear classifiers. Therefore, the initial weight values are all dif- loss function and the stochastic gradient descent method are cho-
ferent and close but not equal to 0, which not only ensures the dif- sen for training the model. The Nesterov Momentum is set to 0.9
ference between the input and output, but also allows the model in the back propagation of stochastic gradient descent in order to
184 J. Wu, L. Xu and F. Wu et al. / Neurocomputing 397 (2020) 179–191

Fig. 3. The implementation details of deep octonion networks (DONs).


J. Wu, L. Xu and F. Wu et al. / Neurocomputing 397 (2020) 179–191 185

Fig. 4. Results of different learning rate settings.

Table 2 Table 3
The learning rate (%) of octonion convolution neural network. The classification error rate of three models in two types of datasets. FLOPs and
MACCs denote floating point operations and multiply-accumulate operations, re-
Epoch Learning-rate spectively.
(0, 20) 0.01
Architecture Params FLOPs MACCs CIFAR-10 CIFAR-100
(20, 60) 0.1
(60, 80) 0.01 Real [25] 3,619,844 1,081,333,248 340,380,416 6.37 –
(80,110) 0.001 Complex [65] 1,823,620 541,132,288 270,273,792 5.60 27.09
(110, 120) 0.0001 Quaternion [67] 932,792 271,922,688 135,662,848 5.44 26.01
Octonion 481,150 137,350,144 68,368,896 5.35 24.60

increase the stability and speed up the convergence. The learn-


floating point operations (FLOPs) and multiply-accumulate opera-
ing rate is shown in Table 2. Using a custom learning rate sched-
tions (MACCs) to statistically count the model’s calculation vol-
ule, different learning rates are used in different epochs in order
ume to obtain the speed of the model. Regarding the calculation of
to make the network more stable. Here, the learning rate from 0
FLOPs and MACCs, we followed the same steps as in [79,80]. The
to 80 th epoch is divided into three stages. The first 20 epochs
calculation performances show that DONs use less computations
are preheated at a learning rate of 0.01, and for the middle 40
and achieve the best results under the same conditions.
epochs we increase the learning rate by a factor ten. For the latter
For the classification tasks of CIFAR-10 and CIFAR-100: Firstly,
20 epochs, we restore it to 0.01. The deep octonion networks are
we use the 10-fold cross-validation method, that is, we divide the
trained on 120 epochs, which is less than 200 epochs in [65] and
data set into ten parts, and take nine parts as training data and
[67], because the convergence speed of the deep octonion net-
one part as testing data, and then the Top-1 error rate is obtained
works is higher than the deep real networks [25], deep complex
by averaging the 10 results. Secondly, the 10-fold cross-validation
networks [65], and deep quaternion networks [67].
are performed 10 times, and the mean value is obtained as an esti-
mation of the Top-1 error rate of the algorithm. Table 3 shows the
4.2. Experimental results and analysis Top-1 error rate of the deep octonion networks, deep real networks
[25], deep complex networks [65], and deep quaternion networks
There are two methods for choosing the learning rate. One is [67] on CIFAR-10 and CIFAR-100. It can be seen from Table 3 that
to have the same learning rate during training which is called the deep octonion networks can achieve lower error rate compared
“smooth” learning rate (blue line in Fig. 4), and the other is to to the other deep networks and the advantage becomes more ap-
adjust the learning rate during training which is called “convex” parent when there are more classes to distinguish. From the fourth
learning rate (red line in Fig. 4). It is worth noting that two initial- column of Table 3, we can see that, compared with deep quater-
ization methods of Glorot and Bengio [77] and He et al. [78] were nion networks [67], the improvement of the deep octonion net-
used in the experiment, which are shown in the form of solid lines works is not significant as the relative improvement is 1.7%. How-
and dotted lines, respectively. Experiments show that the “convex” ever, as shown in the fifth column of Table 3, when the number
learning rate setting from epoch 0 to epoch 80 is better than the of classes increases, the proposed deep octonion networks the rel-
“smooth” learning rate. Although the accuracy will fluctuate during ative improvement becomes 5.4% when compared to deep quater-
the 20–60 epochs, the accuracy will increase slightly in the next nion networks [67]. This also implies that the deep octonion net-
20 epochs. Therefore, in the subsequent experiments, the “convex” works have better generalization ability than the other compared
learning rate was used for scheduling. The accuracy of the deep deep networks. The phenomenon is also explained by the multi-
octonion networks, deep complex networks [65], and deep quater- task learning in the next section.
nion networks [67] in the first 20 epochs are compared in Fig. 5,
from which we can see that the proposed deep octonion networks 5. Explanation of deep octonion networks via multi-task
perform better compared to the other deep networks. Besides, as learning
shown in Table 3, the deep octonion networks have less learning
parameters than the other compared deep networks. In addition, In Machine Learning, the standard algorithm is to learn one task
as shown in the second and third column of Table 3, we also use at a time, we generally train a single model or an ensemble of
186 J. Wu, L. Xu and F. Wu et al. / Neurocomputing 397 (2020) 179–191

Fig. 5. Accuracy curves of four models in the first 20 epochs.

models to perform our desired task and the output of the system of hard parameter sharing. We refer to the structure between the
is real. When facing the complex learning problems, traditional input layer and the output layer as the shared layer.
methods chose a similar approach. Firstly, decompose the com- However, currently it is difficult to determine the best shared
plex learning problems into simple and independent sub-problems, feature position [83], the best sharing / splitting scheme [81] and
and then study each sub-problem separately. Finally, establish the the existing CNN structure only receives feature tensors with a
mathematical model of complex problems by combining the sub- fixed number of feature channels. If multitasking is used for the
problem learning results, and the model is refined through fine existing CNN structure, the number of channels will increase as the
tuning until the performance no longer improves. These operations number of multitasking tasks increases. The tandem features will
seem reasonable but not accurate, for the reason that many prob- not be available to subsequent layers of the CNN. There are multi-
lems in the real world cannot be decomposed into independent ple solutions to this problem. One is the NDRR layer proposed in
sub-problems, the rich interrelated information between the sub- [84], which is plug-and-play extended to the existing CNN archi-
problems cannot be ignored. Even if it can be decomposed, the tecture and uses feature transformation to discriminate cascaded
sub-problems are interrelated, and connected by sharing factors or features and to reduce their dimensions. The DONs network pro-
share representations. In order to solve this problem, multi-task posed in this paper solves the problem of channel number mis-
learning (MTL) was born [73]. Compared to single-task learning match by improving the network structure, and constructs a new
(STL—learning just one task at a time), MTL is a kind of joint learn- octonion convolutional neural network for eight-task learning.
ing method which learns multiple related tasks together based on Generally, when considering the optimization of more than one
shared representations. The purpose of the shared representation loss function, we are effectively dealing with a MTL problem. Next,
is to improve the generalization. Multiple tasks are learned in par- we only focus on one task, that is, there is only one optimization
allel, and the results affect each other. goal, and the learning of auxiliary tasks may still help improving
the learning performance of the main task. Auxiliary tasks can pro-
vide inductive bias, which makes the model more inclined to those
5.1. The relationship between DONs and MTL solutions that can explain multiple tasks at the same time, and the
generalization performance of the model is then better. The DONs
MTL can be seen as a method that inspired by human learn- shown in Fig. 6(c) uses a network structure with hard parameter
ing. We often learn tasks to acquire the necessary skills in order sharing similar to Fig. 6(b) to learn eight related tasks together.
to master more complex problems. There are many forms of MTL The input of the last seven tasks is learned through the input of
that can further improve CNN performance. Fig. 6(a) and (b) shows the first task, which plays a supporting role in the first task. There-
a single-task learning and multi-task feedforward neural network fore, there are relevant and irrelevant parts in these eight tasks.
with one input layer, two hidden layers and one output layer, re- The details of the DONs shared layer follow the rules in Fig. 2.
spectively. In single-task learning, learning between tasks is inde-
5.2. Effectiveness of DONs
pendent of each other. In MTL, parameters between multiple tasks
are shared. MTL methods can be divided into two categories based
In [75], it has been proven that the number of parameters in
on how parameters are shared between different task models. In
the multi-task model is less than the number of parameters for es-
the soft parameter sharing category, each task has its own model
tablishing multiple models, and the task is optimized to reduce the
and its own parameters. The methods in this category focus on
risk of over-fitting and the generalization ability is then stronger.
how to design weight-sharing approaches. The most common way
The effectiveness of MTL is mainly reflected in the following five
is to share all convolutional layers and split on fully connected lay-
aspects: implicit data augmentation, attention focusing, eavesdrop-
ers for heuristic decisions for loss of specific tasks, such as Cross-
ping, representation bias and regularization. Therefore, considering
Stitch Networks [81], Sluice Network [82], etc. In hard parameter
the DONs as a specific MTL is supported by the following:
sharing category, all task models share exactly the same feature
extractor, and each branch head executes its own task. In the con- - Neural networks can help the hidden layer to avoid local min-
text of deep learning, MTL is usually done by sharing hard or soft ima through the interaction between different tasks during
parameters of the hidden layer [74]. Fig. 6(b) shows the MTL mode learning. When learning the main task, the parts that are not
J. Wu, L. Xu and F. Wu et al. / Neurocomputing 397 (2020) 179–191 187

Fig. 6. The comparison of Single-task learning (a), Multi-tasking learning (b) and Deep octonion networks (c).

related to the task will produce noise during the learning pro- do have an impact, as in the process of attention focusing for
cess. Since different tasks have different noise patterns, a model MTL. Studies have shown that if auxiliary tasks and main tasks
that learns eight tasks simultaneously is able to learn a more use the same characteristics for decision-making, they will ben-
general representation. Eight tasks learned at the same time efit more from MTL. Therefore, we need to find suitable auxil-
can average the noise patterns, which can make the model bet- iary tasks to benefit from MTL. The choice of auxiliary tasks is
ter representative of the data. This is similar to implicit data diverse [74].
augmentation for MTL. - DONs restrict the model by using octonion operation rules, so
- DONs takes the first of the eight tasks as the main one, and the that models that are more in line with real rules can be se-
latter seven tasks are learned through the input of the first task, lected from the hypothesis space. This kind of regularization is
which assists the first task. The gap between tasks is not partic- such that the risk of overfitting as well as the complexity of the
ularly large, so the model can be focused on those features that mode is reduced.
188 J. Wu, L. Xu and F. Wu et al. / Neurocomputing 397 (2020) 179–191

In addition, as described in the previous section, hard parame- Declaration of Competing Interest
ter sharing (DONs can be considered as such models) is the most
common method of MTL in neural networks, which can greatly The authors declare that they have no known conflicts of
reduce the risk of overfitting [74]. It has been demonstrated in interest.
[75] that the risk of overfitting the shared parameters is smaller
than overfitting the specific parameters for each task. What’s more,
regards the number of tasks as N, a larger N means that the more
tasks are learned simultaneously, the more the model can find a CRediT authorship contribution statement
representation that captures all tasks, and the less likely the orig-
inal task is to overfitting. This also explains that the performance Jiasong Wu: Conceptualization, Methodology, Writing - review
can be improved when the relationships between convolution ker- & editing. Ling Xu: Writing - original draft, Software, Visualization.
nels are modeled by complex algebra, quaternion algebra, and also Fuzhi Wu: Software, Validation, Data curation. Youyong Kong:
octonion algebra and why deep neural networks can achieve better Validation, Project administration. Lotfi Senhadji: Formal analysis,
results on octonion domains. Writing - review & editing. Huazhong Shu: Supervision, Resources.

6. Conclusion

In this paper, we propose deep octonion networks (DONs) as Acknowledgment


an extension of DRNs, the DCNs, and the DQNs. The main building
blocks of DONs are given, such as octonion convolution, octonion This work was supported in part by the National Natural
batch normalization, and octonion weight initialization. DONs were Science Foundation of China under Grants 61876037, 31800825,
applied to the image classification tasks of CIFAR-10 and CIFAR- 61871117, 61871124, 61773117, 31571001, 61572258, and in part by
100 to verify their validity. Experiments showed that compared the National Key Research and Development Program of China un-
with the DRNs, the DCNs, and the DQNs, the proposed DONs have der Grants 2017YFC0107903, 2017YFC0109202, 2018ZX10201002-
better convergence, less parameters, and higher classification ac- 003, and in part by the Short-Term Recruitment Program of For-
curacy. The success of DONs is also explained through multi-task eign Experts under Grant WQ20163200398, and in part by INSERM
learning approach. under the Grant call IAL.

Table A1
The real representations of complex convolution, quaternion convolution, and octonion convolution. e denotes imaginary component, where e2i = −1, ei e j =
−e j ei = e j ei , (ei e j )ek = −ei (e j ek ) = ei e j ek ,∀i = j = k, 1 ≤ i, j, k ≤ 7. ∗ , ∗ c , ∗ q , and ∗ o denote real convolution, complex convolution, quaternion convolution, and oc-
tonion convolution, respectively. R(• ) denotes the real components of •, I (• ), J (• ), K (• ), E (• ), L (• ), M (• ) and N (• ) denote the different imaginary components
of • respectively. xr ∈ RN , xc ∈ CN , xq ∈ QN , xo ∈ ON , Wr ∈ RN × N , Wc ∈ CN × N , Wq ∈ QN , Wo ∈ ON , xi ∈ RN , Wi ∈ RN×N , i = 0, 1, ..., 7, where R, C, Q, and O denote
real, complex, quaternion, and octonion domain, respectively.

Multidimensional Input Filter Various convolutions


domains vector matrix between W and h Implementation of various convolutions in real domain

Wr ∗ xr
Real [25] xr Wr  R (W ∗ x ) = W ∗ x
= Wr [i, j]xr [m − i, n − j]
i j
xc Wc      
Wc ∗c xc R(Wc ∗c xc ) W0 −W1 x
Complex [65] = x0 = W0 = ∗ 0
= (W0 + W1 e1 )∗c (x0 + x1 e1 ) I (Wc ∗c xc ) W1 W0 x1
+x1 e1 +W1 e1
xq Wq ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
R(Wq ∗q xq ) W0 −W1 −W2 −W3 x0
= x0 = W0
⎢ I (Wq ∗q xq ) ⎥ ⎢W1 W0 −W3 W 2 ⎥ ⎢x 1 ⎥
Quaternion [67] +x1 e1 +W1 e1 ⎣J (W ∗ x )⎦ = ⎣W W3 W0 −W1
⎦ ∗⎣ ⎦
x2
+x 2 e 2 +W2 e2 Wq ∗q xq q q q 2

= ( W0 + W1 e1 + W2 e2 + W3 e3 ) K (Wq ∗q xq ) W3 −W2 W1 W0 x3
+x3 e3 +W3 e3
∗q (x0 + x1 e1 + x2 e2 + x3 e3 )
xo Wo ⎡ R (W ∗ x ) ⎤
o o o
= x0 = W0
+x1 e1 +W1 e1 ⎢ I (Wo ∗o xo ) ⎥
⎢ J (Wo ∗o xo ) ⎥
+x2 e2 +W2 e2 ⎢ ⎥
⎢ K (Wo ∗o xo ) ⎥
⎢ ⎥
Octonion +x3 e3 +W3 e3 ⎢ E (Wo ∗o xo ) ⎥ =
+x4 e4 +W4 e4 Wo ∗o xo ⎢ ⎥
= ( W0 + W1 e1 + W2 e2 + W3 e3
⎢ L (Wo ∗o xo ) ⎥
+x5 e5 +W5 e5 ⎣ ⎦
+W4 e4 + W5 e5 + W6 e6 + W7 e7 ) M (Wo ∗o xo )
+x6 e6 +W6 e6
∗o (x0 + x1 e1 + x2 e2 + x3 e3 N ( W o ∗o x o )
+x7 e7 +W7 e7 ⎡W ⎤
+x4 e4 + x5 e5 + x6 e6 + x7 e7 ) 0 −W1 −W2 −W3 −W4 −W5 −W6 −W7
⎢W 1 W0 −W3 W2 −W5 W4 W7 −W6 ⎥
⎢W 2 −W1 −W6 −W7 W5 ⎥
⎢ W3 W0 W4 ⎥
⎢W 3 −W2 W1 W0 −W7 W6 −W5 W4 ⎥
⎢ ⎥∗
⎢W 4 W5 W6 W7 W0 −W1 −W2 −W3 ⎥
⎢ ⎥
⎢W 5 −W4 W7 −W6 W1 W0 W3 −W2 ⎥
⎣ ⎦
W6 −W7 −W4 W5 W2 −W3 W0 W1
W7 W6 −W5 −W4 W3 W2 −W1 W0
⎡x ⎤
0
⎢x 1 ⎥
⎢x 2 ⎥
⎢ ⎥
⎢x 3 ⎥
⎢ ⎥
⎢x 4 ⎥
⎢ ⎥
⎢x 5 ⎥
⎣ ⎦
x6
x7
J. Wu, L. Xu and F. Wu et al. / Neurocomputing 397 (2020) 179–191 189

Appendix 2. The matrix U

 Vri Vr j V Vre V Vrm Vrn


Urr = Vrr ; Uri = ; Ur j = ; U = rk ; Ure = ; U = rl ; Urm = ; Urn = ;
Urr Urr rk Urr Urr rl Urr Urr Urr
 1 1 1 1
Uii = Vii − Uri 2 ; Ui j = (Vi j − (UriUr j )); Uik = (Vik − (Ur jUrk )); Uie = (Vie − (UrkUre )); Uil = ∗ (Vil − (UreUrl ));
Uii Uii Uii Uii
1 1
Uim = (Vim − (Url Urm )); Uin = (Vin − (UrmUrn ))
Uii Uii

1 1 1
Uj j = V j j − (Ui j 2 + Ur j 2 ); U jk = (V − (Ui jUik + Ur jUrk )); U je = (V je − (UikUie + UrkUre )); U jl = (V − (UieUil + UreUrl ));
U j j jk Uj j U j j jl
1 1
U jm = (V jm − (Uil Uim + Url Urm )); U jn = (V jn − (UimUin + UrmUrn ));
Uj j Uj j

1 1
Ukk = Vkk − (U jk 2 + Uik 2 + Urk 2 ); Uke = (V − (U jkU je + UikUie + UrkUre )); Ukl = (V − (U jeU jl + UieUil + UreUrl ));
Ukk ke Ukk kl
1 1
Ukm = (V − (U jl U jm + Uil Uim + Url Urm )); Ukn = (V − (U jmU jn + UimUin + UrmUrn ));
Ukk km Ukk kn

1
Uee = Vee − (Uke 2 + U je 2 + Uie 2 + Ure 2 ); Uel = (V − (UkeUkl + U jeU jl + UieUil + UreUrl ));
Uee el
1 1
Uem = (Vem − (Ukl Ukm + U jl U jm + Uil Uim + Url Urm )); Uen = (Ven − (UkmUkn + U jmU jn + UimUin + UrmUrn ));
Uee Uee

1
Ull = Vll − (Uel 2 + Ukl 2 + U jl 2 + Uil 2 + Url 2 ); Ulm = (V − (Uel Uem + Ukl Ukm + U jl U jm + Uil Uim + Url Urm ));
Ull lm
1
Uln = (Vln − (UemUen + UkmUkn + U jmU jn + UimUin + UrmUrn ));
Ull

1
Umm = Vmm − (Ulm 2 + Uem 2 +Ukm 2 +U jm 2 + Uim 2 + Urm 2 ); Umn = (Vmn − (UlmUln + UemUen + UkmUkn + U jmU jn + UimUin + UrmUrn ));
Umm

Unn = Vnn − (Umn 2 + Uln 2 + Uen 2 + Ukn 2 + U jn 2 + Uin 2 + Urn 2 ).

References [19] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, Going deeper with convolutions, in: Pro-
ceedings of the CVPR, 2015.
[20] S. Ioffe, C Szegedy, Batch Normalization: Accelerating Deep Network Training
[1] W.S. McCulloch, W. Pitts, A logical calculus of ideas immanent in nervous ac-
by Reducing Internal Covariate Shift„ in: Proceedings of the 32nd International
tivity, Bull. Math. Biophys. 5 (4) (1943) 115–133.
Conference on Machine Learning (ICML), 2015, pp. 448–456.
[2] S. Haykin, Neural Networks: A Comprehensive Foundation, 2 ed., Prentice Hall,
[21] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception
1998 ISBN 0-13-273350-1.
architecture for computer vision, in: Proceedings of the IEEE Conference on
[3] Y. Bengio, A. Courville, P.Vincent, Representation learning: a review and new
Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2818–2826.
perspectives, IEEE Trans. Pattern Anal. Mach. Intell. 35 (8) (2013) 1798–1828.
[22] C. Szegedy, S. Ioffe, V. Vanhoucke, A.A. Alemi, Inception-v4, Inception-ResNet
[4] Y. LeCun, Y. Bengio, G.E. Hinton, Deep learning, Nature 521 (2015) 436–
and the Impact of Residual Connections on Learning, Thirty-first AAAI confer-
4 4 4.
ence on artificial intelligence, 2017, pp. 4278–4284.
[5] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016.
[23] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale
[6] L. Deng, D. Yu, Deep learning: methods and applications, Found. Trends Signal
image recognition, ArXiv e-prints, Arxiv:1409.1556, 2014.
Process. 7 (3–4) (2014) 197–387.
[24] R.K. Srivastava, K. Greff, J. Schmidhuber, Training very deep networks, in: Ad-
[7] J. Schmidhuber, Deep learning in neural networks: an overview, Neural Netw.
vances in neural information processing systems (NIPS), 2015, pp. 2377–2385.
61 (2015) 85–117.
[25] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition,
[8] H. Wang, B. Raj, On the Origin of Deep Learning, ArXiv e-prints Arxiv:1702.
in: Proceedings of the CVPR, 2016.
07800, 2017.
[26] S. Xie, R. Girshick, P. Dollár, Z. Tu, K. He, Aggregated residual transformations
[9] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuaib, et al., Recent advances
for deep neural networks, in: Proceedings of the IEEE conference on computer
in convolutional neural networks, Pattern Recognit. 77 (2018) 354–377.
vision and pattern recognition (CVPR), 2017, pp. 1492–1500.
[10] Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, M.S. Lew, Deep learning for visual
[27] G. Huang, Z. Liu, L. Van Der Maaten, K.Q. Weinberger, Densely Connected Con-
understanding: A review, Neurocomputing 187 (2016) 27–48.
volutional Networks, in: Proceedings of the IEEE conference on computer vi-
[11] A. Prieto, B. Prieto, E.M. Ortigosa, E. Ros, F. Pelayo, J. Ortega, et al., Neural net-
sion and pattern recognition (CVPR), 2017, pp. 4700–4708.
works: An overview of early research, current frameworks and new challenges,
[28] G. Larsson, M. Maire, G. Shakhnarovich, FractalNet: Ultra-Deep Neural Net-
Neurocomputing 214 (2016) 242–268.
works without Residuals, International Conference on Learning Representa-
[12] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, F.E. Alsaadi, A survey of deep neu-
tions (ICLR), 2017.
ral network architectures and their applications, Neurocomputing 234 (2017)
[29] X. Zhang, Z. Li, C.C. Loy, D. Lin, Polynet: a pursuit of structural diversity in very
11–26.
deep networks, in: Proceedings of the IEEE Conference on Computer Vision
[13] F. Rosenblatt, Principles of Neurodynamics: Perceptrons and the Theory of
and Pattern Recognition (CVPR), 2017, pp. 718–726.
Brain Mechanisms, Spartan Books, Washington DC, 1961.
[30] J. Hu, L. Shen, G. Sun, Squeeze-and-Excitation Networks, in: Proceedings of
[14] D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning representations by
the IEEE conference on computer vision and pattern recognition (CVPR), 2017,
back-propagating errors, Nature 323 (6088) (1986) 533–536.
pp. 7132–7141.
[15] G. Hinton, S. Osindero, Y.W. Teh, A fast learning algorithm for deep belief nets,
[31] Y. Yang, Z. Zhong, T. Shen, Z. Lin, Convolutional Neural Networks with Al-
Neural Comput. 18 (7) (2006) 1527–1554.
ternately Updated Clique, in: Proceedings of the IEEE CVPR, 2018, pp.
[16] G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neu-
2413–2422.
ral networks, Science 313 (5786) (2006) 504–507.
[32] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, Y. Bengio, Binarized Neural
[17] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient based learning applied to
Networks, in: Proceedings of NIPS, 2016, pp. 4107–4115.
document recognition, Proc. IEEE 86 (11) (1998) 2278–2324.
[33] F.N. Iandola, S. Han, M.W. Moskewicz, K. Ashraf, W.J. Dally, K. Keutzer,
[18] A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep con-
SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB
volutional neural networks, in: Proceedings of the NIPS, 2012.
model size, in: Proceedings of ICLR, 2017.
190 J. Wu, L. Xu and F. Wu et al. / Neurocomputing 397 (2020) 179–191

[34] A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, et al., [68] T. Parcollet, M. Morchid, P.M. Bousquet, R. Dufour, G. Linarès, R. De Mori, Qua-
MobileNets: Efficient convolutional neural networks for mobile vision applica- terion Neural Networks for Spoken Language Understanding, Spoken Language
tions, ArXiv e-prints, ArXiv:1704.04861, 2017. Technology Workshop IEEE (2016) 362–368.
[35] J.L. Elman, Finding structure in time, Cogn. Sci. 14 (2) (1990) 179–211. [69] T. Parcollet, M. Ravanelli, M. Morchid, G. Linarès, R. De Mori, Speech Recogni-
[36] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (8) tion with Quaternion Neural Networks, in: Conference on Neural Information
(1997) 1735–1780. Processing Systems, 2018.
[37] A. Graves, Jürgen Schmidhuber, Framewise phoneme classification with bidi- [70] T. Parcollet, M. Ravanelli, M. Morchid, G. Linarès, R. De Mori, Quaternion Re-
rectional LSTM and other neural network architectures, neural networks, IJCNN current Neural Networks, in: Proceedings of ICLR, 2019.
18 (5) (2005) 602–610. [71] T. Parcollet, M. Morchid, G. Linarès, R. De Mori, Bidirectional Quaternion
[38] K. Cho, B.V. Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Long-Short Term Memory Recurrent Neural Networks for Speech Recognition,
Y. Bengio, Learning phrase representations using RNN encoder-decoder for sta- in: ICASSP (2019) 8519–8523.
tistical machine translation„ in: Proceedings of the Conference on Empirical [72] T. Parcollet, M. Morchid, G. Linarès, Quaternion Convolutional Neural Networks
Methods in Natural Language Processing (EMNLP), 2014, pp. 1724–1734. for Heterogeneous Image Processing, in: ICASSP (2019) 8514–8518.
[39] W.R. Hamilton, Lectures on quaternions, Nature 57 (1462) (2010) 7. [73] R. Caruana, Multitask learning, Auton. Agent Multi Agent Syst. 27 (1) (1998)
[40] T.A. Ell, S.J. Sangwine, Hypercomplex Fourier transforms of color images, IEEE 95–133.
Trans. Image Process. 16 (1) (2007) 22–35. [74] S. Ruder, An Overview of Multi-Task Learning in Deep Neural Networks, 2017.
[41] C.C. Took, D.P. Mandic, The quaternion LMS algorithm for adaptive filtering of [75] J. Baxter, A Bayesian/information theoretic model of learning to learn via mul-
hypercomplex processes, IEEE Trans. Signal Process. 57 (2009) 1316–1327. tiple task sampling, Mach. Learn. 28 (1997) 7–39.
[42] R. Zeng, J.S. Wu, Z.H. Shao, Y. Chen, B.J. Chen, L. Senhadji, H.Z. Shu, Color image [76] F. Chollet, Xception: deep learning with depthwise separable convolutions, in:
classification via quaternion principal component analysis network, Neurocom- Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-
puting 216 (2016) 416–428. tion, 2017, pp. 1251–1258.
[43] S. Okubo, Introduction to Octonion and Other Non-Associative Algebras in [77] X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward
Physics, Cambridge University Press, 1995. neural networks, in: Proceedings of the International Conference on Artificial
[44] H.Y. Gao, K.M. Lam, From quaternion to octonion: feature-based image saliency Intelligence and Statistics, 2010.
detection, in: Proceedings of the IEEE International Conference on Acoustics, [78] K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: surpassing hu-
speech and signal processing, 2014. man-level performance on ImageNet classification, in: Proceedings of the IEEE
[45] Ł. Błaszczyk, K.M. Snopek, Octonion Fourier transform of real-valued func- International Conference on Computer Vision, 2015.
tions of three variables - selected properties and examples, Signal Process. 136 [79] K. He, J. Sun, Convolutional neural networks at constrained time cost, in: Pro-
(2017) 29–37. ceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
[46] B. Widrow, J. Mccool, M. Ball, The complex LMS algorithm, in: Proceedings of 2015, pp. 5353–5360.
the IEEE, 63, 1975, pp. 719–720. [80] S. Han, J. Pool, J. Tran, W. Dally, Learning both weights and connections for ef-
[47] A. Hirose, Complex-Valued Neural Networks: Advances and Applications, John ficient neural network, in: Advances in neural information processing systems
Wiley & Sons, 2013. (NIPS), 2015, pp. 1135–1143.
[48] Q. Song, H. Yan, Z. Zhao, Y. Liu, Global exponential stability of complex-valued [81] I. Misra, A. Shrivastava, A. Gupta, M. Hebert, Cross-stitch networks for multi–
neural networks with both time-varying delays and impulsive effects, Neural task learning„ in: Proceedings of the IEEE Conference on Computer Vision and
Netw. 79 (2016) 108–116. Pattern Recognition (CVPR), 2016, pp. 3994–4003.
[49] P. Arena, L. Fortuna, L. Occhipinti, M.G. Xinilia, Neural networks for quater- [82] S. Ruder, J. Bingel, I. Augenstein, A. Søgaard. Learning what to share between
nion-valued function approximation, Int. Symp. Circuits Syst. 6 (1994) loosely related tasks, ArXiv e-prints, ArXiv:1705.08142, 2017.
307–310. [83] M.D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks,
[50] T. Isokawa, T. Kusakabe, N. Matsui, F. Peper, Quaternion neural network in: Proceedings of the European Conference on Computer Vision, 2014,
and its application, in: Proceedings of the International Conference on pp. 818–833.
Knowledge-Based and Intelligent Information and Engineering Systems, 2003, [84] Y. Gao, J. Ma, M. Zhao, W. Liu, A.L. Yuille, NDDR-CNN: Layerwise Feature Fus-
pp. 318–324. ing in Multi-Task CNNs by Neural Discriminative Dimensionality Reduction„ in:
[51] H. Kusamichi, T. Isokawa, N. Matsui, Y. Ogawa, K. Maeda, A new scheme for Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-
color night vision by quaternion neural network, in: Proceedings of the Inter- tion (CVPR), 2019, pp. 3205–3214.
national Conference on Autonomous Robots and Agents, 2004, pp. 101–106.
[52] C. Jahanchahi, C. Took, D. Mandic, On HR calculus, quaternion valued stochastic Jiasong Wu received the B.S. degree in Biomedical En-
gradient, and adaptive three dimensional wind forecasting, in: Proceedings of gineering from the University of South China, Hengyang,
the International Joint Conference on Neural Networks, IEEE, 2010, pp. 1–5. China, in 2005, and joint Ph.D. degree with the Labora-
[53] T. Isokawa, T. Kusakabe, N. Matsui, F. Peper, Quaternion neural network tory of Image Science and Technology (LIST), Southeast
and its application, in: Proceedings of the International Conference on University, Nanjing, China, and Laboratoire Traitement du
Knowledge-Based and Intelligent Information and Engineering Systems, 2003, signal et de l’Image (LTSI), University of Rennes 1, Rennes,
pp. 318–324. France in 2012. He is now working in the LIST as a lec-
[54] N. Matsui, T. Isokawa, H. Kusamichi, F. Peper, H. Nishimura, Quaternion neural turer. His-research interest mainly includes deep learn-
network with geometrical operators, J. Intell. Fuzzy Syst. Appl. Eng. Technol. 15 ing, fast algorithms of digital signal processing and its
(3, 4) (2004) 149–164. applications. Dr. Wu received the Eiffel doctorate schol-
[55] C.-A. Popa, Octonion-Valued Neural networks, Artificial Neural Networks and arship of excellence (2009) from the French Ministry of
Machine Learning, ICANN, 2016. Foreign Affairs and also the Chinese government award
[56] C.-A. Popa, Global exponential stability of neutral-type octonion-valued neural for outstanding self-financed student abroad (2010) from
networks with time-varying delays, Neurocomputing 309 (2) (2018) 177. the China Scholarship Council.
[57] C.-A. Popa, Global exponential stability of octonion-valued neural networks
with leakage delay and mixed delays, Neural Netw. 105 (2018) 277–293. Ling Xu received the B.S. degree in Computer Science and
[58] J. Pearson, D. Bisset, Neural networks in the Clifford domain, in: Proceed- technology from Hefei University of Technology, Hefei,
ings of the International Conference on Neural Networks, 3, IEEE, 1994, China, in 2017. Now she is currently pursuing the M.S. de-
pp. 1465–1469. gree in Computer Science and technology, Southeast Uni-
[59] S. Buchholz, G. Sommer, On Clifford neurons and Clifford multi-layer percep- versity. Her research interests lie in deep learning and
trons, Neural Netw. 21 (7) (2008) 925–935. pattern recognition.
[60] Y. Kuroe, S. Tanigawa, H. Iima, Models of Hopfield-type Clifford neural net-
works and their energy functions –hyperbolic and dual valued networks, Proc.
Int. Conf. Neural Inf. Process. 7062 (2011) 560–569.
[61] D.P. Reichert, T. Serre, Neuronal synchrony in complex-valued deep networks,
in: Proceedings of the ICLR, 2014.
[62] R. Haensch, O. Hellwich, Complex-valued convolutional neural networks for
object detection in PolSAR data, in: Proceedings of the EUSAR, 2010, pp. 1–4.
[63] Z. Zhang, H. Wang, F. Xu, Y.Q. Jin, Complex-valued convolutional neural net- Fuzhi Wu received the B.S. degree from Anhui Normal
work and its application in polarimetric SAR image classification, IEEE Trans. University in 2017 and now is studying for Ph.D. degree
Geosci. Remote Sens. 55 (12) (2017) 7177–7188. in School of Computer Science and Engineering, Southeast
[64] C.-A. Popa, Complex-valued convolutional neural networks for real-valued im- University. His-research interests lie in deep learning and
age classification, in: Proceedings of the IJCNN, 2017. pattern recognition, signal and image processing.
[65] C. Trabelsi, O. Bilaniuk, Y. Zhang, D. Serdyuk, S. Subramanian, J.F. Santos, et al.,
Deep Complex Networks, in: Proceedings of ICLR, 2018.
[66] Alex Krizhevsky, H. Geoffrey, Learning Multiple Layers of Features From Tiny
Images, Learning Multiple Layers of Features From Tiny Images, 1, University
of Toronto, 2009 Technical report.
[67] C. Gaudet, A. Maida, Deep quaternion networks, in: Proceedings of the IJCNN,
2018.
J. Wu, L. Xu and F. Wu et al. / Neurocomputing 397 (2020) 179–191 191

Youyong Kong received the B.S. and M.S. degrees in com- Huazhong Shu received the B.S. degree in Applied Math-
puter science and engineering from Southeast University, ematics from Wuhan University, China, in 1987, and a
Nanjing, China, in 2008 and 2011, respectively, and the Ph.D. degree in Numerical Analysis from the University
Ph.D. degree in imaging and diagnostic radiology from the of Rennes 1, Rennes, France in 1992. He is a professor of
Chinese University of Hong Kong, Hong Kong, in 2014. the LIST Laboratory and the Codirector of the CRIBs. His-
He is currently an Assistant Professor with the College of recent work concentrates on the image analysis, pattern
Computer Science and Engineering, Southeast University. recognition and fast algorithms of digital signal process-
His-current research interests include machine learning, ing. Dr. Shu is a senior member of IEEE Society.
and medical image processing and brain network analy-
sis.

Lotfi Senhadji received the Ph.D. degree from the Univer-


sity of Rennes 1, Rennes, France, in signal processing and
telecommunications in 1993. He is a Professor and the
Head of the INSERM Research Laboratory LTSI. He is
also Co-Director of the French-Chinese Laboratory CRIBs
“Centre de Recherche en Information Biomédicale Sino-
Français”. His recent works are focused on signal
processing and modeling, orthogonal transforms, mixture
separation, learning methods, with particular empha-
sis on analysis, detection, classification, and interpretation
of biosignals. He has published more than 140 research
papers in journals and conferences, and he contributed
to five handbooks. Prof. Senhadji is a Senior member of
the IEEE EMBS and the IEEE Signal Processing Society.

You might also like