You are on page 1of 12

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/332696607

Self Residual Attention Network For Deep Face Recognition

Article  in  IEEE Access · April 2019


DOI: 10.1109/ACCESS.2019.2913205

CITATIONS READS

13 601

6 authors, including:

Hefei Ling Wu Jiyang


Huazhong University of Science and Technology Central South University
83 PUBLICATIONS   539 CITATIONS    1 PUBLICATION   12 CITATIONS   

SEE PROFILE SEE PROFILE

Lei Wu Junrui Huang


Huazhong University of Science and Technology Huazhong University of Science and Technology
14 PUBLICATIONS   31 CITATIONS    2 PUBLICATIONS   21 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

computer vision View project

Visual attention, Saliency detection, and Deep learning View project

All content following this page was uploaded by Junrui Huang on 02 November 2019.

The user has requested enhancement of the downloaded file.


This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2913205, IEEE Access

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI

Self Residual Attention Network For


Deep Face Recognition
HEFEI LING1 , JIYANG WU1 , LEI WU1 , JUNRUI HUANG1 , JIAZHONG CHEN1 , AND PING LI1
1
School of Computer Science and Technology, Huazhong University of Science and Technology, China
Corresponding author: Hefei Ling (e-mail: lhefei@163.com).
This work was supported in part by the Natural Science Foundation of China under Grant U1536203, in part by the National key research
and development program of China(2016QY01W0200), in part by the Major Scientific and Technological Project of Hubei Province
(2018AAA068).

ABSTRACT Discriminative feature embedding is of essential importance in the field of large scale
face recognition. In this paper, we propose a self residual attention-based convolutional neural network
(SRANet) for discriminative face feature embedding, which aims to learn the long-range dependencies
of face images by decreasing the information redundancy among channels and focusing on the most
informative components of spatial feature maps. More specifically, the proposed attention module consists
of self channel attention (SCA) block and self spatial attention (SSA) block which adaptively aggregates the
feature maps in both channel and spatial domains to learn the inter-channel relationship matrix and the inter-
spatial relationship matrix, then matrix multiplications are conducted for a refined and robust face feature.
With the attention module we proposed, we can make standard convolutional neural networks (CNNs), such
as ResNet-50, ResNet-101 have more discriminative power for deep face recognition. The experiments
on Labelled Faces in the Wild (LFW), Age Database (AgeDB), Celebrities in Frontal Profile (CFP) and
MegaFace Challenge 1 (MF1) show that our proposed SRANet structure consistently outperforms naive
CNNs and achieves the state-of-the-art performance.

INDEX TERMS Discriminative Face Feature Embedding, Self Residual Channel Attention, Self Residual
Spatial Attention

I. INTRODUCTION functions [4], [23], [37] have began to shine brilliantly and
Deep convolutional neural networks (CNNs) [10], [12], got the state-of-the-art results in face recognition tasks. For
[19], [33] have significantly improved the state-of-the-art example, SphereFace [23] proposes multiplicative angular
performance for a variety of visual tasks in recent years, for margin cos (mθ). CosFace [37] introduces additive cosine
example [17], [42] and so on, especially for deep face recog- margin cos (θ) − m. And additionally, ArcFace [4] uses
nition. Because of the advanced network architectures [10], additive angular margin cos (θ + m) as supervisor signal. All
[12], [13], [33] and discriminative learning methods [4], [6], these methods have already achieved great success in face
[23], [37], deep CNNs have boosted the face recognition per- recognition tasks.
formance to an unprecedent level. Typically, face recognition Apart from loss functions, some methods attempt to com-
can be viewed as face verification and face identification [14], bine the attention mechanism with video face recognition.
[18], where the former makes comparison on a pair of faces Rao et al. [27] propose an attention-based method to discard
to determine whether they belong to the same identity, while the misleading and confounding frames, and then produce a
the latter identities a probe face from a gallery of faces. compact refined feature for person recognition in face videos.
In face recognition fields, the main goal is to obtain a However, in image-based face recognition, we can rarely
discriminative feature to satisfy the criterion that the maximal see feature refinement during training, many of them just use
intra-class distance is smaller than the minimal inter-class standard network structures as their backbone. For example,
distance under a certain metric space. And many researches SphereFace [23] and CosFace [37] use a Residual-Style
have been conducted to get the discriminative features, in- architecture, ArcFace [4] adopts ResNet-50 and ResNet-101
cluding contrastive loss [32], center loss [40] and triplet [10] as the backbone networks. While the feature map gener-
loss [29] and so on. Recently, angular margin-based loss ated by standard CNN structures often has a high number of

VOLUME 4, 2016 1

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2913205, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

Residual Residual Global Average Loss Functions


Bottleneck Bottleneck Pooling Layer

Self Residual Attention Module Self Residual Attention Module


Input Face Images
Self Channel Se lf S pa tia l
Self Channel Self S pa tia l Attention Attention
Attention Attention M atrix M atrix
M atrix M atrix

Fully Connected Layer Loss Functions

Residual
Residual Bottleneck
Bottleneck

FM FC FS
Input Face Images FM FC FS

FIGURE 1: Flow-chart of our proposed SRANet architecture. Top of this figure is a standard process for ResNet and the bottom
is ours. Given an intermediate feature map FM , the attention module first generates a channel refined feature FC , then yields a
spatial refined feature FS . The face feature vector is extracted from fully connected layer. ⊕ denotes element-wise summation
and ⊗ means matrix multiplication.
channels, which inevitably results in an information redun- channel relationship. The two attention blocks can decrease
dancy among channels and even makes the feature has a risk the redundancy among channels and focus on the most im-
of over-fitting. Though there exist some regularization tech- portant part of a face images.
niques, such as dropout [11], to prevent it, the result is still (3) Experiment studies show that our proposed self
not satisfying. Furthermore, based on our cognition, different attention-based network (SRANet) can yield state-of-the-art
parts of the face images should have different importance in face recognition performance on four benchmark datasets.
real face recognition but the convolution kernels treat them
the same way. Attention is a mechanism that learns the most II. RELATED WORKS
important part of an input signal, and has shown great success A. FACE RECOGNITION ARCHITECTURES
in image classifications [12], [36], [41], saliency detections Deep face recognition is one of the most active research
[16], [20] and semantic segmentations [7], [39], which can area in the past few years and has achieved significant
works well for the problems mentioned above. progress thanks to the great success of deep CNN models. In
To address these problems, in this paper, we propose a early researches, DeepFace [34] and DeepID [32] treat face
self residual attention-based network (SRANet) to learn the recognition as a multi-label classification problem and they
inter-channel relationship and inter-spatial relationship by use a shallow network only with 9 layers, which consists
using self residual attention mechanism. It aims to reduce the of 4 convolutional layers, 3 max pooling layers and 2 fully
information redundancy along channel dimension and focus connected layers. This shallow architecture performs well
on the most important part of the spatial dimension of face when the training data and testing data is small. However,
feature maps. The proposed attention module consists of two in the work of VGGFace [26], the scale size of training data
blocks which generate channel attention matrix and spatial becomes large and such shallow architecture is not suitable
attention matrix in a sequential way and then do matrix multi- enough for it, so a new structure, GoogleNet [33] is used as
plications for refined face features. With these modifications, the backbone, GoogleNet has a deeper and wider inception
the features learned from the SRANet can make good use architecture, which with 22 layers, to learn more global
of long-range dependencies. In addition, the proposed self semantic face feature. Later in VGGFace2 [3], the authors
residual attention module can be integrated with existing adopt VGGNet [31] as the backbone architecture, though
standard CNN structures and can also be collaborated with VGGNet has less number of layers than GoogleNet, it still
the margin-based loss functions for a performance boost. The has more parameters.
general framework is shown in Fig. 1. In summary, we list our With the introduction of ResNet [10], training deep con-
contributions as follows: volutional neural network finally comes true, more and
(1) We propose a self residual attention-based network more researchers become to use deep CNN models in their
(SRANet) for discriminative face feature embedding, it cap- works for the larger and larger training dataset. For example,
tures the global feature dependencies in both spatial and SphereFace [23], AM-Softmax [6] and CosFace [37] all use
channel dimensions. To the best of our knowledge, this is the a residual-style architecture with 64 layers, it has several
first to introduce self-attention mechanism to refined visual residual blocks to learn deep face features. More Specifically,
features in image-based face recognition. The ArcFace [4] uses ResNet-50 [10] and ResNet-101 [10]
(2) A self residual spatial attention block is proposed to conduct their experiments, in these studies, ResNet-50 and
to learn the inter-spatial relationship of features and a self ResNet-100 has an improved residual bottleneck which was
residual channel attention block is used to learn the inter- described in [9] . Very recently, the work of SV-Softmax
2 VOLUME 4, 2016

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2913205, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

[38] adopts Attention-56 [36] network to train deep face spatial refined feature FS . In addition, we argue that the
recognition models, it uses an encoder-decoder style atten- features extracted from global average pooling layer are
tion module to enhance the feature representation of standard not discriminative enough for deep face recognition, so we
convolutional neural networks. use a fully connected layer instead. With the modifications
mentioned above, we can reduce the information redundancy
B. ATTENTION MECHANISMS among channels and learn the most important part of face
Attention can be viewed as a tool to learn the most infor- images.
mative components of an input signal, which focuses on im-
portant features and supressing unnecessary ones, and it has A. SELF RESIDUAL SPATIAL ATTENTION MODULE
shown great success in machine translation [35]. Recently, In traditional CNN architectures, a convolutional feature map
there have been several attempts to incorporate the attention usually comes with three dimensions which corresponding to
mechanism with large-scale classification tasks [12], [36], the channel, height and width separately. As the name states,
[41]. Wang et al. [36] introduce a powerful attention module the Self Residual Spatial Attention Module (SSA) is aimed to
named Residual Attention Network which uses an encoder- model the inter-dependencies of spatial dimension and then
decoder style module. Hu et al. [12] propose a robust module learn the most important part of an face image. Inspired by
to learn the inter-channel relationship of convolutional fea- the work of [39], we design our self residual spatial attention
tures which called SENet. Woo et al. [41] propose a Convolu- module in Fig.2.
tional Block Attention Module for feed-forward CNNs. An- Given an input feature FI ∈ RC×H×W , our proposed SSA
other, inspired by the classical non-local means method [2], first feed it into a two convolution layers with the kernel
Wang et al. [39] propose a non-local neural network on video size of 1 × 1, which we can get two new feature maps
C
classification task, which provides insight by relating the self- {θ (FI ) , φ (FI )} ∈ R r ×H×W . In some previous work [39],
attention model to the classic computer vision tasks. Zhang they do matrix multiplication directly between θ (FI ) and
et al. [45] introduces self-attention modules to efficiently φ (FI ) and then get a matrix with dimension of HW × HW .
find global dependencies within internal representations for However, such operation is time-consuming and source-
better image generation. And also the work of [7] proposes consuming when being in both training and testing process
a dual attention network for scene segmentation to model the because the computation cost of the spatial matrix multipli-
semantic interdependencies. cation is huge. To release the burden of matrix computation,
Recently, attention mechanisms have also been introduced it’s natural for us to consider that reduces the dimension of
to video face recognition. Yang et al. [43] propose an face feature maps. Inspired by the contributions of CBAM
attention-based method to find the weight of features by [41], we add two pooling layers before the matrix calculation,
using the information from the features themselves after which greatly reduces the gpu memory needed in forward and
feature embedding. Rao et al. [27] use an attention-ware deep backward process. So we now have:
reinforcement learning method to discard the misleading and
confounding frames and find the focuses of attentions in face P ool(FI )1 = Avg (θ (FI )) ⊕ M ax (θ (FI )) (1)
videos. Then we reshape φ (FI ) and P ool(FI )1 separately to
Unlike them, our research interest is the image-based face C C
R r ×N1 and R r ×N2 , where N1 = H × W is the number of
recognition and the proposed self-residual attention module spatial features and N2 = H×W means the pooling spatial
4
mainly focuses on the inter-channel and inter-spatial relation- features. After that, we do matrix multiplication between the
ships of the feature maps, which captures the long-range de- two feature maps and apply softmax function on each row
pendencies of a face image and obtains a more discriminative of the matrix. So we have our self residual spatial attention
face feature. matrix as below:
 
T
III. OUR PROPOSED METHODS AS = Sof tmax φ (FI ) ⊗ P ool(FI )1 (2)
The framework of our proposed SRANet is illustrated in
for a detailed understanding, we have
Fig.1. As can be seen, we use ResNet-50 and ResNet-101 
[10] as our backbones. T  
exp φ (FI )i P ool(FI )1 j
Build upon the original CNN architectures, we add atten- AS i,j = P   (3)
N2 T 1
tion modules on top of each residual bottleneck of the ResNet i=1 exp φ (FI )i (P ool(FI ) )j
structure to obtain a refined face feature. Especially, the
HW
proposed attention module consists of two blocks named self where AS ∈ RHW × 4 is a 2-D matrix, which represents
residual channel attention module and self residual spatial at- the inter-spatial relationship of each two positions of the in-
tention module, which learns the channel relationship matrix put feature maps. Specifically, AS i,j means the ith position’s
and spatial relationship matrix in a sequential way, and then impact on j th position.
achieves the refined feature by matrix multiplications. After that, we also give FI a linear transformation and
H W
For example, given an intermediate feature map FM , we pooling operation to get ρ (FI ) ∈ RC× 2 × 2 and reshape
C× HW
can sequentially get the channel refined feature FC and the it to ρ (FI ) ∈ R 4 . Finally we get the self spatial

VOLUME 4, 2016 3

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2913205, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

C
MaxPool : (  2  2) Pool(FI)1
r
Conv reshape
C H W
 
AS :Self Spatial
r 2 2
C Attention Matrix
 :(  1  1) C
C H W
 
r C AvgPool : (  2  2) r 2 2
 H W r Softmax transpose
r

C
Conv reshape & transpose : ( HW  )
r (H W )
C
(H W ) 
 :(  1  1) 4
r
C  H W C C  H W FS : Self Channel
 H W
FI : Input Feature r
MaxPool : (C  2  2) Refined Feature
HW
reshape : (C  )
H W 4
C 
Conv 2 2
H W
 : (C  1  1) C 
AvgPool : (C  2  2) 2 2
C  H W

FIGURE 2: Our proposed self residual spatial attention module (Light SSA). θ, φ, ρ denotes convolutional operation, σ means
softmax operation. r is a constant value to reduce the number of channels, which we empirically set it equals with 16 just
like SENet [12] did. ⊕ denotes element-wise summation and ⊗ represents matrix multiplication. The sof tmax operation is
performed on each row.

refined feature FS by matrix multiplication and element-wise pooling and max average pooling to reserve a quarter of the
summation. spatial dimension. Now, we have:

FS = FI ⊕ ρ (FI ) ⊗ ATS P ool(FI )1 = Avg (θ (FI )) ⊕ M ax (θ (FI )) (5)



(4)

in all above equations, ρ, φ, θ denotes convolutional opera- P ool(FI )2 = Avg (φ (FI )) ⊕ M ax (φ (FI )) (6)
tion, σ means softmax operation, ⊕ denotes element-wise n H W C H W
o
where P ool(FI )1 , P ool(FI )2 ∈ RC× 2 × 2 , R 2 × 2 × 2 ,

summation and ⊗ represents matrix multiplication. The final
residual connection learning allows us to plug a new model then after reshape and transpose operations, we have our self
into any existing CNN structures and enables a stable conver- channel attention matrix:
gence.  T 
AC = Sof tmax P ool(FI )1 ⊗ P ool(FI )2 (7)
B. SELF RESIDUAL CHANNEL ATTENTION MODULE In detail:
In forward progress of CNN structures, each channel of
 T 
exp P ool(FI )1i P ool(FI )2 j
the feature map is calculated by a convolution kernel, so AC i,j = P C   (8)
it’s inevitable to have an information redundancy when the 1 2 T
i=1 exp P ool(FI )i (P ool(FI ) )j
2

number of channel is high. To investigate the inter-channel


C
relationships of an input feature map and reduce the redun- where AC i,j ∈ RC× 2 and it means the ith channel’s impact
dancy of it, we generate a weight matrix by the self residual on j th channel. The softmax operation is performed on each
channel attention module. Similar to self spatial attention row. After the softmax on channel matrix, we do matrix
module, the proposed self channel attention module aims multiplication and residual connection for another branch of
to learn the inter-relationships among the channels of face input signal. Finally, we can get our channel refined feature:
feature maps. Details of our proposed Self Residual Channel
Attention Module (Light SCA) are illustrated in Fig.3. FC = FI ⊕ (AC ⊗ ρ (FI )) (9)
To compute the interrelationship matrix of channels, it’s In above equations, θ, φ, ρ denotes convolutional operation,⊕
natural to aggregate the spatial dimension to reduce the means element-wise summation and ⊗ represents matrix
computation cost of matrix multiplication, and also we take multiplication.
the channel number reduction into consideration for a faster
forward and backward speed. So given an input feature C. FEATURE EMBEDDING WITH SELF ATTENTION
map FI ∈ RC×H×W , we first feed it into two convo- MODULE
lution layers and keep the spatial size unchanged but one The two proposed self residual attention modules are in-
of
n them have half C
o of channels.{θ (FI ) , φ (FI )} ∈
number dividual blocks which can be combined with any existing
RC×H×W , R 2 ×H×W , then we apply global average convolution neural networks. In this paper, we aggregate the
4 VOLUME 4, 2016

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2913205, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

MaxPool : (C  2  2)
Pool(FI)1 AC :Self Channel
H W Attention Matrix
Conv C  reshape
2 2
 : (C  1  1) H W
C 
2 2
C  H  W AvgPool : (C  2  2) Softmax

C
MaxPool : (  2  2)
2 2
Pool(FI)
reshape
Conv C H W
  C
C
2 2 2
transpose
C
 :(  1  1) C H W
2
2  
C  H W
C
 H W AvgPool : (
C
 2  2)
2 2 2
C  H W C  H W
2 2
FI : Input Feature FC : Self Channel
Refined Feature
Conv reshape
C C
 H W
 : (  1  1) 2
2

FIGURE 3: Our Proposed self-residual channel attention module (Light SCA). θ, φ, ρ denotes convolutional operation, σ means
softmax operation, ⊕ denotes element-wise summation and ⊗ represents matrix multiplication. The sof tmax operation is
performed on each row.

two modules in a sequential way and append them after to get the attention-refined feature and keep the size of feature
every residual bottleneck of the backbones. For an intuitive map unchanged. Another, element-wise summation can help
understanding, we plot the residual block of our proposed to train a deep neural network and avoid gradient vanishing or
self residual attention network (SRANet) in Fig.4. gradient exploding problems, which means a stable process
for the feature refinement.
For example, given an intermediate feature map X, the at-
X X tention module first generates a channel attention matrix and
get the weighted feature by matrix multiplication, then yields
the channel refined feature using element-wise summation.
Residual
Residual Sequentially, the spatial refined feature can be obtained the
same way. In addition, Batch Normalization [15], [28] is a
Self Channel widely used technology to stabilize the training process, we
Attention
Matrix also adopt this for fast convergence. Finally we can obtain the
XR refined feature XR by residual shortcut learning.
Self Spatial
Attention IV. EXPERIMENTS
Matrix
A. DATASETS
BN Training Data. We use publicly available web-collected
training dataset CASIA-WebFace [44] and MS-Celeb-1M
[8] to train our CNN models. CASIA-WebFace has 494,414
XR face images belonging to 10,575 identities, which has large
variation in pose, age, illumination and profession. Since
CASIA-WebFace is quite clean, we use it directly without
FIGURE 4: Comparison of original ResNet and our pro- data refinement. The original MS-Celeb-1M dataset contains
posed self residual attention network (SRANet). ⊕ denotes about 100k identities with 10 million images, but it consists
element-wise summation, ⊗ means matrix multiplication. X of many noisy face images, we use a well-cleaned version
is input feature and XR is output feature. provided by DeepGlint corporation, which has 86,876 iden-
As can we seen in the figure, after the computation process tities and 3,923,399 aligned images.
of self channel attention matrix and self spatial attention Validation Data. We employ Labelled Faces in the Wild
matrix, we can get the corresponding refined feature by (LFW) [14], Age Database (AgeDB) [24] and Celebrities in
matrix multiplication and element-wise summation. Because Frontal Profile (CFP) [30] as the validation datasets. LFW
the inter-channel and inter-spatial relationships are calculated dataset includes 13,233 face images from 5749 different
by matrix multiplication, we should also do same operation identities and the verification accuracy is evaluated on 6,000
VOLUME 4, 2016 5

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2913205, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 1: Comparison of different self residual spatial attention models. (CASIA-WebFace, ResNet50, ArcFace Loss)
Method LFW Acc. (%) AgeDB-30 Acc. (%) CFP-FP Acc. (%) Inference Speed (ms) Model-Size (MB) GPU Memory (MB)
ResNet50 (Baseline) 99.42 94.45 95.24 14.31 170.51 3583
ResNet50 + Naive SSA 99.55 94.91 95.74 24.69 185.66 15422
ResNet50 + AvgPool SSA 99.52 94.88 95.62 19.53 185.66 7521
ResNet50 + MaxPool SSA 99.53 95.03 95.66 19.70 185.66 7647
ResNet50 + Light SSA 99.55 94.95 95.75 19.71 185.67 7683

TABLE 2: Comparison of different self residual channel attention models. (CASIA-WebFace, ResNet50, ArcFace Loss)
Method LFW Acc. (%) AgeDB-30 Acc. (%) CFP-FP Acc. (%) Inference Speed (ms) Model-Size (MB) GPU Memory (MB)
ResNet50 (Baseline) 99.42 94.45 95.24 14.31 170.51 3583
ResNet50 + Naive SCA 99.50 94.72 95.63 23.13 191.85 7335
ResNet50 + AvgPool SCA 99.48 94.69 95.58 18.21 184.77 5937
ResNet50 + MaxPool SCA 99.49 94.71 95.60 18.13 184.77 6105
ResNet50 + Light SCA 99.49 94.72 95.62 18.34 184.78 6119

face pairs. AgeDB dataset contains 16,488 images of 568 nologies except that all the images are horizontally flipped
distinct identities, each image is annotated with respect to with probability of 0.5.
the identity and age attribute. There are four groups of the CNN Architecture. In face recognition filed, there exist
AgeDB data with different year gaps corresponding to 5 many kinds of architectures [4], [23], [37] to train models.
years, 10 years, 20 years and 30 years respectively, each For a fair comparison, the CNN structures should be the same
group has ten splits of face images, which has the same to evaluate different algorithms. Inspired by the work [4], we
face verification protocol as LFW. In this paper, we report use ResNet-50 [10] and ResNet-100 [10] as our backbone
the performance on the most challenging subset, AgeDB- to train the baseline model. We also adopt BN-Conv-BN-
30. CFP dataset consists of 500 identities with 10 frontal PReLU-Conv-BN module as the residual bottleneck and all
and 4 profile images of each person. The evaluation protocol the convolution kernel size in residual bottlenecks have a size
includes frontal-frontal (FF) and frontal-profile (FP) face ver- of 3 × 3, just like ArcFace [4] did. In this paper, the output
ification, each has 10 folders with 350 positive examples and feature of all ResNet models are fixed to 512-dimension by a
350 negative examples, and the accuracy is conducted by ten- fully connected layer.
fold-cross-validation. Similar to AgeDB, we also choose the Training and Testing Details. We implement the algorithms
most challenging subset, CFP-FP, to report the performance in PyTorch framework. All the CNN models are learned
of our algorithms. with stochastic gradient descent (SGD) algorithm and trained
Test Data. we use MegaFace [18] dataset as our test data. from scratch, with a total batch size 128 on four NVIDIA
MegaFace is a large public available testing benchmark, TITAN XP (12GB) GPUs. In all experiments, we use the
which aims to evaluate the performance of face recognition additive angular margin loss, also named ArcFace loss [4],
algorithms at the million scale of distractors. It includes a to train our face recognition models. On CASIA database,
gallery set and a probe set. The gallery set, a subset of the learning rate starts from 0.1 and is divided by 10 at 30K,
Flickr photos from Yahoo, consists of more than one million 50K, 65K iterations, and the training process is finished at
images from 690,000 different identities. The probe set has 80K iterations. On DeepGlint-MS-Celeb-1M, we divide the
two databases: FaceScrub [25] and FGNet [1]. In this paper, learning rate at 100K, 180K, 240K iterations and finish at
we use the FaceScrub as probe set, which contains 100,000 280K iterations. The weight decay is set to 5e−4 and the mo-
images of 530 unique individuals. The MegaFace dataset also mentum is 0.9. During image testing, for LFW [14], AgeDB-
has niosy images which shows a significant effect on the final 30 [24] and CFP-FP [30] dataset, features of original image
result. Fortunately, we now have a refined version provided and the flipped image are concatenated together to compose
by ArcFace [4]. the final face representation, while for MegaFace [18] testing,
only the original feature is used. Cosine distance of features
B. EXPERIMENTAL SETTINGS is computed as the similarity score and all the reported results
in this paper are evaluated by a single model.
Image Processing. We use the open-source face detection
algorithm MTCNN [46] to detect faces and localize five
landmarks through a cascade CNN. The cropped faces are C. ABLATION STUDIES ON SSA AND SCA
obtained by similarity transformation and are resized to To evaluate the performance of our proposed method, we de-
112 × 112. Each pixel in RGB images is normalized by sign a set of ablation studies on the two attention modules. In
subtracting 127.5 and then being divided by 128. For training all ablation experiments, we use ResNet-50 as the backbone,
process, there exists no more any data augmentation tech- and train all models on CASIA-WebFace dataset with the
6 VOLUME 4, 2016

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2913205, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

additive angular margin loss [4]. The verification accuracy TABLE 3: Combining methods of SSA and SCA. (CASIA-
on LFW [14], AgeDB-30 [24] and CFP-FP [30] datasets are WebFace, ResNet50, Softmax loss and ArcFace loss)
reported in these experiments. Loss Method LFW (%) AgeDB-30 (%) CFP-FP (%)
Self Residual Spatial Attention (SSA). We experimentally Baseline 98.78 87.58 89.54
verify that using both average-pooling and max-pooling fea- SCANet 99.03 88.53 90.42
SoftMax Loss
tures can release the self spatial attention module’s computa- SSANet 99.07 88.39 90.54
tion cost and do not sacrifice the verification accuracy. Apart SRANet 99.10 88.75 90.83
from the baseline, we compare four different SSA varieties: Baseline 99.42 94.45 95.24
No Pooling SSA (we will call it Naive SSA later in this pa- SCANet 99.49 94.72 95.62
ArcFace Loss
per), AvgPool SSA, MaxPool SSA and the Light SSA (Ligth SSANet 99.55 94.95 95.75
SSA means both AvgPool and MaxPool). Results are shown SRANet 99.58 95.05 95.82
in Table 1. From the table we can observe that compared
to the baseline, all four SSA models have boosted the three
test benchmarks to some extent. Specially, the accuracy of Softmax loss, the proposed attention modules can greatly
LFW has improved from 99.42% to 99.55%, the AgeDB- improve the test results, especially for the AgeDB-30 and
30 has a promotion from 94.45% to 95.03% and the CFP- CFP-FP, they all have improved over 1% accuracy. The
FP has an improvement about 0.51%, which corresponding AgeDB-30 has improved from 87.58% to 88.75% and CFP-
to 95.24% and 95.75%. There remains a fact that we need FP has improved from 89.54% to 90.83%. And there exists
to know, all three verification accuracy of the benchmarks an obvious result fact that integrating the two self attention
have already been very high, it’s not easy for us to obtain modules together performs better than any one individual.
such an improvement. As for the four kinds of SSA, They The SphereFace [23] and ArcFace [4] are the most popular
have little difference on the test accuracy and all have an face recognition algorithms in recent two years, they both fo-
improvement. Compared to Naive SSA, when we use only cus on loss functions and desire discriminative face features
global average pooling or max pooling to reduce the size of in an angular-margin way. Just use the two loss functions
feature map, the test results suffer a little decreasement but we can already get a high performance in these tests. While
can be compensated by combining them together, the Light with the proposed self attention modules, we can still get a
SSA shows almost the same results, even performs better. 0.16% improvement on LFW and almost 0.6% promotion
However, Light SSA has a faster inference speed and less gpu on AgeDB-30 and CFP-FP, which supports our argument
memory. As can be seen in the table, the Light SSA reduces that the proposed attention module can be combined with
more than half of the GPU memory compared with Naive all the existing state-of-the-art algorithms for a better face
SSA, which greatly reduces the computation cost during recognition performance.
training and testing process. (GPU memory is calculated with Analysis on Computation Cost. In Table 1 and Tabel 2, we
batch size of 128 in training process.) And Also, in practice, list the inference speed, model size and GPU memory needed
we find that the Naive SSA is not easy to converge so we with batch size of 128 during training process and give a short
choose Light SSA in later experiments. comparison in above ablation studies. Now we give a detailed
Self Residual Channel Attention (SCA). Following the analysis on our proposed self attention modules. Fig. 5 plots
same strategy, we also design four kinds of SCA varieties. the corresponding results in SSA.
The test results are illustrated in Table 2. Compared to the
baseline, all three benchmarks have also obtained a per-
formance improvement: LFW (from 99.42% to 99.50%), Computation Cost of SSA
1800
AgeDB-30 (from 94.45% to 94.72%) and CFP-FP (from
1600
95.24% to 95.63%). Similar to SSA, the Light SCA roughly
1400
performs at the same level with the Naive SCA, but the 1200
Light SCA has smaller model size, faster inference speed 1000
and less gpu memory. The comparison between Naive SCA 800
and Light SCA confirms the effectiveness of our computation 600
cost reduction strategies. 400
200
Combining SSA and SCA. As ablation analysis above, we
0
finally choose the Light SCA and Light SSA as our proposed Baseline Naive SSA AvgPool SSA MaxPool SSA Light SSA
attention module. In this subsection, we analyze the per- Inference Speed (ms) Model Size (MB) GPU Memory (MB)
formance of individual attention module network (SCANet,
SSANet) and their combinations (SRANet) on different loss FIGURE 5: Computation cost of our proposed SSA. The
functions. We train ResNet-50 on CASIA-WebFace dataset GPU memory is divided by 10 for a better visual effect.
with SoftMax Loss and ArcFace Loss separately. The results
are shown in Table 3. As can we seen from the Fig.5. Compared to the baseline,
As can we seen from Table 3, when constrained with the proposed SSA indeed needs more inference speed time
VOLUME 4, 2016 7

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2913205, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

and GPU memory. But with the pooling strategies we used TABLE 4: Verification performance (%) of different meth-
in the proposed module, we can reduce the computation cost ods. (DeepGlint-MS1M, ResNet100, ArcFace Loss)
in a large extent and keep it to an acceptable level, especially Methods Image LFW (%) AgeDB-30 CFP-FP
for the GPU memories. It strongly support our argument that VGG Face [26] 2.6M 98.95 - -
using max pooling or average pooling can greatly release Face Net [29] 200M 99.63 - -
the burden of matrix multiplication. In real applications, we
Baidu [21] 1.3M 99.13 - -
can further increase the kernel size of pooling layers and
Center Loss [40] 0.7M 99.28 - -
reduce the number of attention modules to keep the balance
Range Loss [47] 3.8M 99.52 - -
of performance and computation cost.
Marginal Loss [5] 3.8M 99.48 - -
Visualization of Attention Module.We visualize the feature
SphereFace [23] 0.5M 99.43 91.70 94.38
map of different residual blocks in Fig.6. From the result,
we can observe that the feature maps with attention module SphereFace+ [22] 0.5M 99.47 - -
have a more distinctive contour compared to those without SphereFace [23] 5.8M 99.76 97.56 93.75
attention module, which makes the person easy to identify. CosFace [37] 5M 99.80 97.91 94.42
It proves that our proposed attention module retains the ArcFace [4] 5.8M 99.83 98.08 94.53
most important information of face images and suppresses SRANet,Res100(ours) 3.9M 99.83 98.47 95.60
the unnecessary one, which results in a more discriminative
feature for deep face recognition.
Results on MegaFace Challenge. MegaFace [18] is a very
challenging testing benchmark, which was recently released
for large-scale face identification, it aims to evaluate the
Without performance of face recognition algorithms at the million
Attention
scale of distractors. During the experiments on the MegaFace
challenge, we use the ResNet-50 network and the CASIA-
WebFace as the training dataset for small test protocol, while
With
for large test protocol, we use the ResNet-100 network and
Attention the refined DeepGlint-MS1M as the training data. Because
the original ArcFace [4] has a private clean MS1M data of
5.8M images while we only have 3.9M, we reimplement
SphereFace [23], CosFace [37] and ArcFace [4] for a fair
Without comparison.
Attention

TABLE 5: Face identification and verification results of


different methods on MegaFace Challenge 1 . "Id" refers to
the rank-1 face identification accuracy with 1M distractors,
With
Attention and "Ver" refers to the face verification TAR at 10−6 FAR.
All results are reported on refined version data.

Input Image Block 1 Block 3 Block 5 Methods Id Acc. (%) Ver Acc. (%) Protocol

FIGURE 6: Feature maps of different residual blocks. For an Contrasitive Loss [32] 78.86 79.63 small
intermediate feature map with C × H × W dimension, we Triplet Loss [29] 78.79 80.32 small
calculate the average along the C channels. Center Loss [40] 80.49 80.14 small
SphereFace [23] 86.56 88.74 small
AM-Softmax [6] 87.32 89.44 small
D. TEST RESULTS
ArcFace [4] 88.17 90.24 small
Results on LFW, AgeDB-30 and CFP-FP. To report a
SRANet,Res50(ours) 88.79 91.16 small
better performance on the three verification benchmarks, we
adopt ResNet-100 as the backbone, and trained models on SV-Softmax [38] 97.20 97.38 large
DeepGlint-MS1M, which has 3.9M images of 86k individual SphereFace [23] 97.43 97.50 large
identities. The results are shown in Table.4. Bold number in CosFace [37] 97.62 97.76 large
each column represents the best performance. ArcFace [4] 98.05 97.94 large
From Tabel. 4, we can find that the accuracy of LFW has SRANet,Res100(ours) 98.34 98.38 large
been get saturated and all the competitors can achieve over
99% accuracy rate. We obtain a same accuracy with ArcFace As can we seen in Table.5. our proposed attention mod-
[4] of 99.83% but with less training data. As for the AgeDB- ule performs best in both small and large protocol. When
30 and CFP-FP, we set a new state-of-the-art result without evaluating at small protocol, our ResNet-50 SRANet has
any bells and whistles. improved about 0.62% accuracy on the rank 1 identification
8 VOLUME 4, 2016

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2913205, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

and 0.92% on verification accuracy. The improvement on [3] Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew Zisserman.
such large testing benchmark confirms the effectiveness of Vggface2: A dataset for recognising faces across pose and age. In
2018 13th IEEE International Conference on Automatic Face & Gesture
our proposed attention module. As for the large protocol, Recognition (FG 2018), pages 67–74. IEEE, 2018.
it’s clear that when comparing to ArcFace [4], our proposed [4] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface:
attention module can boost the performance for 0.29% on Additive angular margin loss for deep face recognition. arXiv preprint
arXiv:1801.07698, 2018.
refined rank 1 identification accuracy and 0.44% on refined [5] Jiankang Deng, Yuxiang Zhou, and Stefanos Zafeiriou. Marginal loss
verification accuracy, which greatly reduce the identifica- for deep face recognition. In Proceedings of the IEEE Conference on
tion error rate by 14.87%. (from 98.05% to 98.34%) and Computer Vision and Pattern Recognition Workshops, pages 60–68, 2017.
[6] Wang Feng, Cheng Jian, Weiyang Liu, and Haijun Liu. Additive margin
reduce the verification error rate by 21.35%. (from 97.94% softmax for face verification. IEEE Signal Processing Letters, PP(99):1–1,
to 98.38%). To sum up, our proposed self attention module 2018.
aims to learn the most important part of face images and [7] Jun Fu, Jing Liu, Haijie Tian, Zhiwei Fang, and Hanqing Lu. Dual atten-
tion network for scene segmentation. arXiv preprint arXiv:1809.02983,
reduce the redundancy among channels, which leading to a 2018.
more discriminative feature for large scale face recognition. [8] Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao.
For better view effect, we draw both of the CMC curves to Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In
European Conference on Computer Vision, pages 87–102. Springer, 2016.
evaluate the performance of face identification and the ROC [9] Dongyoon Han, Jiwhan Kim, and Junmo Kim. Deep pyramidal residual
curves to evaluate the performance of face verification in Fig. networks. In Proceedings of the IEEE Conference on Computer Vision
7. From the curves, we can see much clear of our proposed and Pattern Recognition, pages 5927–5935, 2017.
[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual
methods. learning for image recognition. In Proceedings of the IEEE CVPR, pages
770–778, 2016.
[11] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever,
Verification with 1M Distractors
100
Identification with 1M Distractors
and Ruslan R. Salakhutdinov. Improving neural networks by preventing
100
co-adaptation of feature detectors. Computer Science, 3(4):212–223,
98
98 2012.
True Positive Rat e (TPR,%)
Identification Rate (%)

96
96 [12] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. arXiv
SRANet-Res100_Large (ours)
94
ArcFace_Large
94
SRANet-Res100_Large (ours)
ArcFace_Large
preprint arXiv:1709.01507, 7, 2017.
CosFace_Large
92
SphereFace_Large
CosFace_Large
SphereFace_Large
[13] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Wein-
90
SV-Softmax_large
SRANet-Res50_Small (ours)
92 SV-Softmax_large
SRANet-Res50_Small (ours)
berger. Densely connected convolutional networks. In Proceedings of the
88
ArcFace_Small
AM-Softmax_Small
90
ArcFace_Small
AM-Softmax_Small
IEEE conference on computer vision and pattern recognition, pages 4700–
86
SphereFace_Small
88
SphereFace_Small
4708, 2017.
1 10^1 10^2 10^3 10^4 10^5 10^6 10^-6 10^-5 10^-4 10^-3 10^-2 10^-1 1
Rank False Positive Rate (FPR)
[14] Gary B Huang and Erik Learned-Miller. Labeled faces in the wild: Updates
and new reporting procedures. Dept. Comput. Sci., Univ. Massachusetts
(a) CMC (b) ROC Amherst, Amherst, MA, USA, Tech. Rep, pages 14–003, 2014.
[15] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating
deep network training by reducing internal covariate shift. arXiv preprint
FIGURE 7: CMC and ROC curves of different methods arXiv:1502.03167, 2015.
on MegaFace. Results are evaluated on refined MegaFace [16] M. Jian, K. M. Lam, J. Dong, and L. Shen. Visual-patch-attention-aware
dataset. saliency detection. IEEE Transactions on Cybernetics, 45(8):1575, 2015.
[17] Bin Jiang, Jiachen Yang, Zhihan Lv, and Houbing Song. Wearable vision
assistance system based on binocular sensors for visually impaired users.
V. CONCLUSION IEEE Internet of Things Journal, 2018.
In this paper, we propose a self residual attention-based [18] Ira Kemelmacher-Shlizerman, Steven M Seitz, Daniel Miller, and Evan
Brossard. The megaface benchmark: 1 million faces for recognition at
convolutional neural network (SRANet) to learn the long- scale. In Proceedings of the IEEE Conference on Computer Vision and
range dependencies of aligned face images, which aims to Pattern Recognition, pages 4873–4882, 2016.
decrease the information redundancy among channels and [19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet
classification with deep convolutional neural networks. In International
focus on the most informative components of face feature Conference on Neural Information Processing Systems, pages 1097–1105,
maps. Detailedly, the proposed attention module consists of 2012.
two individual blocks named self residual channel attention [20] Jason Kuen, Zhenhua Wang, and Gang Wang. Recurrent attentional net-
works for saliency detection. In Computer Vision and Pattern Recognition,
block and self residual spatial attention block. With the pages 3668–3677, 2016.
attention blocks we proposed, we achieve the state-of-the-art [21] Jingtuo Liu, Yafeng Deng, Tao Bai, Zhengping Wei, and Chang Huang.
performance in three validation benchmarks: LFW, AgeDB- Targeting ultimate accuracy: Face recognition via deep embedding. arXiv
preprint arXiv:1506.07310, 2015.
30, CFP-FP and one large scale test benchmark: MegaFace [22] Weiyang Liu, Rongmei Lin, Zhen Liu, Lixin Liu, Zhiding Yu, Bo Dai, and
Challenger 1. The test accuracy reported in this paper shows Le Song. Learning towards minimum hyperspherical energy. In Advances
that our proposed self residual attention module can work as in Neural Information Processing Systems, pages 6225–6236, 2018.
[23] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and
an auxiliary tool to integrate with all the existing popular loss Le Song. Sphereface: Deep hypersphere embedding for face recognition.
functions for a performance boost in face recognition area. In The IEEE Conference on CVPR, volume 1, page 1, 2017.
[24] Stylianos Moschoglou, Athanasios Papaioannou, Christos Sagonas,
Jiankang Deng, Irene Kotsia, and Stefanos Zafeiriou. Agedb: the first
REFERENCES manually collected, in-the-wild age database. In Proceedings of the IEEE
[1] Fg-net aging database. http://www.fgnet.rsunit.com/. 2010. Conference on Computer Vision and Pattern Recognition Workshops,
[2] Antoni Buades, Bartomeu Coll, and J-M Morel. A non-local algorithm pages 51–59, 2017.
for image denoising. In 2005 IEEE Computer Society Conference on [25] Hong-Wei Ng and Stefan Winkler. A data-driven approach to cleaning
Computer Vision and Pattern Recognition (CVPR’05), volume 2, pages large face datasets. In 2014 IEEE International Conference on Image
60–65. IEEE, 2005. Processing (ICIP), pages 343–347. IEEE, 2014.

VOLUME 4, 2016 9

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2913205, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

[26] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, et al. Deep face
recognition. In BMVC, volume 1, page 6, 2015.
[27] Yongming Rao, Jiwen Lu, and Jie Zhou. Attention-aware deep reinforce-
ment learning for video face recognition. In Proceedings of the IEEE
Conference on CVPR, pages 3931–3940, 2017.
[28] Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry.
How does batch normalization help optimization?(no, it is not about
internal covariate shift). arXiv preprint arXiv:1805.11604, 2018.
[29] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A
unified embedding for face recognition and clustering. In Proceedings of
the IEEE conference on CVPR, pages 815–823, 2015.
[30] Soumyadip Sengupta, Jun-Cheng Chen, Carlos Castillo, Vishal M Patel,
Rama Chellappa, and David W Jacobs. Frontal to profile face verification
in the wild. In 2016 IEEE Winter Conference on Applications of Computer
Vision (WACV), pages 1–9. IEEE, 2016.
[31] Karen Simonyan and Andrew Zisserman. Very deep convolutional net-
works for large-scale image recognition. arXiv preprint arXiv:1409.1556,
2014.
[32] Yi Sun, Yuheng Chen, Xiaogang Wang, and Xiaoou Tang. Deep learning
face representation by joint identification-verification. In Advances in
neural information processing systems, pages 1988–1996, 2014.
[33] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,
and Anguelov .etc. Going deeper with convolutions. In Proceedings of the
IEEE conference on CVPR, pages 1–9, 2015.
[34] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deep-
face: Closing the gap to human-level performance in face verification. In
Proceedings of the IEEE conference on CVPR, pages 1701–1708, 2014.
[35] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention
is all you need. In Advances in Neural Information Processing Systems,
pages 5998–6008, 2017.
[36] Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang
Zhang, Xiaogang Wang, and Xiaoou Tang. Residual attention network for
image classification. arXiv preprint arXiv:1704.06904, 2017.
[37] Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao
Zhou, Zhifeng Li, and Wei Liu. Cosface: Large margin cosine loss for
deep face recognition. In Proceedings of the IEEE Conference on CVPR,
pages 5265–5274, 2018.
[38] Xiaobo Wang, Shuo Wang, Shifeng Zhang, Tianyu Fu, Hailin Shi, and
Tao Mei. Support vector guided softmax loss for face recognition. arXiv
preprint arXiv:1812.11317, 2018.
[39] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-
local neural networks. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pages 7794–7803, 2018.
[40] Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. A discriminative
feature learning approach for deep face recognition. In European Confer-
ence on Computer Vision, pages 499–515. Springer, 2016.
[41] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon.
Cbam: Convolutional block attention module. In European Conference
on Computer Vision, pages 3–19. Springer, 2018.
[42] Jiachen Yang, Yinghao Zhu, Bin Jiang, Lei Gao, Liping Xiao, and Zhihui
Zheng. Aircraft detection in remote sensing images based on a deep
residual network and super-vector coding. Remote Sensing Letters,
9(3):228–236, 2018.
[43] Jiaolong Yang, Peiran Ren, Dongqing Zhang, Dong Chen, Fang Wen,
Hongdong Li, and Gang Hua. Neural aggregation network for video face
recognition. In CVPR, volume 4, page 7, 2017.
[44] Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. Learning face
representation from scratch. Computer Science, 2014.
[45] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena.
Self-attention generative adversarial networks. arXiv preprint
arXiv:1805.08318, 2018.
[46] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. Joint face
detection and alignment using multitask cascaded convolutional networks.
IEEE Signal Processing Letters, 23(10):1499–1503, 2016.
[47] Xiao Zhang, Zhiyuan Fang, Yandong Wen, Zhifeng Li, and Yu Qiao.
Range loss for deep face recognition with long-tailed training data. In
Proceedings of the IEEE International Conference on Computer Vision,
pages 5409–5418, 2017.

10 VOLUME 4, 2016

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2913205, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

HEFEI LING obtained the B.S, M.S, PhD de- PING LI is a lecturer in school of Computer
gree from Huazhong University of Science and science and Technology, Huazhong University of
Technology, China in 1999, 2002, 2005 respec- Science and Technology (HUST). He received
tively. He is currently serving as a professor in his Ph.D. degree in Computer Application from
the School of Computer Science and Technology, HUST in 2009. His research interests include mul-
HUST. Prof.Ling served as a visiting professor in timedia security , image retrieval and machine
University College London from 2008 to 2009. learning.
He has published more than 100 papers. Now he
serves as director of digital media and Intelligent
Technology Research Institute.

JIYANG WU received the B.E. degree in Com-


puter Science and Technology from Central South
University, Changsha, China in 2016. He is cur-
rently pursuing the M.S. degree at Huazhong Uni-
versity of Science and Technology, Wuhan, China.
His research interest includes computer vision and
multimedia analysis, such as face recognition and
object detection.

LEI WU received the B.E. degree in Information


and Computing Science from Wuhan University of
Science and Technology, Wuhan, China in 2016.
Now he is currently pursuing the PhD degree at
Huazhong University of Science and Technology,
Wuhan, China. His research interest includes in-
formation retrieval and non-convex optimization.

JUNRUI HUANG received the B.E. degree at


School of EIC from Huazhong University of Sci-
ence and Technology, Wuhan, China in 2018. He
is currently pursuing the M.S. degree at School
of Computer Science and Technology, Huazhong
University of Science and Technology, Wuhan,
China. His research interest includes computer
vision such as face recognition.

JIAZHONG CHEN received his M.S. and PhD.


degrees from Huazhong University of Science and
Technology (HUST), Wuhan, China, in 1999 and
2003. He is currently an associate professor in
the school of computer science and technology,
HUST. His current research interests include com-
puter vision, image processing, machine learning,
and multimedia communications.

VOLUME 4, 2016 11

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
View publication stats http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like