Professional Documents
Culture Documents
ages, but seldom explored for face recognition. The pioneer negative
from an existing model. It only estimates the variance and (a) (b) (c)
relies on an ad-hoc and costly metric. Thus, it is not easy to Figure 1: (a) Deterministic model gives point embedding without consid-
ering the data uncertainty; (b) probabilistic model gives distributional es-
use. It is unclear how uncertainty affects feature learning.
timation parameterised with estimated mean and estimated variance. PFE
This work applies data uncertainty learning to face leverages the pre-trained point embedding as the mean µ, only learn the
recognition, such that the feature (mean) and uncertainty uncertainty σ for each sample; (c) our method simultaneously learn σ as
(variance) are learnt simultaneously, for the first time. Two well as µ, leading to better intra-class compactness and inter-class separa-
bility for µ in the latent space. Different classes are marked as blue or red.
learning methods are proposed. They are easy to use and Best viewed in color.
outperform existing deterministic methods as well as [35]
on challenging unconstrained scenarios. We also provide a fixed point, in the latent space. Specifically, given a pre-
insightful analysis on how incorporating uncertainty esti- trained FR model, the mean of the Gaussian for each sam-
mation helps reducing the adverse effects of noisy samples ple is fixed as the embedding produced by the FR model. An
and affects the feature learning. extra branch is appended to the FR model and trained to es-
timate the variance. The training is driven by a new similar-
ity metric, mutual likelihood score or MLS, which measures
1. Introduction the “likelihood” between two Gaussian distributions. It is
shown that PFE estimates small variance for high-quality
Data uncertainty1 captures the “noise” inherent in the samples but large variance for noisy ones. Together with
data. Modeling such uncertainty is important for computer the MLS metric, PFE can reduce the mismatches on noisy
vision application [22], e.g., face recognition, because noise samples. This is illustrated in Fig 1, (b). While being effec-
widely exists in images. tive, PFE is limited in that it does not learn the embedded
Most face recognition methods represent each face feature (mean) but only the uncertainty. As a result, it is
image as a deterministic point embedding in the latent unclear how uncertainty affects feature learning. Also, the
space [7, 27, 41, 42, 33]. Usually, high-quality images of conventional similarity metric such as cosine distance can-
the same ID are clustered. However, it is difficult to esti- not be used. The more complex MLS metric is in demand,
mate an accurate point embedding for noisy face images, which takes more runtime and memory.
which are usually out of the cluster and have larger uncer-
For the first time, this work applies data uncertainty
tainty in the embedding space. This is exemplified in Fig 1
learning (DUL) to face recognition such that feature (mean)
(a). The positive example is far from its class and close to a
and uncertainty (variance) are learnt simultaneously. As
noisy negative example, causing a mismatch.
illustrated in Fig 1 (c), this improves the features such that
Probabilistic face embeddings (PFE) [35] is the first the instances in the same class are more compact and the in-
work to consider data uncertainty in face recognition. For stances in different classes are more separated. In this case,
each sample, it estimates a Gaussian distribution, instead of the learned feature is directly usable for conventional simi-
1 Uncertainty could be characterised into two main categories. Another larity metric. MLS metric is no longer necessary.
type is model uncertainty. Specifically, we propose two learning methods. The first
1
is classification based. It learns a model from scratch. The
10.0 regression line
second is regression based. It improves an existing model 7.5 observed data
5.0
, similar as PFE. We discuss how the learned uncertainty
y=x*sin(x)
2.5
0.0
affects the model training in two methods, from the per- 2.5
spective of image noise. We provide insightful analysis that 5.0
7.5
the learned uncertainty will improve the learning of identity 2 0 2
x
4 6 8 10
embeddings by adaptively reducing the adverse effects of (a) noise in mapping X → Y. (b) “noise” in MS-Celeb-1M.
noisy training samples. Figure 2: (a): Target y in observed data pair (red-dot) is corrupted by
Comprehensive experiments demonstrate that our pro- x dependent noise. Data uncertainty regression will give us the “noise
posed methods improve face recognition performance over level” (green-shaded) beyond the particular predicted value (green line);
(b): Samples labeled with the same ID are presented in each row. Samples
existing deterministic models and PFE on most public with red box are regarded as noisy data compared with other intra-class
benchmarks. The improvement is more remarkable on samples. Best viewed in color.
benchmarks with low quality face images, indicating that
model with data uncertainty learning is more suitable to
unconstrained face recognition scenario, thus important for we experimentally demonstrate the proposed methods per-
practical tasks. form more robustly on noisy dataset.
Constructing New Mapping Space for FR We construct Loss Attenuation Mechanism By qualitatively analyz-
a new target space, which is continuous, for face data. Most ing Eq. 6, our learned variance σ i could actually be re-
importantly, it is nearly equivalent to the original discrete garded as the uncertainty score measuring the confidence of
target space Y , which encourages the correct mapping re- the learned identity embedding, µi , belonging to cth class.
lationship. Specifically, we pre-train a classification-based Specifically, for those ambiguous µi located far away from
deterministic FR model, and then utilize the weights in its its class center wc , DULrgs will estimate large variance to
classifier layers, W ∈ RD×C as the expected target vector3 . temper the error term, (w2σc −µ)
2
, instead of overfitting on
2
Since each wi ∈ W can be treated as the typical center these noisy samples. DULrgs is discouraged from predict-
of the embeddings with the same class, {X , W} thus can ing large variance for all samples, which may lead to un-
be regarded as the new equivalent mapping sapce. Simi- derfitting of (wc − µ)2 and larger log σ term will punish
lar to the uncertainty in continuous mapping space as de- the model in turn. Simultaneously, DULrgs is also discour-
scribed in Section 3.1, {X , W} has inherent noise. We aged from predicting very small variance for all samples,
can formulate the mapping from xi ∈ X to wi ∈ W as which may lead to exponentially increases of error term.
wi = f (xi ) + n(xi ), where f (xi ) is the “ideal” identity Thus, Eq. 6 allows DULrgs to adapt the weighting of error
feature and each observed wi is corrupted by input depen- term. This makes the model learn to attenuate the effect
dent noise. from those ambiguous µi caused by poor quality samples.
3.4. Discussion of Related Works
Distributional Representation Next we can estimate
above f (xi ) and n(xi ) by data uncertainty regression. We first discuss the connection between DULcls and vari-
Specifically, a Gaussian distribution is assumed for the like- ational information bottleneck (VIB) [1]. VIB [1] is a varia-
lihood: p(zi |xi ) = N (zi ; µi , σ 2i I) where µi as well as σ i tional approximation to information bottleneck (IB) princi-
are also parameterised by the weights in neural networks4 ple [38] under the framework of deep learning. VIB seeks a
(see Fig. 4). If we take each wc as the target, we should stochastic mapping from input data X to latent representa-
maximize the following likelihood for each xi , tion Z, in terms of the fundamental trade-off between mak-
ing Z as concise as possible but still have enough ability to
1 (wc − µi )2 predict label Y [38]. It is noted that Lcls is similar to the
p(wc |xi∈c , θ) = p exp(− ). (5)
2
2πσ i 2σ 2i objective function in VIB. However, we analyze this clas-
sification method from data uncertainty perspective while
Actually, we take the log likelihood as follows, VIB derived this objective function from the view of infor-
mation bottleneck.
We next clarify some differences between DULrgs and
(wc − µi )2 1 1 PFE [35]. Although both PFE and DULrgs formally en-
ln p(wc |xi∈c , θ) = − − ln σ 2i − ln 2π.
2σ 2i 2 2 code the input uncertainty as variance representation. How-
(6) ever, PFE essentially measures the likelihood of each pos-
Assumed that xi , i ∈ 1, 2, ... are independently and identi- itive pair of {xi , xj } sharing the same latent embedding:
cally
Q Qdistributed (iid.), the likelihood over all data-points is p(zi = zj ). While DULrgs interprets a conventional Least-
c i ln p(wc |xi∈c , θ). Practically, we train the network Square Regression technique as a Maximum likelihood Es-
to predict the log variance, ri := ln σ 2i , to stabilize the nu- timation with a data uncertainty regression model.
merical during the stochastic optimization. Last, the likeli- Last, both DULcls and DULrgs learn identity representa-
hood maximization is reformulated as the minimization of tion µ as well as uncertainty representation σ, which ensure
3 Here D refers to the dimensions of the embedding and C refers to the our predicted µ can be directly evaluated by common-used
numbers of classes in training set. matching metric. However, PFE has to use mutual likeli-
4 Noted here µ ≈ f (x ) and σ ≈ n(x ). hood score (MLS) as the matching metric to improve the
i i i i
performance of deterministic model because identity repre-
sentation is not learnt in PFE.
4. Experiments
In this section, we first evaluate the proposed methods
on standard face recognition benchmarks. Then we provide
qualitative and quantitative analysis to explore what is the
meaning of the learned data uncertainty and how data uncer-
tainty learning affects the learning of FR models. Last, we
conduct experiments on the noisy MS-Celeb-1M dataset to
demonstrate that our methods perform more robustly than
deterministic methods.
baseline model at different settings of matching threshold. Method Training Data LFW YTF MegaFace CFP-FP
Additionally, the lower FPR is set, the better DULcls per- FaceNet [33] 200M 99.63 95.1 - -
forms on TPR. Fig 5 also shows the vast majority cases of DeepID2+ [37] 300K 99.47 93.2 - -
false acceptance respectively happened in baseline model CenterFace [42] 0.7M 99.28 94.9 65.23 76.52
SphereFace [27] 0.5M 99.42 95.0 75.77 89.14
and DULcls . We can see that DULcls resolves more FP cases
ArcFace [7] 5.8M 99.83 98.02 81.03 96.98
with extreme noises, which are typically occurring in the CosFace [41] 5M 99.73 97.6 77.11 89.88
baseline model. This indicates that model with data uncer- L2-Face [32] 3.7M 99.78 96.08 - -
tainty learning is more applicable to the unconstrained face Yin et al. [49] 1M 98.27 - - 94.39
PFE [35] 4.4M 99.82 97.36 78.95 93.34
recognition scenario than deterministic model. Baseline 3.6M 99.83 96.50 98.30 98.75
We have the similar conclusion for DULrgs . PFErep 3.6M 99.82 96.50 98.48 97.28
DULcls 3.6M 99.78 96.78 98.60 98.67
DULrgs 3.6M 99.83 96.84 98.12 98.78
4.3. Comparing DUL with PFE
Table 2: Comparison with the state-of-the-art methods on LFW, YTF,
MegaFace (MF) and CFP-FP. “-” indicates that the author did report the
For comparison, we re-implemented PFE on all base- performance on the corresponding protocol. “PFErep ” means we repro-
line models according to the recommended settings of duce PFE by ourself. Backbone: ResNet64.
implementation details in [35]8 . We note that our re-
implementation has achieved similar or slightly better re- 4.4. Comparison with State-Of-The-Art
sults than those in [35]. Our DULcls/rgs use averaged pooling
aggregation for features in template and are evaluated by co- To compare with state-of-the-art, we use a deeper and
sine similarity. Compared with PFE, our proposed DULcls stronger backbone, ResNet64, trained with AM-Softmax
achieves better performances in all cases, and the proposed loss on MS-Celeb-1M dataset, as our baseline model. Then
DULrgs also shows competitive performances. Results are we train the proposed DUL models following the setting
reported in Table 1. described in section 4.1.
The results are illustrated in Table 2. Noted that perfor-
PFE interprets the point embedding learned by determin-
mances of baseline model have been saturated on LFW and
istic FR models as the mean of its output distributional es-
CFP-FP, where the merit of data uncertainty learning is not
timation and only learn the uncertainty (variance) for each
obvious. However, DULcls/rgs still slightly improve the ac-
sample. Thus, PFE has to use MLS metric, which takes the
curacy on YTF and MegaFace9 . Table 3 reports the results
predicted variance into account. Although PFE achieves
of different methods on IJB-C. Both PFE and DUL achieve
better results with the help of the matching measurement
much better performances over baseline models.
with more precision, it still suffers more computational
complexity for matching. Specifically, for verification of
4.5. Understand Uncertainty Learning
6000 face pairs (LFW), standard cosine metric takes less
than 1 second via matrix multiplication, while MLS takes In this part, we qualitatively and quantitatively analyze
1min28s, on two GTX-1080. the proposed DUL to gain more insights about data uncer-
tain learning.
8 which means we use mutual likelihood score (MLS) for matching
and its proposed fusion strategy for feature aggregation in template/video 9 Noted that our used MegaFace datasets is refined, while previous re-
benchmarks, i.e., YTF and IJB-C. ported SOTA results in Table 2 usually use non-refined MegaFace.
IJB-C (TPR@FPR)
Method Training Data
0.001% 0.01% 0.1% AUC easy semi-hard hard
Yin et al. [48] 0.5M - - 69.3 -
Cao et al. [5] 3.3M 74.7 84.0 91.0 -
Multicolumn [46] 3.3M 77.1 86.2 92.7 -
DCN [45] 3.3M - 88.5 94.7 -
PFE [35] 4.4M 89.64 93.25 95.49 -
Baseline 3.6M 83.06 92.16 95.83 91.97
PFErep 3.6M 89.77 94.14 96.37 93.74
DULcls 3.6M 88.18 94.61 96.70 93.97
DULrgs 3.6M 90.23 94.21 96.32 93.88
Table 3: Comparison with the state-of-the-art methods on IJB-C. Back-
Figure 7: Bad case analysis between baseline model and DULcls . Best
bone: ResNet64.
viewed in color.
R512 as the approximated measurement of the estimated uncertainty. The 11 Noted this averaging distances are further averaged over all classes in
loss.
0.6
genuine
IJB-C (TPR@FPR)
0.4 percent Model MegaFace LFW YTF
0.001% 0.01% 0.1%
0.2 0% baseline 97.11 99.63 96.09 75.32 88.65 94.73
0.0 baseline 96.64 99.63 96.16 64.96 86.00 94.82
imposter 0 10 20 30 40 50 60 70 PFE [35] 97.02 99.63 96.1 83.39 91.33 95.54
kernel size 10%
DULcls 96.88 99.75 96.44 88.04 93.21 95.96
(a) (b) DULrgs 96.05 99.71 96.46 84.74 91.56 95.30
Figure 9: (a) Blur genuine and imposter pair; (b) Cosine similarity score baseline 96.20 99.61 96.00 43.52 80.48 94.22
obtained by baseline and proposed DUL for each pair. PFE [35] 96.90 99.61 95.86 82.03 90.89 95.38
20%
DULcls 96.37 99.71 96.68 89.01 93.24 95.97
DULrgs 95.51 99.66 96.64 81.10 90.91 95.27
4.6. Other Experiments baseline 95.72 99.60 95.45 31.51 76.09 93.11
PFE [35] 96.82 99.61 96.12 80.92 90.31 95.29
30%
Impact of hyper-parameter of DULcls In this part, we DULcls 95.86 99.73 96.38 86.05 91.80 95.02
DULrgs 94.96 99.66 96.66 81.54 91.20 95.32
qualitatively analyze what the trade-off hyper-parameter λ
baseline 95.14 99.56 95.51 39.69 77.12 93.73
controls in DULcls . As mentioned in VIB [1], KLD term PFE [35] 96.59 99.59 95.94 77.72 89.46 94.82
works as a regularization to trade off the conciseness and 40%
DULcls 95.33 99.66 96.54 84.15 92.60 95.85
the richness of the information preserved in bottleneck em- DULrgs 94.28 99.58 96.68 78.13 87.64 94.67
beddings. We experimentally find the KL divergence in our Table 5: Comparison of baseline model and proposed DULcls/rgs trained
method affects the representational capacity of σ. As illus- on noisy MS-Celeb-1M. Backbone model is ResNet18 with AM-Softmax
trated in Table 4, DULcls without the optimization of KLD loss.
term (λ = 0) performs close to baseline model. In this case, 5. Conclusion
DULcls estimates relatively small σ i for all samples, which
makes the sampled representation µi +σ i nearly determin- In this work, we propose two general learning methods
istic. With the enhancement of the optimization strength of to further develop and perfect the data uncertainty learn-
KLD term (λ ↑), DULcls is prone to “assign” larger vari- ing (DUL) for face recognition: DULcls and DULrgs . Both
ance for noisy samples and small variance for high quality methods give a Gaussian distributional estimation for each
ones (as illustrated in Fig 7). However, overly minimizing face image in the latent space and simultaneously learn
KLD (λ = 1) term will prompt the model to predict large identity feature (mean) and uncertainty (variance) of the
variance for all samples, which makes Lcls in Eq. 3 hard estimated mean. Comprehensive experiments demonstrate
to converge, thus the performances deteriorate rapidly (see that our proposed methods perform better than deterministic
Table 4). models on most benchmarks. Additionally, we discuss how
the learned uncertainty affects the training of model from
the perspective of image noise by both qualitative analysis
and quantitative results.
DUL performs more robustly on noisy training data.
Based on the analysis of Section 3.4 about how the learned 6. Acknowledgement
variance affect the model training. We further conduct ex-
periments on noisy MS-Celeb-1M to prove it. We randomly This paper is supported by the National key R&D plan
select different proportions of samples from MS-Celeb-1M of the Ministry of science and technology (Project
to pollute them with Gaussian blur noise. Table 5 demon- Name: “Grid function expansion technology and
strates that our proposed DULcls/rgs perform more robustly equipment for community risk prevention”, Project
on noisy training data. No.2018YFC0809704)
References [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition. In Proceed-
[1] Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin ings of the IEEE conference on computer vision and pattern
Murphy. Deep variational information bottleneck. In Pro- recognition, pages 770–778, 2016.
ceedings of the International Conference on Learning Rep-
[16] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net-
resentations, 2017.
works. In Proceedings of the IEEE conference on computer
[2] Christopher M Bishop and Cazhaow S Quazaz. Regres- vision and pattern recognition, pages 7132–7141, 2018.
sion with input-dependent noise: A bayesian treatment. In [17] Wei Hu, Yangyu Huang, Fan Zhang, and Ruirui Li. Noise-
Advances in Neural Information Processing Systems, pages tolerant paradigm for training face recognition cnns. In Pro-
347–353, 1997. ceedings of the IEEE Conference on Computer Vision and
[3] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, Pattern Recognition, pages 11887–11896, 2019.
and Daan Wierstra. Weight uncertainty in neural networks. [18] Gary B Huang, Marwan Mattar, Tamara Berg, and Eric
arXiv preprint arXiv:1505.05424, 2015. Learned-Miller. Labeled faces in the wild: A database
[4] Axel Brando, Jose A Rodrı́guez-Serrano, Mauricio Ciprian, forstudying face recognition in unconstrained environments.
Roberto Maestre, and Jordi Vitrià. Uncertainty modelling in 2008.
deep networks: Forecasting short and noisy series. In Joint [19] Shuya Isobe and Shuichi Arai. Deep convolutional encoder-
European Conference on Machine Learning and Knowledge decoder network with model uncertainty for semantic seg-
Discovery in Databases, pages 325–340. Springer, 2018. mentation. In 2017 IEEE International Conference on IN-
[5] Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and novations in Intelligent SysTems and Applications (INISTA),
Andrew Zisserman. Vggface2: A dataset for recognising pages 365–370. IEEE, 2017.
faces across pose and age. In 2018 13th IEEE International [20] Ira Kemelmacher-Shlizerman, Steven M Seitz, Daniel
Conference on Automatic Face & Gesture Recognition (FG Miller, and Evan Brossard. The megaface benchmark: 1
2018), pages 67–74. IEEE, 2018. million faces for recognition at scale. In Proceedings of the
[6] Jiwoong Choi, Dayoung Chun, Hyun Kim, and Hyuk-Jae IEEE Conference on Computer Vision and Pattern Recogni-
Lee. Gaussian yolov3: An accurate and fast object detec- tion, pages 4873–4882, 2016.
tor using localization uncertainty for autonomous driving. [21] Alex Kendall, Vijay Badrinarayanan, and Roberto Cipolla.
In The IEEE International Conference on Computer Vision Bayesian segnet: Model uncertainty in deep convolu-
(ICCV), October 2019. tional encoder-decoder architectures for scene understand-
[7] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos ing. BMVC, 2015.
Zafeiriou. Arcface: Additive angular margin loss for deep [22] Alex Kendall and Yarin Gal. What uncertainties do we need
face recognition. In Proceedings of the IEEE Conference in bayesian deep learning for computer vision? In Advances
on Computer Vision and Pattern Recognition, pages 4690– in neural information processing systems, pages 5574–5584,
4699, 2019. 2017.
[8] Armen Der Kiureghian and Ove Ditlevsen. Aleatory or epis- [23] Salman Khan, Munawar Hayat, Syed Waqas Zamir, Jianbing
temic? does it matter? Structural Safety, 31(2):105–112, Shen, and Ling Shao. Striking the right balance with uncer-
2009. tainty. In Proceedings of the IEEE Conference on Computer
[9] Michael Havbro Faber. On the treatment of uncertainties Vision and Pattern Recognition, pages 103–112, 2019.
and probabilities in engineering decision analysis. Journal [24] Diederik P Kingma and Max Welling. Auto-encoding varia-
of Offshore Mechanics and Arctic Engineering, 127(3):243– tional bayes. ICLR, 2014.
248, 2005. [25] Florian Kraus and Klaus Dietmayer. Uncertainty es-
[10] Yarin Gal. Uncertainty in deep learning. PhD thesis, PhD timation in one-stage object detection. arXiv preprint
thesis, University of Cambridge, 2016. arXiv:1905.10296, 2019.
[11] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian [26] Quoc V Le, Alex J Smola, and Stéphane Canu. Het-
approximation: Representing model uncertainty in deep eroscedastic gaussian process regression. In Proceedings
learning. In international conference on machine learning, of the 22nd international conference on Machine learning,
pages 1050–1059, 2016. pages 489–496. ACM, 2005.
[12] Paul W Goldberg, Christopher KI Williams, and Christo- [27] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha
pher M Bishop. Regression with input-dependent noise: A Raj, and Le Song. Sphereface: Deep hypersphere embedding
gaussian process treatment. In Advances in neural informa- for face recognition. In Proceedings of the IEEE conference
tion processing systems, pages 493–499, 1998. on computer vision and pattern recognition, pages 212–220,
[13] Sixue Gong, Vishnu Naresh Boddeti, and Anil K Jain. 2017.
On the capacity of face representation. arXiv preprint [28] Brianna Maze, Jocelyn Adams, James A Duncan, Nathan
arXiv:1709.10433, 2017. Kalka, Tim Miller, Charles Otto, Anil K Jain, W Tyler
[14] Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Niggel, Janet Anderson, Jordan Cheney, et al. Iarpa janus
Jianfeng Gao. Ms-celeb-1m: A dataset and benchmark for benchmark-c: Face dataset and protocol. In 2018 Inter-
large-scale face recognition. In European Conference on national Conference on Biometrics (ICB), pages 158–165.
Computer Vision, pages 87–102. Springer, 2016. IEEE, 2018.
[29] Hong-Wei Ng and Stefan Winkler. A data-driven approach [44] Xiang Wu, Ran He, Zhenan Sun, and Tieniu Tan. A light cnn
to cleaning large face datasets. In 2014 IEEE International for deep face representation with noisy labels. IEEE Trans-
Conference on Image Processing (ICIP), pages 343–347. actions on Information Forensics and Security, 13(11):2884–
IEEE, 2014. 2896, 2018.
[30] David A Nix and Andreas S Weigend. Estimating the mean [45] Weidi Xie, Li Shen, and Andrew Zisserman. Comparator
and variance of the target probability distribution. In Pro- networks. In Proceedings of the European Conference on
ceedings of 1994 IEEE International Conference on Neural Computer Vision (ECCV), pages 782–797, 2018.
Networks (ICNN’94), volume 1, pages 55–60. IEEE, 1994. [46] Weidi Xie and Andrew Zisserman. Multicolumn networks
[31] M Elisabeth Paté-Cornell. Uncertainties in risk analysis: Six for face recognition. In BMVC, 2018.
levels of treatment. Reliability Engineering & System Safety, [47] Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. Learn-
54(2-3):95–111, 1996. ing face representation from scratch. arXiv preprint
[32] Rajeev Ranjan, Carlos D Castillo, and Rama Chellappa. L2- arXiv:1411.7923, 2014.
constrained softmax loss for discriminative face verification. [48] Bangjie Yin, Luan Tran, Haoxiang Li, Xiaohui Shen, and
arXiv preprint arXiv:1703.09507, 2017. Xiaoming Liu. Towards interpretable face recognition. arXiv
[33] Florian Schroff, Dmitry Kalenichenko, and James Philbin. preprint arXiv:1805.00611, 2018.
Facenet: A unified embedding for face recognition and clus- [49] Xi Yin and Xiaoming Liu. Multi-task convolutional neural
tering. In Proceedings of the IEEE conference on computer network for pose-invariant face recognition. IEEE Transac-
vision and pattern recognition, pages 815–823, 2015. tions on Image Processing, 27(2):964–975, 2017.
[34] Soumyadip Sengupta, Jun-Cheng Chen, Carlos Castillo, [50] Tianyuan Yu, Da Li, Yongxin Yang, Timothy M Hospedales,
Vishal M Patel, Rama Chellappa, and David W Jacobs. and Tao Xiang. Robust person re-identification by modelling
Frontal to profile face verification in the wild. In 2016 feature uncertainty. In Proceedings of the IEEE International
IEEE Winter Conference on Applications of Computer Vision Conference on Computer Vision, pages 552–561, 2019.
(WACV), pages 1–9. IEEE, 2016. [51] Umara Zafar, Mubeen Ghafoor, Tehseen Zia, Ghufran
[35] Yichun Shi, Anil K Jain, and Nathan D Kalka. Probabilistic Ahmed, Ahsan Latif, Kaleem Razzaq Malik, and Abdul-
face embeddings. In Proceedings of the IEEE International lahi Mohamud Sharif. Face recognition with bayesian convo-
Conference on Computer Vision, 2019. lutional networks for robust surveillance systems. EURASIP
Journal on Image and Video Processing, 2019(1):10, 2019.
[36] Leslie N Smith. Cyclical learning rates for training neural
networks. In 2017 IEEE Winter Conference on Applications
of Computer Vision (WACV), pages 464–472. IEEE, 2017.
[37] Yi Sun, Xiaogang Wang, and Xiaoou Tang. Deeply learned
face representations are sparse, selective, and robust. In Pro-
ceedings of the IEEE conference on computer vision and pat-
tern recognition, pages 2892–2900, 2015.
[38] Naftali Tishby and Noga Zaslavsky. Deep learning and the
information bottleneck principle. In 2015 IEEE Information
Theory Workshop (ITW), pages 1–5. IEEE, 2015.
[39] Fei Wang, Liren Chen, Cheng Li, Shiyao Huang, Yanjie
Chen, Chen Qian, and Chen Change Loy. The devil of face
recognition is in the noise. In Proceedings of the European
Conference on Computer Vision (ECCV), pages 765–780,
2018.
[40] Feng Wang, Jian Cheng, Weiyang Liu, and Haijun Liu. Ad-
ditive margin softmax for face verification. IEEE Signal Pro-
cessing Letters, 25(7):926–930, 2018.
[41] Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong
Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. Cosface:
Large margin cosine loss for deep face recognition. In Pro-
ceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 5265–5274, 2018.
[42] Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. A
discriminative feature learning approach for deep face recog-
nition. In European conference on computer vision, pages
499–515. Springer, 2016.
[43] Lior Wolf, Tal Hassner, and Itay Maoz. Face recognition
in unconstrained videos with matched background similarity.
In CVPR 2011.