You are on page 1of 2

caPAD - A context aware model for face presentation attack detection

1
Pedro C. Neto1,2 INESC TEC
pedro.d.carneiro@inesctec.pt Porto, Portugal
Ana F. Sequeira1 2
Faculty of Engineering
ana.f.sequeira@inesctec.pt University of Porto, Porto, Portugal
Jaime S. Cardoso2,1
jaime.cardoso@inesctec.pt

Abstract different subjects. The average clips length is 10 seconds. These clips
were collected from five mobile devices (distinct camera resolutions for
Presentation attacks are some of the most frequent vulnerabilities of bio- all of them) and five lighting conditions. The front-facing camera was
metric systems. To perform these attacks, the impostors attempt to bypass used with a distance between face and camera of about 30 to 50 centime-
the biometric vision system. The human visual cortex system can leverage tres.
distinct information from the background and the main focus. However,
researchers still rely on the idea that the background is, in the majority Table 1: List of attacks present in the ROSE-YOUTU dataset [12].
of cases, harmful to machine learning algorithms. And thus, face pre- Attack Description
sentation attack detection models are trained with tight crops of the face. - Genuine (bona fide)
It is argued that it rather limits the model and its performance. We fur- #1 Still printed paper
ther show that a binary classification system aware of the background is #2 Quivering printed paper
capable of outperforming its counterpart that gets no information regard- #3 Video which records a Lenovo LCD display
ing the background. The proposed methodology beats current approaches #4 Video which records a Mac LCD display
and achieves an equal error rate (EER) of just 0.9%. We further analyze #5 Paper mask with two eyes and mouth cropped out
the predictions from an interpretability point-of-view and argue that the #6 Paper mask without cropping
background elements used by the model are similar to the ones used by #7 Paper mask with the upper part cut in the middle
humans.

Each subject is associated with eight distinct types of videos. Each


1 Introduction type corresponds to a label. The first label, 0, represents genuine samples,
whereas the remaining seven represent one of seven types of attacks. The
There has been an unintentional limitation to the capabilities of face pre- first attacks are print attacks, whereas the third and fourth are replay at-
sentation attack detection (PAD) systems over the years. The input in- tacks. The remaining are attacks based on paper masks. These attacks are
formation that these systems receive is limited. Most of the biometric described in Table 1.
systems are targeted by presentation attacks. The methods used to defend
these systems from such attacks rely on tight face crops [10]. This means
that besides the face, all the other information (background) is removed. It 3 Methodology
brings some advantages to biometric systems, for instance, on face recog-
Despite the distinct types of attacks that endanger biometric systems, in
nition several independent faces can be processed independently. On the
practice, it is only necessary to infer a given image is from an impostor or
other hand, it removes contextual and spatial information that might be
a genuine person. Therefore, the face PAD problem is, in its essence, for-
useful for the defence against presentation attacks. The human visual cor-
mulated as a binary classification task. To tackle this problem we trained
tex can process this spatial and contextual information to identify some
a MobileNet v2. This backbone network is optimized to minimize the
attacks on the human eye. It is even possible to verify that some replay
probability of wrong classes and maximize the probability of the correct
attacks can fool humans if the replay device has a high resolution. We
class. The outputs of the network (2 values) are activated with the softmax
argue that machine vision systems can likely learn to leverage the extra
nonlinearity. Weight optimization is done using the binary cross-entropy
information when it is available. They can even decide if the background
loss (Eq. 1).
information is useful for the prediction, or not. Hence, we believe that in-
stead of limiting the information given to the model, researchers must aim
BCE(y, p) = −(y log(p) + (1 − y) log(1 − p)) (1)
to develop novel and robust models that are capable of leveraging contex-
tual information. Even if it remains in a more philosophical domain, we
propose that researchers goal should be to approximate the models to the 4 Results & Discussion
human vision, or even surpass it.
Deep neural networks learn, sometimes, undesired patterns which are Table 2: Comparison of the binary classification system with their ver-
used for predictions. And thus, through the use of explainable artificial sions with and without background. Attack Presentation Classification
intelligence (Xai) methods, a qualitative assessment of the spatial infor- Error Rate (APCER), Bona Fide Classification Error Rate (BPCER) and
mation used by the model to predict if an image represents an attack was Equal Error Rate (ERR) are displayed as %, and lower values are better.
conducted. Intended to avoid errors due to opaqueness, we used visu- In bold is the best result per column.
alization methods such as Grad-CAM++ [4]. Their output indicated the
Method Background APCER BPCER EER
spatial areas used by the model. It was also verified that they correlate
with the ones used by humans. So we speculate about the influence of the No 0.493 2.199 1.319
Binary Classification
background in the future of face PAD algorithms. Yes 0.123 3.051 0.935
The ROSE-Youtu dataset [12], differently from the majority of other
datasets, includes a high diversity of attacks, which include both two-
The initial experiments produced intended to evaluate the effect of
dimensional and three-dimensional information. Hence, it was used to
the inclusion and exclusion of contextual background. As expected, the
study the impact of the background in the model performance and whether
performance increased dramatically when the background was available.
the background affects the capability of generalizing between attacks.
The background provides more information, and thus all the metrics ben-
efited from major improvements. The improvements on the equal error
2 Dataset rate are 29%. The results can be seen in Table 2.
When compared to the state-of-the-art (Table 3), the results are even
The dataset selected to conduct the experiments of this paper is the ROSE- more impressive. The inclusion of background boosts the performance to
Youtu dataset [12]. It contains, in its public version, 3350 videos with 20 be better than the performance of the other published methods.
(a) Crop (b) Crop (c) Crop (d) Crop (e) Crop (f) Crop
Attack #4 Attack #1 Attack #6 Genuine Genuine Genuine

(g) Attack #4 (h) Attack #1 (i) Attack #6 (j) Genuine (k) Genuine (l) Genuine
Figure 1: Samples collected from the ROSE-YOUTU dataset [12] containing images from attacks and genuine captures. On the top row, cropped
images are displayed. Whereas the bottom row contains the exact same images, but with all the background information included.

Table 3: Comparison of the proposed approach with the state-of-the-art. that a face PAD model is capable of leveraging both background and face
EER is displayed as %. In bold is the best result per column. elements to make a correct prediction.
Method EER This proposed approach surpassed the state-of-the-art results for the
Color LBP [1, 5] 27.6 ROSE-YOUTU dataset. The lightweight model is capable of providing
CoALBP (YCBCR) [12] 17.1 impressive results. The interpretability analysis corroborated the beliefs
CoALBP (HSV) [12] 16.4 regarding the usage of background elements.
Color [2, 5] 13.9 Acknowledgements This work was partially funded by the Project TAMI - Transpar-
De Spoofing [5, 9] 12.3 ent Artificial Medical Intelligence (NORTE-01-0247-FEDER-045905) financed by ERDF -
RCTR-all spaces [5] 10.7 European Regional Fund through the North Portugal Regional Operational Program - NORTE
2020, by the Portuguese Foundation for Science and Technology - FCT under the CMU -
ResNet-18 [7] 9.3 Portugal International Partnership, and within the PhD grant “2021.06872.BD”’.
SE-ResNet18 [8] 8.6
AlexNet [12] 8.0
References
DR-UDA (SE-ResNet18) [13] 8.0
DR-UDA (ResNet-18) [13] 7.2 [1] Zinelabidine Boulkenafet, Jukka Komulainen, and Abdenour Hadid. Face anti-spoofing
based on color texture analysis. In 2015 IEEE international conference on image pro-
3D-CNN [11] 7.0 cessing (ICIP), pages 2636–2640. IEEE, 2015.
Blink-CNN [6] 4.6 [2] Zinelabidine Boulkenafet, Jukka Komulainen, and Abdenour Hadid. Face spoofing de-
DRL-FAS [3] 1.8 tection using colour texture analysis. IEEE Transactions on Information Forensics and
Security, 11(8):1818–1830, 2016. doi: 10.1109/TIFS.2016.2555286.
Ours 0.9 [3] Rizhao Cai, Haoliang Li, Shiqi Wang, Changsheng Chen, and Alex C Kot. Drl-fas: A
novel framework based on deep reinforcement learning for face anti-spoofing. IEEE
Transactions on Information Forensics and Security, 16:937–951, 2020.
[4] Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N Balasubrama-
nian. Grad-cam++: Generalized gradient-based visual explanations for deep convolu-
tional networks. In 2018 IEEE Winter Conference on Applications of Computer Vision
(WACV), pages 839–847, 2018. doi: 10.1109/WACV.2018.00097.
[5] Yuting Du, Tong Qiao, Ming Xu, and Ning Zheng. Towards Face Presentation Attack
Detection Based on Residual Color Texture Representation. Security and Communica-
tion Networks, 2021:6652727, 2021. ISSN 1939-0114. doi: 10.1155/2021/6652727.
URL https://doi.org/10.1155/2021/6652727.
[6] Md. Mehedi Hasan, Md. Salah Uddin Yusuf, Tanbin Islam Rohan, and Shidhartho Roy.
(a) Replay Attack (b) Paper Mask At- (c) Print Attack Efficient two stage approach to detect face liveness : Motion based and deep learning
tack based. In 2019 4th International Conference on Electrical Information and Communi-
cation Technology (EICT), pages 1–6, 2019. doi: 10.1109/EICT48899.2019.9068813.
Figure 2: Explanations produced for a prediction from a frame from a
[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for
video of subject #23. Colors closer to pink represent areas with larger image recognition. In Proceedings of the IEEE conference on computer vision and pat-
relevance for the decision. Bluish colors represent less important pixels. tern recognition, pages 770–778, 2016.
[8] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
[9] Amin Jourabloo, Yaojie Liu, and Xiaoming Liu. Face de-spoofing: Anti-spoofing
Finally, we produced explanations of our model for an example of via noise modeling. In Proceedings of the European Conference on Computer Vision
each category of attacks. For the replay attack, we produced the explana- (ECCV), pages 290–306, 2018.
tions in Figure 2a. That figure shows that the model leveraged the pres- [10] Dakshina Ranjan Kisku and Rinku Datta Rakshit. Face spoofing and counter-spoofing:
a survey of state-of-the-art algorithms. Transactions on Machine Learning and Artificial
ence of reflections in the attack image. Figure 2b shows the explanation Intelligence, 5(2):31, 2017.
for a paper mask attack, and as expected, the explanation does not rely on [11] Haoliang Li, Peisong He, Shiqi Wang, Anderson Rocha, Xinghao Jiang, and Alex C.
the background. Instead, the model directs its focus to the mask area for Kot. Learning generalized deep feature representation for face anti-spoofing. IEEE
the final prediction. Finally, the print attack explanation is seen in Fig- Transactions on Information Forensics and Security, 13(10):2639–2652, 2018. doi: 10.
1109/TIFS.2018.2825949.
ure 2c. It shows that the model understands the conditions of the image [12] Haoliang Li, Wen Li, Hong Cao, Shiqi Wang, Feiyue Huang, and Alex C. Kot. Unsu-
given and directs its focus to an important background artefact, the pin pervised domain adaptation for face anti-spoofing. IEEE Transactions on Information
holding the image. Forensics and Security, 13(7):1794–1809, 2018. doi: 10.1109/TIFS.2018.2801312.
[13] Guoqing Wang, Hu Han, Shiguang Shan, and Xilin Chen. Unsupervised adversarial
domain adaptation for cross-domain face presentation attack detection. IEEE Transac-
tions on Information Forensics and Security, 16:56–69, 2021. doi: 10.1109/TIFS.2020.
5 Conclusions 3002390.

In this work, we explored our belief that researchers have been creating
limitations for face presentation attack detection models by cropping the
face from the frame. The experiments of our work corroborated the view

You might also like